Graph databases have improved significantly since the 1990s, with new developments and a better realization of best practices. Graph technology has become one of the most popular methods of performing big data research. Its focus on finding relationships and its flexibility make it ideal for a variety of research projects. An awareness of new developments and an understanding of best practices will streamline any work with graph databases.
Graph databases are typically considered a NoSQL or non-relational technology, providing them the ability to extend memory/storage and the research in any direction, without needing to transfer the project to different structures. Although SQL systems can support graph databases, especially with recent improvements, NoSQL architectures are typically much more effective. It should be noted that a relational/SQL database can work alongside a NoSQL graph database, with the two complementing one another by tapping the strengths of both systems.
The Basic Principles
A graph database is designed to assign equal value to both the data and the relationships connecting the data. The data and the relationships are considered equally important. Graph structures (the node and the edge) are used to represent and store data. A node in graph databases represents the record/object/entity, while the edge represents the relationship between the nodes. Querying relationships is quite fast, as they are stored inside the database itself.
Nodes can be described as the entities within a graph. These nodes can be tagged with labels that represent different roles in the domain. Node labels can also be used to attach metadata (index or identification information) to certain nodes.
The edges, or relationships, provide connections between two node entities. (For example, Volunteer-SCHEDULE-Weekdays or Car-DIRECTIONS-Destination.) Relationships always have a direction, with a start node, an end node, and a type. Relationships/edges can also have properties. Generally, the relationships are based on quantitative properties, such as distances, weights, costs, ratings, strengths, or time intervals. Because of the way relationships are saved, two nodes can associate any type or any number of relationships. Although relationships are stored with a specific direction orientation, these relationships can be navigated efficiently in either direction.
Using Graph Databases
Graphs can be used in a variety of day-to-day applications, such as representing optical fiber mapping, designing a circuit board, or something as simple as roads and streets on a map. Facebook uses graphs to form a data network, with nodes representing a person or a topic, and edges representing processes, activities, or methods that connect the nodes.
Lockheed Martin Space uses graph technologies for supply chain management, making it easier for them to uncover potential weaknesses and boost supply chain resilience. Their CDAO, Tobin Thomas, stated in an interview, “Think about the lifecycle of how a product is created. We’re using technologies like graphs to connect the relationships together, so we can see the lifecycle based on particular parts or components and the relationships between every element.”
Gartner predicts that the market for graph technologies will grow to $3.2 billion by 2025. The growing popularity of graph databases is, in part, the result of well-designed algorithms that make sorting through the data much, much easier. The infamous Panama Papers scandal provides an excellent example of how algorithms were used to seek out information from thousands of shell companies. These shells provided movie stars, criminals, and politicians, such as Iceland’s former prime minister Sigmundur David Gunnlaugsson, with a place to deposit money in offshore accounts. Graph databases, with their algorithms, made the research of these shell companies possible.
Problems with Graph Databases
The problems that can develop when working with graph databases include using inaccurate or inconsistent data and learning to write efficient queries. Accurate results rely on accurate and consistent information. If the data going in isn’t reliable, the results coming out cannot be considered trustworthy.
This data query issue can also be a problem if the stored data uses non-generic terms while the query uses generic terminology. Additionally, the query must be designed to meet the system’s requirements.
Inaccurate data is based on information that is simply wrong. Blatant errors have been included. Inaccurate data may include a wrong address, a wrong gender, or any number of other errors. Inconsistent data, on the other hand, describes a situation with multiple tables in a database working with the same data, but receiving it from different inputs with slightly different versions (misspellings, abbreviations, etc.). Inconsistencies are often compounded by data redundancy.
Graph queries interrogate the graph database, and these queries need to be accurate, precise, and designed to fit the database model. The queries should also be as simple as possible. The simpler the query, the more tightly focused its results. The more complicated the query, the broader – and perhaps more confusing – the results.
Best Practices at the Start
For research purposes, most free or purchased bulk data is reasonably accurate. Inaccurate and inconsistent data tends to be the result of human error, such as a salesperson or a website chat person completing various forms. Training staff to habitually double-check their info (and having their work double-checked during the training process) can encourage dramatic improvements.
Queries should start out simple, and remain simple. If the research becomes more complex, don’t create a more complex query. Create a new, simple query to research separately. CrowdStrike offers a useful example about the value of simplistic queries as they developed their security analytics tool, Threat Strike. CrowdStrike authors Marcus King and Ralph Caraveo wrote:
“At the outset of this project, the main issue we needed to address was managing an extremely large volume of data with a highly unpredictable write rate. At the time, we needed to analyze a few million events per day – a number that we knew would grow and is now in the hundreds of billions. The project was daunting, which is why we decided to step back and think not about how to scale, but how to simplify. We determined that by creating a data schema that was extraordinarily simple, we would be able to create a strong and versatile platform from which to build. So our team focused on iterating and refining until we got the architecture down to something that was simple enough to scale almost endlessly.”
Artificial Intelligence, Machine Learning, and Graph Databases
Graph enhancements applied to artificial intelligence are improving accuracy and modeling speeds.
An AI platform merged with a graph database has been shown to successfully enhance machine learning models, promoting the potential for complex decision-making processes. Graph technology seems to mesh quite well with artificial intelligence and machine learning, making data relationships simpler, more expandable, and more efficient.
Amazon has turned its attention to using machine learning for classifying nodes and edges based on their attributes. The process can also be used to predict the most probable connections. Some versions of this machine learning/graph technology option include maps of the physical world, such as researching the best routes for getting from one place to another. Some versions focus on more abstract tasks – for example, knowledge synthesis – and use graph models based on text, or conceptual networks.
The current graph databases have evolved to the point where they are capable of resolving some of the more complicated challenges of the telecommunications industry. Combating fraud is one challenge that has become a high priority, with AI and machine learning becoming the first choice to stay ahead of threats. Graph databases are being used to support the analytical techniques used by AI and machine learning in combating fraud.