A wide range of tools is available with which users can analyze their data stored in data warehouses and production databases. These tools range from straightforward reporting tools via interactive online analytical processing tools to advanced statistical tools. All these tools help users in some way to improve their business operations and business decisions. They help by presenting data in a textual or graphical way by summarizing data, by grouping data, or by making predictions.
But there are things most of these tools canít do, and that is analyze data when itís structured as a graph or network and when that data must be analyzed by traversing the graph. For example, imagine a manager of a social networking website wants to know who the central members of the social network are. And let's define the term "central member" as a member who has the shortest paths to most of the other members. This problem canít be solved by simply summarizing data, nor does it have anything to do with predicting. Instead, the data must be organized as a graph and a tool must be able to traverse that graph; it has to be able "walk" from node to node. And today, this is not a feature found in most reporting and analytical tools.
Analyzing network structures or graph-based structures is the domain of graph analytics. This is a special form of analytics that has been around for a very long time. In fact, the history of graph analytics and the underlying graph theory goes back to the early 18th century. Today, powerful tools and database servers are available that support graph analytics. Examples of graph database servers are InfiniteGraph, AllegroGraph RDFStore, Neo4j, and vertexdb. Whatís special about them is that they allow fast online graph traversal even if the graphs consist of hundreds of millions of nodes.
Graph Analytics and Database Servers
The challenge of graph analytics on massive graphs is how to store and access data in such a way that graph traversal is fast. Currently, classic SQL database servers donít offer the right features for graph analytics. The way in which data is organized as tables and columns doesnít make them ideal for graph traversal. As an example, imagine that flight data is stored in a table with the following structure:
Also imagine that users want to know what the cheapest flights from Amsterdam to Phoenix are leaving on March 1, 2007, with a maximum of two stops, and each stop should be less than 4 hours. Using SQL, the query would look as follows:
Although this query returns the right result, itís not an easy query to write. And, more importantly, it will be hard for a SQL database server to process this query quickly because the query is hard to optimize and to parallelize. So especially when the graphs become large, these queries might take a long time to process and probably consume a lot of resources.
Graph database servers have been designed for this form of analytics. First of all, they would store this data more like a graph:
Secondly, their database languages and APIs are designed for traversing graphs which leads to straightforward and simple queries.
Application Areas of Graph Analytics
Some think that graph analytics is only relevant for a small number of organizations because they don't see examples of graph-based structures in their own organizations. This is not the case however. Graph-based data structures can be found in almost any industry, including the world of social networking sites, bio-engineering, drug development, fraud detection, and traffic optimization. Here some examples where data can be organized as graphs and where graph analytics might be useful:
- All the flights from and to airports can be organized as a graph. In this case, the airports are the objects and the flights the relationships between the objects. Such a graph can be created for all the flights of one airline, or for a set of airlines (such as Expediaís website).
- The network of all the members of a social networking site, such as LinkedIn and Facebook, plus all their relationships, can be arranged as a graph.
- The accounts of a bank with all the inter-account money transfers form a graph.
- In the context of a parcel service, all the parcel shipments between addresses worldwide can be organized as a graph.
- A visitorís journey on an organizationís website can also be seen as a graph, where webpages form the objects and the visitorís clicks become the relationships.
- In the context of a telecommunication company, all the call detail records between callers can be viewed as relationships between objects, and together they form an incredibly large graph.
Different Forms of Graph Analytics
Different forms of graph analytics exist. For example, a graph can be traversed to find an indirect link between two vertices, or the importance of a vertex in a graph can be determined. Here are some popular forms:
With single path analysis the goal is to find a path through a graph starting with a specific vertex. The path is determined in steps. First, all the edges plus corresponding vertices that can be reached by one hop are evaluated. From the vertices found, one is selected and the first hop is made. Next, the edges and vertices of this vertex found is determined, and this process continues. The result of such an exercise is a path consisting of a number of vertices and edges.
When shortest path analysis is used, the shortest path is found between two vertices. Shortest means the smallest number of hops. Evidently, the shortest path between two vertices consists of one hop.
Optimal path analysis can be used to find the "best" path between two vertices. The best path could be the fastest, the safest, or the cheapest. The best is based on the properties of the vertices and the edges.
With path existence analysis, you determine whether paths exist between two vertices. In other words, if we start with two vertices and their edges are followed, will the paths meet somewhere? An example of path existence analysis is the challenge called the Six Degrees of Kevin Bacon. If a graph is created in which all the movie stars are the vertices and the edges represent the movies in which they played together, it is claimed that anyone can be linked to Kevin Bacon within six hops.
The last one we mention here is vertex centrality analysis. Various measures of the centrality of a vertex within a graph have been defined in graph theory. The higher such a measure is, the more "important" the vertex is in the graph. The following measures have been defined:
- Degree centrality: This measure indicates how many edges a vertex has. The more edges, the higher the degree centrality. A vertex with high degree centrality is generally an active vertex or a hub.
- Closeness centrality: This measure is the inverse of the sum of all shortest paths to other vertices. In other words, it indicates for a vertex the smallest number of hops to make to reach all other vertices individually. A vertex with high closeness centrality has short paths to many vertices.
- Betweenness centrality: This measure indicates the number of shortest paths a vertex is on. It shows a vertexís position within a graph in terms of its ability to make connections to other groups in the graph.
- Eigenvector centrality: This measure indicates the importance of a vertex in a graph. Scores are assigned to vertices based on the principle that connections to high-scoring vertices contribute more to the score than equal connections to low-scoring vertices.
To summarize, graph analytics is a powerful new form of analytics that clearly enriches the set of reporting and analytical capabilities already available. Many organizations can benefit from this form of analytics. The tools are available, and, more importantly, the database servers that make it possible to analyze massive graphs online are also available. If you work in the world of business intelligence
and you haven't studied this topic yet, my recommendation would be to check out what it could mean for your organization.
Note: For more information on graph analytics, see InfiniteGraph: Extending Business, Social and Government Intelligence with Graph Analytics, a technical white paper.
Recent articles by Rick van der Lans