[A version of this post appears on the O’Reilly Strata blog.]
The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.
While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).
Data wrangling: creating graphs
Before you can take advantage of the other tools mentioned in this post, you’ll need to turn your data (e.g., web pages) into graphs. GraphBuilder is an open source project from Intel, that uses Hadoop MapReduce1 to build graphs out of large data sets. Another option is the combination of GraphX/Spark described below. (A startup called Trifacta is building a general-purpose, data wrangling tool, that could help as well. )
Data management and search
Once you have a graph, there are many options for how to store it. The choice of database largely depends on amount of data (# of nodes, edges, along with the size of data associated with them), the types of tasks (pattern-matching and search, analytics), and workload. In the course of evaluating alternatives to MySQL (for storing social graph data), Facebook’s engineering team developed and released Linkbench – a data set that can be used to study how graph databases handle production workloads. Assembled by a team of researchers from academia, industry, and national labs, the Graph 500 benchmark targets concurrent search, optimization (single source shortest path), and edge-oriented (maximal independent set) tasks.
Most graph databases (such as Neo4j2, AllegroGraph, Yarcdata, and InfiniteGraph) come with tools for facilitating and speeding up search – Neo4j comes with a simple query language (Cipher) for search, other graph databases support SPARQL. The Titan distributed graph database supports different storage engines (including HBase and Cassandra) and comes with tools for search and traversal (based on Lucene and Gremlin). Used by Twitter to store graph data, FlockDB targets operations involving adjacency lists.
Update (6/11/2013): Apache Accumulo is an open source key/value store originally developed by the NSA. In a recent talk at CMU’s Parallel Data Lab, NSA researchers gave an overview of how they’ve experimented with using Accumulo to store and traverse graphs with trillions of vertices and edges. A startup called sqrrl, is developing a commercial version of Accumulo.
Graph-parallel frameworks: Pregel, PowerGraph, and GraphX
BSP is a parallel computing model that has inspired many graph analytics tools. Just like Hadoop’s map and reduce, Pregel4, Giraph and Pregelix, come with primitives that let neighboring nodes send/receive messages to one another, or change the state of a node (based on the state of its neighboring nodes). Efficient graph algorithms are a sequence of iterations built from such primitives. GraphLab uses similar primitives (called PowerGraph) but allows for asynchronous iterative computations, leading to an expanded set of (potentially) faster algorithms.
GraphX is a new, fault-tolerant, framework that runs within Spark. Its core data structure is an immutable graph5 (Resilient Distributed Graph – or RDG), and GraphX programs are a sequence of transformations on RDG’s (with each transformation yielding a new RDG). Transformations on RDG’s can affect nodes, edges, or both (depending on the state of neighboring edges and nodes). GraphX greatly enhances productivity by simplifying a range of tasks (graph loading, construction, transformation, and computations). But it does so at the expense of performance: early prototype algorithms written in GraphX were slower6 than those written in GraphLab/PowerGraph.
Machine-learning and analytics
Machine-learning tools that target graph data lead to familiar applications such as detecting influential users (PageRank) and communities, fraud detection, and recommendations (collaborative filtering is popular among GraphLab users). Moreover techniques developed in one domain are often reused in other settings. Besides GraphLab, distributed analytics have been implemented in Giraph, GraphX, Faunus, and Grappa. In addition, graph databases like Neo4j and Yarcdata come with some analytic capabilities. As I noted in a recent post, open source, single-node systems like Twitter’s Cassovary7 are being used for computations involving massive8 graphs.
When you’re dealing with large graphs, being able to zoom in/out helps with clutter, but so do clever layout algorithms. Popular tools for visualizing nodes and edges include Gephi and GraphViz. Users who want to customize their graphs turn to packages like d3.
(1) I would love to see a version of GraphBuilder that’s built on top of Spark.
(2) Many of these systems are quite efficient. For example a single instance of Neo4j can handle very large graphs (“into the tens of billions of nodes/ relationships/ properties”).
(3) Note that using standard Hadoop for graph processing may not be the most efficient option. This talk by Hadapt co-founder Daniel Abadi describes an advanced approach to graph analysis using Hadoop.
(4) Related frameworks include GoldenOrb and Hama.
(5) Resilient Distributed Graphs (RDG) extend Spark’s Resilient Distributed Dataset (RDD).
(6) As the developers of GraphX note: “We emphasize that it is not our intention to beat PowerGraph in performance. … We believe that the loss in performance may, in many cases, be ameliorated by the gains in productivity achieved by the GraphX system. … It is our belief that we can shorten the gap in the near future, while providing a highly usable interactive system for graph data mining and computation”
(7) On the plus side, being single-node means Cassovary doesn’t have to deal with finding the optimal way to partition a graph. On the other hand, it is limited to graphs that fit in the memory of a server – a limitation it alleviates through the use of efficient data structures.
(8) For streaming graphs, see STINGER, a collection of algorithms and a data structure for storing sparse dynamic graphs.