The popular graph analytics framework extends its coverage of the data science workflow
[A version of this post appears on the O’Reilly Data blog and Forbes.]
GraphLab’s SFrame, an interesting and somewhat under-the-radar tool was unveiled1 at Strata Santa Clara. It is a disk-based, flat table representation that extends GraphLab to tabular data. With the addition of SFrame, users can leverage GraphLab’s many algorithms on data stored as either graphs or tables. More importantly SFrame increases GraphLab’s coverage of the data science workflow: it allows users with terabyte-sized datasets to clean their data and create new features directly within GraphLab (SFrame performance can scale linearly with the number of available cores).
The beta version of SFrame can read data from local disk, HDFS, S3 or a URL, and save to a human-readable .csv or a more efficient native format. Once an SFrame is created and saved to disk no reprocessing of the data is needed. Below is Python code that illustrates how to read a .csv file into SFrame, create a new data feature and save it to disk on S3:
[Source: GraphLab notebook from GraphLab, Inc.]
SFrame is part of GraphLab Create2, a Python package due out in March3, that simplifies the creation of scalable analytic products (e.g. Recommenders and Graph Analytics). With GraphLab Create, users will be able to build and maintain analytic pipelines from within Python or IPython4 (“GraphLab Notebook”), and deploy them on single servers or across clusters, both locally or in the Cloud.
In the past GraphLab was regarded as scalable and fast, but hard to use and limited in scope. Over the past several months, the startup GraphLab, Inc. has tackled both problems head on and the resulting tools should greatly increase GraphLab’s appeal among data scientists5. Integration with IPython opens up GraphLab’s fast and scalable analytics modules to the PyData community (“Build an end-to-end recommender in six lines of Python”). SFrame and GraphLab Create expand GraphLab’s coverage of the data science workflow to include data wrangling and ingestion.
Since SFrames are similar to Pandas (PyData) and R dataframes, data scientists can very quickly be productive with them. What intrigued several Strata attendees I spoke with was its ability to scale to large data sets: SFrame lets users wrangle very large tabular datasets without being limited to in-memory size constraints (reminiscent of the efficiency of the GraphChi project). If you’re interested in trying out SFrames and GraphLab’s many graph analytic tools and machine learning algorithms, sign-up for the GraphLab Create beta.
- GraphLab Notebooks from Carlos Guestrin’s recent Strata Santa Clara tutorial
- Improving options for unlocking your graph data
- GraphChi: Graph analytics over billions of edges using your laptop
(1) Alice Zheng mentioned SFrame in her talk, and Carlos Guestrin showed it in action in his tutorial.
(2) GraphLab Create is currently not open source.
(3) This will be a beta release and expect more tools and features to come in the near future. The founders of GraphLab, Inc. told me they are building tools that will “… extend the value of GraphLab across the entire data science pipeline”.
(4) There are several GraphLab Notebooks (IPython notebooks) that demonstrate how to use Graphlab Create to build end-to-end recommender systems on GraphLab.com. The members of GraphLab, Inc. have found that “… users prefer to use their IDE for production-ready development and the IPython notebook for communicating their methodology and approach.”
(5) As I noted in a recent post, Python is very popular among data scientists.