[A version of this post appears on the O’Reilly Radar blog.]
Editor’s note: Carlos Guestrin will be part of the team teaching Large-scale Machine Learning Day at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.
I only really started playing around with GraphLab when the companion project GraphChi came onto the scene. By then I’d heard from many avid users and admired how their user conference instantly became a popular San Francisco Bay Area data science event. For this podcast episode, I sat down with Carlos Guestrin, co-founder/CEO of Dato, a start-up launched by the creators of GraphLab. We talked about the early days of GraphLab, the evolution of GraphLab Create, and what’s he’s learned from starting a company.
MATLAB for graphs
Guestrin remains a professor of computer science at the University of Washington, and GraphLab originated when he was still a faculty member at Carnegie Mellon. GraphLab was built by avid MATLAB users who needed to do large scale graphical computations to demonstrate their research results. Guestrin shared some of the backstory:
“I was a professor at Carnegie Mellon for about eight years before I moved to Seattle. A couple of my students, Joey Gonzales and Yucheng Low were working on large scale distributed machine learning algorithms specially with things called graphical models. We tried to implement them to show off the theorems that we had proven. We tried to run those things on top of Hadoop and it was really slow. We ended up writing those algorithms on top of MPI which is a high performance computing library and it was just a pain. It took a long time and it was hard to reproduce the results and the impact it had on us is that writing papers became a pain. We wanted a system for my lab that allowed us to write more papers more quickly. That was the goal. In other words so they could implement this machine learning algorithms more easily, more quickly specifically on graph data which is what we focused on.”
The original killer app = recommenders
Many of the machine learning projects and start-ups I interact with find initial traction in automatic recommender systems. GraphLab is no exception. (In fact, I first heard about GraphLab from users of its Collaborative Filtering library.) Recommenders are an easy entry point because product recommendations are so common on web sites and they are conceptually easy to explain to non-experts. Guestrin explained how a recommender library that started as an afterthought became a project in its own right:
“We put out this software in the open source community, and it was not something that we decided to do with a lot of ambition. We just put it out there. My postdoc at the time, Danny Bickson, came and said, ‘I’m going to write a recommender library on top of the system.’ We didn’t work on recommender systems in my lab, so it wasn’t really a high priority for me as a professor, but he really wanted to do it. He started implementing something called matrix factorization on top of it. The performance was incredibly good … He also put his recommender library out in the open source. Somehow, we started getting emails from folks saying, ‘we tried that and this doesn’t work, or this was fast, or we want these other things.’ We started getting feature requests for something that was an afterthought for us. … Danny started engaging with that community and getting lots of positive feedback, and what started as an afterthought — let’s put something on the open source — became a project of its own with growing adoption and really nice feedback from folks like Pandora and others.”
Beyond graphs: analytics for tabular data, advanced algorithms, and model deployment
GraphLab’s tabular data structure (SFrames) was unveiled at last February’s Strata Conference in California. It definitely caught the attention of many attendees I spoke with, particularly Pydata users. With an API similar to Pandas, and a growing library of algorithms that are as easy to use as scikit-learn, more Python users will start gravitating toward GraphLab Create. Among other things, it scales to much larger data sets (even on a single machine), it is much faster than comparable Pydata tools (Python API calls a C++ backend), its library of algorithms is expanding, and tools for model management are on the way. (It’s a great time to be a Python data enthusiast, as there are other emerging frameworks — like Apache Spark — that are targeting Pydata users.) Guestrin noted:
“I started talking to some customers and they said, ‘yeah, we have graph data from social networks, but we also have this data with user profile information,’ which turns out to be tables. We have these images of pictures people take and we have this text information from product reviews and then I realized … Let’s design something that is highly scalable both for tabular and graph data and text and images.
… “For example, boosted decision trees is a well-known model for machine learning that can do well with data that requires non-linear features. We incorporated a very efficient implementation of boosted decision trees. … Similarly, deep learning has been getting really amazing performance, especially on things like image data and audio data, so we wanted to incorporate the library where you could do things like deep learning networks easily.
… “One of the things that’s been a big focus for us is the deployment piece of machine learning. If you think about machine learning, there is the training, there’s the data exploration, the data engineering, the training of the models, the intelligence — but eventually, your goals should deploy your solution as a system that runs on tons and tons and tons of data, maybe even on a cluster, or deploy that as a service that can be created in real time.”
Many of the tools that Pydata users have come to depend upon are open source. I asked Guestrin which components of GraphLab Create will be open source. The answer will be revealed at Strata+Hadoop World next month, but I think it’s safe to guess that the components for data transformation (SFrames, SGraphs) and many basic machine learning algorithms will be open source. Guestrin stressed their commitment to open source:
We benefitted from the open source community giving us feedback, contributing to our code, and we continue to be committed to that community. We’re inspired by companies like MongoDB and Elasticsearch that have an open source core and add-on tools. That’s how I view the company. However, when we started the company, we wrote GraphLab Create from scratch. It wasn’t a next version of GraphLab or PowerGraph or GraphChi. … We wanted to make sure the code was in good shape before we put it out there for people to contribute and participate in that way. We’ve only made GraphLab Create available as a free binary thus far. … [At Large-scale Machine Learning Day at Strata+Hadoop World] you’ll also be able to use the open source version of the core components of the GraphLab Create.