The evolution of GraphLab

[A version of this post appears on the O’Reilly Radar blog.]

Editor’s note: Carlos Guestrin will be part of the team teaching Large-scale Machine Learning Day at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.

I only really started playing around with GraphLab when the companion project GraphChi came onto the scene. By then I’d heard from many avid users and admired how their user conference instantly became a popular San Francisco Bay Area data science event. For this podcast episode, I sat down with Carlos Guestrin, co-founder/CEO of Dato, a start-up launched by the creators of GraphLab. We talked about the early days of GraphLab, the evolution of GraphLab Create, and what’s he’s learned from starting a company.

MATLAB for graphs

Guestrin remains a professor of computer science at the University of Washington, and GraphLab originated when he was still a faculty member at Carnegie Mellon. GraphLab was built by avid MATLAB users who needed to do large scale graphical computations to demonstrate their research results. Guestrin shared some of the backstory:

“I was a professor at Carnegie Mellon for about eight years before I moved to Seattle. A couple of my students, Joey Gonzales and Yucheng Low were working on large scale distributed machine learning algorithms specially with things called graphical models. We tried to implement them to show off the theorems that we had proven. We tried to run those things on top of Hadoop and it was really slow. We ended up writing those algorithms on top of MPI which is a high performance computing library and it was just a pain. It took a long time and it was hard to reproduce the results and the impact it had on us is that writing papers became a pain. We wanted a system for my lab that allowed us to write more papers more quickly. That was the goal. In other words so they could implement this machine learning algorithms more easily, more quickly specifically on graph data which is what we focused on.”

Continue reading

Building and deploying large-scale machine learning pipelines

[A version of this post appears on the O’Reilly Radar blog.]

There are many algorithms with implementations that scale to large data sets (this list includes matrix factorization, SVM, logistic regression, LASSO, and many others). In fact, machine learning experts are fond of pointing out: if you can pose your problem as a simple optimization problem then you’re almost done.

Of course, in practice, most machine learning projects can’t be reduced to simple optimization problems. Data scientists have to manage and maintain complex data projects, and the analytic problems they need to tackle usually involve specialized machine learning pipelines. Decisions at one stage affect things that happen downstream, so interactions between parts of a pipeline are an area of active research.


Some common machine learning pipelines. Source: Ben Recht, used with permission.

In his Strata+Hadoop World New York presentation, UC Berkeley Professor Ben Recht described new UC Berkeley AMPLab projects for building and managing large-scale machine learning pipelines. Given AMPLab’s ties to the Spark community, some of the ideas from their projects are starting to appear in Apache Spark. Continue reading

A brief look at data science’s past and future

[A version of this post appears on the O’Reilly Radar blog.]

Back in 2008, when we were working on what became one of the first papers on big data technologies, one of our first visits was to LinkedIn’s new “data” team. Many of the members of that team went on to build interesting tools and products, and team manager DJ Patil emerged as one of the best-known data scientists. I recently sat down with Patil to talk about his new ebook (written with Hilary Mason) and other topics in data science and big data.

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

Here are a few of the topics we touched on:

Proliferation of programs for training and certifying data scientists

Patil and I are both ex-academics who learned learned “data science” in industry. In fact, up until a few years ago one acquired data science skills via “on-the-job training.” But a new job title that catches on usually leads to an explosion of programs (I was around when master’s programs in financial engineering took off). Are these programs the right way to acquire the necessary skills? Continue reading

“Humans-in-the-loop” machine learning systems

Next week I’ll be hosting a webcast featuring Adam Marcus, one of the foremost experts on the topic of “humans-in-the-loop” machine learning systems. It’s a subject many data scientists have heard about, but very few have had the experience of building productions systems that leverage humans:

Crowdsourcing marketplaces like Elance-oDesk or CrowdFlower give us access to people all over the world that can solve various tasks, such as virtual personal assistants, image labelers, or people that can clean up gnarly datasets. Humans can solve tasks that artificial intelligence is not yet able to solve, or needs help in solving, without having to resort to complex machine learning or statistics. But humans are quirky: give them bad instructions, allow them to get bored, or make them do too repetitive a task, and they will start making mistakes. In this webcast, I’ll explain how to effectively benefit from crowd workers to solve your most challenging tasks, using examples from the wild and from our work at GoDaddy.

Machine learning and crowdsourcing are at the core of most of the problems we solve on the Locu team at GoDaddy. When possible, we automate tasks with the help of trained regressions and classifiers. However, it’s not always possible to build machine-only decision-making tools, and we often need to marry machines and crowds. During the webcast, I will highlight how we build human-machine hybrids and benefit from active learning workflows. I’ll also discuss learnings from 17 conversations with companies that make heavy use of crowd work that Aditya Parameswaran and I have collected for our upcoming book.

A recent article in the NYTimes Magazine mentioned a machine-learning system built by some neuroscience researchers that is an excellent example of having “humans-in-the-loop”:

In 2012, Seung started EyeWire, an online game that challenges the public to trace neuronal wiring — now using computers, not pens — in the retina of a mouse’s eye. Seung’s artificial-­intelligence algorithms process the raw images, then players earn points as they mark, paint-by-numbers style, the branches of a neuron through a three-dimensional cube. The game has attracted 165,000 players in 164 countries. In effect, Seung is employing artificial intelligence as a force multiplier for a global, all-volunteer army that has included Lorinda, a Missouri grandmother who also paints watercolors, and Iliyan (a.k.a. @crazyman4865), a high-school student in Bulgaria who once played for nearly 24 hours straight. Computers do what they can and then leave the rest to what remains the most potent pattern-recognition technology ever discovered: the human brain.

For more on this important topic, join me and Adam on January 22nd!

Lessons from next-generation data wrangling tools

[A version of this post appears on the O’Reilly Radar blog.]

One of the trends we’re following is the rise of applications that combine big data, algorithms, and efficient user interfaces. As I noted in an earlier post, our interest stems from both consumer apps as well as tools that democratize data analysis. It’s no surprise that one of the areas where “cognitive augmentation” is playing out is in data preparation and curation. Data scientists continue to spend a lot of their time on data wrangling, and the increasing number of (public and internal) data sources paves the way for tools that can increase productivity in this critical area.

At Strata + Hadoop World New York, NY, two presentations from academic spinoff start-ups — Mike Stonebraker of Tamr and Joe Hellerstein and Sean Kandel of Trifacta — focused on data preparation and curation. While data wrangling is just one component of a data science pipeline, and granted we’re still in the early days of productivity tools in data science, some of the lessons these companies have learned extend beyond data preparation.

Scalability ~ data variety and size

Not only are enterprises faced with many data stores and spreadsheets, data scientists have many more (public and internal) data sources they want to incorporate. The absence of a global data model means integrating data silos, and data sources requires tools for consolidating schemas.

Random samples are great for working through the initial phases, particularly while you’re still familiarizing yourself with a new data set. Trifacta lets users work with samples while they’re developing data wrangling “scripts” that can be used on full data sets.
Continue reading

Spark 1.2 and Beyond

Next week I’ll be hosting a webcast with Spark’s release manager – and Databricks co-founder – Patrick Wendell. (Full disclosure: I’m an advisor to Databricks.)

In this webcast, Patrick Wendell from Databricks will be speaking about Spark’s new 1.2 release. Spark 1.2 brings performance and usability improvements in Spark’s core engine, a major new API for MLlib, expanded ML support in Python, and a fully H/A mode in Spark Streaming, along with several other features. The talk will outline what’s new in Spark 1.2 and leave plenty of time for Q and A about the release or about Spark in general.

We’ll be debuting a new Spark track at the forthcoming Strata+Hadoop World in San Jose. We’re also offering a 3-day Advanced Spark training course, a one day hands-on tutorial (Spark Camp), and the Databricks/O’Reilly Spark developer certification exam.