Building the next-generation big data analytics stack

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Michael Franklin on the lasting legacy of AMPLab.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode I spoke with Michael Franklin, co-director of UC Berkeley’s AMPLab and chair of the Department of Computer Science at the University of Chicago. AMPLab is well-known in the data community for having originated Apache Spark, Alluxio (formerly Tachyon) and many other open source tools. Today marks the start of a two-day symposium commemorating the end of AMPLab, and we took the opportunity to reflect on its impressive accomplishments.

AMPLab is the latest in a series of UC Berkeley research labs each designed with clear goals, a multidisciplinary faculty, and a fixed timeline (for more details, see David Patterson’s interesting design document for research labs). Many of AMPLab’s principals were involved in its precursor, the RAD Lab. As Franklin describes in our podcast episode:

The insight that Dave Patterson and the other folks who founded the RAD Lab had was that modern systems were so complex that you needed serious machine learning—cutting-edge machine learning—to be able to do that [to basically allow the systems to manage themselves]. You couldn’t take a computer systems person, give them an intro to machine learning book, and hope to solve that problem. They actually built this team that included computer systems people sitting next to machine learning people. … Traditionally, these two groups had very little to do with each other. That was a five-year project. The way I like to say it is—they spent at least four of those years learning how to talk to each other.

Toward of the end of the RAD Lab, we had probably the best group in the world of combined systems and machine learning people, who actually could speak to each other. In fact, Spark grew out of that relationship, because there were machine learning people in the RAD Lab who were trying to run iterative algorithms on Hadoop and were just getting terrible performance.

… AMPLab in some sense was a flip of that relationship. If you considered RAD Lab as basically a setting where “machine learning people were consulting for the systems people”, in AMPLab, we did the opposite—machine learning people got help from the systems people in how to make these things scale. That’s one part of the story.

In the rest of this post, I’ll describe some of my interactions with the AMPLab team. These recollections are based on early meetups, retreats, and conferences.

The speed gains were addictive

I first tried Spark around the version 0.4 and 0.5 releases. At the time, I was using Hive and Pig for data processing, while evaluating Mahout for machine learning. Other than being a bit resistant to having to learn a new programming language—Scala—which I later came to love, I immediately became a user and fan of Spark. The speed gains were addicting! AMPLab was also starting to roll out useful examples and libraries at a steady pace, and I soon found myself finding reasons to use Spark on more tasks and projects.

Interacting with and getting feedback from developers at local meetups was important to the students and Professors of AMPLab. Around mid 2012, there was a San Francisco meetup where the audience got to see a preview of Spark Streaming. I remember the reaction to the presentation very clearly. There was immediate interest and enthusiasm, and it was clear to me that Spark Streaming was going to be popular. At the time, many in the audience used Storm, and the prospect of a simplified infrastructure (due to Spark’s ability to handle both batch and streaming) was attractive to many in attendance. It was at this meetup that I first broached the idea of a Spark book to Matei Zaharia (the creator of Spark). That initial conversation led to the popular O’Reilly title, Learning Spark.

Discoveries at AMP Camp

In the fall of 2012, I was fortunate enough to get invited to the first AMP Camp, and while I was enroute to that event, I wrote my first post on Spark (“Seven reasons why I like Spark”). AMP Camps combined talks as well as hands-on tutorials and in the early days of Spark they became the defacto community gathering for users. A few things stood out for me in that first AMP Camp. First—the tutorials were cloud-friendly from the beginning: AMP Camp tutorials provided tools to help users play with Spark on AWS. Second, the unveiling of Pyspark came at a time when most of the early users had JVM (Java, Scala, Clojure) backgrounds. This opened up Spark to the large number of data scientists who use Python as their primary language. This has worked out extremely well—the most recent user survey suggests Python and Scala have the same number of users in the Spark community. Finally, machine learning was featured prominently at that first AMP Camp. From the early days of Spark, many users, including myself, were drawn to its potential for machine learning tasks.

While Spark is the project it’s always identified with, AMPLab has always been about building the next-generation big data analytics stack. As Franklin noted in our conversation, prior to the establishment of AMPLab, both he and his co-director Ion Stoica spent time on separate startups. Their experiences helped inform the initial design of what became known as the Berkeley Data Analytics Stack.

I was fortunate enough to attend several AMPLab retreats where many BDAS components were first revealed. Following that first AMP Camp in 2012, I wrote about a few other projects that caught my attention:

  • Alluxio (formerly Tachyon) is a storage-backed, distributed, shared memory system, that cuts across compute frameworks. In recent months, I’ve come across several companies—here and in Asia—that are starting to use Alluxio in production.
  • BlinkDB was a query engine built with the idea that in many situations approximate answers suffice. It inspired the introduction of approximate algorithms in later versions of Spark.
  • KeystoneML was about reproducible and interpretable end-to-end machine learning pipelines, with some notion of auto-tuning (systems optimizations) and error bounds. The early results were promising, but the project never quite caught on with the external community. This is one of my favorite projects out of AMPLab and it inspired ML Pipelines in Spark.
  • Succinct is a “compressed” data store that enables a wide range of point queries (search, count, range, random access) directly on a compressed representation of input data.

What’s ahead

As I look to the future and to its successor (the RISE Lab), I’m thankful for having had a front row seat to the projects at AMPLab. The model of a university research lab listening to and working with industrial partners, while continuing to produce highly cited academic papers is something that other institutions should emulate. Apache Spark has emerged as the most popular open source project in big data, adopted and promoted by companies across many countries and industries. Many of the other AMPLab projects have influenced other aspects of Spark or other open source projects. In the case of Alluxio, we may have yet another AMPLab project that emerges to a be a popular project in its own right.

Full disclosure: Michael Franklin and I are advisors to both Databricks and Alluxio, companies created by current and former members of the AMPLab.

Related resources:

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s