Building the next-generation big data analytics stack

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Michael Franklin on the lasting legacy of AMPLab.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode I spoke with Michael Franklin, co-director of UC Berkeley’s AMPLab and chair of the Department of Computer Science at the University of Chicago. AMPLab is well-known in the data community for having originated Apache Spark, Alluxio (formerly Tachyon) and many other open source tools. Today marks the start of a two-day symposium commemorating the end of AMPLab, and we took the opportunity to reflect on its impressive accomplishments.

AMPLab is the latest in a series of UC Berkeley research labs each designed with clear goals, a multidisciplinary faculty, and a fixed timeline (for more details, see David Patterson’s interesting design document for research labs). Many of AMPLab’s principals were involved in its precursor, the RAD Lab. As Franklin describes in our podcast episode:

The insight that Dave Patterson and the other folks who founded the RAD Lab had was that modern systems were so complex that you needed serious machine learning—cutting-edge machine learning—to be able to do that [to basically allow the systems to manage themselves]. You couldn’t take a computer systems person, give them an intro to machine learning book, and hope to solve that problem. They actually built this team that included computer systems people sitting next to machine learning people. … Traditionally, these two groups had very little to do with each other. That was a five-year project. The way I like to say it is—they spent at least four of those years learning how to talk to each other.

Toward of the end of the RAD Lab, we had probably the best group in the world of combined systems and machine learning people, who actually could speak to each other. In fact, Spark grew out of that relationship, because there were machine learning people in the RAD Lab who were trying to run iterative algorithms on Hadoop and were just getting terrible performance.

… AMPLab in some sense was a flip of that relationship. If you considered RAD Lab as basically a setting where “machine learning people were consulting for the systems people”, in AMPLab, we did the opposite—machine learning people got help from the systems people in how to make these things scale. That’s one part of the story.

In the rest of this post, I’ll describe some of my interactions with the AMPLab team. These recollections are based on early meetups, retreats, and conferences.

Continue reading “Building the next-generation big data analytics stack”

Visual tools for overcoming information overload

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Dafna Shahaf on information cartography and AI, and Sam Wang on probabilistic methods for forecasting political elections.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this special two-segment episode of the Data Show, I spoke with Dafna Shahaf, assistant professor at the School of Computer Science and Engineering at the Hebrew University of Jerusalem. Her area of research is focused on tools and techniques for overcoming information overload, an area of increasing importance in an attention economy. With the upcoming U.S. Presidential Elections right around the corner, I included a conversation between Jenn Webb, host of the O’Reilly Radar Podcast, and Sam Wang, co-founder of the Princeton Election Consortium and professor of neuroscience and molecular biology at Princeton University.

Below are highlights from my conversation with Dafna Shahaf:
Continue reading “Visual tools for overcoming information overload”