Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

[This piece was co-written by Shannon Cutt. A version of this post appears on the O’Reilly Radar.]

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Many of the open source systems and projects we’ve come to love — including Hadoop and HBase — were inspired by systems used internally within Google. These systems were described in papers and implemented by people who needed frameworks that could comfortably scale to massive data sets.

Google engineers and scientists continue to publish interesting papers, and these days some of the big data systems they describe in publications are available on their cloud platform.

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

The Dataflow model

There are two projects that we say Dataflow came out of. The FlumeJava project, which, for anybody who is not familiar, is a higher level language for describing large-scale, massive-scale data processing systems and then running it through an optimizer and coming up with an execution plan. … We had all sorts of use cases at Google where people were stringing together these series of MapReduce [jobs]. It was complex and difficult to deal with, and you had to try to manually optimize them for performance. If you do what the database folks have done,[you] run it through an optimizer. … Flume is the primary data processing system, so as part of that for the last few years, we’ve been moving MillWheel to be essentially a secondary execution engine for FlumeJava. You can either do it on batch mode and run on MapReduce or you can execute it on MillWheel. … FlumeJava plus MillWheel — it’s this evolution that’s happened internally, and now we’veexternalized it.

Balancing correctness, latency, and cost

There’s a wide variety of use cases out there. Sometimes you need high correctness; sometimes you don’t; sometimes you need low latency; sometimes higher latency is okay. Sometimes you’re willing to pay a lot for those other two features; sometimes you don’t want to pay as much. The real key, at least as far as having a system that is broadly applicable, is being able to be flexible and give people the choices to make the trade-offs they have to make. … There is a single knob which is, which runner am I going to use: batch or streaming? Aside from that, the other level at which you get to make these choices is when you’re deciding exactly when you materialize your results within the pipeline. … Once you have a streaming system or streaming execution engine that gives you this automatic-scaling, like Dataflow does, and it gives you consistency and strong tools for working with your data, then people start to build these really complicated services on them. It may not just be data processing. It actually becomes a nice platform for orchestrating events or orchestrating distributed state machines and things like that. We have a lot of users internally doing this stuff.

Subscribe to the O’Reilly Data Show Podcast: Stitcher, TuneIn, iTunes,SoundCloud, RSS

Related resources:

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s