Moving from Batch to Continuous Computing at Yahoo!

[A version of this post appeared on the O’Reilly Strata blog.]

My favorite session at the recent Hadoop Summit was a keynote by Bruno Fernandez-Ruiz, Senior Fellow & VP Platforms at Yahoo! He gave a nice overview of their analytic and data processing stack, and shared some interesting factoids about the scale of their big data systems. Notably many of their production systems now run on MapReduce 2.0 (MRv2) or YARN – a resource manager that lets multiple frameworks share the same cluster.

Yahoo! was the first company to embrace Hadoop in a big way, and it remains a trendsetter within the Hadoop ecosystem. In the early days the company used Hadoop for large-scale batch processing (the key example being, computing their web index for search). More recently, many of its big data models require low latency alternatives to Hadoop MapReduce. In particular, Yahoo! leverages user and event data to power its targeting, personalization, and other “real-time” analytic systems. Continuous Computing is a term Yahoo! uses to refer to systems that perform computations over small batches of data (over short time windows), in between traditional batch computations that still use Hadoop MapReduce. The goal is to be able to quickly move from raw data, to information, to knowledge:

On a side note: many organizations are beginning to use cluster managers that let multiple frameworks share the same cluster. In particular I’m seeing many companies – notably Twitter – use Mesos¹ (instead of YARN) to run similar services (Storm, Spark, Hadoop MapReduce, HBase) on the same cluster.

Going back to Bruno’s presentation, here are some interesting bits – current big data systems at Yahoo! by the numbers:

100 billion events (clicks, impressions, email content & meta-data, etc.) are collected daily, across all of the company’s systems

A subset of collected events get passed to a stream processing engine over a Hadoop/YARN cluster: 133K events/second are processed, using Storm-on-Yarn across 320 nodes. This involves roughly 500 processors and 12,000 threads.

Iterative computations are performed with Spark-on-YARN, across 40 nodes

Sparse data store: 2 PB of data stored in HBase, across 1,900 nodes. I believe this is one of the largest HBase deployments in production.

365 PB of available raw storage on HDFS, spread across 30,000 nodes (about 150 PB is currently utilized).

About 400,000 jobs/day run on YARN, corresponding to about 10M hours of compute time per day.

Related posts:

Seven reasons why I like Spark

Simpler workflow tools enable the rapid deployment of models

It’s getting easier to build Big Data applications

(1) I first wrote about Mesos over two years ago, when I learned that Twitter was using it heavily. Since then many other companies have deployed Mesos in production, including Twitter, AirBnb, Conviva, UC Berkeley, UC San Francisco, and a slew of startups that I’ve talked with.

Moving from Batch to Continuous Computing at Yahoo!

Like this:

Leave a ReplyCancel reply

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Gradient Flow