Architecting and building end-to-end streaming applications

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Karthik Ramasamy on Heron, DistributedLog, and designing real-time applications.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Karthik Ramasamy, adjunct faculty member at UC Berkeley, former engineering manager at Twitter, and co-founder of Streamlio. Ramasamy managed the team that built Heron, an open source, distributed stream processing engine, compatible with Apache Storm.  While Ramasamy has seen firsthand what it takes to build and deploy large-scale distributed systems (within Twitter, he worked closely with the team that built DistributedLog), he is first and foremost interested in designing and building end-to-end applications. As someone who organizes many conferences, I’m all too familiar with the vast array of popular big data frameworks available. But, I also know that engineers and architects are most interested in content and material that helps them cut through the options and decisions.

Ramasamy and I discussed the importance of designing systems that can be combined to produce end-to-end applications with the requisite characteristics and guarantees.

Here are some highlights from our conversation:

Moving from Apache Storm to Heron

A major consideration was that we had to fundamentally change a lot of things. So, the team weighed the cost: should we go with an existing code base or develop a new code base? We thought that even if we developed a new code base, we would be able to get it done very quickly and the team was excited about it. That’s what we did and we got the first version of Heron done in eight or nine months.

I think it was one of the quickest transitions that ever happened in the history of Twitter. Apache Storm was hit by a lot of failures. There was a strong incentive to move to a new system. Once we proved the new system was highly reliable, we created a compelling value for the engineering teams. We also made it very painless for people to move. All they had to do was recompile a job and launch it. So, when you make a system like that, then people are just going to say, ‘let me give it a shot.’ They just compile it, launch it, then they say, ‘for a week, my job has been running without any issues; that’s good, I’m moving.’ So, we got migration done, from Storm to Heron, in less than six months. All the teams cooperated with us, and it was just amazing that we were able to get it done in less than six months. And we provided them a level of reliability that they never had with Storm.

Continue reading “Architecting and building end-to-end streaming applications”

Structured streaming comes to Apache Spark 2.0

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Michael Armbrust on enabling users to perform streaming analytics, without having to reason about streaming.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

With the release of Spark version 2.0, streaming starts becoming much more accessible to users. By adopting a continuous processing model (on an infinite table), the developers of Spark have enabled users of its SQL or DataFrame APIs to extend their analytic capabilities to unbounded streams.

Within the Spark community, Databricks Engineer, Michael Armbrust is well-known for having led the long-term project to move Spark’s interactive analytics engine from Shark to Spark SQL. (Full disclosure: I’m an advisor to Databricks.) Most recently he has turned his efforts to helping introduce a much simpler stream processing model to Spark Streaming (“structured streaming”).

Tackling these problems at large-scale, in a popular framework with many, many production deployments is a challenging problem. So think of Spark 2.0 as the opening salvo. Just as it took a few versions before a majority of Spark users moved over to Spark SQL, I expect the new structured streaming framework to improve and mature over the next releases of Spark.

Here are some highlights from our conversation:

Continue reading “Structured streaming comes to Apache Spark 2.0”

Stream processing and messaging systems for the IoT age

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show podcast: M.C. Srivas on streaming, enterprise grade systems, the Internet of Things, and data for social good.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with M.C. Srivas, co-founder of MapR and currently Chief Architect for Data at Uber. We discussed his long career in data management and his experience building a variety of distributed systems. In the course of his career, Srivas has architected key components that now comprise many data platforms (distributed file system, database, query engine, messaging system, etc.).

As Srivas is quick to point out, for these systems to be widely deployed in the enterprise, they require features like security, disaster recovery, and support for multiple data centers. We covered many topics in big data, with particular emphasis on real-time systems and applications. Below are some highlights from our conversation:

Applications and systems that run on multiple data centers

Ad serving has a limit of about 70 to 80 milliseconds where you have to reply to an advertisement. When you click on a Web page, the banner ads and the ads on the side and the bottom have to be served within 80 milliseconds. People place data centers across the world near each of the major centers where there’s a lot of people, where there’s a lot of activity. Many of our customers have data centers in Japan, in China, in Singapore, in Hong Kong, in India, in Russia, in Germany, across the United States, and worldwide. However, the billing is typically consolidated so that they bring data from all these data centers into central data centers where they process the entire clickstream and understand how to bill it back to their customers.

… Then they need a clean way to bring these clickstreams back into the central data centers, maybe running in the U.S. or in Japan or Germany, or somewhere where the consolidation on the overall view of the customer is created. Typically, this has been done by running completely independent Kafka systems in each place. As soon as that happens, the producers and the consumers are not coordinated across data centers. Think about a data center in Japan that has a Kafka cluster running. Well, it cannot failover to the Kafka cluster in Hong Kong because that’s a completely independent cluster and doesn’t understand what has been consumed and what has been produced in Japan. If a consumer who was consuming things from the Japanese Kafka moved to the Hong Kong Kafka, they would get garbage. This is the main problem that a lot of customers asked us to solve.

… The data sources have now gone not into a few data centers, but into millions of data centers. Think about every self-driving car. Every self-driving car is a data center in itself. It generates so much data. Think about a plane flying. A plane flying is a full data center. There’s 400 people on the plane, it’s a captive audience, and there’s enough data generated just for the preventative maintenance kind of stuff on the plane anyways. That’s the thinking behind MapR Streams—what do we need for the Internet of Things scale.

Streaming and messaging systems for IoT

A file system is very passive. You write some files, read some files, and how interesting could that get? If I look at a streaming system, what we’re looking for is completely real time. That is, if a publisher publishes something, then all listeners who want to listen to what the publisher is saying will get notified within five milliseconds inside the same data center. Five milliseconds to get a notification saying, “Hey, this was published.” Instantaneous almost. If I cross data centers, let’s say our data center halfway across the world, and you publish something, let’s say, in Japan and the person in South Africa or somewhere can get that information in under a second. They’ll be notified of that. There’s a push that we do so they get notified of it under a second, at a scale that’s billions of messages per second.

… We have learned from Kafka, we have learned from Tibco, we have learned from RabbitMQ and so many other technologies that have preceded us. We learned a lot from watching all those things, and they have paved a way for us. I think what we’ve done is now taking it to the next level, which is what we really need for IoT.

Powering the world’s largest biometric identity system

We implemented this thing in Aadhaar, a biometric identity project. It links you to your banking, your hospital admissions, all the records—whether it’s school admissions, hospital admissions, even airport entry, passport, pension payments.

… There’s about a billion people online right now. There’s another 300 million to go, but what I wanted to point out is that it’s completely digitized. If you want to withdraw money from an ATM, you put your fingerprint and take the money out. You don’t need a card.

… There was a flood in Chennai last November/December. Massive floods. It rained like it’s never rained before. It rained continuously for two months, and the houses were submerged in 10 feet of water. People lost everything—the entire Tamil Nadu state in India, people lost everything. But when they were rescued, they still had their fingerprints and they could access everything. Their bank accounts, their records, and everything because the Aadhaar project was biometrics-based. Really, they lost everything, but they still had it. They could get to everything right away. Think about what happens here if you lose your wallet. All your credit cards, your driver’s license, everything. You don’t have that kind of an issue anymore. That problem was solved.

Editor’s note: This interview took place in mid January 2016, at that time M.C. Srivas served as CTO of MapR. Srivas currently serves as the Chief Architect for Data at Uber and is a member of the Board of Directors of MapR.

Related resources:

Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

[This piece was co-written by Shannon Cutt. A version of this post appears on the O’Reilly Radar.]

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Many of the open source systems and projects we’ve come to love — including Hadoop and HBase — were inspired by systems used internally within Google. These systems were described in papers and implemented by people who needed frameworks that could comfortably scale to massive data sets.

Google engineers and scientists continue to publish interesting papers, and these days some of the big data systems they describe in publications are available on their cloud platform.

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

The Dataflow model

There are two projects that we say Dataflow came out of. The FlumeJava project, which, for anybody who is not familiar, is a higher level language for describing large-scale, massive-scale data processing systems and then running it through an optimizer and coming up with an execution plan. … We had all sorts of use cases at Google where people were stringing together these series of MapReduce [jobs]. It was complex and difficult to deal with, and you had to try to manually optimize them for performance. If you do what the database folks have done,[you] run it through an optimizer. … Flume is the primary data processing system, so as part of that for the last few years, we’ve been moving MillWheel to be essentially a secondary execution engine for FlumeJava. You can either do it on batch mode and run on MapReduce or you can execute it on MillWheel. … FlumeJava plus MillWheel — it’s this evolution that’s happened internally, and now we’veexternalized it.

Balancing correctness, latency, and cost

There’s a wide variety of use cases out there. Sometimes you need high correctness; sometimes you don’t; sometimes you need low latency; sometimes higher latency is okay. Sometimes you’re willing to pay a lot for those other two features; sometimes you don’t want to pay as much. The real key, at least as far as having a system that is broadly applicable, is being able to be flexible and give people the choices to make the trade-offs they have to make. … There is a single knob which is, which runner am I going to use: batch or streaming? Aside from that, the other level at which you get to make these choices is when you’re deciding exactly when you materialize your results within the pipeline. … Once you have a streaming system or streaming execution engine that gives you this automatic-scaling, like Dataflow does, and it gives you consistency and strong tools for working with your data, then people start to build these really complicated services on them. It may not just be data processing. It actually becomes a nice platform for orchestrating events or orchestrating distributed state machines and things like that. We have a lot of users internally doing this stuff.

Subscribe to the O’Reilly Data Show Podcast: Stitcher, TuneIn, iTunes,SoundCloud, RSS

Related resources:

Celebrating the real-time processing revival

[A version of this article appears on the O’Reilly Radar.]

Register for Strata + Hadoop World NYC, which will take place September 29 to Oct 1, 2015.

A few months ago, I noted the resurgence in interest in large-scale stream-processing tools and real-time applications. Interest remains strong, and if anything, I’ve noticed growth in the number of companies wanting to understand how they can leverage the growing number of tools and learning resources to build intelligent, real-time products.

This is something we’ve observed using many metrics, including product sales, the number of submissions to our conferences, and the traffic to Radar and newsletter articles.

As we looked at putting together the program for Strata + Hadoop World NYC, we were excited to see a large number of compelling proposals on these topics. To that end, I’m pleased to highlight a strong collection of sessions on real-time processing and applications coming up at the event. Continue reading “Celebrating the real-time processing revival”

Expanding options for mining streaming data

[A version of this post appears on the O’Reilly Data blog.]

Stream processing was in the minds of a few people that I ran into over the past week. A combination of new systems, deployment tools, and enhancements to existing frameworks, are behind the recent chatter. Through a combination of simpler deployment tools, programming interfaces, and libraries, recently released tools make it easier for companies to process and mine streaming data sources.

Of the distributed stream processing systems that are part of the Hadoop ecosystem0, Storm is by far the most widely used (more on Storm below). I’ve written about Samza, a new framework from the team that developed Kafka (an extremely popular messaging system). Many companies who use Spark express interest in using Spark Streaming (many have already done so). Spark Streaming is distributed, fault-tolerant, stateful, and boosts programmer productivity (the same code used for batch processing can, with minor tweaks, be used for realtime computations). But it targets applications that are in the “second-scale latencies”. Both Spark Streaming and Samza have their share of adherents and I expect that they’ll both start gaining deployments in 2014.

Continue reading “Expanding options for mining streaming data”

Stream Mining essentials

[A version of this post appears on the O’Reilly Strata blog.]

A series of open source, distributed stream processing frameworks have become essential components in many big data technology stacks. Apache Storm remains the most popular, but promising new tools like Spark Streaming and Apache Samza are going to have their share of users. These tools excel at data processing and are also used for data mining – in many cases users have to write a bit of code1 to do stream mining. The good news is that easy-to-use stream mining libraries will likely emerge in the near future.

High volume data streams (data that arrive continuously) arise in many settings, including IT operations, sensors, and social media. What can one learn by looking at data one piece (or a few pieces) at a time? Can techniques that look at smaller representations of data streams be used to unlock their value? In this post, I’ll briefly summarize a recent overview given by stream mining pioneer Graham Cormode.

Generate Summaries
Massive amounts of data arriving at high velocity pose a challenge to data miners. At the most basic level, stream mining is about generating summaries that can be used to answer fundamental questions:

Stream Mining

Properly constructed summaries are useful for highlighting emerging patterns, trends, and anomalies. Common summaries (frequency moments in stream mining parlance) include a list of distinct items, recently trending items, heavy hitters (items that have appeared frequently), and the top k (most popular) items.

Continue reading “Stream Mining essentials”