Architecting and building end-to-end streaming applications

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Karthik Ramasamy on Heron, DistributedLog, and designing real-time applications.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Karthik Ramasamy, adjunct faculty member at UC Berkeley, former engineering manager at Twitter, and co-founder of Streamlio. Ramasamy managed the team that built Heron, an open source, distributed stream processing engine, compatible with Apache Storm.  While Ramasamy has seen firsthand what it takes to build and deploy large-scale distributed systems (within Twitter, he worked closely with the team that built DistributedLog), he is first and foremost interested in designing and building end-to-end applications. As someone who organizes many conferences, I’m all too familiar with the vast array of popular big data frameworks available. But, I also know that engineers and architects are most interested in content and material that helps them cut through the options and decisions.

Ramasamy and I discussed the importance of designing systems that can be combined to produce end-to-end applications with the requisite characteristics and guarantees.

Here are some highlights from our conversation:

Moving from Apache Storm to Heron

A major consideration was that we had to fundamentally change a lot of things. So, the team weighed the cost: should we go with an existing code base or develop a new code base? We thought that even if we developed a new code base, we would be able to get it done very quickly and the team was excited about it. That’s what we did and we got the first version of Heron done in eight or nine months.

I think it was one of the quickest transitions that ever happened in the history of Twitter. Apache Storm was hit by a lot of failures. There was a strong incentive to move to a new system. Once we proved the new system was highly reliable, we created a compelling value for the engineering teams. We also made it very painless for people to move. All they had to do was recompile a job and launch it. So, when you make a system like that, then people are just going to say, ‘let me give it a shot.’ They just compile it, launch it, then they say, ‘for a week, my job has been running without any issues; that’s good, I’m moving.’ So, we got migration done, from Storm to Heron, in less than six months. All the teams cooperated with us, and it was just amazing that we were able to get it done in less than six months. And we provided them a level of reliability that they never had with Storm.

Building DistributedLog

At the time when we were using Apache Kafka, we ran into a lot of issues in terms of durability as well as the ability to serve data for a high fan-out of streams. Essentially, if some data streams were being used by several jobs, and some job is falling behind, that leads to a lot of issues in terms of affecting other perfectly running jobs. This was a big pain point for us at the time.

Then, we decided we had to build a new system in order to address some important issues, and that is when DistributedLog was born. DistributedLog is more generic in the sense that it’s purely a logging system from a distributed perspective — it duplicates or partitions logs in some fashion, etc. Then, the idea was to build the application on top of it. In comparison, if you look at Kafka, Kafka is a pure pub/sub kind of a system. … It’s an application on top of a distribution log.

You can even think about using DistributedLog for something like two key-value stores, running in two different data centers (DC): if you want to migrate the data from one key-value store system running in one DC and replicate it to a key-value system store running in a different DC, you can use DistributedLog for this.

DistributedLog serves as a pub/sub system, and it’s also used for some amount of data to store for search as well. It supports a variety of different applications compared to what Kafka does today. That’s why we have been very happy with it so far. And to give some numbers, the amount of data replication we do with DistributedLog is about 20 petabytes a day.

Large-scale, end-to-end, streaming applications

If you consider end-to-end systems, there are a lot of interactions within the frameworks, the storage, and the pub/sub systems, etc. An end-to-end system needs to be robust across the entire collection of frameworks and interactions between frameworks.

… I would like to see something that wraps everything end-to-end. That way, it’s not only a stream processing framework. … If there were some kind of end-to-end solution that would give the illusion of a single service, that would be even better for the novice, and also for the many small and medium sized businesses that cannot afford to hire experienced engineers.

Related resources:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s