Architecting and building end-to-end streaming applications

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Karthik Ramasamy on Heron, DistributedLog, and designing real-time applications.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Karthik Ramasamy, adjunct faculty member at UC Berkeley, former engineering manager at Twitter, and co-founder of Streamlio. Ramasamy managed the team that built Heron, an open source, distributed stream processing engine, compatible with Apache Storm.  While Ramasamy has seen firsthand what it takes to build and deploy large-scale distributed systems (within Twitter, he worked closely with the team that built DistributedLog), he is first and foremost interested in designing and building end-to-end applications. As someone who organizes many conferences, I’m all too familiar with the vast array of popular big data frameworks available. But, I also know that engineers and architects are most interested in content and material that helps them cut through the options and decisions.

Ramasamy and I discussed the importance of designing systems that can be combined to produce end-to-end applications with the requisite characteristics and guarantees.

Here are some highlights from our conversation:

Moving from Apache Storm to Heron

A major consideration was that we had to fundamentally change a lot of things. So, the team weighed the cost: should we go with an existing code base or develop a new code base? We thought that even if we developed a new code base, we would be able to get it done very quickly and the team was excited about it. That’s what we did and we got the first version of Heron done in eight or nine months.

I think it was one of the quickest transitions that ever happened in the history of Twitter. Apache Storm was hit by a lot of failures. There was a strong incentive to move to a new system. Once we proved the new system was highly reliable, we created a compelling value for the engineering teams. We also made it very painless for people to move. All they had to do was recompile a job and launch it. So, when you make a system like that, then people are just going to say, ‘let me give it a shot.’ They just compile it, launch it, then they say, ‘for a week, my job has been running without any issues; that’s good, I’m moving.’ So, we got migration done, from Storm to Heron, in less than six months. All the teams cooperated with us, and it was just amazing that we were able to get it done in less than six months. And we provided them a level of reliability that they never had with Storm.

Continue reading “Architecting and building end-to-end streaming applications”

Structured streaming comes to Apache Spark 2.0

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Michael Armbrust on enabling users to perform streaming analytics, without having to reason about streaming.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

With the release of Spark version 2.0, streaming starts becoming much more accessible to users. By adopting a continuous processing model (on an infinite table), the developers of Spark have enabled users of its SQL or DataFrame APIs to extend their analytic capabilities to unbounded streams.

Within the Spark community, Databricks Engineer, Michael Armbrust is well-known for having led the long-term project to move Spark’s interactive analytics engine from Shark to Spark SQL. (Full disclosure: I’m an advisor to Databricks.) Most recently he has turned his efforts to helping introduce a much simpler stream processing model to Spark Streaming (“structured streaming”).

Tackling these problems at large-scale, in a popular framework with many, many production deployments is a challenging problem. So think of Spark 2.0 as the opening salvo. Just as it took a few versions before a majority of Spark users moved over to Spark SQL, I expect the new structured streaming framework to improve and mature over the next releases of Spark.

Here are some highlights from our conversation:

Continue reading “Structured streaming comes to Apache Spark 2.0”

Stream processing and messaging systems for the IoT age

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show podcast: M.C. Srivas on streaming, enterprise grade systems, the Internet of Things, and data for social good.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with M.C. Srivas, co-founder of MapR and currently Chief Architect for Data at Uber. We discussed his long career in data management and his experience building a variety of distributed systems. In the course of his career, Srivas has architected key components that now comprise many data platforms (distributed file system, database, query engine, messaging system, etc.).

As Srivas is quick to point out, for these systems to be widely deployed in the enterprise, they require features like security, disaster recovery, and support for multiple data centers. We covered many topics in big data, with particular emphasis on real-time systems and applications. Below are some highlights from our conversation:

Applications and systems that run on multiple data centers

Ad serving has a limit of about 70 to 80 milliseconds where you have to reply to an advertisement. When you click on a Web page, the banner ads and the ads on the side and the bottom have to be served within 80 milliseconds. People place data centers across the world near each of the major centers where there’s a lot of people, where there’s a lot of activity. Many of our customers have data centers in Japan, in China, in Singapore, in Hong Kong, in India, in Russia, in Germany, across the United States, and worldwide. However, the billing is typically consolidated so that they bring data from all these data centers into central data centers where they process the entire clickstream and understand how to bill it back to their customers.

… Then they need a clean way to bring these clickstreams back into the central data centers, maybe running in the U.S. or in Japan or Germany, or somewhere where the consolidation on the overall view of the customer is created. Typically, this has been done by running completely independent Kafka systems in each place. As soon as that happens, the producers and the consumers are not coordinated across data centers. Think about a data center in Japan that has a Kafka cluster running. Well, it cannot failover to the Kafka cluster in Hong Kong because that’s a completely independent cluster and doesn’t understand what has been consumed and what has been produced in Japan. If a consumer who was consuming things from the Japanese Kafka moved to the Hong Kong Kafka, they would get garbage. This is the main problem that a lot of customers asked us to solve.

… The data sources have now gone not into a few data centers, but into millions of data centers. Think about every self-driving car. Every self-driving car is a data center in itself. It generates so much data. Think about a plane flying. A plane flying is a full data center. There’s 400 people on the plane, it’s a captive audience, and there’s enough data generated just for the preventative maintenance kind of stuff on the plane anyways. That’s the thinking behind MapR Streams—what do we need for the Internet of Things scale.

Streaming and messaging systems for IoT

A file system is very passive. You write some files, read some files, and how interesting could that get? If I look at a streaming system, what we’re looking for is completely real time. That is, if a publisher publishes something, then all listeners who want to listen to what the publisher is saying will get notified within five milliseconds inside the same data center. Five milliseconds to get a notification saying, “Hey, this was published.” Instantaneous almost. If I cross data centers, let’s say our data center halfway across the world, and you publish something, let’s say, in Japan and the person in South Africa or somewhere can get that information in under a second. They’ll be notified of that. There’s a push that we do so they get notified of it under a second, at a scale that’s billions of messages per second.

… We have learned from Kafka, we have learned from Tibco, we have learned from RabbitMQ and so many other technologies that have preceded us. We learned a lot from watching all those things, and they have paved a way for us. I think what we’ve done is now taking it to the next level, which is what we really need for IoT.

Powering the world’s largest biometric identity system

We implemented this thing in Aadhaar, a biometric identity project. It links you to your banking, your hospital admissions, all the records—whether it’s school admissions, hospital admissions, even airport entry, passport, pension payments.

… There’s about a billion people online right now. There’s another 300 million to go, but what I wanted to point out is that it’s completely digitized. If you want to withdraw money from an ATM, you put your fingerprint and take the money out. You don’t need a card.

… There was a flood in Chennai last November/December. Massive floods. It rained like it’s never rained before. It rained continuously for two months, and the houses were submerged in 10 feet of water. People lost everything—the entire Tamil Nadu state in India, people lost everything. But when they were rescued, they still had their fingerprints and they could access everything. Their bank accounts, their records, and everything because the Aadhaar project was biometrics-based. Really, they lost everything, but they still had it. They could get to everything right away. Think about what happens here if you lose your wallet. All your credit cards, your driver’s license, everything. You don’t have that kind of an issue anymore. That problem was solved.

Editor’s note: This interview took place in mid January 2016, at that time M.C. Srivas served as CTO of MapR. Srivas currently serves as the Chief Architect for Data at Uber and is a member of the Board of Directors of MapR.

Related resources:

Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

[This piece was co-written by Shannon Cutt. A version of this post appears on the O’Reilly Radar.]

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Many of the open source systems and projects we’ve come to love — including Hadoop and HBase — were inspired by systems used internally within Google. These systems were described in papers and implemented by people who needed frameworks that could comfortably scale to massive data sets.

Google engineers and scientists continue to publish interesting papers, and these days some of the big data systems they describe in publications are available on their cloud platform.

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

The Dataflow model

There are two projects that we say Dataflow came out of. The FlumeJava project, which, for anybody who is not familiar, is a higher level language for describing large-scale, massive-scale data processing systems and then running it through an optimizer and coming up with an execution plan. … We had all sorts of use cases at Google where people were stringing together these series of MapReduce [jobs]. It was complex and difficult to deal with, and you had to try to manually optimize them for performance. If you do what the database folks have done,[you] run it through an optimizer. … Flume is the primary data processing system, so as part of that for the last few years, we’ve been moving MillWheel to be essentially a secondary execution engine for FlumeJava. You can either do it on batch mode and run on MapReduce or you can execute it on MillWheel. … FlumeJava plus MillWheel — it’s this evolution that’s happened internally, and now we’veexternalized it.

Balancing correctness, latency, and cost

There’s a wide variety of use cases out there. Sometimes you need high correctness; sometimes you don’t; sometimes you need low latency; sometimes higher latency is okay. Sometimes you’re willing to pay a lot for those other two features; sometimes you don’t want to pay as much. The real key, at least as far as having a system that is broadly applicable, is being able to be flexible and give people the choices to make the trade-offs they have to make. … There is a single knob which is, which runner am I going to use: batch or streaming? Aside from that, the other level at which you get to make these choices is when you’re deciding exactly when you materialize your results within the pipeline. … Once you have a streaming system or streaming execution engine that gives you this automatic-scaling, like Dataflow does, and it gives you consistency and strong tools for working with your data, then people start to build these really complicated services on them. It may not just be data processing. It actually becomes a nice platform for orchestrating events or orchestrating distributed state machines and things like that. We have a lot of users internally doing this stuff.

Subscribe to the O’Reilly Data Show Podcast: Stitcher, TuneIn, iTunes,SoundCloud, RSS

Related resources:

Building self-service tools to monitor high-volume time-series data

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Phil Liu on the evolution of metric monitoring tools and cloud computing.

One of the main sources of real-time data processing tools is IT operations. In fact, a previous post I wrote on the re-emergence of real-time, was to a large extent prompted by my discussions with engineers and entrepreneurs building monitoring tools for IT operations. In many ways, data centers are perfect laboratories in that they are controlled environments managed by teams willing to instrument devices and software, and monitor fine-grain metrics.

During a recent episode of the O’Reilly Data Show Podcast, I caught up with Phil Liu, co-founder and CTO of SignalFx, a SF Bay Area startup focused on building self-service monitoring tools for time series. We discussed hiring and building teams in the age of cloud computing, building tools for monitoring large numbers of time series, and lessons he’s learned from managing teams at leading technology companies.

Evolution of monitoring tools

Having worked at LoudCloud, Opsware, and Facebook, Liu has seen first hand the evolution of real-time monitoring tools and platforms. Liu described how he has watched the number of metrics grow, to volumes that require large compute clusters:

One of the first services I worked on at LoudCloud was a service called MyLoudCloud. Essentially that was a monitoring portal for all LoudCloud customers. At the time, [the way] we thought about monitoring was still in a per-instance-oriented monitoring system. [Later], I was one of the first engineers on the operational side of Facebook and eventually became part of the infrastructure team at Facebook. When I joined, Facebook basically was using a collection of open source software for monitoring and configuration, so these are things that everybody knows — Nagios, Ganglia. It started out basically using just per-instance instant monitoring techniques, basically the same techniques that we used back at LoudCloud, but interestingly and very quickly as Facebook grew, this per-instance-oriented monitoring no longer worked because we went from tens or thousands of servers to hundreds of thousands of servers, from tens of services to hundreds and thousands of services internally.

Continue reading “Building self-service tools to monitor high-volume time-series data”

A real-time processing revival

[A version of this post appears on the O’Reilly Radar blog.]

Things are moving fast in the stream processing world.

There’s renewed interest in stream processing and analytics. I write this based on some data points (attendance in webcasts and conference sessions; a recent meetup), and many conversations with technologists, startup founders, and investors. Certainly, applications are driving this recent resurgence. I’ve written previously about systems that come from IT operations as well as how the rise of cheap sensors are producing stream mining solutions from wearables (mostly health-related apps) and the IoT (consumer, industrial, and municipal settings). In this post, I’ll provide a short update on some of the systems that are being built to handle large amounts of event data.

Apache projects (Kafka, Storm, Spark Streaming, Flume) continue to be popular components in stream processing stacks (I’m not yet hearing much about Samza). Over the past year, many more engineers started deploying Kafka alongside one of the two leading distributed stream processing frameworks (Storm or Spark Streaming). Among the major Hadoop vendors, Hortonworks has been promoting Storm, Cloudera supports Spark Streaming, and MapR supports both. Kafka is a high-throughput distributed pub/sub system that provides a layer of indirection between “producers” that write to it and “consumers” that take data out of it. A new startup (Confluent) founded by the creators of Kafka should further accelerate the development of this already very popular system. Apache Flume is used to collect, aggregate, and move large amounts of streaming data, and is frequently used with Kafka (Flafka or Flume + Kafka). Spark Streaming continues to be one of the more popular components within the Spark ecosystem, and its creators have been adding features at a rapid pace (most recently Kafka integration, a Python API, and zero data loss). Continue reading “A real-time processing revival”

Building Apache Kafka from scratch

[A version of this post originally appeared on the O’Reilly Radar blog.]

In this episode of the O’Reilly Data Show Podcast, Jay Kreps talks about data integration, event data, and the Internet of Things.

At the heart of big data platforms are robust data flows that connect diverse data sources. Over the past few years, a new set of (mostly open source) software components have become critical to tackling data integration problems at scale. By now, many people have heard of tools like Hadoop, Spark, and NoSQL databases, but there are a number of lesser-known components that are “hidden” beneath the surface.

In my conversations with data engineers tasked with building data platforms, one tool stands out: Apache Kafka, a distributed messaging system that originated from LinkedIn. It’s used to synchronize data between systems and has emerged as an important component in real-time analytics.

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

In my travels over the past year, I’ve met engineers across many industries who use Apache Kafka in production. A few months ago, I sat down with O’Reilly author and Radar contributor Jay Kreps, a highly regarded data engineer and former technical lead for Online Data Infrastructure at LinkedIn, and most recently CEO/co-founder of Confluent. Continue reading “Building Apache Kafka from scratch”