Data is only as valuable as the decisions it enables

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Ion Stoica on building intelligent and secure applications on live data.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode I spoke with Ion Stoica, cofounder and chairman of Databricks. Stoica is also a professor of computer science at UC Berkeley, where he serves as director of the new RISE Lab (the successor to AMPLab). Fresh off the incredible success of AMPLab, RISE seeks to build tools and platforms that enable sophisticated real-time applications on live data, while maintaining strong security. As Stoica points out, users will increasingly expect security guarantees on systems that rely on online machine learning algorithms that make use of personal or proprietary data.

As with AMPLab, the goal is to build tools and platforms, while producing high-quality research in computer science and its applications to other disciplines. Below are highlights from our conversation:
Continue reading “Data is only as valuable as the decisions it enables”

Structured streaming comes to Apache Spark 2.0

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Michael Armbrust on enabling users to perform streaming analytics, without having to reason about streaming.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

With the release of Spark version 2.0, streaming starts becoming much more accessible to users. By adopting a continuous processing model (on an infinite table), the developers of Spark have enabled users of its SQL or DataFrame APIs to extend their analytic capabilities to unbounded streams.

Within the Spark community, Databricks Engineer, Michael Armbrust is well-known for having led the long-term project to move Spark’s interactive analytics engine from Shark to Spark SQL. (Full disclosure: I’m an advisor to Databricks.) Most recently he has turned his efforts to helping introduce a much simpler stream processing model to Spark Streaming (“structured streaming”).

Tackling these problems at large-scale, in a popular framework with many, many production deployments is a challenging problem. So think of Spark 2.0 as the opening salvo. Just as it took a few versions before a majority of Spark users moved over to Spark SQL, I expect the new structured streaming framework to improve and mature over the next releases of Spark.

Here are some highlights from our conversation:

Continue reading “Structured streaming comes to Apache Spark 2.0”

Stream processing and messaging systems for the IoT age

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show podcast: M.C. Srivas on streaming, enterprise grade systems, the Internet of Things, and data for social good.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with M.C. Srivas, co-founder of MapR and currently Chief Architect for Data at Uber. We discussed his long career in data management and his experience building a variety of distributed systems. In the course of his career, Srivas has architected key components that now comprise many data platforms (distributed file system, database, query engine, messaging system, etc.).

As Srivas is quick to point out, for these systems to be widely deployed in the enterprise, they require features like security, disaster recovery, and support for multiple data centers. We covered many topics in big data, with particular emphasis on real-time systems and applications. Below are some highlights from our conversation:

Applications and systems that run on multiple data centers

Ad serving has a limit of about 70 to 80 milliseconds where you have to reply to an advertisement. When you click on a Web page, the banner ads and the ads on the side and the bottom have to be served within 80 milliseconds. People place data centers across the world near each of the major centers where there’s a lot of people, where there’s a lot of activity. Many of our customers have data centers in Japan, in China, in Singapore, in Hong Kong, in India, in Russia, in Germany, across the United States, and worldwide. However, the billing is typically consolidated so that they bring data from all these data centers into central data centers where they process the entire clickstream and understand how to bill it back to their customers.

… Then they need a clean way to bring these clickstreams back into the central data centers, maybe running in the U.S. or in Japan or Germany, or somewhere where the consolidation on the overall view of the customer is created. Typically, this has been done by running completely independent Kafka systems in each place. As soon as that happens, the producers and the consumers are not coordinated across data centers. Think about a data center in Japan that has a Kafka cluster running. Well, it cannot failover to the Kafka cluster in Hong Kong because that’s a completely independent cluster and doesn’t understand what has been consumed and what has been produced in Japan. If a consumer who was consuming things from the Japanese Kafka moved to the Hong Kong Kafka, they would get garbage. This is the main problem that a lot of customers asked us to solve.

… The data sources have now gone not into a few data centers, but into millions of data centers. Think about every self-driving car. Every self-driving car is a data center in itself. It generates so much data. Think about a plane flying. A plane flying is a full data center. There’s 400 people on the plane, it’s a captive audience, and there’s enough data generated just for the preventative maintenance kind of stuff on the plane anyways. That’s the thinking behind MapR Streams—what do we need for the Internet of Things scale.

Streaming and messaging systems for IoT

A file system is very passive. You write some files, read some files, and how interesting could that get? If I look at a streaming system, what we’re looking for is completely real time. That is, if a publisher publishes something, then all listeners who want to listen to what the publisher is saying will get notified within five milliseconds inside the same data center. Five milliseconds to get a notification saying, “Hey, this was published.” Instantaneous almost. If I cross data centers, let’s say our data center halfway across the world, and you publish something, let’s say, in Japan and the person in South Africa or somewhere can get that information in under a second. They’ll be notified of that. There’s a push that we do so they get notified of it under a second, at a scale that’s billions of messages per second.

… We have learned from Kafka, we have learned from Tibco, we have learned from RabbitMQ and so many other technologies that have preceded us. We learned a lot from watching all those things, and they have paved a way for us. I think what we’ve done is now taking it to the next level, which is what we really need for IoT.

Powering the world’s largest biometric identity system

We implemented this thing in Aadhaar, a biometric identity project. It links you to your banking, your hospital admissions, all the records—whether it’s school admissions, hospital admissions, even airport entry, passport, pension payments.

… There’s about a billion people online right now. There’s another 300 million to go, but what I wanted to point out is that it’s completely digitized. If you want to withdraw money from an ATM, you put your fingerprint and take the money out. You don’t need a card.

… There was a flood in Chennai last November/December. Massive floods. It rained like it’s never rained before. It rained continuously for two months, and the houses were submerged in 10 feet of water. People lost everything—the entire Tamil Nadu state in India, people lost everything. But when they were rescued, they still had their fingerprints and they could access everything. Their bank accounts, their records, and everything because the Aadhaar project was biometrics-based. Really, they lost everything, but they still had it. They could get to everything right away. Think about what happens here if you lose your wallet. All your credit cards, your driver’s license, everything. You don’t have that kind of an issue anymore. That problem was solved.

Editor’s note: This interview took place in mid January 2016, at that time M.C. Srivas served as CTO of MapR. Srivas currently serves as the Chief Architect for Data at Uber and is a member of the Board of Directors of MapR.

Related resources:

Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

[This piece was co-written by Shannon Cutt. A version of this post appears on the O’Reilly Radar.]

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Many of the open source systems and projects we’ve come to love — including Hadoop and HBase — were inspired by systems used internally within Google. These systems were described in papers and implemented by people who needed frameworks that could comfortably scale to massive data sets.

Google engineers and scientists continue to publish interesting papers, and these days some of the big data systems they describe in publications are available on their cloud platform.

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

The Dataflow model

There are two projects that we say Dataflow came out of. The FlumeJava project, which, for anybody who is not familiar, is a higher level language for describing large-scale, massive-scale data processing systems and then running it through an optimizer and coming up with an execution plan. … We had all sorts of use cases at Google where people were stringing together these series of MapReduce [jobs]. It was complex and difficult to deal with, and you had to try to manually optimize them for performance. If you do what the database folks have done,[you] run it through an optimizer. … Flume is the primary data processing system, so as part of that for the last few years, we’ve been moving MillWheel to be essentially a secondary execution engine for FlumeJava. You can either do it on batch mode and run on MapReduce or you can execute it on MillWheel. … FlumeJava plus MillWheel — it’s this evolution that’s happened internally, and now we’veexternalized it.

Balancing correctness, latency, and cost

There’s a wide variety of use cases out there. Sometimes you need high correctness; sometimes you don’t; sometimes you need low latency; sometimes higher latency is okay. Sometimes you’re willing to pay a lot for those other two features; sometimes you don’t want to pay as much. The real key, at least as far as having a system that is broadly applicable, is being able to be flexible and give people the choices to make the trade-offs they have to make. … There is a single knob which is, which runner am I going to use: batch or streaming? Aside from that, the other level at which you get to make these choices is when you’re deciding exactly when you materialize your results within the pipeline. … Once you have a streaming system or streaming execution engine that gives you this automatic-scaling, like Dataflow does, and it gives you consistency and strong tools for working with your data, then people start to build these really complicated services on them. It may not just be data processing. It actually becomes a nice platform for orchestrating events or orchestrating distributed state machines and things like that. We have a lot of users internally doing this stuff.

Subscribe to the O’Reilly Data Show Podcast: Stitcher, TuneIn, iTunes,SoundCloud, RSS

Related resources:

Celebrating the real-time processing revival

[A version of this article appears on the O’Reilly Radar.]

Register for Strata + Hadoop World NYC, which will take place September 29 to Oct 1, 2015.

A few months ago, I noted the resurgence in interest in large-scale stream-processing tools and real-time applications. Interest remains strong, and if anything, I’ve noticed growth in the number of companies wanting to understand how they can leverage the growing number of tools and learning resources to build intelligent, real-time products.

This is something we’ve observed using many metrics, including product sales, the number of submissions to our conferences, and the traffic to Radar and newsletter articles.

As we looked at putting together the program for Strata + Hadoop World NYC, we were excited to see a large number of compelling proposals on these topics. To that end, I’m pleased to highlight a strong collection of sessions on real-time processing and applications coming up at the event. Continue reading “Celebrating the real-time processing revival”

Building big data systems in academia and industry

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: Mikio Braun on stream processing, academic research, and training.

Mikio Braun is a machine learning researcher who also enjoys software engineering. We first met when he co-founded a real-time analytics company called streamdrill. Since then, I’ve always had great conversations with him on many topics in the data space. He gave one of the best-attended sessions at Strata + Hadoop World in Barcelona last year on some of his work at streamdrill.

I recently sat down with Braun for the latest episode of the O’Reilly Data Show Podcast, and we talked about machine learning, stream processing and analytics, his recent foray into data science training, and academia versus industry (his interests are a bit on the “applied” side, but he enjoys both).

mikio-big-data-solution-fig
An example of a big data solution. Source: Mikio Braun, used with permission.

Continue reading “Building big data systems in academia and industry”

A real-time processing revival

[A version of this post appears on the O’Reilly Radar blog.]

Things are moving fast in the stream processing world.

There’s renewed interest in stream processing and analytics. I write this based on some data points (attendance in webcasts and conference sessions; a recent meetup), and many conversations with technologists, startup founders, and investors. Certainly, applications are driving this recent resurgence. I’ve written previously about systems that come from IT operations as well as how the rise of cheap sensors are producing stream mining solutions from wearables (mostly health-related apps) and the IoT (consumer, industrial, and municipal settings). In this post, I’ll provide a short update on some of the systems that are being built to handle large amounts of event data.

Apache projects (Kafka, Storm, Spark Streaming, Flume) continue to be popular components in stream processing stacks (I’m not yet hearing much about Samza). Over the past year, many more engineers started deploying Kafka alongside one of the two leading distributed stream processing frameworks (Storm or Spark Streaming). Among the major Hadoop vendors, Hortonworks has been promoting Storm, Cloudera supports Spark Streaming, and MapR supports both. Kafka is a high-throughput distributed pub/sub system that provides a layer of indirection between “producers” that write to it and “consumers” that take data out of it. A new startup (Confluent) founded by the creators of Kafka should further accelerate the development of this already very popular system. Apache Flume is used to collect, aggregate, and move large amounts of streaming data, and is frequently used with Kafka (Flafka or Flume + Kafka). Spark Streaming continues to be one of the more popular components within the Spark ecosystem, and its creators have been adding features at a rapid pace (most recently Kafka integration, a Python API, and zero data loss). Continue reading “A real-time processing revival”