[A version of this post appears on the O’Reilly Radar.]
The O’Reilly Data Show Podcast: Ion Stoica on building intelligent and secure applications on live data.
In this episode I spoke with Ion Stoica, cofounder and chairman of Databricks. Stoica is also a professor of computer science at UC Berkeley, where he serves as director of the new RISE Lab (the successor to AMPLab). Fresh off the incredible success of AMPLab, RISE seeks to build tools and platforms that enable sophisticated real-time applications on live data, while maintaining strong security. As Stoica points out, users will increasingly expect security guarantees on systems that rely on online machine learning algorithms that make use of personal or proprietary data.
As with AMPLab, the goal is to build tools and platforms, while producing high-quality research in computer science and its applications to other disciplines. Below are highlights from our conversation:
Intelligent applications on live data
The RISE Lab is about real-time decisions on live data. There are two differences: one is the transition from analytics to decisions, the other is transitioning from what was mostly queries on batch data to live data. If you look at what people try to do with their data, it’s to use it make decisions or to take some actions that will improve their product, business processes, and things like that. You hear more and more today that data is only as valuable as the decision it enables. Now, if you buy this premise, then what the RISE Lab is doing follows naturally. If you think about decisions: on the one hand, in general, faster decisions are better than slower decisions; decisions on fresh data are typically better than decisions on stale data; and also, more controversially, but still true, decisions on personalized data are better than on aggregate data.
The goal of the RISE Lab is to build platforms, tools, and algorithms to support applications that depend on decisions on live data. In particular, our goal is to support real-time decisions on live data with strong security. By real-time, we are thinking about making decisions in milliseconds or tens of milliseconds. When we say ‘on live data,’ what we mean is we want to make the decision not only on historical data, but on the current state of the environment. When we say ‘strong security,’ what we mean is to provide privacy and confidentiality, and both data and computation integrity, for these computations.
Quality and robustness in real-time settings
We are not talking about just rule-based decisions; we are not talking about some SQL queries where you compare the result with some fixed threshold. Think about fraud detection, think about forecasts, think about coordinating—in real-time—a fleet of drones. There are several aspects of the decision process. … Typically, we characterize decisions by what we call quality and accuracy. You want decisions that have no false positives or false negatives.
One of the most important aspects of the decision process is robustness. Robustness means not only to be robust with respect to noisy inputs. You also want to be robust with respect to unforeseen data. If you are talking about online machine learning, in many of these cases, you are going to train the model based on certain examples. As long as you get inputs that are similar to the examples you use to train your model, you’re fine. But what happens if you get an input that is extremely different from the examples you used to train your model?
In real-time applications you don’t have a human-in-the-loop. This means that the robustness and the security of the algorithms is even more important than it used to be.
The importance of secure execution
There are several angles here about security. One is that, typically, when you want to make these intelligent decisions, you want to actually target per user decisions, per customer decisions. … In order to make these kinds of decisions, it helps you tremendously if you look across customers’ data. … There is a question: how are you going to learn across customers, based on their data, while maintaining their confidentiality and privacy?
We cannot see how one can provide support for this large set of decisions without ensuring security. … Let me put in another way: we all know that a decision would be better if you could make use of individual-level information. It’s a no-brainer. The challenge is that more and more people are “privacy aware,” and it’s harder and harder to use individual-level information.
… If I tell you I’m going to use your data, but it’s going to be private, it’s not going to leak—and that even I, as the owner of this application, am not going to know your data—you are going to be much more likely to share that information with my tools than if I don’t give you that guarantee.
- Michael Franklin on the lasting legacy of AMPLab
- Evaluating machine learning models
- The Spark video collection: 2016
- Apache Spark 2.0: introduction to structured streaming