Vehicle-to-vehicle communication networks can help fuel smart cities

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Bruno Fernandez-Ruiz on the importance of building the ground control center of the future.

In this episode of the Data Show, I spoke with Bruno Fernandez-Ruiz, co-founder and CTO of Nexar. We first met when he was leading Yahoo! technical teams charged with delivering a variety of large-scale, real-time data products. His new company is helping build out critical infrastructure for the emerging transportation sector.

While some question whether V2X communication is necessary to get to fully autonomous vehicles, Nexar is already paving the way by demonstrating how a vehicle-to-vehicle (V2V) communication network can be built efficiently. As Fernandez-Ruiz points out, there are many applications for such a V2V network (safety being the most obvious one). I’m particularly fascinated by what such a network, and the accompanying data, opens up for future, smarter cities. As I pointed out in a post on continuous learning, simulations are an important component of training AI applications. It seems reasonable to expect that the data sets collected by V2V networks will be useful for smart city planners of the future.

Continue reading “Vehicle-to-vehicle communication networks can help fuel smart cities”

Transforming organizations through analytics centers of excellence

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: Carme Artigas on helping enterprises transform themselves with big data tools and technologies.

In this episode of the Data Show, I spoke with Carme Artigas, co-founder and CEO of Synergic Partners (a Telefonica company). As more companies adopt big data technologies and techniques, it’s useful to remember that the end goal is to extract information and insight. In fact, as with any collection of tools and technologies, the main challenge is identifying and prioritizing use cases.

As Artigas describes, one can categorize use cases for big data into the following types:

  • Improve decision-making or operational efficiency
  • Generate new or additional revenue
  • Predict or prevent fraud (forecasting or minimizing risks)

Artigas has spent many years helping large organizations develop best practices for how to use data and analytics. We discussed some of the key challenges faced by organizations that wish to adopt big data technologies, centers of excellence for analytics, and AI in the enterprise.
Continue reading “Transforming organizations through analytics centers of excellence”

The state of machine learning in Apache Spark

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Ion Stoica and Matei Zaharia explore the rich ecosystem of analytic tools around Apache Spark.

In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion Stoica (UC Berkeley professor and executive chairman of Databricks) and Matei Zaharia (assistant professor at Stanford and chief technologist of Databricks). Stoica and Zaharia were core members of UC Berkeley’s AMPLab, which originated Apache Spark, Apache Mesos, and Alluxio.

We began our conversation by discussing recent academic research that would be of interest to the Apache Spark community (Stoica leads the RISE Lab at UC Berkeley, Zaharia is part of Stanford’s DAWN Project). The bulk of our conversation centered around machine learning. Like many in the audience, I was first attracted to Spark because it simultaneously allowed me to scale machine learning algorithms to large data sets while providing reasonable latency.

Here is a partial list of the items we discussed:

  • The current state of machine learning in Spark.
  • Given that a lot of innovation has taken place outside the Spark community (e.g., scikit-learn, TensorFlow, XGBoost), we discussed the role of Spark ML moving forward.
  • The plan to make it easier to integrate advanced analytics libraries that aren’t “textbook machine learning,” like NLP, time series analysis, and graph analysis into Spark and Spark ML pipelines.
  • Some upcoming projects from Berkeley and Stanford that target AI applications (including newer systems that provide lower latency, higher throughput).
  • Recent Berkeley and Stanford projects that address two key bottlenecks in machine learning — lack of training data, and deploying and monitoring models in production.

[Full disclosure: I am an advisor to Databricks.]

Related resources:

Effective mechanisms for searching the space of machine learning algorithms

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Kenneth Stanley on neuroevolution and other principled ways of exploring the world without an objective.

In this episode of the Data Show, I spoke with Ken Stanley, founding member of Uber AI Labs and associate professor at the University of Central Florida. Stanley is an AI researcher and a leading pioneer in the field of neuroevolution—a method for evolving and learning neural networks through evolutionary algorithms. In a recent survey article, Stanley went through the history of neuroevolution and listed recent developments, including its applications to reinforcement learning problems.

Stanley is also the co-author of a book entitled Why Greatness Cannot Be Planned: The Myth of the Objective—a book I’ve been recommending to anyone interested in innovation, public policy, and management. Inspired by Stanley’s research in neuroevolution (into topics like novelty search and open endedness), the book is filled with examples of how notions first uncovered in the field of AI can be applied to many other disciplines and domains.

The book closes with a case study that hits closer to home—the current state of research in AI. One can think of machine learning and AI as a search for ever better algorithms and models. Stanley points out that gatekeepers (editors of research journals, conference organizers, and others) impose two objectives that researchers must meet before their work gets accepted or disseminated: (1) empirical: their work should beat incumbent methods on some benchmark task, and (2) theoretical: proposed new algorithms are better if they can be proven to have desirable properties. Stanley argues this means that interesting work (“stepping stones”) that fail to meet either of these criteria fall by the wayside, preventing other researchers from building on potentially interesting but incomplete ideas.
Continue reading “Effective mechanisms for searching the space of machine learning algorithms”

The current state of applied data science

[A version of this post appears on the O’Reilly Radar.]

Recent trends in practical use and a discussion of key bottlenecks in supervised machine learning.

As we enter the latter part of 2017, it’s time to take a look at the common challenges faced by companies interested in using data science and machine learning (ML). Let’s assume your organization is already collecting data at a scale that justifies the use of analytic tools, and that you’ve managed to identify and prioritize use cases where data science can be transformative (including improvements to decision-making or business operations, increasing revenue, etc.). Data gathering and identifying interesting problems are non-trivial, but assuming you’ve gotten a healthy start on these tasks, what challenges remain?

Data science is a large topic, so I’ll offer a disclaimer: this post is mainly about the use of supervised machine learning today, and it draws from a series of conversations over the last few months. I’ll have more to say about AI systems in future posts, but such systems clearly rely on more than just supervised learning.

It all begins with (training) data

Even assuming you have a team that handles data ingestion and integration, and a team that maintains a data platform (“source of truth”) for you, new data sources continue to appear, and it’s incumbent on domain experts to highlight them. Moreover, since we’re dealing mainly with supervised learning, it’s no surprise that lack of training data remains the primary bottleneck in machine learning projects.

There are some good research projects and tools for quickly creating large training data sets (or augmenting existing ones). Stanford researchers have shown that weak supervision and data programming can be used to train models without access to a lot of hand-labeled training data. Preliminary work on generative models (by deep learning researchers) have produced promising results in unsupervised learning in computer vision and other areas.
Continue reading “The current state of applied data science”

How Ray makes continuous learning accessible and easy to scale

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Robert Nishihara and Philipp Moritz on a new framework for reinforcement learning and AI applications.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on StitcherTuneIniTunesSoundCloudRSS.

In this episode of the Data Show, I spoke with Robert Nishihara and Philipp Moritz, graduate students at UC Berkeley and members of RISE Lab. I wanted to get an update on Ray, an open source distributed execution framework that makes it easy for machine learning engineers and data scientists to scale reinforcement learning and other related continuous learning algorithms. Many AI applications involve an agent (for example a robot or a self-driving car) interacting with an environment. In such a scenario, an agent will need to continuously learn the right course of action to take for a specific state of the environment.

What do you need in order to build large-scale continuous learning applications? You need a framework with low-latency response times, one that is able to run massive numbers of simulations quickly (agents need to be able explore states within an environment), and supports heterogeneous computation graphs. Ray is a new execution framework written in C++ that contains these key ingredients. In addition, Ray is accessible via Python (and Jupyter Notebooks), and comes with many of the standard reinforcement learning and related continuous learning algorithms that users can easily call.

As Nishihara and Moritz point out, frameworks like Ray are also useful for common applications such as dialog systems, text mining, and machine translation. Here are some highlights from our conversation:

Tools for reinforcement learning

Ray is something we’ve been building that’s motivated by our own research in machine learning and reinforcement learning. If you look at what researchers who are interested in reinforcement learning are doing, they’re largely ignoring the existing systems out there and building their own custom frameworks or custom systems for every new application that they work on.

… For reinforcement learning, you need to be able to share data very efficiently, without copying it between multiple processes on the same machine, you need to be able to avoid expensive serialization and deserialization, and you need to be able to create a task and get the result back in milliseconds instead of hundreds of milliseconds. So, there are a lot of little details that come up.
Continue reading “How Ray makes continuous learning accessible and easy to scale”

Why continuous learning is key to AI

[A version of this post appears on the O’Reilly Radar.]

A look ahead at the tools and methods for learning from sparse feedback.

As more companies begin to experiment with and deploy machine learning in different settings, it’s good to look ahead at what future systems might look like. Today, the typical sequence is to gather data, learn some underlying structure, and deploy an algorithm that systematically captures what you’ve learned. Gathering, preparing, and enriching the right data—particularly training data—is essential and remains a key bottleneck among companies wanting to use machine learning.

I take for granted that future AI systems will rely on continuous learning as opposed to algorithms that are trained offline. Humans learn this way, and AI systems will increasingly have the capacity to do the same. Imagine visiting an office for the first time and tripping over an obstacle. The very next time you visit that scene—perhaps just a few minutes later—you’ll most likely know to look out for the object that tripped you.

There are many applications and scenarios where learning takes on a similar exploratory nature. Think of an agent interacting with an environment while trying to learn what actions to take and which ones to avoid in order to complete some preassigned task. We’ve already seen glimpses of this with recent applications of reinforcement learning (RL). In RL, the goal is to learn how to map observations and measurements to a set of actions, while trying to maximize some long-term reward. (The term RL is frequently used to describe both a class of problems and a set of algorithms.) While deep learning gets more media attention, there are many interesting recent developments in RL that are well known within AI circles. Researchers have recently applied RL to game play, robotics, autonomous vehicles, dialog systems, text summarization, education and training, and energy utilization.
Continue reading “Why continuous learning is key to AI”