Responsible deployment of machine learning

[A version of this post appears on the O’Reilly Radar.]

We need to build machine learning tools to augment our machine learning engineers.

In this post, I share slides and notes from a talk I gave in December 2017 at the Strata Data Conference in Singapore offering suggestions to companies that are actively deploying products infused with machine learning capabilities. Over the past few years, the data community has focused on infrastructure and platforms for data collection, including robust pipelines and highly scalable storage systems for analytics. According to a recent LinkedIn report, the top two emerging jobs are “machine learning engineer” and “data scientist.” Companies are starting to staff to put their data infrastructures to work, and machine learning is going become more prevalent in the years to come.

As more companies start using machine learning in products, tools, and business processes, let’s take a quick tour of model building, model deployment, and model management. It turns out that once a model is built, deploying and managing it in production requires engineering skills. So much so that earlier this year, we noted that companies have created a new job role—machine learning (or deep learning) engineer—for people tasked with productionizing machine learning models.

Modern machine learning libraries and tools like notebooks have made model building simpler. New data scientists need to make sure they understand the business problem and optimize their models for it. In a diverse region like Southeast Asia, models need to be localized, as conditions and contexts differ across countries in the ASEAN.
Continue reading “Responsible deployment of machine learning”

What lies ahead for data in 2018

[A version of this post appears on the O’Reilly Radar.]

How new developments in algorithms, machine learning, analytics, infrastructure, data ethics, and culture will shape data in 2018.

1. New tools will make graphs and time series easier, leading to new use cases

Graphs and time series have been a crucial part of the explosion in big data. 2018 will see the emergence of a new generation of tools for storing and analyzing graphs and time series at large scale. These new analytic and visualization tools will help product groups devise new offerings, especially for use cases in security and fraud detection.

2. More companies will join data partnerships to share data

In 2016, I started hearing companies express interest in data sharing platforms, and startups have now begun to build data exchanges to allow companies to share data across organizational boundaries, while protecting privacy and IP. Ideas from the blockchain world have inspired some of these initiatives, particularly crypto and distributed control. Data partnerships are taking hold in financial services companies, and I anticipate this trend to spread into other industries this year. 
Continue reading “What lies ahead for data in 2018”

How machine learning will accelerate data management systems

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Tim Kraska on why ML will change how we build core algorithms and data structures.

In this episode of the Data Show, I spoke with Tim Kraska, associate professor of computer science at MIT. To take advantage of big data, we need scalable, fast, and efficient data management systems. Database administrators and users often find themselves tasked with building index structures (“indexes” in database parlance), which are needed to speed up data access.

Some common examples include:

  • B-Trees—used for range requests (e.g., assemble all sales orders within a certain time frame)
  • Hash maps—used for key-based lookups
  • Bloom filters—used to check whether an element or piece of data is present in a set

Index structures take up space in a database, so you need to be selective about what to index, and they do not take advantage of the underlying data distributions. I’ve worked in settings where an administrator or expert user carefully implements a strategy for building indexes for a data warehouse based on important and common queries.

Indexes are really models or mappings—for instance, a Bloom filter can be thought of as a classification problem. In a recent paper, Kraska and his collaborators approach indexing as a learning problem. As such, they are able to build indexes that take into account underlying data distributions, are smaller in size (thus allowing for a more liberal indexing strategy), and their indexes execute faster. Software and hardware for computation are getting cheaper and better, so using machine learning to create index structures is something that may indeed become routine.
Continue reading “How machine learning will accelerate data management systems”

The current state of Apache Kafka

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Neha Narkhede on data integration, microservices, and Kafka’s roadmap.

In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “The Age of Machine Learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data.

On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed.

Here are some highlights from our conversation:

The first engineering project that made use of Apache Kafka

If I remember correctly, we were putting Hadoop into a place at LinkedIn for the first time, and I was on the team that was responsible for that. The problem was that all our scripts were actually built for another data warehousing solution. The questions was, are we going to rewrite all of those scripts and now sort of make them Hadoop specific? And what happens when a third and a fourth and a fifth system is put into place?

So, the initial motivating use case was: ‘we are putting this Hadoop thing into place. That’s the new-age data warehousing solution. It needs access to the same data that is coming from all our applications. So, that is the thing we need to put into practice.’ This became Kafka’s very first use case at LinkedIn. From there, because that was very easy and I actually helped move one of the very first workloads to Kafka, it was hardly difficult to convince the rest of the LinkedIn engineering team to start moving over to Kafka.
Continue reading “The current state of Apache Kafka”

Building a natural language processing library for Apache Spark

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: David Talby on a new NLP library for Spark, and why model development starts after a model gets deployed to production.

When I first discovered and started using Apache Spark, a majority of the use cases I used it for involved unstructured text. The absence of libraries meant rolling my own NLP utilities, and, in many cases, implementing a machine learning library (this was pre deep learning, and MLlib was much smaller). I’d always wondered why no one bothered to create an NLP library for Spark when many people were using Spark to process large amounts of text. The recent, early success of BigDL confirms that users like the option of having native libraries.

In this episode of the Data Show, I spoke with David Talby of Pacific.AI, a consulting company that specializes in data science, analytics, and big data. A couple of years ago I mentioned the need for an NLP library within Spark to Talby; he not only agreed, he rounded up collaborators to build such a library. They eventually carved out time to build the newly released Spark NLP library. Judging by the reception received by BigDL and the number of Spark users faced with large-scale text processing tasks, I suspect Spark NLP will be a standard tool among Spark users.

Talby and I also discussed his work helping companies build, deploy, and monitor machine learning models. Tools and best practices for model development and deployment are just beginning to emerge—I summarized some of them in a recent post, and, in this episode, I discussed these topics with a leading practitioner.

Here are some highlights from our conversation:

The state of NLP in Spark

Here are your two choices today. Either you want to leverage all of the performance and optimization that Spark gives you, which means you want to stay basically within the JVM, and you want to use a Java-based library. In which case, you have options that include OpenNLP, which is open source, or Stanford NLP, which requires licensing in order to use in a commercial product. These are older and more academically oriented libraries. So, they have limitations in performance and what they do.
Continue reading “Building a natural language processing library for Apache Spark”

How companies can navigate the age of machine learning

[A version of this post appears on the O’Reilly Radar.]

To become a “machine learning company,” you need tools and processes to overcome challenges in data, engineering, and models.

Over the last few years, the data community has focused on gathering and collecting data, building infrastructure for that purpose, and using data to improve decision-making. We are now seeing a surge in interest in advanced analytics and machine learning across many industry verticals.

In this post, I share slides and notes from a talk I gave this past September at Strata Data NYC offering suggestions to companies interested in adding machine learning capabilities. The information stems from conversations with practitioners, researchers, and entrepreneurs at the forefront of applying machine learning across many different problem domains.
Continue reading “How companies can navigate the age of machine learning”

How Ray makes continuous learning accessible and easy to scale

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Robert Nishihara and Philipp Moritz on a new framework for reinforcement learning and AI applications.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on StitcherTuneIniTunesSoundCloudRSS.

In this episode of the Data Show, I spoke with Robert Nishihara and Philipp Moritz, graduate students at UC Berkeley and members of RISE Lab. I wanted to get an update on Ray, an open source distributed execution framework that makes it easy for machine learning engineers and data scientists to scale reinforcement learning and other related continuous learning algorithms. Many AI applications involve an agent (for example a robot or a self-driving car) interacting with an environment. In such a scenario, an agent will need to continuously learn the right course of action to take for a specific state of the environment.

What do you need in order to build large-scale continuous learning applications? You need a framework with low-latency response times, one that is able to run massive numbers of simulations quickly (agents need to be able explore states within an environment), and supports heterogeneous computation graphs. Ray is a new execution framework written in C++ that contains these key ingredients. In addition, Ray is accessible via Python (and Jupyter Notebooks), and comes with many of the standard reinforcement learning and related continuous learning algorithms that users can easily call.

As Nishihara and Moritz point out, frameworks like Ray are also useful for common applications such as dialog systems, text mining, and machine translation. Here are some highlights from our conversation:

Tools for reinforcement learning

Ray is something we’ve been building that’s motivated by our own research in machine learning and reinforcement learning. If you look at what researchers who are interested in reinforcement learning are doing, they’re largely ignoring the existing systems out there and building their own custom frameworks or custom systems for every new application that they work on.

… For reinforcement learning, you need to be able to share data very efficiently, without copying it between multiple processes on the same machine, you need to be able to avoid expensive serialization and deserialization, and you need to be able to create a task and get the result back in milliseconds instead of hundreds of milliseconds. So, there are a lot of little details that come up.
Continue reading “How Ray makes continuous learning accessible and easy to scale”