Building accessible tools for large-scale computation and machine learning

[A version of this post appears on the O’Reilly Radar.]

In this episode of the Data Show, I spoke with Eric Jonas, a postdoc in the new Berkeley Center for Computational Imaging. Jonas is also affiliated with UC Berkeley’s RISE Lab. It was at a RISE Lab event that he first announced Pywren, a framework that lets data enthusiasts proficient with Python run existing code at massive scale on Amazon Web Services. Jonas and his collaborators are working on a related project, NumPyWren, a system for linear algebra built on a serverless architecture. Their hope is that by lowering the barrier to large-scale (scientific) computation, we will see many more experiments and research projects from communities that have been unable to easily marshal massive compute resources. We talked about Bayesian machine learning, scientific computation, reinforcement learning, and his stint as an entrepreneur in the enterprise software space.

Here are some highlights from our conversation:

Pywren

The real enabling technology for us was when Amazon announced the availability of AWS Lambda, their microservices framework, in 2014. Following this prompting, I went home one weekend and thought, ‘I wonder how hard it is to take an arbitrary Python function and marshal it across the wire, get it running in Lambda; I wonder how many I can get at once?’ Thus, Pywren was born.
Continue reading “Building accessible tools for large-scale computation and machine learning”

Simplifying machine learning lifecycle management

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Harish Doddi on accelerating the path from prototype to production.

In this episode of the Data Show, I spoke with Harish Doddi, co-founder and CEO of Datatron, a startup focused on helping companies deploy and manage machine learning models. As companies move from machine learning prototypes to products and services, tools and best practices for productionizing and managing models are just starting to emerge. Today’s data science and data engineering teams work with a variety of machine learning libraries, data ingestion, and data storage technologies. Risk and compliance considerations mean that the ability to reproduce machine learning workflows is essential to meet audits in certain application domains. And as data science and data engineering teams continue to expand, tools need to enable and facilitate collaboration.

As someone who specializes in helping teams turn machine learning prototypes into production-ready services, I wanted to hear what Doddi has learned while working with organizations that aspire to “become machine learning companies.”

Here are some highlights from our conversation:

A central platform for building, deploying, and managing machine learning models

In one of the companies where I worked, we had built infrastructure related to Spark. We were a heavy Spark shop. So we built everything around Spark and other components. But later, when that organization grew, a lot of people came from a TensorFlow background. That suddenly created a little bit of frustration in the team because everybody wanted to move to TensorFlow. But we had invested a lot of time, effort and energy in building the infrastructure for Spark.

… We suddenly had hidden technical debt that needed to be addressed. … Let’s say right now you have two models running in production and you know that in the next two or three years you are going to deploy 20 to 30 models. You need to start thinking about this ahead of time.
Continue reading “Simplifying machine learning lifecycle management”

Notes from the first Ray meetup

[A version of this post appears on the O’Reilly Radar.]

Ray is beginning to be used to power large-scale, real-time AI applications.

Machine learning adoption is accelerating due to the growing number of large labeled data sets, languages aimed at data scientists (R, Julia, Python), frameworks (scikit-learn, PyTorch, TensorFlow, etc.), and tools for building infrastructure to support end-to-end applications. While some interesting applications of unsupervised learning are beginning to emerge, many current machine learning applications rely on supervised learning. In a recent series of posts, Ben Recht makes the case for why some of the most interesting problems might actually fall under reinforcement learning (RL), specifically systems that are able to act based upon past data and do so in a way that is safe, robust, and reliable.

But first we need RL tools that are accessible for practitioners. Unlike supervised learning, in the past there hasn’t been a good open source tool for easily trying RL at scale. I think things are about to change. I was fortunate enough to receive an invite to the first meetup devoted to RayRISE Lab’s high-performance, distributed execution engine, which targets emerging AI applications, including those that rely on reinforcement learning. This was a small, invite-only affair held at OpenAI, and most of the attendees were interested in reinforcement learning.

Here’s a brief rundown of the program:

  • Robert Nishihara and Philipp Moritz gave a brief overview and update on the Ray project, including a description of items on the near-term roadmap.
  • Eric Liang and Richard Liawgave a quick tour of two libraries built on top of Ray: RLlib(scalable reinforcement learning) and Tune(a hyperparameter optimization framework). They also pointed a to a recent ICML paper on RLlib. Both of these libraries are easily accessible to anyone familiar with Python, and both should prove popular among industrial data scientists.

RLlib and reinforcement learning. Image courtesy of RISE Lab.

  • Eugene Vinitsky showed some amazing videos of how Ray is helping them understand and predict traffic patterns in real time, and in the process help researchers study large transportation networks. The videos were some of the best examples of the combination of IoT, sensor networks, and reinforcement learning that I’ve seen.
  • Alex Bao of Ant Financial described three applications they’ve identified for Ray. I’m not sure I’m allowed to describe them here, but they were all very interesting and important use cases. The most important takeaway for the evening was Ant Financial is already using Ray in production in two of the three use cases (and they are close to deploying Ray to production for the third)! Given that Ant Financial is the largest unicorn company in the world, this is amazing validation for Ray.

With the buzz generated by the evening’s presentations and early examples of production deployments beginning to happen, I expect meetups on Ray to start springing up in other geographic areas. We are still in the early stages of adoption of machine learning technologies. The presentations at this meetup confirm that an accessible and scalable platform like Ray, opens up many possible applications of reinforcement learning and online learning.

For more on Ray:

How privacy-preserving techniques can lead to more robust machine learning models

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Chang Liu on operations research, and the interplay between differential privacy and machine learning.

In this episode of the Data Show, I spoke with Chang Liu, applied research scientist at Georgian Partners. In a previous post, I highlighted early tools for privacy-preserving analytics, both for improving decision-making (business intelligence and analytics) and for enabling automation (machine learning). One of the tools I mentioned is an open source project for SQL-based analysis that adheres to state-of-the-art differential privacy(a formal guarantee that provides robust privacy assurances).  Since business intelligence typically relies on SQL databases, this open source project is something many companies can already benefit from today.

What about machine learning? While I didn’t have space to point this out in my previous post, differential privacy has been an area of interest to many machine learning researchers. Most practicing data scientists aren’t aware of the research results, and popular data science tools haven’t incorporated differential privacy in meaningful ways (if at all). But things will change over the next months. For example, Liu wants to make  ideas from differential privacy accessible to industrial data scientists, and she is part of a team building tools to make this happen.

Here are some highlights from our conversation:
Continue reading “How privacy-preserving techniques can lead to more robust machine learning models”