Simplifying machine learning lifecycle management

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Harish Doddi on accelerating the path from prototype to production.

In this episode of the Data Show, I spoke with Harish Doddi, co-founder and CEO of Datatron, a startup focused on helping companies deploy and manage machine learning models. As companies move from machine learning prototypes to products and services, tools and best practices for productionizing and managing models are just starting to emerge. Today’s data science and data engineering teams work with a variety of machine learning libraries, data ingestion, and data storage technologies. Risk and compliance considerations mean that the ability to reproduce machine learning workflows is essential to meet audits in certain application domains. And as data science and data engineering teams continue to expand, tools need to enable and facilitate collaboration.

As someone who specializes in helping teams turn machine learning prototypes into production-ready services, I wanted to hear what Doddi has learned while working with organizations that aspire to “become machine learning companies.”

Here are some highlights from our conversation:

A central platform for building, deploying, and managing machine learning models

In one of the companies where I worked, we had built infrastructure related to Spark. We were a heavy Spark shop. So we built everything around Spark and other components. But later, when that organization grew, a lot of people came from a TensorFlow background. That suddenly created a little bit of frustration in the team because everybody wanted to move to TensorFlow. But we had invested a lot of time, effort and energy in building the infrastructure for Spark.

… We suddenly had hidden technical debt that needed to be addressed. … Let’s say right now you have two models running in production and you know that in the next two or three years you are going to deploy 20 to 30 models. You need to start thinking about this ahead of time.
Continue reading “Simplifying machine learning lifecycle management”

Data collection and data markets in the age of privacy and machine learning

[A version of this post appears on the O’Reilly Radar.]

While models and algorithms garner most of the media coverage, this is a great time to be thinking about building tools in data.

In this post I share slides and notes from a keynote I gave at the Strata Data Conference in London at the end of May. My goal was to remind the data community about the many interesting opportunities and challenges in data itself. Much of the focus of recent press coverage has been on algorithms and models, specifically the expanding utility of deep learning. Because large deep learning architectures are quite data hungry, the importance of data has grown even more. In this short talk, I describe some interesting trends in how data is valued, collected, and shared.

Economic value of data

It’s no secret that companies place a lot of value on data and the data pipelines that produce key features. In the early phases of adopting machine learning (ML), companies focus on making sure they have sufficient amount of labeled (training) data for the applications they want to tackle. They then investigate additional data sources that they can use to augment their existing data. In fact, among many practitioners, data remains more valuable than models (many talk openly about what models they use, but are reticent to discuss the features they feed into those models).

But if data is precious, how do we go about estimating its value? For those among us who build machine learning models, we can estimate the value of data by examining the cost of acquiring training data:
Continue reading “Data collection and data markets in the age of privacy and machine learning”

Data regulations and privacy discussions are still in the early stages

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Aurélie Pols on GDPR, ethics, and ePrivacy.

In this episode of the Data Show, I spoke with Aurélie Pols of Mind Your Privacy, one of my go-to resources when it comes to data privacy and data ethics. This interview took place at Strata Data London, a couple of days before the EU General Data Protection Regulation (GDPR) took effect. I wanted her perspective on this landmark regulation, as well as her take on trends in data privacy and growing interest in ethics among data professionals.

Here are some highlights from our conversation:

GDPR is just the starting point

GDPR is not an end point. It’s a starting point for a journey where a balance between companies and society and users of data needs to be redefined. Because when I look at my children, I look at how they use technology, I look at how smart my house might become or my car or my fridge, I know that in the long run this idea of giving consent to my fridge to share data is not totally viable. What are we going to be build for the next generations?
Continue reading “Data regulations and privacy discussions are still in the early stages”

The evolution of data science, data engineering, and AI

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: A special episode to mark the 100th episode.

This episode of the Data Showmarks our 100th episode. This podcast stemmed out of video interviews conducted at O’Reilly’s 2014 Foo Camp. We had a collection of friends who were key members of the data science and big data communities on hand and we decided to record short conversations with them. We originally conceived of using those initial conversations to be the basis of a regular series of video interviews. The logistics of studio interviews proved too complicated, but those Foo Camp conversations got us thinking about starting a podcast, and the Data Show was born.

To mark this milestone, my colleague Paco Nathan, co-chair of Jupytercon, turned the tables on me and asked me questions about previous Data Show episodes. In particular, we examined the evolution of key topics covered in this podcast: data science and machine learning, data engineering and architecture, AI, and the impact of each of these areas on businesses and companies. I’m proud of how this show has reached so many people across the world, and I’m looking forward to sharing more conversations in the future.

Here are some highlights from our conversation:

AI is more than machine learning

I think for many people machine learning is AI. I’m trying to, in the AI Conference series, convince people that a true AI system will involve many components, machine learning being one. Many of the guests I have seem to agree with that.

Evolving infrastructure for big data

In the early days of the podcast, many of the people I interacted with had Hadoop as one of the essential things in their infrastructure. I think while that might still be the case, there are more alternatives these days. I think a lot of people are going to object stores in the cloud. Another examples is that before, people maintained specialized systems. There’s still that, but people are trying to see if they can combine some of these systems, or come up with systems that can do more than one workload. For example, this whole notion in Spark of having a unified system that is able to do batch in streaming caught on during the span of this podcast.

Related resources:

How to build analytic products in an age when data privacy has become critical

[A version of this post appears on the O’Reilly Radar.]

Privacy-preserving analytics is not only possible, but with GDPR about to come online, it will become necessary to incorporate privacy in your data products.

In this post, I share slides and notes from a talk I gave in March 2018 at the Strata Data Conference in California, offering suggestions for how companies may want to build analytic products in an age when data privacy has become critical. A lot has changed since I gave this presentation: numerous articles have been written about Facebook’s privacy policies, its CEO testified twice before the U.S. Congress, and I deactivated my mostly dormant Facebook account. The end result being that there’s even a more heightened awareness around data privacy, and people are acknowledging that problems go beyond a few companies or a few people.

Let me start by listing a few observations regarding data privacy:

Which brings me to the main topic of this presentation: how do we build analytic services and products in an age when data privacy has emerged as an important issue? Architecting and building data platforms is central to what many of us do. We have long recognized that data security and data privacy are required features for our data platforms, but how do we “lock down” analytics?

Once we have data securely in place, we proceed to utilize it in two main ways: (1) to make better decisions (BI) and (2) to enable some form of automation (ML). It turns out there are some new tools for building analytic products that preserve privacy. Let me give a quick overview of a few things you may want to try today.
Continue reading “How to build analytic products in an age when data privacy has become critical”

What machine learning engineers need to know

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jesse Anderson and Paco Nathan on organizing data teams and next-generation messaging with Apache Pulsar.

In this episode of the Data Show, I spoke Jesse Anderson, managing director of the Big Data Institute, and my colleague Paco Nathan, who recently became co-chair of Jupytercon. This conversation grew out of a recent email thread the three of us had on machine learning engineers, a new job role that LinkedIn recently pegged as the fastest growing job in the U.S. In our email discussion, there was some disagreement on whether such a specialized job role/title was needed in the first place. As Eric Colson pointed out in his beautiful keynote at Strata Data San Jose, when done too soon, creating specialized roles can slow down your data team.

We recorded this conversation at Strata San Jose, while Anderson was in the middle of teaching his very popular two-day training course on real-time systems. We closed the conversation with Anderson’s take on Apache Pulsar, a very impressive new messaging system that is starting to gain fans among data engineers.

Here are some highlights from our conversation:

Why we need machine learning engineers

Jesse Anderson: One of the issues I’m seeing as I work with teams is that they’re trying to operationalize machine learning models, and the data scientists are not the one to productionize these. They simply don’t have the engineering skills. Conversely, the data engineers don’t have the skills to operationalize this either. So, we’re seeing this kind of gap in between the data science and the data engineering, and the gap I’m seeing and the way I’m seeing it being filled, is through a machine learning engineer.

… I disagree with Paco that generalization is the way to go. I think it’s hyper-specialization, actually. This is coming from my experience having taught a lot of enterprises. At a startup, I would say that super-specialization is probably not going to be as possible, but at an enterprise, you are going to have to have a team that specializes in big data, and that is a part from a team, even a software engineering team, that doesn’t work with data.

Putting Apache Pulsar on the radar of data engineers

Key features of Apache Pulsar. Image by Karthik Ramasamy, used with permission.


Jesse Anderson: A lot of my time, since I’m really teaching data engineering is spent on data integration and data ingestion. How do we move this data around efficiently? For a lot of that time Kafka was really the only open source game in town for that. But now there’s another technology called Apache Pulsar. I’ve spent a decent amount of time actually going through Pulsar and there are some things that I see in it that Kafka will either have difficulty doing or won’t be able to do.

… Apache Pulsar separates pub-sub from storage. When I first read about that, I didn’t quite get it. I didn’t quite see, why is this so important or why this is so interesting. It’s because you can individually scale your pub-sub and your storage resources independently. Now you’ve got something. Now you can say, “Well, we originally decided I wanted to store data for seven days. All right, let’s spin up some more bookkeeper processes and now we can store fourteen days, now we can store twenty one days.” I think that’s going to be a pretty interesting addition there. Where the other side of that, the corollary to that is, “Okay, we’re hitting Black Friday and we don’t have so much more data coming through as we have way more consumption and have way more things hitting our pub-sub. We could spin up more pub-sub with that.” This separation is actually allowing some interesting use cases.

Related resources:

Responsible deployment of machine learning

[A version of this post appears on the O’Reilly Radar.]

We need to build machine learning tools to augment our machine learning engineers.

In this post, I share slides and notes from a talk I gave in December 2017 at the Strata Data Conference in Singapore offering suggestions to companies that are actively deploying products infused with machine learning capabilities. Over the past few years, the data community has focused on infrastructure and platforms for data collection, including robust pipelines and highly scalable storage systems for analytics. According to a recent LinkedIn report, the top two emerging jobs are “machine learning engineer” and “data scientist.” Companies are starting to staff to put their data infrastructures to work, and machine learning is going become more prevalent in the years to come.


As more companies start using machine learning in products, tools, and business processes, let’s take a quick tour of model building, model deployment, and model management. It turns out that once a model is built, deploying and managing it in production requires engineering skills. So much so that earlier this year, we noted that companies have created a new job role—machine learning (or deep learning) engineer—for people tasked with productionizing machine learning models.

Modern machine learning libraries and tools like notebooks have made model building simpler. New data scientists need to make sure they understand the business problem and optimize their models for it. In a diverse region like Southeast Asia, models need to be localized, as conditions and contexts differ across countries in the ASEAN.
Continue reading “Responsible deployment of machine learning”