Managing risk in machine learning

[A version of this post appears on the O’Reilly Radar.]

Considerations for a world where ML models are becoming mission critical.

In this post, I share slides and notes from a keynote I gave at the Strata Data Conference in New York last September. As the data community begins to deploy more machine learning (ML) models, I wanted to review some important considerations.

Let’s begin by looking at the state of adoption. We recently conducted a surveywhich garnered more than 11,000 respondents—our main goal was to ascertain how enterprises were using machine learning. One of the things we learned was that many companies are still in the early stages of deploying machine learning (ML):

As far as reasons for companies holding back, we found from a survey we conducted earlier this year that companies cited lack of skilled people, a “skills gap,” as the main challenge holding back adoption.

Interest on the part of companies means the demand side for “machine learning talent” is healthy. Developers have taken notice and are beginning to learn about ML. In our own online training platform (which has more than 2.1 million users), we’re finding strong interest in machine learning topics. Below are the top search topics on our training platform:
Continue reading “Managing risk in machine learning”

Lessons learned while helping enterprises adopt machine learning

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: Francesca Lazzeri and Jaya Mathew on digital transformation, culture and organization, and the team data science process.

In this episode of the Data Show, I spoke with Francesca Lazzeri, an AI and machine learning scientist at Microsoft, and her colleague Jaya Mathew, a senior data scientist at Microsoft. We conducted a couple of surveys this year—“How Companies Are Putting AI to Work Through Deep Learning” and “The State of Machine Learning Adoption in the Enterprise” — and we found that while many companies are still in the early stages of machine learning adoption, there’s considerable interest in moving forward with projects in the near future. Lazzeri and Mathew spend a considerable amount of time interacting with companies that are beginning to use machine learning and have experiences that span many different industries and applications. I wanted to learn some of the processes and tools they use when they assist companies in beginning their machine learning journeys.

Here are some highlights from our conversation:

Team data science process

Francesca Lazzeri: The Data Science Process is a framework that we try to apply in our projects. Everything begins with a business problem, so external customers come to us with a business problem or a process they want to optimize. We work with them to translate these into realistic questions, into what we call data science questions. And then we move to the data portion: what are the different relevant data sources, is the data internal or external? After that, you try to define the data pipeline. We start with the core part of the data science process—that is, data cleaning—and proceed to feature engineering, model building, and model deployment and management.
Continue reading “Lessons learned while helping enterprises adopt machine learning”

Machine learning on encrypted data

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Alon Kaufman on the interplay between machine learning, encryption, and security.

In this episode of the Data Show, I spoke with Alon Kaufman, CEO and co-founder of Duality Technologies, a startup building tools that will allow companies to apply analytics and machine learning to encrypted data. In a recent talk, I described the importance of data, various methods for estimating the value of data, and emerging tools for incentivizing data sharing across organizations. As I noted, the main motivation for improving data liquidity is the growing importance of machine learning. We’re all familiar with the importance of data security and privacy, but probably not as many people are aware of the emerging set of tools at the intersection of machine learning and security. Kaufman and his stellar roster of co-founders are doing some of the most interesting work in this area.

Here are some highlights from our conversation:

Running machine learning models on encrypted data

Four or five years ago, techniques for running machine learning models on data while it’s encrypted were being discussed in the academic world. We did a few trials of this and although the results were fascinating, it still wasn’t practical.

… There have been big breakthroughs that have led to it becoming feasible. A few years ago, it was more theoretical. Now it’s becoming feasible. This is the right time to build a company. Not only because of the technology feasibility but definitely because of the need in the market.

Continue reading “Machine learning on encrypted data”

How social science research can inform the design of AI systems

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jacob Ward on the interplay between psychology, decision-making, and AI systems.

In this episode of the Data Show, I spoke with Jacob Ward, a Berggruen Fellow at Stanford University. Ward has an extensive background in journalism, mainly covering topics in science and technology, at National Geographic, Al Jazeera, Discovery Channel, BBC, Popular Science, and many other outlets. Most recently, he’s become interested in the interplay between research in psychology, decision-making, and AI systems. He’s in the process of writing a book on these topics, and was gracious enough to give an informal preview by way of this podcast conversation.

Here are some highlights from our conversation:

Psychology and AI

I began to realize there was a disconnect between what is a totally revolutionary set of innovations coming through in psychology right now that are really just beginning to scratch the surface of how human beings make decisions; at the same time, we are beginning to automate human decision-making in a really fundamental way. I had a number of different people say, ‘Wow, what you’re describing in psychology really reminds me of this piece of AI that I’m building right now,’ to change how expectant mothers see their doctors or change how we hire somebody for a job or whatever it is.

Transparency and designing systems that are fair

I was talking to somebody the other day who was trying to build a loan company that was using machine learning to present loans to people. He and his company did everything they possibly could to not redline the people they were loaning to. They were trying very hard not to make unfair loans that would give preference to white people over people of color.

They went to extraordinary lengths to make that happen. They cut addresses out of the process. They did all of this stuff to try to basically neutralize the process, and the machine learning model still would pick white people at a disproportionate rate over everybody else. They can’t explain why. They don’t know why that is. There’s some variable that’s mapping to race that they just don’t know about.

But that sort of opacity—this is somebody explaining it to me who just happened to have been inside the company, but it’s not as if that’s on display for everybody to check out. These kinds of closed systems are picking up patterns we can’t explain, and that their creators can’t explain. They are also making really, really important decisions based on them. I think it is going to be very important to change how we inspect these systems before we begin trusting them.

Continue reading “How social science research can inform the design of AI systems”

Why it’s hard to design fair machine learning models

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Sharad Goel and Sam Corbett-Davies on the limitations of popular mathematical formalizations of fairness.

In this episode of the Data Show, I spoke with Sharad Goel, assistant professor at Stanford, and his student Sam Corbett-Davies. They recently wrote a survey paper, “A Critical Review of Fair Machine Learning,” where they carefully examined the standard statistical tools used to check for fairness in machine learning models. It turns out that each of the standard approaches (anti-classification, classification parity, and calibration) has limitations, and their paper is a must-read tour through recent research in designing fair algorithms. We talked about their key findings, and, most importantly, I pressed them to list a few best practices that analysts and industrial data scientists might want to consider.

Here are some highlights from our conversation:

Calibration and other standard metrics

Sam Corbett-Davies: The problem with many of the standard metrics is that they fail to take into account how different groups might have different distributions of risk. In particular, if there are people who are very low risk or very high risk, then it can throw off these measures in a way that doesn’t actually change what the fair decision should be. … The upshot is that if you end up enforcing or trying to enforce one of these measures, if you try to equalize false positive rates, or you try to equalize some other classification parity metric, you can end up hurting both the group you’re trying to protect and any other groups for which you might be changing the policy.
Continue reading “Why it’s hard to design fair machine learning models”

Building accessible tools for large-scale computation and machine learning

[A version of this post appears on the O’Reilly Radar.]

In this episode of the Data Show, I spoke with Eric Jonas, a postdoc in the new Berkeley Center for Computational Imaging. Jonas is also affiliated with UC Berkeley’s RISE Lab. It was at a RISE Lab event that he first announced Pywren, a framework that lets data enthusiasts proficient with Python run existing code at massive scale on Amazon Web Services. Jonas and his collaborators are working on a related project, NumPyWren, a system for linear algebra built on a serverless architecture. Their hope is that by lowering the barrier to large-scale (scientific) computation, we will see many more experiments and research projects from communities that have been unable to easily marshal massive compute resources. We talked about Bayesian machine learning, scientific computation, reinforcement learning, and his stint as an entrepreneur in the enterprise software space.

Here are some highlights from our conversation:

Pywren

The real enabling technology for us was when Amazon announced the availability of AWS Lambda, their microservices framework, in 2014. Following this prompting, I went home one weekend and thought, ‘I wonder how hard it is to take an arbitrary Python function and marshal it across the wire, get it running in Lambda; I wonder how many I can get at once?’ Thus, Pywren was born.
Continue reading “Building accessible tools for large-scale computation and machine learning”

How privacy-preserving techniques can lead to more robust machine learning models

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Chang Liu on operations research, and the interplay between differential privacy and machine learning.

In this episode of the Data Show, I spoke with Chang Liu, applied research scientist at Georgian Partners. In a previous post, I highlighted early tools for privacy-preserving analytics, both for improving decision-making (business intelligence and analytics) and for enabling automation (machine learning). One of the tools I mentioned is an open source project for SQL-based analysis that adheres to state-of-the-art differential privacy(a formal guarantee that provides robust privacy assurances).  Since business intelligence typically relies on SQL databases, this open source project is something many companies can already benefit from today.

What about machine learning? While I didn’t have space to point this out in my previous post, differential privacy has been an area of interest to many machine learning researchers. Most practicing data scientists aren’t aware of the research results, and popular data science tools haven’t incorporated differential privacy in meaningful ways (if at all). But things will change over the next months. For example, Liu wants to make  ideas from differential privacy accessible to industrial data scientists, and she is part of a team building tools to make this happen.

Here are some highlights from our conversation:
Continue reading “How privacy-preserving techniques can lead to more robust machine learning models”