Creating large training data sets quickly

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Alex Ratner on why weak supervision is the key to unlocking dark data.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Alex Ratner, a graduate student at Stanford and a member of Christopher Ré’s Hazy research group. Training data has always been important in building machine learning algorithms, and the rise of data-hungry deep learning models has heightened the need for labeled data sets. In fact, the challenge of creating training data is ongoing for many companies; specific applications change over time, and what were gold standard data sets may no longer apply to changing situations.

Ré and his collaborators proposed a framework for quickly building large training data sets. In essence, they observed that high-quality models can be constructed from noisy training data. Some of these ideas were discussed in a previous episode featuring Mike Cafarella (jump to minute 24:16 for a description of an earlier project called DeepDive).

By developing a framework for mining low-quality sources in order to build high-quality machine learning models, Ré and his collaborators help researchers extract information previously hidden in unstructured data sources (so-called “dark data” buried in text, images, charts, and so on).

Here are some highlights from my conversation with Ratner:

Weak supervision and transfer learning

Weak supervision is a term that people have used before, especially around Stanford, to talk about methods where we have lower-quality training data, or noisier training data. … At a high level, machine learning models are meant to be robust to some noise in the distribution they’re trained on. … One of the really important trends we’ve seen is that more people than ever are using deep learning models. Deep learning models can automate the feature engineering process, but they are more complex and they need more training data to fit to their parameters.

If you look at the very remarkable, empirical successes that deep learning has had over the last few years, they have been mostly (or almost entirely) predicated on these large label training sets that took years to create. … Our motivation with weak supervision is really: how do we weaken this bottleneck? … For weak supervision, our ultimate goal is to make it easier for the human to provide supervision to the model. That’s where the human comes into the loop. This might be an iterative process.

… In the standard transfer learning paradigm, you’d take one nicely collecting training set, and you’d train your model on that in the standard way. Then you just try to apply your model to a new data distribution.

Data programming

Data programming is a general, flexible framework for using weak supervision to train some end model that you want to train without necessarily having any hand-labeled training data. The basic way it works is, we actually have two modeling stages in this pipeline. The first is that we get input from the domain expert or user in the form of what we call labeling functions. Think of them as Python functions. … The user writes a bunch of labeling functions, which are just black box functions that take in a data point, take in one of these objects, and output a label, or they could abstain. These labeling functions can encode all the types of weak supervision, like distant supervision, or crowd labels, or various heuristics. There’s a lot of flexibility because we don’t make any assumptions about what is inside them.

In our first modeling stage, we use a generative model to learn which of the labeling functions are more or less accurate by observing where they overlap, where they agree and disagree. Intuitively, if we have 20 labeling functions from a user and we see that one labeling function is always agreeing with its co-labelers on various data points, we think we should trust it. When a labeling function is always disagreeing in a minority, then we downweight this. Basically, we learn this model that tells us how to weight the difference labeling functions the user has provided. Then, the output of this model is a set of probabilistic training labels.

Then we feed these into the end model we’re trying to train. To give you some intuition on the probabilistic labels: all we’re basically saying is that we want the end model to learn more from data points that got a lot of high confidence votes, rather than the ones that were sort of in contention, from the labeling functions that the user provided. … One goal is to generate data, but often our ultimate goal is to train some end discriminative model, say to do image classification.

Snorkel is a system for using this data programming technique to quickly generate training data. A lot of the tooling and the use cases that are publicly part of Snorkel right now are around text extraction.

Data programming in Snorkel
Data Programming in Snorkel. Slide from Alex Ratner, used with permission.

Related resources: