Building a business that combines human experts and data science

The O’Reilly Data Show podcast: Eric Colson on algorithms, human computation, and building data science teams.

[A version of this post appears on the O’Reilly Radar.]

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

In this episode of the O’Reilly Data Show, I spoke with Eric Colson, chief algorithms officer at Stitch Fix, and former VP of data science and engineering at Netflix. We talked about building and deploying mission-critical, human-in-the-loop systems for consumer Internet companies. Knowing that many companies are grappling with incorporating data science, I also asked Colson to share his experiences building, managing, and nurturing, large data science teams at both Netflix and Stitch Fix.

Augmented systems: “Active learning,” “human-in-the-loop,” and “human computation”

We use the term ‘human computation’ at Stitch Fix. We have a team dedicated to human computation. It’s a little bit coarse to say it that way because we do have more than 2,000 stylists, and these are very much human beings that are very passionate about fashion styling. What we can do is, we can abstract their talent into—you can think of it like an API; there’s certain tasks that only a human can do or we’re going to fail if we try this with machines, so we almost have programmatic access to human talent. We are allowed to route certain tasks to them, things that we could never get done with machines.

… We have some of our own proprietary software that blends together two resources: machine learning and expert human judgment. The way I talk about it is, we have an algorithm that’s distributed across the resources. It’s a single algorithm, but it does some of the work through machine resources, and other parts of the work get done through humans.

… You can think of even the classic recommender systems, collaborative filtering, which people recognize as, ‘people that bought this also bought that.’ Those things break down to nothing more than a series of rote calculations. Being a human, you can actually do them by hand—it’ll just take you a long time, and you’ll make a lot of mistakes along the way, and you’re not going to have much fun doing it—but machines can do this stuff in milliseconds. They can find these hidden relationships within the data that are going to help figure out what’s relevant to certain consumer’s preferences and be able to recommend things. Those are things that, again, a human could, in theory, do, but they’re just not great at all the calculations, and every algorithmic technique breaks down to a series of rote calculations.

… What machines can’t do are things around cognition, things that have to do with ambient information, or appreciation of aesthetics, or even the ability to relate to another human—those things are strictly in the purview of humans. Those types of tasks we route over to stylists. … I would argue that our humans could not do their jobs without the machines. We keep our inventory very large so that there are always many things to pick from for any given customer. It’s so large, in fact, that it would take a human too long to sift through it on her own, so what machines are doing is narrowing down the focus.

Combining art and science

Our business model is different. We are betting big on algorithms. We do not have the barriers to competition that other retailers have, like Wal-Mart has economies of scale that allow them to do amazing things; that’s their big barrier. … What is our protective barrier? It’s [to be the] best in the world at algorithms. We have to be the very best. … More than any other company, we are going to suffer if we’re wrong.

… Our founder wanted to do this from the very beginning, combine empiricism with what can’t be captured in data, call it intuition or judgment. But she really wanted to weave those two things together to produce something that was better than either can do on their own. She calls it art and science, combining art and science.

Defining roles in data science teams

[Job roles at StitchFix are] built on three premises that come from Dan Pink’s book Drive. Autonomy, mastery, purpose—those are the fundamental things you need to have for high job satisfaction. With autonomy, that’s why we dedicate them to a team. You’re going to now work on what’s called ‘marketing algorithms.’ You may not know anything about marketing to begin with, but you’re going to learn it pretty fast. You’re going to pick up the domain expertise. By autonomy, we want you to do the whole thing so you have the full context. You’re going to be the one sourcing the data, building pipelines. You’re going to be applying the algorithmic routine. You’re going to be the one who frames that problem, figures out what algorithms you need, and you’re going to be the one delivering the output and connecting it back to some action, whatever that action may be. Maybe it’s adjusting our multi-channel strategy. Whatever that algorithmic output is, you’re responsible for it. So, that’s mastery. Now, you’re autonomous because you do all the pieces. You’re getting mastery over one domain, in that case, say marketing algorithms. You’re going to be looked at as you’re the best person in the company to go talk about how these things work; you know the end-to-end.

Then, purpose—that’s the impact that you’re going to make. In the case that we gave, marketing algorithms, you want to be accountable. You want to be the one who can move the needle when it comes to how much we should do. What channels are more effective at acquiring new customers? Whatever it is, you’re going to be held accountable for a real number, and that is motivating, that’s what makes people love their jobs.

Subscribe to the O’Reilly Data Show Podcast: Stitcher, TuneIn, iTunes, SoundCloud, RSS

Editor’s note: Eric Colson will speak about augmenting machine learning with human computation for better personalization, at Strata + Hadoop World in San Jose this March.

Related resources:


Compressed representations in the age of big data

[A version of this post appears on the O’Reilly Radar.]

Emerging trends in intelligent mobile applications and distributed computing

When developing intelligent, real-time applications, one often has access to a data platform that can wade through and unlock patterns in massive data sets. The back-end infrastructure for such applications often relies on distributed, fault-tolerant, scaleout technologies designed to handle large data sets. But, there are situations when compressed representations are useful and even necessary. The rise of mobile computing and sensors (IoT) will lead to devices and software that push computation from the cloud toward the edge. In addition, in-memory computation tends to be much faster, and thus, many popular (distributed) systems operate on data sets that can be cached.

To drive home this point, let me highlight two recent examples that illustrate the importance of efficient compressed representations: one from mobile computing, the other from a popular distributed computing framework.

Deep neural networks and intelligent mobile applications

In a recent presentation, Song Han, of the Concurrent VLSI Architecture(CVA) group at Stanford University, outlined an initiative to help optimize deep neural networks for mobile devices. Deep learning has produced impressive results across a range of applications in computer vision, speech, and machine translation. Meanwhile the growing popularity of mobile computing platforms means many mobile applications will need to have capabilities in these areas. The challenge is that deep learning models tend to be too large to fit into mobile applications (these applications are downloaded and often need to be updated frequently). Relying on cloud-based solutions is an option, but network delay and privacy can be an issue in certain applications and domains.

One solution is to significantly reduce the size of deep learning models. CVA researchers recently proposed a general scheme for compressing deep neural networks in three steps:

  • prune the unimportant connections,
  • quantize the network and enforce weight sharing,
  • and finally apply Huffman encoding.
compression schemes on neural network sizes
Figure 1. Sample diagram comparing compression schemes on neural network sizes. Image courtesy of Ben Lorica.


Initial experiments showed their compression scheme reduced neural network sizes by 35 to 50 times, and the resulting compressed models were able to match the accuracy of the corresponding original models. CVA researchers also designed an accompanying energy-efficient ASICaccelerator for running compressed deep neural networks, hinting at next-generation software + hardware designed specifically for intelligent mobile applications.

Succinct: search and point queries on compressed data over Apache Spark

Succinct  is a “compressed” data store that enables a wide range of point queries (search, count, range, random access) directly on a compressed representation of input data. Succinct uses a compression technique that empirically achieves compression close to that of gzip, and supports the above queries without storing secondary indexes, without data scans, and without data decompression. Succinct does not store the input file, just the compressed representation. By letting users query compressed data directly, Succinct combines low latency and low storage:

Qualitative comparison of data scans, indexes, and Succinct
Figure 2. Qualitative comparison of data scans, indexes, and Succinct. Since it stores and operates on compressed representations, Succinct can keep data in-memory for much larger-sized input files. Source: Rachit Agarwal, used with permission.


While this AMPLab project had been around as a research initiative,Succinct became available on Apache Spark late last year. This means Spark users can leverage Succinct against flat files and immediately executesearch queries (including regex queries directly on compressed RDDs), compute counts, and do range queries. Moreover, abstractions have been built on top of Succinct’s basic flat (unstructured) file interface, allowing Spark to be used as a document or key-value store, and a DataFrames API currently exposes search, count, range, and random access queries. Having these new capabilities on top of Apache Spark simplifies the software stack needed to build many interesting data applications.

Early comparisons with ElasticSearch have been promising and, most importantly for its users, Succinct is an active project. The team behind it plans many enhancements in future releases, including Succinct Graphs (for queries on compressed graphs), support for SQL on compressed data, and further improvements in preprocessing/compression (currently at 4 gigabytes per hour, per core). They are also working on a research project called Succinct Encryption (for queries on compressed and encrypted data).

Related Resources:



Is 2016 the year you let robots manage your money?

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show podcast: Vasant Dhar on the race to build “big data machines” in financial investing.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

In this episode of the O’Reilly Data Show, I sat down with Vasant Dhar, a professor at the Stern School of Business and Center for Data Science at NYU, founder of SCT Capital Management, and editor-in-chief of the Big Data Journal (full disclosure: I’m a member of the editorial board). We talked about the early days of A.I. and data mining, and recent applications of data science to financial investing and other domains.

Dhar’s first steps in applying machine learning to finance

I joke with people, I say, ‘When I first started looking at finance, the only thing I knew was that prices go up and down.’ It was only when I actually went to Morgan Stanley and took time off from academia that I learned about finance and financial markets. … What I really did in that initial experiment is I took all the trades, I appended them with information about the state of the market at the time, and then I cranked it through a genetic algorithm and a tree induction algorithm. … When I took it to the meeting, it generated a lot of really interesting discussion. … Of course, it took several months before we actually finally found the reasons for why I was observing what I was observing.

Robots as decision makers

The general research question I really ask is when do computers make better decisions than humans? That’s really sort of the core question. … I’ve applied it to finance, but there are other areas. I’m involved in a project on education, and one might ask the same thing. When do computers make better teachers than humans? It’s an equally interesting question. … Should you trust your money to a robot? The flip side of that question is when do computers make better decisions than humans?

One of the things I did was to break up the investment landscape into three different types of holding periods. On the one hand, you have high-frequency trading, and on the other extreme, you have very long-term investing. In high-frequency trading, your holding periods are sort of minutes to a day. In very long-term investing, your holding periods are months to years, that Warren Buffett style of investing. Then there’s sort of a space in the middle, which is the part I find most interesting, where there’s a lot of action, which is sort of days to weeks holding period. … The strategy one uses for these different horizons tends to be very different. In the high-frequency trading space, for example, humans don’t really stand a chance against computers, there’s just so much information.

Subscribe to the O’Reilly Data Show Podcast

Stitcher, TuneIn, iTunes, SoundCloud, RSS

Related resources:

Image via the Internet Archive on Wikimedia Commons.