Programming collective intelligence for financial trading

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Geoffrey Bradway on building a trading system that synthesizes many different models.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Geoffrey Bradway, VP of engineering at Numerai, a new hedge fund that relies on contributions of external data scientists. The company hosts regular competitions where data scientists submit machine learning models for classification tasks. The most promising submissions are then added to an ensemble of models that the company uses to trade in real-world financial markets.

To minimize model redundancy, Numerai filters out entries that produce signals that are already well-covered by existing models in their ensemble. The company also plans to use (Ethereum) blockchain technology to develop an incentive system to reward models that do well on live data (not ones that overfit and do well on historical data).

Here are some highlights from our conversation:
Continue reading “Programming collective intelligence for financial trading”

Data preparation in the age of deep learning

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Lukas Biewald on why companies are spending millions of dollars on labeled data sets.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Lukas Biewald, co-founder and chief data scientist at CrowdFlower. In a previous episode we covered how the rise of deep learning is fueling the need for large labeled data sets and high-performance computing systems. CrowdFlower has a service that many leading companies have come to rely on to provide them with labeled data sets to train machine learning models. As deep learning models get larger and more complex, they require training data sets that are bigger than those required by other machine learning techniques.

The CrowdFlower platform combines the contributions of human workers and algorithms. Through a process called active learning, they send difficult tasks or edge cases to humans, and they let the algorithms handle the more routine examples. But, how do you decide when to use human workers? In a simple example involving building an automatic classifier, you will probably want to send human workers cases when your machine learning algorithms signal uncertainty (probability scores are on the low side) or when your ensemble of machine learning algorithms signals disagreement. As Biewald describes in our conversation, active learning is much more subtle, and the CrowdFlower platform, in particular, is able to combine humans and algorithms to handle more sophisticated tasks.

Here are some highlights from our conversation:
Continue reading “Data preparation in the age of deep learning”

Building a business that combines human experts and data science

The O’Reilly Data Show podcast: Eric Colson on algorithms, human computation, and building data science teams.

[A version of this post appears on the O’Reilly Radar.]

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

In this episode of the O’Reilly Data Show, I spoke with Eric Colson, chief algorithms officer at Stitch Fix, and former VP of data science and engineering at Netflix. We talked about building and deploying mission-critical, human-in-the-loop systems for consumer Internet companies. Knowing that many companies are grappling with incorporating data science, I also asked Colson to share his experiences building, managing, and nurturing, large data science teams at both Netflix and Stitch Fix.

Augmented systems: “Active learning,” “human-in-the-loop,” and “human computation”

We use the term ‘human computation’ at Stitch Fix. We have a team dedicated to human computation. It’s a little bit coarse to say it that way because we do have more than 2,000 stylists, and these are very much human beings that are very passionate about fashion styling. What we can do is, we can abstract their talent into—you can think of it like an API; there’s certain tasks that only a human can do or we’re going to fail if we try this with machines, so we almost have programmatic access to human talent. We are allowed to route certain tasks to them, things that we could never get done with machines.

… We have some of our own proprietary software that blends together two resources: machine learning and expert human judgment. The way I talk about it is, we have an algorithm that’s distributed across the resources. It’s a single algorithm, but it does some of the work through machine resources, and other parts of the work get done through humans.

… You can think of even the classic recommender systems, collaborative filtering, which people recognize as, ‘people that bought this also bought that.’ Those things break down to nothing more than a series of rote calculations. Being a human, you can actually do them by hand—it’ll just take you a long time, and you’ll make a lot of mistakes along the way, and you’re not going to have much fun doing it—but machines can do this stuff in milliseconds. They can find these hidden relationships within the data that are going to help figure out what’s relevant to certain consumer’s preferences and be able to recommend things. Those are things that, again, a human could, in theory, do, but they’re just not great at all the calculations, and every algorithmic technique breaks down to a series of rote calculations.

… What machines can’t do are things around cognition, things that have to do with ambient information, or appreciation of aesthetics, or even the ability to relate to another human—those things are strictly in the purview of humans. Those types of tasks we route over to stylists. … I would argue that our humans could not do their jobs without the machines. We keep our inventory very large so that there are always many things to pick from for any given customer. It’s so large, in fact, that it would take a human too long to sift through it on her own, so what machines are doing is narrowing down the focus.

Combining art and science

Our business model is different. We are betting big on algorithms. We do not have the barriers to competition that other retailers have, like Wal-Mart has economies of scale that allow them to do amazing things; that’s their big barrier. … What is our protective barrier? It’s [to be the] best in the world at algorithms. We have to be the very best. … More than any other company, we are going to suffer if we’re wrong.

… Our founder wanted to do this from the very beginning, combine empiricism with what can’t be captured in data, call it intuition or judgment. But she really wanted to weave those two things together to produce something that was better than either can do on their own. She calls it art and science, combining art and science.

Defining roles in data science teams

[Job roles at StitchFix are] built on three premises that come from Dan Pink’s book Drive. Autonomy, mastery, purpose—those are the fundamental things you need to have for high job satisfaction. With autonomy, that’s why we dedicate them to a team. You’re going to now work on what’s called ‘marketing algorithms.’ You may not know anything about marketing to begin with, but you’re going to learn it pretty fast. You’re going to pick up the domain expertise. By autonomy, we want you to do the whole thing so you have the full context. You’re going to be the one sourcing the data, building pipelines. You’re going to be applying the algorithmic routine. You’re going to be the one who frames that problem, figures out what algorithms you need, and you’re going to be the one delivering the output and connecting it back to some action, whatever that action may be. Maybe it’s adjusting our multi-channel strategy. Whatever that algorithmic output is, you’re responsible for it. So, that’s mastery. Now, you’re autonomous because you do all the pieces. You’re getting mastery over one domain, in that case, say marketing algorithms. You’re going to be looked at as you’re the best person in the company to go talk about how these things work; you know the end-to-end.

Then, purpose—that’s the impact that you’re going to make. In the case that we gave, marketing algorithms, you want to be accountable. You want to be the one who can move the needle when it comes to how much we should do. What channels are more effective at acquiring new customers? Whatever it is, you’re going to be held accountable for a real number, and that is motivating, that’s what makes people love their jobs.

Subscribe to the O’Reilly Data Show Podcast: Stitcher, TuneIn, iTunes, SoundCloud, RSS

Editor’s note: Eric Colson will speak about augmenting machine learning with human computation for better personalization, at Strata + Hadoop World in San Jose this March.

Related resources:

 

“Humans-in-the-loop” machine learning systems

Next week I’ll be hosting a webcast featuring Adam Marcus, one of the foremost experts on the topic of “humans-in-the-loop” machine learning systems. It’s a subject many data scientists have heard about, but very few have had the experience of building productions systems that leverage humans:

Crowdsourcing marketplaces like Elance-oDesk or CrowdFlower give us access to people all over the world that can solve various tasks, such as virtual personal assistants, image labelers, or people that can clean up gnarly datasets. Humans can solve tasks that artificial intelligence is not yet able to solve, or needs help in solving, without having to resort to complex machine learning or statistics. But humans are quirky: give them bad instructions, allow them to get bored, or make them do too repetitive a task, and they will start making mistakes. In this webcast, I’ll explain how to effectively benefit from crowd workers to solve your most challenging tasks, using examples from the wild and from our work at GoDaddy.

Machine learning and crowdsourcing are at the core of most of the problems we solve on the Locu team at GoDaddy. When possible, we automate tasks with the help of trained regressions and classifiers. However, it’s not always possible to build machine-only decision-making tools, and we often need to marry machines and crowds. During the webcast, I will highlight how we build human-machine hybrids and benefit from active learning workflows. I’ll also discuss learnings from 17 conversations with companies that make heavy use of crowd work that Aditya Parameswaran and I have collected for our upcoming book.

A recent article in the NYTimes Magazine mentioned a machine-learning system built by some neuroscience researchers that is an excellent example of having “humans-in-the-loop”:

In 2012, Seung started EyeWire, an online game that challenges the public to trace neuronal wiring — now using computers, not pens — in the retina of a mouse’s eye. Seung’s artificial-­intelligence algorithms process the raw images, then players earn points as they mark, paint-by-numbers style, the branches of a neuron through a three-dimensional cube. The game has attracted 165,000 players in 164 countries. In effect, Seung is employing artificial intelligence as a force multiplier for a global, all-volunteer army that has included Lorinda, a Missouri grandmother who also paints watercolors, and Iliyan (a.k.a. @crazyman4865), a high-school student in Bulgaria who once played for nearly 24 hours straight. Computers do what they can and then leave the rest to what remains the most potent pattern-recognition technology ever discovered: the human brain.

For more on this important topic, join me and Adam on January 22nd!

Real-world Active Learning

Beyond building training sets for machine-learning, crowdsourcing is being used to enhance the results of machine-learning models: in active learning, humans take care of uncertain cases, models handle the routine ones. Active Learning is one of those topics that many data scientists have heard of, few have tried, and a small handful know how to do well. As data problems increase in complexity, I think active learning is a topic that many more data scientists need to familiarize themselves with.

Next week I’ll be hosting a FREE webcast on Active Learning featuring data scientist and entrepreneur Lukas Biewald:

Machine learning research is often not applied to real world situations. Often the improvements are small and the increased complexity is high, so except in special situations, industry doesn’t take advantage of advances in the academic literature.

Active learning is an example where research proposes a simple strategy that makes a huge difference and almost everyone applying machine learning to real world use cases is doing it or should be doing it. Active learning is the practice of taking cases where the model has low confidence, getting them labeled, and then using those labels as input data.

Webcast attendees will learn simple, practical ways to improve their models by cleaning up and tweaking the distribution of their training data. They will also learn about best practices from real world cases where active learning and data selection took models that were completely unusable in production to extremely effective.

Crowdsourcing Feature discovery

More than algorithms, companies gain access to models that incorporate ideas generated by teams of data scientists

[A version of this post appears on the O’Reilly Data blog and Forbes.]

Data scientists were among the earliest and most enthusiastic users of crowdsourcing services. Lukas Biewald noted in a recent talk that one of the reasons he started CrowdFlower was that as a data scientist he got frustrated with having to create training sets for many of the problems he faced. More recently, companies have been experimenting with active learning (humans1 take care of uncertain cases, models handle the routine ones). Along those lines, Adam Marcus described in detail how Locu uses Crowdsourcing services to perform structured extraction (converting semi/unstructured data into structured data).

Another area where crowdsourcing is popping up is feature engineering and feature discovery. Experienced data scientists will attest that generating features is as (if not more) important than choice of algorithm. Startup CrowdAnalytix uses public/open data sets to help companies enhance their analytic models. The company has access to several thousand data scientists spread across 50 countries and counts a major social network among its customers. Its current focus is on providing “enterprise risk quantification services to Fortune 1000 companies”.

CrowdAnalytix breaks up projects in two phases: feature engineering and modeling. During the feature engineering phase, data scientists are presented with a problem (independent variable(s)) and are asked to propose features (predictors) and brief explanations for why they might prove useful. A panel of judges evaluate2 features based on the accompanying evidence and explanations. Typically 100+ teams enter this phase of the project, and 30+ teams propose reasonable features.

Continue reading “Crowdsourcing Feature discovery”