Programming collective intelligence for financial trading

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Geoffrey Bradway on building a trading system that synthesizes many different models.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Geoffrey Bradway, VP of engineering at Numerai, a new hedge fund that relies on contributions of external data scientists. The company hosts regular competitions where data scientists submit machine learning models for classification tasks. The most promising submissions are then added to an ensemble of models that the company uses to trade in real-world financial markets.

To minimize model redundancy, Numerai filters out entries that produce signals that are already well-covered by existing models in their ensemble. The company also plans to use (Ethereum) blockchain technology to develop an incentive system to reward models that do well on live data (not ones that overfit and do well on historical data).

Here are some highlights from our conversation:
Continue reading “Programming collective intelligence for financial trading”

Creating large training data sets quickly

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Alex Ratner on why weak supervision is the key to unlocking dark data.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Alex Ratner, a graduate student at Stanford and a member of Christopher Ré’s Hazy research group. Training data has always been important in building machine learning algorithms, and the rise of data-hungry deep learning models has heightened the need for labeled data sets. In fact, the challenge of creating training data is ongoing for many companies; specific applications change over time, and what were gold standard data sets may no longer apply to changing situations.

Ré and his collaborators proposed a framework for quickly building large training data sets. In essence, they observed that high-quality models can be constructed from noisy training data. Some of these ideas were discussed in a previous episode featuring Mike Cafarella (jump to minute 24:16 for a description of an earlier project called DeepDive).

By developing a framework for mining low-quality sources in order to build high-quality machine learning models, Ré and his collaborators help researchers extract information previously hidden in unstructured data sources (so-called “dark data” buried in text, images, charts, and so on).

Here are some highlights from my conversation with Ratner:
Continue reading “Creating large training data sets quickly”

Language understanding remains one of AI’s grand challenges

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: David Ferrucci on the evolution of AI systems for language understanding.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with David Ferrucci, founder of Elemental Cognition and senior technologist at Bridgewater Associates. Ferrucci served as principal investigator of IBM’s DeepQA project and led the Watson team that became champion of the Jeopardy! quiz show. Elemental Cognition (EC) is a research group focused on building an AI system that will be equipped with state-of-the-art natural language understanding technologies. Ferrucci envisions that EC will ship with foundational knowledge in many subject areas, but will be able to very quickly acquire knowledge in other (specialized) domains with the help of “human mentors.”

Having built and deployed several prominent AI systems through the years, I also wanted to get Ferrucci’s perspective on the evolution of AI technologies, and how enterprises can take advantage of all the exciting recent developments.

Here are some highlights from our conversation:
Continue reading “Language understanding remains one of AI’s grand challenges”

Data preparation in the age of deep learning

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Lukas Biewald on why companies are spending millions of dollars on labeled data sets.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Lukas Biewald, co-founder and chief data scientist at CrowdFlower. In a previous episode we covered how the rise of deep learning is fueling the need for large labeled data sets and high-performance computing systems. CrowdFlower has a service that many leading companies have come to rely on to provide them with labeled data sets to train machine learning models. As deep learning models get larger and more complex, they require training data sets that are bigger than those required by other machine learning techniques.

The CrowdFlower platform combines the contributions of human workers and algorithms. Through a process called active learning, they send difficult tasks or edge cases to humans, and they let the algorithms handle the more routine examples. But, how do you decide when to use human workers? In a simple example involving building an automatic classifier, you will probably want to send human workers cases when your machine learning algorithms signal uncertainty (probability scores are on the low side) or when your ensemble of machine learning algorithms signals disagreement. As Biewald describes in our conversation, active learning is much more subtle, and the CrowdFlower platform, in particular, is able to combine humans and algorithms to handle more sophisticated tasks.

Here are some highlights from our conversation:
Continue reading “Data preparation in the age of deep learning”

Scaling machine learning

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Reza Zadeh on deep learning, hardware/software interfaces, and why computer vision is so exciting.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Reza Zadeh, adjunct professor at Stanford University, co-organizer of ScaledML, and co-founder of Matroid, a startup focused on commercial applications of deep learning and computer vision. Zadeh also is the co-author of the forthcoming book TensorFlow for Deep Learning (now in early release). Our conversation took place on the eve of the recent ScaledML conference, and much of our conversation was focused on practical and real-world strategies for scaling machine learning. In particular, we spoke about the rise of deep learning, hardware/software interfaces for machine learning, and the many commercial applications of computer vision.

Prior to starting Matroid, Zadeh was immersed in the Apache Spark community as a core member of the MLlib team. As such, he has firsthand experience trying to scale algorithms from within the big data ecosystem. Most recently, he’s been building computer vision applications with TensorFlow and other tools. While most of the open source big data tools of the past decade were written in JVM languages, many emerging AI tools and applications are not. Having spent time in both the big data and AI communities, I was interested to hear Zadeh’s take on the topic.

Here are some highlights from our conversation:
Continue reading “Scaling machine learning”

Natural language analysis using Hierarchical Temporal Memory

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Francisco Webber on building HTM-based enterprise applications.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Francisco Webber, founder of Cortical.io, a startup that is applying tools based on Hierarchical Temporal Memory (HTM) to natural language understanding. While HTM has been around for more than a decade, there aren’t many companies that have released products based on it (at least compared to other machine learning methods). Numenta, an organization developing open source machine intelligence based on the biology of the neocortex, maintains a community site featuring showcase applications. Webber’s company has been building tools based on HTM and applying them to big text data in a variety of industries; financial services has been a particularly strong vertical for Cortical.

Here are some highlights from our conversation:
Continue reading “Natural language analysis using Hierarchical Temporal Memory”

Deep learning that’s easy to implement and easy to scale

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Anima Anandkumar on MXNet, tensor computations and deep learning, and techniques for scaling algorithms.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Anima Anandkumar, a leading machine learning researcher, and currently a principal research scientist at Amazon. I took the opportunity to get an update on the latest developments on the use of tensors in machine learning. Most of our conversation centered around MXNet—an open source, efficient, scalable deep learning framework. I’ve been a fan of MXNet dating back to when it was a research project out of CMU and UW, and I wanted to hear Anandkumar’s perspective on its recent progress as a framework for enterprises and practicing data scientists.

Here are some highlights from our conversation:
Continue reading “Deep learning that’s easy to implement and easy to scale”