Transforming organizations through analytics centers of excellence

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: Carme Artigas on helping enterprises transform themselves with big data tools and technologies.

In this episode of the Data Show, I spoke with Carme Artigas, co-founder and CEO of Synergic Partners (a Telefonica company). As more companies adopt big data technologies and techniques, it’s useful to remember that the end goal is to extract information and insight. In fact, as with any collection of tools and technologies, the main challenge is identifying and prioritizing use cases.

As Artigas describes, one can categorize use cases for big data into the following types:

  • Improve decision-making or operational efficiency
  • Generate new or additional revenue
  • Predict or prevent fraud (forecasting or minimizing risks)

Artigas has spent many years helping large organizations develop best practices for how to use data and analytics. We discussed some of the key challenges faced by organizations that wish to adopt big data technologies, centers of excellence for analytics, and AI in the enterprise.
Continue reading “Transforming organizations through analytics centers of excellence”

The state of machine learning in Apache Spark

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Ion Stoica and Matei Zaharia explore the rich ecosystem of analytic tools around Apache Spark.

In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion Stoica (UC Berkeley professor and executive chairman of Databricks) and Matei Zaharia (assistant professor at Stanford and chief technologist of Databricks). Stoica and Zaharia were core members of UC Berkeley’s AMPLab, which originated Apache Spark, Apache Mesos, and Alluxio.

We began our conversation by discussing recent academic research that would be of interest to the Apache Spark community (Stoica leads the RISE Lab at UC Berkeley, Zaharia is part of Stanford’s DAWN Project). The bulk of our conversation centered around machine learning. Like many in the audience, I was first attracted to Spark because it simultaneously allowed me to scale machine learning algorithms to large data sets while providing reasonable latency.

Here is a partial list of the items we discussed:

  • The current state of machine learning in Spark.
  • Given that a lot of innovation has taken place outside the Spark community (e.g., scikit-learn, TensorFlow, XGBoost), we discussed the role of Spark ML moving forward.
  • The plan to make it easier to integrate advanced analytics libraries that aren’t “textbook machine learning,” like NLP, time series analysis, and graph analysis into Spark and Spark ML pipelines.
  • Some upcoming projects from Berkeley and Stanford that target AI applications (including newer systems that provide lower latency, higher throughput).
  • Recent Berkeley and Stanford projects that address two key bottlenecks in machine learning — lack of training data, and deploying and monitoring models in production.

[Full disclosure: I am an advisor to Databricks.]

Related resources:

The current state of applied data science

[A version of this post appears on the O’Reilly Radar.]

Recent trends in practical use and a discussion of key bottlenecks in supervised machine learning.

As we enter the latter part of 2017, it’s time to take a look at the common challenges faced by companies interested in using data science and machine learning (ML). Let’s assume your organization is already collecting data at a scale that justifies the use of analytic tools, and that you’ve managed to identify and prioritize use cases where data science can be transformative (including improvements to decision-making or business operations, increasing revenue, etc.). Data gathering and identifying interesting problems are non-trivial, but assuming you’ve gotten a healthy start on these tasks, what challenges remain?

Data science is a large topic, so I’ll offer a disclaimer: this post is mainly about the use of supervised machine learning today, and it draws from a series of conversations over the last few months. I’ll have more to say about AI systems in future posts, but such systems clearly rely on more than just supervised learning.

It all begins with (training) data

Even assuming you have a team that handles data ingestion and integration, and a team that maintains a data platform (“source of truth”) for you, new data sources continue to appear, and it’s incumbent on domain experts to highlight them. Moreover, since we’re dealing mainly with supervised learning, it’s no surprise that lack of training data remains the primary bottleneck in machine learning projects.

There are some good research projects and tools for quickly creating large training data sets (or augmenting existing ones). Stanford researchers have shown that weak supervision and data programming can be used to train models without access to a lot of hand-labeled training data. Preliminary work on generative models (by deep learning researchers) have produced promising results in unsupervised learning in computer vision and other areas.
Continue reading “The current state of applied data science”

A framework for building and evaluating data products

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Pinterest data scientist Grace Huang on lessons learned in the course of machine learning product launches.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Grace Huang, data science lead at Pinterest. With its combination of a large social graph, enthusiastic users, and multimedia data, I’ve long regarded Pinterest as a fascinating lab for data science. Huang described the challenge of building a sustainable content ecosystem and shared lessons from the front lines of machine learning product launches. We also discussed recommenders, the emergence of deep learning as a technique used within Pinterest, and the role of data science within the company.

Here are some highlights from our conversation:
Continue reading “A framework for building and evaluating data products”

Programming collective intelligence for financial trading

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Geoffrey Bradway on building a trading system that synthesizes many different models.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Geoffrey Bradway, VP of engineering at Numerai, a new hedge fund that relies on contributions of external data scientists. The company hosts regular competitions where data scientists submit machine learning models for classification tasks. The most promising submissions are then added to an ensemble of models that the company uses to trade in real-world financial markets.

To minimize model redundancy, Numerai filters out entries that produce signals that are already well-covered by existing models in their ensemble. The company also plans to use (Ethereum) blockchain technology to develop an incentive system to reward models that do well on live data (not ones that overfit and do well on historical data).

Here are some highlights from our conversation:
Continue reading “Programming collective intelligence for financial trading”

Creating large training data sets quickly

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Alex Ratner on why weak supervision is the key to unlocking dark data.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Alex Ratner, a graduate student at Stanford and a member of Christopher Ré’s Hazy research group. Training data has always been important in building machine learning algorithms, and the rise of data-hungry deep learning models has heightened the need for labeled data sets. In fact, the challenge of creating training data is ongoing for many companies; specific applications change over time, and what were gold standard data sets may no longer apply to changing situations.

Ré and his collaborators proposed a framework for quickly building large training data sets. In essence, they observed that high-quality models can be constructed from noisy training data. Some of these ideas were discussed in a previous episode featuring Mike Cafarella (jump to minute 24:16 for a description of an earlier project called DeepDive).

By developing a framework for mining low-quality sources in order to build high-quality machine learning models, Ré and his collaborators help researchers extract information previously hidden in unstructured data sources (so-called “dark data” buried in text, images, charts, and so on).

Here are some highlights from my conversation with Ratner:
Continue reading “Creating large training data sets quickly”

What are machine learning engineers?

[A version of this appears on the O’Reilly Radar.]

A new role focused on creating data products and making data science work in production.

by Ben Lorica and Mike Loukides

We’ve been talking about data science and data scientists for a decade now. While there’s always been some debate over what “data scientist” means, we’ve reached the point where many universities, online academies, and bootcamps offer data science programs: master’s degrees, certifications, you name it. The world was a simpler place when we only had statistics. But simplicity isn’t always healthy, and the diversity of data science programs demonstrates nothing if not the demand for data scientists.

As the field of data science has developed, any number of poorly distinguished specialties have emerged. Companies use the terms “data scientist” and “data science team” to describe a variety of roles, including:

  • individuals who carry out ad hoc analysis and reporting (including BI and business analytics)
  • people who are responsible for statistical analysis and modeling, which, in many cases, involves formal experiments and tests
  • machine learning modelers who increasingly develop prototypes using notebooks

And that listing doesn’t include the people DJ Patil and Jeff Hammerbacher were thinking of when they coined the term “data scientist”: the people who are building products from data. These data scientists are most similar to the machine learning modelers, except that they’re building something: they’re product-centric, rather than researchers. They typically work across large portions of data products. Whatever the role, data scientists aren’t just statisticians; they frequently have doctorates in the sciences, with a lot of practical experience working with data at scale. They are almost always strong programmers, not just specialists in R or some other statistical package. They understand data ingestion, data cleaning, prototyping, bringing prototypes to production, product design, setting up and managing data infrastructure, and much more. In practice, they turn out to be the archetypal Silicon Valley “unicorns”: rare and very hard to hire.

What’s important isn’t that we have well-defined specialties; in a thriving field, there will always be huge gray areas. What made “data science” so powerful was the realization that there was more to data than actuarial statistics, business intelligence, and data warehousing. Breaking down the silos that separated data people from the rest of the organization—software development, marketing, management, HR—is what made data science distinct. Its core concept was that data was applicable to everything. The data scientist’s mandate was to gather, and put to use, all the data. No department went untouched.
Continue reading “What are machine learning engineers?”