The current state of applied data science

[A version of this post appears on the O’Reilly Radar.]

Recent trends in practical use and a discussion of key bottlenecks in supervised machine learning.

As we enter the latter part of 2017, it’s time to take a look at the common challenges faced by companies interested in using data science and machine learning (ML). Let’s assume your organization is already collecting data at a scale that justifies the use of analytic tools, and that you’ve managed to identify and prioritize use cases where data science can be transformative (including improvements to decision-making or business operations, increasing revenue, etc.). Data gathering and identifying interesting problems are non-trivial, but assuming you’ve gotten a healthy start on these tasks, what challenges remain?

Data science is a large topic, so I’ll offer a disclaimer: this post is mainly about the use of supervised machine learning today, and it draws from a series of conversations over the last few months. I’ll have more to say about AI systems in future posts, but such systems clearly rely on more than just supervised learning.

It all begins with (training) data

Even assuming you have a team that handles data ingestion and integration, and a team that maintains a data platform (“source of truth”) for you, new data sources continue to appear, and it’s incumbent on domain experts to highlight them. Moreover, since we’re dealing mainly with supervised learning, it’s no surprise that lack of training data remains the primary bottleneck in machine learning projects.

There are some good research projects and tools for quickly creating large training data sets (or augmenting existing ones). Stanford researchers have shown that weak supervision and data programming can be used to train models without access to a lot of hand-labeled training data. Preliminary work on generative models (by deep learning researchers) have produced promising results in unsupervised learning in computer vision and other areas.

The adage “think about features, not algorithms” is another useful way to assess data in the context of machine learning. Here’s a friendly reminder: data enrichment can potentially improve your existing models, and in some situations, it can even help ease the cold start problem. Most data scientists probably already enrich their existing data sets with open data or through third-party data providers, but I find that data enrichment can sometimes be overlooked. Obtaining external data, normalizing, and experimenting with it is not considered as glamorous as model and algorithm development.

From prototype to production

In many use cases the goal is to productionalize a data science project. We’ve pointed out that a new job role—machine learning engineer—has recently emerged to streamline this process. There are also a new set of tools to help ease the transition from prototype to production and to help track context and metadata that accompany analytic products.

We are still in the early stages of deploying machine learning into products, and best practices are just beginning to emerge. As advanced analytic models get more widely used, there are several considerations to keep in mind, including:

Deployment environment: You’ll likely need to integrate with the existing logging or A/B testing infrastructure. Besides being able to deploy robust and performant models on a server, environments increasingly include the questions of how and when to deploy models to the edge (mobile devices are a common example). There are new tools and strategies for deploying models to edge devices.
Scale, latency, freshness: How much data is needed to train the models? What is a reasonable response time for model inference? How often should models be retrained and data sets be refreshed? The latter implies you have reproducible data pipelines in place.
Bias: If your training data is not representative of the current population, you’ll get poor (and even unfair) results. In some situations, you might be able to use propensity scores or other methods to adjust your data set accordingly.
Monitoring models: I think people underestimate the importance of monitoring models, and this is an area where people trained in statistics have a competitive advantage. It can get tricky to figure out when and how much models have degraded. Concept drift might be a factor. In the case of classifiers, one strategy is to compare the distribution of classes predicted by your models to the observed distribution of predicted classes. You can also have business goals that are distinct from the metrics used to evaluate machine learning models. For example, a recommendation system might be tasked to help surface “dark or long-tail” content.
Mission-critical applications: Models deployed in mission-critical settings will need to be much more robust than your typical consumer application. In addition, machine learning applications in such settings need to be designed to run “continuously” for months on end (e.g., without memory leaks).
Privacy and security: Generally speaking, users and companies are more likely to share data if you can convince them their data is secure. And as I noted, data enriched with extra features tend to lead to better results. For companies conducting business in the European Union, one issue looms over the short-term: GDPR is set for May 2018. On other fronts, practical research in adversarial ML and secure ML (including being able to work with encrypted data) are beginning to appear.

Model development

Model and algorithm development get much more media coverage, but when you talk with data scientists, most of them will tell you lack of training data and productionalizing data science are more pressing concerns. Often, there are enough straightforward use cases that you can start with your favorite (basic or advanced) algorithm and tweak or replace it later.

Because tools make it easy to apply algorithms, as a first step it’s good to brush up on how to evaluate the results of machine learning models. With that said, never lose sight of your business metrics and objectives as they need not completely coincide with having the best-tuned or best-performing model. Follow developments pertaining to fairness and transparency that are beginning to be examined and addressed by researchers and companies. Privacy concerns and the rise of devices are giving rise to techniques that do not rely on centralized data sets.

Deep learning is slowly becoming part of the class of algorithms data scientists need to know about. Originally used in computer vision and speech recognition, there are starting to be examples and use cases involving data types and problems that data scientists can relate to. Challenges include choosing the right network architecture (architecture engineering is the new feature engineering), hyperparameter tuning, and casting problems and transforming data so they lend themselves to deep learning. (Coincidentally, one of the more interesting large-scale data products I’ve encountered this year isn’t based on deep learning.)

In many cases, users prefer and favor models that are explainable (in some settings, black box models simply aren’t acceptable). Given that their underlying mechanisms are somewhat understandable, explainable models are also potentially easier to improve. With the recent rise of deep learning, I’m seeing companies use tools that explain how models produce their predictions and tools that can explain where a model comes from by tracing predictions from the learning algorithm and training data.

Tools

I won’t attempt to create a list, as there are simply too many tools to enumerate. The ecosystem of tools that help you with data ingestion, integration, processing, preparation, and storage, as well as model deployment, are all critical. Here are a few observations on machine learning tools:

Python and R are the most popular languages. Keras is the most popular entry point for those wanting to use deep learning (Keras now comes bundled when you install TensorFlow).
While notebooks seem to be the model development tool of choice, IDEs are popular among R users.
There are a lot of libraries for general machine learning and deep learning, some are better at easing the transition from prototype to production.
Ease of scaling from a laptop to a cluster is an important consideration, and Apache Spark is a popular execution framework for making that happen. It’s also often the case that after a series of data wrangling steps you are able to fit your data set into a single, beefy server.
Vendors are starting to support collaboration and version control.
At the end of the day, you may need data science tools that seamlessly integrate with your existing ecosystem and data platform.

This is a great time for companies to assess what problems and use cases lend themselves to machine learning. I’ve attempted to summarize some recent trends and remaining bottlenecks, and your main takeaway should be that you can start using machine learning. Start with a problem for which you already have some data. The fancy models come later.

Thanks to David Talby for comments and suggestions to a draft version of this post.

Related content:

When models go rogue: Hard-earned lessons about using machine learning in production (an upcoming Strata Data NYC talk by David Talby)
Deep learning for recommender systems: an upcoming 3-hour tutorial at Strata Data NYC
Creating large training data sets quickly: why weak supervision is the key to unlocking dark data
Mastering Feature Engineering: Principles and techniques for data scientists
Use deep learning on data you already have: putting deep learning into practice with new tools, frameworks, and future development
What are machine learning engineers: examining new role focused on creating data products and making data science work in production.
Evaluating Machine Learning Models: A Beginner’s Guide to Key Concepts and Pitfalls
Building and deploying large-scale machine learning pipelines: why we need primitives, pipeline synthesis tools, and most importantly, error analysis and verification
Why should I trust you? Carlos Guestrin on tools for explaining the predictions of machine-learning models
Introduction to Local Interpretable Model-Agnostic Explanations (LIME): a technique to explain the predictions of any machine learning classifier

Recent trends in practical use and a discussion of key bottlenecks in supervised machine learning.

It all begins with (training) data

From prototype to production

Model development

Tools

Share this:

Like this:

Discover more from Gradient Flow