Model Monitoring Enables Robust Machine Learning Applications

Key features of ML monitoring solutions, why companies need a holistic MLOps platform that includes model monitoring, and challenges companies face in making that happen.

By Ben Lorica and Paco Nathan.

According to the 2020 Gartner Hype Cycle for Artificial Intelligence, machine learning (ML) is entering the Trough of Disillusionment phase. This is the phase where the real work begins—best practices, infrastructures, and tools are being developed to facilitate the technology’s integration into real-world production environments. Today, ML technologies have secured a central role in many companies.

A glimpse at how a few of the large tech companies use machine learning. — Chart 1: A glimpse at how a few large companies use machine learning. Graphic: GradientFlow.

ML technologies also are beginning to gain footholds across industries as they become more widely adopted in enterprises. For example, advances in speech and natural language models are fueling growth in voice applications. Demand for ML talent continues to rise, too—jobs topping LinkedIn’s 2020 Emerging Jobs Report include machine learning as a uniquely required skill. A quick scan of Fortune 1000 companies offers a snapshot of the average ML engagement across industries.

Chart 2: A snapshot of the average engagement by Fortune 1000 industry category in April 2021. Data and Graphic: GradientFlow. (Fortune 1000 is a trademark of Fortune Media IP Limited.)

Model Degradation

Model deployment isn’t a destination—models need near-constant monitoring and retraining. Manasi Vartak, CEO of Verta.ai, points out that model degradation begins upon deployment. “In our experience helping organizations deploy hundreds of ML models,” she explains, “models begin to degrade the moment they get deployed. This is particularly true for models built on time-varying data, but it also holds for models built on so-called static data, like natural images, because the deployed model is used on new and unseen data.”

Models interacting with the real world and failing to make sense of it can have serious consequences. Consider the following examples:

Healthcare: Engineers from John Snow Labs found that a predictive readmission model that was trained, optimized and deployed at a hospital would start sharply degrading—and predicting poorly—within two to three months. Issues for the company and its customers snowballed proportionally to the number of hospitals where the model was deployed.

Security is a domain where adversaries and threats are constantly changing. In order to stay far enough ahead to prevent attacks, companies and researchers need to continuously monitor and retrain their ML models.

Watson for oncology: This system was designed to recommend treatments for cancer patients. It was withdrawn from the market, as the model degraded to the point that it was suggesting unsafe treatments.

How Model Degradation Happens

You never really know how well a model will work until it’s deployed and interacts with users. As we’ve noted, real-world user interactions and live data often are different from historical or training data. For example, degradation can occur when a model does not generalize as it struggles to understand real-world data it hasn’t yet encountered. There are many catalysts that can result in model degradation.

Chart 3: Examples of real-world situations that can cause a model to degrade. Chart: GradientFlow.

Model Monitoring is Challenging

Monitoring technology performance is nothing new. For example, there’s a long history of application performance management (APM) — the monitoring and management of software applications. While there are many lessons and approaches ML teams can glean from software monitoring, machine learning monitoring has unique challenges that require specialized tools and techniques.

Measurement

It is inherently difficult to obtain or establish ground-truth when training machine learning models. Training usually requires labeled data, which is subject to “measurement errors,” subjectivity (“expert opinion”), or even bias. This lack of ground-truth makes measuring model quality difficult—even defining “accuracy” in the context of an ML model can be a challenge.

The ability to measure ML model performance is also affected by the addition of new metrics that represent new sources of risk. The emergence of Responsible AI is a good example: in addition to ML and business metrics, companies now are tasked with measuring metrics around security and privacy, safety and reliability, fairness, and transparency and accountability.

As regulators and lawmakers increasingly require organizations to continuously review AI and ML models, even after initial approval for deployment, it’s important for companies to have structures in place to accurately and responsibly measure the quality of their ML models. This living document tracking AI incidents, actively maintained by Patrick, Hall, co-founder of BNH.ai, not only highlights the increasing number of incidents in model failure, but the increasing breadth in the types of incidents as machine learning becomes a more pervasive technology across industries.

Customization

There is no one-size-fits-all solution to ML monitoring. The quality metrics an organization needs to track are unique to each model type and domain. There are three general considerations when determining what metrics to monitor.

The model itself. You need to determine which ML or statistical metric to measure to ensure your model is addressing the intended business problem. Possible metrics to measure include F1, precision, recall, MSE, and many others. It’s also important to measure the performance of the data science team, using metrics such as key performance indicators (KPIs); key risk indicators (KRIs), which are used in high-risk sectors such as finance and manufacturing; and business metrics that need to be tied to model performance.
The specific domain and application. False negatives and false positives can have significant consequences, depending on the application. Service-level agreements (SLAs) and metrics can be specific to a domain.
Additional domain-specific considerations. These include Responsible AI metrics, and regulatory and compliance considerations.

Complexity

Machine learning approaches involve multiple complex, distributed systems. Convoluted model lineage and complex data pipelines make root cause analysis extremely hard. For this reason, model monitoring tools should integrate with or include systems for monitoring data and data quality.

Organizational structures can also add to the complexity—oftentimes, companies use separate teams to train, test, deploy, and manage/monitor their models.

Accessibility

As machine learning technologies become more widespread, there are more types of users and more users of varying skill levels. Non-experts are increasingly using and deploying ML models, and as we noted, ML has unique challenges pertaining to measurement, variety of metrics, and complexity of root cause analysis. For non-experts to deploy and use models, monitoring must be easy to plug in and interpret.

Scale

As with most technologies, scaling is an important consideration when implementing machine learning approaches. Monitoring tools need to scale to large datasets, a large number of statistics, and to both live and batch inference. Each of the challenges we’ve mentioned so far, however, applies even if you only have a handful of models. Some companies have thousands of models deployed, perhaps even millions, in the case of big tech platforms that are constantly testing and deploying highly customized and personalized models.

Desirable features of a model monitoring system

In a recent post on DataOps, we outlined three facets that any “ops”-related function usually involves: monitoring, automation, and incident response. These components facilitate a robust model monitoring process.

Establish timely alerts to quickly know when models are failing or degrading.
Identify the root cause of a failure or degradation.
Enable an agile response. A fast recovery or fast update closes the loop to minimize the mean time to recovery (MTTR).

Let’s take a look at each area in the context of machine learning.

Know when models are failing or degrading

The five challenges we outlined in the section “Model Monitoring is Challenging” pertain to understanding when a model is failing. Failing models usually require a more robust response than regular retraining. Frequent retraining can mitigate risks stemming from deployed models, but it’s also important to have an exceptional model monitoring system in place to help you (1) detect problems early, and (2) to diagnose and address problems quickly.

Identify the root cause

Chart 3 illustrates examples of real-world situations that can cause a model to degrade. It’s important to distinguish between the reasons for model degradation. Detecting a root cause is challenging because models are part of complex workflows, and “root cause” is often related to data quality or a broken data pipeline (see Facebook). Thus, model monitoring solutions should include or integrate with data quality and other relevant DataOps systems. Organizational structures can compound the complexity of identifying the root cause as well—the team that trained the model might not be the same team responsible for understanding how to update or fix it. What’s clear is that the more you can tightly connect model development with your model deployment/monitoring tools, the better off you are.

Automation

In 2017 as we observed the rapid growth in the number of models being deployed and in the number of companies engaging with ML, we predicted that machine learning will increasingly be used to monitor machine learning. Automation in this context means two things:

Facilitating intelligent detection and alerting to pre-emptively identify issues in order to trigger remediations.
Having the ability to execute re-training workflows, intelligently deploy fallback models, and alert teams when human intervention is needed.

Closing Thoughts

Machine learning applications are complex, and models degrade and fail even at the most mature ML companies. In this post, we described key challenges in monitoring ML models, and we outlined several key components of a model monitoring platform. We also offered concrete reasons why any machine learning program needs to include a well-structured monitoring practice. We believe that companies need a holistic MLOps platform that includes model monitoring, as opposed to stitching together disparate components—model monitoring solutions that have access to training information (algorithms, datasets, and experiments) are much more attractive. This will facilitate the holy grail goal of an end-to-end MLOps solution.

Chart 4: A well-structured model monitoring platform. Chart: GradientFlow

In closing, we’ll leave you with actionable steps to approach the challenges we’ve laid out in this post. Manasi Vartak, CEO of Verta.ai, offers three succinct pieces of advice for companies developing or implementing machine learning technologies:

Operate defensively; assume your model will fail and put in early-warning systems to detect failure.

Have a Plan B; what will you do if your model fails? Retrain, fall back to an older model, serve no-op predictions?

Invest in robust ML infrastructures and processes, so you can respond to incidents immediately.

In a future post, we’ll offer more context by describing the current state of model monitoring solutions. We will provide a description and taxonomy of available solutions, and detail how they meet the challenges we described in this post.

This post is part of a collaboration between Gradient Flow and Verta. See our statement of editorial independence.