A team from Google Research shares lessons learned from high-stakes domains.
Data has been an undervalued component of AI development since the dawn of AI. We are now seeing the beginnings of a much-needed shift in how data is viewed. In a recent post, we described the growing interest in metadata management systems as a potential substrate for powering the foundational data technologies required to build robust machine learning (ML) and AI applications. While the ability to tune, customize, deploy, and manage ML models is important, without access to reliable, high-quality data, companies aren’t able to build high-impact data and AI products and services. Data scientists have long recognized that data assets, processes, and infrastructure are much more critical than ML models to an organization’s long-term success.
Recent research continues to underscore the importance of data, and the perils of underestimating data quality’s role in AI development. A recent Google Research paper (PDF), “‘Everyone wants to do the model work, not the data work’: Data Cascades in High-Stakes AI,” compiles data challenges from 53 practitioners in India, East and West African countries, and the US, who are working on high-stakes AI projects with significant consequences for human safety and wellbeing. We believe the authors’ conclusions and lessons learned from these interviews are helpful to AI development at all levels in all sectors, beyond high-stakes applications.
The paper describes the prevalence and dangers of data cascades: “compounding events causing negative, downstream effects from data issues that result in technical debt over time.” The authors note that 92% of the practitioners interviewed had experienced at least one cascade. Data cascades often originate upstream (during data gathering and collection) and can have serious consequences downstream (during model deployment and inference). The consequences often take time to materialize, making cascades a costly, time-consuming AI development issue.
In this post, we take the four data cascade challenges outlined in Google Research’s paper and frame them in the context of everyday AI development. We believe the authors’ observations and analyses will be influential and informative for any team working on AI projects.
Challenge: Interacting with physical world brittleness
The team from Google Research interviewed practitioners who deploy AI applications that engage with data in the real world—upstream with data discovery and collection, and downstream with sensors, cameras, and other data-gathering hardware. As the authors note, the real world is ripe with issues that can result in data cascades: “Data cascades often appeared in the form of hardware, environmental, and human knowledge drifts.”
Hardware drifts include such issues as cameras or sensors shifting or malfunctioning. Environmental drifts come from changing environmental or climate changes; one participant in Google Research’s report shared that “the presence of cloud cover, new houses or roads, or vegetation growth posed challenges because their model was comparing pre- and post-images and misconstruing the changes as landslides.” The “human drift” factor includes situations where social, political, or community behavior or expectations result in changes to live data. This particular factor has already been experienced on a large scale with the enactments of the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).
The need to build resilience into data collection and extraction should sound familiar to developers and engineers who build web crawlers, data pipelines, and extract, load, transform (ELT) tools. The rise of a new generation of companies and startups focused on data pipelines and ELT also underscores the importance of data transport, processing, and storage. This further validates the need to monitor data quality end to end, from the point of collection to model deployment and inference. Organizations should also ensure their data collection infrastructure is flexible enough to accommodate changing regulations and other changing societal situations.
Challenge: Inadequate application-domain expertise
It is fairly common practice to enlist the aid of domain experts when labeling data. The authors emphasize the importance of adopting an end-to-end engagement with experts, relying on their expertise throughout the AI development pipeline. The data cascade in this challenge is triggered by a practitioner having to make decisions, discard data, merge data, interpret data, etc., without the benefit of domain expertise. Experts bring context to data and processes that practitioners may not have. The authors note that this is one of the more expensive data cascades:
- ❛ Application-domain expertise cascades were costly: impacts came largely after building models, through client feedback and system performance, and long-winded diagnoses. Impacts included costly modifications like going back to collect more data, improving labels, adding new data sources, or severe unanticipated downstream impacts if the model had already been deployed.
The takeaway here is that tools for data collection, data labeling, data preparation, and data inspection need to be made accessible to non-technical experts. For the most part, this means developing ML models alongside domain experts and potential users. Experts can help inspect data to ensure a model isn’t biased, for instance. They can also help make informed ground truth decisions and help gather or simulate representative data to build more accurate models. Domain experts bring a level understanding to data that technologists cannot replicate.
Challenge: Poor cross-organizational documentation
Most readers are likely familiar with this challenge. Poor documentation has long been an albatross in data science fields (and many other industries). In Google Research’s interviews, a participant working in robotics described frustrations: “a lack of metadata and collaborators changing schema without understanding context led to a loss of four months of precious medical robotics data collection.” Participants stressed the importance of metadata to “to assess quality, representativeness, and fit for use cases,” and noted frustrations with a lack of standards to properly document datasets.
While high-stakes and niche applications are especially prone to data cascades due to insufficient or lacking documentation, lack of metadata and missing context affects AI applications across the board. As we noted in our recent post on metadata, there is a growing awareness of the need to build tools that facilitate data discovery, understanding, and sharing. A 2019 survey of data scientists at Lyft found that 25% of their time was spent on data discovery. As the appreciation for metadata expands, we’re seeing new startups that hope to build on lessons learned implementing and using metadata services within technology companies. But we are clearly still in the early days of tools for data discovery, documentation, and sharing.
Challenge: Conflicting reward systems
Priorities of practitioners, domain experts, and stakeholders often are not aligned. Google Research found that this misalignment resulted in data cascades with impacts “discovered well into deployment, through costly iterations, moving to an alternate data source, or quitting the project altogether.” When data is not properly prioritized, data collection is often treated as a non-technical task and relegated to staff who might lack data literacy, or the job is unceremoniously added to the tasks of those working in the field, who already have full-time responsibilities. This lack of prioritization can lead to issues with representative data, data providence, and incomplete data, all of which, as Google Research notes, can lead to data cascades.
We are seeing a few companies start to roll out tools to address many aspects including, including data labeling, data quality, data pipelines and ELT, which suggests that there is growing awareness and appreciation for data-related tasks. As AI appears in more domains and settings, companies need to place more of an end-to-end perspective on data from collection to utilization. Companies should also integrate data literacy training for all stakeholders, and contributions of domain experts and data collectors should be integrated into data pipelines constructed by ML experts.
Several years ago, a team at Google published a highly influential paper, Hidden technical debt in machine learning systems, that motivated companies to re-examine their machine learning infrastructures. We believe this new paper on data cascades should be equally as influential to those working in data science and data engineering fields.
The authors of the paper note that we need organizations to increase emphasis on tools and processes that ensure data quality from the point of collection to model deployment and inference. We need to better incentivize stakeholders to strive for what the authors term “overall data excellence.” As one participant in the Google study observed: “Everyone wants to do the model work, not the data work.”
With the growing importance of machine learning and AI, overall data excellence will be essential for organizations that want to build AI-enabled products and services. The authors found that teams who are able to tame data cascades address the challenges outlined in this post:
- ❛ The teams with the least data cascades had step-wise feedback loops throughout, ran models frequently, worked closely with application-domain experts and field partners, maintained clear data documentation, and regularly monitored incoming data. Data cascades were by-and-large avoidable through intentional practices, modulo extrinsic resources (e.g., accessible application-domain experts in the region, access to monetary resources, relaxed time constraints, stable government regulations, and so on). Although the behaviour of AI systems is critically determined by data, even more so than code; many of our practitioner strategies mirrored best practices in software engineering. Anticipatory steps like shared style guides for code, emphasizing documentation, peer reviews, and clearly assigned roles—adapted to data—reduced the compounding uncertainty and build-up of data cascades.
To remain competitive and successful, any company developing machine learning products and services must have systems and processes to check quality from data collection to model training and deployment. Data cascades can start anywhere, but as the Google Research authors found, they are typically triggered upstream. To develop robust applications, companies need to integrate tools and processes for working with domain experts who can provide insights through the various stages of an AI/ML pipeline, including data collection, data labeling, data transformation and processing, and data inspection. Finally, metadata and documentation need to be prioritized and recognized as essential elements of an effective, successful machine learning lifecycle.
- One Simple Chart: Data Engineering jobs in the U.S.
- The Growing Importance of Metadata Management Systems
- The quest for high-quality data
[Image: Chain Reaction from Wikimedia.]