A Guide to Data Annotation and Synthetic Data Generation Tools

Trends to consider when evaluating data annotation and synthetic data generation systems.

As we noted in a recent post (“Machine Learning trends you need to know”) researchers are increasingly interested in tools and techniques for labeling, cleaning, augmenting, and enhancing datasets used by machine learning models. In fact, data scientists and machine learning engineers have long known that focusing on data is more effective than modeling. Numerous surveys through the years have shown that data teams spend most of their time on acquiring, cleaning, and augmenting their data sets..

In this post we’ll examine the landscape of tools for building training datasets – specifically tools for data annotation and synthetic data generation. Taking into account emerging trends in machine learning and artificial intelligence, we provide guidelines to help you navigate the explosion of tools in these areas.

Tools for building training data need to be tightly integrated into the model tuning workflow

In areas like computer vision and NLP, model hubs and foundation models reorient the focus from collecting massive amounts of data to collecting and labeling data for specific use cases and applications. As we discovered through a series of surveys, teams are clamoring for tools for tuning pre-trained models. A common approach is to take a pre-trained model, use a data annotation provider, and then tune the model yourself. This strategy introduces friction (call to external data tool) while also requiring model fine-tuning expertise.

Figure 1: Tuning a pre-trained model usually requires task-specific training data.

An encouraging development is that in areas like NLP or computer vision, data annotation is being embedded into low-code/no-code tools that allow users to tune models incrementally (e.g. John Snow Labs, Matroid). This means subject model experts can focus on annotating application specific data sets and the platform automatically fine tunes models. (Model fine tuning and data annotation happen within the same workflow and interface.) We expect that synthetic data generators will be incorporated into similar model tuning tools in the near future.

More importantly, this pattern of tight integration between data annotation and model tuning will become common as other settings and data types gain their own model hubs and foundation models. As David Talby (CTO of John Snow Labs) recently noted: “Using an intuitive, accurate, no-code environment, we enable medical doctors, lawyers, and financial analysts to train & tune NLP models. Our no-code environment resonates exceptionally well with our users, and this type of tool should be applicable to most real-world business scenarios.”

Platforms are beginning to support multiple data types and synthetic data

While the use of foundation models and model tuning tools are growing, there are still many problems that require large amounts of training data or domain-specific annotators. We did a quick search and came up with a plethora of data annotation startups and synthetic data providers. Note that the graphic below doesn’t include the many data markets and data exchanges that have come online in recent years.

Figure 2: A representative sample of startups that help companies with their training data needs.

Having a multitude of options is great but working with several providers is not the most efficient strategy. One option is to select a data provider that has broad capabilities. Fortunately, we are seeing a growing number of companies able to handle a variety of data types, making it easier for teams to work on ML problems of all types. However, we urge you to proceed with caution when working with companies that claim support for multiple data types! You’ll need to carefully investigate the quality of their work for each data type. But we believe that as multi-modal models become more prevalent, data providers will need to support a variety of data types in order to remain relevant.

Intermediate layers give you access to best-of-breed data platforms through a single API and interface

Synthetic data is promising but not a cure all. While synthetic data is not a replacement for real data, synthetic data generation can improve and accelerate model development. From a workflow perspective, it makes sense to try to consolidate and opt for providers who offer both synthetic data generators and data annotation solutions.

Figure 3: Synthetic data can help with a variety of problems.

It is encouraging to note that the startups in this space aren’t standing still. Many of the companies listed in Figure 2 continue to invest in technologies to improve their products and services. Customers and users will benefit benefit as data providers roll out new automation tools and novel data quality and data management solutions.

Figure 4: Interest in “data-centric” AI has fueled innovation in tools for building data sets.

Intermediate layers simplify sourcing, management, maintenance, and automation

Intermediate layers are an emerging option worth considering. For teams who prefer to work with best-in-class suppliers, there are now companies that simplify how teams can take advantage of multiple data providers. MLtwist is building a platform that enables teams to access best-of-breed training data platforms through a single API and interface. By abstracting away the details of working with individual training data providers, these intermediate layers guarantee your pipelines remain optimized even as data providers come and go.

Figure 5: Intermediate layers (“middleware”) help companies keep pace with the exploding number of data annotation and synthetic data generation tools.

Be on the lookout for data tools for Responsible AI

As more companies deploy machine learning and AI into products and services, teams will need to integrate audits and checks for a variety of risks including bias, safety and reliability, privacy and security, and trust and interpretability. A number of books, guidelines, and tools have been published on how to measure and mitigate risks that stem from the construction of training data. But we are still in the early stages and we need more tools to assess and minimize risks that arise from training data sets, including those from data annotation and synthetic data generation systems. A recent post from OpenAI describes in detail the pre-training mitigations they conducted to directly modify the data that DALL·E 2 learns from.

Since training data is an ongoing concern, tighter integration is warranted

As models transition to production, teams need to operationalize end-to-end pipelines needed to retrain or fix a model. This means that ML teams will need to switch from data annotation and synthetic data generation scripts and processes that do not scale, to ones that are easier to automate, monitor, and diagnose. Ideally your tools for building datasets integrate seamlessly with the rest of your data and ML infrastructure. In the future this could mean opting for tools that integrate with, or are offered by your data management system (e.g. Databricks recently announced a data market).

Closing thoughts

ML teams in need of training or test data now have access to many open source and proprietary tools for building data sets. To help you keep up with emerging trends in machine learning, we provide guidelines to help you evaluate data annotation and synthetic data generation solutions. These systems need to be tightly integrated into your model tuning workflows, as well as with your MLOps and data infrastructure. Models that get deployed to production need up-to-date data to ensure they are able to respond to changing conditions or to edge cases not found in training data. One way to simplify and scale is to take advantage of new intermediate layers (like MLtwist) designed to tame the growing number of synthetic data generation and data annotation systems.

Related Content:

To stay up to date, subscribe to the Gradient Flow newsletter.