Most State-Of-The-Art AI Systems Are Trained With Extra Data

According to the 2022 AI Index Report, nine state-of-the-art AI systems out of the ten benchmarks they tested against are trained with extra data.

By Ben Lorica.

Stanford’s AI Index Report has just come out – one of my favorite annual reads. This report tracks several metrics including performance on machine learning benchmarks, volume of publications and patents, new regulations and legislation pertaining to AI, and industry metrics such as the amount of VC investments and number of patents. This year they have a chapter dedicated to AI Ethics where they focus on new benchmarks and metrics that have been developed to measure bias in AI systems.

One of my favorite sections is the chapter focused on Technical Performance. It’s a comprehensive snapshot of recent progress in machine learning (mainly deep learning) and covers all the major AI research areas including:

Computer vision (images, video)
Speech recognition
NLP and language models
Recommendation systems
Reinforcement learning
Robotics
Hardware

To achieve state-of-the-art results across a range of technical benchmarks, extra training data has become increasingly important.

Let me briefly highlight a couple of findings that caught my eye. First is the importance of data: according to the report, “nine state-of-the-art AI systems out of the ten benchmarks they tested against are trained with extra data”.

To achieve state-of-the-art results across a range of technical benchmarks, extra training data has become increasingly important. But while the best models are being built using data-centric approaches, the report notes that this trend favors large companies with access to vast datasets. Here’s a similar chart that contains results in speech recognition:

The second area I want to highlight is Ethics. Algorithmic fairness and bias have gone from an academic pursuit to a practical concern for data and machine learning teams. The report has a standalone chapter on Ethics (focused on fairness) that highlights metrics which have been adopted by AI researchers for reporting progress in reducing bias and promoting fairness. Researchers continue to refine their understanding of how fairness and bias change as AI systems improve, an important consideration as AI models are increasingly used in real-world settings.

Algorithmic fairness and bias have gone from an academic pursuit to a practical concern for data and machine learning teams.

An interesting research area is the “detoxification” of large language models:

❛Detoxification methods aim to mitigate toxicity by changing the underlying training data as in

domain-adaptive pretraining

(DAPT), or by steering the model during generation as in

Plug and Play Language Models

(PPLM) or

Generative Discriminator Guided Sequence Generation

(GeDi).❜

Let’s use a popular tool to examine how toxicity is defined and measured in practice. The following commonly used definition of a toxic comment comes from Perspective: “A rude, disrespectful, or unreasonable comment that is likely to make people leave a discussion.” Perspective uses this definition to assign scores that represent the “likelihood that the patterns in text resemble patterns in comments that people have tagged as toxic”.

However, the safety of language models is very much audience and domain specific. A recent paper from DeepMind points out a couple of challenges with measuring toxicity. Using the approach taken by Perspective and others, toxicity judgments are subjective and rely on the raters assessing toxicity and their cultural background, as well as the inferred context. Secondly, language models can perpetuate negative stereotypes and display biases that are only apparent statistically over large samples.

Setting aside challenges with formalizing toxicity metrics, the goal of data teams is to build tools that will enable them to systematically reduce the risk of toxic degeneration of language models before they deploy such models. Alas, according to the AI Index Report we are still in the early stages of understanding how to detoxify language models. Recent papers show that current debiasing methods are far from perfect:

Given the many potential applications of large language models, detoxification is an active research area. Inspired by ideas from recommender systems, a recent paper from UCSD proposes a simple yet effective “self-detoxification framework to further detoxify the generation by truncating the original distribution and re-rank”.

Bonus Content: global talent pool in computer vision and reinforcement learning

Prompted by the results of the AI Index report, I explored the talent pool for reinforcement learning and computer vision around the world. The charts below are composed of individuals who list RL or computer vision as a skill on their profile (these charts likely underestimate the number of people based in China).

Even though we are still in the early days of practical applications of reinforcement learning, there are close to 12,000 people who list RL as a skill on their profile:

Computer vision is much more mature, with more established real-world and enterprise applications. I was able to identify over 200,000 people who list computer vision (and/or related technologies) on their profile, including over 28,000 in the U.S. alone:

Update (2022-04-14): Jack Clark, co-director of the AI Index Steering Committee was a guest on the Data Exchange podcast.

Related content:

FREE Report:

Download

Bonus Content: global talent pool in computer vision and reinforcement learning

Share this:

Like this:

Discover more from Gradient Flow