Holistic Evaluation of Language Models

Stanford researchers develop tools to help understand language models in their totality.

As general-purpose models become more prevalent and important, there’s a growing need for tools to help developers select what models are appropriate for their use case, and more importantly to help them understand the limitations of these models. As someone who uses these models, I’ve long wanted simple, systematic, and principled tools that could help me assess and compare between various models. Along those lines, the startup Hugging Face recently released low-code tools which makes it simple to assess the performance of a set of models along an axis such as FLOPS and model size, and to assess how well a set of models performs in comparison to others.

Today marks an important milestone. Researchers at Stanford’s Center for Research on Foundation Models just unveiled the results of a study that evaluated the strengths and weaknesses of thirty well-known large language models using a variety of scenarios and metrics. In the process, they developed a new benchmarking framework, Holistic Evaluation of Language Models (HELM), which can be described as follows:

They organize the space of scenarios (use cases) and metrics (desiderata).
They then select a subset of scenarios and metrics based on societal relevance (e.g. user-facing applications), coverage (e.g. different English dialects/varieties), and feasibility (i.e. amount of compute).

Unlike previous benchmarks, which specify specific scenarios and metrics, by locating their choices within a broader taxonomy, HELM clarifies what is currently lacking.

Key Findings

The HELM team evaluated language models from twelve organizations : AI21 Labs, Anthropic, BigScience, Cohere, EleutherAI, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua University, and Yandex. Several of these models are open source, some are available through commercial APIs, and others are private. Here are some of the findings that caught my eye:

Public vs. Private models. The most accurate public models were always less accurate than the most accurate private (e.g. limited access, closed) models. There can be especially large gaps in scenarios that involve knowledge and reasoning. As the HELM team notes, this may in part stem from their evaluation only included private instruction-tuned models.
Accuracy and Fairness. They found that most accurate models are the most robust and most fair. More precisely, across models and scenarios they found that there were very strong correlations between accuracy, robustness, and fairness.
Toxic and harmful content. At least in the context of realistic use cases (e.g. summarizing news articles), researchers found that models actually rarely exhibited problematic behavior such as generating toxic, racist, biased, or otherwise harmful text.
Copyrighted and licensed material. The HELM team found that models rarely generated long sequences verbatim. An important caveat is they noted that the rate of regurgitation correlates with model accuracy.
Machine-generated disinformation. In human evaluations using crowdworkers, researchers found that models can effectively generate disinformation headlines to support a point of view.
Prompt engineering. The tested models were quite sensitive to the way prompts were written.
Model Size and Accuracy. It is not useful to compare the size of models across model providers when predicting model accuracy. The HELM team found that the relationship between model scale and accuracy is not very clear. Then again, for models belonging to the same family (e.g. different size Cohere models), accuracy is highly correlated with model size.

The advent of large language models has revolutionized AI. These models are being rapidly productionized into significant and widely available language applications, whose use will only grow in the near term. HELM is an important step towards the evaluation tools needed to provide better transparency for language models. I hope other researchers build upon this exciting suite of evaluation tools and ideas.

Related Content:

Foundation Models: A Primer for Investors and Builders
Resurgence of Conversational AI
A Guide to Data Annotation and Synthetic Data Generation Tools
The AI $100M Revenue Club
Machine Learning Trends You Need to Know
Mark Chen: How DALL·E works
Connor Leahy and Yoav Shoham: Large Language Models
Jack Clark : The 2022 AI Index

If you enjoyed this post, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

[Image: Portrait (Marcel Wanders), by Ben Lorica.]

Key Findings

Share this:

Like this:

Discover more from Gradient Flow