Site icon Gradient Flow

Holistic Evaluation of Language Models

Portrait - Marcel Wanders

Stanford researchers develop tools to help understand language models in their totality.

As general-purpose models become more prevalent and important, there’s a growing need for tools to help developers select what models are appropriate for their use case, and more importantly to help them understand the limitations of these models. As someone who uses these models, I’ve long wanted simple, systematic, and principled tools that could help me assess and compare between various models. Along those lines, the startup Hugging Face recently released low-code tools which makes it simple to assess the performance of a set of models along an axis such as FLOPS and model size, and to assess how well a set of models performs in comparison to others.

Today marks an important milestone. Researchers at Stanford’s Center for Research on Foundation Models just unveiled the results of a study that evaluated the strengths and weaknesses of thirty well-known large language models using a variety of scenarios and metrics. In the process, they developed a new benchmarking framework, Holistic Evaluation of Language Models (HELM), which can be described as follows:

  1. They organize the space of scenarios (use cases) and metrics (desiderata).
  2. They then select a subset of scenarios and metrics based on societal relevance (e.g. user-facing applications), coverage (e.g. different English dialects/varieties), and feasibility (i.e. amount of compute).

Unlike previous benchmarks, which specify specific scenarios and metrics, by locating their choices within a broader taxonomy, HELM clarifies what is currently lacking.

Key Findings

The HELM team evaluated language models from twelve organizations : AI21 Labs, Anthropic, BigScience, Cohere, EleutherAI, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua University, and Yandex. Several of these models are open source, some are available through commercial APIs, and others are private. Here are some of the findings that caught my eye:

The advent of large language models has revolutionized AI. These models are being rapidly productionized into significant and widely available language applications, whose use will only grow in the near term.  HELM is an important step towards the evaluation tools needed to provide better transparency for language models. I hope other researchers build upon this exciting suite of evaluation tools and ideas.


Related Content:


If you enjoyed this post, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:


[Image: Portrait (Marcel Wanders), by Ben Lorica.]

Exit mobile version