Efficient Learning with Distilling Step-by-Step

In an era where data is abundant yet precious, a new technique (“Distilling Step-by-Step”)  transforms Large Language Models (LLMs) from mere label predictors to reasoning agents that provide intermediate rationales, bridging the gap between inputs and final answers. This mechanism enables the crafting of efficient task-specific models that require less data, less computational cost, and outperform LLMs.

Distilling Step-by-Step goes beyond model refinement to extract relevant task knowledge from the latent potential of LLMs. This knowledge, expressed as natural language rationales, enriches the training of smaller models, pushing the boundaries of their capabilities. It improves their performance on complex tasks while reducing reliance on extensive and expensive human labeling processes.

Distilling Step-by-Step is remarkably straightforward. Imagine teaching someone to solve a problem step-by-step. First, you give them clear instructions, or a “prompt,” that explains the task. Then, you use a method called “few-shot Chain-of-Thought” prompting to make an LLM think through the problem and give step-by-step explanations, or “reasonings,” on how it would solve it. You show the LLM some examples of how the task can be solved (called “input-output pairs”), and it tells you back its thought process or way of thinking. Finally, you use these step-by-step explanations to train smaller, specialized models to perform specific tasks more efficiently.

Distilling step-by-step: Use Chain-of-Thought prompting to extract rationales from an LLM, and then use them to train task-specific models using multi-task learning, by prepending task prefixes to inputs.

Distilling step-by-step outperforms fine-tuning in training smaller models, exhibiting enhanced efficiency and effectiveness. It requires fewer labeled examples, reduces computational costs and data needs, and can match or exceed the performance of LLMs on unlabeled data using just 12.5% of a full dataset. This surpasses standard fine-tuning on the entire dataset, as evidenced by tests on 220M T5 models. Distilling step-by-step consistently outperforms standard fine-tuning and distillation on task accuracy, showing improvements averaging 8% and 13% on certain datasets.

Unlike fine-tuning, which relies solely on labeled examples, distilling step-by-step utilizes natural language rationales to elucidate connections between inputs and outputs. This enables higher data efficiency and effectiveness in tasks requiring advanced reasoning and planning, where fine-tuning may be inadequate.


If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

Discover more from Gradient Flow

Subscribe now to keep reading and get access to the full archive.

Continue reading