Model Distillation

Model distillation is a technique for creating a smaller, faster model that approximates the behavior of a larger, more capable one. The large model is called the teacher, the smaller model the student. During distillation, the student model is trained not just on labeled data, but on the output probability distributions of the teacher.

' This richer signal transfers more of the teacher's knowledge than hard labels alone. The result is a student model that performs close to the teacher on most tasks but requires far less compute to run. Distilled models are necessary for deployment on mobile devices, edge hardware, and cost-sensitive applications.

Many small, fast models available today, including variants of Llama and Mistral, use distillation in their training pipelines. Distillation is also how reasoning capabilities are transferred: DeepSeek-R1 distilled its reasoning behavior into smaller models by training them on the reasoning traces generated by the full R1 model.

Interactive Concept: model distillation

Interactive visualization of knowledge transfer from teacher to student model

Select Image:

Temperature: 3

Training Progress: 0%

Teacher Model (Large)

Prediction for: Cat

Cat85.0%

Dog10.0%

Fox5.0%

Rich probability distribution provides soft targets for student learning

Student Model (Small)

Learning from teacher...

Cat27.5%

Dog50.0%

Fox22.5%

Student learns to mimic teacher's confidence patterns, not just hard labels

Key Insight: Instead of learning "the answer is Cat", the student learns the full probability distribution from the teacher. This transfers more nuanced knowledge about relationships between classes and uncertainty patterns.

Model Distillation