veda.ng
Back to Glossary

Knowledge Distillation

Knowledge distillation is a model compression technique where a smaller student model learns to replicate the behavior of a larger teacher model by training on the teacher's output probability distributions rather than just hard labels. When a teacher model classifies an image as 90% cat and 10% dog, that soft distribution contains more information than just the label 'cat.' The near-miss classifications, the confidence levels, the relationships between classes, all of this transfers to the student. The student learns not just what the right answer is, but how the teacher reasons about uncertainty. This produces smaller models that punch above their weight. A distilled model with 10% of the parameters might achieve 95% of the teacher's accuracy. The technique was formalized by Geoffrey Hinton in 2015 and has become essential for deploying AI in resource-constrained environments. Mobile applications, edge devices, and real-time systems all benefit from distilled models that run faster and use less memory. Modern applications include distilling large language models like GPT-4 into smaller models, distilling vision transformers into efficient CNNs, and multi-teacher distillation where students learn from ensembles. The temperature parameter in distillation controls how much the soft labels are softened, with higher temperatures revealing more about the teacher's learned structure.