Model quantization converts a neural network's numerical representations from high-precision formats (like 32-bit floating point) to lower-precision formats (like 8-bit integers or even binary values). The model's structure, layers, connections, and learned patterns stay intact, but each parameter shrinks to a smaller data type. The result is a model that uses a fraction of the memory and runs faster because integer arithmetic is simpler and more power-efficient than floating-point math. Quantized models fit on devices with limited RAM, smartphones, wearables, microcontrollers, enabling on-device AI inference without sending data to the cloud. This matters for latency (predictions in milliseconds), privacy (data stays on-device), and cost (less compute bandwidth needed). A quantized model's predictions may change slightly compared to its full-precision version, but for most applications the accuracy loss is negligible. As real-time AI demands grow, from voice assistants to autonomous drones, quantization is the practical bridge between research models and deployable products.
Back to Glossary