Quantization is a model compression technique that reduces the precision of a neural network's numerical weights, making models smaller, faster, and cheaper to run. Neural networks store their parameters as floating-point numbers, typically 32-bit or 16-bit values. Quantization reduces these to lower precision formats like 8-bit integers or even 4-bit values.
The size reduction is dramatic: a 32-bit float model shrinks to one-eighth the size when quantized to 4-bit. This matters enormously for deployment. Running a 70-billion-parameter LLM at full precision requires hundreds of gigabytes of GPU memory. Quantized, the same model might fit on a single consumer GPU.
The trade-off is accuracy: lower precision means less detail in the weights, which can degrade performance on complex tasks. But techniques like GPTQ, AWQ, and GGUF have made quantization surprisingly lossless, especially at 8-bit and even 4-bit. cpp and Ollama brought quantized models to consumer hardware, democratizing access to powerful LLMs. Quantization is now required to local AI deployment.
Interactive Visualizer
Neural Network Quantization
Reduce model size by lowering numerical precision of weights