Quantization is a model compression technique that reduces the precision of a neural network's numerical weights, making models smaller, faster, and cheaper to run. Neural networks store their parameters as floating-point numbers, typically 32-bit or 16-bit values. Quantization reduces these to lower precision formats like 8-bit integers or even 4-bit values. The size reduction is dramatic: a 32-bit float model shrinks to one-eighth the size when quantized to 4-bit. This matters enormously for deployment. Running a 70-billion-parameter LLM at full precision requires hundreds of gigabytes of GPU memory. Quantized, the same model might fit on a single consumer GPU. The trade-off is accuracy: lower precision means less nuance in the weights, which can degrade performance on complex tasks. But techniques like GPTQ, AWQ, and GGUF have made quantization surprisingly lossless, especially at 8-bit and even 4-bit. The emergence of llama.cpp and Ollama brought quantized models to consumer hardware, democratizing access to powerful LLMs. Quantization is now essential to local AI deployment.
Back to Glossary