veda.ng
Back to Glossary

Quantization

Quantization infographic

Quantization is a model compression technique that reduces the precision of a neural network's numerical weights, making models smaller, faster, and cheaper to run. Neural networks store their parameters as floating-point numbers, typically 32-bit or 16-bit values. Quantization reduces these to lower precision formats like 8-bit integers or even 4-bit values.

The size reduction is dramatic: a 32-bit float model shrinks to one-eighth the size when quantized to 4-bit. This matters enormously for deployment. Running a 70-billion-parameter LLM at full precision requires hundreds of gigabytes of GPU memory. Quantized, the same model might fit on a single consumer GPU.

The trade-off is accuracy: lower precision means less detail in the weights, which can degrade performance on complex tasks. But techniques like GPTQ, AWQ, and GGUF have made quantization surprisingly lossless, especially at 8-bit and even 4-bit. cpp and Ollama brought quantized models to consumer hardware, democratizing access to powerful LLMs. Quantization is now required to local AI deployment.

Interactive Visualizer

Neural Network Quantization

Reduce model size by lowering numerical precision of weights

Neural Network Weights

w[0]
3.1416
w[1]
-2.7183
w[2]
1.4142
w[3]
-0.5772
w[4]
2.3026
w[5]
-1.7321
w[6]
0.6931
w[7]
1.6180
w[8]
-1.1235
w[9]
0.8660

Model Statistics

Model Size:40.0 bytes
Size Reduction:0.0%
Accuracy Loss:~0%

Precision Impact

32-bit (Original)100% accuracy
16-bit99.5% accuracy
8-bit97.9% accuracy
4-bit91.7% accuracy

Selected Weight Analysis

Original Value
3.141593
Quantized Value
3.141593
Quantization Error
±0.000000
Bits per Weight
32 bits