Quantization

Quantization is a model compression technique that reduces the precision of a neural network's numerical weights, making models smaller, faster, and cheaper to run. Neural networks store their parameters as floating-point numbers, typically 32-bit or 16-bit values. Quantization reduces these to lower precision formats like 8-bit integers or even 4-bit values.

The size reduction is dramatic: a 32-bit float model shrinks to one-eighth the size when quantized to 4-bit. This matters enormously for deployment. Running a 70-billion-parameter LLM at full precision requires hundreds of gigabytes of GPU memory. Quantized, the same model might fit on a single consumer GPU.

The trade-off is accuracy: lower precision means less detail in the weights, which can degrade performance on complex tasks. But techniques like GPTQ, AWQ, and GGUF have made quantization surprisingly lossless, especially at 8-bit and even 4-bit. cpp and Ollama brought quantized models to consumer hardware, democratizing access to powerful LLMs. Quantization is now required to local AI deployment.

Interactive Concept: quantization

Neural Network Quantization

Reduce model size by lowering numerical precision of weights

Precision:

Neural Network Weights

w[0]

3.1416

w[1]

-2.7183

w[2]

1.4142

w[3]

-0.5772

w[4]

2.3026

w[5]

-1.7321

w[6]

0.6931

w[7]

1.6180

w[8]

-1.1235

w[9]

0.8660

Model Statistics

Model Size:40.0 bytes

Size Reduction:0.0%

Accuracy Loss:~0%

Precision Impact

32-bit (Original)100% accuracy

16-bit99.5% accuracy

8-bit97.9% accuracy

4-bit91.7% accuracy

Selected Weight Analysis

Original Value

3.141593

Quantized Value

3.141593

Quantization Error

±0.000000

Bits per Weight

32 bits

Neural Network Quantization

Neural Network Weights

Model Statistics

Precision Impact

Selected Weight Analysis

Related Terms

Quantization

Neural Network Quantization

Neural Network Weights

Model Statistics

Precision Impact

Selected Weight Analysis

Related Terms