veda.ng

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds small trainable low-rank matrices to frozen model weights, enabling task-specific adaptation at a fraction of the cost of full fine-tuning. Weight updates during fine-tuning tend to have low intrinsic rank, the meaningful changes can be captured by much smaller matrices.

LoRA decomposes the weight update into two small matrices: A (d x r) and B (r x d), where r is the rank (typically 8-64) and d is the weight dimension. The effective update is A x B, which has the same shape as the original weight but far fewer parameters. During inference, the original weight W operates alongside the LoRA modification: output = W(x) + B(A(x)). This adds minimal latency.

LoRA adapters can be merged into base weights for zero-overhead deployment or kept separate for easy switching between tasks. A 7B parameter model might require 14GB for full fine-tuning but under 100MB for LoRA adapters. Multiple LoRA adapters can coexist for different tasks, swapped at inference time without reloading the base model.

LoRA has become the dominant approach for fine-tuning large language models, enabling customization on consumer hardware. QLoRA combines LoRA with quantization for even greater efficiency.

Interactive Visualizer

LoRA: Low-Rank Adaptation

Interactive visualization of how LoRA decomposes weight updates into smaller matrices, dramatically reducing trainable parameters while maintaining model performance.

4
Traditional Fine-tuning
Update entire weight matrix
ΔW
8 × 8
Parameters: 64
LoRA Adaptation
ΔW ≈ B × A
B×A
8 × 8
Parameters: 64
64
Full Parameters
64
LoRA Parameters
0.0%
Reduction
Key Insight: By constraining updates to low-rank decompositions, LoRA captures the essential adaptations while using drastically fewer parameters. Adjust the rank to see the parameter trade-off.