LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds small trainable low-rank matrices to frozen model weights, enabling task-specific adaptation at a fraction of the cost of full fine-tuning. The key insight is that weight updates during fine-tuning tend to have low intrinsic rank, the meaningful changes can be captured by much smaller matrices. LoRA decomposes the weight update into two small matrices: A (d x r) and B (r x d), where r is the rank (typically 8-64) and d is the weight dimension. The effective update is A x B, which has the same shape as the original weight but far fewer parameters. During inference, the original weight W operates alongside the LoRA modification: output = W(x) + B(A(x)). This adds minimal latency. LoRA adapters can be merged into base weights for zero-overhead deployment or kept separate for easy switching between tasks. A 7B parameter model might require 14GB for full fine-tuning but under 100MB for LoRA adapters. Multiple LoRA adapters can coexist for different tasks, swapped at inference time without reloading the base model. LoRA has become the dominant approach for fine-tuning large language models, enabling customization on consumer hardware. QLoRA combines LoRA with quantization for even greater efficiency.
Back to Glossary