veda.ng

KV cache (Key-Value cache) stores the computed Key and Value vectors from previous tokens during autoregressive generation, avoiding redundant computation and greatly speeding up inference. In transformer attention, generating token N requires computing attention with all previous tokens 0 through N-1. Without caching, this means recomputing Keys and Values for all previous tokens at every generation step, quadratic complexity as the sequence grows. KV caching stores the Key and Value vectors after computing them once. When generating token N, you only compute Query, Key, and Value for the new token, then attend over the cached Keys and Values from all previous tokens. This reduces complexity from quadratic to linear in sequence length. The memory cost is significant: KV cache grows with sequence length times model dimension times number of layers times number of attention heads times 2 (for K and V). For large models with long contexts, KV cache can consume tens of gigabytes. Memory optimization techniques include KV cache quantization (storing in lower precision), paged attention (virtual memory for KV cache), and sliding window attention (caching only recent tokens). Understanding KV cache is essential for deploying LLMs efficiently, it's often the primary memory bottleneck during inference with long contexts.