Throughput measures the rate at which a system processes work over time, typically expressed as requests per second, tokens per second, or transactions per minute. It's distinct from but related to latency: latency measures how long each individual request takes, while throughput measures how many requests complete in a given period. A system can have low latency but low throughput if it processes requests sequentially. Batching, grouping multiple requests and processing them together, is the primary technique for improving throughput. GPUs are highly parallel, and processing 32 requests together often takes only slightly longer than processing 1, greatly increasing throughput. However, batching increases latency for individual requests because each must wait for the batch to complete. This creates a fundamental tension: optimizing for throughput (large batches, high parallelism) conflicts with optimizing for latency (small batches, immediate response). Production systems typically segment traffic: interactive users get low-latency processing with small batches, while batch workloads maximize throughput with large batches. Throughput also depends on hardware utilization, a system achieving only 50% GPU utilization has headroom to double throughput. Continuous batching and speculative decoding are advanced techniques that maintain high throughput while keeping latency acceptable.
Back to Glossary