Latency measures the time delay between initiating a request and receiving a response, a critical metric for user experience and system design. In AI systems, latency breaks down into several components: network latency (time for data to travel between client and server), queue latency (time waiting for processing resources), and inference latency (time for the model to generate output). First-token latency measures time until the first token appears, critical for perceived responsiveness. Inter-token latency measures time between subsequent tokens. Total latency is the complete time from request to final response. Streaming mitigates perceived latency by delivering partial results incrementally rather than waiting for complete generation. Users see tokens appearing immediately even if total generation takes seconds. For interactive applications, latency under 200ms feels instant, 200-500ms feels responsive, and over 1 second feels slow. Optimizing latency requires profiling to identify bottlenecks. Common strategies include caching, edge deployment (running inference closer to users), model quantization (reducing computation), speculative decoding (predicting ahead), and hardware optimization. The latency-throughput tradeoff is fundamental: batching requests improves throughput but increases individual request latency.
Back to Glossary