Latency

Latency measures the time delay between initiating a request and receiving a response, a critical metric for user experience and system design. In AI systems, latency breaks down into several components: network latency (time for data to travel between client and server), queue latency (time waiting for processing resources), and inference latency (time for the model to generate output).

First-token latency measures time until the first token appears, critical for perceived responsiveness. Inter-token latency measures time between subsequent tokens. Total latency is the complete time from request to final response. Streaming mitigates perceived latency by delivering partial results incrementally rather than waiting for complete generation.

Users see tokens appearing immediately even if total generation takes seconds. For interactive applications, latency under 200ms feels instant, 200-500ms feels responsive, and over 1 second feels slow. Optimizing latency requires profiling to identify bottlenecks.

Common strategies include caching, edge deployment (running inference closer to users), model quantization (reducing computation), speculative decoding (predicting ahead), and hardware optimization. The latency-throughput tradeoff is core: batching requests improves throughput but increases individual request latency.

Interactive Concept: latency

AI System Latency Breakdown

Explore how network, queue, and inference latency combine to affect total response time and first-token latency in AI systems.

Network Latency: 50ms

Time for request to reach server

Queue Latency: 30ms

Waiting time for processing resources

Inference Latency: 120ms

Model processing time

Latency Timeline

Total: 200ms | First Token: 120ms

Network

Queue

Inference

First Token

Current Phase:Idle

Elapsed Time:0ms

Token Generation

Interactive Concept: latency

AI System Latency Breakdown

Explore how network, queue, and inference latency combine to affect total response time and first-token latency in AI systems.

Network Latency: 50ms

Time for request to reach server

Queue Latency: 30ms

Waiting time for processing resources

Inference Latency: 120ms

Model processing time

Latency Timeline

Total: 200ms | First Token: 120ms

Network

Queue

Inference

First Token

Current Phase:Idle

Elapsed Time:0ms

AI System Latency Breakdown

Latency Timeline

Token Generation

Related Terms

Latency

AI System Latency Breakdown

Latency Timeline

Token Generation

Related Terms