Inference is the process of running a trained AI model to generate predictions or outputs. It is distinct from training, which is the process of building the model. When you send a message to ChatGPT or Claude, what happens on the server is inference: the model takes your input, passes it through billions of parameters, and generates a response token by token. Training a large model can take weeks and cost millions of dollars in compute. Inference happens in seconds and costs a fraction of a cent per query. The economics of AI products are largely determined by inference costs. A model that is cheap to run at inference can be deployed at massive scale. A model that is expensive requires either high pricing or subsidized access. Inference optimization is its own field. Techniques like quantization, which reduces numerical precision, and batching, which processes multiple requests together, significantly reduce inference costs. Dedicated inference chips from companies like Groq are designed specifically to run models faster and cheaper than general-purpose GPUs.
Back to Glossary