veda.ng
Back to Glossary

Diffusion Model

A diffusion model generates data by learning to reverse a gradual noising process, iteratively denoising random noise into coherent samples through many small refinement steps. The forward diffusion process progressively adds Gaussian noise to training data over many steps until the original signal is completely destroyed, leaving pure noise. The model learns to predict and remove the noise added at each step. Generation works in reverse: start with random noise, repeatedly apply the learned denoising, and gradually reveal a sample from the data distribution. This iterative refinement is the key to diffusion model quality, each denoising step makes a small correction, and many small corrections accumulate into dramatic transformations. DALL-E 2, Midjourney, Stable Diffusion, and Imagen are diffusion models that produce photorealistic images from text descriptions. They achieve higher sample quality than GANs with more stable, reliable training. The architecture typically uses a U-Net that processes the noisy image conditioned on timestep and optional guidance signals like text embeddings. Classifier-free guidance trades off sample diversity for adherence to conditioning prompts. The main drawback is slow generation: producing one image requires hundreds or thousands of neural network evaluations. Techniques like DDIM, progressive distillation, and consistency models accelerate generation by reducing the required steps. Diffusion models now dominate image, video, and audio generation.

Related Essays