SAM | Glossary | Vedang Vatsa

SAM (Segment Anything Model) is a foundation model for image segmentation from Meta AI that can segment any object in any image given a point, box, or text prompt, generalizing across domains without task-specific training. SAM was trained on over 1 billion masks from 11 million images using a data engine that iteratively improved both the model and the dataset.

This massive scale enables remarkable zero-shot generalization: SAM segments objects it has never seen in domains far from its training data. The architecture separates the heavy image encoder (run once per image) from the lightweight mask decoder (run per prompt), enabling interactive segmentation where users click to indicate what to segment and receive instant mask predictions.

SAM supports multiple input prompt types: point prompts specify locations of interest; box prompts define regions; mask prompts provide coarse outlines to refine; and in SAM 2, text prompts describe what to segment.

The 'Segment Anything' name reflects its goal of being the foundation model for segmentation, just as GPT-3 became a general-purpose language foundation, SAM aims to be a general-purpose segmentation foundation. Applications include annotation tools that greatly speed up labeling, editing tools for precise object selection, and as a component in larger vision systems.

SAM 2 extends the approach to video, tracking segments across frames.

Interactive Concept: sam

SAM: Segment Anything Model

Interactive foundation model for zero-shot image segmentation using point, box, or text prompts

Choose Prompt Type

Simulated Image

SAM Architecture

1. Prompt Encoder

Select a prompt type above

2. Image Encoder (ViT)

Vision Transformer processes image patches into feature representations

3. Mask Decoder

Combines prompt + image embeddings → Segmentation mask

4. Output Mask

Precise object boundaries predicted

Training Scale

Images

11M

Masks

1B+

Related Terms

Image Segmentation Foundation Model