Semantic Similarity

Semantic similarity measures how alike two pieces of text are in meaning, beyond surface-level word overlap, enabling applications like duplicate detection, paraphrase identification, and search relevance scoring. Similar meanings should map to similar points in embedding space. Models learn to produce embeddings where cosine similarity correlates with human judgments of semantic relatedness.

'The cat sat on the mat' and 'A feline rested on the rug' share no words but are semantically similar and should have nearby embeddings. Sentence-BERT (SBERT) pioneered efficient sentence embeddings for similarity by fine-tuning BERT on natural language inference data.

Modern embedding models like E5, BGE, and OpenAI's text-embedding models are specifically optimized for semantic similarity tasks. Evaluation uses human-annotated similarity benchmarks like STS (Semantic Textual Similarity) where annotators rate sentence pair similarity on scales from 0 to 5.

Beyond sentence-level similarity, document similarity compares longer texts, and cross-lingual similarity compares texts in different languages. Semantic similarity underlies search engines, recommendation systems, plagiarism detectors, and customer service routing. The distinction from lexical similarity (word overlap) matters for choosing the right approach.

Interactive Concept: semantic similarity

Compare sentence meanings beyond surface-level word overlap using vector embeddings

Sentence 1

Sentence 2

Word Overlap

shared words

Semantic Similarity

99.4%

cosine similarity

Notice how "cat/mat" and "feline/rug" have high semantic similarity despite zero word overlap, while unrelated sentences score lower regardless of potential word matches.

Interactive Concept: semantic similarity