Semantic similarity measures how alike two pieces of text are in meaning, beyond surface-level word overlap, enabling applications like duplicate detection, paraphrase identification, and search relevance scoring. The key insight is that similar meanings should map to similar points in embedding space. Models learn to produce embeddings where cosine similarity correlates with human judgments of semantic relatedness. 'The cat sat on the mat' and 'A feline rested on the rug' share no words but are semantically similar and should have nearby embeddings. Sentence-BERT (SBERT) pioneered efficient sentence embeddings for similarity by fine-tuning BERT on natural language inference data. Modern embedding models like E5, BGE, and OpenAI's text-embedding models are specifically optimized for semantic similarity tasks. Evaluation uses human-annotated similarity benchmarks like STS (Semantic Textual Similarity) where annotators rate sentence pair similarity on scales from 0 to 5. Beyond sentence-level similarity, document similarity compares longer texts, and cross-lingual similarity compares texts in different languages. Semantic similarity underlies search engines, recommendation systems, plagiarism detectors, and customer service routing. The distinction from lexical similarity (word overlap) matters for choosing the right approach.
Back to Glossary