Speech recognition (Automatic Speech Recognition, ASR) converts spoken language into text, enabling voice interfaces, transcription services, and accessibility tools. The task is challenging: speech varies by accent, speed, emotion, and recording quality; words blend together without clear boundaries; and background noise interferes. Traditional ASR used acoustic models, language models, and pronunciation dictionaries as separate components trained on different objectives. Modern end-to-end systems like Whisper, Wav2Vec 2.0, and Conformer directly map audio waveforms to text using transformers, simplifying the pipeline and improving accuracy. Whisper from OpenAI is particularly notable for strong multilingual speech recognition, trained on 680,000 hours of web audio covering 97 languages. The audio is converted to a spectrogram, processed by transformer encoders, then decoded to text. Real-time transcription requires streaming architectures that output text incrementally as speech arrives. Speaker diarization identifies who spoke when in multi-speaker recordings. Applications include voice assistants, meeting transcription, video captioning, call center analytics, voice typing, and accessibility features for deaf and hard-of-hearing users. Word Error Rate (WER) measures accuracy by comparing recognized words against ground truth transcripts.
Back to Glossary