Speech Recognition | Glossary

Speech recognition (Automatic Speech Recognition, ASR) converts spoken language into text, enabling voice interfaces, transcription services, and accessibility tools. The task is challenging: speech varies by accent, speed, emotion, and recording quality; words blend together without clear boundaries; and background noise interferes.

Traditional ASR used acoustic models, language models, and pronunciation dictionaries as separate components trained on different objectives. 0, and Conformer directly map audio waveforms to text using transformers, simplifying the pipeline and improving accuracy.

Whisper from OpenAI is particularly key for strong multilingual speech recognition, trained on 680,000 hours of web audio covering 97 languages. The audio is converted to a spectrogram, processed by transformer encoders, then decoded to text. Real-time transcription requires streaming architectures that output text incrementally as speech arrives.

Speaker diarization identifies who spoke when in multi-speaker recordings. Applications include voice assistants, meeting transcription, video captioning, call center analytics, voice typing, and accessibility features for deaf and hard-of-hearing users. Word Error Rate (WER) measures accuracy by comparing recognized words against ground truth transcripts.

Interactive Concept: speech recognition

Speech Recognition Pipeline

Interactive visualization of how speech is converted to text through neural networks

Confidence: 0.85

Noise Level: 0.20

1. Audio Signal

Raw waveform data

2. Features

MFCC coefficients

3. Neural Net

Deep learning layers

4. Text Output

Hello

world

Final transcription

Adjust confidence and noise levels to see how they affect recognition accuracy. Click "Process Audio" to see the pipeline in action. Click words to analyze phonemes and confidence scores.

Related Terms

Transformer Neural Network