Text-to-Speech | Glossary

Text-to-Speech (TTS) synthesizes natural-sounding speech from written text, enabling voice interfaces, audiobook generation, accessibility tools, and voice cloning applications. The pipeline typically involves text analysis (handling abbreviations, numbers, punctuation), prosody prediction (determining intonation, stress, timing), and waveform synthesis (generating the actual audio).

Early systems used concatenative synthesis, splicing recorded speech fragments. Modern neural TTS produces remarkably natural speech. Tacotron and its successors generate mel spectrograms from text, while vocoders like WaveNet, WaveGlow, and HiFi-GAN convert spectrograms to audio waveforms.

Recent models like VALL-E and Bark can clone voices from seconds of reference audio, raising both exciting applications and concerns about synthetic media. Zero-shot TTS generalizes to new voices without fine-tuning. Controlling prosody, emphasis, emotion, speaking rate, remains challenging. Multilingual TTS handles multiple languages and code-switching.

Low-latency streaming TTS is required for real-time voice assistants. Applications span accessibility, content creation, voice assistants, navigation systems, language learning, and entertainment. The quality bar is high: listeners are sensitive to unnatural rhythms, mispronunciations, and robotic intonation.

Interactive Concept: tts

Text-to-Speech (TTS)

Interactive visualization of neural speech synthesis pipeline

Input Text:

Text Input

Text Analysis

Prosody Prediction

Waveform Synthesis

Audio Output

Text Analysis

Hello,⏵ world!⏸ How are you today?⏸

Numbers: [bracketed], Pauses: ⏸, Stress: ⏵

Prosody Settings

pitch: 50%

speed: 50%

stress: 50%

Generated Waveform

Related Terms

Neural Network Speech Recognition

Interactive Concept: tts

Text-to-Speech (TTS)

Interactive visualization of neural speech synthesis pipeline

Input Text:

Text Input

Text Analysis

Prosody Prediction

Waveform Synthesis

Audio Output

Text Analysis

Hello,⏵ world!⏸ How are you today?⏸

Numbers: [bracketed], Pauses: ⏸, Stress: ⏵

Prosody Settings

pitch: 50%

speed: 50%

stress: 50%

Generated Waveform

Related Terms

Neural Network Speech Recognition