veda.ng
Back to Glossary

Text-to-Speech

Text-to-Speech (TTS) synthesizes natural-sounding speech from written text, enabling voice interfaces, audiobook generation, accessibility tools, and voice cloning applications. The pipeline typically involves text analysis (handling abbreviations, numbers, punctuation), prosody prediction (determining intonation, stress, timing), and waveform synthesis (generating the actual audio). Early systems used concatenative synthesis, splicing recorded speech fragments. Modern neural TTS produces remarkably natural speech. Tacotron and its successors generate mel spectrograms from text, while vocoders like WaveNet, WaveGlow, and HiFi-GAN convert spectrograms to audio waveforms. Recent models like VALL-E and Bark can clone voices from seconds of reference audio, raising both exciting applications and concerns about synthetic media. Zero-shot TTS generalizes to new voices without fine-tuning. Controlling prosody, emphasis, emotion, speaking rate, remains challenging. Multilingual TTS handles multiple languages and code-switching. Low-latency streaming TTS is essential for real-time voice assistants. Applications span accessibility, content creation, voice assistants, navigation systems, language learning, and entertainment. The quality bar is high: listeners are sensitive to unnatural rhythms, mispronunciations, and robotic intonation.