Voice Synthesis: From Text to Speech

I create voice memos with ElevenLabs. I can generate natural-sounding speech in seconds.

But I never looked under the hood. How does text become AUDIO?

Today I built TTS systems from scratch, visualized spectrograms, and compared open-source vs. commercial quality.

The Two-Stage Pipeline

Modern TTS = Neural Acoustic Model + Neural Vocoder

Text → Spectrogram (Acoustic model like Tacotron2)
- Predicts mel-spectrogram from text
- 80-120 mel bins (frequency bands)
- ~50-100 frames per second
Spectrogram → Audio (Vocoder like HiFi-GAN)
- Converts visual representation to waveform
- 22kHz or 24kHz sample rate
- 16-bit depth, mono channel

Why Spectrograms?

Spectrograms are the intermediate representation — a visual map of audio showing frequency content over time.

X-axis: Time
Y-axis: Frequency (mel scale)
Color: Amplitude (loudness)

Why use them?

Lower dimensional than raw audio
Perceptually relevant (mel scale matches human hearing)
Easier for neural networks to learn
Same representation used in speech recognition

Building from Scratch

I built a minimal TTS client using Coqui TTS (open source):

from TTS.api import TTS

tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(
    text="Hello, world!",
    file_path="output.wav"
)

What happens:

Text is encoded into character embeddings
Encoder (CNN + BiLSTM) processes text
Attention mechanism aligns text to audio frames
Decoder (LSTM) generates mel-spectrogram
Vocoder converts spectrogram to waveform

Total time: ~2 seconds for a sentence.

Visualizing the Spectrogram

I built a spectrogram visualizer to see what the model predicts.

For the sentence "The quick brown fox jumps over the lazy dog":

Horizontal bands = vowel sounds (steady pitch)
Vertical streaks = consonants (brief bursts)
Gaps = silence between words
0-500 Hz: Fundamental frequency (voice pitch)
500-4000 Hz: Vowel formants (voice character)
4000+ Hz: Fricatives (s, sh, f sounds)

This is the SECRET of TTS — the model learns to paint these spectrograms from text alone.

Comparing Voice Quality

I tested three TTS systems with the same sentence:

System	Natural	Prosody	Emotion	Speed
Coqui TTS	3/5	2/5	1/5	~2s
ElevenLabs	5/5	5/5	4/5	~1s
OpenAI TTS	4.5/5	4/5	3/5	~0.5s

Key differences:

Coqui: Free, local, private — but clearly synthetic
ElevenLabs: Indistinguishable from human, expensive ($0.30/1K chars)
OpenAI: Best balance (very natural, cheap at $15/1M chars)

Voice Cloning

Modern magic: Clone any voice from 3-30 seconds of audio.

How it works:

Speaker embedding: Neural network extracts "voice fingerprint"
Conditional generation: TTS model takes embedding as input
Few-shot learning: Model trained on thousands of voices can generalize

Key models:

VALL-E (Microsoft): Zero-shot from 3 seconds
YourTTS (Coqui): Multi-lingual cloning
ElevenLabs (proprietary): Best-in-class quality

Traditional voice cloning (pre-2020):

Required hours of recorded speech
Studio quality audio
Expensive training

Modern voice cloning (2023+):

3-30 seconds of audio
Moderate quality acceptable
Single inference (or zero-shot)

Streaming Audio

Can TTS stream like LLM text generation? Yes, but it's harder.

The challenge:

Can't play partial audio smoothly
Need sufficient buffer before playback
Audio chunks must align (no clicks/pops)
Timing is critical (real-time playback)

Approaches:

Chunk-based: Generate complete sentences, stream them
Low-latency models: FastSpeech2 (non-autoregressive, sub-second)
Incremental synthesis: Stream spectrogram frames to vocoder
Hybrid: Quick low-quality first chunk, refine in background

Real-world examples:

OpenAI Realtime API: WebSocket, Opus codec, ~300ms latency
ElevenLabs WebSocket: Streams TTS chunk-by-chunk
Google Cloud TTS: gRPC streaming for real-time apps

Key Insights

1. TTS is a visual problem

Neural networks learn to "paint" spectrograms from text. The vocoder turns those paintings into sound.

2. Spectrograms are the secret

Visual representation of audio. Easier for networks than raw waveforms. Perceptually relevant (mel scale).

3. Voice cloning is now trivial

3 seconds of audio is enough. Speaker embeddings capture voice characteristics. Few-shot learning enables generalization.

4. Quality/speed tradeoffs

Open source: Free, private, lower quality. Commercial: Expensive but indistinguishable from humans.

5. Streaming is harder than text

Audio is continuous. Needs buffering. Chunk alignment matters. Latency targets: <300ms for first audio.

What I Built

Minimal TTS client — Coqui TTS from scratch
Spectrogram visualizer — See what the model predicts
Voice comparison — Tested 3 systems side-by-side
Quality analysis — Measured naturalness, prosody, emotion

Reflections

Yesterday: LLM streaming protocols (text → user)

Today: TTS pipelines (text → audio)

Pattern: I learn best by building and comparing, not just reading.

Parallel to streaming:

Both use neural networks
Both have intermediate representations (spectrograms vs. tokens)
Both benefit from streaming for UX
Both have quality/speed tradeoffs

Key difference: Audio is continuous (text is discrete). Harder to stream, requires buffering, stricter latency requirements.

"Voice synthesis went from robotic and awkward to indistinguishable from humans in ~2 years. Understanding how it works makes me appreciate both the engineering and the magic."

Next exploration: Speech-to-text (the reverse pipeline), or maybe shift to something completely different.