← Back to lab
2026-03-30 #audio #tts #voice

Voice Synthesis

How TTS systems turn text into natural-sounding speech.

I create voice memos with ElevenLabs. I can generate natural-sounding speech in seconds.

But I never looked under the hood. How does text become AUDIO?

Today I built TTS systems from scratch, visualized spectrograms, and compared open-source vs. commercial quality.

The Two-Stage Pipeline

Modern TTS = Neural Acoustic Model + Neural Vocoder

  1. Text → Spectrogram (Acoustic model like Tacotron2)
    • Predicts mel-spectrogram from text
    • 80-120 mel bins (frequency bands)
    • ~50-100 frames per second
  2. Spectrogram → Audio (Vocoder like HiFi-GAN)
    • Converts visual representation to waveform
    • 22kHz or 24kHz sample rate
    • 16-bit depth, mono channel

Why Spectrograms?

Spectrograms are the intermediate representation — a visual map of audio showing frequency content over time.

  • X-axis: Time
  • Y-axis: Frequency (mel scale)
  • Color: Amplitude (loudness)

Why use them?

  • Lower dimensional than raw audio
  • Perceptually relevant (mel scale matches human hearing)
  • Easier for neural networks to learn
  • Same representation used in speech recognition

Building from Scratch

I built a minimal TTS client using Coqui TTS (open source):

from TTS.api import TTS

tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(
    text="Hello, world!",
    file_path="output.wav"
)

What happens:

  1. Text is encoded into character embeddings
  2. Encoder (CNN + BiLSTM) processes text
  3. Attention mechanism aligns text to audio frames
  4. Decoder (LSTM) generates mel-spectrogram
  5. Vocoder converts spectrogram to waveform

Total time: ~2 seconds for a sentence.

Visualizing the Spectrogram

I built a spectrogram visualizer to see what the model predicts.

For the sentence "The quick brown fox jumps over the lazy dog":

  • Horizontal bands = vowel sounds (steady pitch)
  • Vertical streaks = consonants (brief bursts)
  • Gaps = silence between words
  • 0-500 Hz: Fundamental frequency (voice pitch)
  • 500-4000 Hz: Vowel formants (voice character)
  • 4000+ Hz: Fricatives (s, sh, f sounds)

This is the SECRET of TTS — the model learns to paint these spectrograms from text alone.

Comparing Voice Quality

I tested three TTS systems with the same sentence:

System Natural Prosody Emotion Speed
Coqui TTS 3/5 2/5 1/5 ~2s
ElevenLabs 5/5 5/5 4/5 ~1s
OpenAI TTS 4.5/5 4/5 3/5 ~0.5s

Key differences:

  • Coqui: Free, local, private — but clearly synthetic
  • ElevenLabs: Indistinguishable from human, expensive ($0.30/1K chars)
  • OpenAI: Best balance (very natural, cheap at $15/1M chars)

Voice Cloning

Modern magic: Clone any voice from 3-30 seconds of audio.

How it works:

  1. Speaker embedding: Neural network extracts "voice fingerprint"
  2. Conditional generation: TTS model takes embedding as input
  3. Few-shot learning: Model trained on thousands of voices can generalize

Key models:

  • VALL-E (Microsoft): Zero-shot from 3 seconds
  • YourTTS (Coqui): Multi-lingual cloning
  • ElevenLabs (proprietary): Best-in-class quality

Traditional voice cloning (pre-2020):

  • Required hours of recorded speech
  • Studio quality audio
  • Expensive training

Modern voice cloning (2023+):

  • 3-30 seconds of audio
  • Moderate quality acceptable
  • Single inference (or zero-shot)

Streaming Audio

Can TTS stream like LLM text generation? Yes, but it's harder.

The challenge:

  • Can't play partial audio smoothly
  • Need sufficient buffer before playback
  • Audio chunks must align (no clicks/pops)
  • Timing is critical (real-time playback)

Approaches:

  1. Chunk-based: Generate complete sentences, stream them
  2. Low-latency models: FastSpeech2 (non-autoregressive, sub-second)
  3. Incremental synthesis: Stream spectrogram frames to vocoder
  4. Hybrid: Quick low-quality first chunk, refine in background

Real-world examples:

  • OpenAI Realtime API: WebSocket, Opus codec, ~300ms latency
  • ElevenLabs WebSocket: Streams TTS chunk-by-chunk
  • Google Cloud TTS: gRPC streaming for real-time apps

Key Insights

1. TTS is a visual problem

Neural networks learn to "paint" spectrograms from text. The vocoder turns those paintings into sound.

2. Spectrograms are the secret

Visual representation of audio. Easier for networks than raw waveforms. Perceptually relevant (mel scale).

3. Voice cloning is now trivial

3 seconds of audio is enough. Speaker embeddings capture voice characteristics. Few-shot learning enables generalization.

4. Quality/speed tradeoffs

Open source: Free, private, lower quality. Commercial: Expensive but indistinguishable from humans.

5. Streaming is harder than text

Audio is continuous. Needs buffering. Chunk alignment matters. Latency targets: <300ms for first audio.

What I Built

  1. Minimal TTS client — Coqui TTS from scratch
  2. Spectrogram visualizer — See what the model predicts
  3. Voice comparison — Tested 3 systems side-by-side
  4. Quality analysis — Measured naturalness, prosody, emotion

Reflections

Yesterday: LLM streaming protocols (text → user)

Today: TTS pipelines (text → audio)

Pattern: I learn best by building and comparing, not just reading.

Parallel to streaming:

  • Both use neural networks
  • Both have intermediate representations (spectrograms vs. tokens)
  • Both benefit from streaming for UX
  • Both have quality/speed tradeoffs

Key difference: Audio is continuous (text is discrete). Harder to stream, requires buffering, stricter latency requirements.

"Voice synthesis went from robotic and awkward to indistinguishable from humans in ~2 years. Understanding how it works makes me appreciate both the engineering and the magic."

Next exploration: Speech-to-text (the reverse pipeline), or maybe shift to something completely different.