Code & Stream

ABOUT | CONTACT

Fun With TTS Part II: Richard Hamming's "You And Your Research"

In Fun with TTS Part I, I walked through getting started with OpenAI's Text-to-Speech (TTS) models and API and finish things off by building a simple pipeline for converting all of Paul Graham's essays into mp3s for audiobook style listening.

This article will build on the previous work, but will instead focus on converting a pdf of Richard Hamming's lecture "You And Your Research" into a mp3. Hamming gave various versions of this talk many times over the years. For the purpose of this exercise we will use the version of the lecture which was delivered at Bell Labs in 1986. The PDF is a copy from Dr. Gabe Robins at University of Virginia.

If you have never read, or heard the talk I recommend you do that now as it is far more important than anything I have to say. If you would rather listen to is the final audio file is here:

The full end-to-end process will look like:

PDF → Marker (OCR + Markdown) → Text Cleanup → OpenAI TTS → MP3

PDF Text Extraction

I quickly discovered after multiple abortive attempts that PDF Text extraction is not a trivial problem. I knew there were a ton of libraries out there that can do this. But I didn't realize how imprecise they are if you want a perfect extraction of the text in a PDF.

I tried a bunch of different libraries, I will omit some for brevity but still record some of the failures.

Attempt 1: pypdf

import re
from pypdf import PdfReader

source_pdf = "YouAndYourResearch.pdf"
output_txt = "YouAndYourResearch.txt"

reader = PdfReader(source_pdf)
full_text = ""

for page in reader.pages:
    full_text += page.extract_text() + "\n"

with open(output_txt, "w", encoding="utf-8-sig") as f:
    f.write(full_text)

print(f"Save {len(full_text)} characters to {output_txt}")

This produced 79,816 characters. But when I opened the file, it was full of problems: NUL characters, broken line wrapping (every PDF line break became a newline, splitting sentences mid-word), and scattered control characters. I spent several iterations trying to clean it up with regex:

# Remove NUL characters and other unwanted control characters
full_text = full_text.replace('\x00', '')
full_text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', full_text)

# Normalize whitespace
full_text = re.sub(r' +', ' ', full_text)
full_text = re.sub(r'\n{3,}', '\n\n', full_text)

Then I tried to different techniques for rejoining paragraphs — treating empty lines as paragraph breaks and joining everything else:

lines = full_text.split('\n')
paragraphs = []
current_paragraph = []

for line in lines:
    stripped = line.strip()
    if stripped == '':
        if current_paragraph:
            paragraphs.append(' '.join(current_paragraph))
            current_paragraph = []
        paragraphs.append('')
    else:
        current_paragraph.append(stripped)

if current_paragraph:
    paragraphs.append(' '.join(current_paragraph))

full_text = '\n'.join(paragraphs)

I also tried stripping stray single characters (PDF artifacts like footnote numbers that ended up on their own lines), replacing single newlines with spaces while preserving double newlines, and removing isolated punctuation. Each pass got the text a little cleaner, but it never felt right. The fundamental problem was that pypdf was giving me poor raw material to work with.

Attempt 2: pdfminer.six

I switched to pdfminer.six, which the LLM gods suggested was better at text flow reconstruction:

from pdfminer.high_level import extract_text
import re

source_pdf = "YouAndYourResearch.pdf"
output_txt = "YouAndYourResearch.txt"

full_text = extract_text(source_pdf)

# Clean control characters
full_text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', full_text)

# Remove (cid:XX) artifacts — these are unmapped glyphs from the PDF
full_text = re.sub(r'\(cid:\d+\)', '', full_text)

# Rejoin paragraphs: replace single newlines with space,
# preserve double newlines as paragraph breaks
full_text = re.sub(r'(?<!\n)\n(?!\n)', ' ', full_text)

# Clean up
full_text = re.sub(r' +', ' ', full_text)
full_text = re.sub(r'\n{3,}', '\n\n', full_text)
full_text = re.sub(r'\n\s*\n', '\n\n', full_text)

Better — the text flow was improved — but now I had (cid:XX) artifacts everywhere, which are pdfminer's way of saying "I found a glyph I can't map to Unicode." After stripping those, I was down to 78,906 characters of mostly-readable text, but it still wasn't clean enough for TTS. Sentences were garbled in places, and there were odd spacing issues throughout.

Attempt 3: marker-pdf (the one that worked)

At this point I decided to try use a more advanced approach: marker-pdf, a library that uses deep learning models (layout detection, OCR, table recognition) to convert PDFs to clean Markdown. It's designed for exactly this use case — getting high-quality text from PDFs that simpler parsers struggle with.

pip install marker-pdf

The install is large — it pulls in PyTorch, transformers, surya-ocr, and several other ML libraries. And on first run, it downloads about 2-3 GB of model weights.

⚠️ GPU Required (practically speaking) My first attempt ran on CPU. The layout recognition step took over 5 minutes for a 16-page PDF, and then the OCR text recognition step (69 batches) was going to take hours. I killed it and reconfigured my Jupyter Lab environment so that PyTorch would use my laptop's GPU.

With a Quadro RTX 5000 (16 GB VRAM), the entire conversion took about 3 minutes:

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

PyTorch version: 2.10.0+cu128
CUDA available: True
Device name: Quadro RTX 5000
VRAM: 16.0 GB

If you need to install the CUDA version of PyTorch (the default pip install gives you CPU-only), follow the instructions at pytorch.org.

Step 1: PDF to Clean Text with Marker

Here's the full extraction and cleanup pipeline. Marker outputs Markdown, so we need to strip the formatting down to plain text suitable for TTS:

import re
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

source_pdf = "YouAndYourResearch.pdf"
output_txt = "YouAndYourResearch.txt"

import torch
print(f"Using device: {torch.cuda.get_device_name(0)}")

# Convert PDF — first run downloads models (~2-3 GB)
converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter(source_pdf)
text = rendered.markdown

# --- Strip markdown to TTS-friendly plain text ---

# Remove images ![alt](url)
text = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', text)

# Remove links [text](url) -> text
text = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', text)

# Remove markdown headers
text = re.sub(r'^#{1,6}\s+', '', text, flags=re.MULTILINE)

# Remove bold/italic markers
text = re.sub(r'\*{1,3}(.*?)\*{1,3}', r'\1', text)
text = re.sub(r'_{1,3}(.*?)_{1,3}', r'\1', text)

# Remove strikethrough
text = re.sub(r'~~(.*?)~~', r'\1', text)

# Remove inline code backticks
text = re.sub(r'`([^`]*)`', r'\1', text)

# Remove code blocks
text = re.sub(r'```[\s\S]*?```', '', text)

# Remove horizontal rules
text = re.sub(r'^[\-\*_]{3,}\s*$', '', text, flags=re.MULTILINE)

# Remove markdown bullet points / numbered lists, keep the text
text = re.sub(r'^\s*[\-\*\+]\s+', '', text, flags=re.MULTILINE)
text = re.sub(r'^\s*\d+\.\s+', '', text, flags=re.MULTILINE)

# Remove blockquote markers
text = re.sub(r'^>\s?', '', text, flags=re.MULTILINE)

# Remove HTML tags if any snuck in
text = re.sub(r'<[^>]+>', '', text)

# --- TTS-specific cleanup ---

# Expand common abbreviations that TTS might mangle
text = text.replace('Dr.', 'Doctor')
text = text.replace('Mr.', 'Mister')
text = text.replace('Mrs.', 'Missus')
text = text.replace('Prof.', 'Professor')
text = text.replace('i.e.', 'that is')
text = text.replace('e.g.', 'for example')
text = text.replace('etc.', 'etcetera')
text = text.replace('vs.', 'versus')
text = text.replace('Ph.D.', 'PhD')

# Replace special quotes / dashes with plain equivalents
text = text.replace('\u2018', "'").replace('\u2019', "'")
text = text.replace('\u201c', '"').replace('\u201d', '"')
text = text.replace('``', '"').replace("''", '"')
text = text.replace('\u2013', '-').replace('\u2014', '-')
text = text.replace('\u2026', '...')

# Remove any remaining special/non-ASCII punctuation that might confuse TTS
text = re.sub(r'[\u2000-\u206F\u2190-\u27FF]', '', text)

# Replace form feed characters
text = text.replace('\x0c', '')

# Normalize whitespace
text = re.sub(r' +', ' ', text)
text = re.sub(r'\n{3,}', '\n\n', text)
text = text.strip()

with open(output_txt, "w", encoding="utf-8") as f:
    f.write(text)

print(f"Saved {len(text)} characters to {output_txt}")

This produced 78,348 characters of clean, readable text. The quality difference compared to pypdf and pdfminer was dramatic — proper paragraph structure, no garbled sentences, no stray artifacts.

Step 2: Extract Just the Talk

The PDF contains more than just the talk itself — there's front matter and other content. I extracted just the talk portion using simple string markers:

source = "YouAndYourResearch.txt"
output = "YouAndYourResearch_TalkOnly.txt"

with open(source, "r", encoding="utf-8") as f:
    text = f.read()

start_marker = 'THE TALK: "You and Your Research" by Doctor Richard W. Hamming'
end_marker = "Go forth, then, and do great work!"

start_idx = text.index(start_marker)
end_idx = text.index(end_marker) + len(end_marker)

talk_text = text[start_idx:end_idx].strip()

with open(output, "w", encoding="utf-8") as f:
    f.write(talk_text)

print(f"Extracted {len(talk_text)} characters to {output}")

This gave me 69,765 characters — the complete talk plus the Q&A session.

Step 3: Smart Text Chunking

OpenAI's TTS API has a 4,096-character limit per request. My first version used naive character splitting — just chopping the text every 4,096 characters. This had the potential to produce audible artifacts: words cut in half, sentences split mid-thought, jarring transitions between audio chunks.

The fix was a smart splitter that respects natural text boundaries. It tries to split at the best available boundary point, in priority order:

Paragraph boundaries (double newlines) — ideal, natural pause points
Single newlines — usually a topic shift or new speaker
Sentence boundaries (. ! ? followed by space)
Clause boundaries (, ; : —)
Word boundaries (spaces) — last resort before hard split
Hard character split — should essentially never happen with normal text

def split_text_smart(text, max_len=4096):
    """
    Split text into chunks that respect natural boundaries.
    """
    chunks = []
    remaining = text.strip()

    while remaining:
        # If the remaining text fits in one chunk, we're done
        if len(remaining) <= max_len:
            chunks.append(remaining)
            break

        # Take a window of max_len characters to find a split point
        window = remaining[:max_len]
        split_point = None

        # --- Priority 1: Paragraph boundary (double newline) ---
        para_pattern = r'\n\s*\n'
        para_matches = list(re.finditer(para_pattern, window))
        if para_matches:
            last_para = para_matches[-1]
            split_point = last_para.end()

        # --- Priority 2: Single newline ---
        if split_point is None:
            newline_idx = window.rfind('\n')
            if newline_idx > 0:
                split_point = newline_idx + 1

        # --- Priority 3: Sentence boundary ---
        if split_point is None:
            sentence_pattern = r'[.!?]["\'\)]*\s'
            sentence_matches = list(re.finditer(sentence_pattern, window))
            if sentence_matches:
                last_sentence = sentence_matches[-1]
                split_point = last_sentence.end()

        # --- Priority 4: Clause boundary ---
        if split_point is None:
            clause_pattern = r'[,;:\u2014]\s'
            clause_matches = list(re.finditer(clause_pattern, window))
            if clause_matches:
                last_clause = clause_matches[-1]
                split_point = last_clause.end()

        # --- Priority 5: Word boundary (last space) ---
        if split_point is None:
            space_idx = window.rfind(' ')
            if space_idx > 0:
                split_point = space_idx + 1

        # --- Priority 6: Hard split ---
        if split_point is None:
            split_point = max_len

        chunk = remaining[:split_point].rstrip()
        if chunk:
            chunks.append(chunk)
        remaining = remaining[split_point:].lstrip()

    return chunks

I also added a preview function that prints a summary of how the text was split, so I could visually verify that chunks started and ended at sensible places:

def preview_chunks(chunks, preview_chars=80):
    """Print a summary of the chunks for verification."""
    print(f"\n{'='*60}")
    print(f"CHUNK SUMMARY: {len(chunks)} chunks")
    print(f"{'='*60}")
    for i, chunk in enumerate(chunks):
        start = chunk[:preview_chars].replace('\n', '\\n')
        end = chunk[-preview_chars:].replace('\n', '\\n')
        print(f"\n  Chunk {i:3d} | {len(chunk):5d} chars")
        print(f'    Start: "{start}..."')
        print(f'    End:   "...{end}"')
    total_chars = sum(len(c) for c in chunks)
    avg_chars = total_chars / len(chunks) if chunks else 0
    print(f"\n  Total characters: {total_chars}")
    print(f"  Average chunk size: {avg_chars:.0f} chars")
    print(f"  Largest chunk: {max(len(c) for c in chunks)} chars")
    print(f"  Smallest chunk: {min(len(c) for c in chunks)} chars")
    print(f"{'='*60}\n")

For the Hamming talk, this produced 19 chunks averaging 3,670 characters each, with every chunk ending at a natural sentence or paragraph boundary.

Step 4: Async TTS Generation

With 19 chunks, sequential generation takes a while — each API call takes several seconds because it's generating HD-quality audio. Using Python's asyncio with a bounded semaphore lets us process multiple chunks in parallel while respecting API rate limits.

The generation also includes retry logic with exponential backoff (in case of transient API errors) and resume support (it skips chunks that already exist on disk, so you can re-run after a failure without re-generating everything).

import os
import asyncio
import re
import aiofiles
from openai import AsyncOpenAI
from pydub import AudioSegment

# Create directories
os.makedirs('chunked', exist_ok=True)
os.makedirs('final', exist_ok=True)

client = AsyncOpenAI(
    api_key="sk-YOUR_API_KEY_HERE",
    organization="org-YOUR_ORG_ID_HERE"
)


async def generate_chunk(chunk_text, chunk_index, filename_key, max_retries=5):
    """Generate speech for a single chunk with retry logic."""
    chunk_file_path = f"./chunked/{filename_key}_{chunk_index:04d}.mp3"

    # Resume support: skip already-generated chunks
    if os.path.exists(chunk_file_path) and os.path.getsize(chunk_file_path) > 0:
        print(f"  Chunk {chunk_index} already exists, skipping.")
        return chunk_index, chunk_file_path

    for attempt in range(1, max_retries + 1):
        try:
            print(f"  Chunk {chunk_index} - Attempt {attempt}...")

            async with client.audio.speech.with_streaming_response.create(
                model="tts-1-hd",
                voice="fable",
                input=chunk_text
            ) as response:
                async with aiofiles.open(chunk_file_path, 'wb') as f:
                    async for data in response.iter_bytes():
                        await f.write(data)

            if os.path.exists(chunk_file_path) and os.path.getsize(chunk_file_path) > 0:
                print(f"  Chunk {chunk_index} completed successfully.")
                return chunk_index, chunk_file_path
            else:
                raise Exception("File was empty or not created")

        except Exception as e:
            print(f"  Chunk {chunk_index} - Attempt {attempt} failed: {e}")
            if os.path.exists(chunk_file_path):
                os.remove(chunk_file_path)
            if attempt < max_retries:
                wait_time = min(2 ** attempt, 30)
                print(f"  Chunk {chunk_index} - Retrying in {wait_time}s...")
                await asyncio.sleep(wait_time)
            else:
                print(f"  Chunk {chunk_index} - All {max_retries} attempts failed!")
                return chunk_index, None


async def process_file(input_file, max_concurrent=5):
    """Process the entire file with concurrent chunk generation."""

    with open(input_file, 'r', encoding='utf-8') as f:
        text = f.read()

    base_name = os.path.splitext(os.path.basename(input_file))[0]
    filename_key = ''.join(
        e for e in base_name if e.isalnum() or e in [' ']
    ).replace(' ', '_')

    print(f"Generating speech for {input_file}...")

    # Use smart splitting
    chunks = split_text_smart(text, max_len=4096)
    preview_chunks(chunks)

    # Validate no chunk exceeds the limit
    oversized = [(i, len(c)) for i, c in enumerate(chunks) if len(c) > 4096]
    if oversized:
        print(f"ERROR: {len(oversized)} chunk(s) exceed 4096 chars: {oversized}")
        return

    print(f"Total chunks to process: {len(chunks)}")

    # Bounded concurrency
    semaphore = asyncio.Semaphore(max_concurrent)

    async def limited_generate(chunk_text, chunk_index):
        async with semaphore:
            return await generate_chunk(chunk_text, chunk_index, filename_key)

    tasks = [
        limited_generate(chunk_text, i)
        for i, chunk_text in enumerate(chunks)
    ]
    results = await asyncio.gather(*tasks)

    # Check for failures
    failed_chunks = [idx for idx, path in results if path is None]
    if failed_chunks:
        print(f"\nWARNING: Failed chunks: {failed_chunks}")
        print("Re-run to resume from where it left off.\n")

    # Assemble final audio in order
    audio_segments = []
    for idx, path in sorted(results, key=lambda x: x[0]):
        if path is not None:
            try:
                audio_segments.append(AudioSegment.from_file(path))
            except Exception as e:
                print(f"  Error loading chunk {idx}: {e}")

    if audio_segments:
        combined = sum(audio_segments, AudioSegment.empty())
        all_file_path = f"./final/{filename_key}.mp3"
        combined.export(all_file_path, format="mp3")
        print(f"\nSaved final audio to {all_file_path}")

    if not failed_chunks:
        print("Finished — all chunks successful!")
    else:
        print(f"Finished with {len(failed_chunks)} failed chunk(s).")


# === RUN ===
input_file = "YouAndYourResearch_TalkOnly.txt"

# For Jupyter notebooks:
await process_file(input_file, max_concurrent=5)

# For standalone .py scripts:
# asyncio.run(process_file(input_file, max_concurrent=5))

This then gives us the following audio file:

Also if you prefer video, YouTube has a recording of Dr. Hamming giving a different version of the lecture in 1995 at the Naval Postgraduate School three years before his death. I recommend listening to both the Bell Labs talk, and the updated version as you get to see some of the evolution in his thinking.

In a follow-up article, I will walk through generalizing this process a bit more and wrapping it in a GUI to make it a little more accessible.