Back to blog
9 min readMaya Chen

Writing scripts your AI voice will actually nail

Great synthetic narration starts long before you press generate. A practical, example-driven guide to writing scripts for the ear — structure, rhythm, pronunciation, and the SSML that ties it all together.

A studio microphone in soft light, ready to record a script.
Photo via Unsplash

Most people who are disappointed by AI voiceover blame the model. They paste in a paragraph that reads beautifully on screen, press generate, and get back something that is technically clear but somehow lifeless — and they conclude the technology is not there yet. Almost always, the problem is upstream. The voice is only as good as the script you hand it, and the script that wins an award when read silently is rarely the one that sounds natural out loud. Writing for the ear is a distinct craft, and once you learn its handful of rules, the quality of your renders jumps overnight without changing a single model setting.

This guide walks through how we coach teams to write scripts that synthetic voices perform well: how to structure sentences for breath, how to build rhythm, how to handle the tricky bits like numbers and names, and how to use a light touch of SSML to direct the delivery. The examples are deliberately mundane — product explainers, ads, course modules — because that is the work most people are actually shipping.

Write for the ear, not the eye

The single most useful habit you can build is to read every line out loud before you consider it finished. Your eye forgives long, clause-heavy sentences because it can backtrack; your ear cannot. When a listener loses the thread halfway through a sentence, there is no rewinding in the moment — the meaning is simply gone. So the prose that works for the ear is shorter, more direct, and more repetitive than what you would write for a page. You will find yourself breaking one elegant 40-word sentence into three plain ones, and the result will sound dramatically more human when synthesized.

Spoken language also leans on signposting that written language can drop. Phrases like "here is the thing," "but first," and "the short version is" feel redundant on the page, yet they are exactly how real speakers tell a listener what to do with the next sentence. A synthetic voice delivers these cues with the same easy confidence a person would, and they give the listener room to follow along. Do not strip them out in the name of concision; in audio, they are the concision.

One idea per breath

A good rule of thumb: each sentence should contain one idea that a person could say comfortably in a single breath. If you run out of air reading it aloud, the model will run out of natural places to pause, and the line will come out as a flat, hurried wall. When you genuinely need to connect two ideas, use a short connective sentence rather than a semicolon-stitched monster. The voice will thank you, and so will the listener.

Watch your sentence openings, too. Starting three sentences in a row with the same word or structure creates an unintentional sing-song pattern that the ear picks up immediately, even when the eye would not. Vary your openings: a question, then a statement, then a short fragment. That variety is what keeps a two-minute narration from feeling like a metronome.

Build rhythm with sentence length

Rhythm is the difference between a script that informs and one that holds attention. The mechanism is simple: vary your sentence lengths deliberately. A long, flowing sentence that carries the listener through a connected chain of ideas can be followed by something short. Something punchy. That contrast creates a sense of momentum and emphasis without any markup at all. If you read your script and every sentence is roughly the same length, the delivery will feel monotonous no matter how good the voice is.

Paragraph breaks matter as well, even though the listener cannot see them. Treat each paragraph as a single thought that deserves a beat of silence before the next begins. When you generate, that structure gives you natural seams to insert slightly longer pauses, which is how you signal "we are moving to a new point" in audio. A script that is one undifferentiated block of text invites the voice to plough straight through with no breathing room.

Numbers, names, and acronyms

This is where most scripts quietly break. Synthetic voices are good but not psychic, and ambiguous text is where they guess wrong. Spell out how you want numbers spoken: "twenty twenty-six" reads differently from "two thousand twenty-six," and "$1.5M" could be voiced half a dozen ways. If a figure matters, write it the way you want to hear it. The same goes for dates, times, phone numbers, and ranges — decide the spoken form and put it directly in the script rather than hoping the model matches your intent.

Acronyms need an explicit decision: do you want them spelled out letter by letter ("A. P. I.") or pronounced as a word ("NASA")? Brand names and unusual proper nouns are the other landmine. If your company is named in a way that does not follow English spelling rules, write a phonetic respelling the first time it appears and confirm the render before you commit to a long script. Five minutes of pronunciation testing up front saves you from regenerating a finished ninety-second piece because the brand name landed wrong.

Direct the delivery with SSML

Once the words are right, SSML — Speech Synthesis Markup Language — lets you direct how they are performed. Think of it as stage directions. The most valuable tags are also the simplest. Use breaks to insert a deliberate pause before a key phrase to set it up, or after a list item so each point can land. Use emphasis to mark the one or two words in a sentence that carry the argument — and only those. The most common mistake we see is over-marking: a script with emphasis on every adjective sounds like a hard-sell infomercial, not a trusted narrator.

Pacing is your third lever. Slow the rate down for numbers, names, and anything the listener might need to remember, and let it move a little faster through transitional phrases that connect ideas but carry little new information. Even a slight variation in rate across a paragraph is what separates a flat read from a performance. The guiding principle for all of it: SSML should encode how you would naturally say the line, not impose an artificial cadence on top of it. If you cannot perform a line convincingly yourself, no amount of markup will rescue it — rewrite the sentence first, then direct it.

Punctuation is your cheapest tool

Before you reach for markup at all, remember that punctuation already shapes delivery. Commas, periods, dashes, and ellipses each produce a different length and character of pause, and a well-punctuated sentence often needs no SSML whatsoever. A dash creates a sharper break than a comma; an ellipsis trails off where a period stops cleanly. Use these intentionally. Often the fastest way to fix an awkward render is not to add a tag but to move a comma, split a sentence, or turn a clause into its own line.

Test small, then scale

Do not write a thousand words, generate the whole thing, and only then discover that a recurring term mispronounces or the pacing drags. Work in passes. Generate the first paragraph, listen critically, and fix the script. Once the opening sounds right, the voice and the conventions you have established usually carry through the rest with far fewer surprises. This iterative loop is faster than it sounds, and it is how professional audio teams work whether the performer is a person or a model.

When you listen back, listen for specific failure modes rather than a vague sense of "off." Is the emphasis landing on the right word? Are the pauses doing real work, or are they just silence? Does the energy match the moment — calm where it should be calm, bright where it should be bright? Naming the problem tells you exactly which lever to pull, and keeps revision from turning into aimless re-rolling.

A pre-generation checklist

Before you press generate on anything longer than a few lines, run through five questions. Have you read every sentence out loud? Is each one a single comfortable breath? Have you spelled out numbers, dates, and acronyms the way you want them spoken? Have you tested any unusual names or brand terms in isolation? And have you reserved emphasis and pauses for the few places that genuinely earn them? If you can answer yes to all five, the model has everything it needs to give you a take that sounds intentional rather than generated.

The encouraging part is that none of this requires audio engineering or a background in voice acting. It is writing — a particular kind of writing aimed at the ear instead of the eye. Teams that internalize these habits stop fighting the model and start directing it, and the gap between "obviously synthetic" and "wait, was that AI?" closes faster than almost anyone expects. Treat the script as the performance, and the render will follow.


Try it on your own scripts

Generate your first voiceover in under a minute — no credit card required.

Start free