Voice drift: what AI marketing tools forget to measure

Most AI marketing tools claim "brand voice." Almost none of them measure it after the model finishes writing.

This post is about a metric we use to score how far a generated draft has drifted from a brand's actual voice. What it is, what assumptions it rests on, and why we built it the way we did. We are not yet publishing aggregate numbers across our user base — that takes a sample size we have not crossed. But the model itself should not be proprietary, and nobody else in the AI marketing tool space is talking about it.

What voice drift actually is

Voice drift is the gap between what an AI tool generated and what your brand actually sounds like.

Every AI writing tool you have ever used either trains on examples of your voice or takes voice instructions in the prompt. Either way, the question that almost nobody asks is: did the model hold the voice all the way to the end of the draft, or did it slip back to its default?

The honest answer, most of the time, is it slipped.

You see it in the cliches that appear in the third paragraph but not the first. You see it in carousels where slide 1 sounds like the brand and slide 7 sounds like the model. You see it in long-form posts that open in a distinctive register and end in a generic one. None of this gets caught by prompts that say "use my brand voice." Prompts decay across context. Voice training decays across paragraphs. What you need is a measurement that runs after generation finishes — a check, not an instruction.

The metric

We define voice drift as the cosine distance between a brand's voice fingerprint and a newly generated draft, measured in the same embedding space.

A drift score of 0.00 means the draft sits exactly inside the brand's voice cluster — indistinguishable from things the brand has actually published. A score of 1.00 means it sits in a completely unrelated stylistic neighborhood — recognizably not you.

Our published thresholds, used in the Marqeting drift indicator:

Band	Score	What to do
In voice	0.00 – 0.18	Ship it.
Borderline	0.18 – 0.32	Rewrite the worst sentence, then ship.
Drift	0.32+	Do not ship without rewriting.

The thresholds were calibrated against human reviewers reading the same drafts. They mark the boundary where reviewers stopped describing a draft as "sounds like us" and started describing it as "sounds like AI." If you build something on top of this metric, calibrate against your own reviewers — the cutoffs will shift slightly by brand.

The three assumptions baked into the metric

The interesting part of any metric is the assumptions it rests on. Ours rests on three.

1. Drift is categorical, not gradual

We do not believe drift is a smooth distribution where every draft is a little bit off. We believe most drafts either hold the voice or lose it — that the borderline band between in voice and drift is narrower than the in-voice and drift bands themselves.

This is why the Marqeting drift indicator is a green/yellow/red badge, not a continuous score. Reviewers do not need to triage a 0.21 vs a 0.23. They need to know "this one passed, this one needs attention, this one is ship-blocking." Treating drift as a continuous gradient implies a precision the metric does not have, and it pushes reviewers toward optimizing a number instead of fixing the writing.

We are logging the full distribution on every generation and will publish the shape once the sample size justifies it.

2. Drift compounds with length

We expect long-form drafts to drift more than captions, holding everything else constant.

There is a well-documented intuition behind this: the first sentence of any generation can fake any voice, because the model has the full prompt context active. The third paragraph reverts to the model's defaults, because the prompt has been diluted by everything generated since. This is not Marqeting-specific. It is how attention works in transformer-based models.

The implication for content tools is straightforward: long-form needs voice scoring more than captions do, not less. It also means a brand voice that holds across a 100-word LinkedIn post can completely collapse by paragraph four of a blog post. Same voice training, different drift outcomes.

3. Drift is brand-correlated, not user-correlated

Our model assumes the variance in drift scores will be dominated by which brand the draft was generated for, not which user hit the generate button or how careful their prompt was.

A brand with a distinctive voice trained on 15+ past posts gives the model a fingerprint specific enough to hold across paragraphs. A brand trained on three short samples gives the model an under-specified target, and the model fills the gap with its defaults. The careful prompter on the second brand still gets drift. The sloppy prompter on the first brand mostly does not.

This is the most actionable assumption in the set: if drift scores come back high, the right intervention is more training samples, not stricter prompt discipline.

What voice scoring catches that prompts do not

Prompt instructions like "avoid AI cliches" are a known-failed approach. They work for one draft, on a good day. They reliably degrade across long-form. They are easy to write, easy to ship, and they do almost nothing in practice.

Voice scoring catches what prompts miss because it runs against the brand's own past posts — not against a list of banned phrases. If your brand has never used the phrase "in today's fast-paced world," a drift score against your fingerprint will flag a draft that opens that way. The model does not need to know the phrase is bad. It only needs to know your brand has never written it.

Some of the categories voice scoring catches consistently:

Generic openers — "In today's world," "We live in a world where," "Picture this," "Let's face it"
Excitement boilerplate — "thrilled to announce," "incredibly excited," "huge milestone"
Hedge stacking — "It is worth noting that, generally speaking, in many cases…"
Buzzword density — "leverage," "synergy," "game-changer," "next-level"
Em-dash overuse — patterns of three or more em-dashes in adjacent sentences

Prompt-based filters can be defeated by paraphrase. Voice scoring against the brand's actual past output is not, because it does not work by string matching.

Three things this implies in practice

Train your voice on more than three samples. Three is enough to start. Fifteen is what holds across long-form. If your drift scores come back high, this is almost always the lever.

Score every draft. Do not rely on prompts. Voice instructions in the prompt fade by paragraph two. A drift score evaluated on the output catches the cases where the prompt did not stick. Build the score into your review surface, not your prompt engineering.

Treat the borderline band as a rewrite signal, not a publish signal. Drafts in the 0.18 – 0.32 range ship plausibly — they pass casual review. They are also where slow voice erosion happens, because nobody is going to notice one borderline draft, and twenty borderline drafts in a row reset the audience's sense of what your brand sounds like.

What we are publishing next

We are logging drift scores on every generation across the Marqeting user base. When we cross 1,000 drafts across 10+ distinct brand voices, we are publishing the full distribution: how often drafts cluster in voice vs drift, how length correlates with drift in practice, which cliches voice scoring catches that prompts do not.

If you want the update when it lands, start a free Marqeting account and we will send it.

Voice drift: what AI marketing tools forget to measure

What voice drift actually is

The metric

The three assumptions baked into the metric

1. Drift is categorical, not gradual

2. Drift compounds with length

3. Drift is brand-correlated, not user-correlated

What voice scoring catches that prompts do not

Three things this implies in practice

What we are publishing next

Read next

Get one good idea on Tuesdays.

What voice drift actually is

The metric

The three assumptions baked into the metric

1. Drift is categorical, not gradual

2. Drift compounds with length

3. Drift is brand-correlated, not user-correlated

What voice scoring catches that prompts do not

Three things this implies in practice

What we are publishing next

Read next

Get one good idea on Tuesdays.

Why every "AI for marketing" tool sounds the same

We built an AI marketing tool. Time to use it on ourselves.

How to use AI to write blog posts without sounding like AI