One Prompt Can’t Measure Your AI Visibility

Six hundred people sat down to ask AI tools the same kind of question — recommend headphones for a family member who travels. They wrote it 142 different ways. When researchers measured how similar those prompts were to each other, the average semantic similarity came back at 0.081, on a scale where 1.0 means identical and 0 means unrelated.

In other words, people asking for the exact same thing barely phrase it alike. And the AI tools answered each version differently.

That finding comes from a study by Rand Fishkin of SparkToro and Patrick O’Donnell of Gumshoe, who recruited 600 volunteers to run 12 prompts across ChatGPT, Claude, and Google AI, collecting 2,961 responses in late 2025. Their headline number is the one that should worry anyone tracking AI visibility: there’s less than a 1-in-100 chance an AI tool returns an identical list of brands across repeated queries, and roughly a 1-in-1,000 chance it returns the same list in the same order (SparkToro).

So here’s the uncomfortable part. If your AI visibility report is built on one prompt, checked once, you aren’t measuring your visibility. You’re measuring a single roll of the dice.

These are probability engines, not search engines

The instinct most teams carry over from SEO is to treat an AI answer like a ranking. You ask ChatGPT “best CRM for startups,” you see your brand listed third, and you log it: position 3. Next week you’re position 5 and someone calls an emergency meeting.

That whole exercise is broken, and not because the tools are buggy. They’re working as designed. A language model samples from a distribution of likely tokens every time it generates text. Variation isn’t a glitch. It’s the mechanism. Fishkin and O’Donnell landed on the right word for it: these are “probability engines,” and chasing a precise ranking position inside one is, in their phrase, “a fool’s errand.”

This is different from the problem of AI answers drifting over time, where a model updates and last month’s answer goes stale. This is variation at a single moment, from a single model, driven by nothing more than how the question was worded and the randomness baked into generation. Both problems are real. This one is sneakier, because it hides inside data that looks clean.

Mid-tier brands get whipsawed the hardest

Here’s where it gets practical. The variation isn’t evenly distributed — it lands almost entirely on the brands fighting for the middle of the list.

A separate study from Rankshift tested seven near-identical prompts about CRM software for small and medium businesses, swapping only words like “best” versus “top” versus “leading,” and “tools” versus “platforms” versus “software.” They ran each phrasing 1,176 times and watched what moved (Rankshift).

The market leaders didn’t budge. HubSpot held a 98% visibility score across every variation. But the challengers swung hard:

Brand	Visibility range	Swing
HubSpot	98% (flat)	~0 points
Salesforce	78–91%	13 points
Monday	59–78%	19 points

Change one adjective and Monday’s measured presence moves 19 percentage points. Nothing about Monday changed. Nothing about its content, its reviews, or its market position changed. The only thing that moved was the word in the prompt.

Rankshift found that 8 of the top 10 brands stayed positionally locked, with the rough pecking order holding, even as the visibility percentages underneath shifted a lot. That combination is the trap. The order looks stable enough to trust, so teams trust the numbers attached to it. But for any brand that isn’t the runaway category leader, those numbers are the noisiest part of the report.

If you’re a market leader, you can almost ignore this. If you’re a challenger — which describes most companies paying attention to AI visibility — it’s the whole game. The same dynamic explains why smaller brands can punch above their weight in AI search: the middle of the list is genuinely contestable, and it’s also genuinely volatile.

The sources move even more than the brands

It gets worse one layer down. Rankshift catalogued 22 unique domains that AI cited as sources across those seven prompts. Only three of them showed up in the top ten for every single phrasing. Ten domains appeared for just one prompt and vanished for the rest.

So the citations feeding the answer are even less stable than the brand list the answer produces. If you’re tracking which sources AI pulls from — a reasonable thing to want to know — a single prompt tells you almost nothing reliable. You’re sampling a population by interviewing one person.

What this breaks in your current reporting

Run through your AI visibility dashboard and flag anything in this list:

Ranking position as a KPI. “We’re #3 in ChatGPT for [query]” is a vanity metric. The position is a sample, not a standing. Both studies agree position tracking lacks validity.
Screenshot-driven panic. Someone pastes a ChatGPT answer into Slack where your brand is missing. Before anyone reacts, ask how many times that prompt was run and how many phrasings were tested. Usually the answer is one and one.
Single-phrasing tracking. If you monitor “best [category] software” but your buyers ask “top [category] tools,” “[category] platforms for small teams,” and “what [category] should I use,” you’re blind to three-quarters of the question space.
Week-over-week comparisons on thin samples. A 6-point drop means nothing if your sample size is small enough that 6 points is inside the noise band.

How to actually measure it

The fix isn’t complicated, but it requires giving up the comfort of a single tidy number in favor of a more honest, messier one. Fishkin and O’Donnell’s own conclusion, after they revised their starting hypothesis, was that “visibility % across dozens to hundreds of prompts run multiple times is a reasonable metric.” That’s the standard to hold yourself to.

Three principles:

1. Build a prompt set, not a prompt. For every topic you care about, write a spread of phrasings that real buyers use — different adjectives, different framings, branded and unbranded, broad and specific. Aim for breadth that mirrors how messy real queries are. One question phrased one way is not a measurement; it’s an anecdote.

2. Sample repeatedly, then track the share. Run each prompt multiple times and record how often your brand appears across the whole set. Your metric is visibility share — the percentage of responses you show up in — not a position number. A brand that appears in 70% of responses is in genuinely better shape than one at 40%, and that gap survives the noise. A brand at “position 3 today” tells you nothing that lasts.

3. Watch the trend on the share, not the spikes on the position. A single answer that drops you is noise. Your visibility share sliding from 65% to 45% over a month across a stable prompt set is a signal worth acting on. Hold your fire for the second one.

Here’s the contrast in plain terms:

Don’t track	Track instead
Position in one answer	Visibility share across a prompt set
One phrasing of a query	A distribution of real-world phrasings
A single run	Repeated runs, aggregated
This week’s spike	The trend line over weeks

This reframes the discipline. AI visibility isn’t a leaderboard you climb one rung at a time. It’s a probability you raise — the odds that, however a buyer happens to phrase their question, your brand is in the answer. That’s also the cleanest way to think about which GEO metrics deserve a place on the dashboard and which are just SEO habits in new clothing.

The takeaway

The research is consistent across both studies: AI recommendations are stable in their rough hierarchy and volatile in their specifics, and the volatility lands hardest on exactly the brands that most need to know where they stand. Measure with one prompt and you’ll either panic over noise or feel safe when you shouldn’t.

So build the prompt set. Run it often. Track the share, not the rank. The number will be less satisfying than “we’re #2” — and far more likely to be true.

Stop guessing about your AI search presence. Start your free RivalHound trial and get real data — measured across the prompts your buyers actually use, not just one.