AI almost never gives the same brand list twice

Ask ChatGPT the same product question twice and you’ll get two different answers. Ask it a third time, a fourth time, all the way out to a hundred — the answers keep changing. The list of brands shifts. The order shifts. The number of items shifts.

For anyone selling “AI ranking” as a metric, that’s a problem.

In late 2025, Rand Fishkin at SparkToro teamed up with Patrick O’Donnell at Gumshoe to run the kind of stress test the AI visibility industry has mostly avoided. They asked 600 volunteers to run 12 prompts through ChatGPT, Claude, and Google’s AI a combined 2,961 times. Same prompts, fresh sessions, default settings.

The result they published in January 2026: less than 1 in 100 runs produced the same list of brands. Less than 1 in 1,000 produced the same list in the same order.

If your AI visibility report ranks your brand at position three this morning and position seven tomorrow, that isn’t movement. That’s noise.

The study, in detail

Fishkin and O’Donnell didn’t test trivia. They tested the kind of questions marketers actually care about: “best CRM for small business,” “top creative agencies,” “best science fiction novels of the decade.” Twelve prompts in total, B2C and B2B, narrow and broad. Each prompt ran 60 to 100 times across three of the most-used AI platforms.

They didn’t change temperature settings. They didn’t manipulate context. The runs differed only in who typed them and when.

The numbers across all three platforms:

Measurement	Result
Probability of identical brand list across two runs	< 1%
Probability of identical list in identical order	< 0.1%
Variability in number of items returned	2 to 10+ per response
Semantic similarity of “equivalent” prompts written by different volunteers	0.081

That last row is the easy one to skip past. It’s the most important one in the study.

When the researchers asked 600 people to write a prompt asking the same question, the prompts came back so different that the semantic similarity score averaged 0.081 out of 1.0. Fishkin’s team described it as comparing “Kung Pao Chicken and Peanut Butter.”

Real users don’t all type the same prompt. The “what’s your AI ranking” question assumes a canonical query that doesn’t exist in the wild.

Why this breaks position tracking

The AI visibility tool category is full of dashboards that show “ranking position” — your brand was #3 yesterday, #5 today, here’s a sparkline. Most of those dashboards run a single query per day per prompt and report the position they see.

After SparkToro’s data, that approach is dead.

A single-run rank is a coin flip with extra steps. Run the same prompt twice and the brand that ranked #3 may not appear at all. Run it a hundred times and you get a real signal: how often does the brand show up?

That distinction matters because the buyer experience is the aggregate, not the individual run. If your brand appears in 40 of 100 runs of a category prompt, roughly 40% of buyers asking that question see your name. The order they see it in varies. The presence is stable. Frequency is what travels.

This is what we mean by citation velocity over legacy authority — the metric that matters is the rate of appearance, not the position on any one day.

What’s actually stable

The interesting part of the SparkToro write-up isn’t that AI is noisy. Everyone knew that. The interesting part is what stayed stable across the chaos.

Three things held up across thousands of runs:

Visibility percentage. The share of runs in which a brand appeared. This number changed slowly across the study window and tracked roughly with the brand’s prominence in the category.
Category-level patterns. Narrower categories — Los Angeles Volvo dealerships, SaaS cloud providers — clustered around the same handful of brands across most runs. Broader categories like creative agencies or science fiction novels fragmented across hundreds of names.
Top-of-mind brands. In every category tested, two or three brands appeared in the majority of runs. The next 10 fluctuated. The tail was effectively random.

The pattern fits our earlier piece on how AI answers change over time, which found similar drift even on factual questions with verifiable answers. The variability is structural, not a bug to be solved.

The narrow vs broad category trap

There’s a second finding from the study that has direct implications for how teams budget AI visibility work. The smaller the category, the more stable the recommendations. The bigger the category, the more random the results.

For a regional service business with 20 real competitors, AI answers feel deterministic. The same five or six names show up most of the time. A team in that category can run 30 queries and have high confidence in the picture.

For a national B2B SaaS in a 500-vendor space, the same approach is useless. Thirty queries are nowhere near enough to separate signal from noise. You need hundreds of runs across personas and platforms to get a stable read.

The implication for budgeting:

Category type	Sampling needed for stable read	What “stable” means
Narrow / regional	30–50 runs per prompt	Top 5 brands clearly visible
Mid-size / vertical SaaS	100–200 runs per prompt	Top 10–15 brands measurable
Broad / consumer goods	300+ runs per prompt	Reliable share-of-voice estimates

Most AI visibility tools haven’t internalized this. They sell the same sampling cadence to a five-competitor home services brand and a 500-competitor consumer brand. The first one gets a clean signal. The second one gets noise dressed up as a dashboard.

What to actually track

Translate the study into a measurement plan and you end up with a different set of KPIs than the industry has been pushing.

Drop:

Single-run “ranking position” reports
“What does ChatGPT say about your brand?” screenshots used as a KPI
One-shot competitor comparisons

Keep:

Visibility percentage — share of runs in which your brand appears, by prompt and by platform
Share of voice — your visibility percentage relative to the top competitors in the same prompt set
Frequency over time — the trend line of your visibility percentage across weeks
Platform splits — the same metrics computed separately for ChatGPT, Claude, Gemini, Perplexity, and AI Overviews, because the answers diverge

This is the framework we laid out in GEO metrics that matter, and the SparkToro data is the cleanest external validation of it I’ve seen this year.

The other metric I’d watch: prompt diversity coverage. If volunteers write 142 variants of the same intent at 0.081 average semantic similarity, no single phrasing represents how buyers actually search. A monitoring program that only tracks one phrasing of “best CRM for small business” is missing how most of those buyers are asking the question.

The fix is to seed each tracked intent with a dozen real-world phrasings, not a single textbook one. The variants should come from sales call transcripts, support tickets, and Reddit threads — places where users phrase things their own way — rather than from whatever wording the marketing team thought was clean.

How to operate against the noise

If you’re rebuilding your AI visibility program around the SparkToro findings, three operating decisions matter more than the rest.

First, sample on a recurring cadence, not on demand. The instinct to run a query when a stakeholder asks “where do we rank in ChatGPT” is the instinct that produces a screenshot. Set up a daily or weekly sampling job that runs N runs per prompt across platforms, and report aggregates, not individual outputs.

Second, run the same prompt set across every monitored platform. The SparkToro study compared ChatGPT, Claude, and Google AI side by side and found that the answers diverged — sometimes drastically. If your tool only reports ChatGPT data, you’re seeing roughly a third of the picture. Most B2B buyers touch two or three AI platforms in a single research session.

Third, report deltas and trends, not absolute positions. “Our visibility on Discovery prompts went from 22% to 31% this quarter” is a number a team can act on. “We ranked #4 yesterday” is a number a team will fight over for an hour and then ignore. We covered the broader framework in our piece on tracking brand visibility in AI search.

Why this matters for tool selection

The deeper point in the SparkToro study, and the one most tool vendors are quiet about, is that the entire “rank tracker for AI” pitch is a category mistake. Search engines have deterministic ranks. LLMs have probability distributions. You measure distributions by sampling them, not by checking the rank once.

The right architecture for AI visibility monitoring isn’t a daily ranking screenshot. It’s a sampling system that runs many variants of each prompt across many platforms over time, then reports frequency and trends, not position.

That’s how we built RivalHound. We run query categories — Discovery, Direct, and Comparison — across the major AI platforms on a recurring basis, then report visibility percentage per category, per platform, with trend lines you can act on. Single-run snapshots are a fast way to lose credibility with the team you’re reporting to. Repeated sampling is the only thing the data supports.

The takeaway

The temptation when AI looks chaotic is to throw up your hands and stop measuring. That’s the wrong move. The data is noisy but not random. There’s a real signal underneath, and brands that learn to extract it — frequency over position, sampling over snapshots, distributions over ranks — will be ahead of the competitors still arguing about a single screenshot.

Ranking position tracking for AI was always going to die. SparkToro and Gumshoe just gave it a death certificate.

If the tool you’re paying for shows you a daily rank and not a visibility percentage, you’re not measuring AI visibility. You’re measuring one coin flip per day.

Stop guessing about your AI search presence. Start your free RivalHound trial and get real data.