Research · May 18, 2026 · 11 min read

Don't Measure Once: GEO Visibility Is a Distribution, Not a Score

AI search is probabilistic — answers vary across runs, engines, and weeks. A single GEO visibility measurement is statistically unreliable. Here's why you have to sample, and how to do it right.

Don't Measure Once: GEO Visibility Is a Distribution, Not a Score

Primarily based on Schulte's "Don't Measure Once: Measuring Visibility in AI Search (GEO)" (April 2026), and its companion paper "Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement" (March 2026).

Papers: arXiv:2604.07585 | arXiv:2603.08924 | arXiv:2604.25707

The Problem: Your GEO Measurement Is Lying, and You Don't Know It

You run a query against ChatGPT. Your brand gets cited — top 1, paragraph 2. You log it in a spreadsheet, run the next one. You compile 30 queries. You draw a conclusion about your "AI visibility."

Here's the catch: if you re-run the exact same 30 queries 24 hours later, you'll get a different result. Not marginally different — significantly different. Some brands disappear. Others appear. Cited sources change. Order changes. Tone changes.

That's the central observation from a wave of academic research published in spring 2026: AI search is not deterministic, and measuring it as if it were produces wrong numbers.

This collides head-on with traditional SEO practice. For 25 years, measuring visibility meant opening a ranking, reading a position, archiving it. Position 3 is position 3. It doesn't become position 7 between two scrolls. Generative search breaks that property — and most GEO tools on the market haven't yet folded it into their measurement methodology.

The Thesis in One Line

GEO performance must be characterized as a distribution, not as a single point.

That's the formulation in the April 2026 reference paper, and it's the idea every other recent piece of work — academic and industrial — corroborates in its own way.

Practically, if you want to answer "what's my citation share on ChatGPT for this query?", the right answer isn't a number. It's a range: "between 18% and 34% with a 95% confidence interval, measured over N runs across K days." Anything shorter than that is, statistically, noise dressed up as signal.

Three Sources of Variance, and Why They Compound

1. Per-run variance (intra-day)

Same query, same engine, two runs 30 minutes apart: responses differ. The LLMs powering ChatGPT, Gemini, Perplexity or Claude use stochastic sampling (temperature > 0) at generation time. The retrieval pipeline (which pages get fetched before the model answers) introduces a second layer of randomness.

The companion statistical paper (March 2026) quantifies this via bootstrap analysis: across domain-vs-domain comparisons, many apparent differences fall inside the noise floor. In other words: if you observe 22% for brand A and 19% for brand B on a single run, you have not statistically shown that A is ahead of B. You may simply have rolled different dice.

2. Cross-engine variance

The industry has nailed this down with very clear numbers:

~11% domain overlap between ChatGPT and Perplexity citations on the same queries (Digital Bloom Report, surfaced via Geneo)
~14% URL overlap between Google AI Mode citations and the historical top 10 organic (SE Ranking, surfaced via Geneo)

Translation: ranking well on Perplexity tells you almost nothing about ChatGPT. And ranking well in traditional SEO does not guarantee that you'll appear in AI Mode. These are not three views of the same ranking — they are three different rankings that must be tracked and measured separately.

3. Temporal variance (drift)

Models change. Corpora change. Retrieval pipelines change. A brand that dominates in March may have slipped by May because OpenAI updated its model routing, because a competitor domain published content that got picked up everywhere, or simply because the internal ranker was retrained.

These three variances do not cancel out — they compound. A measurement taken Tuesday morning on Perplexity is not the same as a Thursday-evening one, nor the same as an identical slot on ChatGPT, nor the same as the following week's reading.

The Hidden Structure: Everything Follows a Power Law

The companion paper (arXiv:2603.08924) adds an important finding about the shape of citation distributions: they follow a power law. A small number of domains capture the vast majority of citations; the long tail captures almost nothing.

The industry has confirmed this at scale. An analysis by Trakkr.ai of 1.3 million AI citations spanning 60,209 domains finds exactly the same pattern: citation frequency follows a power law.

Two practical consequences for measurement:

Mean metrics lie. The mean of a power-law distribution is dragged by outliers. If you report "the average domain is cited 4.2 times on this query," you are describing a domain that doesn't exist. Prefer the median and percentiles.
Gaps between brands are huge in absolute terms, and tiny in the long tail. Moving from rank 80 to rank 60 in the distribution barely changes your citations. Falling out of the top 20, or entering it, changes everything. The useful unit of measurement is not linear.

Citation ≠ Absorption

A third April 2026 paper (arXiv:2604.25707) adds a layer that naive measurement misses: being cited doesn't mean being used.

AI engines distinguish two stages:

Selection — your URL appears in the list of sources the engine consulted
Absorption — your content actually fed the generated answer (the produced sentences lean on what you wrote)

A page can be selected and cited at the bottom of an answer without any of the produced sentences reflecting its content. Conversely, a page can be absorbed and reformulated without an explicit citation. Measuring only URL presence in the citation list misses half the picture.

That's why serious GEO measurement needs at least two layers:

Selection frequency — rate at which your domain appears in citations across N runs
Absorption depth — how much of the generated text reflects your content (lexical overlap, semantic similarity, or explicit in-line citation marking)

Operationalization: How Many Samples?

On the practitioner side, Geneo's measurement guide turns the academic thesis into actionable rules. The standard that's emerging:

3 to 5 same-day runs per query per engine, to bound intra-day variance
Longitudinal tracking over weeks, to detect drift
Explicit multi-engine coverage (at minimum ChatGPT, Perplexity, Gemini), because cross-engine overlap is too low to extrapolate from any single one

And a metric set adapted to the distributional nature of the signal:

Metric	What it measures
Jaccard overlap	Stability of cited sources between two runs (set similarity)
Source Survival Rate	Percentage of sources that survive into the next run
Domain Rotation Index	Speed at which the cited-domain set rotates
Drift Rate	Variation in your citation share across periods

Geneo even proposes concrete alerting thresholds:

Jaccard overlap < 0.35 for two consecutive days → high instability, do not draw business conclusions
Drift Rate > 40% week-over-week → something has shifted on the engine or ecosystem side, trigger an investigation

This is the managerial translation of the paper: don't just measure — measure with an alerting policy that distinguishes noise from signal.

iPullRank's Three-Tier Framework

The agency iPullRank has published, in parallel, a conceptual frame that summarizes the new discipline well. Their formulation:

"Share-of-voice is no longer a static percentage of positions held, but a statistical distribution of presence over many trials. Measuring it requires repeated sampling, probabilistic modeling, and acceptance that visibility is not a single snapshot but rather a range of likely outcomes."

They recommend a three-tier measurement stack:

Input metrics — what content you produce, which external sources mention you, what your topical coverage looks like (you control everything)
Channel metrics — how AI engines treat your content: citation rate, absorption depth, propagation latency after publication (you measure)
Performance metrics — business impact: referred traffic, lead quality attributed to AI channels, conversion lifts on audiences exposed to your brand in AI answers (you attribute)

The classic first-generation GEO trap is to measure only tier 2 without connecting it to the others. You end up with a score that moves without being able to explain its cause (tier 1) or measure its consequences (tier 3).

Google's Counterpoint: "GEO and AEO Are a Myth"

In May 2026, Google publicly pushed back, arguing that AI Overviews and AI Mode use the same ranking systems as regular search. Their position: if you rank well in traditional SEO, you'll appear in AI answers. GEO would be SEO rebranded.

That claim is factually true within Google's narrow perimeter. It's still doubly insufficient.

First, the ~14% URL overlap measured between AI Mode and the top 10 organic says the opposite: even at Google, the gap between "well-ranked in SERP" and "cited by AI" is massive. The two systems share infrastructure; they don't do the same thing at answer time.

Second, and this is the central point, Google is talking only about Google. The argument doesn't touch ChatGPT, Perplexity, Claude, or Gemini outside AI Mode. Yet that's precisely where the measurement-variance problem is most acute — because those engines don't use Google's ranking, because their retrieval is different, and because their generation stochasticity is exposed much more directly to the user.

Conclusion: Google's stance is internally consistent for its own ecosystem but doesn't excuse anyone from rigorously measuring the other engines. If anything, it confirms the thesis — each engine has its own statistical regime.

Practical Synthesis: What to Do Tomorrow

If you manage a brand's AI visibility, here's the operational translation of the five sources:

Stop reasoning in "my position on ChatGPT for this query." That sentence is malformed. The right phrasing is "my average citation share on ChatGPT for this query across 5 same-day runs, with its confidence interval."
Sample 3 to 5 times the same query within the same time window, on every engine you track. A single measurement is forbidden for drawing conclusions.
Multi-engine by default. At minimum ChatGPT, Perplexity, Gemini. Cross-engine overlap is too low for any one to be representative.
Track stability, not just score. Jaccard overlap, drift rate, survival rate. A brand stable at 18% is in better strategic shape than a brand oscillating between 10% and 30% with an average of 20% — even if the latter has a better "score" on an isolated run.
Measure two layers: selection AND absorption. Citation is necessary but insufficient. Without absorption, your brand appears in a footnote nobody reads.
Define your alert thresholds before you measure. Jaccard < 0.35 two days in a row, drift > 40% week-over-week: not universal constants, but reasonable starting points to calibrate against your own history.
Document your protocols. A GEO benchmark with no measurement protocol description (number of runs, time window, engines covered, model version when capturable) is not comparable to another. Methodology is half the result.

The Bigger Picture

GEO has moved in two years from a nascent discipline (the foundational Princeton/IIT Delhi paper at KDD 2024) to a measurable one. The current phase is where measurement catches up with practice — and where we discover that many of the numbers shared so far are statistical fiction: averages reported without variance, single points treated as trends, cross-engine comparisons made as if it were the same ranking.

The shift is healthy. It forces tools, agencies, and in-house teams to publish protocols, expose confidence intervals, and stop promising certainties that the very nature of generative engines doesn't allow. That's exactly the bet we're making at Traaker: repeated sampling, multi-engine tracking by default, and an explicit separation between selection and absorption.

The message holds in one line: in GEO, a single measurement is not a measurement.

Primary sources: Schulte, "Don't Measure Once: Measuring Visibility in AI Search (GEO)," arXiv:2604.07585, 8 April 2026 — "Quantifying Uncertainty in AI Visibility," arXiv:2603.08924, March 2026 — "From Citation Selection to Citation Absorption," arXiv:2604.25707, April 2026.

Industry sources: iPullRank, "The Measurement Chasm: Tracking GEO Performance" — Geneo, "Ultimate Guide to AI Search Volatility Tracking" — Trakkr.ai, AI Citation Tracking (analysis of 1.3M citations across 60,209 domains) — Google, public stance on AI Overviews / AI Mode (May 2026).

X LinkedIn

Don't Measure Once: GEO Visibility Is a Distribution, Not a Score

Don't Measure Once: GEO Visibility Is a Distribution, Not a Score

The Problem: Your GEO Measurement Is Lying, and You Don't Know It

The Thesis in One Line

Three Sources of Variance, and Why They Compound

1. Per-run variance (intra-day)

2. Cross-engine variance

3. Temporal variance (drift)

The Hidden Structure: Everything Follows a Power Law

Citation ≠ Absorption

Operationalization: How Many Samples?

iPullRank's Three-Tier Framework

Google's Counterpoint: "GEO and AEO Are a Myth"

Practical Synthesis: What to Do Tomorrow

The Bigger Picture

Put these strategies into practice