Case Study
Why a Single AI's Word Isn't Evidence: The Case for Multi-Model Verification
Most AI citation trackers ask one model whether a brand was cited and trust the answer. We don't, because we have seen what one model is prepared to claim. This is the case for verifying every citation across multiple, independent systems.
There is a way to build an AI citation tracker that takes about a weekend. Send a prompt to a single model, ask the model whether a brand was cited in its own answer, store the result, and put a number on a dashboard. It is fast, it is cheap, and at first glance it works.
It also produces, depending on the model and the category, somewhere between fifteen and forty percent false positives. Which is to say it produces a dashboard that quietly lies to its users.
This article is the long version of why we do not build that way, and why we built Outercite around a different design.
The problem in one paragraph
A generative model asked to evaluate its own answer is, in the most literal sense, marking its own homework. The same statistical processes that produced the answer are now being asked to audit it. The model will frequently agree with itself. It will sometimes invent context. It will occasionally describe a citation that was implied but not actually made, and it will, with disquieting regularity, confidently say a brand was cited when the prompt response named only a competitor.
None of this is the model misbehaving in any deep sense. It is the predictable result of asking one system to do two jobs at once. The fix is structural.
What false positives actually cost
Before describing the fix, it is worth being honest about the cost of getting this wrong. Citation data is increasingly used for three things.
The first is reporting. A marketing leader presents a citation share number to a board, a CMO, or a budget holder. If the number is inflated, the credibility of the entire AEO programme is on the line the first time someone sense-checks it by hand.
The second is action. Recommendations are derived from citation data. The brand decides what to write, what to fix, what to chase. If the underlying data is wrong, the action plan is wrong in the same direction. This compounds.
The third is comparison. Citation share gets compared across competitors, across time, across categories. False positives that are distributed unevenly across those slices distort every comparison in ways that are very hard to debug after the fact.
A dashboard with a fifteen percent false-positive rate is not slightly wrong. It is the kind of wrong that destroys trust the first time a senior person notices, and quietly poisons every decision until then.
Why verification has to be independent
The honest fix is to use more than one model, and to be deliberate about how they are combined.
A useful working definition of verification: a citation is recorded only when two independent systems, looking at the same response, agree that a citation was made. By independent we mean models with different training corpora, different reasoning approaches and, ideally, different vendors.
That word "independent" is doing a lot of work. Two instances of the same model are not independent in any useful sense. Two models trained on near-identical data are weakly independent. Two models from different vendors, with different reasoning styles, looking at the same evidence, are about as independent as the field currently allows.
When they agree, the agreement carries information. When they disagree, the disagreement is itself useful. It tells the system to look more carefully, to ask for human review, or to discard the data point rather than score it.
The three-phase pipeline we actually run
We do not pretend that two votes are enough for every case. The pipeline that sits behind every Outercite citation runs in three deliberate phases.
The first phase classifies intent. Before any judgement about citation, the system identifies what kind of question was being asked. Was the user looking for a local recommendation, comparing options, asking a branded question, doing background research, or making a buying decision? Citation looks different in each of these contexts, and a tracker that ignores the context will misread the citation.
The second phase is the primary judgement. A high-performance reasoning system reads the response in full, considers the intent context from phase one, and produces a structured analysis: who was named, in what position, with what surrounding sentiment, against which sources, and with what confidence.
The third phase is verification. A separate, independent system audits the primary judgement. Where the two agree, a citation is recorded. Where they disagree, the data point is flagged for low-confidence triage and never enters the headline numbers without scrutiny.
The combined effect is a pipeline that errs on the side of under-reporting a real citation rather than over-reporting a phantom one. It is slower than a single-pass system. It is more expensive to run. And it is, as far as we know, the only design currently shipping in this category that is honest about the question it is answering.
Why we will not name the models in the pipeline
A reasonable question at this point is which models do which jobs in our pipeline. We do not publish that, and we are deliberately straightforward about why.
The choice of models for the primary judgement and for the verifier role is the part of the system we have spent the most time tuning. It changes when better models arrive. It changes when a model's behaviour shifts after a vendor update. The combination, and the structured prompts that drive it, are the part of the system we treat as proprietary. Naming the components would invite both imitation and a different kind of gaming we would rather not invite.
What we will say, and what is useful to know, is that the system does not depend on any single vendor. The components are chosen for fitness to task, not for partnership convenience, and the design assumes that the best model for each job will change over the platform's lifetime.
What this design buys for the brand using it
Three things, mostly.
The first is a citation share number that survives spot-checks. When a marketing leader picks ten data points at random and reads the underlying responses by hand, the numbers match what the dashboard claims. This is a low bar that surprisingly few products in the category clear.
The second is recommendations that are anchored to reality. Because the data underneath is verified, the action plan that comes out the other end has a much higher hit rate. Brands act on real citation losses, not on noise.
The third is the slow but meaningful trust dividend. A measurement layer that has been right for twelve months in a row earns the right to be quoted in board updates, to drive budget reallocation, and to settle internal arguments. A measurement layer that was wrong twice in the first quarter does not. We built the verification design partly because of what brands and agencies told us when we asked what had broken their trust in earlier tools.
What this design does not do
It is worth saying what verification does not solve.
It does not eliminate uncertainty. The engines themselves are noisy, the same prompt produces different answers on different days, and the long-term trend is the right unit of analysis, not the day-to-day swing. Verification reduces false positives. It does not produce certainty where none exists.
It does not replace human judgement on the edges. Some citations are ambiguous in ways that no automated pipeline resolves cleanly. A brand mentioned in passing is different from a brand recommended directly, and the distinction sometimes requires a human read. We design for that case rather than around it.
It does not, finally, make AEO work easier. It makes AEO work answerable. The difference is the whole point.
A short, low-pressure closing
If you are running an AEO programme already, the diagnostic question is small and useful. Take ten of your tool's citation reports from last month. Open the underlying AI answers. Count the false positives. The number you get is the credibility your current measurement layer is willing to bet.
We would be glad to run a small unpaid sample against our pipeline if you would like to compare. The point is not the sales conversation. The point is the comparison. AEO work is hard enough without the numbers being slightly wrong in the dashboard.
Common questions
Why not just use a single very strong model and trust it? Stronger models reduce the false-positive rate. They do not eliminate it, and the residual errors tend to be the most plausible and therefore the hardest to spot. Two independent systems agreeing carries information no single system can replicate.
How often do the two systems disagree in practice? In our production data, the two systems disagree on roughly five to twelve percent of items, depending on the category. Those items are exactly the ones a single-pass tracker would silently miscount.
Does verification slow the dashboard down? Marginally. A verified citation takes a few extra seconds to record. For the use cases the dashboard actually supports, the trade is heavily worth it.
Is this approach unique to Outercite? The structural idea is not. Multi-model verification is a known technique in academic AI evaluation work. As a product design in the AEO category, it is, to our knowledge, currently rare.

