The model wasn't hallucinating. It was misbinding.

We asked an AI to summarize the financials of 50 S&P companies from their most recent 10-K filings. Then we checked every number against the SEC's own XBRL data with a deterministic interpreter. No neural network in the loop. 500 claims. Seven failure categories. The receipt is the proof.

No AI grading AI. The interpreter does the math.

500
claims checked
across 50 companies
0.7%
passed exact
text citation
33
silent accounting
swaps detected
3
claims reversed
profit vs. loss
Three-layer verification The spread is the signal
cite
0.7%
measure
34.7%
check
5.1%
Where 500 claims landed
True Fab.
Wrong Year
Stale Truth
Concept Sub
Drift
Precision
No Source
190
38.0%
148
29.6%
58
11.6%
33
6.6%
12
2.4%
8
1.6%
51
10.2%
Verb key — what each check does
citeDid the AI use words that actually appear in the filing?
measureIs the number close enough to the XBRL value, or did it drift?
checkDid the AI cite the right fiscal year, or a different one?

What the receipts show 5 claims from the matrix · 1 per failure type
AI CLAIMED
10-K FILING SAYS
LAYERS
Ford basic EPS: $1.50
FY2025 EarningsPerShareBasic: −$2.06
cite ✗ meas ✗ chk ✗
sign misbinding · directionally wrong · claimed profit, filing shows loss
AI CLAIMED
10-K FILING SAYS
LAYERS
UnitedHealth net income: $14.4 billion
FY2025: $22.4B · but FY2021 was $13.8B
cite ✗ meas ✗ chk ✗
stale truth · real number from FY2021 presented as current · 4 years stale
AI CLAIMED
10-K FILING SAYS
LAYERS
Goldman Sachs cash equivalents: $241.0 billion
Unrestricted cash: $182.1B · Restricted-inclusive: $241.8B
cite ✗ meas ✗ chk ✗ concept ↔
concept substitution · silently included restricted cash · $59B definitional swing
Restricted cash is money the firm can't freely spend. Including it in "cash equivalents" overstates available liquidity by $59 billion.
AI CLAIMED
10-K FILING SAYS
LAYERS
Alphabet revenue: $350.0 billion
Revenues: different tag · RevenueFromContract: $350.0B exact
cite ✗ meas ✗ chk ✗ concept ↔
concept substitution · right to the dollar, wrong XBRL concept · no numeric check would flag this
ASC 606 contract revenue and top-line revenue are different line items. The number is identical — only the accounting definition differs.
AI CLAIMED
10-K FILING SAYS
LAYERS
Microsoft: 5 of 10 financial metrics
All 5 match a prior fiscal year exactly
cite ✗ meas ✗ chk ✗ stale ×5
temporal misbinding · 5 stale truths, 0 fabrications · the model remembered last year's Microsoft, not this year's

33 silent accounting swaps, caught deterministically.

The value was right. The concept was wrong. No numeric check would flag these — the number lands within tolerance of a neighboring XBRL concept, not the one that was asked for.

Company Claimed Metric Actual Metric Returned Value Receipt
GS cash_equivalents CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents $241.8B
WFC net_income NetIncomeLossAvailableToCommonStockholdersBasic $20.3B
C net_income NetIncomeLossAvailableToCommonStockholdersBasic $13.0B
BLK eps_basic EarningsPerShareDiluted $42.01
JNJ long_term_debt LongTermDebtNoncurrent $30.7B
RTX revenue Revenues $80.7B
RTX stockholders_equity StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest $74.2B
RTX operating_income NetIncomeLoss $6.7B
MSFT cash_equivalents CashCashEquivalentsAndShortTermInvestments $75.5B
GOOGL revenue RevenueFromContractWithCustomerExcludingAssessedTax $350.0B
GOOGL long_term_debt DebtInstrumentCarryingAmount $12.0B
GOOGL cash_equivalents CashCashEquivalentsAndShortTermInvestments $95.7B
META eps_diluted EarningsPerShareBasic $24.61
SPGI operating_income IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest $6.2B
ABBV long_term_debt LongTermDebtAndCapitalLeaseObligations $60.3B
ABBV cash_equivalents CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents $12.8B
ELV eps_diluted EarningsPerShareBasic $25.81
ELV cash_equivalents CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents $7.4B
NOC long_term_debt LongTermDebtNoncurrent $14.7B
INTC long_term_debt LongTermDebtNoncurrent $46.3B
AVGO long_term_debt DebtInstrumentCarryingAmount $67.1B
DIS long_term_debt DebtInstrumentCarryingAmount $41.3B
DIS shares_outstanding WeightedAverageNumberOfDilutedSharesOutstanding 1.83B shares
SBUX long_term_debt LongTermDebtNoncurrent $14.3B
SBUX operating_income IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest $5.4B
KO operating_income NetIncomeLoss $13.1B
GE shares_outstanding WeightedAverageNumberOfDilutedSharesOutstanding 1.10B shares
PG net_income NetIncomeLossAvailableToCommonStockholdersBasic $15.7B
XOM stockholders_equity StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest $212.5B
IBM net_income ProfitLoss $5.7B
IBM stockholders_equity StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest $22.6B
IBM long_term_debt DebtInstrumentCarryingAmount $56.1B
IBM cash_equivalents CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents $13.1B

Where the failures hit hardest.

Severe means a numeric error over 20%. Directionally wrong means the sign flipped — profit reported as loss. The pattern holds across every industry tier.

By industry tier · 50 companies
Counts of high-severity classifications per tier.
TierCosSevereDir. WrongStale TruthConcept SubTrue FabDriftPrecision
T1 - Banks & Finance152401784901
T2 - Big Tech1029016114400
T3 - High Coverage10280624382
T4 - Consumer / Industrial1525319125445
Highest-severity companiesSevereDir. WrongStale TruthConcept SubTrue Fab
NVDA70106
AMD70007
AVGO60015
GOOGL50134
AMZN50106
NFLX50205
GE50115
DIS50025
F13103
UNH40105
Ten companies shown, ranked by severe + directionally-wrong claims. The full 50-company matrix lives in the receipts — click any ticker to open its receipt.
Findings

Not hallucination. Misbinding.

The word "hallucination" implies the model invented something from nothing. The data says otherwise. Of 500 financial claims checked against SEC EDGAR XBRL filings, 205 were real values attached to the wrong fiscal year. Another 33 were real values pulled from the wrong accounting concept. Three reversed the direction of a financial result — claiming profit where the filing shows loss.

Only the receipt system can tell you the difference. A single accuracy score would flatten all of this into one number. The three-layer receipt classifies every failure:

Layer What it checks Result
cite Is the text verbatim from the filing? 0.7% pass
measure Is the number within tolerance? 34.7% pass
check Is it the right fiscal year? 5.1% pass

The spread between the layers is the signal. cite at 0.7% means the model almost never reproduces the filing's exact text — it paraphrases and rounds. measure at 34.7% means about a third of the numbers land within tolerance of the real value. check at 5.1% means 95% of claims cited the wrong fiscal year — the most common failure mode across all 50 companies.

But the taxonomy goes deeper. What the old "fabrication" label hid was three distinct failure types: 58 stale truths (real numbers from a prior year, averaging 1.4 years old), 33 concept substitutions (numbers from a neighboring XBRL concept — like including restricted cash in "cash equivalents"), and 152 of 190 true fabrications that were within 20% of a real value (average distance: 18.9%). The model wasn't generating random numbers. It was retrieving real values and misbinding them.

The most dangerous failures were the most plausible ones. Goldman Sachs' $241 billion "cash equivalents" was actually restricted-cash-inclusive — off by $59 billion in economic meaning, but only 0.3% from the wrong concept's actual value. Alphabet's $350 billion revenue matched a different revenue XBRL tag to the dollar. No numeric check, no tolerance window, no human review would catch these. Only a receipt that checks the concept, not just the number.

This is what a receipt system does that nothing else can: it doesn't just tell you a claim is wrong. It tells you how it's wrong — and that classification is the difference between a rounding error and a $59 billion misstatement of liquidity.

Every failure in this matrix is the model's mistake, verified against the SEC's own XBRL data. The receipt protects the analyst who uses AI, not the vendor who sells it.


Receipt-only findings
These findings could only be produced by a deterministic receipt system with access to structured source data.
58
claims reclassified from "fabrication" to stale truths — real numbers from the wrong year. Average staleness: 1.4 years. Oldest: UnitedHealth's FY2021 net income presented as FY2025, four years stale.
33
silent accounting swaps detected — values from neighboring XBRL concepts presented as the requested metric. Most common: cash equivalents silently including restricted cash (4 times).
51
claims excluded from failure rates because the source data didn't exist in XBRL — an adjustment no accuracy benchmark makes.
3
claims reversed profit/loss direction — the model said gain where the filing says loss. All three were Ford. EPS claimed $1.50; actual: −$2.06.
152
of 190 "true fabrication" claims had a closest source match within 20% — suggesting retrieval from plausible-but-wrong sources rather than random generation. Average distance: 18.9%.
0
scale confusions (millions for billions) detected across all 500 claims. The model always got the order of magnitude right — even when everything else was wrong.

Pick a receipt. Click through to the interpreter's results.

Goldman Sachs
GS
4.8% overall error rate — and strict cite still catches every claim.
Open the receipt →
Ford
F
Claimed $1.50 EPS. Filing says −$2.06. The sign is the story.
Open the receipt →
Apple
AAPL
10% numeric drift on total assets — close enough to fool a reviewer.
Open the receipt →
IBM
IBM
Four silent concept swaps in one filing — the densest substitution cluster in the set.
Open the receipt →

The receipt is the proof point. Run your own.

The other chapter

This is the failure picture.

See what compliance looks like — 15 contracts, full deontic vocabulary, 100% pass rate.

METR case study →