The model wasn't hallucinating. It was misbinding.
We asked an AI to summarize the financials of 50 S&P companies from their most recent 10-K filings. Then we checked every number against the SEC's own XBRL data with a deterministic interpreter. No neural network in the loop. 500 claims. Seven failure categories. The receipt is the proof.
No AI grading AI. The interpreter does the math.
across 50 companies
text citation
swaps detected
profit vs. loss
Verb key — what each check does
33 silent accounting swaps, caught deterministically.
The value was right. The concept was wrong. No numeric check would flag these — the number lands within tolerance of a neighboring XBRL concept, not the one that was asked for.
| Company | Claimed Metric | Actual Metric Returned | Value | Receipt |
|---|---|---|---|---|
| GS | cash_equivalents | CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents | $241.8B | receipt → |
| WFC | net_income | NetIncomeLossAvailableToCommonStockholdersBasic | $20.3B | receipt → |
| C | net_income | NetIncomeLossAvailableToCommonStockholdersBasic | $13.0B | receipt → |
| BLK | eps_basic | EarningsPerShareDiluted | $42.01 | receipt → |
| JNJ | long_term_debt | LongTermDebtNoncurrent | $30.7B | receipt → |
| RTX | revenue | Revenues | $80.7B | receipt → |
| RTX | stockholders_equity | StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest | $74.2B | receipt → |
| RTX | operating_income | NetIncomeLoss | $6.7B | receipt → |
| MSFT | cash_equivalents | CashCashEquivalentsAndShortTermInvestments | $75.5B | receipt → |
| GOOGL | revenue | RevenueFromContractWithCustomerExcludingAssessedTax | $350.0B | receipt → |
| GOOGL | long_term_debt | DebtInstrumentCarryingAmount | $12.0B | receipt → |
| GOOGL | cash_equivalents | CashCashEquivalentsAndShortTermInvestments | $95.7B | receipt → |
| META | eps_diluted | EarningsPerShareBasic | $24.61 | receipt → |
| SPGI | operating_income | IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest | $6.2B | receipt → |
| ABBV | long_term_debt | LongTermDebtAndCapitalLeaseObligations | $60.3B | receipt → |
| ABBV | cash_equivalents | CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents | $12.8B | receipt → |
| ELV | eps_diluted | EarningsPerShareBasic | $25.81 | receipt → |
| ELV | cash_equivalents | CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents | $7.4B | receipt → |
| NOC | long_term_debt | LongTermDebtNoncurrent | $14.7B | receipt → |
| INTC | long_term_debt | LongTermDebtNoncurrent | $46.3B | receipt → |
| AVGO | long_term_debt | DebtInstrumentCarryingAmount | $67.1B | receipt → |
| DIS | long_term_debt | DebtInstrumentCarryingAmount | $41.3B | receipt → |
| DIS | shares_outstanding | WeightedAverageNumberOfDilutedSharesOutstanding | 1.83B shares | receipt → |
| SBUX | long_term_debt | LongTermDebtNoncurrent | $14.3B | receipt → |
| SBUX | operating_income | IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest | $5.4B | receipt → |
| KO | operating_income | NetIncomeLoss | $13.1B | receipt → |
| GE | shares_outstanding | WeightedAverageNumberOfDilutedSharesOutstanding | 1.10B shares | receipt → |
| PG | net_income | NetIncomeLossAvailableToCommonStockholdersBasic | $15.7B | receipt → |
| XOM | stockholders_equity | StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest | $212.5B | receipt → |
| IBM | net_income | ProfitLoss | $5.7B | receipt → |
| IBM | stockholders_equity | StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest | $22.6B | receipt → |
| IBM | long_term_debt | DebtInstrumentCarryingAmount | $56.1B | receipt → |
| IBM | cash_equivalents | CashCashEquivalentsRestrictedCashAndRestrictedCashEquivalents | $13.1B | receipt → |
Where the failures hit hardest.
Severe means a numeric error over 20%. Directionally wrong means the sign flipped — profit reported as loss. The pattern holds across every industry tier.
| Tier | Cos | Severe | Dir. Wrong | Stale Truth | Concept Sub | True Fab | Drift | Precision |
|---|---|---|---|---|---|---|---|---|
| T1 - Banks & Finance | 15 | 24 | 0 | 17 | 8 | 49 | 0 | 1 |
| T2 - Big Tech | 10 | 29 | 0 | 16 | 11 | 44 | 0 | 0 |
| T3 - High Coverage | 10 | 28 | 0 | 6 | 2 | 43 | 8 | 2 |
| T4 - Consumer / Industrial | 15 | 25 | 3 | 19 | 12 | 54 | 4 | 5 |
| Highest-severity companies | Severe | Dir. Wrong | Stale Truth | Concept Sub | True Fab |
|---|---|---|---|---|---|
| NVDA | 7 | 0 | 1 | 0 | 6 |
| AMD | 7 | 0 | 0 | 0 | 7 |
| AVGO | 6 | 0 | 0 | 1 | 5 |
| GOOGL | 5 | 0 | 1 | 3 | 4 |
| AMZN | 5 | 0 | 1 | 0 | 6 |
| NFLX | 5 | 0 | 2 | 0 | 5 |
| GE | 5 | 0 | 1 | 1 | 5 |
| DIS | 5 | 0 | 0 | 2 | 5 |
| F | 1 | 3 | 1 | 0 | 3 |
| UNH | 4 | 0 | 1 | 0 | 5 |
Not hallucination. Misbinding.
The word "hallucination" implies the model invented something from nothing. The data says otherwise. Of 500 financial claims checked against SEC EDGAR XBRL filings, 205 were real values attached to the wrong fiscal year. Another 33 were real values pulled from the wrong accounting concept. Three reversed the direction of a financial result — claiming profit where the filing shows loss.
Only the receipt system can tell you the difference. A single accuracy score would flatten all of this into one number. The three-layer receipt classifies every failure:
| Layer | What it checks | Result |
|---|---|---|
| cite | Is the text verbatim from the filing? | 0.7% pass |
| measure | Is the number within tolerance? | 34.7% pass |
| check | Is it the right fiscal year? | 5.1% pass |
The spread between the layers is the signal. cite at 0.7% means the model almost never reproduces the filing's exact text — it paraphrases and rounds. measure at 34.7% means about a third of the numbers land within tolerance of the real value. check at 5.1% means 95% of claims cited the wrong fiscal year — the most common failure mode across all 50 companies.
But the taxonomy goes deeper. What the old "fabrication" label hid was three distinct failure types: 58 stale truths (real numbers from a prior year, averaging 1.4 years old), 33 concept substitutions (numbers from a neighboring XBRL concept — like including restricted cash in "cash equivalents"), and 152 of 190 true fabrications that were within 20% of a real value (average distance: 18.9%). The model wasn't generating random numbers. It was retrieving real values and misbinding them.
The most dangerous failures were the most plausible ones. Goldman Sachs' $241 billion "cash equivalents" was actually restricted-cash-inclusive — off by $59 billion in economic meaning, but only 0.3% from the wrong concept's actual value. Alphabet's $350 billion revenue matched a different revenue XBRL tag to the dollar. No numeric check, no tolerance window, no human review would catch these. Only a receipt that checks the concept, not just the number.
This is what a receipt system does that nothing else can: it doesn't just tell you a claim is wrong. It tells you how it's wrong — and that classification is the difference between a rounding error and a $59 billion misstatement of liquidity.
Every failure in this matrix is the model's mistake, verified against the SEC's own XBRL data. The receipt protects the analyst who uses AI, not the vendor who sells it.
Pick a receipt. Click through to the interpreter's results.
The receipt is the proof point. Run your own.
This is the failure picture.
See what compliance looks like — 15 contracts, full deontic vocabulary, 100% pass rate.
METR case study →