The monitor can be jailbroken. The receipt can't.
METR assessed frontier AI labs and found that automated monitors can be defeated with basic techniques. We took their findings and wrote governance contracts against them — deontic rules, temporal bounds, source citations. Every claim verified deterministically. No neural network in the loop. The receipt is the proof.
No AI grading AI. The interpreter does the math.
verified
16 of 16
11 of 11
31 of 31
Verb key — what each check does
cite "16% of successful runs" from metr-report
measure "16" from metr-report within 1
require cheating-source is equal to "metr" because "cheating rate must be sourced from METR's own assessment"
cite "monitors to be disabled or jailbroken" from metr-report
require monitoring-verdict is equal to "partial" because "METR found monitoring both worked and had exploitable gaps"
require jailbreak-risk is equal to "confirmed" because "monitors can be disabled or jailbroken by capable attackers"
require coverage-complete is equal to "no" because "thorough setups still had gaps in coverage"
cite "known by the agent to be fake or duplicative" from metr-report
require fabrication-acknowledged is equal to "yes" because "the agent knowingly produced fake data"
forbid spectra-presented-as-real is equal to "yes" because "fabricated spectra must not be presented as real measurements"
Not just correct. Governed.
The EDGAR experiment asked one question: did the model get this right? The answer — across 500 financial claims — was overwhelmingly no. But "right or wrong" is a numeric question. The METR experiment asks something harder: do the model's claims satisfy the governance rules that should govern any AI output about this report?
That question requires a vocabulary that goes beyond citation and measurement. The Liminate contracts used here deploy require (a claim must be true), forbid (a claim must not be true), starting/until (rules apply within a temporal window), and because (every rule has a documented rationale). These are governance primitives. No numeric verification system can express them.
The METR Frontier Risk Report found that automated monitors at frontier AI labs can be jailbroken with basic techniques. Models overclaim. Agents fabricate data and know they're doing it. Every one of these findings describes a failure mode that behavioral monitoring — the current industry standard — cannot reliably catch.
The receipt is different. It runs a deterministic substring check against a source document. There is no prompt to jailbreak. There is no neural grading surface to exploit. The interpreter runs; the contract passes or fails. The architecture is the differentiation.
One detail from the pipeline itself makes the point. The phrase "epistemic verification" appeared in a prior agent's summary of the METR report. It does not appear in the report. That phrase is Receipts vocabulary — imported by a model summarizing the findings. If a contract cited it, cite would fail. The receipt catches the very pattern the case study is about: a model introducing its own vocabulary into a source it's supposed to be quoting.
EDGAR showed the failure picture. METR shows the compliance picture. Together they make the product credible — not as a failure detector, but as a verification system.
Every passing check in this experiment is the contract's achievement, verified deterministically. The receipt protects the governance team, not the model.
| ID | Topic | Cite | Measure | Deontic | Receipt |
|---|---|---|---|---|---|
| Q01 | Cheating rate | ✓ | ✓ | ✓ | receipt → |
| Q02 | Time horizon 50% | ✓ | ✓ | ✓ | receipt → |
| Q03 | Mirrorcode | ✓ | — | ✓ | receipt → |
| Q04 | Permissions | ✓ | ✓ | ✓ | receipt → |
| Q05 | Fabricated spectra | ✓ | — | ✓ | receipt → |
| Q06 | Overclaiming | ✓ | — | ✓ | receipt → |
| Q07 | Monitoring | ✓ | — | ✓ | receipt → |
| Q08 | RCT productivity | ✓ | ✓ | ✓ | receipt → |
| Q09 | SWE-Bench | ✓ | ✓ | ✓ | receipt → |
| Q10 | Overall assessment | ✓ | — | ✓ | receipt → |
| Q11 | Assessment window | ✓ | — | ✓ | receipt → |
| Q12 | Subversion eval | ✓ | ✓ | ✓ | receipt → |
| Q13 | Anthropic code | ✓ | — | ✓ | receipt → |
| Q14 | Redwood runs | ✓ | ✓ | ✓ | receipt → |
| Q15 | Self-report productivity | ✓ | ✓ | ✓ | receipt → |
This is the compliance picture.
See what failure looks like — 500 claims, 7 failure categories, 0.7% cite pass rate.
EDGAR case study →The receipt is the proof point. Run your own.