Three experiments. Three questions. One receipt system.
The EDGAR experiment asked whether AI gets financial data right. The METR experiment asked whether AI claims satisfy the governance rules that should govern them. The DailyMed experiment asked whether AI knows the safety rules for FDA-approved drugs — not just the numbers, but the warnings that protect patients. Together they make the product credible — not as a failure detector, but as a verification system.
No AI grading AI. The interpreter does the math.
The failure picture
The model wasn't hallucinating. It was misbinding.
Read the EDGAR case study →The compliance picture
The monitor can be jailbroken. The receipt can't.
Read the METR case study →The safety picture
The model knows the numbers. It doesn't know the rules.
Read the DailyMed case study →EDGAR is the failure picture. METR is the compliance picture. DailyMed is the safety picture. A system that only catches failure is a failure detector. A system that also confirms compliance — and exposes where the safety rules go unmet — is a verification system. The receipt tells you which one you're dealing with. Every receipt is the same thing: the interpreter checking the AI's work by hand, against the source, with no shortcuts. The receipt protects the person doing the work, not the system that produced it.
The receipt is the proof point. Run your own.