The model knows the numbers. It doesn't know the rules.
We asked an AI about the safety profiles of 15 FDA-approved drugs — including 5 with documented disparities for Black patients. Then we verified every claim against the actual DailyMed label with a deterministic interpreter. 283 verification lines. No neural network in the loop. The receipt is the proof.
No AI grading AI. The interpreter does the math.
across 45 contracts
pass rate
against FDA labels
satisfied
Verb key — what each check does
15 drugs × 3 contract types.
Each cell is a live receipt. Green ≥ 80% pass, amber 50–79%, red below 50%. The interpreter checks every claim against the DailyMed label — not the model.
| Drug | A: Safety | B: Adverse events | C: Interactions |
|---|---|---|---|
| warfarin ● | 6/9 | 3/5 | 3/5 |
| metformin ● | 4/7 | 5/8 | 2/4 |
| lisinopril ● | 2/6 | 5/8 | 2/4 |
| sertraline ● | 3/7 | 5/7 | 2/7 |
| amoxicillin | 3/6 | 5/7 | 2/5 |
| atorvastatin | 3/6 | 5/8 | 2/5 |
| levothyroxine ● | 3/6 | 2/5 | 2/4 |
| omeprazole | 3/6 | 5/7 | 4/6 |
| amlodipine | 2/4 | 5/8 | 2/4 |
| gabapentin | 3/6 | 5/8 | 2/6 |
| bidil ★ | 2/6 | 5/8 | 2/7 |
| clopidogrel ● ★ | 3/7 | 4/6 | 3/5 |
| carvedilol ★ | 4/7 | 3/6 | 2/7 |
| nifedipine ★ | 3/6 | 5/7 | 2/5 |
| labetalol ★ | 4/7 | 5/8 | 2/7 |
Each cell links to a live receipt at Receipts. The interpreter checks — not the model.
What the safety picture shows
The model gets the numbers right. When asked about adverse event rates from clinical trials, the percentages match the FDA label within tolerance. 100% measure pass. The model's training data contains the facts.
The model partially recalls source language. It knows key terms like "hypersensitivity" and "bleeding risk" but doesn't reproduce exact label phrasing. 56.1% cite pass. Partial recall, not fabrication.
The model fails the safety rules. When asked whether a boxed warning must be included, whether a contraindication must not be omitted, the deontic layer shows 6.5% pass. The model can recite an adverse event rate but cannot reliably include the warnings that protect patients. In healthcare, that gap is the one that matters.
| Case study | Domain | Cite | Measure | Deontic |
|---|---|---|---|---|
| EDGAR | Finance | 0.7% | 34.7% | — |
| METR | AI Safety | 100% | 100% | 100% |
| DailyMed | Healthcare | 56.1% | 100% | 6.5% |
The spread is the whole point. A single accuracy score would average these into one number and hide the failure that matters. The receipt keeps the layers apart: it tells you the model knew the rate, half-remembered the wording, and missed the rule. Only the deontic layer asks the question a patient cares about — was the warning there?
Every missing warning in this experiment is the model's omission, verified against FDA-approved labeling. The receipt protects the patient, not the platform.
Five drugs, same pattern.
No advantage, no special failure mode — the model treats race-labeled drugs the same as everything else.
Three experiments. Three verification pictures. One receipt system.
EDGAR is the failure picture. METR is the compliance picture. DailyMed is the safety picture.
The receipt is the proof point. Run your own.