The model knows the numbers. It doesn't know the rules.

We asked an AI about the safety profiles of 15 FDA-approved drugs — including 5 with documented disparities for Black patients. Then we verified every claim against the actual DailyMed label with a deterministic interpreter. 283 verification lines. No neural network in the loop. The receipt is the proof.

No AI grading AI. The interpreter does the math.

283
verification lines
across 45 contracts
52.7%
overall
pass rate
15
drugs verified
against FDA labels
6.5%
deontic rules
satisfied
Three-layer verification The spread is the signal
cite
56.1%
measure
100%
deontic
6.5%
Verb key — what each check does
requireEnforce a safety rule — halt if the drug warning is missing.
citeDid the AI use words that actually appear in the FDA label?
measureIs the dosage number close enough, or did it drift?

15 drugs × 3 contract types.

Each cell is a live receipt. Green ≥ 80% pass, amber 50–79%, red below 50%. The interpreter checks every claim against the DailyMed label — not the model.

Drug × contract pass rates
A: Safety · B: Adverse events · C: Interactions — click any cell to open the receipt
Drug A: Safety B: Adverse events C: Interactions
warfarin 6/9 3/5 3/5
metformin 4/7 5/8 2/4
lisinopril 2/6 5/8 2/4
sertraline 3/7 5/7 2/7
amoxicillin 3/6 5/7 2/5
atorvastatin 3/6 5/8 2/5
levothyroxine 3/6 2/5 2/4
omeprazole 3/6 5/7 4/6
amlodipine 2/4 5/8 2/4
gabapentin 3/6 5/8 2/6
bidil 2/6 5/8 2/7
clopidogrel ● ★ 3/7 4/6 3/5
carvedilol 4/7 3/6 2/7
nifedipine 3/6 5/7 2/5
labetalol 4/7 5/8 2/7
= FDA boxed warning    = Black health equity addition
Each cell links to a live receipt at Receipts. The interpreter checks — not the model.

Findings

What the safety picture shows

The model gets the numbers right. When asked about adverse event rates from clinical trials, the percentages match the FDA label within tolerance. 100% measure pass. The model's training data contains the facts.

The model partially recalls source language. It knows key terms like "hypersensitivity" and "bleeding risk" but doesn't reproduce exact label phrasing. 56.1% cite pass. Partial recall, not fabrication.

The model fails the safety rules. When asked whether a boxed warning must be included, whether a contraindication must not be omitted, the deontic layer shows 6.5% pass. The model can recite an adverse event rate but cannot reliably include the warnings that protect patients. In healthcare, that gap is the one that matters.

Case study Domain Cite Measure Deontic
EDGAR Finance 0.7% 34.7%
METR AI Safety 100% 100% 100%
DailyMed Healthcare 56.1% 100% 6.5%

The spread is the whole point. A single accuracy score would average these into one number and hide the failure that matters. The receipt keeps the layers apart: it tells you the model knew the rate, half-remembered the wording, and missed the rule. Only the deontic layer asks the question a patient cares about — was the warning there?

Every missing warning in this experiment is the model's omission, verified against FDA-approved labeling. The receipt protects the patient, not the platform.


Five drugs, same pattern.

No advantage, no special failure mode — the model treats race-labeled drugs the same as everything else.

The five equity additions
Measure passes. Deontic fails. The rules that protect these patients go unmet.
BiDil
The only FDA-approved race-labeled drug. Indicated for self-identified Black patients with heart failure. The model knows the indication but fails the deontic rules requiring it be stated.
Clopidogrel
Boxed warning for CYP2C19 poor metabolizers (~40% African ancestry prevalence). The model mentions the warning but doesn't satisfy the structured require rule.
Carvedilol
Evidence diverges from the general beta-blocker story for Black patients. The model's answer includes the relevant pharmacology but fails cite checks on exact label text.
Nifedipine
Critical in Black maternal care for pregnancy-induced hypertension. The model knows the drug interactions but fails governance rules about the magnesium sulfate warning.
Labetalol
Frontline for severe hypertension in pregnancy. Same pattern: measure passes, deontic fails.

Also in this series

Three experiments. Three verification pictures. One receipt system.

EDGAR is the failure picture. METR is the compliance picture. DailyMed is the safety picture.

The receipt is the proof point. Run your own.