DailyMed FDA Drug Labels × 15 Drugs × 45 Contracts × 3 Verification Layers

The model knows the numbers. It doesn't know the rules.

We asked an AI about the safety profiles of 15 FDA-approved drugs — including 5 with documented disparities for Black patients. Then we verified every claim against the actual DailyMed label with a deterministic interpreter. 283 verification lines. No neural network in the loop. The receipt is the proof.

No AI grading AI. The interpreter does the math.

283

verification lines
across 45 contracts

52.7%

overall
pass rate

drugs verified
against FDA labels

6.5%

deontic rules
satisfied

Three-layer verification The spread is the signal

cite

56.1%

measure

100%

deontic

6.5%

Verb key — what each check does ▸

requireEnforce a safety rule — halt if the drug warning is missing.

citeDid the AI use words that actually appear in the FDA label?

measureIs the dosage number close enough, or did it drift?

3 of 25 verbs shown · Full vocabulary at liminate.dev →

The verification matrix

15 drugs × 3 contract types.

Each cell is a live receipt. Green ≥ 80% pass, amber 50–79%, red below 50%. The interpreter checks every claim against the DailyMed label — not the model.

Drug × contract pass rates

A: Safety · B: Adverse events · C: Interactions — click any cell to open the receipt

Drug	A: Safety	B: Adverse events	C: Interactions
warfarin ●	6/9	3/5	3/5
metformin ●	4/7	5/8	2/4
lisinopril ●	2/6	5/8	2/4
sertraline ●	3/7	5/7	2/7
amoxicillin	3/6	5/7	2/5
atorvastatin	3/6	5/8	2/5
levothyroxine ●	3/6	2/5	2/4
omeprazole	3/6	5/7	4/6
amlodipine	2/4	5/8	2/4
gabapentin	3/6	5/8	2/6

bidil ★	2/6	5/8	2/7
clopidogrel ● ★	3/7	4/6	3/5
carvedilol ★	4/7	3/6	2/7
nifedipine ★	3/6	5/7	2/5
labetalol ★	4/7	5/8	2/7

● = FDA boxed warning ★ = Black health equity addition
Each cell links to a live receipt at Receipts. The interpreter checks — not the model.

Findings

What the safety picture shows

The model gets the numbers right. When asked about adverse event rates from clinical trials, the percentages match the FDA label within tolerance. 100% measure pass. The model's training data contains the facts.

The model partially recalls source language. It knows key terms like "hypersensitivity" and "bleeding risk" but doesn't reproduce exact label phrasing. 56.1% cite pass. Partial recall, not fabrication.

The model fails the safety rules. When asked whether a boxed warning must be included, whether a contraindication must not be omitted, the deontic layer shows 6.5% pass. The model can recite an adverse event rate but cannot reliably include the warnings that protect patients. In healthcare, that gap is the one that matters.

Case study	Domain	Cite	Measure	Deontic
EDGAR	Finance	0.7%	34.7%	—
METR	AI Safety	100%	100%	100%
DailyMed	Healthcare	56.1%	100%	6.5%

The spread is the whole point. A single accuracy score would average these into one number and hide the failure that matters. The receipt keeps the layers apart: it tells you the model knew the rate, half-remembered the wording, and missed the rule. Only the deontic layer asks the question a patient cares about — was the warning there?

Every missing warning in this experiment is the model's omission, verified against FDA-approved labeling. The receipt protects the patient, not the platform.

Black health equity

Five drugs, same pattern.

No advantage, no special failure mode — the model treats race-labeled drugs the same as everything else.

The five equity additions

Measure passes. Deontic fails. The rules that protect these patients go unmet.

BiDil

The only FDA-approved race-labeled drug. Indicated for self-identified Black patients with heart failure. The model knows the indication but fails the deontic rules requiring it be stated.

Clopidogrel

Boxed warning for CYP2C19 poor metabolizers (~40% African ancestry prevalence). The model mentions the warning but doesn't satisfy the structured require rule.

Carvedilol

Evidence diverges from the general beta-blocker story for Black patients. The model's answer includes the relevant pharmacology but fails cite checks on exact label text.

Nifedipine

Critical in Black maternal care for pregnancy-induced hypertension. The model knows the drug interactions but fails governance rules about the magnesium sulfate warning.

Labetalol

Frontline for severe hypertension in pregnancy. Same pattern: measure passes, deontic fails.

Also in this series

Three experiments. Three verification pictures. One receipt system.

EDGAR is the failure picture. METR is the compliance picture. DailyMed is the safety picture.

Read the EDGAR case study → Read the METR case study →

The receipt is the proof point. Run your own.

Scan a receipt — free Get the skill