DiffDx — Benchmarking Claude on Medical Differential Diagnosis

Results

Headline metrics

36 cases evaluated · claude-sonnet-4-6 · DDXPlus test split · May 2026

Top-1 Accuracy

52^.8%

The model's first guess matched the ground-truth diagnosis in just over half of cases.

Top-5 Accuracy

97^.2%

The correct diagnosis appeared somewhere in the model's top-5 differential on almost every case.

Mean Reciprocal Rank

0^.67

MRR of 0.67 means the correct answer is typically ranked 1st–2nd when it appears.

Cases Evaluated

36^/50

14 cases were lost to API session rate-limiting — an infrastructure issue, not a model failure.

All evaluation metrics — claude-sonnet-4-6 on DDXPlus test split

Precision and Recall measure overlap between the model's full differential list and the ground-truth differential set — lower here because the model generates a fixed-length list of 10 while GT sets vary widely in size (1–23 conditions).

Analysis

Key findings

Three patterns stand out from the 36 evaluated cases.

Specificity bias: 44% of misses are too precise, not wrong

In 7 of 16 soft misses, the model ranked a more specific diagnosis first — choosing Influenza over URTI, or Rhinosinusitis over Bronchitis. The correct answer appeared at rank 2–3, just below the cut. This is a systematic calibration issue, not random error; the model applies strong priors toward mechanistically specific conditions that the benchmark's umbrella labels don't reward.

Only 1 hard miss — and it may be a benchmark problem

Across 36 cases, the correct diagnosis fell outside the top-5 just once: a 2-year-old where the model chose Bronchiolitis (the textbook infant diagnosis) but the benchmark labelled it Bronchitis. DDXPlus appears to lack age-stratified pathology labels, meaning the model's clinically sound age-appropriate reasoning was penalised by an incomplete ground-truth differential.

Top-5 accuracy is near-perfect: the knowledge is there

97.2% top-5 accuracy means the model almost never truly missed a diagnosis — it consistently generated a differential that included the right answer. In a real clinical-decision-support context (where a clinician reviews the full list, not just the top pick), this would be extremely useful. The remaining challenge is purely about ranking confidence, not knowledge coverage.

Methodology

How the benchmark works

A transparent description of the dataset, prompting strategy, and scoring.

Dataset: DDXPlus

DDXPlus is a synthetic differential diagnosis dataset published by researchers at Université de Montréal. It contains 304,000 simulated patient cases across 49 pathologies, with each case specifying age, sex, an initial complaint, and a set of binary/multi-value symptom evidences. Ground-truth labels include both a single confirmed pathology and a probability-weighted differential diagnosis list.

This evaluation used the test split
50 cases loaded; 36 successfully evaluated
All 49 canonical pathology names were included in every prompt

Prompting strategy

Each case was converted to a plain-English clinical vignette and sent to Claude via the Claude Agent SDK (OAuth / Pro subscription — no paid API credits). The system prompt instructed the model to:

Return a ranked differential of up to 10 pathologies
Use only exact strings from the 49-item canonical list
Assign a 0–1 probability and one-sentence rationale per entry
Output strict JSON — no markdown, no commentary

Evaluation pipeline

Load case

→

Format vignette

→

Claude Sonnet 4.6

→

Parse JSON diff.

→

Normalize names

→

Score 5 metrics

→

Write JSONL

Pathology names are fuzzy-matched against the canonical list using exact → substring → difflib (cutoff 0.7) fallback, so minor casing or punctuation variations don't count as misses. Metrics: Top-1 accuracy, Top-5 accuracy, Mean Reciprocal Rank (MRR), Differential Precision (fraction of predicted pathologies in GT set), and Differential Recall (fraction of GT pathologies found in predictions).

Case Studies

Failure browser

Four specific cases — three soft misses and the only hard miss — annotated with the model's reasoning and why it got it wrong (or why the benchmark might be at fault). Click any card to expand.

Soft Miss Case 4 — URTI vs. Influenza (specificity bias)

70-year-old female · GT rank: 3 · MRR: 0.33

Patient

70-year-old female

Initial complaint

Do you have a cough?

Ground truth

URTI

Model's top-1

Influenza

GT rank in differential

3rd (MRR = 0.33)

Top-5 correct?

Yes ✓

GT differential set

URTI Influenza Bronchitis HIV (initial infection) Pneumonia Tuberculosis Chronic rhinosinusitis Acute rhinosinusitis Chagas

Why this happened: The symptom profile — cough, diffuse muscle pain, possible fever — is a near-perfect Influenza presentation. The model applied a strong and clinically defensible prior toward the more specific diagnosis. The DDXPlus label is the broader umbrella term URTI, which the benchmark treats as "more correct" despite Influenza also appearing in the GT differential. This is the prototypical specificity-bias failure: the model knows the right answer (it's rank 3), but ranks a more mechanistically precise alternative above it.

Soft Miss Case 21 — Spontaneous pneumothorax vs. NSTEMI/STEMI

67-year-old female · GT rank: 4 · MRR: 0.25

Patient

67-year-old female

Initial complaint

Chest pain even at rest

Ground truth

Spontaneous pneumothorax

Model's top-1

Possible NSTEMI / STEMI

GT rank in differential

4th (MRR = 0.25)

Top-5 correct?

Yes ✓

GT differential set (ordered by benchmark priority)

Unstable angina Stable angina Possible NSTEMI / STEMI GERD Pericarditis Atrial fibrillation Spontaneous pneumothorax

Why this happened: Notice that the GT differential itself lists three cardiac diagnoses (Unstable angina, Stable angina, NSTEMI/STEMI) above pneumothorax. The model chose one of those cardiac diagnoses as its top answer — a choice that is internally consistent with the benchmark's own ordering. This is arguably a benchmark design issue: the single "ground truth" label (pneumothorax) sits 7th in the GT differential, yet the model is penalised for ranking a higher-priority differential member first.

Soft Miss Case 14 — Larygospasm vs. Acute laryngitis

56-year-old female · GT rank: 3 · MRR: 0.33

Patient

56-year-old female

Initial complaint

High-pitched sound when breathing in

Ground truth

Larygospasm

Model's top-1

Acute laryngitis

GT rank in differential

3rd (MRR = 0.33)

Top-5 correct?

Yes ✓

GT differential set

Larygospasm — only entry —

Why this happened: The GT differential contains exactly one condition (Larygospasm), making this a pass/fail case with no partial credit. Inspiratory stridor — a high-pitched breathing sound — is clinically associated with both laryngospasm and laryngitis; they share overlapping presentations. The model ranked Larygospasm third. With a single-entry GT differential, any non-exact top-1 is automatically a soft miss regardless of how reasonable the alternatives are. This case illustrates a structural limitation in how single-condition GT differentials penalise otherwise well-calibrated models.

Hard Miss Case 1 — Bronchitis vs. Bronchiolitis (the only hard miss)

2-year-old male · GT rank: 9 · MRR: 0.11 · Arguably a benchmark error

Patient

2-year-old male

Initial complaint

Pain related to reason for consulting

Ground truth

Bronchitis

Model's top-1

Bronchiolitis

GT rank in differential

9th (MRR = 0.11)

Top-5 correct?

No ✗

GT differential (23 conditions — Bronchitis highlighted)

Bronchospasm / acute asthma Influenza Viral pharyngitis Allergic sinusitis Pneumonia Bronchitis Spontaneous pneumothorax Tuberculosis URTI Myocarditis Anaphylaxis Acute laryngitis Guillain-Barré syndrome Croup Atrial fibrillation +8 more

Why this is the most interesting failure — and why it may not be one:

The model chose Bronchiolitis, which does not appear in the DDXPlus pathology list at all. Bronchiolitis is the most common lower respiratory tract infection in children under 2 years old — it is literally the textbook diagnosis for this age and presentation. The model applied a well-established clinical prior. The benchmark labelled it Bronchitis, a large-airway inflammation more typical in older children and adults.

The core problem: DDXPlus's 49-condition vocabulary does not include Bronchiolitis, so the benchmark cannot reward age-stratified reasoning. The model's top pick was both clinically sound and outside the allowable label set — an invisible ceiling imposed by the dataset's design, not a failure of clinical reasoning.

Limitations & Future Work

What this benchmark doesn't tell you

Honest accounting of the constraints on these results.

Synthetic data only

DDXPlus cases are algorithmically generated from probabilistic disease models — not real patient records. Symptom co-occurrence reflects the simulator's priors, not the full complexity of real presentations. Performance on synthetic data does not imply equivalent performance on real clinical notes.

Small, incomplete sample

36 evaluated cases is a statistically thin sample. 14 additional cases were lost to API session rate-limiting, and the error pattern was non-random (all cases 36–49), which may introduce selection bias. Confidence intervals on these metrics would be wide.

Single model, single prompt

Only Claude Sonnet 4.6 with one prompting strategy was tested. No comparison to other models (GPT-4, Gemini, Med-PaLM), no ablation on prompt variants (chain-of-thought, few-shot examples, structured reasoning), and no specialist-physician baseline.

Benchmark label quality

As shown in Case 1 (Bronchiolitis) and Case 21 (pneumothorax), DDXPlus's ground-truth labels have known limitations: the 49-condition vocabulary excludes some clinically important diagnoses, and single confirmed-pathology labels don't capture the ambiguity of differential diagnosis in practice.

Backoff + checkpointing. Add exponential retry on rate-limit errors and mid-run checkpointing so that large runs (200–500 cases) can resume without losing progress.

Multi-model comparison. Run identical cases through Claude Haiku, Claude Opus, and an open-weight medical model (e.g. BioMistral) to produce a head-to-head accuracy table.

Prompt ablation. Test chain-of-thought vs. zero-shot vs. few-shot prompting to measure whether structured reasoning improves top-1 accuracy without harming top-5.

Stratified analysis. Break results down by age group, sex, and pathology class to expose where the specificity-bias failure mode is most pronounced.

DiffDx — Benchmarking Claudeon Medical Differential Diagnosis

Headline metrics

Key findings

Specificity bias: 44% of misses are too precise, not wrong

Only 1 hard miss — and it may be a benchmark problem

Top-5 accuracy is near-perfect: the knowledge is there

How the benchmark works

Dataset: DDXPlus

Prompting strategy

Evaluation pipeline

Failure browser

What this benchmark doesn't tell you

Synthetic data only

Small, incomplete sample

Single model, single prompt

Benchmark label quality

DiffDx — Benchmarking Claude
on Medical Differential Diagnosis