An independent evaluation of Claude Sonnet 4.6 on 36 patient cases drawn from the DDXPlus synthetic dataset. We measure how well the model ranks the correct diagnosis within a differential of up to ten candidates.
This benchmark uses entirely synthetic, artificial patient data. Results describe model behaviour on a controlled dataset and carry no implications for real-world clinical performance. Do not use these findings to inform any medical decision. Consult a qualified healthcare professional for any medical concern.
36 cases evaluated · claude-sonnet-4-6 · DDXPlus test split · May 2026
The model's first guess matched the ground-truth diagnosis in just over half of cases.
The correct diagnosis appeared somewhere in the model's top-5 differential on almost every case.
MRR of 0.67 means the correct answer is typically ranked 1st–2nd when it appears.
14 cases were lost to API session rate-limiting — an infrastructure issue, not a model failure.
Precision and Recall measure overlap between the model's full differential list and the ground-truth differential set — lower here because the model generates a fixed-length list of 10 while GT sets vary widely in size (1–23 conditions).
Three patterns stand out from the 36 evaluated cases.
In 7 of 16 soft misses, the model ranked a more specific diagnosis first — choosing Influenza over URTI, or Rhinosinusitis over Bronchitis. The correct answer appeared at rank 2–3, just below the cut. This is a systematic calibration issue, not random error; the model applies strong priors toward mechanistically specific conditions that the benchmark's umbrella labels don't reward.
Across 36 cases, the correct diagnosis fell outside the top-5 just once: a 2-year-old where the model chose Bronchiolitis (the textbook infant diagnosis) but the benchmark labelled it Bronchitis. DDXPlus appears to lack age-stratified pathology labels, meaning the model's clinically sound age-appropriate reasoning was penalised by an incomplete ground-truth differential.
97.2% top-5 accuracy means the model almost never truly missed a diagnosis — it consistently generated a differential that included the right answer. In a real clinical-decision-support context (where a clinician reviews the full list, not just the top pick), this would be extremely useful. The remaining challenge is purely about ranking confidence, not knowledge coverage.
A transparent description of the dataset, prompting strategy, and scoring.
DDXPlus is a synthetic differential diagnosis dataset published by researchers at Université de Montréal. It contains 304,000 simulated patient cases across 49 pathologies, with each case specifying age, sex, an initial complaint, and a set of binary/multi-value symptom evidences. Ground-truth labels include both a single confirmed pathology and a probability-weighted differential diagnosis list.
test splitEach case was converted to a plain-English clinical vignette and sent to Claude via the Claude Agent SDK (OAuth / Pro subscription — no paid API credits). The system prompt instructed the model to:
Pathology names are fuzzy-matched against the canonical list using exact → substring → difflib (cutoff 0.7) fallback, so minor casing or punctuation variations don't count as misses. Metrics: Top-1 accuracy, Top-5 accuracy, Mean Reciprocal Rank (MRR), Differential Precision (fraction of predicted pathologies in GT set), and Differential Recall (fraction of GT pathologies found in predictions).
Four specific cases — three soft misses and the only hard miss — annotated with the model's reasoning and why it got it wrong (or why the benchmark might be at fault). Click any card to expand.
Honest accounting of the constraints on these results.
DDXPlus cases are algorithmically generated from probabilistic disease models — not real patient records. Symptom co-occurrence reflects the simulator's priors, not the full complexity of real presentations. Performance on synthetic data does not imply equivalent performance on real clinical notes.
36 evaluated cases is a statistically thin sample. 14 additional cases were lost to API session rate-limiting, and the error pattern was non-random (all cases 36–49), which may introduce selection bias. Confidence intervals on these metrics would be wide.
Only Claude Sonnet 4.6 with one prompting strategy was tested. No comparison to other models (GPT-4, Gemini, Med-PaLM), no ablation on prompt variants (chain-of-thought, few-shot examples, structured reasoning), and no specialist-physician baseline.
As shown in Case 1 (Bronchiolitis) and Case 21 (pneumothorax), DDXPlus's ground-truth labels have known limitations: the 49-condition vocabulary excludes some clinically important diagnoses, and single confirmed-pathology labels don't capture the ambiguity of differential diagnosis in practice.
Backoff + checkpointing. Add exponential retry on rate-limit errors and mid-run checkpointing so that large runs (200–500 cases) can resume without losing progress.
Multi-model comparison. Run identical cases through Claude Haiku, Claude Opus, and an open-weight medical model (e.g. BioMistral) to produce a head-to-head accuracy table.
Prompt ablation. Test chain-of-thought vs. zero-shot vs. few-shot prompting to measure whether structured reasoning improves top-1 accuracy without harming top-5.
Stratified analysis. Break results down by age group, sex, and pathology class to expose where the specificity-bias failure mode is most pronounced.