If the title committee had permitted full honesty, McCoy and Wu's paper might have been called: "Our Medical AI Passed the Exam, Met an Actual Hospital Note, and Immediately Needed a Juice Box."
The humans appear to enjoy testing their thinking machines by giving them medical exam questions. This is understandable. Exams are tidy. They have answer keys. They sit still. Real clinical records, by contrast, behave like a family group chat written by tired specialists at 2:13 a.m.: abbreviations, copied-forward baggage, missing context, contradictory clues, and the occasional sentence that looks like it was assembled during a fire drill.
The Sacred Exam Scrolls Were Not Enough
McCoy and Wu's News & Views article [1] comments on BRIDGE [2], a benchmark built around the less glamorous question: can large language models understand clinical text as it appears in actual care? An electronic health record is, formally, a digital collection of patient health information. In practice, it is where medications, lab values, billing codes, old diagnoses, new hunches, and "patient doing fine?" all collide. If a language model is a digital brain trained on flashcards, an EHR is the backpack after finals week.
BRIDGE collects 87 tasks from 59 real-world clinical data sources, across nine languages, 14 specialties, and eight task types, including triage, information extraction, diagnosis, prognosis, and billing coding. The researchers evaluated 95 models, including DeepSeek-R1, GPT-4o, Gemini, and Qwen3. The result was not one clean victory trumpet. Performance shifted by model size, language, specialty, and task. Open-source models could sometimes match proprietary systems. Medical fine-tunes built on older base models often lagged behind newer general models. The species has discovered that "medical AI" is not one thing. Remarkable. Next they may discover that "food" includes both soup and hospital cafeteria pudding.
Breadth Is Easy. Depth Has Paperwork.
The title's "breadth to depth" shift matters because many previous evaluations were broad in the way a buffet is broad: lots of small samples, not necessarily a satisfying meal. Medical exam benchmarks ask whether a model can select the right answer from polished clinical trivia. Clinical work asks whether it can survive ambiguity, timeline weirdness, language variation, abbreviations, and the eternal medical mystery known as "see prior note."
This is not a tiny complaint from fussy methodologists polishing their clipboards for sport. Bedi and colleagues reviewed 519 studies of health-care LLM evaluation and found that only 5% used real patient-care data [3]. Most leaned heavily on question answering, especially exam-style medical knowledge. The humans built a robot, asked it to pass a quiz, then wondered if it could safely help in a hospital. A curious ritual.
BRIDGE joins a growing push to make evaluation more like the work itself. MedHELM organized medical AI tasks into a clinician-validated taxonomy covering 121 tasks, from clinical decision support to administrative workflow [4]. HealthBench used 5,000 multi-turn health conversations scored with physician-written rubrics [5]. Rao and colleagues tested 21 LLMs across stepwise clinical reasoning and found a very human-looking failure mode: models could often land on the final diagnosis, yet struggled with differential diagnosis and uncertainty [6]. In detective terms, they sometimes guessed the culprit while forgetting to investigate the crime scene. Stylish, but not how one would prefer medicine to operate.
The Useful Part Is Less Sparkly
If BRIDGE-style evaluation holds up, the impact is practical rather than cinematic. Hospitals could compare models by specialty, language, and task instead of waving around a single exam score like a ceremonial fern. Developers could see where systems fail: non-English notes, billing codes, prognosis, extraction from messy records, or whatever fresh paperwork labyrinth the health system has invented this week.
It also pushes the field toward safer deployment. A model that helps summarize discharge instructions may not be the same model you want near diagnostic uncertainty. A system that performs well in English oncology notes may wobble in Spanish cardiology notes. The humans call this "context." From orbit, it looks like basic survival behavior.
There is a privacy lesson here too. Clinical text is sensitive, and benchmarks cannot always share the richest data because patient records are not party favors. This is the same instinct behind private, browser-based document tools like pdfb2.io: keep sensitive documents local when possible. Hospitals need that instinct plus audit logs, governance, and far fewer people saying "just upload it somewhere" with the serene confidence of someone not named in the lawsuit.
The Machines Are Not Fired. They Are on Probation.
The best reading of McCoy and Wu is not "medical AI is doomed." It is more precise and more annoying: broad benchmarks are not enough. Clinical AI needs deeper tests that match real workflows, real languages, real failure costs, and real humans who have been awake since 5 a.m.
Benchmarks still have limits. Models update. Leaderboards age like milk in a warm server room. Test sets can leak into training data. Strong benchmark performance does not prove improved patient outcomes. For that, the humans must perform their most elaborate ritual: prospective studies, clinical trials, monitoring, bias audits, and admitting uncertainty in public.
Still, BRIDGE is a useful correction. It moves the conversation from "Can the model sound medical?" to "Can it do this specific clinical job, with this messy data, for these patients, under these constraints?" That is less glamorous than a superhuman exam score. It is also much closer to the point.
References
-
McCoy, L. G., & Wu, D. From breadth to depth in clinical artificial intelligence evaluation. Nature Biomedical Engineering (2026). DOI: 10.1038/s41551-026-01691-x. PMID: 42343092.
-
Wu, J. et al. BRIDGE: benchmarking large language models for understanding real-world clinical practice texts. Nature Biomedical Engineering (2026). DOI: 10.1038/s41551-026-01719-2.
-
Bedi, S. et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 333, 319-328 (2025). DOI: 10.1001/jama.2024.21700.
-
Bedi, S. et al. Holistic evaluation of large language models for medical tasks with MedHELM. Nature Medicine 32, 943-951 (2026). DOI: 10.1038/s41591-025-04151-2.
-
Arora, R. K. et al. HealthBench: evaluating large language models towards improved human health. arXiv:2505.08775 (2025). DOI: 10.48550/arXiv.2505.08775.
-
Rao, A. S. et al. Large language model performance and clinical reasoning tasks. JAMA Network Open 9, e264003 (2026). DOI: 10.1001/jamanetworkopen.2026.4003.
-
Raji, I. D., Daneshjou, R., & Alsentzer, E. It's time to bench the medical exam benchmark. NEJM AI 2, AIe2401235 (2025). DOI: 10.1056/AIe2401235.
Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.