Medical AI Needs to Stop Shipping Demos as Medicine

Mattia Andreoletti, Berkay Senkalfa, Effy Vayena, and Alessandro Blasimme’s Lancet Digital Health article, “Ensuring the clinical impact of medical artificial intelligence,” is basically a code review on the whole medical AI pipeline. The comment is short, but the diff is large: stop treating proxy metrics as patient benefit.

A model can detect tumors, flag sepsis, summarize notes, prioritize scans, and still fail the only review that matters: did anyone get better care because of it?

Nit: “better AUC” is not the same as “fewer deaths,” “less suffering,” or “clinicians got home before their children forgot their names.”

Medical AI Needs to Stop Shipping Demos as Medicine

LGTM, But Where Is the Patient?

Medical AI has spent years optimizing the leaderboard stuff: sensitivity, specificity, diagnostic accuracy, detection rate, time-to-result, workflow efficiency. These are useful. They are also suspiciously convenient. They are the metrics you can collect before the messy human part starts.

The authors’ point is that clinical impact lives downstream. Not in the model card. Not in the vendor demo where the interface glows like a spaceship dashboard. It lives in questions like:

Does the AI change treatment decisions?
Does it reduce missed diagnoses?
Does it shorten delays without increasing false alarms?
Does it help the patients who usually get the worst version of the health system?
Does it make clinicians safer, faster, calmer, or just more responsible for debugging another black box?

That last one matters. If AI becomes the clinical equivalent of “one more Jira ticket,” congratulations, we have automated resentment.

The Benchmark Is Not the Bedside

This critique lands because the evidence base keeps saying the same awkward thing in different fonts. A 2024 scoping review of randomized controlled trials found that trials of AI in clinical practice exist, but they remain limited and uneven across tasks and outcomes (Han et al., 2024). Translation: we have some real tests, but not nearly enough for the amount of confident slideware being produced.

A 2024 systematic review in npj Digital Medicine looked at real-world imaging workflows and found that many studies reported faster work after AI, but meta-analyses did not show clear time savings across comparable studies. Blocking comment: if your efficiency claim disappears when studies become comparable, needs refactor.

This is not anti-AI. It is anti-magic-trick. AI can absolutely help in medicine, especially in image-heavy, data-heavy, paperwork-heavy settings. But hospitals are not Kaggle with hand sanitizer. They are sociotechnical systems full of old software, exhausted people, edge cases, billing codes, and printers that have chosen violence.

Clinical Impact Is a Deployment Problem

The paper pushes us toward a more grown-up idea: clinical AI should be evaluated like an intervention inside a care system, not like an algorithm floating in a vacuum wearing a tiny lab coat.

That means measuring patient-relevant outcomes, tracking harms, auditing bias, checking model drift, and watching what happens after launch. Sittig and Singh made a similar safety argument in JAMA: health care organizations need governance, local testing, monitoring, clinician training, system inventories, and kill switches for AI tools. Yes, kill switches. Finally, a feature request everyone understands.

This is also where boring documentation becomes heroic. Who uses the model? When? What happens if it disagrees with the clinician? What happens if it stops working after the hospital changes scanners, forms, patient mix, or EHR settings? If the answer is “the vendor said it generalizes,” please request changes.

For teams mapping clinical workflows before deployment, visual tools like mapb2.io are oddly relevant here. Not because mind maps cure sepsis, sadly, but because AI impact often depends on where the tool enters the workflow and who has to act on it.

Approved With Reservations

The best part of Andreoletti and colleagues’ argument is that it raises the bar without throwing out the work. Medical AI does not need fewer models. It needs fewer victory laps after lab-only validation.

A clean solution would look like this: define the clinical problem first, pick outcomes patients and clinicians actually care about, test prospectively when risk warrants it, monitor after deployment, publish failures, and keep checking whether the tool helps across different populations. Grudging compliment: that is not glamorous, but it is maintainable.

The potential upside remains real. If reproducible and expanded, AI could help clinicians catch disease earlier, reduce delays, triage overloaded queues, summarize sprawling records, and spot risks that humans miss because humans need sleep and occasionally lunch. But the authors are asking for the adult version of that promise: evidence that survives contact with real clinics.

Final review: clever technology, high potential, insufficient tests in production. Needs outcome metrics. Needs monitoring. Needs equity checks. Needs less swagger. Approved only after clinical impact is demonstrated, not merely implied by a very confident ROC curve.

References

Andreoletti M, Senkalfa B, Vayena E, Blasimme A. Ensuring the clinical impact of medical artificial intelligence. The Lancet Digital Health. 2026;101030. DOI: 10.1016/j.landig.2026.101030. PMID: 42303563.

Han R, Acosta JN, Shakeri Z, Ioannidis JPA, Topol EJ, Rajpurkar P. Randomised controlled trials evaluating artificial intelligence in clinical practice: a scoping review. The Lancet Digital Health. 2024;6:e367-e373. DOI: 10.1016/S2589-7500(24)00047-5.

Wenderott K, Krups J, Zaruchas F, et al. Effects of artificial intelligence implementation on efficiency in medical imaging: a systematic literature review and meta-analysis. npj Digital Medicine. 2024;7:265. DOI: 10.1038/s41746-024-01248-9.

Sittig DF, Singh H. Recommendations to ensure safety of AI in real-world clinical care. JAMA. 2025;333(6):457-458. DOI: 10.1001/jama.2024.24598.

Angus DC, Khera R, Lieu T, et al. AI, health, and health care today and tomorrow: The JAMA Summit Report on Artificial Intelligence. JAMA. 2025;334(18):1650-1664. DOI: 10.1001/jama.2025.18490. PMID: 41082366.

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.