Comparison of AI-Generated Radiology Impressions

Remember that Breaking Bad episode where Walter White's scan comes back and the doctors all stare at the same image but somehow walk away with completely different takes? Turns out, that's not just dramatic TV writing - it's basically what happens when you ask radiologists and oncologists to judge AI-written radiology reports.

The Setup: Three Impressions Walk Into a Clinic

A team led by Sharang Phadke and colleagues at multiple institutions ran a beautifully blinded experiment (Phadke et al., 2026). They took 200 oncologic CT reports and generated impressions three different ways:

The originals - written by the actual radiologists who read the scans
A custom AI model - fine-tuned on the institution's own data (think: the AI that grew up in that hospital)
A generic large language model - the kind of off-the-shelf LLM that's read everything on the internet but has never set foot in a radiology reading room

Then they handed all three versions, unlabeled, to ten clinicians - four original radiologists, three independent radiologists, and three oncologists - and asked them to rate each impression on completeness, correctness, conciseness, clarity, clinical utility, and potential patient harm.

The Plot Twist Nobody Expected

The custom AI model basically tied with the humans. Independent radiologists couldn't tell the difference (Cohen's h = -0.03, p = 0.78). Even the original radiologists - the people who wrote the human versions - only slightly preferred their own work, and that preference wasn't statistically significant (h = 0.18, p = 0.0716). That's the AI equivalent of a photo finish.

The generic LLM, though? It got roasted. Radiologists could spot it like a typo in a headline. Its impressions were wordier (75.1 words on average versus shorter human versions), and while they were technically more complete, they were dramatically less concise. Imagine asking someone what time it is and they explain the history of horology first. Useful information, sure, but nobody asked.

The Oncologist Wildcard

Here's where it gets spicy. While radiologists clearly penalized the generic model, oncologists didn't care. They showed no significant preference among any of the three impression types (all p > 0.20). Zero. Nada.

This makes a weird kind of sense when you think about it. Oncologists are the consumers of radiology impressions - they want the clinical bottom line. If the AI gives them what they need to make treatment decisions, they're happy. Radiologists, on the other hand, are the craftspeople - they notice when the wording is bloated or the structure is off, the way a chef notices when the plating is sloppy even if the food tastes fine.

Why Fine-Tuning Matters (A Lot)

The gap between the custom model and the generic model tells us something that the AI field keeps rediscovering: context is everything. A model trained on your institution's style, your abbreviation conventions, your reporting templates - it just sounds right. The generic model writes perfectly reasonable English that reads like it was translated from Medical Textbook into Hospital Report by someone who's never actually dictated one.

This tracks with earlier work showing GPT-4 struggling with radiology impressions in zero-shot settings, producing "unsupported statements" and a "certainty illusion" where the model sounds confident about things it shouldn't be (Sun et al., Radiology, 2023). Meanwhile, a 2024 study demonstrated that fine-tuning on just 800 reports could produce impressions rated "professionally and linguistically appropriate" (Radiology, 2024). The lesson: don't bring a Swiss Army knife to a surgery.

The Safety Check

Perhaps the most reassuring finding: patient harm ratings were uniformly low across all three types. The AI isn't dangerous - it's just occasionally verbose. On harm scales, all versions scored between 1.01 and 1.21, meaning evaluators essentially said "this would not hurt anyone." That's a low bar, but an essential one to clear before anyone starts deploying these tools in clinical workflows.

A recent scoping review tracking 67 studies from 2022-2024 confirms that LLM adoption in radiology is accelerating rapidly, with nearly two-thirds of papers published in 2024 alone (JMIR Medical Informatics, 2025). RSNA experts now consider auto-generated impressions from findings to be the most "deployment-ready" LLM use case in radiology. But with hallucination rates still hovering around 8-15% in some evaluations, the human radiologist isn't getting replaced - they're getting a surprisingly competent first draft.

The Bottom Line

The era of AI-assisted radiology reporting isn't coming. It's here. But this study shows that how you build the AI matters enormously, and who you ask to evaluate it changes the answer. A custom model fine-tuned on local data can match human quality so closely that independent radiologists literally can't tell the difference. Meanwhile, if you're working with medical documents and want tools that respect your data privacy, browser-based solutions like pdfb2.io handle PDF processing without uploading anything to external servers - a consideration that matters more every day in healthcare.

The real takeaway? AI in radiology isn't a single story. It's a Rashomon situation where every stakeholder sees the same output and reaches a different verdict.

References

Phadke S, Suresh N, Allen Z, et al. Comparison of AI-generated radiology impressions: a multi-stakeholder evaluation. npj Digital Medicine. 2026. DOI: 10.1038/s41746-026-02586-6. PMID: 41935165.
Sun Z, Ong H, Kennedy P, et al. Evaluating GPT-4 on impressions generation in radiology reports. Radiology. 2023;307(5):e231259. DOI: 10.1148/radiol.231259. PMID: 37367439.
Constructing a large language model to generate impressions from findings in radiology reports. Radiology. 2024;312(3). DOI: 10.1148/radiol.240885. PMID: 39287525.
Trends and trajectories in the rise of large language models in radiology: scoping review. JMIR Medical Informatics. 2025;e78041. DOI: 10.2196/78041. PMID: 41364806.
Large language models in radiology reporting - systematic review of performance, limitations, and clinical implications. medRxiv. 2025. DOI: 10.1101/2025.03.18.25324193.

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.

AIb2.io - AI Research Decoded