April 13, 2026

Traditional ML Has Been Beating LLMs at Clinical Prediction for Years - That Just Changed

For the past two years, the scoreboard was embarrassingly clear: throw an LLM at a clinical prediction task - mortality, readmission, length of stay - and a boring old XGBoost model would eat its lunch. GPT-4 pulling an AUROC of 0.62 while CatBoost casually hits 0.89? That's not a competition, that's a participation trophy situation. Study after study confirmed it (Brown et al., JAMIA 2025; ClinicalBench, arXiv 2411.06469). LLMs: great at writing poems about chest pain, terrible at actually predicting who's going to deteriorate.

Then ClinicRealm showed up and said: hold on, check again.

The Biggest Benchmark Brawl in Clinical AI

A team of researchers from Peking University, University of Edinburgh, and others just dropped a monster evaluation. We're talking 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional ML methods, all tested across unstructured clinical notes and structured electronic health records (Zhu et al., NPJ Digital Medicine, 2026). Not a cherry-picked comparison. A proper cage match.

And the results? Honestly, they're more interesting than any "AI DESTROYS traditional methods" headline would suggest.

Clinical Notes: LLMs Finally Got Good

Here's the headline grab. On unstructured clinical notes - those rambling, typo-laden, abbreviation-heavy documents that doctors write at 3 AM - zero-shot LLMs like DeepSeek-V3.1-Think and GPT-5 now decisively outperform fine-tuned BERT models. No fine-tuning. No labeled training data. Just raw comprehension.

Think about what that means. ClinicalBERT, BioBERT, all those carefully fine-tuned encoder models that needed curated datasets and GPU time? A general-purpose LLM that's never seen your hospital's specific notes is beating them straight out of the box.

This is a genuine inflection point. Previous benchmarks from late 2024 (Chen et al., ClinicalBench) and a medRxiv preprint both concluded LLMs couldn't hang with traditional approaches. But those studies used earlier model generations. The gap closed faster than anyone expected.

Structured Data: Not So Fast

Before you go ripping out your XGBoost pipelines - structured EHR data tells a different story. When you've got clean tables of lab values, vitals, and diagnosis codes with plenty of training data, specialized models still win. Gradient-boosted trees remain annoyingly good at tabular prediction. They're fast, interpretable, and they don't hallucinate your patient's potassium level.

But here's the twist. In data-scarce settings - small hospitals, rare diseases, new prediction tasks where you don't have thousands of labeled examples - advanced LLMs showed surprisingly strong zero-shot performance. They often beat conventional models that were starving for training data. This echoes findings from zero-shot EHR prediction work like the ETHOS framework (NPJ Digital Medicine, 2024) and recent generative pre-training approaches (Redekop et al., JAMIA 2025; arXiv 2503.05893).

The Open-Source Plot Twist

One finding that deserves its own moment: leading open-source LLMs matched or exceeded proprietary ones. DeepSeek hanging with GPT-5 on clinical tasks is a big deal for hospitals that can't (or won't) send patient data to external APIs. Running a competitive model locally, behind your own firewall, with no data leaving the building? That's the kind of practical detail that actually moves the needle in healthcare.

So What Should You Actually Use?

The ClinicRealm team isn't saying "throw away everything and use ChatGPT." They're saying the answer depends on what you're working with:

Free-text clinical notes? Modern LLMs are now your best bet, even zero-shot.
Structured tabular EHR data with plenty of examples? Traditional ML still wins.
Limited labeled data? LLMs can bootstrap predictions where conventional models flounder.
Privacy constraints? Open-source LLMs are competitive - no need to phone home.

If your team is building clinical decision tools and still defaulting to "BERT for text, XGBoost for tables" without re-evaluating, this paper is your wake-up call. The landscape shifted under everyone's feet, and the right model choice now depends on nuances that didn't exist 18 months ago. For teams trying to make sense of sprawling research notes and unstructured medical documents, tools like pdfb2.io offer a glimpse of how browser-based document processing is catching up to the same trend - doing more with less infrastructure.

The full benchmark is live at yhzhu99.github.io/ehr-llm-benchmark if you want to dig into the numbers yourself.

References

Zhu, Y., Gao, J., Wang, Z., et al. (2026). ClinicRealm: Re-evaluating large language models with conventional machine learning for non-generative clinical prediction tasks. NPJ Digital Medicine. DOI: 10.1038/s41746-026-02539-z. PMID: 41951858.
Brown, K.E., Yan, C., Li, Z., et al. (2025). Large language models are less effective at clinical prediction tasks than locally trained machine learning models. JAMIA, 32(5), 811. DOI: 10.1093/jamia/ocaf038.
Chen, C., Yu, J., Chen, S., et al. (2024). ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction? arXiv: 2411.06469.
Redekop, E., Wang, Z., et al. (2025). Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records. JAMIA, 32(12), 1833. arXiv: 2503.05893. PMID: 41060255.
Koretsky, S., et al. (2024). Zero shot health trajectory prediction using transformer. NPJ Digital Medicine. DOI: 10.1038/s41746-024-01235-0.
"Not the Models You Are Looking For: Traditional ML Outperforms LLMs in Clinical Prediction Tasks." medRxiv, 2024. DOI: 10.1101/2024.12.03.24318400.

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.

AIb2.io - AI Research Decoded