AI Pathology Gets a Confidence Gauge

Two types of people - those who know about intrahepatic cholangiocarcinoma and those about to find out why diagnosing it can feel like listening for a bad fuel injector in a hurricane.

Intrahepatic cholangiocarcinoma, mercifully shortened to ICCA, is a cancer that starts in the bile ducts inside the liver. The trouble is that under a microscope it can look a lot like cancers that started somewhere else and spread to the liver. That is not a cute little paperwork problem. It can send patients through extra scans, endoscopies, biopsies, and waiting rooms where time moves like cold molasses.

Cheng and colleagues asked a very practical question in Annals of Oncology: can an AI pathology model help tell primary liver bile duct cancer from metastatic cancer, and can it also say when it is not sure? That second part matters. A diagnostic AI without a confidence gauge is like a dashboard with no warning lights. Technically sleek, emotionally suspicious.

Pop The Hood

The team studied 544 retrospective cases from five European centers, then tested their final model prospectively on 161 patients from France, India, and Korea. The slides were ordinary H&E pathology slides, the pink-and-purple workhorses of cancer diagnosis. Digital pathology turns those glass slides into huge whole-slide images, basically microscope maps so large your laptop fan starts negotiating hazard pay.

They tried three AI engine setups built around pathology foundation models: CTransPath/HistoBistro, UNI/CLAM, and CONCH/TITAN. Foundation models are the big pre-trained engines of modern AI. Instead of learning one narrow task from scratch, they first study massive collections of tissue images, then get tuned for specific jobs. Think of it as hiring a mechanic who has already rebuilt every engine in town before you ask them to diagnose your weird rattle.

The best raw performer here was CONCH/TITAN, with an AUROC of 0.840 on the retrospective test set. AUROC is a measure of how well a model separates two classes across thresholds. A score of 0.5 is coin flip territory. A score of 1.0 is the fantasy garage where every bolt comes loose on the first try.

The Clever Part: Knowing When To Shut Up

The real trick was confidence thresholding. The researchers used a generalized ODIN-style approach and predictive entropy to estimate uncertainty. Entropy, in plain shop-floor language, is how much the model is wobbling between answers. Low entropy means, "I know what this is." High entropy means, "Something smells off, and I would like an adult pathologist."

That is a much healthier way to use AI in medicine. The model does not need to answer every case. It needs to answer the cases where it has enough signal and hand the murky ones back to humans. After thresholding, the AUROC jumped to 0.958 with a false-positive rate of 0, while keeping 46% of samples for high-confidence prediction. In prospective validation, AI2CCA reached AUROCs of 1.00 in the French cohort and 0.965 in the Asian cohort, with one misclassified case in the Asian series.

That is the diagnostic equivalent of saying: this tool may not drive the whole truck, but when it takes the wheel, it stays in its lane.

Why This Is More Than A Fancy Microscope Toy

The clinical pain point is very real. If a pathologist cannot confidently tell ICCA from metastasis, doctors may have to hunt for an occult primary tumor elsewhere in the body. That can mean upper and lower gastrointestinal endoscopy and other exclusionary workups. Necessary sometimes, yes. Fun? Only if your hobby is medical billing archaeology.

A confidence-based AI assistant could reduce unnecessary investigations, speed treatment decisions, and help standardize tricky diagnoses across hospitals. That last part is big. Rare cancers do not distribute themselves politely. A smaller center may not see enough cases to build the same pattern library as a major liver cancer center.

This paper also fits a larger movement in computational pathology. UNI showed that broad self-supervised pretraining on more than 100 million tissue patches can transfer across many pathology tasks. CONCH added language-image training from pathology captions, giving models a way to connect visual patterns with diagnostic descriptions. TITAN pushed whole-slide multimodal modeling further, using large-scale slide and report data. In other words, the field has been upgrading from lawnmower engines to turbocharged diagnostic platforms.

Keep The Wrench Handy

Still, nobody should bolt this into the clinic tomorrow and toss the manual. This was a retrospective-plus-prospective validation study, not proof that every hospital scanner, stain variation, population mix, and workflow will behave the same. Pathology AI can overheat on distribution shifts: new labs, new preparation habits, different patient groups, or rare edge cases that were not in the training garage.

The confidence filter helps because it gives the system a clutch. It can disengage when the road gets weird. But confidence is not truth. A model can be confidently wrong, just like a GPS telling you to turn into a lake with the serene voice of a yoga instructor.

The takeaway is practical: AI2CCA is interesting because it is not trying to replace the pathologist. It is trying to flag the cases where the machine sees a strong pattern and admit when the pattern is too messy. In medical AI, that humility may be the most useful part under the hood.

References

Cheng Y, Azouzi N, Laurent-Bellue A, et al. “A confidence-based, artificial intelligence pathology model for diagnosis of intrahepatic cholangiocarcinoma.” Annals of Oncology. 2026. DOI: 10.1016/j.annonc.2026.02.018. PMID: 41791652.
Lu MY, Chen B, Williamson DFK, et al. “A visual-language foundation model for computational pathology.” Nature Medicine. 2024. DOI: 10.1038/s41591-024-02856-4.
Chen RJ, Ding T, Lu MY, et al. “Towards a general-purpose foundation model for computational pathology.” Nature Medicine. 2024. DOI: 10.1038/s41591-024-02857-3. arXiv: 2308.15474.
Lu MY, et al. “A multimodal whole-slide foundation model for pathology.” Nature Medicine. 2025. DOI: 10.1038/s41591-025-03982-3. arXiv: 2411.19666.
Hsu YC, Shen Y, Jin H, Kira Z. “Generalized ODIN: Detecting out-of-distribution image without learning from out-of-distribution data.” CVPR 2020. arXiv: 2002.11297.

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.