Ductal carcinoma in situ, or DCIS, is one of those medical phrases that sounds more decisive than it is. The abnormal cells are still “in place,” inside the breast ducts, which is good. But some DCIS can later become invasive cancer, which is bad. The annoying middle part is that doctors often cannot tell with enough confidence which cases are the sleepy ones and which ones are quietly assembling a villain origin story.
So, many women get surgery, radiation, hormone therapy, or some combination of the above. Sometimes that is exactly right. Sometimes it may be more treatment than the biology actually needed. Medicine hates that kind of ambiguity, and frankly, same.
The paper by Doyle, Oerlemans, and colleagues asks a very practical question: can a deep learning system look at ordinary H&E-stained pathology slides and predict the features that help decide whether someone might qualify for active surveillance instead of aggressive treatment?
If I am reading this right, the answer is: surprisingly well, but please do not throw your microscope into the ocean just yet.
The Slide Is Doing More Gossip Than Expected
The researchers focused on three things clinicians already care about in DCIS: grade, estrogen receptor status, and HER2 status. Grade is basically how unruly the cells look under the microscope. ER and HER2 are molecular markers that help classify breast lesions and guide treatment decisions.
Normally, those biomarkers need specific lab tests. Here, the model tries to infer them from the regular H&E slide, which is the pathology equivalent of guessing someone’s Spotify Wrapped from their handwriting. Weirdly, with enough data, neural networks can sometimes pick up visual patterns humans do not formally name.
The team used pathology foundation models, which are large AI models pretrained on huge collections of tissue images. Think of them as digital pathology interns who have stared at enough pink-and-purple tissue tiles to start noticing things. Not understanding them like a human pathologist does, exactly. More like a very caffeinated pattern-matching appliance with excellent eyesight and no lunch break.
They trained and tested their pipeline on a Dutch multicenter dataset of 887 DCIS cases, then externally validated it on 259 cases from the UK. That external validation bit matters. Models that only work at their home hospital are less “clinical AI” and more “local weather forecast with a lab coat.”
The Numbers, But With Less Spreadsheet Fog
On the Dutch dataset, the models reached mean AUROCs of 0.90 for ER, 0.84 for HER2, and 0.86 for grade. On the UK dataset, performance dropped, as it often does when reality enters the room carrying a different scanner, staining protocol, and patient population: 0.80 for ER, 0.74 for HER2, and 0.75 for grade.
AUROC is a ranking score. A perfect model gets 1.0, a coin flip gets 0.5, and most real clinical AI lives in the emotionally complicated suburbs between those numbers. These results suggest the model is not magic, but it is learning useful signals.
Then the authors combined grade, ER, and HER2 predictions to classify whether patients matched active surveillance criteria used in the LORD trial: screen-detected, ER-positive, HER2-negative, grade 1 or 2 DCIS. The balanced accuracy was 0.81 in the Dutch cohort and 0.64 in the UK cohort, with negative predictive values of 0.86 and 0.76.
That UK drop is the part where my anxious overachiever brain taps the brakes. Helpful? Yes. Ready to single-handedly decide treatment? No. This is decision support, not an oracle wearing a tiny white coat.
Why This Actually Matters
Active surveillance for low-risk DCIS is not science fiction. The COMET randomized clinical trial reported that, over two years, active monitoring for low-risk DCIS did not produce a higher invasive cancer rate than guideline-concordant care. That does not settle every long-term question, but it makes better risk stratification feel less like a nice academic hobby and more like a tool patients could actually need.
This paper fits into a larger wave in computational pathology. Models like Prov-GigaPath, UNI, Virchow, and CONCH show that foundation models can transfer across many pathology tasks when labels are scarce and whole-slide images are absurdly large. A single slide can be gigapixels. That is not an image, that is a continent with staining artifacts.
The practical dream is not “AI replaces pathologists,” which is both lazy and wrong. The dream is more like: pathologists get a second reader that flags likely low-risk cases, highlights uncertainty, and helps standardize scoring across hospitals. Correct me if I am wrong, but that sounds less like robot takeover and more like giving the most overworked person in the room a very specialized calculator.
And because this is computer vision, image quality still matters. In non-clinical settings, tools like combb2.io already show how browser-based enhancement can sharpen noisy images. Medical pathology has much stricter validation needs, obviously, but the shared idea is simple: better pixels can make downstream interpretation less chaotic.
The Catch, Because Of Course There Is One
The model performed worse on external data. That is the classic domain shift problem: new hospital, new scanner, new staining chemistry, new subtle mess. AI loves patterns, including the dumb ones we wish it would ignore.
Also, predicting biomarkers from H&E is not the same as replacing biomarker assays. These outputs need prospective validation, calibration, workflow testing, and clinician-facing explanations that do not read like a cryptic fortune cookie.
Still, the paper is exciting in a grounded way. If these models keep improving, they could help identify DCIS patients who might safely avoid overtreatment, while still catching higher-risk cases that need action. That is the kind of AI result I can get behind: not flashy, not pretending to be a genius, just quietly reducing unnecessary harm. Honestly, very aspirational behavior for a neural network.
References
-
Doyle S, Oerlemans MA, Brunekreef J, et al. “Enabling DCIS subtyping: leveraging foundation models for robust grading and molecular biomarker scoring.” npj Breast Cancer. 2026. DOI: 10.1038/s41523-026-00957-6
-
Hwang ES, Hyslop T, Lynch T, et al. “Active Monitoring With or Without Endocrine Therapy for Low-Risk Ductal Carcinoma In Situ: The COMET Randomized Clinical Trial.” JAMA. 2025;333(11):972-980. DOI: 10.1001/jama.2024.26698
-
Xu H, Usuyama N, Bagga J, et al. “A whole-slide foundation model for digital pathology from real-world data.” Nature. 2024;630:181-188. DOI: 10.1038/s41586-024-07441-w
-
Lu MY, Chen RJ, Ding T, et al. “A visual-language foundation model for computational pathology.” Nature Medicine. 2024. DOI: 10.1038/s41591-024-02856-4
-
Vorontsov E, Bozkurt A, Casson A, et al. “A clinical benchmark of public self-supervised pathology foundation models.” Nature Communications. 2025. DOI: 10.1038/s41467-025-58796-1
Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.