A Most Peculiar Paradox in Modern Therapeutics
We find ourselves, dear reader, in the grip of a delightful pharmacological contradiction. Immune checkpoint inhibitors (ICIs) - the crown jewels of modern cancer therapy - work by unleashing your immune system to hunt down tumors. Splendid idea, truly. Except the immune system, once unchained, sometimes gets enthusiastic and starts attacking perfectly innocent organs like a Roomba that gained sentience and declared war on the furniture.
These friendly-fire incidents are called immune-related adverse events (irAEs), and they can hit nearly any organ: lungs, liver, colon, thyroid, skin, joints - the immune system is nothing if not thorough. Over half of ICI patients develop some form of irAE, ranging from annoying rashes to life-threatening pneumonitis. The trouble? Detecting them in the sprawling chaos of clinical notes is like finding a needle in a haystack, if the haystack were written in doctor handwriting and the needle kept changing shape.
Enter irAE-GPT, a study from Bejan et al. published in eBioMedicine (2026), which had the audacity - the scientific audacity - to ask: can we just point GPT at a mountain of clinical notes and have it flag these events automatically?
The Specimen Under Examination
The researchers unleashed GPT-3.5, GPT-4, and GPT-4o on clinical notes from 442 patients across three institutions: Vanderbilt, UCSF, and seven Roche-sponsored clinical trials. No fine-tuning, no special training - just zero-shot prompting. They essentially handed the models a stack of medical records and said, "find the irAEs," which is roughly equivalent to giving a very well-read parrot a medical license and hoping for the best.
GPT-4o emerged as the top performer, achieving patient-level F1 scores between 56% and 66% depending on the dataset. It particularly excelled at spotting hematological irAEs (a perfect F1 of 1.0 - the model equivalent of a hole-in-one) and gastrointestinal events (F1 of 0.81-0.85). The models showed high sensitivity, meaning they rarely missed an irAE. The problem was the opposite: they were a bit too eager, flagging adverse events that weren't actually caused by ICIs. The overprediction bias is real - GPT-4o is basically that friend who diagnoses themselves with every disease after a WebMD session.
The Causation Conundrum (Or: Correlation Is Not Immunotherapy)
Here's where it gets genuinely interesting. The biggest limitation wasn't vocabulary or medical knowledge - it was causal reasoning. A patient on ICIs might develop pneumonia from a plain old infection, but the model would read "pneumonia" in the notes and think, "Aha! irAE!" Linking an adverse event to its actual cause requires understanding the clinical narrative in a way that current LLMs find remarkably difficult. It's the difference between reading that someone was at the scene of a crime and concluding they committed it.
This mirrors findings from other pharmacovigilance studies. Luo et al. (2024) found similar patterns with their AE-GPT system on vaccine safety reports (PMID: 38536862), and Ge et al. (2024) showed ChatGPT lagging behind fine-tuned models for adverse event extraction (arXiv: 2402.15663). A comprehensive scoping review by Chopard et al. (2025) catalogued the broader landscape of NLP for adverse drug events in EHRs, confirming that causal attribution remains the field's white whale (PMID: 39786481).
Why This Matters Beyond the Laboratory
Currently, identifying irAEs requires physicians to manually sift through thousands of clinical notes - a process roughly as efficient as reading every book in a library to find a single misplaced comma. With ICI use expanding rapidly across cancer types, the volume of data requiring safety monitoring is growing faster than any human team can handle.
If tools like irAE-GPT can be refined to reduce false positives - particularly by improving causal reasoning - they could dramatically accelerate pharmacovigilance. The ability to process notes from clinical trials and real-world EHR systems alike (as this study demonstrated across seven trials and two hospital systems) suggests genuine generalizability. For anyone working with complex document analysis at scale, the parallel challenges are familiar - whether you're extracting safety signals from clinical notes or pulling structured data from messy PDFs using tools like pdfb2.io, the core problem of making sense of unstructured text is universal.
The Verdict, Rendered with Due Scientific Caution
irAE-GPT represents a genuinely promising specimen in the emerging taxonomy of clinical AI applications. It is not, we must note with characteristic restraint, a replacement for physician judgment. It is, however, a remarkably capable first-pass filter - one that catches most events, even if it occasionally cries wolf. The next frontier is teaching these models not just what happened, but why - a challenge that, one suspects, will keep researchers productively occupied for years to come.
References
-
Bejan CA, Wang M, Venkateswaran S, et al. irAE-GPT: leveraging large language models to identify immune-related adverse events in electronic health records and clinical trial datasets. eBioMedicine. 2026. DOI: 10.1016/j.ebiom.2026.106227. PMID: 41951517
-
Luo Y, et al. Using Large Language Models to extract adverse events from surveillance reports. PLOS ONE. 2024. PMID: 38536862
-
Ge Y, et al. Leveraging ChatGPT in Pharmacovigilance Event Extraction: An Empirical Study. EACL 2024. arXiv: 2402.15663
-
Chopard D, et al. Leveraging Natural Language Processing and Machine Learning Methods for Adverse Drug Event Detection in Electronic Health/Medical Records: A Scoping Review. Drug Safety. 2025. PMID: 39786481
-
Agbavor F, et al. Large Language Models for Adverse Drug Events: A Clinical Perspective. J Clin Med. 2025. DOI: 10.3390/jcm14155490
Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.