Blocking Review: Humans Are Accurate, but the Queue Is Brutal

Fix the endpoint-adjudication bottleneck, and you unblock faster trial analysis, which enables cheaper studies, which might let useful heart drugs spend less time rotting in paperwork purgatory. That is the PR this paper opens: can an AI reviewer handle major adverse cardiovascular events, or MACE, without turning a serious clinical trial into autocomplete with a stethoscope?

In cardiovascular trials, MACE usually means three ugly outcomes nobody wants: cardiovascular death, nonfatal heart attack, and nonfatal stroke. These are composite trial endpoints, which is a tidy statistical phrase for "the stuff that really matters happened" (MACE background; clinical endpoint background).

The catch is adjudication. Investigators flag possible events, then a physician clinical events committee reads medical records and decides what actually counts. This is sensible. It is also slow, expensive, and about as glamorous as debugging a CSV parser at 2 a.m.

Marti-Castellote and colleagues built an AI system called Auto-MACE to do that job faster. Their setup used an iteratively refined prompt with OpenAI o1-mini to adjudicate events from clinical documents, plus a Clinical Longformer model to estimate confidence on each decision. Longformer matters here because trial records are long, messy, and full of details buried where no sane person wants to look. Regular transformer models are great until the document length starts acting like an unbounded input buffer and everyone regrets their life choices (Transformer background; Longformer).

LGTM, But Only on the Confident Cases

The test bed was the PARADISE-MI trial, with 5,661 patients after myocardial infarction. Auto-MACE gave a confident ruling on 69% of deaths, 46% of possible MIs, and 81% of possible strokes. On those confident cases, it matched the human committee 97% of the time for deaths, 89% for MIs, and 88% for strokes. Across all candidate events, agreement dropped to 86%, 76%, and 84%, respectively (original paper).

That is the key code-review comment on this whole paper: the model is useful when it knows what it knows. When confidence is high, performance looks strong. When confidence is lower, humans still need to step in. Honestly, that is not a flaw so much as basic adult supervision.

The biggest practical result was not just agreement percentages. When the researchers used AI adjudication to estimate the treatment effect of sacubitril/valsartan versus ramipril on composite MACE, the answer looked almost the same as the human committee's answer: HR 0.91 with Auto-MACE versus HR 0.90 with the CEC. Nit: if your trial conclusion survives swapping in a robot reviewer for many events, that is a pretty solid stress test.

Where the Model Face-Planted

The failure modes are very human in a depressing way. The model struggled when important evidence sat inside tables or checkbox forms, especially troponin data for MI. It also confused old events with new events, like prior MI history versus a fresh MI, or prior stroke findings on imaging versus a new stroke. In other words, it got tripped up by timeline logic, which is the same reason half of us still hate legacy code.

That lines up with nearby research. A 2025 AMIA paper on automating cardiovascular death adjudication with LLMs reported decent extraction and adjudication performance, but also flagged temporal reasoning as a weak spot (Sivarajkumar et al.). Broader benchmark studies in biomedical NLP keep finding the same pattern: LLMs can be strong, but performance jumps around by task, dataset, and prompting setup, which is not the kind of personality you want running unsupervised inside a pivotal trial (Computers in Biology and Medicine, 2024; Nature Communications, 2025; systematic review, 2025).

Approved With Reservations

The smart takeaway is not "replace the physicians." That would be a blocking comment. The smart takeaway is hybrid adjudication: let AI clear the obvious cases, send the uncertain ones to the committee, and keep the audit trail intact.

That matters because clinical trials are full of repetitive document review, and every extra review cycle costs time and money. If AI can safely reduce the pile without smuggling in bias, hidden errors, or regulatory headaches, trials get leaner. Not magical. Just leaner. In medicine, that counts.

One more nit before merge: this was validated in one major trial context, with one event family, against an existing adjudication process that already has its own imperfections. Regulators are not going to rubber-stamp "trust the bot" because somebody got a nice confusion matrix. They will want transparency, traceability, reproducibility, and proof that performance holds up when the records get uglier, the sites get messier, and the edge cases start breeding.

Still, as first serious drafts go, this is clean work. Not flashy. Not reckless. Just a solid refactor of a painfully manual workflow.

References

Marti-Castellote PM, Badrouchi S, Claggett B, et al. Using Artificial Intelligence to Adjudicate Major Adverse Cardiovascular Events in Clinical Trials. Journal of the American College of Cardiology. 2025. DOI: 10.1016/j.jacc.2025.10.055. PubMed: PMID 41493293
Sivarajkumar S, Ameri K, Li C, Wang Y, Jiang M. Automating Adjudication of Cardiovascular Events Using Large Language Models. AMIA Annual Symposium Proceedings. 2025. PMCID: PMC12919421 PubMed: PMID 41726539
Omar M, Nadkarni GN, Klang E, Glicksberg BS. Large language models in medicine: A review of current clinical trials across healthcare applications. PLOS Digital Health. 2024. DOI: 10.1371/journal.pdig.0000662 PMCID: PMC11575759
Hernández DC, Sierra K, Ibrohim M, et al. A comprehensive evaluation of large language models on benchmark biomedical text processing tasks. Computers in Biology and Medicine. 2024. DOI: 10.1016/j.compbiomed.2024.108189
Chen Q, Hu Y, Peng X, et al. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature Communications. 2025. DOI: 10.1038/s41467-025-56989-2
Vatsalan D, Seneviratne M, et al. Large Language Models for Health Care Text Classification: Systematic Review. Journal of Medical Internet Research. 2025. PMCID: PMC12936667
Beltagy I, Peters ME, Cohan A. Longformer: The Long-Document Transformer. arXiv: 2004.05150

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.

AIb2.io - AI Research Decoded

Blocking Review: Humans Are Accurate, but the Queue Is Brutal

LGTM, But Only on the Confident Cases

Where the Model Face-Planted

Approved With Reservations

References