This Heart-Trial AI Wants a Spotter

ADAPT-CEC probably walked into the cardiovascular trial gym feeling pretty good about its form, then immediately got handed a new workout plan: “Nice myocardial infarction reps, champ. Now adapt to bleeding and cardiovascular death after seeing only 20 examples per endpoint.”

That is the core flex in this Circulation paper by Vemulapalli and colleagues: can one AI system help adjudicate cardiovascular trial events across different definitions, without needing a full retraining montage set to 1980s synth music? The answer is: sort of, and the “with a human spotter” version looks much stronger.

The Endpoint Gym Is Expensive

Clinical endpoint classification, or CEC, is the official judging table for many cardiovascular trials. Did this patient have a myocardial infarction? Was that stroke really a stroke by the trial’s definition? Did cardiovascular death count for the primary endpoint? These calls matter because they can change whether a drug looks helpful, harmful, or about as exciting as a treadmill used as a coat rack.

The problem: human adjudication is slow and expensive. A 2024 JACC review on AI in cardiovascular clinical trials points out that randomized trials can cost tens of thousands of dollars per participant, and endpoint work is one of the heavy lifts in the trial pipeline. AI could help with trial design, recruitment, follow-up, endpoint detection, and analysis, but only if it behaves with enough discipline for medicine’s very low tolerance for “my bad” energy.

Meet ADAPT-CEC, the Model Doing Progressive Overload

The researchers trained ADAPT-CEC using adjudicated events from ODYSSEY OUTCOMES, including myocardial infarction, stroke, and heart failure. Then they tested it externally in EUCLID, a different cardiovascular trial, on myocardial infarction, stroke, bleeding, and cardiovascular death.

Here is the fun part: before the EUCLID test, the model got only 20 suspected events per endpoint for adaptation. That is not a buffet. That is a protein bar and a stern look from the coach.

The study compared three strategies:

ADAPT-CEC alone.
GPT-4o doing direct adjudication.
A hybrid setup where humans reviewed the 30% of suspected events where ADAPT-CEC had the lowest confidence.

That last strategy is the spotter system. Let the model lift the routine sets, but when the bar starts wobbling over its digital sternum, bring in a person.

The Scores: Good Form, Better With a Spotter

Across 13,885 suspected EUCLID primary endpoint events, ADAPT-CEC correctly classified 86.4% of endpoints and 99.4% of non-endpoints compared with human adjudication. GPT-4o hit 76.3% of endpoints and 99.8% of non-endpoints. The hybrid strategy did best on endpoints at 95.6%, while still correctly identifying 99.6% of non-endpoints.

The paper reports F1 scores, which combine precision and recall into one number. Think of F1 as the model’s “don’t skip balance training” metric: it punishes you if you only catch true events but also flag half the parking lot, or if you are cautious enough to miss the actual heart attacks. For the hybrid approach, F1 scores were 0.94 for cardiovascular death, 0.80 for myocardial infarction, 0.82 for stroke, and 0.83 for bleeding.

The treatment-effect estimates also stayed close to the human-adjudicated result. EUCLID’s primary endpoint hazard ratio was 1.02 with human adjudication. The hybrid version gave 1.04, ADAPT-CEC gave 0.98, and GPT-4o gave 1.06. Not identical, but the trial-level conclusion did not suddenly start doing burpees in the corner.

Why This Is More Than AI Doing Paperwork

This research matters because clinical trial endpoints are where evidence gets real. If AI can reduce the manual burden while preserving trial conclusions, trials could move faster, cost less, and maybe study more patients or more representative populations. That is not flashy robot-doctor stuff. It is better plumbing for medical evidence, which sounds boring until you remember the entire house depends on it.

The adaptive part is especially useful. Trial definitions change. Endpoints differ. A model that can learn a new definition from a small calibration set is doing something closer to progressive overload than memorization. It is not just bench-pressing the dataset it trained on. It is trying a new machine without immediately dropping a dumbbell through the floor.

Still, this is not permission to fire the clinical events committee and replace it with a chatbot wearing a stethoscope screensaver. The hybrid result is the headline: AI plus targeted human review beat AI alone. That is exactly the kind of human-in-the-loop setup medicine should like. Let automation handle volume. Keep experts for ambiguity, rare edge cases, and definition hair-splitting, which is where clinical trials quietly hide the dragons.

Also, privacy and document handling matter. Endpoint adjudication often involves piles of clinical records, PDFs, and source documents. Browser-based tools like pdfb2.io are a useful reminder of the direction this ecosystem is moving: process sensitive documents with as little unnecessary exposure as possible. For clinical trials, that principle is not a nice accessory. It is part of the training plan.

The Cool-Down

ADAPT-CEC shows that adaptive AI can help classify cardiovascular events across trials and definitions, especially when humans review the lowest-confidence cases. The model got solid gains, but the best workout was supervised.

Next reps: prospective trials, broader validation, bias checks, audit trails, privacy-preserving deployment, and clear rules for when the AI must rack the weight and call a human.

References

Vemulapalli S, Peña Guerra K, Wojdyla D, et al. “Adaptive AI for Cardiovascular Event Adjudication: Cardiovascular Event Adjudication Across Different Definitions in the ODYSSEY OUTCOMES and EUCLID Trials.” Circulation. 2026;153(22):1694-1706. DOI: 10.1161/CIRCULATIONAHA.126.080072. PMID: 41911340.
Cunningham JW, Abraham WT, Bhatt AS, et al. “Artificial Intelligence in Cardiovascular Clinical Trials.” Journal of the American College of Cardiology. 2024;84(20):2051-2062. DOI: 10.1016/j.jacc.2024.08.069.
Sivarajkumar S, Ameri K, Li C, Wang Y, Jiang M. “Automating Adjudication of Cardiovascular Events Using Large Language Models.” arXiv: 2503.17222. DOI: 10.48550/arXiv.2503.17222.
“F-score.” Wikipedia. Background on F1 score as a combined precision-recall metric: https://en.wikipedia.org/wiki/F-score.

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.

AIb2.io - AI Research Decoded