GPT-5 Made Better Surgery Checklists Than...

GPT-5 Made Better Surgery Checklists Than Humans, and That Should Make You Think

Surgeons live and die by checklists. Not metaphorically - literally. The Enhanced Recovery After Surgery (ERAS) protocol is basically a to-do list that says things like "give the patient this drug at this time" and "get them walking by day two." When hospitals actually follow these checklists, patients recover faster, go home sooner, and generally have a much better time of not dying. The problem? Nobody follows them consistently.

So a team of researchers in Istanbul had a perfectly reasonable idea: what if we just let GPT-5 write the checklists?

The AI Checklist Showdown

The study, published in the World Journal of Surgery in March 2026, pitted 12 AI-generated ERAS checklists against 12 human-curated ones across bariatric and gastrointestinal cancer surgeries. Three blinded raters - two board-certified surgeons and a clinical informatics specialist - scored each checklist on coverage (did it include everything the guidelines say it should?) and clarity (could a normal human being actually understand it?).

The results were almost embarrassingly lopsided. GPT-5's checklists covered 97% of guideline items compared to 89% for the traditional ones. Clarity scores hit 4.8 out of 5 versus 4.2 for the human versions. The inter-rater agreement was excellent at 0.92, which in statistics-speak means the judges weren't just flipping coins.

The Catch Nobody Wants to Talk About

Here's where it gets interesting, though. The researchers themselves flagged something they called "bundle inflation" - the AI checklists had more items than the traditional ones. And in healthcare, more items on a checklist can actually be worse. Every additional line item is another thing a busy nurse or surgeon might skip. It's the paradox of completeness: the most thorough checklist in the world is useless if everyone ignores it because it's 47 pages long.

Think of it like packing for a trip. Your AI assistant might generate a perfectly comprehensive packing list that includes a sewing kit, tide pen, and backup phone charger. Technically correct. Also technically the reason your carry-on weighs 35 pounds.

What GPT-5 Got Right (and Wrong)

The AI nailed the broad strokes - preoperative counseling, multimodal analgesia, early mobilization, all the hits. Where it stumbled was context-specific tailoring. It didn't always account for the fact that a nutrition pathway for a sleeve gastrectomy patient looks very different from one for a colorectal cancer patient who's been on chemo for six months.

This is the recurring theme with LLMs in medicine: they're excellent at aggregating and organizing existing knowledge, and mediocre at knowing when the standard answer doesn't apply. They read every textbook but never spent a night on call.

Why This Actually Matters

The practical takeaway isn't "let AI write all our medical protocols." It's more like "let AI generate the first draft, then have actual surgeons trim it down." The researchers specifically recommend treating AI outputs as "draft master lists" that require local curation - separating core items from conditional ones before anyone brings them near a patient.

This is a model that works well beyond surgery. If you've ever had to create a complex document from guidelines or standards - whether it's a surgical protocol, a compliance checklist, or even a detailed project plan - the AI-first-draft-plus-human-editing pipeline is genuinely faster than starting from scratch. Tools like pdfb2.io already make it easy to annotate and mark up PDFs of guidelines when you're doing that kind of curation work.

The Bigger Picture

What strikes me about this study is how honest it is. They didn't claim AI will replace surgical protocol committees. They showed that it covers more ground, more clearly, and then immediately said "but that might actually cause problems if you don't edit it." That kind of nuance is rare in AI research papers, where the temptation is always to oversell.

The 97% coverage number is real and impressive. But the real question isn't whether AI can generate a better checklist - it clearly can. The question is whether the humans downstream will actually use it. And that's a problem no language model has solved yet. - ## References

Caliskan YK, Basak F, Erdem O, Kudas I. From Guidelines to Clicklists: GPT-5-Generated ERAS Checklists Improve Guideline Coverage for Bariatric and Gastrointestinal Cancer Surgery. World Journal of Surgery. 2026. DOI: 10.1002/wjs.70339 | PMID: 41873099