85,000 Doors in the Hospital Dungeon: MIRA Rolls for Clinical Initiative

85,000 clinical options sat inside the sandboxed electronic health record, and MIRA, the AI agent in Ferber et al.'s new Nature paper, had to choose which doors to open without accidentally summoning a medication-error demon.

That number matters because most medical AI demos still behave like tavern oracles: you ask a question, they answer in polished prose, and everyone pretends the hard part is over. Real medicine is not a quiz bowl. It is a campaign. You talk to the patient, order labs, interpret imaging, reconcile home meds, avoid allergies, decide whether someone needs admission, and document the whole thing in the EHR, that enchanted filing cabinet designed by a committee of lawful-neutral wizards.

85,000 Doors in the Hospital Dungeon: MIRA Rolls for Clinical Initiative

MIRA, short for Medical Intelligence for Reasoning and Action, tries to play the whole quest.

The Party Enters the EHR

The researchers built MIRA as an autonomous medical AI agent inside a controlled EHR sandbox. It could chat with a simulated patient, order and interpret labs, microbiology, and imaging, generate differential diagnoses, prescribe medications, schedule procedures, and plan admissions. Under the hood, it used health-data standards like FHIR, the protocol that lets healthcare systems exchange EHR data without every hospital inventing its own dialect of spreadsheet gobbledygook. FHIR is basically the shared rulebook for moving clinical information around.

The simulated patients came from real retrospective MIMIC-IV cases, with a patient agent grounded in the documented history of present illness. That matters because an agent can only be judged fairly if the patient does not randomly reveal the answer like a dungeon master who forgot to hide the final boss behind a curtain.

The team checked that patient agent carefully. Across 622 rephrased question-answer pairs, responses stayed content-consistent 99.4% of the time by human assessment. Across 933 conversations, there were zero premature diagnostic information leaks. Then the researchers tried 880 adversarial prompts, including prompt-injection and social-engineering tricks. Still zero premature leaks. The mimic did not break character. Respect the commitment to the bit.

Roll for Diagnosis

MIRA was tested on more than 500 emergency department cases across eight diseases, including appendicitis, pancreatitis, pneumonia, pulmonary embolism, urinary tract infection, cholecystitis, diverticulitis, and pancreatic cancer. Against MIMIC-IV discharge diagnoses, it reached 88.9% average diagnostic accuracy across 574 cases. In a head-to-head subset of 311 cases, MIRA averaged 87.8% diagnostic accuracy, compared with 78.1% for four board-certified physicians and 71.1% for a mixed-seniority physician cohort.

The largest gap appeared in pancreatitis: MIRA hit 95.2%, while board-certified physicians reached 78.6%. Nobody should read that as "delete doctors, install chatbot." That is how you get cursed armor. The better reading is narrower and more useful: when the environment is standardized, the data are available through tools, and the task is constrained, an LLM-based agent can perform multi-step clinical work at a level worth taking seriously.

MIRA also mirrored physician-like workflows. It usually started with lower-risk steps, like history and labs, then moved toward imaging, procedures, medications, and admission. It requested physical exams more consistently than physicians, 97.1% versus 87.8% for board-certified doctors. It requested more blood analytes than physicians, but not in an "I cast Order Everything at level 9" way: its lab use still covered only about half of what appeared in routine MIMIC-IV care, and it did not systematically over-order expensive imaging.

The Medication Trap Room

Medication safety is where many AI demos go to lose hit points. MIRA did unusually well in the subset reviewed for prescribing risks. In 56 patient outputs, reviewers found no high-severity drug-drug interactions, renal dosing incompatibilities, allergy-medication mismatches, QT-risk prescribing, or unsafe opioid prescribing. Across 468 prescriptions, 467 had clinically useful and correct dosing instructions, while route of administration was the weakest field at 97% correct.

That is strong, but not perfect. The paper reports a few therapeutic duplications that were judged clinically reasonable but could have used clearer dosing instructions. In D&D terms: the spell worked, but the scroll handwriting needed adult supervision.

Why This Quest Matters

The big move here is not that MIRA answered medical questions. AMIE, from Google Research, already showed strong performance in diagnostic conversation, and AgentClinic pushed evaluation toward interactive simulated clinical environments. Almanac Copilot explored autonomous EHR navigation. Microsoft’s MAI-DxO showed how orchestrated diagnostic agents can improve case-solving and cost control on difficult diagnostic cases.

MIRA pushes further into action. It does not just say, "Consider appendicitis." It can request tests, interpret results, recommend surgery, prescribe meds, and structure outputs inside an EHR-like system. That is the difference between a bard giving advice and a party member actually carrying the healing potions.

Still, the boss battle is not beaten. This was a simulation, not a live emergency department. The patients were generated from retrospective notes, which may be cleaner than real human storytelling, where "stomach pain" can include three unrelated symptoms, one forgotten medication, and a family member loudly Googling. The disease set was limited. The system needs prospective trials, governance, monitoring, liability frameworks, and clinical integration that does not turn doctors into babysitters for a very confident calculator.

But if future systems reproduce these results safely, the impact could be real: fewer missed steps, better medication reconciliation, more consistent guideline adherence, and EHR workflows that feel less like fighting a gelatinous cube made of dropdown menus.

References

Ferber, D. et al. "Towards autonomous medical artificial intelligence agents." Nature (2026). DOI: 10.1038/s41586-026-10675-5. PMID: 42310457
Tu, T. et al. "Towards conversational diagnostic artificial intelligence." Nature 642, 442-450 (2025). DOI: 10.1038/s41586-025-08866-7
Schmidgall, S. et al. "AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments." arXiv: 2405.07960 (2024).
Zakka, C. et al. "Almanac Copilot: Towards Autonomous Electronic Health Record Navigation." arXiv: 2405.07896 (2024).
McDuff, D. et al. "Sequential Diagnosis with Language Models." arXiv: 2506.22405 (2025).

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.