The verdict: AMIE and MIRA deliver a real step forward, but they are still practicing medicine inside a very tidy terrarium.

Medical AI has spent years doing the exam-room equivalent of flashcards.

Question in. Answer out.

Sometimes the answer is good. Sometimes it is your uncle at Thanksgiving with a stethoscope: confident, specific, and in need of supervision.

The verdict: AMIE and MIRA deliver a real step forward, but they are still practicing medicine inside a very tidy terrarium.

Karen O'Leary's Nature Medicine research highlight, "Agents AMIE and MIRA advance medical AI capabilities", points to something more interesting.

Two new systems do not just answer medical questions.

They act.

Carefully. In simulations. With guardrails.

Still, that matters.

Not A Chatbot In A Lab Coat

AMIE and MIRA are medical AI agents.

That word, "agent," gets abused more than the office coffee machine. Here it means a model that can plan steps, use tools, gather information, and update its answer as the case changes.

A chatbot says, "This might be appendicitis."

An agent asks for labs. Checks imaging. Reviews history. Suggests treatment. Thinks again.

That is the shift.

Not smarter autocomplete.

More like autocomplete with a clipboard, a pager, and mild anxiety.

MIRA: The Hospital Intern That Lives In The EHR

MIRA, described in Nature as "Towards autonomous medical artificial intelligence agents", works inside a sandboxed electronic health record.

That is key.

Hospitals do not run on vibes. They run on forms, codes, lab orders, medication lists, admission decisions, and the ancient ritual of clicking twelve boxes to prescribe one thing.

MIRA used 11 tools and more than 85,000 possible actions. It could order labs, imaging, microbiology tests, medications, procedures, and admissions. It also followed standards such as FHIR, the health-data plumbing that lets medical systems talk without immediately setting the room on fire.

In 574 emergency-department cases from MIMIC-IV, MIRA reached 88.9% diagnostic accuracy across eight diseases. It did especially well on appendicitis and pancreatitis.

It also showed strong medication safety in the tested cases. No high-severity drug-drug interactions appeared in one safety screen. Its admission decisions had perfect recall in pneumonia and pulmonary embolism tests.

Read that carefully.

Perfect recall in that narrow test.

Not perfect doctor.

Not ready for your emergency room.

AMIE: The Follow-Up Visit Whisperer

AMIE, Google's Articulate Medical Intelligence Explorer, tackles a different pain point.

Medicine is not one heroic diagnosis. It is often three visits, five messages, two medication changes, and a patient saying, "I forgot to mention this started after the camping trip."

In "Towards Conversational AI for Disease Management", AMIE handled multi-visit disease-management conversations.

It used Gemini's long-context abilities with retrieved clinical guidelines and drug formularies. In a blinded virtual OSCE study, AMIE was compared with 21 primary care physicians across 100 multi-visit scenarios.

Specialists judged AMIE non-inferior overall.

It also scored better on treatment precision, investigation precision, and guideline alignment.

That sounds dry. It is not.

A vague plan says, "Try antibiotics."

A precise plan names the drug, dose, duration, safety checks, and follow-up.

One is a sticky note. The other is a plan that has eaten its vegetables.

AMIE also did well on RxQA, a medication-reasoning benchmark built from US and UK formularies and validated by pharmacists.

Why This Is More Than Benchmark Theater

Medical AI benchmarks can be weird.

A model aces a licensing exam. Everyone claps. Then someone asks whether it can handle a real patient with missing records, unclear symptoms, three medications, and a spouse correcting the timeline from the corner.

The room gets quiet.

That is why these studies matter.

They move toward workflow.

MIRA touches the electronic record. AMIE reasons across visits. Both deal with management, not just diagnosis.

This is closer to how clinicians actually work.

Still, the evidence base is young. A 2026 Nature Medicine review found 4,609 clinical LLM studies from January 2022 through September 2025, but only 19 used prospective randomized trials with real-world patient data (DOI: 10.1038/s41591-026-04229-5).

That is the whole plot.

Lots of demos. Not enough real-world proof.

The Catch Has A Stethoscope

The biggest limitation is simple.

These were simulations.

MIRA worked in a sandbox. AMIE used virtual cases. The patients were not real people with messy lives, missing histories, fear, bad Wi-Fi, or insurance forms arriving like side quests nobody requested.

Eric Topol made the same point: both systems were text-only, tested on clean cases, and far from the full chaos of clinical medicine.

Experts quoted by the UK Science Media Centre were also blunt. These tools may support doctors. They should not replace them.

That is the sane lane.

Use AI for memory, structure, guideline checking, medication review, and administrative drag.

Keep humans responsible for judgment, uncertainty, communication, and final decisions.

Especially when the model sounds very sure.

That is when you check twice.

The Real Promise

If these results hold up, the best version is not robot doctor theater.

It is quieter.

Fewer missed options.

Cleaner medication plans.

Better follow-up.

Less clerical sludge.

A clinician with an AI co-pilot that reads the whole chart before speaking. Imagine that. An entity in healthcare that reads the whole chart. Science fiction, basically.

But first come prospective studies. Real hospitals. Diverse patients. Bias testing. Governance. Audit logs. Liability. Privacy. Failure modes.

All the boring stuff.

Which, in medicine, is usually the stuff that saves lives.

References

O'Leary, K. "Agents AMIE and MIRA advance medical AI capabilities." Nature Medicine (2026). DOI: 10.1038/d41591-026-00034-2. PMID: 42380502
Ferber, D. et al. "Towards autonomous medical artificial intelligence agents." Nature (2026). DOI: 10.1038/s41586-026-10675-5
Liévin, V. et al. "Towards Conversational AI for Disease Management." Nature (2026). DOI: 10.1038/s41586-026-10764-5. arXiv: 2503.06074
McDuff, D. et al. "Towards accurate differential diagnosis with large language models." Nature 642, 451-457 (2025). DOI: 10.1038/s41586-025-08869-4
Chen, S. F. et al. "LLM-assisted systematic review of large language models in clinical medicine." Nature Medicine 32, 1152-1159 (2026). DOI: 10.1038/s41591-026-04229-5

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.

AIb2.io - AI Research Decoded