HemaGuide: The Tumor Board Agent With a Surprisingly Serious Moat

As of June 2026, the best anyone could do was route complex blood cancer cases through overloaded tumor boards, specialist calendars, molecular reports, guidelines, and the occasional heroic spreadsheet. This paper changes that.

Zoller, Kalz, Wu, and colleagues introduce HemaGuide, a locally deployable large language model agent for hematological malignancies. In startup terms, this is not "ChatGPT, but wearing a white coat." It is closer to a vertical SaaS wedge into one of medicine’s most chaotic workflows: taking messy clinical documents, turning them into a structured case, choosing the right reasoning mode, and grounding the recommendation in guidelines plus a memory bank of more than 2,000 real tumor board cases.

That is a pretty good pitch deck. The TAM is, unfortunately, cancer.

HemaGuide: The Tumor Board Agent With a Surprisingly Serious Moat

The Problem: Tumor Boards Do Not Scale Like Software

Hematology is where medicine goes when it wants to make things computationally spicy. A patient may have years of treatment history, lab trends, relapse patterns, molecular variants, transplant eligibility, drug resistance, and guidelines that evolve faster than your average startup pivots.

Multidisciplinary tumor boards are built for exactly this. Experts gather, weigh the evidence, argue politely, and land on a plan. But access is uneven. Not every clinic has a deep bench of lymphoma, leukemia, myeloma, transplant, molecular pathology, and genomics specialists available on command. Even when they do, the workflow can be slow.

HemaGuide’s core idea is simple: give the model guardrails, memory, and routing instead of asking a naked LLM to freestyle oncology like a confident intern with Wi-Fi.

The Product: Three Modes and a Memory Bank

The system converts unstructured clinical documents into structured case representations, then routes each case into one of three modes:

Guideline mode handles cases where established disease-specific flowcharts matter most.

Advanced mode deals with more complex clinical situations where precedent and nuanced decision-making become useful.

Molecular mode focuses on genomic variants and molecular tumor board-style interpretation.

That routing matters. The authors ran ablations across 11 layers and found that no single component carried the whole company. Very founder-coded lesson: the moat was not one magic prompt. It was the workflow.

The model also uses what the paper calls a clinical decision memory: more than 2,000 real-world tumor board cases. Think of it as a very specialized CRM, except instead of tracking "hot enterprise leads," it tracks what experts actually decided when faced with complicated blood cancer cases.

For document-heavy medicine, this is where browser-native tooling starts to feel relevant. A workflow that turns clinical PDFs and reports into structured inputs sits in the same universe as private document tools like pdfb2.io, where the boring miracle is: the document stays local, and nobody has to email a lab report to the cloud like it is 2009.

The Numbers: Surprisingly Fundable

In expert-blinded benchmarking on 45 high-complexity cases across six foundation models, HemaGuide improved concordance with tumor board decisions. That is the first real signal: the agent setup beat the base-model experience.

The molecular variant workflow classified 70 clinically relevant missense variants with high concordance against expert standards. Crucially, no oncogenic variant was downgraded to benign. That is the kind of failure mode you really do not want, because "we accidentally called the dangerous mutation harmless" is not a cute postmortem.

It also ran fast: median latency of 39 seconds on commodity hardware, compared with the hours often needed for manual molecular board workflows. Commodity hardware is doing a lot of brand work here. Somewhere, a GPU is asking for equity.

Then the authors tested the human-AI combo. Resident physicians assisted by the agent reached near-senior concordance and partially outperformed senior physicians in the simulated practice study. That does not mean residents should replace experts. It means the right support tool may compress the gap between general clinical training and subspecialty judgment.

External validation matters even more. On 555 independent cases from a second academic center, HemaGuide reached 81.8% concordance across 47 entities. In a prospective one-month silent trial on 64 consecutive unselected cases, it reached 82.8%. Hallucinations appeared in 2 of 664 evaluated cases, or 0.3%.

For clinical AI, that hallucination number is not a victory parade. It is a diligence item. But it is a meaningful diligence item.

Why This Is Different From Chatbot Medicine

Recent work has already shown that retrieval-augmented and agentic systems can outperform standalone LLMs in medicine. Almanac used curated medical retrieval to improve factuality and safety. MedRAG benchmarked retrieval-augmented generation across medical QA and found big gains, but also messy behavior like "lost in the middle." Oncology-specific reviews keep landing on the same point: LLMs look useful, but evaluation, safety, and human oversight are the actual bottlenecks.

HemaGuide fits that trend. It is not asking the model to know everything. It asks the model to retrieve, structure, route, compare, and justify. That is less sexy than "the AI doctor is here," but much more investable if your investors are allergic to lawsuits.

The Catch: Concordance Is Not Cure

The study uses tumor board concordance as a key outcome. That is sensible, but concordance with expert decisions is not the same as improved survival, fewer complications, better quality of life, or lower cost. Also, the tool was built around specific institutions, workflows, guidelines, and case memories. External validation helps, but broader deployment will need monitoring across hospitals, populations, data formats, and changing standards of care.

Still, this is an important step. The strongest version of clinical AI may not be a giant general model shouting answers from the cloud. It may be local, auditable, case-grounded, and boring in exactly the right ways.

That is the kind of zero-to-one I can get behind: not replacing the tumor board, but making its best reasoning easier to access before the calendar invite becomes a bottleneck with a stethoscope.

References

Zoller J, Kalz M, Wu X, et al. Clinical decision support in hematological malignancies using a case-grounded AI agent. Nature Medicine (2026). DOI: 10.1038/s41591-026-04494-4
Hao Y, Qiu Z, Holmes J, et al. Large language model integrations in cancer decision-making: a systematic review and meta-analysis. npj Digital Medicine 8, 450 (2025). DOI: 10.1038/s41746-025-01824-7
Ferber D, et al. Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology. Nature Cancer 6, 1337-1349 (2025). DOI: 10.1038/s43018-025-00991-6
Xiong G, Jin Q, Lu Z, Zhang A. Benchmarking Retrieval-Augmented Generation for Medicine. arXiv: 2402.13178 (2024).
Zakka C, Chaurasia A, Shad R, et al. Almanac: Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI 1(2) (2024). DOI: 10.1056/AIoa2300068

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.

AIb2.io - AI Research Decoded