AIb2.io - AI Research Decoded

When Your Phone Predicts Your Next Word, It Has One Huge Advantage Medicine Doesn't

Your phone can guess the next word in your text because millions of people have already fed models an all-you-can-eat buffet of language. Precision medicine, meanwhile, often shows up with three patient records, a weird biomarker, and the exhausted hope that machine learning will somehow "figure it out." That, in a very tired-parent nutshell, is the problem Andrew Janowczyk and colleagues tackle in a 2026 Lancet Digital Health viewpoint about the future of AI in medicine [1].

Their argument is simple and annoyingly correct: as medicine gets more personalized, patient groups get smaller. Much smaller. The result is what they call rare-disease-sized cohorts, or RDSCs - tiny, fragmented datasets that look less like the internet-scale feast modern AI grew up on and more like a toddler refusing dinner because the peas are "too green."

Precision Medicine Wants Specificity. AI Wants Snacks.

Precision medicine aims to tailor treatment to narrower and narrower patient subgroups. That sounds great, because it is. If two people both have "lung cancer" but one has a specific mutation and the other doesn't, treating them the same way can be sloppy at best and harmful at worst.

When Your Phone Predicts Your Next Word, It Has One Huge Advantage Medicine Doesn't

But there's a catch. Every time you slice patients into more meaningful subgroups - by mutation, imaging pattern, immune profile, disease course, or treatment history - you shrink the cohort. Eventually, you stop having "big data" and start having "Steve, Priya, one person in Milan, and a biopsy from 2019."

That is bad news for standard machine learning, especially deep learning. These models usually thrive on giant datasets because they need lots of examples to separate real patterns from statistical glitter. With small cohorts, the risk of overfitting goes way up. The model doesn't learn biology - it learns trivia. Like a kid who memorized one dinosaur fact and now thinks every problem can be solved by yelling "ankylosaurus."

Janowczyk and colleagues say this isn't a niche problem. It's where precision medicine is heading by design [1]. The better we get at dividing disease into biologically meaningful subtypes, the more often we'll end up with tiny datasets.

Tiny Datasets, Giant Headaches

Small datasets are not just small. They are often high-dimensional and heterogeneous. That means each patient might come with a mountain of variables - genomics, pathology slides, clinical notes, imaging, lab values - but there are very few patients overall.

This is the classic curse of dimensionality: too many features, not enough examples [2]. Imagine trying to learn a child's entire personality from one afternoon at the playground, one half-eaten cracker, and a drawing that may or may not be a horse. You will become overconfident and very wrong.

The paper also points out a more boring but deadly issue: infrastructure. Data live in different hospitals, follow different standards, use different scanners, and get labeled in different ways. In medical AI, half the battle is not building the model. Half the battle is discovering that Hospital A calls it one thing, Hospital B calls it another, and Hospital C stored the answer in a PDF last touched during the Obama administration.

That is why this viewpoint pushes not just for new algorithms, but for better biobanking, data standardization, governance, and cross-institution collaboration [1].

The Fix Is Not "Just Use a Bigger Neural Net"

If your instinct is "couldn't we just fine-tune a big pretrained model?" - fair question. Researchers across medical AI have tried transfer learning, foundation models, self-supervised learning, federated learning, and synthetic data generation to make more from less [3-6].

Some of that helps. Transfer learning can let a model start with broad visual or biological knowledge, then adapt to a smaller medical task [3]. Federated learning can let institutions train shared models without moving raw patient data around, which matters when privacy rules are stricter than a toddler's bedtime routine [4]. Self-supervised learning tries to squeeze signal out of unlabeled data, which hospitals have in abundance [5].

But none of these methods are magic beans. They can still inherit bias, fail to generalize, or quietly collapse when the real-world cohort differs from the training set. And synthetic data, while useful in some settings, can become a photocopy of a photocopy if the original dataset was tiny to begin with [6].

This is why the paper's broader message lands: the bottleneck is social and organizational as much as technical. If institutions do not share standards, collect richer longitudinal data, and build trustworthy governance frameworks, the model architecture almost doesn't matter. You're still trying to make soup from one carrot and a lot of optimism.

Why This Actually Matters Outside Research Meetings and Coffee-Stained Slide Decks

If this problem gets handled well, the payoff is real. Better ML for tiny cohorts could help identify which rare patient subgroup will respond to a therapy, who might face toxic side effects, or how diseases split into biologically distinct pathways. That's the kind of thing that moves care from "we usually do this" to "for you, this is the better bet."

And the challenge is not limited to rare diseases. Oncology, neurology, autoimmune disorders, and even common conditions are all getting chopped into more precise subtypes. Precision medicine keeps zooming in. The datasets keep shrinking. The nap schedule gets worse.

You can already see adjacent tools creeping into practice. For example, document-heavy clinical workflows increasingly need private browser-based PDF handling for records and reports - the sort of practical plumbing that tools like pdfb2.io nod toward, even if the glamorous conference talk is about multimodal transformers. Because yes, AI research loves to discuss giant models. Clinics still need clean, standardized documents that don't explode on contact.

The Real Plot Twist

The interesting thing about this paper is that it refuses the usual AI fairy tale. It does not say, "Don't worry, smarter models will save us." It says: precision medicine is creating exactly the kind of data environment current ML handles poorly, so we need better methods and better systems.

That is a more adult answer. Slightly less sparkly. Much more useful.

If "n=1" really becomes normal, then medical AI has to grow up fast. Less flexing about benchmark scores. More patience, better data stewardship, and models that can learn without demanding the computational equivalent of a warehouse full of goldfish crackers.

References

  1. Janowczyk A, Merkler D, Michielin O, Madabhushi A. Precision medicine's inevitable trajectory toward rare-disease-sized cohorts: implications for machine learning and deep learning. Lancet Digit Health. 2026. doi:10.1016/j.landig.2026.101000. PubMed: 42297703

  2. Bellman R. Adaptive Control Processes: A Guided Tour. Princeton University Press; 1961. Background on the curse of dimensionality. Wikipedia summary: Curse of dimensionality

  3. Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: Understanding transfer learning for medical imaging. Adv Neural Inf Process Syst. 2019;32. arXiv:1902.07208

  4. Kaissis G, Makowski M, Rückert D, Braren R. Secure, privacy-preserving and federated machine learning in medical imaging. Nat Mach Intell. 2020;2:305-311. doi:10.1038/s42256-020-0186-1

  5. Azizi S, Mustafa B, Ryan F, et al. Big self-supervised models advance medical image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021. arXiv:2101.05224

  6. Dash S, Shakyawar SK, Sharma M, Kaushik S. Big data in healthcare: management, analysis and future prospects. J Big Data. 2019;6:54. doi:10.1186/s40537-019-0217-0

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.