AIb2.io - AI Research Decoded

The AI Models Trained on Millions of Cells Might Not Be Worth the Hype

Researchers threw ten foundation models at single-cell data and discovered something the AI hype cycle doesn't want you to hear: bigger isn't always better.

When ChatGPT Met Your Cells

Foundation models are having their moment. These massive AI systems, trained on incomprehensible amounts of data, have conquered language translation, image generation, and probably your social media feed. Naturally, biologists wondered: what if we built one for cells?

The pitch sounds irresistible. Train a transformer model on tens of millions of individual cells, let it learn the deep patterns of gene expression, and watch it solve every single-cell analysis problem under the sun. Models like scGPT, Geneformer, and the 800-million-parameter behemoth CellFM have emerged, each promising to be the one model to rule them all.

The AI Models Trained on Millions of Cells Might Not Be Worth the Hype
The AI Models Trained on Millions of Cells Might Not Be Worth the Hype

But here's where it gets interesting. A team from Yale decided to actually test whether these expensive, GPU-melting models deliver on their promises. Spoiler alert: the results are... complicated.

The Great Single-Cell Showdown

Liu and colleagues put ten foundation models through their paces across eight different tasks that biologists actually care about: identifying cell types, correcting batch effects, predicting how cells respond to drugs, and more. They even built a framework called scEval to make these comparisons fair and reproducible.

The winners? scGPT, Geneformer, and CellFM took the top spots when you factor in both performance and whether regular scientists can actually use them without a supercomputer. But the real finding hides in the fine print.

These foundation models - trained on 30 to 100 million cells each - don't consistently beat task-specific methods. That's right. In several tasks, simpler tools designed for one job outperformed the swiss-army-knife AI models that required orders of magnitude more computing power to train.

It's like discovering that your expensive multi-tool is worse at cutting bread than a regular knife. Sure, the multi-tool has 47 functions, but sometimes you just need to cut bread.

Why Your Cells Are Harder to Read Than Shakespeare

The challenge is that biological data breaks all the rules that made transformers successful in language. When GPT-4 reads a sentence, words come in a specific order that matters. "The dog bit the man" means something different from "The man bit the dog."

Genes don't work that way. A cell expressing gene A and gene B isn't fundamentally different from one expressing gene B and gene A - they're both just expressing both genes. This nonsequential nature of omics data means researchers have to get creative with how they feed information into transformer architectures, and not every approach works equally well.

Recent zero-shot evaluations have found that both scGPT and Geneformer sometimes perform worse than selecting highly variable genes and using established methods like Harmony or scVI. The foundation models, for all their training data, struggle when asked to generalize to completely new scenarios without fine-tuning.

The Silver Lining (Because Science Needs Those)

Before you write off single-cell foundation models entirely, the Yale study found genuine promise in two areas. First, these models show "emergent abilities" - they can do things they weren't explicitly trained to do. Second, they excel at transfer learning across species and data types. Train on human cells, apply to mouse cells. Train on gene expression, transfer to chromatin accessibility data.

That's not nothing. In fact, it's the whole point of foundation models: learn general patterns that transfer everywhere.

The paper also provides practical guidance for researchers who want to use these models. Hyperparameter choices matter enormously. The learning rate during fine-tuning can make or break your results. And stability varies wildly between models - some produce consistent results across runs, others are more temperamental than a sourdough starter.

What This Means for Biology

The honest conclusion here isn't that foundation models are useless - it's that they're tools, not magic. For some tasks, particularly those involving transfer across biological contexts, they offer real advantages. For others, a purpose-built method still wins.

For scientists working with single-cell data, the takeaway is refreshingly practical: don't assume the fanciest model is the best choice. Test it. Compare it to simpler alternatives. And maybe don't feel bad about using established methods that actually work for your specific question.

The future of single-cell AI isn't about building ever-larger models. It's about building smarter ones that understand what makes biology different from language - and being honest about what they can and can't do.

References:

  1. Liu T, Li K, Wang Y, Li H, Zhao H. Evaluating the Utilities of Foundation Models in Single-Cell Data Analysis. Advanced Science. 2026. DOI: 10.1002/advs.202514490

  2. Cui H, Wang C, et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods. 2024;21(8):1470-1480. DOI: 10.1038/s41592-024-02201-0

  3. Park J, et al. Single-cell foundation models: bringing artificial intelligence into cell biology. Experimental & Molecular Medicine. 2025. DOI: 10.1038/s12276-025-01547-5

  4. Zero-shot evaluation reveals limitations of single-cell foundation models. PMC. 2025. Available at: PMC12007350

  5. scEval GitHub Repository. Available at: https://github.com/HelloWorldLTY/scEval

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.