AIb2.io - AI Research Decoded

Blog Post: Generalist Biological AI

A massive squid has roughly the same number of genes as you do. About 20,000. The difference between you and a cephalopod isn't really in the parts list - it's in the instruction manual, the timing, the choreography of which genes turn on, where, and when. Biology isn't a parts catalog. It's a language. And a group of researchers just published what might be the most ambitious roadmap yet for teaching AI to actually speak it.

Published in Nature Biotechnology, a new review by Vishwanatha Rao, Pranav Rajpurkar, Marinka Zitnik, Eric Topol, and colleagues lays out the case for what they call Generalist Biological AI (GBAI): systems that don't just model proteins or DNA or RNA in isolation, but process all of them simultaneously - the way actual cells do (Rao et al., 2026).

Blog Post: Generalist Biological AI
Blog Post: Generalist Biological AI

The Central Dogma Gets a Chatbot

Here's the biology recap you didn't ask for but secretly needed: DNA gets transcribed into RNA, RNA gets translated into proteins, proteins do basically everything interesting in your body. Biologists call this the "central dogma." AI researchers looked at these molecular sequences and thought, "That looks suspiciously like text."

They weren't wrong. Over the past few years, researchers have built language models for each layer of this biological stack. ESM3, a 98-billion-parameter protein model from EvolutionaryScale, jointly reasons over protein sequence, structure, and function - and managed to generate a completely novel fluorescent protein that's roughly 500 million years of evolution away from anything found in nature (Hayes et al., Science, 2025). Meanwhile, Arc Institute's Evo 2 trained 40 billion parameters on over 9 trillion nucleotides from 100,000+ species, reading DNA at single-nucleotide resolution across megabase-scale contexts - and it can actually identify disease-causing mutations in human genes (Nguyen et al., Nature, 2026). AlphaFold3 expanded protein structure prediction to handle DNA, RNA, ions, and small molecules in complex together (Abramson et al., Nature, 2024).

Each of these is impressive on its own. But here's the catch: biology doesn't work in isolated layers.

Why "Generalist" Is the Hard Part

A protein doesn't fold in a vacuum. It folds in a cell, surrounded by other molecules, affected by which genes are active, which RNA variants got spliced, what signals the cell received five minutes ago. The whole system is absurdly interconnected - like a Rube Goldberg machine designed by committee over 3.8 billion years.

Current AI models are specialists. They're like having a world-class translator for French, another for Mandarin, and another for Arabic, but nobody who can handle a meeting where all three languages are spoken at once. GBAI is the proposal to build that multilingual translator for biology.

The review outlines key opportunities: merging language-based AI (good at sequences) with structural AI (good at 3D shapes), building modular systems where specialized models collaborate, and - perhaps most ambitiously - developing AI agents that can autonomously design and run experiments. Picture an AI that hypothesizes a drug target, designs a candidate molecule, predicts its behavior in a virtual cell, and suggests which experiment to run next. We're not there yet, but the architecture is taking shape.

Virtual Cells: The Moonshot

The ultimate destination is what researchers call the virtual cell - a computational model that simulates cellular behavior from the ground up (Bunne et al., Cell, 2024). Not just static snapshots, but dynamic simulations of how cells respond to drugs, mutations, or environmental changes. If you want to visualize the sheer complexity of these interconnected biological pathways, tools like mapb2.io can help map out the relationships - though even the best mind map would struggle to capture the full chaos of a living cell.

The Fine Print (Because There's Always Fine Print)

The paper is refreshingly honest about the obstacles. Training data is biased toward well-studied organisms and molecules. Biological complexity doesn't scale the way language does - a cell isn't just a longer sentence. Experimental validation is slow and expensive. And the models still hallucinate, which is mildly amusing when ChatGPT invents a fake restaurant, and considerably less amusing when it invents a fake drug interaction.

But the trajectory is clear. We've gone from "can AI predict a protein's shape?" to "can AI simultaneously model DNA regulation, RNA splicing, protein interactions, and cellular responses?" in roughly five years. The language of life has four billion years of content. AI is just starting to learn to read it. The interesting part is what happens when it starts writing back.

References:

  1. Rao VM, Zhang S, Plosky BS, et al. Generalist biological artificial intelligence in modeling the language of life. Nature Biotechnology (2026). DOI: 10.1038/s41587-026-03064-w
  2. Hayes T, Rao R, Akin H, et al. Simulating 500 million years of evolution with a language model. Science (2025). DOI: 10.1126/science.ads0018
  3. Nguyen E, Poli M, Durrant MG, et al. Genome modeling and design across all domains of life with Evo 2. Nature (2026). bioRxiv: 2025.02.18.638918
  4. Abramson J, Adler J, Dunbar J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493-500 (2024). DOI: 10.1038/s41586-024-07487-w
  5. Bunne C, Rosen Y, Hartman A, et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell 187(25), 7045-7063 (2024). DOI: 10.1016/j.cell.2024.11.015

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.