AIb2.io - AI Research Decoded

A Survey on Large Language Models in Biology and Chemistry

If you've ever tried to predict how a protein folds, design a new drug molecule, or figure out what a single cell is doing with its life, you already know the frustration: biology is messy, chemistry is unforgiving, and the computational tools we had five years ago feel like trying to solve a Rubik's cube while wearing oven mitts. Researchers have been drowning in sequence data, molecular structures, and genomic datasets for years, desperately wishing for models smart enough to actually understand the language of life. Turns out, the same technology behind your chatbot might be up to the job.

A sweeping new survey from Ashyrmamatov et al. (2025) maps out exactly how large language models are crashing the biology and chemistry party, and the guest list is wilder than you'd expect.

Molecules Have Grammar Now, Apparently

The big insight driving this entire field is deceptively simple: molecules are sequences. Proteins are strings of amino acids. DNA is a four-letter alphabet. Even small drug-like molecules can be written as text using SMILES notation (which, despite the cheerful name, will absolutely not make you smile the first time you try to parse it). If molecules are language, then language models should be able to learn their rules, right?

A Survey on Large Language Models in Biology and Chemistry
A Survey on Large Language Models in Biology and Chemistry

The survey traces how researchers took this idea and ran with it. Protein language models like Meta's ESM-2 (packing up to 15 billion parameters) learned structural and functional constraints just from reading millions of amino acid sequences, no 3D structures required. Chemical language models like ChemBERTa-2 learned molecular syntax through masked token prediction on SMILES strings. And the architectures powering all of this - BERT-style encoders, GPT-style decoders, encoder-decoder hybrids - are basically the same transformer machinery behind the AI writing your emails, just retrained on nature's source code.

What Can These Models Actually Do?

Here's where it gets genuinely impressive. The survey catalogs applications across a dizzying range of tasks:

Protein structure and function prediction - Models can now predict how proteins fold, what they bind to, and what they do, sometimes rivaling experimental methods. AlphaFold's Nobel Prize-winning work (Jumper et al., Nature, 2021; Abramson et al., Nature, 2024; DOI: 10.1038/s41586-024-07487-w) set the bar, but language model approaches are catching up with far less computational overhead.

De novo molecular design - Need a molecule that crosses the blood-brain barrier and inhibits a specific enzyme? LLMs can generate candidate structures from scratch. MIT's Llamole system even lets you describe what you want in plain English, and the model designs molecules to match. That's not science fiction - that's a preprint from 2025.

Reaction prediction and retrosynthesis - Given a target molecule, these models can work backward to figure out how to actually make it. Think of it as GPS for organic chemistry, except the roads are reaction pathways and the traffic is thermodynamics.

Single-cell analysis - Some models now treat gene expression profiles as a kind of cellular language, learning to classify cell types and predict responses to perturbations from transcriptomic data.

The Fine Print (Read This Part)

Sure, this all sounds incredible - until you realize the caveats could fill their own survey paper. The authors don't shy away from the problems, and neither should we.

First, hallucinations aren't just a chatbot problem. When a language model "hallucinates" a molecular structure, you don't get a funny wrong answer - you get an invalid SMILES string or a chemically impossible compound. Imagine a GPS confidently routing you off a cliff. That's what a hallucinating chemical LLM does.

Second, data quality is a persistent headache. Biology's labeled datasets are small, noisy, and biased toward well-studied organisms and pathways. Fine-tuning a model on garbage data produces a very confident garbage-generating machine.

Third, interpretability remains a challenge. These models learn representations that work, but explaining why they work is another story entirely. When your model predicts a protein will misfold, and you can't explain the reasoning to a regulatory agency, that's a problem. The FDA's January 2025 draft guidance on AI in drug development (FDA, 2025) makes clear that "the model said so" isn't going to cut it.

Agents, Assemble

The most forward-looking section of the survey covers agentic AI systems - models that don't just predict but actively plan, search databases, run simulations, and iterate on their own designs. Insilico Medicine's AI-designed drug ISM001-055 reached Phase IIa clinical trials in just 18 months, a timeline that would make traditional pharma weep into their quarterly reports. The biomedical NLP market hit $8.97 billion in 2025 and is projected to reach $132 billion by 2034 (Lu et al., Clin. Transl. Sci., 2025; DOI: 10.1111/cts.70205).

But let's keep our eyebrows raised. Autonomous AI agents designing drugs with minimal human oversight raises ethical questions that the field is only beginning to wrestle with. The survey responsibly flags concerns around reproducibility, data privacy, and the potential for these tools to widen the gap between well-resourced labs and everyone else.

If you're the type who likes to visually map out how all these model architectures and molecular representations connect, tools like mapb2.io can help you sketch out the conceptual landscape without losing your mind to the complexity.

The Bottom Line

This survey is a genuinely useful roadmap for anyone trying to understand where LLMs meet molecules. It covers molecular representations, model architectures, pretraining strategies, and applications with enough depth to be useful and enough breadth to show the full picture. The field is moving fast - a related review in Patterns (Cell Press, 2025; DOI: 10.1016/j.patter.2025.101194) and a focused analysis of LLMs in small-molecule drug discovery (Analytical Chemistry, 2025) confirm that this isn't a niche trend but a full-on paradigm... let's call it a paradigm adjustment.

Biology wrote the original language model four billion years ago. We're just now learning to read it.

References

  1. Ashyrmamatov, I., Gwak, S.J., Jin, S.-Y., et al. (2025). A survey on large language models in biology and chemistry. Experimental & Molecular Medicine. DOI: 10.1038/s12276-025-01583-1
  2. Abramson, J., Adler, J., Dunbar, J., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630, 493-500. DOI: 10.1038/s41586-024-07487-w
  3. Lu, J., Choi, K., Eremeev, M., et al. (2025). Large Language Models and Their Applications in Drug Discovery and Development: A Primer. Clinical and Translational Science. DOI: 10.1111/cts.70205
  4. Large language models for drug discovery and development. (2025). Patterns (Cell Press). DOI: 10.1016/j.patter.2025.101194
  5. Liu, X.-H., Lu, Z.-H., Wang, T., Liu, F. (2024). Large language models facilitating modern molecular biology and novel drug development. Frontiers in Pharmacology, 15. DOI: 10.3389/fphar.2024.1458739
  6. Application and Prospects of Large Language Models in Small-Molecule Drug Discovery. (2025). Analytical Chemistry, 97(50), 27453-27477.

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.