The dream was bigger than this

Generalization (noun): the ability of a model to deal with new cases instead of just regurgitating old ones. In this paper, that noble concept wanders into protein-ligand cofolding and gets shoved into a locker.

Protein-ligand cofolding is the flashy next act after protein structure prediction. The protein is the big molecular machine. The ligand is the smaller molecule that binds to it. The "pose" is the exact way that small molecule sits in the binding pocket, which matters because in drug discovery, "close enough" can mean "completely useless." A millimeter is nothing in daily life. An angstrom is everything when your molecule is trying not to collide with a carbon atom like a shopping cart with no steering.

That is why this new paper by Škrinjar and colleagues lands with a thud you can hear from the medicinal chemistry lab. They built a benchmark called Runs N' Poses, with 2,600 high-resolution protein-ligand systems released after the training cutoffs of modern cofolding models, then tested four leading all-atom methods. The verdict is not subtle: today's cofolding systems often look strong because they have effectively memorized ligand poses from training data, which badly limits their value for de novo drug design [1].

On one hand, that is disappointing. On the other hand, it is also exactly the kind of reality check the field needs. Better a bruised ego now than a billion-dollar drug program built on vibes.

When the model has seen the exam answers

The background here matters. Traditional protein-ligand docking usually starts with a protein structure and tries to place a ligand into a likely binding site. Cofolding methods try something more ambitious: predict the protein and ligand arrangement together, letting both adapt. That is appealing because real proteins are not statues. They wiggle, shift, sulk, and occasionally behave like a door hinge designed by chaos.

Recent systems such as AlphaFold 3 and RoseTTAFold All-Atom made this feel suddenly plausible at scale [2,3]. Reviews over the last year have framed cofolding as part of a broader move from rigid docking toward fully flexible, all-atom prediction [4]. There is genuine progress here. The machines are not doing magic, but they are doing something much more valuable: compressing a ridiculous amount of structural biology into a predictive engine.

But there has been a catch for a while. PoseBusters showed in 2024 that many AI docking methods could look good on standard metrics while still producing physically dubious poses or failing to generalize to novel sequences [5]. PoseBench pushed that concern further in 2025, finding that deep learning cofolding methods often beat older baselines yet still struggle with new binding poses, unknown pockets, and the tradeoff between structural accuracy and chemical realism [6]. This new Nature Structural & Molecular Biology paper is basically the moment somebody finally checks whether the class valedictorian learned the material or just stole the answer key.

Why this matters beyond benchmark drama

If you care about drug discovery, this is the whole ballgame. The useful case is not "can the model recognize a familiar pocket with a familiar-looking ligand?" The useful case is "can it help with molecules nobody has made yet, for targets nobody has solved cleanly yet, in projects where being wrong costs months and several existential spreadsheets?"

That is why memorization is such a problem. A model that mostly recalls training-set geometry is still helpful for some retrospective or near-neighbor tasks. But it is much less helpful for actual frontier work, where the whole point is to leave the neighborhood. On one hand, these systems can still accelerate hypothesis generation. On the other hand, if they fail exactly when chemistry gets weird, then weird chemistry remains the boss fight.

There is a broader theme here too. In 2024, a cross-industry group argued that machine learning in drug discovery suffers from a widening gap between perceived progress and real-world impact, and called for tougher, more realistic benchmarks [7]. This paper feels like that argument made concrete. Not cynical. Not anti-AI. Just unwilling to clap because a model can do karaoke of the Protein Data Bank.

The weirdly hopeful part

Oddly enough, this is good news.

Not for the models' reputations, obviously. But for the field's honesty. A benchmark like Runs N' Poses gives researchers a harder target and a cleaner scoreboard. That is how methods improve. If future systems learn the actual physics and chemistry of binding, rather than pattern-matching yesterday's poses with the confidence of a man explaining crypto at a wedding, then this paper will look less like a takedown and more like a turning point.

So the mood here is mixed. Wonder, because protein-ligand cofolding is still an astonishing idea. Dread, because "astonishing" and "ready for de novo drug design" are not the same sentence. Possibly they are not even on speaking terms yet.

References

Škrinjar P, Eberhardt J, Studer G, Tauriello G, Schwede T, Durairaj J. Evaluating generalization in protein-ligand cofolding methods. Nature Structural & Molecular Biology. 2026. DOI: 10.1038/s41594-026-01797-5
Abramson J, Adler J, Dunger J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493-500. DOI: 10.1038/s41586-024-07487-w. PMID: 38718835
Krishna R, Wang J, Ahern W, et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science. 2024;384:eadl2528. DOI: 10.1126/science.adl2528
Lee N, Myllykoski M. Beyond rigid docking: deep learning approaches for fully flexible protein-ligand interactions. Briefings in Bioinformatics. 2025;26(5):bbaf454. DOI: 10.1093/bib/bbaf454. PMID: 40900115
Buttenschoen M, Morris GM, Deane CM. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chemical Science. 2024;15:3130-3139. DOI: 10.1039/D3SC04185A
Harris C, Stärk H, Corso G, et al. Assessing the potential of deep learning for protein-ligand docking. Nature Machine Intelligence. 2025. DOI: 10.1038/s42256-025-01160-1
Wognum C, Ash JR, Aldeghi M, et al. A call for an industry-led initiative to critically assess machine learning for real-world drug discovery. Nature Machine Intelligence. 2024;6:1120-1121. DOI: 10.1038/s42256-024-00911-w

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.

AIb2.io - AI Research Decoded

The dream was bigger than this

When the model has seen the exam answers

Why this matters beyond benchmark drama

The weirdly hopeful part

References