Somebody Just Organized 19 Billion Proteins Into a Cosmic Filing Cabinet

The number 19 billion doesn't mean much until you try to sort it. Imagine dumping every book ever written - in every language, including ones nobody speaks anymore - into a single warehouse, then asking an intern to group them by topic. Now replace "books" with "proteins," replace "intern" with "a very clever algorithm," and crank the difficulty up by a factor of about a million. That's roughly what a team led by Benjamin Buchfink just pulled off, and they published the receipts in Nature Methods (Buchfink et al., 2026).

The Problem: Biology's Messiest Junk Drawer

Here's something that rarely makes headlines: we've been sequencing DNA like there's a clearance sale on nucleotides. Metagenomic surveys - basically scooping up environmental samples and reading every scrap of genetic material inside - have flooded public databases with billions of protein sequences. One 2023 study alone pulled 1.17 billion novel proteins out of 26,931 metagenomes that matched nothing in existing reference databases. The total pile now sits north of 19 billion sequences and climbing.

Somebody Just Organized 19 Billion Proteins Into a Cosmic Filing Cabinet

The trouble is, a pile isn't a library. Without grouping these proteins into families, they're just a staggeringly large list of letters. And existing tools? They were designed for a world that had millions of proteins, not billions. CD-HIT, the workhorse from the mid-2000s, scales quadratically - meaning if you double the input, runtime roughly quadruples (Li & Godzik, 2006). MMseqs2's Linclust module cracked linear scaling and can handle billions of sequences, but its sensitivity drops at low sequence identity (Steinegger & Söding, 2018). Grouping distantly related proteins - the ones that look nothing alike in sequence but fold into similar shapes and do similar jobs - remained a brutal computational headache.

Enter DeepClust: Clustering Without a Safety Net

DIAMOND DeepClust takes a cascaded approach. Think of it as sorting your messy closet in rounds: first, a quick pass groups the obvious stuff (jeans with jeans, shirts with shirts). Then you go back with a sharper eye and merge the borderline cases. Each round uses a more sensitive alignment method than the last.

The original DIAMOND aligner was already famous for being 100 to 10,000 times faster than BLAST for protein searches. DeepClust builds on that engine but removes the identity cutoff floor entirely - meaning it can chase down relationships between proteins that share vanishingly little sequence similarity. The default - approx-id for deepclust mode is literally 0%. No minimum. It just keeps looking.

The result: 19 billion biosphere proteins collapsed into 544 million non-singleton clusters. That's a roughly 35-fold reduction in complexity while preserving the biological signal researchers actually need.

Why AlphaFold Cares (A Lot)

Here's where it gets properly exciting. AlphaFold2, DeepMind's structure prediction marvel, doesn't work in isolation. It relies heavily on multiple sequence alignments (MSAs) - essentially stacking up related sequences to spot which positions co-evolve, hinting at which amino acids sit near each other in 3D space. The richer and more diverse the MSA, the better the prediction (Jumper et al., 2021).

The DeepClust database acts like a cheat code for MSA construction. By pre-organizing the protein universe into meaningful clusters, it lets AlphaFold2 pull in more relevant homologs, faster. The paper demonstrates that plugging the DeepClust clusters into AlphaFold2's pipeline measurably improves structure predictions - particularly for those tricky orphan proteins that usually leave prediction tools shrugging their digital shoulders.

The AlphaFold database already hosts over 214 million predicted structures (Varadi et al., 2024). Meanwhile, Foldseek's structural clustering identified 2.3 million non-singleton structural clusters, with 31% lacking any annotation whatsoever. DeepClust attacks the same organizational problem from the sequence side, and the two approaches complement each other like peanut butter and an unexpectedly competent jelly.

The Scale Is Genuinely Absurd

Let's sit with the numbers for a second. Nineteen billion sequences. If you printed each protein sequence on a single line of a text file, you'd need roughly 20 petabytes of storage. The fact that DeepClust can chew through this on accessible hardware - the tool runs with a configurable memory limit (the -M flag) and leans on temporary disk storage - is a minor engineering miracle. For context, CD-HIT would need approximately the remaining lifespan of our sun to finish the same job at this scale. (Slight exaggeration. Slight.)

The database itself is available for download, which is the bioinformatics equivalent of someone building a highway and then just... giving it away. Researchers studying everything from antibiotic resistance genes in soil microbiomes to viral proteins in ocean samples now have a pre-organized map of protein space to work with.

What This Means for the Rest of Us

Roughly 30-50% of known protein families still have no assigned function. That's not a footnote - that's half the playbook missing. DeepClust doesn't solve that mystery directly, but by organizing the chaos, it makes the mystery approachable. When you can see that a thousand uncharacterized proteins from ocean sediment cluster together with a known enzyme from a soil bacterium, you've got a hypothesis. Hypotheses become experiments. Experiments become drugs, industrial enzymes, biofuels, and the occasional Nobel Prize.

If you're into visualizing how these kinds of complex relationships connect - protein families branching, merging, clustering into hierarchies - tools like mapb2.io are built for exactly that sort of structural thinking, letting you map out knowledge graphs and concept relationships right in your browser.

The age of petascale biology is here. The proteins were always out there, in hot springs and hospital ventilation systems and the guts of deep-sea tube worms. We just didn't have a filing system big enough. Now we do.

References:

Buchfink, B.J., Barbé, É., Ashkenazy, H., Reuter, K., Kennedy, J.A., & Drost, H.G. (2026). Clustering the protein universe of life using DIAMOND DeepClust. Nature Methods. DOI: 10.1038/s41592-026-03030-z
Buchfink, B., Xie, C., & Huson, D.H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12, 59-60. DOI: 10.1038/nmeth.3176
Buchfink, B., Reuter, K., & Drost, H.G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods, 18, 366-368. DOI: 10.1038/s41592-021-01101-x
Jumper, J., Evans, R., Pritzel, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583-589. DOI: 10.1038/s41586-021-03819-2
Steinegger, M., & Söding, J. (2018). Clustering huge protein sequence sets in linear time. Nature Communications, 9, 2542. DOI: 10.1038/s41467-018-04964-5
Barrio-Hernandez, I., Yeo, J., Jänes, J., et al. (2023). Clustering predicted structures at the scale of the known protein universe. Nature, 622, 637-645. DOI: 10.1038/s41586-023-06510-w
Pavlopoulos, G.A., Baltoumas, F.A., Liu, S., et al. (2023). Unraveling the functional dark matter through global metagenomics. Nature, 622, 594-602. DOI: 10.1038/s41586-023-06583-7
Varadi, M., Bertoni, D., Gupta, P., et al. (2024). AlphaFold Protein Structure Database in 2024. Nucleic Acids Research, 52(D1), D368-D375. DOI: 10.1093/nar/gkad1011

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.