Nineteen Billion Proteins Walk Into a Cluster

Somewhere between "a lot" and "incomprehensibly many" lives the number 19 billion. That's roughly how many protein sequences the biosphere has coughed up so far - scraped from soil microbes, ocean plankton, your gut bacteria, and basically every living thing that has ever bothered to encode an amino acid chain. The problem? Nobody had a filing system big enough to sort them all. Until now.

A team led by Benjamin Buchfink just dropped DIAMOND DeepClust in Nature Methods, and it does exactly what the name suggests: clusters proteins. All of them. The entire known protein universe. In 18 days. On 27 compute nodes. That's roughly 250,000 CPU hours, which sounds like a lot until you realize the alternative was "never finishing at all."

The Protein Sorting Problem (It's Worse Than Your Email Inbox)

Here's why this matters. Every living organism makes proteins, and metagenomic sequencing - the practice of blending up an environmental sample and reading every scrap of DNA - has flooded databases with billions of sequences that nobody has characterized yet. Researchers sometimes call this the "dark matter" of the protein universe, which is the biology equivalent of admitting "we found a warehouse full of mystery boxes and lost the manifest."

Grouping similar proteins into families is how biologists make sense of this chaos. Find a cluster, and you can infer that the members probably share a function, an evolutionary ancestor, or at least a structural resemblance. Tools like CD-HIT, MMseqs2/Linclust, and UCLUST have been the workhorses here for years. But they all hit a wall: search-based clustering scales super-linearly with the number of input sequences. Double your dataset, and your runtime more than doubles. At 19 billion sequences, "more than doubles" starts to mean "see you next geological epoch."

How DeepClust Actually Pulls This Off

DIAMOND - originally published in 2015 as a BLAST alternative that runs 20,000x faster - has been the bioinformatics community's favorite speed demon for a decade. DeepClust builds on that foundation with a cascaded clustering approach: it runs multiple rounds of increasingly sensitive alignment, using fast initial passes to rough-sort proteins before applying finer-grained comparisons. Think of it as sorting your sock drawer by first separating colors (fast, easy) and then matching exact patterns (slower, but you're working with a much smaller pile).

The key trick is that each cascade step reduces the dataset for the next one, so the expensive sensitive alignments only run on manageable subsets. The result? Scaling that can theoretically handle trillions of sequences while still catching distant homologs that simpler methods miss.

The Numbers Are Staggering (No, Seriously)

DeepClust organized 19 billion proteins into 544 million nonsingleton clusters - meaning clusters with at least two members. Of those, 335 million clusters had three or more members, representing a 5.5-fold increase in sequence diversity compared to the Big Fantastic Database (BFD), which was previously the heavyweight champion of protein collections at 2.5 billion sequences.

The real kicker: an estimated 118 million protein families in the DeepClust database are entirely new - they couldn't be mapped to anything in the BFD. That's 118 million groups of proteins that science had sequenced but never organized. It's like discovering that 118 million books in the Library of Babel actually have coherent stories - somebody just needed to shelve them properly.

Wait, This Helps AlphaFold Too?

Yes, and this might be the most consequential part. AlphaFold2 - DeepMind's structure prediction model that essentially solved protein folding - relies heavily on multiple sequence alignments (MSAs). The more diverse and deep your alignment, the better AlphaFold can pick up on co-evolutionary signals that reveal 3D structure. Feed it a richer set of homologs, and you get better predictions.

The DeepClust database, which is publicly available for download, serves exactly this purpose. By providing vastly more organized protein families than previous databases, it gives AlphaFold2 (and similar tools) a much larger evolutionary context to work with. For proteins that previously had sparse alignments - the orphans and oddballs sitting in poorly populated clusters - this could be the difference between a confident structural prediction and a shrug emoji.

It's a similar idea to what Meta AI explored with their ESM Metagenomic Atlas of 617 million predicted structures, but approached from the sequence-clustering side rather than the language-model side. If you're trying to map the full architecture of life's molecular machinery, you want both.

Why You Should Care (Even If You Don't Speak Amino Acid)

The protein universe is expanding faster than our ability to study it. Moon-shot efforts like the Earth BioGenome Project aim to sequence 1.8 million eukaryotic species, and environmental metagenomics keeps pulling new organisms out of every mud puddle and thermal vent. Without tools that scale, we'd be drowning in data while learning nothing from it.

DeepClust is infrastructure - not as flashy as a chatbot, but arguably more important. It's the filing cabinet that makes the rest of bioinformatics possible at planetary scale. And if you've ever tried to organize anything at scale - say, mapping out a research project's architecture in a tool like mapb2.io - you know that the organizational layer is what separates "data" from "knowledge."

The database is available now. The code is open source. The protein universe just got a whole lot more navigable.

References:

Buchfink, B.J., Barbé, É., Ashkenazy, H., Reuter, K., Kennedy, J.A. & Drost, H.-G. Clustering the protein universe of life using DIAMOND DeepClust. Nature Methods (2026). DOI: 10.1038/s41592-026-03030-z
Buchfink, B., Xie, C. & Huson, D.H. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59-60 (2015). DOI: 10.1038/nmeth.3176
Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nature Communications 9, 2542 (2018). DOI: 10.1038/s41467-018-04964-5
Barrio-Hernandez, I. et al. Clustering predicted structures at the scale of the known protein universe. Nature 622, 637-645 (2023). DOI: 10.1038/s41586-023-06510-w
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-589 (2021). DOI: 10.1038/s41586-021-03819-2
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123-1130 (2023). DOI: 10.1126/science.ade2574
Pavlopoulos, G.A. et al. Unraveling the functional dark matter through global metagenomics. Nature 622, 594-602 (2023). DOI: 10.1038/s41586-023-06583-7

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.