Back in 1997, a group of bioinformaticians got tired of everyone describing the same protein differently depending on which organism they studied, so they invented Gene Ontology - a universal dictionary for what proteins actually do. Brilliant move. Except for one tiny problem: figuring out what each protein does still required painstaking lab experiments. And with over 200 million known protein sequences floating around in databases? Yeah. That backlog wasn't going anywhere fast.
The Labeling Problem Nobody Talks About
Here's a number that should make you uncomfortable: fewer than 0.5% of known proteins have experimentally verified functional annotations. Half a percent. That means for the vast majority of proteins in every database on Earth, we're basically guessing. Or rather, we were basically guessing - until AI showed up with a highlighter and a lot of confidence.
A new review from Wang et al. in Advanced Science (10.1002/advs.202524373) lays out exactly how machine learning is tackling this massive annotation gap. The paper breaks down methods for predicting two complementary labeling systems: Gene Ontology (GO) terms and Enzyme Commission (EC) numbers. Think of GO as answering "what does this protein do, where does it do it, and what biological process is it part of?" while EC numbers answer "what chemical reaction does this enzyme catalyze?" Together, they're basically a protein's resume.
Six Flavors of AI, One Giant Protein Problem
The review organizes the current computational zoo into six modeling paradigms. We're talking everything from old-school sequence homology (find a protein that looks similar, assume it does similar stuff) to protein language models that treat amino acid chains like sentences and try to "read" their meaning. Right? It's literally NLP for biology.
Models like ESM and ProtTrans have become the workhorses here - pretrained on millions of protein sequences, they learn patterns that transfer surprisingly well to function prediction. Then there are graph neural networks that encode the hierarchical structure of GO itself, because - and this is the fun part - protein functions aren't independent labels. They're organized in a massive directed acyclic graph where "catalytic activity" is a parent of "kinase activity" which is a parent of "protein kinase activity." Models that ignore this structure are basically trying to organize a library without knowing the Dewey Decimal System exists.
Recent methods have gotten creative. DeepSS2GO (PMID: 38701416) throws secondary structure features into the mix alongside sequence and homology data. MEGA-GO (PMID: 39847542) uses multi-scale graph adaptive neural networks to handle proteins of wildly different lengths. ProtGO (PMID: 40632605) goes full multi-modal, combining protein language models with text descriptions, species taxonomy, and GO graph embeddings. It's like giving the model every cheat sheet simultaneously.
The AlphaFold Effect
You can't talk about protein AI without mentioning the elephant - or rather, the Nobel Prize-winning elephant - in the room. AlphaFold cracked protein structure prediction. But structure isn't function. Knowing what a protein looks like doesn't automatically tell you what it does. It helps, though. A lot.
Structure-aware methods like TopEC (10.1101/2024.01.31.578271) use 3D graph neural networks to classify enzymes directly from their shapes. And the AlphaFun pipeline achieved over 98% functional annotation coverage on previously uncharacterized proteins by combining structural alignments with existing databases. The AlphaFold Protein Structure Database now has 200+ million entries. That's a LOT of structural data waiting to be mined for functional clues.
Honestly? The Hard Parts Are Still Hard
Look, the review is refreshingly honest about limitations. Models struggle with proteins that have no close homologs - the truly novel stuff. A critical bioRxiv preprint (10.1101/2024.07.01.601547) showed that current methods basically fall apart when tested on genuinely uncharacterized enzymes rather than held-out sequences with known relatives. Benchmark contamination is real. EC-Bench (10.1101/2025.06.25.661207) was specifically created to standardize evaluation because everyone was testing on slightly different datasets and declaring victory.
The other unsolved piece? Context. A protein might do different things in different tissues, at different times, or in different organisms. Current models mostly predict function in a vacuum. Getting to context-dependent, high-resolution annotation - that's the next frontier.
Why This Matters Beyond the Lab
If you can predict what a protein does computationally, you accelerate basically everything downstream: drug target identification, understanding disease mechanisms, engineering enzymes for industrial applications, even figuring out what the heck all those proteins in your gut microbiome are up to. It's the difference between reading a parts list and understanding the machine.
And honestly, the pace is wild. Between protein language models, structure prediction, and knowledge graph methods all converging, we might be approaching a point where computational annotation is reliable enough to guide - not just supplement - experimental work. Not replace it. Guide it. Big difference.
References
- Wang, W., Yang, Q., Zeng, M., Zheng, R., & Li, M. (2026). Artificial Intelligence Powers Protein Functional Annotation. Advanced Science. DOI: 10.1002/advs.202524373
- ProtGO: Multi-modal GO knowledge framework (2025). PMID: 40632605
- MEGA-GO: Multi-scale graph adaptive neural network (2025). PMID: 39847542
- DeepSS2GO: Secondary structure features for GO prediction (2024). PMID: 38701416
- TopEC: 3D graph neural networks for EC classification (2024). DOI: 10.1101/2024.01.31.578271
- EC-Bench: Benchmark for EC number prediction (2025). DOI: 10.1101/2025.06.25.661207
- Limitations of current ML models for uncharacterized proteins (2024). DOI: 10.1101/2024.07.01.601547
Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.