Monday Morning in the Protein Savannah

By 9:07 on a Monday, the single-cell researcher has coffee in one hand, a fresh scRNA-seq matrix on the screen, and the same old question pacing around the lab like a suspicious heron: if this cell has the mRNA, does it actually have the protein? Here, in the natural habitat of computational biology, we observe a familiar ritual. The scientist clicks on a gene marker, squints at a heatmap, and hopes the transcriptome is telling the truth. Often, with the serene confidence of a creature that has never paid for validation experiments, it is not.

That is the setup for a new Genome Biology paper by Fisher and colleagues, who tested whether machine learning can predict single-cell protein expression better than just looking at the matching mRNA alone. Short version: yes, often quite a bit better. Also, nature remains annoying and refuses to be summarized by one molecule at a time [1].

The mRNA left tracks. The protein did not.

In biology, mRNA is the recipe card. Protein is the actual meal. You can probably see the problem. Sometimes the kitchen is slow, sometimes the cook ignores the card, sometimes the dish gets thrown out immediately, and sometimes single-cell sequencing drops the card behind the radiator and records a zero. This is why using one mRNA as a stand-in for its matching protein can be a bit like estimating a restaurant's output by counting how many sticky notes are taped near the stove.

Fisher et al. compared nine machine learning methods for predicting surface proteins from scRNA-seq data and found that these models consistently beat the naive baseline of "just use the cognate mRNA." In some cases, they even recovered proteins whose matching mRNA looked absent, because the rest of the transcriptome still carried the scent trail. The whole flock told the story better than one feather [1].

That idea fits the broader mood of the field. Recent methods range from lean linear models like ScLinear to transformers like scTEL to larger deep-ensemble systems such as SPIDER. Different species, same ecosystem: use many genes together to infer what the cell is really doing on its surface [3-5].

The clever animal is not magic

The charming part here is that the machine learning model is not "reading minds." It is pattern matching across thousands of genes at once. If a certain immune cell tends to express a particular combination of transcripts when a protein is high, the model can learn that pattern. It is less fortune teller, more extremely caffeinated park ranger.

That matters because proteins are the business end of cell behavior. They are the receptors, signals, badges, and doorknobs cells use to interact with the world. In immunology, cancer biology, and disease profiling, missing protein information means missing a lot of the plot. Methods like CITE-seq can measure RNA and proteins together, but they cost more, require antibody panels, and are not available for every old dataset sitting in a public repository like a forgotten treasure chest with terrible metadata [1,5].

So the appeal is obvious: if you can train on paired RNA-protein datasets and then apply that knowledge to RNA-only data, suddenly a huge archive of scRNA-seq experiments becomes more biologically useful.

The predator named Generalization

Now for the part where the documentary narrator lowers their voice.

These models do not roam freely across all habitats. Fisher et al. found that prediction accuracy depends heavily on how much the training data resembles the test data, especially in cell type composition [1]. In plain English, the model performs best when it has seen animals from the same biome before. Train it on one tissue, unleash it on another, and it may start behaving like a tourist trying to identify birds using a fish guide.

That limitation echoes larger benchmark studies. A 2024 Nature Methods benchmark across 47 multi-omics datasets found that methods differ substantially in performance, with totalVI and scArches often leading overall, while a 2025 Genome Biology benchmark showed that sample differences, tissues, protocols, and training set size all matter a lot [2,6]. Translation: there is no universal emperor penguin of protein prediction yet.

This is also why the paper is interesting beyond its headline result. It does not just say "ML good." It says ML is useful, conditional, and deeply dependent on ecological context. Which, frankly, is the most biological sentence imaginable.

Why this matters outside the swamp of jargon

If these predictions keep improving and hold up across more settings, they could help researchers re-analyze older RNA-only datasets, sharpen immune cell annotation, and flag candidate biomarkers without rerunning expensive multimodal experiments. That is not a substitute for measuring proteins directly when the stakes are high. It is more like giving your field guide binoculars instead of asking it to guess from rustling noises alone.

The calm lesson from this paper is almost rude in its simplicity: single mRNAs are often bad stand-ins for proteins, but the broader transcriptome contains clues worth exploiting. Cells, like wildlife, make more sense when you watch the whole habitat instead of one footprint.

References

Fisher J, Wood O, Bullers S, Murray L, Li L, Jackson-Wood MA, et al. Machine learning predictions surpass individual mRNAs as a proxy of single-cell protein expression. Genome Biology. 2026. DOI: 10.1186/s13059-026-04083-1. PubMed: 42021407
Li CY, Hong YJ, Li B, Zhang XF, et al. Benchmarking single-cell cross-omics imputation methods for surface protein expression. Genome Biology. 2025;26:46. DOI: 10.1186/s13059-025-03514-9
Hanhart D, Gossi F, Rapsomaniki MA, et al. ScLinear predicts protein abundance at single-cell resolution. Communications Biology. 2024;7:267. DOI: 10.1038/s42003-024-05958-4
Chen R, Zhou J, Chen B. Imputing abundance of over 2500 surface proteins from single-cell transcriptomes with context-agnostic zero-shot deep ensembles. bioRxiv. 2024. DOI: 10.1101/2024.07.31.605432. PMCID: PMC11312525
Chen Y, Fan X, Shi C, et al. A joint analysis of single cell transcriptomics and proteomics using transformer. npj Systems Biology and Applications. 2025;11:1. DOI: 10.1038/s41540-024-00484-9
Hu Y, Wan S, Luo Y, et al. Benchmarking algorithms for single-cell multi-omics prediction and integration. Nature Methods. 2024;21:2182-2194. DOI: 10.1038/s41592-024-02429-w

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.