Star Trek promised a tricorder that could scan you, squint electronically, and report what was wrong before Dr. McCoy finished being annoyed. MLMarker is not that. Nobody is waving it over a patient in sickbay. But in the long campaign to make biology less of a foggy battlefield, it is the kind of reconnaissance tool commanders quietly love: it looks at a proteomics sample and asks, “Which tissue does this most resemble, and which proteins gave away its position?” Claeys et al., 2026
The Front Line: Proteins Are Messy Little Informants
Genes are the battle plans. Proteins are the troops actually moving around, changing uniforms, missing roll call, and occasionally setting fire to the supply tent.
Mass spectrometry-based proteomics lets researchers measure thousands of proteins in tissue, blood, cerebrospinal fluid, tumors, and other biological samples. That sounds orderly until you remember biology runs on wet chemistry, not spreadsheets. Some proteins are abundant, some vanish below detection limits, and datasets from different labs often arrive with the consistency of field reports written during artillery fire.
Traditionally, researchers compare groups: tumor versus normal, responder versus non-responder, disease versus control. Useful, yes. But it can miss subtler maneuvers. A sample may not merely be “different.” It may be drifting toward the protein signature of another tissue, like a battalion quietly changing flags overnight.
That is where MLMarker enters the theater.
MLMarker’s Maneuver: Ask “What Do You Resemble?”
MLMarker uses a Random Forest model trained on proteomics data from 34 healthy human tissues. A Random Forest is basically a council of decision trees, which sounds woodland and peaceful until you realize each tree is voting on your biological identity like a jury with caffeine access.
Instead of forcing a single hard label, MLMarker outputs continuous tissue similarity scores. A sample can look partly brain-like, partly pituitary-like, or vaguely like the model has received a suspiciously incomplete dossier. This matters because tissues share biology, tumors adapt to new environments, and biofluids carry protein traces from multiple places.
The model also uses SHAP explanations, a machine-learning method based on Shapley values from game theory. In plain English: SHAP tries to fairly assign credit or blame to each protein for a prediction. If MLMarker says a tumor looks brain-like, SHAP points to the proteins that helped push the model there. That is the difference between “the algorithm says so” and “these specific molecular fingerprints were found near the scene.”
Missing Proteins: The Supply-Line Problem
Proteomics has a chronic missing-data problem. Sometimes a protein is absent because biology made it absent. Sometimes the instrument missed it. Sometimes the sample was sparse, and everyone pretends not to be stressed.
MLMarker handles this with a penalty factor for missing proteins. Without that correction, low-coverage samples can get weirdly confident tissue assignments, the computational equivalent of declaring victory because half the map is blank. The paper reports that an empty sample could otherwise receive a nontrivial bone marrow similarity score, which is both funny and a little alarming, like a smoke detector praising the kitchen for ambiance.
The penalty factor helps keep sparse samples from marching into false confidence. That is especially relevant for biofluids and lower-input proteomics, where the protein signal can be thin but still biologically useful.
Three Battles, Three Lessons
The authors tested MLMarker across three public datasets.
First, in cerebral melanoma metastases, MLMarker found brain-like proteomic signatures that standard expression clustering did not separate cleanly. Poor responders showed higher brain similarity than good responders. Translation from battlefield: the tumors that blended into brain territory may have gained a tactical advantage.
Second, in a large pan-cancer FFPE tissue dataset, MLMarker showed strong tissue prediction performance and highlighted cases where tissue identity was biologically complicated. Fixed, archived cancer tissue is not pristine parade-ground material, so getting useful signal there is valuable.
Third, in biofluids, MLMarker identified brain and pituitary origins. That is the reconnaissance dream: infer where biological signals may be coming from without needing every sample to be a perfect chunk of tissue.
Why This Matters, If the Lines Hold
If MLMarker’s results hold up across more cohorts, labs, instruments, and disease settings, it could help researchers generate sharper hypotheses from messy proteomics data. Not diagnose disease by magic. Not replace pathologists. Not become the tricorder by Q3, despite what an overcaffeinated pitch deck might imply.
Its more realistic value is strategic: flag unexpected tissue resemblance, expose which proteins drove that signal, and help researchers decide where to aim the next experiment. In cancer metastasis, that could reveal how tumors adapt to hostile terrain. In biofluid studies, it could help trace molecular signals back to likely tissue origins. In single-cell or sparse proteomics, it could provide an extra layer of interpretation beyond clustering, which sometimes behaves like a seating chart made by a sleep-deprived intern.
The limits are clear. MLMarker depends on the tissues represented in its healthy reference atlas. Public proteomics data still suffers from uneven metadata, which is like trying to coordinate an army when half the units forgot to label their maps. Missingness remains hard. And tissue similarity is not the same thing as clinical truth.
Still, this is a useful advance: interpretable, practical, and built for the battlefield proteomics actually occupies, not the tidy one we wish existed.
References
-
Claeys T, van Puyenbroeck S, Gevaert K, Martens L. “MLMarker: a machine learning framework for tissue inference and biomarker discovery.” Genome Biology 27, 207 (2026). DOI: 10.1186/s13059-026-04125-8
-
Claeys T, Menu M, Bouwmeester R, Gevaert K, Martens L. “Machine Learning on Large-Scale Proteomics Data Identifies Tissue and Cell-Type Specific Proteins.” Journal of Proteome Research 22(4), 1181-1192 (2023). DOI: 10.1021/acs.jproteome.2c00644
-
Peng H, Wang H, Kong W, Li J, Goh WWB. “Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference.” Nature Communications 15, 3922 (2024). DOI: 10.1038/s41467-024-47899-w
-
Tüshaus J et al. “Towards routine proteome profiling of FFPE tissue: insights from a 1,220-case pan-cancer study.” The EMBO Journal 44, 304-329 (2025). DOI: 10.1038/s44318-024-00289-w
-
Climente-González H et al. “Interpretable machine learning leverages proteomics to improve cardiovascular disease risk prediction and biomarker identification.” Communications Medicine 5, 170 (2025). DOI: 10.1038/s43856-025-00872-0
Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.