Why Machine Learning Keeps Flunking the Molecular Crime Scene

Google, OpenAI, and Meta tried the big-AI recipe - feed a model absurd amounts of data, let transformers chew through patterns, then wait for competence to emerge - but Khoo and Barzilay’s new paper does something less glamorous and more useful: it checks whether the machine actually learned chemistry, or just memorized the lighting in the interrogation room.

The case file is small-molecule mass spectrometry. In LC-MS/MS, scientists smash molecules into fragments, measure the resulting peaks, and try to reconstruct the original chemical structure. It is forensic chemistry with very expensive confetti. The dream is obvious: train machine learning models to read these spectra and identify unknown metabolites, pollutants, drug-like compounds, or biological signals hiding in messy samples.

The headline promise has been around for years. Models such as MIST predict molecular fingerprints from tandem mass spectra, while newer benchmarks like MassSpecGym try to standardize the contest so everyone stops comparing trophies from different sports [2,3]. DreaMS goes bigger, pretraining a transformer on hundreds of millions of spectra because apparently "more data" is still AI’s favorite emotional support blanket [4].

But according to Khoo and Barzilay, the numbers tell a different story.

The Baseline Wearing a Fake Mustache

The uncomfortable finding: modern ML models often fail to beat simple nearest-neighbor matching. That baseline does not "understand" molecules. It just asks: have I seen a spectrum that looks like this before, with the same molecular formula? If yes, copy the neighbor’s answer.

That sounds embarrassingly primitive, like solving a murder by pointing at whoever owns the same coat. Yet in this setting, it can hold its own.

Khoo and Barzilay benchmarked multiple models across datasets including NPLIB1, MassSpecGym, and NIST2023. They measured predictions with Morgan fingerprint similarity, a way of asking whether the predicted molecule shares structural features with the real one. On random splits, MIST looked pretty strong: 0.547 on NPLIB1, 0.674 on MassSpecGym, and 0.622 on NIST2023. But under harder Learning-to-Split tests, performance dropped to 0.231, 0.247, and 0.214 respectively [1].

That is not a tiny dent. That is the model walking confidently into a glass door.

The Lab Conditions Did It

When pressed, the culprit looks less like "AI is bad at chemistry" and more like "AI is bad at chemistry when the lab changes the knobs."

Tandem mass spectrometry does not produce one universal fingerprint from heaven. A spectrum depends on collision energy, instrument type, adducts, ionization behavior, and other experimental details. The same molecule can look different under different settings, while different molecules can look annoyingly similar. Chemistry, as usual, refuses to be tidy.

The authors used a Learning-to-Split method to create hard train-test splits. Those difficult splits exposed distribution shifts in collision energy, detector instrument, and adduct type. In plain English: the model trained in one neighborhood of lab conditions and got tested in another. Suddenly, the shiny neural network looked less like Sherlock Holmes and more like someone who only studied the answer key from last semester.

This matters because real metabolomics is full of shifting conditions. Instruments vary. Protocols vary. Databases are incomplete. If a system only works when the experiment politely resembles the training set, it is not ready to be the detective. It is the intern who alphabetized the evidence.

The Missing Loudness Clue

The second clue is peak intensity. In MS/MS, each spectrum has peak positions, measured by mass-to-charge ratio, and peak intensities, which tell you how abundant each fragment is. The positions say what fragments appeared. The intensities whisper how likely those fragments were under the experiment.

Khoo and Barzilay found that models, including MIST, often leaned heavily on peak positions while failing to use intensity well. In one example, MIST assigned two spectra from distinct molecules an embedding similarity of 0.966 even though their full spectral similarity was only 0.652. When the researchers ignored intensity and treated all peaks equally, the similarity jumped to 0.945, almost matching MIST’s view [1].

That is a red flag in a lab coat. The model was acting like all fragments shouted at the same volume. Imagine identifying a song from the notes but ignoring rhythm and volume. You might get "Happy Birthday" confused with a smoke alarm if both hit enough matching frequencies.

The authors also found that these intensity-blind confusions contributed to poor predictions: 7.75% of MIST’s poor predictions on NPLIB1, and higher fractions for some other architectures [1]. Not the whole story, but enough to make the plot thicken.

Unknown Molecules Are Still the Boss Fight

The third problem is vocabulary. Many models annotate peaks with possible chemical formulas. But what happens when test spectra contain fragment formulas the model did not see during training? The paper calls this out-of-vocabulary behavior, which is AI-speak for "the model opened the menu and the dish was not listed."

The hard splits had higher rates of these unseen formulas. That suggests models struggle not only with new molecules, but with new fragment chemistry. This fits a wider concern in small-molecule ML: coverage bias. Kretschmer and colleagues recently argued that chemical datasets often make models look better than they are because training and test molecules share more coverage than real discovery settings would allow [5].

MassSpecGym was built partly to fix this mess by giving the field shared datasets and evaluation protocols [3]. That is good. But Khoo and Barzilay’s audit says benchmarks must also punish shortcut learning, experimental-condition brittleness, and sneaky dataset comfort zones.

So What Should Change?

The paper does not say "throw out machine learning." It says stop grading it on the easy homework.

Better systems may need richer physical modeling of fragmentation, explicit handling of instrument metadata, stronger use of intensity information, and evaluation splits that mimic real deployment. Hybrid methods may win here: chemistry-aware candidate generation, spectral simulation, formula constraints, and ML ranking working together instead of pretending a transformer can absorb the periodic table by osmosis.

If reproducible and expanded, this work could make metabolomics tools more honest. That matters for drug discovery, disease biomarker hunting, environmental monitoring, and any field where unknown small molecules are the suspects. The big impact is not a magic model that names every compound. It is a better lie detector for models that claim they can.

And honestly, that may be the more valuable invention.

References

Khoo, L. M. S. & Barzilay, R. "Why machine learning fails at mass spectrometry for small molecules." Nature Metabolism (2026). DOI: 10.1038/s42255-026-01544-6. PMID: 42277271
Goldman, S. et al. "Annotating metabolite mass spectra with domain-inspired chemical formula transformers." Nature Machine Intelligence 5, 965-979 (2023). DOI: 10.1038/s42256-023-00708-3
Bushuiev, R. et al. "MassSpecGym: a benchmark for the discovery and identification of molecules." NeurIPS Datasets and Benchmarks (2024). arXiv: 2410.23326
Bushuiev, R. et al. "Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS." Nature Biotechnology (2025). DOI: 10.1038/s41587-025-02663-3
Kretschmer, F. et al. "Coverage bias in small molecule machine learning." Nature Communications 16, 554 (2025). DOI: 10.1038/s41467-024-55462-w. PMCID: PMC11718084

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.

AIb2.io - AI Research Decoded