Somewhere in China, scientists pointed a very expensive machine at industrial wastewater and asked it a question it couldn't fully answer: "How much of each weird chemical is actually in here?"
The machine - a gas chromatography system coupled with high-resolution mass spectrometry (GC-EI-HRMS, because scientists love acronyms) - is brilliant at identifying chemicals. It can spot 910 different compounds lurking in wastewater from an iron and steel plant. But here's the awkward part: knowing what's there and knowing how much is there are two completely different problems. And for most of those 910 chemicals, nobody has a reference standard sitting in a freezer somewhere to compare against.
This is where a team of researchers decided to get clever with machine learning.
The Reference Standard Problem (Or: Why Lab Work Is Expensive)
Here's how chemical quantification normally works: you get a pure sample of the chemical you're looking for, run it through your instrument, measure the signal, and boom - you've got a reference point. Find that same signal in your mystery sample, do some math, and you know the concentration.
The problem? There are tens of thousands of chemicals in environmental samples, and buying reference standards for all of them would bankrupt most labs while also being physically impossible for many compounds that simply aren't commercially available. Scientists have been using workarounds - picking a "surrogate" chemical that seems close enough and hoping the math transfers. Spoiler: it often doesn't.
The research team, led by Yi Liu and colleagues, built what they call EIQuan: a stacked ensemble learning model that predicts chemical concentrations without needing a reference standard for every single compound [1].
Stacking Models Like Pancakes
Ensemble learning is essentially the "ask multiple experts and average their opinions" approach to machine learning. But stacked ensemble learning takes this further - it trains a second model to learn which of your base models to trust in which situations.
The EIQuan system uses three different base learners (think of them as three chemists with different specialties), then combines their predictions using bootstrap aggregation. Bootstrap aggregation, or "bagging," is a technique where you train multiple versions of a model on slightly different random samples of your data, then average their outputs to reduce variance [2]. It's like polling a crowd, but making sure you're not accidentally asking the same person multiple times.
The researchers trained their model on 278 reference standards spanning 19 major chemical categories. That's not a huge training set by modern ML standards, but it's remarkably comprehensive for analytical chemistry work.
The Results: Better Than Guessing, Way Better Than Surrogates
Here's where things get interesting. The stacked ensemble model achieved quantification errors within a 3.79-fold range for 95% of predictions. In plain English: if the model says there's 100 units of a chemical, the actual amount is probably somewhere between 26 and 379 units.
That might sound imprecise, but compared to the surrogate-based methods everyone's been using? It's a massive improvement. The surrogate approach can easily be off by 10-fold or more, especially for chemicals that behave differently than their chosen stand-in [3].
The team applied their model to wastewater samples from an iron and steel corporation, quantifying total chemical concentrations ranging from 8.86 × 10³ to impressive levels across different sampling points. This kind of comprehensive chemical profiling was previously impossible without either a warehouse full of reference standards or a willingness to accept huge uncertainty.
Why This Actually Matters
Industrial wastewater isn't just academically interesting - it's a public health concern. Knowing that "some chemicals" are present versus knowing their approximate concentrations makes the difference between identifying a minor concern and catching a serious contamination event.
Nontargeted analysis has been growing rapidly as a field, with researchers increasingly using it to discover unexpected pollutants that traditional targeted methods would miss entirely [4]. But discovery without quantification is only half the battle. You can't set regulatory limits or assess health risks if you can only say "yep, it's there" without any sense of how much.
The EIQuan approach could be particularly valuable for emerging contaminants - chemicals that are too new or too obscure to have established analytical methods. As new synthetic compounds enter the environment faster than labs can develop standards for them, ML-based quantification might be the only practical path forward [5].
The Bigger Picture
It's also a reminder that ML doesn't always need millions of training examples. With carefully curated domain knowledge (those 278 reference standards weren't randomly chosen), you can build useful predictive models even in data-scarce fields.
The wastewater flowing out of industrial facilities contains stories about what's happening inside - stories written in trace chemicals at parts-per-billion concentrations. Tools like EIQuan are teaching us to read those stories more completely.
References
-
Liu Y, Dong Q, Hu J, Yao C, Sun W. EIQuan: A Stacked Ensemble Learning-Based Predictor for Quantification of Nontargeted Chemicals in Gas Chromatography Coupled with Electron Ionization High-Resolution Mass Spectrometry Analysis. Environmental Science & Technology. 2025. DOI: 10.1021/acs.est.6c00394
-
Breiman L. Bagging predictors. Machine Learning. 1996;24(2):123-140. DOI: 10.1007/BF00058655
-
Hollender J, Schymanski EL, Singer HP, Ferguson PL. Nontarget Screening with High Resolution Mass Spectrometry in the Environment: Ready to Go? Environmental Science & Technology. 2017;51(20):11505-11512. DOI: 10.1021/acs.est.7b02184
-
Abrahamsson DP, et al. A Comprehensive Non-targeted Analysis Study of the Prenatal Exposome. Environmental Science & Technology. 2021;55(15):10542-10557. DOI: 10.1021/acs.est.1c01010
-
Wang Z, et al. Toward a Global Understanding of Chemical Pollution: A First Comprehensive Analysis of National and Regional Chemical Inventories. Environmental Science & Technology. 2020;54(5):2575-2584. DOI: 10.1021/acs.est.9b06379
Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.