AIb2.io - AI Research Decoded

Data Biases in Genomics: When Your DNA Database Plays Favorites

A genetic counselor opens a patient's report on a Monday morning. The variant flagged as "uncertain significance" stares back from the screen - not because science doesn't know what it does, but because the databases were mostly built by studying people who don't look like this patient. Somewhere, a machine learning model trained on that same lopsided data just confidently scored the variant as benign. The counselor sighs. This is Tuesday in genomics.

Lusine Nazaretyan and Martin Kircher's new review in Trends in Genetics (Nazaretyan & Kircher, 2026) pulls back the curtain on a problem that's been quietly compounding for years: the data feeding our genomic ML models is riddled with biases, and those biases aren't just academic annoyances - they're shaping real clinical decisions for real people.

The 94% Problem

Here's a number that should make you uncomfortable: as of late 2024, 94.48% of all genome-wide association study participants are of European ancestry (Cell Genomics, 2024). Let that sink in. We've built the genomic equivalent of a restaurant that only tested its recipes on one table and then opened to the entire city.

Data Biases in Genomics: When Your DNA Database Plays Favorites
Data Biases in Genomics: When Your DNA Database Plays Favorites

The databases everyone relies on - ClinVar for variant classification, gnomAD for population frequencies - reflect this imbalance. ClinVar's biggest submitters sit in majority-European countries, meaning European-common variants get reviewed by multiple labs while variants common in African or South Asian populations languish in "uncertain significance" limbo. One study found that 40% of high-frequency pathogenic/likely pathogenic variants in ClinVar probably deserve a downgrade (Sharo et al., Genome Medicine, 2023). African-ancestry individuals, meanwhile, faced a significantly higher chance of being incorrectly flagged as affected when using the HGMD database.

Garbage In, Confidently Wrong Out

Nazaretyan and Kircher organize genomic biases into categories that map neatly onto the ML world: selection bias (who gets sequenced), knowledge bias (what questions researchers bother asking), and annotation bias (whose variants get carefully labeled). It's bias all the way down.

The ML angle is where things get spicy. Train a variant pathogenicity predictor on ClinVar data, and it learns the quirks of ClinVar - including which populations got the most attention. A recent bioRxiv preprint showed that genomic heterogeneity actually inflates the apparent performance of variant pathogenicity predictors (bioRxiv, 2025). Your model looks great on the benchmark. It just happens to work significantly better for some populations than others. The overall accuracy metric is doing the statistical equivalent of averaging a mansion and a studio apartment and calling it "nice housing."

And the data circularity problem? Many benchmarking studies train and test on the same ClinVar data, which is like studying for an exam using the answer key and then bragging about your score (bioRxiv, 2025).

gnomAD Gets a Tune-Up (But We're Not Done)

Fighting Back With Better Tools

The picture isn't all doom. PhyloFrame, a method published in Nature Communications, integrates functional interaction networks with population genomics data to actively correct ancestral bias in cancer prediction models, improving accuracy across all ancestries tested (Graim et al., 2025). Other groups are using population-conditional resampling and gradient boosting to bring prediction accuracy for underrepresented groups up to par with majority populations (bioRxiv/PSB, 2023).

If you're the kind of person who likes to visually map out how these different bias types cascade through a research pipeline, tools like mapb2.io can help you sketch out the connections - because honestly, the web of selection bias feeding into annotation bias feeding into model bias feeding into clinical decisions is the kind of thing that benefits from a diagram.

The Bottom Line

Nazaretyan and Kircher's review is a well-timed reality check. As genomic foundation models grow larger and more capable - one recent model achieved 0.997 AUROC on 839,000 ClinVar variants (bioRxiv, 2026) - the temptation to trust the outputs grows proportionally. But a model is only as fair as its training data. And right now, genomic training data has a very specific accent.

The fix isn't just "collect more diverse data" (though yes, please do that). It's about understanding where biases enter, how they propagate through ML pipelines, and building tools that explicitly account for them. Because a precision medicine that's only precise for some people isn't precision medicine - it's a lottery with better marketing.

References

  1. Nazaretyan L, Kircher M. "Data biases in genomics." Trends in Genetics. 2026. DOI: 10.1016/j.tig.2026.02.007. PMID: 41945017.
  2. Graim K et al. "Equitable machine learning counteracts ancestral bias in precision medicine." Nature Communications 16, 2025. DOI: 10.1038/s41467-025-57216-8. PMID: 40064867.
  3. Sharo AG et al. "ClinVar and HGMD genomic variant classification accuracy has improved over time." Genome Medicine 15, 51 (2023). DOI: 10.1186/s13073-023-01199-y. PMID: 37443081.
  4. "Bridging genomics' greatest challenge: The diversity gap." Cell Genomics, 2024. DOI: 10.1016/j.xgen.2024.100353.
  5. "Improved allele frequencies in gnomAD through local ancestry inference." Nature Communications 16, 8734 (2025). DOI: 10.1038/s41467-025-63340-2. PMID: 40661606.
  6. Schmitz MJ et al. "Leveraging diverse genomic data to guide equitable carrier screening." AJHG 112(1):181-195, 2025. DOI: 10.1016/j.ajhg.2024.11.004. PMID: 39615480.
  7. "Genomic heterogeneity inflates the performance of variant pathogenicity predictions." bioRxiv, 2025. DOI: 10.1101/2025.09.05.674459.
  8. "Benchmarking of variant pathogenicity prediction methods using a population genetics approach." bioRxiv, 2025. DOI: 10.1101/2025.03.16.643565.
  9. "EVEE: Interpretable variant effect prediction from genomic foundation model embeddings." bioRxiv, 2026. DOI: 10.64898/2026.04.10.717844.

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.