When AI Learns to Speak "Gene" and Designs...

When AI Learns to Speak "Gene" and Designs Drugs From Scratch

Most drug discovery starts with a target - a protein that misbehaves, a receptor that needs blocking, a lock that needs a very specific molecular key. But what if you skipped all that and just told the computer, "Hey, see how this disease messes up thousands of genes at once? Find me a molecule that un-messes them." That's the pitch behind GPS, a new deep learning platform published in Cell this March, and the results are wild enough to make medicinal chemists do a double take.

The Mixtape Theory of Disease

Here's the core idea, and it's been floating around drug discovery since the Broad Institute launched the Connectivity Map back in 2006: every disease leaves a fingerprint on your cells' gene expression - thousands of genes cranked up or dialed down in a specific pattern. If you could find a drug that produces the opposite pattern, you'd theoretically push those cells back toward healthy. Think of it like a DJ playing a song that's the exact inverse waveform of the noise at a construction site. Noise-canceling headphones for your transcriptome.

The problem? We only have experimental gene expression data for around 30,000 compounds tested in limited cell lines (thanks to the LINCS L1000 project). Meanwhile, commercial chemical libraries contain tens of millions of molecules that have never been profiled. GPS - short for Gene expression Profile predictor on chemical Structures - was built to close that gap (Xing et al., 2026).

Teaching a Neural Network to Read Molecular Handwriting

The team at Michigan State University and Stanford trained GPS to predict what a compound would do to the expression of ~978 landmark genes, using nothing but the compound's chemical structure as input. No wet lab required. Just SMILES strings in, transcriptomic predictions out.

But biological data is famously messy. The L1000 dataset is full of weak signals, batch effects, and measurements that politely disagree with each other. So the team borrowed a trick from education theory: curriculum learning. Just like a teacher wouldn't start a first-grader on calculus, they fed the model the cleanest, most reliable expression signatures first, then gradually introduced noisier data. The model they called RCL (Robust Curriculum Learner) effectively learned to separate real signal from experimental static - the computational equivalent of training your ears to pick out a conversation at a loud party.

Once trained, GPS could predict transcriptomic profiles for millions of untested compounds, then score each one for how well it "reverses" a disease signature. And for lead optimization, they bolted on MolSearch, a tree-search algorithm that tweaks molecular structures to maximize disease reversal while keeping drug-like properties intact. It's basically AlphaGo, but instead of finding the best move on a Go board, it's finding the best chemical modification to flip a disease transcriptome.

Two Diseases, Real Results

The team didn't just benchmark on academic datasets and call it a day. They went after two diseases that desperately need new options.

Hepatocellular carcinoma (HCC) - the most common form of liver cancer and the third leading cause of cancer death worldwide - got the first treatment. GPS screened compounds, identified hits, and the team synthesized and tested two entirely new chemical series. Both showed favorable selectivity (killing cancer cells while sparing normal liver cells) and shrank tumors in mice. These weren't repurposed existing drugs. They were de novo discoveries - molecules that didn't exist in any drug database before GPS suggested them.

Idiopathic pulmonary fibrosis (IPF) was the second target, and here's where things get particularly clever. IPF involves multiple cell types going haywire simultaneously - recent single-cell RNA sequencing studies show that roughly 60% of IPF's dysregulated transcriptome doesn't respond to current approved treatments like nintedanib or pirfenidone. GPS handled this by incorporating single-cell transcriptomics, reversing gene expression signatures across multiple distinct cell types at once. The result: one repurposing candidate plus a novel anti-fibrotic compound, both validated on human lung tissue from transplant patients at Corewell Health.

Why This Matters Beyond Two Diseases

The real headline isn't just "AI finds drugs." It's the concept of structure-gene-activity relationships - extending the classic structure-activity relationship (SAR) that medicinal chemists live and breathe by adding a transcriptomic layer in the middle. Instead of just asking "does this molecule bind the target?", GPS asks "what does this molecule do to the entire cellular program, and is that what we want?"

That matters because many diseases - cancers, fibrotic conditions, neurodegenerative disorders - aren't single-target problems. They're systems-level failures. A tool that thinks in transcriptomes rather than individual targets is playing a different game entirely. If you've ever used mapb2.io to map out complex relationships visually, you'll get why this systems-level thinking resonates - the best insights often come from seeing the whole network, not staring at one node.

The Fine Print

GPS isn't magic. It's limited by the quality and diversity of its training data, the transcriptomic profiles still need experimental validation, and the jump from "shrank tumors in mice" to "approved human therapy" remains one of the hardest gaps in all of medicine. But as a discovery engine - a way to search chemical space orders of magnitude faster than any high-throughput screen - it's a genuinely new capability. The fact that it produced de novo compounds with in vivo activity, not just pretty computational predictions, separates it from a lot of AI drug discovery hype.

Drug discovery usually takes a decade and a billion dollars. GPS won't change the clinical trial timeline, but it might dramatically compress the "find something worth testing" phase. And for diseases like IPF, where patients have a median survival of three years and approximately zero curative options, faster matters.

References

Xing, J., Tan, M., Leshchiner, D., et al. (2026). Deep-learning-based de novo discovery and design of therapeutics that reverse disease-associated transcriptional phenotypes. Cell. DOI: 10.1016/j.cell.2026.02.016
Subramanian, A., et al. (2017). A next generation Connectivity Map: L1000 platform and the first 1,000,000 profiles. Cell, 171(6), 1437-1452. PMCID: PMC5990023
Pham, T.H., et al. (2024). Deep representation learning of chemical-induced transcriptional profile for phenotype-based drug discovery. Nature Communications. DOI: 10.1038/s41467-024-49620-3
Bouzigon, E., et al. (2025). Single cell transcriptomics in a treatment-segregated cohort exposes a STAT3-regulated therapeutic gap in idiopathic pulmonary fibrosis. PMID: 40666833
Cheng, F., et al. (2024). A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation. Briefings in Bioinformatics, 25(4). PMCID: PMC11247410

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.