AIb2.io - AI Research Decoded

When the Haystack Is Also Made of Needles

Plants are chemical chaos gremlins in the best possible way. They make all sorts of useful molecules, but they do not store the instructions neatly. In bacteria, biosynthetic genes often sit together like polite neighbors. In plants, they can be scattered across the genome like someone emptied a toolbox down the stairs.

When the Haystack Is Also Made of Needles

That is a problem if you want to find the exact enzyme that performs one very specific chemical move.

This new Journal of the American Chemical Society paper tackles that mess for jolkinolides, a family of labdane-related diterpenoids found in Euphorbia plants. These molecules matter because they carry an alpha,beta-unsaturated gamma-lactone motif linked to interesting anticancer activity, which makes them attractive targets for bioproduction and further drug-development work ([1], [2]).

The bottleneck is the usual suspect: cytochrome P450 enzymes, or CYPs. Think of them like tiny molecular body shops. They add oxygen in just the right place, reshaping a plain hydrocarbon skeleton into something biologically spicy. Very useful. Also very annoying, because a single plant can contain hundreds of CYP genes, and most of them are not the one you want ([1], [3]).

Teaching the Algorithm to Stop Guessing

The clever bit here is that Lu and colleagues did not ask machine learning to magically solve biology with techno-sorcery and vibes. They built a supervised model to predict which CYPs are compatible with which labdane-related diterpenoid olefins by looking at both sequence space and substrate space ([1]).

Think of it like speed dating, but for enzymes and molecules. Sequence tells you what kind of enzyme you are dealing with. Substrate features tell you what kind of chemical customer is showing up at the counter. The model tries to predict which pairings are worth testing in the lab, so researchers stop spending months biochemically interrogating enzymes that were never going to cooperate in the first place.

And the payoff was not trivial. The team characterized ten CYPs involved in LRD biosynthesis from Euphorbia fischeriana, including enzymes from five subfamilies that had not previously been shown to oxidize LRD scaffolds: CYP82BU, CYP80C, CYP71BF, CYP82J, and CYP71AN. Together, these enzymes enabled oxidation at 12 sites across four tested LRD olefins. That is the sort of result that makes a pathway engineer put down their coffee and say, “Oh. That’s actually useful.” ([1])

The standout trick was a combination of EfCYP71BF25 and EfCYP82BU7, which produced the prized alpha,beta-unsaturated gamma-lactone ring. In plain English: the authors found enzyme parts that can build the business end of the molecule, not just decorate the edges ([1]).

Why This Is Bigger Than One Plant and One Diterpenoid

This paper sits inside a much larger trend. Recent studies have been using deep learning, graph models, and protein language models to predict CYP behavior more accurately, especially in drug metabolism and enzyme-substrate matching. DeepP450 combined pretrained protein and molecular representations to predict CYP substrates across nine human CYPs with strong performance. GTransCYPs used graph transformers for inhibitor prediction. A 2024 review in Computational and Structural Biotechnology Journal sums up how fast this prediction space is moving, and a 2023 ACS Synthetic Biology review makes the broader case that machine learning is becoming a serious tool for natural-product genome mining ([4], [5], [6], [7]).

That broader context matters. If you can identify pathway enzymes faster, you can build microbial production systems faster. And if you can build those systems, you are no longer stuck waiting for a slow-growing plant to make small amounts of a complex molecule because nature felt artistic that day.

Think of it like going from treasure hunting to supply-chain design.

This is also where the paper feels practical instead of flashy. It is not claiming that a model has “understood” plant metabolism in some sci-fi sense. It is doing something more valuable: shrinking the experimental search space. In wet-lab biology, that can save huge amounts of time, money, and graduate-student morale.

If you ever tried sketching one of these branching biosynthetic pathways, by the way, a visual mapping tool like mapb2.io is probably kinder to your brain than turning a notebook page into a detective wall with arrows everywhere.

The Fine Print, Because Biology Likes Humility

There are limits. This is still an experimentally validated workflow, not a universal enzyme oracle in a trench coat. The model’s usefulness depends on the quality and diversity of its training examples. Plant metabolism is full of weird edge cases, and CYPs are famous for being selective, promiscuous, or both, sometimes before lunch. So the real win is not “ML replaces biochemistry.” The win is “ML tells biochemistry where to look first.”

That is enough to matter.

For people interested in synthetic biology, natural products, and greener manufacturing, this paper is a nice example of machine learning behaving like a competent lab assistant instead of a hype balloon. It does not make the chemistry less weird. It just makes the search less ridiculous.

References

  1. Lu K, Zhang R, Gao K, Li N, Cai Z, Zhu J, Zi J. Machine-Learning-Guided Discovery of Cytochrome P450 Enzymes for Bioproduction of Jolkinolides and Other Labdane-Related Diterpenoids. Journal of the American Chemical Society (2026). DOI: https://doi.org/10.1021/jacs.6c05010

  2. Lai J-Z, Zhang M-H, Wu Y-C, Zhang D-Y, Wu X-M, Hua W-Y. ent-Abietane Lactones from Euphorbia. Mini Reviews in Medicinal Chemistry (2017) 17(4):380-397. DOI: https://doi.org/10.2174/1389557516666160923130814

  3. Zi J, Peters RJ. A review: biosynthesis of plant-derived labdane-related diterpenoids. Chinese Journal of Natural Medicines (2021). DOI: https://doi.org/10.1016/S1875-5364(21)60100-0

  4. Chang J, Fan X, Tian B. DeepP450: Predicting Human P450 Activities of Small Molecules by Integrating Pretrained Protein Language Model and Molecular Representation. Journal of Chemical Information and Modeling (2024) 64(8):3149-3160. DOI: https://doi.org/10.1021/acs.jcim.4c00115

  5. GTransCYPs: an improved graph transformer neural network with attention pooling for reliably predicting CYP450 inhibitors. Journal of Cheminformatics (2024). DOI: https://doi.org/10.1186/s13321-024-00915-z

  6. Investigation of in silico studies for cytochrome P450 isoforms specificity. Computational and Structural Biotechnology Journal (2024). DOI: https://doi.org/10.1016/j.csbj.2024.08.002

  7. Machine Learning-Enabled Genome Mining and Bioactivity Prediction of Natural Products. ACS Synthetic Biology (2023). DOI: https://doi.org/10.1021/acssynbio.3c00234

  8. Nature meets machine: the AI renaissance in natural product drug discovery. Natural Products and Bioprospecting (2025). DOI: https://doi.org/10.1007/s13659-025-00589-6

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.