SpliceSelectNet: Teaching AI to Read the Genome Without Losing Its Glasses

A patient can carry one tiny DNA typo, and that typo can make a cell splice a gene the wrong way - which is a very small mistake with a very rude habit of becoming cancer, a rare disorder, or a diagnosis nobody can explain cleanly.

That is the human mess behind SpliceSelectNet, or SSNet, a new hierarchical Transformer model from Yuna Miyachi and Kenta Nakai published in Nucleic Acids Research in 2026. The model tries to predict splice sites - the spots where cells cut and paste RNA before making proteins - from DNA sequences as long as 100,000 bases. That is not a “read the next sentence” problem. That is a “read the whole apartment lease, the footnotes, and the suspicious clause hiding on page 37” problem.

The Cell Is Editing a Movie Trailer

Your genes are not copied straight into proteins like a neat recipe card. First, cells make precursor RNA, then remove introns and stitch exons together. This is RNA splicing. Done well, it helps one gene produce different useful transcripts. Done badly, it can create broken proteins or weirdly active ones.

The annoying part: splice sites are not always obvious. Many real splice sites use common short motifs, but the human genome is full of lookalikes. As one recent spliced-alignment paper notes, only a tiny fraction of GT and AG pairs are actual splice sites, which is biologically efficient and computationally just deeply impolite.⁵

Earlier tools like SpliceAI pushed the field forward by using deep neural networks to predict splice junctions from sequence.⁶ But models often struggle when regulatory clues sit far away from the splice site. Biology, being biology, does not always leave the answer next to the question. It hides the answer 15 kilobases away, then looks at you like this was obvious.

Why Transformers Showed Up Wearing a Lab Coat

Transformers are good at sequences because attention lets one part of the input weigh information from another part. In language, that means connecting a pronoun to the noun it refers to. In genomics, it can mean connecting a splice site to a distant enhancer or silencer.

But vanilla attention gets expensive fast. Make the sequence longer, and the computation starts eating memory like a teenager after sports practice. Recent models have tried different angles: SpliceBERT pre-trained on millions of RNA sequences from 72 vertebrates,⁴ Spliceformer used Transformer context up to 45,000 nucleotides,³ and SpliceTransformer added tissue-specific splicing prediction, even linking predicted splice changes to disease patterns across ClinVar variants.²

SSNet’s trick is hierarchy. Instead of asking every base to stare at every other base all at once, it builds from local patches toward broader context. Local attention handles nearby sequence details. Global attention brings in the long-distance gossip. Proud parent moment: the model finally learned to read the whole room. Exasperated parent moment: this was necessary because genome regulation apparently refuses to keep its socks in one drawer.

What SSNet Claims It Can Do

Miyachi and Nakai report that SSNet predicts donor and acceptor splice sites with single-nucleotide resolution across 100 kb DNA windows.¹ On GENCODE protein-coding genes, SSNet had higher precision and F1 than SpliceAI while maintaining similar recall. That matters because false positives are not harmless. If a model flags too many fake splice sites, researchers get a confetti cannon of “maybe disease-causing” candidates, and nobody needs that kind of party.

The paper also tested aberrant splicing prediction using datasets such as SpliceVarDB and BRCA variants. In BRCA1 and BRCA2, where splicing errors can influence breast and ovarian cancer risk, SSNet reportedly outperformed several comparison models on AUROC and AUPRC. That is the part where you nod approvingly, then immediately ask whether it holds up across ancestries, tissues, variant classes, sequencing contexts, and all the other places biology likes to spill juice on the carpet.

The Interpretability Bit, Also Known as “Show Your Work”

One of SSNet’s nicer features is that its attention maps can point to sequence regions the model treats as meaningful. The authors paired this with in silico mutagenesis, basically changing bases computationally to see what shakes loose. Attention scores aligned with functional sequence importance, suggesting the model was not just waving a laser pointer around the genome for dramatic effect.

That said, attention is not proof of mechanism. It is a clue. A pretty good clue, perhaps, but still a clue. Models can ace benchmark tests and then do something baffling in the wild, like a gifted child who explains calculus and then puts a fork in the toaster. I did not raise you like this.

For researchers trying to reason through these architectures, a visual map helps. This is where something like mapb2.io fits naturally: sketching local attention, global attention, datasets, and validation loops can make the model less like a sci-fi radiator and more like a thing with parts.

Why This Could Matter

If SSNet’s results reproduce and generalize, it could help geneticists prioritize variants that disrupt splicing, especially when the damaging signal sits far from the obvious splice boundary. That could improve variant interpretation, guide functional experiments, and eventually support more precise diagnosis for diseases where DNA testing finds “something suspicious” but not enough evidence.

The model may also influence broader genomic AI. A hierarchy that handles 100 kb context without melting the hardware has uses beyond splicing: enhancer prediction, chromatin accessibility, transcription factor binding, and other long-range regulatory tasks.

Still, the hard work remains. Models need external validation, diverse genomic data, tissue-specific context, careful clinical calibration, and transparent failure modes. SSNet is not a doctor. It is more like a very sharp research assistant that read the entire genome neighborhood and may finally stop missing the clues across the street.

References

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.

Miyachi Y, Nakai K. “SpliceSelectNet: a hierarchical Transformer-based deep learning model for splice site prediction.” Nucleic Acids Research 54(12), gkag625, 2026. DOI: 10.1093/nar/gkag625 ↩
You N, et al. “SpliceTransformer predicts tissue-specific splicing linked to human diseases.” Nature Communications 15, 9129, 2024. DOI: 10.1038/s41467-024-53088-6 ↩
Jónsson BA, et al. “Transformers significantly improve splice site prediction.” Communications Biology 7, 1616, 2024. DOI: 10.1038/s42003-024-07298-9 ↩
Chen K, Zhou Y, Ding M, Wang Y, Ren Z, Yang Y. “Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction.” Briefings in Bioinformatics 25(3), bbae163, 2024. DOI: 10.1093/bib/bbae163 ↩
Li H. “Improving spliced alignment by modeling splice sites with deep learning.” BMC Bioinformatics, 2025. DOI: 10.1186/s13015-025-00293-7 ↩
Jaganathan K, et al. “Predicting splicing from primary sequence with deep learning.” Cell 176(3), 535-548.e24, 2019. DOI: 10.1016/j.cell.2018.12.015 ↩

AIb2.io - AI Research Decoded