Project Rosalind

Challenges with Today’s Genomic Foundation Models

Genomics is generating data at unprecedented scale, but most computational methods still treat DNA as a flat string of four letters, typically interpreted against a single reference genome. This representation is useful, but it fails to reflect how genomes are organized, regulated, and shaped by variation. Today’s genomic foundation models generally follow one of two approaches: some learn broad patterns from DNA sequences alone, while others achieve stronger predictive performance by training on large collections of matched laboratory experiments that link genetic changes to measured biological effects, but those datasets are expensive, limited in availability, and not uniformly representative.

Real human genomes are not a single reference: meaningful differences can be sparse, subtle, or large-scale, and many current methods struggle to represent this diversity in a way that is both natural and biologically aligned.

A New Foundation for Genome Interpretation

We are building a more flexible foundation for genome interpretation by using a richer, more intuitive representation of genetic information. This allows our models to learn effectively from DNA data on its own and to incorporate experimental data when it exists, rather than depending on it. The goal is to create a foundational system that delivers clearer, more reliable insight from genomes and enables high-impact applications, such as improving variant interpretation, accelerating target discovery, and supporting the development of next-generation therapeutics.

We are in the process of building something exciting!

Sign up with your email address to get notified.

info@projectrosalind.com