Plant breeding has always been a cornerstone of agriculture, enabling the development of crops that are more productive, resilient, and adaptable to various environments. Traditionally, breeders have relied on statistical methods, such as linear mixed models, to predict how plants will perform based on their genetic makeup. However, in the case of highly complex agronomic traits like yield, these traditional methods have serious limitations in their ability to account for complex gene-by-gene and gene-by-environment interactions. Machine-learning (ML) models, with their capacity to model these intricate interactions, offer superior predictive ability over traditional approaches, especially in the abundance of data. A key benefit of ML is that its performance has an explosive growth with data: the more data a model is trained on, the more it outperforms statistical approaches. This is why large multi-environment trials can dramatically improve ML-based predictions, helping breeders make more informed decisions. But this also leads to a challenge: acquiring enough data to fully harness the potential of ML models.
The Challenge of Data Scarcity
While ML holds immense promise for breeders, its ability to fulfill high expectations in terms of predictive ability is limited by the availability of data. Collecting these datasets requires extensive field trials across diverse environments and advanced genomic sequencing, which is not only resource-intensive but also costly. Moreover, these datasets are often proprietary, and competition makes it unfeasible to combine information from different sources, unless breeding companies agree to share their data and collaborate. This data scarcity has been a major bottleneck in fully harnessing ML for plant breeding, and one may easily set the bar too high by expecting game-changing ML performance with tiny, unbalanced datasets.
A Game-Changer: Transfer Learning
A powerful solution to the problem of data scarcity is offered by transfer learning — a groundbreaking deep-learning concept that has been a main driving force of the AI revolution in fields like computer vision, natural language processing, and even healthcare. At its core, transfer learning allows models trained on one task with abundant data, to be adapted for a different but similar task where data are scarce, by leveraging a lot of their existing knowledge. Just like a professional Italian chef can learn the intricacies of French high cuisine much more easily than a kitchen rookie, cutting-edge deep-learning models are also able transfer pertinent knowledge between related tasks. This remarkable capability has led to some extraordinary feats of AI, such as it allowed computer vision models that were pre-trained on billions of images to identify common everyday objects to be fine-tuned on a limited number of medical images, and achieve (or even exceed) the performance of radiologists in diagnosing cancer. Transferring foundational knowledge between tasks within a similar context can indeed lead to spectacular results.
Genomic Language Models: A New Frontier in Plant Breeding
The most promising application of transfer learning to mitigate the challenge of data scarcity in plant breeding is in the form of genomic language models (GLMs). These models are trained on large collections of genomic sequences, allowing them to understand the language of DNA. Just as human large language models (LLMs) like GPT-3 [1] understand the structure of human language, and can be fine-tuned to translate, reason, and even write a poem, GLMs are trained to gain a foundational understanding of the structure of genetic sequences. Because evolution has shaped the genomes of many species similarly, much of the genomic information is transferable not only across individual genotypes, but also different species.
Recent advancements in GLMs have been nothing short of groundbreaking. Models like ESMFold [2] have revolutionised biology by accurately predicting protein structures at atomic resolution from nucleotide sequences. Other foundational GLMs pre-trained on transcriptomes (e.g., Geneformer [3]) or DNA sequences (e.g., Nucleotide Transformer [4], Genomic Pre-Trained Network [5]) have provided accurate predictions in a large variety of tasks after fine-tuning even in low-data settings. Such benchmarking tasks include chromatin-profile prediction, identifying splice sites and transcription factor binding sites, detect promoter regions, and even predict genome-wide variant effects. These were once labor-intensive and error-prone tasks, and the large success in automating them by transfer learning clearly demonstrates the ability of GLMs to understand the complex language of life. The newest generation of open-source GLMs specifically pre-trained on the DNA of crop species (e.g., AgroNT [6]) opens new frontiers for the agriculture technology sector.
Fig. 1 Genome LLMs can analyze both sequential data (like DNA sequences, ATAC-seq, Hi-C) and non-sequential data (like single-cell RNA-seq, bulk transcriptomes, multiome). They find patterns to predict things, like functional regions, disease-causing SNPs, and gene expression. First, they learn from the data (training) and are fine-tuned or prompted for specific tasks. (Adapted from Consens et al., 2023) [7]
Fine-Tuning for Trait Prediction: Challenges and Solutions
Fine-tuning these pre-trained genomic models for trait prediction is of course not without its challenges. One major hurdle is the need to capture long-range interactions in the genome. For example, understanding the impact of genetic variants on gene expression is complex because it often involves interactions over long distances within the genome. Some regulatory effects can span millions of base pairs, meaning that the model must be capable of handling very long sequences of genetic data. Even though the best-performing language models (including ChatGPT) are based on the transformer architecture, they fall short at handling extremely long contexts. The latest innovations in language model architectures, such as Hyena [8] and Mamba [9], have overcome this major obstacle, and paved the way to the emergence of foundational GLMs (e.g., HyenaDNA [10], PlantCaduceus [11]) that are able to quickly process sequences up to a million base pairs long, and do it at single-nucleotide resolution.
Furthermore, for genomic models to be effective in predictive breeding, they must be able to understand the effect of allelic variations in gene expression. This requires data from a wide variety of genotypes for fine-tuning a GLM. Although whole-genome sequences would be ideal for this purpose, pseudo-assemblies — partial representations of genomes that can be generated from more affordable sequencing technologies — could offer a cost-effective interim solution to create sufficiently large datasets for fine-tuning these models. One key factor enabling the ultimate success of GLMs on the long term is the rapid decrease in the cost of genomic sequencing. While back in the ‘90s, the Human Genome Project [12] consumed billions of dollars, and sequencing a whole genome still cost around $10,000 in 2010, the price tag dropped to just $100 by today [13] — an astounding 100-fold reduction over the last decade. This dramatic decrease in cost means that whole-genome sequencing large numbers of genotypes is expected to soon become economically viable.
A Call to Action for Plant Breeders — Embracing the Age of AI
As we enter an era where artificial intelligence and genomics converge, plant breeding companies have an unprecedented opportunity to adopt these cutting-edge technologies. By embracing the power of AI, breeders can unlock new levels of precision and efficiency in predicting complex traits. These advancements not only have the potential to accelerate the development of resilient, high-yielding plant varieties but also to reduce the costs and time required for breeding programs. The global food crisis, exacerbated by climate change, demands innovative solutions. Foundational genomic language models offer a transformative opportunity to unlock the full potential of predictive breeding. By adopting in AI-driven tools, plant breeders can position themselves at the forefront of a new era in crop development —before the next breakthrough reshapes the industry.
Now is the time to invest into AI and lead the charge toward a more sustainable and food-secure world. The tools are available, the science is proven, and the need is urgent. Let’s grow the future together.
References
[1] Brown, T. B., et al. (2020): Language Models are Few-Shot Learners, arXiv:2005.14165
[https://doi.org/10.48550/arXiv.2005.14165]
[2] Lin, Z., et al. (2023): Evolutionary-scale prediction of atomic-level protein structure with a language model, Science, 379, 6637
[https://doi.org/10.1126/science.ade2574]
[3] Theodoris, C.V., et al. (2023): Transfer learning enables predictions in network biology, Nature, 618, 616
[https://doi.org/10.1038/s41586-023-06139-9]
[4] Dalla-Torre, H., et al. (2025): Nucleotide Transformer: building and evaluating robust foundation models for human genomics, Nature Methods, 22, 287
[https://doi.org/10.1101/2023.01.11.523679]
[5] Benegas, G., et al. (2023): DNA language models are powerful predictors of genome-wide variant effects, PNAS, 120, 44
[https://doi.org/10.1073/pnas.2311219120]
[6] Mendoza-Revilla, J., et al. (2023): A Foundational Large Language Model for Edible Plant Genomes, bioRxiv:2023.10.24.563624
[https://www.biorxiv.org/content/10.1101/2023.10.24.563624v1]
[7] Consens, Micaela E., et al. (2023): To transformers and beyond: large language models for the genome, arXiv preprint arXiv:2311.07621
[https://arxiv.org/abs/2311.07621]
[8] Poli, M., et al. (2023): Hyena Hierarchy: Towards Larger Convolutional Language Models, arXiv:2302.10866
[https://doi.org/10.48550/arXiv.2302.10866]
[9] Gu, A. & Dao, T. (2024): Mamba: Linear-Time Sequence Modeling with Selective State Spaces, arXiv:2312.00752
[https://doi.org/10.48550/arXiv.2312.00752]
[10] Nguyen, E., et al. (2023): HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution, Advances in Neural Information Processing Systems, 36, 43177
[https://doi.org/10.48550/arXiv.2306.15794]
[11] Zhai, J., et al. (2024): Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model, bioRxiv:2024.06.04.596709
[https://doi.org/10.1101/2024.06.04.596709]
[12] International Human Genome Sequencing Consortium (2001): Initial sequencing and analysis of the human genome, Nature, 409, 860
[https://doi.org/10.1038/35057062]
[13] https://www.science.org/content/article/100-genome-new-dna-sequencers-could-be-game-changer-biology-medicine
Share on