banner
Pan-Genome Research in Agriculture: Steps, Case Studies, and Breeding Implications

Pan-Genome Research in Agriculture: Steps, Case Studies, and Breeding Implications

In the paradigm shift of life science research from single-reference genomes to population genetic diversity analysis, Pan-Genome research offers a new perspective for revealing the genetic basis of complex biological traits by integrating the genetic information of all taxonomic units within a species. Since Tettelin and others first put forward the concept of pan-genome in Streptococcus pneumoniae in 2005, this technology has been extended from the field of microorganisms to the research of animals, plants, and humans. By systematically capturing the dynamic composition of the core genome and variable genome, it has shown remarkable scientific value in analyzing the adaptive evolution mechanism of species, mining key functional genes, and promoting transformation and application.

This paper introduces the research progress of pan-genome, expounds its construction steps, and analyzes the application of potato, rice, solanum crops and peanuts as examples.

What is Pan-Genome

Pan-genome represents all genetic information of a species, including the core genome and accessory genome. Among them, the core gene is the gene existing in all sample species, which controls the basic biological functions and main phenotypic characteristics of living organisms. Accessory genome are genes that only exist in a single sample or a part of samples, which are generally related to the adaptability of species to a specific environment or unique biological characteristics, reflecting the characteristics of species. The pan-genome will gradually replace the single reference genome, becoming a new standard for studying the evolution, selection, and gene function of animals and plants.

Pan-Genome Construction Steps

Pan-genome construction is a key technical process to analyze the genetic panorama of species, which breaks through the limitation of single reference genome by systematically integrating the genome information of multiple individuals in the population. Its core lies in the multi-link collaboration from sample screening to map drawing, aiming at comprehensively capturing the distribution characteristics of core genes and variable genes, and laying a data foundation for revealing species diversity and functional evolution.

Sample Selection and Experimental Design

The first step of pan-genome construction is scientific and reasonable sample selection. It is necessary to select representative individuals of different ecotypes, geographical populations, or varieties within the natural distribution range of the species according to the research objectives to ensure that the main genetic variation types of the species are covered.

High Throughput Sequencing

Select the appropriate sequencing technology according to the characteristics of the species genome.

  • For species with simple genomes, short-read technology (such as Illumina) can be used, which has low cost and large data volume.
  • For complex genomes with high repetitive sequences or polyploids, it is necessary to combine long-read techniques (such as PacBio SMRT and Oxford Nanopore) to cross repetitive regions and obtain longer sequence fragments.

Generally, the whole genome sequencing strategy is adopted to sequence each sample in high depth (for example, animals and plants generally require a depth of more than 30×) to ensure the integrity of genome coverage. In addition, in order to assist genome assembly, we can combine Hi-C technology to capture chromosome conformation and obtain three-dimensional structural information of the genome.

Individual Genome Assembly

Using sequencing data, the genome of each individual was assembled independently.

  • For short-read data, common assembly tools, such as SOAPdenovo and SPAdes, assemble short sequences into contigs by overlapping extension method.
  • For long-read data, tools such as Flye and Canu can be used to directly assemble longer contigs by taking advantage of its long segment.

After the assembly, it is necessary to evaluate the quality of each individual's genome, including contig N50, genome integrity (such as BUSCO evaluation), sequencing depth coverage uniformity, etc., to ensure the reliability of assembly results. For polyploid species, special attention should be paid to distinguishing homologous chromosomes to avoid assembly errors.

A synopsis of the human genome sequence assembly process (Taylor et al., 2024) Overview of the process of human genome sequence assembly (Taylor et al., 2024)

Variation Detection and Comparative Analysis

The genome of each individual is compared with the reference genome (or all individuals are compared with each other) to detect various genetic variations, including single nucleotide polymorphism (SNP), insertion-deletion (InDel), structural variation (SV) and copy number variation (CNV). Commonly used comparison tools such as BWA and Bowtie2, and mutation detection tools such as GATK and FreeBayes.

Through comparative analysis, the core genome (genes shared by all individuals) and variable genome (genes only existed in some individuals) were identified, and the genetic differences between different individuals were clarified. For complex variation types such as structural variation, it is necessary to combine long-read and long-read data with visualization tools (such as IGV) for verification.

Pan-genome Integration and Map Construction

The genome information of all individuals is integrated to construct a pan-genome map. The traditional linear pan-genome map takes a reference genome as the skeleton and presents the unique sequences of other individuals in the form of additional sequences. The more advanced Graph Genome represents the genetic diversity of species in the form of graphs, with nodes representing common sequences and edges representing variation sequences of different individuals, which can reflect the genetic variation of species more comprehensively.

In the process of construction, it is necessary to annotate and classify the variable genes to clarify their functions and distribution characteristics. At the same time, the functional association analysis of genes in the pan-genome can be carried out by combining gene expression data and phenotypic data.

A pangene page example (such as maize pan p014093) belongs to the U2 auxiliary factor small subunit gene family (GP001047) (Valentin et al., 2021) Example of a pangene page (i.e. maize pan p014093) member of the U2 auxiliary factor small subunit gene family (GP001047) (Valentin et al., 2021)

Functional Annotation and Verification

Functional annotation of all genes in the pan-genome, including functional prediction of coding genes, identification of non-coding RNA, analysis of regulatory elements, etc. We can use tools such as BLAST for homologous alignment annotation, combine with InterProScan for protein domain analysis, and also use transcriptome data such as RNA-seq to assist annotation.

For variable genes, especially newly discovered genes, it is necessary to verify their functions. Experimental methods such as gene knockout, overexpression, and CRISPR screening can be used, combined with phenotypic analysis, to clarify their biological functions. This step is very important for mining functional genes in the pan-genome.

Data Storage and Sharing

Massive data generated by pan-genome research need to be stored and managed efficiently. A special pan-genome database can be established to store the original sequencing data, assembly results, variation information, functional notes, etc., and provide convenient retrieval and analysis tools. At the same time, following the principle of open science, the data will be submitted to public databases (such as NCBI and EMBL-EBI) to realize data sharing and promote cooperative research in the field. When storing data, the security and long-term availability of data should be considered, and standardized data format and metadata description should be adopted to ensure the repeatability and reusability of data.

Requirements and Promising Prospects in Pangenome Research (Taylor et al., 2024) Opportunities and needs for pangenome research (Taylor et al., 2024)

Practical Case Studies of Pan-Genome

Pan-genome research is gradually becoming the key path to unlock the genetic mystery of species. This year, several research papers focusing on pan-genome have been published in top academic journals such as Nature and Nature Genetics. These studies cover rice, potato, peanut and other species, revealing the structure and function of the pan-genome from different dimensions, providing a new perspective and key data for the follow-up agricultural breeding and biological evolution exploration. Next, the important achievements of these four articles will be introduced in detail one by one.

The phased pan-genome of tetraploid European potato

Publish Magazine: Nature

Impact Factors: 40.137

Publication Time: 2025.04.16

DOI: https://doi.org/10.1038/s41586-025-08843-0

Potatoes were first introduced to Europe in the 16th century. Two hundred years later, one of them has become one of the most important food sources in the whole European continent and even in the world. However, due to its highly heterozygous autotetraploid genome, variety improvement has been difficult since then.

In this study, based on the haplotype genome assembly of ten traditional European cultivars, the pan-genome of potatoes was constructed, covering about 85% of haplotypes isolated in Europe. Due to the multiple gene infiltration of wild potatoes, the sequence diversity among haplotypes is extremely high. However, due to the population bottleneck effect in the process of domestication and migration to Europe, its haplotype diversity is extremely low.

Genetic variation and haplotype diversity in European potato (Sun et al., 2025) Genetic and haplotype diversity in European potato (Sun et al., 2025)

In order to show the practical application of the pan-genome, the author transformed it into a haplotype map and generated the typing and Mb-level pseudo-genome assembly of commercial potato varieties (including the famous French fries special variety "Russet Burbank") by using low-cost short-reading long sequencing data.

An analysis of the potato pan-genome (Sun et al., 2025) Potato pan-genome analysis (Sun et al., 2025)

To sum up, the author completed a nearly complete pan-genome of European autotetraploid potato, revealed the ultra-high sequence diversity of domesticated crops, and prospected how to use this resource to accelerate genomics-assisted breeding and research.

A pangenome reference of wild and cultivated rice

Publish Magazine: Nature

Impact Factors: 40.137

Publication Time: 2025.04.16

DOI: https://doi.org/10.1038/s41586-025-08883-6

Oryza rufipogon, as the wild ancestor of Oryza sativa, is an important genetic resource for rice breeding. Based on 145 genome assemblies at the chromosome level (including 129 common wild rice and 16 cultivated rice varieties with genetic diversity), the first wild-cultivated rice pan-genome was constructed. The pan-genome contains 3.87 Gb of new sequences that are not included in the reference genome of Japan's Nipponbare, and the heterozygous information missing from the original assembly is captured for the first time by the alternative assembly.

A pangenome analysis was conducted on 149 representative wild and cultivated rice accessions (Guo et al., 2025) Pangenome analysis of 149 representative wild and cultivated rice accessions (Guo et al., 2025)

A total of 69,531 ubiquitin genes were identified, including 28,907 core genes and 13,728 wild rice-specific genes. It was found that the abundance and diversity of disease-resistant gene analogs in wild rice were significantly higher than those in cultivated rice. The evidence of population genetics shows that the sub-groups of tropical japonica rice (intro-indica) and basmati rice (basmati) in South Asia were formed through gene exchange among cultivated rice, which strongly supports the theory of single origin of Asian cultivated rice.

The genetic divergence existing between indica and japonica (Guo et al., 2025) Genetic divergence between indica and japonica (Guo et al., 2025)

The study also identified 855,122 SNPs and 13,853 PAV variations in indica-japonica differentiation, which can be traced back to the ancestral species differentiation events, and japonica rice experienced a more serious genetic bottleneck effect. This achievement not only provides a new tool for rice breeding but also deepens the understanding of the origin and domestication of rice.

Solanum pan-genetics reveals paralogues as contingencies in crop engineering

Publish Magazine: Nature

Impact Factors: 40.137

Publication Time: 2025.03.05

DOI: https://doi.org/10.1038/s41586-025-08619-6

Pan-genomics and genome editing technology are causing revolutionary changes in the field of global crop breeding. At present, the key opportunity is to reshape the grain system through the transfer of genotype-phenotype association knowledge between staple crops (widely planted in the world) and local characteristic crops (regional planted). However, species-specific genetic variation and its interaction with natural/artificial mutation hinder phenotypic prediction even among related crops.

Based on the pan-genome of Solanum crops, combined with the analysis of functional genomics and pan-genetics, this study revealed that gene replication and its collateral homologous gene differentiation were the main obstacles to genotype-phenotype predictability. Although the chromosome-scale genomes of 22 species (including 13 local crops) are macroscopically collinear, thousands of genes (especially domestication-related gene families) show a dynamic evolutionary trajectory in sequence, expression, and function.

Functional dissection of lineage-specific paralog diversification via pan-genomics uncovers modified compensatory relationships in a key fruit size regulator (Benoit et al., 2025) Functional dissection of lineage-specific paralogue diversification through pan-genetics reveals modified compensatory relationships in a major fruit size regulator (Benoit et al., 2025)

By integrating the data of African eggplant cultivars, combined with quantitative genetics and gene editing verification, the author analyzed the complex collateral homologous evolution history of regulating fruit size: after the redundant collateral homologous gene of the classical fruit size regulator CLAVATA3 (CLV3) was lost, it was compensated by pedigree-specific tandem replication, and then the derivative copy was pseudogeneized and the cultivar-specific large fragment was deleted, and finally a single fused CLV3 allele was formed, which together with another enzyme coding gene controlling the same trait regulated the number of organs.

The pan-genome analysis of African eggplant uncovers extensive structural variations, introgression from wild species, and diversification of CLV3 paralogues (Benoit et al., 2025) Pan-genome of African eggplant reveals widespread structural variation, wild species introgression and CLV3 paralogue diversification (Benoit et al., 2025)

This study shows that the differentiation of collateral homologous genes on the short-term scale is an accidental evolutionary event that has not been fully recognized in the study of trait evolution. Revealing and controlling these accidental evolutionary events is of decisive significance for the transformation and application of cross-species genotype-phenotype association.

Pangenome analysis reveals structural variation associated with seed size and weight traits in peanuts

Publish Magazine: Nature Genetics

Impact Factors: 31.8

Publication Time: 2025.04.28

DOI: https://doi.org/10.1038/s41588-025-02170-w

Peanut (Arachis hypogaea L.) is an important oilseed and edible legume crop, and its seed size and weight are the key characteristics of domestication and breeding. However, the mechanism of genome SVs related to these traits is still unclear.

Based on the resequencing data of 8 high-quality genomes (including 2 diploid wild species, 4 tetraploid wild species, and 4 tetraploid cultivated species) and 269 germplasm with different grain sizes, a comprehensive pan-genomic analysis was carried out. A total of 22,222 core/softcore gene families, 22,232 non-essential gene families, and 5,643 endemic gene families were identified, and the structural variation frequency of subgenome A was higher than that of subgenome B.

The gene-scale pangenome of peanuts (Zhao et al., 2025) Gene-level pangenome in peanuts (Zhao et al., 2025)

The study further screened out 1,335 SV related to domestication and 190 SV related to grain size or weight. Among them, the deletion of 275-bp of the AhARF2-2 gene leads to the loss of its interaction with AhIAA13 and TOPLESS proteins, thus weakening the inhibitory effect on AhGRF5 and promoting seed expansion. The high-quality pan-genome provides an important resource for the genetic improvement of peanuts and other leguminous crops.

Features of structural variants (SVs) within the pangenome framework (Zhao et al., 2025) Characteristics of SVs in pangenome (Zhao et al., 2025)

Conclusion

To sum up, pan-genome research takes the technological innovation of genomics as the engine, and shows multiple values in crop improvement. These cases not only confirm the effectiveness of pan-genome in analyzing the genetic complexity of crops, but also provide methodological reference for cross-species gene resource mining. In the future, with the reduction of the cost of long reading and long sequencing and the optimization of graph genome algorithm, pan-genome research will be more deeply integrated into molecular design breeding, which will promote the paradigm shift from empirical breeding to precision breeding and inject lasting impetus into global food security and sustainable improvement of crops.

References

  1. Sun H, Tusso S, Dent CI, et al. "The phased pan-genome of tetraploid European potato." Nature. 2025 https://doi.org/10.1038/s41586-025-08843-0
  2. Guo D, Li Y, Lu H, et al. "A pangenome reference of wild and cultivated rice." Nature. 2025 https://doi.org/10.1038/s41586-025-08883-6
  3. Benoit M, Jenike KM, Satterlee JW, et al. "Solanum pan-genetics reveals paralogues as contingencies in crop engineering." Nature. 2025 640(8057): 135-145 https://doi.org/10.1038/s41586-025-08619-6
  4. Zhao K, Xue H, Li G, et al. "Pangenome analysis reveals structural variation associated with seed size and weight traits in peanut." Nat Genet. 2025 57(5): 1250-1261 https://doi.org/10.1038/s41588-025-02170-w
  5. Taylor DJ, Eizenga JM, Li Q, et al. "Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References." Annu Rev Genomics Hum Genet. 2024 25(1): 77-104 https://doi.org/10.1146/annurev-genom-021623-081639
  6. Valentin G, Abdel T, et al. "GreenPhylDB v5: a comparative pangenomic database for plant genomes." Nucleic Acids Res. 2021 49(12): 7203 https://doi.org/10.1093/nar/gkaa1068
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Send a MessageSend a Message

For any general inquiries, please fill out the form below.

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
We provide the best service according to your needs Contact Us
OUR MISSION

CD Genomics is propelling the future of agriculture by employing cutting-edge sequencing and genotyping technologies to predict and enhance multiple complex polygenic traits within breeding populations.

Contact Us
Copyright © CD Genomics. All Rights Reserved.
Top