Pangenome Explained: History, Sequencing Strategies, Databases, and Applications
From Linear References to Pangenomes
For decades, genomics relied on the idea of a single high-quality reference genome. This backbone guided downstream studies in transcriptomics, epigenomics (such as ChIP-seq, CUT&Tag, WGBS, and ATAC-seq), and variant discovery. A reference genome provides a consistent coordinate system, yet it represents only one individual or lineage. Environmental pressures, domestication, and geographical separation all introduce significant genetic variation within a species.
As a result, a single reference genome introduces reference bias. Sequences missing from the reference may map poorly or not at all, leading to under-detection of alleles and structural variants. Over time, researchers realized that this approach masked valuable diversity that shapes adaptation, health, and phenotype.
The pangenome solves this limitation. It represents the full repertoire of genomic sequences across many individuals of a species. Within a pangenome, sequences are classified into:
- Core genes — shared by nearly all individuals, often housekeeping functions.
- Variable (dispensable) genes — present in some individuals, linked to adaptation or environmental responses.
- Private genes — unique to one or a few lineages.
By integrating multiple genomes, pangenomes provide a much more complete view of natural diversity.
Pan-genome selection and construction. (Murukarthick Jayakodi, et al., DNA Research, 2021)
Open vs. Closed Pangenomes
One of the most practical distinctions in pangenomics is whether a species has an open or closed pangenome.
- Open pangenomes: The repertoire continues to expand as new individuals are sampled. Microbial species that undergo frequent horizontal gene transfer, as well as diverse plants with wide geographic ranges, often fall into this category. In such cases, the number of new genes rises almost linearly with the number of genomes added.
- Closed pangenomes: The gene repertoire stabilizes after sampling a sufficient number of genomes. Most new individuals add little or no novel sequence. This is more common in species with restricted gene flow or highly conserved lineages.
This distinction helps researchers design efficient sampling strategies. For open pangenomes, diversity may require dozens or hundreds of genomes; for closed pangenomes, fewer samples may be enough to capture the full range.
Pan-genome and core genome development plot projections for S. lugdunensis (Panel a and b), S. epidermidis (Panel c and d), and S. aureus (Panel e and f). (Argemi, X., Matelska, D., Ginalski, K. et al., BMC Genomics, 2018)
A Brief History of Pangenomes
The term pangenome was first introduced in 2005 by Tettelin and colleagues in their work on Streptococcus agalactiae. They divided the genome into core and dispensable elements, setting the conceptual foundation. In 2007, Morgante applied the framework to plants, although sequencing limitations restricted its utility.
The field gained momentum after 2010 as short-read sequencing matured. By 2014, plant pangenomes were reported for soybean, rice, and maize, demonstrating that a single cultivar's reference genome overlooked key genes linked to agronomic traits.
The real leap came with the rise of third-generation sequencing technologies (PacBio HiFi and Oxford Nanopore). These long-read methods resolve structural variants, repeats, and haplotypes that short reads struggle with. Around 2020, graph-based assembly methods were introduced, allowing variation to be represented in a unified structure rather than separate lists of variants. Since then, the number of published pangenomes has surged, spanning microbes, crops, animals, and humans.
Human Pangenome
Traditional human genomics was anchored by GRCh38, a composite reference based on a small number of individuals. While useful, it failed to capture global human diversity.
In 2023, the Human Pangenome Reference Consortium (HPRC) published a landmark resource: a draft human pangenome built from 47 individuals across diverse ancestries. Compared with GRCh38, the new reference added 119 million base pairs of sequence and 1,115 gene duplications, many derived from structural variants.
By 2024, the HPRC expanded to more than 350 individuals. In parallel, the Chinese Pangenome Consortium (CPC) released a reference covering 36 Chinese populations, highlighting population-specific structural variants and alleles. Together, these efforts represent a turning point: moving away from a static reference to a living graph that can grow as new genomes are added.
The human pangenome not only improves variant calling accuracy but also supports medical research by revealing hidden alleles relevant to disease risk, immunity, and pharmacogenomics.
Animal Pangenomes
Animal pangenomes remain less numerous than microbial or plant counterparts but are expanding quickly. Most focus on humans, poultry, and domesticated livestock.
- Livestock: Pangenomes reveal structural variation underlying traits such as growth rate, milk yield, fat distribution, or adaptation to harsh environments. For example, duplications and inversions often mark breed-specific adaptations.
- Poultry: In chickens and ducks, presence–absence variation correlates with immune function and domestication traits.
- Wild animals: Early studies are exploring biodiversity and conservation genomics, where pangenomes can track lineage-specific adaptations.
These references open new opportunities for breeding programs, conservation, and veterinary applications.
Plant Pangenomes
Plants are the natural home for pangenome approaches because of their extraordinary intraspecific diversity.
- From Arabidopsis thaliana in 2000 to maize, rice, soybean, tomato, and wheat, reference genomes laid the groundwork.
- Around 2010, bioinformatics tools began detecting large-scale structural variants in short-read data.
- In 2016, the strategy of mapping unmapped reads to a pangenome framework was proposed.
- In 2020, the first graph-based plant pangenome was assembled for soybean, proving superior to linear approaches for capturing complex variants.
Applications range from stress resistance and flowering time to fruit quality and yield stability. For breeders, pangenomes provide unprecedented access to variable genes and structural rearrangements linked to traits of interest.
Sequencing and Analysis
A pan-genome workflow. (Murukarthick Jayakodi, et al., DNA Research, 2021)
Sample Design
Effective sampling is key to success. Using only closely related individuals underestimates diversity. The best approach combines wild relatives, landraces, and modern cultivars to span genetic breadth. For humans or animals, cohort design often reflects geographic and ancestral diversity.
Sequencing Strategies
Most pangenomes combine multiple sequencing platforms:
- Third-generation sequencing (PacBio HiFi, ONT) for long, accurate assemblies.
- Illumina short reads for polishing and variant detection.
- Hi-C for chromosomal scaffolding.
- RNA-seq for gene annotation.
This multimodal strategy balances accuracy, cost, and functional annotation.
Construction Approaches
- De novo assembly: Assemble each genome independently and compare. Highest accuracy but computationally expensive.
- Iterative assembly: Start with one reference, then integrate new sequences stepwise. Efficient but biased toward the initial genome.
- Graph-based pangenome: Represent sequences as a graph, with nodes as sequences and edges as relationships. This approach excels at representing structural variants and allows seamless addition of new individuals.
Pangenome Construction Methods — Side-by-Side Comparison
| De novo assembly | Iterative augmentation | Graph-based assembly | |
| Main idea | Assemble each individual genome independently, then compare to define core vs. variable content. | Start from one reference genome; add new sequences step-by-step as additional individuals are incorporated. | Build a sequence graph from multiple assemblies; nodes are sequence segments, edges connect alternative paths. |
| Advantages | Most comprehensive view of variation, including complex SVs and novel sequences; minimal reference bias. | Efficient use of an existing reference; lower compute and turnaround per added sample. | Encodes complex variation (PAV, CNV, inversions, translocations) as alternate paths; scalable for cohorts; reduces reference bias and supports re-analysis. |
| Limitations | High compute/storage and stringent assembly QC; cost rises with cohort size. | Biased toward the seed reference; may miss complex or lineage-specific sequence introduced late. | Building and maintaining large graphs requires mature tooling, standards, and substantial compute. |
| Best-fit use cases | Small–medium cohorts needing high-contiguity assemblies; repeat-rich or highly heterozygous genomes. | Incremental projects that must extend an established reference under tight budgets/timelines. | Population-scale studies (humans, crops, livestock) where many samples and continuous updates are expected. |
| Typical inputs | Long reads (PacBio/ONT) ± Hi-C for scaffolding; RNA-seq for annotation. | Existing reference + long/short reads for each new individual. | Multiple high-quality de novo assemblies merged into a graph; long reads + Hi-C; path-aware annotation. |
Analysis Outputs
- Core vs dispensable gene sets and frequency distributions.
- Genome size dynamics and U-shaped gene frequency plots.
- Structural variant discovery, including deletions, insertions, duplications, inversions, and translocations.
- GWAS and SV-GWAS, linking genetic variants to phenotypic traits more effectively than SNP-only methods.
Pangenome graph of 3STs N. meningitidis genomes. (Yang Z. et al. (2023) Frontiers in Genetics)
Pangenome Databases
Several open-access resources make published pangenomes widely usable:
- Rice Pangenome Database (RGI) — https://riceome.hzau.edu.cn/
- Populus Super-pangenome — http://www.populus-superpangenome.com/
- SilkMeta (Silkworm) — http://silkmeta.org.cn/
- BnPIR (Brassica napus) — http://cbi.hzau.edu.cn/bnapus/
These databases allow users to search homology, explore gene annotations, visualize structural variation, and download genomic datasets.
Gene Family Analysis in the Pangenome Era
Gene family analysis traditionally relied on single references, missing lineage-specific or rare variants. Pangenome-informed studies now enable more complete and unbiased characterizations.
For example, a study in barley used a pangenome plus pan-transcriptome to analyze the bHLH transcription factor family, revealing expansion and regulatory divergence not detectable in a single reference. Such approaches will likely become the standard for functional genomics.
Many plant species now have published pangenomes or super-pangenomes. The table below lists illustrative examples across major crop categories. It is not exhaustive and should be updated as new studies appear.
| Category | Representative species with published pangenome resources* |
| Model | Arabidopsis thaliana |
| Cereals | Rice (Oryza sativa), Maize (Zea mays), Wheat (Triticum aestivum), Barley (Hordeum vulgare), Sorghum (Sorghum bicolor) |
| Oilseed & Economic crops | Soybean (Glycine max), Rapeseed/Canola (Brassica napus), Sesame (Sesamum indicum), Sunflower (Helianthus annuus), Cotton (Gossypium spp.) |
| Vegetables | Potato (Solanum tuberosum), Tomato (Solanum lycopersicum), Pepper (Capsicum annuum), Cucumber (Cucumis sativus), Brassica vegetables (B. rapa, B. oleracea) |
| Forestry | Poplar (Populus spp.) super-pangenome |
| Fruits | Watermelon (Citrullus lanatus), Melon (Cucumis melo), Citrus (Citrus spp.), Apple (Malus domestica), Grapevine (Vitis vinifera) |
| Forage/Grasses | Alfalfa (Medicago sativa), Brachypodium (Brachypodium distachyon) |
*Examples only; definitions vary (species-level pangenome, super-pangenome, graph pangenome). Always cite the primary paper.
Beyond Genomes: Multi-omics Integration
The pangenome concept is now extending beyond DNA sequences.
- Pan-transcriptomics integrates expression data from multiple individuals and tissues to uncover regulatory diversity.
- Pan-3D genomes combine 3D chromatin architecture with sequence variation, enabling insights into gene regulation and adaptation.
Recent work in poplars integrated methylation, chromatin accessibility, resequencing, and functional assays into a species-level pangenome. This holistic approach uncovered how genetic and epigenetic variation together shape morphology and environmental responses.
Conclusion
The pangenome has transformed from a niche concept into a central paradigm for modern genomics. By combining multiple genomes, it captures genetic diversity, reduces reference bias, and enables discovery of structural and presence–absence variants overlooked by linear references.
From human health to animal breeding and crop improvement, pangenomes provide the framework for understanding how genetic variation drives traits and adaptation. Advances in sequencing, graph references, and computational methods now make pangenomes more practical than ever. As databases expand and integration with multi-omics accelerates, pangenomes are poised to become the new reference standard for genomics research.
Note: CD Genomics offers comprehensive pangenome sequencing, assembly, and analysis services tailored to research projects. Services are intended for research use only and not provided for clinical applications.
Related reading:
- Pan-genome: Definition, Classification, and Why It Matters (2025 Guide)
- Pan-genome vs Single Reference: Why One Genome Isn't Enough
- Pan-genome Sampling Strategy: Humans, Animals, and Plants
- Pan-genome Sequencing & Assembly: Short-reads vs HiFi/ONT and Hybrid
- Pan-genome Pipeline Deep Dive: From Annotation Harmonization to Orthology
- Pan-genome Tools at a Glance: Panaroo, Roary, PPanGGOLiN, PanX
- Bacterial vs Eukaryotic Pangenomes: Methods, Sampling & Pitfalls
References
- Golicz, A. A., Bayer, P. E., Bhalla, P. L., Batley, J., & Edwards, D. (2020). Pangenomics comes of age: from bacteria to plant and animal applications. Trends in Genetics, 36(2), 132-145.
- Argemi, X., Matelska, D., Ginalski, K. et al. Comparative genomic analysis of Staphylococcus lugdunensis shows a closed pan-genome and multiple barriers to horizontal gene transfer. BMC Genomics 19, 621 (2018).
- Murukarthick Jayakodi, Mona Schreiber, Nils Stein, Martin Mascher, Building pan-genome infrastructures for crop plants and their use in association genetics, DNA Research, Volume 28, Issue 1, February 2021, dsaa030.
- Agarwal, G., Choudhary, D., Stice, S.P., Myers, B.K., Gitaitis, R.D., Venter, S.N., Kvitko, B.H. & Dutta, B. Pan-genome-wide analysis of Pantoea ananatis identified genes linked to pathogenicity in onion. Frontiers in Microbiology 12, 684756 (2021).
- Gautreau, G., Bazin, A., Gachet, M. et al. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLOS Computational Biology 16(3), e1007732 (2020).
- Yang, Z., Guarracino, A., Biggs, P.J. et al. Pangenome graphs in infectious disease: a comprehensive genetic variation analysis of Neisseria meningitidis leveraging Oxford Nanopore long reads. Frontiers in Genetics 14, 1225248 (2023).
- Gao, L., Gonda, I., Sun, H. et al. The tomato pan-genome uncovers new genes for disease resistance and flavour. Nature Genetics 51, 1044–1051 (2019).
- Tettelin, H., Masignani, V., Cieslewicz, M.J. et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial "pan-genome". Proceedings of the National Academy of Sciences (PNAS) 102(39), 13950–13955 (2005).