For decades, genomics relied on the idea of a single high-quality reference genome. This backbone guided downstream studies in transcriptomics, epigenomics (such as ChIP-seq, CUT&Tag, WGBS, and ATAC-seq), and variant discovery. A reference genome provides a consistent coordinate system, yet it represents only one individual or lineage. Environmental pressures, domestication, and geographical separation all introduce significant genetic variation within a species.
As a result, a single reference genome introduces reference bias. Sequences missing from the reference may map poorly or not at all, leading to under-detection of alleles and structural variants. Over time, researchers realized that this approach masked valuable diversity that shapes adaptation, health, and phenotype.
The pangenome solves this limitation. It represents the full repertoire of genomic sequences across many individuals of a species. Within a pangenome, sequences are classified into:
By integrating multiple genomes, pangenomes provide a much more complete view of natural diversity.
Pan-genome selection and construction. (Murukarthick Jayakodi, et al., DNA Research, 2021)
One of the most practical distinctions in pangenomics is whether a species has an open or closed pangenome.
This distinction helps researchers design efficient sampling strategies. For open pangenomes, diversity may require dozens or hundreds of genomes; for closed pangenomes, fewer samples may be enough to capture the full range.
Pan-genome and core genome development plot projections for S. lugdunensis (Panel a and b), S. epidermidis (Panel c and d), and S. aureus (Panel e and f). (Argemi, X., Matelska, D., Ginalski, K. et al., BMC Genomics, 2018)
The term pangenome was first introduced in 2005 by Tettelin and colleagues in their work on Streptococcus agalactiae. They divided the genome into core and dispensable elements, setting the conceptual foundation. In 2007, Morgante applied the framework to plants, although sequencing limitations restricted its utility.
The field gained momentum after 2010 as short-read sequencing matured. By 2014, plant pangenomes were reported for soybean, rice, and maize, demonstrating that a single cultivar's reference genome overlooked key genes linked to agronomic traits.
The real leap came with the rise of third-generation sequencing technologies (PacBio HiFi and Oxford Nanopore). These long-read methods resolve structural variants, repeats, and haplotypes that short reads struggle with. Around 2020, graph-based assembly methods were introduced, allowing variation to be represented in a unified structure rather than separate lists of variants. Since then, the number of published pangenomes has surged, spanning microbes, crops, animals, and humans.
Traditional human genomics was anchored by GRCh38, a composite reference based on a small number of individuals. While useful, it failed to capture global human diversity.
In 2023, the Human Pangenome Reference Consortium (HPRC) published a landmark resource: a draft human pangenome built from 47 individuals across diverse ancestries. Compared with GRCh38, the new reference added 119 million base pairs of sequence and 1,115 gene duplications, many derived from structural variants.
By 2024, the HPRC expanded to more than 350 individuals. In parallel, the Chinese Pangenome Consortium (CPC) released a reference covering 36 Chinese populations, highlighting population-specific structural variants and alleles. Together, these efforts represent a turning point: moving away from a static reference to a living graph that can grow as new genomes are added.
The human pangenome not only improves variant calling accuracy but also supports medical research by revealing hidden alleles relevant to disease risk, immunity, and pharmacogenomics.
Animal pangenomes remain less numerous than microbial or plant counterparts but are expanding quickly. Most focus on humans, poultry, and domesticated livestock.
These references open new opportunities for breeding programs, conservation, and veterinary applications.
Plants are the natural home for pangenome approaches because of their extraordinary intraspecific diversity.
Applications range from stress resistance and flowering time to fruit quality and yield stability. For breeders, pangenomes provide unprecedented access to variable genes and structural rearrangements linked to traits of interest.
A pan-genome workflow. (Murukarthick Jayakodi, et al., DNA Research, 2021)
Effective sampling is key to success. Using only closely related individuals underestimates diversity. The best approach combines wild relatives, landraces, and modern cultivars to span genetic breadth. For humans or animals, cohort design often reflects geographic and ancestral diversity.
Sequencing Strategies
Most pangenomes combine multiple sequencing platforms:
This multimodal strategy balances accuracy, cost, and functional annotation.
Pangenome Construction Methods — Side-by-Side Comparison
| De novo assembly | Iterative augmentation | Graph-based assembly | |
| Main idea | Assemble each individual genome independently, then compare to define core vs. variable content. | Start from one reference genome; add new sequences step-by-step as additional individuals are incorporated. | Build a sequence graph from multiple assemblies; nodes are sequence segments, edges connect alternative paths. |
| Advantages | Most comprehensive view of variation, including complex SVs and novel sequences; minimal reference bias. | Efficient use of an existing reference; lower compute and turnaround per added sample. | Encodes complex variation (PAV, CNV, inversions, translocations) as alternate paths; scalable for cohorts; reduces reference bias and supports re-analysis. |
| Limitations | High compute/storage and stringent assembly QC; cost rises with cohort size. | Biased toward the seed reference; may miss complex or lineage-specific sequence introduced late. | Building and maintaining large graphs requires mature tooling, standards, and substantial compute. |
| Best-fit use cases | Small–medium cohorts needing high-contiguity assemblies; repeat-rich or highly heterozygous genomes. | Incremental projects that must extend an established reference under tight budgets/timelines. | Population-scale studies (humans, crops, livestock) where many samples and continuous updates are expected. |
| Typical inputs | Long reads (PacBio/ONT) ± Hi-C for scaffolding; RNA-seq for annotation. | Existing reference + long/short reads for each new individual. | Multiple high-quality de novo assemblies merged into a graph; long reads + Hi-C; path-aware annotation. |
Pangenome graph of 3STs N. meningitidis genomes. (Yang Z. et al. (2023) Frontiers in Genetics)
Several open-access resources make published pangenomes widely usable:
These databases allow users to search homology, explore gene annotations, visualize structural variation, and download genomic datasets.
Gene Family Analysis in the Pangenome Era
Gene family analysis traditionally relied on single references, missing lineage-specific or rare variants. Pangenome-informed studies now enable more complete and unbiased characterizations.
For example, a study in barley used a pangenome plus pan-transcriptome to analyze the bHLH transcription factor family, revealing expansion and regulatory divergence not detectable in a single reference. Such approaches will likely become the standard for functional genomics.
Many plant species now have published pangenomes or super-pangenomes. The table below lists illustrative examples across major crop categories. It is not exhaustive and should be updated as new studies appear.
| Category | Representative species with published pangenome resources* |
| Model | Arabidopsis thaliana |
| Cereals | Rice (Oryza sativa), Maize (Zea mays), Wheat (Triticum aestivum), Barley (Hordeum vulgare), Sorghum (Sorghum bicolor) |
| Oilseed & Economic crops | Soybean (Glycine max), Rapeseed/Canola (Brassica napus), Sesame (Sesamum indicum), Sunflower (Helianthus annuus), Cotton (Gossypium spp.) |
| Vegetables | Potato (Solanum tuberosum), Tomato (Solanum lycopersicum), Pepper (Capsicum annuum), Cucumber (Cucumis sativus), Brassica vegetables (B. rapa, B. oleracea) |
| Forestry | Poplar (Populus spp.) super-pangenome |
| Fruits | Watermelon (Citrullus lanatus), Melon (Cucumis melo), Citrus (Citrus spp.), Apple (Malus domestica), Grapevine (Vitis vinifera) |
| Forage/Grasses | Alfalfa (Medicago sativa), Brachypodium (Brachypodium distachyon) |
*Examples only; definitions vary (species-level pangenome, super-pangenome, graph pangenome). Always cite the primary paper.
The pangenome concept is now extending beyond DNA sequences.
Recent work in poplars integrated methylation, chromatin accessibility, resequencing, and functional assays into a species-level pangenome. This holistic approach uncovered how genetic and epigenetic variation together shape morphology and environmental responses.
The pangenome has transformed from a niche concept into a central paradigm for modern genomics. By combining multiple genomes, it captures genetic diversity, reduces reference bias, and enables discovery of structural and presence–absence variants overlooked by linear references.
From human health to animal breeding and crop improvement, pangenomes provide the framework for understanding how genetic variation drives traits and adaptation. Advances in sequencing, graph references, and computational methods now make pangenomes more practical than ever. As databases expand and integration with multi-omics accelerates, pangenomes are poised to become the new reference standard for genomics research.
Note: CD Genomics offers comprehensive pangenome sequencing, assembly, and analysis services tailored to research projects. Services are intended for research use only and not provided for clinical applications.
Related reading:
References