Pan-genome Sequencing & Assembly: Short-reads vs HiFi/ONT and Hybrid
Pan-genome sequencing brings population diversity into a unified framework. Platform and assembly choices decide what you can actually see. This guide compares long-read vs short-read data for pan-genome sequencing, explains when a hybrid assembly wins, and outlines realistic paths toward T2T-like results. We focus on how each route affects repeat resolution and structural variant discovery. You will find clear build recipes, practical QC, and planning tips for human, animal, and crop cohorts. All recommendations target non-clinical, research-use projects.
Illustrated overview of our recommended approach to perfect bacterial whole-genome assembly. (Wick R.R. et al. (2023) PLOS Computational Biology)
Why platform choice changes downstream results
Choosing a sequencing platform is not a brand decision. It is a decision about what biology you can resolve at a given cost.
Short reads are inexpensive and accurate per base. They scale to large cohorts and support precise polishing. Yet they collapse long repeats and under-detect complex structural variants. Long reads provide the span to connect repeats, phase haplotypes, and resolve breakpoints. Coverage and per-base accuracy cost more but reveal regions that short reads cannot disambiguate.
Illumina short reads in pan-genomes
Short-read sequencing remains the workhorse for many projects. It delivers consistent coverage and sharp base accuracy. Use it to polish assemblies, estimate allele frequencies across large cohorts, and power variant counting at population scale. Expect limitations in highly repetitive regions. Expect missed inversions, complex insertions, and translocations that require long-range information. Short reads still earn a place in most hybrid designs as a cost-effective accuracy booster.
PacBio HiFi (CCS)
HiFi reads average 15–25 kb with high single-molecule accuracy. They strike a strong balance for eukaryotic pan-genomes. HiFi assemblies handle segmental duplications better than short reads and support reliable structural variant (SV) calls. They also enable robust haplotype phasing without extreme read lengths. For many teams, HiFi is the default long-read option when accuracy and contiguity must both be high.
Oxford Nanopore (standard, duplex, ultra-long)
ONT offers the longest read lengths available. Ultra-long reads can span centromeres, rDNA arrays, and complex tandem repeats. Duplex sequencing improves accuracy and consensus QV, shrinking the historical gap with HiFi. ONT shines when your question depends on spanning distance rather than marginal base accuracy. It is a powerful path toward near T2T assemblies, especially when combined with polishing and scaffolding data.
Assembly strategies that actually work
Match the assembly approach to your data and your goals. Do not force a tool to solve a different problem.
A four-stage framework for NGS assembly: preprocessing, graph construction, graph simplification, and postprocessing/scaffolding." (El-Metwally S. et al. (2013) PLOS Computational Biology)
De novo with long reads (HiFi/ONT)
If budget allows, build the assembly primarily from long reads. Use assemblers tuned to your platform, such as hifiasm for HiFi or flye for ONT. Long-read-first assemblies produce higher contiguity and fewer collapsed repeats. Follow with platform-matched polishing to correct small errors. Consider a round of short-read polishing only when you have high-quality short-read coverage and a clear error profile to fix.
Hybrid assembly (long + short)
Hybrid assembly combines long reads for structure and short reads for accuracy. This route often delivers the best cost-to-quality ratio. Long reads establish contiguity across repeats. Short reads correct indels and mismatches efficiently. Hybrids are ideal when you need better SV resolution than short reads alone can provide, but cannot fund long-read-only coverage for every sample. Many pan-genome teams adopt a "spine and ribs" model: long-read assemblies for a representative subset, short-read data for the wider cohort.
Phasing and trio-binning options
Phasing clarifies allelic structure and reduces graph complexity downstream. Trio-binning uses parental reads to separate haplotypes before assembly. Strand-seq and Hi-C add orthogonal signals for phasing and scaffolding. These methods improve structural accuracy and reduce false merges. The payoff is cleaner comparison across samples and fewer artefactual nodes in pan-genome graphs.
Repeats and structural variants: what changes with each choice
Pan-genome analyses depend on accurate representation of repeats and SVs. Platform and assembly decisions determine how much of this landscape you capture.
Schematic of bridged vs. unbridged repeats across assembly graph models; HINGE separates bridged repeats while collapsing unbridged ones." (Kamath G.M. et al. (2017) Genome Research)
Repeat-rich regions
Centromeres, telomeres, rDNA clusters, and other tandem arrays challenge short reads. Long reads, especially ONT ultra-long, span these regions and prevent collapse. HiFi reads, while shorter than ultra-long ONT, still cross many segmental duplications and large transposon insertions. If your organism has high repeat content or recent duplications, bias the design toward longer reads and higher coverage.
SV detection fidelity
Structural variants drive many functional differences between accessions. HiFi excels at balanced accuracy for SV calling across insertions, deletions, and inversions. ONT ultra-long maximises breakpoint resolution and spans complex rearrangements. Short reads contribute for small indels and SNPs but under-detect complex or repeat-mediated events. For SV-centric studies, a long-read-first strategy or a hybrid design is the safest path.
QC that predicts downstream usability (beyond N50)
Headline contiguity metrics can mislead. Choose QC that correlates with graph quality, annotation stability, and comparative analyses.
Must-watch metrics
- Consensus QV: Captures base-level correctness better than read-level quality.
- BUSCO completeness and duplication: Signals gene-level representation and assembly collapse.
- Phasing metrics: Phase block N50 indicates haplotype continuity for downstream graphs.
- Misassembly profiles: Evaluate structural breaks, chimeras, and indel patterns after polishing.
Track these metrics per sample and per batch. Consistent QC reduces downstream harmonisation costs and cuts rework.
Whole-genome phylogeny of Listeria monocytogenes shows topology consistency between reference and multiple hybrid assembly results." (Chen Z. et al. (2020) BMC Genomics)
When is T2T realistic?
True telomere-to-telomere assemblies remain demanding. You need very long reads, high coverage, and supportive scaffolding like Hi-C or Strand-seq. Even then, repeat biology can defeat a fully continuous build. For many species, "T2T-minus" is the realistic goal: telomere-complete for most chromosomes, with a handful of difficult regions called out and annotated. Set expectations early and document known gaps.
Budget-to-design playbook
Every program faces trade-offs between depth, breadth, and timeline. Use these tiered recipes to map design to goals.
Entry tier (surveys and polishing)
Start with short-read sequencing for broad coverage across the cohort. Add targeted long-read sequencing for a subset of representative or phenotypically extreme samples. Use the long-read assemblies to anchor repeats and calibrate SV expectations. Use short reads to polish and to scale allele frequency estimates. This tier is ideal for method exploration, pilot studies, and early pan-genome scoping.
What you get: Reliable SNPs and small indels across many samples. Preliminary SV maps from the long-read subset. Solid polishing for draft assemblies.
Balanced tier (most eukaryotes)
Use PacBio HiFi for primary de novo assemblies. Add Illumina short reads for final polishing where needed. Integrate Hi-C to scaffold and phase, especially for larger genomes. This tier balances accuracy, contiguity, and cost for many plant and animal species.
What you get: High-quality assemblies with strong BUSCO scores and stable annotations. Robust SV detection with clear breakpoints. Phasing support sufficient for graph construction and GWAS inputs.
Premium tier (repeat-rich or T2T-aspiring)
Combine ONT ultra-long reads with HiFi for consensus accuracy. Add Hi-C and, if possible, Strand-seq for scaffolding and chromosome-scale phasing. Apply iterative polishing with technology-matched tools. This tier is for programs that must interrogate centromeres, long arrays, and complex rearrangements, or that aim for near T2T results.
What you get: Maximum repeat resolution and the best chance at telomere-complete chromosomes. High-fidelity consensus after polishing. Graph-ready genomes with minimal artefacts.
From assemblies to pan-genome graphs
Assemblies are inputs, not the final product. Consistency across samples determines graph clarity and downstream statistics.
Pan genomes of the hybrid assemblies of Salmonella Typhimurium LT2 with simulated Illumina short reads and mediocre- or low-quality Oxford Nanopore long reads. (Chen Z. et al. (2020) BMC Genomics)
Harmonisation and versioning
Use a common annotation workflow across all assemblies. Keep gene models consistent and track versions rigorously. Standardise file formats and reference naming. These simple steps prevent false differences from creeping into the graph. They also make downstream orthology and presence–absence calls repeatable.
Graph-aware genotyping at scale
Build graph references that preserve true SV paths without inflating spurious nodes. Plan pipelines that output presence/absence variation, copy number, and balanced SV genotypes. These feed association studies, comparative analyses, and trait discovery. Graph-aware genotyping also stabilises coordinate systems when new assemblies arrive, cutting reprocessing overhead.
Practical recommendations and next steps
- Define the biological question first. Then choose short reads, long reads, or a hybrid to match it.
- For SV-centric work, favour long-read-first or hybrid designs.
- Track QC beyond N50. Consensus QV, BUSCO, and phasing metrics predict success.
- Treat T2T as aspirational. Plan for T2T-minus with honest gap annotation.
- Harmonise annotations and formats early to protect downstream graph quality.
- Use a "spine and ribs" model for scale: long-read assemblies for the core set, short reads for the broader cohort.
How we can help (research use only)
CD Genomics supports pan-genome sequencing, hybrid assembly, and graph-aware analysis for research institutions and companies. We help design sampling plans, select platforms, build high-quality assemblies, and standardise annotations for graph construction. Services are for non-clinical research only and are not provided to individuals. If you'd like a method review or a tailored build recipe, contact our team with your organism, cohort size, and target analyses.
Related reading:
- Pan-genome: Definition, Classification, and Why It Matters (2025 Guide)
- Pangenome Explained: History, Sequencing Strategies, Databases, and Applications
- Graph-based Pan-genome & Structural Variants 101
- Pan-genome Sampling Strategy: Humans, Animals, and Plants
- Pan-genome Pipeline Deep Dive: From Annotation Harmonization to Orthology
- Pan-genome Tools at a Glance: Panaroo, Roary, PPanGGOLiN, PanX
References
- Wick, R.R., Judd, L.M., Holt, K.E. Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing. PLOS Computational Biology 19, e1010905 (2023).
- El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M. Next-Generation Sequence Assembly: Four Stages of Data Processing and Computational Challenges. PLOS Computational Biology 9, e1003345 (2013).
- Chen, Z., Erickson, D.L., Meng, J. Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing. BMC Genomics 21, 631 (2020).
- Kamath, G.M., Shomorony, I., Xia, F., Courtade, T.A., Tse, D.N. HINGE: Long-read assembly achieves optimal repeat resolution. Genome Research 27, 747–756 (2017).
- Cheng, H., Concepcion, G.T., Feng, X., Zhang, H., Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 170–175 (2021).
- Nurk, S., Koren, S., Rhie, A. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
- Kolmogorov, M., Yuan, J., Lin, Y., Pevzner, P.A. Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology 37, 540–546 (2019).