Pan-genome sequencing brings population diversity into a unified framework. Platform and assembly choices decide what you can actually see. This guide compares long-read vs short-read data for pan-genome sequencing, explains when a hybrid assembly wins, and outlines realistic paths toward T2T-like results. We focus on how each route affects repeat resolution and structural variant discovery. You will find clear build recipes, practical QC, and planning tips for human, animal, and crop cohorts. All recommendations target non-clinical, research-use projects.
Illustrated overview of our recommended approach to perfect bacterial whole-genome assembly. (Wick R.R. et al. (2023) PLOS Computational Biology)
Choosing a sequencing platform is not a brand decision. It is a decision about what biology you can resolve at a given cost.
Short reads are inexpensive and accurate per base. They scale to large cohorts and support precise polishing. Yet they collapse long repeats and under-detect complex structural variants. Long reads provide the span to connect repeats, phase haplotypes, and resolve breakpoints. Coverage and per-base accuracy cost more but reveal regions that short reads cannot disambiguate.
Short-read sequencing remains the workhorse for many projects. It delivers consistent coverage and sharp base accuracy. Use it to polish assemblies, estimate allele frequencies across large cohorts, and power variant counting at population scale. Expect limitations in highly repetitive regions. Expect missed inversions, complex insertions, and translocations that require long-range information. Short reads still earn a place in most hybrid designs as a cost-effective accuracy booster.
HiFi reads average 15–25 kb with high single-molecule accuracy. They strike a strong balance for eukaryotic pan-genomes. HiFi assemblies handle segmental duplications better than short reads and support reliable structural variant (SV) calls. They also enable robust haplotype phasing without extreme read lengths. For many teams, HiFi is the default long-read option when accuracy and contiguity must both be high.
ONT offers the longest read lengths available. Ultra-long reads can span centromeres, rDNA arrays, and complex tandem repeats. Duplex sequencing improves accuracy and consensus QV, shrinking the historical gap with HiFi. ONT shines when your question depends on spanning distance rather than marginal base accuracy. It is a powerful path toward near T2T assemblies, especially when combined with polishing and scaffolding data.
Match the assembly approach to your data and your goals. Do not force a tool to solve a different problem.
A four-stage framework for NGS assembly: preprocessing, graph construction, graph simplification, and postprocessing/scaffolding." (El-Metwally S. et al. (2013) PLOS Computational Biology)
If budget allows, build the assembly primarily from long reads. Use assemblers tuned to your platform, such as hifiasm for HiFi or flye for ONT. Long-read-first assemblies produce higher contiguity and fewer collapsed repeats. Follow with platform-matched polishing to correct small errors. Consider a round of short-read polishing only when you have high-quality short-read coverage and a clear error profile to fix.
Hybrid assembly combines long reads for structure and short reads for accuracy. This route often delivers the best cost-to-quality ratio. Long reads establish contiguity across repeats. Short reads correct indels and mismatches efficiently. Hybrids are ideal when you need better SV resolution than short reads alone can provide, but cannot fund long-read-only coverage for every sample. Many pan-genome teams adopt a "spine and ribs" model: long-read assemblies for a representative subset, short-read data for the wider cohort.
Phasing clarifies allelic structure and reduces graph complexity downstream. Trio-binning uses parental reads to separate haplotypes before assembly. Strand-seq and Hi-C add orthogonal signals for phasing and scaffolding. These methods improve structural accuracy and reduce false merges. The payoff is cleaner comparison across samples and fewer artefactual nodes in pan-genome graphs.
Pan-genome analyses depend on accurate representation of repeats and SVs. Platform and assembly decisions determine how much of this landscape you capture.
Schematic of bridged vs. unbridged repeats across assembly graph models; HINGE separates bridged repeats while collapsing unbridged ones." (Kamath G.M. et al. (2017) Genome Research)
Centromeres, telomeres, rDNA clusters, and other tandem arrays challenge short reads. Long reads, especially ONT ultra-long, span these regions and prevent collapse. HiFi reads, while shorter than ultra-long ONT, still cross many segmental duplications and large transposon insertions. If your organism has high repeat content or recent duplications, bias the design toward longer reads and higher coverage.
Structural variants drive many functional differences between accessions. HiFi excels at balanced accuracy for SV calling across insertions, deletions, and inversions. ONT ultra-long maximises breakpoint resolution and spans complex rearrangements. Short reads contribute for small indels and SNPs but under-detect complex or repeat-mediated events. For SV-centric studies, a long-read-first strategy or a hybrid design is the safest path.
Headline contiguity metrics can mislead. Choose QC that correlates with graph quality, annotation stability, and comparative analyses.
Track these metrics per sample and per batch. Consistent QC reduces downstream harmonisation costs and cuts rework.
Whole-genome phylogeny of Listeria monocytogenes shows topology consistency between reference and multiple hybrid assembly results." (Chen Z. et al. (2020) BMC Genomics)
True telomere-to-telomere assemblies remain demanding. You need very long reads, high coverage, and supportive scaffolding like Hi-C or Strand-seq. Even then, repeat biology can defeat a fully continuous build. For many species, "T2T-minus" is the realistic goal: telomere-complete for most chromosomes, with a handful of difficult regions called out and annotated. Set expectations early and document known gaps.
Every program faces trade-offs between depth, breadth, and timeline. Use these tiered recipes to map design to goals.
Start with short-read sequencing for broad coverage across the cohort. Add targeted long-read sequencing for a subset of representative or phenotypically extreme samples. Use the long-read assemblies to anchor repeats and calibrate SV expectations. Use short reads to polish and to scale allele frequency estimates. This tier is ideal for method exploration, pilot studies, and early pan-genome scoping.
What you get: Reliable SNPs and small indels across many samples. Preliminary SV maps from the long-read subset. Solid polishing for draft assemblies.
Use PacBio HiFi for primary de novo assemblies. Add Illumina short reads for final polishing where needed. Integrate Hi-C to scaffold and phase, especially for larger genomes. This tier balances accuracy, contiguity, and cost for many plant and animal species.
What you get: High-quality assemblies with strong BUSCO scores and stable annotations. Robust SV detection with clear breakpoints. Phasing support sufficient for graph construction and GWAS inputs.
Combine ONT ultra-long reads with HiFi for consensus accuracy. Add Hi-C and, if possible, Strand-seq for scaffolding and chromosome-scale phasing. Apply iterative polishing with technology-matched tools. This tier is for programs that must interrogate centromeres, long arrays, and complex rearrangements, or that aim for near T2T results.
What you get: Maximum repeat resolution and the best chance at telomere-complete chromosomes. High-fidelity consensus after polishing. Graph-ready genomes with minimal artefacts.
Assemblies are inputs, not the final product. Consistency across samples determines graph clarity and downstream statistics.
Pan genomes of the hybrid assemblies of Salmonella Typhimurium LT2 with simulated Illumina short reads and mediocre- or low-quality Oxford Nanopore long reads. (Chen Z. et al. (2020) BMC Genomics)
Use a common annotation workflow across all assemblies. Keep gene models consistent and track versions rigorously. Standardise file formats and reference naming. These simple steps prevent false differences from creeping into the graph. They also make downstream orthology and presence–absence calls repeatable.
Build graph references that preserve true SV paths without inflating spurious nodes. Plan pipelines that output presence/absence variation, copy number, and balanced SV genotypes. These feed association studies, comparative analyses, and trait discovery. Graph-aware genotyping also stabilises coordinate systems when new assemblies arrive, cutting reprocessing overhead.
CD Genomics supports pan-genome sequencing, hybrid assembly, and graph-aware analysis for research institutions and companies. We help design sampling plans, select platforms, build high-quality assemblies, and standardise annotations for graph construction. Services are for non-clinical research only and are not provided to individuals. If you'd like a method review or a tailored build recipe, contact our team with your organism, cohort size, and target analyses.
Related reading:
References