Genome Assembly Strategy Solution

Table of Contents

Genome assembly strategy decision framework

Explore how assembly level, genome complexity, sample quality, sequencing strategy, QC, annotation, and downstream analysis connect.

Start with the Assembly Level Your Research Actually Needs

Many genome assembly pages begin by comparing sequencing platforms. In practice, the better starting point is the assembly level. A microbial genome, a first draft for a non-model species, a chromosome-level plant genome, and a haplotype-resolved animal genome do not need the same plan.

Before recommending a workflow, we help you define what the final assembly must support. A project for gene discovery may need a different assembly and annotation strategy from a project focused on trait mapping, structural variation, pan-genome construction, or population genomics.

Draft or contig-level assembly for early genome resources

A draft or contig-level assembly may be suitable when your goal is an early reference resource, broad gene discovery, microbial genome reconstruction, or preliminary comparative analysis. It can be useful when the genome is compact, the research question does not require chromosome-scale ordering, or the project is designed as a first step before deeper analysis.

This level can provide a practical starting point, but it may not fully support linkage analysis, chromosome-scale structural interpretation, or complex repeat resolution.

Chromosome-level assembly for linkage, trait, and comparative studies

Chromosome-level assembly is often needed when genomic position matters. This includes trait mapping, breeding research, chromosome evolution, synteny analysis, comparative genomics, and many plant or animal genome projects.

Hi-C scaffolding can help order and orient assembled contigs into chromosome-scale scaffolds. For projects where the final assembly will support downstream mapping or comparative work, chromosome-level structure can be more valuable than a fragmented assembly with high local accuracy but limited long-range organization.

Haplotype-resolved assembly for heterozygous or polyploid genomes

For highly heterozygous, outbred, hybrid, or polyploid organisms, a single collapsed representation may hide important allele- or haplotype-specific structure. In these cases, haplotype-resolved or phased assembly may be useful.

This strategy can be important for plant and animal breeding, allele-specific gene discovery, structural variation analysis, and projects where subgenomes or homologous chromosomes need careful interpretation.

T2T-like assembly when repeats, centromeres, and telomeres matter

A T2T-like strategy may be considered when unresolved gaps, long repeats, centromeric regions, telomeric regions, or complex structural regions are central to the study. This is usually a higher-demand project type because it depends heavily on sample quality, read length, assembly strategy, and manual or custom review.

Not every project needs a T2T-like assembly. We help you decide whether this level of resolution is necessary for your research question or whether a chromosome-level or phased assembly would be more practical.

Match the Sequencing Strategy to Genome Complexity

Once the target assembly level is clear, the sequencing strategy becomes easier to design. Different genomes require different evidence layers. Genome size, repeat content, ploidy, heterozygosity, contamination risk, and sample quality all affect the final assembly plan.

When PacBio HiFi is the accuracy anchor

PacBio SMRT Sequencing can support genome assembly projects that require long-read evidence with high consensus accuracy. PacBio HiFi reads are often valuable for de novo genome assembly because they combine long-read structure with high per-read accuracy.

PacBio HiFi can be especially useful when the project needs reliable consensus quality, strong gene-space recovery, and a clean foundation for annotation.

When Nanopore ultra-long reads help bridge repeats

Nanopore sequencing can be useful when the genome contains long repeats, large structural regions, or gaps that require longer spanning evidence. For some complex genomes, ultra-long reads can help bridge regions that shorter reads cannot resolve.

Nanopore data may also be considered in T2T-like strategies or projects where read length is a major advantage.

When Hi-C is needed for chromosome-scale scaffolding

Hi-C Sequencing Service provides long-range contact information that can help order and orient contigs into chromosome-scale scaffolds. It is especially relevant when the final assembly needs chromosome-level structure.

Hi-C is not simply an optional decoration. When the research goal depends on chromosome-scale organization, Hi-C or another long-range scaffolding approach may be a key part of the strategy.

When short-read polishing or hybrid evidence still helps

Short-read sequencing can still have value in assembly projects. It may support polishing, local correction, contamination review, variant evaluation, or complementary analysis depending on the project design.

A hybrid strategy can be useful when one data type does not answer every question. The point is not to include every technology, but to combine the evidence layers that fit the genome and downstream goal.

What We Review Before Recommending an Assembly Plan

A good assembly plan starts with risk review. We do not want to recommend a high-end workflow that the sample cannot support, or a minimal workflow that cannot answer the downstream research question.

Species, genome size, ploidy, and heterozygosity

We first review the organism and any available genome information. Useful details include estimated genome size, ploidy, known repeat content, expected heterozygosity, related reference genomes, and whether the species is domesticated, wild, hybrid, inbred, outbred, or polyploid.

These details help determine whether the project may need a standard de novo assembly, chromosome-level scaffolding, haplotype-resolved assembly, or a more advanced strategy.

HMW DNA quality and sample risk

High molecular weight DNA quality is one of the most important factors in long-read genome assembly. Tissue type, preservation method, extraction difficulty, DNA fragment size, contaminants, polysaccharides, polyphenols, microbial contamination, and sample age can all affect library construction and assembly continuity.

For difficult samples, we review feasibility before locking the assembly strategy.

Existing sequencing data or draft assemblies

Some projects begin with existing data. You may already have short reads, PacBio reads, Nanopore reads, Hi-C data, or a fragmented draft assembly.

In these cases, we can help evaluate whether the data can be reused, improved, scaffolded, polished, annotated, or integrated into a revised assembly workflow.

Downstream goals that affect assembly design

Downstream goals should shape the assembly plan. A genome intended for gene annotation may need different priorities than one intended for structural variation, pan-genome analysis, genome-wide association, population genomics, or breeding marker development.

We review these goals early so the assembly is designed as a usable genome resource, not just a FASTA file.

Genome Assembly Strategy Options Compared

The best genome assembly strategy depends on both the genome and the research goal. The table below summarizes common options and how we help position them.

Strategy	Best-fit use case	Sample requirement sensitivity	Genome complexity fit	QC considerations	Downstream readiness
Short-read draft assembly	Compact genomes, early screening, simple microbial projects, or preliminary resources	Moderate; shorter DNA fragments may be acceptable depending on project	Limited for high repeats, large genomes, and complex structure	Needs coverage review, contamination check, and assembly completeness review	May support basic gene discovery or microbial analysis, but limited for complex downstream structure
PacBio HiFi assembly	Accurate de novo assembly, gene-space recovery, reference genome construction, phasing-ready projects	Requires suitable high-quality DNA	Strong for many plant, animal, fungal, and non-model genomes	Evaluate contig N50, BUSCO, QV, completeness, contamination, and phasing if applicable	Strong foundation for annotation, comparative genomics, and many reference genome projects
Nanopore long-read or ultra-long assembly	Repeat-rich regions, long structural regions, gap closure, T2T-like strategies	Highly sensitive to HMW DNA quality and fragment length	Strong when long-read span is critical	Evaluate read length, coverage, consensus quality, repeat resolution, and polishing strategy	Useful for complex structure, gap resolution, and long-range genome architecture
Hi-C scaffolding	Chromosome-level assembly, linkage, synteny, breeding, comparative genomics	Requires suitable material for Hi-C library preparation	Strong for ordering and orienting contigs into chromosome-scale scaffolds	Evaluate contact map quality, scaffold accuracy, misjoins, and chromosome assignment	Important for chromosome-level downstream work
Hybrid assembly	Projects needing complementary accuracy, continuity, polishing, or long-range evidence	Depends on all data types included	Flexible for complex or high-value projects	Requires careful integration and cross-platform QC	Strong when the assembly must support multiple downstream uses
Haplotype-resolved assembly	Heterozygous, hybrid, outbred, or polyploid organisms	Requires strong data quality and sufficient coverage	Strong when allele-specific or subgenome-aware interpretation matters	Evaluate phasing accuracy, haplotype separation, duplication, and completeness	Useful for breeding, allele-specific analysis, SV, and complex genome interpretation
T2T-like assembly	Centromeres, telomeres, long repeats, unresolved gaps, premium reference resources	Very sensitive to sample quality, read length, and data design	Strong for difficult repetitive regions when supported by data	Evaluate gap closure, repeat resolution, QV, manual review, and structural consistency	Useful for high-end reference projects and repeat-centric research
Microbial, fungal, or compact genome assembly	Bacterial, fungal, viral, plasmid, or engineered strain genomes	Often less demanding than large eukaryotic genomes, but contamination control is important	Suitable for compact genomes; strategy depends on plasmids, repeats, and genome structure	Evaluate circularization, contamination, plasmids, completeness, and annotation quality	Useful for strain characterization, comparative genomics, and synthetic biology research

End-to-End Workflow from Sample Review to Usable Genome Resource

From sample feasibility review to sequencing design, assembly, QC, annotation, and downstream-ready files

Genome assembly workflow with sequencing strategy and QC checkpoints

A genome assembly project moves through several technical and decision checkpoints. We build the workflow around the final assembly level and the downstream research goal.

We review the organism, sample type, preservation method, expected DNA quality, and risk factors. For long-read assembly, high molecular weight DNA is often critical. When sample risk is high, we discuss options before sequencing begins.

Based on the target assembly level, we recommend the data types needed. This may include PacBio HiFi, Oxford Nanopore, Hi-C, short-read polishing, or a hybrid design. The sequencing plan should match the genome complexity rather than follow a fixed template.

The assembly workflow may include contig assembly, polishing, haplotype separation, scaffold construction, Hi-C-based ordering and orientation, gap review, and T2T-like refinement when appropriate.

When included in the project, we support repeat annotation, gene prediction, functional annotation, and downstream bioinformatics. The final output can include assembly files, annotation files, QC reports, visual summaries, and project documentation.

Sample Requirements and Project Intake Information

Sample quality has a direct effect on genome assembly strategy. Long-read assembly, chromosome-level scaffolding, haplotype-aware projects, and T2T-like workflows may require different sample and data planning.

Final requirements depend on species, genome size, ploidy, tissue type, preservation method, platform choice, and target assembly level. Before project confirmation, our team reviews the information below and recommends the most suitable workflow.

Sample or input type	What we review	Quality focus	Required project information	Typical QC checkpoints	Notes
Fresh or frozen tissue for HMW DNA	Tissue type, preservation, expected DNA yield, contamination risk	Long fragment DNA suitable for long-read libraries	Species, genome size estimate, ploidy, downstream goal	DNA integrity, purity, concentration, fragment size, contamination review	Final requirements depend on species, genome size, assembly level, and platform strategy
Plant, animal, fungal, or non-model organism samples	Sample source, tissue difficulty, inhibitors, related references, expected repeat content	Feasibility for de novo, chromosome-level, or phased assembly	Species, sample source, estimated genome size, ploidy, target assembly level	Sample suitability review, DNA quality review, contamination risk review	Complex or inhibitor-rich tissues may require special review before workflow selection
Existing sequencing data	FASTQ/BAM files, platform, coverage, read length, sample labels, prior assembly	Compatibility with reassembly, polishing, scaffolding, or annotation	Sequencing platform, genome target, prior assembly files, analysis goal	File integrity, read QC, coverage review, assembly feasibility review	Can support rescue, improvement, reanalysis, or downstream annotation when data quality is suitable
Draft assembly files	Assembly FASTA, statistics, annotation status, contamination concerns, scaffolding needs	Improvement potential and downstream suitability	Assembly FASTA, existing QC, species information, desired improvement level	Contiguity review, BUSCO review, contamination check, scaffold feasibility review	May be improved through polishing, scaffolding, annotation, or custom bioinformatics depending on data

How to Read Genome Assembly QC Without Over-Relying on N50

N50 is widely used, but it should not be the only metric used to judge a genome assembly. A high N50 can reflect long contigs or scaffolds, but it does not automatically mean the assembly is complete, accurate, correctly scaffolded, or useful for every downstream analysis.

QC metric	What it helps evaluate	What it does not fully answer
Contig N50	Assembly continuity before scaffolding	Completeness, correctness, contamination, or gene recovery
Scaffold N50	Long-range scaffold continuity	Whether scaffolds are correctly ordered and oriented
BUSCO	Gene-space completeness using conserved genes	Repeat resolution, structural correctness, or whole-genome accuracy
QV	Consensus accuracy estimate	Long-range structure, phasing quality, or annotation usefulness
Genome size comparison	Whether assembly size matches expectation	Whether the sequence is complete or correctly assembled
Contamination review	Non-target sequence or mixed-sample risk	Biological interpretation or annotation accuracy by itself
Hi-C contact map review	Chromosome-level scaffolding consistency	Base-level accuracy or gene completeness
Annotation summary	Gene prediction and functional interpretation readiness	Whether assembly structure is fully correct

N50 can help describe assembly continuity, but it does not measure everything. A high N50 assembly can still have contamination, misjoins, missing genes, collapsed repeats, or poor annotation readiness.

BUSCO helps evaluate conserved gene completeness, while QV can provide a consensus accuracy estimate when applicable. These metrics help complement N50, especially when the assembly will support gene discovery, comparative genomics, or publication-oriented research.

The best QC framework depends on what the assembly must support. A genome used for gene annotation, pan-genome analysis, structural variation, or trait mapping may need different checks. We help interpret QC in the context of the research goal.

Annotation and Downstream Analysis Make the Assembly Usable

A genome assembly becomes more valuable when it is connected to annotation and downstream analysis. For many research teams, the final goal is not only a FASTA file. It is a usable genome resource.

Genome annotation and gene prediction

We can support Genome Annotation and Gene Prediction Service for projects that require gene models, coding sequences, protein sequences, functional annotation, and annotation summaries.

This is especially important for non-model organisms, species with limited annotation resources, and projects focused on gene discovery.

Repeat annotation and functional annotation

Repeat annotation helps characterize transposable elements, repetitive regions, and repeat content that may influence assembly strategy and downstream interpretation. Functional annotation can help connect predicted genes with known databases, pathways, gene families, or biological functions.

Comparative genomics, pan-genome, SV, and population support

When the assembly will support downstream studies, we can help plan additional analyses through Genomic Data Analysis, Pan Genome, Variant Calling, and Population Genetics services.

These modules can support comparative genomics, gene family expansion, pan-genome construction, structural variation analysis, population genomics, and breeding-related research.

Files your team can reuse for future studies

Assembly FASTA
GFF or GTF annotation files
Repeat annotation files
Protein FASTA and CDS FASTA
BUSCO reports and QV summaries
Hi-C scaffolding outputs
Comparative genomics tables
Pan-genome or SV-ready files
Project report

Choose a Strategy Based on the Research Question, Not the Technology Name

A good assembly strategy starts with the research question. We help you decide what genome resource is needed and which data types can support it.

If your goal is a first reference genome

A de novo reference strategy may be suitable when no close reference exists or when you need a new genome resource for a non-model species. In many cases, De Novo Whole Genome Sequencing Service or Plant/Animal Whole Genome de novo Sequencing can support this goal.

If your goal is trait mapping or breeding support

Chromosome-level assembly may be more useful when genomic position matters. Hi-C scaffolding can support trait mapping, linkage analysis, comparative genomics, and breeding-related research.

If your goal is polyploid or haplotype-aware interpretation

Haplotype-resolved assembly may be needed when the organism is highly heterozygous, outbred, hybrid, or polyploid. This strategy can help preserve allele- or subgenome-specific structure when supported by data.

If your goal is pan-genome, SV, or population genomics

If the assembly will support pan-genome construction, structural variation analysis, or population genomics, we help plan the assembly and downstream outputs together. The goal is to avoid building an assembly that looks acceptable on paper but is not suitable for the next analysis step.

Request Assembly Strategy Review

References

Compliance / Disclaimer

CD Genomics provides this service for Research Use Only (RUO). This service is not intended for clinical diagnosis, direct medical interpretation, patient management, treatment guidance, direct-to-consumer testing, or guaranteed discovery claims.

Demo Results

Demo results help your team understand what an assembly project may deliver. These examples show output types, not fixed biological conclusions.

Assembly continuity and chromosome scaffolding summary

This output may show contig statistics, scaffold statistics, chromosome-scale scaffold views, and a Hi-C contact map summary when Hi-C scaffolding is included.

BUSCO, QV, and contamination review dashboard

This output summarizes assembly completeness, consensus quality, and contamination review in a compact format.

Annotation and downstream-ready output view

This output may show gene annotation summaries, repeat annotation tracks, gene family outputs, and files prepared for comparative or population-level analysis.

FAQ

1. What is a Genome Assembly Strategy Solution?

It is a research-focused service approach that helps you choose and execute the right genome assembly plan based on your species, sample quality, genome complexity, assembly level, and downstream goals.

2. How do I know what assembly level my project needs?

The right level depends on the research question. Early gene discovery may need a draft or contig-level assembly, while trait mapping, synteny, and breeding research often benefit from chromosome-level assembly. Haplotype-resolved or T2T-like strategies may be considered for complex genomes or repeat-rich regions.

3. When is draft assembly enough?

Draft assembly may be enough for compact genomes, early reference development, preliminary gene discovery, or projects where chromosome-scale position is not central. It may not be enough for linkage, structural variation, chromosome evolution, or pan-genome work.

4. When should I choose chromosome-level assembly?

Chromosome-level assembly is useful when genomic position, long-range structure, trait mapping, synteny, or comparative genomics matters. Hi-C or related scaffolding methods may be used to support this level.

5. When is haplotype-resolved assembly important?

Haplotype-resolved assembly can be important for heterozygous, outbred, hybrid, or polyploid organisms. It helps preserve allele- or haplotype-specific information when the data and project design support it.

6. When is T2T-like assembly worth considering?

A T2T-like strategy may be worth considering when centromeres, telomeres, large repeats, unresolved gaps, or high-end reference genome quality are central to the research question. It is more demanding and should be planned carefully.

7. How do PacBio and Nanopore differ for genome assembly?

PacBio HiFi reads are often valued for high-accuracy long-read assembly. Nanopore long or ultra-long reads can be useful for spanning long repeats and complex regions. Many projects benefit from choosing one or combining technologies based on the genome and research goal.

8. Why is Hi-C useful for chromosome-level assembly?

Hi-C provides long-range contact information that can help order and orient contigs into chromosome-scale scaffolds. It is especially useful when downstream analysis depends on chromosome-level structure.

9. Why should I not rely on N50 alone?

N50 describes continuity, but it does not fully measure completeness, accuracy, contamination, misassembly risk, or annotation readiness. A strong QC review should combine multiple metrics.

10. What sample information is needed before recommending a strategy?

Useful information includes species, genome size estimate, ploidy, sample type, preservation method, DNA quality, expected heterozygosity, related reference genomes, existing sequencing data, and downstream research goals.

11. Can existing sequencing data or draft assemblies be improved?

Yes. Existing data or draft assemblies may support polishing, scaffolding, reassembly, annotation, contamination review, or downstream analysis when data quality is suitable.

12. What deliverables can I expect from a genome assembly project?

Deliverables may include assembly FASTA files, QC summaries, N50 statistics, BUSCO reports, QV estimates, contamination review, annotation files, repeat annotation, Hi-C scaffolding outputs, genome browser-ready files, and project reports.

13. Can genome assembly results support annotation, pan-genome, SV, or population genomics?

Yes. When planned correctly, genome assembly can support annotation, comparative genomics, pan-genome analysis, structural variation analysis, and population genomics. These downstream needs should be considered before the assembly plan is finalized.

14. Is this service intended for clinical or diagnostic use?

No. This service is designed for research-focused genome assembly and bioinformatics projects only.

Literature Case: Hi-C Scaffolding Changes How Genome Assemblies Are Evaluated

Published Research Highlight

Benchmarking of Hi-C tools for scaffolding plant genomes obtained from PacBio HiFi and ONT reads

Journal: Frontiers in Bioinformatics
Published: 2024

Background

Chromosome-level genome assembly often requires more than contig generation. Hi-C reads can help order and orient large genomic regions into scaffolds, making them useful for projects that need chromosome-scale structure.

Methods

The study generated two de novo Arabidopsis thaliana assemblies from the same PacBio HiFi and Oxford Nanopore data. It then scaffolded the assemblies using 3D-DNA, SALSA2, and YaHS.

The scaffolded assemblies were evaluated using contiguity, completeness, accuracy, and structural correctness. This design is relevant because it compares not only sequencing data types, but also downstream scaffolding and quality interpretation.

Results

The study reported that Hi-C scaffolding tools showed different performance characteristics across the evaluated assemblies.
YaHS performed best in that analysis.
The broader lesson is important for project planning: chromosome-level assembly quality depends not only on sequencing platform, but also on scaffolding method, QC review, and structural correctness.

Hi-C scaffolding benchmark for plant genome assemblies generated from PacBio HiFi and Oxford Nanopore reads A Hi-C scaffolding benchmark illustrates why chromosome-level genome assembly should be evaluated by contiguity, completeness, accuracy, and structural correctness rather than one metric alone.

Conclusion

This case supports the central idea behind our Genome Assembly Strategy Solution. Genome assembly planning should not stop at choosing PacBio, Nanopore, or Hi-C. A strong strategy also considers assembly level, scaffolding method, QC metrics, annotation, and downstream usability.