Mastering Genome Assembly: From de Bruijn Graphs to Telomere-to-Telomere Reconstruction

Genome assembly is often described as a sequencing workflow. In practice, it is an inference problem. Reads do not reveal the genome directly. They sample it in fragments, with finite span, platform-specific error, and uneven power to cross repeats. The assembler must reconstruct a hidden sequence from partial observations, while deciding which graph paths are real, which are ambiguous, and which should be rejected. That is why assemblies fail in patterned ways rather than random ways. They break at repeat boundaries, collapse copy-rich regions, and sometimes appear highly contiguous even when the structure is still wrong.

A 2026-ready discussion of genome assembly should therefore move past generic overview language. The important questions are sharper. Which graph model fits the read type? When does a graph branch because of sequencing error, and when does it branch because the biology is genuinely ambiguous? When does scaffolding increase chromosome-scale truth, and when does it merely hide an unresolved error inside a larger scaffold? Why can a higher N50 still coexist with repeat collapse, haplotype confusion, or structural misjoins? These are now the questions that matter for both scientific rigor and project design.

For technical teams planning de novo projects, this shift also changes how services should be evaluated. A basic whole genome sequencing workflow may be sufficient for broad discovery-stage needs, but genomes with high repeat burden, strong heterozygosity, or chromosome-scale ambitions often require a more explicit evidence architecture. In those settings, plant/animal whole genome de novo sequencing becomes less about generating reads and more about matching data type, graph logic, and validation framework to the genome's actual failure modes.

Why assembly fails even when the data look good

Most failed assemblies do not fail because the data are obviously bad. They fail because the information content of the data does not match the structure of the genome. A read set can be deep, clean, and still be unable to resolve a region if that region is longer, more repetitive, or more duplicated than the available span can disambiguate. In other words, assembly breaks are often evidence-limited, not software-limited.

This point is easy to miss because raw coverage is seductive. If a genome has high depth, it feels intuitive that the assembly should be complete. But assembly does not depend on coverage alone. It depends on whether the reads carry enough unique context to bridge one ambiguous region into the next. Repeats, tandem arrays, ribosomal DNA clusters, segmental duplications, and transposable elements all attack that requirement. So do mixed haplotypes, copy-number differences, and polyploid structure. The result is a graph that may be richly populated with data and still be locally undecidable.

This is why the hardest genomes are not simply the largest ones. They are the ones whose sequence architecture contains too many places where local evidence becomes non-unique. A bacterial genome with limited repeat complexity can often be reconstructed with straightforward long-read design. A large plant genome with recent transposon expansion, residual heterozygosity, and long repeat tracts can punish almost every naive assumption. In such cases, the goal is not to "run assembly harder." The goal is to redesign the evidence model.

That is also why different project types naturally converge on different service architectures. For lower-repeat microbial projects, a focused long-read strategy such as bacterial whole genome de novo sequencing for lower-repeat genomes may already provide sufficient continuity. For larger and more ambiguous genomes, the assembly plan must anticipate repeat traversal, long-range ordering, and orthogonal validation from the beginning.

How to tell what kind of failure you are seeing

One of the most useful habits in assembly work is to stop treating "fragmentation" as a single diagnosis. Different failure signatures point to different underlying causes.

If you see sharp drops in contiguity at known repeat-rich regions, while unique regions remain well assembled, the problem is often repeat collapse or repeat-driven ambiguity rather than global data shortage. If you increase coverage and the same regions still fail, that is another sign the bottleneck is span or uniqueness, not depth.

If the assembly shows broad fragmentation across many unrelated loci, especially with noisy data or variable read quality, the issue may be coverage insufficiency or unstable read support. In that case, more data or cleaner data may help directly.

If the graph contains persistent parallel paths, duplicated local sequence, or unstable phasing across variant-dense regions, the assembly may be struggling with heterozygous branching rather than ordinary repeat content. This is especially common in outbred diploid genomes and many plant genomes.

If a scaffold looks impressively long but later shows discordant long-range evidence, conflicting map alignment, or implausible joins across distant sequence contexts, the problem may be a chimeric misjoin. That kind of failure is especially dangerous because it increases apparent continuity while reducing structural truth.

These distinctions matter because each failure type suggests a different intervention. Repeat collapse calls for longer or more informative span. Coverage insufficiency calls for more usable data. Heterozygous branching calls for phasing-aware assembly logic. Chimeric misjoin calls for independent structural validation rather than more aggressive scaffolding.

The mathematics of assembly: graph theory in action

Assemblers do not work by intuition. They convert reads into graph structures, simplify those structures, and infer sequence paths that best explain the observed data. The reason different assemblers behave so differently is not just implementation quality. It is that they encode evidence in different mathematical forms.

The two core traditions are familiar: de Bruijn graph assembly and overlap-layout-consensus logic. But in modern practice, the real contrast is broader. It is a contrast between local k-mer compression and context-preserving overlap structure. That contrast explains why the same genome can look tractable under one data model and nearly impossible under another.

de Bruijn graphs and the logic of short-read assembly

de Bruijn graphs became dominant in the short-read era because they solved a brutal scaling problem. Instead of comparing every read against every other read, the assembler breaks reads into overlapping words of length k. These k-mers are then used to build a graph in which adjacency reflects observed sequence continuity. The approach is elegant and efficient. It compresses enormous read collections into a form that can be traversed computationally.

That compression is the source of both its power and its limitation.

When reads are reduced to k-mers, some global read context disappears. Local adjacency remains, but long-range identity becomes harder to preserve. If the genome contains many repeated sequences longer than the unique context available around them, the graph tangles. Different genomic regions may collapse into the same local graph structure. The assembler then no longer faces a simple path-finding task. It faces a symmetry problem. More than one reconstruction becomes compatible with the observed k-mer set.

Three artifacts define much of practical de Bruijn graph assembly.

Tips are short dead-end branches. They often arise from sequencing errors, weakly supported sequence ends, or rare artifacts. Pruning them can improve graph clarity, but over-pruning can also remove true low-coverage sequence.

Bubbles are parallel paths that diverge and rejoin. Some are error-derived. Others reflect real biology, such as heterozygous variants, small structural alternatives, or duplicated sequence with slight divergence. A bubble is therefore not a nuisance by definition. It is an ambiguity signal that must be interpreted.

False traversals become possible when repeats create branch structures that appear locally valid but do not correspond to the true genome path. This is where short-read assembly often looks strongest until it suddenly breaks. Local support is abundant, but the unique context needed for correct global traversal is missing.

K-mer choice sits at the center of this trade-off. A smaller k tends to improve connectivity, but it also increases the chance that unrelated repeat copies will collapse into the same graph structure. A larger k increases specificity, but it can fragment low-coverage regions or penalize noisy data. There is no universal best setting, because the right answer depends on read length, data quality, repeat density, and expected heterozygosity.

The deeper lesson is that de Bruijn graphs are not merely a fast implementation detail. They encode a specific view of sequence evidence. They perform best when local k-mer relationships retain enough uniqueness to represent the genome faithfully. When the genome stops cooperating, the graph does not become "bad." It becomes honest about ambiguity.

OLC and string-graph logic for long reads

Long reads change the problem because they restore context. Instead of observing only small local fragments, the assembler can often see through larger repeat units, across structural variation, or from one unique anchor into the next. That does not eliminate complexity, but it changes where the uncertainty lives.

Overlap-layout-consensus, or OLC, captures this shift clearly. In classical form, the assembler first detects overlaps among reads, then arranges those reads into a layout, and finally computes a consensus sequence. Modern long-read assemblers often use variants such as string graphs or repeat graphs rather than a literal textbook OLC pipeline, but the underlying logic remains similar: preserve read-level context for as long as possible and use real overlap evidence to infer structure.

This is why long-read assembly often handles repetitive sequence more gracefully than short-read assembly. A repeat that defeats a k-mer graph may become tractable if long reads span from unique sequence into and across that repeat. The key is not simply read length in the abstract. It is whether the read span is longer than the ambiguity that must be resolved.

That said, long reads do not magically remove assembly uncertainty. They move it. If raw read error is high, overlap detection becomes noisier. If repeat copies are longer than the read span, the ambiguity persists. If the genome is strongly heterozygous or polyploid, even long overlaps may contain multiple valid paths that need phasing-aware interpretation. Modern assemblers such as Flye, Canu, and hifiasm differ precisely in how they manage these trade-offs.

For many high-complexity projects, the difference between a useful long-read dataset and an insufficient one comes down to whether the reads only enter difficult regions or actually cross them. This is why teams evaluating human whole genome PacBio SMRT sequencing or other long-read designs should think in terms of repeat-spanning power, not just platform labels.

Why repeats still dominate de novo assembly failure

Repeat complexity remains the master variable in assembly difficulty. Most serious assembly errors can be traced back to one of a small set of repeat-driven problems: collapse, fragmentation, false join, or unresolved duplication. Even when the underlying mechanism differs, the trigger is often the same. The evidence does not uniquely distinguish one genomic copy from another.

Transposable elements are a classic example. If a genome contains many recent elements with high sequence identity, short-range evidence quickly becomes ambiguous. Ribosomal DNA clusters create a different but equally stubborn version of the same problem. Tandem organization, high copy number, and local sequence similarity all compress the solution space. Segmental duplications create perhaps the most dangerous case because they may be long, highly similar, and embedded in otherwise unique sequence, which tempts the assembler into a confident but incorrect join.

This is why highly contiguous assemblies can still carry biologically important distortions. A repeat collapse may make the graph easier to traverse and the contig longer. It may also erase copy number, flatten structural heterogeneity, or distort dosage-sensitive regions. From a purely cosmetic standpoint, the assembly improved. From a biological standpoint, it may have degraded.

The practical implication is simple but often ignored: repeat handling should be evaluated as a first-order design criterion, not a downstream refinement. If a project is expected to encounter long tandem repeats, extensive satellite sequence, or high-copy transposon content, the assembly strategy should anticipate that reality at the sequencing stage. For some genomes, this means a standard long-read workflow is enough. For others, it means the difference between a scaffold-level result and a sequence-resolved result lies in whether the design includes enough ultra-long molecules to bridge the hardest regions.

Graph choice changes the dominant failure mode: short-read de Bruijn graphs vs overlap-based long-read logicFigure 1. Graph choice changes the dominant failure mode: short-read de Bruijn graphs tend to fragment or branch in repeat-rich regions, while overlap-based long-read logic can rescue ambiguity only when read context is long enough to span it.

Scaffolding and contiguity enhancement: making larger structures without hiding smaller errors

A contig is a local sequence claim. A scaffold is a larger structural claim about how contigs relate across unsequenced or unresolved space. That difference is crucial. Scaffolding does not automatically create missing sequence. It uses long-range evidence to estimate order, orientation, and distance relationships among existing contigs. When done well, that produces chromosome-scale organization. When done carelessly, it can produce a longer but less trustworthy assembly.

This is why contiguity enhancement should never be reduced to a formatting exercise. The goal is not merely to make the assembly longer. The goal is to increase span without inflating unsupported structure.

Hi-C and proximity ligation: using chromosome physics as evidence

Hi-C scaffolding works because chromosomes are physical objects, not abstract strings. Inside the nucleus, loci that are nearby on the same chromosome tend to contact one another more often than loci that are distant or on different chromosomes. Hi-C converts that physical organization into interaction counts. Scaffolding algorithms then use those patterns to cluster contigs into chromosomes and infer likely order and orientation.

That logic is powerful because it introduces information that sequence alone may not provide. A contig set that cannot be extended further through local graph reasoning may still be organized at chromosome scale if the contact map shows coherent long-range structure. This is why Hi-C sequencing has become a central layer in chromosome-scale assembly design.

But Hi-C is not magic. It is an indirect signal. Contact frequency reflects genomic distance only probabilistically, and that relationship is modulated by chromatin state, local mappability, restriction bias, repeat density, and assembly quality itself. If the underlying contigs are already chimeric, repeat-collapsed, or haplotype-mixed, the Hi-C signal is being mapped onto a flawed substrate. In that scenario, scaffolding can amplify the mistake. It does not invent the error, but it can stabilize it inside a larger structure that now looks more convincing.

This is the key diagnostic insight many overview pages miss: Hi-C is strongest when used to organize already credible contigs, not to rescue fundamentally unresolved local ambiguity. If the contig layer is weak, the contact map may still produce a plausible chromosome picture, but the plausibility is structural, not necessarily sequence-true.

When Hi-C is helping and when it is hiding the problem

A healthy Hi-C scaffolding result usually shows several consistent features. Contigs cluster into chromosome-scale groups with clear interaction enrichment. Ordering along the scaffold produces a contact pattern that decays in a coherent way with genomic distance. Orientation decisions are supported by reproducible asymmetry in local contact structure rather than by weak signals scattered across the matrix.

A problematic result looks different. You may see long scaffolds that require many low-confidence joins, blocks whose contact patterns do not agree with neighboring structure, or contigs that repeatedly swap placement depending on parameter choice. These are warning signs that Hi-C is being asked to solve a problem that belongs earlier in the assembly workflow.

Another common red flag appears in highly heterozygous material. If haplotypes are partially collapsed or inconsistently separated, Hi-C links may connect homologous regions in misleading ways. The scaffold still looks chromosome-like, but the internal logic is unstable because the contig substrate does not correspond cleanly to a single genomic representation.

In practical terms, this means Hi-C should be interpreted as long-range structural evidence, not as proof that the sequence path between two linked blocks is itself correct. Chromosome-scale scaffolding is valuable, but it is not equivalent to sequence-complete reconstruction.

Optical mapping and large-scale structural correction

Where Hi-C gives contact-based evidence, optical mapping gives long-molecule structural evidence. Long DNA molecules are labeled at specific motifs, imaged, and converted into barcode-like maps. These molecule maps can then be aligned against an assembly to test whether the large-scale structure is consistent with the observed labeling pattern.

This makes optical mapping especially useful for detecting errors that sequence-centric metrics may miss. A scaffold can look excellent by N50 and still contain an inversion, a collapsed expansion, or a false join that becomes obvious when long-molecule label spacing is examined. Optical mapping therefore plays a different role from Hi-C. Hi-C is often most useful for chromosome assignment and large-scale organization. Optical mapping is especially effective for identifying structural discordance.

That distinction matters because many teams treat all long-range evidence as interchangeable. It is not. Hi-C asks which segments are likely near one another in chromosomal space. Optical mapping asks whether the physical pattern along a long molecule agrees with the claimed structure. Those are related questions, but they are not the same question.

Long-range evidence validation vs amplificationFigure 2. Long-range evidence can either validate or amplify assembly structure: Hi-C is strongest for chromosome-scale clustering, ordering, and orientation, while optical mapping is especially valuable for exposing large-scale discordance that inflated scaffolds may hide.

Gap filling is not just gap closing

A gap is not a generic absence. Different gaps arise from different mechanisms, and each mechanism implies a different solution.

Some gaps are simple span problems. No read, or no reliable overlap, crosses the missing interval. In those cases, longer molecules may directly solve the issue.

Some gaps are repeat problems. Reads enter the region but do not do so uniquely enough to distinguish one copy from another. More depth may add confidence to the same ambiguity rather than resolve it. Here the limiting factor is not quantity, but informative span.

Some gaps are haplotype problems. The assembly is not merely missing sequence. It is undecided about whether nearby alternatives represent allelic difference, paralogous duplication, or graph noise. Filling such gaps without phasing-aware logic can produce superficially cleaner output while reducing biological truth.

Some gaps are scaffolding artifacts. The scaffold claims continuity because long-range evidence links two blocks, but the actual sequence across the interval remains unresolved. This is not the same as sequence completion, even if the scaffold is reported as chromosome-scale.

A strong assembly workflow asks a more precise question: what kind of gap is this? If the answer is "insufficient span," then longer-read architectures may help. If the answer is "repeat symmetry," then only reads that bridge unique anchors may fix the problem. If the answer is "haplotype confusion," then the project may need a phased graph model. If the answer is "over-scaffolding," then the correct move may be to reduce claimed continuity rather than defend it.

This is where platform choice becomes strategic. When local consensus accuracy is the limiting issue, high-fidelity long-read options such as human whole genome PacBio SMRT sequencing may be the better fit. When repeat bridging across very long tracts is the limiting issue, the relevant question becomes whether and when to use Nanopore ultra-long sequencing to cross ambiguity that shorter molecules cannot resolve.

Telomere-to-telomere reconstruction starts before the telomeres

A telomere-to-telomere assembly is not just a longer scaffold set. It is a sequence-resolved claim that the chromosome has been reconstructed across the regions that usually defeat standard assembly: telomeric repeats, centromeric arrays, large satellites, segmental duplications, and often ribosomal DNA-rich regions. That is a much higher bar than chromosome-scale scaffolding. A scaffold can connect two arms across a difficult interval by long-range evidence. A true T2T assembly must reconstruct the difficult sequence itself.

This difference matters because many assemblies now look chromosome-scale long before they become sequence-complete. Hi-C can place contigs into convincing chromosome groups. Optical mapping can support large-scale structure. But neither of those alone proves that the repeat-dense interior has been correctly reconstructed at sequence level. A centromere linked across is not the same as a centromere assembled through.

That is why T2T projects depend so strongly on span plus orthogonality. Ultra-long reads are valuable not because they are fashionable, but because they can bridge from one unique anchor across a long repeat system into the next unique anchor. In practice, the question is simple: can the data actually cross the ambiguity, or can they only point at its boundaries?

This is also why T2T-oriented projects should be designed from the start as repeat-completion projects rather than ordinary contig-improvement projects. If the endpoint is true sequence continuity through centromeres, telomeres, and other repeat-dense intervals, then the evidence stack must be selected for that endpoint. For many teams, that means combining chromosome-scale planning with telomere-to-telomere sequencing and, where repeat span is the governing bottleneck, Nanopore ultra-long sequencing.

Why ultra-long reads matter most where ordinary long reads still fail

Not all long reads solve the same problem. Some improve local consensus accuracy. Some improve ordinary repeat traversal. Ultra-long reads become decisive when the unresolved structure itself is longer than the effective span of standard long-read evidence.

Centromeric satellites are the classic example. These regions often contain long stretches of highly homogeneous repeat sequence with sparse unique anchors. Standard long reads may reach into the array but still fail to connect one unique flank to the other. The same logic applies to large telomeric tracts, rDNA-associated complexity, and some segmental duplications. In these settings, the assembly does not fail because it lacks sequence in general. It fails because it lacks reads that remain informative for long enough.

This is where teams often over-interpret polished contigs. A beautifully polished assembly can still be incomplete in the most biologically difficult regions if no data type actually spans them. Sequence quality inside the easy regions and sequence sovereignty inside the hard regions are related, but they are not interchangeable.

Scaffold continuity is not sequence-resolved truth

A useful discipline in T2T work is to separate three different claims that are often blurred together:

  1. Contig continuity: the sequence is locally assembled without gaps.
  2. Scaffold continuity: those contigs are ordered and oriented into larger chromosome-scale structures.
  3. Sequence-resolved chromosome continuity: the difficult sequence between major blocks has itself been assembled and validated.

Only the third claim deserves T2T language. This distinction is not semantic. It changes how a genome should be interpreted downstream. Structural analyses, repeat biology, copy-number-sensitive inference, and pangenome comparisons can all be distorted if a scaffold-level representation is mistaken for a repeat-complete one.

Scaffold span is not equivalent to T2T truthFigure 3. Scaffold span is not equivalent to T2T truth: ultra-long reads can bridge repeat-dense regions that ordinary assemblies leave unresolved, but true chromosome completion still requires sequence-level reconstruction and validation beyond simple continuity.

Metrics of truth: why N50 is not enough

N50 remains common because it is easy to explain and easy to market. It reports the sequence length at which half of the total assembled bases are contained in contigs or scaffolds of that size or longer. That makes it useful as a continuity descriptor. It does not make it a truth metric.

A longer scaffold can still be wrong. It may contain a false join, a collapsed repeat, or a mis-ordered segment supported only weakly by long-range evidence. In all of those cases, N50 improves while biological faithfulness declines. This is why mature assembly evaluation now separates continuity, completeness, consensus truth, and structural validity rather than forcing all quality judgment into one headline number.

NG50 is often better than N50 when an expected genome size is known, because it anchors continuity to the target genome length rather than the assembled length. Even so, NG50 still answers only a continuity question. It does not tell you whether the assembly is complete in gene space, correct in repeat structure, or accurate in sequence consensus.

BUSCO helps solve a different problem. It asks whether expected conserved single-copy orthologs are present and complete for the lineage under study. That makes it highly useful for gene-space completeness. But BUSCO can be excellent in an assembly that still contains important repeat collapse, structural misjoins, or unresolved copy-rich regions. In other words, BUSCO is strong evidence for biological completeness in one layer of the genome, not a global certificate of assembly truth.

K-mer-based evaluation adds a different kind of rigor. Tools such as Merqury compare trusted k-mer content from the read data with the k-mer content in the assembly, allowing evaluators to estimate consensus quality, completeness, and in some settings phasing-related properties without relying entirely on an external reference. This is especially valuable in de novo settings where the closest available reference may itself be incomplete or structurally different from the genome being assembled.

For heterozygous or complex diploid projects, k-mer spectra can be particularly revealing. They can show whether heterozygous content has been collapsed, duplicated, over-purged, or retained in a way that matches the intended assembly model. That is often more informative than mapping-based metrics alone.

A practical evaluation framework

The fastest way to interpret assembly quality is to stop asking for one score and instead ask four separate questions.

Evaluation layer Common metrics or evidence What it can answer What it cannot answer
Contiguity N50, NG50, contig/scaffold length distribution How large the assembled pieces are Whether those pieces are structurally correct or biologically complete
Gene-space completeness BUSCO Whether expected conserved genes are represented Whether repeats, copy number, or chromosome-scale structure are correct
Consensus accuracy and completeness k-mer spectra, Merqury QV, k-mer completeness Whether the assembly agrees with trusted sequence content in the reads Whether large-scale order and orientation are correct by themselves
Structural validity Hi-C consistency, optical mapping, long-range concordance Whether chromosome-scale structure is supported by independent evidence Whether local base-level consensus is accurate in every region

This framework matters because these layers are complementary, not interchangeable. A high N50 cannot substitute for weak BUSCO. Strong BUSCO cannot erase evidence of repeat collapse. Good k-mer agreement cannot by itself prove chromosome ordering. Long-range concordance cannot rescue poor local consensus. Once these questions are kept separate, assembly evaluation becomes much harder to game and much more useful for project decisions.

When high BUSCO still does not mean high-confidence assembly

This is a common trap in complex genomes. BUSCO may report excellent completeness because gene-rich regions are relatively well assembled, while repeat-rich and copy-variable regions remain collapsed or misrepresented. In such cases, the assembly may look strong for gene-centric tasks yet still be weak for structural biology, dosage analysis, centromere biology, or pangenome-grade comparison.

The lesson is not that BUSCO is weak. The lesson is that it measures one slice of the truth stack. In many B2B scientific contexts, that distinction is critical because the right sequencing architecture depends on what the downstream biology actually needs.

How to compare Flye, Canu, and hifiasm without reducing the answer to a scoreboard

Assembler choice is often presented as if one tool simply wins. That framing is usually misleading. Flye, Canu, and hifiasm were shaped by different data realities and optimize for different evidence models. A better question is not "Which is best?" but "Which is most aligned with the geometry of this project?"

A practical selection heuristic usually begins with four variables:

  • Read accuracy: are the long reads noisy or highly accurate?
  • Read span: do the reads only enter difficult regions, or do they cross them?
  • Genome complexity: how much repeat burden, heterozygosity, or duplication is present?
  • Endpoint: is the goal draft contigs, phased diploid assembly, chromosome-scale scaffolds, or T2T-oriented reconstruction?

Once those questions are answered, tool choice becomes more rational.

Flye

Flye is widely valued for repeat-aware long-read assembly and for strong practical performance on many long-read datasets, including noisier long-read contexts. Its logic is well suited to projects where robust long-read contig construction matters more than maximal phasing sophistication. For microbial genomes and many moderate-complexity eukaryotic assemblies, Flye often offers a useful balance between contiguity and operational practicality.

That makes it attractive in workflows where the main problem is assembling through ordinary repeat content rather than disentangling deeply heterozygous diploid structure. In projects centered on microbial whole genome sequencing, or in exploratory de novo builds where robust long-read assembly is the first priority, Flye is often a reasonable fit.

Canu

Canu reflects a more correction-heavy philosophy. It remains important because difficult long-read assembly often benefits from aggressive attention to noisy data, repeat separation, and adaptive weighting before final contig inference is made. Canu may be more computationally demanding than some newer workflows, but that cost is tied to a serious design principle: conservative handling of uncertainty can be more valuable than headline efficiency when the data are difficult.

This is why Canu still deserves consideration in projects where raw long-read noise, uneven support, or repeat ambiguity punish more streamlined assumptions. It is not merely a legacy choice. It is still a useful model of what robust preprocessing and repeat-aware long-read assembly can look like when caution matters.

hifiasm

hifiasm became central because high-accuracy long reads changed the assembly landscape. Its phased assembly graph logic is especially powerful for HiFi-centered workflows, where read accuracy is high enough to support strong contiguity while still preserving the information needed for haplotype-aware reconstruction. For large diploid genomes, this can be transformative.

hifiasm is often the most natural choice when the project goal includes strong contiguity plus phasing-aware structure, particularly in animal or human-like genomes where diploid representation matters. It is also increasingly relevant in near-T2T and T2T-oriented designs when paired with additional long-range or ultra-long evidence. In that setting, data quality and endpoint clarity matter greatly. The tool performs best when the project architecture is designed around what phased graphs can exploit well.

A decision-oriented comparison

Assembler Strongest input profile Best-fit genome context Main strength Main caution
Flye Long reads, including noisier long-read sets Microbial genomes and moderate-complexity eukaryotic de novo projects Practical repeat-aware long-read assembly with good robustness Less naturally aligned with top-priority phased diploid reconstruction on HiFi-centered projects
Canu Noisy long-read datasets that benefit from correction-heavy treatment Difficult assemblies where conservative processing is valuable Strong correction logic and careful handling of repeat ambiguity Higher computational burden and slower workflows on some datasets
hifiasm High-accuracy HiFi reads, often with complementary long-range support Large diploid or polyploid genomes, phased assembly, near-T2T design Excellent contiguity and phased graph logic for accurate long reads Depends strongly on data quality and project design; not a universal answer for every noisy-read case

This table should be read as a fit map, not a winner list. The right assembler is the one whose evidence assumptions match the genome and the endpoint.

How to optimize contiguity without inflating assembly error

"Contiguity optimization" sounds like a software-tuning problem. In reality, it is a three-step systems problem.

  1. Define the endpoint first.
    Decide whether the goal is draft contigs, chromosome-scale scaffolds, phased diploid assembly, or T2T-oriented reconstruction. Different endpoints require different evidence layers.
  2. Match the evidence layers to the failure modes.
    If the genome is repeat-rich, longer or more informative span matters more than depth alone. If haplotype structure is central, phased assembly logic matters more than raw scaffold size. If chromosome-scale order matters, long-range evidence such as Hi-C sequencing becomes part of the core architecture rather than an optional add-on.
  3. Validate against likely failure modes, not just summary metrics.
    Ask where repeat collapse, chimeric misjoin, over-scaffolding, or haplotype distortion are most likely to occur. Then choose validation methods that can actually expose those problems.

This framework explains why ambitious projects increasingly converge on integrated designs rather than sequential rescue strategies. A team planning whole genome sequencing for an ordinary discovery-stage study may not need a heavily layered assembly architecture. A team targeting chromosome-scale or repeat-complete output from a large eukaryotic genome often does. In those cases, plant/animal whole genome de novo sequencing is best understood not as a generic service label, but as a project architecture that should be matched to genome size, repeat burden, ploidy, and endpoint.

Closing perspective

Genome assembly has moved well beyond the era of generic "overview" content. The central questions are now about graph choice, repeat logic, long-range physical evidence, and the difference between continuity and truth. A strong assembly is not the one that simply looks long. It is the one that remains defensible when repeat structure, haplotype representation, and chromosome-scale validation are all examined together.

That shift changes how scientific buyers and technical teams should plan de novo projects. The right question is no longer "Which pipeline gives the largest N50?" It is "Which evidence model and algorithmic logic best preserve truth for this genome and this endpoint?" Once that question leads the design, scaffold span, phased structure, and even T2T-grade reconstruction become consequences of sound inference rather than cosmetic output.

Teams planning a de novo genome project should define the endpoint first--draft contigs, chromosome-scale scaffolds, phased diploid assembly, or T2T-oriented reconstruction--because the right sequencing and scaffolding architecture depends on genome size, repeat burden, ploidy, and the failure modes most likely to distort the result. In practice, that is why service architecture matters: the strongest design is the one that matches evidence layers to the biological problem, not the one that simply adds more data.

FAQ

What is the main difference between de Bruijn graph assembly and OLC assembly?

de Bruijn graph assembly compresses reads into k-mer relationships and is especially efficient for short-read data. OLC-style assembly preserves longer read context by using overlaps directly, which is often more suitable for long-read data where span helps resolve repeats.

Why do repeats break genome assemblies so often?

Repeats create non-unique sequence structure. If the available evidence does not bridge uniquely from one side of the repeat to the other, the assembler cannot tell which genomic copy should connect to which path. The result is collapse, fragmentation, or false joining.

Can Hi-C alone produce a true telomere-to-telomere assembly?

No. Hi-C is excellent for chromosome-scale clustering, ordering, and orientation, but it does not replace sequence-level reconstruction across centromeres, telomeres, or other difficult repeat-rich regions.

Why is N50 not enough to evaluate assembly quality?

Because N50 measures continuity, not correctness. It does not reveal whether joins are valid, whether gene space is complete, whether repeats are collapsed, or whether the consensus sequence agrees with trusted read evidence.

When is BUSCO most useful?

BUSCO is most useful for evaluating lineage-appropriate gene-space completeness. It is strong evidence that expected conserved genes are represented, but it does not by itself prove correct repeat resolution or chromosome-scale structure.

What does k-mer spectra analysis add that mapping-based evaluation may miss?

K-mer analysis can estimate completeness and consensus accuracy in a largely reference-free way. That is especially valuable when the available reference is incomplete, structurally different, or too distant to serve as a clean benchmark.

Which assembler is best: Flye, Canu, or hifiasm?

There is no universal winner. Flye is often practical for robust long-read assembly, Canu remains valuable for correction-heavy noisy-read workflows, and hifiasm is especially strong for accurate long-read phased assembly. The best choice depends on read accuracy, span, genome complexity, and endpoint.

What data combination is most effective for a high-complexity eukaryotic genome?

In many cases, the strongest design combines accurate long reads for contig construction, long-range evidence such as Hi-C for chromosome-scale ordering, and ultra-long reads when extreme repeats must be bridged directly.

References

  1. Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. DOI: 10.1038/nbt.2023
  2. Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. DOI: 10.1038/s41587-019-0072-8
  3. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. DOI: 10.1038/s41592-020-01056-5
  4. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. DOI: 10.1101/gr.215087.116
  5. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. DOI: 10.1093/bioinformatics/btv351
  6. Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. DOI: 10.1186/s13059-020-02134-9
  7. Nurk S, Koren S, Rhie A, et al. The complete sequence of a human genome. DOI: 10.1126/science.abj6987
  8. Rautiainen M, Nurk S, Walenz BP, et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. DOI: 10.1038/s41587-023-01662-6
  9. Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. DOI: 10.1038/nbt.2727
  10. Bankevich A, Tang Y, Pevzner PA. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. DOI: 10.1038/s41587-022-01220-6
  11. Rhie A, Walenz BP, Koren S, Phillippy AM. Genome assembly in the telomere-to-telomere era. DOI: 10.1038/s41576-024-00718-w
  12. Cheng H, Jarvis ED, Fedrigo O, et al. Scalable telomere-to-telomere assembly for diploid and polyploid genomes with hifiasm-UL. DOI: 10.1038/s41592-024-02269-8

Disclaimer: This resource is intended for research-use project planning and technical evaluation only, not for clinical, diagnostic, or patient-use applications.

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Related Services
Speak to Our Scientists
What would you like to discuss?
With whom will we be speaking?

* is a required item.

Contact CD Genomics
Terms & Conditions | Privacy Policy | Feedback   Copyright © CD Genomics. All rights reserved.
Top