Choosing the Right T2T Deliverables: Assembly Outputs, Polishing, Phasing, and Data Formats (RUO)

Quick Overview

01 Introduction 02 Core deliverable definitions and file-format expectations 03 Interpreting QV in context and common caveats 04 Polishing strategy and practical tool choices 05 Evaluation and caution 06 Recommendations by Use Case (Authority) 07 Conclusion — Make your decision once, use it everywhere

Introduction

"The wrong deliverable format can add weeks of reformatting — choose wisely from the start." For small-genome projects (<500 Mb) pursuing telomere-to-telomere (T2T) assemblies, the deliverable contract you define on day one will determine whether downstream analysis starts tomorrow—or in three weeks after ad‑hoc conversions. This practical guide explains how to choose the right mix of assembly outputs, polishing, phasing, and data formats so your deliverables are immediately usable, auditable, and archive‑ready.

If you need a concise primer on what T2T entails (telomeres, centromeres, and tricky repeats) and why completeness changes downstream biology, start with the series background: Telomere-to-telomere sequencing explained. Here, we focus on the decisions that matter for T2T deliverables assembly phasing data formats in RUO contexts: contigs versus chromosome-level (and T2T) outputs; base‑level polishing targets that meet research and publication expectations; phasing strategies (trio, Hi‑C, Strand‑seq) for small genomes.

What you'll get from this article:

Clear, actionable acceptance gates for T2T deliverables tailored to small genomes (<500 Mb), including Merqury QV thresholds, BUSCO completeness targets, and QUAST checks, with guidance on interpretation and common caveats.
Practical pipeline guidance and command skeletons for hifiasm and Verkko in trio, Hi‑C, and Strand‑seq scenarios, plus recommended polishing sequences and Merqury/BUSCO/QUAST evaluation steps.
A recommended deliverable bundle and directory/manifest layout that minimizes downstream reformatting: per‑haplotype FASTA (+.fai), compressed GFA/GFA.gz, AGP, soft‑masked FASTA/GFF3 where relevant, indices, checksums, and a provenance README.
Two concrete example workflows (a 50 Mb microbe and a 400 Mb model organism) with expected inputs, outputs, and QC outcomes to help you scope compute, coverage, and gating decisions.
Practical guidance on when to deliver graph vs. linear outputs, how to validate phasing completeness, and a short vendor micro‑example showing a D‑tier (full T2T) handoff pattern for scoping.

Phased diploid assembly diagram with maternal and paternal chromosomes, telomeres, centromeres, and phase blocks. Figure 1 — Haplotype-resolved diploid assembly (self‑created). Legend: blue = haplotype A, orange = haplotype B; telomere caps at ends; centromere domains; phased blocks. This image illustrates why a T2T handoff often includes per‑haplotype FASTA plus a GFA graph retaining alternative paths.

Core deliverable definitions and file-format expectations

In practice, deliverables fall on a continuity spectrum: a contig is an uninterrupted sequence produced from overlapping reads; a scaffold orders and orients contigs using long-range links and may contain gap runs (Ns); a chromosome-level assembly has scaffolds anchored to expected chromosomes but can still contain gaps; and a telomere-to-telomere (T2T) assembly is a gap-free, end-to-end chromosome-level sequence that includes canonical telomeric repeats and resolves centromeres and other large repeats Genome.gov T2T overview and the gapless-assembly literature (Koren et al., 2024) (see examples in recent mouse and plant T2T reports). These distinctions matter for deliverables because some downstream analyses require linear, gapless chromosomes (publication/benchmarking), while others benefit from the richer branching information preserved in an assembly graph.

For small-genome T2T projects we recommend packaging both linear and graph representations so consumers can choose the view that fits their workflow: provide per-haplotype FASTA files (hap1.fa, hap2.fa, gzipped and indexed) as the canonical sequences; include the assembly graph in GFA or GFA.gz to preserve alternative paths and unresolved repeats; add an AGP file to document scaffold→contig relationships and gap sizes for archival submissions; and deliver annotation files in GFF3 (or GTF) when structural annotation is included. These formats align with community practices and archive requirements (see the NCBI AGP specification and the community-maintained GFA specification).

Figure 2: End‑to‑end T2T pipeline for small genomes (<500 Mb). Inputs at left feed assembly graph construction and phasing modules; polishing, QC, and packaging steps create the deliverable bundle (per‑haplotype FASTA, GFA, AGP, QC reports). Expected coverage and target QV ranges are shown as callouts for each data type.

Finally, treat the file bundle as a single handoff: include index files (.fai, .gzi where applicable), checksums, a README describing tool versions and parameters, and a QC bundle (Merqury k-mer QV and spectra plots, BUSCO summary, and a QUAST report). The following section details practical acceptance gates (Merqury QV thresholds, BUSCO completeness, and QUAST metrics) and how to interpret them for small genomes.

Merqury QV and how to use it for acceptance gates

Merqury reports a reference‑free consensus quality value (QV) by comparing k‑mers derived from your raw reads to k‑mers observed in the assembly and converting the inferred error rate to a Phred‑scale score (QV = −10·log10(error rate)). For a practical reference and method details see Formenti et al., Merfin/Merqury (2022), which describes the k‑mer workflow (meryl → merqury) and output interpretation, including spectra plots that reveal collapses, duplications, and missing content.

Practical thresholds for small‑genome T2T projects

Use QV ≥ 40 as a conservative research‑grade acceptance gate (≈1 error per 10,000 bp); aim for QV ≥ 50 when packaging assemblies for publication or method benchmarking (≈1 error per 100,000 bp). Many recent HiFi‑based T2T projects report QV50+ after polishing, but QV targets should be weighed alongside BUSCO and structural metrics rather than treated in isolation. Practical procedural notes and example Merqury command skeletons are summarized in the Galaxy Project's Assembly QC tutorial (2025) — see Galaxy Project, Assembly QC (2025).

Interpreting QV in context and common caveats

QV is most informative when paired with k‑mer completeness and spectra‑cn/asm plots: a high QV with low k‑mer completeness indicates missing sequence; a high QV with abnormal spectra may signal collapsed repeats or false duplications. For small, heterozygous genomes, parental k‑mers (trio) or a high‑accuracy k‑mer set improves QV reliability. Always report Merqury QV alongside BUSCO (lineage choice noted) and QUAST/QUAST‑LG summaries; link the reader to the detailed QC acceptance criteria in the QC metrics section for actionable thresholds and example troubleshooting steps (see /t2t-assembly-qc-metrics.html).

Polishing strategy and practical tool choices

For a small‑genome T2T deliverable, treat polishing as a data‑type–dependent, evidence‑led pipeline rather than a one‑size‑fits‑all step. For PacBio HiFi assemblies, start with assembler‑recommended polishing (Arrow/ccs workflows or hifiasm's internal consensus), then apply a repeat‑aware pass such as NextPolish2 to fix residual homopolymers and repeat‑associated errors; evaluate each round with k‑mer checks and report Merqury QV after every major polishing stage (Formenti et al., Merfin/Merqury, 2022; NextPolish2, 2024). A common HiFi skeleton is: align HiFi reads → run Arrow/consensus → NextPolish2 → optional short‑read polish (Pilon or Polypolish) if high‑coverage Illumina exists. Example commands (conceptual):

Align: pbmm2 align --sort hifi.bam assembly.fa > aligned.bam
NextPolish2: nextpolish2 --reads hifi.fq.gz --assembly assembly.fa --out polished.fa

For ONT assemblies, an effective pipeline remains Racon (1–3 iterations) → Medaka (neural network model matched to basecaller) → short‑read polish; recent ONT advances (Dorado/APK) can push hybrid Verkko/Medaka assemblies toward Q50 in practice, but these rely on chemistry/basecaller parity and careful parameterization (Oxford Nanopore announcement, 2024). Typical ONT skeleton:

Racon: minimap2 -x map-ont reads.fq.gz assembly.fa | racon -m 8 -x -6 -g -8 -t 32 reads.fq.gz - assembly.fa > racon1.fa
Medaka: medaka_consensus -i reads.fq.gz -d raconN.fa -o medaka_out -t 32 -m r941_min_high_g303

Evaluation and caution

Use Merqury/Merfin to report QV and k‑mer completeness after each polishing stage and avoid blind iterative polishing: over‑polishing can introduce reference bias or collapse true haplotypes in heterozygous regions. For small genomes, stop when Merqury QV gains plateau and BUSCO/QUAST metrics no longer improve; record tool versions, parameters, and the k used for Merqury in the README so downstream users can reproduce the acceptance gate calculations.

Recommendations by Use Case (Authority)

For small genomes (<500 Mb) that target full T2T deliverables (trio + Hi‑C + Strand‑seq integration), package outputs so downstream teams — annotators, comparative genomics groups, and benchmarkers — can operate without reformatting. Below are prescriptive acceptance gates, a handoff packaging checklist, and two concrete example workflows.

Acceptance gates and QC thresholds

Follow a multi-metric acceptance policy rather than a single-number pass/fail rule. For T2T small-genome deliverables adopt these gates as minimums, with tighter thresholds for publication or method benchmarking:

Base accuracy (Merqury QV): aim QV ≥ 55; minimum QV ≥ 50 for publication; QV ≥ 40 acceptable for lower-tier research. Report QV with k-mer completeness and spectra plots to show missing or duplicated content. See the detailed QC criteria in T2T Assembly QC Metrics: Completeness, Accuracy, and How to Evaluate Results.
Gene completeness (BUSCO): Complete (C) ≥ 98% preferred; Single-copy (C:S) ≥ 95%; Duplicated (D) < 2%. Choose the most specific lineage dataset for the taxon (e.g., fungi_odb10, arthropoda_odb10) and report the full BUSCO table.
Structural correctness (QUAST/QUAST‑LG): report NG50/NGA50, misassemblies, genome fraction, and reference-aware metrics when a close reference exists. Misassemblies should be investigated and resolved for contigs that span centromeres or telomeres.
Phasing completeness: For trio-based phasing, report per-haplotype assembly sizes, percent of sequence assigned to haplotypes, and switch error rate where parental truth is available. For Hi‑C/Strand‑seq phasing, include contact-map validation and Strand-seq orientation summaries.
Provenance & integrity: every file must have a checksum (SHA256 preferred), a .fai index for FASTA, and a manifest (JSON/YAML) that records tool versions, parameters, and coverage summary.

When noting sample metadata and extraction constraints, reference sample prep guidance in Sample & DNA Requirements for T2T Sequencing: How to Avoid Project Failure.

Explicit naming conventions and a checksum manifest reduce downstream ambiguity; include example manifest fragments and a minimal README that documents the acceptance gates used for this project.

Example workflow — 50 Mb microbe

Inputs assumed: PacBio HiFi 30–50× per haplotype, 30× ONT ultra-long (optional), parental data not available, Hi‑C optional for large plasmid scaffolding.

Assembly: hifiasm in Hi‑C mode (if Hi‑C supplied) or hifiasm default for HiFi-only: hifiasm -o asm -t 48 reads.hifi.fq.gz.
Polishing: internal hifiasm consensus → NextPolish2 with HiFi reads; evaluate k-mer shifts after each round with Merqury.
QC targets: expect QV 50+ after polishing; BUSCO (fungi/archaea/other lineage) > 98%.
Deliverables: single-file haploid FASTA (if organism effectively haploid) or dual haplotype FASTA if heterozygosity resolved; assembly.gfa.gz; merqury spectra and BUSCO reports; manifest and README.

Expected outputs: hap1.fa.gz (50 Mb), assembly.gfa.gz (small graph), merqury_summary.txt (QV ~50), busco short_summary (C >98%).

Example workflow — 400 Mb model (trio + Hi‑C + Strand‑seq)

Inputs assumed: PacBio HiFi 30–40× per haplotype, ONT ultra-long 20×, parental Illumina for trio-binning, Hi‑C 100M read pairs, Strand‑seq libraries (10–20 cells).

Preprocessing: build parental k-mer DBs (yak) and run trio-binning hifiasm to partition reads.
Assembly: hifiasm -o asm -t 96 --trio maternal.yak paternal.yak hifi/*.fq.gz then integrate Hi‑C for scaffolding; use Verkko for hybrid regions where ultra-long ONT provides resolution.
Phasing validation: compute switch error rate against parental truth; use Strand‑seq to resolve orientations across centromeres and validate inversions.
Polishing: NextPolish2 on haplotype assemblies then targeted Medaka passes on ONT-resolved regions; validate with Merqury until QV gains plateau (aim QV ≥ 55).
Deliverables: hap1.fa.gz, hap2.fa.gz, assembly.gfa.gz (graph preserving alternative paths), scaffolds.agp, merqury spectra, BUSCO reports (C ≥ 98%), QUAST NGA50 and misassembly report, full manifest and provenance.

Expected outputs: two haplotype FASTAs (~200 Mb each), assembly.gfa.gz (large graph with phasing paths), merqury_summary (QV 55+), BUSCO (C ≥98%).

Neutral vendor micro-example (allowed placement)

CD Genomics can accept a standard D-tier input package (HiFi reads, parental Illumina for trio binning, Hi‑C FASTQs, and Strand‑seq libraries). For a typical small eukaryote they will run a trio-binned hifiasm assembly, integrate Hi‑C for chromosome scaffolding, and use Strand‑seq to confirm orientation and large-scale phasing. Deliverables are produced as per the checklist above: per-haplotype FASTA (gzip + .fai), a compressed assembly graph (GFA/GFA.gz), AGP, and a QC bundle (Merqury spectra, BUSCO, QUAST) accompanied by a manifest and README. Confirm specific file naming and metric gates during the scoping call.

Micro-example note: factual description only; no performance superlatives.

Figure 3. Left: simplified GFA assembly graph showing nodes and branching paths that expose repeats and alternate haplotype routes (useful for structural diagnostics and manual curation). Right: resolved linear haplotype FASTAs (hap1, hap2) shown as chromosome bars for annotation and archival submission. Deliver both: compressed GFA/GFA.gz plus per-haplotype FASTA (+.fai).

Conclusion — Make your decision once, use it everywhere

A robust T2T deliverable policy for small genomes saves weeks of downstream rework. Summary decision checklist:

Choose phasing method early (trio preferred for small genomes); commit to trio/Hi‑C/Strand‑seq if pursuing full T2T.
Define QV and BUSCO acceptance gates before assembly (aim QV ≥ 55; BUSCO C ≥ 98%).
Require both linear FASTA per haplotype and compressed GFA graph in the final bundle, plus AGP and annotation-ready soft-masked FASTA when applicable.
Deliver a machine‑readable manifest with SHA256 checksums, README, and tool provenance.

If you want to scope a D‑tier T2T deliverable for a small genome, contact us to define inputs, gates, timelines, and pricing.

References:

Formenti, G. et al., Merfin/Merqury k‑mer methods (2022). Merqury/Merfin overview.
Cheng, H. et al., hifiasm algorithm and modes (2022). hifiasm Hi‑C/trio paper.
Rautiainen, M. et al., Verkko hybrid assembler (2023). Verkko Genome Research.
Earth BioGenome Project, assembly standards and guidance (2022). EBP standards summary.
BUSCO user guide and benchmarking recommendations. BUSCO documentation.
Bandage / BandageNG and GFA visualization notes. Bandage GitHub.

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.

Related Services