Animal & Plant De Novo Genome Sequencing QC & Deliverables

Q: What’s the single most common cause of failure in plant/animal de novo genome projects?

Not a fancy algorithm—usually HMW DNA that isn’t actually HMW anymore (shearing during extraction/handling) or chemical inhibitors that block enzymatic steps.

At a glance:

Step 1: Define success before you ship a sample
Step 2: Sample requirements for animal and plant de novo genome sequencing
Step 3: QC checkpoints during library construction and sequencing
Step 4: Choose a sequencing strategy that matches your failure modes
Step 5: Assembly QC checkpoints (validate at each milestone)
Step 6: Deliverables checklist (files + reports you should insist on)
Step 7: Governance items that prevent surprises (cost, timeline, reruns, and IP)
Next steps
FAQ

Cover illustration of animal and plant de novo genome sequencing QC checkpoints and deliverables

If you're planning animal and plant de novo genome sequencing for a complex genome (high repeat content, high heterozygosity, or polyploidy), the fastest way to burn budget is to treat the project like "ship DNA → get a FASTA."

Evaluation-stage teams usually fail for more mundane reasons: DNA quality that looked "fine" on a NanoDrop, missing QC reports, unclear acceptance metrics, or deliverables that aren't publication-ready.

This guide is designed as a pre-flight checklist: sample requirements, QC checkpoints, and deliverables you should agree on before you start—with explicit "done when…" gates you can use in a statement of work.

Key takeaways

Define success upfront with an acceptance bundle that includes contiguity + completeness + correctness (not just N50).
For long-read assemblies, HMW DNA integrity is the limiting factor—set explicit intake gates (Qubit mass, NanoDrop ratios, PFGE/Femto Pulse profile, handling notes).
Require intermediate QC artifacts during library prep and sequencing so failures can be diagnosed (not hidden behind a final FASTA).
Choose strategy based on your constraint: HiFi for consensus accuracy, ultra-long reads for repeat resolution, and specify the polishing/validation plan.
Treat assembly as milestone deliverables (draft → polished → scaffolded) with validation at each stage (mapping/k-mer checks, BUSCO, QV/equivalent, contamination screening).
Agree on a deliverables package (raw reads, versioned assemblies, QC reports, pipeline manifests) so the work is reproducible and publication-ready.
Put governance in writing: rerun/top-up policy, timeline by milestone, and IP/data handling expectations.

Step 1: Define success before you ship a sample

Before you talk about platforms or coverage, lock down what "success" means for your specific organism and downstream goals.

1.1 Choose the target assembly level

Pick one and write it into the project scope:

Draft contig assembly (fastest, cheapest): useful for exploratory gene discovery and some comparative genomics.
Polished contig assembly (common baseline): higher consensus accuracy; better for annotation and downstream comparisons.
Chromosome-scale assembly (often needed for complex plant/animal genomes): requires scaffolding evidence (typically Hi-C) and stronger validation.
Near T2T (only for some projects): requires aggressive QC, multi-data integration, and realistic expectations.

1.2 Define acceptance metrics (don't let N50 be the only headline)

A high N50 can coexist with misjoins, collapsed haplotypes, or missing genic regions. For evaluation-stage work, set an acceptance bundle that covers contiguity + completeness + correctness.

Minimum acceptance bundle:

Completeness: BUSCO completeness using the correct lineage dataset (e.g., Embryophyta for plants, Metazoa for animals)
Correctness: a consensus accuracy proxy (often reported as QV or an equivalent method) with the method stated
Contiguity: contig N50 and (if scaffolding is in scope) scaffold N50
Sanity checks: read mapping rate, k-mer completeness (when used), contamination screening

Key Takeaway: Treat N50 as a contiguity descriptor, not a quality guarantee. Your acceptance criteria should include at least one completeness metric and one correctness metric.

Step 2: Sample requirements for animal and plant de novo genome sequencing

Complex animal/plant assemblies are disproportionately sensitive to DNA integrity. If you want long reads, you need high molecular weight DNA (HMW DNA)—and you need to protect it from shearing.

2.1 The minimum QC panel you should run (and record)

Oxford Nanopore's QC guidance is explicit about purity thresholds and handling. They state DNA should have OD 260/280 ~1.8 and OD 260/230 2.0–2.2, and recommend a Qubit fluorometer for accurate DNA quantification (with NanoDrop mainly for purity checks when DNA is sufficiently concentrated). Oxford Nanopore "Input DNA/RNA QC" (last updated 2026-04-10)

Use this as a vendor-neutral baseline QC gate.

2.2 HMW DNA acceptance table (practical gates)

QC item	What to measure	"Pass" gate (typical)	Done when…
Quantity	Qubit dsDNA (BR/HS as appropriate)	Enough for planned library strategy + contingency	You can allocate input for library prep and reserve material for reruns
Purity (protein/phenol)	NanoDrop A260/280	~1.8 (roughly 1.75–2.0)	Value is stable across re-measurements and consistent with clean gDNA
Purity (salts/organics)	NanoDrop A260/230	2.0–2.2	No clear inhibitor signal; no need for extra cleanup
Integrity / size	PFGE or Femto Pulse (preferred for >10 kb)	Strong HMW peak; minimal smear	Fragment-size profile supports long-read library construction
Handling	Process notes	No vortexing; gentle mixing	Extraction/handling steps are documented and reproducible

2.3 Plant- and animal-specific failure modes to plan for

This is where many SOWs are too generic.

Plant samples (common failure modes)

Polysaccharides/phenolics → low A260/230 and enzymatic inhibition.
Tough cell walls → harsh extraction → shearing.
Chloroplast/mitochondrial overrepresentation → surprises in assembly totals.

Animal samples (common failure modes)

Field collection / microbiome exposure → contamination risk.
Tissue degradation in transport → fragmented DNA profiles.
Inconsistent tissue type across samples → variability in yield and inhibitor load.

"Done when…" for this subsection:

You have a written mitigation plan for your top 1–2 risks (extra cleanup step, alternate tissue choice, repeat extraction threshold, etc.).

2.4 Input amount expectations (why genome size belongs in the first email)

Providers often specify DNA input either as an absolute mass (µg) or as a genome-size-scaled guideline. For example, PacBio discusses HiFi library inputs in workflow-specific terms (with standard workflows often described as scaling with genome size, and low/ultra-low workflows available for constrained samples). PacBio "New Ampli‑Fi ultra‑low‑input protocol"

Practical evaluation question to ask your provider:

"Give me a table that maps genome size × ploidy/heterozygosity assumptions × library strategy × required input × contingency. What happens if sample QC barely misses the gate?"

Step 3: QC checkpoints during library construction and sequencing

If you only receive raw FASTQ and a final assembly, you've lost visibility into where things went wrong. Your evaluation checklist should require intermediate QC artifacts.

3.1 Library QC checkpoints to request

Ask the provider to report (at minimum):

DNA quantification method and kit (and whether RNase treatment was used)
Size distribution method (PFGE/Femto Pulse; settings)
Library size distribution (where applicable)
Any cleanup steps applied and why
A "stop/go" decision note when a sample is borderline (and whether re-extraction or alternate tissue was recommended)

Done when…

You can explain, sample by sample, whether the limiting factor is purity, yield, or fragmentation.

3.2 Sequencing run QC checkpoints to request

For each sample (or each library pool), request:

Yield (Gb) and read count
Read length distribution (including read N50)
Quality distribution (platform-appropriate)
Coverage estimate vs target with the assumed genome size stated

Pro Tip: Require the provider to state the genome size assumption used for coverage calculations. Otherwise, "30× coverage" can be meaningless.

Step 4: Choose a sequencing strategy that matches your failure modes

For complex plant and animal genomes, strategy is usually a trade-off between:

contiguity
consensus accuracy
cost and turnaround
how much bioinformatics support you'll need

4.1 PacBio HiFi sequencing vs Oxford Nanopore ultra-long reads (evaluation framing)

Use this simplified decision frame:

If consensus accuracy is your constraint (gene models, annotation confidence, polished variant calls): bias toward PacBio HiFi sequencing.
If long-range repeat resolution is your constraint (extreme repeats, structural complexity): consider Oxford Nanopore ultra-long reads, often paired with an explicit polishing plan.

4.2 Coverage: how to ask for it (and how to avoid paying for the wrong thing)

Coverage targets should be written per haplotype when relevant, and tied to an explicit output.

PacBio guidance commonly frames de novo assembly planning in terms of HiFi read coverage per haplotype (for many projects, a 10–15× per-haplotype recommendation is a common starting point, with higher targets in ultra-low-input contexts). Use that as a planning baseline, then adjust for genome complexity and assembly goals.

For a practical explanation of why "more data" has diminishing returns (and where it still matters), PacBio's coverage explainer is a useful reference for building an internal cost-benefit model for depth planning. PacBio "Sequencing 101: sequencing coverage" (updated 2026-04-13)

Practical evaluation questions to ask your provider:

What's the proposed coverage target (and genome size assumption)?
Which metric improves if we add more data—BUSCO, correctness (QV), scaffold correctness, or mostly N50?
What's the planned fallback if the first run misses the target (top-up sequencing vs re-library vs re-extraction)?

4.3 When Hi-C is worth it (and what "QC" should look like)

Hi-C scaffolding is usually worth considering when:

you need chromosome-scale assemblies for synteny, breeding-relevant haplotypes, or large-scale rearrangements
your genome is repeat-rich or polyploid and contigs alone won't resolve structure

If you include Hi-C, don't accept it as a black box: require contact-map evidence and a written description of how misjoins were handled.

For example, CD Genomics offers HiFi-C as a long-read + conformation capture approach that can support chromosome-scale objectives in appropriate projects. CD Genomics HiFi‑C sequencing

Step 5: Assembly QC checkpoints (validate at each milestone)

Treat assembly as a multi-stage deliverable. For evaluation-stage work, don't accept a single "final FASTA" without the QC trail.

5.1 Draft assembly checkpoint

Request a draft assembly package with:

Assembly stats (total length, # contigs, contig N50)
Coverage and mapping summary
Initial contamination screen

Done when…

Total length is plausible for the organism (and explained if not)
Mapping/k-mer signals suggest the assembly isn't missing major content

5.2 Polishing checkpoint (correctness and reproducibility)

Polishing is where correctness improves—and where pipelines can become opaque.

Require:

Toolchain and versions (assembler, polisher, parameters)
A correctness evaluation output (QV or equivalent) with the method stated

Done when…

The correctness metric improves or stabilizes
The pipeline can be re-run from the same inputs with the same versions/parameters

5.3 Completeness checkpoint (BUSCO)

Run BUSCO with the correct lineage and report the full breakdown.

Done when…

BUSCO completeness meets the acceptance gate you set in Step 1
Fragmented BUSCO is low enough for your downstream goals

5.4 Scaffolding checkpoint (if chromosome-scale is in scope)

If chromosome-scale scaffolding is in scope, require:

Hi-C contact map visuals plus summary QC
Misjoin detection/curation notes (even if it's "no manual curation performed")
Evidence that scaffolds represent biological structure, not just aggressive joining

Done when…

Chromosome-scale structure is supported by contact evidence
The provider can explain and document how conflicts were resolved

Step 6: Deliverables checklist (files + reports you should insist on)

A clean deliverables package makes the project reproducible, publishable, and auditable. The checklist below is written to be provider-agnostic.

6.1 Raw data deliverables

Raw reads (FASTQ): per sample, including any "unused" reads (don't discard data silently)
Run-level yield/QC summary report

6.2 Alignment and intermediate deliverables (when applicable)

Mapped reads (BAM/CRAM) to the final assembly (or to a reference, if that's the chosen analysis)
Read mapping stats (overall mapping rate, coverage distribution)

6.3 Assembly deliverables (milestone versions)

Require versioned outputs:

Draft contig FASTA
Polished contig FASTA
Scaffolded FASTA (if scaffolding is included)

Include with each milestone:

Assembly report (length, # contigs/scaffolds, N50)
Pipeline manifest (tools, versions, parameters)

6.4 QC and validation reports

BUSCO report (full breakdown + lineage)
Correctness/accuracy report (QV or equivalent), with methodology stated
Contamination screen output and interpretation

6.5 Optional but common add-ons

Structural variant analysis outputs (if in scope)
Annotation package (gene models, functional annotation) if requested

CD Genomics positions its animal/plant genomics services as end-to-end, including sequencing, assembly, and optional downstream analyses and reporting—so this deliverables checklist maps cleanly to what a full-service provider should be able to package in a reproducible way. CD Genomics animal/plant whole genome sequencing

Step 7: Governance items that prevent surprises (cost, timeline, reruns, and IP)

Teams working on milestone-based acceptance and publication timelines should treat governance as a first-class technical requirement.

7.1 Rerun and failure policy (must be explicit)

Write these answers into the SOW:

If sample QC fails at intake, what's the recommended action (cleanup vs re-extraction vs alternate tissue)?
Who pays for resequencing if yield is below target because of library failure vs because the input was out-of-spec?
Is top-up sequencing possible without rebuilding the library?

7.2 Timeline transparency by milestone

Ask for a milestone timeline (not a single date):

raw data delivery
draft assembly delivery
polished assembly delivery
scaffolded assembly delivery (if in scope)
final reproducible package delivery (all files + manifests)

7.3 Data ownership and sovereignty

Evaluation-stage objections often include data security and IP.

Done when…

storage location, retention window, and access controls are agreed
publication constraints and embargo needs are documented

Next steps

If you want a second set of eyes on your plan for animal and plant de novo genome sequencing, share:

organism + estimated genome size
ploidy/heterozygosity expectations
desired assembly level (contig vs chromosome-scale)
whether Hi-C is planned

…and we'll translate that into a concrete QC gate + deliverables checklist you can use to compare providers and reduce re-sampling risk.

If you're exploring provider options that cover both PacBio and ONT strategies plus bioinformatics delivery, CD Genomics' long-read service hub is a useful starting point: CD Genomics LongSeq.

FAQ

What's the single most common cause of failure in plant/animal de novo genome projects?

Not a fancy algorithm—usually HMW DNA that isn't actually HMW anymore (shearing during extraction/handling) or chemical inhibitors that block enzymatic steps.

Can I rely on NanoDrop alone?

Use NanoDrop for purity ratios, but for DNA mass, Qubit-style fluorometry is preferred because contaminants and residual RNA distort absorbance-based readings.

Do I always need Hi-C?

No. If your downstream requires chromosome-scale structure (e.g., large-scale synteny, structural rearrangements, breeding-relevant haplotypes), scaffolding evidence can be worth it. If your goal is an accurate contig set for gene discovery, it may be unnecessary.

What deliverable do I need for publication and reproducibility?

At minimum: raw FASTQ, versioned assemblies, BUSCO report, and a pipeline manifest (tools, versions, parameters). Without those, reviewers (and future you) can't reproduce the result.

Author: Dr. Yang H., Senior Scientist at CD Genomics

LinkedIn: https://www.linkedin.com/in/yang-h-a62181178/

Dr. Yang H. focuses on long-read sequencing (PacBio SMRT and Oxford Nanopore) and de novo genome assembly workflows for complex animal and plant genomes, with an emphasis on sample QC, acceptance metrics, and reproducible deliverables.

For Research Use Only. Not for use in diagnostic procedures.

Talk about your projects

For research purposes only, not intended for personal diagnosis, clinical testing, or health assessment