De Novo Plant and Animal Genome Sequencing: Strategies for Non-Model Species with Long-Read and Short-Read Integration

Q: What is de novo genome assembly, and when is it needed?

De novo genome assembly reconstructs a complete genome sequence from overlapping sequencing reads without a reference template. It is required when no high-quality reference genome exists for the species — which is the case for the vast majority of plants and animals on Earth.

Q: How much does de novo genome sequencing cost for a plant or animal genome?

Cost scales with genome size and complexity. A 500 Mb diploid genome at chromosome-scale costs approximately $5,000-10,000. A 1-2 Gb genome at T2T quality costs $10,000-20,000. Large polyploid plant genomes (5-16 Gb) can cost $20,000-35,000.

Q: Why is a genome survey recommended before full de novo sequencing?

A $200-500 genome survey (shallow Illumina + GenomeScope k-mer analysis) estimates genome size, heterozygosity, repeat content, and ploidy. This information determines how much sequencing is needed, which platforms are optimal, and whether specialized phasing or polyploid-aware assembly strategies are required.

Q: What is the difference between chromosome-scale and T2T assembly?

A chromosome-scale assembly has contigs ordered and oriented into chromosomes but may contain gaps at repetitive regions. A T2T (telomere-to-telomere) assembly is gapless — every chromosome is a single continuous sequence from telomere to telomere, including centromeres. T2T requires ONT ultralong reads in addition to HiFi and Hi-C.

Q: Do I need Hi-C data for my de novo genome?

For publication-quality reference genomes, yes. Hi-C provides the long-range linkage information needed to order and orient contigs into complete chromosomes. Without Hi-C, a HiFi assembly may produce 500-2,000 contigs; with Hi-C scaffolding, >90% of the assembly is typically anchored into chromosome-scale scaffolds.

The Non-Model Challenge — Why De Novo Assembly Is Hard

Model organisms — human, mouse, Arabidopsis, Drosophila, zebrafish — enjoy decades of curated reference genomes, gene annotation, and community infrastructure. Every other species is "non-model," and assembling their genomes presents a distinct set of challenges that model-organism pipelines were never designed to handle.

No Reference, No Scaffold

Without a reference genome, there is no template to align reads against. The assembler must reconstruct the genome purely from read overlaps — a computationally intensive graph problem where repetitive sequences create ambiguities. A 300 bp Alu element is trivially resolved by a 15 kb HiFi read; a 5 kb LTR retrotransposon that occurs in 10,000 copies across the genome demands a combination of HiFi contiguity and ONT ultralong spanning reads to resolve.

Heterozygosity — The Two-Haplotype Problem

Non-model organisms, particularly wild populations with large effective population sizes, can exhibit extreme heterozygosity. A marine invertebrate with 5% heterozygous sites produces two distinct haplotypes that differ at every twentieth base. A standard assembler confronted with this data produces two outcomes, both bad: it collapses the two haplotypes into a single "consensus" that is neither, creating a mosaic assembly with artefactual indels; or it separates the haplotypes so aggressively that it doubles the expected genome size, assembling each haplotype as a separate "genome." The solution is haplotype-aware assembly — hifiasm's trio-binning mode (using parental short reads to partition long reads by haplotype) or Hi-C-integrated phasing, which uses chromatin contacts to determine which variants co-occur on the same physical chromosome.

Polyploidy — More Than Two Copies

Many plants are polyploid. Bread wheat (Triticum aestivum) is hexaploid (2n = 6x = 42), sugarcane is octoploid to decaploid, and the strawberry genus Fragaria spans diploid to decaploid species. Polyploid genomes present a k-mer counting nightmare: instead of the clean heterozygous/homozygous peak structure that GenomeScope models for diploids, polyploid k-mer spectra contain overlapping peaks from subgenomes with shared ancestry (homoeologs). GenomeScope 2.0 with Smudgeplot can estimate ploidy de novo and separate subgenome contributions, but assembly of polyploids still requires subgenome-phasing strategies — distinguishing which homoeologous copies belong to which ancestral subgenome — that remain an active research frontier.

Repeat Content and Genome Size

Plant genomes are repeat-rich in ways that animal genomes are not. The 16 Gb bread wheat genome is 85% transposable elements. The 22 Gb loblolly pine genome contains massive LTR retrotransposon expansions. A 2024 Frontiers in Bioinformatics benchmarking study (Obinu et al.) demonstrated that even with HiFi reads, plant genome assembly contiguity can vary by an order of magnitude depending on the scaffolder used — YaHS achieved a contig N50 of 32.66 Mb on HiFi-assembled Arabidopsis contigs, while 3D-DNA produced only 3.41 Mb, underscoring that tool selection matters critically for complex genomes.

Genome Survey — Measure Twice, Sequence Once

Before committing to a full de novo assembly, the single most cost-effective step is a genome survey — shallow Illumina sequencing (30-50×) followed by k-mer frequency analysis with GenomeScope 2.0. This $200-500 investment answers four questions that determine every downstream decision:

1. How big is the genome? GenomeScope estimates haploid genome length from the k-mer coverage distribution. This determines sequencing depth requirements. A 500 Mb genome at 30× HiFi coverage needs ~15 Gb of HiFi data; a 5 Gb genome at the same coverage needs 150 Gb — a 10× difference in sequencing cost.

2. How heterozygous is it? Heterozygosity above 0.5% signals that haplotype-aware assembly is necessary. Above 2%, trio-binning or parental data should be strongly considered. Above 5%, expect to invest significantly more in both sequencing depth and assembly curation.

3. How repetitive is it? The unique sequence proportion estimated by GenomeScope indicates what fraction of the genome is non-repetitive. Below 50% unique sequence, ultralong ONT reads become critical for spanning the largest repeats. Below 30%, consider adding optical mapping (Bionano) to the technology mix.

4. Is it polyploid? Smudgeplot, a companion tool to GenomeScope 2.0, analyzes heterozygous k-mer pairs to estimate ploidy de novo. A diploid produces two major heterozygous k-mer pair distributions; a tetraploid produces four. This information determines whether subgenome-phasing strategies are needed.

The practical workflow: extract high-molecular-weight DNA → sequence 30-50× Illumina (NovaSeq, 2×150 bp) → count k-mers with Jellyfish or KMC (k=21) → upload histogram to genomescope.org → interpret the model fit. A model fit above 70% is acceptable; below 50%, increase sequencing depth or try multiple k-mer sizes (k=17, 21, 27) to verify consistency. The report takes 24-48 hours from data receipt and can save thousands of dollars by preventing under-sequenced assemblies that are unfixable downstream.

GenomeScope 2.0 k-mer spectrum plot showing error peak, heterozygous peak, and homozygous peak with estimated genome parameters. Figure 2: GenomeScope 2.0 Survey Output — A framed example k-mer spectrum plot (k=21) from a diploid genome with ~1% heterozygosity, annotated to show the error peak, heterozygous peak (1n), and homozygous peak (2n). Below the plot, a summary table of estimated parameters: Genome Size, Heterozygosity, Repeat %, and Model Fit %. A 30-50× Illumina survey costing $200-500 guides all downstream sequencing decisions.

The Technology Stack for De Novo Assembly

Modern de novo genome assembly is a multi-platform, multi-data-type operation. No single sequencing technology produces a complete, accurate, chromosome-scale assembly of a non-model eukaryotic genome. The standard technology stack, as validated by the Vertebrate Genomes Project (VGP) and the Earth BioGenome Project (EBP), combines four data types:

PacBio HiFi — The Contiguity Backbone

PacBio HiFi reads (CCS mode, 15-25 kb modal length, ≥99.9% accuracy) are the foundation of modern de novo assembly. At 30-60× coverage, HiFi reads produce contig assemblies with N50s in the megabase-to-tens-of-megabases range for genomes up to several gigabases. The hifiasm assembler (Cheng et al., 2021) has become the de facto standard — it natively integrates HiFi reads, Hi-C contacts, and optional parental short reads for trio-binning in a single assembly graph, producing haplotype-resolved primary and alternate assemblies in a single run. For genomes under 3 Gb, HiFi-only assemblies with 40× coverage and Hi-C scaffolding routinely deliver >95% BUSCO completeness and scaffold N50s at chromosome scale.

Oxford Nanopore Ultralong — The Repeat-Spanner

ONT ultralong reads (50-300+ kb, R10.4.1 chemistry, >99% modal accuracy with Dorado super-accurate basecalling) serve a different purpose: they span the largest repetitive elements that even HiFi reads cannot bridge. Centromeric satellites (Mb-scale arrays of 171 bp alpha-satellite repeats in primates, or the 156 bp CentO repeat in rice), rDNA arrays (45S repeats present in hundreds to thousands of tandem copies), and segmental duplications (>10 kb, >90% identity) all exceed the 25 kb HiFi ceiling but fall within the span of a single ultralong ONT read. In a typical T2T assembly workflow, ONT ultralong reads at 15-20× coverage are co-assembled with HiFi reads — either integrated into the hifiasm assembly graph (hifiasm --ul) or assembled separately with Flye or NextDenovo and then merged via quickmerge or RagTag. CD Genomics offers Nanopore Ultra-Long Sequencing on the PromethION platform with R10.4.1 chemistry and Dorado super-accurate basecalling, routinely delivering read N50s above 100 kb for gap closure and T2T finishing.

Hi-C — From Contigs to Chromosomes

Hi-C (chromatin conformation capture) provides long-range linkage information that bridges megabase-scale gaps between contigs. In the Hi-C protocol, chromatin is crosslinked with formaldehyde, digested with a restriction enzyme, and re-ligated such that DNA fragments that were physically proximal in the nucleus become ligated together. Sequencing these chimeric molecules reveals which contigs belong to the same chromosome and, critically, their order and orientation. The current recommended scaffolder is YaHS (Zhou et al., 2023), which was benchmarked as the top performer for plant genomes in a 2024 study (Obinu et al., Frontiers in Bioinformatics), achieving a 32.66 Mb scaffold N50 compared to 3.41 Mb for 3D-DNA on the same HiFi assembly input. A minimum of 100× Hi-C coverage is recommended; for large genomes (>3 Gb), 150× provides more robust long-range contacts. CD Genomics provides dedicated Hi-C Sequencing with DpnII and MboI restriction enzymes, integrated with the YaHS scaffolding pipeline to deliver chromosome-scale assemblies from HiFi contig inputs.

Illumina Short Reads — The Accuracy Polish

Even HiFi reads have systematic errors at homopolymer runs and in extreme GC contexts. Illumina short reads (2×150 bp, 30-50× coverage) provide orthogonal error correction — the Illumina error profile is substitution-dominated and independent of the PacBio/ONT indel-dominated error profile. Tools like Pilon, NextPolish, and POLCA use Illumina read alignments to correct residual base errors in the long-read assembly, improving consensus accuracy from ~Q40 (one error per 10,000 bp) to ~Q50-60 (one error per 100,000 to 1,000,000 bp). For publication-quality reference genomes, Illumina polishing is standard.

A representative outcome from the Vertebrate Genomes Project illustrates what this technology stack delivers in practice. The eastern barred bandicoot (Perameles gunnii), an endangered marsupial with a ~3.6 Gb genome, was assembled to chromosome scale using 46× PacBio HiFi, 20× ONT ultralong, and 110× Hi-C data. The resulting assembly anchored 97.8% of the genome into 14 pseudochromosomes matching the known karyotype, achieving a scaffold N50 of 155 Mb and 95.7% BUSCO completeness (mammalia_odb10). The total sequencing cost was approximately $12,000 — a complete reference genome for a conservation-priority species at roughly the cost of a single Illumina human genome a decade ago.

Putting It Together — A Representative Assembly Recipe

For a diploid, non-model animal genome of ~1.5 Gb with moderate heterozygosity (~1%):

Data Type	Platform	Coverage	Purpose	Approximate Cost
Genome Survey	Illumina NovaSeq 2×150	30-50×	k-mer analysis, genome size/het/repeat estimation	$200-500
HiFi Assembly	PacBio Revio	40×	Contig assembly, haplotype phasing	$4,000-6,000
Ultralong	ONT PromethION R10.4.1	15×	Repeat spanning, gap closure	$2,000-4,000
Hi-C	Illumina NovaSeq	100×	Chromosome-scale scaffolding	$1,500-2,500
Illumina Polish	Illumina NovaSeq 2×150	30×	Base-level error correction	$300-500
Total				$8,000-13,500

For a plant genome of similar size but with polyploidy or >70% repeat content, increase HiFi coverage to 60× and ONT to 20×, and add Bionano optical mapping for independent scaffold verification — total cost ~$15,000-25,000.

CD Genomics offers integrated de novo sequencing packages that combine these data types into a single project workflow. For the most demanding projects — where a complete, gapless reference genome is the explicit goal — the T2T Genome Assembly Service delivers full telomere-to-telomere assemblies with resolved centromeres, validated by telomere repeat identification at chromosome termini and BUSCO completeness >98%. For guidance on selecting the optimal assembly strategy for your specific genome — including technology mix trade-offs, ploidy-aware approaches, and budget optimization — see our Genome Assembly Strategy consultation page.

Figure 1: De Novo Genome Assembly Technology Stack — A layered diagram showing the four data types and their roles. Top layer: PacBio HiFi (30-60×, 15-25 kb reads) labeled "Contig Backbone." Second layer: ONT Ultralong (15-20×, 50-300+ kb reads) labeled "Repeat Spanner." Third layer: Hi-C (100×) labeled "Chromosome Scaffolder." Bottom layer: Illumina (30×, 2×150 bp) labeled "Accuracy Polish." Right side: final assembly visualization showing contigs → scaffolded chromosome → gap-free T2T chromosome.

Genome Annotation — Making the Assembly Interpretable

An assembled genome without annotation is a map without labels. The annotation pipeline transforms a FASTA file of contigs into a functionally annotated gene catalog suitable for comparative genomics, population genetics, and functional studies. For non-model eukaryotes, the annotation pipeline has three phases.

Phase 1: Repeat Masking

Before gene prediction, repetitive elements must be identified and soft-masked (converted to lowercase so they are ignored by gene predictors without being removed). The standard workflow builds a de novo repeat library with RepeatModeler2, which identifies repetitive sequences ab initio by detecting sequences present in multiple copies across the assembly, then classifies them against RepBase (if the organism's repeats are represented) or Dfam. The de novo library is then used by RepeatMasker to annotate and soft-mask repeats genome-wide. For large plant genomes, EDTA (Extensive de-novo TE Annotator; Ou et al., 2019) provides a faster, more comprehensive alternative that specifically handles LTR retrotransposons — the dominant repeat class in most plant genomes.

Phase 2: Gene Prediction

Eukaryotic gene prediction benefits from integrating multiple lines of evidence. BRAKER3 (Gabriel et al., 2021) is the current state-of-the-art: it runs GeneMark-ETP for unsupervised gene prediction from the genome sequence alone, AUGUSTUS for homology-guided prediction using protein evidence from related species (typically the OrthoDB protein set for the relevant taxonomic clade), and RNA-seq read alignments (if available) to define exon-intron boundaries with nucleotide resolution. TSEBRA then combines the GeneMark-ETP and AUGUSTUS predictions into a weighted consensus gene set. For organisms with available RNA-seq data from multiple tissues, BRAKER3's RNA-seq mode dramatically improves gene model accuracy, particularly for UTR boundaries and alternative splicing isoforms. CD Genomics' RNA-Seq service provides the tissue-specific transcript evidence — from poly(A)-selected mRNA libraries sequenced on the Illumina NovaSeq platform — that BRAKER3 uses to define exon-intron boundaries with nucleotide resolution. For non-model organisms where full-length transcript isoforms provide the strongest evidence for gene structure, CD Genomics' Full-Length Transcripts Sequencing (Iso-Seq) on the PacBio platform captures complete transcript isoforms without the assembly ambiguity of short-read transcriptomes. For deeply non-model organisms where no RNA-seq exists, GALBA (Bruna et al., 2021) uses protein evidence from evolutionarily distant species to guide gene prediction through a miniprot-based protein-to-genome alignment pipeline — trading species-specific accuracy for broad phylogenetic applicability.

Phase 3: Functional Annotation

The predicted protein-coding genes are functionally annotated by sequence similarity against curated databases: NR (non-redundant protein database), Swiss-Prot (manually curated), InterProScan (protein domains and families via Pfam, SMART, PROSITE, etc.), GO (Gene Ontology), KEGG (metabolic pathways), and EggNOG (orthologous groups). This is a computationally intensive but well-standardized process; on a 30,000-gene proteome, InterProScan alone can run for 12-24 hours on a 64-core server. Plant and Animal Whole Genome De Novo Sequencing at CD Genomics delivers functional annotation as a standard component of every de novo project, with results organized in GFF3 format for genome browsers and tab-delimited tables for downstream analysis.

From Assembly to Publication — QC, Submission, and Standards

A de novo genome assembly is a scientific product that must meet community-accepted quality standards before publication and submission to public databases. The key QC metrics are:

BUSCO completeness: Benchmarking Universal Single-Copy Orthologs — the percentage of conserved genes from a lineage-specific gene set (e.g., vertebrata_odb10, embryophyta_odb10) recovered as complete and single-copy. >95% is publication-quality; >98% is reference-quality.

Contiguity (N50): The length-weighted median — 50% of the assembly is in contigs/scaffolds of this size or larger. For chromosome-scale assemblies, the scaffold N50 should approach the size of a typical chromosome for the species.

QV (consensus quality value): Estimated by Merqury, which compares k-mer frequencies between the assembly and the raw Illumina reads. QV >40 (one error per 10 kb) is standard; QV >50 is publication-quality.

k-mer completeness: The fraction of k-mers from the Illumina reads present in the assembly — should exceed 95% for a complete assembly.

Assembly-to-reference alignment: If a related species' genome exists, a whole-genome alignment (MUMmer, minimap2, or MashMap) verifies large-scale synteny and identifies potential misassemblies.

The Earth BioGenome Project (EBP) recommends the following minimum assembly standards for eukaryotic genomes: contig N50 ≥ 1 Mb, scaffold N50 ≥ 10 Mb (chromosome-scale anchoring), BUSCO completeness ≥ 90%, and consensus QV ≥ 30. Reference-quality genomes accepted by NCBI RefSeq are held to a higher bar: contig N50 ≥ 10 Mb (or chromosome-arm scale), BUSCO ≥ 95%, QV ≥ 40, and <5% contamination. At the top tier, T2T assemblies — such as the 2024 gapless Gossypium hirsutum ZM113 genome (26 chromosomes, 0 gaps, contig N50 89.27 Mb, BUSCO 99.6%, QV 42.9) — represent the current gold standard for complete eukaryotic genomes, with every chromosome resolved as a single continuous sequence from telomere to telomere.

Submission to public databases is the final step. NCBI GenBank requires that assemblies pass the Foreign Contamination Screen (FCS) — which detects adaptor, vector, and cross-species contamination — before accession numbers are issued. The European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) have similar validation pipelines. At the conclusion of each project, CD Genomics provides submission-ready files through its Whole Genome Sequencing service — including masked assembly FASTA, gene annotation GFF3, and functional annotation tables — formatted to meet NCBI/ENA/DDBJ requirements with pre-validated FCS screening results.

Figure 3: De Novo Genome Assembly Pipeline — A 7-stage vertical workflow from sample collection to public database submission. Stages: (1) Sample Collection & DNA Extraction (Week 1-2) → (2) Genome Survey — GenomeScope 2.0 k-mer analysis (Week 2-3) → (3) HiFi + ONT Ultralong Sequencing (Week 3-12) → (4) Hi-C Library Preparation & Sequencing (Week 8-13) → (5) Assembly & Scaffolding — hifiasm + YaHS (Week 13-16) → (6) Genome Annotation — RepeatModeler2/EDTA + BRAKER3 + InterProScan (Week 16-20) → (7) QC & NCBI/ENA/DDBJ Submission — BUSCO, Merqury, FCS (Week 20-24). Each stage annotated with key tools, estimated duration, and primary deliverables.

Practical Considerations for De Novo Projects

DNA — It All Starts Here

De novo assembly quality is bounded above by DNA quality. For PacBio HiFi, ≥5-15 µg of HMW DNA with fragments ≥30 kb is required; the Femto Pulse or PFGE is used to verify fragment size distribution before library preparation. For ONT ultralong sequencing, ≥10 µg of DNA with an N50 ≥50 kb — ideally ≥100 kb — is needed; the Circulomics Nanobind kit or a modified phenol-chloroform protocol is recommended for extraction. DNA from a single individual is strongly preferred for de novo assembly; pooling multiple individuals introduces artificial heterozygosity that degrades assembly contiguity and haplotype resolution.

For organisms where tissue quantity is limiting — small invertebrates, embryos, herbarium specimens, museum samples — Whole Genome Sequencing low-input workflows at CD Genomics can generate HiFi libraries from as little as 500 ng of DNA, though assembly contiguity will be reduced compared to high-input protocols.

Project Planning Timeline

A typical de novo genome project for a 1-2 Gb non-model organism follows this timeline:

Week 1-2: Sample collection, DNA extraction, QC, genome survey sequencing

Week 2-3: GenomeScope k-mer analysis, finalize sequencing strategy

Week 3-8: PacBio HiFi library preparation and sequencing (30-60×)

Week 3-12: ONT ultralong library preparation and sequencing (15-20×)

Week 8-13: Hi-C library preparation and sequencing (100×)

Week 13-16: Assembly (hifiasm), scaffolding (YaHS), polishing (NextPolish), QC (BUSCO, Merqury)

Week 16-20: Annotation (RepeatMasker → BRAKER3 → InterProScan)

Week 20-24: Manual curation, figure generation, NCBI submission

Total: 4-6 months from sample to NCBI-submitted reference genome.

For a broader overview of how plant and animal de novo sequencing fits into the wider WGS landscape — from bacterial genomes to large-scale population re-sequencing — see our Whole Genome Sequencing Services Hub. For bacterial-scale de novo assembly, which follows a distinct workflow tailored to small (3-7 Mb) prokaryotic genomes, see our Bacterial Whole Genome Sequencing Guide. For projects requiring exclusively long-read approaches, CD Genomics' Long-Read Sequencing Services provide PacBio and ONT platforms for targeted applications. For a comprehensive guide to long-read sequencing across all applications — from structural variant detection to full-length transcript sequencing and epigenetics — see our Long-Read Sequencing Services for Every Application.

Frequently Asked Questions

What is de novo genome assembly, and when is it needed?

De novo genome assembly reconstructs a complete genome sequence from overlapping sequencing reads without a reference template. It is required when no high-quality reference genome exists for the species — which is the case for the vast majority of plants and animals on Earth. If a closely related reference genome is available, reference-guided assembly or re-sequencing may be faster and cheaper.

How much does de novo genome sequencing cost for a plant or animal genome?

Cost scales with genome size and complexity. A 500 Mb diploid genome at chromosome-scale costs approximately $5,000-10,000. A 1-2 Gb genome at T2T quality costs $10,000-20,000. Large polyploid plant genomes (5-16 Gb) can cost $20,000-35,000. These estimates include sequencing, assembly, scaffolding, and basic annotation.

Why is a genome survey recommended before full de novo sequencing?

A $200-500 genome survey (shallow Illumina + GenomeScope k-mer analysis) estimates genome size, heterozygosity, repeat content, and ploidy. This information determines how much sequencing is needed, which platforms are optimal, and whether specialized phasing or polyploid-aware assembly strategies are required — preventing costly under- or over-sequencing.

What is the difference between chromosome-scale and T2T assembly?

A chromosome-scale assembly has contigs ordered and oriented into chromosomes but may contain gaps at repetitive regions (centromeres, rDNA arrays). A T2T (telomere-to-telomere) assembly is gapless — every chromosome is a single continuous sequence from telomere to telomere, including previously intractable regions like centromeres. T2T requires ONT ultralong reads in addition to HiFi and Hi-C.

Do I need Hi-C data for my de novo genome?

For publication-quality reference genomes, yes. Hi-C provides the long-range linkage information needed to order and orient contigs into complete chromosomes. Without Hi-C, a HiFi assembly of a 1 Gb genome may produce 500-2,000 contigs; with Hi-C scaffolding, >90% of the assembly is typically anchored into chromosome-scale scaffolds matching the expected karyotype.

What DNA input is required for plant and animal de novo sequencing?

For PacBio HiFi: ≥5 µg of HMW DNA, fragments ≥30 kb. For ONT ultralong: ≥10 µg DNA, N50 ≥50 kb (ideally ≥100 kb). For Hi-C: ≥1-2 µg of crosslinked DNA. DNA should be from a single individual for de novo assembly; pooled samples introduce artificial heterozygosity.

How long does a de novo genome project take from sample to completed assembly?

A typical project timeline is 4-6 months: sample prep (1-2 weeks), genome survey (2-3 weeks), sequencing (6-12 weeks depending on data types), assembly and scaffolding (3-4 weeks), annotation (4 weeks), and curation/submission (4 weeks). Expedited timelines are available for individual data types.

What bioinformatic deliverables are included in a CD Genomics de novo sequencing project?

Standard deliverables: raw sequencing data (FASTQ), QC report, assembled genome (FASTA), BUSCO/QV/k-mer QC metrics, repeat annotation (GFF), gene prediction (GFF3), and functional annotation (GO, KEGG, InterProScan, Swiss-Prot, NR). Publication-ready files formatted for NCBI/ENA/DDBJ submission are included.

References:

Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods. 2021;18(2):170-175. doi:10.1038/s41592-020-01056-5
Obinu L, Dettori G, Lemay MA, et al. Benchmarking of Hi-C tools for scaffolding plant genomes obtained from PacBio HiFi and ONT reads. Frontiers in Bioinformatics. 2024;4:1462923. doi:10.3389/fbinf.2024.1462923
Ranallo-Benavidez TR, Jaron KS, Schatz MC. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications. 2020;11(1):1432. doi:10.1038/s41467-020-14998-3
Gabriel L, Hoff KJ, Bruna T, et al. TSEBRA: transcript selector for BRAKER. BMC Bioinformatics. 2021;22(1):566. doi:10.1186/s12859-021-04482-0
Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology. 2020;21(1):245. doi:10.1186/s13059-020-02134-9
Zhou C, McCarthy SA, Durbin R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics. 2023;39(1):btac808. doi:10.1093/bioinformatics/btac808
Manni M, Berkeley MR, Seppey M, Simao FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution. 2021;38(10):4647-4654. doi:10.1093/molbev/msab199
Ou S, Su W, Liao Y, et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biology. 2019;20(1):275. doi:10.1186/s13059-019-1905-y

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.

Related Services

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.