Whole Genome Sequencing Services for Every Genome Size: From Bacterial Genomes to Large Plant and Animal Genomes
A microbiologist has just isolated a novel bacterium from deep-sea sediment and needs its complete genome — all 4.2 megabases — to identify the biosynthetic gene clusters producing a promising antimicrobial compound. A plant breeder needs a chromosome-level assembly of a 7.5-gigabase hexaploid oat genome to map drought-tolerance QTLs with sub-centimorgan resolution. A population genomics consortium needs variant calls on 3,000 individual salmon genomes at a cost that won't bankrupt the grant. These three projects all involve whole genome sequencing — but the sequencing strategy, platform selection, coverage depth, and budget differ by orders of magnitude.
Whole genome sequencing (WGS) is the most information-complete genomic analysis available: it captures the entire nuclear genome, from single-copy genes to tandem repeats to structural variants, without the biases inherent in targeted enrichment or amplicon approaches. Yet "whole genome sequencing" is not one service — it is a family of strategies spanning bacterial de novo assembly at one end and population-scale resequencing at the other, with coverage ranging from 0.4× to 100× and price tags from under $50 to over $10,000 per sample. Choosing the wrong combination of platform, depth, and analysis pipeline turns a tight budget into an underpowered study, or conversely, over-sequences a routine task with money that could have funded three more experiments.
CD Genomics provides Whole Genome Sequencing services across the full genome-size spectrum, from 5 Mb bacterial genomes through 3 Gb mammalian genomes to 16 Gb plant genomes, on Illumina, MGI DNBSEQ, PacBio SMRT, and Oxford Nanopore platforms. This article is a strategic decision guide: which WGS approach matches your genome size, your biological question, and your budget.
Bacterial WGS — The Rapid Gateway to Genomics
Bacterial whole genome sequencing is the most mature and cost-efficient segment of the WGS landscape. A typical 4-6 Mb bacterial genome can be sequenced, assembled, and annotated for $100-500, depending on the required assembly quality and annotation depth. At these price points, sequencing 100 bacterial isolates costs less than a single mammalian genome, making bacterial WGS the entry point for labs newly adopting genomic approaches.
De Novo Assembly: Closing the Genome
Bacterial WGS divides cleanly into de novo assembly (for novel isolates without a reference) and re-sequencing (for comparing strains against an existing reference). De novo assembly reconstructs the complete genome from overlapping sequencing reads without a template. The quality of the assembly — measured by contig N50, number of contigs, and completeness benchmarking with tools like BUSCO — depends heavily on the sequencing technology mix.
Short-read-only assemblies, using Illumina NovaSeq or MGI DNBSEQ at 100-200× coverage, produce highly accurate contigs (Q40+) but fracture at repetitive elements: rRNA operons, insertion sequences, and prophage regions. The resulting draft genome typically consists of 20-100 contigs rather than a single circular chromosome. For many applications — species identification, MLST typing, AMR gene detection — this is sufficient.
When complete closure is required, long-read sequencing bridges the repeats. PacBio HiFi reads (CCS mode, ≥99.9% accuracy at 10-25 kb) or Oxford Nanopore reads (ultra-long, 50-100+ kb, with R10.4.1 chemistry achieving >99% modal accuracy) are assembled into 1-4 contigs, and a short-read polishing step corrects residual indel errors. This hybrid strategy routinely delivers complete, circularized bacterial chromosomes with zero gaps — the gold standard for publication-quality reference genomes. CD Genomics offers bacterial WGS on all three platforms, with coverage recommendations of ≥50× for Illumina, ≥100× for PacBio, and ≥100× for Nanopore, with DNA input requirements as low as 200 ng for short-read libraries and 10-15 µg of high-molecular-weight DNA for long-read platforms. Turnaround time is 30-45 working days depending on assembly complexity.
For more detailed guidance on bacterial genome projects, see our Bacterial Whole Genome Sequencing service page.
Beyond the Genome: Annotation and Functional Profiling
Assembling a genome is half the project. The annotation pipeline adds the biological interpretation layer: gene prediction (coding sequences, tRNAs, rRNAs, ncRNAs), functional annotation against NR, GO, COG, KEGG, SwissProt, Pfam, and CAZy databases, and specialized analyses for biomedically or industrially relevant features — antimicrobial resistance genes (CARD, ResFinder), virulence factors (VFDB), plasmid reconstruction, prophage prediction (PHASTER), and CRISPR array detection. For comparative genomics projects spanning dozens or hundreds of isolates, pan-genome analysis identifies the core genome (genes shared by all strains) and the accessory genome (genes present in subsets), revealing the evolutionary dynamics of gene gain and loss that underpin niche adaptation and pathogenicity.
Figure 1: Bacterial WGS Assembly Quality Comparison — Three-column comparison showing the trade-off between cost and completeness at each tier. Column 1 — Draft (Short-Read Only): Illumina 150 bp PE, ~60 contigs, N50 ~200 kb, ~97% BUSCO, $100-200/genome, suitable for species ID and AMR screening. Column 2 — Near-Complete (Hybrid): PacBio HiFi + Illumina polishing, 1-4 contigs, N50 ~4 Mb, ~99.5% BUSCO, $300-500/genome, suitable for publication-quality reference genomes. Column 3 — Complete (Multi-Platform): ONT ultra-long + HiFi, 1 circular chromosome, 100% BUSCO, $500-800/genome, suitable for plasmid-resolved complete references. Color-coded headers: orange (draft), yellow (near-complete), green (complete).
Plant and Animal De Novo — Unlocking Non-Model Organisms
Plant and animal de novo genome sequencing is a fundamentally different challenge from bacterial WGS. Genome sizes span four orders of magnitude: the 125 Mb Arabidopsis thaliana genome sits at one end, while the 16 Gb hexaploid wheat genome occupies the other, with repeat contents ranging from 20% to over 85%. Polyploidy — common in plants and pervasive in crops (bread wheat is allohexaploid, potato is autotetraploid, sugarcane is octoploid) — confounds assemblers that assume diploidy. High heterozygosity in outcrossing species generates divergent haplotypes that, when collapsed into a single consensus, produce fragmented assemblies with missing biological content.
The technology strategy that solved these problems is long-read sequencing plus chromatin conformation capture. PacBio HiFi reads (15-25 kb, Q30+) provide contiguity at the contig level, routinely achieving contig N50 values of 10-50 Mb for plant genomes. Oxford Nanopore ultra-long reads (100+ kb) bridge the largest repeats — ribosomal DNA arrays, centromeric satellites, segmental duplications — that even HiFi reads cannot span. Hi-C (chromatin conformation capture) scaffolds the contigs into chromosome-scale pseudomolecules by exploiting the physical proximity of DNA segments within the same chromosome. The result is a chromosome-level assembly with telomere-to-telomere (T2T) completeness for each chromosome arm.
CD Genomics provides end-to-end plant and animal de novo WGS through Plant and Animal Whole Genome De Novo Sequencing and De Novo Whole Genome Sequencing Service, with recommended sequencing strategies stratified by genome complexity:
- Simple genomes (≤1 Gb, diploid, moderate repeat content): PacBio HiFi at 30-50× coverage plus Illumina short-read polishing. Contig N50 target: ≥3 Mb.
- Complex genomes (1-5 Gb, polyploid, high repeat): PacBio HiFi at 50-60× or ONT at 100×, plus Hi-C at 100× for chromosome-scale scaffolding. Contig N50 target: ≥10 Mb.
- Very large genomes (>5 Gb, high ploidy): ONT ultra-long reads at 100× plus Hi-C at 100×. Contig N50 target: ≥5 Mb.
DNA input requirements are correspondingly higher than for bacterial WGS: ≥5-15 µg of high-molecular-weight DNA with OD 260/280 of 1.8-2.0 and fragment sizes ≥20 kb for long-read libraries. Samples that fall short on quantity or fragment length may still be sequenced with short-read-only approaches at 50-100×, but the resulting draft assembly will have substantially lower contiguity.
The bioinformatic deliverables for a de novo project extend well beyond the assembly itself: gene structure annotation (ab initio prediction + RNA-seq evidence-based + protein homology-based, integrated with MAKER or BRAKER), functional annotation, repeat annotation (de novo repeat library construction with RepeatModeler + RepeatMasker), non-coding RNA annotation, pseudogene identification, and comparative genomics against related species. For agricultural species, additional analyses include QTL mapping, GWAS population structure correction, and selective sweep detection.
A representative de novo project illustrates the impact of technology choice. A 2023 study assembling the 2.3 Gb maize genome (B73-Ab10 line) using PacBio HiFi at 50× plus Hi-C at 100× produced a chromosome-level assembly with contig N50 of 61.2 Mb and 99.7% BUSCO completeness — capturing the knob repeats, centromeric satellite arrays, and rDNA clusters that defeated short-read assemblies for two decades. The entire project, from DNA extraction to annotated genome, was completed in under six months at a cost of approximately $15,000, demonstrating that reference-quality assemblies of complex plant genomes are now achievable on a timeline and budget accessible to individual research groups.
Scaling Up — Population and Re-Sequencing Projects
Once a reference genome exists, the analytical frame shifts from "what is in this genome?" to "how does this genome differ from the reference — and what do those differences mean?" Population-scale re-sequencing answers questions about genetic diversity, domestication history, local adaptation, and genotype-phenotype associations by comparing hundreds to thousands of individuals against a common reference.
The economics of population WGS have transformed over the past decade. The first human genome cost approximately $3 billion. Today, a 30× human WGS costs roughly $500-800 through large-scale core facilities, and agricultural genomes — with similar sizes to the human genome but less demanding coverage requirements for variant discovery — can be sequenced at 10-20× for $150-300 per sample in batch sizes of hundreds. At these price points, a $50,000 grant can fund whole-genome re-sequencing of 150-300 individuals rather than the 15-30 it could cover a decade ago.
CD Genomics supports population-scale re-sequencing through its Whole Genome Resequencing service and Long-Read Whole Genome Resequencing service. The analytical deliverables for re-sequencing differ fundamentally from de novo: variant calling (SNPs, small indels, structural variants, copy-number variants) against the reference, population genetics statistics (nucleotide diversity π, Fst, Tajima's D), linkage disequilibrium decay, population structure analysis (PCA, ADMIXTURE, phylogenetic trees), selective sweep detection (XP-CLR, iHS, Fst outliers), and GWAS or QTL mapping for phenotype-associated loci.
For agricultural breeding programs, the integration of WGS with genomic selection (GS) and genomic prediction (GP) models is replacing marker-assisted selection with whole-genome prediction: rather than tracking a handful of QTL-linked markers, breeders predict breeding values from genome-wide SNP profiles, achieving prediction accuracies of 0.5-0.8 for complex traits like yield, drought tolerance, and disease resistance. A 2024 study resequencing 3,008 Atlantic salmon at 12× coverage identified 18.7 million SNPs and used genomic prediction to forecast fillet color and fat content with correlations exceeding 0.7, directly informing aquaculture breeding decisions.
The practical logistics of a population-scale project differ from bench-scale genomics. DNA extraction becomes the bottleneck — 1,000 samples require automated extraction on liquid handlers. Library preparation in 96-well plates with dual-index barcoding minimizes cross-contamination risk. Sequencing on NovaSeq X Plus or MGI DNBSEQ-T7 instruments, which generate 6-16 Tb per run, processes dozens to hundreds of genomes simultaneously. The bioinformatic analysis shifts from interactive desktop work to high-performance computing pipelines running GATK best-practices workflows or DeepVariant-based calling on compute clusters.
Depth Decisions — Low-Pass vs High-Coverage WGS
Not every project needs 30× coverage. The trade-off between sequencing depth and sample throughput is the single most consequential decision in WGS project design, and the optimal answer depends on the biological question rather than a fixed standard.
Low-Pass WGS (0.4-5×)
Low-pass whole genome sequencing, also termed shallow WGS or low-coverage WGS, sequences the genome at 0.4-5× average depth and uses genotype imputation — statistical inference of unobserved genotypes from a reference panel of haplotypes — to fill in the gaps. The approach exploits the fact that adjacent SNPs on the same chromosome are co-inherited in haplotype blocks; observing a fraction of them constrains the identity of the rest with high probability when a suitable reference panel exists.
The performance numbers are striking. At 0.4-1× coverage, low-pass WGS recovers >99% of common variants (MAF >1%) with imputation accuracy r² >0.9 when using large, population-matched reference panels like the Haplotype Reference Consortium (HRC) or 1000 Genomes for humans, or breed-specific panels for livestock. For GWAS, low-pass WGS at 1× coverage matches or exceeds the statistical power of high-density SNP arrays (600K-900K markers) while detecting novel variants that fixed arrays miss by design. The cost per sample at 1× coverage runs $50-100, compared to $30-80 for a high-density SNP array — but the WGS data are reusable for future analyses as reference panels and imputation algorithms improve, whereas array data are locked to the markers on the chip.
CD Genomics offers low-pass WGS through its Shallow Whole Genome Sequencing service on Illumina and MGI platforms, with standardized analysis pipelines delivering imputed genotypes, population structure analysis, and GWAS-ready data.
High-Coverage WGS (30×+)
Deep WGS at 30× or higher coverage provides direct observation of variants rather than imputation-dependent inference. This is necessary when: (a) the variants of interest are rare (MAF <0.1%) and imputation accuracy degrades below r² of 0.6-0.8; (b) structural variants — deletions, duplications, inversions, and translocations — are primary targets, as these are poorly imputed from low-pass data; (c) de novo mutations must be detected, as these are absent from any reference panel by definition; (d) the population lacks a suitable imputation reference panel, as is common for non-model organisms and underrepresented populations.
The cost of deep WGS has declined but remains substantial for large cohorts. A 30× human genome costs $500-800; a 30× bovine genome (similar size) costs $400-600. For plant genomes exceeding 5 Gb, 30× coverage drives costs to $2,000-5,000 per sample. At these prices, deep WGS is reserved for reference-quality assemblies, discovery cohorts that inform downstream study design, and projects where the analytical question genuinely requires direct variant observation.
Decision Framework: Low-Pass vs High-Coverage
The choice between low-pass and high-coverage WGS depends on four factors:
- Study design: GWAS of common variants in well-characterized populations → low-pass. Rare variant association, SV discovery, or de novo mutation detection → deep.
- Reference panel availability: High-quality, population-matched reference panels → low-pass is viable. No reference panel → deep WGS is required.
- Budget allocation: Fixed budget of $50,000 → approximately 500-1,000 samples at low-pass vs 50-100 samples at deep. Statistical power for common-variant GWAS favors the larger sample size.
- Future utility: Data intended for reuse across multiple analyses over years → deep WGS provides the most flexibility. Single-purpose analysis with archival → low-pass is sufficient.
For a more detailed comparison of SNP arrays, low-pass WGS, and deep WGS with cost and accuracy benchmarks, see CD Genomics' A Beginner's Guide to Low-Pass Whole Genome Sequencing.
Figure 2: The WGS Depth-Cost-Performance Continuum — A three-zone visualization. Zone A: Low-Pass (0.4-5×, $50-100/sample, >99% common variants via imputation, ideal for GWAS). Zone B: Moderate (10-20×, $150-300/sample, direct variant calling, ideal for population genomics). Zone C: Deep (30-100×, $500-5000/sample, comprehensive variant detection, ideal for reference genomes and rare variants). X-axis: sequencing depth. Y-axis: cost per sample (log scale). Color gradient from light (low-pass) to dark (deep).
How CD Genomics Delivers WGS
A WGS project at CD Genomics follows a standardized, quality-controlled pipeline from sample submission to publication-ready data, with platform selection, coverage, and bioinformatic analysis tailored to the project's genome size and research objectives.
Sample-to-Data Workflow
Step 1: Sample submission and QC. Clients submit extracted DNA or biological samples for extraction. Incoming QC measures concentration (Qubit fluorometry), purity (Nanodrop 260/280 and 260/230 ratios), and integrity (agarose gel electrophoresis or TapeStation for fragment size distribution). Samples that fail QC are flagged immediately, and a re-extraction or re-submission plan is coordinated.
Step 2: Library construction. Platform-specific libraries are prepared with the appropriate insert size (350-500 bp for short-read WGS, 15-20 kb for PacBio HiFi, no size selection for ONT ultra-long). For population-scale projects, 96-well plate dual-index barcoding ensures sample traceability and minimizes index-hopping artifacts.
Step 3: Sequencing. Sequencing depth is monitored in real time. For Illumina and MGI platforms, a minimum of 80% of bases at ≥Q30 is standard. For PacBio HiFi, CCS reads with ≥Q30 (99.9% accuracy) are generated. For ONT, the latest R10.4.1 flow cells with super-accurate basecalling (dorado) deliver >99% modal accuracy.
Step 4: Bioinformatics. The analysis pipeline is matched to the project type. De novo assembly uses Hifiasm (HiFi), Flye (ONT), or Unicycler (hybrid). Reference-based analysis uses BWA-MEM2 + GATK4 or DeepVariant. Functional annotation uses Prokka (bacteria) or MAKER2/BRAKER3 (eukaryotes). All pipelines include quality metrics: assembly statistics (N50, L50, BUSCO completeness), variant call rates and transition/transversion ratios, and coverage uniformity plots.
CD Genomics' Whole Genome Sequencing services and De Novo Whole Genome Sequencing Service together cover the complete spectrum of genome sizes and project scales, from single bacterial isolates to multi-thousand-sample population cohorts.
Figure 3: WGS Platform Selection Guide — A decision matrix table with four columns. Rows represent project types (Bacterial De Novo, Plant De Novo, Animal De Novo, Population Re-Seq, Low-Pass GWAS). Columns: Recommended Platform(s), Coverage, DNA Input, Approximate Cost/Sample, Turnaround Time. Color-coded cells indicate optimal (green), viable (yellow), and not recommended (red) choices.
FAQ
What is the difference between de novo sequencing and re-sequencing?
De novo sequencing assembles a genome from scratch without a reference template — required for species without an existing reference genome. Re-sequencing aligns reads to an existing reference genome to identify variants — suitable when a high-quality reference already exists for the species. De novo costs 5-20× more than re-sequencing at equivalent depth because of the additional bioinformatic assembly and annotation work.
How much does whole genome sequencing cost?
Costs vary by genome size and coverage. A bacterial genome (5 Mb, 100×): $100-500. A mammalian genome (3 Gb, 30×): $500-800 for re-sequencing, $5,000-15,000 for de novo with annotation. A large plant genome (10 Gb, 30×): $2,000-5,000 for re-sequencing, $10,000-30,000 for de novo. Low-pass WGS at 1× costs $50-100 per sample for human-scale genomes. These figures are for sequencing and standard bioinformatics only, excluding DNA extraction.
What DNA quantity and quality do I need for WGS?
For Illumina short-read WGS: ≥200 ng of DNA at ≥10 ng/µL, OD 260/280 of 1.8-2.0. For PacBio HiFi: ≥5-15 µg of high-molecular-weight DNA with fragment sizes ≥20 kb. For Oxford Nanopore: ≥5-10 µg of HMW DNA with fragments ≥20 kb for standard libraries, or ≥1 µg for ultra-low input protocols. Degraded DNA with fragments <5 kb can still be sequenced on Illumina platforms but is unsuitable for long-read sequencing.
Why use long-read sequencing for de novo assembly?
Short reads (150-300 bp) cannot span repetitive elements — transposons, segmental duplications, centromeres, rRNA arrays — that are longer than the read length. The assembler hits a repeat, cannot determine how many copies exist or how they are arranged, and breaks the assembly into contigs. Long reads (10-100+ kb) span most repeats, producing 50-500× fewer contigs and resolving genome architecture that short-read assemblies collapse. For polyploid genomes, long reads can phase haplotypes into separate assemblies rather than collapsing them into a single mosaic consensus.
How do I choose between Illumina, PacBio, and Nanopore for my project?
Illumina/MGI: highest raw accuracy (Q30+), lowest cost per Gb, ideal for re-sequencing and variant calling. PacBio HiFi: high accuracy (Q30+) with 15-25 kb reads, ideal for de novo assembly of moderate-size genomes (≤3 Gb). Oxford Nanopore: longest reads (100+ kb) with moderate accuracy (Q20+, improving), ideal for resolving ultra-complex repeat structures in very large genomes. Hybrid approaches combine platforms: long reads for assembly continuity + short reads for base-level accuracy polishing.
What is the turnaround time for a WGS project?
Standard turnaround is 30-45 working days for bacterial WGS and 45-60 working days for plant/animal de novo projects, depending on genome size, coverage, and analysis complexity. Population-scale re-sequencing projects with hundreds to thousands of samples may extend to 60-90 working days due to library preparation throughput and data processing volume. Expedited timelines are available for time-sensitive projects.
Can CD Genomics handle large-scale population genomics projects?
Yes. CD Genomics supports population-scale re-sequencing projects with automated DNA extraction, 96-well plate library preparation, and sequencing on NovaSeq X Plus or MGI DNBSEQ-T7 platforms. Projects ranging from 100 to 10,000+ samples are accommodated, with tiered pricing that reduces per-sample costs as batch size increases.
What bioinformatic deliverables do I receive?
Standard deliverables include raw sequencing data (FASTQ), quality control reports (FastQC, MultiQC), and analysis-specific outputs: assembled genome (FASTA) with annotation (GFF/GBK) for de novo projects; variant call files (VCF) with annotation for re-sequencing; imputed genotypes for low-pass WGS. All data are delivered via secure download or hard drive for large datasets. Custom bioinformatic analyses are available for specific research requirements.
References:
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754-1760. doi:10.1093/bioinformatics/btp324
- Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research. 2017;27(5):722-736. doi:10.1101/gr.215087.116
- Vaser R, Sovic I, Nagarajan N, Sikic M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research. 2017;27(5):737-746. doi:10.1101/gr.214270.116
- Nurk S, Koren S, Rhie A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44-53. doi:10.1126/science.abj6987
- Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Computational Biology. 2017;13(6):e1005595. doi:10.1371/journal.pcbi.1005595
- Li H. Protein-to-genome alignment with miniprot. Bioinformatics. 2023;39(1):btad014. doi:10.1093/bioinformatics/btad014
- De Coster W, Weissensteiner MH, Sedlazeck FJ. Towards population-scale long-read sequencing. Nature Reviews Genetics. 2021;22(9):572-587. doi:10.1038/s41576-021-00367-3
- Delaneau O, Zagury J-F, Robinson MR, Marchini JL, Dermitzakis ET. Accurate, scalable and integrative haplotype estimation. Nature Communications. 2019;10:5436. doi:10.1038/s41467-019-13225-y
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.