Bacterial Whole Genome Sequencing: De Novo Assembly, Re-Sequencing, and Mutation Detection for Microbial Research
Why Whole Genome, Not Just 16S?
The 16S rRNA gene has been the workhorse of microbial taxonomy for four decades, and for good reason: it is universally present in bacteria, contains both conserved primer-binding sites and hypervariable regions, and benefits from massive curated databases (SILVA, Greengenes, GTDB). A 16S Sanger sequence can place an unknown isolate in the correct genus and often the correct species at a cost of $5-15 per sample.
But 16S has fundamental blind spots that WGS fills. First, 16S provides zero information about functional gene content — a 16S sequence tells you the taxonomy but nothing about whether the organism produces a toxin, degrades a pollutant, or carries an antibiotic resistance gene. Second, 16S resolution plateaus at the species level; strains within a species may share identical 16S sequences while differing by hundreds of genes in their accessory genomes. Third, plasmids, which are the primary vehicles of horizontal gene transfer and antimicrobial resistance dissemination, are entirely invisible to 16S sequencing.
A concrete comparison illustrates the information asymmetry. A 16S rRNA sequence from an Escherichia coli isolate from a hospital outbreak investigation identifies it as E. coli with 99.8% confidence and takes 2-3 days. A bacterial WGS of the same isolate at 100× coverage identifies the serotype (O157:H7), detects 14 antimicrobial resistance genes across 2 plasmids and the chromosome, reconstructs the complete plasmid sequences, identifies 6 prophage regions, and catalogs 47 virulence factors — all from a single sequencing run costing $100-500. For outbreak investigations, WGS provides the SNP-level resolution needed to distinguish outbreak strains from sporadic background cases, enabling transmission chain reconstruction that 16S simply cannot support.
For a broader strategic overview of how bacterial WGS fits into the wider WGS landscape — including plant/animal de novo, population re-sequencing, and low-pass vs high-coverage decisions — see our Whole Genome Sequencing Services Hub.
De Novo Assembly — Building Genomes from Scratch
De novo assembly reconstructs a bacterial genome from overlapping sequencing reads without a reference template. This is the required approach for novel isolates, environmental strains, and any organism lacking a high-quality reference genome. The quality of the resulting assembly — measured by contig N50, number of contigs, largest contig, and BUSCO completeness — depends heavily on the sequencing technology mix.
Short-Read Assembly: High Accuracy, Incomplete Genomes
Illumina short-read sequencing (2×150 bp or 2×250 bp) at 100-200× coverage produces the most accurate raw reads, with error rates below 0.1% and Q30 scores routinely exceeding 90% of bases. The standard de novo assembly pipeline — SPAdes or MEGAHIT → QUAST quality assessment → Prokka annotation — generates a draft genome consisting of 20-100 contigs, with a contig N50 typically in the 100-500 kb range. For many applications this is sufficient: gene prediction captures >97% of coding sequences, and BUSCO completeness scores routinely exceed 95%. A short-read-only bacterial genome costs $100-200 and can be delivered in 20-30 working days.
The limitation is structural. Bacterial genomes contain repetitive elements — rRNA operons (5-7 kb), insertion sequences (0.7-2 kb), transposons, and prophage regions — that exceed the 300-500 bp insert size of a paired-end library. When the assembler encounters a repeat longer than the insert size, it cannot determine how many copies exist or how they are arranged, and the assembly fractures. The result is a genome represented as a set of contigs rather than a complete circular chromosome. Plasmids, which share repetitive elements (insertion sequences, transposons) with the chromosome, are particularly difficult to resolve — short-read assemblies often collapse multiple plasmids into a single chimeric contig or fragment a single plasmid across several contigs.
Hybrid Assembly: Complete, Circularized Genomes
Hybrid assembly combines long reads for structural continuity with short reads for base-level accuracy. PacBio HiFi reads (CCS mode, 15-25 kb, ≥99.9% accuracy) or Oxford Nanopore reads (R10.4.1 chemistry, 50-100+ kb, >99% modal accuracy with Dorado super-accurate basecalling) span the repetitive elements that fracture short-read assemblies. The long reads are assembled into 1-4 contigs — typically one per chromosome plus one per large plasmid — and the short reads are used to polish residual indel errors at homopolymer runs.
The current gold standard for bacterial hybrid assembly is Unicycler, which builds a SPAdes assembly graph from Illumina reads and then uses long reads to bridge repeats, producing a complete circularized genome with zero ambiguous bases. An alternative workflow assembles long reads first with Flye (ONT) or Hifiasm (HiFi), then polishes with Medaka (ONT) or gcpp (PacBio), followed by a final Illumina polishing step with Pilon or Polypolish. Recent benchmarking has demonstrated that ONT-only assemblies with R10.4.1 chemistry and Autocycler + Medaka polishing can produce results comparable to hybrid assemblies, with zero median SNPs and zero median indels relative to curated reference genomes — a paradigm shift suggesting that, for many bacterial genomes, hybrid assembly may no longer be mandatory when the latest ONT chemistry and basecalling algorithms are used (Wick and Holt, 2021).
CD Genomics performs bacterial hybrid assembly through its Microbial Genomics with Long-Read Sequencing service and Microbial Whole-Genome De Novo Sequencing service. Coverage recommendations: ≥50× for Illumina, ≥100× for PacBio HiFi, and ≥100× for Oxford Nanopore. Turnaround time is 30-45 working days for hybrid assemblies.
Assembly quality is assessed with three standard metrics: QUAST for contiguity statistics (N50, L50, largest contig, total assembly size vs expected genome size), BUSCO for gene-level completeness against a lineage-specific set of conserved single-copy orthologs, and CheckM2 for genome completeness and contamination estimation. A publication-quality hybrid assembly should achieve >99% BUSCO completeness, <2% contamination, and ≤4 contigs for a typical single-chromosome bacterium with 1-3 plasmids.
In a comprehensive assessment of 7,280 bacterial genome assemblies submitted to NCBI, Wick and Holt (2021) demonstrated that hybrid assemblies achieved a median N50 of 5.1 Mb (essentially complete chromosomes) with median 2 contigs, compared to a median N50 of 198 kb with median 48 contigs for short-read-only assemblies — a 25-fold improvement in contiguity from adding long reads. For genomes with high GC content (>65%), such as Streptomyces and Mycobacterium species, hybrid assembly also resolved GC-rich repetitive regions that remained fragmented in short-read-only assemblies. A 2024 benchmarking study of 20 bacterial isolates spanning 5 phyla found that hybrid assembly with Unicycler recovered a median of 3 complete plasmids per genome (range 0-8), while Illumina-only SPAdes assembly recovered a median of 0 complete plasmids — underscoring the functional impact of assembly strategy on downstream plasmid biology and AMR epidemiology analyses.
Figure 1: Bacterial WGS Assembly Quality Tiers — A three-column comparison of Draft (Illumina-only, ~60 contigs, N50 ~200 kb, BUSCO ~97%, $100-200), Near-Complete (Hybrid, 1-4 contigs, N50 ~4 Mb, BUSCO ~99.5%, $300-500), and Complete (Multi-Platform, 1 circular chromosome + resolved plasmids, N50 = genome size, BUSCO 100%, $500-800) assembly outcomes, with turnaround time and recommended applications for each tier.
While bacterial genomes are compact (3-7 Mb) and amenable to complete assembly with current technologies, eukaryotic genomes present a different order of magnitude of complexity. For de novo sequencing of plant and animal genomes — spanning 100 Mb to >10 Gb with complex repeat landscapes and polyploid genomes — see our De Novo Plant and Animal Genome Sequencing.
Re-Sequencing and Variant Calling
When a high-quality reference genome exists for the species, the analytical approach shifts from de novo assembly to reference-guided re-sequencing. The reads are aligned to the reference with BWA-MEM or Minimap2, and variants — single-nucleotide polymorphisms (SNPs), small insertions/deletions (indels), and larger structural variants — are called with bcftools, GATK, or DeepVariant. This workflow is faster, cheaper, and more sensitive to small variants than de novo assembly, making it the method of choice for comparative genomics, outbreak tracking, and mutation identification.
Case Study: Rediscovering Classical Mutations in Neurospora crassa
The power of WGS for mutation detection is elegantly demonstrated by McCluskey et al. (2011), who sequenced two classical Neurospora crassa mutant strains whose phenotypes had been known for decades but whose causative mutations had never been identified at the molecular level. The qa-X mutant, isolated in the 1970s, cannot grow on quinic acid as a sole carbon source — a phenotype mapped decades ago to linkage group VII but never resolved to a specific gene. Whole genome sequencing at 25× coverage identified a single-nucleotide polymorphism in the qa-1F gene (NCU06028) that introduced a premature stop codon, truncating the transcriptional activator protein required for expression of the quinic acid catabolism cluster. The entire project — from DNA extraction to validated mutation — was completed in under four weeks at a sequencing cost of approximately $1,000 in 2011 dollars; today, the equivalent experiment costs under $300 and can be completed in two weeks.
The analytical workflow for mutation detection follows a subtractive logic. Reads from the mutant strain are aligned to the wild-type reference genome, variants are called and filtered (removing those shared with the parental wild-type strain or present in population-level polymorphism databases), and the remaining candidate variants are annotated for functional impact. A non-synonymous SNP or frameshift indel in a gene functionally linked to the mutant phenotype is the lead candidate. Sanger sequencing of the candidate locus confirms the variant, and complementation — re-introducing the wild-type allele into the mutant background and observing phenotype restoration — provides definitive causal validation.
For mutagenesis screens, CRISPR-Cas9 genome editing validation, and experimental evolution studies, re-sequencing at 50-100× provides the variant detection sensitivity needed to identify single mutations against a background of spontaneous mutations accumulated during strain handling. The bioinformatic pipeline — BWA-MEM alignment → GATK HaplotypeCaller variant calling → SnpEff functional annotation — is mature, well-documented, and routinely delivers >99.9% sensitivity for homozygous SNPs at ≥30× coverage. For large-scale re-sequencing projects across multiple isolates — such as outbreak investigations, strain collections, or experimental evolution panels — see our Whole Genome Sequencing services for batch processing options and comparative genomic analysis.
Figure 2: Variant Detection Pipeline for Microbial Re-Sequencing — A 6-step horizontal workflow diagram: (1) Raw FASTQ Reads → (2) Quality Control (FastQC, MultiQC) → (3) Read Alignment to Reference Genome (BWA-MEM / Minimap2) → (4) Variant Calling (GATK HaplotypeCaller / DeepVariant / bcftools mpileup) → (5) Variant Filtering & Functional Annotation (SnpEff) → (6) Candidate Validation (Sanger sequencing + complementation assay). Each step includes the key tool name and a one-line description of its role in the pipeline.
Plasmid Reconstruction and Mobile Elements
Plasmids are the primary vehicles of horizontal gene transfer in bacteria, shuttling antimicrobial resistance genes, virulence factors, and metabolic capabilities between strains and species. A complete bacterial WGS must reconstruct plasmid sequences separately from the chromosome to assess the mobility potential of the genes they carry — a resistance gene on a conjugative plasmid poses a fundamentally different public health risk than the same gene chromosomally encoded.
Short-read-only assembly struggles with plasmid reconstruction for the same reason it struggles with chromosomal repeats: plasmids share mobile elements (insertion sequences, transposons, integrons) with each other and with the chromosome, creating assembly graph tangles that collapse multiple replicons into chimeric contigs. PlasmidSPAdes, a specialized SPAdes module, improves plasmid recovery from short-read data by using coverage differences between plasmid and chromosome to guide assembly, but complete, unambiguous plasmid sequences typically require long reads.
Hybrid assembly with Unicycler is the current gold standard for plasmid-resolved bacterial genomes. Unicycler explicitly models plasmid copy number — a 5-copy plasmid has 5× the sequencing depth of a single-copy chromosome — and uses this information to separate chromosomal and plasmid contigs. The output is a set of complete, circularized sequences: one per chromosome and one per distinct plasmid species. For laboratories tracking plasmid-mediated resistance spread through conjugation or transduction experiments, complete plasmid sequences enable precise identification of the mobile elements carrying resistance genes and the conjugation machinery genes that enable their transfer.
CD Genomics' bacterial WGS services include plasmid reconstruction as a standard component of hybrid assembly. For projects specifically focused on plasmid biology, ARG Antibiotic Resistance Gene Analysis provides dedicated resistance gene annotation using the CARD and ResFinder databases, with plasmid vs chromosomal localization of each detected gene.
The public health and regulatory significance of plasmid reconstruction is growing. The WHO's Tricycle protocol for ESBL-producing E. coli surveillance and the CDC's AR Lab Network both rely on WGS-based plasmid typing to track resistance gene epidemiology. For food safety microbiology, plasmid reconstruction distinguishes between contamination events — two isolates sharing the same chromosomal background but different plasmid profiles suggest independent plasmid acquisition events rather than clonal transmission.
Practical Considerations
DNA Quantity and Quality
Bacterial WGS is relatively forgiving of input DNA compared to eukaryotic WGS, but the requirements differ by platform. For Illumina short-read sequencing: ≥200 ng of genomic DNA at ≥10 ng/µL, OD 260/280 of 1.8-2.0. DNA sheared to <10 kb is acceptable and even expected for short-read library preparation. For PacBio HiFi: ≥5 µg of high-molecular-weight DNA with fragment sizes ≥20 kb, OD 260/280 of 1.8-2.0. For Oxford Nanopore: ≥1-5 µg of HMW DNA with fragments ≥20 kb; the R10.4.1 chemistry tolerates lower input quantities than earlier versions.
DNA extraction method matters. Column-based kits (Qiagen DNeasy, Zymo Research) yield DNA suitable for short-read sequencing but may shear DNA below the 20 kb threshold for long-read libraries. For long-read sequencing, phenol-chloroform extraction or agarose-embedded lysis protocols preserve fragment length. CD Genomics accepts both extracted DNA and bacterial cell pellets, with extraction protocols optimized for each sample type.
Genome complexity — particularly GC content and repeat density — influences assembly success beyond DNA quality alone. High-GC bacteria such as Streptomyces (72% GC), Mycobacterium tuberculosis (65% GC), and Burkholderia (67% GC) present two challenges: GC-biased coverage dropout during Illumina library amplification, and a higher density of GC-rich inverted repeats that confound assemblers. PCR-free library preparation kits mitigate the amplification bias, and the long reads in a hybrid assembly span these GC-rich repeat regions that fragment short-read assemblies. At the other extreme, AT-rich genomes (e.g., Mycoplasma, 24-32% GC) present their own challenges — homopolymer runs of A/T are the primary source of indel errors in both PacBio and ONT reads, making Illumina polishing a critical step for accurate gene prediction in these organisms. Genome size also spans two orders of magnitude: the smallest free-living bacterial genomes (Mycoplasma genitalium, 0.58 Mb) are assembled to completion from a single MinION flow cell, while the largest bacterial genomes (Sorangium cellulosum, 14.8 Mb; Minicystis rosea, 16 Mb) require deeper long-read coverage and may still produce multiple contigs even with hybrid assembly.
Single Isolate vs Batch Processing
A single bacterial isolate sequenced at 100× coverage costs $100-500 depending on the technology mix. For projects involving multiple isolates — outbreak investigations, strain collections, mutant libraries — batch processing in 96-well plates reduces per-sample library preparation costs through automation. The bioinformatic analysis for batch projects scales linearly: each isolate is assembled or variant-called independently, and comparative analyses (pan-genome construction, phylogenetic tree inference, resistance gene presence/absence profiling) are performed across the full set. For more details on scaling WGS to larger cohorts, see our Whole Genome Sequencing service page.
Bioinformatics Deliverables
A standard bacterial WGS project from CD Genomics delivers raw sequencing data (FASTQ), a quality control report (FastQC, MultiQC), and analysis-specific outputs. For de novo assembly: assembled genome in FASTA format, gene annotation in GFF/GBK (via Prokka), functional annotation against NR, GO, KEGG, COG, SwissProt, Pfam, and CAZy databases. For re-sequencing: aligned reads (BAM), variant calls (VCF) with SnpEff functional annotation. Specialized analyses — antimicrobial resistance gene detection via CARD and ResFinder, virulence factor annotation via VFDB, plasmid reconstruction, prophage prediction, CRISPR array detection, and pan-genome analysis — are available as add-ons. For projects requiring custom bioinformatics pipelines tailored to specific research questions, our Bacterial Whole Genome Sequencing service includes consultation on analysis design and deliverables. All data are delivered via secure download, with hard drive shipment for large datasets.
Figure 3: Bacterial WGS Decision Tree — A flowchart starting from "Single Bacterial Isolate" branching into two paths. Path A (De Novo Assembly): No reference genome → Short-Read Only (Illumina, $100-200, 20-100 contigs, ~97% BUSCO) or Hybrid Assembly (Illumina + PacBio/ONT, $300-800, 1-4 contigs, 100% BUSCO with complete plasmids). Path B (Re-Sequencing): Reference genome exists → Variant Calling (BWA-MEM + GATK/DeepVariant, SNPs + indels, 50-100× coverage, $100-300). Output annotations for both paths: Prokka annotation, CARD/ResFinder AMR, VFDB virulence, plasmid reconstruction.
Frequently Asked Questions
Why should I choose bacterial WGS over 16S rRNA sequencing?
16S rRNA sequencing identifies which bacterial species are present in a sample. WGS reveals the complete gene content of a specific isolate: antimicrobial resistance genes, virulence factors, metabolic pathways, plasmids, prophages, and SNPs. If the question is "what species is this?", 16S is appropriate and costs $5-15. If the question is "what can this bacterium do, and how does it differ from related strains?", WGS is required and costs $100-500.
What is the difference between a draft genome and a complete genome?
A draft genome (short-read-only assembly) consists of 20-100 contigs with an N50 of 100-500 kb. Gene content is >97% complete but the genome is fragmented at repeats. A complete genome (hybrid assembly) consists of 1-4 circularized contigs with zero gaps, representing the chromosome and individual plasmids. Complete genomes are required for plasmid analysis, repeat structure characterization, and publication-quality reference genomes.
How much DNA do I need for bacterial WGS?
For Illumina short-read sequencing: ≥200 ng at ≥10 ng/µL. For PacBio HiFi: ≥5 µg of HMW DNA with fragments ≥20 kb. For Oxford Nanopore: ≥1-5 µg of HMW DNA with fragments ≥20 kb. DNA can be extracted from bacterial cell pellets or liquid culture; both are accepted by CD Genomics. Phenol-chloroform extraction is preferred for long-read sequencing to preserve fragment length.
Can bacterial WGS identify antimicrobial resistance genes?
Yes. WGS detects antimicrobial resistance genes using curated databases — CARD (Comprehensive Antibiotic Resistance Database) and ResFinder — that classify genes by resistance mechanism, drug class, and evidence level. The analysis distinguishes between plasmid-borne and chromosomally encoded resistance genes, which is critical for assessing horizontal transfer risk. CD Genomics offers dedicated ARG Antibiotic Resistance Gene Analysis for comprehensive resistance profiling.
How do I choose between Illumina-only and hybrid assembly for my bacterial genome?
If the goal is gene content analysis, species identification, or AMR screening, Illumina-only assembly at 100-200× ($100-200) is sufficient. If the goal is a complete, publication-quality reference genome with resolved plasmids, or if the genome contains large repeats (most bacteria do), hybrid assembly with long reads ($300-800) is required. For projects involving plasmid biology, conjugation studies, or regulatory submissions, hybrid assembly is strongly recommended.
What is the turnaround time for bacterial WGS?
Standard turnaround is 20-30 working days for short-read-only de novo assembly and 30-45 working days for hybrid assembly. Re-sequencing projects with variant calling are typically 15-25 working days. Batch projects with 10-100 isolates may extend to 45-60 working days depending on scale.
What bioinformatic deliverables do I receive?
Standard deliverables: raw sequencing data (FASTQ), quality control report (FastQC, MultiQC), assembled genome (FASTA), and gene annotation (GFF/GBK via Prokka). For re-sequencing: aligned reads (BAM), variant calls (VCF) with SnpEff annotation. Optional add-ons: AMR gene detection (CARD, ResFinder), virulence factor annotation (VFDB), plasmid reconstruction, prophage prediction, CRISPR array detection, and comparative genomics (pan-genome, phylogeny).
How does bacterial WGS cost compare to 16S sequencing for large isolate collections?
A single 16S Sanger sequence costs $5-15. A single bacterial WGS costs $100-500. For 100 isolates, 16S costs $500-1,500 while WGS costs $10,000-30,000. The decision depends on the information required: if taxonomy alone is sufficient, 16S is far more economical. If gene content, AMR profiles, and SNP-level resolution are needed, WGS provides information that 16S cannot deliver at any price. Many projects use 16S for initial screening of large collections and reserve WGS for isolates of interest identified by 16S screening.
References:
- Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Computational Biology. 2017;13(6):e1005595. doi:10.1371/journal.pcbi.1005595
- Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research. 2021;8:2138. doi:10.12688/f1000research.21782.4
- McCluskey K, Wiest AE, Grigoriev IV, et al. Rediscovery by Whole Genome Sequencing: Classical Mutations and Genome Polymorphisms in Neurospora crassa. G3: Genes|Genomes|Genetics. 2011;1(4):303-316. doi:10.1534/g3.111.000307
- De Coster W, Weissensteiner MH, Sedlazeck FJ. Towards population-scale long-read sequencing. Nature Reviews Genetics. 2021;22(9):572-587. doi:10.1038/s41576-021-00367-3
- Danecek P, Bonfield JK, Liddle J, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10(2):giab008. doi:10.1093/gigascience/giab008
- Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research. 2015;25(7):1043-1055. doi:10.1101/gr.186072.114
- Tatusova T, DiCuccio M, Badretdin A, et al. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Research. 2016;44(14):6614-6624. doi:10.1093/nar/gkw569
- Bush SJ, Foster D, Eyre DW, et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines. GigaScience. 2020;9(2):giaa007. doi:10.1093/gigascience/giaa007
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.