Large-Scale Whole Genome Re-Sequencing Projects: Population Genomics, GWAS, and Cost Optimization for High-Volume Samples

Q: What is large-scale whole genome re-sequencing and how is it different from de novo assembly?

Large-scale re-sequencing maps reads from hundreds to thousands of individuals against an existing reference genome to identify genetic variants (SNPs, indels, structural variants) across a population. It is fundamentally different from de novo assembly, which constructs a genome from scratch without a reference. Re-sequencing is faster and cheaper per sample but requires a high-quality reference genome.

Q: How many samples do I need for a population genomics or GWAS study?

For population structure and demographic history, 10-30 individuals per population at ≥10× coverage is often sufficient. For GWAS, sample sizes of hundreds to thousands are required to detect loci explaining 0.1-1% of phenotypic variance. For genomic selection, 500-2,000 individuals is standard for training population construction in plant and animal breeding.

Q: What sequencing depth should I choose for a large-scale re-sequencing project?

Low-coverage WGS (1-4×) with imputation is the default for cohorts exceeding 300 samples, capturing common variants at a fraction of deep WGS cost. Standard coverage (10-15×) provides reliable rare variant calls for demographic inference and selection scans. Deep coverage (30×) is recommended for reference panel construction and high-confidence variant detection.

Q: How do I control costs for a project with hundreds to thousands of samples?

Three highest-impact strategies: (1) use low-coverage WGS + imputation for the full cohort with a custom reference panel from 10-20% of samples at 30×, (2) negotiate volume pricing and perform pre-pool QC runs to avoid costly requeueing, and (3) adopt compressed formats (CRAM, PGEN) to cut storage costs by 30-98%.

Q: What bioinformatic infrastructure do I need for joint analysis of 1,000 genomes?

For alignment and per-sample variant calling, a 500-core HPC cluster or equivalent cloud compute can process 1,000 30× genomes in under a week. For joint genotyping, at least 1 TB of RAM and 50 TB of fast storage are recommended for cohorts exceeding 2,000 samples. Workflow managers (Nextflow, Snakemake) and containerized tools (Docker, Singularity) are strongly recommended for reproducibility.

Q: Can I combine samples sequenced at different depths in the same analysis?

Yes, joint genotyping with GATK handles heterogeneous coverage. This is routine in projects combining a deeply sequenced reference panel with a low-coverage discovery cohort. Variant calling sensitivity differs by depth, so batch effects should be explicitly modeled. Imputation with GLIMPSE2 can harmonize coverage differences by imputing low-coverage samples to reference panel resolution.

Q: What are the data storage requirements for a large re-sequencing project?

A single 30× genome generates 200-300 GB of total data; a 1,000-sample project at 10× requires 100-150 TB of active storage and 50-80 TB for long-term archival. Cloud archival storage costs roughly $100-400 per month for a 100 TB archive. Using CRAM instead of BAM cuts alignment storage by 30-50%; PGEN format cuts genotype storage by 98%.

Q: How does CD Genomics handle the logistics of large-scale re-sequencing projects?

CD Genomics provides a dedicated project manager, LIMS-tracked sample handling in 96-well format, automated liquid handling for library preparation, pre-pool QC runs on every batch, joint variant calling with GATK, and comprehensive population genetics analysis. Raw data (FASTQ), aligned data (BAM/CRAM), variant calls (VCF), and publication-ready analysis outputs are delivered with a detailed methods document.

Moving Beyond Single Genomes — When WGS Scales to Populations

A plant breeder needs to genotype 2,000 doubled-haploid maize lines to train a genomic selection model that predicts hybrid performance before field testing. A conservation geneticist wants to scan 500 Atlantic salmon genomes for signatures of local adaptation to warming rivers. An evolutionary biologist plans to re-sequence 300 individuals across 12 populations of a non-model fish to reconstruct its demographic history since the last glacial maximum. These projects share a common DNA: they all require whole-genome re-sequencing (WGS) at population scale — and the logistics, cost, and bioinformatic challenges of 500 genomes are categorically different from those of 5.

Large-scale WGS re-sequencing — defined here as projects involving hundreds to thousands of individuals sequenced at 1× to 30× coverage — has become the default approach for population genomics, genome-wide association studies (GWAS), genomic selection in plant and animal breeding, and evolutionary biology since roughly 2022. The convergence of plummeting sequencing costs (a 30× human genome now costs under $300 for sequencing consumables alone), mature bioinformatic pipelines capable of joint-calling thousands of samples, and validated low-coverage imputation methods has made population-scale WGS feasible for individual research groups — not just consortia with eight-figure budgets.

CD Genomics provides Whole Genome Sequencing services scaled to population-level projects, from DNA extraction in 96-well format through joint variant calling and population genetic analysis. This article covers the complete workflow for large-scale re-sequencing: project design, sample logistics, cost optimization, bioinformatic strategies for joint analysis of hundreds to thousands of genomes, and data management for publication-ready deliverables.

What Population-Scale Re-Sequencing Answers

A single reference genome tells you what one individual carries. A population of re-sequenced genomes tells you what the species carries — and more importantly, how that variation is distributed across geography, ecology, and time. The core questions that population-scale WGS answers fall into four categories:

Population structure and demographic history. Principal component analysis (PCA), ADMIXTURE-based ancestry estimation, and identity-by-descent (IBD) inference from WGS data resolve population subdivisions, gene flow, and historical bottlenecks at resolutions unreachable by reduced-representation methods. Pairwise sequentially Markovian coalescent (PSMC) and its multi-sample extensions (MSMC, SMC++) reconstruct effective population size trajectories across hundreds of thousands of generations from a single diploid genome or a handful of genomes, providing a window into the demographic history of understudied species.

Selective sweeps and local adaptation. Comparing allele frequency spectra and population differentiation (Fst) across the genome identifies regions where selection has driven alleles to fixation or near-fixation. Methods like XP-CLR (cross-population composite likelihood ratio), iHS (integrated haplotype score), and nucleotide diversity (π) ratio scans pinpoint the specific genomic intervals under selection — from the lactase persistence sweep in human populations to salinity tolerance QTLs in rice landraces. WGS resolution matters here: genotyping arrays capture only common variants present in the design panel, while WGS captures the full allele frequency spectrum, including the low-frequency and population-private variants that are often the most informative for detecting recent selection.

Genome-wide association studies (GWAS). For traits with complex genetic architectures — yield in crops, disease resistance in livestock, body size in fish — GWAS tests millions of SNPs for statistical association with phenotype. Modern mixed-model approaches (GEMMA, GCTA, BOLT-LMM) account for population structure and cryptic relatedness, reducing false positives that plagued early candidate-gene studies. The statistical power of GWAS scales primarily with sample size, not marker density beyond a certain point — but WGS provides two advantages over genotyping arrays: it captures rare causal variants that arrays miss, and it enables direct fine-mapping of GWAS peaks to candidate causal variants without subsequent targeted sequencing.

A concrete example from 2025 illustrates what population-scale re-sequencing delivers for crop GWAS. Zhang et al. (Frontiers in Plant Science) re-sequenced 348 diverse soybean accessions at 10× coverage, detecting 1,882,531 SNPs for a hundred-seed weight GWAS. A significant peak on chromosome 19 co-localized with a biparental QTL (qHSW-19-4) mapped in an independent RIL population, narrowing the candidate interval to 580 kb. Four high-priority genes within this interval were validated by qRT-PCR — a pipeline from population WGS to functional candidates that exemplifies how moderate-coverage re-sequencing of a few hundred individuals provides sufficient resolution for GWAS peak discovery, after which fine-mapping and functional validation take over.

Genomic selection and prediction. In plant and animal breeding, genomic selection uses genome-wide markers to predict breeding values (GEBVs) for selection candidates. The 2025 Big BIT maize experiment — a multi-location, multi-year validation study across thousands of hybrids — confirmed that whole-genome prediction-enabled genomic selection anchored in broad-environment training data is the most effective early-stage genetic evaluation strategy. WGS, or low-coverage WGS with imputation to sequence level, provides the dense marker data that genomic selection models require without the ascertainment bias of SNP arrays.

How Many Samples Do You Really Need?

Sample size requirements depend on the question. For population structure and demographic inference, 10-30 individuals per population with WGS at ≥10× coverage is typically sufficient. For GWAS with realistic effect sizes (explaining 0.1-1% of phenotypic variance), hundreds to thousands of individuals are needed — power calculations should be performed before committing to sequencing. For genomic selection, training population sizes of 500-2,000 individuals are common in plant breeding programs, with prediction accuracy plateauing as training sets exceed several thousand.

A practical rule of thumb: if you can answer your question with fewer than 100 individuals, WGS at 10-30× is straightforward and cost-effective. If you need 500-5,000 individuals, low-coverage WGS (1-4×) with imputation to a reference panel becomes the dominant cost-optimization strategy. Above 10,000 individuals, consider a staged design — low-coverage WGS for the full cohort, with a subset of 10-20% sequenced at 30× to serve as the imputation reference panel.

Project Design for Scale — Logistics, Not Biology, Is the Bottleneck

A 500-sample WGS project is a logistics problem more than a biology problem. The wet-lab workflow — DNA extraction through library preparation to sequencing — must be designed for parallel throughput, sample tracking integrity, and batch-effect minimization from the outset. Retrospective fixes for sample swaps, uneven coverage, or batch-confounded variant calls are expensive or impossible.

DNA Extraction and Quality Control at Scale

For population-scale projects, DNA extraction moves from individual spin columns to 96-well plate formats. Key requirements:

Input quantity: ≥500 ng of high-quality genomic DNA per sample is recommended for PCR-free library preparation, which eliminates GC bias and PCR duplicate artifacts that disproportionately affect variant calling in population cohorts. For low-input samples (degraded museum specimens, single small invertebrates), PCR-plus workflows are accepted but should be applied uniformly within a project — mixing PCR-free and PCR-plus libraries confounds batch with biology.

Quality metrics: Every sample should be quantified by fluorescence-based dsDNA assay (Qubit or PicoGreen) and sized by capillary electrophoresis (TapeStation or Fragment Analyzer). DIN (DNA Integrity Number) scores below 6 indicate degradation that may require protocol adjustments. In large cohorts analyzed by the Tohoku Medical Megabank project, DIN scores ranged from 1.6 to 9.2 across 100,000 samples — the key is documenting, not eliminating, this variation so it can be modeled as a technical covariate.

Normalization and plating: DNA should be normalized to a uniform concentration (typically 10-50 ng/µL) across all samples and aliquoted into 96-well plates. Automated liquid handlers (Agilent Bravo, Biomek NXp) are strongly recommended above ~100 samples to eliminate manual pipetting errors that cause sample swaps. All plates should be barcoded and tracked through a laboratory information management system (LIMS).

Library Preparation and Multiplexing

For population re-sequencing, the library preparation strategy determines both per-sample cost and data quality:

PCR-free library preparation is preferred whenever DNA input exceeds 500 ng. PCR-free libraries eliminate amplification-induced GC bias, reduce duplicate rates, and produce more uniform coverage — all of which improve variant calling sensitivity, particularly in GC-rich and GC-poor regions. The trade-off is higher DNA input requirements and more stringent quality thresholds.

Unique dual indexes (UDIs) are mandatory for population-scale projects. Index hopping — where reads from one sample are misassigned to another during demultiplexing on patterned flow cells — can produce spurious heterozygous calls when a contaminating read carries a different allele than the true sample. UDIs, where both the i7 and i5 indexes are unique to each sample and the combination is validated, eliminate this risk. Single-index strategies should not be used for projects exceeding 96 samples.

Multiplexing density depends on the sequencing platform: a NovaSeq S4 flow cell accommodates 48-96 samples at 30× human coverage; a DNBSEQ-T7 can process 150+ samples across its four flow cells. For low-coverage designs (1-4×), 384-768 samples can be multiplexed on a single S4 flow cell.

Pre-pool quality control runs — sequencing a pooled aliquot of 48-96 samples at 1-2× coverage before committing to full-depth sequencing — cost approximately $500-1,000 and catch library balance issues, contamination, and sample swaps before they propagate to the full dataset. The Tohoku Medical Megabank and UK Biobank both use this strategy; the investment pays for itself by preventing a single requeued sequencing run.

Sequencing Depth — A Spectrum of Strategies

Strategy	Coverage	Variants Detected	Cost/Sample (approx.)	Best For
Ultra-low pass	0.5-1×	~1-5M SNPs (with imputation)	$20-40	Very large cohorts (N>5,000); ancestry, polygenic scores
Low-coverage + imputation	2-4×	~10-20M SNPs (with imputation)	$50-100	GWAS in large cohorts; genomic selection in breeding
Standard WGS	10-15×	~30-40M SNPs, reliable rare variant calls	$150-250	Population structure, selection scans, demographic inference
Deep WGS	30×	~40-50M SNPs, high-confidence rare variant calls	$250-400	Reference panels for imputation; high-confidence variant detection
Ultra-deep	60×+	Maximum sensitivity for somatic/single-cell	$500-800+	Specialized applications (single-cell WGS, somatic mosaicism)

The critical insight from the 2023-2025 literature is that low-coverage WGS with imputation (using GLIMPSE2 or QUILT) now achieves common-variant genotyping accuracy comparable to deep WGS or high-density SNP arrays at a fraction of the cost. For population genomics questions where common variants (MAF > 1%) drive the signal — population structure, demographic inference, most GWAS — the cost per statistical power favors larger sample sizes at lower coverage over smaller sample sizes at higher coverage.

Figure 1: Sequencing Depth vs. Sample Size Decision Matrix — A 2D plot with number of samples on X-axis (log scale, 10 to 10,000) and coverage depth on Y-axis (0.5× to 60×). Four application zones color-coded: Blue (Ultra-low pass 0.5-1×) — ancestry/PGS; Green (Low-coverage + imputation 2-4×) — GWAS/genomic selection; Orange (Standard WGS 10-15×) — selection scans/demography; Red (Deep WGS 30×+) — reference panel construction. Diagonal cost contours at $5K, $25K, $100K, $500K total project cost. Key insight: for a fixed budget, more samples at lower coverage yields greater common-variant GWAS power than fewer samples at higher coverage.

Cost Components and Optimization

Where the Money Goes

A population-scale WGS project has five cost components, and their relative contributions shift with scale:

DNA extraction and QC (~5-10% of total): Dominated by consumables and labor. At scale, bulk reagent purchasing and automated liquid handling reduce per-sample costs by 40-60% compared to manual processing.
Library preparation (~15-25% of total): The largest variable cost. Commercial library prep kits cost $50-150 per sample at list price; negotiated volume discounts and in-house Tn5 transposase production can reduce this to $10-30 per sample. For projects exceeding 500 samples, the investment in in-house library preparation infrastructure typically recovers its cost within the first batch.
Sequencing (~40-60% of total): The dominant cost, driven by coverage × number of samples × genome size. Sequencing costs have declined roughly 2-3× per year since 2021, and this trend is expected to continue. Core facility pricing varies widely; direct negotiation with service providers and flexible scheduling (filling partially loaded flow cells) can reduce costs by 20-30%.
Data storage and transfer (~5-15% of total): A 30× human genome generates approximately 90 GB of FASTQ, 60 GB of BAM, and 1 GB of VCF — plus aligner indices, temporary files, and backups, totaling ~200-300 GB per sample. For 1,000 samples, that is 200-300 TB. Cloud storage costs ($0.02-0.05 per GB per month) become significant at this scale and should be budgeted for the project lifetime (typically 3-5 years). The PGEN compressed format (PLINK 2.0) achieves 98% compression of genotype data, reducing a 2 TB dataset to 39 GB — a practical necessity for large-scale projects.
Bioinformatic analysis (~10-20% of total): Compute costs scale with sample size but can be optimized through workflow parallelization. Cloud-based analysis on AWS or Google Cloud costs roughly $5-15 per 30× genome for alignment and variant calling; on-premise high-performance computing (HPC) amortizes to a lower per-sample cost but requires upfront infrastructure investment.

Figure 2: Population WGS Project Cost Breakdown — Side-by-side comparison of two representative designs. Top: 500 Samples × 10× (1 Gb Genome, total ~$150K). Bottom: 2,000 Samples × 2× (1 Gb, lcWGS + Imputation, total ~$230K). Each bar shows proportional breakdown: DNA Extraction & QC (7%), Library Preparation (20%/25%), Sequencing (50%/30%), Data Storage (10%/15%), Bioinformatics (13%/23%). Below charts, pill callouts for four cost-optimization levers: lcWGS + imputation (10-30× savings), in-house Tn5 library prep ($5 vs $50-100/sample), bulk pre-pool QC (prevents 10-20% overrun), compressed formats (CRAM 30-50%, PGEN 98% storage reduction).

Cost-Optimization Strategies That Work

Beyond the obvious strategy of sequencing fewer samples at lower coverage, several specific optimizations have been validated in large-scale projects:

Low-coverage WGS + imputation to a reference panel. This is the single most impactful cost-optimization strategy available in 2025-2026. Sequencing 1,000 individuals at 2× coverage costs roughly the same as 70 individuals at 30× coverage — and for GWAS power, the 1,000 low-coverage genomes almost always win. The imputation reference panel should be ancestry-matched or population-matched to the target cohort; for non-model organisms without existing reference panels, sequencing 10-20% of the cohort at 30× to build a custom reference panel is cost-effective at cohort sizes above ~500.

Pooled sequencing for specific questions. When individual-level genotypes are not required — for allele frequency estimation, selective sweep scans, or evolve-and-resequence experiments — pooling DNA before library preparation can reduce costs by 5-20×. Pool-seq sacrifices individual genotype information but preserves allele frequency estimates with quantifiable precision that depends on pool size and sequencing depth.

In-house Tn5 transposase production. Commercial transposase-based library preparation kits (Nextera, TrueTag) cost $50-100 per reaction. A 2026 study in Aquaculture demonstrated that in-house purification and optimization of Tn5 transposase reduces library preparation costs to under $5 per sample while maintaining library complexity equivalent to commercial kits. For projects exceeding 200 samples, the 3-4 day investment in protein production is highly worthwhile.

Bulk QC runs and rebalancing. Sequencing a pooled aliquot at low coverage before committing to full-depth sequencing costs ~1-3% of the total project budget and can prevent a 10-20% cost overrun from requeued runs.

Computational optimizations. Using compressed file formats (PGEN for genotypes, CRAM for alignments instead of BAM), sparse representations for GWAS, and cloud spot instances for non-time-critical analyses can reduce compute costs by 40-60%.

CD Genomics offers flexible sequencing depth and multiplexing configurations across its Whole Genome Sequencing platform, enabling projects to balance coverage, sample count, and budget. For projects that combine population-scale re-sequencing with a smaller number of deeply sequenced reference genomes, CD Genomics' Plant and Animal Whole Genome De Novo Sequencing service provides the high-quality reference assemblies against which re-sequencing reads are aligned.

Figure 3: Large-Scale WGS Re-Sequencing Pipeline — A 5-stage horizontal workflow from sample intake to population analysis. Stages: (1) Sample Intake & QC — 96-well plates, fluorescence-based DNA quantification, TapeStation integrity check (Month 1) → (2) Library Preparation & Multiplexing — PCR-free with UDI barcodes, automated liquid handling, pre-pool QC at 1-2× (Month 2) → (3) Sequencing — NovaSeq S4 or DNBSEQ-T7 at 0.5×–30× (Month 2-4) → (4) Joint Genotyping — GVCF per sample, ReblockGVCF compression, GenomicsDBImport, GenotypeGVCFs, VQSR filtering (Month 5-6) → (5) Population Analysis — PCA/ADMIXTURE, GWAS (GEMMA/PLINK), selection scans (XP-CLR/iHS), demographic inference (PSMC/MSMC2) (Month 6-8).

Bioinformatics at Scale — From FASTQ to Population Genetics

The bioinformatic pipeline for a 1,000-sample WGS project is not simply the single-sample pipeline run 1,000 times. Joint analysis — where information is shared across samples — improves variant calling accuracy, enables the detection of rare variants that are invisible in individual samples, and is required for population genetic analyses. The computational architecture must be designed for parallelization from the start.

Read Alignment and Pre-Processing

Alignment of short reads to a reference genome is a per-sample parallel operation — each sample can be processed independently. The standard pipeline: quality control with FastQC and MultiQC → adapter trimming and quality filtering with fastp → alignment with BWA-MEM2 → duplicate marking with Picard or Sambamba → base quality score recalibration (BQSR) with GATK (DePristo et al., 2011).

For projects exceeding 100 samples, workflow managers (Nextflow, Snakemake, or Cromwell/WDL) are essential — they handle parallel job submission, resource allocation, and automatic re-submission of failed jobs. A well-configured Nextflow pipeline on a 500-core HPC cluster can process 1,000 30× human genomes from FASTQ to analysis-ready BAMs in 3-5 days.

Long-read re-sequencing — using PacBio HiFi or Oxford Nanopore for structural variant discovery or phasing — is increasingly incorporated into population studies. CD Genomics' Long-Read Sequencing Services provide complementary platforms for SV-aware re-sequencing on a subset of the cohort, with reads aligned by minimap2 and structural variants called by Sniffles2 or SVIM. For a comprehensive overview of long-read platforms across all applications — including structural variant discovery, methylation detection, and full-length isoform sequencing at population scale — see our Long-Read Sequencing Services for Every Application.

Variant Calling at Scale — Joint Genotyping

Per-sample variant calling with GATK HaplotypeCaller in GVCF mode, followed by joint genotyping across all samples, is the gold-standard approach for population-scale WGS. The GATK "Biggest Practices," introduced for cohorts exceeding 2,000 samples and validated on gnomAD (150,000 exomes), UK Biobank, and All of Us, introduce key optimizations:

ReblockGVCF compresses adjacent reference blocks in per-sample GVCFs and removes low-quality alternate alleles (GQ < 20), reducing file sizes by 70-90% and downstream merge times proportionally. GnarlyGenotyper approximates QUAL scores from INFO field annotations without iterating over every genotype, eliminating the computational bottleneck that made joint calling of very large cohorts impractical. VQSR scatter mode parallelizes variant quality score recalibration across genomic intervals, enabling filtering of tens of millions of variants across thousands of samples.

For non-model organisms without established truth sets, VQSR requires a minimum of 50 samples for effective Gaussian mixture model training; for smaller cohorts, hard-filtering based on GATK-recommended thresholds (QD < 2.0, FS > 60.0, MQ < 40.0, etc.) is a practical alternative.

For projects analyzing structural variants at population scale, CD Genomics' Variant Calling service includes multi-caller consensus approaches (Manta + Delly + Lumpy) validated for sensitivity and precision across a range of genome sizes and repeat contents.

Imputation — Making Low-Coverage Data Analysis-Ready

GLIMPSE2 (Rubinacci et al., 2023) is the current state-of-the-art for imputing low-coverage WGS data to sequence resolution. It achieves sublinear scaling in both sample count and marker count, processing a 1× genome against a reference panel of 150,000 haplotypes in approximately 11 hours at a computational cost of under $0.10 per genome. The method uses a sparse representation of the reference panel, a positional Burrows-Wheeler transform for fast haplotype matching, and hardware-optimized HMM computations — enabling population-scale imputation that was computationally prohibitive with earlier methods.

For non-model organisms, where large reference panels do not exist, a two-stage design is recommended: sequence 50-100 individuals at ≥25× to build a custom reference panel, then sequence the remaining cohort at 1-4× and impute against the custom panel. A 2025 study in cultivated strawberry demonstrated that ~70 genetically representative individuals at ≥25× were sufficient to build an imputation reference panel achieving 94-98% concordance in an allo-octoploid genome — strong evidence that this strategy generalizes across organisms.

Population Genetic Analysis

With a joint-called, filtered VCF in hand, the population genetic analyses that transform variant calls into biological insight include:

Population structure: PCA (PLINK), ADMIXTURE, and phylogenetic reconstruction (IQ-TREE, RAxML-ng). Kinship estimation with KING or PLINK identifies cryptic relatedness that must be accounted for in downstream analyses.

Genetic diversity: Nucleotide diversity (π), observed and expected heterozygosity, and Tajima's D calculated in sliding windows with VCFtools or pixy.

Population differentiation: Weir and Cockerham's Fst, Hudson's Fst, and Patterson's D-statistic (ABBA-BABA) for detecting gene flow and introgression — implemented in Dsuite and ADMIXTOOLS 2.

Selective sweep detection: XP-CLR, iHS/nSL, and composite likelihood ratio approaches implemented in selscan, RAiSD, and SweeD.

Demographic history: PSMC for single diploid genomes, MSMC2 for multiple genomes, and Stairway Plot 2 for site frequency spectrum-based inference.

GWAS: GEMMA for mixed-model association, PLINK 2.0 for large-scale linear/logistic regression, and BOLT-LMM for biobank-scale datasets where kinship matrices for 500,000 individuals are computationally intractable.

CD Genomics' Population Evolution analysis service provides the full suite of population genetic analyses as part of large-scale re-sequencing projects, delivering publication-ready figures, tables, and methods sections for each analysis module.

Data Management and Sharing

A 1,000-sample, 10× WGS project generates roughly 100 TB of raw data, intermediate files, and analysis outputs. Data management is not an afterthought — it is a first-order project design consideration that affects budget, timeline, and compliance with journal and funder data-sharing mandates.

Storage Architecture

Active analysis data (FASTQ, BAM, VCF) should reside on high-performance parallel storage (Lustre, GPFS, or BeeGFS) during the analysis phase. After project completion, data transitions to lower-cost archival storage: CRAM format for alignments (30-50% smaller than BAM), PGEN format for genotype data (98% smaller than flat-text VCF), and compressed archives for raw FASTQ. Cloud object storage (AWS S3 Glacier, Google Cloud Archive) costs $0.001-0.004 per GB per month — approximately $100-400 per month for a 100 TB archive — but retrieval costs and latency must be factored into archiving decisions.

Public Database Submission

Most journals and funders require deposition of sequencing data in public repositories. The standard submission targets are:

NCBI Sequence Read Archive (SRA): Accepts raw sequencing reads (FASTQ) and aligned reads (BAM). Submission requires a BioProject accession (project-level metadata) and BioSample accessions (sample-level metadata) for each individual. The SRA submission wizard and Aspera-based file transfer handle large datasets.

European Nucleotide Archive (ENA): Equivalent to SRA for European projects; accepts the same data types and provides mirroring between SRA and ENA.

European Variation Archive (EVA): Accepts variant calls (VCF) with associated metadata. For population-scale projects, EVA submission of the joint-called VCF is strongly recommended for reproducibility.

CD Genomics provides submission-ready data packages formatted for SRA/ENA/DDBJ, including validated metadata spreadsheets that satisfy INSDC (International Nucleotide Sequence Database Collaboration) requirements. Our Whole Genome SNP Genotyping and Genotyping by Sequencing (GBS) services offer complementary genotyping approaches when WGS exceeds the project's immediate budget or when focused genotyping of known variants suffices.

Reproducibility and Data Provenance

For population-scale projects, computational reproducibility requires more than sharing scripts. Containerized workflows (Docker or Singularity images with pinned software versions), workflow definition files (Nextflow .nf or WDL scripts deposited alongside the manuscript), and explicit random seeds for stochastic algorithms should be archived. The Workflow Description Language (WDL) scripts for the GATK Best Practices pipeline, for example, are publicly maintained on Dockstore and can be referenced by DOI — a standard that population genomics projects should adopt.

Practical Considerations for Project Planning

Timeline

A 500-sample, 10× WGS re-sequencing project for a 1 Gb genome follows roughly this timeline:

Month 1: Sample collection, DNA extraction, QC, normalization, plating (parallelized across 96-well plates)

Month 2: Library preparation and multiplexing; pre-pool QC sequencing

Month 2-4: Full-depth sequencing (6-10 NovaSeq S4 runs, depending on multiplexing density)

Month 3-5: Alignment and per-sample variant calling (parallel; can begin as sequencing data arrive)

Month 5-6: Joint genotyping, variant filtering, imputation (if applicable)

Month 6-8: Population genetic analyses, figure generation, manuscript preparation

Month 8-9: Public database submission, data archival

Total project duration: 8-9 months from sample receipt to publication-ready results. Expedited timelines (4-6 months) are achievable with prioritized sequencing, cloud-based compute, and parallel analysis-pipeline execution.

Working with CD Genomics on Large-Scale Projects

For projects involving hundreds to thousands of samples, CD Genomics provides a dedicated project manager who coordinates sample logistics, sequencing scheduling, and data delivery. The standard workflow:

Consultation: Define project goals, sample numbers, coverage strategy, and analysis scope. If complex traits or population structure are the primary focus, CD Genomics' Genome-Wide Association Study (GWAS) service provides integrated phenotype-genotype analysis with mixed-model correction for population structure.
Sample intake: Samples are accessioned into LIMS with 2D-barcoded tubes, cross-referenced against the sample manifest, and subjected to incoming QC (concentration, purity, integrity).
Pilot batch: The first 48-96 samples are processed through the full pipeline — extraction, library prep, sequencing, and preliminary analysis — to validate DNA quality, library complexity, and coverage uniformity. Any protocol adjustments are made before scaling to the full cohort.
Production sequencing: Remaining samples are processed in batches of 96, with each batch tracked through LIMS and subjected to batch-level QC.
Joint analysis: All samples are joint-called, filtered, and analyzed for the agreed-upon population genetics modules. Results are delivered interactively — preliminary PCA and ADMIXTURE plots, for example, can be reviewed and discussed before final analyses are run.
Final delivery: Raw data (FASTQ), aligned data (BAM/CRAM), variant calls (VCF), population genetics analysis outputs (publication-ready figures and tables), and a comprehensive methods document describing all bioinformatic steps.

For a broader view of how large-scale re-sequencing fits into the WGS landscape, from single bacterial genomes to de novo assembly of non-model eukaryotes, see our Whole Genome Sequencing Services Hub. For projects that require assembling a reference genome before re-sequencing, see our De Novo Plant and Animal Genome Sequencing guide. For guidance on choosing between low-pass and high-coverage strategies based on your specific research question, see our Low-Pass vs. High-Coverage WGS: Choosing the Right Sequencing Depth for Your Research Goals and Budget.

Frequently Asked Questions

What is large-scale whole genome re-sequencing and how is it different from de novo assembly?

Large-scale re-sequencing maps reads from hundreds to thousands of individuals against an existing reference genome to identify genetic variants (SNPs, indels, structural variants) across a population. It is fundamentally different from de novo assembly, which constructs a genome from scratch without a reference. Re-sequencing is faster and cheaper per sample but requires a high-quality reference genome.

How many samples do I need for a population genomics or GWAS study?

For population structure and demographic history, 10-30 individuals per population at ≥10× coverage is often sufficient. For GWAS, sample sizes of hundreds to thousands are required to detect loci explaining 0.1-1% of phenotypic variance — power calculations should guide this decision. For genomic selection, 500-2,000 individuals is standard for training population construction in plant and animal breeding.

What sequencing depth should I choose for a large-scale re-sequencing project?

Low-coverage WGS (1-4×) with imputation is the default for cohorts exceeding ~300 samples, capturing common variants at a fraction of deep WGS cost. Standard coverage (10-15×) provides reliable rare variant calls for demographic inference and selection scans. Deep coverage (30×) is recommended for reference panel construction and high-confidence variant detection.

How do I control costs for a project with hundreds to thousands of samples?

The three highest-impact strategies: (1) use low-coverage WGS + imputation for the full cohort with a custom reference panel from 10-20% of samples at 30×, (2) negotiate volume pricing and perform pre-pool QC runs to avoid costly requeueing, and (3) adopt compressed formats (CRAM, PGEN) to cut storage costs by 30-98%.

What bioinformatic infrastructure do I need for joint analysis of 1,000 genomes?

For alignment and per-sample variant calling, a 500-core HPC cluster or equivalent cloud compute can process 1,000 30× genomes in under a week. For joint genotyping, at least 1 TB of RAM and 50 TB of fast storage are recommended for cohorts exceeding 2,000 samples — at which point the GATK "Biggest Practices" (ReblockGVCF + GnarlyGenotyper) become essential. Workflow managers (Nextflow, Snakemake) and containerized tools (Docker, Singularity) are strongly recommended for reproducibility.

Can I combine samples sequenced at different depths in the same analysis?

Yes, joint genotyping with GATK handles heterogeneous coverage — this is routine in projects that combine a deeply sequenced reference panel with a low-coverage discovery cohort. However, variant calling sensitivity differs by depth, so batch effects should be explicitly modeled. Imputation with GLIMPSE2 can harmonize coverage differences by imputing low-coverage samples to the resolution of the reference panel.

What are the data storage requirements for a large re-sequencing project?

A single 30× genome generates ~200-300 GB of total data; a 1,000-sample project at 10× requires 100-150 TB of active storage and 50-80 TB for long-term archival. Cloud archival storage costs roughly $100-400 per month for a 100 TB archive. Using CRAM instead of BAM cuts alignment storage by 30-50%; PGEN format cuts genotype storage by 98%.

How does CD Genomics handle the logistics of large-scale re-sequencing projects?

CD Genomics provides a dedicated project manager, LIMS-tracked sample handling in 96-well format, automated liquid handling for library preparation, pre-pool QC runs on every batch, joint variant calling with GATK, and comprehensive population genetics analysis. Raw data (FASTQ), aligned data (BAM/CRAM), variant calls (VCF), and publication-ready analysis outputs are delivered with a detailed methods document. Expedited timelines are available.

References:

DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 2011;43(5):491-498. doi:10.1038/ng.806
Rubinacci S, Hofmeister RJ, Sousa da Mota B, Delaneau O. Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes. Nature Genetics. 2023;55(7):1088-1090. doi:10.1038/s41588-023-01438-3
Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi:10.1186/s13742-015-0047-8
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]. 2013. arXiv:1303.3997
Danecek P, Bonfield JK, Liddle J, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10(2):giab008. doi:10.1093/gigascience/giab008
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047-3048. doi:10.1093/bioinformatics/btw354
Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 2007;81(3):559-575. doi:10.1086/519795
Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nature Genetics. 2012;44(7):821-824. doi:10.1038/ng.2310
Koorevaar T, van de Weg E, Visser RGF, et al. Genotype imputation from low-coverage WGS using haplotype reference panels in cultivated strawberry. BMC Genomics. 2025;26(1):968. doi:10.1186/s12864-025-12270-w

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.

Related Services

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.