Population Genomics Study Design: Sample Size, Sequencing Depth, and Platform Selection Guide
Summary
This guide walks through the key decisions in designing a population genomics study — from defining the research question and estimating sample size to selecting sequencing depth and genotyping platforms. It is written for early-career researchers and project managers who need a practical, grounded overview rather than an exhaustive methods treatise. The focus is on GWAS, population structure analysis, and variant discovery workflows, with an emphasis on getting the design right before samples reach the sequencer.
Key Takeaways
- The research question determines every downstream design choice: GWAS demands large sample sizes, population structure analysis requires representative sampling, and rare variant discovery hinges on sequencing depth.
- For a GWAS with 80% power to detect moderate effect sizes (OR ~1.3), budget for at least 1,000 cases — and more if the trait is highly polygenic.
- Sequencing depth above 30× for WGS delivers diminishing returns for germline SNP calling; low-coverage WGS (5–10×) with imputation is a cost-effective alternative for common variant studies.
- Platform choice — array, WGS, WES, or low-coverage sequencing — should follow the variant type you need, not the other way around.
- Ancestry and population structure must be addressed at the design stage, not during analysis. Underrepresented populations need dedicated reference panels or direct sequencing.
Figure 1: Four-pillar framework for population genomics study design — research question, sampling strategy, sequencing platform, and analysis plan.
Start With the Question
Study design begins with a single sentence: what exactly are you trying to find? The answer reshapes every budget line, every sample, and every analysis pipeline downstream.
A GWAS asks which common variants associate with a trait. That question rewards large sample sizes, modest sequencing depth, and careful phenotyping. A population structure analysis asks how groups are related and what demographic events shaped them — it rewards broad geographic sampling and genome-wide markers but does not need the statistical power of a GWAS. A rare variant discovery study asks whether low-frequency coding or regulatory mutations contribute to a phenotype; it rewards deep sequencing in carefully ascertained samples, often from families or extreme phenotype tails.
Table 1: How the Research Question Shapes the Study
| Study Goal | Primary Design Driver | Key Constraint |
|---|---|---|
| GWAS of a complex trait | Sample size (N) | Phenotype consistency across cohorts |
| Population structure and demographic history | Geographic and ancestral representativeness | Marker density across the genome |
| Rare variant association | Variant detection sensitivity | Sequencing depth and sample ascertainment |
| Selection scans | Dense markers across selected and neutral regions | Appropriate null model for demographic history |
| PRS construction | GWAS summary statistics from a well-powered discovery cohort | Ancestry match between discovery and target samples |
Articulating the primary goal before ordering reagents prevents the most expensive mistake in genomics: generating data that cannot answer the question.
Researchers designing a GWAS should review the GWAS service page for array-based discovery options with imputation.
How Many Samples You Need
Sample size is the single largest lever on statistical power — and the single largest line item in most budgets. Getting this number right before data generation begins is the difference between a study that finds signals and one that produces noise.
GWAS Sample Size
For a GWAS, the sample size required depends on effect size, allele frequency, and the genome-wide significance threshold (p < 5×10⁻⁸). Under typical complex trait assumptions:
- Detecting a common variant (MAF > 5%) with moderate effect size (OR ~1.3) requires approximately 1,000 cases at 80% power.
- Detecting weaker effects (OR < 1.1) requires tens or hundreds of thousands of samples.
- Biobank-scale cohorts (UK Biobank, FinnGen, All of Us) with n > 100,000 routinely detect thousands of loci, but most individual variants explain less than 0.1% of trait variance.
A power analysis using tools such as GCTA or Quanto, based on realistic assumptions about the trait's genetic architecture, should precede any sample size commitment.
Population Structure and Demographic Inference
Studies focused on population structure, admixture, or demographic history have different sample size logic. The key metric is not statistical power per variant, but the number of individuals needed to capture within-population diversity and between-population differentiation.
- For F_ST-based analyses: 20–30 unrelated individuals per population can estimate allele frequency differences.
- For PCA and ADMIXTURE analyses: 50–100 individuals per population provide stable inference of ancestry components.
- For fine-scale structure (e.g., within-country gradients): hundreds to thousands of individuals from geographically referenced sampling locations are required.
- GWAS discovery: 1,000+ cases; 10,000+ for polygenic traits with weak effects.
- Rare variant discovery (WGS): Hundreds to low thousands of well-phenotyped samples, depending on variant aggregation strategy.
- Population structure (PCA/ADMIXTURE): 50–100 individuals per population of interest.
- Selection scans: Comparable to population structure requirements; power depends more on marker density than N.
These numbers assume unrelated individuals with low missingness and consistent phenotyping. Family-based designs, case-control imbalances, and phenotype measurement error all reduce effective sample size.
For guidance on statistical power tradeoffs between methods, see GWAS vs Whole Genome Sequencing.
Figure 2: Statistical power as a function of sample size for GWAS, showing the steep sample size requirements for detecting weak-effect common variants.
What Sequencing Depth Delivers
Sequencing depth — the average number of reads covering each base — determines which variants you can call and with what confidence. Higher depth is not always better, and the point of diminishing returns arrives sooner than many researchers expect.
Depth Ranges and What They Enable
- Low coverage (1–5×): Sufficient for imputation-based GWAS when combined with a large reference panel. Cannot reliably call individual genotypes but enables accurate imputation of common variants (MAF > 1%) at a fraction of the cost of deep WGS.
- Moderate coverage (10–20×): Enables direct genotype calling for common variants and most heterozygous sites. Adequate for population structure analysis, selection scans, and most non-clinical applications.
- High coverage (30×): The standard for germline variant discovery. Achieves >99% sensitivity for heterozygous SNP calls and is sufficient for de novo mutation detection in family-based designs.
- Ultra-deep (>50×): Required for somatic variant calling in mixed samples, rare subclonal populations, or single-cell applications. Beyond 30–40×, additional depth adds marginal improvement for germline SNP calling in bulk tissue.
Table 2: Sequencing Depth by Study Goal
| Study Goal | Recommended Depth | Acceptable Range | Notes |
|---|---|---|---|
| GWAS via imputation | Low-coverage (1–5×) or array | 0.5–10× | Requires large, well-matched reference panel |
| Common variant GWAS (direct calling) | 10–20× | 8–30× | Balance between cost and genotype accuracy |
| Rare variant discovery (SNVs) | 30× | 25–40× | Standard for germline SNV/indel calling |
| Structural variant detection | 30× (short-read) or 10–20× (long-read) | Platform-dependent | Long reads outperform short reads for SVs at lower depth |
| De novo mutation detection | 30× + parental sequencing | 30–50× | Requires trio design for confident calling |
| Population structure and admixture | 5–10× | 2–20× | Low depth is adequate; marker count matters more than depth |
Coverage Uniformity Matters
Average depth is only half the story. Coverage uniformity — how evenly reads are distributed across the genome — affects variant calling sensitivity in GC-rich regions, repetitive elements, and regions of extreme copy number. A sample with 30× mean coverage and 20% of bases below 10× will produce more false-negative calls than a sample with 25× mean coverage and uniform distribution. When evaluating sequencing providers, ask for coverage uniformity metrics alongside mean depth.
For deep sequencing needs, whole genome sequencing services provide 30× coverage with quality metrics for variant discovery studies.
Choosing Your Platform
The platform choice — microarray, WGS, WES, or low-coverage sequencing — determines the variant classes you can detect and the analyses you can run. The decision should be driven by the study question, not by platform availability or familiarity.
Platform Options at a Glance
SNP microarrays genotype 500,000 to 2 million pre-selected SNPs. Imputation against a reference panel expands coverage to tens of millions of common variants. Arrays are cost-effective for large-cohort GWAS of common variants in well-represented populations. They cannot detect novel rare variants, structural variants, or ancestry-specific variants absent from the reference panel.
Whole genome sequencing (WGS) captures all variant classes — common SNPs, rare variants, structural variants, and non-coding regulatory changes. It is the most comprehensive option and the only platform suitable for rare variant discovery and de novo mutation detection. The higher per-sample cost limits cohort sizes compared to arrays.
Whole exome sequencing (WES) captures protein-coding regions (~1–2% of the genome) at a cost between arrays and WGS. Suitable for studies focused on coding variants. Misses non-coding regulatory variants and most structural variants.
Low-coverage WGS with imputation is a middle ground: sequencing at 1–5× depth followed by imputation against a large reference panel. It captures common and low-frequency variants more comprehensively than arrays, at a cost below deep WGS. Imputation accuracy depends heavily on reference panel quality and population match.
Table 3: Platform Selection by Study Priority
| If your priority is... | Best Platform | Why |
|---|---|---|
| Maximizing sample size for common variant GWAS | SNP array + imputation | Lowest per-sample cost; scales to 100,000+ samples |
| Discovering novel rare variants | WGS (30×) | Only platform that captures all rare variants genome-wide |
| Coding variant burden analysis | WES | Cost-effective for coding regions; smaller data footprint |
| Studying non-European populations | WGS or low-coverage WGS + population-specific imputation panel | Avoids reference panel bias of standard arrays |
| Detecting structural variants | WGS (long-read if available) | Short-read WGS detects most SVs; long-read detects complex rearrangements |
| Balancing cost and completeness | Low-coverage WGS + imputation | More variants than arrays, lower cost than deep WGS |
A common design error is choosing WES because "it's cheaper than WGS," only to discover that the causal variant lies in a non-coding regulatory region that WES cannot see. Match the platform to the expected variant class.
Whole exome sequencing services offer a cost-effective option for studies targeting coding variants, though non-coding regions remain unsequenced.
Ancestry Shapes Everything
Population stratification can produce spurious associations, reduce power, and limit the translatability of findings — and it cannot be fully fixed in analysis if the design phase ignored it.
Reference Panel Representation
GWAS and low-coverage WGS depend on imputation reference panels. The most widely used panels (1000 Genomes, HRC, TOPMed) overrepresent European ancestry. For a study conducted in East Asian, South Asian, African, Middle Eastern, or Indigenous populations:
- Imputation accuracy drops for ancestry-specific alleles.
- Rare variants enriched in non-European populations are systematically missed.
- PRS derived from European GWAS transfer poorly to other ancestries, reducing clinical and predictive utility.
If your study population is not well-represented in existing reference panels, two design strategies mitigate the problem:
- Sequence a subset of your cohort with WGS to build a population-specific imputation reference panel. Impute the rest against this custom panel.
- Use low-coverage WGS for the entire cohort rather than arrays. Even at 5× depth, direct sequencing captures ancestry-specific variants that arrays and standard imputation panels miss.
Sampling Across Population Structure
For population structure and demographic inference, sampling strategy is everything. Convenience sampling from a single clinic or city will confound population structure with local recruitment bias. Key principles:
- Sample across the geographic or linguistic range of the populations of interest.
- Include population outgroups for PCA and phylogenetic analyses.
- Record fine-grained metadata — self-reported ancestry, language, geographic coordinates — not just broad continental labels.
- Account for relatedness: first- and second-degree relatives should be identified and handled analytically, not treated as independent observations.
After study design, variant calling and annotation services support the transition from raw sequencing data to analysis-ready variant files.
Figure 3: Three common genotyping strategies — array with imputation, low-coverage WGS with imputation, and deep WGS — compared by cost, variant coverage, and computational requirements.
Budget Without the Blind Spots
Study budgets that account only for per-sample sequencing costs underestimate the true project cost by a wide margin. Storage, computation, personnel time, and quality control each consume a meaningful share of the total.
The Full Cost Picture
- Library preparation and sequencing reagents: The visible per-sample cost. Arrays: $50–100/sample. WGS at 30×: $500–1,000/sample depending on volume and platform. Low-coverage WGS: $100–300/sample.
- Data storage: A single 30× WGS produces ~90–100 GB of raw data. A cohort of 1,000 samples generates ~100 TB. Cloud storage, backup, and archival costs compound annually.
- Computational analysis: Read alignment, variant calling, and quality control for 1,000 WGS samples require substantial compute resources — either on-premises HPC or cloud instances. Budget for both CPU-hours and the personnel to run them.
- Metadata management and phenotyping: The most under-budgeted line item. Phenotype harmonization across cohorts, metadata cleanup, and data entry consume weeks to months of personnel time.
Budget-Driven Design Tradeoffs
- More samples at lower depth vs. fewer samples at higher depth: For common variant GWAS, prefer more samples at lower depth or with arrays. Effect size discovery scales with N.
- Arrays + WGS follow-up vs. all-WGS: For large cohorts on a fixed budget, array-based GWAS followed by WGS fine-mapping of significant loci extracts more discoveries per dollar than sequencing every sample deeply.
- Low-coverage WGS vs. arrays: Low-coverage WGS costs more than arrays but detects more variants and reduces reference panel dependency. For diverse or admixed cohorts, the additional cost is often justified.
The Pre-Launch Checklist
Before any sample enters a library preparation tube, the following questions should have clear, documented answers. A design flaw caught at this stage costs a conversation; a design flaw caught after sequencing costs the entire study.
Table 4: Pre-Launch Design Checklist
| Design Element | Key Question | Status |
|---|---|---|
| Research question | Is the primary goal a GWAS, population structure analysis, rare variant discovery, or selection scan? | ☐ |
| Sample size | Has a power analysis been performed with realistic effect size and allele frequency assumptions? | ☐ |
| Population sampling | Do the samples represent the geographic, ancestral, and demographic range of the populations under study? | ☐ |
| Sequencing platform | Does the platform match the variant class(es) of interest (common SNPs, rare coding, structural, non-coding)? | ☐ |
| Sequencing depth | Is the target depth appropriate for the study goal and variant type? | ☐ |
| Reference panel | If using imputation, does the reference panel adequately represent the study population? | ☐ |
| Phenotype quality | Are phenotypes consistently measured, harmonized across sites, and recorded with sufficient metadata? | ☐ |
| Relatedness plan | Will related individuals be identified, and how will they be handled in analysis? | ☐ |
| Storage and compute | Is the budget sufficient for data storage, computational analysis, and personnel time beyond sequencing reagents? | ☐ |
| Replication strategy | Is there a plan for replicating key findings in an independent cohort? | ☐ |
If any row in this checklist lacks a clear answer, the study design is not yet complete. Returning to the bench or the budget spreadsheet at this stage is far less expensive than discovering a design flaw after data generation.
FAQ
The most common mistake is choosing a sequencing platform or depth before defining the research question. Researchers often default to "WGS at 30× for everyone" because it sounds comprehensive, without considering whether arrays or low-coverage sequencing would answer the same question at a fraction of the cost — or whether 30× is even necessary for the variant classes of interest.
For standard PCA and ADMIXTURE-based population structure inference, 50 to 100 unrelated individuals per population of interest typically provide stable estimates of ancestry components and allele frequency differentiation. For fine-scale geographic structure — such as detecting clines within a country — hundreds to thousands of individuals with precise sampling locations are required.
SNP arrays are more cost-effective when your study population is well-represented in existing imputation reference panels — primarily European, East Asian, and South Asian populations with large genome databases. Low-coverage WGS is the better choice for populations poorly represented in reference panels, for studies targeting low-frequency variants, or when you anticipate needing to call structural variants or ancestry-specific alleles that standard arrays miss.
For germline SNP and small indel discovery, 30× coverage is the widely accepted standard and achieves more than 99% sensitivity for heterozygous calls. Coverage above 30–40× provides diminishing returns for germline SNV calling. For structural variant detection, depth requirements depend on the platform: 30× for short-read WGS provides reasonable sensitivity for most SV classes, while long-read sequencing can detect complex rearrangements at 10–20× coverage.
Population stratification must be addressed at the design stage, not deferred to analysis. If using arrays or low-coverage WGS with imputation, verify that the reference panel adequately represents all populations in your cohort. Record fine-grained ancestry metadata — self-reported ancestry, language group, and geographic origin — rather than relying on broad continental labels. For admixed or recently mixed populations, consider direct WGS sequencing to avoid reference panel bias.
Three budget items are consistently underestimated or omitted. First, data storage costs: a cohort of 1,000 WGS samples at 30× generates approximately 100 TB of raw data, with cloud or on-premises storage recurring annually. Second, computational analysis costs: read alignment, variant calling, and quality control require substantial CPU-hours. Third, phenotype harmonization and metadata management: cleaning, standardizing, and curating phenotype data across cohorts often takes weeks to months of personnel time and is rarely allocated in grant budgets.
Whole exome sequencing is a reasonable choice when coding variants are the primary target and the study has a limited budget. However, WES captures only 1 to 2% of the genome and misses the non-coding regulatory regions where the majority of GWAS-significant loci reside. If the study may later pivot to regulatory variant analysis, structural variant detection, or non-coding fine-mapping, the cost savings of WES relative to WGS may prove a false economy.
Long-read sequencing is worth considering when structural variants are a primary study target, when studying regions of the genome inaccessible to short reads — such as segmental duplications, centromeres, and telomeres — or when phasing haplotypes across long genomic distances is analytically important. For standard germline SNV calling and GWAS, short-read WGS at 30× remains more cost-effective and computationally mature.
References:
- Sham PC, Purcell SM. Statistical power and significance testing in large-scale genetic studies. Nature Reviews Genetics. 2014;15(5):335-346. doi:10.1038/nrg3706
- Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D. Benefits and limitations of genome-wide association studies. Nature Reviews Genetics. 2019;20(8):467-484. doi:10.1038/s41576-019-0127-1
- Taliun D, Harris DN, Kessler MD, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590(7845):290-299. doi:10.1038/s41586-021-03205-y
- Das S, Forer L, Schönherr S, et al. Next-generation genotype imputation service and methods. Nature Genetics. 2016;48(10):1284-1287. doi:10.1038/ng.3656
- Visscher PM, Wray NR, Zhang Q, et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. American Journal of Human Genetics. 2017;101(1):5-22. doi:10.1016/j.ajhg.2017.06.005
- Wang QS, Kelley DR, Ulirsch J, et al. Leveraging supervised learning for functionally-informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nature Communications. 2021;12:3394. doi:10.1038/s41467-021-23134-8
- Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203-209. doi:10.1038/s41586-018-0579-z
- Auton A, Brooks LD, Durbin RM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68-74. doi:10.1038/nature15393
Research Use Only Statement
The information provided in this article is for research use only and is not intended for use in diagnostic or therapeutic procedures. CD Genomics provides sequencing and bioinformatics services for research purposes. Researchers should consult the appropriate regulatory guidelines for their specific applications.