GWAS vs Whole Genome Sequencing: When to Use Each for Disease Studies
The information provided in this article is for research use only and is not intended for use in diagnostic or therapeutic procedures. CD Genomics provides sequencing and bioinformatics services for research purposes. Researchers should consult the appropriate regulatory guidelines for their specific applications.
This article compares genome-wide association studies (GWAS) and whole genome sequencing (WGS) as tools for disease research. It covers what each method detects, how they differ in statistical power and cost, when population stratification tilts the balance, and how to combine both approaches in a single study design. Researchers designing disease association studies will find a practical decision framework grounded in current genomic literature.
Key Takeaways:
- GWAS with SNP arrays and imputation delivers the highest statistical power per dollar for common variant discovery in large cohorts.
- WGS is the only approach that captures novel rare variants, structural variants, and non-coding regulatory regions at scale.
- A GWAS needs at least 1,000 cases for 80% power at moderate effect sizes (OR ~1.3) under the p < 5×10⁻⁸ genome-wide significance threshold.
- WGS power depends on variant aggregation strategies (burden tests, SKAT) rather than sample size alone.
- Two-stage designs — GWAS discovery followed by WGS fine-mapping — are the field standard for maximizing discovery per dollar.
Figure 1: Decision flowchart for selecting between GWAS and WGS based on study goals, variant type, and budget.
Two Paths to Variant Discovery
A GWAS measures known variants. A WGS reads everything. That single distinction shapes every downstream decision in a disease study.
GWAS relies on SNP genotyping arrays that interrogate 500,000 to 2 million pre-selected single-nucleotide polymorphisms (SNPs) across the genome. Statistical imputation then expands this set to tens of millions of variants by referencing large haplotype panels such as the Haplotype Reference Consortium (HRC) or TOPMed. The result is a cost-effective scan of common variants — those with minor allele frequency (MAF) above roughly 1% — across thousands or hundreds of thousands of samples.
WGS takes the opposite approach. It sequences the entire ~3 billion base-pair genome, capturing common SNPs, rare variants (MAF < 1%), structural variants (CNVs, inversions, translocations), and non-coding regulatory sequences that arrays were never designed to detect. No imputation step is required because every base is read directly.
The tradeoff is immediate: GWAS trades completeness for scale, while WGS trades scale for completeness. Which tradeoff makes sense depends on the disease architecture, the available cohort, and the research question.
Researchers planning large-scale association studies can explore GWAS service options for array-based discovery with imputation.
What Arrays Miss
SNP arrays are designed around common variation in European populations. That design choice creates predictable blind spots — and those blind spots are where WGS proves its value.
Common Variants, Rare Variants, and Structural Changes
GWAS arrays capture common SNPs efficiently, but they are largely blind to variants with MAF below 0.1%. Rare variants — which are more likely to have large effect sizes and direct functional consequences — simply are not on the array. WGS detects them directly, without relying on linkage disequilibrium with nearby common markers.
Table 1: What Each Method Detects
| Variant Type | GWAS (Array + Imputation) | WGS |
|---|---|---|
| Common SNPs (MAF > 1%) | Captured well; imputation quality depends on reference panel | Captured directly |
| Low-frequency variants (MAF 0.1–1%) | Partial capture; imputation accuracy drops | Captured directly |
| Rare variants (MAF < 0.1%) | Not captured | Captured directly |
| Structural variants (CNVs, inversions) | Limited to array-targeted CNV probes | Detected genome-wide |
| Non-coding regulatory variants | Imputation captures variants in reference panels only | All non-coding regions sequenced |
| De novo mutations | Not detectable | Detectable in family-based designs |
Structural variants deserve particular attention. Copy number variations (CNVs), inversions, and translocations play established roles in neurodevelopmental disorders, cancer susceptibility, and rare disease. Standard GWAS arrays include only a limited set of CNV probes — usually a few thousand markers — while WGS can call structural variants genome-wide with specialized tools.
Non-Coding Regions
The vast majority of GWAS-significant loci map to non-coding regions. But SNP arrays, by design, select tag SNPs that correlate with common haplotypes — they do not comprehensively cover regulatory elements. WGS captures every base in enhancers, promoters, insulators, and non-coding RNAs, enabling direct association testing of regulatory variants rather than relying on LD proxies.
- GWAS coverage of regulatory regions: Typically less than 5% of known regulatory elements are directly genotyped. The rest are imputed, and imputation quality falls sharply for variants absent from reference panels.
- WGS coverage of regulatory regions: Complete. Every regulatory base is sequenced, making WGS the preferred platform for studies focused on gene regulation, epigenetics, or non-coding disease mechanisms.
For studies targeting rare coding mutations or non-coding regulatory variants, whole genome sequencing services provide complete variant discovery without reference panel dependency.
Power and the P-Value Threshold
Statistical power — the probability of detecting a true association — depends heavily on study design choices. GWAS and WGS approach power from different directions.
GWAS Power Depends on Sample Size
For a GWAS, power is a function of sample size, effect size, and allele frequency. The genome-wide significance threshold of p < 5×10⁻⁸ corrects for roughly one million independent tests across the genome. Under this threshold:
- Detecting a common variant (MAF > 5%) with moderate effect size (OR ~1.3) requires approximately 1,000 cases for 80% power.
- Detecting weaker effects (OR < 1.1) — which account for the majority of complex trait heritability — requires tens or hundreds of thousands of samples.
- Large biobank-scale GWAS (UK Biobank with n > 400,000, FinnGen with n > 300,000) routinely detect thousands of loci per trait by maximizing sample size.
GWAS power calculators such as GCTA and Quanto allow researchers to estimate required sample sizes before committing to array processing.
WGS Power Depends on Variant Aggregation
WGS power does not scale with sample size in the same way. Individual rare variants appear in too few individuals for single-variant tests to reach significance. Instead, WGS relies on aggregate rare variant tests — statistical methods that pool variants within a gene, regulatory region, or pathway and test the combined signal.
- Burden tests collapse multiple rare variants into a single genetic score and test for association with the trait.
- SKAT (Sequence Kernel Association Test) allows variants within a region to have different directions and magnitudes of effect, improving power when both protective and risk variants exist in the same gene.
- SKAT-O combines burden and SKAT approaches, adapting to the underlying genetic architecture.
Table 2: Power Determinants by Method
| Factor | GWAS | WGS |
|---|---|---|
| Primary power driver | Sample size (N) | Variant aggregation strategy + N |
| Minimum cases for 80% power | ~1,000 (OR ~1.3, MAF > 5%) | Depends on gene-level rare variant burden; typically hundreds to low thousands for Mendelian genes |
| Significance threshold | p < 5×10⁻⁸ | Gene/region-level thresholds; Bonferroni correction per number of tested units |
| Power for common variants | High at scale | High but cost-inefficient vs GWAS |
| Power for rare variants | Near zero | High with appropriate aggregation |
The practical implication: if your disease of interest is driven primarily by common variants, spending your budget on larger GWAS sample sizes will yield more discoveries than sequencing a smaller cohort with WGS. If rare, penetrant variants drive your disease — particularly in Mendelian or early-onset cases — WGS is the only path to discovery.
Figure 2: Conceptual power curves illustrating the sample size tradeoff between GWAS (high power for common variants at moderate N) and WGS (rare variant discovery enabled through aggregate tests).
Where Cost Meets Coverage
Cost differences between GWAS and WGS are driven by library preparation, sequencing chemistry, data storage, and computational analysis — not just the per-sample reagent price.
A GWAS array costs roughly $50–100 per sample at scale, plus imputation compute. WGS at 30× coverage typically costs 5–10 times more per sample, and the data footprint is substantially larger: a single 30× WGS produces approximately 90–100 GB of raw data per sample versus a few megabytes for array genotypes. Storage, processing, and analysis costs compound with cohort size.
- Array-based GWAS per sample: Lowest cost; fixed overhead per batch; well-suited for cohorts of 1,000–500,000 samples.
- WGS at 30× per sample: Higher per-sample cost but decreasing annually; data storage and compute dominate long-term expenses.
- Low-coverage WGS (5–10×) + imputation: A middle ground that captures more variation than arrays at a cost below deep WGS. Imputation accuracy with low-coverage WGS surpasses array-based imputation, particularly for low-frequency and ancestry-specific variants.
Table 3: Approximate Cost and Resource Comparison
| Resource Dimension | GWAS (Array + Imputation) | WGS (30×) | Low-Coverage WGS (5–10×) |
|---|---|---|---|
| Per-sample reagent cost | $ (lowest) | $$$$ (highest) | $$ (intermediate) |
| Data per sample | ~5–10 MB | ~90–100 GB | ~15–30 GB |
| Storage and compute needs | Minimal | Substantial | Moderate |
| Sample size for fixed budget | Largest | Smallest | Intermediate |
| Rare variant detection | None | Comprehensive | Partial (MAF > 0.5%) |
For budget-constrained discovery cohorts, whole exome sequencing services offer a compromise — capturing coding variants at a cost below WGS — though non-coding regions remain unsequenced.
Ancestry as a Hidden Variable
Population stratification — systematic allele frequency differences between subpopulations — can produce spurious associations if not properly modeled. Both GWAS and WGS face this challenge, but WGS offers tools that GWAS cannot match.
Why Reference Panels Matter
GWAS imputation accuracy depends on how well the reference panel represents the study population. The most widely used panels (HRC, 1000 Genomes, TOPMed) are enriched for European ancestry. For a GWAS conducted in an East Asian, South Asian, African, or admixed population:
- Imputation quality drops for ancestry-specific variants.
- Rare variants present only in non-European populations are systematically missed.
- Polygenic risk scores built from European GWAS transfer poorly to other ancestries.
WGS avoids the reference panel bottleneck entirely. By sequencing every sample directly, WGS detects all variants regardless of how well they are represented in existing databases.
Stratification Control
Both GWAS and WGS use principal component analysis (PCA) and mixed linear models to control for population structure. However, WGS provides finer-grained ancestry inference because rare variants carry more information about recent demographic history than common variants do. This additional resolution can improve stratification control in admixed or recently mixed populations.
- GWAS + PCA: Effective for broad-scale population structure when reference panels are well-matched.
- WGS + PCA: Higher-resolution ancestry inference from rare and private variants; better suited to admixed cohorts.
- Multi-ancestry reference panels: TOPMed, gnomAD v3, and the forthcoming All of Us dataset improve imputation for diverse populations, narrowing the gap between GWAS and WGS.
Figure 3: Recommended two-stage study design combining GWAS discovery with WGS follow-up for fine-mapping and rare variant validation.
The Two-Stage Design
The most productive disease studies do not choose between GWAS and WGS — they sequence the approaches. A two-stage design maximizes discovery per dollar by matching each method to the task it performs best.
Stage 1: GWAS Discovery
A large, well-phenotyped cohort is genotyped with arrays and imputed against a diverse reference panel. Genome-wide significant loci are identified. This stage leverages GWAS's core strength — statistical power for common variants at scale.
Stage 2: WGS Follow-Up
A subset of samples — selected from the tails of the phenotype distribution, from carriers of the top GWAS signals, or from population subgroups poorly represented in imputation panels — is sequenced with WGS. This stage addresses GWAS's blind spots:
- Fine-mapping GWAS loci to identify causal variants.
- Testing rare coding and non-coding variants within associated regions.
- Detecting structural variants at GWAS-significant loci.
- Discovering ancestry-specific variants in the discovery cohort.
Imputation from WGS Reference Panels
A compelling hybrid approach uses WGS data from a subset of the cohort to build a population-specific imputation reference panel, then imputes the full GWAS cohort against this panel. The TOPMed and HRC panels were built this way. For populations poorly represented in existing panels — particularly African, South Asian, and Indigenous cohorts — a population-specific WGS imputation backbone substantially improves GWAS imputation quality.
- GWAS → WGS: Discovery → fine-mapping. Most common in complex disease research.
- WGS → GWAS: Build a population-specific imputation panel, then genotype the full cohort with arrays and impute against the custom panel. Common in founder populations and underrepresented groups.
- Simultaneous GWAS + WGS: Genotyping the full cohort while sequencing a strategic subset. Used in well-funded consortia.
After identifying associated loci through GWAS, variant calling and annotation services support the transition from association signal to functional interpretation.
Figure 4: Multi-dimensional comparison of GWAS and WGS performance across variant discovery, statistical power, cost, rare variant sensitivity, and non-coding coverage.
Making the Call
The decision between GWAS and WGS is rarely a simple binary. It turns on what you are trying to find, how many samples you have, what your budget allows, and who your study population is. The checklist below distills the analysis into actionable questions.
Table 4: Decision Matrix for Study Design
| If your goal is... | Use this approach | Why |
|---|---|---|
| Discover common SNPs linked to a complex disease | GWAS (array + imputation) | Maximizes power per dollar at common MAF |
| Find causal rare variants in a Mendelian disorder | WGS | Arrays miss novel rare variants entirely |
| Study non-coding or regulatory variation | WGS | Arrays cover less than 5% of known regulatory regions |
| Run a discovery cohort on a limited budget | GWAS first, then WGS follow-up | Two-stage design — the field standard |
| Build a polygenic risk score | GWAS summary statistics → PRS model | PRS is downstream of GWAS results |
| Study a diverse or admixed population | Low-coverage WGS or WGS | Reference panel limitations reduce GWAS imputation quality |
| Return clinically actionable results to participants | WGS with clinical-grade interpretation | GWAS is a research tool, not a clinical assay |
Before committing to a platform, ask:
- Is the trait likely driven by common variants, rare variants, or both?
- What is the available sample size, and can it be expanded?
- What is the ancestral composition of the cohort?
- Does the budget cover both sequencing and downstream storage and analysis?
- Will the study need to detect structural variants or non-coding regulatory changes?
- Is the study grant or publication venue expecting WGS-level data?
- Would a two-stage design meet the study's goals on a realistic timeline?
For most complex disease studies with cohorts exceeding 1,000 samples, starting with GWAS and reserving WGS for fine-mapping and rare variant follow-up remains the most efficient strategy. For Mendelian disorders, family-based studies, or research focused on non-coding regulatory mechanisms, WGS is the appropriate primary platform. The gap between the two approaches narrows each year as sequencing costs fall and reference panels diversify — but the fundamental distinction between scanning known variants and discovering all variants remains.
For guidance on related study design decisions, see our comparison of QTL mapping and GWAS approaches — which covers complementary methods in plant and animal breeding research, with a different set of design considerations.
FAQ
GWAS uses SNP arrays to genotype approximately 500,000 to 2 million known variants and relies on statistical imputation against reference panels to expand coverage to tens of millions of common SNPs. Whole genome sequencing reads the entire ~3 billion base-pair genome directly, capturing known and novel variants across coding and non-coding regions without imputation. The core distinction is that GWAS measures pre-selected common variants, while WGS discovers all variants present in a sample.
GWAS is the more efficient choice when your study investigates a common complex disease, you have access to at least 1,000 well-phenotyped cases, your budget favors maximizing sample size over per-sample completeness, and your primary goal is discovering common SNP associations or building polygenic risk scores. Under these conditions, the statistical power gained by genotyping more samples with arrays outweighs the additional variant information from WGS.
No. WGS provides a more complete picture of an individual's genetic variation, but it does not deliver higher statistical power than GWAS for detecting common variant associations at equivalent sample sizes. Because GWAS achieves comparable coverage of common SNPs through imputation at a fraction of the cost, a study with a fixed budget can genotype 5 to 10 times more samples with arrays than with WGS — and in association testing, sample size is often the limiting factor for discovery.
Per-sample reagent costs for GWAS arrays typically range from $50 to $100 at scale, while WGS at 30× coverage costs roughly 5 to 10 times more per sample depending on platform, read length, and service provider. The cost gap extends beyond reagents: a single 30× WGS produces approximately 90 to 100 GB of raw data, requiring substantially more storage and computational resources than array genotypes. Low-coverage WGS with imputation offers a middle ground at an intermediate price point.
Yes, and this two-stage design is the predominant strategy in contemporary complex disease genomics. In Stage 1, a large cohort is genotyped with arrays to identify genome-wide significant loci. In Stage 2, a subset of samples is sequenced with WGS to fine-map those loci, identify causal variants, and test for rare variants within the associated regions. An alternative approach uses WGS on a subset of the cohort to build a population-specific imputation reference panel, then imputes the full GWAS cohort against this custom panel.
To detect a common variant with moderate effect size (odds ratio approximately 1.2 to 1.3) at the genome-wide significance threshold of p < 5×10⁻⁸, approximately 1,000 cases are needed for 80% statistical power. For weaker effects — which collectively explain most complex trait heritability — tens of thousands to hundreds of thousands of samples are required. Power calculators such as GCTA and Quanto can estimate required sample sizes based on assumptions about effect size, allele frequency, and study design before sample processing begins.
WGS is almost always the preferred platform for rare disease research. The pathogenic variants underlying most rare diseases are absent from standard SNP arrays and may not be well-represented in imputation reference panels. For Mendelian disorders, family-based studies, or de novo mutation discovery, WGS provides the complete variant ascertainment needed to identify causal mutations. In these settings, the additional cost of WGS is justified by the higher diagnostic yield and the inability of GWAS to capture the relevant variant classes.
GWAS has four systematic limitations that WGS directly resolves. First, GWAS cannot detect novel rare variants with MAF below roughly 0.1% because these variants are not on arrays and cannot be reliably imputed. Second, GWAS has limited structural variant detection beyond the small set of CNV probes included on most arrays. Third, GWAS coverage of non-coding regulatory regions depends entirely on the reference panel, leaving ancestry-specific regulatory variants undetected. Fourth, GWAS imputation accuracy degrades in populations poorly represented in existing reference panels, widening health disparities in genomic research.
References:
- Buniello A, MacArthur JAL, Cerezo M, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Research. 2019;47(D1):D1005-D1012. doi:10.1093/nar/gky1120
- Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203-209. doi:10.1038/s41586-018-0579-z
- Taliun D, Harris DN, Kessler MD, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590(7845):290-299. doi:10.1038/s41586-021-03205-y
- Karczewski KJ, Francioli LC, Tiao G, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434-443. doi:10.1038/s41586-020-2308-7
- Lee S, Abecasis GR, Boehnke M, Lin X. Rare-variant association analysis: study designs and statistical tests. American Journal of Human Genetics. 2014;95(1):5-23. doi:10.1016/j.ajhg.2014.06.009
- Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D. Benefits and limitations of genome-wide association studies. Nature Reviews Genetics. 2019;20(8):467-484. doi:10.1038/s41576-019-0127-1
- Turro E, Astle WJ, Megy K, et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature. 2020;583(7814):96-102. doi:10.1038/s41586-020-2434-2
- Das S, Forer L, Schönherr S, et al. Next-generation genotype imputation service and methods. Nature Genetics. 2016;48(10):1284-1287. doi:10.1038/ng.3656