Practical Guide: Designing an LD Study (MAF, r² Thresholds, Sample Size)
Designing an LD study hinges on four controllable levers: the MAF filter, the r² threshold, the LD window size, and the sample size. In linkage disequilibrium analysis, these parameters move together, not alone. Set them with your cohort, platform, and objective in mind, and you will gain power and reproducibility without wasting compute. This guide explains the trade-offs, gives practical defaults, and provides a decision flow you can apply today. You'll also find a scoping checklist to help your team get fast, high-quality quotes and a clean hand-off to analysis.
Quick Answer: LD Defaults
Use these defaults when you need a defensible starting point for linkage disequilibrium analysis. Adjust based on objectives and cohort features.
- MAF filter: start at 0.05; lower to 0.01 for rare-variant emphasis.
- r² threshold: 0.2 for pruning; 0.8 for tag SNP selection.
- LD window size: 250 kb or 50 variants, whichever hits first.
- Step size: 5–10 variants to control runtime with minimal loss.
- Sample size: ≥300–1,000 genomes or genotypes, depending on ancestry and aim.
- Deviate when: fine-mapping (raise r²), high recombination (shrink windows), admixed cohorts (increase n), ultra-dense WGS (prefer variant-count windows).
Outcome to expect: stable r² estimates, reduced redundancy, faster GWAS and panel design, and cleaner downstream deliverables.
Design Inputs & Parameters
Define objective and platform first
Start with the intended use. Each objective implies different settings.
- GWAS QC and model stability. Emphasize pruning to reduce collinearity. Use r²≈0.2 and standard windows.
- Tag SNP panel design. Emphasize coverage with fewer markers. Use r²≥0.8 and verify transferability.
- Haplotype structure or LD decay. Use consistent windows across regions and cohorts, then compare decay curves.
LD decay of the human genome depending on recombination rates. (Park L. (2012) PLOS ONE)
Your platform sets practical limits. Arrays have spaced markers and missingness patterns. WGS is dense but computationally heavier. For arrays, a variant-count window (e.g., 50 SNPs) is often robust. For WGS, a hybrid rule (250 kb or 50 SNPs) avoids oversampling high-density regions.
Tip: Document the objective and platform up front. It shortens review cycles and avoids re-runs.
Cohort and ancestry shape LD
LD reflects demography, recombination, and population history. Ancestry impacts r² decay and the sample size needed for stable estimates.
Distance-corrected LD (r²) increases with range-expansion distance and with a shift toward selfing; effects are comparable across deleterious and tolerated SNP classes.(Lucek K. & Willi Y. (2021) PLOS Genetics)
- LD decays faster in groups with higher effective recombination or more diverse ancestry.
- Admixed cohorts need larger samples for precise r², especially when tagging across the genome.
- Cross-ancestry projects should plan either ancestry-specific tag sets or a multi-ancestry strategy.
Heuristic: if your study spans multiple ancestries, plan stratum-specific LD calculations and then evaluate transferability rather than assuming one threshold fits all.
Set the MAF filter
MAF influences r² behavior and noise.
- Why it matters. Very rare alleles can inflate apparent LD due to sampling variance.
- Where to start. MAF ≥0.05 is a general-purpose default for pruning and LD summaries.
- When to lower. For rare-variant emphasis or targeted regions, consider 0.01 with tighter QC.
- When to raise. For computational screening or highly admixed samples, 0.1 can stabilize estimates.
Practical pattern: Begin at 0.05 for exploratory LD and pruning. For final tag selection, review the MAF spectrum of your targets and adjust.
Choose r² thresholds by use-case
r² encodes how much one marker explains another. Pick the threshold that fits your goal.
- Pruning for GWAS: r²≈0.2 reduces collinearity, speeds models, and preserves independent signals.
- Tag selection: r²≥0.8 yields strong coverage of nearby markers. For fine-mapping panels, consider 0.9.
- Regional nuance: In gene-dense regions, a slightly higher threshold can help; in low-recombination regions, lower thresholds avoid over-tagging.
Guardrail: Always validate the chosen r² against ancestry-specific LD patterns. Thresholds that work in one group may not generalize.
Tune LD window size and step
Window size determines the genomic span over which you evaluate LD.
Heat map of linkage disequilibrium across the sunflower genome. (Mandel J.R. et al. (2013) PLOS Genetics)
- Default: 250 kb or 50 SNPs per window, whichever comes first.
- High recombination: shrink to 100–150 kb to avoid mixing signals.
- Sparse arrays: prefer a variant-count window (e.g., 50 SNPs) and step 5–10 SNPs.
- Dense WGS: hybrid rules avoid excessive compute in dense regions.
Runtime control: Step size is your throttle. Increasing the step from 5 to 10 reduces comparisons ~2× with minimal loss of information for pruning.
Sample size heuristics
Stable r² estimation needs enough samples in each ancestry stratum.
- Exploratory LD summaries: ≥300 samples per ancestry is a reasonable floor.
- Robust pruning and tagging: aim for ≥500–1,000 when possible.
- Admixed or multi-ancestry: bias upward; instability is more likely.
If your cohort is smaller, constrain windows and favor conservative thresholds. You can also pool cohorts with similar structure after careful QC.
QC & Pipeline Hand-Off
Quality control determines whether linkage disequilibrium analysis results are believable. Build these checks into your pre-LD gate.
QC gate: thresholds that matter
Set the bar before computing LD.
- Sample call rate: target ≥98% (arrays) or per-base completeness metrics for WGS.
- Marker call rate: target ≥98–99% to prevent missingness-induced artifacts.
- Hardy–Weinberg equilibrium: remove extreme deviations (e.g., p < 1e-6) after stratifying by ancestry.
- Relatedness: remove up to second-degree relatives or down-weight them, depending on design.
- Population structure: evaluate principal components; remove outliers or stratify.
- Imputation quality: if using imputed variants, apply a stringent INFO/R2 cutoff before LD.
Why this matters: Poor QC inflates false LD, confounds pruning, and destabilizes r² thresholds across runs.
PLINK essentials: a reproducible backbone
A minimal, reproducible PLINK flow keeps LD analysis predictable. Below is an illustrative example (adapt parameters to your data):
# Pre-filtering
plink --bfile INPUT \
--geno 0.02 --mind 0.02 \
--maf 0.05 \
--hwe 1e-6 midp \
--make-bed --out STEP1_QC
# Pruning (r² = 0.2, 250 kb window, step 5 SNPs)
plink --bfile STEP1_QC \
--indep-pairwise 250 5 0.2 \
--out STEP2_PRUNE
plink --bfile STEP1_QC \
--extract STEP2_PRUNE.prune.in \
--make-bed --out STEP3_PRUNED
# LD matrix in selected regions or genome-wide
plink --bfile STEP1_QC \
--r2 gz yes-really --ld-window-kb 250 --ld-window 99999 --ld-window-r2 0.2 \
--out STEP4_LD
Notes:
- Use ancestry-specific subsets when calculating LD.
- For WGS, consider --ld-window 50 to cap by variant count.
- For tag selection, adjust r² to 0.8–0.9 and export high-confidence pairs.
Structure of the LDlinkR package. (Myers T.A. et al. (2020) Frontiers in Genetics)
Transferability and deliverables
LD patterns and tag performance vary across ancestries. Validate transferability before fixing panels.
- If transfer fails: build ancestry-specific panels or a multi-ancestry tag set.
- If partial transfer holds: supplement the base panel with ancestry-specific add-ons.
Standard deliverables we provide:
Long-range linkage disequilibrium visualization for the region [22 Mb–40 Mb], chromosome 6, surrounding the major histocompatibility complex. (Mourad R. et al. (2011) PLOS ONE)
- LD matrices (compressed), per-chromosome.
- Pruning reports with chosen parameters.
- LD decay curves by ancestry.
- Tag SNP lists with r² coverage stats.
- Parameter manifest for full reproducibility.
For a refresher on core concepts, see LD 101 (Overview). For execution detail, see PLINK LD Workflow.
Decision Framework & Scoping Checklist
One-page decision flow
Use this simple flow to move from objectives to parameters.
- Objective. Pick your primary aim: GWAS QC, tag panel, or haplotypes.
- Platform. Array → prefer variant-count windows; WGS → hybrid windows.
- Ancestry. Single-ancestry → standard defaults; multi-ancestry → larger n and transferability checks.
- MAF. Start at 0.05; move to 0.01 for rare-variant emphasis.
- r². Use 0.2 to prune; 0.8–0.9 to tag; confirm region-specific nuances.
- Window & step. 250 kb or 50 SNPs; step 5–10; shrink in high recombination.
- QC. Enforce call rate, HWE, relatedness, structure, and imputation quality gates.
- Pilot. Run a small pilot on two chromosomes; inspect LD decay and runtime.
- Lock. Finalize parameters; register them in the manifest; proceed genome-wide.
Pro tip: Keep the pilot results as a mini-appendix in your report. It justifies parameter choices to reviewers and collaborators.
Inquiry checklist for fast, qualified quotes
Copy this checklist into your inquiry. It shortens scoping and helps us return a precise plan.
- Aim and endpoints: GWAS QC, tag selection, or haplotype analysis.
- Cohort size and ancestry mix: include expected relatedness or family structure.
- Platform and call-rate targets: array model or WGS/WES; desired call rate.
- Proposed defaults: your starting MAF, r², window, and step (or request our presets).
- QC status: raw genotypes vs imputed; current filters applied.
- Deliverables required: matrices, decay plots, tag lists, manifests, and formats.
- Constraints: timelines, compute restrictions, regulatory needs.
- Contact and data transfer: SFTP or secure bucket; preferred file naming.
To convert LD into panel design, read From LD to Tag SNPs next.
FAQs
What MAF filter should I use for LD pruning?
Start with MAF ≥0.05 for robust pruning and stable r². Lower to 0.01 when your aim includes rare-variant signals, but tighten QC and monitor runtime.
Which r² threshold is best for tag SNP selection?
Use r² ≥0.8 for general tagging. For fine-mapping or high-stakes regions, consider 0.9. Always evaluate coverage across ancestry strata.
How large should my LD windows be on arrays versus WGS?
A safe default is 250 kb or 50 SNPs. On arrays, the variant-count rule helps when density is uneven. In high-recombination regions, shrink to 100–150 kb.
How many samples do I need for stable LD estimates?
Plan for ≥300 per ancestry for exploratory summaries and ≥500–1,000 for robust pruning and tagging. Admixed cohorts benefit from larger n.
Can one tag set fit all ancestries?
Often not. Base performance may drop in groups with faster LD decay. Use multi-ancestry tagging or ancestry-specific supplements.
Do imputed variants change thresholds?
Yes. Apply strict imputation quality filters before LD. Then revisit MAF and r² once the call set is stable.
How do I reduce runtime without losing key signal?
Increase step size modestly (e.g., from 5 to 10 variants), prefer variant-count windows on arrays, and pilot on two chromosomes to tune parameters.
Should I prune before PCA or association testing?
Yes, pruning at r²≈0.2 reduces collinearity and speeds models. Keep a record of parameters and the list of retained markers.
What reports should I include for reviewers?
Provide the parameter manifest, LD decay plots by ancestry, pruning metrics, and tag coverage tables. These materials establish transparency and rigor.
Where can I learn the basics before choosing thresholds?
Read LD 101 (Overview) for definitions, use-cases, and key metrics.
Conclusion
Well-chosen MAF filters, r² thresholds, LD window sizes, and sample sizes turn linkage disequilibrium analysis from a time sink into a decision engine. Start from clear objectives, respect ancestry and platform realities, and enforce QC at the door. Use a small pilot to lock parameters, then scale confidently. With stable r² and practical pruning, you gain faster GWAS, cleaner models, and efficient tag sets—without sacrificing power.
Ready to move from planning to execution?
Share your scoping checklist and a brief cohort summary. We will return a tailored MAF/r²/window plan, a QC template aligned to your platform, and a quote.
CD Genomics provides research-use services for institutions and companies. We do not offer personal or clinical testing.
For next reading, visit:
- Linkage Disequilibrium 101: What LD Measures and When It Matter
- Running LD the Right Way: PLINK Workflow, Parameters, and LD Pruning
- From LD to Tag SNPs: Building Efficient Panels Without Losing Power
References
- Mandel, J.R., Nambeesan, S., Bowers, J.E. et al. Association mapping and the genomic consequences of selection in sunflower. PLOS Genetics 9, e1003378 (2013).
- Park, L. Linkage disequilibrium decay and past population history in the human genome. PLOS ONE 7, e46603 (2012).
- Mourad, R., Sinoquet, C., Dina, C. et al. Visualization of pairwise and multilocus linkage disequilibrium structure using latent forests. PLOS ONE 6, e27320 (2011).
- Lucek, K., Willi, Y. Drivers of linkage disequilibrium across a species' geographic range. PLOS Genetics 17, e1009477 (2021).
- Myers, T.A., Chanock, S.J., Machiela, M.J. LDlinkR: An R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Frontiers in Genetics 11, 157 (2020).
- Chang, C.C., Chow, C.C., Tellier, L.C.A.M. et al. Second-generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).
- The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).