Designing an LD study hinges on four controllable levers: the MAF filter, the r² threshold, the LD window size, and the sample size. In linkage disequilibrium analysis, these parameters move together, not alone. Set them with your cohort, platform, and objective in mind, and you will gain power and reproducibility without wasting compute. This guide explains the trade-offs, gives practical defaults, and provides a decision flow you can apply today. You'll also find a scoping checklist to help your team get fast, high-quality quotes and a clean hand-off to analysis.
Use these defaults when you need a defensible starting point for linkage disequilibrium analysis. Adjust based on objectives and cohort features.
Outcome to expect: stable r² estimates, reduced redundancy, faster GWAS and panel design, and cleaner downstream deliverables.
Start with the intended use. Each objective implies different settings.
LD decay of the human genome depending on recombination rates. (Park L. (2012) PLOS ONE)
Your platform sets practical limits. Arrays have spaced markers and missingness patterns. WGS is dense but computationally heavier. For arrays, a variant-count window (e.g., 50 SNPs) is often robust. For WGS, a hybrid rule (250 kb or 50 SNPs) avoids oversampling high-density regions.
Tip: Document the objective and platform up front. It shortens review cycles and avoids re-runs.
LD reflects demography, recombination, and population history. Ancestry impacts r² decay and the sample size needed for stable estimates.
Distance-corrected LD (r²) increases with range-expansion distance and with a shift toward selfing; effects are comparable across deleterious and tolerated SNP classes.(Lucek K. & Willi Y. (2021) PLOS Genetics)
Heuristic: if your study spans multiple ancestries, plan stratum-specific LD calculations and then evaluate transferability rather than assuming one threshold fits all.
MAF influences r² behavior and noise.
Practical pattern: Begin at 0.05 for exploratory LD and pruning. For final tag selection, review the MAF spectrum of your targets and adjust.
r² encodes how much one marker explains another. Pick the threshold that fits your goal.
Guardrail: Always validate the chosen r² against ancestry-specific LD patterns. Thresholds that work in one group may not generalize.
Window size determines the genomic span over which you evaluate LD.
Heat map of linkage disequilibrium across the sunflower genome. (Mandel J.R. et al. (2013) PLOS Genetics)
Runtime control: Step size is your throttle. Increasing the step from 5 to 10 reduces comparisons ~2× with minimal loss of information for pruning.
Stable r² estimation needs enough samples in each ancestry stratum.
If your cohort is smaller, constrain windows and favor conservative thresholds. You can also pool cohorts with similar structure after careful QC.
Quality control determines whether linkage disequilibrium analysis results are believable. Build these checks into your pre-LD gate.
Set the bar before computing LD.
Why this matters: Poor QC inflates false LD, confounds pruning, and destabilizes r² thresholds across runs.
A minimal, reproducible PLINK flow keeps LD analysis predictable. Below is an illustrative example (adapt parameters to your data):
# Pre-filtering
plink --bfile INPUT \
--geno 0.02 --mind 0.02 \
--maf 0.05 \
--hwe 1e-6 midp \
--make-bed --out STEP1_QC
# Pruning (r² = 0.2, 250 kb window, step 5 SNPs)
plink --bfile STEP1_QC \
--indep-pairwise 250 5 0.2 \
--out STEP2_PRUNE
plink --bfile STEP1_QC \
--extract STEP2_PRUNE.prune.in \
--make-bed --out STEP3_PRUNED
# LD matrix in selected regions or genome-wide
plink --bfile STEP1_QC \
--r2 gz yes-really --ld-window-kb 250 --ld-window 99999 --ld-window-r2 0.2 \
--out STEP4_LD
Notes:
Structure of the LDlinkR package. (Myers T.A. et al. (2020) Frontiers in Genetics)
LD patterns and tag performance vary across ancestries. Validate transferability before fixing panels.
Standard deliverables we provide:
Long-range linkage disequilibrium visualization for the region [22 Mb–40 Mb], chromosome 6, surrounding the major histocompatibility complex. (Mourad R. et al. (2011) PLOS ONE)
For a refresher on core concepts, see LD 101 (Overview). For execution detail, see PLINK LD Workflow.
Use this simple flow to move from objectives to parameters.
Pro tip: Keep the pilot results as a mini-appendix in your report. It justifies parameter choices to reviewers and collaborators.
Copy this checklist into your inquiry. It shortens scoping and helps us return a precise plan.
To convert LD into panel design, read From LD to Tag SNPs next.
What MAF filter should I use for LD pruning?
Start with MAF ≥0.05 for robust pruning and stable r². Lower to 0.01 when your aim includes rare-variant signals, but tighten QC and monitor runtime.
Which r² threshold is best for tag SNP selection?
Use r² ≥0.8 for general tagging. For fine-mapping or high-stakes regions, consider 0.9. Always evaluate coverage across ancestry strata.
How large should my LD windows be on arrays versus WGS?
A safe default is 250 kb or 50 SNPs. On arrays, the variant-count rule helps when density is uneven. In high-recombination regions, shrink to 100–150 kb.
How many samples do I need for stable LD estimates?
Plan for ≥300 per ancestry for exploratory summaries and ≥500–1,000 for robust pruning and tagging. Admixed cohorts benefit from larger n.
Can one tag set fit all ancestries?
Often not. Base performance may drop in groups with faster LD decay. Use multi-ancestry tagging or ancestry-specific supplements.
Do imputed variants change thresholds?
Yes. Apply strict imputation quality filters before LD. Then revisit MAF and r² once the call set is stable.
How do I reduce runtime without losing key signal?
Increase step size modestly (e.g., from 5 to 10 variants), prefer variant-count windows on arrays, and pilot on two chromosomes to tune parameters.
Should I prune before PCA or association testing?
Yes, pruning at r²≈0.2 reduces collinearity and speeds models. Keep a record of parameters and the list of retained markers.
What reports should I include for reviewers?
Provide the parameter manifest, LD decay plots by ancestry, pruning metrics, and tag coverage tables. These materials establish transparency and rigor.
Where can I learn the basics before choosing thresholds?
Read LD 101 (Overview) for definitions, use-cases, and key metrics.
Well-chosen MAF filters, r² thresholds, LD window sizes, and sample sizes turn linkage disequilibrium analysis from a time sink into a decision engine. Start from clear objectives, respect ancestry and platform realities, and enforce QC at the door. Use a small pilot to lock parameters, then scale confidently. With stable r² and practical pruning, you gain faster GWAS, cleaner models, and efficient tag sets—without sacrificing power.
Ready to move from planning to execution?
Share your scoping checklist and a brief cohort summary. We will return a tailored MAF/r²/window plan, a QC template aligned to your platform, and a quote.
CD Genomics provides research-use services for institutions and companies. We do not offer personal or clinical testing.
For next reading, visit:
References