Outbred cohort GWAS often stall under dense, correlated markers. Linkage disequilibrium analysis explains why: LD inflates test counts, memory needs, and post-hoc interpretation. In this LD case study, we show how LD pruning in GWAS delivers meaningful marker redundancy reduction and runtime optimization without sacrificing discovery. Using a reproducible PLINK LD workflow, we trimmed features, preserved index signals, and stabilized calibration—creating an analysis project managers can schedule with confidence.
Dense genotyping captures biology—and a lot of correlation. In outbred populations, nearby SNPs move together within LD blocks. That correlation raises the number of effectively similar tests and slows mixed-model engines. It also smears association peaks across regions, which makes downstream fine-mapping harder.
This study asked a practical question. How much speed and stability can we gain by pruning correlated SNPs before association—while keeping true signals? We framed pruning as a pre-GWAS transformation that selects a near-independent subset of variants. PLINK's LD-based pruning removes pairs exceeding a user-set r² threshold within sliding windows. The result is a lean marker set that approximates linkage equilibrium and reduces compute without erasing biology.
Comparison of LD maps from ABG and WGS, and linkage map. (Pengelly R.J. et al. (2015) BMC Genomics).
Business impact at a glance
Quick Summary
LD pruning removes variants above an r² cutoff inside a sliding window (PLINK --indep-pairwise). Typical starting points explore r² ≈ 0.10–0.20 and windows spanning local LD. Prune before the association step to reduce compute, then run a mixed-model engine such as BOLT-LMM (quantitative traits) or SAIGE (imbalanced case–control). This sequence keeps discovery power while making timelines predictable.
Cohort. A representative outbred cohort genotyped on a dense SNP array or imputed to a common reference. Standard QC included sample and marker missingness, Hardy–Weinberg tests, sex checks, relatedness screens, and ancestry control.
Pruning step. We used PLINK 1.9 --indep-pairwise with a small grid of r² and window parameters. We pinned software versions, exact flags, random seeds, and MAF filters. This makes runs auditable, repeatable, and easy to compare across parameter choices.
Association step. For continuous traits we used BOLT-LMM, which scales to biobank-sized data and provides practical runtime guidance. For imbalanced binary traits we used SAIGE, a generalized mixed model designed for case–control imbalance and relatedness. All tools ran in containers with versioned manifests and reference builds.
Calibration check. We evaluated λ<sub>GC</sub> and the LDSC intercept to distinguish confounding from polygenicity. These diagnostics complement Q–Q plots and help confirm that pruning did not introduce bias.
Why pruning fits the biology. LD and haplotype blocks vary across genomes and cohorts. Pruning reduces within-block redundancy before modeling. It also clarifies regional peaks so downstream interpretation and replication are faster.
1) Marker redundancy reduction.
Pruning removed highly correlated markers while retaining index variants. The reduced feature set lowered the effective test count and tightened regional peaks. Lead signals were easier to spot, describe, and hand off to experimental teams.
Expanded comparison of LD maps for a small region. (Pengelly R.J. et al. (2015) BMC Genomics).
2) Runtime and memory.
Compared with the unpruned baseline, the pruned design matrix cut I/O and linear algebra cost. Mixed-model solvers benefited from fewer columns and lower multicollinearity. Wall-clock time and peak RAM dropped, improving queue predictability on shared compute and reducing cloud spend.
3) Calibration and stability.
Top associations overlapped closely between pruned and unpruned runs. Q–Q plots showed similar tails. λ<sub>GC</sub> and the LDSC intercept remained stable, indicating controlled inflation and no systemic bias from pruning.
Calibration holds across marker representations: Q–Q and Manhattan plots agree well among SNPs, haplotype alleles, and haplotype blocks, with genomic control ~1—illustrating stable inflation and overlap of lead signals after dimension reduction (Chen H. et al. (2020) BMC Genomics).
Interpreting MAF and LD.
MAF thresholds interact with LD decay and the effective number of tests. Higher MAF cut-offs can raise mean r² among retained SNPs and extend half-decay distances. For robust pruning, inspect your cohort's LD decay curve and recombination context rather than copying parameters from an unrelated population.
Distribution of allele frequencies between data sources. (Pengelly R.J. et al. (2015) BMC Genomics).
We profiled r² thresholds (for example 0.10, 0.15, 0.20), window sizes spanning local LD, and MAF filters. For every setting we tracked:
Key takeaways.
Moderate pruning captured most compute gains. Aggressive pruning offered diminishing returns and risked losing secondary, near-independent signals within dense loci. Replication of sentinel hits in a hold-out cohort stayed high across sensible settings when QC was strong.
LD-aware significance.
Bonferroni thresholds can be overly strict when tests are correlated. We therefore interpret p-values alongside an effective number of independent tests, which better reflects LD dependence without masking real associations.
We design deliverables for decision-makers and assay teams:
1) Should we LD-prune only for PCA, or also before GWAS?
Pruning before PCA is standard to avoid components driven by high-LD regions. For GWAS, pruning is a project choice. It reduces compute and simplifies multiple testing. Modern mixed-model engines can run full density, but they need more time and memory. A common pattern is to run a pruned GWAS for speed, then re-evaluate promising regions at full density.
2) What's the difference between LD pruning and clumping?
Pruning selects a near-independent marker subset based on LD only and ignores p-values. Clumping happens after association; it groups SNPs by LD around index hits and keeps the top variant within each clump. Use pruning for dimension reduction, compute control, and PCA. Use clumping to summarise independent loci and for PRS-style reporting after GWAS.
3) How should we choose r² thresholds and window sizes?
Anchor choices to your LD decay curve and recombination landscape. Start with r² ≈ 0.10–0.20 and windows that span typical decay to background levels. Validate with a small grid search and track runtime, stability of top hits, and calibration metrics. Populations showing longer LD or higher MAF cut-offs may benefit from wider windows.
4) Does pruning reduce statistical power?
Pruning mostly removes redundant information. Lead associations typically persist. Over-pruning can drop secondary or conditionally independent signals within complex loci. For discovery, use moderate pruning and confirm top regions at full density. Pair pruning with clumping and fine-mapping for transparent reporting.
5) How do we verify that results are well-calibrated after pruning?
Inspect the Q–Q plot, λ<sub>GC</sub>, and the LDSC intercept. The intercept partitions inflation into components attributable to confounding versus polygenicity. Stable intercepts and overlapping top hits across pruned and unpruned runs indicate sound calibration.
Goal: reduce correlated SNPs before GWAS to save compute and clarify peaks.
Command pattern
plink --bfile [prefix] --indep-pairwise [window_kb] [step_snps] [r2] --out prune
plink --bfile [prefix] --extract prune.prune.in --make-bed --out data_pruned
Suggested starting grid
LD decays with distance: in a high-density cattle panel, mean r² drops from ~0.33–0.40 at <2.5 kb to ~0.05–0.07 at 400–500 kb, with LD-phase persistence decreasing accordingly—motivating window sizes that span typical decay in the target cohort (Mokry F.B. et al. (2014) BMC Genomics).
Then run GWAS
Report
Pre-/post-pruning marker counts, wall time, peak RAM, overlap of top hits, λ<sub>GC</sub>, and LDSC intercept.
Baseline (no pruning).
The unpruned run involved millions of correlated tests. Wall time expanded and RAM approached cluster limits. Regional plots showed broad plateaus across LD blocks. Lead signals existed, but their boundaries were fuzzy.
Intervention (pruning).
We ran a small parameter grid and selected the best compute-to-stability trade-off. Marker count dropped markedly. Mixed-model runtimes fell and memory pressure eased. Peaks narrowed and became easier to interpret, with consistent sentinel variants across runs.
Outcome.
Pruned and unpruned analyses agreed on lead loci. Calibration metrics remained stable. The project schedule became more predictable and budget exposure lower—without losing biological signal.
Share four items to receive a rapid feasibility check:
We will return a recommended preset for array vs WGS projects, expected runtime savings, and a draft deliverables list tailored to your cohort.
Get started.
Recommended Reading Inside This Hub:
References