Case Study: LD-Based Marker Pruning Speeds a GWAS in Outbred Cohorts
Outbred cohort GWAS often stall under dense, correlated markers. Linkage disequilibrium analysis explains why: LD inflates test counts, memory needs, and post-hoc interpretation. In this LD case study, we show how LD pruning in GWAS delivers meaningful marker redundancy reduction and runtime optimization without sacrificing discovery. Using a reproducible PLINK LD workflow, we trimmed features, preserved index signals, and stabilized calibration—creating an analysis project managers can schedule with confidence.
The Bottleneck in Outbred GWAS
Dense genotyping captures biology—and a lot of correlation. In outbred populations, nearby SNPs move together within LD blocks. That correlation raises the number of effectively similar tests and slows mixed-model engines. It also smears association peaks across regions, which makes downstream fine-mapping harder.
This study asked a practical question. How much speed and stability can we gain by pruning correlated SNPs before association—while keeping true signals? We framed pruning as a pre-GWAS transformation that selects a near-independent subset of variants. PLINK's LD-based pruning removes pairs exceeding a user-set r² threshold within sliding windows. The result is a lean marker set that approximates linkage equilibrium and reduces compute without erasing biology.
Comparison of LD maps from ABG and WGS, and linkage map. (Pengelly R.J. et al. (2015) BMC Genomics).
What Pruning Changes for PMs and Scientists
Business impact at a glance
- Panel reduction. Fewer, less redundant SNPs simplify multiple testing and sharpen peaks.
- Runtime and RAM ↓. Smaller design matrices speed mixed models and temper cloud cost.
- Model stability ↑. Lead hits persist; inflation is easier to track with λ<sub>GC</sub> or the LD Score Regression intercept.
Quick Summary
LD pruning removes variants above an r² cutoff inside a sliding window (PLINK --indep-pairwise). Typical starting points explore r² ≈ 0.10–0.20 and windows spanning local LD. Prune before the association step to reduce compute, then run a mixed-model engine such as BOLT-LMM (quantitative traits) or SAIGE (imbalanced case–control). This sequence keeps discovery power while making timelines predictable.
Cohort & Methods Snapshot (Reproducible Setup)
Cohort. A representative outbred cohort genotyped on a dense SNP array or imputed to a common reference. Standard QC included sample and marker missingness, Hardy–Weinberg tests, sex checks, relatedness screens, and ancestry control.
Pruning step. We used PLINK 1.9 --indep-pairwise with a small grid of r² and window parameters. We pinned software versions, exact flags, random seeds, and MAF filters. This makes runs auditable, repeatable, and easy to compare across parameter choices.
Association step. For continuous traits we used BOLT-LMM, which scales to biobank-sized data and provides practical runtime guidance. For imbalanced binary traits we used SAIGE, a generalized mixed model designed for case–control imbalance and relatedness. All tools ran in containers with versioned manifests and reference builds.
Calibration check. We evaluated λ<sub>GC</sub> and the LDSC intercept to distinguish confounding from polygenicity. These diagnostics complement Q–Q plots and help confirm that pruning did not introduce bias.
Why pruning fits the biology. LD and haplotype blocks vary across genomes and cohorts. Pruning reduces within-block redundancy before modeling. It also clarifies regional peaks so downstream interpretation and replication are faster.
Results That Matter (Markers ↓, Runtime ↓, Stability ↑)
1) Marker redundancy reduction.
Pruning removed highly correlated markers while retaining index variants. The reduced feature set lowered the effective test count and tightened regional peaks. Lead signals were easier to spot, describe, and hand off to experimental teams.
Expanded comparison of LD maps for a small region. (Pengelly R.J. et al. (2015) BMC Genomics).
2) Runtime and memory.
Compared with the unpruned baseline, the pruned design matrix cut I/O and linear algebra cost. Mixed-model solvers benefited from fewer columns and lower multicollinearity. Wall-clock time and peak RAM dropped, improving queue predictability on shared compute and reducing cloud spend.
3) Calibration and stability.
Top associations overlapped closely between pruned and unpruned runs. Q–Q plots showed similar tails. λ<sub>GC</sub> and the LDSC intercept remained stable, indicating controlled inflation and no systemic bias from pruning.
Calibration holds across marker representations: Q–Q and Manhattan plots agree well among SNPs, haplotype alleles, and haplotype blocks, with genomic control ~1—illustrating stable inflation and overlap of lead signals after dimension reduction (Chen H. et al. (2020) BMC Genomics).
Interpreting MAF and LD.
MAF thresholds interact with LD decay and the effective number of tests. Higher MAF cut-offs can raise mean r² among retained SNPs and extend half-decay distances. For robust pruning, inspect your cohort's LD decay curve and recombination context rather than copying parameters from an unrelated population.
Distribution of allele frequencies between data sources. (Pengelly R.J. et al. (2015) BMC Genomics).
Sensitivity, Replication, and Significance Control
We profiled r² thresholds (for example 0.10, 0.15, 0.20), window sizes spanning local LD, and MAF filters. For every setting we tracked:
- Fraction of SNPs removed and retained
- Wall-clock time and peak RAM
- Overlap of top signals and credible sets
- λ<sub>GC</sub> and LDSC intercept
Key takeaways.
Moderate pruning captured most compute gains. Aggressive pruning offered diminishing returns and risked losing secondary, near-independent signals within dense loci. Replication of sentinel hits in a hold-out cohort stayed high across sensible settings when QC was strong.
LD-aware significance.
Bonferroni thresholds can be overly strict when tests are correlated. We therefore interpret p-values alongside an effective number of independent tests, which better reflects LD dependence without masking real associations.
What You'll Receive (Deliverables & Formats)
We design deliverables for decision-makers and assay teams:
- Pruned marker list with exact PLINK flags and seeds used
- Parameter sheet (window, step, r², MAF) and rationale for the chosen preset
- GWAS outputs with λ<sub>GC</sub>, LDSC intercept, and publication-ready Manhattan and Q–Q plots
- Replication checklist and a short memo that translates compute savings into project days and cloud budget
- Optional: tag-SNP mapping and haplotype summaries to brief assay or panel design teams
FAQ: Practical Questions Teams Ask
1) Should we LD-prune only for PCA, or also before GWAS?
Pruning before PCA is standard to avoid components driven by high-LD regions. For GWAS, pruning is a project choice. It reduces compute and simplifies multiple testing. Modern mixed-model engines can run full density, but they need more time and memory. A common pattern is to run a pruned GWAS for speed, then re-evaluate promising regions at full density.
2) What's the difference between LD pruning and clumping?
Pruning selects a near-independent marker subset based on LD only and ignores p-values. Clumping happens after association; it groups SNPs by LD around index hits and keeps the top variant within each clump. Use pruning for dimension reduction, compute control, and PCA. Use clumping to summarise independent loci and for PRS-style reporting after GWAS.
3) How should we choose r² thresholds and window sizes?
Anchor choices to your LD decay curve and recombination landscape. Start with r² ≈ 0.10–0.20 and windows that span typical decay to background levels. Validate with a small grid search and track runtime, stability of top hits, and calibration metrics. Populations showing longer LD or higher MAF cut-offs may benefit from wider windows.
4) Does pruning reduce statistical power?
Pruning mostly removes redundant information. Lead associations typically persist. Over-pruning can drop secondary or conditionally independent signals within complex loci. For discovery, use moderate pruning and confirm top regions at full density. Pair pruning with clumping and fine-mapping for transparent reporting.
5) How do we verify that results are well-calibrated after pruning?
Inspect the Q–Q plot, λ<sub>GC</sub>, and the LDSC intercept. The intercept partitions inflation into components attributable to confounding versus polygenicity. Stable intercepts and overlapping top hits across pruned and unpruned runs indicate sound calibration.
Quick-Start Commands
Goal: reduce correlated SNPs before GWAS to save compute and clarify peaks.
Command pattern
plink --bfile [prefix] --indep-pairwise [window_kb] [step_snps] [r2] --out prune
plink --bfile [prefix] --extract prune.prune.in --make-bed --out data_pruned
Suggested starting grid
- r² = 0.10 / 0.15 / 0.20
- window = 50–250 kb
- step = 5–20 SNPs
LD decays with distance: in a high-density cattle panel, mean r² drops from ~0.33–0.40 at <2.5 kb to ~0.05–0.07 at 400–500 kb, with LD-phase persistence decreasing accordingly—motivating window sizes that span typical decay in the target cohort (Mokry F.B. et al. (2014) BMC Genomics).
Then run GWAS
- BOLT-LMM for large quantitative traits
- SAIGE for imbalanced case–control
Report
Pre-/post-pruning marker counts, wall time, peak RAM, overlap of top hits, λ<sub>GC</sub>, and LDSC intercept.
Mini-Narrative: Before vs After
Baseline (no pruning).
The unpruned run involved millions of correlated tests. Wall time expanded and RAM approached cluster limits. Regional plots showed broad plateaus across LD blocks. Lead signals existed, but their boundaries were fuzzy.
Intervention (pruning).
We ran a small parameter grid and selected the best compute-to-stability trade-off. Marker count dropped markedly. Mixed-model runtimes fell and memory pressure eased. Peaks narrowed and became easier to interpret, with consistent sentinel variants across runs.
Outcome.
Pruned and unpruned analyses agreed on lead loci. Calibration metrics remained stable. The project schedule became more predictable and budget exposure lower—without losing biological signal.
Internal Learning: When Pruning Helps the Most
- Biobank-scale quantitative traits. BOLT-LMM scales well, yet pruning still reduces runtime spikes and smooths queues on shared compute.
- Imbalanced case–control traits. SAIGE handles imbalance; pruning reduces feature load entering the sparse GRM framework and speeds the single-variant phase.
- Panels with long-range LD. Some populations show extended LD and large blocks. Carefully chosen windows avoid under-pruning and capture real efficiency.
Start Your Project
Share four items to receive a rapid feasibility check:
- Sample size and trait type
- SNP count and genome build
- Desired r²/window starting point
- Any MAF or QC constraints
We will return a recommended preset for array vs WGS projects, expected runtime savings, and a draft deliverables list tailored to your cohort.
Get started.
Recommended Reading Inside This Hub:
- Study design inputs and thresholds: Practical Guide: Designing an LD Study (MAF, r² Thresholds, Sample Size)
- PLINK pruning workflow and parameters: Running LD the Right Way: PLINK Workflow, Parameters, and LD Pruning
- From LD to tag-SNP panels: From LD to Tag SNPs: Building Efficient Panels Without Losing Power
References
- Pengelly, R.J., Tapper, W., Gibson, J. et al. Whole genome sequences are required to fully resolve the linkage disequilibrium structure of human populations. BMC Genomics 16, 666 (2015).
- Mokry, F.B., Buzanskas, M.E., Mudadu, M.A. et al. Linkage disequilibrium and haplotype block structure in a composite beef cattle breed. BMC Genomics 15 (Suppl 7), S6 (2014).
- Chen, H., Hao, Z., Zhao, Y. et al. A fast-linear mixed model for genome-wide haplotype association analysis: application to agronomic traits in maize. BMC Genomics 21, 151 (2020).
- Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A.P. & Price, A.L. Mixed-model association for biobank-scale datasets. Nature Genetics 50, 906–908 (2018).
- Zhou, W., Nielsen, J.B., Fritsche, L.G. et al. Efficiently controlling for case–control imbalance and sample relatedness in large-scale genetic association studies. Nature Genetics 50, 1335–1341 (2018).
- Bulik-Sullivan, B.K., Loh, P.-R., Finucane, H.K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics 47, 291–295 (2015).
- Purcell, S., Neale, B., Todd-Brown, K. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics 81, 559–575 (2007).