Running LD the Right Way: PLINK Workflow, Parameters, and LD Pruning
Need a reliable PLINK LD workflow you can trust across projects? This practical guide focuses on a production-ready LD pruning pipeline with sensible defaults, robust pairwise r2 checks, and clear sliding-window LD settings. Built for linkage disequilibrium analysis at scale, we keep theory to a minimum and zero in on execution: inputs, commands, tuning on dense data, and reviewer-ready outputs.
Quick Start Defaults for PLINK LD
When you need defensible settings fast:
- Pruning goal: r² ≈ 0.2
- Windowing: 250 kb OR 50 SNPs (whichever first), step 5–10 SNPs
- MAF for stability: start at 0.05
- Strata: compute LD within ancestry groups
- Imputation: filter on INFO/R² before LD
- Outputs: pruned marker list, small LD matrices for QA, decay plot, parameter manifest
Inputs & QC Before LD
Garbage in, garbage out. LD calculations amplify upstream errors because pairwise r2 is sensitive to missingness, batch effects, and subtle stratification. Lock down a simple QC gate before you touch LD.
Minimal, executable QC (arrays/WGS)
- Sample call rate: --mind 0.02
- Marker call rate: --geno 0.02 (tighten to 0.01 for final runs)
- MAF for LD stability: --maf 0.05
- Hardy–Weinberg by ancestry: --hwe 1e-6 midp
- Relatedness: remove up to 2nd-degree (use your preferred kinship tool prior to PLINK)
- Structure: subset or label by ancestry (PCA/labels), then run LD per stratum
- Imputed data: apply INFO/R² filters upstream; only then compute LD
Distribution of allele frequencies between data sources. (Pengelly R.J. et al. (2015) BMC Genomics)
Executable QC command
plink --bfile INPUT \
--geno 0.02 --mind 0.02 \
--maf 0.05 \
--hwe 1e-6 midp \
--make-bed --out 01_qc
Batch harmonization checklist (before merging cohorts)
- Build/coordinate alignment: confirm all batches use the same reference (GRCh37/38); liftover if needed.
- Allele consistency: recode to reference alleles; resolve strand flips (A/T, C/G sites need extra care).
- Duplicates/conflicts: remove duplicate rsIDs or colliding coordinates so each variant is unique.
- Batch labels: keep a batch field in metadata for stratified QA.
- Ancestry labels: prefer self-reported ancestry plus PCA confirmation; compute LD within strata.
Why these thresholds and checks matter is explained in the QC Gate section of Designing an LD Study.
The PLINK LD Workflow: From BED to Pruned Set
1) Standardise inputs and produce the QC'd dataset
- Harmonise alleles/IDs if merging batches.
- Split or label by ancestry to keep LD homogeneous.
plink --bfile INPUT \
--geno 0.02 --mind 0.02 \
--maf 0.05 --hwe 1e-6 midp \
--make-bed --out 01_qc
2) Prune with a sliding window
--indep-pairwise <kb> <stepSNPs> <r2> removes correlated markers as the window slides.
# Default for many GWAS QC tasks
plink --bfile 01_qc \
--indep-pairwise 250 5 0.2 \
--out 02_prune
# Keep only the independent set
plink --bfile 01_qc \
--extract 02_prune.prune.in \
--make-bed --out 03_pruned
3) Inspect LD and export matrices (targeted QA)
# Inspect local LD (pairwise r2) by distance-capped window
plink --bfile 01_qc \
--r2 gz yes-really \
--ld-window-kb 250 --ld-window 99999 --ld-window-r2 0.2 \
--out 04_ld_kb
# On dense WGS, cap comparisons by variant count
plink --bfile 01_qc \
--r2 gz yes-really \
--ld-window 50 --ld-window-r2 0.2 \
--out 04_ld_count
Choosing LD inspection modes (kb vs count)
- Use --ld-window-kb when you want to interpret LD as a function of physical distance (ideal for decay curves and reports).
- Use --ld-window (count cap) in dense WGS to limit comparisons, control runtime, and reduce file size.
- Add --r2 gz yes-really to compress outputs for easier sharing and archiving.
- Name files to reflect chromosome/stratum/window strategy (e.g., 04_ld_EUR_chr10_kb250) for traceability.
Schematic illustration for large-scale linkage disequilibrium (LD) analysis as exampled for CONVERGE cohort. (Huang X. et al. (2023) eLife)
4) Make the run reproducible
- Containerise PLINK (Docker/Singularity) and pin versions.
- Log commands, seeds, input checksums, and timestamps.
- Parallelise safely by chromosome to speed runs.
- Name files predictably (e.g., 01_qc / 02_prune / 03_pruned / 04_ld_*).
Execution Tuning: Windows, Steps, Thresholds, and Dense Data
Performance & scaling (practical guidance)
- Runtime rises with marker density and window length, and falls as step increases. Dense WGS will run slower than arrays under identical settings.
- Pilot on 1–2 chromosomes: compare step=5 vs 10 and 250 kb vs 150 kb; choose the best trade-off in retained SNPs, stability, and time.
- In very dense regions, enable count caps (e.g., --ld-window 50) and per-chromosome parallelism; record thread counts and I/O limits (shared storage can bottleneck).
- Monitor: retained SNP ratio per chromosome, peak RAM, and time per million comparisons; non-monotonic patterns usually flag missingness or batch issues.
r² thresholds (execution view)
- Start pruning at 0.2. For dense or highly structured regions, test 0.1 and 0.3 on one chromosome and compare retained SNPs and downstream model stability.
- For tagging, use ≥0.8 (fine-mapping can push ~0.9 in target regions). Keep tagging logic in your panel-design pipeline; avoid mixing with pruning defaults.
Window size and step (distance vs density)
- Default hybrid: 250 kb or 50 SNPs; step 5–10 SNPs.
- High recombination: reduce to 100–150 kb to avoid mixing distinct blocks.
- Sparse arrays: SNP-count windows stabilise comparisons when spacing is uneven.
- Dense WGS: always cap by count for LD inspection; tile by chromosome for parallelism.
Runtime levers and parallelism
- Step size is the main throttle; doubling often halves comparisons with little effect on pruning results.
- Use --threads judiciously; avoid I/O contention on shared filesystems.
- Split by chromosome; aggregate reports afterward.
Imputed and mixed data
- Filter imputed variants by INFO/R² first; then compute LD on the filtered set.
- If you must mix sources, validate LD patterns on raw versus imputed subsets during QA and only then pool.
Edge regions and practical checks
- Long-range LD areas (e.g., MHC) can dominate summaries. Exclude during genome-wide pruning and report separately with context.
- Chromosome X needs sex-aware handling; check PLINK flags for X-specific QC.
Regions with atypical LD
- Long-range LD: handle as above—separate analysis and reporting.
- Inversions/selection signals: expect extended LD; review by segment and note plausible causes (structural variation, local selection, demographic history).
- Runs of homozygosity (ROH): large ROH inflate local correlation, especially in inbred/related samples; consider removing those samples or masking ROH for global summaries.
Reporting That Reviewers Accept
Deliver an output bundle that collaborators can reuse and journals can audit.
Include
- Pruned set (.bed/.bim/.fam or variant list) with build/versions noted.
- LD decay curves per stratum (annotate MAF, r², window, and step).
- LD heatmaps for representative regions with consistent r² colour scales.
- Tag coverage tables (if tagging): r² thresholds, proxies covered, per-region summaries.
- Parameter manifest: all QC gates, windows, steps, thresholds; PLINK version; container image/tag; command history; seeds; input checksums.
- Reproducibility bundle: scripts or a Makefile/Snakemake definition.
Folder & file schema (delivery suggestion)
- 00_meta/: manifest.yaml, command logs, container/version info, input summary (samples, variants, build).
- 01_qc/: pre/post QC stats (--freq/--missing) and PCA coordinates (for ancestry verification).
- 02_prune/: *.prune.in/out, retained set bed/bim/fam.
- 03_ld/: representative LD matrices (.gz), decay raw data and plots.
- 04_reporting/: heatmaps, coverage tables (CSV/TSV), and a one-pager methods/parameters sheet.
Acceptance checks (quick sign-off)
- Retained SNP proportion aligns with expectations (too low suggests over-pruning or strict settings).
- Mean/median r² in the retained set drops versus baseline; model VIF improves; decay curves look plausible per ancestry.
- Figures label MAF, r², window, and step; colour scales remain consistent across plots.
Various linkage disequilibrium (LD) components for the 26 1KG cohorts. (Huang X. et al. (2023) eLife)
Avoid
- Over-pruning that erases structure in gene-dense areas.
- Pooled multi-ancestry LD without stratum analysis.
- LD computed on unfiltered imputed calls.
FAQs
What are pragmatic PLINK LD parameters for fast, stable pruning?
Use --indep-pairwise 250 5 0.2 as a dependable baseline. On dense WGS, trial a count-based variant (50 5 0.2) on one chromosome and compare retained markers and runtime before scaling.
How do I set sliding windows: kb or SNP count—and can I combine them?
--indep-pairwise takes kb, step (SNPs), and r². For inspection via --r2, you can cap by kb (--ld-window-kb) or by SNP count (--ld-window). In practice, use a hybrid approach: kb for biological distance, count to limit compute on dense data.
How can I limit comparisons on dense WGS and still keep signal?
Increase the step from 5 to 10, cap by SNP count in LD inspection, and parallelise by chromosome. These changes reduce compute cost with little impact on pruning quality.
How do I make runs fully reproducible across machines and analysts?
Containerise PLINK, pin versions, log commands and seeds, store input checksums, and ship a parameter manifest. Keep file naming deterministic and avoid manual edits to intermediate files.
What should I do about the MHC and other long-range LD regions?
Inspect them separately. Consider region-specific thresholds or report them as special cases so they don't dominate genome-wide summaries.
Can I mix raw genotypes and imputed variants in LD calculations?
Yes—but only after strict imputation-quality filtering. Validate LD patterns separately on raw and imputed subsets during QA, then pool if they agree.
Do I need to publish the parameter manifest?
It's recommended. The manifest records key plink ld and ld pruning pipeline parameters (MAF, r², windows, step, versions/containers, ancestry stratification, special-region handling), which helps peer review and ensures internal reruns remain consistent.
Conclusion
A disciplined ld pruning pipeline makes PLINK LD a dependable foundation for GWAS, structure analysis, and tag SNP panel design. Keep the QC gate executable and light, prune with a sliding window that respects both distance and density, and cap comparisons on dense WGS. Run a small pilot per chromosome to tune step and windows, then lock everything in a manifest so teammates and reviewers can reproduce your work.
From pipeline to downstream
- GWAS/confounding control: after linkage disequilibrium analysis, use the pruned set for PCA/GRM modelling; run association on the full set with reduced collinearity.
- PRS/panel design: for PRS or targeted panels, combine --show-tags (e.g., r²≥0.8) with ancestry-specific evaluation to improve transferability.
- Traceability: archive the pruned ID list with the manifest as the "data lineage" anchor for all downstream models.
Related reading
CD Genomics provides research-use services for institutions and companies. We do not offer personal or clinical testing.
References
- Huang, X. et al. Efficient estimation for large-scale linkage disequilibrium patterns of the human genome. eLife 12, e90636 (2023).
- Pengelly, R.J., Tapper, W., Gibson, J. et al. Whole genome sequences are required to fully resolve the linkage disequilibrium structure of human populations. BMC Genomics 16, 666 (2015).
- Chang, C.C., Chow, C.C., Tellier, L.C.A.M. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).
- The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
- Purcell, S., Neale, B., Todd-Brown, K. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics 81, 559–575 (2007).
- Patterson, N., Price, A.L., Reich, D. Population structure and eigenanalysis. PLOS Genetics 2, e190 (2006).
- Wigginton, J.E., Cutler, D.J., Abecasis, G.R. A note on exact tests of Hardy–Weinberg equilibrium. American Journal of Human Genetics 76, 887–893 (2005).
* Designed for biological research and industrial applications, not intended
for individual clinical or medical purposes.