Running LD the Right Way: PLINK Workflow, Parameters, and LD Pruning

Need a reliable PLINK LD workflow you can trust across projects? This practical guide focuses on a production-ready LD pruning pipeline with sensible defaults, robust pairwise r2 checks, and clear sliding-window LD settings. Built for linkage disequilibrium analysis at scale, we keep theory to a minimum and zero in on execution: inputs, commands, tuning on dense data, and reviewer-ready outputs.

Quick Start Defaults for PLINK LD

When you need defensible settings fast:

Pruning goal: r² ≈ 0.2
Windowing: 250 kb OR 50 SNPs (whichever first), step 5–10 SNPs
MAF for stability: start at 0.05
Strata: compute LD within ancestry groups
Imputation: filter on INFO/R² before LD
Outputs: pruned marker list, small LD matrices for QA, decay plot, parameter manifest

Inputs & QC Before LD

Garbage in, garbage out. LD calculations amplify upstream errors because pairwise r2 is sensitive to missingness, batch effects, and subtle stratification. Lock down a simple QC gate before you touch LD.

Minimal, executable QC (arrays/WGS)

Sample call rate: --mind 0.02
Marker call rate: --geno 0.02 (tighten to 0.01 for final runs)
MAF for LD stability: --maf 0.05
Hardy–Weinberg by ancestry: --hwe 1e-6 midp
Relatedness: remove up to 2nd-degree (use your preferred kinship tool prior to PLINK)
Structure: subset or label by ancestry (PCA/labels), then run LD per stratum
Imputed data: apply INFO/R² filters upstream; only then compute LD

Allele frequency distributions compared across different data sources (e.g., array vs WGS). (Pengelly R.J. et al. (2015) BMC Genomics) Distribution of allele frequencies between data sources. (Pengelly R.J. et al. (2015) BMC Genomics)

Executable QC command

plink --bfile INPUT \

--geno 0.02 --mind 0.02 \

--maf 0.05 \

--hwe 1e-6 midp \

--make-bed --out 01_qc

Batch harmonization checklist (before merging cohorts)

Build/coordinate alignment: confirm all batches use the same reference (GRCh37/38); liftover if needed.
Allele consistency: recode to reference alleles; resolve strand flips (A/T, C/G sites need extra care).
Duplicates/conflicts: remove duplicate rsIDs or colliding coordinates so each variant is unique.
Batch labels: keep a batch field in metadata for stratified QA.
Ancestry labels: prefer self-reported ancestry plus PCA confirmation; compute LD within strata.

Why these thresholds and checks matter is explained in the QC Gate section of Designing an LD Study.

The PLINK LD Workflow: From BED to Pruned Set

1) Standardise inputs and produce the QC'd dataset

Harmonise alleles/IDs if merging batches.
Split or label by ancestry to keep LD homogeneous.

plink --bfile INPUT \

--geno 0.02 --mind 0.02 \

--maf 0.05 --hwe 1e-6 midp \

--make-bed --out 01_qc

2) Prune with a sliding window

--indep-pairwise <kb> <stepSNPs> <r2> removes correlated markers as the window slides.

# Default for many GWAS QC tasks

plink --bfile 01_qc \

--indep-pairwise 250 5 0.2 \

--out 02_prune

# Keep only the independent set

plink --bfile 01_qc \

--extract 02_prune.prune.in \

--make-bed --out 03_pruned

3) Inspect LD and export matrices (targeted QA)

# Inspect local LD (pairwise r2) by distance-capped window

plink --bfile 01_qc \

--r2 gz yes-really \

--ld-window-kb 250 --ld-window 99999 --ld-window-r2 0.2 \

--out 04_ld_kb

# On dense WGS, cap comparisons by variant count

plink --bfile 01_qc \

--r2 gz yes-really \

--ld-window 50 --ld-window-r2 0.2 \

--out 04_ld_count

Choosing LD inspection modes (kb vs count)

Use --ld-window-kb when you want to interpret LD as a function of physical distance (ideal for decay curves and reports).
Use --ld-window (count cap) in dense WGS to limit comparisons, control runtime, and reduce file size.
Add --r2 gz yes-really to compress outputs for easier sharing and archiving.
Name files to reflect chromosome/stratum/window strategy (e.g., 04_ld_EUR_chr10_kb250) for traceability.

Schematic of large-scale LD analysis workflow, illustrated using the CONVERGE cohort. (Huang X. et al. (2023) eLife) Schematic illustration for large-scale linkage disequilibrium (LD) analysis as exampled for CONVERGE cohort. (Huang X. et al. (2023) eLife)

4) Make the run reproducible

Containerise PLINK (Docker/Singularity) and pin versions.
Log commands, seeds, input checksums, and timestamps.
Parallelise safely by chromosome to speed runs.
Name files predictably (e.g., 01_qc / 02_prune / 03_pruned / 04_ld_*).

Execution Tuning: Windows, Steps, Thresholds, and Dense Data

Performance & scaling (practical guidance)

Runtime rises with marker density and window length, and falls as step increases. Dense WGS will run slower than arrays under identical settings.
Pilot on 1–2 chromosomes: compare step=5 vs 10 and 250 kb vs 150 kb; choose the best trade-off in retained SNPs, stability, and time.
In very dense regions, enable count caps (e.g., --ld-window 50) and per-chromosome parallelism; record thread counts and I/O limits (shared storage can bottleneck).
Monitor: retained SNP ratio per chromosome, peak RAM, and time per million comparisons; non-monotonic patterns usually flag missingness or batch issues.

r² thresholds (execution view)

Start pruning at 0.2. For dense or highly structured regions, test 0.1 and 0.3 on one chromosome and compare retained SNPs and downstream model stability.
For tagging, use ≥0.8 (fine-mapping can push ~0.9 in target regions). Keep tagging logic in your panel-design pipeline; avoid mixing with pruning defaults.

Window size and step (distance vs density)

Default hybrid: 250 kb or 50 SNPs; step 5–10 SNPs.
High recombination: reduce to 100–150 kb to avoid mixing distinct blocks.
Sparse arrays: SNP-count windows stabilise comparisons when spacing is uneven.
Dense WGS: always cap by count for LD inspection; tile by chromosome for parallelism.

Runtime levers and parallelism

Step size is the main throttle; doubling often halves comparisons with little effect on pruning results.
Use --threads judiciously; avoid I/O contention on shared filesystems.
Split by chromosome; aggregate reports afterward.

Imputed and mixed data

Filter imputed variants by INFO/R² first; then compute LD on the filtered set.
If you must mix sources, validate LD patterns on raw versus imputed subsets during QA and only then pool.

Edge regions and practical checks

Long-range LD areas (e.g., MHC) can dominate summaries. Exclude during genome-wide pruning and report separately with context.
Chromosome X needs sex-aware handling; check PLINK flags for X-specific QC.

Regions with atypical LD

Long-range LD: handle as above—separate analysis and reporting.
Inversions/selection signals: expect extended LD; review by segment and note plausible causes (structural variation, local selection, demographic history).
Runs of homozygosity (ROH): large ROH inflate local correlation, especially in inbred/related samples; consider removing those samples or masking ROH for global summaries.

Reporting That Reviewers Accept

Deliver an output bundle that collaborators can reuse and journals can audit.

Include

Pruned set (.bed/.bim/.fam or variant list) with build/versions noted.
LD decay curves per stratum (annotate MAF, r², window, and step).
LD heatmaps for representative regions with consistent r² colour scales.
Tag coverage tables (if tagging): r² thresholds, proxies covered, per-region summaries.
Parameter manifest: all QC gates, windows, steps, thresholds; PLINK version; container image/tag; command history; seeds; input checksums.
Reproducibility bundle: scripts or a Makefile/Snakemake definition.

Folder & file schema (delivery suggestion)

00_meta/: manifest.yaml, command logs, container/version info, input summary (samples, variants, build).
01_qc/: pre/post QC stats (--freq/--missing) and PCA coordinates (for ancestry verification).
02_prune/: *.prune.in/out, retained set bed/bim/fam.
03_ld/: representative LD matrices (.gz), decay raw data and plots.
04_reporting/: heatmaps, coverage tables (CSV/TSV), and a one-pager methods/parameters sheet.

Acceptance checks (quick sign-off)

Retained SNP proportion aligns with expectations (too low suggests over-pruning or strict settings).
Mean/median r² in the retained set drops versus baseline; model VIF improves; decay curves look plausible per ancestry.
Figures label MAF, r², window, and step; colour scales remain consistent across plots.

LD components profiled across the 26 1000 Genomes (1KG) cohorts. (Huang X. et al. (2023) eLife) Various linkage disequilibrium (LD) components for the 26 1KG cohorts. (Huang X. et al. (2023) eLife)

Avoid

Over-pruning that erases structure in gene-dense areas.
Pooled multi-ancestry LD without stratum analysis.
LD computed on unfiltered imputed calls.

FAQs

What are pragmatic PLINK LD parameters for fast, stable pruning?

Use --indep-pairwise 250 5 0.2 as a dependable baseline. On dense WGS, trial a count-based variant (50 5 0.2) on one chromosome and compare retained markers and runtime before scaling.

How do I set sliding windows: kb or SNP count—and can I combine them?

--indep-pairwise takes kb, step (SNPs), and r². For inspection via --r2, you can cap by kb (--ld-window-kb) or by SNP count (--ld-window). In practice, use a hybrid approach: kb for biological distance, count to limit compute on dense data.

How can I limit comparisons on dense WGS and still keep signal?

Increase the step from 5 to 10, cap by SNP count in LD inspection, and parallelise by chromosome. These changes reduce compute cost with little impact on pruning quality.

How do I make runs fully reproducible across machines and analysts?

Containerise PLINK, pin versions, log commands and seeds, store input checksums, and ship a parameter manifest. Keep file naming deterministic and avoid manual edits to intermediate files.

What should I do about the MHC and other long-range LD regions?

Inspect them separately. Consider region-specific thresholds or report them as special cases so they don't dominate genome-wide summaries.

Can I mix raw genotypes and imputed variants in LD calculations?

Yes—but only after strict imputation-quality filtering. Validate LD patterns separately on raw and imputed subsets during QA, then pool if they agree.

Do I need to publish the parameter manifest?

It's recommended. The manifest records key plink ld and ld pruning pipeline parameters (MAF, r², windows, step, versions/containers, ancestry stratification, special-region handling), which helps peer review and ensures internal reruns remain consistent.

Conclusion

A disciplined ld pruning pipeline makes PLINK LD a dependable foundation for GWAS, structure analysis, and tag SNP panel design. Keep the QC gate executable and light, prune with a sliding window that respects both distance and density, and cap comparisons on dense WGS. Run a small pilot per chromosome to tune step and windows, then lock everything in a manifest so teammates and reviewers can reproduce your work.

From pipeline to downstream

GWAS/confounding control: after linkage disequilibrium analysis, use the pruned set for PCA/GRM modelling; run association on the full set with reduced collinearity.
PRS/panel design: for PRS or targeted panels, combine --show-tags (e.g., r²≥0.8) with ancestry-specific evaluation to improve transferability.
Traceability: archive the pruned ID list with the manifest as the "data lineage" anchor for all downstream models.

Related reading

CD Genomics provides research-use services for institutions and companies. We do not offer personal or clinical testing.

References

Huang, X. et al. Efficient estimation for large-scale linkage disequilibrium patterns of the human genome. eLife 12, e90636 (2023).
Pengelly, R.J., Tapper, W., Gibson, J. et al. Whole genome sequences are required to fully resolve the linkage disequilibrium structure of human populations. BMC Genomics 16, 666 (2015).
Chang, C.C., Chow, C.C., Tellier, L.C.A.M. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Purcell, S., Neale, B., Todd-Brown, K. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics 81, 559–575 (2007).
Patterson, N., Price, A.L., Reich, D. Population structure and eigenanalysis. PLOS Genetics 2, e190 (2006).
Wigginton, J.E., Cutler, D.J., Abecasis, G.R. A note on exact tests of Hardy–Weinberg equilibrium. American Journal of Human Genetics 76, 887–893 (2005).

* Designed for biological research and industrial applications, not intended for individual clinical or medical purposes.