TL;DR — A Defensible PCA QC Recipe
Run PCA on LD-pruned autosomal SNPs while masking long-range LD regions; remove or adjust for close relatives before fitting; select PCs with the Tracy–Widom test (avoid a fixed "top 10"); flag outliers by distance in PC space; and project new batches onto the fixed axes to keep covariates consistent across waves. This reduces stratification artifacts and stabilizes covariates for downstream GWAS analysis.
Even modest ancestry shifts or plate effects can inflate test statistics, trigger false positives, or hide real associations. Principal components analysis (PCA) is the standard for detecting and correcting stratification in genome-wide studies. Treat it as a quantitative control step, not just a pretty plot.
Risks when PCA QC is skipped or superficial:
The cure is simple: prune, mask, de-relate, test, and project—then carry forward significant PCs as covariates in your Population Structure Analysis and GWAS analysis workflows.
Where this connects to services: most teams fold PCA outputs directly into our PCA analysis service (projection files + PC covariates) and end-to-end Population Structure Analysis, then carry PCs into GWAS analysis models.
These elements—and the exact parameters you record—become your reviewer-ready Methods and your audit trail.
Robust PCA methods improve sensitivity and specificity for flagging sample outliers versus classical PCA, highlighting how robust estimators sharpen QC decisions. (Liu X. et al. (2020) BMC Bioinformatics)
Tip: If you mix array with WES/WGS, harmonize reference/strand and intersect SNPs first; run a preliminary PCA to check that top PCs are not just platform. If they are, revisit harmonization.
Detect ≥2nd-degree pairs with KING (robust to population structure). Either drop one per pair for the PCA fit or use PC-Relate to get structure-adjusted kinship; then fit PCA on an unrelated subset and project relatives back. This prevents family clusters from pulling PCs off true ancestry axes.
Use EIGENSOFT/smartpca or FastPCA for large cohorts. Save eigenvectors (loadings) and eigenvalues, and project new or held-out samples (validation batches, WES/WGS arms) onto the same axes so covariates remain comparable across time and subcohorts.
Scale Choices
PC1–PC2 projections across multiple geographic contexts show that some Southeast Asian diversity is insufficiently represented in 1KG, underscoring why projection choices and reference panels matter for stratification control. (Lu D. et al. (2013) Frontiers in Genetics)
Use the Tracy–Widom test to determine how many PCs represent genuine structure. Combine TW results with model-fit diagnostics (e.g., genomic inflation after including PCs). Do not hardcode a number across projects; cohorts differ in complexity.
Disposition Policy
Package these hand-offs to your association team:
These artifacts reduce back-and-forth during manuscript prep and make pre-registration and SOP reuse simple.
An interactive PCA sample view with linked scree plot streamlines PC selection, outlier review, and the generation of publication-ready figures from the same UI. (Marini F. et al. (2019) BMC Bioinformatics)
True outliers (e.g., contamination, mix-ups) break expected PC patterns and sit far from any cluster; remove and re-run PCA after fixing the root cause. Legitimate structure (e.g., recent admixture or a population isolate) forms coherent tails or clusters consistent with geography or known demography; keep such samples, treat PCs as covariates, and annotate with a complementary method (e.g., global ancestry proportions) only if it aids interpretation. Above all, ensure the PCs you carry into association models are significant (Tracy–Widom) and stable under projection.
If PC1/PC2 are driven by one genomic region, you likely missed a long-range LD mask. Re-mask, re-prune, and re-fit.
Methods Boilerplate To Adapt
"We generated an unlinked SNP set using PLINK (--indep-pairwise 50 5 0.1), excluded published GRCh38 long-range LD regions, and removed one individual from each pair of ≥2nd-degree relatives (KING). We computed PCA on the unrelated subset with EIGENSOFT (or FastPCA for scale) and projected remaining samples. We selected significant PCs via Tracy–Widom tests and inspected associations for residual inflation. Software versions, CPU/RAM, and array job settings were recorded."
Figure Checklist
PCs Track Batch, Not Ancestry
Relatives Inflate Apparent Clusters
"How Many PCs?" Remains Unclear
PCA Is Too Slow At Cohort Scale
Runtime and accuracy benchmarks show ProPCA scales efficiently on very large cohorts while maintaining PCA fidelity, enabling rapid analysis of biobank-sized data. (Ioannidis A.G. et al. (2020) PLOS Genetics)
Use the Tracy–Widom test to select significant PCs and complement with model-fit checks (e.g., genomic inflation after adding PCs). Avoid a fixed number across studies; structure varies by cohort, platform, and ancestry mix.
Yes. LD pruning creates an unlinked marker set; long-range LD masks remove extended haplotype regions that can hijack top PCs. Skipping either step raises the chance your PCs reflect technical artifacts rather than ancestry.
Detect ≥2nd-degree pairs using KING or compute structure-aware kinship with PC-Relate. Fit PCA on an unrelated subset and project related individuals afterwards to avoid kinship-driven PCs.
Yes. FastPCA approximates top PCs with far lower time and memory while preserving the core eigenstructure. Pair it with projection for new data waves to keep covariates consistent.
Fit once on a clean reference set, save loadings, and project all subsequent batches. This maintains the axes and prevents drift in covariates across analysis waves.
A defensible PCA QC pipeline—PLINK LD pruning, GRCh38 long-range LD masks, kinship control (KING/PC-Relate), Tracy–Widom PC selection, and projection—gives you clean, reproducible covariates and fewer surprises in association testing. Document parameters, state your selection rule, and ship a reviewer-ready pack (PC covariates, figures, Methods).
Ready to operationalise this across cohorts? Start with a PCA QC plan or engage our Population Structure Analysis and PCA analysis service to deliver projection files and PC covariates that drop straight into your GWAS analysis workflow—complete with compute planning and SOPs for repeatable results.
Related reading:
References