Population Structure with ddRAD: PCA, ADMIXTURE & STRUCTURE
If you need structure analysis with ddRAD data, this guide walks through a reproducible, reviewer-friendly workflow. You will learn how to build a clean genotype matrix, run PCA and ADMIXTURE/STRUCTURE, choose K with evidence, and present results that stand up to peer review. Because ddRAD is a reduced-representation method with uneven locus recovery, we focus on practical guardrails that keep signal and remove artefacts. Use the workflow below to plan your study and interpret structure results; for end-to-end support, see our Population Genomics Sequencing, ddRAD sequencing service, and Bioinformatics – Population Structure Analysis.
Quick Answer — How many K should I use?
Scan a sensible grid (for example, K = 1–10) and run 5–10 independent replicates per K. Select the K that minimises cross-validation (CV) error in ADMIXTURE or reaches a clear likelihood/ΔK plateau in STRUCTURE, and confirm that choice against PCA patterns and your sampling design. Report the chosen K and replicate stability.

1) ddRAD pitfalls that distort structure
ddRAD libraries sample a subset of the genome using two restriction enzymes and size selection. That design is cost-efficient across many individuals, but it also means markers are clustered near restriction sites, coverage varies across samples, and short inserts can create LD-heavy SNP blocks. If uncorrected, these features can inflate clustering, bias ADMIXTURE/STRUCTURE bar plots, and push you toward too-large K.
Chromosome 1" panel showing ddRAD vs sdRAD coverage and SNP tracks alongside gene density. (Ruperao P. et al. (2023) PLoS ONE)
The three pitfalls we see most often:
- Residual LD inflates clusters. Dense, correlated SNPs act like pseudo-replicates. Without LD pruning, PCA and model-based methods may "discover" structure that is really local correlation.
- Lane/plate effects masquerade as biology. ddRAD data can show depth and dropout differences by run. Always colour PCA by batch before trusting K.
- Over-interpreting bar plots. Membership bars are model summaries, not literal ancestry fractions. Treat them as one line of evidence among CV/likelihood curves, PCA geography, and sampling logic.
Design your analysis around these risks—QC → LD pruning → PCA diagnostics → model-based inference—and you will keep biological structure while dropping artefacts.
2) From VCF to a clean genotype matrix
Most downstream issues trace back to inconsistent inputs. Standardise once so PCA/ADMIXTURE/STRUCTURE see a well-behaved matrix.
Format and metadata hygiene
- Convert pipeline outputs (Stacks, ipyrad, dDocent) to VCF/PLINK with consistent chromosome labels, sample IDs, and missing genotype codes.
- Maintain a simple sample sheet with population label, batch (lane/plate), collection date, and coordinates. That sheet powers diagnostics and reproducibility.
Per-sample QC (experience-based tips)
- Plot missingness and heterozygosity. ddRAD dropout can elevate missingness and distort heterozygosity; remove extreme outliers early.
- Track read count per sample and the number of loci passing filters. Sudden drops often signal library or demultiplexing problems.
Per-locus QC
- Use MAF thresholds aligned to cohort size (e.g., remove singletons in small cohorts to avoid spurious structure).
- Filter loci with high missingness or inconsistent mapping.
- Prefer biallelic SNPs for stability in PCA and model-based clustering.
Why this step matters
Model-based methods assume reasonably independent, informative SNPs sampled consistently across individuals. Clean inputs reduce run-to-run variance and eliminate the temptation to "shop" for K.
3) LD pruning and filters for PCA/ADMIXTURE
LD pruning is not busywork in ddRAD; it is the main guardrail against exaggerated structure.
Start with practical defaults, then sensitivity-test
- Prune with an indep-pairwise approach and keep a few thousand evenly spaced SNPs.
- A robust starting grid: window 50–100 kb (or a fixed window of 50 SNPs), step 10–20, r² 0.2–0.5.
- Run a sensitivity check. If K or membership patterns swing when you tighten r² from 0.5 to 0.2, LD was driving the previous result. Record the exact parameters you adopt.
Control missingness before imputation
- Uneven missingness destabilises bar plots and PCs. Prefer removing high-missing loci/samples first.
- If you impute, treat it as a separate branch of analysis and show with/without comparisons on the pruned set.
Use PCA as a diagnostic, not an oracle
- Compute PCA on the pruned SNPs. Inspect the scree plot for the number of informative axes.
- Plot PC1–PC2 coloured by population and by batch (lane/plate). If batch separates cleanly, fix the lab or demultiplexing issue before fitting structure models.
- Map PCs against geography or environment if available; consistent gradients help confirm biological signal.
This combination—LD-pruned SNPs, controlled missingness, and PCA diagnostics—sets a stable foundation for model-based clustering.
Principal component analysis on ddRAD SNPs separates wild accessions from elite varieties, illustrating how PCA exposes broad gradients and potential batch/collection effects. (Ruperao P. et al. (2023) PLoS ONE)
4) Picking K with confidence
K selection gets more attention from reviewers than any other step. Make the decision transparent, repeatable, and backed by multiple signals.
ADMIXTURE (maximum-likelihood with cross-validation)
- Replicates: For each K, run 5–10 independent seeds. Replicates expose unstable optima and prevent cherry-picking one "pretty" bar plot.
- Criterion: Prefer the K with lowest CV error (use the --cv flag). Inspect the full curve; shallow minima suggest weak population structure.
- Stability: Summarise replicate bar plots and ancestry proportions. Large between-run differences often trace back to inadequate LD pruning, noisy loci, or outlier samples.
STRUCTURE (Bayesian clustering)
- Replicates: Run multiple independent chains per K with sufficient burn-in and MCMC length.
- Criteria: Inspect log-likelihood trends across K and consider the Evanno ΔK heuristic to identify the strongest hierarchical split. Note its limitations: ΔK cannot evaluate K = 1 and sometimes favours upper-level subdivision. Combine with biological context and PCA rather than using ΔK alone.
Summarising many runs without bias
- Use tools such as CLUMPAK or equivalent to align clusters across replicates and K values, and to produce consistent colour schemes. This avoids manual run selection and provides a principled consensus view.
Guardrails against over-interpretation
- Avoid reading bar-plot heights as literal ancestry fractions. Present CV/likelihood curves, replicate stability, and PCA geography together, and narrate uncertainty clearly. That mix is what most reviewers expect.
Three distinct demographic histories (recent admixture, ghost admixture, recent bottleneck) can yield visually similar ADMIXTURE bar plots, underscoring the risk of literal ancestry interpretation without corroborating evidence. (Lawson D.J. et al. (2018) Nature Communications)
5) Reporting, failure modes, and a mini case
Strong studies are reproducible. Build trust by reporting decisions, not just pictures.
Reporting checklist (include in Methods or Supplement)
- Data & filters: counts before/after QC; MAF and missingness thresholds; LD parameters (window/step/r²); pruned SNP count.
- Software & versions: ADMIXTURE/STRUCTURE, PLINK, PCA toolchain, and exact command options or parameter files.
- Random seeds & replicates: how many runs per K and how seeds were set.
- Model selection: CV error (ADMIXTURE) and/or likelihood/ΔK (STRUCTURE) curves, with the chosen K marked and a one-paragraph rationale.
- Diagnostics: PCA plots (population and batch), scree plot, and membership bar plots.
- Limitations: one paragraph on known constraints (e.g., reduced-representation loci, gaps near repeats or centromeres, moderate missingness in one batch mitigated by stricter locus filters).
Common failure modes and fixes
- Over-clustering at high K. Tighten LD pruning, raise SNP quality thresholds, re-run replicates, and report the lowest K that remains stable and consistent with PCA.
- Lane/plate shadows in PCA. Colour by batch; if separation appears, re-check demultiplexing, rebalance libraries, or treat batch as a covariate in downstream models.
- Uneven missingness by population. Remove problematic loci/samples first; if imputation is used, document the method and include a sensitivity analysis.
- Bar-plot storytelling. Avoid deterministic narratives. Present a short "what we know vs what is uncertain" box next to the figure.
Mini case blueprint (ddRAD in a non-model plant)
- Context: 240 accessions spanning three eco-regions; ~1.2 M paired-end reads per sample; de novo assembly in ipyrad; initial callset 24 k SNPs after per-locus QC (MAF ≥ 0.05; missingness ≤ 20%).
- LD pruning: indep-pairwise on 50-SNP windows, step 10, r² = 0.2 → 3,100 evenly spaced SNPs retained.
- PCA diagnostic: PC1 tracked rainfall gradient; no separation by lane/plate when coloured by batch.
- ADMIXTURE grid: K = 1–10 with 10 seeds per K; CV error minimum at K = 3, narrow variance across seeds.
- STRUCTURE check: log-likelihood plateau at K = 3–4; ΔK peaked at K = 3.
- Interpretation: three weakly differentiated gene pools matching eco-regions, with 10–20% shared ancestry in contact zones. Reported with caveats on reduced-representation and uneven marker spacing.
Throughout the case, internal links can point readers to Population Genomics Sequencing when planning sampling, to ddRAD sequencing service when validating enzyme/size windows, and to Population Structure Analysis in Bioinformatics when formalising the pipeline and figures.
FAQs: ddRAD population structure
How many SNPs are enough for structure analysis with
ddRAD?
There is no universal threshold, but many projects stabilise once they retain a few thousand evenly spaced, LD-pruned SNPs. Simple, deep splits can be recovered with fewer; complex demographic histories benefit from more loci. Emphasise independence and informativeness over raw counts.
What LD settings should I try first for ADMIXTURE/STRUCTURE?
A practical start is a 50–100 kb window (or 50-SNP windows), step 10–20, r² between 0.2 and 0.5. Then sensitivity-test: if your preferred K changes when you tighten r², LD was influencing the result. Record parameters in the Methods.
How do I pick K without overfitting?
Use replicates per K, select by CV error (ADMIXTURE) or likelihood/ΔK (STRUCTURE), and confirm with PCA geography. Summarise replicates with a cluster-matching tool so you do not rely on one attractive run.
PCA and ADMIXTURE disagree—what should I trust?
Treat PCA as a diagnostic of broad gradients and batch effects. When disagreement persists, revisit LD pruning and missingness, remove outliers, and confirm replicate stability. Present both views with a short explanation rather than forcing convergence.
Do I need to impute missing genotypes?
Prefer filtering high-missing loci/samples before imputation. If you do impute, use a simple, transparent method on the pruned set and show with/without comparisons. Reviewers look for proof that your biological story does not hinge on an imputation choice.
Next steps
- Get a sanity check. Share your VCF/PLINK set for a 360° review of filters, LD pruning, PCA diagnostics, and K selection via our Population Structure Analysis in Bioinformatics.
- Start a pilot. Launch a ddRAD sequencing service pilot to confirm locus recovery, batch balance, and expected SNP yield before scaling to full cohorts—part of our Population Genomics Sequencing offering.
Related Reading:
References
- Ruperao, P., Bajaj, P., Subramani, R. et al. A pilot-scale comparison between single and double-digest RAD markers generated using GBS strategy in sesame (Sesamum indicum L.). PLoS ONE 18(6), e0286599 (2023).
- Lawson, D.J., van Dorp, L., Falush, D. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nature Communications 9, 3258 (2018).
- Sermyagin, A.A., Dotsev, A.V., Gladyr, E.A. et al. Whole-genome SNP analysis elucidates the genetic structure of Russian cattle and its relationship with Eurasian taurine breeds. Genetics Selection Evolution 50, 37 (2018).
- Mussmann, S.M., Douglas, M.R., Chafin, T.K. et al. AdmixPipe: population analyses in Admixture for non-model organisms. BMC Bioinformatics 21, 337 (2020).
- Alexander, D.H., Novembre, J., Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19, 1655–1664 (2009).
- Patterson, N., Price, A.L., Reich, D. Population structure and eigenanalysis. PLoS Genetics 2(12), e190 (2006).
* Designed for biological research and industrial applications, not intended
for individual clinical or medical purposes.