Introgression Analysis With D-Statistics: A Practical Workflow Beyond PCA and ADMIXTURE

When people say "this PCA looks admixed" or "that ADMIXTURE plot proves gene flow," they're often answering the wrong question.
Introgression questions aren't mainly about whether samples form clusters. They're about whether one lineage shares too many alleles with another lineage—more than you'd expect from incomplete lineage sorting (ILS) alone—under a defensible population topology.
That's the role of Patterson's D-statistic (the ABBA–BABA test): a formal test of excess allele sharing that complements, rather than replaces, structure analysis. This article is a practical introgression D-statistic workflow for project leads who need to decide (1) when D is the right next step, (2) how to design P1/P2/P3/outgroup without circular reasoning, and (3) how to report results so reviewers trust the inference.
Why PCA and ADMIXTURE Are Not Enough for Introgression Questions
PCA, ADMIXTURE, and trees are excellent for exploration: they help you see major structure, spot outliers, and sanity-check labels. The common mistake is treating "visual admixture" as formal evidence of introgression.
A PCA gradient can reflect isolation-by-distance, drift, bottlenecks, or uneven sampling. ADMIXTURE bar plots depend on model assumptions, LD pruning, the chosen K, and sample composition—so component shifts are not automatically history. A tree forces a single branching pattern even when the genome carries multiple histories.
If your real question is excess allele sharing, you need a test framed around that question.
What PCA and ADMIXTURE Do Well
Use structure analyses to:
- detect gross QC problems (missingness, batch effects, sample swaps),
- identify obvious substructure that would make groups incoherent,
- generate candidate hypotheses for which population comparisons are biologically meaningful.
If you need an upstream checklist for this stage, CD Genomics' overview of genetic diversity analysis is a useful reminder that diversity, relatedness, and sampling design sit underneath every "structure" figure.
Where Structure Plots Start to Over-Interpret History
Three over-interpretations come up repeatedly.
First, equating "a gradient" with "a single admixture event." Many processes produce gradients.
Second, reading ADMIXTURE components as literal ancestral populations or a historical timeline.
Third, using a tree as proof of no gene flow. A tree is a model fit; it can't represent reticulation.
For practical guidance on model validation—especially K selection and stability—see CD Genomics' guide on how to validate ADMIXTURE/STRUCTURE results.
What D-Statistics Adds That Structure Plots Cannot
D-statistics forces a falsifiable claim: in a specified four-taxon configuration, are ABBA and BABA site patterns imbalanced genome-wide?
That constraint is what makes the result reviewable. The reader can audit your quartet definitions and your outgroup choice rather than guessing what an ADMIXTURE color means.
Questions This Workflow Can Answer
This workflow is designed to answer questions like:
- "Given that P1 and P2 are sisters, does P3 share excess derived alleles with P2 relative to P1?"
- "Is a structure signal consistent with genome-wide excess allele sharing under the tested topology?"
- "Is the signal stable to reasonable changes in outgroup choice and sample grouping?"
It cannot, by itself, localize tracts, determine timing, infer directionality, or uniquely identify the donor when unsampled diversity is plausible.

What Patterson's D-Statistic Actually Tests
Patterson's D-statistic (ABBA–BABA) tests whether two discordant allele-sharing patterns occur at equal frequency across the genome in a four-population setup.
In its classic form, you specify (P1, P2) as sister groups, P3 as a candidate source (or related lineage), and an outgroup (O) used to polarize alleles. Across many loci, ILS produces discordant gene trees—but under the null, the ABBA and BABA patterns should be symmetric, so D is near zero.
The formal framework is described in the original ABBA–BABA test paper, Durand et al., 2011, and later canonical treatments of f-statistics (e.g., Patterson et al., 2012). In practice, teams often refer to this as the "Patterson's D-statistic ABBA BABA" framework.
The Four-Population Logic Behind ABBA and BABA
At a biallelic site, after determining the ancestral allele using the outgroup, you ask whether P3 shares the derived allele more often with P2 than with P1. ABBA captures the "P3 with P2" sharing pattern; BABA captures the "P3 with P1" pattern.
D is essentially a normalized imbalance between those two counts.
Why Incomplete Lineage Sorting Matters
ILS isn't just noise; it's the baseline expectation. The test works because, under no excess allele sharing, ILS produces ABBA and BABA symmetrically. A consistent imbalance across the genome suggests something beyond ILS alone under the tested topology.
What a Significant D Value Means
A significant D value means there is evidence of excess allele sharing in your quartet.
In practice, you report D with a standard error estimated via block jackknife to account for linkage, plus a Z score (or p value) derived from that standard error—this is the standard reporting pattern described in Durand et al., 2011 in their original Testing for Ancient Admixture Between Closely Related Populations.
For many genome-wide applications, a common convention is to treat results as "strong" when |Z| > 3 (roughly p < 0.001 under a normal approximation), while still emphasizing that effect size, SNP count, and sensitivity checks matter as much as a threshold (see Soraggi et al., 2018, Powerful Inference with the D-Statistic on Low-Coverage Whole-Genome Data).
What a Significant D Value Does Not Mean
A significant D value does not prove a single clean introgression narrative.
It does not, by itself, provide timing, directionality, tract localization, or a robust admixture proportion. It is best treated as formal evidence of non-treeness consistent with gene flow under the tested topology, with alternative explanations explicitly discussed.
When D-Statistics Is the Right Next Step
D-statistics is most valuable when structure analyses have already suggested non-tree-like relationships and you can define a biologically defensible P1/P2/P3/outgroup setup.
If you can't define those populations without circular reasoning, a D result may be statistically significant while answering an incoherent question.
Good Use Cases for D-Statistics
D-statistics is a strong fit when:
- you have a meaningful sister comparison (P1 vs P2),
- P3 is a plausible contact lineage (or proxy for one),
- and the outgroup choice is defensible and stable.
Situations Where PCA or ADMIXTURE Is Still Enough
If your goal is exploratory—characterizing broad structure, selecting representative samples, or diagnosing data issues—structure analyses can be sufficient.
Also, if sampling is thin or groups are not well defined, formal testing can create false confidence: you get a Z score, but you still don't have an interpretable biological contrast.
When to Move Beyond D-Statistics
Move beyond D when your research claim requires more than "excess allele sharing under this topology," for example:
- admixture proportion estimation,
- comparing multiple sources,
- or tract-level localization.
A Decision Matrix for D-Statistics vs Follow-Up Methods
| Research question | Setup ready? | Expected output | D-statistics fit? | What to add next |
|---|---|---|---|---|
| Excess allele sharing under a quartet | Yes | D, SE, Z | Yes | Sensitivity checks |
| Robustness to setup choices | Mostly | Stability across alternatives | Yes | Alternative outgroups/groupings |
| Admixture proportion | Yes | Proportion estimate | Not alone | f4-ratio |
| Multiple sources / model plausibility | Partial | Source comparison | Sometimes | qpWave/qpAdm |
| Localized regions | Yes | Candidate tracts | No | Local methods + careful windows |
Define P1, P2, P3, and the Outgroup Before You Run Anything
Most failures in D-statistics projects are design failures: population definitions that bake in the conclusion, or outgroup choices that make allele polarization unstable.
If you want standardized preprocessing before quartets are even discussed, CD Genomics' overview of population structure analysis tools can help teams align on upstream checks.
How to Choose P1 and P2 Without Circular Reasoning
P1 and P2 should be defined by biology (geography, taxonomy, known strata), not by what makes D "pop."
A defensible approach is to pre-specify P1/P2, verify the sister relationship is plausible across upstream views, and state the contrast you intend to test. This keeps the inference interpretable and reduces post hoc fishing.
What Makes a Good P3 Population
P3 should be a plausible source lineage (or proxy) that makes sense in the ecological and historical context.
Avoid constructing P3 as a mixture of incompatible subgroups unless that is explicitly the hypothesis being tested.
How to Select an Outgroup That Stabilizes Interpretation
The outgroup is not a formality. It anchors allele polarization—this is why outgroup choice ABBA BABA decisions can determine whether your interpretation is stable or fragile.
A good outgroup is outside the ingroup clade, not too diverged, and justified by independent phylogenetic evidence. When outgroup confidence is imperfect, treat it as a sensitivity axis and report whether sign and significance are stable across plausible alternatives.
Why Sample Grouping Errors Can Flip the Story
Hidden substructure, related individuals, or mislabeled samples can turn a coherent quartet into a moving target.
A reviewer-friendly discipline is simple: document group definitions and sample counts up front, state exclusions, and make sure the biological question still makes sense after the exclusions.

Build a Defensible Introgression D-Statistic Workflow Before Formal Testing
A strong D-statistics workflow is ordered by inference credibility, not by software convenience.
Step 1: QC and Variant Filtering
D-statistics is sensitive to systematic asymmetries across populations. Stabilize missingness, depth, and filtering criteria across groups, and remove close relatives where appropriate.
Step 2: Population Structure and Relatedness Review
Use PCA/ADMIXTURE/tree/relatedness to verify that (1) the planned sister relationship is plausible and (2) groups are not mixtures of incompatible strata.
If you need a consolidated upstream blueprint, CD Genomics' population structure analysis workflow is a practical reference for standardizing this stage.
Step 3: Tree-Aware Hypothesis Framing
Write down the tested topology and why it is biologically defensible. This step prevents a common failure mode: running many quartets and interpreting the most significant one as "the" story.
Step 4: Running D-Statistics and Checking Significance
Report the quartet definition, D, a block-jackknife standard error, and the derived Z score (or p value). If you ran many quartets, treat multiple testing and non-independence as interpretation constraints, not footnotes.
Step 5: Cross-Checking With Complementary Evidence
Sensitivity checks are not optional when claims are strong. Re-run with alternative plausible outgroups, alternative groupings, and competing P3 candidates.
Step 6: Deciding Whether to Stop or Extend the Analysis
If your claim is "formal evidence of excess allele sharing under a tested topology," stable D results plus sensitivity checks may be enough.
If your claim involves proportions, competing sources, or localized regions, plan an escalation path rather than stretching D beyond what it was designed to do.

Common Pitfalls That Make D-Statistics Look Stronger Than It Is
The safest posture is to treat a significant D as evidence of excess allele sharing consistent with introgression under the tested topology—not as a complete historical narrative. Many "D-statistics interpretation pitfalls" come from skipping that conditional phrasing and turning a test statistic into a story.
Outgroup Mis-Specification
If the outgroup does not reliably represent the ancestral allele, ABBA/BABA counts can shift in ways that look like introgression. This risk is repeatedly flagged in methodological discussions of ABBA–BABA because allele polarization errors directly bias pattern counts (see Durand et al., 2011, Testing for Ancient Admixture Between Closely Related Populations). Reviewers want the rationale and stability checks, not just the outgroup name.
Rate Variation and Other Non-Admixture Asymmetries
Even with no gene flow, lineage-specific rate variation and related asymmetries can create ABBA/BABA imbalance that mimics introgression. If you suspect heterogeneous rates (or your outgroup is very distant), treat D as a screening result and strengthen sensitivity checks rather than escalating the claim (see Doyle et al., 2024, Towards Reliable Detection of Introgression in the Presence of Rate Variation).
Ancestral Structure and Ghost Lineages
Finally, remember that D is a quartet-based test of non-treeness under your specified topology. Significant results can still be consistent with more complex histories, including ancestral population structure or introgression from unsampled "ghost" lineages; the reviewer-trusted move is to phrase conclusions conditionally and note plausible alternatives rather than naming a unique donor without support.
Treating Genome-Wide and Local D as the Same Question
Genome-wide D asks whether there is overall excess allele sharing. Local-window statistics ask where signals vary across the genome.
Those are different questions, and local scans can be high-variance. Tool-focused practitioner discussions such as the Dsuite paper (Malinsky et al., 2020) emphasize correlated tests and the need for cautious interpretation.
Ghost Lineages and Unsampled Diversity
Even when D is significant, identifying the "donor" can be fragile if unsampled diversity is plausible. In that setting, the cautious interpretation is to describe excess allele sharing between sampled lineages and explicitly acknowledge alternative sources.
Minor Rate Variation and False Signals
D relies on assumptions that can be stressed by rate variation or distant outgroups. When signals are weak or depend on a single setup, interpretation should be downgraded rather than embellished.
Complex Demography vs Simple Introgression Narratives
Real histories often include multiple contacts, continuous migration, and population size changes. In these contexts, D is often best treated as a screening statistic that prioritizes which relationships deserve richer modeling. For broader context on how demography and evolutionary history shape inference, CD Genomics' overview of population evolution analysis is a useful reference point.
Key Takeaway: If your conclusion can't survive mild perturbations (outgroup, grouping, missing taxa), phrase it as "consistent with introgression under the tested topology," not as a definitive narrative.
What to Add After D-Statistics if the Question Gets Bigger
D-statistics is often the first formal test. Stronger claims usually need follow-up methods.
When to Add f4-Ratio
If you need an admixture proportion estimate and your topology supports it, f4-ratio approaches may be appropriate.
When qpWave or qpAdm Is More Informative
When you need to compare multiple plausible sources or test whether a target can be modeled as a mixture of references, qpWave/qpAdm-style modeling is often more directly aligned with the question.
A common way readers phrase this decision is "qpAdm vs D-statistics": use D to detect and prioritize, then use qpAdm-style approaches when the question becomes model plausibility and source comparison.
When Genome-Wide D Should Be Followed by Local Methods
If the biological question is tract-level introgression, local methods can help—but only with a clear window strategy and cross-validation. Treat genome-wide D as the justification for localization, not the localization tool.
How to Escalate Without Turning the Article Into a Software Tutorial
The escalation principle is simple: promote the question, not the tool. Use D for formal testing of excess allele sharing; use other methods for proportions, sources, or localization.
For a broader menu of downstream approaches, CD Genomics' overview of gene flow analysis provides a roadmap without becoming a command-by-command tutorial.
What Good D-Statistics Reporting Looks Like
Reviewer-trusted reporting explains the tested topology, population definitions, significance framework, and alternative interpretations. A single significant number in isolation is rarely persuasive.
A Minimal Table for Tested Population Quartets
Include a table listing P1, P2, P3, outgroup, D, SE, Z, and the number of SNPs used. If you ran many quartets, provide the full table as supplementary material and highlight the biologically central tests in the main text.
What to Report About Outgroup Choice
Write a short rationale: why this outgroup is outside the ingroup clade, why it is an appropriate distance, and whether the conclusion is stable to plausible alternatives.
Significance, Replication, and Sensitivity Notes
State how you computed standard errors (block jackknife), and summarize the sensitivity checks you ran. The reporting goal is not maximal significance; it is maximal auditability.
How to Phrase Conclusions Conservatively
A reviewer-friendly template sentence is:
These results support excess allele sharing consistent with introgression under the tested topology.
A Short Checklist for Methods and Supplementary Files
Reviewers commonly expect: sample lists and group definitions, filtering criteria, full quartet tables, jackknife settings, and a compact sensitivity summary. This aligns with broader calls in the literature for clearer, more standardized reporting of introgression analyses across studies (see Hibbins & Hahn, 2022, A need for standardized reporting of introgression: Insights from hybrid zone studies).

Published Case Examples That Make the Workflow Concrete
Rather than treating D-statistics as "human-only," it's useful to recognize two broad patterns of use.
Case Example 1: A Landmark Human Admixture Application
In human evolutionary genomics, early genome-wide applications used ABBA–BABA logic to test for excess allele sharing between non-African populations and archaic lineages relative to African populations (e.g., the Neanderthal introgression literature). These studies helped establish the test as a formal, scalable screen for non-tree-like history.
Case Example 2: A Non-Human Introgression Study With Formal Testing
In non-model systems, D-statistics is often used to validate gene flow hypotheses suggested by structure plots or discordant gene trees, followed by sensitivity checks and escalation only when the biological claim requires more than detection.
Case Example 3: Why Localizing Introgressed Regions Needs More Than D Alone
Using local-window ABBA–BABA outliers as "introgressed loci" can be misleading when information content and coalescent histories vary across the genome. If tract-level claims matter, plan for dedicated localization methods and cross-validation.
When to Use a Service Instead of Solving Every Step In-House
Projects are often worth outsourcing when the bottleneck is not computation but defensible design and interpretation: population setup, outgroup choice, sensitivity checks, and a reviewer-ready reporting package.
CD Genomics can support research-use (RUO) introgression projects when teams want help turning exploratory structure results into a defensible formal testing plan, executing D-statistics with documented assumptions, and assembling reporting outputs reviewers can audit. If your work involves archaic or deeply divergent lineages, their overview of archaic introgression analysis shows the broader scope such projects may include.
If you're requesting support, a strong input package includes variant files (or allele counts), population labels, upstream PCA/ADMIXTURE/tree outputs, and the biological question you want to test. For a broader view of available options, see CD Genomics' hub for bioinformatics analysis for population genomics.
FAQs
No. ADMIXTURE can show patterns consistent with mixed ancestry under its model assumptions, but it does not formally test excess allele sharing under a specific topology, and it is sensitive to choices like K and LD pruning.
A significant D-statistic means ABBA and BABA patterns are imbalanced genome-wide in your specified quartet, indicating excess allele sharing between P3 and one of the sister populations under the tested topology. It does not, by itself, identify the direction, timing, or genomic locations of introgression.
It's critical. The outgroup anchors allele polarization, and mis-specification can shift ABBA/BABA counts in ways that look like introgression. If your conclusion depends on one outgroup choice, test plausible alternatives and report stability.
Not reliably on its own. Genome-wide D is a detection test; tract-level localization is a different problem that typically requires local methods, careful window definitions, and cross-validation.
Use qpAdm-style modeling when your question involves mixture proportions or comparing multiple candidate sources and D-statistics has already identified non-tree-like relationships worth modeling.
Report the quartet definitions, outgroup rationale, D with block-jackknife standard errors and Z scores, and a sensitivity summary showing whether results are stable to reasonable changes in outgroup choice and population grouping. Phrase conclusions conservatively: evidence of excess allele sharing consistent with introgression under the tested topology.
