How Population Genomics + GWAS Uncovered Mulberry Domestication History and Breeding-Relevant Loci
This casebook summarizes a published high-impact study (Advanced Science, 2023) that combined population genomics and GWAS to clarify mulberry (Morus spp.) domestication and expansion, quantify gene flow/introgression, reconstruct demographic history, and identify genomic regions linked to agronomic traits (leaf size/biomass and flowering time). The same end-to-end workflow is widely applicable to crops, forestry species, aquaculture species, and non-model organisms where researchers need both evolutionary context and actionable markers.
Data source: All study findings summarized here are from the original publication (DOI: 10.1002/advs.202300039).
At a glance (study facts):
- Cohort size: 425 mulberry accessions
- Data generation: whole-genome resequencing, ~20× mean depth
- New data + integration: 290 newly resequenced + 135 from previous studies
- Variants discovered: 2,359,117 high-quality SNPs + 934,187 short InDels (<10 bp)
- Core analyses: population structure/phylogeny, gene flow (f3 + ABBA-BABA), demography (PSMC + SMC++), GWAS
- Trait signals: leaf size/biomass (Chr 7), flowering time (Chr 5)
- Candidate genes proposed: MaBXY5 and MaERF110
Minimal banner showing "Mulberry Evolution & GWAS" with leaf, DNA, and gene-flow icons plus a small Manhattan-plot motif.
Why Mulberry Is a Powerful Model for "Evolution + Breeding" Genomics
Mulberry (Morus spp.) is a widely cultivated economic plant across many developing countries in Asia. Its importance goes beyond yield: as the only food source for silkworms, mulberry sits at the foundation of sericulture and has played an outsized role in historical trade and cultural exchange.
Despite its long cultivation history, the genetic and evolutionary story of mulberry domestication and spread has been less clear than for many staple crops. For breeders and population geneticists, this uncertainty creates practical problems:
- If domestication involved repeated mixing among lineages, population structure and introgression can confound trait mapping.
- If lineages experienced different bottlenecks, allele frequencies and LD patterns can vary by group—affecting marker transferability.
- Without a high-quality variant resource, it is difficult to move from "interesting biology" to usable molecular markers.
This published study addressed those gaps by pairing large-scale whole-genome resequencing with population genetic inference and GWAS, producing a single, coherent narrative from variants → history → trait loci.
Study Design: A Cohort Built for Both Population Genomics and GWAS
Cohort composition and sampling scale
The study analyzed 425 mulberry resources/accessions. Importantly, the cohort combined:
- 290 newly resequenced individuals, and
- 135 additional individuals integrated from earlier research.
That "new data + published data integration" approach is increasingly common in population genomics because it can expand geographic coverage and genetic diversity without restarting from scratch—provided that harmonization and QC are handled carefully.
Sequencing strategy
All individuals were profiled by whole-genome resequencing, with an average sequencing depth of ~20×. For population genomics, that depth supports robust variant discovery and reduces uncertainty in genotype calls relative to shallow sequencing—especially useful when results will feed into multiple downstream analyses (structure, gene flow, demography, GWAS).
Variant discovery outcome (the foundation for everything else)
From the 425 accessions, the authors reported:
- 2,359,117 high-quality SNPs, and
- 934,187 short InDels (<10 bp).
This dense variant set is the enabling resource that makes the rest of the paper possible: it increases resolution for phylogeny and structure, improves power for detecting introgression signals, and provides the marker density GWAS needs to localize trait associations.
End-to-End Analytical Workflow (Variants → History → Traits)
A strength of this study is that it does not treat population genetics and GWAS as separate projects. Instead, it uses a connected workflow where evolutionary inference informs trait mapping, and trait mapping is interpreted in the context of structure and gene flow.
Flowchart showing 425 mulberry accessions → whole-genome resequencing → variant calling → population structure, diversity, LD, gene flow, and history, plus phenotypes and GWAS leading to breeding markers.
Workflow overview of population genomics and GWAS used to study mulberry domestication history and trait loci
Below is the end-to-end workflow used in this study, summarized for clarity.
1) Population Structure and Phylogeny: Defining Genetic Groups First
What the study did
Using the SNP dataset, the authors built a phylogenetic tree and clustered all accessions into five distinct genetic groups. They also examined genetic structure and geographic distribution, tying genetic clusters to where accessions were sampled or cultivated.
Phylogenetic tree, LD decay curve, PCA, and admixture plots showing population structure of 425 mulberry accessions.
Why this step is essential for GWAS and marker development
In plant populations—especially those shaped by domestication, breeding, and human-mediated movement—population structure is not just a "nice plot." It directly impacts inference:
- GWAS confounding: allele frequency differences among groups can produce false positives if not modeled.
- Marker portability: a marker associated with a trait in one subgroup may not transfer to another if LD patterns differ.
- Interpretability: identifying trait loci without understanding structure risks misattributing signals to selection or adaptation.
If your goal includes GWAS, start by characterizing population structure (and relatedness) so association models can be properly controlled.
2) Gene Flow and Introgression: Testing the "Messy Middle" of Domestication
What the study did
The authors used f3 statistics and ABBA–BABA (D-statistics) to test gene flow among mulberry populations. These methods are widely used to detect admixture and introgression signals that are not always obvious from phylogenetic trees alone.
Key finding: different mulberry populations showed extensive gene flow, implying frequent inter- and intra-specific introgression during domestication and cultivation.
Why this matters in domestication studies
Domestication is often presented as a simple split between "wild ancestor" and "cultivated descendant." Real plant domestication histories are frequently more complex:
- farmers move plants across regions,
- breeding crosses occur intentionally,
- introgression may be repeated across many generations,
- cultivated and wild stands may coexist and cross.
By explicitly testing gene flow, this study avoided an oversimplified narrative and instead supported a model where domestication and expansion occurred amid substantial genetic exchange.
If your species has a long cultivation history or wide geographic spread, plan for explicit gene flow/admixture testing—because it can change both the evolutionary story and how you interpret GWAS hits.
3) Demographic History: Reconstructing Bottlenecks and Effective Population Size Changes
What the study did
The study inferred demographic trajectories using PSMC and SMC++, methods designed to estimate historical changes in effective population size (Ne) from genome-wide data.
Key demographic signals reported:
- M. atropurpurea and M. alba experienced bottlenecks during the Quaternary glaciation period.
- A lineage/group (MAM, as defined in the original study) showed a continuous decline in Ne in the recent past.
Demographic history and dispersal of mulberry, including geographic sampling, genetic diversity, admixture signals, and effective population size through time. (from paper Figure 3).
Why demographic context strengthens both evolution and breeding conclusions
Demographic events shape genetic variation and can mimic or mask selection:
- Bottlenecks can reduce diversity and inflate LD—changing GWAS resolution.
- Expansion can create allele frequency gradients that look like adaptation if not controlled.
- Different lineages can carry different levels of standing variation, affecting trait improvement potential.
When demography is analyzed alongside structure and gene flow, you can interpret genetic patterns more confidently and design downstream studies more strategically (e.g., subgroup GWAS, balanced sampling, targeted crosses).
Demographic inference is not only "history"; it informs practical choices like cohort design, expected LD decay, and how transferable trait markers may be across lineages.
4) GWAS: Linking Genomes to Leaf Traits and Flowering Time
Genomic loci associated with leaf size during mulberry domestication, integrating phenotype differences, GWAS peaks, selection signals, and candidate gene/LD evidence. (from paper Figure 4).
Traits investigated
The study performed GWAS for:
- Leaf size and biomass-related traits (leaf dimensions/weight at maturity), and
- Flowering time, assessed annually during the blooming season.
These are breeding-relevant traits in mulberry: leaf biomass connects directly to sericulture productivity, while flowering time can affect adaptation and management.
Major association signals reported
The GWAS identified:
- Chromosome 7 as a key region associated with leaf size/biomass, and
- Chromosome 5 as a key region associated with flowering time.
The authors further proposed two candidate genes as potential regulators:
- MaBXY5 (linked to leaf traits), and
- MaERF110 (linked to flowering time).
How to interpret this (without over-claiming)
GWAS highlights loci associated with phenotypes, but breeding action requires careful follow-up:
- confirm robustness across subpopulations and years,
- validate candidate genes (expression evidence, functional assays, or independent cohorts),
- translate into markers that are stable across breeding material.
Still, the study demonstrates a powerful principle: when population structure and gene flow are accounted for, GWAS can yield practical, chromosome-level signals and candidate genes even in complex domestication contexts.
The most credible GWAS in domesticated species is anchored by population genomics (structure + gene flow + demography), not performed in isolation.
What Makes This Study a "Classic" Template for Population Genomics + GWAS
This case is worth emulating because it combines four components that—together—produce publishable insights and practical outputs:
- Scale and diversity (425 accessions) enabling robust structure and mapping
- High-confidence variant resource (SNPs + InDels) enabling multiple analyses
- Explicit modeling of gene flow and demography to avoid misleading conclusions
- Trait mapping with interpretable outputs (chromosomal regions + candidate genes)
For teams planning similar projects, the value is not only the biological story of mulberry, but the reusable study design logic.
Practical Checklist: How to Reuse This Workflow in Your Own Species
Below is a planning checklist written for researchers designing a population genomics + GWAS study (plants, aquaculture species, forestry trees, or non-model organisms).
A) Cohort and sampling
- Aim for a cohort that captures geography + breeding history + wild/cultivated contrasts where relevant.
- If integrating published genomes, plan early for cross-study harmonization (consistent reference, joint genotyping strategy, and QC thresholds).
B) Sequencing and variant calling strategy
- Choose depth and sample size based on goals: demography and high-quality genotypes generally benefit from higher coverage; discovery-focused screening may prioritize sample count.
- Define a clear pipeline for producing high-confidence SNPs and short InDels suitable for structure + GWAS.
C) Population genomics first, GWAS second
- Build the "genetic map of your cohort": structure, phylogeny, and relatedness.
- Test gene flow/admixture explicitly (f3/D-statistics or comparable frameworks).
- Consider demography to understand diversity, LD expectations, and subgroup differences.
D) GWAS design and interpretation
- Ensure phenotyping is consistent across years/locations when possible; record metadata that supports covariate modeling.
- Use association models that control for structure and kinship.
- Prioritize outputs that are actionable: candidate regions, annotated gene lists, and marker shortlists.
How CD Genomics Supports Similar Population Genomics + GWAS Projects
If you want to run a mulberry-like workflow in your own organism, the project can be organized as modular deliverables—so you can start with population structure and expand into GWAS and marker development.
Typical service modules aligned to this case:
- Whole-genome resequencing study design (cohort strategy, data integration plan, QC endpoints)
- Variant discovery and filtering for population-scale WGS (SNPs + short InDels)
- Population structure & phylogeny (cluster definition, geographic interpretation support)
- Gene flow / introgression analysis (f3/D-statistics style evidence summaries)
- Demographic history inference (PSMC/SMC++-style modeling and interpretation)
- GWAS pipeline and reporting (association plots, locus summaries, candidate gene prioritization)
- Breeding-oriented outputs (marker candidate shortlist, panel design guidance, validation strategy suggestions)
Tell us your species, target traits, cohort size, and whether you have existing WGS data to integrate. We’ll recommend a sequencing + analysis plan aligned to population structure, gene flow, and GWAS best practices.
Deliverables You Can Expect
For a project modeled after this published study, deliverables are typically packaged as a reproducible report plus machine-readable files:
- Curated variant dataset (SNPs and short InDels) with QC summary
- Population structure outputs (cluster assignments, structure plots, phylogeny figures)
- Gene flow evidence summary (statistics tables + interpretation notes)
- Demographic history figures and narrative interpretation
- GWAS report (Manhattan/QQ plots, top loci table, candidate gene list, locus summaries)
- Publication-ready figures (high resolution) and a methods summary suitable for manuscripts
- Optional: marker shortlist for downstream validation or panel development
FAQ
Whole-genome resequencing across diverse accessions, followed by population structure, gene flow (f3/D-statistics), and demographic inference (PSMC/SMC++) provides a high-resolution view.
A total of 425 accessions (290 newly resequenced + 135 integrated from earlier studies).
The study reported an average depth of ~20× for whole-genome resequencing.
2,359,117 SNPs and 934,187 short InDels (<10 bp).
Because domestication and cultivation often involve mixing among lineages; gene flow tests can reveal admixture that a tree alone may not capture.
The study used f3 statistics and ABBA–BABA (D-statistics).
PSMC and SMC++ were used to infer historical changes in effective population size (Ne).
Leaf size/biomass-related traits and flowering time.
Leaf size/biomass signals were highlighted on Chromosome 7, and flowering time signals on Chromosome 5.
The study proposed MaBXY5 and MaERF110 as potential key genes related to leaf traits and flowering time, respectively.
References:
- Dai, Fanwei, et al. "Genomic resequencing unravels the genetic basis of domestication, expansion, and trait improvement in Morus atropurpurea." Advanced Science 10.24 (2023): 2300039.