ddRAD for Plants: A Practical Manual for Non-Model Crops
This ddRAD plant manual is written for breeding programs and ag-biotech teams that need reliable SNP panels from non-model crops. It walks through enzyme pairs for plants, plant-aware sample prep, library and size-selection choices for repetitive genomes, coverage targets for small vs. large genomes, and an analysis workflow (Stacks/ipyrad) that turns reads into breeding-ready markers.

1) Start Here: Why ddRAD Fits Non-Model Crops
The practical problem. Many crop projects require tens of thousands of markers across hundreds of lines, often without a complete reference genome. Double-digest RAD-seq (ddRAD) solves this by using two restriction enzymes and narrow size selection to sample the same genomic fraction across individuals. Compared with shotgun sequencing, ddRAD delivers stable locus recovery at a lower cost per sample, making it well suited to non-model plants with large or repeat-rich genomes.
Double-digest RAD-seq schematic: two-enzyme digestion plus narrow size selection yields consistent genomic tags and supports high multiplexing with dual indexing. (Peterson B.K. et al., 2012, PLOS ONE).
Why ddRAD over single-digest RAD or GBS? Two-enzyme digestion reduces random shearing effects, increases reproducibility across plates and seasons, and allows you to "dial" marker density by choosing enzyme pairs and fragment windows. In breeding contexts, that control translates into cleaner downstream statistics—fewer missing genotypes, better locus sharing across cohorts, and less batch-to-batch drift.
What you'll get from this guide. A decision-first walkthrough: (1) plant-aware sampling and DNA QC, (2) enzyme pair selection by genome size and GC profile, (3) library design for repetitive genomes, (4) sequencing and coverage targets you can budget around, (5) analysis and QC tuned for plants, and (6) how to hand off a panel for population structure, GWAS/QTL, and selection scans—without wasting runs on avoidable pitfalls.
Project goals map → design (quick view).
- Diversity scan / germplasm curation: emphasize breadth and per-sample economy; choose enzyme+window settings that maximize shared polymorphic loci at moderate depth.
- QTL/GWAS in crops: prioritize consistent locus recovery, size windows that avoid repeat-inflated duplicates, and coverage that stabilizes genotype likelihoods.
- Selection scans / relatedness control: ensure sufficient shared loci across lines; plan technical replicates and balanced lanes to reduce confounding.
Common early pitfalls to avoid. Over-tight fragment windows in repeat-rich genomes (causing locus dropout), hidden relatedness inflating signals, and index hopping on patterned flow cells when non-unique indexes are used. These three issues account for a large share of preventable QC failures in plant ddRAD panels.
2) Study Design & Sample Workflow (Plant-Specific)
Sampling that respects plant chemistry. Young leaves are usually best (lower polysaccharides and phenolics). Seeds and roots often carry more inhibitors; pre-washes (e.g., sorbitol) help. Maintain a cold chain; avoid repeated freeze–thaw cycles that shear DNA. A CTAB-based extraction remains a robust default for diverse crop tissues; many labs add PVP for phenolic-rich species and RNase to reduce carryover.
DNA QC thresholds that protect downstream genotyping.
- Integrity: visible high-molecular-weight DNA with minimal smearing on agarose.
- Purity: OD260/280 ≈ 1.8–2.0; OD260/230 > 1.8 when achievable.
- Quantity: ≥ 300 ng total DNA per sample typically suffices for ddRAD library preparation (adjust for your ligation system).
These thresholds are conservative but practical across leaf, seed, and root tissues. In our experience, meeting them reduces allele dropout and uneven digestion far more than any exotic cleanup step.
Barcoding and batching at plate scale. Randomize samples across plates and columns to decouple phenotype groups from lane effects. Use unique dual indexes (UDI) so that hopped reads (index misassignment) can be identified and removed during demultiplexing. On patterned flow cells (e.g., HiSeq 4000, NovaSeq), a modest level of index cross-talk can occur; UDIs and careful cleanup of free adapters meaningfully reduce it.
Characterization of index swapping on patterned flow cells; UDI enables detection of swapped reads across tiles and index combinations. (Costello M. et al., 2018, BMC Genomics).
Cross-batch controls that pay off later.
- 5–10% technical replicates per 96-well plate.
- A shared reference DNA on every plate to monitor drift.
- A small pilot pool to validate digestion and size distribution before scaling.
Hands-on prep tips (from plant ddRAD projects).
- Add PVP for phenolic-rich leaves (e.g., tea, some legumes).
- Try a sorbitol wash for mucilage-rich tissues and stubborn seeds.
- Avoid overdrying DNA pellets; rehydrate gently to preserve length.
- RNase before ligation and a bead/column cleanup to remove inhibitors.
Together, these practical steps have a stronger effect on downstream call rates than minor tweaks to PCR cycles or ligation timing.
3) Enzymes & Library Design That Scale
Choosing enzyme pairs by genome and GC content. Enzyme selection determines tag density and locus sharing. In plants, methylation patterns matter: methylation-sensitive rare cutters (e.g., PstI) bias toward gene-rich regions, which can be desirable for trait mapping; methylation-insensitive partners (e.g., MspI) maintain broader coverage. Practical, plant-friendly pairs often combine a rare cutter with a frequent cutter to balance tag count and complexity.
Table — Common ddRAD enzyme pairs for plants (practical guide)
| Rare cutter + frequent cutter | Methylation sensitivity | Typical use case (qualitative) | Notes for plants |
| PstI + MspI | PstI sensitive; MspI largely insensitive | Genic bias, moderate tag density | Good for many crops; enriches gene-proximal SNPs |
| EcoRI + MspI | EcoRI sensitive; MspI insensitive | Balanced panels | Broadly used; watch GC bias in some species |
| SbfI + MspI | SbfI very rare; MspI insensitive | Large genomes; fewer but consistent tags | Keeps locus counts tractable in repeat-rich genomes |
| AvaII + MspI | AvaII rare; MspI insensitive | Cross-species, cost-aware plant panels | Reported as robust in plant-optimized ddRAD protocols |
How to use this table. Start with two candidate pairs that bracket your expected tag density (e.g., PstI+MspI for genic bias and AvaII+MspI for cross-species consistency). Run a 24–48 sample pilot with two fragment windows to see which combination maximizes shared polymorphic loci after filtering.
Fragment window selection for repetitive genomes. In large or repeat-rich genomes, a moderately wide window (for example, 400–700 bp) often improves locus sharing while limiting repetitive fragments. Extremely tight windows can increase dropout when small shifts in fragment distributions occur between plates. Evaluate distribution on a Bioanalyzer or equivalent before committing the full cohort.
Avoiding star activity and chimera formation.
- Keep glycerol content below 5% in digestion mixes; use high-fidelity enzymes and the manufacturer's buffer/temperature.
- Verify double-digest performance with a pilot gel or fragment analyzer; a heavily smeared pattern suggests suboptimal reaction conditions.
- Adapter dimers point to off-ratio ligation or insufficient cleanup; fix before pooling.
Indexing and read structure. Use UDIs to control index cross-talk and pair with paired-end sequencing (e.g., PE150) so that downstream software can assemble short contigs per locus. The extra information from paired-end reads reduces paralog conflation and improves genotyping in complex plant genomes.
4) Sequencing Strategy & Coverage Targets
How many reads per sample? There is no one-size-fits-all number across crops, but a practical rule is to aim for coverage that stabilizes genotype calls after filtering. Many population-scale ddRAD projects target ≈10× per retained locus. Translating that into reads depends on enzyme+window choices and genome size; typical ranges land around 1–3 million read pairs per sample for moderate-density panels in small-to-medium genomes. Plan a pilot to measure: (i) loci shared across ≥80% of individuals, (ii) missingness after QC, and (iii) effective per-locus depth.
Genome-size-aware scenarios.
- Small/medium genomes (≤ 1 Gb): With a balanced pair (e.g., EcoRI+MspI or PstI+MspI) and a moderate window, tens of thousands of loci are attainable at ~1–2 M read pairs per sample.
- Large genomes (≥ 5 Gb) or polyploids: Favor rarer cutters (e.g., SbfI+MspI or AvaII+MspI) and slightly tighter windows to keep locus counts tractable; budget more reads per sample to maintain per-locus depth.
Lane balance and multiplexing. Pool libraries to equalize representation and avoid over-clustering. On patterned flow cells, keep adapter carryover low and use UDIs; any residual index swapping can then be removed during demultiplexing.
When to consider low-pass depth plus imputation. For very large cohorts, moderate per-sample depth combined with within-population imputation can reduce costs while preserving analytical power. This is common in plant genomics pipelines where pedigree or population structure supports accurate imputation; document the reference panels and methods used to keep results reproducible.
Pilot design that answers the only question that matters. Compare two enzyme pairs × two size windows on 24–48 representative samples. Choose the recipe that maximizes shared polymorphic loci at acceptable missingness. Do not chase the absolute largest raw SNP count; prioritize robustness and cross-plate reproducibility.
5) Analysis & QC: Plant-Tuned, Decision-Ready
De novo vs. reference-guided analysis. When a high-quality reference genome exists (even from a related cultivar), reference-aligned calling improves contiguity, functional interpretation, and integration with downstream tools (LD, GWAS, selection scans). Where references are incomplete, a de novo approach remains defensible if you use conservative filters. Two widely used, well-documented toolchains:
- Stacks v2: paired-end aware, assembles locus contigs and reduces genotyping errors compared with earlier RAD pipelines.
- ipyrad: scalable RAD assembly and analysis across hundreds of taxa with flexible clustering and data export.
Parameter tuning that actually matters (m/M/n). The trio of parameters—minimum depth to form a stack (m), allowable mismatches within individuals (M), and across individuals (n)—controls marker yield, genotyping error, and inferred differentiation. Follow a documented grid-search strategy and track core metrics (loci, SNPs, heterozygosity, r80/shared loci).
Number of polymorphic loci present in ≥75% of individuals across values of m (with different M/n), before and after PCR-clone removal. (Díaz-Arce N. & Rodríguez-Ezpeleta N., 2019, Frontiers in Genetics)
Pipeline differences change biological conclusions. Stacks, SAMtools-based workflows, and GATK can produce different call sets on reduced-representation data because of genotype models and filters. To protect interpretability: fix versions, publish parameter files, and replicate key steps on a subset when you change chemistry or instruments.
Plant-aware filtering to reduce false positives.
- Paralog filters: drop loci with excess read depth or extreme heterozygosity suggestive of collapsed paralogs.
- Missingness: for structure and kinship analyses, keep loci present in ≥75–80% of individuals.
- Replicate concordance: estimate per-SNP error from technical replicates (include at least one replicate per plate).
- HWE and depth filters: use cautiously in structured breeding panels; depth-normalized filters are often more informative.
A mini-playbook for fast triage.
- Star activity (off-target cuts): reduce glycerol, confirm buffers and temperatures, switch to HF enzymes.
- Adapter dimers / duplication spikes: adjust adapter:insert ratios, improve post-ligation cleanup.
- Index hopping: enforce UDIs; remove unexpected index pairs during demultiplexing.
De novo vs. reference-aligned (decision note). If you anticipate Genome-wide Association Analysis or selection scans, reference-aligned calls simplify integration with gene annotations and LD maps. If your priority is broad discovery in a non-model crop, start de novo with conservative filters and consider lifting over representative loci once a reference becomes available.
6) From Markers to Breeding Decisions
What "breeding-ready" looks like.
- Population structure and relatedness: PCA/PCoA and ADMIXTURE clarify population stratification; KING/IBS check for close relatives that can bias estimates.
- Selection scans: F_ST outliers and cross-population statistics (e.g., XP-CLR) highlight regions under selection across environments or breeding stages.
- GWAS/QTL hand-off: export stable SNPs with consistent minor-allele frequency and—if reference-aligned—genomic positions. Provide VCF + PLINK files, a sample sheet with plate/lane metadata, and a short QC report (missingness, replicate concordance, depth distributions).
Panel hand-off checklist (one page).
- Document enzyme pair, size window, and read-per-sample target.
- Freeze pipeline versions (Stacks/ipyrad); bundle parameter files.
- Report replicate concordance and per-SNP error estimates.
- Deliver VCF/PLINK plus a concise data dictionary for analysts.
- Save an imputation-ready matrix if low-pass strategies were used.
Why this matters to breeding. The faster you can turn sequencing into a stable, annotated SNP set, the sooner your program can run genomic selection, verify QTLs, and advance lines with confidence. A robust ddRAD panel is often the cheapest path to that outcome in non-model crops.
Mini-FAQ
Start with a rarer cutter paired to a frequent, largely methylation-insensitive cutter—for example, SbfI + MspI or AvaII + MspI—and test a moderately wide window (e.g., 400–700 bp). The goal is a tractable number of tags that still share well across individuals
Budget to stabilize genotype likelihoods after filtering. Many panels balance cost and power around ≈10× per retained locus; depending on genome size and enzyme/window choice, that may equate to ~1–3 M read pairs per sample. Confirm with a pilot; do not lock budgets without empirical locus-sharing metrics.
Yes, with tighter size selection, paired-end reads, and paralog-aware filters (depth and heterozygosity caps). Paired-end aware callers (e.g., Stacks v2) help separate paralogs and reduce false heterozygosity in complex plant genomes.
Use unique dual indexes, remove free adapters/primers before pooling, and discard unexpected index combinations during demultiplexing. Double indexing is a proven strategy to limit misassignment.
No. ddRAD was designed to work without one. When a good reference exists, reference-aligned calling simplifies LD, GWAS, and functional interpretation; otherwise, a de novo pipeline with conservative filters is a defensible starting point.
Your Next Steps
If you're mapping diversity or traits in a non-model crop, our team can help you move from idea to pilot to breeding-ready panel:
- Study Design & Power Modeling for ddRAD in crops (enzyme/window pilots, plate randomization, replicate strategy).
- Population Genomics Sequencing Service with plant-tuned library prep (UDI by default), balanced pooling, and audit-ready QC.
- RAD-seq service / Reference-Guided Variant Calling using Stacks v2 or ipyrad, with frozen parameter files and clean deliverables (VCF/PLINK + data dictionary).
- Genome-wide Association Analysis and Selection Scan Service to translate your ddRAD panel into trait-level evidence.
Ready to scope a pilot? Share species, approximate genome size, desired marker density, and budget envelope. We'll return two enzyme pairs × two size windows, a per-sample coverage plan, and a milestone-based timeline you can take to internal review.
Related Reading:
- ddRAD-Seq 101: Enzyme Choice & Size Selection
- GBS vs RAD vs ddRAD: Which Method Fits Your Project
- Population Structure with ddRAD: PCA, ADMIXTURE & STRUCTURE
- Choosing Your ddRAD Pipeline: Stacks 2 vs ipyrad vs dDocent
- ddRAD for Plants: A Practical Manual for Non-Model Crops
- Designing ddRAD Projects: Expected Loci, Coverage Models & Budget
- Low-Coverage WGS + ANGSD vs ddRAD: When to Replace, When to Complement
References
- Peterson, B.K., Weber, J.N., Kay, E.H., Fisher, H.S., Hoekstra, H.E. Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. PLoS ONE 7, e37135 (2012).
- Yang, G.Q., Chen, Y.M., Wang, J.P. et al. Development of a universal and simplified ddRAD library preparation approach for SNP discovery and genotyping in angiosperm plants. Plant Methods 12, 39 (2016).
- Wright, B., Farquharson, K.A., McLennan, E.A. et al. From reference genomes to population genomics: comparing three reference-aligned reduced-representation sequencing pipelines in two wildlife species. BMC Genomics 20, 453 (2019).
- Díaz-Arce, N., Rodríguez-Ezpeleta, N. Selecting RAD-Seq Data Analysis Parameters for Population Genetics: The More the Better? Frontiers in Genetics 10, 533 (2019).
- Costello, M., Fleharty, M., Abreu, J. et al. Characterization and remediation of sample index swaps by non-redundant dual indexing on massively parallel sequencing platforms. BMC Genomics 19, 332 (2018).
- Rochette, N.C., Rivera-Colón, A.G., Catchen, J.M. Stacks 2: Analytical methods for paired-end sequencing improve RADseq-based population genomics. Molecular Ecology 28, 4737–4754 (2019).