ddRAD-Seq 101: Enzyme Choice & Size Selection

If you need genome-wide markers without the cost of whole-genome resequencing, ddRAD-Seq remains a dependable approach. The two levers that determine success are enzyme choice and the size-selection window. Pick the right pair of restriction enzymes, constrain your fragment sizes intelligently, and you'll recover stable loci at the coverage you need—without fighting avoidable missing data or adapter contamination. This guide walks through practical decisions from bench to variants so your next population genomics project starts clean and scales smoothly.

Three-panel schematic: (A) ddRAD-Seq experimental workflow; (B) Restriction enzyme selection criteria; (C) Size-selection window effects.

Why ddRAD-Seq in 2025

ddRAD-Seq (double-digest RAD sequencing) samples a consistent subset of the genome by cutting DNA with two enzymes and sequencing only fragments within a defined size range. That design yields thousands to tens of thousands of SNPs at a fraction of whole-genome cost—ideal for population genetics and population genomics in non-model species. The protocol remains popular in ecology, evolution, breeding, and conservation because it balances budget, throughput, and marker density. Compared with single-digest RAD, ddRAD's two-enzyme design helps reduce repetitive noise and improves cross-sample repeatability when labs keep enzyme and window choices constant across cohorts.

When you compare methods, ddRAD-Seq sits in the middle: more informative than microsatellites or amplicon panels, lighter than WGS, and a strong fit for common study goals like population structure, differentiation, demographic inference, and linkage mapping in diploids and polyploids. Many programs run multi-year monitoring with the same ddRAD settings to keep markers comparable across time and sites.

How ddRAD-Seq Works

Two restriction enzymes cut at their recognition sites to create sticky ends. Indexed adapters are ligated, and you size-select a narrow fragment window by beads or gel. After PCR enrichment, libraries are pooled and sequenced—commonly paired-end (PE150 or PE250) on Illumina instruments. Because you keep only fragments within a defined window, the same genomic neighborhoods are sampled across individuals, which supports consistent locus recovery in population-scale datasets.

Enhanced efficiency, robustness, and cost-effectiveness of double-digest RAD sequencing. (Peterson et al., 2012, PLOS ONE) Double digest RAD sequencing improves efficiency and robustness while minimizing cost. (Peterson et al., 2012, PLOS ONE)

Paired-end runs improve locus assembly and SNP detection by reading both ends of each insert. They also introduce a design constraint you control: avoid read-through. If your read length exceeds the insert size, sequencers read into adapters, inflating adapter content and degrading alignment. Your size-selection window is the lever that prevents this.

Choosing Enzymes

Goal: generate a stable, project-appropriate number of loci with even genomic coverage and minimal allele dropout.

Start with simulation. Before placing orders, simulate candidate enzyme pairs on a representative genome or a close relative. Modern planners (e.g., ddRAD design calculators) predict locus counts under different size windows and flag overlap/adapter risk for common read lengths. Use simulation to shortlist two or three pairs that hit your marker and coverage targets.

Enzymatic digestion profiles of maize genomic DNA. (Puchta-Jasińska et al., 2023, Agronomy) Results of enzymatic digestion of maize DNA. (Puchta-Jasińska et al., 2023, Agronomy)

Field-tested heuristics:

Pair frequency: Match a frequent cutter (4–5-base recognition) with a rarer cutter (6–8-base recognition) to tune fragment density without flooding the library with very short inserts.
Genome context: For plant genomes rich in methylation, consider enzymes less sensitive to methylated cytosines to avoid systematic under-sampling of certain regions.
Pilot both pairs: Order two enzyme pairs. Prepare 16–24 libraries per pair and run a low-depth pilot. Compare insert-size distributions, percent adapter after trimming, and unique loci recovered at the same read budget.

What success looks like in the pilot: sharp insert-size peaks by Bioanalyzer/TapeStation, low adapter content after trimming, and stable locus counts across replicates at relaxed assembly thresholds. That consistency pays off later during population structure and differentiation analyses.

Size Selection

The size-selection window sets the trade-off between locus count and per-locus depth—and shields you from two common failure modes: adapter read-through (window too short) and R2 quality drop from very long inserts (window too long).

Short inserts: If inserts are shorter than your read length, paired reads overlap and the instrument reads into adapters. Expect higher adapter content and lower mapping rates unless you trim aggressively. Prevent it by setting a window that keeps inserts comfortably longer than the read length you plan to use (PE150 vs PE250).
Very long inserts: Extremely long fragments can reduce base quality in the second read (R2) and complicate de novo assembly of RAD loci. Wider windows rarely help if they dilute depth across too many loci.

Proportional increase in low-quality R2 reads relative to long-fragment content. (Tan et al., 2019, Scientific Reports) Increase of R2 low quality reads as a function of the content of long fragments. (Tan et al., 2019, Scientific Reports)

A practical routine for setting the window:

Use a ddRAD planner to estimate the fragment distribution for each enzyme pair and flag overlap risk for your read length.
Choose the narrowest window that still yields your target locus count at your planned coverage per sample.
After the pilot run, adjust the window once—tighten if adapter rates are high; loosen slightly if locus counts are too low for your downstream tests (e.g., ADMIXTURE, F_ST scans).

Cleanup matters. Very short inserts and adapter dimers cluster well and can degrade run metrics. If pre-run QC shows an excess of short fragments, add an extra bead cleanup or adjust bead ratios to tighten the insert distribution before pooling.

Reads to SNPs

Once you have clean FASTQs, assemble loci and call variants. Most teams rely on one (or both) of the following pipelines:

Stacks 2 — Strong performance on paired-end de novo RAD/ddRAD with robust locus assembly and genotyping across population datasets. If you want straightforward command-line tools and reliable genotype calls in de novo mode, Stacks 2 is a solid choice.
ipyrad — A flexible, modular toolkit that streamlines assembly and offers built-in downstream analyses (PCA, clustering). ipyrad encourages running multiple parameter sets and comparing outcomes, which is essential because filters influence biological inferences.

De novo vs reference-guided? If you have a well-annotated reference with modest divergence, reference-guided assembly provides coordinates and can help paralog filtering. In non-model systems with fragmented or distant references, de novo assembly often behaves more predictably.

What to deliver to analysis: demultiplexed FASTQs; VCFs for loci/SNPs; per-sample coverage reports; and a parameter log for assembly and filtering. Include adapter/overlap metrics from trimming to document why the chosen size window is safe at the read length used.

Common Pitfalls

Over-filtering too early. Tight "min-samples-per-locus" filters produce tidy matrices but bias your SNP set toward conserved regions, inflate missingness later, and can distort downstream tests. Keep locus retention permissive during assembly; manage missingness at the analysis stage (e.g., imputation for PCA).

Assuming parameters don't matter. Seemingly small changes in clustering thresholds, minor-allele frequency cut-offs, or per-locus missingness can shift population structure, introgression signals, or outlier scans. Build several datasets under a sensible grid of parameters and report which biological conclusions remain stable.

Insert size drift. If insert distributions drift across batches, you'll see batch-specific adapter rates or R2 quality. Track insert medians (±IQR) per pool, and re-balance with bead ratios if drift appears. Lock the window for production once your pilot looks stable.

Neglecting dual-index design. Barcodes with small edit distances increase index hopping or misreads, creating apparent batch effects. Use dual indexes with ample edit distance and validate demultiplexing on a subset before committing to a full run.

FAQ — ddRAD-Seq Design

1) What is ddRAD-Seq, and why use two enzymes?

It's a restriction-site method that uses two enzymes to target genomic regions reproducibly. A defined size-selection window keeps only fragments likely to sequence cleanly. The result is a repeatable set of loci at moderate cost—well suited to population genomics and population genetics.

2) How do I choose enzyme pairs for my species?

Model candidates with a ddRAD planner to preview locus density, GC effects, and overlap risk at your read length. Shortlist two or three pairs that hit your SNP and coverage goals, then run a pilot (16–24 samples per pair) and compare insert distributions, percent adapter after trimming, and unique locus counts at the same read budget.

3) What size-selection window should I use for PE150 or PE250?

Pick a window that keeps inserts longer than the read length to avoid read-through, but not so long that R2 quality drops. Confirm by checking adapter rates and insert medians on your pilot; adjust once before scaling.

4) Which pipeline should I use—Stacks 2 or ipyrad?

Both work well. Stacks 2 is a strong default for de novo paired-end genotyping; ipyrad shines when you want a flexible, end-to-end workflow that encourages parameter exploration and includes downstream analyses. Many teams test both on pilot data and select the one that produces the most stable biological inferences.

5) How should I handle missing data in ddRAD-Seq?

Don't try to eliminate all missing sites during assembly. Keep retention permissive, then manage missingness during analysis (e.g., imputation for PCA or genotype likelihood frameworks for low-depth data). Over-filtering early can bias results and reduce power.

Lab Tips That Prevent Re-runs

Plan dual indexes with edit distance. Mis-assigned reads mimic subtle batch effects; robust index sets and careful demultiplexing reduce that risk.
Tighten the insert distribution. If QC shows a shoulder <200 bp, adjust bead ratios or add an extra cleanup before pooling. Short inserts inflate adapter content and waste reads.
Track pool-level metrics. Record insert medians, IQRs, and percent adapter by pool. A single off-spec pool can pull down lane-level quality.
Fix the lab first, then tune filters. Only after enzyme pair and window are stable should you adjust clustering thresholds, MAF cut-offs, or per-locus missingness.
Pilot at production scale. Run the pilot with the same bead ratios, PCR cycles, and cleanup scheme you intend to use at scale so performance estimates transfer.

Next Steps

Design beats rescue. A small, well-instrumented pilot aligns enzyme pairs, size-selection windows, and read length with your project goals (structure, differentiation, demographic inference, or linkage mapping). Here's a straightforward path to production:

Plan with simulations to shortlist enzyme pairs and window candidates.
Pilot 16–24 libraries per pair at low depth to validate insert sizes, adapter rates, locus counts, and per-sample coverage.
Assemble ×3 using a sensible grid of parameters in Stacks 2 or ipyrad; pick the parameter set that stabilises biological conclusions across subsamples.
Freeze the recipe and document it in your methods so subsequent cohorts, sites, or years remain comparable.

If you want help scoping or validating a design, our Population Genomics Sequencing (such as ddRAD-seq, 2bRAD-seq) and Bioinformatics Analysis teams can simulate enzyme/window choices, run a pilot, and deliver transparent QC with FASTQs, VCFs, coverage summaries, and parameter logs—for research use only.

Related Reading:

References

Peterson, B.K., Weber, J.N., Kay, E.H., Fisher, H.S., Hoekstra, H.E. Double digest RADseq: An inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS ONE 7, e37135 (2012).
Tan, G., Opitz, L., Schlapbach, R., Rehrauer, H. Long fragments achieve lower base quality in Illumina paired-end sequencing. Scientific Reports 9, 2856 (2019).
Díaz-Arce, N., Rodríguez-Ezpeleta, N. Selecting RAD-Seq data analysis parameters for population genetics: The more the better? Frontiers in Genetics 10, 533 (2019).
Puchta-Jasińska, M., Bolc, P., Piechota, U., Boczkowska, M. Optimized in vitro restriction digestion protocol for preparing maize and barley ddRAD-Seq libraries. Agronomy 13, 2956 (2023).
Rochette, N.C., Rivera-Colón, A.G., Catchen, J.M. Stacks 2: Analytical methods for paired-end sequencing improve RADseq-based population genomics. Molecular Ecology 28, 4737–4754 (2019).
Eaton, D.A.R., Overcast, I. ipyrad: Interactive assembly and analysis of RADseq datasets. Bioinformatics 36, 2592–2594 (2020).
Lajmi, A., Glinka, F., Privman, E. Optimizing ddRAD sequencing for population genomic studies with ddgRADer. Molecular Ecology Resources (2023).

* Designed for biological research and industrial applications, not intended for individual clinical or medical purposes.