Designing ddRAD Projects: Expected Loci, Coverage Models & Budget

Q: How do I estimate expected locus count before sequencing?

Run ddgRADer (or similar in-silico digestion) with your candidate enzyme pairs and windows. It predicts fragments after digestion/selection, expected SNP-bearing loci, and feasible multiplexing. Validate with a small pilot, since real size selection can differ from nominal settings.

Q: What affects reads per sample more: enzyme choice or size window?

Both matter, but the size-selection window often has the larger effect on duplicates and overlaps, while enzyme choice shapes where and how evenly fragments come from. Test two neighbouring windows for the leading enzyme pair and choose the one that balances locus yield with duplicate control.

Q: Do I need unique dual indexes (UDI) for ddRAD multiplexing?

For large pools, yes, strongly recommended. UDI enables removal of hopped index combinations during demultiplexing, preserving accurate sample assignment and effective depth with modest added cost.

Q: How many reads per sample should I budget?

For many diploid ddRAD population studies, 1.0–1.5 × 10⁶ paired-end reads per sample is a sound starting point. Adjust upward for large/repetitive genomes or strict missingness targets, and validate against pilot on-target and duplicate rates.

Q: Gel vs BluePippin—does automated size selection matter?

Automated systems offer programmable collection ranges and repeatable recoveries, but you still need to evaluate and adjust elution timing to hit the target window. Consistency improves, yet verification and tuning remain essential.

Effective ddRAD project design starts with defensible numbers, not wishful thinking. This guide shows how to estimate expected locus count with ddgRADer. Then convert estimates into coverage planning, reads-per-sample targets, and a durable budget. You'll see how enzyme pairs and size-selection windows change locus yield. You'll also learn how duplicate and on-target rates shape effective depth. When depth misses targets, tune insert sizes before buying more reads. Each step is transparent and reproducible, so methods and decisions remain review-ready.

Clean, minimalist illustration of ddRAD project planning, featuring a double-digest DNA motif with simple icons for expected loci, coverage, and budget.

The ddRAD project design problem

Most unplanned costs in ddRAD projects originate from two early missteps: over-optimistic locus yield and miscalibrated read depth. The method—two restriction enzymes plus size selection—delivers cost-efficient, population-scale SNP discovery, but fragments cluster near cut sites, insert sizes vary, and size selection can be imprecise. Short inserts increase read overlap and adapter read-through, which lowers the effective bases per read and depresses usable coverage if not modeled upfront.

Empirical ddRAD data show carryover of short fragments and amplification bias: locus recovery depends on fragment length, and per-locus depth tracks GC content within the chosen window. (Gompert Z. et al. (2014) PLOS ONE) Empirical ddRAD results reveal small-fragment carryover and amplification biases: recovery varies with fragment length, and per-locus depth correlates with GC content within the selected window. (Gompert Z. et al. (2014) PLOS ONE)

A second, often overlooked, risk is index hopping in large multiplexes. When many samples share a lane, we strongly recommend unique dual indexes (UDI). UDI allows removal of hopped combinations during demultiplexing, reducing cross-sample contamination and protecting effective depth. If resources are limited, prioritising UDI typically offers high risk-reduction for modest cost.

Schematic overview of 2RAD/3RAD library preparation workflows. (Glenn T.C. et al. (2019) PeerJ) Overview of 2RAD/3RAD library construction. (Glenn T.C. et al. (2019) PeerJ).

Expected loci with ddgRADer: enzymes & size windows

The first design pass should be in-silico. Start with two or three enzyme pairs and two adjacent size windows per pair; run ddgRADer (or equivalent digestion simulation) to predict:

number of fragments after digestion and size selection,
expected SNP-bearing loci,
feasible samples per lane given read length and insert distribution,
sensitivity to window width and insert-size shifts.

Simulated digest fragment distributions illustrate how enzyme pairing and size-selection windows jointly determine recoverable ddRAD loci; broader windows capture more regions but alter insert profiles. (Peterson B.K. et al. (2012) PLOS ONE) Fragment distributions from simulated digests show how enzyme pair and size-selection window jointly control the number of recoverable ddRAD loci; widening the window increases regions captured but also shifts insert profiles. (Peterson B.K. et al. (2012) PLOS ONE).

Key practical insights:

Genome-aware enzymes. Favor enzyme pairs with stable cut frequencies in your organism (or a close reference). If the genome is repeat-rich or GC-extreme, prefer enzymes less influenced by those motifs.
Start narrow, then tune. A conservative window (e.g., 300–450 bp for PE150) helps control duplicates and overlaps. If the pilot under-yields loci, widen ±20–30 bp; if duplicates and overlaps increase, narrow 15–20 bp or shift the window upward.
Make assumptions auditable. Record genome size, GC band, intended insert distribution, and downstream filter thresholds (MAF, missingness, LD pruning). This speeds protocol reviews and manuscript methods.

Why plan two windows? Even with automated selection, collection timing and recovery can drift. Testing two neighbouring windows on a small pilot de-risks the full run by showing where your library actually lands.

Coverage models: reads per sample, depth, multiplexing

Once you have a locus target (for example, 3–5 k loci for population structure or 8–12 k for finer scans), translate it into reads per sample with transparent math:

Effective depth per locus ≈

(raw reads per sample) × (on-target fraction) × (1 − duplicate rate) ÷ (retained loci)

This deliberately simple model forces you to declare the two levers that most affect outcomes: on-target fraction and duplicate rate.

On-target fraction. If inserts are short relative to read length (e.g., PE150 on 250–300 bp inserts), overlaps and adapter read-through reduce usable bases. Use ddgRADer estimates and your lab's trimming statistics as priors for planning.

Duplicate rate. Complexity depends on input DNA, fragment diversity, and PCR cycles. Narrow windows and small genomes tend to raise duplicates. If duplicates threaten effective depth, consider capture-assisted RAD (e.g., RADcap) or molecular tags to identify PCR copies. Both approaches stabilise locus recovery and enable explicit duplicate handling downstream.

Ballpark anchors (adjust after the pilot):

Population structure (diploids, non-model): plan 3–5 k LD-pruned loci at 8–12× effective depth.
Selection scans or higher resolution: aim 8–12 k loci at 12–20×, with stricter missingness control.
Polyploids: increase per-locus depth targets (e.g., 20–30×) or accept fewer loci to keep budgets stable.

Multiplexing and lane plans

Use UDI for large pools to mitigate index hopping and preserve sample identity.
Cap read variance. Keep per-sample reads within ±20–25% by conservative molarity normalisation and avoiding theoretical maximums per lane. If historical runs show higher dispersion, plan a 10% overage.
Match read length to inserts. If pre-runs indicate insert peaks <300 bp, PE150 usually suffices; longer windows and inserts may justify PE250 for more unique bases. Re-compute effective depth under both settings before locking the plan.

Budget & throughput scenarios you can defend

You'll need numbers that explain themselves to a PI, finance partner, and reviewer. Build three tiers that map to programme decisions: Pilot, Core, and Scale. Each tier should state the expected locus count, reads per sample, and why the design fits the question.

Pilot (proof-of-design)

Design: 24–48 samples; two enzyme pairs × two size windows; UDI; PE150 to profile inserts.
Goal: select the best pair/window by observed locus yield, duplicates, and missingness; refine your on-target and duplicate priors for budgeting.
Decision gate: advance when ≥80% of samples exceed the locus target with duplicates under your threshold.
Why it pays: A small, well-designed pilot typically costs less than rescuing a full cohort that missed locus or depth targets.

Core study

Design: fix the chosen pair/window; lock reads per sample (for example, 1.0–1.5 × 10⁶ PE reads per sample for ~3–5 k loci at ≥8–12× in diploids); include 5–10% technical replicates to monitor lane-to-lane variance.
Goal: generate the dataset sized for your primary analysis (e.g., population structure) with headroom for reasonable missingness thresholds.
Decision gate: QC gates met (below) and interim locus/depth stats within ±10–15% of pilot expectations.

Scale-up

Design: lane-balanced pools; maintain UDI; keep per-lane sample counts at levels that avoid over-dispersion.
Goal: expand sample numbers without changing the data-generating process.
Decision gate: cross-lane replicates show comparable locus/depth distributions and missingness.

Pilot, QC gates, and risk controls

Write acceptance criteria before sequencing so decisions aren't reset mid-project. Effective QC gates for ddRAD include:

Library fragment profiles match targeted windows. For automated selection, adjust collection timing and confirm ranges after re-analysis. If inserts skew short, consider shifting the window upward rather than widening indiscriminately.
On-target rate (post-trim) tracks your plan. If adapter read-through is frequent, shorten reads or adjust the window.
Duplicate rate remains below your threshold. If not, increase input DNA, reduce PCR cycles, or pilot RADcap to stabilise targeted loci and tag duplicates.
Per-sample locus count exceeds target in ≥80% of samples; per-locus depth distribution meets plan (for example, median ≥12× with a narrow IQR).
Batch neutrality. Include a small reference panel spread across lanes. If the panel shows lane or plate effects, diagnose and rebalance before proceeding.

If the pilot misses targets

Under-yielding loci? Widen the size window by ~20–30 bp or switch to the second enzyme pair.
High duplicates/overlaps? Narrow or shift the window to a longer insert band; reduce PCR cycles; consider PE250 only if inserts justify it.
Over-dispersion across samples? Improve normalisation, reduce per-lane sample count, or add a reads-per-sample overage.

Quick answer: How many reads per sample?

For diploid ddRAD studies aimed at population structure, a defensible starting point is 1.0–1.5 × 10⁶ paired-end reads per sample, which typically supports ~3–5 k high-quality loci at ≥8–12× effective depth when enzyme pairs and size windows are tuned. Increase reads if the genome is large or repeat-rich, if the window is extremely narrow, or if downstream analysis (e.g., outlier scans) requires tighter missingness control. For polyploids, plan for higher depth per locus or accept fewer loci. Validate these assumptions with your ddgRADer predictions and the pilot results, which capture insert distributions and trimming realities in your workflow.

Practical, experience-based tips

Lock assumptions early. Put genome size, enzyme cut frequencies, and target insert distribution into a one-page design brief. Version it for the pilot and Core stages.
Prioritise UDI for large pools. It provides a straightforward way to mitigate index hopping and protect effective depth in high-throughput settings.
Normalise to reduce variance. If historical ddRAD runs show ±30–40% read dispersion, cap per-lane sample counts or add a planned overage until variance improves.
Pre-register filters. Define locus missingness, MAF, and LD-pruning rules in the protocol. This reduces post-hoc parameter drift and makes results easier to defend.
Consider capture on difficult genomes. On repeat-rich genomes where locus dropout is persistent, RADcap can stabilise targeted loci and improve downstream consistency.

Budget narratives that resonate with finance and science

Frame requests in outcomes, not only inputs:

Pilot narrative: "We will evaluate two enzyme pairs and two size windows across 36 samples. The goal is to confirm locus yield (≥3 k loci at ≥8×) and acceptable duplicates (<15%). We will then lock the design and reads per sample. This reduces the risk of a high-cost rerun."
Core narrative: "We will generate ~1.2 × 10⁶ paired-end reads per sample with UDI in balanced pools. Expected outcome: ~4 k loci at ≥12× with ≤20% missingness in the initial call set, then LD-pruned for downstream analyses."
Scale narrative: "We will maintain the same library recipe and index strategy, adding a 5% reference panel across lanes to monitor drift and enforce batch neutrality."

These narratives anticipate typical stakeholder questions and connect each budget line to a measurable reduction in technical risk.

FAQs

How do I estimate expected locus count before sequencing?

Run ddgRADer (or similar in-silico digestion) with your candidate enzyme pairs and windows. It predicts fragments after digestion/selection, expected SNP-bearing loci, and feasible multiplexing. Validate with a small pilot, since real size selection can differ from nominal settings.

What affects reads per sample more: enzyme choice or size window?

Both matter, but the size-selection window often has the larger effect on duplicates and overlaps, while enzyme choice shapes where and how evenly fragments come from. Test two neighbouring windows for the leading enzyme pair and choose the one that balances locus yield with duplicate control.

Do I need unique dual indexes (UDI) for ddRAD multiplexing?

For large pools, yes, strongly recommended. UDI enables removal of hopped index combinations during demultiplexing, preserving accurate sample assignment and effective depth with modest added cost.

How many reads per sample should I budget?

For many diploid ddRAD population studies, 1.0–1.5 × 10⁶ paired-end reads per sample is a sound starting point. Adjust upward for large/repetitive genomes or strict missingness targets, and validate against pilot on-target and duplicate rates.

Gel vs BluePippin—does automated size selection matter?

Automated systems offer programmable collection ranges and repeatable recoveries, but you still need to evaluate and adjust elution timing to hit the target window. Consistency improves, yet verification and tuning remain essential.

Next steps

Scope a pilot now. Run two enzyme pairs × two size windows across 24–48 samples with UDI, and verify ddgRADer predictions against observed on-target rate, duplicates, and locus yield.
Request a design check. Our team can review your ddRAD project design, pressure-test the coverage model, and return a lane plan aligned to your primary endpoint through Population Genomics Sequencing and our ddRAD sequencing service.
Lock a defendable budget. Translate your chosen window and enzymes into a reads-per-sample line item with explicit assumptions. If scope changes, you will know exactly what moves.

Related Reading:

References

Peterson, B.K., Weber, J.N., Kay, E.H., Fisher, H.S. & Hoekstra, H.E. Double digest RADseq: An inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS ONE 7, e37135 (2012).
DaCosta, J.M. & Sorenson, M.D. Amplification biases and consistent recovery of loci in a double-digest RAD-seq protocol. PLoS ONE 9, e106713 (2014).
Díaz-Arce, N. & Rodríguez-Ezpeleta, N. Selecting RAD-Seq data analysis parameters for population genetics: The more the better? Frontiers in Genetics 10, 533 (2019).
Farouni, R., Djambazian, H., Ferri, L.E., Ragoussis, J. & Najafabadi, H.S. Model-based analysis of sample index hopping reveals its widespread artifacts in multiplexed single-cell RNA-sequencing. Nature Communications 11, 2704 (2020).
Glenn, T.C., Nilsen, R., Kieran, T.J. et al. Adapterama I: Universal stubs and primers for 384 unique dual-indexed or 147,456 combinatorially-indexed Illumina libraries (iTru & iNext). PeerJ 7, e7755 (2019).
Lajmi, A., Glinka, F. & Privman, E. Optimizing ddRAD sequencing for population genomic studies with ddgRADer. Molecular Ecology Resources (2023).
Hoffberg, S.L., Kieran, T.J., Catchen, J.M. et al. RADcap: Sequence capture of dual-digest RADseq libraries with identifiable duplicates and reduced missing data. Molecular Ecology Resources 16, 1264–1278 (2016).

* Designed for biological research and industrial applications, not intended for individual clinical or medical purposes.