Planning Reduced Representation Sequencing in Non-Model Species: Genome Size, Repeats, and In-Silico Digest

Reduced representation sequencing helps you sample a manageable, repeatable subset of the genome. The challenge is non-model species RAD-seq design: genome size is unknown, repeats are high, and GC and restriction-site polymorphisms skew fragment yield. An in silico digest makes planning concrete. This guide gives you a decision framework, the inputs we need, and failure-mode troubleshooting so you can commit budget confidently.
What "Non-Model Species Design" Means for Reduced Representation Sequencing
Non-model species RRS design is the process of choosing enzymes, fragment windows, and sequencing strategy based on genome size, repeats, and predicted cut-site behavior.
Designing for non-models means embracing uncertainty and controlling it. Unknown genome features create unstable locus recovery and missingness. Generic "works in my model organism" RAD advice often fails in plants, polyploids, and repeat-heavy lineages.
You're trying to control three outcomes:
- Locus count within a useful range for your analysis goals.
- Locus consistency across samples and batches.
- Missingness patterns low enough to avoid bias.
When to consider alternatives within the RRS family:
- ddRAD for tunable locus counts and better cross-sample consistency.
- GBS when you prioritize simplicity over tight window control.
- 2b-RAD when size selection is unstable or repeats complicate fragment recovery.
If you need a RAD-seq refresher, see the overview in the RAD-seq basics guide.
The Inputs You Need Before You Pick Enzymes
A good RRS plan starts with a small set of species and sample facts that predict locus yield and failure modes.
Start by assembling a species profile:
- Genome size estimate, even a range, from literature, flow cytometry, or k-mers.
- Ploidy and known duplications.
- Repeat level guess and any transposon notes.
- GC content range if known; proxy from relatives if not.
Then outline your sample and project profile:
- Tissue type, extraction method, and DNA integrity metrics.
- DNA amount per sample and expected variability.
- Cohort structure and whether cross-batch comparability is required.
Downstream goal matters:
- Structure/selection scans tolerate modest missingness and fewer loci.
- Association-like mapping and panel building need denser, more consistent loci.
Low input or degraded DNA changes feasible options. For constraints and workarounds, see low-input RAD-seq considerations.
Figure 2. Collect these species inputs before choosing enzymes or size windows.
Boxed checklist — copy/paste:
- Genome size estimate (range is fine)
- Ploidy
- Repeat level (rough)
- GC range (if known)
- Expected divergence vs reference or close relative
- Sample count
- DNA amount/quality per sample
- Target analysis (structure/selection vs association/panel)
- Cross-batch comparability requirement
- Reference availability
Genome Size, Repeats, and GC: How They Change Locus Yield
Genome size, repeat density, and GC content shape how many restriction fragments you generate and how reliably loci can be recovered.
Genome size: bigger often means fewer usable loci per sequencing dollar. With fixed reads, large genomes spread coverage thin across a vast fragment universe. You may need stricter size windows or an RRS flavor that standardizes fragments.
Repeats and duplications inflate apparent loci but reduce trustworthiness. Paralog-like signals creep in when fragments are not unique enough. Keep windows tighter and choose enzyme pairs that minimize repetitive contexts.
GC content matters for digestion and coverage bias. Recognition-site GC and genome GC interact to produce short-fragment inflation or read-through when frequent cutters are GC-poor. Mitigate by pairing enzymes carefully and enforcing robust lower cut-offs.
For plant and non-model, repeat-heavy genomes, see ddRAD for plants: a practical manual.
Figure 3. Genome size and repeats change how many unique loci RRS can recover.
Choosing Enzymes and Fragment Windows Without Guesswork
Enzyme and size-window choices should be driven by predicted fragment distributions and repeat avoidance, not by what worked in another species.
Single-enzyme vs double-digest vs 2b-RAD — practical cues:
- If genome complexity and repeats are moderate, ddRAD gives tunable locus counts.
- If you need simplicity and can accept looser window control, consider GBS.
- If size selection is unstable or repeats complicate recovery, 2b-RAD avoids size selection with fixed-length tags; see 2b-RAD option for population genomics.
Enzyme choice logic:
- Cut-site frequency: recognition-site length and motif GC determine how often cuts occur.
- Methylation sensitivity: CpG methylation can block cuts for some enzymes; consult NEB/REBASE.
- Robustness in non-models: favor pairs with predictable behavior across GC ranges and avoid frequent cutters that overproduce short fragments.
Size selection strategy stabilizes locus recovery but can fail if windows are loose or off-center. Tie window choice directly to locus consistency and missingness. For practical steps and pitfalls, see ddRAD size selection strategy.
Mini comparison matrix — option → best for → main risk → validate first:
- ddRAD → Tunable loci in moderate/large genomes → window drift and short-fragment leakage → verify fragment histograms vs prediction.
- GBS → Simpler library builds → uneven locus recovery → check missingness and duplicates.
- 2b-RAD → Unstable size selection or complex repeats → tag uniformity limits per-locus read length → confirm tag counts and cross-sample repeatability.
Figure 4. Choose an RRS approach based on genome context and validation needs.
In-Silico Digest: The Checklist That De-Risks Non-Model Projects
An in-silico digest is a computational simulation that predicts fragment counts and size distributions so you can select enzymes and windows before committing budget.
What you can simulate with and without a reference:
- With a reference: digest exact sequences, filter by size windows, and predict locus counts and fragment-length modes.
- Without a reference: use proxy genomes, low-pass assemblies, or k-mer-based expectations. Be explicit about uncertainty and use ranges.
The In-Silico Digest Checklist — copy/paste:
- Choose 3–5 candidate enzymes or pairs aligned to genome GC and repeats.
- Simulate fragments with a reference, proxy, or RADinitio/SimRAD.
- Apply your target size window and record fragment-length mode and spread.
- Estimate unique vs repetitive fraction using mappability or repeat annotations.
- Predict a loci range per sample at your planned read depth.
- Compare options on locus count, consistency risk, and repeat burden.
- Pick the top 1–2 designs and run a small pilot before scaling.
How to interpret results and act:
- Too many short fragments: raise the lower cut-off or change the frequent cutter.
- Too few loci: widen the window or pick a rarer cutter pair; consider 2b-RAD.
- Mode drift vs prediction: re-center the window; check instrument and gel cuts.
Teams often standardize enzyme screens, fragment histograms, and a design manifest so later batches stay comparable.
Figure 5. An in-silico digest reduces guesswork before committing budget.
Common Failure Modes in Non-Model Species (and How to Prevent Them)
Most non-model RRS failures come from unpredictable locus dropout, repeat-driven ambiguity, or inconsistent fragment recovery across samples.
Locus dropout drivers include restriction-site polymorphisms and DNA quality variance. Symptoms are scattered missingness and low shared-loci percentages. Prevention: consistent input, robust enzyme pairs, and window tuning.
Repeat-driven ambiguity happens when loci aren't unique enough to trust. Prevention: tighter windows, alternative enzyme pairs, and consider 2b-RAD in specific contexts.
"Looks fine in FASTQ, fails downstream" is common when assemblies or filters are mis-set. Monitor reads-per-sample balance, fragment size proxies, and duplication outliers early.
| Symptom (what you see) | Likely cause | Prevention / first fix |
| Shared loci drops in one group (site/tissue/batch) | DNA quality varies; digestion efficiency differs | Standardize extraction; re-quantify; pilot balanced subsets; tighten QC gates before scale |
| Total loci far above expectation | Window too wide; repeats inflate apparent loci | Narrow size window; reconsider enzyme pair; validate unique loci fraction in pilot |
| Total loci far below expectation | Enzyme cut frequency too low; window too narrow | Try alternative enzyme pair; broaden window slightly; confirm fragment distribution in pilot |
| High duplicate rates across many samples | Low complexity libraries; over-amplification | Reduce PCR cycles; optimize input DNA; remove duplicates consistently |
| Reads look fine but genotypes are sparse | Locus dropout / restriction-site polymorphism | Pilot across diversity; choose enzymes less sensitive to site polymorphism; adjust design before filters |
| Missingness is strongly batch-shaped | Library prep drift; size-selection drift | Freeze protocol; use bridge samples; batch-aware QC; rerun only the affected step |
| Fragment size profiles shift across runs | Inconsistent size selection | Tighten selection method; track profiles; keep settings consistent |
| Excess heterozygosity at many loci | Paralog-like loci or multi-copy regions | Filter abnormal heterozygosity/depth; tighten rules; exclude suspect loci |
| PCA separates by batch before biology | Structured missingness or coverage imbalance | Batch-colored PCA checks; harmonize QC; remove outliers; rebalance coverage |
| In-silico and observed loci disagree by >3–4× | Proxy genome mismatch; repeats not modeled | Treat in-silico as directional; confirm with pilot; update assumptions (k-mer/genome size) |
| Polyploids show unstable locus sets | High repeats/duplication; polyploid complexity | Conservative design; narrower windows; validate locus stability; consider alternative RRS type |
| Too many repetitive/off-target fragments | Repeat-rich genome; enzyme sites enriched in repeats | Redesign enzyme/window; prioritize unique fraction checks early |
A Practical Pilot Plan: Validate Before You Scale
A small pilot validates locus yield and consistency so you can lock a design before running a full cohort.
Use the thresholds below as starting ranges to guide actions. Adapt them to genome size, repeats, sample quality, and your downstream goal.
Pilot sample selection should represent genome diversity and sample quality. Choose diverse individuals and tissues; avoid only "best samples." Include variability to expose dropout risks.
Pilot readout — Go/Adjust/Stop starting ranges with action triggers:
- Go (common starting range): median per-sample loci within ±20% of in-silico prediction; shared loci is often ≥60–70% as a starting range; fragment size mode within ~±20–30 bp of predicted.
- Adjust (starting range): loci −20–40% vs predicted or shared loci 40–60% → shift window ±50 bp, change frequent cutter, or increase reads 25–50%.
- Stop (starting signal): >40% below predicted and shared loci < 40% → redesign enzymes or consider 2b-RAD.
- Missingness: ≤10% for structure/selection; ≤5% for association-like/panel building. 10–20% prompts QC and window re-centering; >20–30% widespread means redesign.
What to freeze for scale:
- Enzyme(s), size window, adapter set, and library approach.
- A simple design manifest documenting pilot metrics and chosen parameters.
For cohort-scale projects, teams often standardize a design manifest (enzymes, window, pilot metrics) to keep later batches comparable.
Figure 6. Use a pilot to validate locus yield, then freeze the design for scale.
FAQ: RRS in Non-Model Species
Start with a range, not a single number. Run in-silico scenarios across that range, then confirm with a small pilot.
High repeats and duplication increase ambiguous loci. Prefer designs that reduce repetitive fragments and validate paralog-like signals early.
Yes, but results are approximate. Use close relatives, low-pass assemblies, or k-mer-based estimates and treat outputs as pilot decision guides.
Consider 2b-RAD when genome complexity or repeats make size selection unstable, or when you need a more standardized fragment profile across samples.
RRS is most efficient when multiplexing saves costs. The exact number depends on genome size, desired loci, and analysis goals.
Share genome size estimate, ploidy, repeat expectations, DNA input/quality, target sample count, downstream goal, and any reference availability.
Next steps
- Build a simple design manifest from your pilot: enzymes, window, adapter set, expected loci range, pilot metrics.
- Use the checklists and thresholds above to lock a design before scaling your cohort.
References
- Peterson, B. K., et al. "Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species." PLOS ONE, 2012.
- DaCosta, J. M., and Sorenson, M. D. "Amplification Biases and Consistent Recovery of Loci in a Double-Digest RAD-seq Protocol." PLOS ONE, 2014.
- Christiansen, H., et al. "Facilitating Population Genomics of Non-Model Organisms Through Optimized Experimental Design for Reduced Representation Sequencing." BMC Genomics, 2021.
- Lajmi, A., et al. "Optimizing ddRAD Sequencing for Population Genomic Studies with ddgRADer." Molecular Ecology Resources, 2023.
- Chambers, E. A., et al. "2b or Not 2b? 2bRAD Is an Effective Alternative to ddRAD for Phylogenomics." Ecology and Evolution, 2023.
- Flanagan, S. P., and Jones, A. G. "Substantial Differences in Bias Between Single-Digest and Double-Digest RAD-Seq Libraries: A Case Study." Molecular Ecology Resources, 2018.