Using RAD-seq for Parentage Analysis and Population Assignment (PBT): Accuracy, SNP Panel Size, and Evaluation Methods
RAD-seq offers a reliable solution for parentage analysis and population assignment studies. By tailoring SNP panel size and filtering parameters to the study design, researchers can achieve high assignment accuracy across fisheries, conservation programs, and breeding projects. Restriction enzyme choice, RAD variant (e.g., single-digest, double-digest), genome size, marker density, and research goals all influence the optimal number of SNPs. RAD-seq scales well to large cohorts, fits tight budgets, and performs especially well in non-model species.
Note: This article discusses research-use applications only (RUO). It is not intended for clinical diagnostics, forensic testing, or direct-to-consumer genetic services.
Key Takeaways
- RAD-seq is cost-effective for parentage and population assignment, enabling large sample sizes without prohibitive per-sample costs.
- Study-specific SNP panel design (MAF thresholds, LD pruning, missingness filters) is critical for achieving robust assignment accuracy, especially in non-model species.
- Thousands of candidate markers discovered by RAD-seq improve genotype quality and confidence, which is particularly valuable in rare or endangered species and historical collections.
- Migrating SNPs discovered by RAD-seq to GT-seq supports long-term, high-throughput monitoring with stable, targeted panels.
- Strict quality control (QC) at each step—from DNA extraction to SNP calling—reduces genotyping errors and limits the impact of missing data.
- Likelihood-based parentage tools such as CERVUS and COLONY, combined with carefully tuned RAD pipelines (e.g., dDocent, ipyrad, Stacks), can reach parentage assignment success rates near 100% with low false exclusion rates.
- Consistent missing-data management and marker selection criteria across projects improve reproducibility, comparability, and data integration.
Figure 1. Overview of the RAD-seq parentage and PBT workflow from SNP panel design to accuracy.
Why Choose RAD-seq for Parentage and PBT?
Key Researcher Questions
Researchers planning parentage or population assignment projects often ask:
- How many SNPs do we really need for reliable parentage inference?
- Can RAD-seq handle hundreds or thousands of individuals without breaking the budget?
- Is RAD-seq practical for species with no reference genome?
- How difficult is it to migrate from discovery to a fixed production panel (e.g., GT-seq) for long-term monitoring?
Comparing RAD-seq with traditional genotyping methods helps clarify these questions.
RAD-seq vs Traditional Genotyping (Microsatellites, AFLPs)
| Feature |
RAD-seq |
Traditional Methods (e.g., Microsatellites, AFLPs) |
| Marker discovery |
Thousands to tens of thousands loci |
Limited number of markers |
| Labour cost |
Fraction of traditional methods |
Higher labour cost |
| De novo capability |
Yes, does not require a reference |
Often limited to reference-based approaches |
| Marker quality |
Many loci but requires filtering to separate noise |
Fewer loci; alleles easier to inspect |
RAD-seq provides far more loci and reduces labor, which is especially important in large breeding or conservation programs. The ability to work without a reference genome makes RAD-seq attractive for non-model species.
RAD-seq Advantages for Non-Model Species
RAD-seq has transformed research in ecology, evolution, and conservation genetics:
- Generates thousands of SNP loci at moderate cost.
- Achieves higher per-locus coverage compared with many reduced-panel methods, increasing confidence in genotype calls.
- Does not require prior genomic resources, enabling work in poorly characterized species.
- Supports studies of rare or endangered species, museum specimens, and populations with little or no genomic data.
Emerging Practices
Many labs have shifted from microsatellites to RAD-seq for:
- Fisheries parentage analysis and parentage-based tagging (PBT).
- Mixed-stock analysis and population assignment.
- Breeding program tracking in aquaculture and livestock.
RAD-seq is often used as a discovery platform, with informative SNPs subsequently migrated into GT-seq or similar targeted assays for routine genotyping.
RAD-seq vs Microsatellites, WGS, and GT-seq
Figure 2. Schematic comparison of RAD-seq, microsatellites, WGS, and GT-seq in cost, throughput, and typical use cases.
Cost and Throughput
Researchers frequently compare platforms to balance depth, cost, and throughput:
| Method |
Cost Implications |
| Whole genome sequencing |
Most comprehensive but expensive; often unnecessary when only a small fraction of the genome informs parentage/PBT. |
| RAD-seq |
No probe development; flexible but involves multiple experimental steps, so not ideal for ultra-high-throughput routine screening. |
| Microsatellites |
Requires probe development; cost-effective if reused across many samples and projects. |
| GT-seq |
Highly cost-effective for large sample sizes; excellent for long-term, high-resolution monitoring. |
RAD-seq lowers upfront investment and allows marker discovery without probes or arrays. GT-seq then takes over when the marker set is fixed and sample numbers grow.
SNP Discovery and Panelization
- RAD-seq generates a large number of SNP loci, which improves resolution for population structure, kinship, and PBT.
- Microsatellites have many alleles per locus, but their limited number of loci can reduce power in highly related or recently bottlenecked populations.
- In practice, RAD-seq and GT-seq datasets often show comparable assignment probabilities when analyzed under the same statistical models; RAD-seq shines in the discovery phase, GT-seq in high-throughput deployment.
Microhaplotypes and Scalability
RAD-seq behaves as a semi-open system:
- Each new dataset can reveal additional variation, allowing marker sets to evolve as knowledge grows.
- It is portable across species and sequencing platforms.
- Projects can tune sequencing depth and marker density to match budgets and power requirements.
However, shallow sequencing can increase noise and missing genotypes. Many groups use imputation or genotype likelihood frameworks to stabilize downstream analyses.
Study Design: Marker and SNP Panel Selection
Marker Selection Strategies
Robust marker selection is central to accurate parentage and PBT:
- Minor allele frequency (MAF) thresholds ensure loci are informative across individuals.
- Linkage disequilibrium (LD) pruning removes redundant markers, improving independence of signals.
- Missingness filters exclude loci or individuals with excessive missing data.
- Error-rate filters remove loci with high Mendelian error or inconsistent calls in replicates.
- Many teams maintain blacklists for problematic loci and whitelists for high-performing candidates.
Tip: Applying consistent marker selection criteria across projects improves reproducibility and simplifies meta-analyses.
SNP Panel Size Guidelines and the "r80" Rule
The number of SNPs directly influences both assignment accuracy and cost. A widely used heuristic is the "r80 rule": choose parameter values (e.g., minimum stack depth, allowed mismatches) that maximize the number of polymorphic loci present in at least 80% of individuals.
Optimal parameter values are species-specific. For example:
- Mackerel: m = 3, M = 4, n = 4
- Scallop: m = 6, M = 1, n = 1
- Green crab: m = 7, M = 2, n = 2
These values illustrate how tuning read-depth and clustering thresholds yields panels with high coverage and low missingness.
Recommended SNP Counts for Different Panel Types
Parentage vs PBT panels have different design targets:
| Panel Type |
Recommended SNP Count |
Marker Selection Focus |
| Parent–offspring / half-sib |
100–300 |
High MAF, low missingness, low error |
| Mixed-stock PBT |
300–1000 |
Strong population differentiation |
- For parentage and sibship, 100–300 high-quality SNPs are sufficient in most populations. Panels larger than ~300 SNPs rarely improve accuracy unless genetic similarity is extreme.
- For mixed-stock PBT and complex population mixtures, larger panels (300–1000 SNPs) increase assignment power, especially in recently admixed or weakly structured populations.
Sample Size and Statistical Power
Sample size strongly affects:
- Precision of allele frequency estimates.
- Power to detect true parent–offspring and sib relationships.
- Stability of population assignment probabilities.
Challenges such as library construction artefacts, uneven read-depth, and batch effects can bias estimates of population parameters. Careful experimental design and QC (see below) are required to avoid over-interpreting patterns driven by technical noise.
Managing Missing Data and Allele Dropout in RAD-seq
Figure 3. Example missing-data patterns and a QC workflow for handling them in RAD-seq parentage and PBT studies.
Patterns of Missing Data: Lessons from Key Studies
Missing data in RAD-seq can arise from:
- Low coverage or failed libraries.
- Allele dropout due to restriction-site polymorphism (RSP).
- Stringent filtering that removes many loci.
Findings from representative studies:
| Study |
Findings |
Recommendations |
| Gautier et al. (2012) |
Allele dropout (ADO) frequency depends on mutation rate and effective population size; ADO inflates estimates of genetic variation. |
Remove loci with high ADO frequency (e.g., ADO fraction ≈ 0.5). |
| Huang & Knowles (2014) |
Very strict missing-data thresholds reduce the mutational spectrum; over-conservative locus selection leads to less accurate phylogenies. |
Understand dataset-specific causes of missing data when designing filters. |
| Leaché et al. (2015) |
Increasing missing data leads to discordant topologies, longer branches, and lower bootstrap support. |
Balance the desire for more SNPs with the need to minimize missingness. |
In practice, researchers often:
- Remove loci with high missingness or obvious allele dropout.
- Examine missing-data patterns by library, lane, or batch to detect systematic problems.
Tip: Review missing-data heatmaps or summary tables early in the analysis to flag problematic loci or samples before final panel selection.
Genotype Likelihoods vs Imputation
Two common strategies help manage missing genotypes:
Genotype likelihoods
- Use the probability of the observed read data to represent uncertainty in genotypes without forcing hard calls.
- Particularly useful for low-coverage datasets.
- Implemented in tools such as ANGSD and PCAngsd, which propagate uncertainty into downstream analyses (e.g., PCA, association tests).
Imputation
- Fills in missing genotypes based on linkage patterns and relatedness in the dataset.
- Works best when marker density is high and a suitable reference or training panel is available.
- Tools such as BEAGLE or IMPUTE2 can substantially increase the number of usable loci, but imputation can introduce bias if the reference is poorly matched.
| Method |
Strengths |
Limitations |
| Genotype likelihoods |
Handles low coverage; reduces bias from hard calls |
More computationally intensive; requires specialized tools |
| Imputation |
Increases usable data; fills genotype gaps |
Requires good reference; may propagate errors |
Tip: Use genotype likelihood frameworks for low-coverage RAD-seq, and reserve imputation for dense marker sets with an appropriate reference panel.
Restriction-Site Polymorphism (RSP) Mitigation
RSP occurs when mutations disrupt restriction-enzyme recognition sites, preventing digestion and causing allele dropout at affected loci. This can:
- Bias allele frequencies.
- Reduce assignment accuracy if dropout is structured among populations or families.
Mitigation strategies include:
- In silico digestion of reference genomes to identify and avoid regions with high RSP risk.
- Choosing alternative enzymes with more robust recognition sites or different motifs.
- Including validation loci to monitor dropout rates in each run.
- Empirical testing of a subset of samples to quantify RSP effects and refine marker selection.
Careful enzyme choice and marker validation help maintain high data quality and reduce systematic biases.
RAD-seq Accuracy and Evaluation Methods
Parentage Inference Workflows
Parentage analysis typically relies on likelihood-based workflows that evaluate genotypes under different relationship hypotheses.
LOD-based testing (CERVUS, COLONY)
- CERVUS calculates log-of-odds (LOD) scores for candidate parent–offspring pairs or trios, assigning parentage when LOD exceeds a confidence threshold.
- COLONY uses similar likelihood models but can jointly infer more complex pedigree structures, including half-sibs and full-sib families.
Both tools work well with RAD-derived SNP panels.
RAD Processing Pipelines for Parentage
RAD data must be assembled and filtered before feeding SNP calls into parentage software. Common pipelines include:
| Software Tool |
Speed/Accuracy |
Notes |
| dDocent |
Fastest and most accurate |
High assignment success rates; low false exclusion rates |
| ipyrad |
Close to dDocent in accuracy |
Can over-split loci if parameters are not carefully tuned |
| Stacks |
Good accuracy with filtering |
May produce some spurious loci; downstream filtering required |
With well-tuned parameters and robust filtering, parentage assignment success can reach >98%, with very few false assignments and low false exclusion rates.
Accuracy Metrics for Parentage Validation
Common metrics include:
- False Assignment Rate (FAR): Proportion of parentage assignments that are incorrect.
- True Positive Rate (TPR): Proportion of true parent–offspring pairs correctly identified.
- Positive Predictive Value (PPV): Probability that an inferred parent–offspring relationship is actually correct.
Different analysis workflows balance sensitivity and specificity differently. For example:
| Method |
True Positive Rate |
False Discovery Rate |
| DESeq |
Low |
Low |
| baySeq |
Low |
Low |
| edgeR |
Higher |
Higher |
| NBPSeq |
Higher |
Higher |
| ShrinkSeq |
Higher |
Higher |
| SAMseq |
High |
Low |
| EBSeq |
Stable across sizes |
Decreases with size |
Although originally developed for other omics analyses, these comparisons illustrate general trade-offs: some methods favor conservative calls, others favor sensitivity.
Regularly checking FAR, TPR, and PPV—often using simulated datasets or pedigrees with known parentage—helps validate that the chosen pipeline is appropriate for the study.
Population Assignment Tools
Population assignment uses genotype data to infer the likely origin of individuals. Popular tools for SNP panels derived from RAD-seq include:
- assignPOP: Machine-learning–based classification of individuals to populations.
- SNPweights: Uses allele frequency–based weights to estimate assignment probabilities.
- Naive Bayes / Linear Discriminant Analysis (LDA): Statistical frameworks that classify individuals based on multilocus genotype patterns.
Accurate assignment depends heavily on clean assembly and filtering (e.g., Stacks parameter tuning, MAF thresholds). Even small changes in thresholds can shift patterns of population differentiation.
Cross-Validation and Holdout Strategies
Cross-validation helps estimate assignment accuracy without overfitting:
- Data are split into training and validation sets, commonly using fivefold cross-validation.
- Models are trained on the majority of individuals and tested on held-out samples.
- Repeating the split several times reduces the influence of random sampling.
Example cross-validation setup:
| Methodology |
Description |
| Cross-validation type |
Fivefold cross-validation |
| Training set size |
972 individuals |
| Validation set size |
242 individuals |
| Repetitions |
10 repeats to reduce random sampling effects |
| Accuracy estimation |
Correlation-based metrics such as r = (EBV, y) / h |
Robust cross-validation ensures that reported assignment accuracy reflects performance on genuinely new samples.
Reporting and Traceability
Transparent reporting is essential for reproducibility:
- Archive raw FASTQ files and final genotype tables.
- Record all software tools, version numbers, and key parameters.
- Document filtering criteria and thresholds (e.g., MAF, missingness, LD pruning).
- Provide QC metrics (call rates, allele balance, error rates in replicates).
- Include a clear statement that analyses are for research use only (RUO) and summarize major limitations.
Projects that follow these practices make it easier for others to reproduce and build on their results.
Practical RAD-seq Workflow and Quality Control
Sampling and Preservation
- Collect appropriate tissues (e.g., fin clips, blood, feathers, small tissue biopsies, museum samples).
- Preserve samples promptly—commonly in 95% ethanol or with silica gel—and store at –20°C to –80°C when possible.
- Use unique, unambiguous identifiers on all tubes and plates to maintain traceability.
Tip: Rapid preservation after sampling minimizes DNA degradation and improves downstream success.
Library Preparation and Multiplexing
Typical steps:
- DNA extraction and QC
- Use fluorometric quantification and gel electrophoresis to confirm DNA integrity and purity.
- RAD library construction
- Digest DNA with chosen restriction enzyme(s).
- Ligate adapters containing unique barcodes per sample.
- Perform size selection and PCR amplification.
- Pooling and sequencing
- Pool barcoded libraries and sequence on a suitable platform (e.g., Illumina).
- Demultiplex reads based on barcodes to recover sample-level FASTQ files.
An example study sequenced 96 individuals across four regions, achieving high alignment rates to a reference genome and millions of reads per individual.
A simple process summary:
| Step |
Purpose |
Outcome |
| DNA Extraction |
Obtain pure DNA |
High-quality DNA |
| Library Prep |
Create RAD-seq libraries |
Barcoded DNA fragments |
| Multiplexing |
Pool samples |
Efficient sequencing |
| Sequencing |
Generate reads |
Millions of sequences per sample |
QC Gates and Stop Criteria
Quality control checkpoints typically include:
- Pre-library DNA QC: Minimum concentration and purity thresholds.
- Post-sequencing QC: Read counts and alignment rates (e.g., alignment ≥ ~90% indicates strong library performance).
- Per-sample filters: Remove samples with low read depth or poor alignment (e.g., mean depth < 10×).
In one example, 85 out of 96 individuals passed QC and were genotyped at >56,000 SNPs.
Additional QC measures:
- Monitor contamination and barcode misassignment using replicates and negative controls.
- Track allele balance and call rates to detect systemic issues.
- If an entire batch fails QC, halt and troubleshoot rather than pushing flawed data downstream.
A clear QC pipeline builds confidence in parentage and population assignment results.
Migrating RAD-seq SNPs to GT-seq for High-Throughput Genotyping
Many groups use RAD-seq for initial SNP discovery and then shift to GT-seq for routine, high-throughput genotyping. This approach combines the flexibility of RAD-seq with the scalability of targeted sequencing.
Figure 4. Five-step workflow for migrating SNPs discovered by RAD-seq into a GT-seq production panel.
Stepwise Migration Workflow
Step 1: Discovery and SNP Selection
Use RAD-seq to identify candidate SNPs, then select markers with:
- High MAF
- Low missingness and low error rates
- Strong differentiation among target populations or families
Validate that selected SNPs perform consistently across populations.
Step 2: Primer Design
- Design primers for each SNP, avoiding regions with high RSP risk.
- Check for specificity, GC content, and potential secondary structures using standard primer-design tools.
Step 3: Panel Optimization and Validation
- Test primers on a subset of samples.
- Adjust primer concentrations and multiplex combinations to ensure robust amplification.
- Evaluate call rates, allele balance, and reproducibility across test runs.
Step 4: Concordance and Batch Testing
- Compare GT-seq genotypes with original RAD-seq calls to confirm high concordance.
- Run multiple batches to detect batch-specific artefacts or drift.
Step 5: Production Genotyping
- Deploy the validated panel for large-scale genotyping (hundreds to thousands of samples per run).
- Use the panel for long-term monitoring, breeding program tracking, or routine PBT.
Summary table:
| Step |
Purpose |
Outcome |
| SNP Selection |
Identify robust, informative markers |
High-quality SNP list |
| Primer Design |
Ensure specificity and efficiency |
Reliable GT-seq primers |
| Panel Optimization |
Improve multiplex performance |
Efficient, balanced GT-seq panel |
| Concordance Testing |
Confirm accuracy and reproducibility |
Validated panel |
| Production Use |
Enable high-throughput genotyping |
Consistent, scalable results |
Advantages of Migrating to GT-seq
- Cost efficiency: Lower per-sample cost once the panel is finalized.
- Consistency: Stable marker set supports temporal comparisons and across-project integration.
- Traceability: Panel designs, validation data, and batch metrics can be archived for long-term use.
Empirical studies show that:
- Panels with ~118–192 SNPs can achieve perfect parentage or population assignments in some systems (e.g., black soldier fly).
- Panels with ~700 SNPs have been used to minimize false positives in cattle and sheep.
Exact panel sizes depend on genetic diversity, structure, and study goals.
Getting Started and Frequently Asked Questions (FAQ)
Practical Getting-Started Checklist
To launch a RAD-seq–based parentage or PBT project:
- Define goals and sampling design
- Parentage vs PBT, number of populations, target power, and acceptable error rates.
- Set up analysis pipeline
- Create project directory, version-control your scripts, and document parameter choices.
- Process sequencing data
- Demultiplex RAD-seq reads.
- Align to a reference genome (if available) or assemble de novo.
- Build the loci catalogue and call SNPs.
- Filter and explore data
- Apply MAF, LD, missingness, and error filters.
- Visualize population structure and missing-data patterns.
- Plan downstream usage
- Determine whether to keep using RAD-seq or migrate to a targeted panel (e.g., GT-seq).
Recommended lab-side basics:
| Step |
Description |
| DNA Quality Check |
Use agarose electrophoresis and spectro/fluorometry to verify integrity and purity. |
| Sample Preparation |
Prepare ≥1 µg DNA per sample at ~25–200 ng/µL. |
| Sample Size |
Include at least ~100 offspring or individuals per focal group when possible. |
FAQ
Q1. How many SNPs does a typical RAD-seq panel need for parentage analysis?
Most studies use 100–300 high-quality SNPs for parentage and sibship analysis. This range usually provides strong assignment power. Markers should have high MAF, low missingness, and low error rates. In extremely homogeneous populations, larger panels may be needed.
Q2. Can RAD-seq work with low-quality or degraded DNA samples?
Yes. RAD-seq protocols can accommodate low-input or moderately degraded DNA (e.g., fin clips, historical material). However, good preservation, careful extraction, and rigorous QC are essential to avoid bias from differential dropout.
Q3. What is the main advantage of RAD-seq for non-model species?
RAD-seq does not require a reference genome. Researchers can discover and genotype thousands of SNPs in species with little or no genomic information, then later build reference resources or targeted panels based on those discoveries.
Q4. How does RAD-seq compare to microsatellites for population assignment?
RAD-seq typically provides thousands of biallelic SNPs, which improves resolution and assignment accuracy, especially in weakly structured or highly related populations. Microsatellites can still be useful in small-scale projects, but RAD-seq scales better and often reduces manual labour.
Q5. What tools help manage missing data in RAD-seq studies?
Tools such as ANGSD (for genotype likelihoods), PCAngsd, and common imputation packages (e.g., BEAGLE) are widely used. They help handle low coverage, missing genotypes, and uncertainty while preserving as many informative loci as possible.
Q6. How do researchers ensure data quality in RAD-seq projects?
They set QC gates at each step: DNA quality, library metrics, read counts, alignment rates, and per-locus call rates. Replicate samples and batch testing help detect contamination, index hopping, and batch effects before final analyses.
Q7. Can RAD-seq SNPs be used in GT-seq panels for long-term monitoring?
Yes. A common approach is to use RAD-seq once for discovery, then migrate a subset of robust SNPs into a GT-seq panel. This provides cost-effective, high-throughput genotyping for ongoing parentage and PBT studies.
Related Reading:
References
- Baird, N.A., Etter, P.D., Atwood, T.S. et al. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS ONE 3, e3376 (2008).
- Peterson, B.K., Weber, J.N., Kay, E.H., Fisher, H.S., Hoekstra, H.E. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS ONE 7, e37135 (2012).
- Gautier, M., Gharbi, K., Cezard, T. et al. The effect of RAD allele dropout on the estimation of genetic variation within and between populations. Molecular Ecology 22, 3165–3178 (2013).
- Anderson, E.C., Garza, J.C. The power of single-nucleotide polymorphisms for large-scale parentage inference. Genetics 172, 2567–2582 (2006).
- Díaz-Arce, N., Rodríguez-Ezpeleta, N. Selecting RAD-Seq data analysis parameters for population genetics: the more the better? Frontiers in Genetics 10, 533 (2019).
- Campbell, N.R., Harmon, S.A., Narum, S.R. Genotyping-in-Thousands by sequencing (GT-seq): a cost effective SNP genotyping method based on custom amplicon sequencing. Molecular Ecology Resources 15, 855–867 (2015).
- Steele, C.A., Hess, M.A., Narum, S.R., Campbell, M.R. Parentage-based tagging: reviewing the implementation of a new tool for an old problem. Fisheries 44, 412–422 (2019).
- Beacham, T.D., Wallace, C., MacConnachie, C. et al. Population and individual identification of Coho Salmon in British Columbia through parentage-based tagging and genetic stock identification: an alternative to coded-wire tags. Canadian Journal of Fisheries and Aquatic Sciences 74, 1391–1410 (2017).
* Designed for biological research and industrial applications, not intended
for individual clinical or medical purposes.