Quality Control for Reduced Representation Sequencing: From Raw Reads to Reviewer-Proof SNP Sets

Reduced representation sequencing QC gets hard when unstable loci, high missingness, and batch-shaped patterns creep in. If you’ve shipped RAD or ddRAD data, you know how fast a promising dataset can unravel under reviewer scrutiny. This guide gives you a practical playbook to stabilize your SNPs: a TL;DR checklist, stage-by-stage QC, a large troubleshooting table, a methods template, and validation tips. In the first minutes, you’ll see how reduced representation sequencing QC, RAD-seq quality control, and a reviewer-proof SNP set align in one defensible workflow.
What QC Means in Reduced Representation Sequencing
Reduced representation sequencing QC is the process of turning restriction-based reads into a consistent, publishable SNP set using defensible filters and reproducible records.
RRS QC is not WGS QC. In RRS, locus consistency and missingness mechanisms matter more than average depth. Restriction sites define your genomic windows, and any disturbance—enzyme choice, trimming policy, barcode issues—propagates into shared-locus rates and callability. ddRAD replaced random shearing with dual enzyme digestion to stabilize fragment size selection and reduce locus dropout variability as shown in the original method description in Double Digest RADseq (2012). Your goal is simple: a stable locus catalog across samples and batches, with filtering choices that are transparent, justified, and shown not to manufacture structure.
- Success looks like this: most loci are shared across samples, filters are interpretable, and downstream PCA/ADMIXTURE patterns remain stable under modest threshold shifts.
- Failure shows up as batch-shaped missingness, paralog-rich loci with inflated heterozygosity, and thresholds that can’t be defended or reproduced.
Mini-table — RRS QC focus vs WGS QC focus:
| Aspect | RRS QC focus | WGS QC focus |
| Locus definition | Restriction-site windows, clustering | Genome-wide alignment |
| Primary risk | Locus dropout, nonrandom missingness | Mapping bias, coverage variability |
| Key gates | Shared loci fraction, per-sample missingness | Mean/median depth, coverage uniformity |
| Validation | PCA by batch/pop, shared loci metrics | Variant call recall/precision, depth histograms |
Figure 2. RRS QC is a chain: early read issues can become locus dropout and biased SNP sets.
The Reviewer’s Lens: What Makes a SNP Set Reviewer-Proof
A reviewer-proof SNP set is one where missingness, locus consistency, and filters are transparent, justified, and shown not to create artifacts.
Common reviewer questions:
- Are loci comparable across samples or does dropout dominate?
- Is missingness structured by batch, library, or population group?
- Are thresholds justified and reproducible?
- Are validation plots included (PCA, missingness distributions, replicate checks)?
One practical way to preempt concerns is to include batch-aware diagnostics and write up your checks. For examples of cohort-scale metrics, see these batch-aware QC metrics and dashboards in the article on batch-aware QC metrics. To understand how field design and sampling can interact with batch, review this explainer on sampling bias and batch effects in population genomics.
TL;DR: The 12-Step RRS QC Checklist (FASTQ → SNP Set)
This checklist summarizes the minimum QC decisions needed to go from raw reads to a defensible SNP dataset.
Figure 3. A quick QC checklist to move from FASTQ to a defensible SNP dataset.
- Demultiplex integrity
- Check: Barcode/cutsite validity; index pairs; reads per sample distribution.
- Red flag: Barcode leakage, index hopping signatures; extreme low-read outliers (<10–20% of median).
- Action: Re-run demultiplex with stricter barcode rescue; inspect index reads; consider unique dual indices; hold downstream until outliers explained.
- Raw read quality
- Check: MultiQC summary; per-base quality; adapter content; sequence duplication levels.
- Red flag: High adapter content; quality drop at read tails; duplication >50% across many samples.
- Action: Adjust trimming policy; review PCR cycles; if duplication is batch-wide, pause for library review.
- Trimming and read-length policy
- Check: Adapter/quality trimming parameters; final read-length distribution across batches.
- Red flag: Aggressive trimming creating inconsistent lengths across batches; read-through.
- Action: Harmonize length policy; trim minimally to maintain shared loci; document settings.
- Duplicate handling
- Check: PCR duplicate rate per sample; over-represented sequences.
- Red flag: High duplicates in specific plates/batches; over-represented fragments.
- Action: Mark/remove duplicates cautiously; review PCR cycles and cleanups; consider rebalancing libraries.
- Mapping or clustering strategy (Scenario B vs C)
- Check: Scenario B: alignment rate, MAPQ; Scenario C: clustering threshold, min depth per locus.
- Red flag: Scenario B: strong reference bias, many low-MAPQ reads; Scenario C: oversplitting at high similarity thresholds.
- Action: Scenario B: tune mismatch/MAPQ cutoffs and assess bias; Scenario C: relax clustering slightly (e.g., 0.90–0.94) and re-evaluate shared loci.
- Shared loci fraction
- Check: Fraction of loci present in ≥70–80% of samples before hard filters.
- Red flag: <50% shared loci or strong batch differences.
- Action: Revisit trimming policy, clustering/alignment parameters; investigate catalog drift; consider pilot consensus parameters.
- Paralog signals
- Check: Excess heterozygosity; allele balance distortions; unusually high per-SNP depth.
- Red flag: Clusters of high-het SNPs with skewed allele ratios.
- Action: Apply paralog filters (heterozygosity excess, depth caps); consider paired-end assembly; remove suspect loci.
- SNP calling sanity
- Check: Call rate; transition/transversion ratios; site-level QUAL/GQ distributions.
- Red flag: Low call rate across many samples; QUAL inflated at shallow depth.
- Action: Adjust caller thresholds; ensure min depth per genotype; re-assess per-sample QC.
- Missingness filters
- Check: Per-sample and per-SNP missingness; PCA colored by batch before hard cuts.
- Red flag: PCA dominated by batch; missingness clustered by library.
- Action: Diagnose demultiplexing and length policy; harmonize parameters; only then set moderate missingness thresholds. Nonrandom missingness can bias PCA, pulling samples toward the origin as shown in a Methods in Ecology and Evolution study (2021).
- MAF policy
- Check: Minor allele count/frequency distributions; presence of singletons.
- Red flag: Many singletons driving clusters; too-strict MAF removing informative variants.
- Action: Exclude singletons for structure; test 0.02–0.05 MAF ranges in sensitivity analyses. MAF choices strongly affect structure inference per Linck & Battey 2019.
- HWE checks
- Check: HWE tests per population; summarize removal schemes.
- Red flag: Global HWE filtering removing many loci; excess removal in structured groups.
- Action: Use per-population tests; choose conservative schemes (e.g., Out Within); control for multiple testing. Global HWE cuts can distort structure as shown in a 2022 assessment of HWE filtering schemes.
- LD policy and final sanity
- Check: LD pruning settings for PCA/ADMIXTURE; replicate concordance if available.
- Red flag: Strong LD blocks within RAD windows biasing PCs; poor replicate concordance.
- Action: Prune LD (e.g., r^2 ≈ 0.2 windows); confirm PCs reflect biology; quantify replicate error; finalize exports.
Stop/go gates: If steps 1–3 reveal batch-wide issues, stop and resolve before locus building. If shared loci fraction stays <50% after harmonization, pause and pilot parameter grids before proceeding.
Stage-by-Stage QC: What to Check and Why It Breaks
Stage-by-stage QC links each processing step to the artifacts it can introduce in RRS SNP sets.
Raw reads & demultiplexing (signals that predict dropout)
Focus on signals that matter for RRS. MultiQC gives a quick map: per-base quality trends, adapter content, and duplication. In RAD/ddRAD, uneven reads per sample are predictive. Extreme low-read outliers often translate into high missingness later. Look for barcode/cutsite validation and index hopping signatures in demultiplex logs. Tools like Stacks’ process_radtags validate barcodes and cut sites and can rescue or discard reads as needed (process_radtags documentation).
Demultiplexing pitfalls to watch:
- Barcode leakage and index hopping suggest cross-sample contamination.
- Uneven allocation across samples predicts batch-shaped missingness.
- Low index read quality can exacerbate misassignment. Illumina’s index misassignment white paper (2017) explains UDI best practices.
Why these matter: read-level problems propagate into locus dropout, nonrandom missingness, and spurious PCA separation.
Figure 4. Uneven reads per sample and contamination signals often predict downstream missingness.
Trimming, read length policy, and duplicates (avoid over-cleaning)
Adapter trimming is good; over-trimming is not. Overly aggressive policies can shorten reads unevenly across batches, eroding shared-locus overlap. Keep a consistent length policy across runs. Track duplication as a proxy for PCR bias; high duplication inflates apparent depth without increasing information. Mark/remove duplicates selectively and review library steps if duplication clusters by batch.
A note on duplicates: RRS libraries often lack molecular barcodes, so duplicate decisions rely on fragment identity. Treat duplicates as a QC signal first. If duplication is batch-wide, investigate library prep before filtering individuals.
Locus building (reference vs de novo, without turning into a tutorial)
Define locus in plain terms: in RRS, a locus is the cluster of reads derived from the same restriction-site fragment across samples.
Two conceptual paths:
- Reference mapping (scenario B): Reads align to a heterologous or sketch reference. Key parameters include mismatch tolerance, MAPQ thresholds, and handling multi-mappers. Risks include mapping bias toward the reference allele and loss of divergent loci.
- De novo clustering (scenario C): Reads are clustered without a reference. Critical parameters include clustering similarity (e.g., 0.90–0.94 typical), within- vs across-sample clustering order, and minimum depth per locus. ipyrad’s guidelines emphasize avoiding excessive similarity thresholds to prevent oversplitting orthologs (ipyrad assembly guidelines).
Parameter classes to record:
- Clustering/stringency: Stacks (m, M, n) or iPyrad similarity; across-sample merge settings. Stacks v2’s paired-end assembly improves locus contigs and genotyping resolution (Stacks 2 methods, 2019).
- Minimum depth per locus and per-genotype.
- Mismatch allowances during alignment/assembly.
- Paralog screens: heterozygosity excess, allele balance, max depth per locus.
Paralog signals matter because they inflate heterozygosity and distort allele balance; retain them and your structure inferences will skew.
Read next for context, not a deep dive: a pipeline overview comparing Stacks2/iPyrad/dDocent in this ddRAD pipeline overview, and for planning parameters up front, see this ddRAD project design guide.
What to record for scalable, audit-ready RRS projects
For cohort-scale projects, teams standardize inputs (enzyme pair, size window, index scheme), logs (demultiplex settings, trimming policy, clustering/alignment parameters, min depth, paralog screen), and outputs (VCF/PLINK exports, MultiQC bundle, parameter manifest with software versions). This supports re-runs, cross-batch comparisons, and reviewer auditability.
Calling and Filtering SNPs You Can Defend
Defensible filtering reduces error while minimizing bias that can look like real population structure.
Depth and genotype quality protect against stochastic calls and low-quality genotypes. Aim for per-genotype minimum depth and a reasonable genotype quality threshold. Avoid turning this into coverage math; that belongs in planning. If you need guidance for planning, refer to your coverage and multiplexing planning resources.
Missingness needs context. Before setting hard thresholds, visualize per-sample and per-SNP missingness and run a PCA colored by batch. If batch correlates with major PCs, fix the cause first; otherwise, filters risk baking artifacts into your SNP set.
MAF policy depends on your goal. For structure, excluding singletons improves stability; overly tight cutoffs can remove informative variants. Test a small range and report sensitivity.
HWE filtering should be per population in structured datasets. Global HWE cuts can remove biologically meaningful loci; use conservative schemes and control false positives.
LD policies matter because RAD windows often contain tightly linked SNPs. For PCA/ADMIXTURE, prune LD (for example, r^2 around 0.2) to avoid over-weighting local haplotypes.
Goal-based filtering mini-table:
| Downstream goal | Filtering mindset | Main risk if over-filtered |
| Population structure | Exclude singletons; moderate MAF; LD-pruned SNPs; per-pop HWE tests | Bias to common alleles; lost local signal |
| Relatedness/parentage | Higher depth/GQ; stricter per-sample missingness; duplicates reviewed | Marker count drop; ascertainment bias |
| Selection scans | Balanced missingness; minimal LD blocks; conservative HWE context | False positives from batch-shaped missingness |
When you discuss depth or multiplex planning with collaborators, consider pointing them to a dedicated resource on coverage and multiplexing planning so this article stays focused on post-sequencing QC.
Troubleshooting Playbook: Symptom → Likely Cause → Fix (RRS-Specific)
Troubleshooting turns common RRS failure patterns into actionable fixes that reduce rework.
Figure 5. A fast path from symptom to likely cause and the next best fix.
| Symptom | Likely cause | Quick check (5–10 min) | Fix (lowest-cost first) | When to re-run |
| Many samples have very low read counts | Uneven pooling, demultiplexing loss | Plot reads per sample; check barcode match rate | Re-check barcode list and mismatch settings; confirm sample sheet; resequence low-count samples if needed | Re-run demultiplex |
| One batch has systematically lower reads | Lane/library imbalance | Compare reads per sample grouped by batch/lane | Confirm pooling assumptions; flag batch in downstream diagnostics; resequence if required | Re-run demultiplex |
| High adapter/primer contamination | Short inserts, incomplete trimming | Check adapter content signal; review overrepresented sequences | Strengthen adapter trimming; verify insert-size distribution | Re-run trimming |
| Quality drops early in reads | Run quality issue or chemistry mismatch | Review per-base quality profile | Trim low-quality tails; confirm read length policy | Re-run trimming |
| Duplicate rate is an outlier (batch or sample) | PCR over-amplification; low library complexity | Compare duplicate metrics to cohort median | Review library complexity; avoid aggressive duplicate removal unless justified; rebuild worst samples if necessary | Rebuild library (subset) |
| Loci per sample varies widely (some collapse) | Locus dropout from digestion/size selection variance | Check loci-per-sample distribution; compare shared loci proportion | Verify size-selection window; revisit locus-building thresholds; drop extreme outlier samples | Rebuild loci |
| Shared loci proportion is very low across the cohort | Over-stringent locus building; divergent samples; inconsistent size selection | Review shared loci summary; check divergence expectations | Relax clustering/mismatch slightly; stratify by group; confirm size selection | Rebuild loci |
| One population/group shows much higher missingness | Mapping bias or group-specific divergence | Stratify missingness by group and batch | Consider de novo or mixed strategy; adjust mapping stringency; mask repeats as appropriate | Rebuild loci |
| PCA separates mainly by batch | Batch-shaped missingness or parameter drift | PCA colored by batch; missingness heatmap by batch | Re-check demultiplex and trimming consistency; harmonize locus-building settings; validate with sensitivity checks | Rebuild loci + Re-call variants |
| Excess heterozygosity across many loci | Paralog inflation or mis-mapping | Heterozygosity outliers; allele balance skew | Apply paralog screens; mask multi-mapping; tune locus definition; remove suspect loci | Re-call variants |
| Allele balance strongly skewed at many loci | Mapping bias, paralogs, or technical artifacts | Allele balance plot; compare by batch/group | Tighten mapping quality filters; remove multi-mapping; add paralog filters | Re-call variants |
| SNP count changes drastically after small parameter tweak | Over-sensitive locus definition | Run a small sensitivity test on a subset | Choose a more stable parameter band; document parameter manifest; avoid extreme settings | Rebuild loci |
| Many SNPs fail sanity checks after filtering (extreme missingness tail) | Over-permissive calling or inconsistent filters | Inspect missingness distribution; check GQ/DP distributions | Tighten genotype quality and depth logic; revise missingness strategy; confirm sample exclusions | Re-call variants |
| Replicates disagree more than expected (if available) | Sample swaps, contamination, batch artifacts | Pairwise concordance; quick identity checks | Verify sample sheet; check contamination signals; reprocess affected subset | Re-run demultiplex or Re-call variants |
| A few outlier samples drive multiple QC failures | Low DNA quality/quantity; library failure | Review per-sample QC summary | Remove outliers early; rebuild key samples if critical | Rebuild library (subset) |
Emphasize lowest-cost checks first: demultiplex logs, MultiQC, and simple PCA of QC metrics often identify the problem quickly.
Validation & Reporting: The Minimum Set Reviewers Expect
Validation checks confirm your SNP set reflects biology more than technical artifacts.
Minimum plots to include and why:
- PCA colored by batch and by population group to show that structure is not driven by batch.
- Missingness distributions per sample and per SNP to justify thresholds.
- Loci-per-sample and shared-loci summaries to demonstrate consistency.
- Replicate concordance (if available) to quantify error and tuning success.
- Sensitivity analysis to show that modest threshold changes do not flip conclusions.
For practical ways to compute and present cohort-scale diagnostics, review these batch-aware QC metrics. For interpreting PCA/ADMIXTURE in ddRAD studies, see this population structure workflow.
Figure 6. Reviewers expect you to check whether PCA reflects batch artifacts or biology.
Methods Template + Reproducibility Pack (What to Record and Deliver)
A reproducibility pack is a complete record of parameters, QC outputs, and decision thresholds that allows the analysis to be audited and repeated.
Methods template (copy/paste-ready headings with example sentences):
- Sample & library summary. We prepared ddRAD libraries using the [enzyme pair], size window [X–Y bp], and unique dual indices. Libraries were pooled across [n] batches with matched insert-size policies.
- Read processing. Raw reads were demultiplexed with [tool/version] using barcode and cutsite validation; adapters and low-quality tails were trimmed with [tool/version] to a final read-length policy of [N bp].
- Locus strategy. For scenario B, reads were aligned to [reference build] using [aligner/version] with mismatch and MAPQ thresholds tuned to balance alignment rate and bias. For scenario C, de novo clustering used [tool/version] with similarity [0.90–0.94], minimum depth per locus [m], and cross-sample merge settings recorded.
- SNP calling. Variants were called with [caller/version]; genotype quality and depth thresholds were chosen to minimize stochastic error while preserving informative variation.
- Filters. We applied per-sample and per-SNP missingness filters after verifying that PCA colored by batch was not dominated by technical grouping. We excluded singletons for structure analyses, tested MAF ranges [0.02–0.05], performed per-population HWE tests with scheme [Out Within/Out Some], and pruned LD for PCA/ADMIXTURE.
- Batch diagnostics. We summarized reads per sample, duplication rates, shared loci fraction, and missingness distributions; PCA colored by batch and population was used to detect batch-shaped artifacts.
- Software versions and parameter disclosure. All tools and parameter classes (demultiplexing, trimming, clustering/alignment, min depths, paralog screens) are listed in the parameter manifest.
Minimum reporting set (bullets):
- Exact filter thresholds with rationales and a short sensitivity analysis summary.
- Tool names and versions; parameter classes and final values; alignment/clustering settings.
- QC figures (MultiQC summary, missingness histograms, PCA by batch/population, loci-per-sample/shared loci, replicate concordance).
- Description of any outlier handling or resequencing.
Final export & archive micro-checklist:
- VCF and PLINK exports with filter expressions documented.
- MultiQC report and demultiplex logs.
- Parameter manifest with software versions.
- README noting batch metadata and any deviations from SOP.
For cohort-scale projects, teams often standardize parameter manifests and QC reports to keep runs comparable.
FAQs
There is no universal rule. Visualize per-sample and per-SNP missingness, color PCA by batch, and run a short sensitivity test before setting thresholds.
Use per-population HWE tests and conservative schemes. Global HWE filtering can remove biologically meaningful loci in structured populations.
Check demultiplex logs, index design, trimming/length policy, and parameter manifests for catalog drift. Harmonize, then rebuild loci and reassess.
Start with low-cost checks (demultiplexing, trimming, length harmonization). If samples remain extreme outliers after harmonization and reasonable filters, remove them with clear justification.
Use heterozygosity excess and allele balance checks plus per-SNP max depth. Where possible, use paired-end assembly to increase locus length and resolve copies.
Put PCA by batch/population, missingness distributions, and shared-loci summaries in the main text. Replicate concordance and sensitivity analyses can be main or supplement depending on journal space.
Yes. Strict MAF and missingness cuts can reshape clusters. Show that your conclusions persist across reasonable parameter ranges.
Enzyme pair, size window, demultiplex logs, MultiQC, parameter manifest, batch metadata, and a representative VCF with current filters.
References
- Nguyen, T., et al. "Empirical versus estimated accuracy of imputation: optimising filtering thresholds for sequence imputation." PLoS Genetics, 2024.
- Cahoon, A., et al. "Imputation Accuracy Across Global Human Populations." American Journal of Human Genetics, 2023.
- Shi, H., et al. "Genotype imputation accuracy and the quality metrics of the minor ancestry in multi-ancestry reference panels." Briefings in Bioinformatics, 2024.
- Zhou, MK., et al. "Chimeric Reference Panels for Genomic Imputation." Genetics, 2025.
- Jordan, K. W., et al. "Development of the Wheat Practical Haplotype Graph Facilitates Imputation and Cost‑Effective Genomic Prediction." G3: Genes, Genomes, Genetics, 2022.
- Long, E M., et al. "Genome‑Wide Imputation Using the Practical Haplotype Graph in Sorghum." G3: Genes, Genomes, Genetics, 2021.
- Torkamaneh, D., et al. "NanoGBS: A Miniaturized Procedure for GBS Library Preparation." Frontiers in Genetics, 2020.
- Ausmees, K., et al. "Achieving Improved Accuracy for Imputation of Ancient DNA." Bioinformatics, 2022.