Quality Control for Reduced Representation Sequencing: From Raw Reads to Reviewer-Proof SNP Sets

Q: How much missingness is too much in RAD-seq QC?

There is no universal rule. Visualize per-sample and per-SNP missingness, color PCA by batch, and run a short sensitivity test before setting thresholds.

Q: Should I filter SNPs by Hardy–Weinberg equilibrium in structured populations?

Use per-population HWE tests and conservative schemes. Global HWE filtering can remove biologically meaningful loci in structured populations.

Q: Why does one batch dominate PCA, and what should I do first?

Check demultiplex logs, index design, trimming/length policy, and parameter manifests for catalog drift. Harmonize, then rebuild loci and reassess.

Q: Do I remove samples or try to rescue them with different filters?

Start with low-cost checks (demultiplexing, trimming, length harmonization). If samples remain extreme outliers after harmonization and reasonable filters, remove them with clear justification.

Q: How do I handle paralogs in non-model genomes?

Use heterozygosity excess and allele balance checks plus per-SNP max depth. Where possible, use paired-end assembly to increase locus length and resolve copies.

Q: What QC figures belong in the main text vs supplementary materials?

Put PCA by batch/population, missingness distributions, and shared-loci summaries in the main text. Replicate concordance and sensitivity analyses can be main or supplement depending on journal space.

Q: Can aggressive filtering create false population structure?

Yes. Strict MAF and missingness cuts can reshape clusters. Show that your conclusions persist across reasonable parameter ranges.

Q: What should I share with a sequencing provider to troubleshoot faster?

Enzyme pair, size window, demultiplex logs, MultiQC, parameter manifest, batch metadata, and a representative VCF with current filters.

Reduced representation sequencing QC gets hard when unstable loci, high missingness, and batch-shaped patterns creep in. If you’ve shipped RAD or ddRAD data, you know how fast a promising dataset can unravel under reviewer scrutiny. This guide gives you a practical playbook to stabilize your SNPs: a TL;DR checklist, stage-by-stage QC, a large troubleshooting table, a methods template, and validation tips. In the first minutes, you’ll see how reduced representation sequencing QC, RAD-seq quality control, and a reviewer-proof SNP set align in one defensible workflow.

What QC Means in Reduced Representation Sequencing

Reduced representation sequencing QC is the process of turning restriction-based reads into a consistent, publishable SNP set using defensible filters and reproducible records.

RRS QC is not WGS QC. In RRS, locus consistency and missingness mechanisms matter more than average depth. Restriction sites define your genomic windows, and any disturbance—enzyme choice, trimming policy, barcode issues—propagates into shared-locus rates and callability. ddRAD replaced random shearing with dual enzyme digestion to stabilize fragment size selection and reduce locus dropout variability as shown in the original method description in Double Digest RADseq (2012). Your goal is simple: a stable locus catalog across samples and batches, with filtering choices that are transparent, justified, and shown not to manufacture structure.

Success looks like this: most loci are shared across samples, filters are interpretable, and downstream PCA/ADMIXTURE patterns remain stable under modest threshold shifts.
Failure shows up as batch-shaped missingness, paralog-rich loci with inflated heterozygosity, and thresholds that can’t be defended or reproduced.

Mini-table — RRS QC focus vs WGS QC focus:

Aspect	RRS QC focus	WGS QC focus
Locus definition	Restriction-site windows, clustering	Genome-wide alignment
Primary risk	Locus dropout, nonrandom missingness	Mapping bias, coverage variability
Key gates	Shared loci fraction, per-sample missingness	Mean/median depth, coverage uniformity
Validation	PCA by batch/pop, shared loci metrics	Variant call recall/precision, depth histograms

Reduced representation sequencing QC workflow from raw reads to reviewer-proof SNP set Figure 2. RRS QC is a chain: early read issues can become locus dropout and biased SNP sets.

The Reviewer’s Lens: What Makes a SNP Set Reviewer-Proof

A reviewer-proof SNP set is one where missingness, locus consistency, and filters are transparent, justified, and shown not to create artifacts.

Common reviewer questions:

Are loci comparable across samples or does dropout dominate?
Is missingness structured by batch, library, or population group?
Are thresholds justified and reproducible?
Are validation plots included (PCA, missingness distributions, replicate checks)?

One practical way to preempt concerns is to include batch-aware diagnostics and write up your checks. For examples of cohort-scale metrics, see these batch-aware QC metrics and dashboards in the article on batch-aware QC metrics. To understand how field design and sampling can interact with batch, review this explainer on sampling bias and batch effects in population genomics.

TL;DR: The 12-Step RRS QC Checklist (FASTQ → SNP Set)

This checklist summarizes the minimum QC decisions needed to go from raw reads to a defensible SNP dataset.

RAD-seq quality control checklist for building a reviewer-proof SNP set Figure 3. A quick QC checklist to move from FASTQ to a defensible SNP dataset.

Demultiplex integrity
- Check: Barcode/cutsite validity; index pairs; reads per sample distribution.
- Red flag: Barcode leakage, index hopping signatures; extreme low-read outliers (<10–20% of median).
- Action: Re-run demultiplex with stricter barcode rescue; inspect index reads; consider unique dual indices; hold downstream until outliers explained.
Raw read quality
- Check: MultiQC summary; per-base quality; adapter content; sequence duplication levels.
- Red flag: High adapter content; quality drop at read tails; duplication >50% across many samples.
- Action: Adjust trimming policy; review PCR cycles; if duplication is batch-wide, pause for library review.
Trimming and read-length policy
- Check: Adapter/quality trimming parameters; final read-length distribution across batches.
- Red flag: Aggressive trimming creating inconsistent lengths across batches; read-through.
- Action: Harmonize length policy; trim minimally to maintain shared loci; document settings.
Duplicate handling
- Check: PCR duplicate rate per sample; over-represented sequences.
- Red flag: High duplicates in specific plates/batches; over-represented fragments.
- Action: Mark/remove duplicates cautiously; review PCR cycles and cleanups; consider rebalancing libraries.
Mapping or clustering strategy (Scenario B vs C)
- Check: Scenario B: alignment rate, MAPQ; Scenario C: clustering threshold, min depth per locus.
- Red flag: Scenario B: strong reference bias, many low-MAPQ reads; Scenario C: oversplitting at high similarity thresholds.
- Action: Scenario B: tune mismatch/MAPQ cutoffs and assess bias; Scenario C: relax clustering slightly (e.g., 0.90–0.94) and re-evaluate shared loci.
Shared loci fraction
- Check: Fraction of loci present in ≥70–80% of samples before hard filters.
- Red flag: <50% shared loci or strong batch differences.
- Action: Revisit trimming policy, clustering/alignment parameters; investigate catalog drift; consider pilot consensus parameters.
Paralog signals
- Check: Excess heterozygosity; allele balance distortions; unusually high per-SNP depth.
- Red flag: Clusters of high-het SNPs with skewed allele ratios.
- Action: Apply paralog filters (heterozygosity excess, depth caps); consider paired-end assembly; remove suspect loci.
SNP calling sanity
- Check: Call rate; transition/transversion ratios; site-level QUAL/GQ distributions.
- Red flag: Low call rate across many samples; QUAL inflated at shallow depth.
- Action: Adjust caller thresholds; ensure min depth per genotype; re-assess per-sample QC.
Missingness filters
- Check: Per-sample and per-SNP missingness; PCA colored by batch before hard cuts.
- Red flag: PCA dominated by batch; missingness clustered by library.
- Action: Diagnose demultiplexing and length policy; harmonize parameters; only then set moderate missingness thresholds. Nonrandom missingness can bias PCA, pulling samples toward the origin as shown in a Methods in Ecology and Evolution study (2021).
MAF policy
- Check: Minor allele count/frequency distributions; presence of singletons.
- Red flag: Many singletons driving clusters; too-strict MAF removing informative variants.
- Action: Exclude singletons for structure; test 0.02–0.05 MAF ranges in sensitivity analyses. MAF choices strongly affect structure inference per Linck & Battey 2019.
HWE checks
- Check: HWE tests per population; summarize removal schemes.
- Red flag: Global HWE filtering removing many loci; excess removal in structured groups.
- Action: Use per-population tests; choose conservative schemes (e.g., Out Within); control for multiple testing. Global HWE cuts can distort structure as shown in a 2022 assessment of HWE filtering schemes.
LD policy and final sanity
- Check: LD pruning settings for PCA/ADMIXTURE; replicate concordance if available.
- Red flag: Strong LD blocks within RAD windows biasing PCs; poor replicate concordance.
- Action: Prune LD (e.g., r^2 ≈ 0.2 windows); confirm PCs reflect biology; quantify replicate error; finalize exports.

Stop/go gates: If steps 1–3 reveal batch-wide issues, stop and resolve before locus building. If shared loci fraction stays <50% after harmonization, pause and pilot parameter grids before proceeding.

Stage-by-Stage QC: What to Check and Why It Breaks

Stage-by-stage QC links each processing step to the artifacts it can introduce in RRS SNP sets.

Raw reads & demultiplexing (signals that predict dropout)

Focus on signals that matter for RRS. MultiQC gives a quick map: per-base quality trends, adapter content, and duplication. In RAD/ddRAD, uneven reads per sample are predictive. Extreme low-read outliers often translate into high missingness later. Look for barcode/cutsite validation and index hopping signatures in demultiplex logs. Tools like Stacks’ process_radtags validate barcodes and cut sites and can rescue or discard reads as needed (process_radtags documentation).

Demultiplexing pitfalls to watch:

Barcode leakage and index hopping suggest cross-sample contamination.
Uneven allocation across samples predicts batch-shaped missingness.
Low index read quality can exacerbate misassignment. Illumina’s index misassignment white paper (2017) explains UDI best practices.

Why these matter: read-level problems propagate into locus dropout, nonrandom missingness, and spurious PCA separation.

RAD-seq raw read QC example showing uneven reads per sample and contamination red flags Figure 4. Uneven reads per sample and contamination signals often predict downstream missingness.

Trimming, read length policy, and duplicates (avoid over-cleaning)

Adapter trimming is good; over-trimming is not. Overly aggressive policies can shorten reads unevenly across batches, eroding shared-locus overlap. Keep a consistent length policy across runs. Track duplication as a proxy for PCR bias; high duplication inflates apparent depth without increasing information. Mark/remove duplicates selectively and review library steps if duplication clusters by batch.

A note on duplicates: RRS libraries often lack molecular barcodes, so duplicate decisions rely on fragment identity. Treat duplicates as a QC signal first. If duplication is batch-wide, investigate library prep before filtering individuals.

Locus building (reference vs de novo, without turning into a tutorial)

Define locus in plain terms: in RRS, a locus is the cluster of reads derived from the same restriction-site fragment across samples.

Two conceptual paths:

Reference mapping (scenario B): Reads align to a heterologous or sketch reference. Key parameters include mismatch tolerance, MAPQ thresholds, and handling multi-mappers. Risks include mapping bias toward the reference allele and loss of divergent loci.
De novo clustering (scenario C): Reads are clustered without a reference. Critical parameters include clustering similarity (e.g., 0.90–0.94 typical), within- vs across-sample clustering order, and minimum depth per locus. ipyrad’s guidelines emphasize avoiding excessive similarity thresholds to prevent oversplitting orthologs (ipyrad assembly guidelines).

Parameter classes to record:

Clustering/stringency: Stacks (m, M, n) or iPyrad similarity; across-sample merge settings. Stacks v2’s paired-end assembly improves locus contigs and genotyping resolution (Stacks 2 methods, 2019).
Minimum depth per locus and per-genotype.
Mismatch allowances during alignment/assembly.
Paralog screens: heterozygosity excess, allele balance, max depth per locus.

Paralog signals matter because they inflate heterozygosity and distort allele balance; retain them and your structure inferences will skew.

Read next for context, not a deep dive: a pipeline overview comparing Stacks2/iPyrad/dDocent in this ddRAD pipeline overview, and for planning parameters up front, see this ddRAD project design guide.

What to record for scalable, audit-ready RRS projects

For cohort-scale projects, teams standardize inputs (enzyme pair, size window, index scheme), logs (demultiplex settings, trimming policy, clustering/alignment parameters, min depth, paralog screen), and outputs (VCF/PLINK exports, MultiQC bundle, parameter manifest with software versions). This supports re-runs, cross-batch comparisons, and reviewer auditability.

Calling and Filtering SNPs You Can Defend

Defensible filtering reduces error while minimizing bias that can look like real population structure.

Depth and genotype quality protect against stochastic calls and low-quality genotypes. Aim for per-genotype minimum depth and a reasonable genotype quality threshold. Avoid turning this into coverage math; that belongs in planning. If you need guidance for planning, refer to your coverage and multiplexing planning resources.

Missingness needs context. Before setting hard thresholds, visualize per-sample and per-SNP missingness and run a PCA colored by batch. If batch correlates with major PCs, fix the cause first; otherwise, filters risk baking artifacts into your SNP set.

MAF policy depends on your goal. For structure, excluding singletons improves stability; overly tight cutoffs can remove informative variants. Test a small range and report sensitivity.

HWE filtering should be per population in structured datasets. Global HWE cuts can remove biologically meaningful loci; use conservative schemes and control false positives.

LD policies matter because RAD windows often contain tightly linked SNPs. For PCA/ADMIXTURE, prune LD (for example, r^2 around 0.2) to avoid over-weighting local haplotypes.

Goal-based filtering mini-table:

Downstream goal	Filtering mindset	Main risk if over-filtered
Population structure	Exclude singletons; moderate MAF; LD-pruned SNPs; per-pop HWE tests	Bias to common alleles; lost local signal
Relatedness/parentage	Higher depth/GQ; stricter per-sample missingness; duplicates reviewed	Marker count drop; ascertainment bias
Selection scans	Balanced missingness; minimal LD blocks; conservative HWE context	False positives from batch-shaped missingness

When you discuss depth or multiplex planning with collaborators, consider pointing them to a dedicated resource on coverage and multiplexing planning so this article stays focused on post-sequencing QC.

Troubleshooting Playbook: Symptom → Likely Cause → Fix (RRS-Specific)

Troubleshooting turns common RRS failure patterns into actionable fixes that reduce rework.

Troubleshooting decision tree for RAD-seq missingness and locus dropout in RRS QC Figure 5. A fast path from symptom to likely cause and the next best fix.

Symptom	Likely cause	Quick check (5–10 min)	Fix (lowest-cost first)	When to re-run
Many samples have very low read counts	Uneven pooling, demultiplexing loss	Plot reads per sample; check barcode match rate	Re-check barcode list and mismatch settings; confirm sample sheet; resequence low-count samples if needed	Re-run demultiplex
One batch has systematically lower reads	Lane/library imbalance	Compare reads per sample grouped by batch/lane	Confirm pooling assumptions; flag batch in downstream diagnostics; resequence if required	Re-run demultiplex
High adapter/primer contamination	Short inserts, incomplete trimming	Check adapter content signal; review overrepresented sequences	Strengthen adapter trimming; verify insert-size distribution	Re-run trimming
Quality drops early in reads	Run quality issue or chemistry mismatch	Review per-base quality profile	Trim low-quality tails; confirm read length policy	Re-run trimming
Duplicate rate is an outlier (batch or sample)	PCR over-amplification; low library complexity	Compare duplicate metrics to cohort median	Review library complexity; avoid aggressive duplicate removal unless justified; rebuild worst samples if necessary	Rebuild library (subset)
Loci per sample varies widely (some collapse)	Locus dropout from digestion/size selection variance	Check loci-per-sample distribution; compare shared loci proportion	Verify size-selection window; revisit locus-building thresholds; drop extreme outlier samples	Rebuild loci
Shared loci proportion is very low across the cohort	Over-stringent locus building; divergent samples; inconsistent size selection	Review shared loci summary; check divergence expectations	Relax clustering/mismatch slightly; stratify by group; confirm size selection	Rebuild loci
One population/group shows much higher missingness	Mapping bias or group-specific divergence	Stratify missingness by group and batch	Consider de novo or mixed strategy; adjust mapping stringency; mask repeats as appropriate	Rebuild loci
PCA separates mainly by batch	Batch-shaped missingness or parameter drift	PCA colored by batch; missingness heatmap by batch	Re-check demultiplex and trimming consistency; harmonize locus-building settings; validate with sensitivity checks	Rebuild loci + Re-call variants
Excess heterozygosity across many loci	Paralog inflation or mis-mapping	Heterozygosity outliers; allele balance skew	Apply paralog screens; mask multi-mapping; tune locus definition; remove suspect loci	Re-call variants
Allele balance strongly skewed at many loci	Mapping bias, paralogs, or technical artifacts	Allele balance plot; compare by batch/group	Tighten mapping quality filters; remove multi-mapping; add paralog filters	Re-call variants
SNP count changes drastically after small parameter tweak	Over-sensitive locus definition	Run a small sensitivity test on a subset	Choose a more stable parameter band; document parameter manifest; avoid extreme settings	Rebuild loci
Many SNPs fail sanity checks after filtering (extreme missingness tail)	Over-permissive calling or inconsistent filters	Inspect missingness distribution; check GQ/DP distributions	Tighten genotype quality and depth logic; revise missingness strategy; confirm sample exclusions	Re-call variants
Replicates disagree more than expected (if available)	Sample swaps, contamination, batch artifacts	Pairwise concordance; quick identity checks	Verify sample sheet; check contamination signals; reprocess affected subset	Re-run demultiplex or Re-call variants
A few outlier samples drive multiple QC failures	Low DNA quality/quantity; library failure	Review per-sample QC summary	Remove outliers early; rebuild key samples if critical	Rebuild library (subset)

Emphasize lowest-cost checks first: demultiplex logs, MultiQC, and simple PCA of QC metrics often identify the problem quickly.

Validation & Reporting: The Minimum Set Reviewers Expect

Validation checks confirm your SNP set reflects biology more than technical artifacts.

Minimum plots to include and why:

PCA colored by batch and by population group to show that structure is not driven by batch.
Missingness distributions per sample and per SNP to justify thresholds.
Loci-per-sample and shared-loci summaries to demonstrate consistency.
Replicate concordance (if available) to quantify error and tuning success.
Sensitivity analysis to show that modest threshold changes do not flip conclusions.

For practical ways to compute and present cohort-scale diagnostics, review these batch-aware QC metrics. For interpreting PCA/ADMIXTURE in ddRAD studies, see this population structure workflow.

PCA validation for RAD-seq QC showing batch effects versus biological population structure Figure 6. Reviewers expect you to check whether PCA reflects batch artifacts or biology.

Methods Template + Reproducibility Pack (What to Record and Deliver)

A reproducibility pack is a complete record of parameters, QC outputs, and decision thresholds that allows the analysis to be audited and repeated.

Methods template (copy/paste-ready headings with example sentences):

Sample & library summary. We prepared ddRAD libraries using the [enzyme pair], size window [X–Y bp], and unique dual indices. Libraries were pooled across [n] batches with matched insert-size policies.
Read processing. Raw reads were demultiplexed with [tool/version] using barcode and cutsite validation; adapters and low-quality tails were trimmed with [tool/version] to a final read-length policy of [N bp].
Locus strategy. For scenario B, reads were aligned to [reference build] using [aligner/version] with mismatch and MAPQ thresholds tuned to balance alignment rate and bias. For scenario C, de novo clustering used [tool/version] with similarity [0.90–0.94], minimum depth per locus [m], and cross-sample merge settings recorded.
SNP calling. Variants were called with [caller/version]; genotype quality and depth thresholds were chosen to minimize stochastic error while preserving informative variation.
Filters. We applied per-sample and per-SNP missingness filters after verifying that PCA colored by batch was not dominated by technical grouping. We excluded singletons for structure analyses, tested MAF ranges [0.02–0.05], performed per-population HWE tests with scheme [Out Within/Out Some], and pruned LD for PCA/ADMIXTURE.
Batch diagnostics. We summarized reads per sample, duplication rates, shared loci fraction, and missingness distributions; PCA colored by batch and population was used to detect batch-shaped artifacts.
Software versions and parameter disclosure. All tools and parameter classes (demultiplexing, trimming, clustering/alignment, min depths, paralog screens) are listed in the parameter manifest.

Minimum reporting set (bullets):

Exact filter thresholds with rationales and a short sensitivity analysis summary.
Tool names and versions; parameter classes and final values; alignment/clustering settings.
QC figures (MultiQC summary, missingness histograms, PCA by batch/population, loci-per-sample/shared loci, replicate concordance).
Description of any outlier handling or resequencing.

Final export & archive micro-checklist:

VCF and PLINK exports with filter expressions documented.
MultiQC report and demultiplex logs.
Parameter manifest with software versions.
README noting batch metadata and any deviations from SOP.

For cohort-scale projects, teams often standardize parameter manifests and QC reports to keep runs comparable.

FAQs

How much missingness is too much in RAD-seq QC?

There is no universal rule. Visualize per-sample and per-SNP missingness, color PCA by batch, and run a short sensitivity test before setting thresholds.

Should I filter SNPs by Hardy–Weinberg equilibrium in structured populations?

Use per-population HWE tests and conservative schemes. Global HWE filtering can remove biologically meaningful loci in structured populations.

Why does one batch dominate PCA, and what should I do first?

Check demultiplex logs, index design, trimming/length policy, and parameter manifests for catalog drift. Harmonize, then rebuild loci and reassess.

Do I remove samples or try to rescue them with different filters?

Start with low-cost checks (demultiplexing, trimming, length harmonization). If samples remain extreme outliers after harmonization and reasonable filters, remove them with clear justification.

How do I handle paralogs in non-model genomes?

Use heterozygosity excess and allele balance checks plus per-SNP max depth. Where possible, use paired-end assembly to increase locus length and resolve copies.

What QC figures belong in the main text vs supplementary materials?

Put PCA by batch/population, missingness distributions, and shared-loci summaries in the main text. Replicate concordance and sensitivity analyses can be main or supplement depending on journal space.

Can aggressive filtering create false population structure?

Yes. Strict MAF and missingness cuts can reshape clusters. Show that your conclusions persist across reasonable parameter ranges.

What should I share with a sequencing provider to troubleshoot faster?

Enzyme pair, size window, demultiplex logs, MultiQC, parameter manifest, batch metadata, and a representative VCF with current filters.

References

Nguyen, T., et al. "Empirical versus estimated accuracy of imputation: optimising filtering thresholds for sequence imputation." PLoS Genetics, 2024.
Cahoon, A., et al. "Imputation Accuracy Across Global Human Populations." American Journal of Human Genetics, 2023.
Shi, H., et al. "Genotype imputation accuracy and the quality metrics of the minor ancestry in multi-ancestry reference panels." Briefings in Bioinformatics, 2024.
Zhou, MK., et al. "Chimeric Reference Panels for Genomic Imputation." Genetics, 2025.
Jordan, K. W., et al. "Development of the Wheat Practical Haplotype Graph Facilitates Imputation and Cost‑Effective Genomic Prediction." G3: Genes, Genomes, Genetics, 2022.
Long, E M., et al. "Genome‑Wide Imputation Using the Practical Haplotype Graph in Sorghum." G3: Genes, Genomes, Genetics, 2021.
Torkamaneh, D., et al. "NanoGBS: A Miniaturized Procedure for GBS Library Preparation." Frontiers in Genetics, 2020.
Ausmees, K., et al. "Achieving Improved Accuracy for Imputation of Ancient DNA." Bioinformatics, 2022.

* Designed for biological research and industrial applications, not intended for individual clinical or medical purposes.