Bioinformatics for Low-Pass WGS: Implementing cn.mops & Pipelines

Low-pass whole-genome sequencing (low-pass WGS) is attractive for copy-number profiling because it trades depth for breadth. But for a bioinformatics pipeline architect, "low-pass CNV" is not a single method—it's a stack of decisions about binning, bias correction, segmentation/calling, and deliverable standardization.

This resource is written for RUO projects where your goals are typically:

Stable read-depth signal at low coverage
Controlled false positives (especially "over-segmentation")
Pipeline compatibility with existing internal tooling (inputs/outputs, reference builds, reproducibility)

Throughout, cn.mops is used as a concrete example, but most guidance is caller-agnostic and applies to read-depth CNV pipelines broadly.

Pipeline Blueprint (RUO) — what you'll build and what you'll receive
A robust low-pass CNV pipeline can be implemented as a deterministic blueprint: FASTQ → aligned BAM/CRAM → binned counts → normalized signal → segmentation/calls → standardized deliverables. The "definition of done" is not just a segment list; it's a reproducible package: (1) a segments table (BED/TSV) plus optional gene-annotation table, (2) an auditable QC report (human-readable + machine-readable), and (3) a run manifest capturing reference build, callable masks/blacklists, binning parameters, caller versions, and parameter hashes. This blueprint makes internal re-analysis and pipeline integration predictable—even when coverage is low and variance is high.

1. Why Low-Pass CNV Calling Is Hard (and How Pipelines Fix It)

1.1 Low depth → high variance: what noise looks like in read-depth space

Read-depth CNV relies on a simple idea: if a genomic region has fewer reads than expected, it may be a deletion; if it has more, it may be a duplication. Low-pass WGS breaks the "expected" part.

At low coverage, your signal is dominated by count noise and sampling variance:

Sparse bins: many bins land near zero reads, which inflates variance and destabilizes segmentation.
Outlier frequency: extreme low/high bins become common enough to create spurious breakpoints unless you explicitly filter them.
Tail risk in segmentation: algorithms can "explain noise" by creating many small segments (over-segmentation), which looks detailed but is often false positive burden.

Operational takeaway: in low-pass regimes, segmentation is not a final step—it's a consumer of a stabilized, bias-corrected track.

1.2 Bias sources: GC, mappability, repeats, batch effects

Even with perfect sampling, systematic effects often dominate low-pass CNV:

GC bias: coverage depends on GC content (library chemistry, amplification, sequencing). Residual GC bias often appears as whole-genome "waviness."
Mappability: ambiguous alignments in low-complexity regions create inconsistent counts and false signal.
Repeats/segmental duplications: repeat-rich bins have high variance and can generate artifactual breakpoints.
Batch effects: changes in library method, flowcell/lanes, read length, aligner version, or reference build can shift coverage profiles.

Figure 1. Noise Sources Map (Library → Alignment → Binning → Segmentation).
What to look for: GC-driven curvature, low-mappability stripes, repeat-associated spikes, and cross-sample shifts consistent with batch.
Where to fix: apply GC correction, enforce a mappability mask, exclude known-problem regions via a blacklist/callable mask, and keep batch-homogeneous modeling for multi-sample callers (e.g., cn.mops).
How to use: inspect these signals before segmentation—most "mysterious CNVs" at low depth are preventable upstream.

1.3 Pipeline goals: stabilize signal, control false positives, standardize deliverables

A robust low-pass CNV pipeline should be designed around three goals:

Signal stabilization
Make per-bin coverage comparable across the genome (GC correction, mappability filtering, outlier handling).
False-positive control
Prevent over-segmentation by choosing bin sizes and segmentation constraints that reflect realistic resolution.
Standardized deliverables
Ensure downstream teams can re-run or integrate results: file formats, reference metadata, parameters, and QC must be explicit.

Assay context note (RUO): low-pass WGS is one option in a broader RUO toolkit. Depending on project constraints, teams may also evaluate alternatives such as Whole Exome Sequencing for exome-constrained questions or targeted approaches like Targeted Region Sequencing when the goal is focused interrogation rather than genome-wide profiling.

2. Core Pipeline Blocks (Implementation-Oriented)

2.1 Input requirements: FASTQ → aligned BAM/CRAM (what QC is mandatory)

Minimum inputs

Paired-end FASTQ (recommended) or single-end FASTQ
Sample sheet/metadata (library method, read length, platform, lane/batch identifiers)
Target reference build and the callable-region resources you standardize on

Alignment outputs (compatibility baseline)

BAM or CRAM plus index (BAI/CSI for BAM; CRAI for CRAM)
Alignment QC summary (per-sample + batch rollups)

Mandatory QC checks (engineering gates, not "nice-to-haves")

Mapped reads / usable reads: ensure bins won't be dominated by zeros after filtering
Duplicate rate: duplicates inflate variance without adding information at low-pass
Mapping rate: low mapping often correlates with repeat-driven artifacts and spurious segments
Insert size distribution: unexpected multimodality can track with GC bias and uneven coverage
Adapter/quality trimming: improves mapping consistency and reduces bin-level dispersion

If you want standardized upstream alignment artifacts (BAM/CRAM + QC) for RUO workflows, CD Genomics Whole Genome Sequencing and Next Generation Sequencing services can be used as consistent inputs.

2.2 Binning strategy: bin size tradeoffs (resolution vs stability)

Binning converts aligned reads into a count vector across the genome. Your bin size defines:

the smallest event you can reliably detect (practical resolution)
the variance that segmentation must tolerate (stability)

Figure 2. Bin Size Tradeoff Chart (Resolution vs Stability).
This figure illustrates three practical bin regimes and their intended goals: (i) larger bins for chromosomal/broad events (stability-first), (ii) mid-size bins for multi-megabase events (balanced), and (iii) smaller bins for focal annotation (often feasible only as annotation of segments, not true gene-level resolution, at low-pass).

Decision framework: picking a starting bin size (practical checklist)
Validate bin size with measurable properties rather than intuition:

Median reads per bin (post-filter): avoid regimes where many bins are near-zero
Bin-level dispersion: robust CV/MAD of normalized track should decrease as bins increase
Segment burden: too many segments usually means bins too small or normalization under-corrected
Callable fraction: aggressive masking can reduce effective coverage and force larger bins

Bin Size Quick Matrix (starter, tune per project)

Goal	Typical event scale	What can break	QC signal to watch
Chromosomal	broad	waviness / batch shifts	GC residual, segment burden
Multi-Mb	sub-chromosomal	over-segmentation	segment burden, callable fraction
Focal annotation	gene-proximal	near-zero bins / sparse counts	dispersion, callable fraction

How to use this matrix: choose a starting bin regime, then run the tuning loop in 3.2 and adjust bin size and segmentation constraints until QC gates stabilize.

Internal link requirement :
For a deeper explanation of gene-level vs chromosomal CNV resolution limits, read this resolution guide.

2.3 Normalization: GC correction, mappability filtering, outlier handling

Normalization is where most low-pass pipelines succeed or fail.

GC correction

Goal: remove coverage dependence on GC without overfitting
Validation: plot normalized signal vs GC; residual trend should be minimal and stable across batches

Mappability filtering

Enforce a consistent callable mask and report callable fraction
Low-mappability bins are a repeatable source of false positives across tools

Outlier handling (operator-focused)
Outliers come from repeats, mapping ambiguity, or assembly quirks. Treat them as first-class objects:

fixed blacklists (known problematic regions)
adaptive outlier bins (extreme bins across a batch)
conservative smoothing (only if validated; oversmoothing hides breakpoints)

Batch strategy
For multi-sample methods, batch homogeneity is a "hard requirement," not a preference:

avoid mixing library methods, read lengths, or reference builds in one modeling batch
if batches must be combined, combine after normalization with clear metadata separation

(Non-linked note per Change List: Standardizing upstream sequencing parameters across projects reduces batch variance.)

2.4 Calling/segmentation: cn.mops concept and outputs

cn.mops models read counts using a mixture of Poisson components representing discrete copy-number states, and it estimates noise to reduce false positives. It tends to behave well when:

you have multiple technically comparable samples
batch heterogeneity is controlled (or segmented into homogeneous modeling groups)

Outputs you should standardize regardless of caller

segments table (BED/TSV) with reproducibility fields (see Section 4)
per-bin normalized signal (at least for QC/traceability)
QC plots (coverage distribution, GC residual, segment burden)

Reference: Klambauer et al., cn.MOPS (NAR 2012). DOI: https://doi.org/10.1093/nar/gks003

3. cn.mops Practical Notes (What Architects Care About)

3.1 Why cn.mops works well for multiple samples (mixture of Poissons idea—high level)

Pipeline architects usually care about one question: does the model reduce false positives without hiding real signal?

cn.mops is useful in multi-sample settings because it:

models per-bin counts across samples, separating consistent technical patterns from sample-specific deviations
provides noise-aware outputs that support principled filtering beyond "this looks too segmented"

At low depth, this matters because pure segmentation on noisy log2 tracks can explode into a high-burden call set.

3.2 Key parameters to tune (window/bin, minimum segment, sample batch design)

Treat tuning as an engineering loop, not a one-time decision.

A practical tuning loop (recommended)

Pick 2–3 candidate bin regimes aligned to the Bin Size Quick Matrix (Section 2.2).
For each regime, run normalization + cn.mops and produce the same QC report.
Gate using objective metrics:
- bin-level dispersion
- GC residual / waviness proxy
- callable fraction
- segment burden distribution
Lock parameters and version them with a manifest (Section 4.3).

Knobs that matter most

Bin size/window (stability vs resolution)
Minimum segment length / minimum bins per segment (primary lever against over-segmentation)
Modeling batch design (only mix technically comparable samples)

If your internal architecture prefers plug-and-play handoff (BAM/CRAM in → standardized segments/QC out) while keeping outputs re-runnable, a single, well-defined Bioinformatics Services workflow boundary can reduce integration friction.

3.3 QC metrics to report (variance, coverage uniformity, segment confidence)

A low-pass CNV pipeline should emit QC that supports "accept / re-run / quarantine" decisions.

Recommended QC metrics (per sample + batch summaries)

mapped reads / usable reads (post-filter)
duplicate rate (and whether duplicates were marked/removed)
callable fraction (post mask/blacklist/outlier filtering)
bin-level dispersion (robust CV/MAD on normalized signal)
GC residual (correlation/slope of normalized signal vs GC)
waviness proxy (low-frequency trend amplitude / autocorrelation)
segment burden (count + length distribution)
event sanity checks (e.g., fraction of genome in altered states; extreme fractions often indicate artifact)

QC Starter Thresholds Table (platform-specific; use placeholders until project QA defines limits)
Starter note: thresholds depend on library method, read length, reference build, and masking strategy.

QC gate	Why it matters for low-pass CNV	If it fails	Typical fix
Usable mapped reads	prevents near-zero bins dominating	sparse counts → unstable segments	increase reads or increase bin size
Duplicate rate	duplicates inflate variance	false breakpoints / noisy track	review library prep; mark/remove duplicates; adjust gating
Callable fraction	effective coverage after masks	loss of signal; forced large bins	refine mask/blacklist; re-check reference/mappability resources
Bin-level dispersion (robust CV/MAD)	direct indicator of signal stability	over-segmentation	increase bin size; strengthen outlier filtering; batch split
GC residual (slope/correlation)	predicts waviness artifacts	broad false gains/losses	revisit GC correction; batch normalization strategy
Segment burden (count / genome fraction)	proxy for false positives	noisy call set	increase minimum segment length; tune caller; increase bin size

Figure 3. QC Dashboard Mock (Coverage, GC bias, Segment count, Log2 ratio).
This QC dashboard aligns directly to the gates above: coverage (depth distribution and outlier bins), GC bias (residual trend and waviness), segment burden (count/shape of distribution), and genome-wide log2 ratio (highlighted segments for spot-check review). Use it as a pre-release QA snapshot: if GC residual or segment burden is unstable, tune bin regime and segmentation constraints before exporting deliverables.

4. Deliverables and Compatibility (For Internal Re-analysis)

4.1 Standard outputs: segments (BED/TSV), gene-level table, QC report

Segments (analysis + visualization)

TSV/CSV for analysis, BED for browsers
Recommended columns:
- sample_id
- chr, start, end
- num_bins, length_bp
- log2_ratio (or equivalent normalized measure)
- discrete_call (loss/neutral/gain)
- confidence_or_noise_metric (if available)
- pipeline_version and parameter_hash

Gene-level summary table (annotation, not "true gene resolution")

derived by intersecting segments with gene annotations
must explicitly state it is annotation of segment-level signal
include overlap fraction and segment IDs for traceability

QC report

human-readable (PDF/HTML) + machine-readable (JSON)
include per-metric pass/warn/fail flags and the gating thresholds used

4.2 Required raw deliverables: aligned BAM/CRAM + index, reference metadata

Minimum required for deterministic internal reanalysis:

BAM/CRAM + index
reference build identifier + fasta checksums when possible
aligner name/version + command/config
callable mask / blacklist version
binning parameters (bin size, bin boundaries definition, filters)
cn.mops version + key parameters
segments table + QC report

For a packaging checklist of inputs and metadata to support deterministic RUO reanalysis, see the Sample Submission Guidelines.

4.3 Reproducibility: versioning (reference build, caller versions, parameters)

Low-pass CNV is reproducibility-sensitive because small normalization changes can alter segmentation.

Recommended practice:

emit a manifest.json / run.yaml per batch containing:
- references + checksums
- tool versions
- parameters
- parameter hashes
store intermediate artifacts:
- bin count matrix (pre/post normalization)
- filtered-bin list / callable mask
- segmentation input tracks

5. Troubleshooting Guide

5.1 Too many segments (over-segmentation)

Symptoms

extremely high segment counts
many tiny segments with small log2 shifts
inconsistent calls across similar samples

Likely causes

bins too small for depth regime
insufficient outlier filtering
residual GC bias / waviness
batch heterogeneity (mixed library/platform/reference)

Checks

segment-count distribution across samples (batch-specific?)
fraction of bins near zero reads post-filter
GC residual plot stability
bin-level dispersion across samples

Fixes

increase bin size and/or minimum segment length
tighten outlier-bin filtering and callable masks
split heterogeneous batches and re-run
re-check reference and mappability resources

5.2 Whole-genome waviness (GC bias / batch)

Symptoms

low-frequency oscillation across chromosomes
broad false gains/losses tracking GC rather than stable signal
shared waviness signature within a batch

Checks

normalized signal vs GC (residual trend should be minimal)
waviness proxy by batch
reference build and mask/blacklist consistency

Fixes

revise GC correction strategy (avoid under/overfitting)
enforce batch-homogeneous processing/modeling
avoid mixing read lengths and library chemistries inside one cn.mops modeling batch

For project planning and assay selection in RUO settings (e.g., throughput, cost, expected resolution), see this scalable CNV assay comparison.settings

5.3 Poor callable regions (repeat-rich genomes)

Symptoms

large fractions of bins filtered
calls cluster in low-mappability regions
results vary widely across tools

Checks

callable fraction per chromosome
overlap of called segments with low-mappability tracks
compare calls before/after masking

Fixes

tune callable masks/blacklists to the reference genome
shift goals to larger event sizes if effective coverage is too low
validate that reference resources (mappability tracks, blacklists) match the build

RUO assay context: when a project's constraints favor array-based readouts over low-pass WGS, teams may evaluate SNP Microarray or broader Microarray Services as alternative inputs for CNV-focused research pipelines.

FAQ

1) Do I need matched controls for low-pass CNV?

Not always. Many read-depth workflows can run without matched controls, but you must compensate with stronger bias correction, conservative segmentation constraints, and stricter QC gates.

2) What deliverables should I require so my team can re-run everything deterministically?

At minimum: BAM/CRAM+index, reference build metadata, alignment version/config, callable masks/blacklists, binning parameters, caller version/parameters, segments table, QC report, and a manifest capturing parameter hashes.

3) How do I choose bin size without guessing?

Use the Bin Size Quick Matrix (Section 2.2) to select a starting regime, then run the tuning loop in 3.2 and gate on dispersion, GC residual, callable fraction, and segment burden.

4) Why does segment count explode even after GC correction?

GC correction doesn't fix mappability/repeat artifacts or batch heterogeneity. Over-segmentation is usually a system problem: bins too small + residual bias + outlier bins + heterogeneous batches.

5) Can low-pass WGS support gene-level CNV calls?

Often not reliably. Treat gene-level tables as annotation of segment-level calls. See the resolution guide linked above.

6) Should I output VCF for CNVs?

VCF can be useful for certain ecosystems, but many CNV workflows are more naturally represented as BED/TSV segments plus a manifest and QC JSON. Pick formats that best match downstream tooling and reproducibility requirements.

7) What's the most common reason a low-pass CNV pipeline fails review by a bioinformatics lead?

Underspecified QC gates and incomplete deliverables. If the pipeline can't be re-run deterministically—or if QC can't justify stability—integration risk is high even if calls look plausible.

8) Where can I standardize sample metadata and packaging to avoid handoff friction?

Use a single packaging checklist and require the manifest fields described in Sections 4.2–4.3. If you need additional upstream consistency, RUO pipelines often pair low-pass WGS outputs with a complementary genotyping layer such as Genotyping for specific study designs.

References

Klambauer G, Schwarzbauer K, Mayr A, et al. cn.MOPS: Mixture of Poissons for Discovering Copy Number Variations in Next Generation Sequencing Data with a Low False Discovery Rate. Nucleic Acids Research (2012). DOI: 10.1093/nar/gks003 — https://doi.org/10.1093/nar/gks003
Scheinin I, Sie D, Bengtsson H, et al. DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. Genome Research (2014). DOI: 10.1101/gr.175141.114 — https://doi.org/10.1101/gr.175141.114
Boeva V, Popova T, Bleakley K, et al. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics (2012). DOI: 10.1093/bioinformatics/btr670 — https://doi.org/10.1093/bioinformatics/btr670
Smolander J, Khan S, Singaravelu K, et al. Evaluation of tools for identifying large copy number variations from ultra-low-coverage whole-genome sequencing data. BMC Genomics (2021). DOI: 10.1186/s12864-021-07686-z — https://doi.org/10.1186/s12864-021-07686-z

Related Services

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.