Bioinformatics for Low-Pass WGS: Implementing cn.mops & Pipelines

Low-pass whole-genome sequencing (low-pass WGS) is attractive for copy-number profiling because it trades depth for breadth. But for a bioinformatics pipeline architect, "low-pass CNV" is not a single method—it's a stack of decisions about binning, bias correction, segmentation/calling, and deliverable standardization.

This resource is written for RUO projects where your goals are typically:

  • Stable read-depth signal at low coverage
  • Controlled false positives (especially "over-segmentation")
  • Pipeline compatibility with existing internal tooling (inputs/outputs, reference builds, reproducibility)

Throughout, cn.mops is used as a concrete example, but most guidance is caller-agnostic and applies to read-depth CNV pipelines broadly.

Pipeline Blueprint (RUO) — what you'll build and what you'll receive
A robust low-pass CNV pipeline can be implemented as a deterministic blueprint: FASTQ → aligned BAM/CRAM → binned counts → normalized signal → segmentation/calls → standardized deliverables. The "definition of done" is not just a segment list; it's a reproducible package: (1) a segments table (BED/TSV) plus optional gene-annotation table, (2) an auditable QC report (human-readable + machine-readable), and (3) a run manifest capturing reference build, callable masks/blacklists, binning parameters, caller versions, and parameter hashes. This blueprint makes internal re-analysis and pipeline integration predictable—even when coverage is low and variance is high.

1. Why Low-Pass CNV Calling Is Hard (and How Pipelines Fix It)

1.1 Low depth → high variance: what noise looks like in read-depth space

Read-depth CNV relies on a simple idea: if a genomic region has fewer reads than expected, it may be a deletion; if it has more, it may be a duplication. Low-pass WGS breaks the "expected" part.

At low coverage, your signal is dominated by count noise and sampling variance:

  • Sparse bins: many bins land near zero reads, which inflates variance and destabilizes segmentation.
  • Outlier frequency: extreme low/high bins become common enough to create spurious breakpoints unless you explicitly filter them.
  • Tail risk in segmentation: algorithms can "explain noise" by creating many small segments (over-segmentation), which looks detailed but is often false positive burden.

Operational takeaway: in low-pass regimes, segmentation is not a final step—it's a consumer of a stabilized, bias-corrected track.

1.2 Bias sources: GC, mappability, repeats, batch effects

Even with perfect sampling, systematic effects often dominate low-pass CNV:

  • GC bias: coverage depends on GC content (library chemistry, amplification, sequencing). Residual GC bias often appears as whole-genome "waviness."
  • Mappability: ambiguous alignments in low-complexity regions create inconsistent counts and false signal.
  • Repeats/segmental duplications: repeat-rich bins have high variance and can generate artifactual breakpoints.
  • Batch effects: changes in library method, flowcell/lanes, read length, aligner version, or reference build can shift coverage profiles.

Noise Sources Map (Library → Alignment → Binning → Segmentation)Figure 1. Noise Sources Map (Library → Alignment → Binning → Segmentation).
What to look for: GC-driven curvature, low-mappability stripes, repeat-associated spikes, and cross-sample shifts consistent with batch.
Where to fix: apply GC correction, enforce a mappability mask, exclude known-problem regions via a blacklist/callable mask, and keep batch-homogeneous modeling for multi-sample callers (e.g., cn.mops).
How to use: inspect these signals before segmentation—most "mysterious CNVs" at low depth are preventable upstream.

1.3 Pipeline goals: stabilize signal, control false positives, standardize deliverables

A robust low-pass CNV pipeline should be designed around three goals:

  1. Signal stabilization
    Make per-bin coverage comparable across the genome (GC correction, mappability filtering, outlier handling).
  2. False-positive control
    Prevent over-segmentation by choosing bin sizes and segmentation constraints that reflect realistic resolution.
  3. Standardized deliverables
    Ensure downstream teams can re-run or integrate results: file formats, reference metadata, parameters, and QC must be explicit.

Assay context note (RUO): low-pass WGS is one option in a broader RUO toolkit. Depending on project constraints, teams may also evaluate alternatives such as Whole Exome Sequencing for exome-constrained questions or targeted approaches like Targeted Region Sequencing when the goal is focused interrogation rather than genome-wide profiling.

2. Core Pipeline Blocks (Implementation-Oriented)

2.1 Input requirements: FASTQ → aligned BAM/CRAM (what QC is mandatory)

Minimum inputs

  • Paired-end FASTQ (recommended) or single-end FASTQ
  • Sample sheet/metadata (library method, read length, platform, lane/batch identifiers)
  • Target reference build and the callable-region resources you standardize on

Alignment outputs (compatibility baseline)

  • BAM or CRAM plus index (BAI/CSI for BAM; CRAI for CRAM)
  • Alignment QC summary (per-sample + batch rollups)

Mandatory QC checks (engineering gates, not "nice-to-haves")

  • Mapped reads / usable reads: ensure bins won't be dominated by zeros after filtering
  • Duplicate rate: duplicates inflate variance without adding information at low-pass
  • Mapping rate: low mapping often correlates with repeat-driven artifacts and spurious segments
  • Insert size distribution: unexpected multimodality can track with GC bias and uneven coverage
  • Adapter/quality trimming: improves mapping consistency and reduces bin-level dispersion

If you want standardized upstream alignment artifacts (BAM/CRAM + QC) for RUO workflows, CD Genomics Whole Genome Sequencing and Next Generation Sequencing services can be used as consistent inputs.

2.2 Binning strategy: bin size tradeoffs (resolution vs stability)

Binning converts aligned reads into a count vector across the genome. Your bin size defines:

  • the smallest event you can reliably detect (practical resolution)
  • the variance that segmentation must tolerate (stability)

Bin Size Tradeoff Chart (Resolution vs Stability)Figure 2. Bin Size Tradeoff Chart (Resolution vs Stability).
This figure illustrates three practical bin regimes and their intended goals: (i) larger bins for chromosomal/broad events (stability-first), (ii) mid-size bins for multi-megabase events (balanced), and (iii) smaller bins for focal annotation (often feasible only as annotation of segments, not true gene-level resolution, at low-pass).

Decision framework: picking a starting bin size (practical checklist)
Validate bin size with measurable properties rather than intuition:

  • Median reads per bin (post-filter): avoid regimes where many bins are near-zero
  • Bin-level dispersion: robust CV/MAD of normalized track should decrease as bins increase
  • Segment burden: too many segments usually means bins too small or normalization under-corrected
  • Callable fraction: aggressive masking can reduce effective coverage and force larger bins

Bin Size Quick Matrix (starter, tune per project)

Goal Typical event scale What can break QC signal to watch
Chromosomal broad waviness / batch shifts GC residual, segment burden
Multi-Mb sub-chromosomal over-segmentation segment burden, callable fraction
Focal annotation gene-proximal near-zero bins / sparse counts dispersion, callable fraction

How to use this matrix: choose a starting bin regime, then run the tuning loop in 3.2 and adjust bin size and segmentation constraints until QC gates stabilize.

Internal link requirement :
For a deeper explanation of gene-level vs chromosomal CNV resolution limits, read this resolution guide.

2.3 Normalization: GC correction, mappability filtering, outlier handling

Normalization is where most low-pass pipelines succeed or fail.

GC correction

  • Goal: remove coverage dependence on GC without overfitting
  • Validation: plot normalized signal vs GC; residual trend should be minimal and stable across batches

Mappability filtering

  • Enforce a consistent callable mask and report callable fraction
  • Low-mappability bins are a repeatable source of false positives across tools

Outlier handling (operator-focused)
Outliers come from repeats, mapping ambiguity, or assembly quirks. Treat them as first-class objects:

  • fixed blacklists (known problematic regions)
  • adaptive outlier bins (extreme bins across a batch)
  • conservative smoothing (only if validated; oversmoothing hides breakpoints)

Batch strategy
For multi-sample methods, batch homogeneity is a "hard requirement," not a preference:

  • avoid mixing library methods, read lengths, or reference builds in one modeling batch
  • if batches must be combined, combine after normalization with clear metadata separation

(Non-linked note per Change List: Standardizing upstream sequencing parameters across projects reduces batch variance.)

2.4 Calling/segmentation: cn.mops concept and outputs

cn.mops models read counts using a mixture of Poisson components representing discrete copy-number states, and it estimates noise to reduce false positives. It tends to behave well when:

  • you have multiple technically comparable samples
  • batch heterogeneity is controlled (or segmented into homogeneous modeling groups)

Outputs you should standardize regardless of caller

  • segments table (BED/TSV) with reproducibility fields (see Section 4)
  • per-bin normalized signal (at least for QC/traceability)
  • QC plots (coverage distribution, GC residual, segment burden)

Reference: Klambauer et al., cn.MOPS (NAR 2012). DOI: https://doi.org/10.1093/nar/gks003

3. cn.mops Practical Notes (What Architects Care About)

3.1 Why cn.mops works well for multiple samples (mixture of Poissons idea—high level)

Pipeline architects usually care about one question: does the model reduce false positives without hiding real signal?

cn.mops is useful in multi-sample settings because it:

  • models per-bin counts across samples, separating consistent technical patterns from sample-specific deviations
  • provides noise-aware outputs that support principled filtering beyond "this looks too segmented"

At low depth, this matters because pure segmentation on noisy log2 tracks can explode into a high-burden call set.

3.2 Key parameters to tune (window/bin, minimum segment, sample batch design)

Treat tuning as an engineering loop, not a one-time decision.

A practical tuning loop (recommended)

  1. Pick 2–3 candidate bin regimes aligned to the Bin Size Quick Matrix (Section 2.2).
  2. For each regime, run normalization + cn.mops and produce the same QC report.
  3. Gate using objective metrics:
    • bin-level dispersion
    • GC residual / waviness proxy
    • callable fraction
    • segment burden distribution
  4. Lock parameters and version them with a manifest (Section 4.3).

Knobs that matter most

  • Bin size/window (stability vs resolution)
  • Minimum segment length / minimum bins per segment (primary lever against over-segmentation)
  • Modeling batch design (only mix technically comparable samples)

If your internal architecture prefers plug-and-play handoff (BAM/CRAM in → standardized segments/QC out) while keeping outputs re-runnable, a single, well-defined Bioinformatics Services workflow boundary can reduce integration friction.

3.3 QC metrics to report (variance, coverage uniformity, segment confidence)

A low-pass CNV pipeline should emit QC that supports "accept / re-run / quarantine" decisions.

Recommended QC metrics (per sample + batch summaries)

  • mapped reads / usable reads (post-filter)
  • duplicate rate (and whether duplicates were marked/removed)
  • callable fraction (post mask/blacklist/outlier filtering)
  • bin-level dispersion (robust CV/MAD on normalized signal)
  • GC residual (correlation/slope of normalized signal vs GC)
  • waviness proxy (low-frequency trend amplitude / autocorrelation)
  • segment burden (count + length distribution)
  • event sanity checks (e.g., fraction of genome in altered states; extreme fractions often indicate artifact)

QC Starter Thresholds Table (platform-specific; use placeholders until project QA defines limits)
Starter note: thresholds depend on library method, read length, reference build, and masking strategy.

QC gate Why it matters for low-pass CNV If it fails Typical fix
Usable mapped reads prevents near-zero bins dominating sparse counts → unstable segments increase reads or increase bin size
Duplicate rate duplicates inflate variance false breakpoints / noisy track review library prep; mark/remove duplicates; adjust gating
Callable fraction effective coverage after masks loss of signal; forced large bins refine mask/blacklist; re-check reference/mappability resources
Bin-level dispersion (robust CV/MAD) direct indicator of signal stability over-segmentation increase bin size; strengthen outlier filtering; batch split
GC residual (slope/correlation) predicts waviness artifacts broad false gains/losses revisit GC correction; batch normalization strategy
Segment burden (count / genome fraction) proxy for false positives noisy call set increase minimum segment length; tune caller; increase bin size

QC Dashboard Mock (Coverage, GC bias, Segment count, Log2 ratio)Figure 3. QC Dashboard Mock (Coverage, GC bias, Segment count, Log2 ratio).
This QC dashboard aligns directly to the gates above: coverage (depth distribution and outlier bins), GC bias (residual trend and waviness), segment burden (count/shape of distribution), and genome-wide log2 ratio (highlighted segments for spot-check review). Use it as a pre-release QA snapshot: if GC residual or segment burden is unstable, tune bin regime and segmentation constraints before exporting deliverables.

4. Deliverables and Compatibility (For Internal Re-analysis)

4.1 Standard outputs: segments (BED/TSV), gene-level table, QC report

Segments (analysis + visualization)

  • TSV/CSV for analysis, BED for browsers
  • Recommended columns:
    • sample_id
    • chr, start, end
    • num_bins, length_bp
    • log2_ratio (or equivalent normalized measure)
    • discrete_call (loss/neutral/gain)
    • confidence_or_noise_metric (if available)
    • pipeline_version and parameter_hash

Gene-level summary table (annotation, not "true gene resolution")

  • derived by intersecting segments with gene annotations
  • must explicitly state it is annotation of segment-level signal
  • include overlap fraction and segment IDs for traceability

QC report

  • human-readable (PDF/HTML) + machine-readable (JSON)
  • include per-metric pass/warn/fail flags and the gating thresholds used

4.2 Required raw deliverables: aligned BAM/CRAM + index, reference metadata

Minimum required for deterministic internal reanalysis:

  • BAM/CRAM + index
  • reference build identifier + fasta checksums when possible
  • aligner name/version + command/config
  • callable mask / blacklist version
  • binning parameters (bin size, bin boundaries definition, filters)
  • cn.mops version + key parameters
  • segments table + QC report

For a packaging checklist of inputs and metadata to support deterministic RUO reanalysis, see the Sample Submission Guidelines.

4.3 Reproducibility: versioning (reference build, caller versions, parameters)

Low-pass CNV is reproducibility-sensitive because small normalization changes can alter segmentation.

Recommended practice:

  • emit a manifest.json / run.yaml per batch containing:
    • references + checksums
    • tool versions
    • parameters
    • parameter hashes
  • store intermediate artifacts:
    • bin count matrix (pre/post normalization)
    • filtered-bin list / callable mask
    • segmentation input tracks

5. Troubleshooting Guide

5.1 Too many segments (over-segmentation)

Symptoms

  • extremely high segment counts
  • many tiny segments with small log2 shifts
  • inconsistent calls across similar samples

Likely causes

  • bins too small for depth regime
  • insufficient outlier filtering
  • residual GC bias / waviness
  • batch heterogeneity (mixed library/platform/reference)

Checks

  • segment-count distribution across samples (batch-specific?)
  • fraction of bins near zero reads post-filter
  • GC residual plot stability
  • bin-level dispersion across samples

Fixes

  • increase bin size and/or minimum segment length
  • tighten outlier-bin filtering and callable masks
  • split heterogeneous batches and re-run
  • re-check reference and mappability resources

5.2 Whole-genome waviness (GC bias / batch)

Symptoms

  • low-frequency oscillation across chromosomes
  • broad false gains/losses tracking GC rather than stable signal
  • shared waviness signature within a batch

Checks

  • normalized signal vs GC (residual trend should be minimal)
  • waviness proxy by batch
  • reference build and mask/blacklist consistency

Fixes

  • revise GC correction strategy (avoid under/overfitting)
  • enforce batch-homogeneous processing/modeling
  • avoid mixing read lengths and library chemistries inside one cn.mops modeling batch

For project planning and assay selection in RUO settings (e.g., throughput, cost, expected resolution), see this scalable CNV assay comparison.settings

5.3 Poor callable regions (repeat-rich genomes)

Symptoms

  • large fractions of bins filtered
  • calls cluster in low-mappability regions
  • results vary widely across tools

Checks

  • callable fraction per chromosome
  • overlap of called segments with low-mappability tracks
  • compare calls before/after masking

Fixes

  • tune callable masks/blacklists to the reference genome
  • shift goals to larger event sizes if effective coverage is too low
  • validate that reference resources (mappability tracks, blacklists) match the build

RUO assay context: when a project's constraints favor array-based readouts over low-pass WGS, teams may evaluate SNP Microarray or broader Microarray Services as alternative inputs for CNV-focused research pipelines.

FAQ

1) Do I need matched controls for low-pass CNV?

Not always. Many read-depth workflows can run without matched controls, but you must compensate with stronger bias correction, conservative segmentation constraints, and stricter QC gates.

2) What deliverables should I require so my team can re-run everything deterministically?

At minimum: BAM/CRAM+index, reference build metadata, alignment version/config, callable masks/blacklists, binning parameters, caller version/parameters, segments table, QC report, and a manifest capturing parameter hashes.

3) How do I choose bin size without guessing?

Use the Bin Size Quick Matrix (Section 2.2) to select a starting regime, then run the tuning loop in 3.2 and gate on dispersion, GC residual, callable fraction, and segment burden.

4) Why does segment count explode even after GC correction?

GC correction doesn't fix mappability/repeat artifacts or batch heterogeneity. Over-segmentation is usually a system problem: bins too small + residual bias + outlier bins + heterogeneous batches.

5) Can low-pass WGS support gene-level CNV calls?

Often not reliably. Treat gene-level tables as annotation of segment-level calls. See the resolution guide linked above.

6) Should I output VCF for CNVs?

VCF can be useful for certain ecosystems, but many CNV workflows are more naturally represented as BED/TSV segments plus a manifest and QC JSON. Pick formats that best match downstream tooling and reproducibility requirements.

7) What's the most common reason a low-pass CNV pipeline fails review by a bioinformatics lead?

Underspecified QC gates and incomplete deliverables. If the pipeline can't be re-run deterministically—or if QC can't justify stability—integration risk is high even if calls look plausible.

8) Where can I standardize sample metadata and packaging to avoid handoff friction?

Use a single packaging checklist and require the manifest fields described in Sections 4.2–4.3. If you need additional upstream consistency, RUO pipelines often pair low-pass WGS outputs with a complementary genotyping layer such as Genotyping for specific study designs.

References

  1. Klambauer G, Schwarzbauer K, Mayr A, et al. cn.MOPS: Mixture of Poissons for Discovering Copy Number Variations in Next Generation Sequencing Data with a Low False Discovery Rate. Nucleic Acids Research (2012). DOI: 10.1093/nar/gks003 — https://doi.org/10.1093/nar/gks003
  2. Scheinin I, Sie D, Bengtsson H, et al. DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. Genome Research (2014). DOI: 10.1101/gr.175141.114 — https://doi.org/10.1101/gr.175141.114
  3. Boeva V, Popova T, Bleakley K, et al. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics (2012). DOI: 10.1093/bioinformatics/btr670 — https://doi.org/10.1093/bioinformatics/btr670
  4. Smolander J, Khan S, Singaravelu K, et al. Evaluation of tools for identifying large copy number variations from ultra-low-coverage whole-genome sequencing data. BMC Genomics (2021). DOI: 10.1186/s12864-021-07686-z — https://doi.org/10.1186/s12864-021-07686-z
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
PDF Download
* Email Address:

CD Genomics needs the contact information you provide to us in order to contact you about our products and services and other content that may be of interest to you. By clicking below, you consent to the storage and processing of the personal information submitted above by CD Genomcis to provide the content you have requested.

×
Quote Request
! For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Contact CD Genomics
Terms & Conditions | Privacy Policy | Feedback   Copyright © CD Genomics. All rights reserved.
Top