CD Genomics communication preferences and email settings

Copy Number Variation (CNV) Analysis: Detection Methods, Depth Strategies, and Bioinformatics Tools

Home
Resource
Genome Research
Copy Number Variation (CNV) Analysis: Detection Methods, Depth Strategies, and Bioinformatics Tools

Get Your Instant Quote

Copy Number Variation (CNV) Analysis: Detection Methods, Depth Strategies, and Bioinformatics Tools

Q: What sequencing depth is needed for CNV detection from WGS?

15-30x for CNVs >5-10 kb. LP-WGS at 1-2x detects CNVs >50-100 kb for cost-effective clinical screening.

Q: How does FFPE sample quality affect CNV detection?

FFPE reduces CNV detection sensitivity by 15-25% compared to fresh-frozen tissue. Matched FFPE normal controls and increased depth partially compensate.

Q: What is the minimum tumor purity for somatic CNV detection?

Most tools require purity above 20-30%. BAF-based tools (Control-FREEC) perform better at low purity than depth-only methods.

Q: How do I choose between CNVkit and GATK gCNV?

GATK gCNV for large-scale WGS germline cohorts with population modeling. CNVkit for project-level WES cancer CNV with matched normal controls.

Q: What is the minimum reference size for pooled-normal CNV analysis?

Minimum 10 normal samples for WGS and 30 for WES. Below these thresholds, model-based methods like GATK gCNV are preferred.

Copy number variation (CNV) refers to the duplication or deletion of DNA segments larger than 1 kb. Across the human genome, CNVs account for more total base-pair differences between individuals than single-nucleotide variants, yet they remain more challenging to detect because the primary signal—read depth—is a continuous variable affected by numerous technical confounders. Unlike SNV calling, which depends on base identity at a single position, CNV detection requires integrating depth information across genomic windows, correcting for systematic biases, and segmenting the genome into regions of consistent copy number.

This guide provides a practical framework for researchers who have experience with NGS data and need to design, execute, and interpret CNV analysis projects. It covers the algorithmic basis of read-depth CNV detection, the quantitative relationship between sequencing depth and detection sensitivity, a detailed comparison of bioinformatics tools with benchmark performance data, the key technical challenges that compromise accuracy, and practical strategies for project design—from low-pass screening to high-resolution genome-wide profiling. The focus throughout is on how analytical decisions affect the types and sizes of CNVs that can be reliably detected, and on avoiding the common pitfalls that lead to false-positive or false-negative results. Whether you are analyzing germline CNVs in a population study of 1,000+ samples or somatic CNVs in a small cancer cohort, the principles of GC normalization, mappability filtering, and cohort-aware segmentation apply across all scales of CNV analysis.

Whole genome sequencing services support CNV detection across all depth configurations—from high-coverage 30× germline analysis to low-pass 0.5-2× somatic screening—with matched bioinformatics pipelines optimized for each approach.

What Is Copy Number Variation and Why Does It Matter?

Copy number variants are structural alterations in which a DNA segment is present in more or fewer copies than the reference genome. They range from approximately 1 kb to several megabases, and their formation is driven primarily by non-allelic homologous recombination (NAHR) between flanking segmental duplications during meiosis, and by non-homologous end-joining (NHEJ) in both meiotic and mitotic contexts. CNVs are classified into deletions (loss of a genomic segment, reducing copy number to 1 or 0) and duplications (gain of an additional copy, increasing copy number to 3 or more). This classification is detectable from sequencing data by the magnitude and direction of the read depth change—a heterozygous deletion reduces expected depth by 50%, while a homozygous deletion reduces it to zero.

The biological impact of CNVs is substantial. In the germline, CNVs are a well-established cause of genetic disorders: the 22q11.2 microdeletion syndrome (1 in 4,000 live births), Charcot-Marie-Tooth disease (PMP22 duplication), and Smith-Magenis syndrome (RAI1 deletion) are classic examples. Population-scale studies estimate that large CNVs (>50 kb) affect approximately 15% of the genome in copy number and account for more inter-individual genetic variation than all SNVs combined. In cancer, focal amplifications of oncogenes such as MYC, EGFR, KRAS, and ERBB2 directly drive tumor progression, while homozygous or heterozygous deletions of tumor suppressor genes including TP53, CDKN2A, PTEN, and RB1 eliminate critical regulatory pathways. Variant calling services include CNV detection as a standard component of comprehensive genomic analysis.

How CNV Detection from Sequencing Data Works — The Algorithmic Basis

All sequencing-based CNV detection methods share a common algorithmic framework built on read depth analysis, though the specific implementations differ substantially between tools. Understanding this framework is essential for interpreting CNV results and troubleshooting failed analyses.

Read depth counting and windowing: Sequenced reads are aligned to the reference genome, and the number of reads mapping to each genomic window is counted. Window size is a critical parameter—smaller windows (100 bp to 1 kb) provide higher breakpoint resolution but lower statistical power per window, while larger windows (10-100 kb) increase signal-to-noise ratio at the cost of blurred breakpoints. For WGS at 30×, 1 kb windows provide sufficient power for CNV detection. For LP-WGS at 1×, 100-500 kb windows are required. The read depth in each window follows a Poisson distribution with mean equal to the expected coverage, and CNVs are identified as regions where the observed depth deviates significantly from this expectation after normalization.

Normalization — removing technical variation: Raw read depth is dominated by technical factors unrelated to copy number. GC content alone produces a 2-5-fold range in coverage depth—GC-rich regions sequence with higher efficiency in most library preparation protocols, creating systematic peaks and valleys that are consistent across samples from the same sequencing run. Normalization corrects this by computing the ratio of observed to expected depth for windows of similar GC content, then smoothing the GC-bias curve to remove fine-scale noise. Mappability correction excludes windows where a substantial fraction of the genome cannot be uniquely mapped by short reads—typically centromeres, telomeres, and segmental duplications. Without these corrections, the GC-bias signal would produce false CNV calls at most GC-rich and GC-poor regions of the genome.

Segmentation: After normalization, the depth profile is divided into segments of consistent copy number using segmentation algorithms. Circular binary segmentation (CBS), implemented in the DNAcopy R package, recursively splits the genome into segments by testing adjacent windows for significant differences in mean depth. The PELT (Pruned Exact Linear Time) algorithm, used by GATK gCNV, is faster and scales linearly with the number of windows. Hidden Markov models (HMMs), used by XHMM for exome data, treat copy number state as a hidden variable inferred from the observed depth sequence. The choice of segmentation algorithm affects the balance between sensitivity (detecting small CNVs) and specificity (avoiding over-splitting the genome into many small segments that reflect noise rather than true CNVs). In practice, CBS produces more conservative calls with fewer false positives, while HMMs are more sensitive at the cost of increased false-positive rates for single-window events.

Genotype assignment from B-allele frequency: In addition to read depth, heterozygous SNP positions in the aligned reads provide B-allele frequency (BAF) information. In diploid regions, BAF clusters at 0.5 for heterozygous SNPs. In regions with copy number change, BAF deviates from 0.5—loss of heterozygosity (LOH) shifts BAF toward 0 or 1, while amplification shifts it in a pattern that depends on the allele copy ratio. Tools such as Control-FREEC and Canvas incorporate BAF alongside depth to distinguish CNV types (copy-neutral LOH vs. true deletion) and to detect CNVs in samples with normal cell contamination, where the depth signal alone may be ambiguous.

The quantitative relationship between depth and CNV signal strength: The confidence of a CNV call depends on the signal-to-noise ratio of the observed depth deviation. For a heterozygous deletion, the expected depth reduction is 50%. The standard deviation of read depth in a window is approximately sqrt(mean_depth) for Poisson-distributed counts. At 30× WGS with 1 kb windows containing approximately 30 reads, the expected standard deviation is ~5.5 reads (18%), making a 50% reduction approximately 9 standard deviations from the mean — easily detectable. At 1× LP-WGS with 200 kb windows containing ~200 reads, the expected standard deviation is ~14 reads (7%), and the same CNV produces a 50% reduction detectable at ~7 standard deviations. However, after GC correction and normalization remove the systematic bias, the residual noise in LP-WGS is approximately 2-3× higher than the Poisson expectation due to fragmentation variability and alignment artifacts. This additional noise is the reason LP-WGS requires larger windows than WGS for equivalent detection sensitivity. Understanding this quantitative relationship helps researchers set realistic expectations for CNV detection — there is a direct trade-off between CNV size, sequencing depth, and detection confidence that cannot be overcome by improved bioinformatic normalization alone.

Figure 1: Four CNV detection approaches — genome coverage, resolution, and optimal depth

Sequencing-Based CNV Detection — Four Approaches Compared

CNV detection from sequencing data can be performed using four approaches that differ fundamentally in genome coverage, depth, and cost. The choice between them determines what types and sizes of CNVs can be detected.

WGS at 30× coverage: Whole-genome sequencing at standard depth provides the most comprehensive CNV detection. Read depth is measured across the entire genome in 100 bp to 1 kb windows, corrected for GC content and mappability, and segmented to identify regions with significantly shifted coverage. WGS at 30× detects heterozygous deletions as small as 1-5 kb and amplifications as small as 5-10 kb, covering both coding and non-coding regions. The trade-off is sequencing cost—approximately 90-100 Gb per genome—which limits sample throughput for large cohort studies. Within the coding fraction of the genome, the resolution is typically 1-2 kb, sufficient to detect single-exon CNVs.

WES at 100-200×: Whole-exome sequencing captures only the coding fraction of the genome (~1-2%, approximately 35 Mb), but the higher read depth provides better statistical power for CNV detection within captured regions at comparable cost to WGS. The fundamental challenge of WES-based CNV detection is the non-uniform coverage inherent to target capture—the hybridization efficiency varies between probes and between regions within the same probe set, introducing systematic noise that is sample-specific and cannot be fully corrected by generic GC normalization. ECOLE (2023, Nature Communications), a deep-learning-based CNV caller for WES data, addresses this by training a convolutional neural network on simulated data that incorporates the capture-specific noise profile of each kit, achieving 20-30% fewer false positives than conventional WES CNV callers. For researchers using WES for CNV analysis, a minimum of 100× mean target coverage is recommended, with at least 30 normal samples included in the project for reference construction. Whole exome sequencing services offer coverage to 150-200× for CNV-optimized WES study design.

LP-WGS at 0.5-5×: Low-pass whole-genome sequencing sequences the entire genome at a fraction of standard depth, making it the most cost-effective CNV screening method. At 1× coverage, approximately 3 Gb per sample, LP-WGS detects CNVs larger than 50-100 kb with sensitivity comparable to chromosomal microarray—making it a viable alternative for clinical CNV screening where resolution requirements are moderate. A 2025 benchmark in Briefings in Bioinformatics demonstrated that at 1× with 200 kb windows, LP-WGS achieves >90% sensitivity for deletions >100 kb and >85% for duplications >150 kb. The window size parameter is the key lever—larger windows improve sensitivity at the cost of breakpoint resolution, and the optimal setting scales inversely with depth (200 kb at 1×, 50 kb at 5×). For projects that need to balance CNV detection with sample throughput, LP-WGS provides the best per-sample economics.

Long-read sequencing (PacBio HiFi / Nanopore): Long reads spanning 10-20 kb (HiFi) or exceeding 100 kb (Nanopore) approach CNV detection from a fundamentally different principle: instead of inferring copy number from depth, they can physically span CNV breakpoints, providing base-pair resolution of the breakpoint junction. This is particularly valuable for CNVs in repetitive regions—segmental duplications, the MHC region, tandem gene arrays—where short-read depth signals are unreliable. PacBio's HiFiCNV caller (2024) is the first tool optimized for long-read CNV detection. A 2024 benchmark found that HiFi reads at 15× detected ~30% more CNVs than short-read WGS at 30× in the same samples, with additional calls concentrated in segmental duplications. The trade-off is cost—long-read sequencing at equivalent genome coverage is 3-5× more expensive than short-read methods.

Figure 2: CNV detection sensitivity as a function of sequencing depth and CNV size

Bioinformatics Tools for CNV Detection — Algorithmic Basis and Benchmark Performance

A 2024 benchmark study in Genome Biology evaluated six CNV calling tools on a hyper-diploid cancer cell line (HCC1395) with matched WGS and WES data, generating actionable performance data for tool selection.

CNVkit: Designed for WES data with paired tumor-normal samples. CNVkit constructs a pooled reference from normal samples, corrects for GC bias, and segments the coverage signal using circular binary segmentation. It is the most widely validated WES CNV tool in cancer genomics. Benchmark result: >90% precision for WES somatic CNV detection with matched normal controls. Best for: project-level cancer WES analysis with >10 normal samples.

GATK gCNV: Developed for population-scale WGS germline CNV detection. Uses a Bayesian model with PELT segmentation that learns the coverage profile from a cohort (not requiring matched normal controls for each sample). Standard for large WGS cohorts (100+ samples). Benchmark result: highest recall (>85%) for rare germline CNVs due to cohort-aware denoising. Best for: large-scale WGS germline CNV studies where no normal controls are available.

Control-FREEC: Identifies CNVs using both read depth and B-allele frequency, enabling detection in the absence of matched normal controls. Its allele-specific capability is useful for samples with normal cell contamination—BAF can reveal CNVs where depth alone is ambiguous. Benchmark result: moderate precision (~80%) but best flexibility across data types. Best for: projects with limited control samples or where allelic information is needed.

Canvas: Illumina's recommended CNV caller for WGS and WES. Integrates depth with GC/mappability correction and BAF from SNP reads. Optimized for Illumina DRAGEN pipeline. Benchmark result: >90% precision for WGS somatic CNVs, comparable to CNVkit for WES. Best for: Illumina-only workflows and automated pipeline integration.

ECOLE: Deep-learning-based WES CNV caller (convolutional neural network). Trained on simulated data with capture-specific noise, achieving lower false-positive rates than CNVkit for single-exon CNVs. Benchmark result: 20-30% fewer false positives than CNVkit for WES. Best for: WES projects requiring high specificity, particularly where validation capacity is limited.

HiFiCNV: PacBio's long-read CNV caller for HiFi reads. Segments long-read coverage after GC correction, benefiting from the high mappability of long reads. Early-stage tool. Benchmark result: detects ~30% more CNVs in repetitive regions than short-read methods. Best for: projects using PacBio HiFi data requiring CNV detection in complex genomic regions.

CNV detection services support each of these tools with validated workflows, enabling researchers to select the appropriate tool based on their data type and project goals.

Figure 3: CNV bioinformatics pipeline — key processing steps from raw reads to copy number calls

Key Challenges That Compromise CNV Detection Accuracy

CNV detection is more sensitive to technical artifacts than SNV detection because read depth is affected by multiple factors independent of biological copy number. Understanding these confounders and applying appropriate corrections is essential for obtaining reliable calls.

GC bias: PCR amplification efficiency varies with GC content over a 2-5-fold range, creating systematic depth variation that mimics CNV signals. GC correction computes observed-to-expected depth ratios within GC-matched windows, but this correction is imperfect for fragmented or low-input DNA samples. Residual bias after correction accounts for a substantial fraction of false-positive CNV calls in both WGS and WES data, particularly in GC-rich promoter regions and GC-poor intergenic regions.

Mappability: Approximately 10-15% of the human genome—centromeres, telomeres, ribosomal DNA arrays, and segmental duplications—cannot be uniquely mapped by short reads and must be excluded from CNV analysis. CNVs in these regions are systematically missed. For WES, the inaccessible fraction depends on the capture kit design and can reach 15-20% of targeted regions.

Matched normal requirement: Single-sample CNV calling—identifying CNVs from one sample's depth without reference comparison—has limited accuracy because technical depth variation cannot be distinguished from biological CNV signal in isolation. Standard practice uses either a matched normal control (somatic cancer) or a pooled reference from ≥10 normal samples (germline WGS) or ≥30 samples (germline WES). Projects meeting these minimums use tools with built-in model-based normalization (GATK gCNV) as an alternative.

FFPE artifacts: FFPE DNA has average fragment sizes <300 bp and deaminated bases from formalin cross-linking. These properties increase depth variance and reduce CNV signal-to-noise. A 2024 benchmark found CNV detection precision decreases by 15-25% for FFPE compared to fresh-frozen tissue. Mitigation strategies include matched FFPE normal controls, increased sequencing depth, and specialized normalization methods for fragmented DNA.

Tumor purity and heterogeneity: In cancer samples, the effective CNV signal is the product of CNV state, the fraction of tumor cells carrying that CNV, and the sample purity. A CNV present in 40% of tumor cells in a 60% pure sample produces a depth change of only 12% from diploid—indistinguishable from noise in most pipelines. BAF-based tools (Control-FREEC) maintain detection at purity down to ~20%, while depth-only methods require >30% for somatic CNV calling.

Low-coverage limits for specific CNV types: Different CNV types have different detection limits at the same depth. Amplifications (gains) produce weaker signals than deletions — a triplication produces only a 50% increase in expected depth compared to the 50% decrease of a single-copy deletion. At 30× WGS, both are detectable, but at 1× LP-WGS, amplifications require 2-3× larger events to reach comparable confidence. Homozygous deletions produce the strongest signal (depth approaching zero at the target) and are detectable even at the lowest depths. Understanding these type-specific detection limits is important for planning projects that aim to detect specific classes of CNVs.

Figure 4: Tumor CNV analysis challenges — the combined effect of purity, sample quality, and sequencing depth

CNV Analysis in Cancer — Somatic vs. Germline Considerations

Somatic CNV detection in cancer differs from germline CNV detection in three critical respects: the requirement for a matched normal sample, the presence of subclonal CNVs, and the confounding effect of tumor ploidy changes.

A matched normal sample (blood or adjacent normal tissue) is essential for distinguishing somatic CNVs from inherited germline CNVs and from technical artifacts. The matched normal provides sample-specific GC bias, capture efficiency, and sequencing run effect correction. In its absence, pooled normal references can partially compensate but with reduced sensitivity. For cancer studies, LP-WGS at 1-2× is rapidly adopted for large-cohort somatic CNV profiling—a 2025 study of 2,000 cancer samples found that LP-WGS at 1× detected focal amplifications and homozygous deletions with >85% agreement with 30× WGS for events larger than 100 kb.

Copy number signatures—genome-wide patterns of amplifications and deletions linked to specific mutational processes—provide information beyond individual gene alterations. The HRD (homologous recombination deficiency) score derived from genome-wide CNV patterns is established as a predictive biomarker for PARP inhibitor therapy response. Clinical WGS services support both standard-depth and low-pass configurations for cancer CNV analysis, with matched tumor-normal paired protocols available.

Single-cell CNV detection: An emerging frontier in CNV analysis is single-cell CNV profiling, which resolves the intra-tumor heterogeneity that is averaged out in bulk sequencing. Single-cell WGS at low-pass depth (0.1-1× per cell) combined with copy number inference algorithms can reconstruct the clonal architecture of a tumor by detecting CNV differences between individual cells. This approach has been used to trace metastatic dissemination patterns and to identify rare subclones carrying resistance-conferring CNVs that are invisible in bulk analysis. The trade-off is that single-cell data has higher noise per cell than bulk data, requiring specialized normalization and calling algorithms and larger cell numbers for confident CNV assignment. Projects considering single-cell CNV analysis should budget for at least 100-500 cells per sample to achieve sufficient statistical power for subclone detection.

Figure 5: CNV analysis tools selection guide — matching the tool to data type and research goal

Long-Read Sequencing for CNV Detection

Long-read sequencing addresses the fundamental limitation of short-read CNV detection: inability to map reads to repetitive regions and inability to span breakpoints. PacBio HiFi reads at 10-20 kb with >99.9% accuracy can directly span CNV breakpoints, while Nanopore ultra-long reads exceeding 100 kb can span entire complex rearrangements.

A 2024 benchmark of long-read CNV detection found that HiFi reads at 15× detected approximately 30% more CNVs than short-read WGS at 30× in the same samples, with additional calls concentrated in segmental duplications and other repetitive regions. Breakpoint resolution improved from 1-10 kb (short-read) to within 100 bp (HiFi).

Long-read CNV detection is most valuable when the research focus involves complex genomic regions—segmental duplications, the MHC region, tandemly duplicated gene families, or known CNV hotspots. For projects focused on the >90% of the genome accessible to short reads, short-read methods remain the practical choice due to lower cost and more mature tooling.

Hybrid strategies for comprehensive CNV analysis: For projects requiring both cost-effective genome-wide screening and high-resolution breakpoint analysis, a hybrid approach combining short-read LP-WGS at 1-2× for initial CNV identification with targeted long-read sequencing of CNV breakpoint regions provides the most efficient use of resources. The LP-WGS screen identifies candidate CNVs and estimates their approximate boundaries, while long-read sequencing resolves the precise breakpoint and identifies the underlying sequence architecture (e.g., NAHR between specific repeat elements). This staged approach has been used successfully in clinical CNV validation pipelines and in population studies where comprehensive CNV characterization is needed but cost constraints limit the use of genome-wide long-read sequencing.

CNV Validation — The Role of Orthogonal Methods

Computational CNV calls should be validated by orthogonal methods before drawing strong biological conclusions. Digital droplet PCR (ddPCR) provides absolute copy number quantification at specific loci and is practical for validating 5-20 candidates per project, with detection sensitivity sufficient to confirm single-copy gains or losses in samples with 50% or greater tumor purity. Chromosomal microarray (aCGH) remains the genome-wide gold standard with >95% sensitivity and specificity for CNVs >50 kb, and it serves as the reference platform for most clinical CNV validation pipelines. For projects reporting novel disease-associated CNVs or clinical-grade results, validation by at least one orthogonal method is standard practice before publication or clinical reporting.

CNV Databases for Annotation and Interpretation

UCSC Genome Browser: Primary visualization platform for CNV calls in genomic context, with tracks for repeat elements, known genes, segmental duplications, and population CNV frequency from 1000 Genomes and gnomAD.

DECIPHER Database: Curates CNVs associated with genetic disorders, linking each variant to clinical phenotypes and gene content for pathogenicity assessment.

ClinGen Database: Gene-level dosage sensitivity scores—systematic assessments of haploinsufficiency and triplosensitivity for each gene—guide interpretation of whether a gene-containing CNV is likely pathogenic. Genomic data analysis services integrate these databases into automated annotation workflows.

Computational Resources for CNV Analysis

CNV analysis is computationally modest compared to WGS alignment or de novo assembly. A 30× WGS sample requires approximately 1-2 hours for CNV calling (after alignment) on a standard compute node. WES CNV calling with CNVkit requires 30-60 minutes per sample. LP-WGS at 1× processes in 15-30 minutes. GATK gCNV requires cohort-level processing — after the model is built from the cohort (1-2 hours), individual sample calling is rapid (10-20 minutes per sample). Storage requirements are dominated by the aligned BAM/CRAM files needed for depth extraction. Projects using cloud computing should budget compute resources based on sample number and data volume.

Figure 6: CNV analysis project roadmap — from research question to CNV interpretation

FAQ

What sequencing depth is needed for CNV detection from WGS?
For detection of CNVs larger than 5-10 kb across the genome, 15-30× coverage is standard. LP-WGS at 1-2× detects CNVs larger than 50-100 kb for clinical screening where cost efficiency is prioritized.

How does FFPE sample quality affect CNV detection?
FFPE samples have fragmented DNA and base damage that increase read depth noise, reducing CNV detection sensitivity by 15-25% compared to fresh-frozen tissue. Using matched FFPE normal controls and increasing sequencing depth partially compensates.

What is the minimum tumor purity for somatic CNV detection?
Most tools require purity above 20-30% for reliable detection. BAF-based tools (Control-FREEC) perform better at low purity than depth-only methods.

How do I choose between CNVkit and GATK gCNV?
GATK gCNV is recommended for large-scale WGS germline CNV detection where a population model can be built. CNVkit is recommended for project-level WES cancer CNV with matched normal controls.

Can I detect CNVs from RNA-seq data?
RNA-seq CNV detection is possible but less reliable than DNA-based methods due to expression variation. Validation by DNA-based methods is recommended for any RNA-seq CNV findings.

What is the minimum reference size for pooled-normal CNV analysis?
Minimum 10 normal samples for WGS and 30 for WES, reflecting the higher noise level of target capture. Below these thresholds, model-based methods (GATK gCNV) are preferred.

References

Related Services

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.

Related Services

Speak to Our Scientists