Coverage Requirements for Whole Exome Sequencing Projects

Whole exome sequencing (WES) provides efficient support for research on genetic diseases, tumors, and complex diseases by targeting and capturing variation information in gene coding regions (exons). Its coverage requirements need to be comprehensively designed based on research objectives, sample type, and clinical needs. The following are key parameters and technical specifications.

I. Technical Principles and Core Parameters

Exon Capture Technology

Probe Design: Utilizes RNA or DNA probes (such as Agilent SureSelect's 120-mer RNA probes) to primarily cover coding sequences (CDS) of the genome (approximately 30 Mb, 1%-2% of the genome), which constitute the core of the whole exome. Some extended or "Plus" versions of commercial capture kits (e.g., Agilent SureSelect Human All Exon V8) may additionally include untranslated regions (UTRs) (5'UTR/3'UTR) and pathogenic introns (e.g., splice sites or disease-associated intronic regions), but these are not part of the standard whole exome sequencing (WES) definition.
Capture Efficiency: The target region must account for ≥60% of all aligned bases. Low starting sample volumes (e.g., 50 ng DNA) can be prepared using the transposase method.

Sequencing Depth and Coverage

Coverage Depth:
- Germline Variations: ≥50x (100x recommended), Q30 value ≥90%.
- Somatic Mutations (Tumors): ≥200x (tumor tissue), matched with normal samples for filtering germline variations.
Coverage:
- Germline Variations: It is recommended to achieve an average sequencing depth of ≥100x, with ≥95% of the target regions reaching a coverage depth of ≥20x, to ensure reliable detection of both homozygous and heterozygous variants.
- Low-Frequency Mutations (e.g., Somatic Variants): A recommended average sequencing depth of ≥200x is advised to enhance the detection sensitivity for variants with an allele frequency below 5%.

Data Quality Indicators

Q30 Score: The proportion of bases with a Q30 quality score in the raw sequencing data should be ≥85% (a commonly accepted threshold on the Illumina platform). This metric indirectly reflects the base-calling reliability of the primary data.
FOLD80 Penalty: ≤ 1.4 (ideally ≤ 1.2), reflecting uniformity of coverage depth.
PCR Duplication Rate: This should ideally be kept ≤10%. A high duplication rate may result from insufficient starting DNA material or excessive PCR amplification cycles, which reduces the effective utilization of sequencing data.

Coverage of target regions across WES and WGS samples (Barbitoff YA et al., 2020)

II. Coverage Requirements for Different Research Scenarios

Germline Variation Detection

Objective: Single nucleotide variants (SNVs), insertions/deletions (InDels), and other genetic disease-related mutations.
Technical Requirements:
- Coverage of 93% of genes in the OMIM database and 96% of sites in ClinVar is required. For DMD gene CNVs, probe encryption design is necessary.
- Data volume ≥ 10 Gb. Valid data must cover the CDS region and non-coding pathogenic regions (e.g., introns, UTRs).

Tumor Somatic Mutation Analysis

Objective: Assessment of SNVs, copy number variations (CNVs), and tumor mutational burden (TMB).
Technical Requirements:
- Tumor Samples: An average sequencing depth of ≥200x is recommended. Furthermore, it is essential to include a paired normal tissue sample (e.g., adjacent normal tissue or peripheral blood), with a recommended depth of ≥100x, for filtering out germline background variants. During analysis, tumor purity must be taken into account to calculate the effective sequencing depth.
- Use the WES-CNV algorithm to detect large CNV fragments, combined with MLPA or long-fragment PCR for validation.

Complex Diseases and Multigene Association Analysis

Objective: Identification of multigene interactions and low-frequency pathogenic mutations.
Key technical points:
- Normalized coverage (≥0.3) is required, covering ≥80% of the target region.
- Whole-genome sequencing (WGS) is integrated to supplement non-coding region variation information.

III. Experimental Procedure and Quality Control

Sample Processing and Library Construction

DNA Requirements: Concentration ≥ 50 ng/μL, purity OD260/OD280 ≈ 1.8; FFPE samples require assessment of degradation degree.
Library Construction Method: Low-cycle PCR library construction library construction ensures homogeneity; a 1:1 hybridization system is used in the capture stage.

Sequencing and Data Analysis

Platform Selection: Illumina NovaSeq platform, PE150 sequencing strategy, data volume ≥ 8-10 Gb/sample.
Analysis Flow:
- Quality Control: A two-stage quality control process is recommended to ensure data reliability: (1) Raw Data QC: Use FastQC to assess base quality distribution, adapter contamination, GC content, and overrepresented sequences. (2) Post-Alignment QC: Use samtools flagstat to evaluate the alignment rate; use Picard CollectInsertSizeMetrics and CollectGcBiasMetrics to assess insert size distribution and GC bias; use Picard MarkDuplicates to calculate the PCR duplication rate; use Qualimap or mosdepth for a comprehensive evaluation of the coverage depth and uniformity across the target regions.
- Variation Detection: GATK HaplotypeCaller identifies SNVs/InDels; CNVkit or Control-FREEC detects CNVs.

Variant Annotation and Filtering

Database Integration: ClinVar, OMIM, gnomAD, etc., combined with ACMG guidelines for pathogenicity grading (P/LP/VUS/LB/B).
False Positive Filtering: Sanger sequencing to verify positive results, qPCR or MLPA to confirm CNVs.

Modeling of CDS coverage identifies key determinants of coverage evenness (Barbitoff YA et al., 2020)

IV. Special Samples and Technical Challenges

FFPE Sample Processing

DNA Quality Assessment: Fragment size needs to be detected using Agilent Bioanalyzer. If degradation is ≥30%, the number of amplification cycles needs to be increased.
Library Construction Optimization: Use a low starting amount protocol (50 ng DNA) and optimize library amplification conditions.

Micro Sample Analysis

Neonatal Dried Blood Smears: Use the Transposase method for library construction (e.g., Illumina Nextera), with a minimum DNA amount ≤50 ng.
Tumor Heterogeneity Studies: Requires multi-region sampling, combined with UMI molecular tagging technology to reduce amplification bias.

Complex Region Capture

High GC Regions: Use paired-end probe design or increase probe density, combined with PCR-free library construction to reduce GC bias.
Pseudogene Interference: Encrypt probes to cover differentially expressed sequences, combined with long-read sequencing (e.g., PacBio) to verify structural variations.

V. Data Analysis Depth and Clinical Application

Variation Annotation and Filtering Strategies

Functional Annotation: Integrating databases such as ClinVar, OMIM, and gnomAD, and combining with ACMG guidelines for pathogenicity grading (P/LP/VUS/LB/B).
False Positive Filtering: Verifying positive results through Sanger sequencing or qPCR to reduce the false diagnosis rate.

CNV Detection Technical Details

Algorithm Selection: XHMM or CNVkit is recommended for exon-level CNVs. Large CNV fragments (>1 Mb) are validated using WGS data or MLPA.
Tumor-Specific Analysis: Calculating the tumor/normal tissue copy number ratio, filtering germline polymorphisms (e.g., >5% frequency), and identifying LOH (loss of heterozygosity) regions.

Data Visualization and Reporting

Interactive IGV Analysis: Generating coverage depth maps and variant site distribution maps, supporting multi-sample comparisons (e.g., family co-segregation analysis).
Report template: Classify according to ACMG guidelines (pathogenicity/probable pathogenicity/unclear significance), with validation methods (e.g., Sanger sequencing) and genetic counseling recommendations.

VI. Ethics and Compliance Management

Management of Human Genetic Resources

Sample Collection: Informed consent is required, specifying the intended use (e.g., research/clinical). Samples sent externally must be registered with the Ministry of Science and Technology.
Data Storage: Raw data (FASTQ) must be encrypted and stored for ≥2 years. Cross-border transfer is prohibited, and compliance with the "Regulations on the Management of Human Genetic Resources" is mandatory.

Quality Control Certification

Laboratory Qualification: CAP/CLIA accreditation is required, and regular participation in interlaboratory quality assessments.
Third-Party Validation: Key results (e.g., pathogenic mutations) must be retested by independent institutions to ensure accuracy.

VIII. Case Analysis