Quality Control in Whole Exome Sequencing: From Sample to Data

Whole exome sequencing (WES) targets approximately 1-2% of protein-coding regions (exons) in the genome, detecting about 85% of pathogenic variants. However, the reliability of its data highly depends on rigorous quality control procedures. The following are key quality control points based on experimental procedures and literature.

I. Sample and DNA Quality Control

Sample Collection and Preservation

  • Sample Types: Blood (EDTA anticoagulant tubes recommended), tissue (requires rapid freezing or RNAlater preservation).
  • Preservation Conditions: Short-term (4℃, ≤7 days); Long-term (-80℃, avoid repeated freeze-thaw cycles).
  • Precautions: Avoid nuclease contamination; tissue samples require mechanical homogenization or enzymatic digestion (e.g., Proteinase K digestion).

DNA Extraction and Quality Control

  • Extraction Methods: Phenol-chloroform method (high purity), magnetic bead method (automated), Qiagen kit (clinical grade).
  • Quality Control Indicators:
    • Concentration: ≥50 ng/μL (minimum requirement for library construction), Qubit 4.0 quantitative PCR recommended.
    • Purity: A260/A280 = 1.8-2.0 (no protein contamination), A260/A230 > 2.0 (no salt residue).
    • Integrity: Agarose gel electrophoresis shows a main band ≥10 kb (no significant degradation), RIN value ≥8.0 (RNA interference samples require additional testing).

DNA Fragmentation

  • Physical Fragmentation: Covaris S220 sonication (fragment size 50-200 bp, CV <5%).
  • Enzymatic Digestion: NEBNext Fragmentase (suitable for FFPE samples), reaction time optimization is required to avoid over-fragmentation.

II. Library Construction and Capture Efficiency Optimization

Adapter Ligation and Amplification

  • Adapter Design: Illumina TruSeq (with molecular barcodes marker) or Agilent SureSelect (with blocker to prevent adapter dimerization).
  • Amplification Conditions: KAPA HiFi HotStart ReadyMix (low GC bias), ≤12 cycles (to avoid PCR bias).

Exon Capture

  • Probe Design: Agilent SureSelect XT (covers exons ±50 bp), IDT xGen (customized probes).
  • Capture Conditions:
    • Hybridization Temperature: 65℃ (high stringency), time ≥16 hours.
    • Elution Conditions: Magnetic bead washing (low-salt buffer to remove non-specific bindings).
  • Post-Capture Quality Control:
    • Target Coverage: ≥70% (clinical standard), validated using SeqCap EZ Assay (Roche).
    • Reproducibility: ≤5% (Picard Mark Duplicates detection).

III. Sequencing and Raw Data Quality Control

Sequencing Platform Selection

  • Illumina NovaSeq 6000: Recommended paired-end length of 150 bp (PE150), single sample data volume ≥50 Gb (coverage depth ≥100×).
  • HiSeq 4000: Lower cost, but shorter read length (PE125), suitable for projects with limited budgets.

Raw Data Filtering

  • FastQC Analysis:
    • GC Content: The normal range for the human genome is 40-60%. Abnormal fluctuations indicate contamination or library bias.
    • Low-Quality Bases: Bases with a tail Q value <20 need to be pruned (Trimmomatic parameters: LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15).
    • Adapter Contamination: Automatically identify and prune using Fastp (parameter: --adapter_sequence AGATCGGAAGAGC).
  • PhiX contamination: If the proportion is >0.1%, resequencing is required (Kraken2 detection, k-mer length = 31).

Figure 1.Workflow for data analysis Workflow for data analysis (Yin Y et al., 2019)

IV. Alignment and Variant Detection Quality Control

Sequence Alignment

  • Tool Selection: BWA-MEM (default parameters, suitable for long inserts), Bowtie2 (low memory consumption).
  • Reference Genome: hg38 (recommended) or hg19 (pay attention to version compatibility).
  • Post-Alignment Quality Control:
    • Mapping Rate: ≥95% (outliers require checking for sample contamination or probe design flaws).
    • Insertion Distribution: Median 200-400 bp (Illumina platform), deviations indicate library structure abnormalities.

Variant Detection Workflow

  • GATK Best Practices:
    • Duplicate Marking: Picard MarkDuplicates (parameter: REMOVE_DUPLICATES=true).
    • Base Correction: BaseRecalibrator (trained using 1000G and Mills datasets).
    • Variant Calling: HaplotypeCaller (-ERC GVCF mode, multi-sample joint analysis).
  • Filtering Criteria:
    • SNV: QD≥2.0, FS>60.0, MQRankSum≥-12.5.
    • Indel: QD≥2.0, ReadPosRankSum≥-20.0.
    • Functional Notes: ANNOVAR (Databases: RefSeq, ClinVar, COSMIC).

V. Advanced Quality Control and Visualization

Coverage Depth Analysis

  • Tools: GATK DepthOfCoverage, IGV (visualized coverage heatmap).
  • Standards:
    • Clinical Diagnosis: Target region ≥20× coverage ≥95%, non-target region ≤5×.
    • Tumor Research: Somatic mutations must have ≥5% AF (allele frequency) and germline variations must be excluded.

Contamination Control

  • Inter-sample Contamination: VerifyBAMID (threshold ≤0.1%), ContEst (based on population frequency).
  • Reagent Contamination: PhiX control, template-free control (NTC) detection.

Visualization Tools

  • IGV: Examines the sequence context surrounding variant sites (such as repetitive elements and splice sites).
  • MultiQC: Integrates FastQC, BWA, and GATK reports to generate a quality control overview chart.

VI. Common Issues and Solutions

Issue Cause Solution
Low Target Region Coverage Low probe hybridization efficiency Optimize hybridization conditions (extend to 24 hours) or increase DNA input to 100 ng
High Strand Bias PCR amplification bias Use molecular barcodes labeling or adjust BWA parameters (e.g., -X 500)
False Positive Variants Low-quality reads or sequencing errors Apply stricter filtering criteria (e.g., SAV ≥ 0.2) and validate with Sanger sequencing
Batch Effects Inconsistent experimental conditions Analyze batches together or apply batch correction (e.g., ComBat in R)

VII. Clinical Application and Compliance

Report Interpretation

  • ACMG Guidelines: Pathogenicity Classification (Pathogenic, Possibly Pathogenic, Undetermined Significance, etc.).
  • Family Validation: Sanger sequencing confirms proband variants; parental samples are tested for inheritance patterns.

Ethics and Privacy

  • Informed Consent: Clearly inform the individual of the testing scope and the handling strategy for unexpected findings (e.g., BRCA1 mutations).
  • Data Security: Raw data is encrypted and stored (compliant with HIPAA/GDPR standards).

VIII. References and Tools

Core References

  • T/CHIA 21.2-2021 (China Exome Sequencing Quality Control Standard)
  • GATK Best Practices (Broad Institute)

Recommended Tools

  • Quality Control: FastQC, MultiQC, Picard
  • Alignment: BWA-MEM, Bowtie2
  • Annotations: ANNOVAR, VEP, SnpEff

IX. Quality Control in Practical Cases

WES Quality Control in VITT

Data Processing and Analysis

  • Alignment: BWA alignment of reads to the hg19 genome; IGV visualization for quality checks.
  • Variation Calling: GATK haplotype caller for variant identification; R script for calculating coverage and depth.
  • Screening: Removing non-functional variants (including subtypes and missense); retaining rare SNVs/Indels with MAF <0.01; focusing on target pathway genes (coagulation, platelet activation, etc.).
  • Pathogenicity Assessment: 7 tools (PROVEAN, etc.) + ACMG guideline classification (VUS/LP/P).

Data Consistency

The total number of variants (140,000+), the proportion of rare variants (1619–1774), and the type distribution (including subtypes 38%–42%, etc.) were similar across the 6 patients. Coverage was calculated using a unified script to ensure reproducibility.

Ethics and Independence

With the approval of the ethics committee (Declaration of Helsinki), three blinded legal experts independently adjudicated the case according to the Pavord standard (Giusti B et al., 2024).

Quality Control of WES for Chinese Concurrent Cancer Family

Sample and DNA Quality Control

  • Tumor Tissue: >200mg, frozen in liquid nitrogen/-80°C; treated with FFPE (4% formaldehyde fixation, paraffin embedding, 4μm sectioning), examined independently by two pathologists (confirming malignancy and excluding metastasis).
  • Peripheral Blood: 5ml, DNA extracted using the QIAamp DNA Mini kit.
  • DNA Quantification: Sonicated cutting to ~350bp, purified with AMPure XP, particle size distribution analyzed using an Agilent 2100.

Library Preparation and Sequencing Quality Control

  • Library Construction: Exon capture using Agilent SureSelect Human All ExonV5 (0.5μg DNA input), end polishing/A-tailing/adapter ligation, followed by KAPA HiFi HotStart PCR amplification.
  • Library Quantification: KAPA kit PCR method (standard curve), 3 nM working concentration.
  • Sequencing: Ion flow platform, 100bp end sequencing.

Data Processing and Analysis Quality Control

  • Data Filtering: Remove low-quality reads (with adapters, N>10%, etc.), BWA mapping to hg19, Picard+GATK v3.2 for deduplication/re-alignment/base calibration.
  • Variation Annotation: Annotate SNVs with GATK v3.0 (QD>2.0 is "good"); ANNOVAR references 1000 Genomes/dbSNP/CGC databases, annotating function/exon type/amino acid changes.
  • Filtering Criteria: Remove reads with a quality <20, MAF>0.005, and synonymous variants; retain exon/splicing site missense variants.
  • Coverage: Average depth 58-fold, ≥82.08% of exons >10-fold coverage, transition/transversion ratio 2.2–2.4 (normal).

Data Consistency Validation

  • Variation Distribution: Varscan2 v2.3.9 identified somatic mutations and cross-analyzed common genes (e.g., NDUFS7); germline mutation screening identified variants shared by patients that did not affect individuals (16 genes, 17 SNVs).
  • Reproducibility: Sample quality score >20, standardized procedures, consistent variant type distribution.

Ethics and Independence

  • Ethics: Approved by the ethics committee, following the Declaration of Helsinki; patient's written informed consent.
  • Independence: Two pathologists independently examined tumor tissue to avoid diagnostic bias (Yin Y et al., 2019).

Figure 2.Workflow for the identification of germline mutations. Workflow for the identification of germline mutations (Yin Y et al., 2019)

WES Quality Control for Mitochondrial Variance Detection in Hundreds of Thousands of Individuals

Data Preprocessing and Variance Filtering

Exome sequencing (415,000 samples) and array genotyping data (784,000 SNPs) were merged. Low-quality variants were filtered: variant level (deletion rate >10%, singlet variants, number of minor alleles <6), and sample level (deletion rate >10%) were excluded. 6,767,000 variants were retained (autosomal + X chromosome, MAF ≥ 0.001).

Covariates and Confounding Controls

Age, sex, 40 principal components (PCs), and WES batch effects were adjusted. A genetic relationship matrix (GRM) was constructed using BOLT-LMM. For sparse GRMs, kinship coefficients <0.0442 were set to 0.

Variance Analysis and Statistical Rigor

  • Single variant analysis: association was detected using BOLT-LMM, chrX stratification was performed, and METAL pooling statistics were used. Power analysis was performed using the genpwr package.
  • Aggregation of rare variants: The GENESIS package tested 9 combinations (harmful: all non-synonymous/CADD≥18/pLoF; frequency: MAF≤1%/0.1%/0.01%), with a cumulative allele frequency ≥0.01%, including variants with MAC<6; p-value correlation resulted in 4 clusters, with an effective detection rate of 18,557 genes × 4 clusters, and a threshold of 0.05/(18,557 × 4).
  • Robustness and independence verification:
    • At the gene level: Leave-one-out method and conditional analysis were used to assess the signal; gene set enrichment was performed using a 33,750 gene set from MitoCarta et al., t-test, exclusion of extreme values, and Bonferroni correction (FWER 1.5 × 10⁻⁶).

Phenotypic Association and Multiple Test Correction

PheWAS: PheWAS includes ICD10 phenotype pooling, restricting irrelevant individuals + Caucasian British ancestry, logistic regression to adjust covariates; simpleM calculates effective tests at 1,530, Bonferroni correction (p≤3.0×10⁻⁶); Mendelian randomization analysis for causality (e.g., SAMHD1-mtDNA-CN and breast cancer) (Pillalamarri V et al., 2022).

Figure 3.A single variant significantly associated with mitochondrial DNA-CN was discovered. A single variant significantly associated with mitochondrial DNA-CN was discovered (Pillalamarri V et al., 2022)

Summary

Quality control of whole-exome sequencing (WES) is a multi-dimensional, dynamically optimized systematic project that needs to be implemented throughout the entire lifecycle of experimental design, execution, and data analysis. Standardized operations, technological innovation, and inter-institutional collaboration can significantly improve detection sensitivity and specificity, providing a solid foundation for genetic disease diagnosis, precision oncology treatment, and drug development. Laboratories should continuously monitor updates to international guidelines and promote the translation of WES technology from research to clinical applications.

People Also Ask

What is the QV value in sequencing?

During the sequencing process, a quality value (QV), also known as quality score in the literature, is assigned to each nucleotide in a read. These quality values express the confidence that the corresponding nucleotide has been read out correctly.

How to analyze whole exome sequencing data?

A typical workflow of WES analysis includes these steps: raw data quality control, preprocessing, sequence alignment, post-alignment processing, variant calling, variant annotation, and variant filtration and prioritization.

What is the data output of whole exome sequencing?

The data output of whole exome sequencing typically consists of high-throughput sequencing reads (in FASTQ format) and a processed variant call file (VCF) containing identified genetic variants within the protein-coding regions of the genome.

What is another name for whole exome sequencing?

Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome (known as the exome).

What cannot be detected by whole exome sequencing?

There may be functional variants in non-coding regions that regulate gene expression, such as enhancers and long-noncoding RNAs. However, these non-coding variants (NCVs), even if genetically identifiable, are not covered by WES and thus cannot be detected.

What are secondary findings in whole exome sequencing?

What are Secondary Findings in Whole Exome Sequencing? A secondary finding is a variation that may contribute to disease but is not the cause of the patient's current condition. Secondary findings are found in up to 5 out of 100 (5%) patients who choose to have WES.

What is trio analysis in whole exome sequencing?

Whole Exome Sequencing (WES), Trio Analysis is a molecular test that captures data from the entire exome with additional coverage for genes with known Mendelian disease associations, to help identify the underlying genetic cause of patient's unexplained medical condition.

References:

  1. Sealock JM, Ivankovic F, Liao C, Chen S, Churchhouse C, Karczewski KJ, Howrigan DP, Neale BM. Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses. Nat Protoc. 2025 Sep;20(9):2372-2382.
  2. Belova V, Pavlova A, Afasizhev R, Moskalenko V, Korzhanova M, Krivoy A, Cheranev V, Nikashin B, Bulusheva I, Rebrikov D, Korostin D. System analysis of the sequencing quality of human whole exome samples on BGI NGS platform. Sci Rep. 2022 Jan 12;12(1):609.
  3. Giusti B, Sticchi E, Capezzuoli T, Orsi R, Squillantini L, Giannini M, Suraci S, Rogolino AA, Cesari F, Berteotti M, Gori AM, Lotti E, Marcucci R. Whole Exome Sequencing in Vaccine-Induced Thrombotic Thrombocytopenia (VITT). Biomed Res Int. 2024 Jul 14;2024:2860547.
  4. Yin Y, Wu S, Zhao X, Zou L, Luo A, Deng F, Min M, Jiang L, Liu H, Wu X. Whole exome sequencing study of a Chinese concurrent cancer family. Oncol Lett. 2019 Sep;18(3):2619-2627.
  5. Pillalamarri V, Shi W, Say C, Yang S, Lane J, Guallar E, Pankratz N, Arking DE. Whole-exome sequencing in 415,422 individuals identifies rare variants associated with mitochondrial DNA copy number. HGG Adv. 2022 Sep 26;4(1):100147.
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Related Services
PDF Download
* Email Address:

CD Genomics needs the contact information you provide to us in order to contact you about our products and services and other content that may be of interest to you. By clicking below, you consent to the storage and processing of the personal information submitted above by CD Genomcis to provide the content you have requested.

×
Quote Request
! For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Contact CD Genomics
Terms & Conditions | Privacy Policy | Feedback   Copyright © CD Genomics. All rights reserved.
Top