Common Challenges and Limitations of Whole Exome Sequencing

Whole exome sequencing (WES), as one of the core technologies in genomics research, provides an important tool for genetic disease diagnosis, tumor research, and drug development by targeting and capturing DNA sequences in protein-coding regions (exons). However, its application still faces many technical, analytical, and ethical challenges. The following section systematically reviews the main limitations of WES, combining the latest research and clinical practice.

Alt:WES's list of unresolved challenges. WES's list of unresolved challenges (Bertier G et al., 2016)

I. Technical Limitations

Incomplete Coverage and Capture Bias

WES only covers approximately 1%-2% of the genome (i.e., exon regions), but some exons cannot be effectively captured due to technical limitations. For example:
High GC Content Regions: For example, the GC content of exon regions in the BRCA1 gene is as high as 70%, leading to decreased probe hybridization efficiency and a false negative rate of 5%-10%.
Repetitive Sequence Interference: For example, the pseudogene (CYP21A2P) of the CYP21A2 gene has a sequence similarity of up to 98% with the functional gene, making it easy for conventional capture probes to bind incorrectly, resulting in false negative results.
Platform Differences: Coverage uniformity can vary between different capture enrichment kits (e.g., Agilent vs. IDT) and sequencing platforms, but current domestic platforms (such as DNBSEQ-T7/G400) have achieved comparable performance to Illumina NovaSeq in whole exome sequencing.

Limitations in the Detection of Structural Variations and Complex Mutations

Copy Number Variations (CNVs): WES has a sensitivity of less than 60% for CNVs >50 bp, while long-read sequencing (such as PacBio HiFi) can improve sensitivity to over 90%. For example, 22q11.2 microdeletion syndrome may be missed in WES due to insufficient probe coverage.
Insufficient Sensitivity for Low-Frequency Variants: Standard clinical diagnostic WES (typically at an average depth of ~100×) has limited capability in detecting low-frequency variants, such as somatic mutations or mosaicism with an allele frequency below 5%. While sensitivity can be improved by increasing the sequencing depth (>300×), this approach significantly raises both costs and data volume, thereby making it less cost-effective compared to more targeted sequencing panels or whole-genome sequencing (WGS).

Pseudogene Interference and Sequence Homology

Pseudogene Interference: For example, the pseudogene SBDSP of the SBDS gene has a sequence similarity of up to 97% with the functional gene. Sanger sequencing may misidentify it as a heterozygous mutation, requiring verification by qPCR or long-read sequencing.
Difficulty in Distinguishing Homologous Genes: For example, the sequence similarity between PTEN and PTENP1 can easily lead to cross-contamination during targeted capture, requiring probe-specific design or MLPA verification.

II. Challenges in Data Analysis and Interpretation

Complexity of Variant Annotation

Timeliness and Limitations of Database Annotation: Whole Exome Sequencing (WES) generates a vast number of rare variants, the interpretation of which heavily relies on public databases (e.g., ClinVar, gnomAD). However, two major challenges exist: (1) Annotation Lag: Newly discovered private familial mutations are often classified as Variants of Uncertain Significance (VUS) and may require years of accumulated evidence for reclassification; (2) Population Bias: Existing databases (e.g., gnomAD) are predominantly composed of data from European populations. This bias can lead to the erroneous classification of rare variants found in other populations as pathogenic. Consequently, WES faces significant interpretation challenges both in discovering novel disease-causing genes and when serving underrepresented populations.
Conflicting Annotation Sources: The same variant may be labeled as "pathogenic" or "benign" in tools such as ANNOVAR and SnpEff, requiring manual verification. For example, the c.743G>A mutation in the TP53 gene is listed as pathogenic in some databases, but recent studies show it is not phenotypically relevant.

Dual Risk of False Positives and False Negatives

Technical Errors: PCR amplification bias may lead to allele dropout, such as the c.1521_1523delCTT mutation in the CFTR gene, which may be missed during amplification.
Low-frequency mutation misses: When the mutation frequency is below 1%, the detection rate of WES may be less than 30%, requiring verification by ddPCR or NGS deep sequencing.

Lack of multidisciplinary collaboration and standardization

Interpreting discrepancies: The pathogenicity classification of the same VUS can vary by up to 40% between different laboratories. For example, the c.3985C>T mutation in the SCN1A gene may be classified as "pathogenic" or "of unknown significance" in epilepsy diagnosis.
Insufficient phenotype-genotype association: Approximately 30% of WES results cannot be effectively associated with specific genes due to vague phenotypic descriptions (e.g., "developmental delay").

Alt:Issues that were encountered in data analysis. Issues that were encountered in data analysis (Corominas J et al., 2022)

III. Ethical and Practical Issues in Clinical Applications

Ethical Dilemmas of Incidental Discoveries

Non-Target Disease Risk: WES may detect pathogenic mutations unrelated to the current phenotype (e.g., BRCA1 c.68_69delAG), requiring prior patient notification and management plans. Studies show that 15% of participants experience anxiety symptoms due to incidental discoveries.
Controversies Regarding Adult-Stage Disease Reporting: For example, APC gene mutations may indicate a risk of colorectal cancer, but the patient may not yet be asymptomatic, requiring a careful consideration of the necessity of disclosure.

Sample quality and phenotype matching

DNA degradation impact: DNA fragmentation in FFPE samples may lead to decreased coverage depth and a 2-3 fold increase in sequencing error rates in exon edge regions (such as start/stop codons).
Dynamic phenotype changes: For example, the phenotype of neonatal epileptic encephalopathy may evolve over several months, requiring regular updates of clinical information for data reanalysis.

IV. Technological Improvements and Future Directions

Technological Innovation

Long-read sequencing integration: PacBio HiFi and Oxford Nanopore technologies can resolve complex structural variations (such as balanced translocations), increasing CNV detection rate to 90%.
CRISPR targeted enrichment: CRISPR-mediated enrichment combined with Long-read sequencing enables specific capture of long-range genomic regions (e.g., >10 kb) with high specificity, improving the detection sensitivity of low-frequency mutations (<1%) by 40%-60% compared to non-targeted Long-read sequencing. This approach is particularly useful for resolving complex genomic regions (e.g., repetitive sequences, structural variants) that are challenging for traditional Short-read WES.

Algorithm and database optimization

Deep learning models: Models such as ECOLE, utilizing the Transformer architecture, improve CNV detection precision from 50.1% to 68.7% and recall from 49.6% to 78.4%.
Dynamic database updates: The gnomAD database (v4.0) integrates data from over 800,000 individuals (including 730,947 exomes and 76,215 genomes), supporting real-time variant frequency queries and reducing VUS false positives.

Standardized clinical workflows

Multidisciplinary team (MDT): Teams composed of geneticists, clinicians, and bioinformaticians can improve diagnostic accuracy by 20%.
Quality control system: CAP certified laboratories are required to have a sequencing depth of ≥100×, coverage of ≥95%, and participate in interlaboratory quality assessments (such as UK NEQAS) regularly.

V. Typical Case Analysis

Case 1: SBDS Gene and Pseudogene Interference

Background: Shwachman-Diamond syndrome (SDS) is often caused by biallelic mutations in the SBDS gene. This gene has a highly homologous pseudogene, SBDSP (sequence similarity >97%).
WES Limitations and Consequences: Short-read WES struggles to distinguish sequences between SBDS and SBDSP. This can lead to: 1) False positives: Misinterpreting sequence variations in the pseudogene as mutations in the functional gene; 2) False negatives/Misinterpretation: When a large deletion (e.g., an exon deletion) exists in the functional gene concurrent with a single nucleotide variant in the pseudogene, WES data might be erroneously interpreted as a "homozygous point mutation" instead of the correct "compound heterozygous state with a deletion." Clarification requires long-read sequencing or quantitative PCR (qPCR).
Core Argument: This case reveals the fundamental limitation of WES in resolving highly homologous genomic regions.

Case 2: Missed Detection of a Large Fragment Deletion in the CYP21A2 Gene

Background: The most common form of congenital adrenal hyperplasia (CAH) is caused by mutations in the CYP21A2 gene. This gene is also highly homologous to the pseudogene CYP21A1P (similarity >98%).
WES Limitations and Consequences: For a patient suspected of having CAH, WES only identified a Variant of Uncertain Significance (VUS) in the CFTR gene, which could not explain the clinical presentation. Subsequent Genome Sequencing (GS) or MLPA testing revealed a large heterozygous deletion (e.g., exons 1-7) in the CYP21A2 gene, a variant with clear pathogenicity. The reasons for the WES miss are: 1) Large deletions exceed detection scope: Standard WES bioinformatics pipelines are insensitive to deletions >50 bp; 2) Alignment difficulty in homologous regions: Even if fragments are captured, short reads cannot be accurately aligned between CYP21A2 and CYP21A1P.
Core Argument: This case illustrates the high miss rate of WES in detecting large structural variants (SVs). When clinical suspicion remains high despite a negative WES result, supplemental GS or targeted testing is necessary.

Summary

The limitations of whole-exome sequencing (WES) involve multiple dimensions, including technology, analysis, and ethics. These limitations require gradual breakthroughs through technological innovation (such as long-read sequencing), algorithm optimization (such as deep learning models), and standardized processes (such as multidisciplinary team (MDT) collaboration). In the future, with the improvement of dynamic informed consent frameworks and global data sharing (such as the GA4GH project), WES is expected to play a more central role in precision medicine; however, its clinical application still needs to balance technological potential with ethical risks.