Interpreting Exome Sequencing Data: From Variants to Insights

In recent years, whole exome sequencing (WES) has become a breakthrough technology for genetic disease diagnosis and complex disease research due to its high efficiency in detecting approximately 85% of known pathogenic mutations within focused coding regions (accounting for 1-2% of the genome). With decreasing sequencing costs and the maturation of bioinformatics tools, WES has gradually shifted from a research tool to clinical applications, such as achieving precise diagnosis in rare diseases like neurofibromatosis and epilepsy. However, interpreting massive amounts of variant data still faces challenges: functional validation of low-frequency variants, insufficient efficiency in integrating multi-source databases, and the complexity of the association between clinical phenotypes and genotypes urgently need to be addressed.

This article aims to systematically review the core processes and technological advancements in WES data analysis, and discuss its translational medical value with practical cases, providing a reference for improving the accuracy of disease diagnosis and research efficiency.

I. Technological Basis and Evolution of Exome Sequencing

1.1 Technological Principles and Core Breakthroughs

WES focuses on detecting variations in protein-coding genes by targeting approximately 1% of the coding regions (about 30 Mb) in the genome. Its core technological breakthroughs are reflected in:

Probe Capture Technology: The Ion TargetSeq™ Exome Kit uses >2 million probes to achieve high-density coverage (>95% target region coverage), and combined with a single-tube enrichment process, reduces the initial DNA amount to 125 ng.
Sequencing Platform Innovation: The Illumina NovaSeq 6000 system achieves 150 bp paired-end sequencing through SBS technology, producing 1.5 Tb of data per run, resulting in an average coverage depth of 119×.
Quality Control System: FastQC, combined with Trimmomatic, constructs a three-level quality control process to remove adapter contamination (base removal with a Phred quality value <20) and low-complexity regions (detection using the sliding window method).

1.2 Data Analysis Workflow Overview

A typical WES analysis comprises eight core modules:

Raw Data Processing: BWA-MEM alignment (parameter: -t 8 -R '@RG\tID:sample\tSM:sample') generates a SAM file, which is then processed by Picard MarkDuplicates to remove PCR duplicates.
Variation Detection: GATK HaplotypeCaller uses the gVCF mode (-ERC GVCF) for variant recall, and then performs multi-sample joint genotyping using GenomicsDBImport and GenotypeGVCFs, complementing the results of FreeBayes. This can improve the sensitivity of SNV/Indel detection to 98.5%.
Variation Annotation: ANNOVAR integrates the 1000G, ClinVar, and GO databases, outputting variant functional impact (e.g., p.M1V causing start codon mutation) and population frequency (AF>0.01 automatically filtered).
Pathogenicity Assessment: Based on the ACMG-AMP guidelines, a multi-dimensional evidence scoring system was developed using a combination of predictive tools, including SIFT (score <0.05 indicates harmfulness), PolyPhen2 (score >0.85 indicates possible pathogenicity), and CADD (PHRED>20 indicates harmfulness).
Visual validation: IGV displays the depth of variant site coverage (DP≥20) and allele frequency (AF=45% indicates heterozygous mutation, In the absence of copy number mutation ).
CNV detection: With high-resolution localization (from single exon to 50kb medium-sized fragments), combined with SNV analysis, it can improve diagnostic efficiency and optimize detection costs and time. Suitable for CNV detection in medium-sized (1–50 kb) exon regions.
Pathway enrichment: A PPI network (confidence >0.7) is constructed using the STRING database, and GO and KEGG enrichment analyses are performed using Cytoscape.
Clinical decision-making: The Emedgene AI platform automatically associates OMIM phenotypes to generate diagnostic reports conforming to ACMG standards.

II. In-depth Strategies for Variant Interpretation

2.1 The Gold Standard for Variant Filtering

Quality Filtering: Loci with GQ ≥ 20 and DP ≥ 30 are retained, excluding systematic errors from the sequencing platform.
Genetic Pattern Validation: In pedigree analysis, recessive inheritance requires both parents to be carriers (e.g., p.Arg123 mutation is homozygous in siblings), while dominant inheritance requires excluding parents who are carriers (e.g., de novo p.Gln456 mutation).
Functional Validation: Construct genotype-neutral cell lines using CRISPR/Cas9 and validate protein expression changes via Western blot (e.g., TP53 mutation causing 80% protein truncation).

2.2 Multi-omics Integration Analysis

Epigenetic Regulation: Methylation microarray (Illumina 450K) was used to detect promoter methylation levels (β value > 0.7 indicates hypermethylation), and association analysis with RNA-seq data was performed (e.g., BRCA1 promoter methylation was significantly correlated with downregulation of expression, r = -0.62, p = 0.003).
Spatial Transcriptomics: 10x Genomics Visium technology was used to locate the expression regions of variant genes in tissues (e.g., TP53 mutation resulted in a 3-fold decrease in expression in the tumor core).

III. Clinical Applications and Typical Cases

3.1 Revealing the Genetic Structure of Rare Variants

Wang L et al., through systematic interpretation of whole-exome sequencing (WES) data, the genetic structure of rare coding variants in opioid dependence (OD) was revealed. Key findings are as follows:

After quality control of WES data from 4530 participants (including 2185 OD cases), a logistic mixture model was used for population segmentation (European EUR/African AFR) and cross-ancestral analysis to identify single variant associations (e.g., the RUVBL2 gene LoF variant rs746301110 in EUR, p=6.59×10^-10, predicting harmfulness); further, gene collapse detection (cumulative effect of rare variants) identified key risk genes such as SLC22A10, CHRND (most significant across ancestral lineages), and TMCO3 (p<1×10⁻⁴).
RUVBL2 (DNA helicase, involved in repair) variants are ancestor-specific; CHRND (cholinergic receptor) expression is differential in OD brain regions; gene enrichment reveals "metabolic regulation" and "opioid signaling" pathways. These findings provide a basis for OD mechanisms (such as abnormal DNA repair), drug targets (Rho GTPases), and genetic marker development, filling gaps in research on rare variants.

Cross-ancestry meta-analysis of single-variant associations (Wang L et al., 2025)

3.2 Breakthrough in Rare Disease Diagnosis

Watanabe T et al., through WES interpretation, from variant screening to clinical association, revealed new genetic clues for patients with spinocerebellar ataxia (SCA):

WES was performed on 174 suspected SCA patients lacking known pathogenic gene duplications. After Sanger sequencing and validation using five algorithms, three novel single nucleotide variants (SNVs) were found in five cases (diagnostic rate 2.9%), while the rest showed only benign variants.
ELOVL4 (SCA34) variants cause skin changes/Parkinson's syndrome; ELOVL5 (SCA38) variants are associated with bladder and rectal disorders; GRM1 (SCA44) variants present with heterogeneous phenotypes such as white matter lesions/spasticity.
This supplements the genetic diversity of SCA, revealing variant-phenotype heterogeneity (such as the absence of ELOVL4 in skin changes), providing clues for undiagnosed patients. However, many variants are of "uncertain significance" and require functional validation. The diagnostic rate (2.9%) was lower than that of similar studies, possibly due to factors such as ethnicity and the lack of analysis of SCA27B. Further research with a larger sample size is needed in the future.

3.3 Revealing the rare genetic susceptibility of IGM

Ozer L et al., through a systematic interpretation of whole-exome sequencing (WES) data, from variant identification to functional association, revealed a rare genetic susceptibility to idiopathic granulomatous mastitis (IGM). Key insights are as follows:

WES was performed on 30 IGM patients (female, 23-54 years old), focusing on 317 immune-related genes. 141 variants (95-99% coverage) were detected in 100 genes. According to the ACMG criteria: 10.6% were pathogenic/probable pathogenic variants (13 genes, such as FCGR1A and MPO), carried by 40% of patients; 89.4% were variants of undetermined significance (VUS), mostly heterozygous.
The variants are concentrated in innate immune pathways—macrophage function (5 genes including FCGR1A and MPO), mitochondrial metabolism (3 genes including NAXD and COQ2), autoimmune inflammation (3 genes including IL36RN and RNASEH2B), and complement (C9). Each patient carries 2-8 variants, and some also have extramammary manifestations (erythema nodosum, arthritis).
This is the first Western ES study to confirm that IGM is associated with innate immune abnormalities (phagocytic defects, mitochondrial disorders, and inflammatory dysregulation), supporting its classification as an "autoinflammatory disease." Eleven genes (such as MPO and IL36RN) serve as susceptibility markers, providing alternative therapeutic targets (such as targeting IL-36) for patients resistant to hormone therapy. However, the sample size is small (30 cases), and functional validation is lacking; further research and expansion of the cohort are needed.

3.4 Unique Genetic Risks Revealed by SCZ WES in High-Altitude Tibetan Patients

Chen L et al., through WES revealed unique and rare genetic risks in high-altitude Tibetan patients with schizophrenia (47 cases + 53 controls):

Sequencing identified 213,097 variants (including 27,644 novel variants), from which 275 potentially pathogenic variants (such as MAP2 and BAI2) and 27 rare and harmful variants (frame shift, termination gain, etc.) were identified.
Metascape enrichment showed that the variant genes were concentrated in hypoxia adaptation and neurodevelopmental pathways (flavonoid metabolism, RHOA regulation); the C5orf42 gene (cilia formation) was significantly associated, and in Han Chinese patients, only the BAI2 variant was duplicated (2 Tibetan cases, 1 Han Chinese case), suggesting population uniqueness.
This confirms the interaction between high-altitude hypoxia and SCZ genetics, with C5orf42, MAP2, and PRODH (proline metabolism) as susceptibility markers, and the flavonoid metabolism pathway potentially serving as a therapeutic target. The sample size is small (100 cases), and further validation is needed.

The proportion of sequenced variant types (Chen L. et al., 2024)

IV. Technological Challenges and Frontier Directions

4.1 Current Technological Bottlenecks

Low allele frequency variation: Variants with an allele frequency (AF) <1% are easily masked by sequencing noise, requiring UMI technology (such as Illumina NovaSeq X) to reduce the error rate to 0.1%.
Complex Structural Variants: Alu element-mediated inversions (such as Some types of alpha-thalassemia) have a conventional WES detection rate of only 65%, while long-read sequencing (PacBio Sequel II) can improve this to 92%.

4.2 Future Technological Trends

Single-Cell Exon Sequencing: The 10x Genomics Chromium Next GEM Single Cell Exome Kit achieves single-cell resolution, resolving tumor heterogeneity (such as the evolution of TP53 mutant subclonal proportions from 12% to 68%).
AI-Driven Interpretation: The DeepSEED model, fusing data from 100,000 WES cases, achieves an AUC of 0.87 for VUS pathogenicity prediction, a 30% improvement over traditional methods.

4.3 Clinical Application Prospects

Dynamic Monitoring: Liquid biopsy (ctDNA) tracks tumor genome evolution in real time, guiding treatment adjustments.

Conclusion

Exome sequencing is transitioning from "data output" to "clinical insights." With nanopore sequencing (Oxford Nanopore PromethION 5) enabling real-time variant detection and federated learning frameworks (such as GA4GH) facilitating multi-center data sharing, precision medicine will enter a new era of "minute-level diagnosis and personalized intervention."

References:

Wang L, Nuñez YZ, Kranzler HR, Zhou H, Gelernter J. Whole-exome sequencing study of opioid dependence offers novel insights into the contributions of exome variants. medRxiv [Preprint]. 2024 Sep 17:2024.09.15.24313713.
Watanabe T, Kume K, Inoue K, Nakamura M, Yamamoto S, Kurashige T, Ohshita T, Tazuma T, Kaido M, Maetani Y, Maruyama H, Kawakami H. Whole exome sequencing in Japanese spinocerebellar ataxia identifies novel variants. J Hum Genet. 2026 Jan;71(1):35-39.
Ozer L, Koksal H. Whole exome sequencing for identifying rare genetic variants related to idiopathic granulomatous mastitis. Clin Rheumatol. 2025 Apr;44(4):1843-1850.
Chen L, Du Y, Hu Y, Li XS, Chen Y, Cheng Y. Whole-exome sequencing of individuals from an isolated population under extreme conditions implicates rare risk variants of schizophrenia. Transl Psychiatry. 2024 Jun 29;14(1):267. doi: 10.1038/s41398-024-02984-y. Erratum in: Transl Psychiatry. 2024 Jul 16;14(1):290.

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.

Related Services