HLA Typing Data Analysis: Quality Control, Tools, and Data Visualization

As a key component of the immune system, the human leukocyte antigen (HLA) system plays an irreplaceable role in organ transplantation matching, disease association research, and population genetic analysis. With the rapid development of high-throughput sequencing technology, HLA Typing data analysis faces the challenge of extracting accurate allele information from massive raw data.

In this paper, the complete process of HLA Typing data analysis will be systematically expounded, which will provide comprehensive technical reference for related research.

Processing of Original Sequencing Data

In HLA Typing data analysis, the processing and filtering of original sequencing data is the first step to ensure typing accuracy. The original data produced by high-throughput sequencing often contain sequencing connectors, low-quality sequences, and polymorphic region noise. If not strictly pretreated, it will lead to allele alignment deviation and typing errors. A scientific and efficient data filtering strategy is the basis of extracting accurate HLA genotypes from complex sequencing data.

Quality Evaluation

The raw data generated by high-throughput sequencing are usually stored in FASTQ format, which contains sequence information and corresponding quality scores. Before HLA Typing analysis, the first task is to comprehensively evaluate the data quality. Commonly used quality assessment tools, such as FastQC, can generate detailed quality reports, covering key indicators such as base mass distribution, sequence length distribution, and adapter contamination detection.

By analyzing the mass fraction distribution of each base position, low-mass regions can be identified. The reliability of sequencing data can be judged by checking the proportion of N bases in the sequence. When the proportion of base mass fraction below 20 in a certain region exceeds 10%, the data in this region may need special treatment.

Data Filtering

The original sequencing data often contain sequencing connectors, low-quality sequences, and noise introduced by PCR amplification, so the data quality must be improved through the strict filtering process. Tools such as Trim Galore can automatically identify and cut the sequencing linker, and trim the sequence according to the mass fraction. The commonly used filtering parameters are set as follows:

  • Base with mass fraction less than 20 at the 3' end is excised
  • Sequences with average mass less than 25 and length less than 50bp are discarded

For double-ended sequencing data, it is necessary to consider the matching of sequences at both ends and use PEAR and other tools to splice sequences to improve the continuity of subsequent analysis. When dealing with immune-related gene regions, HLA loci are highly polymorphic, and mismatches are easy to occur in the sequencing process, so the filtering threshold needs to be raised appropriately to reduce false positive results.

A summary of the HLA-VBSeq pipeline (Nariai et al., 2015) Overview of the HLA-VBSeq pipeline (Nariai et al., 2015)

Special Data Processing

Different processing strategies should be adopted for the data generated by different sequencing platforms.

  • When filtering the data of the Illumina platform, we should pay attention to the problem of connector residue
  • PacBio single molecule long reading sequencing data should focus on the correction of sequencing errors.

For targeted capture of sequencing data, we need to use BEDTools and other tools to extract the sequence of the HLA gene region to eliminate the interference of non-target regions. When dealing with clinical samples, we often encounter the situation of insufficient sample size or DNA degradation. At this time, we need to adopt a special library construction method, relax the length threshold appropriately in the data filtering stage, and increase the sequencing depth to make up for the shortage of data.

HLA Allele Alignment and Typing Algorithm

In HLA typing sequencing, allele comparison and typing algorithms are the core. Because of the high polymorphism of the HLA gene, the sequencing data needs to match the known alleles accurately. The algorithm should solve the problems of sequencing error and heterozygote typing, and provide reliable typing results for clinical and scientific research.

Typing Method Based on Sequence Alignment

Sequence alignment is the core step of HLA Typing, and its basic principle is to match the pretreated sequencing reads with the known HLA allele reference database. Commonly used alignment algorithms include Bowtie2 and BWA, which realize efficient short sequence alignment by constructing index structure. In HLA typing, due to the high polymorphism of loci, special alignment strategies are needed.

For classical class I genes such as HLA-A, HLA-B, and HLA-C, a reference database containing all known allele sequences can be constructed, and the matching rate can be improved by using local alignment mode. In the process of comparison, special attention should be paid to the matching of hypervariable regions, such as the coding sites of amino acid residues in the antigen-binding groove region. The accurate typing of these regions directly affects the success rate of transplantation matching.

Mathematical Model of Classification Algorithm

Modern HLA typing algorithms are usually based on probability models or machine learning methods, and accurate genotyping is achieved by comprehensively considering sequence alignment results and allele frequency information.

  • The OptiType algorithm adopts the Hidden Markov Model (HMM), assigns sequencing reads to different alleles, and determines the most likely genotype combination by maximizing the posterior probability.
  • HLA*LA is based on the Bayesian framework, which combines prior allele frequency and sequencing depth information to calculate the probability of each possible genotype.

The core of these algorithms is how to effectively deal with sequencing errors and heterozygote typing. In the case of heterozygotes, the algorithm needs to accurately distinguish two different alleles and evaluate their relative abundance to avoid typing errors.

The accuracy of each tool for both one-field and two-field accuracy, with Class I accuracy plotted against Class II accuracy (Yu et al., 2024) Accuracy ot each tool to one and two-field accurcy, plotted with Class I accuracy against Class II accuracy (Yu et al., 2024)

Algorithm Optimization of High-Resolution Classification

Realizing HLA high-resolution typing (such as 4-digit or higher resolution) puts forward higher requirements for the algorithm. Because there are only a single or few nucleotide differences between high-resolution alleles, traditional alignment methods are prone to ambiguity. To this end, researchers have developed a variety of optimization strategies.

  • The Kourami algorithm adopts a step-by-step typing strategy, which first determines the low-resolution genotype, and then realizes high-resolution typing by in-depth analysis of high-variable region sequences.
  • Another method is to make use of the advantage of single molecule long reading long sequencing data and avoid the assembly ambiguity of short reading long data by directly reading the complete HLA allele sequence.

In the process of algorithm optimization, the integrity of the allele database should be considered, and the reference database should be updated regularly to include the newly discovered alleles to ensure the accuracy and timeliness of typing results.

Comparison of Common Analytical Tools

In the analysis of HLA typing, tools such as OptiType, HLA*LA, and Kourami play a key role. Based on different algorithm principles, they have their own advantages and disadvantages in classification accuracy and resolution. The following compares and analyzes these commonly used tools from multiple dimensions.

Comparison of Core Algorithms

OptiType, HLALA, and Kourami, as the mainstream tools in HLA Typing, have their characteristics in architecture design and algorithm principle.

Based on the hidden Markov model, OptiType divides the sequencing data into several short segments and finds the most possible allele combination through a dynamic programming algorithm. The advantage of this tool lies in its high computational efficiency, which is suitable for large-scale data analysis. However, there may be insufficient resolution when dealing with highly similar alleles.

HLALA adopts Bayesian statistical model, integrates allele frequency database (such as Allele frequency net database (AFND) and sequencing depth information, and deduces the genotype by Markov Chain Monte Carlo (MCMC) method, which has the advantage of using population genetic information to improve the typing accuracy, especially suitable for the detection of rare alleles.

Kourami, as a relatively new typing tool, adopts a grading typing strategy.

  • Firstly, the low-resolution genotype is determined by global comparison
  • Then the depth analysis is carried out for the high-variable region
  • Finally, the high-resolution typing is realized by combining the reading length coverage and the mutation site information.

This tool especially optimizes the ability to deal with fuzzy results and quantifies the possibility of different genotypes by constructing an allele network model.

The core difference between the three tools is also reflected in their adaptability to the sequencing platform:

  • OptiType and HLA*LA are mainly aimed at short-reading and long-reading sequencing data (such as Illumina).
  • Kourami also has good support for long-reading and long-reading sequencing data (such as PacBio and Oxford Nanopore) through algorithm optimization.

The consumption of computational resources by the 13 selected tools (Claeys et al., 2023) Computational resource consumption of the 13 selected tools (Claeys et al., 2023)

Comparative Analysis of Performance Indexes

Based on the above performance differences, different tools are suitable for different research scenarios.

  • For clinical organ transplantation matching, Kourami and OptiType are better choices because of the need to obtain high-resolution HLA typing results quickly, and Kourami's advantage in class II genotyping can improve the matching success rate.
  • For large-scale population genetics research, the characteristics of HLA*LA integrated allele frequency can help researchers to better analyze the genetic structure of population and disease association, and its Bayesian framework is convenient for statistical inference.
  • Kourami, as a relatively new tool, is constantly updated to support the latest allele database (such as the IMGT/HLA database), while the update frequency of OptiType and HLA*LA is relatively low.

In practical application, it is suggested to adopt the strategy of multi-tool cross-validation, such as using OptiType and Kourami for typing at the same time to improve the accuracy by comparing the results, especially in key application scenarios (such as clinical diagnosis), which can effectively reduce the typing error rate.

A performance comparison of three exome-based HLA-typing algorithms (Kiyotani et al., 2016) Performance comparison of three exome-based HLA-typing algorithm (Kiyotani et al., 2016)

Data Visualization and Interpretation of Results

In HLA Typing data analysis, data visualization, and result interpretation are the key links to transforming complex genotype information into practical value. It is like a bridge, connecting cold sequencing data with clinical scientific research decisions. Through intuitive presentation and in-depth analysis, HLA typing results can really serve transplant matching, disease research, and other scenarios.

Visualization Method of HLA Typing Results

The ultimate goal of HLA Typing data analysis is to transform complex genotype information into intuitive and easy-to-understand visual results and provide support for subsequent research and clinical decision-making. Commonly used visualization methods include genotype thermogram, allele frequency distribution map, haplotype network map, and sequencing coverage map.

  • Genotype thermogram can visually show the typing results of multiple samples at different HLA loci, and distinguish different alleles by color coding, which is convenient for quickly identifying the compatibility or difference between samples, especially suitable for multi-sample comparison in organ transplantation matching.
  • The allele frequency distribution map is used to show the distribution of alleles in the target population, which can be used in the form of a histogram or pie chart, combined with a population genetics database (such as AFND) to help researchers analyze the genetic structure and evolutionary relationship of the population.
  • The haplotype network diagram shows the diversity and distribution pattern of HLA haplotypes by constructing the evolutionary relationship network between haplotypes, which is of great significance for studying the evolution of the HLA system and disease association.
  • The sequencing coverage map helps to evaluate the reliability of typing results by visualizing the sequencing depth and mass distribution of each locus.

Validation of SpecHLA for two-field and full-resolution HLA typing (Wang et al., 2023) Validation of SpecHLA for 2-field and full-resolution HLA typing (Wang et al., 2023)

Key Points of Interpretation of Different Clinical Applications

In the matching consideration of clinical organ transplantation, the core of result interpretation is to evaluate HLA compatibility between donors and recipients. Attention should be paid to the high-resolution typing results of classical class I (A, B, C) and class II (DRB1, DQB1) loci, and the number and types of mismatches (such as amino acid mismatch vs. silent mutation) should be calculated. For each mismatch site, it is necessary to combine the amino acid sequence of antigen binding groove region to evaluate its potential impact on transplant rejection.

In the study of disease association, the interpretation of results needs to be combined with population genetic data and statistical analysis methods. First of all, it is necessary to verify whether the difference in HLA allele or haplotype frequency between the target disease group and the control group is statistically significant. Common methods include chi-square test, Fisher exact test, and logistic regression. Secondly, multiple tests and corrections should be considered to avoid false positive results caused by testing multiple alleles. For the positive association found, it is necessary to verify its biological significance through functional experiments (such as antigen presentation experiments).

Frequencies of HLA-A, B, and DRB1 alleles that showed differences between lung cancer patients and healthy controls (Yang et al., 2010) Frequencies of HLA-A, B and DRB1 alleles whose frequencies were different in lung cancer patients and healthy controls (Yang et al., 2010)

In the research of tumor immunotherapy, the interpretation of HLA typing results should pay attention to the HLA expression status and loss of heterozygosity (LOH) of tumor cells. By comparing the HLA typing results of tumor samples and normal samples, we can identify the missing HLA alleles in tumor cells, which may be related to the mechanism of tumor immune escape. In addition, the results of HLA typing can also be used to predict the presentation efficiency of new tumor antigens. By combining peptide binding prediction algorithms of HLA alleles (such as NetMHC), tumor antigens that may be recognized by T cells can be screened, which provides a basis for the design of personalized immunotherapy programs.

Conclusion

To sum up, HLA Typing data analysis has formed a complete technical system from original sequencing data filtering to high-resolution typing and then results in visualization. With the popularization of long reading and long sequencing technology and the deep integration of AI algorithms, HLA typing in the future will break through in the direction of accurate haplotype analysis, heterogeneity analysis of single-cell resolution, and real-time clinical diagnosis. The sustainable development of this field not only promotes the basic research of immunogenetics, but also provides key technical support for clinical applications such as organ transplantation and tumor immunotherapy, and finally realizes the precise medical transformation from genotype to phenotype.

References

  1. Nariai N, Kojima K., et al. "HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data." BMC Genomics. 2015 16 Suppl 2(Suppl 2): S7 https://doi.org/10.1186/1471-2164-16-s2-s7
  2. Yu D, Ayyala R., et al. "A rigorous benchmarking of alignment-based HLA typing algorithms for RNA-seq data." bioRxiv [Preprint]. 2024 16: 2023.05.22.541750 https://doi.org/10.1101/2023.05.22.541750
  3. Claeys A, Merseburger P., et al. "Benchmark of tools for in silico prediction of MHC class I and class II genotypes from NGS data." BMC Genomics. 2023 24(1): 247 https://doi.org/10.1186/s12864-023-09351-z
  4. Kiyotani K, Mai TH, Nakamura Y. "Comparison of exome-based HLA class I genotyping tools: identification of platform-specific genotyping errors." J Hum Genet. 2017 62(3): 397-405 https://doi.org/10.1038/jhg.2016.141
  5. Wang S, Wang M., et al. "SpecHLA enables full-resolution HLA typing from sequencing data." Cell Rep Methods. 2023 3(9): 100589 https://doi.org/10.1016/j.crmeth.2023.100589
  6. Yang L, Wang LJ., et al. "Analysis of HLA-A, HLA-B and HLA-DRB1 alleles in Chinese patients with lung cancer." Genet Mol Res. 2010 9(2): 750-5 https://doi.org/10.4238/vol9-2gmr735
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.


Related Services
Inquiry
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.

CD Genomics is transforming biomedical potential into precision insights through seamless sequencing and advanced bioinformatics.

Copyright © CD Genomics. All Rights Reserved.
Top