HLA Typing Sequencing Troubleshooting: Common Questions and Solutions
The high polymorphism of the human leukocyte antigen (HLA) system interprets typing results full of challenges, especially in the face of rare alleles, data bias, population differences, and other complex situations. With the number of alleles in the IMGT/HLA database exceeding 40,000 (by 2025) and the deepening of multi-ethnic research in the world, it is urgent to establish a standardized problem analysis system for scientific research and clinics.
This paper systematically expounds the troubleshooting methods of HLA typing sequencing, and introduces how to use public database to assist analysis, which provides comprehensive guidance for solving complex problems in HLA typing.
Identification and Verification of Rare HLA Alleles
The HLA system is highly polymorphic, so it is of great significance to identify and verify rare alleles. With the rapid increase in the number of alleles in the database, accurate identification and verification of rare alleles is very important for scientific research and clinics. This paper constructs its identification and verification system.
Definition and Frequency Threshold of Rare Alleles
The definition of rare HLA alleles should be combined with the genetic background of the research target population, and the following frequency criteria are usually adopted.
- Group specificity is rare: The frequency of alleles less than 0.1% in a specific population, such as HLA-A*68:02, is 0.05% in the Han population, but it can reach 1.2% in the African population.
- Rare in the world: Alleles with a frequency less than 0.01% in the global population, which are usually newly discovered or extremely low-frequency variants, such as HLA-B*14:56, only 3 samples are included in IMGT/HLA database.
Attention should be paid to the timeliness of the frequency database when identifying rare alleles: Allele Frequency Net Database (AFND) integrates the frequency data of more than 200 populations around the world, but the data of some ethnic groups are still missing. For example, the frequency of HLA-B*40:02 in the Tibetan population in China is as high as 18%, while the AFND database only contains Han nationality data before, which leads to the allele being easily misjudged as rare in Tibetan samples.
A graphical representation of the variation seen in the HLA-A sequences (Robinson et al., 2020)
The Three-level Verification Process of Rare Alleles
For samples suspected of rare alleles, a strict three-level verification process should be implemented to avoid misjudgment:
- A. First-level verification: bioinformatics optimization
- a) Multi-database comparison: At the same time, IMGT/HLA, dbMHC, and other databases are used for sequence comparison to eliminate misjudgment caused by database version differences. For example, the new HLA-DRB116:54 in IMGT/HLA version 3.47.0 may be wrongly classified as 16:02 in the database of the old version.
- b) Construction of custom reference database: Add the suspected rare allele sequence into the reference database and re-type. If the stability of the typing result is improved (probability > 0.95), it may be a real variation.
- B. Secondary verification: validation of experimental methods
- a) Sanger sequencing verification: Two-way sequencing was performed on the specific primers designed in the suspected region, and the DNA sequence was directly read. The accuracy of this method for verifying rare alleles is 99.9%, but it is limited by the sequencing length (about 1kb) and needs to be verified by segments.
- b) Molecular cloning technology: Cloning PCR products into vectors for monoclonal sequencing can solve the problem that rare alleles in heterozygotes are covered by high-frequency alleles.
- C. Three-level verification: family genetic analysis
- a) Verification of Mendel's genetic law: If rare alleles exist in a family, they should conform to Mendel's genetic law. For example, when the parents' HLA-A is 02:01 and 24:02, respectively, and the offspring have alleles other than * 02:01/*24:02, it is necessary to consider new alleles or sample contamination.
- b) Haplotype linkage analysis: Rare alleles are often linked with a specific haplotype, such as HLA-A66:01 is often linked with HLA-B40:02-DRB1*13:02 in Japanese. If the haplotype is incomplete, we should be alert to typing errors.
NeoOncoHLA VS POLYSOLVER performance benchmark in somatic HLA variant simulation experiments (Anzar et al., 2022)
Discovery and Submission Specification of New Alleles
When an allele is confirmed as a newly discovered mutation, the submission specification of the WHO HLA Nomenclature Committee should be followed:
- Sequence data requirements: Two-way sequence data of at least two independent experiments (such as Sanger sequencing and NGS) should be provided, covering all exons and adjacent intron regions of the allele.
- Function prediction analysis: Use tools such as NetMHC to predict the peptide binding ability of new alleles. If the difference between the new alleles and the known alleles is located in the antigen binding groove area, it is necessary to supplement the experimental data of antigen presentation.
- Naming application process: Submit the sequence data and analysis report online through the IMGT/HLA database, and the naming committee will complete the review within 8 weeks. The new allele will be released with a temporary number (such as HLA-A*02:XX), and the official number will be assigned after confirmation.
Overview of the main features in AFND (Gonzalez-Galarza et al., 2020)
Take the Next Step: Explore Related Services
Learn More
Common inducement and Identification of False Positive Results
False positive results in HLA typing may lead to serious clinical consequences, and the inducement can be summarized into three categories.
- A. False positive caused by experimental operation
- a) Cross-contamination of samples: In the complex process of high-throughput sequencing library construction, the wrong index label of samples or the contamination of the PCR system are the common causes of false positive alleles. To avoid this kind of situation, the laboratory should strictly implement the operation specification of sample partition, use the suction head with a filter element to transfer liquid, and regularly clean the experimental table and equipment with a nuclease. In addition, the establishment of a sample double-index labeling system can significantly reduce the risk of label confusion through double-label verification.
- b) Nonspecific amplification: In the process of HLA typing, the capture probe will combine with homologous sequences in other regions of the genome, which will lead to abnormal amplification of non-target regions. To solve this problem, modern detection technology mostly adopts an optimized probe design strategy and enhances the probe binding specificity by introducing specific locked nucleic acid (LNA) modifications. At the same time, combined with the bioinformatics algorithm, the non-specific amplification products were identified and filtered by sequence comparison model in the data analysis stage, to improve the reliability of typing results.
- B. False positives caused by bioinformatics analysis
- a) Database deviation: The timeliness of the HLA database directly affects the accuracy of typing. When using an outdated HLA database (such as not including the latest alleles) for typing, the variation of high-frequency alleles may be misjudged as rare alleles. It is worth noting that there are differences in allele naming rules between different databases, such as the IPD-IMGT/HLA database and the RefSeq database of NCBI. If cross-database verification is not carried out, naming confusion may occur, leading to systematic deviation of typing results.
- b) Limitations of the algorithm: Typing algorithms based on short reading and long sequencing technology (such as OptiType and HLA-VBSeq) have technical bottlenecks when dealing with highly similar alleles. Although the emerging long reading and long sequencing technology can alleviate this problem, the existing algorithms still have the challenge of balancing computational efficiency and accuracy in long reading and long data processing.
- C. False positive caused by sample characteristics
- a) Somatic mutation of tumor samples: During the occurrence and development of tumor cells, the HLA gene is easily affected by tumor microenvironment and genetic instability of cells themselves, and somatic mutation occurs. If such mutations are not accurately identified, they are often misjudged as normal alleles, which leads to the deviation of HLA typing results.
- b) Mixed signals of chimeric samples: The HLA signals of chimeric samples after hematopoietic stem cell transplantation (HSCT) are in a complex mixed state due to the coexistence of hematopoietic cells of donors and recipients. In the traditional PCR-SSP (sequence-specific primer polymerase chain reaction) or SBT (sequencing typing) detection, if there are similar sequences in HLA alleles of donors and recipients, the amplified products of the two will interfere with each other, resulting in seemingly reasonable but wrong allele combinations.
The identification of false positive results needs to be combined with multiple evidences: when the supported reads a number of an allele is less than 10, the coverage is less than 50×, and the frequency in the AFND database is 0, false positive should be highly suspected. In addition, the false positive rate can be reduced from 3.2% to less than 0.1% by orthogonal verification with long reading length sequencing.
Reactivity patterns of class II HLA antibody assay (Park et al., 2020)
Mechanism of False Negative Results and Corrective Strategies
False-negative results (missing real alleles) are more harmful in HLA typing, and the main mechanisms include
- A. Limitations of sequencing technology
- a) Blind spot of capture probe design: The traditional HLA capture probe design has obvious limitations, and its targeting area mainly focuses on exons 2-4 encoding antigen binding sites but ignores the functional variation of 5'UTR or intron region.
- b) It is difficult to sequence high GC content regions: the high GC region at the 5' end of the HLA-DRB1 gene (especially exon 1) constitutes another technical bottleneck. The GC content in this region often exceeds 75%, which is significantly higher than the average level of the genome, which leads to the enhancement of DNA double-stranded stability, and it is easy to cause problems such as primer annealing efficiency reduction and sequencing reads falling off during PCR amplification and sequencing.
- B. Algorithm model deviation
- a) Insufficient detection of heterozygote imbalance: In the second-generation sequencing data processing, the unbalanced distribution of heterozygote alleles is a common factor leading to typing errors. When the reading support rate of an allele is less than 20%, traditional typing algorithms (such as HLALA) are often unable to identify it because of the preset threshold limit.
- b) New alleles are not included in the model: The polymorphism of HLA loci continues to expand, and a large number of alleles are added every year, but the update of the algorithm model is lagging behind.
- C. Sample processing problem
- a) DNA degradation: The DNA of formalin-fixed paraffin-embedded (FFPE) samples was seriously fragmented, which led to the failure of HLA gene long fragment amplification and missed intron variation.
- b) The low initial amount of DNA: The PCR amplification bias of micro samples (such as circulating tumor DNA) is significant, and short fragments or high-frequency alleles are preferentially amplified, resulting in missed detection of rare alleles.
- D. Correction of false negative results
- a) Experimental optimization: The complete HLA alleles were directly read by long reading and long sequencing (such as PacBio HiFi), and the amplification bias was reduced by single molecule PCR; Specific primers were designed for high GC region for nested PCR.
- b) Algorithm improvement: Using Kourami and other new generation typing tools, the allele network model can improve the detection rate of heterozygote imbalance; Regularly update the algorithm training set to include the newly discovered allele data.
- c) Quality control: Establish a false negative risk assessment model. When the sample DNA concentration is < 10ng/μL and the degradation index (DI) is > 2.5, the long reading and long sequencing verification process will be triggered automatically.
Block diagram of false-positive and false-negative results in molecular diagnostic (Sajal et al., 2024)
How to Use Public Database to Assist Analysis
In the difficult analysis of HLA typing, the public database is the key support. With the rapid increase in the number of HLA alleles, its polymorphism analysis becomes more and more complicated. IPD-IMGT/HLA and other databases integrate massive data, which provides a basis for rare allele identification and population difference analysis, and plays an important role in solving the typing problem.
Allele Sequence Alignment and Variation Analysis
Multi-sequence alignment tool: When systematic analysis is carried out by using the Align Database tool built into the database, the suspected rare allele can be compared with more than 29,000 known allele sequences in the IMGT/HLA database in parallel by setting the e-value threshold to 1e-5 and using the Needleman-Wunsch global alignment algorithm.
Prediction of mutation function: By constructing an integrated analysis platform including eight prediction tools such as SIFT, PolyPhen-2, and MutationTaster, the newly discovered HLA allele mutation was evaluated in a multi-dimensional function.
Allele Frequency and Population Distribution Query
Interactive frequency map: With the powerful visualization function of the Allele Frequency Browser, researchers can quickly obtain the distribution characteristics of target alleles in more than 200 populations around the world. In addition, the map can also be associated with the disease susceptibility database, which is helpful for clinicians to quickly assess the risk level of specific alleles in different populations.
Population-specific frequency table: For samples with complex genetic backgrounds, the application of a population-specific frequency table significantly improves the accuracy of HLA typing. This strategy is especially suitable for populations with unique genetic structures. By integrating localized frequency data, the risk of misjudgment caused by common reference data is effectively reduced.
The IPD-IMGT/HLA Database has received over 53,000 submissions since its launch in 1998 (Robinson et al., 2020)
Haplotype and Linkage Disequilibrium Data Query
The haplotypes module can be used to accurately query the common linkage partners of the target allele. The module integrates haplotype data of different populations around the world and establishes a huge allele linkage map. This linkage information based on population specificity can provide key clues for haplotype inference in complex typing results, and effectively reduce the typing error rate caused by unclear linkage relationships.
Calculation of LD coefficient: Using the LD Calculator tool can scientifically quantify the linkage disequilibrium between different loci. This tool comprehensively evaluates the association strength between loci by calculating two core parameters, D' and R.
- D' value reflects the close linkage degree of alleles in the population, and its value ranges from 0 to 1. The closer the value is to 1, the stronger the linkage degree is.
- R-value further considers the influence of allele frequency and is often used to evaluate the statistical significance of linkage disequilibrium in association studies, which is of great value in genome-wide association studies (GWAS) and other studies.
Conclusion
Difficult analysis of HLA typing results is the key link between gene polymorphism and clinical practice. With the continuous expansion of the allele database and the deepening of cross-ethnic research, it is necessary to establish a dynamically updated analytical system to deeply integrate bioinformatics algorithms, experimental verification techniques, and population genetics data. In the future, with the popularization of long reading and long sequencing and the development of AI-assisted interpretation, the solution of difficult problems will be more efficient, but the standardized verification process and multidisciplinary collaboration are still the cornerstones of quality, ensuring that HLA typing data will release the greatest value in transplantation medicine and disease research.
References
- Robinson J, Barker DJ., et al. "IPD-IMGT/HLA Database." Nucleic Acids Res. 2020 48(D1): D948-D955 https://doi.org/10.1093/nar/gkz950
- Gonzalez-Galarza FF, McCabe A., et al. "Allele frequency net database (AFND) 2020 update: gold-standard data classification, open access genotype data and new query tools." Nucleic Acids Res. 2020 48(D1): D783-D788 https://doi.org/10.1093/nar/gkz1029
- Anzar I, Sverchkova A., et al. "Personalized HLA typing leads to the discovery of novel HLA alleles and tumor-specific HLA variants." HLA. 2022 99(4): 313-327 https://doi.org/10.1111/tan.14562
- Park BG, Park Y, Kim BS, Kim YS, Kim HS. "False Positive Class II HLA Antibody Reaction Due to Antibodies Against Denatured HLA Might Differ Between Assays: One Lambda vs. Immucor." Ann Lab Med. 2020 40(5): 424-427 https://doi.org/10.3343/alm.2020.40.5.424
- Sajal SSA, Islam DZ., et al. "Strategies to Overcome Erroneous Outcomes in Reverse Transcription-Polymerase Chain Reaction (RT-PCR) Testing: Insights From the COVID-19 Pandemic." Cureus. 2024 16(11): e72954 https://doi.org/10.7759/cureus.72954
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.