Interpretation of Sanger Sequencing Results: How to Analyze and Apply Sequencing Data

Sanger sequencing, as a mature and widely used DNA sequencing technology, is an important basis for obtaining gene information in molecular biology research, clinical diagnosis, and other fields. Sanger sequencing results are usually presented in two forms: electrophoresis map (i.e., sequencing peak map) and corresponding base sequence, in which the peak map directly shows the separation of different bases in the sequencing process, while the base sequence is a direct interpretation of the peak map.

This result presentation form has the basic characteristics of high single-base resolution and high accuracy, and can reflect the base information of each position in DNA fragments. However, for many researchers and clinicians, it is not easy to interpret Sanger sequencing results. In practice, they may encounter problems such as chaotic peak patterns, abnormal peak patterns, and difficulties in base identification, especially in the face of mutation, insertion, deletion, and other variations. Accurately judging and analyzing this information is a big challenge. In addition, combining the sequencing results with the research objectives and reasonably applying them to experimental design and conclusion derivation also requires rich experience and professional knowledge.

This article elaborates on the presentation forms, quality evaluation indicators, data analysis methods, and result applications of Sanger sequencing results, aiming to help researchers accurately interpret and apply such results.

Presentation of Sanger Sequencing Results

Sanger sequencing results are mainly presented in two forms: electrophoresis peak map and base sequence. The peaks of different colors in the peak diagram correspond to A, T, C, and G bases, and the clarity and height of the peaks reflect the signal quality. The base sequence is converted from the peak map with the mass value attached. It is characterized by high single-base resolution and can display sequence details intuitively, but it also has the problems of reading length limitation and terminal signal attenuation.

Interpretation of Electrophoresis Atlas

The electrophoresis map of Sanger sequencing is generated by separating DNA fragments of different lengths by capillary electrophoresis technology. In the map, the horizontal axis represents the base position (that is, the sequencing length) and the vertical axis represents the fluorescence signal intensity. During electrophoresis, dideoxynucleotide (ddNTPs) with different fluorescent labels will be recognized by the detector with the migration of DNA fragments. Different bases correspond to different fluorescent colors: adenine (A) is green, cytosine (C) is blue, guanine (G) is black or yellow, and thymine (T) is red.

The meaning of the peak is the core of reading the atlas. Each clear and sharp peak represents the appearance of a specific base at this position, and the height of the peak is related to the signal intensity of the base. The higher the signal intensity, the steeper the peak type, indicating that the sequencing reaction has high extension efficiency and good specificity at this position. The continuous peak pattern arrangement forms a complete DNA sequence, and the color and position of the peak can be identified by software, which can be directly converted into the corresponding base sequence.

The Sanger sequencing map (Li et al., 2022)Sanger sequencing map (Li et al., 2022)

Quality Evaluation Index

In order to objectively evaluate the reliability of Sanger sequencing results, researchers introduced a series of quality evaluation indicators, among which the most commonly used ones are Phred mass fraction and sequencing depth.

The mass fraction (Q value) of Phred is an important index to measure the accuracy of single base recognition, and its calculation formula is Q = -10log10 (P), where P is the probability of base recognition error. For example, Q20 means that the error probability of this base is 1%, and Q30 means that the error probability is 0.1%. In practical application, it is usually required that the proportion of bases above Q20 is more than 90% and that of bases above Q30 is more than 80% in the sequencing results to ensure the accuracy of sequencing data. Through professional sequencing analysis software (such as Sequencher, BioEdit, etc.), the Phred mass fraction of each base can be viewed, which helps researchers to judge the credibility of sequencing results in different regions.

Sequencing depth usually refers to the number of times the same DNA fragment is sequenced in Sanger sequencing. Different from high-throughput sequencing, the sequencing depth of Sanger sequencing is generally low (usually 1-2 times), but due to its high accuracy, a single sequencing result can meet most experimental requirements. In some scenes that require high accuracy of results (such as mutation confirmation in clinical diagnosis), the same template is usually sequenced in two directions (forward and backward) or repeatedly to improve the reliability of the results. The results of two-way sequencing can verify each other, reduce the possible errors caused by single-direction sequencing, and are especially suitable for detecting the variation in long DNA fragments.

In addition, the reading length of sequencing results is also an important evaluation index. The average reading length of Sanger sequencing is usually 500-800 bases, and the reading length of high-quality sequencing results can reach more than 1000 bases. The length of reading will affect the coverage of long fragments of DNA. When designing sequencing experiments, it is necessary to arrange sequencing strategies reasonably according to the length of target fragments to ensure that the whole target area can be covered.

Sanger sequencing versus next generation sequencing (NGS) (Botella et al., 2015)Sanger sequencing vs next generation sequencing (NGS) (Botella et al., 2015)

Sanger Sequencing Data Analysis Method

Sanger sequencing is a high-precision gene sequencing technology, and its data analysis is the key link to mining gene information. This process needs to rely on professional tools to analyze electrophoresis peaks, identify base sequences, judge reliability by combining quality evaluation indicators, accurately detect abnormalities such as bimodal and mutation, and reveal gene variation by comparing with reference sequences, providing a core basis for scientific research and clinical applications.

Use of Professional Software Tools

The analysis of Sanger sequencing results needs the help of professional software tools, which can help researchers quickly identify base sequences, evaluate sequence quality, compare reference sequences, and detect variations. The following introduces several commonly used software tools and their main functions.

FinchTV is a free and easy-to-use software for viewing sequencing results, which supports various sequencing file formats (such as. ab1,. scf, etc.). Its main functions include displaying the electrophoretogram, base sequence, and corresponding Phred mass fraction. Users can directly observe the peak pattern changes by enlarging the details of the electrophoretogram, and manually correct the wrong base automatically identified by the software. In addition, FinchTV also provides a sequence comparison function, which can simply compare the sequencing results with reference sequences and preliminarily judge whether there is variation.

Chromas is another widely used sequencing analysis software, which is more powerful. Besides the basic functions of FinchTV, it also supports sequence editing, reverse complementary sequence generation, restriction site analysis, and so on.

  • In peak pattern analysis, Chromas can automatically identify abnormal peak patterns such as bimodal peaks and heteropeaks, and mark low-quality base regions, which helps users quickly locate the problem regions in sequencing results.
  • At the same time, the software can also export sequencing results into various formats (such as FASTA, GenBank, etc.), which is convenient for subsequent data analysis and storage.

For researchers who need to conduct large-scale sequence analysis or complex mutation detection, advanced software such as Sequencher and BioEdit can be selected. These software support multiple sequence alignment, automatic mutation detection, splicing overlapping sequences, and other functions, and are suitable for research scenarios such as gene cloning verification and mutation screening.

SeqTrace's user interface comprises the project window (A) and the trace-view window (B) (Stucky et al., 2012)SeqTrace's user interface, including the project window (A) and the trace-view window (B) (Stucky et al., 2012)

Identification and Treatment of Common Problems

In Sanger sequencing results, there are often some abnormal peaks, such as double peaks, deletion peaks, noise peaks, etc. These problems will affect the accurate identification of base sequences and need to be correctly identified and processed.

Bimodal refers to two highly similar peaks at the same base position, which are usually caused by template pollution, heterozygote samples, or nonspecific amplification during sequencing.

  • For the double peaks caused by template contamination, the peak pattern usually persists throughout the sequencing process, and the signal intensities of the two peaks are relatively stable. In this case, it is necessary to re-prepare samples for sequencing.
  • For heterozygous samples (such as heterozygous mutation in the human genome), bimodal peaks usually appear at a specific position, and the peak type returns to normal after this position, which is a normal biological phenomenon, and it is necessary to record the heterozygous base information at this position.
  • The missing peak shows that there is no obvious peak signal at a certain position, which may be caused by the interruption of the sequencing reaction, the deletion of large fragments in the template, or the abnormality of the primer binding site. If the missing peak appears in the initial region of sequencing, it may be due to poor primer binding, so we can try to change primers and re-sequence.
  • Noise peak (also known as background peak) refers to the low-intensity clutter peak between normal peaks, which is usually caused by nonspecific extension, fluorescence interference, or instrument detection error in the sequencing reaction. A slight noise peak generally does not affect base recognition, but when the noise peak intensity is high, it may lead to base misjudgment. Methods to deal with noise peaks include optimizing sequencing reaction conditions (such as increasing annealing temperature and reducing primer dosage), using high-quality sequencing reagents, or manually correcting low-quality base regions through software.

The primary problems encountered when reading DNA chromatograms of PCR products using the Sanger sequencing method (Al-Shuhaib et al., 2023)A-N The main issues encountered in the reading of DNA chromatograms of PCR products based on the Sanger sequencing method (Al-Shuhaib et al., 2023)

Comparison with the Reference Sequence

Comparing the sequencing results with the reference sequence is the key step to analyze the sequencing data. Mutation types such as mutation, insertion, and deletion can be detected by comparison, which provides a basis for subsequent research.

First of all, it is necessary to obtain the reference sequence of the target gene or fragment, which can be downloaded from public databases such as GenBank. Then, sequence comparison software (such as BLAST, ClustalW, MegAlign, etc.) is used to compare the sequenced sequence with the reference sequence. The alignment results are usually displayed in the form of sequence alignment, in which the same bases are represented by the same characters, different bases are marked by different characters, and inserted or missing bases are represented by horizontal lines or other symbols.

In mutation detection, point mutation is the most common mutation type, which shows that a base in the sequencing sequence is different from the reference sequence. For example, the base in the reference sequence is "A" and the corresponding position in the sequencing sequence is "G", which indicates that there is a point mutation with G>A at this position. By looking at the peak type and Phred mass fraction of this position, we can confirm the reliability of the mutation and avoid false-positive results caused by sequencing errors.

A comparison of the distinct processes between the Sanger method and NGS in detecting various pathogens (Nafea et al., 2023)Comparing the different processes of the Sanger method and NGS in detecting different pathogens (Nafea et al., 2023)

Result Application of Sanger sequencing

Sanger sequencing has become the cornerstone of molecular biology research with high accuracy, and its results have irreplaceable applications in many fields. From the verification of gene cloning to ensure the correct insertion of fragments, to the diagnosis and treatment of diseases by mutation detection, to the study of gene function to reveal the mechanism of gene action, accurate interpretation of sequencing results is the key to promoting scientific research and clinical progress.

Gene Cloning Verification

In genetic engineering research, it is necessary to verify the correctness of the inserted fragment by Sanger sequencing after constructing the recombinant plasmid. The research team inserted a target gene into the pET-28a vector to construct a recombinant expression plasmid. After sequencing the recombinant plasmid, the sequencing results were compared with the reference sequence and vector sequence of the target gene. It was found that the sequence of the inserted fragment was completely consistent with the target gene, and the insertion direction was correct, without base mutation or deletion, which indicated that the recombinant plasmid was successfully constructed and could be used for subsequent protein expression experiments.

Sequence chromatogram (A) and sequence quality evaluation (B) derived from clinical Staphylococcus aureus strain 1 (Chen et al., 2014)Sequence chromatogram (A) and sequence quality evaluation (B) from clinica Staphylococcus aureus strain 1 (Chen et al., 2014)

Mutation Detection

In clinical diagnosis, Sanger sequencing is often used to detect gene mutations related to diseases. When detecting EGFR gene mutation in tumor tissue samples of suspected lung cancer patients, the deletion mutation of exon 19 was found in the patient samples by sequencing the hot mutation regions of the EGFR gene. Combined with the clinical symptoms and other examination results of the patient, it can be determined that the patient is suitable for treatment with EGFR tyrosine kinase inhibitors.

Study on Gene Function

In the study of gene function, Sanger sequencing can be used to verify the effect of gene knock-out or knock-in experiments. The researchers used CRISPR-Cas9 technology to knock out a gene in mice, amplified the target gene region by PCR, and sequenced it. If the sequencing results show that there is an expected deletion or insertion mutation in the target gene region, and the mutation causes the gene reading frame to shift, it indicates that the gene knockout is successful. Subsequently, the biological function of the gene can be studied by observing the phenotypic changes of knockout mice.

Amplification curves (A) and melting curves (B) belonging to partial experimental strains (Chen et al., 2014)Amplification curves (A) and melting curves (B) of partial experimental strains (Chen et al., 2014)

Conclusion

Correct interpretation and analysis of Sanger sequencing results are the key to giving full play to the advantages of this technology, which not only relates to the reliability of experimental results but also affects the formulation of subsequent research directions and the accuracy of scientific research conclusions. By mastering the presentation form, quality evaluation index, and data analysis method of sequencing results, researchers can accurately identify base sequences and detect gene variation, and effectively apply sequencing data to gene cloning verification, mutation detection, gene function research, and other fields.

References:

  1. Botella LM, Albiñana V, Ojeda-Fernandez L, Recio-Poveda L, Bernabéu C. "Research on potential biomarkers in hereditary hemorrhagic telangiectasia." Front Genet. 2015 6: 115 https://doi.org/10.3389/fgene.2015.00115
  2. Li, Z., Lou, J., Li, W. et al. "A newly detected c.180 + 1G > A variant causes a decrease of FGA transcription in patients with congenital hypo-dysfibrinogenemia." J Hematopathol. 2022 15 259–263 https://doi.org/10.1007/s12308-022-00518-3
  3. Stucky BJ. "SeqTrace: a graphical tool for rapidly processing DNA sequencing chromatograms." J Biomol Tech. 2012 23(3): 90-93 https://doi.org/10.7171/jbt.12-2303-004
  4. Dunitz MI, Lang JM., et al. "Swabs to genomes: a comprehensive workflow." PeerJ. 2015 3: e960 https://doi.org/10.7717/peerj.960
  5. Nafea AM, Wang Y, Wang D, et al. "Application of next-generation sequencing to identify different pathogens." Front Microbiol. 2024 14: 1329330 https://doi.org/10.3389/fmicb.2023.1329330
  6. Al-Shuhaib MBS, Hashim HO. "Mastering DNA chromatogram analysis in Sanger sequencing for reliable clinical analysis." J Genet Eng Biotechnol. 2023 21(1): 115 https://doi.org/10.1186/s43141-023-00587-6
  7. Chen L, Cai Y, Zhou G, et al. "Rapid Sanger sequencing of the 16S rRNA gene for identification of some common pathogens." PLoS One. 2014 9(2): e88886 https://doi.org/10.1371/journal.pone.0088886
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Related Services
PDF Download
* Email Address:

CD Genomics needs the contact information you provide to us in order to contact you about our products and services and other content that may be of interest to you. By clicking below, you consent to the storage and processing of the personal information submitted above by CD Genomcis to provide the content you have requested.

×
Quote Request
! For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Contact CD Genomics
Terms & Conditions | Privacy Policy | Feedback   Copyright © CD Genomics. All rights reserved.
Top