In the field of biological research, with the deepening of research, from single species genome sequencing to the key stage of population genetic diversity analysis, the inherent limitations of single reference genome become more and more obvious. The concept of Pan-Genome has brought a brand-new perspective and opportunity for genomics research. By integrating the genome sequences of different individuals of the same species, it constructs a complete genetic map including core genes and variable genes. Then it reveals the genetic variation network formed by species in the long evolution process from the system level.
Pan-Genome not only depends on the leap-forward development of sequencing technology from short read length to long read length, but also strongly promotes the comprehensive upgrade of genome assembly strategy, data analysis tools, and quality control system. In practical application, from in-depth analysis of the evolution mechanism of microbial drug resistance to accurate decoding of the genetic basis of crop stress-resistant genes, Pan-Genome sequencing technology is playing an important role in various researches with its unique systematic advantages.
Focusing on pan-genome sequencing, this paper introduces the short-read and long-read sequencing technology, the assembly strategies, the construction process, the data analysis tools and platforms such as Roary and PanX.
In the Pan-Genome research, the reasonable selection of sequencing technology is the key factor in determining the success or failure of research. Different types of sequencing technologies have their advantages, and there are also significant differences in applications.
Comparing R2C2 and Illumina based assemblies of a small genome (Zee et al., 2022)
Illumina is the representative of short reading and long sequencing technology, which has the advantages of Qualcomm and low cost. Illumina sequencing platform adopts bridge PCR amplification and sequencing while synthesizing (SBS) technology. By fixing DNA fragments in a flow tank, millions of DNA molecules can be amplified and sequenced in parallel. This unique clustering sequencing strategy can produce TB-level data in a single run.
In the Pan-Genome research, the technology can process hundreds of samples at the same time, realize rapid loading and parallel sequencing of multiple samples by using microfluidic chips, and obtain a large number of 100-300bp reading sequences in a short time, which provides basic data for the construction of pan-genome variation maps.
However, the limitation of short-read technology has also become the main problems of genome assembly. Because the reading length is too short, when faced with tandem repeats (such as centromere regions), transposon elements, or highly homologous gene families in complex genomes, it is difficult for sequencing fragments to cross the repetitive regions, which leads to inaccurate splicing of assembly algorithms.
In addition, it is difficult to identify structural variation (such as inversion and translocation) with short read data, because it can not provide enough contextual information, which may easily lead to misjudgment or missed detection of structural variation, thus affecting the accurate analysis of structural diversity of Pan-Genome.
Long-read sequencing technologies mainly include PacBio and Nanopore, which have become the key tools to analyze complex genome structures by breaking through the technological innovation of traditional short-read technologies.
PacBio's SMRT sequencing technology is based on the principle of single-molecule real-time sequencing, and the observation of a single DNA polymerase molecule is realized by using a zero-mode waveguide hole (ZWM). This technology has a unique circular consensus sequencing (CCS) mode. By sequencing the same DNA molecule dozens to hundreds of times, the original error rate of about 15% can be greatly reduced to less than 0.1%, and it is especially good at capturing high GC content regions, tandem repeats, and complex gene structures in the genome.
Nanopore sequencing technology adopts the principle of nanopore electrical signal detection. When DNA or RNA molecules pass through nanopores, the current changes caused by different bases can be captured in real-time and converted into sequence information. Its remarkable advantage lies in the fact that there is no upper limit on the theoretical read length, and the longest reported reading length is more than 2 million base pairs, which makes it show great ability in analyzing telomere regions with millions of bases in the human genome and transposon-rich regions widely existing in the plant genome. In the study of rice pan-genome, Nanopore sequencing not only completely assembled centromere heterochromatin regions that could not be resolved by traditional methods but also found many structural variations related to disease resistance.
Although the technology of long read has greatly promoted the progress of genomics research, it still faces technical bottlenecks.
In addition, the massive storage requirements and high consumption of computing resources for long reading and long data also put forward higher requirements for the bioinformatics analysis process. With the optimization of error correction algorithms and the improvement of hardware performance, long reading and long sequencing technology is gradually developing towards high precision, Qualcomm, and low cost.
Advantages and limitations of short and long reads (Deshpande et al., 2023)
Services you may interested in
In Pan-Genome sequencing, the selection of a multi-genome assembly strategy is very important. Reference genome-guided assembly and de novo assembly are two main methods, which are significantly different in principle, process, and applications. Reasonable selection is of great significance for obtaining accurate and complete genome information.
Reference genome-guided assembly is an assembly strategy based on existing reference genomes. Its basic principle is to compare the sequence obtained by sequencing with the known reference genome, and then assemble it based on the comparison results. This method makes use of the structural information of the reference genome, which can effectively reduce the complexity of assembly and greatly improve assembly efficiency.
de novo assembly is an method that does not depend on any reference genome. It is completely based on the sequence data obtained by sequencing, and short or long sequences are spliced into a complete genome by algorithm. In practical application, the algorithm framework based on a Debruin Graph or Overlap-Layout-Consensus (OLC).
The advantage of de novo assembly is that it can find the sequence information that is not found in the reference genome, which is of great significance for studying the evolution of species, the discovery of new genes, and the study of species-specific regions. In the field of crop breeding, de novo assembly can help tap the potential disease-resistant and stress-resistant genes in wild-related species and help cultivate new varieties with better characteristics.
However, assembly from scratch also faces many challenges. First of all, it requires high quality and quantity of sequencing data, especially for complex genomes or genomes containing a large number of repetitive sequences, which require higher sequencing depth and longer reading length to ensure the accuracy of assembly. Secondly, the calculation of assembling from scratch is very large, which requires a lot of time and computing resources, which is a big challenge for some research teams with limited resources.
Flowchart of genome assembly: de novo and based on the reference genome (Diniz et al., 2017)
In the construction of the Pan-Genome, gene family clustering and mutation detection are key links. The core and variable gene families can be identified by clustering multi-genome sequences with tools such as OrthoFinder. The combination of SNP and CNV detection can reveal the genetic diversity of species and lay a foundation for analyzing the differences between evolution and phenotype.
Gene family clustering is one of the key steps in the construction process of Pan-Genome and commonly used tools such as OrthoFinder. The basic process is to compare the protein sequences or nucleic acid sequences of multiple genomes, and then divide them into different gene families according to the sequence similarity. Through gene family clustering, researches can understand the amplification, contraction, and evolutionary relationship of genes in species.
When studying the Pan-Genome of several related species, core gene families (gene families existing in all species) and variable gene families (gene families existing only in some species) can be identified by OrthoFinder clustering, thus providing clues for studying the adaptive evolution of species and the formation of species-specific traits. When clustering gene families, it is necessary to select appropriate comparison parameters and clustering algorithms to ensure the accuracy of clustering results.
Pipeline and example result from the Pandagma software package (Cannon et al., 2024)
Variation detection includes single nucleotide polymorphism (SNP) and copy number variation (CNV), which is of great significance for understanding the genetic diversity and phenotypic differences of species. In Pan-Genome research, mutation detection is usually carried out after the assembly of multiple genomes.
In the study of the human Pan-Genome, mutation detection can help to find genetic variation related to diseases and provide the basis for the diagnosis and treatment of diseases. In mutation detection, attention should be paid to the quality of sequencing data, the selection of comparison algorithm, and the standard of mutation filtering to reduce the occurrence of false positive results.
Overview of PanSVR SV calling process (Li et al., 2021)
In the research of Pan-Genome, it is very important to choose the appropriate data analysis tools and platforms. Different tools are suitable for different research because of their different design principles and functional characteristics.
Roary is a tool for Pan-Genome analysis of bacteria, which can quickly cluster multiple bacterial genomes and analyze core genes and variable genes. Roary has the advantage of fast running speed and is suitable for processing large-scale bacterial genome data. When studying the evolution of bacterial drug resistance, Roary can be used to analyze the genomes of a large number of drug-resistant bacteria, identify gene families and variations related to drug resistance, and provide reference for developing new antibacterial drugs and formulating antibacterial strategies.
PanX is mainly used for Pan-Genome analysis of eukaryotes. It can process sequence data of multiple genomes, cluster gene families, detect mutation, and construct a Pan-Genome. The characteristic of PanX is that it can handle complex genome structure and large-scale data sets, and provides rich visualization functions, which is convenient for researchers to interpret the analysis results. In the study of plant Pan-Genome, the genomes of several crop varieties can be analyzed by using PanX, and the genes and variations related to agronomic traits can be identified, which provides a theoretical basis for the genetic improvement of crops.
Interconnected components of the panX web application (Ding et al., 2018)
PGAP (Pan-Genome analysis pipeline) is a comprehensive Pan-Genome analysis platform, that integrates many steps from sequencing data processing to Pan-genome construction and analysis. The advantage of PGAP is that it provides a one-stop analysis process, and users can complete data preprocessing, assembly, annotation, mutation detection, and gene family clustering on the same platform. In addition, PGAP also supports data input from various sequencing technologies, which can flexibly meet different research needs.
The innovation of sequencing technology and the optimization of analytical methods continue to promote the development of the Pan-Genome research. At present, the collaborative application of long/short read length technology and the integrated analysis of multi-omics data are gradually cracking the bottleneck of complex genome assembly and genetic variation analysis.
In the future, it is necessary to further strengthen the standardized collection of cross-species samples and develop intelligent data analysis algorithms, so as to construct a more complete species pan-genome map and provide a more solid theoretical basis, and technical support for evolutionary biology, functional genomics, and precision medicine research.
References
Send a MessageFor any general inquiries, please fill out the form below.
CD Genomics is propelling the future of agriculture by employing cutting-edge sequencing and genotyping technologies to predict and enhance multiple complex polygenic traits within breeding populations.