Pan-Genome Sequencing: Technologies, Assembly Strategies, and Computational Tools

In the field of biological research, with the deepening of research, from single species genome sequencing to the key stage of population genetic diversity analysis, the inherent limitations of single reference genome become more and more obvious. The concept of Pan-Genome has brought a brand-new perspective and opportunity for genomics research. By integrating the genome sequences of different individuals of the same species, it constructs a complete genetic map including core genes and variable genes. Then it reveals the genetic variation network formed by species in the long evolution process from the system level.

Pan-Genome not only depends on the leap-forward development of sequencing technology from short read length to long read length, but also strongly promotes the comprehensive upgrade of genome assembly strategy, data analysis tools, and quality control system. In practical application, from in-depth analysis of the evolution mechanism of microbial drug resistance to accurate decoding of the genetic basis of crop stress-resistant genes, Pan-Genome sequencing technology is playing an important role in various researches with its unique systematic advantages.

Focusing on pan-genome sequencing, this paper introduces the short-read and long-read sequencing technology, the assembly strategies, the construction process, the data analysis tools and platforms such as Roary and PanX.

Selection and Comparison of Sequence Technology

In the Pan-Genome research, the reasonable selection of sequencing technology is the key factor in determining the success or failure of research. Different types of sequencing technologies have their advantages, and there are also significant differences in applications.

A comparison between R2C2 and Illumina-based assemblies of a small genome (Zee et al., 2022) Comparing R2C2 and Illumina based assemblies of a small genome (Zee et al., 2022)

Short Read Length

Illumina is the representative of short reading and long sequencing technology, which has the advantages of Qualcomm and low cost. Illumina sequencing platform adopts bridge PCR amplification and sequencing while synthesizing (SBS) technology. By fixing DNA fragments in a flow tank, millions of DNA molecules can be amplified and sequenced in parallel. This unique clustering sequencing strategy can produce TB-level data in a single run.

In the Pan-Genome research, the technology can process hundreds of samples at the same time, realize rapid loading and parallel sequencing of multiple samples by using microfluidic chips, and obtain a large number of 100-300bp reading sequences in a short time, which provides basic data for the construction of pan-genome variation maps.

However, the limitation of short-read technology has also become the main problems of genome assembly. Because the reading length is too short, when faced with tandem repeats (such as centromere regions), transposon elements, or highly homologous gene families in complex genomes, it is difficult for sequencing fragments to cross the repetitive regions, which leads to inaccurate splicing of assembly algorithms.

In addition, it is difficult to identify structural variation (such as inversion and translocation) with short read data, because it can not provide enough contextual information, which may easily lead to misjudgment or missed detection of structural variation, thus affecting the accurate analysis of structural diversity of Pan-Genome.

Long Read Length

Long-read sequencing technologies mainly include PacBio and Nanopore, which have become the key tools to analyze complex genome structures by breaking through the technological innovation of traditional short-read technologies.

PacBio's SMRT sequencing technology is based on the principle of single-molecule real-time sequencing, and the observation of a single DNA polymerase molecule is realized by using a zero-mode waveguide hole (ZWM). This technology has a unique circular consensus sequencing (CCS) mode. By sequencing the same DNA molecule dozens to hundreds of times, the original error rate of about 15% can be greatly reduced to less than 0.1%, and it is especially good at capturing high GC content regions, tandem repeats, and complex gene structures in the genome.

Nanopore sequencing technology adopts the principle of nanopore electrical signal detection. When DNA or RNA molecules pass through nanopores, the current changes caused by different bases can be captured in real-time and converted into sequence information. Its remarkable advantage lies in the fact that there is no upper limit on the theoretical read length, and the longest reported reading length is more than 2 million base pairs, which makes it show great ability in analyzing telomere regions with millions of bases in the human genome and transposon-rich regions widely existing in the plant genome. In the study of rice pan-genome, Nanopore sequencing not only completely assembled centromere heterochromatin regions that could not be resolved by traditional methods but also found many structural variations related to disease resistance.

Although the technology of long read has greatly promoted the progress of genomics research, it still faces technical bottlenecks.

PacBio sequencing is limited by expensive sequencing reagents and the flux limit that only a few samples can be processed in a single run, which leads to its high cost in large-scale population sequencing projects.
Nanopore sequencing has the advantages of portability and low cost, but its average error rate (mainly insertion/deletion errors) is 10%-15%, which needs to be corrected by complex algorithms (such as Canu and Flye) and mixed sequencing strategies (combined with short reading and long data).

In addition, the massive storage requirements and high consumption of computing resources for long reading and long data also put forward higher requirements for the bioinformatics analysis process. With the optimization of error correction algorithms and the improvement of hardware performance, long reading and long sequencing technology is gradually developing towards high precision, Qualcomm, and low cost.

The benefits and constraints of short and long reads (Deshpande et al., 2023) Advantages and limitations of short and long reads (Deshpande et al., 2023)

Services you may interested in

Pan-Genome Service

Plant Pan-genome Sequencing

Bulked Segregant Analysis (BSA) Services

GWAS Services

Genetic Linkage Map

Association Mapping

Multi-genome Assembly Strategy in Pan-Genome

In Pan-Genome sequencing, the selection of a multi-genome assembly strategy is very important. Reference genome-guided assembly and de novo assembly are two main methods, which are significantly different in principle, process, and applications. Reasonable selection is of great significance for obtaining accurate and complete genome information.

Reference Genome-Guided Assembly

Reference genome-guided assembly is an assembly strategy based on existing reference genomes. Its basic principle is to compare the sequence obtained by sequencing with the known reference genome, and then assemble it based on the comparison results. This method makes use of the structural information of the reference genome, which can effectively reduce the complexity of assembly and greatly improve assembly efficiency.

A. Reference genome-guided assembly application

a)In practical application, the advantages of this method are particularly remarkable. Taking the Pan-Genome research of rice as an example, the genome of Japanese fine rice is used as a common reference genome. When assembling other local rice varieties, researchers can complete the genome assembly of multiple varieties in a short time by quickly anchoring the sequencing data to the Japanese fine genome framework.
b)Moreover, referring to the existing gene annotation information in the genome, such as gene location and functional classification, can also be directly applied to the annotation process of the newly assembled genome, greatly saving annotation time and labor costs. This strategy is especially suitable for closely related species or varieties, and can quickly obtain high-quality and continuous genome sequences.

B. Limitations of reference genome-guided assembly

a)However, the reference genome-guided assembly also has obvious limitations. If there are great genetic differences between the target species and the reference species, such as the insertion, deletion, or rearrangement of large fragments, the assembly based on the reference genome may miss these different regions.
b)In addition, this method may also be affected by the quality of the reference genome itself. If the reference genome is wrong or incomplete, it will also be transmitted to the newly assembled genome, which will affect the subsequent analysis results.

de novo Assembly

de novo assembly is an method that does not depend on any reference genome. It is completely based on the sequence data obtained by sequencing, and short or long sequences are spliced into a complete genome by algorithm. In practical application, the algorithm framework based on a Debruin Graph or Overlap-Layout-Consensus (OLC).

The debruin diagram algorithm constructs the network relationship between nodes and edges by dividing short sequences into fixed-length K-MERs and then deduces the complete genome sequence.
OLC algorithm, on the other hand, is more suitable for long read sequencing data. By finding the overlapping regions between sequences, the OLC algorithm is gradually expanded to form continuous sequences.

The advantage of de novo assembly is that it can find the sequence information that is not found in the reference genome, which is of great significance for studying the evolution of species, the discovery of new genes, and the study of species-specific regions. In the field of crop breeding, de novo assembly can help tap the potential disease-resistant and stress-resistant genes in wild-related species and help cultivate new varieties with better characteristics.

However, assembly from scratch also faces many challenges. First of all, it requires high quality and quantity of sequencing data, especially for complex genomes or genomes containing a large number of repetitive sequences, which require higher sequencing depth and longer reading length to ensure the accuracy of assembly. Secondly, the calculation of assembling from scratch is very large, which requires a lot of time and computing resources, which is a big challenge for some research teams with limited resources.

Genome assembly flowchart: de novo and reference genome-based (Diniz et al., 2017) Flowchart of genome assembly: de novo and based on the reference genome (Diniz et al., 2017)

Pan-Genome Construction Process

In the construction of the Pan-Genome, gene family clustering and mutation detection are key links. The core and variable gene families can be identified by clustering multi-genome sequences with tools such as OrthoFinder. The combination of SNP and CNV detection can reveal the genetic diversity of species and lay a foundation for analyzing the differences between evolution and phenotype.

Gene Family Clustering

Gene family clustering is one of the key steps in the construction process of Pan-Genome and commonly used tools such as OrthoFinder. The basic process is to compare the protein sequences or nucleic acid sequences of multiple genomes, and then divide them into different gene families according to the sequence similarity. Through gene family clustering, researches can understand the amplification, contraction, and evolutionary relationship of genes in species.

When studying the Pan-Genome of several related species, core gene families (gene families existing in all species) and variable gene families (gene families existing only in some species) can be identified by OrthoFinder clustering, thus providing clues for studying the adaptive evolution of species and the formation of species-specific traits. When clustering gene families, it is necessary to select appropriate comparison parameters and clustering algorithms to ensure the accuracy of clustering results.

Workflow and illustrative output from the Pandagma software suite (Cannon et al., 2024) Pipeline and example result from the Pandagma software package (Cannon et al., 2024)

Variation Detection

Variation detection includes single nucleotide polymorphism (SNP) and copy number variation (CNV), which is of great significance for understanding the genetic diversity and phenotypic differences of species. In Pan-Genome research, mutation detection is usually carried out after the assembly of multiple genomes.

For SNP detection, the difference of single nucleotide can be identified by comparing the genome sequence of each sample with the reference genome or the genome sequence of other samples.
The detection of CNV is relatively complicated, and it is necessary to identify the change in copy number by calculating the coverage of genome regions or using special algorithms.

In the study of the human Pan-Genome, mutation detection can help to find genetic variation related to diseases and provide the basis for the diagnosis and treatment of diseases. In mutation detection, attention should be paid to the quality of sequencing data, the selection of comparison algorithm, and the standard of mutation filtering to reduce the occurrence of false positive results.

A synopsis of the PanSVR structural variant calling procedure (Li et al., 2021) Overview of PanSVR SV calling process (Li et al., 2021)

Data Analysis Tools and Platforms

In the research of Pan-Genome, it is very important to choose the appropriate data analysis tools and platforms. Different tools are suitable for different research because of their different design principles and functional characteristics.

Roary

Roary is a tool for Pan-Genome analysis of bacteria, which can quickly cluster multiple bacterial genomes and analyze core genes and variable genes. Roary has the advantage of fast running speed and is suitable for processing large-scale bacterial genome data. When studying the evolution of bacterial drug resistance, Roary can be used to analyze the genomes of a large number of drug-resistant bacteria, identify gene families and variations related to drug resistance, and provide reference for developing new antibacterial drugs and formulating antibacterial strategies.

PanX

PanX is mainly used for Pan-Genome analysis of eukaryotes. It can process sequence data of multiple genomes, cluster gene families, detect mutation, and construct a Pan-Genome. The characteristic of PanX is that it can handle complex genome structure and large-scale data sets, and provides rich visualization functions, which is convenient for researchers to interpret the analysis results. In the study of plant Pan-Genome, the genomes of several crop varieties can be analyzed by using PanX, and the genes and variations related to agronomic traits can be identified, which provides a theoretical basis for the genetic improvement of crops.

The interlinked components of the panX web application (Ding et al., 2018) Interconnected components of the panX web application (Ding et al., 2018)

PGAP

PGAP (Pan-Genome analysis pipeline) is a comprehensive Pan-Genome analysis platform, that integrates many steps from sequencing data processing to Pan-genome construction and analysis. The advantage of PGAP is that it provides a one-stop analysis process, and users can complete data preprocessing, assembly, annotation, mutation detection, and gene family clustering on the same platform. In addition, PGAP also supports data input from various sequencing technologies, which can flexibly meet different research needs.

Conclusion

The innovation of sequencing technology and the optimization of analytical methods continue to promote the development of the Pan-Genome research. At present, the collaborative application of long/short read length technology and the integrated analysis of multi-omics data are gradually cracking the bottleneck of complex genome assembly and genetic variation analysis.

In the future, it is necessary to further strengthen the standardized collection of cross-species samples and develop intelligent data analysis algorithms, so as to construct a more complete species pan-genome map and provide a more solid theoretical basis, and technical support for evolutionary biology, functional genomics, and precision medicine research.

References

Zee A, Deng DZQ., et al. "Sequencing Illumina libraries at high accuracy on the ONT MinION using R2C2." Genome Res. 2022 32(11-12): 2092-2106 https://doi.org/10.1101/gr.277031.122
Deshpande D, Chhugani K., et al. "RNA-seq data science: From raw data to effective interpretation." Front Genet. 2023 14: 997383 https://doi.org/10.3389/fgene.2023.997383
Diniz WJ, Canduri F. "REVIEW-ARTICLE Bioinformatics: an overview and its applications." Genet Mol Res. 2017 16 (1) https://doi.org/10.4238/gmr16019645
Cannon SB, Lee HO, Weeks NT, Berendzen J. "Pandagma: a tool for identifying pan-gene sets and gene families at desired evolutionary depths and accommodating whole-genome duplications." Bioinformatics. 2024 40(9): btae526 https://doi.org/10.1093/bioinformatics/btae526
Li G, Jiang T, Li J, Wang Y. "PanSVR: Pan-Genome Augmented Short Read Realignment for Sensitive Detection of Structural Variations." Front Genet. 2021 12: 731515 https://doi.org/10.3389/fgene.2021.731515
Ding W, Baumdicker F, Neher RA. "panX: pan-genome analysis and exploration." Nucleic Acids Res. 2018 46(1): e5 https://doi.org/10.1093/nar/gkx977

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.

Send a Message

For any general inquiries, please fill out the form below.