Comprehensive Workflows, Core Tools, and Analytical Strategies for GBS Data Processing

Genotyping-by-sequencing (GBS), as an efficient and low-cost simplified genome sequencing technology, has become the core means to analyze plant genetic diversity, accelerate crop breeding, and explore the genetic basis of complex traits. By simplifying the genome, high-throughput sequencing and accurate typing can quickly obtain genetic markers such as massive single-nucleotide polymorphism (SNP) in the whole genome, providing key data support for multi-field research from population genetics to molecular breeding.

However, the large-scale sequencing data generated by GBS technology (usually including tens of thousands to millions of mutation sites and hundreds of samples) put forward extremely high requirements for the standardization of the data analysis process, the adaptability of tools, and the reliability of results. From the quality control, sequence alignment, and genotype identification of the original sequencing data, to the downstream population structure analysis, gene location, and visual presentation, every step needs rigorous method selection and parameter optimization. At the same time, the interference of complex genomes (such as polyploidy), the missing data caused by uneven sequencing depth, and the computational pressure of large-scale samples further highlight the importance of establishing an efficient analysis system.

The article details GBS data analysis workflows, core tools (like TASSEL, Stacks, GATK, PLINK), challenges (uneven depth, complex genomes, etc.) with solutions, downstream analysis and visualization tools, and concludes with a summary of its significance.

Workflow and Key Stages in GBS Data Analysis

GBS technology has been widely used in plant genetics, crop breeding, and population evolution research because of its advantages of high throughput and low cost. The data analysis process is the core link between the original sequencing data and biological conclusions, and it needs strict standardization to ensure the reliability of the results. The basic process of GBS data analysis can be divided into four key stages, which are closely connected to form a complete chain from data generation to information extraction.

Quality control and pretreatment of original sequencing data are the first step of GBS analysis, which directly affects the accuracy of subsequent results. The original data (usually in FASTQ format) contains sequencing sequences (reads) and their mass values, so it is necessary to filter low-mass bases, remove linker sequences, and repeated sequences through quality control tools. Specifically, quality control mainly includes: filtering based on Phred quality value (Q value) (usually the bases with Q≥20 are reserved, and the corresponding error rate is ≤1%), removing reads with the ratio of N (unknown base) exceeding 5%, pruning sequencing adapters and primer sequences (which can be realized by Cutadapt and other tools), and removing PCR repeated sequences (especially when the sequencing depth is high, repeated sequences will lead to variation).

Sequence alignment and reference genome location are the key steps to anchor pretreated reads to the reference genome, and their accuracy determines the reliability of mutation detection. For species with reference genomes (such as rice and Arabidopsis thaliana), short sequence alignment tools such as BWA and Bowtie2 can be used to achieve efficient alignment by setting appropriate parameters (such as allowable mismatch number ≤2). The comparison results are usually stored in SAM/BAM format, including the location of reads on the genome, mismatch information, and so on. For non-model organisms without reference genomes (such as many wild plants), we need to adopt a de novo assembly strategy, cluster reads into contigs by using tools such as Stacks and UNEAK, and then carry out subsequent analysis.

Diagram of the four stages in the SNP-GBS-CROP workflow (Melo et al., 2016)Schematic of the four stages of the SNP-GBS-CROP workflow (Melo et al., 2016)

Genotyping and mutation detection are the core objectives of GBS analysis, calling at identifying genetic variations such as SNP and InDel from the comparison data and determining the genotype of each sample. This stage depends on mutation detection tools (such as GATK and TASSEL GBS), and its core algorithms include: genotype probability calculation based on Bayesian model, mutation filtering at population level (such as minimum allele frequency ≥5% and deletion rate ≤20%), and abnormal heterozygote ratio detection (excluding possible sample pollution).

Data standardization and format conversion are the bridge between upstream processing and downstream analysis, and data should be converted into formats suitable for different tools according to the research objectives. Population genetic analysis tools (such as Structure and admission) usually require the input of PLINK format (.ped/.map), and the VCF file needs to be converted by PLINK tools; Linkage map construction tools (such as JoinMap) need to input linkage format (.loc), which can be converted by TASSEL or R/qtl package; Genome-wide association study (GWAS) tools (such as GAPIT) support direct reading of VCF files, but low-quality variations (such as sites with MAF<0.05) need to be pre-filtered.

Bar graph illustrating the degree of marker overlap across the five assessed pipelines (Melo et al., 2016)Bar plot showing the extent of marker overlap among the five evaluated pipelines (Melo et al., 2016)

GBS Data Analysis Tools and Their Functionalities

The complexity of GBS data analysis has given birth to a variety of special tools, which are designed for different links (comparison, mutation detection, downstream analysis, etc.) and have their own advantages and applicable scenarios. Choosing the right tool combination is the key to improving the analysis efficiency and the reliability of the results. The following introduces several core tools and their functional characteristics.

TASSEL

TASSEL GBS is a Qualcomm-based GBS analysis pipeline developed by Cornell University in the United States. It is specially designed for plant populations and supports automatic analysis of the whole process from raw data to genotype calling. Its core functions include: tag identification based on restriction site information, tag clustering and comparison, SNP calling, and genotype data derivation. The unique advantage of TASSEL GBS lies in its high efficiency in processing large-scale samples (such as tens of thousands of plant materials), and the calculation time can be shortened by more than 50% through parallel calculation optimization.

Diagrammatic illustration of the TASSEL -GBS Discovery Pipeline (Glaubitz et al., 2014)Schematic representation of the TASSEL -GBS Discovery Pipeline (Glaubitz et al., 2014)

Stacks

Stacks is a tool for de novo assembly and genotyping of non-model organisms, which can realize mutation detection without reference to the genome, and is widely used in the study of wild plants, fish, and other species lacking genomic information. Its core algorithm clusters similar reads into a "stack", constructs a simplified genome tag, and then identifies SNPs through population polymorphism analysis. The advantage of Stacks is that it has a high tolerance for low coverage data (the minimum sequencing depth can reach 3×) and supports the direct calculation of population genetic parameters (such as Fst and π).

GATK

GATK (Genome Analysis Toolkit) is a universal mutation detection tool developed by the Broad Institute, which is mainly used for SNP calling and mutation filtering with high accuracy in GBS data analysis. Its core modules (such as HaplotypeCaller and VariantFiltration) are based on a machine learning model, which can effectively distinguish real variation from sequencing errors, especially for GBS data of model organisms such as humans and mice. The advantage of GATK lies in its high detection sensitivity for complex mutations (such as multi-allele SNPs and InDels) and its support for functional annotation of mutations (such as integrating gene annotation information through ANNOVAR).

PLINK

PLINK is a classic tool of population genetics and association analysis, which is mainly used for downstream processing and statistical analysis of GBS data. Its functions include data format conversion (such as VCF→PLINK), quality control (such as filtering high deletion rate loci), population structure analysis (such as PCA and LD calculation), and association analysis (such as chi-square test and logistic regression). The advantage of PLINK lies in its fast operation speed, which can handle the data of millions of SNPs and tens of thousands of samples, and its output format is compatible with most downstream tools (such as Structure and GCTA).

The Stacks workflow (Catchen et al., 2013)The Stacks pipeline (Catchen et al., 2013)

Challenges and Solutions in GBS Data Analysis

Although GBS data analysis has formed a standardized process, it still faces many challenges in practical application, which mainly come from technical characteristics, species differences, and data scale. Given these problems, researchers have developed a series of solutions that provide strong support for improving the quality of analysis.

Uneven sequencing depth and missing data are the most common challenges in GBS analysis. Because GBS depends on the distribution of restriction sites, the sequencing depth of different regions of the genome is significantly different (usually between 1× and 50×). Low-depth regions are prone to genotype calling errors, while a high deletion rate (> 30%) will reduce the data utilization rate. For example, in wheat GBS data, about 20%-30% SNP loci were eliminated because of the high deletion rate, which affected the statistical effectiveness of subsequent analysis. The solutions mainly include:

  • Optimizing sequencing strategy, improving coverage uniformity by increasing sequencing depth (recommended average depth ≥10×) or adopting a double enzyme digestion scheme
  • Missing value filling, which is based on linkage disequilibrium (LD) information by BEAGLE, IMPUTE, and other tools. BEAGLE performs well in plant population, which can reduce the missing rate from 30% to less than 5%, and the filling accuracy is over 90%
  • Adopt robust statistical methods, such as using mixed linear model considering missing data in GWAS (such as EM algorithm in GAPIT) to reduce the information loss caused by data elimination.

Concordance prior to and following the application of GBS SNP filter (Cooke et al., 2016)Concordance before and after applying GBS SNP filter (Cooke et al., 2016)

The analysis of complex genomes is particularly difficult in polyploid and highly repetitive species (such as wheat, potato, and sugarcane). Homologous chromosomes of polyploids are prone to causing ambiguity in sequence alignment, while highly repetitive sequences will increase the proportion of false-positive variation. For example, the repetitive sequences in the genome of hexaploid wheat account for more than 80%, and the error rate of GBS data comparison can reach 15%-20%. To solve this problem, the solutions include:

  • Using the alignment tools optimized for polyploidy, such as PolyCat, to improve the alignment accuracy by distinguishing homologous chromosome specific sequences, which can reduce the error rate to less than 5% in wheat
  • Using haplotype-based analysis strategies, such as HapMap3, to reduce homologous sequence interference and improve the specificity of mutation detection by constructing haplotype blocks
  • Anchoring mutation sites to specific chromosomes by combining physical maps or fluorescence in situ hybridization (FISH) data to avoid confusion of homologous regions.

Consistency check between samples and technical repetition is the key to ensuring data reliability, but it is often ignored. Sample contamination, barcode hopping, or sequencing batch effect may lead to a decrease in genotypic consistency of repeated samples. For example, the tag jump rate of the Illumina sequencing platform is about 0.5%-2%, which may introduce false positive variation in large-scale sample analysis. The solutions include:

  • Setting up technical repetition in experimental design (it is suggested that each population should repeat more than 5% of samples), and evaluating data quality by calculating genotype consistency (such as Concordance rate) of repeated samples, which usually requires consistency of more than 95%
  • Using bioinformatics tools to detect abnormal samples, such as identifying samples with abnormal genetic background by identity-by-state(IBS) analysis in PLINK, or excluding outliers by PCA clustering
  • Batch effect correction, using SVA, ComBat and other tools to remove the impact of sequencing batches on data, especially when integrating GBS data in different periods.

Assessing population structure and growth using GBS data (Cooke et al., 2016)Detecting population structure and growth with GBS data (Cooke et al., 2016)

Analysis and Visualization Tools for GBS Data

After quality control, comparison, and mutation detection, GBS data need to be analyzed downstream to explore biological significance, such as population genetic structure analysis, gene mapping, linkage map construction, and so on. At the same time, visualization tools can transform complex data into intuitive charts to help interpret and display the results. The following introduces several core downstream analysis and visualization tools and their application scenarios.

The tool of population genetic structure analysis is used to analyze the genetic relationship between samples and population stratification, which is the basis of evolutionary research and association analysis. Structure is a population structure inference tool based on a Bayesian model, which reveals the potential genetic structure of the population by assigning samples to hypothetical k subgroups. In the study of plant GBS, Structure is often used to divide the ecotypes of cultivated species.

The genome-wide association study (GWAS) tool is used to mine genetic variation related to phenotype in natural populations, and it is widely used in the study of complex traits of crops. Gapit (genome association and prediction integrated tool) is a GWAS tool based on the R language, which supports mixed linear model (MLM) and can effectively control the interference of group structure and kinship on association results.

Data visualization tools can transform GBS analysis results into intuitive charts and help interpret the results. Circos is used to draw a circular chromosome map, which can display multi-dimensional information such as SNP density, gene location, QTL interval, etc. For example, in the wheat genome, Circos can clearly show the correlation between the distribution of disease-resistant genes on different chromosomes and GBS markers. Ggplot2 is a drawing package in the R language, which supports drawing PCA scatter plot, LD decay curve, population phylogenetic tree, etc. Its highly customized parameters can meet the needs of diversified visualization.

Multiomics data integration tools are used to correlate GBS genotype data with phenotypic data such as transcriptome and metabolomics, and reveal the molecular regulation mechanism of traits. WGCNA (weighted gene co-expression network analysis) can associate GBS markers with gene expression data and identify co-expression modules related to target traits.

An overview of the R/Bioconductor package SWATH2stats (Blattmann et al., 2016)Overview of the R/Bioconductor package SWATH2stats (Blattmann et al., 2016)

Conclusion

The rapid development of GBS technology promotes the innovation of plant genetics and breeding research, and data analysis, as the core link between technology and scientific discovery, the progress of its methods and tools directly determines the utilization efficiency of GBS data. This paper summarizes the basic process, core tools, challenges, and applications of GBS data analysis and provides a systematic reference for researchers.

References:

  1. Melo AT, Bartaula R, Hale I. "GBS-SNP-CROP: a reference-optional pipeline for SNP discovery and plant germplasm characterization using variable length, paired-end genotyping-by-sequencing data." BMC Bioinformatics. 2016 17: 29 https://doi.org/10.1186/s12859-016-0879-y
  2. Catchen J, Hohenlohe PA, Bassham S, Amores A, Cresko WA. "Stacks: an analysis tool set for population genomics." Mol Ecol. 2013 22(11): 3124-3140 https://doi.org/10.1111/mec.12354
  3. Glaubitz JC, Casstevens TM, Lu F, et al. "TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline." PLoS One. 2014 9(2): e90346 https://doi.org/10.1371/journal.pone.0090346
  4. Cooke TF, Yee MC, Muzzio M, et al. "GBStools: A Statistical Method for Estimating Allelic Dropout in Reduced Representation Sequencing Data." PLoS Genet. 2016 12(2): e1005631 https://doi.org/10.1371/journal.pgen.1005631
  5. Blattmann P, Heusel M, Aebersold R. "SWATH2stats: An R/Bioconductor Package to Process and Convert Quantitative SWATH-MS Proteomics Data for Downstream Analysis Tools." PLoS One. 2016 11(4): e0153160 https://doi.org/10.1371/journal.pone.0153160
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Related Services
PDF Download
* Email Address:

CD Genomics needs the contact information you provide to us in order to contact you about our products and services and other content that may be of interest to you. By clicking below, you consent to the storage and processing of the personal information submitted above by CD Genomcis to provide the content you have requested.

×
Quote Request
! For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Contact CD Genomics
Terms & Conditions | Privacy Policy | Feedback   Copyright © CD Genomics. All rights reserved.
Top