banner
Foundations of Genotyping-by-Sequencing in Plant Genomics

Foundations of Genotyping-by-Sequencing in Plant Genomics

Inquiry

With the rapid development of high-throughput sequencing technology, genotyping-by-sequencing (GBS), as an efficient and economical genotyping technology, has completely changed the pattern of plant genetics research and breeding practice. GBS is essentially a multiplex and simplified representative sequencing method based on restriction endonuclease digestion, which can simultaneously realize large-scale excavation and genotyping of single-nucleotide polymorphism (SNP) markers in the whole genome without relying on the reference genome.

This paper will systematically expound the core methodology and technical process of GBS, deeply analyze its remarkable advantages compared with traditional genotyping platforms (such as chip technology and whole genome sequencing), and discuss its applicability and future prospects in different scales and different crop breeding projects.

What is Genotyping-by-Sequencing: Defining the Core Methodology

To understand GBS accurately, we must grasp its two core characteristics: multiplicity and simplified representativeness.

  • Multiplication: It refers to mixing the DNA of dozens or even hundreds of samples in one sequencing reaction by adding a sample-specific index (barcode) to the DNA fragment of each sample. This strategy greatly shared the sequencing cost, which made it possible to carry out high-density marker analysis on a large-scale breeding population, and was the key to realizing Qualcomm's characteristics of quantity and low cost.
  • Simplifying representativeness is the essence of GBS design. Instead of sequencing the whole genome, it uses restriction enzymes to cut the genome specifically and then sequences specific parts of the genome. The basic logic is that genomic DNA is digested by one or two restriction enzymes (usually a combination of rare and common cutting enzymes) to produce a large number of fragments, and then only the sequences at both ends of these fragments (that is, the flanking regions of the restriction sites) are sequenced using high-throughput sequencing. The flanking sequences of these restriction sites are highly conserved among different individuals in the population, so when a SNP occurs at a certain site, it can be accurately detected.

Therefore, we can give a clear definition of GBS: GBS is a low-cost genome analysis technology based on restriction endonuclease digestion, which can simultaneously find and genotype thousands of SNP markers by constructing a multiplex sequencing library and simplifying representative sequencing of the genome.

Its core output is a genotype matrix containing all samples at thousands of SNP sites (usually stored in VCF file format), which can be directly used in breeding applications such as genetic diversity analysis, population structure analysis, linkage map construction, genome-wide association study (GWAS), and genome selection (GS).

GBS adapters, along with PCR and sequencing primers (Elshire et al., 2011) GBS adapters, PCR and sequencing primers (Elshire et al., 2011)

The Technical Workflow: From DNA Extraction to SNP Calling

The complete process of GBS can be divided into two stages: library construction of the wet experiment and bioinformatics analysis of the dry experiment.

Stage 1: Wet-Lab Procedure

  • A. Step 1: tissue sampling and DNA extraction
    • a) The process begins with high-quality plant tissue samples (such as leaves and seed endosperm). In order to ensure the uniformity of subsequent digestion efficiency and meet the basic requirements for the quality and concentration of DNA, it is necessary to adopt standardized methods (such as the CTAB method or commercial kit) for extraction, and accurately quantify and quality control by agarose gel electrophoresis or a fluorometer. High-quality DNA is the premise of obtaining high-coverage and repeatability data.
  • B. Step 2: restriction endonuclease digestion
    • a) This is one of the most critical steps in the GBS process. It is very important to choose the appropriate restriction endonuclease, because it directly determines the complexity of the simplified genome and the captured genome region.
    • b) Selection of enzymes: commonly used enzymes, such as ApeKI, are methylation-sensitive restriction endonucleases, which tend to cut low-copy and gene-rich regions of the genome and avoid highly repetitive and highly methylated heterochromatin regions. This feature enables GBS to target valuable areas with potential functional variation more effectively.
    • c) Other commonly used enzyme combinations include PstI/MspI (double enzyme digestion GBS, namely ddGBS). By using the combination of rare cleavage enzyme (PstI) and common cleavage enzyme (MspI), more complicated simplification and higher controllability of marker density can be achieved.
  • C. Step 3: library preparation and linker connection
    • a) The end of the DNA fragment produced by enzyme digestion needs to be connected with a specific sequencing linker. These connectors are carefully designed and usually include the following functional areas:
      • Illumina sequencing primer binding site: Used for bridge PCR amplification and sequencing primer binding.
      • Sample-specific Barcode: A short and unique oligonucleotide sequence used to distinguish the sources of different samples after mixed sequencing.
      • Y-type adapter structure: The Y-type adapter is often used in modern GBS processes, which has the advantage of effectively preventing the linker from self-connecting, ensuring that sequencing starts from real restriction fragments, and improving data quality.
    • b) After the ligation reaction, a pre-amplification is usually carried out to enrich the fragments with adapters and provide enough DNA for the subsequent fragment size selection.
  • D. Step 4: fragment selection, purification, and pooling.
    • a) The size of the amplified product was selected by magnetic beads (such as SPRIbeads) or agarose gel electrophoresis, and the short or long fragments were removed, and the fragments suitable for the Illumina sequencing platform (such as 200-300bp) were retained.
    • b) This step can optimize the generation efficiency and data output of sequencing clusters. Subsequently, each sample library with a unique Barcode was accurately mixed according to the equimolar concentration to form a multiplex sequencing library pool.
  • E. Step 5: high-throughput sequencing
    • a) The mixed library pool was loaded on an Illumina HiSeq, NovaSeq, or similar high-throughput sequencing instrument for sequencing. Single-ended or double-ended sequencing (such as PE150) is usually performed, and the sequencing depth depends on the number of samples and the tag density requirements. Usually, the average sequencing depth of each sample is between 1x and 20x.

Stage 2: Bioinformatic Analysis Process

The original sequencing data (FASTQ format) needs a series of bioinformatics processing before it can be transformed into reliable genotypic data.

  • A. Step 1: Data quality control and redundancy removal
    • a) Use software such as FastQC to evaluate the quality of the original sequencing data, and check the sequencing error rate, GC content, connector pollution, etc. Then use Trimmomatic or Cutadapt tools to remove low-quality bases, linker sequences, and primer sequences.
  • B. Step 2: Sample demultiplexing
    • a) According to the Barcode sequence embedded in the sequencing read length, the mixed sequencing data is accurately split back to each independent sample file. Only the reading segments whose barcodes are completely matched and qualified will be classified into the corresponding samples.
  • C. Step 3: Sequence Alignment
    • a) The pure reading segments of each sample are aligned to the reference genome (if available). Commonly used comparison software includes BWA-MEM, Bowtie2, etc. For species without a reference genome, the pseudo-reference genome strategy is adopted, that is, the high-quality reading segments of all samples are assembled from scratch to generate a consistent tag sequence, and then the reading segments of each sample are compared back to this self-built tag sequence set.
  • D. Step 4: SNPCalling (mutation detection)
    • a) This is the core of the analysis process. Use special mutation detection software (such as ref_map.pl or denovo_map.pl in SAMtools/bcftools, GATK, and Stackspipeline) to identify polymorphic sites in the population. The software will count the sequencing depth, allele type, and frequency of each sample at each locus, and finally determine the genotype (such as AA, AG, GG) of each sample at this locus based on the genotype likelihood model.
    • b) At the same time, strict filtering criteria, such as minimum allele frequency (MAF), minimum sequencing depth, maximum deletion rate, Hardy-Weinberg equilibrium test, and so on, will be applied to remove low-quality and unreliable SNP sites.
  • E. Step 5: Genotype matrix generation
    • a) Finally, the filtered high-quality SNP data set is sorted into a standard VCF file or allele matrix, which is the cornerstone of all downstream genetic analysis and breeding applications.

The variance proportion of the analyzed traits derived from the 7 fitted genetic models (Dong et al., 2024) The proportion of variance for analyzed traits from the 7 genetic models fitted (Dong et al., 2024)

Key Advantages of GBS for Plant Breeding Programs

The reason why GBS can quickly become the first choice tool for plant breeders stems from its following outstanding advantages:

Excellent Cost-effectiveness

This is the most striking advantage of GBS. By multiplex sequencing, the cost of a single sample is reduced to a very low level. Compared with chip or genome-wide re-sequencing, which costs hundreds of dollars per sample, the cost of GBS can be easily controlled at tens of dollars or even lower, which makes genotyping of thousands of breeding materials or strains economically feasible and greatly expands the scale of breeding projects.

No Required Reference Genome

For many non-model organisms, orphan crops or forest species, high-quality reference genomes are often unavailable. The simplified representativeness and tag sequencing characteristics of GBS enable it to develop and type tags through an ab initio assembly strategy. This feature of no reference genome dependence greatly broadens its application range and opens the door for genetic improvement of species with limited resources.

Qualcomm Quantity and High Scalability

Hundreds to thousands of samples can be easily processed in one sequencing run, which realizes the real large-scale population genotyping. This Qualcomm capacity perfectly meets the needs of rapid and large-scale screening of early isolated populations, core germplasm banks, and a large number of new strains produced every year in modern breeding.

Simultaneous Completion in Marker Discovery and Genotyping

Unlike chip technology that relies on known SNPs, GBS is a brand-new scan in every analysis. It can not only classify known SNPs, but also find new, rare, or group-specific SNPs at the same time. This is especially important for species with rich genetic diversity or when mining new alleles, which realizes dynamic breeding while discovering and applying.

Targeted Gene Enrichment Region

As mentioned above, selecting appropriate restriction enzymes (such as *Ape*KI) can make GBS preferentially enrich the low-copy and gene-coding regions of the genome. This means that the obtained SNP markers are more likely to be located in or linked to gene regions with biological functions, thus improving the probability of locating candidate genes in GWAS or linkage analysis and improving the effectiveness of markers.

GBS SNP marker distribution within the Oregon Wolfe Barley (OWB) bin map (Poland et al., 2012) Distribution of GBS SNP markers in the Oregon Wolfe Barley (OWB) bin map (Poland et al., 2012)

Comparing GBS with Other Genotyping Platforms

In order to comprehensively evaluate the applicability of GBS, it is necessary to compare it with two other mainstream technologies, SNP chip and whole genome resequencing (WGS).

Comparison between GBS, SNP array and WGS

Feature Dimension GBS SNP Array WGS
Technical Principle Reduced representation sequencing Hybridization and fluorescence detection of known SNPs Random fragment sequencing of the entire genome
Information Content Combines unknown and known SNPs, moderate marker density, covers specific restriction sites Limited to pre-designed known SNPs on the array; fixed markers; no new variant discovery Nearly all genome-wide variations (SNP, InDel, SV, etc.); most comprehensive information
Throughput and Cost High throughput; extremely low per-sample cost; suitable for ultra-large populations High throughput; moderate per-sample cost (rises with array density) Low throughput; high per-sample cost (higher for deep sequencing)
Reference Genome Dependence Non-mandatory; de novo analysis applicable Strongly dependent; array design requires known genome and SNP information Strongly dependent; data analysis relies heavily on high-quality reference genome
Data Complexity Moderate; requires certain bioinformatics support Low; mature and standardized data analysis workflow High; massive data volume; high demands for storage and computing resources
Marker Uniformity and Reproducibility Moderate; affected by digestion efficiency and sequencing depth; certain missing data High; high genotype call rate; excellent reproducibility High; optimal uniformity and reproducibility with sufficient sequencing depth
Main Application Scenarios Large-scale breeding population screening
  • Species without reference genomes
  • Dynamic marker discovery
  • Genetic map construction
  • GWAS/GS with moderate budget
Major crops with mature arrays (e.g., maize, soybean, wheat)
  • Targeted genotyping of fixed loci
  • Commercial breeding requiring high-reproducibility, low-missing-rate data
  • Germplasm resource fingerprint library construction
Basic research (e.g., evolution, population genetics)
  • Rare variant detection
  • Structural variation research
  • Genome assembly assistance
  • In-depth analysis of core germplasm with sufficient budget

Comprehensive Comparison and Selection Strategy

GBS vs. SNP chip

  • 1) Choose GBS when:
    • The budget is limited but large-scale samples need to be analyzed
    • Studying species without commercial chips
    • New SNP markers need to be discovered continuously (such as in wild germplasm or outcrossing population)
    • The project is exploratory and the marking requirements are not fixed
  • 2) The choice of chips should be as follows:
    • Chips with excellent design and appropriate density are available for mainstream crops
    • Highly standardized and repeatable data are needed for long-term, multi-site and multi-season data integration (such as the breeding plan of multinational companies)
    • There are strict requirements for data missing rate
    • The ability of bioinformatics is limited, and I hope to get plug-and-play genotype data

GBS vs. WGS

  • 1) GBS should be selected as follows:
    • The research goal is high-density SNP typing rather than comprehensive mutation detection
    • The maximum sample size needs to be covered at the lowest cost (such as the construction of a training population in genome selection)
    • Limited computing and storage resources
  • 2) WGS should be selected as follows:
    • To pursue the most complete genome information, it is necessary to detect structural variation other than SNPs
    • The sample size is small, but each sample needs deep analysis
    • Sufficient funds and computing resources

Conclusion: GBS has achieved the best balance in terms of cost, flux, flexibility, and adaptability of non-model species. It does not completely replace the chip or WGS, but provides a very competitive middle way, especially suitable for plant breeding projects that are in the development stage and have relatively limited resources, but want to embrace genomics technology.

Response to selection using GBS data within the expanding prediction set, grounded in the accuracies of genomic predictions (Gorjanc et al., 2015) Response to selection with GBS data in the expanding prediction set based on accuracies of genomic predictions (Gorjanc et al., 2015)

Conclusion

Since the advent of GBS technology, its data stability, repeatability, and label density have been significantly improved through continuous technical optimization (such as the introduction of a double enzyme digestion system to improve complexity and optimize joint design to improve efficiency) and the improvement of bioinformatics tools. It has been successfully applied to many species ranging from annual field crops to perennial fruit trees, and has achieved fruitful results in germplasm resources identification, high-density genetic map construction, QTL mapping of important traits, genome-wide association study, and genome selection.

In a word, GBS has become an indispensable and powerful tool in modern plant breeding because of its core methodological advantages, such as high cost-effectiveness, no reference to a genome and Qualcomm, and simultaneous marker discovery and genotyping. It effectively bridges the gap between traditional molecular markers and expensive whole genome sequencing, democratizing the application of genomics in plant breeding, and provides a key technical driving force for accelerating crop genetic gain and coping with global food security challenges.

FAQ

1. What core characteristics define Genotyping-by-Sequencing (GBS) in plant genomics?

It has two key traits: multiplicity (mixing dozens/hundreds of samples via Barcodes to cut costs) and simplified representativeness (using restriction enzymes to sequence specific genomic regions instead of the whole genome).

2. What are the two main stages of the GBS technical workflow?

The wet-lab procedure (DNA extraction → restriction digestion → library prep → sequencing) and the bioinformatic analysis process (data QC → demultiplexing → alignment → SNP calling → genotype matrix generation).

3. Does GBS require a reference genome for plant genotyping?

No. For species without a reference genome, GBS uses an "ab initio" assembly strategy (building a pseudo-reference from high-quality reads) to develop and genotype markers.

4. How does GBS compare to SNP chips in cost and marker discovery?

GBS has lower per-sample costs (tens of dollars) and discovers new/rare SNPs; SNP chips rely on pre-designed known SNPs and have moderate costs, with higher data reproducibility.

5. What makes GBS suitable for large-scale plant breeding projects?

Its high throughput (processing hundreds/thousands of samples per run), cost-effectiveness, and ability to simultaneously perform marker discovery and genotyping meet large-population screening needs.

References

  1. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE. "A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species." PLoS One. 2011 6(5): e19379.
  2. Dong L, Xie Y, Zhang Y, Wang R, Sun X. "Genomic dissection of additive and non-additive genetic effects and genomic prediction in an open-pollinated family test of Japanese larch." BMC Genomics. 2024 25(1): 11.
  3. Poland JA, Brown PJ, Sorrells ME, Jannink JL. "Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach." PLoS One. 2012 7(2): e32253.
  4. Gorjanc G, Cleveland MA, Houston RD, Hickey JM. "Potential of genotyping-by-sequencing for genomic selection in livestock populations." Genet Sel Evol. 2015 47(1): 12.
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Send a MessageSend a Message

For any general inquiries, please fill out the form below.

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
We provide the best service according to your needs Contact Us
OUR MISSION

CD Genomics is propelling the future of agriculture by employing cutting-edge sequencing and genotyping technologies to predict and enhance multiple complex polygenic traits within breeding populations.

Contact Us
Copyright © CD Genomics. All Rights Reserved.
Top