Germplasm resource banks around the world preserve the huge genetic variation necessary for crop improvement. However, how to efficiently and accurately analyze the genetic background of these massive resources and turn them into practical breeding advantages has always been a major challenge for plant breeders. The emergence of genotyping-by-sequencing (GBS) technology provides a revolutionary solution to this challenge. In this paper, we will discuss how GBS can quantify genetic diversity parameters, reveal the relationship between population structure and phylogeny, and finally transform this knowledge into strategic breeding decisions to guide parental selection and germplasm management, so as to maximize the potential value of genetic resources.
More than 1,750 germplasm banks around the world have preserved millions of plant genetic resources, which are strategic reserves to cope with future climate change and food security challenges. However, the genetic background of a large number of germplasm is ambiguous, and there are serious phenomena of homonymy, heteronymy, and genetic redundancy. GBS technology, like a high-precision genetic ruler, can systematically inventory and identify these ex-situ germplasm collections.
Traditionally, the de-duplication of germplasm resources depends on morphological markers, which take a long time, are easily affected by the environment, and have low resolution. GBS can accurately calculate the genetic similarity between individuals by obtaining tens of thousands of genome-wide SNP markers at one time.
By principal component analysis (PCA) or genetic distance matrix, all germplasm can be visualized in two or three dimensions. Those germplasm closely clustered in genetic space, even with different collection site names or storage numbers, are likely to be genetically duplicated materials. This enables resource managers to identify and merge redundant germplasm, thus saving valuable storage capacity, manpower, and management costs and focusing resources on preserving truly unique genetic variation.
The simplified representative sequencing strategy of GBS tends to cover the gene coding region, which makes it not only able to identify the identity, but also to evaluate the value of germplasm functionally. By analyzing the allele frequency of each SNP locus, researchers can easily identify rare alleles with extremely low frequency in the population but with possibly important functions.
For example, a local variety may carry a rare resistance allele against an emerging disease, which has been lost in mainstream breeding materials. Through GBS scanning, this previously neglected ugly duckling germplasm can be quickly located, its value can be re-evaluated, and it can be introduced into the breeding plan as a key parent to broaden the narrow genetic basis of modern varieties.
In the face of tens of thousands of germplasm resources, it is almost impossible to identify the phenotype in a comprehensive way. GBS data is an ideal tool for constructing CoreCollection. Core germplasm refers to a part of the smallest collection selected from the whole germplasm resource pool through genetic methods, which can represent the genetic diversity of the original population to the greatest extent.
Based on the genetic distance matrix obtained from GBS, hundreds of the most representative core germplasm can be scientifically screened by cluster sampling or a simulated annealing algorithm. Breeders can give priority to in-depth phenotypic identification of this part of the core germplasm, so as to capture the genetic breadth of the whole resource pool at the lowest cost, which greatly improves the efficiency of research and utilization of germplasm resources.
Population structure analysis of 374 NPGS Ethiopian sorghum accessions (Cuevas et al., 2017)
Recommended Services for This Step
The high-density SNP data generated by GBS provides an unprecedented fine scale for quantifying the core parameters of population genetics, which makes it possible to accurately measure the genetic variation within and between populations.
PIC is an index to measure the degree of polymorphism of a molecular marker, which reflects the ability of the marker to distinguish different individuals in genetic analysis. For a SNP marker, its PIC value depends on the number and frequency of its alleles. A large number of SNP markers produced by GBS have a wide range of PIC values. In the subsequent analysis (such as constructing a genetic map or association analysis), researchers usually screen out markers with high PIC values (such as > 0.3 or 0.4), because these markers have higher information and can provide stronger resolution, thus improving the statistical efficacy and accuracy of analysis.
By comparing Ho and He, we can infer the reproductive mode and historical events of the population. For example:
If Ho is found to be significantly and continuously lower than He (that is, an inbreeding phenomenon), it may indicate that the population has experienced a bottleneck effect or inbreeding.
By calculating the average of different subgroups or different geographical sources, we can quickly evaluate and compare their genetic diversity levels, and provide a scientific basis for giving priority to those groups with high diversity that are protected.
FST is a classical index to measure the degree of genetic differentiation among populations, and its value is between 0 and 1.
Using GBS data, the FST value between each pair of populations can be calculated, thus constructing a population differentiation matrix. For example, when analyzing the local varieties from different provinces, the high FST value indicates that the varieties in these areas have significant genetic isolation and may adapt to different local environments. Furthermore, by scanning the FST values in the whole genome, we can identify those genomic regions with abnormal differentiation among populations. These regions are likely to be affected by local adaptation or artificial selection, and they are hot spots for locating candidate genes of important adaptive traits.
Muscle amino acid composition and content of three groups of M. ensis (Li et al., 2024)
Understanding the population structure and phylogenetic relationship of germplasm resources is the basis of effective utilization of them. GBS provides genome-wide data for this purpose, which makes the reconstruction history clearer and reliable.
The model clustering algorithm assumes that the population consists of k ancestor groups, and then infers the proportion of each individual genome from each ancestor group (that is, ancestor components).
Application process: Usually starting from K=2, gradually increasing the value of k, running many times to obtain robust results, and then determining the maximum number of ancestor groups (optimal k value) according to the likelihood value or δk method.
Interpretation and application:
Based on the genetic distance calculated from GBSSNP data, Neighbor-Joining Tree or other types of phylogenetic trees can be constructed.
PCA is a model-free dimensionality reduction analysis method, which can project high-dimensional SNP data onto several principal components that can explain the maximum variance. The scatter plot composed of PC1 and PC2 is a classic way to show the genetic structure of a population.
Distribution of LTR-RT lineages in S. italica chromosomes (Suguiyama et al., 2019)
The ultimate value of genetic diversity analysis is that it can be transformed into practical breeding actions. Genetic information provided by GBS is transforming parental selection from an art to an accurate science.
Traditional parental selection relies heavily on phenotypic data and the experience of breeders. GBS provides a new decision-making dimension based on genome information:
Dynamic management of core collection: the core collection based on GBS is not static. When new germplasm resources are introduced, GBS analysis can be conducted again to evaluate the uniqueness of the new germplasm relative to the existing core germplasm, so as to dynamically update the core collection and ensure that it always represents the most comprehensive diversity.
Examples of genome-assisted prediction performed using sommer (Covarrubias et al., 2016)
GBS technology has transformed the germplasm resource bank from a passive seed warehouse into an active and data-driven genetic information center. Through systematic GBS genotyping of germplasm resources, we can:
With the continuous decline of sequencing cost and the increasing automation of data analysis platforms, it will become normal to conduct a GBS gene ID card survey for all germplasm. In the future, integrating GBS genotype data with Qualcomm's phenotypic group, metabolic group, and gene editing technology will form a complete intelligent decision-making system for breeding, further accelerate the process of crop genetic improvement, and lay a solid scientific and technological foundation for global food security and sustainable development of agriculture.
1. How does GBS help identify redundant germplasm in resource banks?
GBS generates tens of thousands of genome-wide SNPs to calculate genetic similarity between individuals; PCA or genetic distance matrices visualize germplasm, and closely clustered ones (even with different labels) are identified as redundant.
2. What's the role of GBS in calculating population genetics parameters like FST?
GBS provides high-density SNPs to compute FST (0–1 scale): FST≈0 means no population differentiation, FST>0.25 indicates high differentiation. It also quantifies PIC, observed/expected heterozygosity for diversity assessment.
3. How does GBS support phylogenetic and population structure analysis?
It enables model cluster analysis (inferring ancestral components to find optimal K), constructs phylogenetic trees (showing genetic relationships), and uses PCA (dimensionality reduction to visualize population structure).
4. Can GBS guide parental selection in breeding?
Yes. It analyzes genetic distance to select parents with complementary alleles (e.g., disease resistance + high yield) and divides heterosis groups—crossing parents from different groups boosts strong heterosis chances.
5. How does GBS assist core germplasm management?
Based on GBS-derived genetic distance matrices, core germplasm (representing maximum original diversity) is screened. New germplasm is GBS-analyzed to update the core collection dynamically.
Related reading
References
Send a MessageFor any general inquiries, please fill out the form below.
CD Genomics is propelling the future of agriculture by employing cutting-edge sequencing and genotyping technologies to predict and enhance multiple complex polygenic traits within breeding populations.