banner
GBS for Genetic Diversity Analysis and Germplasm Characterization

GBS for Genetic Diversity Analysis and Germplasm Characterization

Inquiry

Germplasm resource banks around the world preserve the huge genetic variation necessary for crop improvement. However, how to efficiently and accurately analyze the genetic background of these massive resources and turn them into practical breeding advantages has always been a major challenge for plant breeders. The emergence of genotyping-by-sequencing (GBS) technology provides a revolutionary solution to this challenge. In this paper, we will discuss how GBS can quantify genetic diversity parameters, reveal the relationship between population structure and phylogeny, and finally transform this knowledge into strategic breeding decisions to guide parental selection and germplasm management, so as to maximize the potential value of genetic resources.

Unlocking Genetic Variation in Germplasm Banks and Core Collections

More than 1,750 germplasm banks around the world have preserved millions of plant genetic resources, which are strategic reserves to cope with future climate change and food security challenges. However, the genetic background of a large number of germplasm is ambiguous, and there are serious phenomena of homonymy, heteronymy, and genetic redundancy. GBS technology, like a high-precision genetic ruler, can systematically inventory and identify these ex-situ germplasm collections.

Identify Redundant Germplasm and Optimize Management

Traditionally, the de-duplication of germplasm resources depends on morphological markers, which take a long time, are easily affected by the environment, and have low resolution. GBS can accurately calculate the genetic similarity between individuals by obtaining tens of thousands of genome-wide SNP markers at one time.

By principal component analysis (PCA) or genetic distance matrix, all germplasm can be visualized in two or three dimensions. Those germplasm closely clustered in genetic space, even with different collection site names or storage numbers, are likely to be genetically duplicated materials. This enables resource managers to identify and merge redundant germplasm, thus saving valuable storage capacity, manpower, and management costs and focusing resources on preserving truly unique genetic variation.

Exploring Rare and Endemic Alleles

The simplified representative sequencing strategy of GBS tends to cover the gene coding region, which makes it not only able to identify the identity, but also to evaluate the value of germplasm functionally. By analyzing the allele frequency of each SNP locus, researchers can easily identify rare alleles with extremely low frequency in the population but with possibly important functions.

For example, a local variety may carry a rare resistance allele against an emerging disease, which has been lost in mainstream breeding materials. Through GBS scanning, this previously neglected ugly duckling germplasm can be quickly located, its value can be re-evaluated, and it can be introduced into the breeding plan as a key parent to broaden the narrow genetic basis of modern varieties.

Constructing Core Germplasm and Improving Utilization Efficiency

In the face of tens of thousands of germplasm resources, it is almost impossible to identify the phenotype in a comprehensive way. GBS data is an ideal tool for constructing CoreCollection. Core germplasm refers to a part of the smallest collection selected from the whole germplasm resource pool through genetic methods, which can represent the genetic diversity of the original population to the greatest extent.

Based on the genetic distance matrix obtained from GBS, hundreds of the most representative core germplasm can be scientifically screened by cluster sampling or a simulated annealing algorithm. Breeders can give priority to in-depth phenotypic identification of this part of the core germplasm, so as to capture the genetic breadth of the whole resource pool at the lowest cost, which greatly improves the efficiency of research and utilization of germplasm resources.

Analysis of population structure in 374 sorghum accessions from the NPGS Ethiopian collection (Cuevas et al., 2017) Population structure analysis of 374 NPGS Ethiopian sorghum accessions (Cuevas et al., 2017)

Calculating Population Genetics Parameters: FST, Heterozygosity, and PIC

The high-density SNP data generated by GBS provides an unprecedented fine scale for quantifying the core parameters of population genetics, which makes it possible to accurately measure the genetic variation within and between populations.

Polymorphic information content (PIC): Assess the Information of Markers

PIC is an index to measure the degree of polymorphism of a molecular marker, which reflects the ability of the marker to distinguish different individuals in genetic analysis. For a SNP marker, its PIC value depends on the number and frequency of its alleles. A large number of SNP markers produced by GBS have a wide range of PIC values. In the subsequent analysis (such as constructing a genetic map or association analysis), researchers usually screen out markers with high PIC values (such as > 0.3 or 0.4), because these markers have higher information and can provide stronger resolution, thus improving the statistical efficacy and accuracy of analysis.

Observation of Heterozygosity and Expected Heterozygosity

  • Observed heterozygosity (Ho): directly calculates the actual proportion of heterozygotes in the population.
  • Expected heterozygosity (He): The expected proportion of heterozygotes calculated according to allele frequency under the Hardy-Weinberg equilibrium hypothesis.

By comparing Ho and He, we can infer the reproductive mode and historical events of the population. For example:

  • For self-pollinated crops, the Ho is usually significantly lower than that of He, because continuous self-pollination leads to homozygote fixation
  • For cross-pollinated crops, the Ho and he are usually close

If Ho is found to be significantly and continuously lower than He (that is, an inbreeding phenomenon), it may indicate that the population has experienced a bottleneck effect or inbreeding.

By calculating the average of different subgroups or different geographical sources, we can quickly evaluate and compare their genetic diversity levels, and provide a scientific basis for giving priority to those groups with high diversity that are protected.

Genetic Differentiation Index (FST): Quantifies Differentiation Degree

FST is a classical index to measure the degree of genetic differentiation among populations, and its value is between 0 and 1.

  • FST≈0: It means that there is no genetic differentiation between the two populations, and the gene flow is frequent.
  • FST>0.25: It means that there is great differentiation among populations.

Using GBS data, the FST value between each pair of populations can be calculated, thus constructing a population differentiation matrix. For example, when analyzing the local varieties from different provinces, the high FST value indicates that the varieties in these areas have significant genetic isolation and may adapt to different local environments. Furthermore, by scanning the FST values in the whole genome, we can identify those genomic regions with abnormal differentiation among populations. These regions are likely to be affected by local adaptation or artificial selection, and they are hot spots for locating candidate genes of important adaptive traits.

Amino acid composition and content in the muscle of three groups of M. ensis (Li et al., 2024) Muscle amino acid composition and content of three groups of M. ensis (Li et al., 2024)

Phylogenetics and Population Structure: Tracing Evolutionary Relationships

Understanding the population structure and phylogenetic relationship of germplasm resources is the basis of effective utilization of them. GBS provides genome-wide data for this purpose, which makes the reconstruction history clearer and reliable.

Model Cluster Analysis: Revealing the Pedigree of Genetic Admixture

The model clustering algorithm assumes that the population consists of k ancestor groups, and then infers the proportion of each individual genome from each ancestor group (that is, ancestor components).

Application process: Usually starting from K=2, gradually increasing the value of k, running many times to obtain robust results, and then determining the maximum number of ancestor groups (optimal k value) according to the likelihood value or δk method.

Interpretation and application:

  • Analyze the history of domestication: For example, if we make a joint analysis of wild ancestors, local varieties, and modern bred varieties, we can clearly see the changes of ancestral components in the process from wild to domestication and then to modern breeding improvement, and reveal genetic bottlenecks and gene infiltration events.
  • Identification of subgroups: A mixed germplasm resource bank can be clearly divided into subgroups with different genetic backgrounds, which are often related to geographical sources, ecological types, or agronomic traits.
  • Directed correlation analysis: Group structure is an important confounding factor in genome-wide association studies (GWAS). The population structure matrix (Q matrix) obtained from GBS analysis can be incorporated into the GWAS model as a covariate to effectively reduce false positive correlation.

Construction of a Phylogenetic Tree: Depicting Genetic Relationship

Based on the genetic distance calculated from GBSSNP data, Neighbor-Joining Tree or other types of phylogenetic trees can be constructed.

  • Visual relationship: A Phylogenetic tree visually shows the genetic similarity and divergence history between individuals or groups in the form of a tree diagram. The more closely related individuals gather in the same branch on the tree.
  • Verification taxonomy: It can verify whether the classification of species or varieties based on morphology is consistent with molecular evidence, and solve difficult problems in taxonomy.
  • Tracking gene flow: Unexpected branching patterns in trees (such as some individuals not getting together with their geographical neighbors) may suggest that cross-regional transmission of seeds or pollen has occurred in history.

PCA: A Fast and Intuitive Genetic Projection

PCA is a model-free dimensionality reduction analysis method, which can project high-dimensional SNP data onto several principal components that can explain the maximum variance. The scatter plot composed of PC1 and PC2 is a classic way to show the genetic structure of a population.

  • Quick preview: PCA can quickly provide an overview of data quality and a preliminary impression of group structure.
  • Complemented with cluster analysis: The results of PCA are usually mutually confirmed with the results of STRUCTURE and phylogenetic tree, and a complete understanding of population genetic history is constructed together.

LTR-RT lineage distribution across the chromosomes of S. italica (Suguiyama et al., 2019) Distribution of LTR-RT lineages in S. italica chromosomes (Suguiyama et al., 2019)

Informing Breeding Decisions through Parental Selection and Germplasm Curation

The ultimate value of genetic diversity analysis is that it can be transformed into practical breeding actions. Genetic information provided by GBS is transforming parental selection from an art to an accurate science.

Strategic Parental Selection: Maximizing Genetic Gain and Heterosis

Traditional parental selection relies heavily on phenotypic data and the experience of breeders. GBS provides a new decision-making dimension based on genome information:

  • Complementary selection: By analyzing the genetic distance between potential parents, parents with large genetic differences and strong allelic complementarity can be actively selected for hybridization. For example, a material with excellent disease resistance but short yield components should be paired with a material with a rich genetic basis for yield traits, which can make up for its defects. This complementary strategy of advantages and disadvantages based on molecular data aims to create a separated population with a wider genetic basis and increase the chances of selecting offspring with excellent traits of both parents.
  • Division and prediction of heterosis groups: GBS is a powerful tool to divide heterosis groups. Through population structure analysis, a large number of inbred lines can be clearly divided into different genetic groups. A large number of studies have shown that crossing between inbred lines from different heterosis groups can usually produce stronger heterosis. Therefore, when preparing hybrid combinations, breeders can give priority to those parents who have been confirmed by GBS to belong to different genetic groups, thus improving the probability of selecting strong heterosis hybrid combinations.

Precise Management and Innovation of Germplasm Resources

Dynamic management of core collection: the core collection based on GBS is not static. When new germplasm resources are introduced, GBS analysis can be conducted again to evaluate the uniqueness of the new germplasm relative to the existing core germplasm, so as to dynamically update the core collection and ensure that it always represents the most comprehensive diversity.

  • Construction of synthetic population: In order to create new genetic variation, breeders sometimes construct a multi-parent advanced generation cross line (MAGIC) or synthetic population. At the beginning of construction, GBS can be used to accurately select the most representative and farthest founding parents, so as to ensure that the newly constructed population has the maximum initial genetic diversity.
  • Protection of endangered genetic resources: GBS can quickly evaluate the inbreeding degree and genetic erosion of endangered resources such as farm varieties or wild relatives, and provide urgent sequencing and a scientific basis for formulating priority protection strategies.

Examples of genome-assisted prediction conducted with sommer (Covarrubias et al., 2016) Examples of genome-assisted prediction performed using sommer (Covarrubias et al., 2016)

Conclusion

GBS technology has transformed the germplasm resource bank from a passive seed warehouse into an active and data-driven genetic information center. Through systematic GBS genotyping of germplasm resources, we can:

  • Realize the digitization and redundancy of inventory and improve management efficiency.
  • Accurately quantify the genetic diversity parameters and understand the history and present situation of the population.
  • Reveal the deep-seated relationship between group structure and evolution, and provide a historical background for the utilization of resources.
  • Accurate strategic parental selection and germplasm management based on genome information will be realized, and valuable genetic variation will be transformed into breeding progress efficiently.

With the continuous decline of sequencing cost and the increasing automation of data analysis platforms, it will become normal to conduct a GBS gene ID card survey for all germplasm. In the future, integrating GBS genotype data with Qualcomm's phenotypic group, metabolic group, and gene editing technology will form a complete intelligent decision-making system for breeding, further accelerate the process of crop genetic improvement, and lay a solid scientific and technological foundation for global food security and sustainable development of agriculture.

FAQ

1. How does GBS help identify redundant germplasm in resource banks?

GBS generates tens of thousands of genome-wide SNPs to calculate genetic similarity between individuals; PCA or genetic distance matrices visualize germplasm, and closely clustered ones (even with different labels) are identified as redundant.

2. What's the role of GBS in calculating population genetics parameters like FST?

GBS provides high-density SNPs to compute FST (0–1 scale): FST≈0 means no population differentiation, FST>0.25 indicates high differentiation. It also quantifies PIC, observed/expected heterozygosity for diversity assessment.

3. How does GBS support phylogenetic and population structure analysis?

It enables model cluster analysis (inferring ancestral components to find optimal K), constructs phylogenetic trees (showing genetic relationships), and uses PCA (dimensionality reduction to visualize population structure).

4. Can GBS guide parental selection in breeding?

Yes. It analyzes genetic distance to select parents with complementary alleles (e.g., disease resistance + high yield) and divides heterosis groups—crossing parents from different groups boosts strong heterosis chances.

5. How does GBS assist core germplasm management?

Based on GBS-derived genetic distance matrices, core germplasm (representing maximum original diversity) is screened. New germplasm is GBS-analyzed to update the core collection dynamically.

References

  1. Cuevas HE, Rosa-Valentin G, Hayes CM, Rooney WL, Hoffmann L. "Genomic characterization of a core set of the USDA-NPGS Ethiopian sorghum germplasm collection: implications for germplasm conservation, evaluation, and utilization in crop improvement." BMC Genomics. 2017 18(1): 108.
  2. Li Y, Chen J, Jiang S, et al. "A Comprehensive Assessment of Nutritional Value, Antioxidant Potential, and Genetic Diversity in Metapenaeus ensis from Three Different Populations." Biology (Basel). 2024 13(10): 838.
  3. Suguiyama VF, Vasconcelos LAB, Rossi MM, Biondo C, de Setta N. "The population genetic structure approach adds new insights into the evolution of plant LTR retrotransposon lineages." PLoS One. 2019 14(5): e0214542.
  4. Covarrubias-Pazaran G. "Genome-Assisted Prediction of Quantitative Traits Using the R Package sommer." PLoS One. 2016 11(6): e0156744.
For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
Send a MessageSend a Message

For any general inquiries, please fill out the form below.

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.
We provide the best service according to your needs Contact Us
OUR MISSION

CD Genomics is propelling the future of agriculture by employing cutting-edge sequencing and genotyping technologies to predict and enhance multiple complex polygenic traits within breeding populations.

Contact Us
Copyright © CD Genomics. All Rights Reserved.
Top