This article provides a brief introduction to good practices for the bioinformatics analysis of 16S rRNA sequencing by NGS (next-generation sequencing). The bioinformatics pipeline involves two main stages: the preprocessing of data (quality control) and quantification (including taxonomic profiling and predictive metagenomics profiling).
Figure 1. Bioinformatics pipeline for NGS-based 16S rRNA amplicon sequencing (Mataragas et al. 2018).
Table 1. Software and statistical tests used in each stage of the pipeline (Mataragas et al. 2018).
|Pipeline step||Statistical test and software used||Alternative software|
|Processing||Qiime v.1.9.0||SILVAngs pipeline
|Taxonomic profiling (OTUs)||SILVAngs pipeline using the SILVA database||BMPOS pipeline using the Greengenes databases
One Codex pipeline
|Statistical comparison of the metagenomic samples||ANOSIM using the Past software||Stamp
|Microbial community overview||Community-Analyzer
Stacked bars chart using GraphPad Prism software
|Statistical significance of the identified OTUs||METAGENassist||MicrobiomeAnalyst
|Symbiotic and antagonistic relationships within the microbial community||Heatmap using the METAGENassist software||MicrobiomeAnalyst
|Predictive Metagenomics profiling (PMP)||Tax4Fun||Picrust
|Statistical analysis of the PMP results||Kruskal-Wallis H-test with Tuckey-Kramer corrected for multiple tests according to Benjamini-Hockberg
False Discovery Rate using the Stamp software
|Orientation of the metagenomic samples of the most abundant KEGG pathways||Principal Component Analysis (PCA) using the Past software||MicrobiomeAnalyst
|Metabolic interactions within the microbial community||MMinte||-|
Removal of adapters, PCR primers, and low quality bases is a necessary step for quality control of sequences. And there are a variety of integrated tools have been developed for this purpose. ‘Q’ is the output quality score for Illumina platforms (Q10 represents 1 error is expected for every 10 bases; Q20 represents 1 error is expected for every 100 bases...). Elimination of sequences with low quality scores can improve the accuracy of bioinformatics analyses. Compared with shotgun sequencing, this is more significant for 16S rRNA amplicon sequencing. For 16S rRNA gene sequencing, it is supposed to set a quality threshold as high as possible and to trim sequences along the entire length.
Prior to taxonomic classification, bacterial 16S rRNA genes are clustered by two main approaches. One is to cluster these sequences into phylotypes based on their similarity to reference database, the other is to cluster sequences into operational taxonomic units (OTUs) using a 97% similarity threshold, only according to their similarity. The available reference databases for annotation of 16S rRNA gene include Greengenes database, Ribosomal database project (RDP), SILVA, and Human microbiome project (HMP).
Beta (β) diversity measures the difference in bacterial community composition for different samples. Before quantifying β diversity, the read counts (reads mapped to each taxon) must be normalized to minimize the technical variability between samples. There are two common normalization procedures: the total sum and upper quartile normalization.
There are two main methods for quantifying β diversity: phylogenetic β diversity that considers the evolutionary differences between communities (such as UniFrac), and non-phylogenetic or taxon-based methods (such as Bray-Curtis dissimilarity). Once distances or dissimilarities between samples have been determined, they can be ordinated in a low-dimensional space to better illustrate how closely related they are to each other. The two most commonly used ordination tools are principal coordinate analyses (PCoA) and non-metric multidimensional scaling (NMDS).
Figure 2. NMDS and PCoA for quantification of beta diversity (Jovel et al. 2016).
OTU abundance table can be further used to presume for metabolic functions. It is a process to understand the role of the microbiome on host metabolism and disease. There are currently three powerful tools for predictive metagenomics profiling (PMP): PICRUSt, Tax4Fun, and Piphillin.
16S rRNA amplicon sequencing is popular due to its cost-efficient, time-effective, and informative features. But it is also limited by several disadvantages. First, 16S is well suited for multiple patients, longitudinal studies, but provides limited taxonomic and functional information. Second, the PCR amplification of different regions of 16S rRNA gene may generate discordant results owing not only to the distinct binding affinities for the corresponding flanking conserved regions, but also owing to the resolution of each variable region across taxa. Therefore, full-length 16S rRNA sequencing or shotgun metagenomics may sometimes be more favorable, especially the latter.