Inquiry

## Principal Component Analysis (PCA) and Principal Coordinate Analysis (PCoA) for Microbial Sequencing: Introduction and Procedures

Principal Component Analysis (PCA) and Principal Coordinate Analysis (PCoA) are two of the main mathematical procedures or ordination techniques used for multivariate analysis. Unlike classification, which assigns names or labels, ordination is the arranging of samples or data along gradients. These approaches basically sacrifice a small amount of accuracy to produce a simplified visualization of a huge amount of microbiome gene expression data, for example.

PCA is a method that reduces the dimensionality of large data sets into a smaller one that still preserves most of the information. The numbers of variables in a dataset are reduced so only highly correlated variables are together. A data set with 1 to 3 variables is quite easy to visualize from a 1 dimension to a 3-dimension point of view; however, analyses get extremely complicated when you have 4 variables or more. Visualizing a dataset with 200 dimensions is just impossible. Therefore, researchers use multivariate analysis procedures such as PCA to make data easier to visualize and analyze without extraneous variables to process. In the end, a low-dimensional graphical plot of the data is generated where distances between points in the plot are close to original dissimilarities.

The procedure involves standardization which eliminates bias by altering the range of continuous primary variables to the same scale so that each one contributes equally to the analysis. Then, the covariance matrix computation is done to examine any existing relationships through their varying values from the mean with respect to each other. Correlation is observed when the covariances have a positive sign; otherwise, they are inversely correlated. Eigenvectors and eigenvalues were then computed from the covariance matrix to identify principal components which are new variables that are generated as linear combinations. These are combinations of uncorrelated variables and most information within the primary variables is compressed into the first principal component. Basically, 50-dimensional data, which normally gives you 50 dimensions, could be squeezed into the first component (PC1) and into the second principal component (PC2) while retaining maximum possible information. PC1 has the largest possible variance or where the values are scattered the most. PC2, on the other hand, contains the next highest variance. Next, a feature vector is generated by choosing components that have the highest significance.  Lastly, data is reoriented to the ones represented by the principal components by using the eigenvectors.

PCA can be useful for integration in microbiome sequencing data because it provides a visualization of correlations between samples and it also relates features within and across multiple tables. However, some tables may have more variables than others; hence, dominating the resulting ordination. Another drawback of PCA is that it can only relate pairs of variables and not between sets of variables defining the tables. CCA and MFA can address these drawbacks.

On the other hand, PCoA represents the distances between samples in a low-dimensional space. Specifically, the linear correlation between the distances is maximized in the distance matrix, and the distances in a space of low dimension. The first step is the construction of a (dis)similarity matrix which can be calculated from semi-quantitative, quantitative, qualitative, and mixed variables. Although it is based on a (dis)similarity matrix, it can be similar in interpretation with PCA because of its derivation through eigenanalysis. An advantage of PCoA is that it can be used better than PCA when there is a lot of missing data and when there are more characters than individuals.

### References

1. Mohammadi, S.A. Prasanna, B.M. Review and Interpretation Analysis of Genetic Diversity in Crop Plants —Salient Statistical Tools. Crop Science, 2003, 43, 1235-1248.
2. Sankaran, K., & Holmes, S. P. Multitable Methods for Microbiome Data Integration. Frontiers in genetics, 2019 10, 627.
* For Research Use Only. Not for use in diagnostic procedures or other clinical purposes. # Customer Support & Price Inquiry

•  