Principal Component Analysis (PCA)
Click here for a very good, interactive explanation of the idea behind PCA. Now let’s try a hands-on example:
rm(list=ls()) # its always good practice to clear R’s memory
# Open the “iris” dataset:
#?iris # gives info about the dataset
By plotting the variables against each other it becomes obvious that some are strongly correlated: in other words, there is an overlap in the power of some variables at explaining/accounting for the data variability. A PCA will help disentangling these correlations.
# functions prcomp() performs PCA:
Use scale=T to standardize the variables to the same relative scale, avoiding some variables to become dominant just because of their large measurement units.
The summary indicates that four PCs where created: the number of possible PCs always equals the number of original variables. The difference is that Principal Components are all orthogonal to each other, therefore independent (their collinearity is zero). PC1 and PC2 explain respectively ~73% and ~23% of the data’s total variability, summing up to a more-than-respectable 96% of the total variability. There is no fixed rule about this, but this already tells us that all the other PCs can be ignored as they explain only crumbs of the total data variability.
A “scree plot” allows a graphical assessment of the relative contribution of the PCs in explaining the variability of the data.
fit is the the “Rotation” matrix containing the “loadings” that each of the original variables have on each of the newly created PCs. The concept of eigenvalue would require to be introduced for understanding how the loadings are estimated, and in general for a quantitative understanding of how the principal components are calculated: the interested reader might look it up in the references here and here.
The arrows provide a graphical rendition of the loadings of each of the original variables over the used PCs.
Package ggplot2 and its derivative ggbiplot have a somewhat esoteric syntax, but they produce much fancier plots:
library(ggplot2) # this might need to be installed
# Variances of principal components:
variances <- data.frame(variances=fit$sdev**2, pcomp=1:length(fit$sdev))
# **2 means ^2
Plot of variances with ggplot:
varPlot <- ggplot(variances, aes(pcomp, variances)) +
geom_bar(stat="identity", fill="gray") + geom_line()
To plot the first two Principal Components of a PCA using ggbiplot:
# run these two lines only if ggbiplot is not installed yet:
# install_github("ggbiplot", "vqv")
# load ggbiplot
iris_pca <- ggbiplot(fit,obs.scale = 1,
var.scale=1,groups=Species,ellipse=F, circle=F, varname.size=3)
# iris_pca <- iris_pca + theme_bw() # try this for a less
# ink-demanding background
Non-linear Multidimensional Scaling (nMDS)
Non-linear Multidimensional Scaling (nMDS) is another type of ordination useful for reducing the number of dimensions that describe the data. Regardless of the initial number of dimensions, nMDS disposes the data points in a two- or three-dimensional space until it finds the configuration that produces the minimal amount of distortion. Here is how to produce and represent nMDS in R:
i <- iris[,c(1:4)]
i.dist <- vegdist(i, method="euclidean")
i.mds <- metaMDS(i.dist, try=50)plot(i.mds, display="sites",type="n",
xlab="Axis 1", ylab="Axis 2", main="Iris species")
# add color
PerMANOVA is the multidimensional version of an Analysis of Variance. In R it can be performed with function adonis from package vegan, a widely used tool for multivariate statistics.
i.obs <- i
i.spp <- data.frame(species=iris[,5])
d.manova <- adonis(i.obs ~ species, method = "euclidean", data= i.spp)
Post-hoc pairwise comparisons:
iris.adonis.pw <- pairwise.adonis(x = i.obs,
iris.adonis.pw <- as.data.frame(iris.adonis.pw)
iris.adonis.pw$F.Model <- round(iris.adonis.pw$F.Model, 2)
iris.adonis.pw$R2 <- round(iris.adonis.pw$R2, 2)