rm(list=ls()) # a rather compulsive habit of making sure that R's memory is a clean slate. 99.9% of the time it isn't necessary but it doesn't hurt. # Some data: DF<-data.frame( y=c(-1.22, -1.73, -2.64, -2.44, -1.11, 2.24, 3.42, 0.67, 0.59, -0.61, -10.77, 0.93, -8.6, -6.99, -0.12, -2.29, -5.16, -3.35, -3.35, -2.51, 2.21, -1.18, -5.21, -7.74, -1.34), x1=c(39.5, 41, 34, 30.5, 31.5, 30, 41.5, 24, 43, 39, 25.5, 38.5, 33.5, 30, 41, 31, 25, 37, 37.5, 24.5, 38, 37, 41, 37, 36), x2=c(61, 53, 53, 44, 49, 44, 57, 47, 54, 48, 46, 59, 46, 61, 55, 57, 59, 59, 55, 50, 62, 55, 55, 52, 55))
If I hadn’t provided the data as above but in a separate file, you could load them by using this:
DF <- read.csv(file.choose())
Explore the data graphically:
plot(DF) # install.packages("scatterplot3d") and load it all in once: if (!require(scatterplot3d)) install.packages('scatterplot3d'); library(scatterplot3d) # does what the name says s3d <-scatterplot3d(DF$x1,DF$x2,DF$y, xlab="x1", ylab="x2", zlab="y", pch=16, highlight.3d=T, type="p", main="3D Scatterplot") # it is possible to fit flat surfaces (the product of additive models)* with the code below: fit <- lm(y ~ x1+x2,data=DF) s3d$plane3d(fit,lty.box = "solid") # try to interpret the plot qualitatively.
Some more detail: plane3d can only plot flat surfaces, which is why I am using “fit” which is an additive model.
Adding an interaction between x1 and x2 in the model would mean allowing for x1 to affect the effect of x2 on y, and similarly for x2 on x1. This would result in a non-flat surface and plane3d would not handle it. There are tools for plotting non-flat surfaces in R… You can look them up online.
# the usual drill for selecting the best model:
linmod.full <- lm(y~x1*x2, data=DF)
linmod.add <- lm(y~x1+x2, data=DF)
linmod.x1 <- lm(y~x1, data=DF)
linmod.x2 <- lm(y~x2, data=DF)
linmod.null <- lm(y~1, data=DF)
We retain the null model. Variables x1 and x2, by theirselves or together, have no statistically significant explanatory power on y.
Let’s look at a more complicated situation:
dd <- longley # ?longley multilinmod <- lm(Employed ~ GNP + Unemployed + Armed.Forces + Population, data=dd)
This model has more than two explanatory variables, so we cannot represent it graphically.
IMPORTANT: all models have assumptions.
One can just assume that the assumptions are respected and move on (which is what we’ve been doing until now for simplicity), but it’s usually wiser to test those assumptions. If the data don’t respect the assumptions, one can transform the data or tweak the model to deal with it. ONE CRUCIAL ASSUMPTION OF MULTIPLE REGRESSION IS THAT THERE SHOULD NOT BE MULTI-COLLINEARITY BETWEEN EXPLANATORY VARIABLES.
What does that mean?
For multiple linear regression to be used reliably, the explanatory variables should not be correlated to each other. Correlated explanatory variables are difficult to interpret in multiple regression analyses because they share explanatory power. Consequently, the perceived importance of one explanatory variable depends on other explanatory variables are included in the model.
# Evaluate Collinearity graphically explanatory <- dd[,c("GNP","Unemployed","Armed.Forces","Population")] plot(explanatory) #install.packages("GGally") library(GGally) # a little gimmick ggpairs(dd[,c("GNP","Unemployed","Armed.Forces","Population")]) # Evaluate Collinearity quantitatively: install.packages("mctest") library(mctest) # The ‘mctest’ package in R provides the Farrar-Glauber test and other relevant tests for multicollinearity. Functions ‘omcdiag’ and ‘imcdiag’ provide the overall and individual diagnostic checking for multicollinearity respectively. # More here: https://www.r-bloggers.com/multicollinearity-in-r/ response <- dd[,c("Employed")] omcdiag(explanatory, response)
There is multicollinearity. How to fix this? To start off, one sould (should?) prune those explanatory variables that have correlation coeffs > 0.9 with other explanatory variables. In our case, “GNP” and “population”. Here I am overwriting object “explanatory” with a new one in which I arbitrarily dropped “population” (based on my limited knowledge of the dataset, retaining “population” and dropping “GNP” would have been equally valid):
explanatory <- dd[,c("GNP","Unemployed","Armed.Forces")]
For all other cases we can use Principal Component Analysis (PCA) to obtain new variables. Principal Components are as many as the original ones but they have two main features:
- They are ranked according to how much variability they explain. PC1 explains the most variability observed in the data
- Principal Components are BY DEFINITION ORTHOGONAL, I.E. UNCORRELATED
More on PCA in general here (it includes a link to an excellent interactive explanation of how PCA works).
#Let’s run a PCA:
pca1 <- prcomp(explanatory, scale=TRUE) # scale=T deals with variables very different value ranges. See ?prcomp
pca1 # the "Rotation" matrix # it contains the "loadings" of each of the original variables on the newly created PCs.
Loadings are proportional to the correlation between each of the original variables and each of the PC. More details here.
pca.scores <- pca1$x # the PCA scores, namely the new coordinates for each data point.# create a new dataset to run our analyses on: newvariables <- as.data.frame(cbind(response, pca.scores)) # test for multicollinearity omcdiag(pca.scores, response) # SOLVED! # how to select the best model? # one way is to compare models with all possible combinations of PC1,PC2 and PC3 - the usual drill: # multilinmod2 <- lm(response ~ PC1+PC2+PC3, data=newvariables) # multilinmod3 <- lm(response ~ PC1+PC2, data=newvariables) # multilinmod4 <- lm(response ~ PC1, data=newvariables) # multilinmod5 <- lm(response ~ PC2, data=newvariables) # et cetera... # anova(multilinmod2, multilinmod3) # et cetera... # there are ways to make this process automatic. One is called stepwise model selection and I think the function for doing it is step()
Alternatively we can compare the amount of total variance explained by each PC and drop those that explain less than a certain threshold (5%? 10%?)
vars <- data.frame(variances= pca1$sdev^2, pcomp=1:length(pca1$sdev)) vars$percent.explained <- 100 * vars$variances/sum(vars$variances) vars # In this case we can make our life easier and keep only PC1 and PC2, and do model selection from there: multilinmod3 <- lm(response ~ PC1+PC2, data=newvariables) multilinmod4 <- lm(response ~ PC1, data=newvariables) multilinmod5 <- lm(response ~ PC2, data=newvariables) multilinmod6 <- lm(response ~ 1, data=newvariables) anova(multilinmod3, multilinmod4) # we can drop PC2 anova(multilinmod3, multilinmod5) # but keep PC1 summary(multilinmod4) # this seems the best model
To understand what the model actually says, look into the “Rotation” matrix (see above) and see how much each of the original variables contributes to PC1.