rm(list=ls())
The above is just to make sure that R’s memory is squeaky-clean.
set.seed(100) # this is just a way to keep "randomly" generated numbers from changing every time we run the script
g1<-rpois(n=100, lambda=2)
g2<-rpois(n=100, lambda=7)
trcol <- adjustcolor(c("red", "blue"), alpha=0.4)
hist(g1, freq=T, xlim=c(0,25),
col= trcol[1], breaks=5,
main="count data")
hist(g2, freq=T, add=T, col= trcol[2], breaks=15)
The frequency distributions of the two groups are asymmetrical, a clue that they are not normal.
# Let's run some analyses
# organise data in a dataset:
dd <- data.frame(
group = c(rep("group1", 100),rep("group2", 100)),
values = c(g1, g2)
)
head(dd)
Let’s play dumb and use a general linear model approach:
lm1 <- lm(values~group, data=dd)
summary(lm1)
# check the model assumptions (graphically):
par(mfrow=c(3,2))
plot(lm1)
Diagnostics plots:
- Residuals vs fitted. This plot shows if residuals have non-linear patterns (in our case it looks all right-ish)
- Q-Q plot. This plot shows if residuals are normally distributed (in our case they aren’t). For interpreting Q-Q plots see here and here – many other examples online and on books.
- Scale-Location: to check the assumption of equal variance (homoscedasticity). For the data in exam, Scale-Location shows that the larger the fitted values, the more the variability around them. In other words, mean and variance are correlated.
- Residuals vs Leverage. Are there any outlayers? (No)
- Extra aid for evaluating the distribution of residuals:
hist(lm1$resid)
How to deal with non-normality? One way is to normalise the data by transforming them:
dd$logvals <- log(dd$values)
lm2 <- lm(logvals ~ group, data=dd)
Here is one problem with log-transformations: values=0 need to be made arbitrarily different from zero, hence adding an extra transformation for our data:
dd$logvals2 <- log(dd$values+1)
par(mfrow=c(1,2))
hist(dd[dd$group=="group1",]$logvals3, freq=T,
xlim=c(-4,4), ylim=c(0,60),
col= trcol[1], breaks=5,
main="log(count data)"
)
hist(dd$logvals3[which(dd$group=="group2")], freq=T, add=T, col= trcol[2], breaks=4)
hist(dd[dd$group=="group1",]$logvals3, freq=T,
xlim=c(-4,4), ylim=c(0,60),
col= trcol[1], breaks=5,
main="log(count data+1"
)
hist(dd$logvals3[which(dd$group=="group2")], freq=T, add=T, col= trcol[2], breaks=4)
# Now lm() runs.
lm3 <- lm(logvals2 ~ group, data=dd)
summary(lm3)
# Note that it gives log(estimates)
# estimates = exp(log(estimates))
Because of the issues above, some have suggested: http://www.imachordata.com/do-not-log-transform-count-data-bitches/
Use GENERALISED linear models instead!
glm1 <- glm(values ~ group, data=dd, family="poisson"(link = "log"))
summary(glm1)
The model estimates both the parameters AND their associated Std. Errors on the link scale (in our example, link=log), and the statistical tests are run assuming that the residuals follow not a Normal, but a Poisson distribution.
To back-transform the parameters:
mean1 <- exp(summary(glm1)$coefficients[1,1])
mean2 <- exp(summary(glm1)$coefficients[1,1]) + exp(summary(glm1)$coefficients[2,1])
And to obtain the Std.Error on the original scale? Just doing:
exp(summary(glm1)$coefficients[,2]) # WRONG!
Is not correct (to find out why, this is a good starting point).
One way is to use function predict()
:
preddata <- data.frame(group=unique(dd$group))
preds <- predict(glm1, newdata=preddata, type="response", se.fit=T)
preds$se.fit
For a detailed comparison of four different ways to calculate standard errors and plot 95% confidence intervals, click here. Spoiler: the results of the different methods show some discrepancies that I can’t explain. If you know better than I do about it, let me know.