Compare two groups (count data)

rm(list=ls())

The above is just to make sure that R’s memory is squeaky-clean.

set.seed(100) # this is just a way to keep "randomly" generated numbers from changing every time we run the script
g1<-rpois(n=100, lambda=2)
g2<-rpois(n=100, lambda=7)
trcol <- adjustcolor(c("red", "blue"), alpha=0.4)
hist(g1, freq=T, xlim=c(0,25),
col= trcol[1], breaks=5,
main="count data")
hist(g2, freq=T, add=T, col= trcol[2], breaks=15)

The frequency distributions of the two groups are asymmetrical, a clue that they are not normal.

# Let's run some analyses
# organise data in a dataset:
dd <- data.frame(
group = c(rep("group1", 100),rep("group2", 100)),
values = c(g1, g2)
)
head(dd)

Let’s play dumb and use a general linear model approach:

lm1 <- lm(values~group, data=dd)
summary(lm1)
# check the model assumptions (graphically):
par(mfrow=c(3,2))

plot(lm1)

Diagnostics plots:

  1. Residuals vs fitted. This plot shows if residuals have non-linear patterns (in our case it looks all right-ish)
  2. Q-Q plot. This plot shows if residuals are normally distributed (in our case they aren’t). For interpreting Q-Q plots see here and here – many other examples online and on books.
  3. Scale-Location: to check the assumption of equal variance (homoscedasticity). For the data in exam, Scale-Location shows that the larger the fitted values, the more the variability around them. In other words, mean and variance are correlated.
  4. Residuals vs Leverage. Are there any outlayers? (No)
  5. Extra aid for evaluating the distribution of residuals:

hist(lm1$resid)

How to deal with non-normality? One way is to normalise the data by transforming them:

dd$logvals <- log(dd$values)
lm2 <- lm(logvals ~ group, data=dd)

Here is one problem with log-transformations: values=0 need to be made arbitrarily different from zero, hence adding an extra transformation for our data:

dd$logvals2 <- log(dd$values+1)
par(mfrow=c(1,2))
hist(dd[dd$group=="group1",]$logvals3, freq=T,
xlim=c(-4,4), ylim=c(0,60),
col= trcol[1], breaks=5,
main="log(count data)"
)
hist(dd$logvals3[which(dd$group=="group2")], freq=T, add=T, col= trcol[2], breaks=4)
hist(dd[dd$group=="group1",]$logvals3, freq=T,
xlim=c(-4,4), ylim=c(0,60),
col= trcol[1], breaks=5,
main="log(count data+1"
)
hist(dd$logvals3[which(dd$group=="group2")], freq=T, add=T, col= trcol[2], breaks=4)

# Now lm() runs.
lm3 <- lm(logvals2 ~ group, data=dd)
summary(lm3)
# Note that it gives log(estimates)
# estimates = exp(log(estimates))

Because of the issues above, some have suggested: http://www.imachordata.com/do-not-log-transform-count-data-bitches/

Use GENERALISED linear models instead!

glm1 <- glm(values ~ group, data=dd, family="poisson"(link = "log"))
summary(glm1)

The model estimates both the parameters AND their associated Std. Errors on the link scale (in our example, link=log), and the statistical tests are run assuming that the residuals follow not a Normal, but a Poisson distribution.

To back-transform the parameters:

mean1 <- exp(summary(glm1)$coefficients[1,1])
mean2 <- exp(summary(glm1)$coefficients[1,1]) + exp(summary(glm1)$coefficients[2,1])

And to obtain the Std.Error on the original scale? Just doing:

exp(summary(glm1)$coefficients[,2]) # WRONG!

Is not correct (to find out why, this is a good starting point).

One way is to use function predict():

preddata <- data.frame(group=unique(dd$group))
preds <- predict(glm1, newdata=preddata, type="response", se.fit=T)
preds$se.fit

For a detailed comparison of four different ways to calculate standard errors and plot  95% confidence intervals, click here. Spoiler: the results of the different methods show some discrepancies that I can’t explain. If you know better than I do about it, let me know.