`rm(list=ls())`

The above is just to make sure that R’s memory is squeaky-clean.

`set.seed(100) # this is just a way to keep "randomly" generated numbers from changing every time we run the script`

g1<-rpois(n=100, lambda=2)

g2<-rpois(n=100, lambda=7)

trcol <- adjustcolor(c("red", "blue"), alpha=0.4)

hist(g1, freq=T, xlim=c(0,25),

col= trcol[1], breaks=5,

main="count data")

hist(g2, freq=T, add=T, col= trcol[2], breaks=15)

The frequency distributions of the two groups are asymmetrical, a clue that they are not normal.

`# Let's run some analyses`

# organise data in a dataset:

dd <- data.frame(

group = c(rep("group1", 100),rep("group2", 100)),

values = c(g1, g2)

)

head(dd)

Let’s play dumb and use a general linear model approach:

`lm1 <- lm(values~group, data=dd)`

summary(lm1)

# check the model assumptions (graphically):

par(mfrow=c(3,2))

plot(lm1)

Diagnostics plots:

- Residuals vs fitted. This plot shows if residuals have non-linear patterns (in our case it looks all right-ish)
- Q-Q plot. This plot shows if residuals are normally distributed (in our case they aren’t). For interpreting Q-Q plots see here and here – many other examples online and on books.
- Scale-Location: to check the assumption of equal variance (homoscedasticity). For the data in exam, Scale-Location shows that the larger the fitted values, the more the variability around them. In other words, mean and variance are correlated.
- Residuals vs Leverage. Are there any outlayers? (No)
- Extra aid for evaluating the distribution of residuals:

`hist(lm1$resid)`

How to deal with non-normality? One way is to normalise the data by transforming them:

`dd$logvals <- log(dd$values)`

lm2 <- lm(logvals ~ group, data=dd)

Here is one problem with log-transformations: values=0 need to be made arbitrarily different from zero, hence adding an extra transformation for our data:

`dd$logvals2 <- log(dd$values+1)`

par(mfrow=c(1,2))

hist(dd[dd$group=="group1",]$logvals3, freq=T,

xlim=c(-4,4), ylim=c(0,60),

col= trcol[1], breaks=5,

main="log(count data)"

)

hist(dd$logvals3[which(dd$group=="group2")], freq=T, add=T, col= trcol[2], breaks=4)

hist(dd[dd$group=="group1",]$logvals3, freq=T,

xlim=c(-4,4), ylim=c(0,60),

col= trcol[1], breaks=5,

main="log(count data+1"

)

hist(dd$logvals3[which(dd$group=="group2")], freq=T, add=T, col= trcol[2], breaks=4)

`# Now lm() runs.`

lm3 <- lm(logvals2 ~ group, data=dd)

summary(lm3)

**# Note that it gives log(estimates)**

**# estimates = exp(log(estimates))**

Because of the issues above, some have suggested: http://www.imachordata.com/do-not-log-transform-count-data-bitches/

**Use GENERALISED linear models instead!**

`glm1 <- glm(values ~ group, data=dd, family="poisson"(link = "log"))`

summary(glm1)

The model estimates both the parameters AND their associated Std. Errors on the link scale (in our example, link=log), and the statistical tests are run assuming that the residuals follow not a Normal, but a Poisson distribution.

To back-transform the parameters:

`mean1 <- exp(summary(glm1)$coefficients[1,1])`

mean2 <- exp(summary(glm1)$coefficients[1,1]) + exp(summary(glm1)$coefficients[2,1])

And to obtain the Std.Error on the original scale? Just doing:

`exp(summary(glm1)$coefficients[,2]) # WRONG!`

Is not correct (to find out why, this is a good starting point).

One way is to use function `predict()`

:

`preddata <- data.frame(group=unique(dd$group))`

preds <- predict(glm1, newdata=preddata, type="response", se.fit=T)

preds$se.fit

For a detailed comparison of four different ways to calculate standard errors and plot 95% confidence intervals, click here. Spoiler: the results of the different methods show some discrepancies that I can’t explain. If you know better than I do about it, let me know.