All models are wrong, but some are useful. – George Box
Most links below take to R scripts that I wrote to illustrate how to perform various tasks (chiefly statistics) in R; other links lead to some of the many existing pages that illustrate topics and tasks better than I would.
You are free to use my original material on this website for non-lucrative purposes. If you are planning to make money out of it, get in touch with me first so that we can arrange terms and conditions. I am also available as an R-gun for hire for statistical consultancy and for teaching courses.
Univariate statistics refer to the set of tests and analyses used to analyse response variables in which every observation is expressed by a single value.
UNIVARIATE STATISTICS FOR NORMALLY DISTRIBUTED DATA
- Comparing two groups: t-test and some data-plotting (click here to see how to perform a t-test manually, with a little help from R)
- Comparing more than two groups: 1-way ANOVA – also covered: box-and-whiskers plots (click here to see how to perform ANOVA manually, with a little help from R)
- Comparing groups treated with combinations of two or more categorical factors: 2-way ANOVA – also covered: histograms with error bars
- Linear regression
- Studying the effect of one continuous variable and (at least) one categorical variable on a response variable: Analysis of Covariance – also covered: making fancy plots with Confidence intervals.
- Multiple linear regression – also covered: 3D graphs and PCA.
- Power analysis allows “to determine the sample size required to detect an effect of a given size with a given degree of confidence” (Clay Ford). Clay Ford’s vignette for Stephane Champely’s package pwr gives some clear examples and explanations (here). For power analysis in Mixed Effect models, see this and this article. Here is my example of a power analysis for a linear modela and mixed-effect model.
- On estimating marginal means in R.
- In the previous scripts we’ve seen how to compare models using anova(), which performs an F-test on the residual variability of the compared models, or using AIC. Here are some considerations on AIC by Brian McGill.
GENERALISED LINEAR MODELS (GLMs)
GLMs allow to perform univariate statistics for NON-NORMALLY distributed data and they represent an alternative to performing general linear models on transformed data. such as in the following examples:
- Comparing two groups (count data) – the code in this example can easily be adapted to more complicated situations, such as comparing more than two groups or testing the effect of multiple categorical variables on a response variable in “count” format.
- Another example with count data – here we look at a regression-like situation with non-Normal data.
- Binary data such as survival data or presence/absence: see this worked example by Gavin Simpson, where he also shows how to compute confidence intervals correctly.
Further useful resources:
- On the difference between using GLMs or using LMs on transformed data, see here and here.
- On how to compute Confidence Intervals, see here and here.
- On how to calculate the SE of a quantity on the original scale knowing its SE on the transformed scale, see here (delta method).
MIXED EFFECT MODELS (MEMs)
MEMs are useful to deal with unbalanced study designs and autocorrelated (non-independent) data. They estimate a statistical model’s fixed effects while accounting for the model’s random effects. As a practical and brutal definition, a model’s fixed effects are the explanatory variables the effect of which we are interested to quantify explicitly; a model’s random effects are variables that are not the main focus of our study but that we want to account for because we expect them to explain some of the residual variability (more on fixed and random effects: here, here).
- An example by Ignasi Bartolomeus (the issue with his example is that the random effect has only three levels. While this is helpful for teaching purposes, they recommend the random effect to have at least 5-6 levels for it to give reliable estimates).
- An excellent book for understanding MEMs and implementing them in R is “A Beginner’s Guide to GLM and GLMM with R: A Frequentist and Bayesian Perspective for Ecologists” by Alain F. Zuur, Joseph M. Hilbe, and Elena N. Leno.
PHYLOGENETIC GENERALIZED LEAST SQUARES (PGLS)
Mixed effect models can deal with data non-independence. A special case of data non-independence is represented by the phylogenetic relationship among species. Let’s say that we want to study the correlation between dietary habits and mean body weight in mammals. The body weight of different species may be affected by their dietary habits, but their phylogenetic relationship may also matter – closely related species may have similar mean body weight just because of their common evolutionary history. To study the correlation between dietary habits and mean body weight in mammals, one has to account for phylogenetic signal. One possibility is to use MEMs with nested random effects, such as:
MEM.model <- lmer(data = mammals, body.weigth ~ dietary.pref
+ (1 + dietary.pref |family/genus/species))
A more specific alternative is to use phylogenetic generalized least squares (PGLS), that allow to inform the model about the autocorrelation between taxonomic units using existing phylogenetic trees. Here is an example where they look at the relationship between wing length and tarsus length among Geospiza finch species.
GENERAL ADDITIVE MODELS (GAM)
I used to be dismissive about GAMs, seeing them as a glorified smoothing function. In a way they are, but they can accomplish a lot: they can model and compare strongly non-linear and asymmetric patterns, they can decompose the relative contribution of continuous and categorical predictors to the observed trends, etc. There are even ways to include random effects and to account for autocorrelation in GAMs. Extremely useful resources for getting started with GAMs are here, here, and in Gavin Simpson’s answer to this question I asked on CrossValidated. For further detail, see this paper.
Multivariate statistics is used to analyse response variables in which every observation consists in a set of values.
- Jari Oknasen’s introduction to multivariate statistics in R (more by him on ordinations here)
- Pedro Martinez Arbizu’s package and function for post-hoc comparisons of perMANOVA models in R
- A script for PCA, nMDS, and perMANOVA.