Has this happened to you before?

You are (ostensibly) a data professional - perhaps even one with a graduate degree in statistics - and you are given an experimental dataset and asked to evaluate some basic questions e.g. comparing different levels of an experimental condition of interest in the face of covariates.

Your life momentarily flashes before you as you find yourself standing before an abyss filled with nothing but Bonferroni-corrected p-values. Maybe you even think to yourself, “Didn’t I go to school for this? How is it possible that I’ve spent the last few months making plots and running univariate logistic regressions and let my statistics skills decay?”.

You are in good company, friend. In such situations, there’s nothing inherently wrong imo with starting off with some nice visualizations or basic hypothesis tests to get a feel for the data (the latter I know some take issues with - let’s assume here that we’re not writing a formal manuscript). But surely, there is a more robust toolkit for analyzing this type of data, perhaps even one in a well-maintained R package.

There is indeed, and the legend Russ Lenth has us covered.

Enter the {emmeans} package

The package name stands for “estimated marginal means”. What does that mean?

Basically, you provide a model (from lm, glm, lmer, glmer, etc.) and this package will calculate the estimated mean values for different factor variables and assuming the mean value for all continuous variable. Lenth refers to these combination of predictors as a “reference grid” for computing your marginal means.

Computing the mean values is very helpful as a starter, but the {emmeans} package also includes a lot of features for hypothesis testing, p-value adjustments, contrasts, backtransformations, etc. I will emphasize that there are a ton of features in this package. Compound this with the fact that for a lot of common tasks, there is more than one way to do the same thing in {emmeans}. While there is very thorough documentation in the form of several package vignettes, I think because of the two aforementioned issues, the amount of information can be overwhelming.

This post is an attempt for me to condense the parts of {emmeans} that I found most commonly applicable to my work. Most of this is taken from the documentation, but I will cite other sources as needed.

The basics

From the documentation, Lenth really emphasizes the following points:

  • Making sure your model is appropriate for your research question. {emmeans} depends on the model, so a bad or inappropriate model will lead to bad or inappropriate conclusions. Check your diagnostics.
  • The fundamental function from this package is emmeans::emmeans and it can do a lot by itself (e.g. doing tests, CIs, contrasts), however it is recommended to not do everything in one call to emmeans::emmeans but rather break it up into several separate function calls, since these should be discrete steps in practice.

Estimated Marginal Means

Let’s make a model first. Here, we are modeling plasma leucine concentration in pigs who are fed different sources of protein in different percentage concentrations.

library(emmeans)

pigs.lm1 <- lm(log(conc) ~ source + factor(percent), data = pigs)

{emmeans} relies on a reference grid, as stated earlier. There is a emmeans::ref_grid function, but we don’t really need to call this by itself. These two things are equivalent:

pigs.emm.s1 <- emmeans(pigs.lm1, specs = "source")

pigs.rg <- ref_grid(pigs.lm1)
pigs.emm.s2 <- emmeans(pigs.rg, specs = "source")

Either pigs.emm.s1 or pigs.emm.s2 will return:

 source emmean     SE df lower.CL upper.CL
 fish     3.39 0.0367 23     3.32     3.47
 soy      3.67 0.0374 23     3.59     3.74
 skim     3.80 0.0394 23     3.72     3.88

Results are averaged over the levels of: percent 
Results are given on the log (not the response) scale. 
Confidence level used: 0.95 

The input “specs” here refers to the variable we are interested in calculating the means for. We see in the output that results are averaged across the levels of “percent”, which is treated as a factor variable and are therefore not averaged together into one value in the reference grid. That’s visible when we look at pigs.rg:

> pigs.rg
'emmGrid' object with variables:
    source = fish, soy, skim
    percent =  9, 12, 15, 18
Transformation: "log"

as opposed to a model without declaring percent as a factor.

'emmGrid' object with variables:
    source = fish, soy, skim
    percent = 12.931
Transformation: "log"

Hypothesis Tests

There is an emmeans::test() function for executing hypothesis tests on an emmeans object.

We can specifiy a null hypothesis value here, as well as a direction for the hypothesis tests using “>”, “<”, or “!=”.

The input type can be set to “response”, indicating that values should be backtransformed. Note that the backtransformation is done as the last step, so all tests are conducted on the transformed scaled.

Adjustments can be specified as well, with possible values including “tukey”, “scheffe”, “sidak”, “mvt” (multivariate t with a Monte Carlo method), and of course “bonferroni”.

> test(pigs.emm.s, null = log(40), side = ">", type = "response")
 source response   SE df null t.ratio p.value
 fish       29.8 1.09 23   40  -8.026  1.0000
 soy        39.1 1.47 23   40  -0.577  0.7153
 skim       44.6 1.75 23   40   2.740  0.0058

Results are averaged over the levels of: percent 
P values are right-tailed 
Tests are performed on the log scale 

> test(pigs.emm.s, null = log(40), side = ">", adjust = "bonferroni")
 source emmean     SE df null t.ratio p.value
 fish     3.39 0.0367 23 3.69  -8.026  1.0000
 soy      3.67 0.0374 23 3.69  -0.577  1.0000
 skim     3.80 0.0394 23 3.69   2.740  0.0175

Results are averaged over the levels of: percent 
Results are given on the log (not the response) scale. 
P value adjustment: bonferroni method for 3 tests 
P values are right-tailed 

Confidence Intervals

emmeans::emmeans output by default includes confidence intervals, but we can also use confint since {emmeans} includes an S3 method for confint to handle emmGrid objects.

> confint(pigs.emm.s)
 source emmean     SE df lower.CL upper.CL
 fish     3.39 0.0367 23     3.32     3.47
 soy      3.67 0.0374 23     3.59     3.74
 skim     3.80 0.0394 23     3.72     3.88

Results are averaged over the levels of: percent 
Results are given on the log (not the response) scale. 
Confidence level used: 0.95 

You can add the sample sizes for each group using the calc input, which appends to this input. If we look at pigs.emm.s@grid, we see a column .wgt. which contains all of the sample sizes for each source group.

> confint(pigs.emm.s, calc = c(n = ~.wgt.)) 
 source emmean     SE df  n lower.CL upper.CL
 fish     3.39 0.0367 23 10     3.32     3.47
 soy      3.67 0.0374 23 10     3.59     3.74
 skim     3.80 0.0394 23  9     3.72     3.88

Results are averaged over the levels of: percent 
Results are given on the log (not the response) scale. 
Confidence level used: 0.95 

The confint method has a lot of options as well, such as one-sided intervals, alpha level, and backtransforming.

> confint(pigs.emm.s, side = ">", level = .90, type = "response")
 source response   SE df lower.CL upper.CL
 fish       29.8 1.09 23     28.4      Inf
 soy        39.1 1.47 23     37.3      Inf
 skim       44.6 1.75 23     42.3      Inf

Results are averaged over the levels of: percent 
Confidence level used: 0.9 
Intervals are back-transformed from the log scale 

Nested estimated marginal means and confidence intervals

If we want marginal means across two factor variables, we can use the pipe operator | in the specs input. The variable to the right of the pipe defines the larger groups.

> emmeans(pigs.lm1, ~ percent | source)
source = fish:
 percent emmean     SE df lower.CL upper.CL
       9   3.22 0.0536 23     3.11     3.33
      12   3.40 0.0493 23     3.30     3.50
      15   3.44 0.0548 23     3.32     3.55
      18   3.52 0.0547 23     3.41     3.63

source = soy:
 percent emmean     SE df lower.CL upper.CL
       9   3.49 0.0498 23     3.39     3.60
      12   3.67 0.0489 23     3.57     3.77
      15   3.71 0.0507 23     3.61     3.82
      18   3.79 0.0640 23     3.66     3.93

source = skim:
 percent emmean     SE df lower.CL upper.CL
       9   3.62 0.0501 23     3.52     3.73
      12   3.80 0.0494 23     3.70     3.90
      15   3.84 0.0549 23     3.73     3.95
      18   3.92 0.0646 23     3.79     4.06

Results are given on the log (not the response) scale. 
Confidence level used: 0.95 

Interestingly, this is equivalence to feeding the reference grid into confint.

> confint(pigs.rg, by = "source")
source = fish:
 percent prediction     SE df lower.CL upper.CL
       9       3.22 0.0536 23     3.11     3.33
      12       3.40 0.0493 23     3.30     3.50
      15       3.44 0.0548 23     3.32     3.55
      18       3.52 0.0547 23     3.41     3.63

source = soy:
 percent prediction     SE df lower.CL upper.CL
       9       3.49 0.0498 23     3.39     3.60
      12       3.67 0.0489 23     3.57     3.77
      15       3.71 0.0507 23     3.61     3.82
      18       3.79 0.0640 23     3.66     3.93

source = skim:
 percent prediction     SE df lower.CL upper.CL
       9       3.62 0.0501 23     3.52     3.73
      12       3.80 0.0494 23     3.70     3.90
      15       3.84 0.0549 23     3.73     3.95
      18       3.92 0.0646 23     3.79     4.06

Results are given on the log (not the response) scale. 
Confidence level used: 0.95 

Doing pairwise comparisons

Pairwise comparisons?? Now we’re talking about the good stuff!

Here, Lenth switches over to using the warpbreaks dataset which tries to estimate the number of breaks in yarn, based on the type of wool (A or B) and tension (L, M, or H).

We can do pairwise comparisons using the emmeans::pairs function which takes an emmGrid object.

> warp.lm <- lm(breaks ~ wool * tension, data = warpbreaks)
> warp.pw <- pairs(emmeans(warp.lm, ~ tension | wool))
> warp.pw
wool = A:
 contrast estimate   SE df t.ratio p.value
 L - M      20.556 5.16 48   3.986  0.0007
 L - H      20.000 5.16 48   3.878  0.0009
 M - H      -0.556 5.16 48  -0.108  0.9936

wool = B:
 contrast estimate   SE df t.ratio p.value
 L - M      -0.556 5.16 48  -0.108  0.9936
 L - H       9.444 5.16 48   1.831  0.1704
 M - H      10.000 5.16 48   1.939  0.1389

P value adjustment: tukey method for comparing a family of 3 estimates

The default p-value adjustment uses the studentized range distribution a la Tukey’s Honestly Significant Differences. Each wool group is corrected separately from the other, explaining why the output says a “family of 3 estimates”.

We can correct the across all pairwise comparisons by calling emmeans::test and specifying by = NULL, e.g. something like:

> test(warp.pw, by = NULL, adjust = "mvt")
 contrast wool estimate   SE df t.ratio p.value
 L - M    A      20.556 5.16 48   3.986  0.0013
 L - H    A      20.000 5.16 48   3.878  0.0018
 M - H    A      -0.556 5.16 48  -0.108  1.0000
 L - M    B      -0.556 5.16 48  -0.108  1.0000
 L - H    B       9.444 5.16 48   1.831  0.3080
 M - H    B      10.000 5.16 48   1.939  0.2553

P value adjustment: mvt method for 6 tests 

We can also use cross.adjust which corrects the Type I error between groups before applying a (potentially different) p-value adjustment within that group.

> test(warp.pw, adjust = "tukey", cross.adjust = "bonferroni")
wool = A:
 contrast estimate   SE df t.ratio p.value
 L - M      20.556 5.16 48   3.986  0.0013
 L - H      20.000 5.16 48   3.878  0.0018
 M - H      -0.556 5.16 48  -0.108  1.0000

wool = B:
 contrast estimate   SE df t.ratio p.value
 L - M      -0.556 5.16 48  -0.108  1.0000
 L - H       9.444 5.16 48   1.831  0.3407
 M - H      10.000 5.16 48   1.939  0.2777

P value adjustment: tukey method for comparing a family of 3 estimates 
Cross-group P-value adjustment: bonferroni 

Contrasts

{emmeans} comes with an emmeans::contrast function for looking at more specific comparisons. The simple input refers to the variable that should be contrasted; everything else basically becomes a by variable. The method input can define the contrast of interest, for example:

  • “consec” refers to consecutive levels of the value
  • “dunnett” refers to Dunnett’s test. This can also be entered as “trt.vs.ctrl”
> contrast(pigs.rg,
+          method = "consec", # compare consecutive levels of percent
+          simple = "percent")
source = fish:
 contrast              estimate     SE df t.ratio p.value
 percent12 - percent9    0.1796 0.0561 23   3.202  0.0112
 percent15 - percent12   0.0378 0.0582 23   0.650  0.8614
 percent18 - percent15   0.0825 0.0691 23   1.194  0.5199

source = soy:
 contrast              estimate     SE df t.ratio p.value
 percent12 - percent9    0.1796 0.0561 23   3.202  0.0111
 percent15 - percent12   0.0378 0.0582 23   0.650  0.8613
 percent18 - percent15   0.0825 0.0691 23   1.194  0.5201

source = skim:
 contrast              estimate     SE df t.ratio p.value
 percent12 - percent9    0.1796 0.0561 23   3.202  0.0110
 percent15 - percent12   0.0378 0.0582 23   0.650  0.8612
 percent18 - percent15   0.0825 0.0691 23   1.194  0.5200

Results are given on the log (not the response) scale. 
P value adjustment: mvt method for 3 tests

Note that we can also do pairwise contrasts. These two lines produce the same output:

> contrast(pigs.rg,
+          method = "pairwise", 
+          simple = "percent")
source = fish:
 contrast              estimate     SE df t.ratio p.value
 percent9 - percent12   -0.1796 0.0561 23  -3.202  0.0193
 percent9 - percent15   -0.2174 0.0597 23  -3.640  0.0070
 percent9 - percent18   -0.2998 0.0676 23  -4.434  0.0010
 percent12 - percent15  -0.0378 0.0582 23  -0.650  0.9143
 percent12 - percent18  -0.1203 0.0654 23  -1.839  0.2814
 percent15 - percent18  -0.0825 0.0691 23  -1.194  0.6367

source = soy:
 contrast              estimate     SE df t.ratio p.value
 percent9 - percent12   -0.1796 0.0561 23  -3.202  0.0193
 percent9 - percent15   -0.2174 0.0597 23  -3.640  0.0070
 percent9 - percent18   -0.2998 0.0676 23  -4.434  0.0010
 percent12 - percent15  -0.0378 0.0582 23  -0.650  0.9143
 percent12 - percent18  -0.1203 0.0654 23  -1.839  0.2814
 percent15 - percent18  -0.0825 0.0691 23  -1.194  0.6367

source = skim:
 contrast              estimate     SE df t.ratio p.value
 percent9 - percent12   -0.1796 0.0561 23  -3.202  0.0193
 percent9 - percent15   -0.2174 0.0597 23  -3.640  0.0070
 percent9 - percent18   -0.2998 0.0676 23  -4.434  0.0010
 percent12 - percent15  -0.0378 0.0582 23  -0.650  0.9143
 percent12 - percent18  -0.1203 0.0654 23  -1.839  0.2814
 percent15 - percent18  -0.0825 0.0691 23  -1.194  0.6367

Results are given on the log (not the response) scale. 
P value adjustment: tukey method for comparing a family of 4 estimates 

> pairs(emmeans(pigs.lm1, ~ percent | source))
source = fish:
 contrast              estimate     SE df t.ratio p.value
 percent9 - percent12   -0.1796 0.0561 23  -3.202  0.0193
 percent9 - percent15   -0.2174 0.0597 23  -3.640  0.0070
 percent9 - percent18   -0.2998 0.0676 23  -4.434  0.0010
 percent12 - percent15  -0.0378 0.0582 23  -0.650  0.9143
 percent12 - percent18  -0.1203 0.0654 23  -1.839  0.2814
 percent15 - percent18  -0.0825 0.0691 23  -1.194  0.6367

source = soy:
 contrast              estimate     SE df t.ratio p.value
 percent9 - percent12   -0.1796 0.0561 23  -3.202  0.0193
 percent9 - percent15   -0.2174 0.0597 23  -3.640  0.0070
 percent9 - percent18   -0.2998 0.0676 23  -4.434  0.0010
 percent12 - percent15  -0.0378 0.0582 23  -0.650  0.9143
 percent12 - percent18  -0.1203 0.0654 23  -1.839  0.2814
 percent15 - percent18  -0.0825 0.0691 23  -1.194  0.6367

source = skim:
 contrast              estimate     SE df t.ratio p.value
 percent9 - percent12   -0.1796 0.0561 23  -3.202  0.0193
 percent9 - percent15   -0.2174 0.0597 23  -3.640  0.0070
 percent9 - percent18   -0.2998 0.0676 23  -4.434  0.0010
 percent12 - percent15  -0.0378 0.0582 23  -0.650  0.9143
 percent12 - percent18  -0.1203 0.0654 23  -1.839  0.2814
 percent15 - percent18  -0.0825 0.0691 23  -1.194  0.6367

Results are given on the log (not the response) scale. 
P value adjustment: tukey method for comparing a family of 4 estimates

As for why the Tukey method is only correcting on 4 estimates and why the pairwise comparisons are all the same… I’m not sure. Will need to investigate further!