I wouldn’t be the first student to come across some topic in a statistics theory lecture and think “ok but why do I need to know this in order to throw stuff into lm()”. Ahh younger me. Such was the case when I first learned about Jensen’s inequality. “What is that again?” I hear you ask, a statistics graduate maybe 2.5 years out of school. Jensen’s inequality relates to functions of expected values, namely if we have a function that is convex (e.g. sits like a tarp full of rainwater) and a random variable , then

The opposite is true if is concave (e.g. it droops like snoot of a Concord jet). Another way of saying Jensen’s inequality is that equality of expections is only preserved when functions are linear.

Jensen’s inequality appears a lot in MS exams, courses, and sometimes job interviews. For example, it explains why the typical sample standard deviation estimate of is not an unbiased estimator for population standard deviation (Bessel’s correction e.g. that part only corrects the bias for the population variance estimate). But after I graduated from my MS, I never thought I would have to deal with it again.

Nick has to deal with it again

A common situation in my field is detecting evidence of change in performance for a particular assay. Let’s say that we have an assay that detects some analyte. It’s already validated and is known to do its job quite well, but now we want to make some minor changes to the workflow or equipment. How can we measure the change associated with these changes so that we can compare them to our tolerable limits?

Well this depends in part on how we define our tolerable limits. If it’s something straightforward like “difference cannot be greater than X units”, well, maybe we can just throw the new equipment into a linear model as factor variables and check the effect of fixed coefficients on the measurements. But what if it’s slightly more complicated, and defined like “cannot be more than X% different”. Ugh. In the context of linear models, now we’re likely dealing with log transformations.

Fortunately, log transformations are the “nicest” transformation to play around with and make reasonable interpretations on the original scale of your data. I think everyone who has taken an applied statistics class has learned about applying a natural log function to each value of the dependent variable, then rerunning the regression. Let’s say we have one independent variable, so what comes out of your statistical software is now this:

To get back to our original scale (and figure out what effect $\beta_1$ has on our dependent variable $y$), we can back transform this by exponentiating both sides.

Changing $x$ by 1 leads to a multiplicative change of $\text{exp}(\beta_1)$ on our dependent variable. Log transformations are useful for a few different scenarios (heteroskedasticity being the big one covered in school), but they are also useful when we’re trying to estimate percent change of a continuous measurement.

Thanks for reviewing STAT102. How does this relate to Jensen’s Inequality?

I’ve run into the problem above a few times, but have been thinking a different approach to the problem. Where it starts to get a little more interesting is comparing log transformations to log link functions.

In more applied statistical terms, what is the difference between log transforming your original data the running a regression versus running a generalized linear model with a log link function? As it turns out, the difference have everything to do with Jensen’s Inequality.

Remember that a linear model is basically a way of calculating some function of conditional means e.g. “if our independent variable X equals some value, what is the average response Y”. When we log transform a linear model, what we’re really asking for is this:

where $E$ is the expected value operator. Meanwhile, what we’re asking for in a generalized linear model with a log link is this

According to Jensen’s inequality, these two thins are not equal! More specifically, since the natural log transfomation is a concave function, .

Why would we prefer to generalized linear model approach?

Good question. In fact, a fundamental question. I think the central issue is really asking ourselves what we are trying to model. If we log transform the model, what we are really modelling is the response variable in log units, whereas when we use a log link, the response variable stays in the original scale. In the case of the log transform, we can get back to the original scale using backtransform but I think it is interesting that we need to incorporate an additional conceptual step.

What does Jensen’s Inequality look like?

The following simulation code creates a 100 random samples distributed according to a Gaussian distribution. For each sample, we compute the the mean of the log transformed data points and the log of the mean of the data points. It then repeats this process for a Gaussian distribution with a different mean and standard deviation.

xlbars <- vector(mode = "numeric", length = 100)
lxbars <- vector(mode = "numeric", length = 100)
for (m in seq(1, 100, by = 1)) {
  xlbar <- vector(mode = "numeric", length = 100)
  lxbar <- vector(mode = "numeric", length = 100)
  for (i in 1:100) {
    x <- rnorm(20, mean = m, sd = m/5)
    lx <- log(x)
    xlbar[i] <- mean(lx)
    lxbar[i] <- log(mean(x))
  xlbars[m] <- mean(xlbar)
  lxbars[m] <- mean(lxbar)
plot(lxbars, xlbars)
abline(a = 0, b = 1)
d <- data.frame(lxbars, xlbars)
d$diff <- d$lxbars - d$xlbars


While it’s hard to see in the plot, the points are almost all below the line, meaning that the log of the mean is greater than the mean of the log. This is easier to see this with a numeric summary, the column diff explicitly looks at the difference between the log of the mean and the mean of the logs. The fact that the numeric summary is positive tells us the log of the mean is greater.

> summary(d)
     lxbars             xlbars              diff        
 Min.   :0.001304   Min.   :-0.01802   Min.   :0.01874  
 1st Qu.:3.246754   1st Qu.: 3.22594   1st Qu.:0.01993  
 Median :3.925583   Median : 3.90574   Median :0.02027  
 Mean   :3.637099   Mean   : 3.61673   Mean   :0.02037  
 3rd Qu.:4.317022   3rd Qu.: 4.29727   3rd Qu.:0.02090  
 Max.   :4.607214   Max.   : 4.58695   Max.   :0.02186 

However from looking at this summary, the difference is relatively small. This is also apparent when modelling in R. Here is the output on a simulated dataset with a 5% difference between two machines.

boop <- glm(output ~ machines, data = tmp, family = gaussian(link = "log"))
boop2 <- lm(log(output) ~ machines, data = tmp)
glm(formula = output ~ machines, family = gaussian(link = "log"), 
    data = tmp)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-0.0245552  -0.0061303   0.0007913   0.0058882   0.0233464  

             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -1.900738   0.006614 -287.389  < 2e-16 ***
machinesB   -0.050026   0.009596   -5.213 4.65e-07 ***
lm(formula = log(output) ~ machines, data = tmp)

      Min        1Q    Median        3Q       Max 
-0.186988 -0.040725  0.007296  0.041641  0.154679 

             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -1.902754   0.006877 -276.684  < 2e-16 ***
machinesB   -0.050637   0.009726   -5.207  4.8e-07 ***

The actual difference on the original scale after taking the exponent is

> exp(-.050026) # glm with log link
[1] 0.9512047
> exp(-.050637) # lm with log transform
[1] 0.9506237

They don’t seem that different

Yeah, but I think it’s still important for your own knowledge that there is a difference between the two approaches. Practically speaking, it looks like there isn’t a compelling reason to use a GLM with log link over a log transformed LM, but I do think that keeping Jensen’s inequality in mind when doing this type of modelling makes it clear what we are trying to measured.