---
title: "Box-cox transformation"
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
html_document:
code_folding: show
fig_align: center
fig_height: 4.5
fig_width: 4.5
number_sections: yes
theme: lumen
toc: yes
toc_float: yes
word_document:
toc: yes
---
# Motivation
Linear models may not meet the assumptions of Ordinary Least Squares (OLS), which are:
1. Independent residuals
2. No relation between residuals and fitted values
3. Normally-distributed residuals
4. Homoscedastic (equal-variance) residuals, at each fitted value
5. No high-leverage points with large residuals
6. Linearity in all predictors, independently (perhaps also with interaction)
(1) can be dealt with by Generalized Linear Squares (GLS), if the nature of the dependence is known (e.g., time or space dependence, repeated measurements on the same object).
(2) usually indicates a non-linear relation; (3) and (4) may be due to that, but also to unusual values, and especially skewed predictors or response variables.
For example, in the Meuse soil pollution dataset, model one of the metal concentrations in soil from the elevation above sea level and the distance from the Maas river:
```{r fig.height=8, fig.width=8, out.width='80%'}
library(sp); library(gstat)
data(meuse) # an example dataset supplied with the sp package
names(meuse)
summary(meuse$zinc)
summary(m.ols <- lm(zinc ~ elev + dist, data=meuse))
par(mfrow=c(2,2))
plot(m.ols)
par(mfrow=c(1,1))
```
We see all of the above-mentioned problems.
# The Box-Cox transform
Box and Cox (1964) developed a family of transformations designed to reduce nonnormality of the errors in a linear model, i.e. our problem (3) above. Applying this transform often reduces non-linearity as well, i.e., our problem (1), and heteroscedascity, i.e., our problem (4).
The linear model is:
$$
Y_i=\beta_0+\mathbf{\beta X}+\varepsilon_i, \;\;\varepsilon_i\sim\mathcal{N}(0,\sigma^2)
$$
The idea is to transform the response variable $Y$ to a replacement response variable $Y_i^{(\lambda)}$, leaving the right-hand side of the regression model unchanged, so that the regression residuals become normally-distributed. Note that the regression coefficients will also change, because the response variable has changed; therefore the regression coefficients must be interpreted with respect to the transformed variable. Also, any predictions made with the model have to be back-transformed, to be interpreted in the original units.
The standard (simple) Box-Cox transform is:
$$
Y_i^{(\lambda)}=\begin{cases}
\dfrac{Y_i^\lambda-1}{\lambda}&(\lambda\neq0)\\
\log{(Y_i)} & (\lambda=0)
\end{cases}
$$
The inverse transform is:
$$
Y_i=\begin{cases}
\exp \left( \dfrac{\log(1 + \lambda Y_i^{(\lambda)})}{\lambda} \right) & (\lambda\neq0)\\
\exp( {Y_i^\lambda)} & (\lambda=0)
\end{cases}
$$
Obviously, this depends on a parameter $\lambda$. Some familiar transformations are associated with certain values of $\lambda$:
* $\lambda$ = 1.00: no transformation needed; produces results identical to original data
* $\lambda$ = 0.50: square root transformation
* $\lambda$ = 0.33: cube root transformation
* $\lambda$ = 0.25: fourth root transformation
* $\lambda$ = 0.00: natural log transformation
* $\lambda$ = -0.50: reciprocal square root transformation
* $\lambda$ = -1.00: reciprocal (inverse) transformation=
This is solved by _maximum likelihood_, i.e., which value of $\lambda$ is most likely to have given rise to the observed distribution of the model residuals. ^[https://www.r-bloggers.com/on-box-cox-transform-in-regression-models/]
The log-likelihood of the linear model, assuming independent observations (i.e., solution by OLS), is:
$$
\begin{aligned}
\log\mathcal{L} = &-\frac{n}{2}\log(2\pi)-n\log\sigma\\&-\frac{1}{2\sigma^2}\sum_{i=1}^n\left[Y_i^{(\lambda)}-(\beta_0+\mathbf{\beta X}_i)\right]^2\\&+(\lambda-1)\sum_{i=1}^n\log Y_i
\end{aligned}
$$
This can also be written:
$$
\mathcal{L} \propto -\frac{n}{2}\log\left[\sum_{i=1}^n\left(\frac{Y_i^{(\lambda)}-(\beta_0+\mathbf{\beta_1 X}_i)}{\dot{Y}^\lambda}\right)^2\right]"
$$
(the $\propto$ is because some constants have been removed) where $\dot{Y}=\exp\left[\frac{1}{n}\sum_{i=1}^n \log Y_i\right]$, i.e., the exponential of the mean of the logarithms of the variable.
Fortunately, the function `boxcox` has been implemented in the `MASS` package; this computes the likelihood at user-defined or default ($\lambda = [-2 \ldots 2]$ by $0.1$); we can then find the optimum, i.e., maximum log-likelihood.
For the Meuse linear model example:
```{r}
library(MASS)
bc <- boxcox(m.ols)
str(bc) # numbers that make up the graph
(bc.power <- bc$x[which.max(bc$y)])
```
The optimal transformation is `r round(bc.power, 3)`, i.e., towards an inverse power. Note that the confidence interval does not include 1 (no transformation), so a transformation is indicated. A logarithmic transformatic ($\lambda = 0$) would not be optimal.
# Transformation and back-transformation
Now we need to transform the response variable accordingly, and re-compute the linear model, with this transformed variable.
We can write a function corresponding to the formula:
```{r}
BCTransform <- function(y, lambda=0) {
if (lambda == 0L) { log(y) }
else { (y^lambda - 1) / lambda }
}
```
And of course we need a reverse transformation:
```{r}
BCTransformInverse <- function(yt, lambda=0) {
if (lambda == 0L) { exp(yt) }
else { exp(log(1 + lambda * yt)/lambda) }
}
```
```{r}
# testing
yt <- BCTransform(meuse$zinc, 0)
yo <- BCTransformInverse(yt, 0)
unique(round(yo-meuse$zinc),8)
yt <- BCTransform(meuse$zinc, .5)
yo <- BCTransformInverse(yt, .5)
unique(round(yo-meuse$zinc),8)
```
# Applying the transformation
Use this function with the optimal value of $\lambda$ to transform the response variable:
```{r fig.width=6, fig.height=4}
hist(meuse$zinc, breaks=12); rug(meuse$zinc)
meuse$zinc.bc <- BCTransform(meuse$zinc, bc.power)
hist(meuse$zinc.bc, breaks=12); rug(meuse$zinc.bc)
```
The skew has been removed. The target is obviously bi-modal.
Compare to the log-transform:
```{r fig.width=6, fig.height=4}
hist(log(meuse$zinc), breaks=12); rug(log(meuse$zinc))
```
But the test of the transformation is the distrbution of the linear model residuals, not the target variable.
Re-run the linear model with transformation:
```{r fig.height=8, fig.width=8, out.width='80%'}
summary(m.ols.bc <- lm(zinc.bc ~ elev + dist, data=meuse))
par(mfrow=c(2,2))
plot(m.ols.bc)
par(mfrow=c(1,1))
```
The residuals are close to normally-distributed; the residuals are less related to the fits, and the apparent heteroscedascity is mostly due to the uneven numbers of observations at different fitted values.
Compare this to the linear model with a log-transformation:
```{r fig.height=8, fig.width=8, out.width='80%'}
summary(m.ols.log <- lm(log(zinc) ~ elev + dist, data=meuse))
par(mfrow=c(2,2))
plot(m.ols.log)
par(mfrow=c(1,1))
```
# Another transformation function
The `powerTransform` of the `car` package also uses the maximum likelihood-like approach outlined by Box and Cox (1964) to select a transformatiion of a univariate or multivariate response for normality, linearity and/or constant variance. The summary method automatically computes two or three likelihood ratio tests concerning the transformation powers.
We use this on the Meuse Zn linear model:
```{r}
require(car)
(bc.car <- powerTransform(m.ols))
str(bc.car, max.level=1)
print(bc.car$lambda)
summary(bc.car)
```
We see that this estimate from `car` `r round(bc.car$lambda, 3)` is not quite the same estimate as from the function in `MASS`, i.e., `r round(bc.power, 3)`.
The LR-tests here show that a log-transform would not be optimal, and that a transformation is surely needed.
The `car` package also has a function `bcPower` to carry out the transformation:
```{r}
meuse$zinc.bc.car <- bcPower(meuse$zinc, lambda=bc.car$lambda)
hist(meuse$zinc.bc.car, breaks=12); rug(meuse$zinc.bc.car)
```
Re-run the linear model with this transformation:
```{r fig.height=8, fig.width=8, out.width='80%'}
summary(m.ols.bc.car <- lm(zinc.bc.car ~ elev + dist, data=meuse))
par(mfrow=c(2,2))
plot(m.ols.bc.car)
par(mfrow=c(1,1))
```
This is also quite good.
# Analyzing the linear model of the transformed variable
Note that this model is conditioned on the data in two ways: (1) the usual estimate from noisy data points, and (2) the value of $\lambda$ selected for the transformation. Condition (2) is not the case for user-selected, _a priori_, transforms.
# Transforming a predictor
In this case we want the predictor (to be transformed) to be represented as the response $Y$; this is accomplished with the null model, i.e., a variable predicted only by its mean: $Y_i = \beta_0 + \varepsilon_i$. The distribution of the residuals is the same as that of the variable, just centred at its mean.
For example, supposing that Zn would be used as a predictor in a model, and we would like to transform it to near-normality:
```{r}
m.null <- lm(zinc ~ 1, data=meuse)
summary(m.null)
plot(m.null, which=2)
qqnorm(meuse$zinc); qqline(meuse$zinc)
hist(meuse$zinc, breaks=12); rug(meuse$zinc)
```
We see that the model residuals follow the same distribution as the original data values.
There is serious skew: far too many small values, far too few large, to be normally-distributed.
Now compute the transformation parameter:
```{r}
bc.null <- boxcox(m.null)
(bc.power.null <- bc.null$x[which.max(bc.null$y)])
summary(bc.power.null.car <- powerTransform(m.null))
```
The optimal transformation from `MASS` is `r round(bc.power.null, 3)` and from `car` is `r round(bc.power.null.car$lambda, 3)`. As in the linear model residuals for the predictive model of Zn, these are negative, but closer to a log transformation.
Transform the variable and re-compute the distribution:
```{r fig.width=6, fig.height=4}
meuse$zinc.bc.null <- BCTransform(meuse$zinc, bc.power.null)
qqnorm(meuse$zinc.bc.null); qqline(meuse$zinc.bc.null)
hist(meuse$zinc.bc.null, breaks=12); rug(meuse$zinc.bc.null)
```
This is somewhat better -- at least it is symmetric, although the tails have become too "light". This is because this variable has a bimodal distribution, which is difficult to see in the original histogram.
# Caution
The Box-Cox transform should not be applied if other transformations are indicated by theory. An example from Box and Cox (1964) is a chemical reaction, where physical theory suggests that the reaction rate $r$ should be modelled as $r \propto c^\alpha \cdot \beta \cdot exp(t)$, where $c$ is a concentration of the reactants and $t$ is the temperature. This could be linearized as $r = \alpha \cdot \log c + \beta \cdot t$ and then solved for $\alpha$ and $\beta$. This transformation is indicated by theory, not empirically as with Box-Cox.
Also, if the suggested Box-Cox power is close to 0, many analysts use the log-transform even though it is not optimal, for ease of interpretation.
# References
Box, G. E. P., & Cox, D. R. (1964). An Analysis of Transformations (with discussion). Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 211–252.