---
title: "Model evaluation"
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
html_document:
number_sections: TRUE
theme: "lumen"
code_folding: show
fig_height: 5.5
fig_width: 5.5
fig_align: 'center'
toc: TRUE
toc_float: TRUE
---
```{r setup, echo=FALSE}
options(show.signif.stars=FALSE)
```
With any predictive method, we would like to know how good it is. This is model *evaluation*, often called model *validation*[^1].
This is in contrast with model *calibration*, when we are building (fitting) the model.
[^1]: We prefer the term "evaluation" because "validation" implies that the model is correct ("valid"); that of course is never the case. We want to evaluate how close it comes to reality.
See Oreskes, N. (1998). Evaluation (not validation) of quantitative models. Environmental Health Perspectives, 106(Suppl 6), 1453–1460, and Oreskes, N., et al. (1994). Verification, validation, and confirmation of numerical models in the earth sciences. Science, 263, 641–646.1
# Introduction
To evaluate the *predictive accuracy* of a model, we compare the *model predictions* against *actual values* of the same items. These items ideally are from an independent dataset representative of the target population. An alternative is *cross-validation*, where some items from a single dataset are left out of the predictive model and then predicted by the remaining items.
Notation: actual values $y_i$, predicted values $\hat{y}_i$, for a set of $n$ paired values (actual and predicted). Their means are $\bar{y}$ and $\bar{\hat{y}}$.
This should be representative of the *population* for which we want to make a statement.
# Evaluation measures
## RMSE
The *root mean squared error* (RMSE) of the residuals (actual – predicted) measures how close, on average, the predictions are to reality.
$$\mathrm{RMSE} = \left[ \frac{1}{n} \sum_{i=1}^n (\widehat{y}_i - y_i)^2 \right]^{1/2}$$
The advantage of this measure is that it can be interpreted in terms of the model application.
To put this in application context, it may be compared to the actual range, inter-quartile range, or standard deviation. These show how significant is the prediction variance when compared to the overall variability of the dataset.
## Variance explained against a 1:1 line
This is similar to an $R^2$ "coefficient of determination", but is computed against the 1:1 line of actual vs. predicted. This measures how much of the variance in the actual values is explained by the prediction. It is the complement of the ratio of the *residual* sum-of-squares, i.e., left over after matching 1:1, to the *total* sum-of-squares, i.e., the total variability in the dataset:
$$ R^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}} = 1 - \frac{\sum_i(y_i- \hat{y}_i)^2}{\sum_i( y_i - \bar{y})^2}$$
This measure is normalized for the variability of the evaluation set, and shows how much of this variability is explained by the 1:1 line based on the model predictions. Ideally this is 1, i.e., _RSS_ = 0. Note that a poor model can actually give _RSS_ $>$ _TSS_, so that $R^2 < 0$.
## Bias
*Bias*, also known as the *mean prediction error* (MPE) compares the predicted vs. actual means of the evaluation dataset:
$$ \mathrm{MPE} = \frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y}_i ) $$
This shows whether the model systematically over- or under-predicts. Ideally, this is zero.
To put this in application context, it may be compared to the actual mean or median.
## Gain
This is the slope $\beta_1$ of the *linear* relation between actual and predicted, fitted by ordinary least squares (OLS), assuming independent and identically-distributed residuals:
$$y = \beta_0 + \beta_1 \hat{y} + \epsilon; \;\;\epsilon \sim \mathcal{N}(0,\sigma^2)$$
Ideally this should be 1: a unit increase in the precicted value should correspond to a unit increase in the actual value.
If $\beta_1 > 1$ the higher actual values are under-predicted and the lower actual values are over-predicted. If $\beta_1 < 1$ the higher actual values are over-predicted and the lower actual values are under-predicted. It is fairly common for $\beta_1 < 1$ in machine-learning models, which have difficulty with the highest and lowest values due to the resampling during bagging.
The $R^2$ (coefficient of determination) of the fitted linear model is _not_ an evaluation measure of the model! It does show how much of the variability in the evaluation set can be explained by s linear model (_not_ the 1:1 relation between actual and precicted). Thus it tells us how well an adjustment equation is able to match the two sets. This equation could be used to adjust measurements by one method to another.
## Lin's Concordance
This is a measure of the deviation from the 1:1 line which includes all sources of deviation. It was developed to evaluate reproducibility of test procedures that are supposed to give the same result, but it applies equally well to compare actual vs. predicted by any model. It is a correlation coefficient corrected for the different means (bias) and standard deviations (slope) [^2].
$$\rho_c = \frac{2 \rho_{1,2} \sigma_1 \sigma_2}{\sigma_1^2 + \sigma_2^2 + (\mu_1 - \mu_2)^2}$$
If points are independent use the *sample* estimates $r_{1,2}, S_1, S_2, \overline{Y_1}, \overline{Y_2}$ in place of the *population* statistics of the formula.
[^2]: Lin, L. I.-K. (1989). A concordance correlation
coefficient to evaluate reproducibility. Biometrics, 45(1),
255--268.
# Examples
To illustrate evaluation statistics, we simulate some actual measurements and model outcomes.
Here are the presumed actual values:
```{r sim.actual}
n <- 256 # number of evaluation items
y <- seq(1:n)
```
In order to have the same amount of "noise" in each example, we define it here:
```{r noise}
set.seed(316) # so the script will be reproducible, omit for repeated runs
noise <- rnorm(n=n, mean=0, sd=12)
```
## Noise only
First, we simulate a model that predicts without *bias*, i.e., $\bar{y} = \bar{\hat{y}}$, and *gain*, i.e., the slope of the linear relation actual vs. predicted is 1.
```{r sim1to1}
yhat.1 <- y + noise
plot(y ~ yhat.1, asp=1, pch=20,
xlab="predicted", ylab="actual",
main="Model with noise only")
abline(0,1, lty=2); grid()
```
## Bias and noise
Second, a model with bias but the same noise
```{r sim.bias}
bias <- 20
yhat.2 <- y + bias + noise
plot(y ~ yhat.2, asp=1, pch=20,
xlab="predicted", ylab="actual",
main="Model with noise and bias")
abline(0,1, lty=2); grid()
```
Notice the same linear relation and lack of fit, but systematically over-predicting.
## Gain and noise
Third, a model with gain but no bias and no noise:
```{r sim.gain}
gain=0.8
yhat.3 <- y*gain + noise
yhat.3 <- yhat.3 - (mean(yhat.3) - mean(y)) # remove bias
plot(y ~ yhat.3, asp=1, pch=20,
xlab="predicted", ylab="actual",
main="Model with noise and gain")
abline(0,1, lty=2); grid()
```
Notice how the higher actual values are systematically under-predicted, and the lower actual values are systematically over-predicted. The values at the centroid are not changed.
## Gain, bias and noise
```{r sim.gain.bias}
yhat.4 <- yhat.3 + bias
plot(y ~ yhat.4, asp=1, pch=20,
xlab="predicted", ylab="actual",
main="Model with noise, bias and gain")
abline(0,1, lty=2); grid()
```
# Evaluation of examples
Now let's see how the evaluation statistics compare:
## RMSE
```{r fun.rmse}
f.rmse <- function(x.a, x.p) {
sqrt(mean((x.a-x.p)^2))
}
```
```{r rmse}
f.rmse(y, yhat.1)
f.rmse(y, yhat.2)
f.rmse(y, yhat.3)
f.rmse(y, yhat.4)
```
Both gain and bias in the predictive model increase the RMSE.
These can be compared to various measures of spread in the actual dataset: (1) the range, (2) the inter-quartile range; (3) the standard deviation:
```{r rmse.compare}
range(y); f.rmse(y, yhat.1)/diff(range(y))
(iqr <- as.numeric(quantile(y, .75) - quantile(y, .25))); f.rmse(y, yhat.1)/iqr
sd(y); f.rmse(y, yhat.1)/sd(y)
```
The RMSE of this model is about `r round(f.rmse(y, yhat.1)/diff(range(y))*100)`% of the total range, about `r round(f.rmse(y, yhat.1)/iqr*100)`% of the IQR, and about `r round(f.rmse(y, yhat.1)/sd(y)*100)`% of the standard deviation of the test set.
## Variance explained against a 1:1 line
```{r fun.r2.1.1}
f.R2.1.1 <- function (x.a, x.p) {
1 - sum((x.a - x.p)^2)/sum(x.a^2)
}
```
```{r r2.1.1}
f.R2.1.1(y, yhat.1)
f.R2.1.1(y, yhat.2)
f.R2.1.1(y, yhat.3)
f.R2.1.1(y, yhat.4)
```
This $R^2$ decreases with gain and bias. The first model just quantifies the signal vs. noise of the 1:1 relation, but the other two include bias and gain, as deviations from the 1:1 line.
Compare these to the linear model $R^2$:
```{r r2.model}
summary(lm(y ~ yhat.1))$adj.r.squared
summary(lm(y ~ yhat.2))$adj.r.squared
summary(lm(y ~ yhat.3))$adj.r.squared
summary(lm(y ~ yhat.4))$adj.r.squared
```
Adding a bias does not change the proportion of variance explained, but this is not relative to the 1:1 line. The only difference between the $R^2$ for the models with and without gain is the slight lack of fit due to the offset of the centroid.
## Bias
```{r fun.bias}
f.bias <- function(x.a, x.p) {
mean(sum(x.a-x.p))
}
```
```{r bias}
f.bias(y, yhat.1)
f.bias(y, yhat.2)
f.bias(y, yhat.3)
f.bias(y, yhat.4)
```
This measures the systematic over- or under-prediction. In the model with added bias we see a severe over-prediction, i.e., the actual is systematically less than the predicted, leading to a negative MPE.
## Gain
```{r fun.gain}
gain <- function(x.a, x.p) {
m <- lm(x.a ~ x.p)
return(as.numeric(coef(m)[2]))
}
```
```{r gain}
gain(y, yhat.1)
gain(y, yhat.2)
gain(y, yhat.3)
```
Notice that the bias does not change this measure -- there is no gain, i.e., the slope of actual vs. predicted is still $\approx 1$ (here, with some error due to the random noise).
## Concordance
```{r fun.lin}
lin <- function (x.a, x.p) {
return((2*cov(x.a,x.p))/
(var(x.a) + var(x.p) + (mean(x.a) - mean(x.p))^2))
}
```
```{r lin}
lin(y, yhat.1)
lin(y, yhat.2)
lin(y, yhat.3)
```
This is a good overall measure because it accounts for all kinds of lack of fit to the 1:1 line.
# Taylor diagrams
Taylor [^T] developed a diagram that shows three measures of modelled vs. observed pattern: (1) correlation $\rho$ between the two patterns, (2) centred RMSE $E'$, (3) standard deviations $\sigma_r$ (reference) and $\sigma_t$ (test). These are related by the formula:
$$E'^2 = \sigma^2_t + \sigma^2_r - 2 \sigma_t \sigma_r \rho$$
It does _not_ measure bias, since the RMSE is centred by subtracting the respective means. So it shows how well the two _patterns_ match, allowing for a systematic shift, which can be measured by the bias (see above).
[^T]: Taylor, K. E. (2001). Summarizing multiple aspects of model performance in a single diagram. Journal of Geophysical Research: Atmospheres, 106(D7), 7183–7192. https://doi.org/10.1029/2000JD900719
A Taylor diagram can be created by the `taylor.diagram` function of the `plotrix` package. Typically several models are compared on one diagram. Here we compare the three models:
```{r show.taylor}
require(plotrix)
taylor.diagram(y, yhat.1, col = "red",
pch = 1, pcex = 2,
ngamma = 8,
gamma.col = "green",
sd.arcs = TRUE,
ref.sd = TRUE)
taylor.diagram(y, yhat.2, col = "green", add = TRUE)
taylor.diagram(y, yhat.3, col = "blue", add = TRUE)
```
The correlation coefficient is shown by the angle (azimuth), from perfect ($0^\circ$) to none ($90^\circ$) -- ideally this would be 1. Here all three models have good correlation, the third model a bit less than the other two. Note that the models with and without bias (but no gain) are identical in this diagram.
Note that only non-negative correlations can be shown, which makes sense for two patterns that are supposed to represent the same thing.
The centered RMS error in the predicted pattern is proportional to the distance from the point on the x-axis (standard deviation of the actual pattern) marked with an open circle. Various values of the RMS error are shown with circles centred at this point, coloured green in this example -- ideally this would be zero, on the x-axis. Here we see identical centred RMSE, i.e., the noise in each model is the same.
The standard deviation of the predicted pattern is proportional to the radial distance from the origin (black dottted contours) -- ideally this would be on the arc from the standard deviation of the reference set. Here we see a bit more variabiliy in the third model.
## Example -- increasing noise
We can show some features of the Taylor diagram with some other models. First, increasing noise, leading to poorer correlation:
```{r taylor.noise}
yhat.t1 <- y + noise
yhat.t2 <- y + noise*2
yhat.t3 <- y + noise*3
yhat.t4 <- y + noise*4
taylor.diagram(y, yhat.t1, col = "black",
ngamma = 8,
gamma.col = "green",
sd.arcs = TRUE,
ref.sd = TRUE)
taylor.diagram(y, yhat.t2, col = "red", add = TRUE)
taylor.diagram(y, yhat.t3, col = "blue", add = TRUE)
taylor.diagram(y, yhat.t4, col = "magenta", add = TRUE)
round(c(cor(y, yhat.t1), cor(y, yhat.t2), cor(y, yhat.t3), cor(y, yhat.t4)),2)
round(c(sd(y), sd(yhat.t1), sd(yhat.t2), sd(yhat.t3), sd(yhat.t4)), 1)
```
The centred RMSE increases with noise, and the correlation decreases (each observation may be perturbed more from its actual value).
The standard deviation of the predicted set increases (more extreme values) with noise and is increasingly further from the actual standard deviation (distance from centre of quandrant) -- these models predict too much variability.
## Example -- increasing gain
Here are four models, the first with no gain, the other three with increasing gain.
Note that for the Taylor diagrqm there is no need to remove bias, it is not taken into account.
```{r models.gain}
# yhat.t1 has no gain, same noise
yhat.g2 <- y*0.8 + noise
yhat.g2 <- yhat.g2 - (mean(yhat.g2) - mean(y)) # remove bias
yhat.g3 <- y*0.7 + noise
yhat.g3 <- yhat.g3 - (mean(yhat.g3) - mean(y)) # remove bias
yhat.g4 <- y*0.6 + noise
yhat.g4 <- yhat.g4 - (mean(yhat.g4) - mean(y)) # remove bias
plot(y ~ yhat.t1, asp=1, pch=20,
xlab = "predicted", ylab="actual",
main = "Increasing gain")
abline(0,1, lty=2); grid()
points(y ~ yhat.g2, add=T, col="red", pch=20)
points(y ~ yhat.g3, add=T, col="blue", pch=20)
points(y ~ yhat.g4, add=T, col="magenta", pch=20)
```
Now show these four on a Taylor diagram:
```{r taylor.gain}
taylor.diagram(y, yhat.1, col = "black",
ngamma = 8,
gamma.col = "green",
sd.arcs = TRUE,
ref.sd = TRUE)
taylor.diagram(y, yhat.g2, col = "red", add = TRUE)
taylor.diagram(y, yhat.g3, col = "blue", add = TRUE)
taylor.diagram(y, yhat.g4, col = "magenta", add = TRUE)
round(c(cor(y, yhat.1), cor(y, yhat.g2), cor(y, yhat.g3), cor(y, yhat.g4)), 3)
round(c(sd(y), sd(yhat.1), sd(yhat.g2), sd(yhat.g3), sd(yhat.g4)),1)
```
The correlation between actual and predicted patterns decreases with gain, but only by a small amount.
The standard deviation increases with noise but then decreases with gain, because the range of the predicted values is narrower, i.e., the angle of the actual:predicted line is steeper than the 1:1 line. The models with increasing gain are not variable enough.
# Gauch decomposition
A systematic approach to model quality was presented by Gauch _et al._ [^G]
This is based on a comparison of the model-based predictions and the measured values. This should be a 1:1 relation: each model prediction should equal the true value. Of course this will rarely be the case.
These authors present a decomposition to show the source of the disagreement:
[^G]: Gauch, H. G., Hwang, J. T. G., & Fick, G. W. (2003). Model evaluation by comparison of model-based predictions and measured values. Agronomy Journal, 95(6), 1442–1446. https://doi.org/10.2134/agronj2003.1442
* *MSD* Mean Squared Deviation. This shows how close, on average the
prediction is to reality. Its square root is called the Root Mean Squared
Error of Prediction (RMSEP), expressed in the same units as the
modelled variable.
* *SB* Squared bias. This shows if the predictions are
systematically higher or lower than reality.
* *NU* Non-unity slope. This shows if the relation between
predicted and actual is proportional 1:1 throughout the range of
values; if not, there is either an under-prediction at low values
and corresponding over-prediction at high variables (slope $>1$), or
vice-versa (slope $<1$).
* *LC* Lack of correlation. This shows how scattered are the predictions about the 1:1 line.
$$
\begin{align}
\mathrm{MSD} &= \frac{1}{n} \sum_{i=1}^n \big(y_i - \widehat{y}_i\big)^2\label{eq:2}\\
\mathrm{SB} &= \Big(\overline{\widehat{y}} - \overline{y}\Big)^2\label{eq:3}\\
\mathrm{NU} &= (1 - b)^2 \frac{1}{n} \sum_{i=i}^n \Big(\widehat{y}_i - \overline{\widehat{y}}\Big)^2\label{eq:4}\\
\mathrm{LC} &= (1 - r^2) \frac{1}{n} \sum_{i=i}^n \Big(y_i - \overline{y}\Big)^2
\end{align}
$$
* $n$ is the number of observations, $y_i, \hat{y}_i$ are the actual and predicted values of observation $i$, the overline indicates the arithmetic mean.
* $b$ is the slope of the least-squares regression of _actual_ values on the _predicted_ values, i.e., $\sum{y_i \widehat{y}_i} / \sum{\widehat{y}_i^2}$; this is also called the _\textbfgain_.
* $r^2$ is the square of the correlation coefficient $r_{1:1}$ between actual and predicted, i.e., $\big(\sum{y_i} \widehat{y}_i\big)^2 / \big(\sum{y_i}\big)^2 \big(\sum{\widehat{y}_i}\big)^2$.
Thus
%
NU is the non-unity slope in the regression of actual on predicted, scaled by the sums of squares of the predicted values of the evaluation observations,
%
and LC is the lack of correlation in the regression of actual on predicted, scaled by the sums of squares of the actual values of the evaluation observations.
These have the relation:
$$
\mathrm{MSD} = \mathrm{SB} + \mathrm{NU} + \mathrm{LC}
$$
That is, the total evaluation error consists of bias, gain, and lack of correlation. These are distinct aspects of model quality and can be interpreted as *translation* (SB), *rotation* (NU), and *scatter* (LC), respectively:
* _Translation_ The model systematically over- or under-predicts;
* _Rotation_ The relation between actual and predicted value is
non-linear, that is, not a constant relation throughout the range of values;
* _Scatter_ The model is not precise.
If there is significant translation or rotation, this indicates the model _form_ is not correct.
If there is significant scatter, this indicates that the model does not well-describe the system; perhaps there are missing factors (predictors) or perhaps the system has a lot of noise that can not be modelled.
Thus by examining the model evaluation decomposition the analyst can decide on how to improve the model.
## Function
```{r fn.decompose}
decompose <- function(actual, predicted) {
msd <- mean((actual - predicted)^2)
sb <- (mean(predicted) - mean(actual))^2
regr.actual.pred <- lm(actual ~ predicted)
b1 <- coef(regr.actual.pred)[2]; names(b1) <- NULL
nu <- (1 - b1)^2 * mean((predicted - mean(predicted))^2)
r2 <- summary(regr.actual.pred)$r.squared
lc <- (1 - r2) * mean((actual - mean(actual))^2)
v <- c(msd, sb, nu, lc)
names(v) <- c("msd", "sb", "nu", "lc")
return(v)
}
```
## Example calculation
```{r gauch}
decompose(y, yhat.1) # noise only
decompose(y, yhat.2) # bias and noise
decompose(y, yhat.3) # gain and noise
decompose(y, yhat.4) # bias, gain, and noise
```
In the noise-only model, the MSD is almost completely from lack of correlation LC. There is minimal bias and non-unity slope, these are just due to randomness in the noise.
In the bias + noise model, the MSD is larger, the lack of correlation LC is the same, and the majority of the MSD is from the squared bias SB. Again the non-unity slope is minimal and due to rqndomness.
In the gain + noise model, the MSD is larger, the lack of correlation LC is larger, there is effectively no bias SB, and now we see a large contribution to MSD by the non-unity slope NU.
In the model will bias + gain + noise, the MSD is the largest. The lack of correlation LC is the same as with only gain, but now in addition there is squared bias SB as the main contributor to MSD.
This decomposition shows where each model has the most problems.