This R Markdown file shows how linear models are computed by ordinary least squares (OLS) and by a robust regression variant of OLS.

1 Example dataset

We use a subset of the meuse soil pollution dataset found in the sp (spatial objects) package. Note that each time the source R Markdown file is executed, a different random set of observation is selected from the full dataset, so the results will be different.

library(sp)
data(meuse)
# if you want to see its metadata
# help(meuse)
dim(meuse)
## [1] 155  14

Add computed fields with the log-transformed values of the heavy metals:

meuse$logZn <- log10(meuse$zinc)    # 锌
meuse$logCu <- log10(meuse$copper)  # 铜
meuse$logPb <- log10(meuse$lead)    # 铅
meuse$logCd <- log10(meuse$cadmium) # 镉
dim(meuse)
## [1] 155  18
names(meuse)
##  [1] "x"       "y"       "cadmium" "copper"  "lead"    "zinc"    "elev"   
##  [8] "dist"    "om"      "ffreq"   "soil"    "lime"    "landuse" "dist.m" 
## [15] "logZn"   "logCu"   "logPb"   "logCd"
table(meuse$ffreq) # observations in each flood frequency class
## 
##  1  2  3 
## 84 48 23

To be able to show the matrices, we limit the data to only a few rows and 4 selected columns: a dependent variable logZn, a continuous predictor dist (normalized distance to the river Meuse), and a categorical predictor ffreq(flooding frequency).

To ensure we have some observations from each flood frequency class, we take a stratified sample based on this category. This uses the strata method from the sampling package. We choose an equal number of observations from each of the three classes.

The n.s parameter is the sample size from each stratum; you can change this to see the effect of taking larger or smaller samples.

n.s <- 5  # sample size from each stratum
(n <- n.s * length(levels(meuse$ffreq))) # total sample size
## [1] 15
library(sampling)
# indices of the selected rows in the data frame
(ix <- strata(meuse, stratanames="ffreq", size=rep(n.s,3), method="srswor")$ID_unit)
##  [1]  16  17  38  61  74 109 125 129 131 132 143 144 148 150 151

These are the selected rows, they will differ with each random sample.

# the reduced data frame
(ds <- meuse[ix, c("logZn","dist", "ffreq")])
##        logZn      dist ffreq
## 16  3.013680 0.0000000     1
## 17  2.782473 0.0122243     1
## 39  2.920645 0.0484965     1
## 62  2.957607 0.0054321     1
## 83  2.773055 0.2009760     1
## 114 2.283301 0.5757520     2
## 131 2.418301 0.1683310     2
## 135 2.618048 0.0812194     2
## 161 2.100371 0.7716980     2
## 162 2.322219 0.3368290     2
## 148 2.227887 0.1030290     3
## 149 2.605305 0.0703550     3
## 153 2.893762 0.0921435     3
## 155 2.330414 0.3749400     3
## 156 2.220108 0.4238370     3

Visualize the relation with possible predictors of log(Zn):

plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()

plot(logZn ~ ffreq, data=ds, xlab="Flood frequency class", boxwex=0.4)

Depending on the random sample, there should be some relation: further distances and less-frequent flooding are associated with lower metal concentrations.

2 Regression on a continuous predictor

Model form: \(y = \beta_0 + \beta_1 x + \varepsilon; \;\; \mathbf{\varepsilon} \sim \mathcal{N}(0, \sigma^2)\)

Each observation \(i\): \(y_i = \beta_0 + \beta_1 x_i + r_i\) where \(r_i\) is the residual – i.e., what the model does not explain.

There is only one set of coefficients \(\beta_p\) which must be adjusted to all observations.

Once the model is fit, i.e., we compute the \(\beta_p\), we can compute the fitted values that would be expected at each observation point: \(\widehat{y}_i = \beta_0 + \beta_1 x_i\), and from that the residuals: \(y_i - (\beta_0 + \beta_1 x_i)\), i.e., \((y_i - \widehat{y}_i)\), actual less fitted.

So the question is: what are the `best’ values of \(\beta_p\)?

2.1 Ordinary Least Squares (OLS)

One criterion is to select \(\beta_0\), \(\beta_1\) to minimize sum of squared residuals: \(\sum_i (y_i - (\beta_0 + \beta_1 x_i))^2\).

Note:

  • All observations are equally weighted
  • This implies that the residuals are independent – they do not co-vary
    • Not true for most time series and spatial data
  • This implies that all all residuals are from the same distribution
    • Not true if some observations have more or less variance than others
    • For example, different measurement precisions
  • The residuals are squared so that under- and over-fits are equally penalized.

Given this criterion, the optimal values of the coefficients \(\beta_0\), \(\beta_1\) can be found by simple minimization:

  1. Minimize \(\sum_i \varepsilon_i^2 = \sum_i (y_i - (\beta_0 + \beta_1 x_i))^2\)
  2. Take partial derivatives with respect to the two parameters; solve system of two simultaneous equations

Solution:

\[ \begin{aligned} \widehat{\beta}_1 &= \frac{ \sum_i (x_i - \overline{x})(y_i - \overline{y}) } { \sum_i (x_i - \overline{x})^2 } \\ \widehat{\beta}_0 &= \overline{y} - \widehat{\beta}_1 \overline{x} \end{aligned} \] Where:

  • \(\overline{x}, \overline{y}\) are the means
  • \(\widehat{\beta}_0\) centres the regression on \((\overline{x}, \overline{y})\)

This can also be expressed in terms of variances and covariances:

\[ \widehat{\beta}_1 = \frac{s_{xy}}{s_x^2} \] Where:

  • \(s_{xy}\) is the sample covariance between predictor (independent) and predictand (dependent).
  • \(s_x^2\) is the sample variance of the independent variable – all the error is assumed to be in the dependent variable.

These are unbiased estimates of the population variance/covariance:

\[ \widehat{\beta}_1 = \frac{\mathrm{Covar}(x,y)}{\mathrm{Var}(x)} \]

2.2 Matrix notation

This can be more efficiently written in matrix notation: \(\mathbf{y} = \mathbf{X} \mathbf{\beta} + \varepsilon\)

Where:

  • \(\mathbf{\varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I})\), i.e., residuals are independently drawn from a normal distribution with \(0\) mean and variance $^2%
  • \(\mathbf{X}\) is the design matrix
  • \(\mathbf{\beta}\) is the coefficient vector
  • \(\mathbf{I}\) is the identity matrix: diagonals all 1, off-diagonals all 0

This allows extension to multiple variables and to categorical variables, as we will see below.

Notice that \(\mathbf{I}\) means there is no correlation among the residuals. If this is not true, generalized least squares (GLS) must be used.

Solution by minimizing the sum of squares of the residuals:

\[ \begin{aligned} S &= \varepsilon^T \varepsilon \\ S &= (\mathbf{y} - \mathbf{X}\mathbf{\beta})^T (\mathbf{y} - \mathbf{X}\mathbf{\beta}) \end{aligned} \]

Note that \((\mathbf{X}^T \mathbf{X})\) is the matrix equivalent of \(s_x^2\), i.e., the variance of the predictor \(x\).

This expands to: \[ \begin{aligned} S &= \mathbf{y}^T\mathbf{y} - \beta^T \mathbf{X}^T \mathbf{y} - \mathbf{y}^T \mathbf{X}\mathbf{\beta} + \mathbf{\beta}^T\mathbf{X}^T\mathbf{X} \mathbf{\beta} \nonumber \\ S &= \mathbf{y}^T\mathbf{y} - 2 \mathbf{\beta}^T \mathbf{X}^T \mathbf{y} + \mathbf{\beta}^T\mathbf{X}^T\mathbf{X} \mathbf{\beta} \end{aligned} \]

Minimize by finding the partial derivative with respect the the unknown coefficients \(\mathbf{\beta}\), setting this equal to \(\mathbf{0}\), and solving:

\[ \begin{aligned} \frac{\partial}{\partial \mathbf{\beta}^T} S &= -2 \mathbf{X}^T \mathbf{y} + 2 \mathbf{X}^T \mathbf{X} \mathbf{\beta} \\ \mathbf{0} &= -\mathbf{X}^T \mathbf{y} + \mathbf{X}^T \mathbf{X} \mathbf{\beta} \\ (\mathbf{X}^T \mathbf{X} ) \mathbf{\beta} &= \mathbf{X}^T \mathbf{y} \\ (\mathbf{X}^T \mathbf{X})^{-1} (\mathbf{X}^T \mathbf{X}) \mathbf{\beta} &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \\ \mathbf{\widehat{\beta}}_{\mathrm{OLS}} &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \end{aligned} \]

2.3 Computation

The model matix \(\mathbf{X}\) is computed from the model formula, with the model.matrix function:

(X <- model.matrix(logZn ~ dist, data=ds))
##     (Intercept)      dist
## 16            1 0.0000000
## 17            1 0.0122243
## 39            1 0.0484965
## 62            1 0.0054321
## 83            1 0.2009760
## 114           1 0.5757520
## 131           1 0.1683310
## 135           1 0.0812194
## 161           1 0.7716980
## 162           1 0.3368290
## 148           1 0.1030290
## 149           1 0.0703550
## 153           1 0.0921435
## 155           1 0.3749400
## 156           1 0.4238370
## attr(,"assign")
## [1] 0 1
dim(X)
## [1] 15  2

Note the column of \(1\)s to represent the mean, and the column dist which are the observed normalized distances at each of the selected observations.

Its square, using the t “transpose” function and “%*%" matrix multiplication operator:

t(X)
##             16        17        39        62       83      114      131
## (Intercept)  1 1.0000000 1.0000000 1.0000000 1.000000 1.000000 1.000000
## dist         0 0.0122243 0.0484965 0.0054321 0.200976 0.575752 0.168331
##                   135      161      162      148      149       153
## (Intercept) 1.0000000 1.000000 1.000000 1.000000 1.000000 1.0000000
## dist        0.0812194 0.771698 0.336829 0.103029 0.070355 0.0921435
##                 155      156
## (Intercept) 1.00000 1.000000
## dist        0.37494 0.423837
## attr(,"assign")
## [1] 0 1
(XTX <- t(X)%*%X)
##             (Intercept)     dist
## (Intercept)   15.000000 3.265263
## dist           3.265263 1.462589
dim(XTX)
## [1] 2 2

This is the product-crossproduct matrix of the two predictor variables:

sum(rep(1,n)^2); sum((ds$dist)^2); sum(rep(1,n)*ds$dist)
## [1] 15
## [1] 1.462589
## [1] 3.265263

Its inverse, using the solve matrix solution function:

(XTXinv <- solve(XTX))
##             (Intercept)       dist
## (Intercept)   0.1296979 -0.2895533
## dist         -0.2895533  1.3301533
dim(XTXinv)
## [1] 2 2

Compute \(\mathbf{X}^T \mathbf{y}\), the matrix equivalent of \(s_{xy}\), i.e., the covariance between two predictors (intercept and normalized distance) and predictand:

y <- ds$logZn
(XTy <- t(X)%*%y)
##                  [,1]
## (Intercept) 38.467175
## dist         7.580611
dim(XTy)
## [1] 2 1

And finally the solution:

(beta <- XTXinv%*%XTy)
##                  [,1]
## (Intercept)  2.794119
## dist        -1.054924
dim(beta)
## [1] 2 1

This is the same result as using the lm function:

coef(m1 <- lm(logZn ~ dist, data=ds))
## (Intercept)        dist 
##    2.794119   -1.054924

2.4 Model diagnostics

Recall, in OLS the residuals are assumed to be identically and independently distributed, from a normal distribution.

First see if they appear to be normally-distributed. We use the qqPlot' method from thecar’ package, which shows the quantiles of the normal distribution against the residuals, with confidence intervals.

library(car)
qqPlot(residuals(m1))

Second, see if there is any relation between the residuals and fitted values, and also if the variance of the residuals is the same across the range of fitted values:

plot(residuals(m1) ~ fitted(m1))
grid()
abline(h=0, col="blue")
lines(stats::lowess(residuals(m1) ~ fitted(m1)), col="red")

There are not so many points here, so it’s hard to draw a firm conclusion. The empirical smoother, using lowess “locally-weighted sum of squares”, helps visualize any relation.

Third, look for observations with high influence on the coefficients; these may be from a different population (process). The leverage is measured by the hat value, which measures the overall influence of a single observation on the predictions.

(sort(h <- hatvalues(m1)))
##         83        131        148        162        153        135 
## 0.06703800 0.06990657 0.08415261 0.08554884 0.08763050 0.09143763 
##        149        155         39         17        156         62 
## 0.09553883 0.09956055 0.10474161 0.12281745 0.12319683 0.12659134 
##         16        114        161 
## 0.12969785 0.23720903 0.47493237
plot(ds$dist, h, xlab="Normalized distance", ylab="Hat value", col=ds$ffreq, pch=20,
     xlim=c(0,1), ylim=c(0,0.8))
text(ds$dist, h, labels=names(residuals(m1)), pos=3)

The highest hat values are at the extremes of the predictor.

The influence of each observation can be measured by its Cook’s distance, which measures how much \(\beta\) would change if the observation were omitted. It is defined as:

\[ \begin{aligned} D_i &= \frac{r^2_i}{\mathrm{tr}(\mathbf{H})} \frac{h_i}{1-h_i} \end{aligned} \] where \(\mathbf{H}\) is the hat matrix, \(h_i\) is its \(i\)th diagonal, and \(r_i\) is the standardized residual, i.e., \((y_i - \widehat{y}_i) / s \sqrt{1 - h_i}\). This is based on \(s^2\), which is the residual variance after the model fit.

sort(d <- cooks.distance(m1))
##           17          155          135          162          149 
## 3.166605e-06 7.248844e-03 1.149586e-02 1.765843e-02 1.948093e-02 
##          156           83          131          114          153 
## 3.277671e-02 3.567119e-02 4.033986e-02 4.828057e-02 5.181515e-02 
##           39           62           16          148          161 
## 5.241002e-02 6.036023e-02 1.048543e-01 2.667905e-01 3.168586e-01
plot(fitted(m1), d, xlab="Fitted value", ylab="Cook's distance", col=ds$ffreq, pch=20)
text(fitted(m1), d, labels=names(residuals(m1)), pos=4)

2.5 The hat matrix

The hat matrix is so-named because it `puts the hats on’ the fitted values: from observed \(\mathbf{y}\) to best-fit \(\mathbf{\widehat{y}}\)

It is defined as:

\[ \begin{aligned} \mathbf{H} &= \mathbf{X} {(\mathbf{X}' \mathbf{X})}^{-1} \mathbf{X}' \end{aligned} \]

The hat matrix gives the weights by which each original observation is multiplied when computing \(\mathbf{\widehat{\beta}}\). This means that if a high-leverage observation were changed, the fit would change substantially. In terms of vector spaces, the \(\mathbf{H}\) matrix gives the of the observation vector \(\mathbf{y}\) into the of the design (model) matrix \(\mathbf{X}\).

The overall leverage of each observation \(y_i\) is given by its hat value, which is the sum of the squared entries of the hat matrix associated with the observation. This is also the diagonal of the hat matrix, because of the fact that \(\mathbf{H} = \mathbf{H'} \mathbf{H} = \mathbf{H^2}\):

\[ \begin{aligned} h_{ii} &= \sum_{j=1}^n h_{ij}^2 \end{aligned} \]

3 Regression on a categorical predictor

There are various ways to set up the model matrix to separate categories. These are viewed and set with the contrasts function:

contrasts(ds$ffreq)
##   2 3
## 1 0 0
## 2 1 0
## 3 0 1

This is the default “treatment contrasts” matrix; where “treatments” comes from experimental design. Here the “treatment” was by nature, not humans. Treatment contrasts are sometimes called “dummy variables”.

The model matrix uses these contrasts:

(X <- model.matrix(logZn ~ ffreq, data=ds))
##     (Intercept) ffreq2 ffreq3
## 16            1      0      0
## 17            1      0      0
## 39            1      0      0
## 62            1      0      0
## 83            1      0      0
## 114           1      1      0
## 131           1      1      0
## 135           1      1      0
## 161           1      1      0
## 162           1      1      0
## 148           1      0      1
## 149           1      0      1
## 153           1      0      1
## 155           1      0      1
## 156           1      0      1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"

Notice how the first category is included in the intercept – this is the base case, the other two classes are contrasted to it. The matrix has created two variables, beyond the intercept, to represent the two additional classes.

Solution is exactly as before, by OLS:

(XTX <- t(X)%*%X)
##             (Intercept) ffreq2 ffreq3
## (Intercept)          15      5      5
## ffreq2                5      5      0
## ffreq3                5      0      5
dim(XTX)
## [1] 3 3

This is the product-crossproduct matrix among the predictors. Note the relation to the number of observations in each class.

XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
##                 [,1]
## (Intercept) 38.46718
## ffreq2      11.74224
## ffreq3      12.27748
(beta <- XTXinv%*%XTy)
##                   [,1]
## (Intercept)  2.8894919
## ffreq2      -0.5410438
## ffreq3      -0.4339968

And again, these are the coefficents found by lm:

coef(m2 <- lm(logZn ~ ffreq, data=ds))
## (Intercept)      ffreq2      ffreq3 
##   2.8894919  -0.5410438  -0.4339968

The first coefficient is the mean of the independent variable for observations in the first class:

beta[1]
## [1] 2.889492
mean(ds$logZn[ds$ffreq==1])
## [1] 2.889492

The others are the difference between this and their class means:

beta[2]
## [1] -0.5410438
mean(ds$logZn[ds$ffreq==2])
## [1] 2.348448
mean(ds$logZn[ds$ffreq==2]) - beta[1]
## [1] -0.5410438

We can make the model without an intercept, by adding a -1 to the model formulation:

(X <- model.matrix(logZn ~ ffreq - 1, data=ds))
##     ffreq1 ffreq2 ffreq3
## 16       1      0      0
## 17       1      0      0
## 39       1      0      0
## 62       1      0      0
## 83       1      0      0
## 114      0      1      0
## 131      0      1      0
## 135      0      1      0
## 161      0      1      0
## 162      0      1      0
## 148      0      0      1
## 149      0      0      1
## 153      0      0      1
## 155      0      0      1
## 156      0      0      1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"

Now each observation has a 1 only in its own `dummy’ variable.

(XTX <- t(X)%*%X)
##        ffreq1 ffreq2 ffreq3
## ffreq1      5      0      0
## ffreq2      0      5      0
## ffreq3      0      0      5
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
##            [,1]
## ffreq1 14.44746
## ffreq2 11.74224
## ffreq3 12.27748
(beta <- XTXinv%*%XTy)
##            [,1]
## ffreq1 2.889492
## ffreq2 2.348448
## ffreq3 2.455495
coef(m3 <- lm(logZn ~ ffreq - 1, data=ds))
##   ffreq1   ffreq2   ffreq3 
## 2.889492 2.348448 2.455495

Now the coefficients are the means of each class.

4 Mixed models

The same formulation works with any combination of predictors. For example, both the flood frequency and distance:

(X <- model.matrix(logZn ~ ffreq + dist, data=ds))
##     (Intercept) ffreq2 ffreq3      dist
## 16            1      0      0 0.0000000
## 17            1      0      0 0.0122243
## 39            1      0      0 0.0484965
## 62            1      0      0 0.0054321
## 83            1      0      0 0.2009760
## 114           1      1      0 0.5757520
## 131           1      1      0 0.1683310
## 135           1      1      0 0.0812194
## 161           1      1      0 0.7716980
## 162           1      1      0 0.3368290
## 148           1      0      1 0.1030290
## 149           1      0      1 0.0703550
## 153           1      0      1 0.0921435
## 155           1      0      1 0.3749400
## 156           1      0      1 0.4238370
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
(XTX <- t(X)%*%X)
##             (Intercept)   ffreq2   ffreq3     dist
## (Intercept)   15.000000 5.000000 5.000000 3.265263
## ffreq2         5.000000 5.000000 0.000000 1.933829
## ffreq3         5.000000 0.000000 5.000000 1.064304
## dist           3.265263 1.933829 1.064304 1.462589
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
##                  [,1]
## (Intercept) 38.467175
## ffreq2      11.742240
## ffreq3      12.277475
## dist         7.580611
(beta <- XTXinv%*%XTy)
##                   [,1]
## (Intercept)  2.9285211
## ffreq2      -0.2975279
## ffreq3      -0.3175242
## dist        -0.7305327
coef(m4 <- lm(logZn ~ ffreq + dist, data=ds))
## (Intercept)      ffreq2      ffreq3        dist 
##   2.9285211  -0.2975279  -0.3175242  -0.7305327

Here the intercept is the value at distance 0 and in flood frequency class 1.

Their interaction:

(X <- model.matrix(logZn ~ ffreq * dist, data=ds))
##     (Intercept) ffreq2 ffreq3      dist ffreq2:dist ffreq3:dist
## 16            1      0      0 0.0000000   0.0000000   0.0000000
## 17            1      0      0 0.0122243   0.0000000   0.0000000
## 39            1      0      0 0.0484965   0.0000000   0.0000000
## 62            1      0      0 0.0054321   0.0000000   0.0000000
## 83            1      0      0 0.2009760   0.0000000   0.0000000
## 114           1      1      0 0.5757520   0.5757520   0.0000000
## 131           1      1      0 0.1683310   0.1683310   0.0000000
## 135           1      1      0 0.0812194   0.0812194   0.0000000
## 161           1      1      0 0.7716980   0.7716980   0.0000000
## 162           1      1      0 0.3368290   0.3368290   0.0000000
## 148           1      0      1 0.1030290   0.0000000   0.1030290
## 149           1      0      1 0.0703550   0.0000000   0.0703550
## 153           1      0      1 0.0921435   0.0000000   0.0921435
## 155           1      0      1 0.3749400   0.0000000   0.3749400
## 156           1      0      1 0.4238370   0.0000000   0.4238370
## attr(,"assign")
## [1] 0 1 1 2 3 3
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
(XTX <- t(X)%*%X)
##             (Intercept)   ffreq2   ffreq3     dist ffreq2:dist ffreq3:dist
## (Intercept)   15.000000 5.000000 5.000000 3.265263    1.933829    1.064304
## ffreq2         5.000000 5.000000 0.000000 1.933829    1.933829    0.000000
## ffreq3         5.000000 0.000000 5.000000 1.064304    0.000000    1.064304
## dist           3.265263 1.933829 1.064304 1.462589    1.075394    0.344273
## ffreq2:dist    1.933829 1.933829 0.000000 1.075394    1.075394    0.000000
## ffreq3:dist    1.064304 0.000000 1.064304 0.344273    0.000000    0.344273
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
##                  [,1]
## (Intercept) 38.467175
## ffreq2      11.742240
## ffreq3      12.277475
## dist         7.580611
## ffreq2:dist  4.337369
## ffreq3:dist  2.494204
(beta <- XTXinv%*%XTy)
##                   [,1]
## (Intercept)  2.9320609
## ffreq2      -0.3425105
## ffreq3      -0.2610534
## dist        -0.7967881
## ffreq2:dist  0.1734077
## ffreq3:dist -0.2156685
coef(m5 <- lm(logZn ~ ffreq * dist, data=ds))
## (Intercept)      ffreq2      ffreq3        dist ffreq2:dist ffreq3:dist 
##   2.9320609  -0.3425105  -0.2610534  -0.7967881   0.1734077  -0.2156685

A nested model without intercept:

(X <- model.matrix(logZn ~ ffreq/dist - 1, data=ds))
##     ffreq1 ffreq2 ffreq3 ffreq1:dist ffreq2:dist ffreq3:dist
## 16       1      0      0   0.0000000   0.0000000   0.0000000
## 17       1      0      0   0.0122243   0.0000000   0.0000000
## 39       1      0      0   0.0484965   0.0000000   0.0000000
## 62       1      0      0   0.0054321   0.0000000   0.0000000
## 83       1      0      0   0.2009760   0.0000000   0.0000000
## 114      0      1      0   0.0000000   0.5757520   0.0000000
## 131      0      1      0   0.0000000   0.1683310   0.0000000
## 135      0      1      0   0.0000000   0.0812194   0.0000000
## 161      0      1      0   0.0000000   0.7716980   0.0000000
## 162      0      1      0   0.0000000   0.3368290   0.0000000
## 148      0      0      1   0.0000000   0.0000000   0.1030290
## 149      0      0      1   0.0000000   0.0000000   0.0703550
## 153      0      0      1   0.0000000   0.0000000   0.0921435
## 155      0      0      1   0.0000000   0.0000000   0.3749400
## 156      0      0      1   0.0000000   0.0000000   0.4238370
## attr(,"assign")
## [1] 1 1 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
(XTX <- t(X)%*%X)
##                ffreq1   ffreq2   ffreq3 ffreq1:dist ffreq2:dist
## ffreq1      5.0000000 0.000000 0.000000   0.2671289    0.000000
## ffreq2      0.0000000 5.000000 0.000000   0.0000000    1.933829
## ffreq3      0.0000000 0.000000 5.000000   0.0000000    0.000000
## ffreq1:dist 0.2671289 0.000000 0.000000   0.0429222    0.000000
## ffreq2:dist 0.0000000 1.933829 0.000000   0.0000000    1.075394
## ffreq3:dist 0.0000000 0.000000 1.064304   0.0000000    0.000000
##             ffreq3:dist
## ffreq1         0.000000
## ffreq2         0.000000
## ffreq3         1.064304
## ffreq1:dist    0.000000
## ffreq2:dist    0.000000
## ffreq3:dist    0.344273
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
##                   [,1]
## ffreq1      14.4474593
## ffreq2      11.7422405
## ffreq3      12.2774754
## ffreq1:dist  0.7490383
## ffreq2:dist  4.3373692
## ffreq3:dist  2.4942038
(beta <- XTXinv%*%XTy)
##                   [,1]
## ffreq1       2.9320609
## ffreq2       2.5895504
## ffreq3       2.6710075
## ffreq1:dist -0.7967881
## ffreq2:dist -0.6233804
## ffreq3:dist -1.0124565
coef(m6 <- lm(logZn ~ ffreq/dist - 1, data=ds))
##      ffreq1      ffreq2      ffreq3 ffreq1:dist ffreq2:dist ffreq3:dist 
##   2.9320609   2.5895504   2.6710075  -0.7967881  -0.6233804  -1.0124565

This gives directly the means of each class at distance 0, and a coefficient showing how much the metal concentration decreases with each normalized distance away from the river.

5 Robust regression

A robust inference is one that is not greatly disturbed by:

For linear modelling, the underlying model form should still be linear, but there may be unusual observations that obscure the general relation.

An example of the first case is a contaminated dataset, where some observations result from a different process then that which produced the others. The issue here is that the observations do not all come from the same population, i.e., result of a single process.

Many of the problems with parametric regression can be avoided by fitting a so-called robust linear regression, which attempts to fit the majority of the observations and ignore the unusual cases. Since we do not know in advance if there is contamination or problems with the model assumptions, there is no way to select an optimum method. See the discussion in J Fox and S Weisberg. Robust regression in R: An appendix to “An R companion to applied regression, second edition”. 15-Dec-2010. URL http://socserv.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Robust-Regression.pdf

One method is called least trimmed squares (LTS). It estimates the slope vector \(\mathbf{\beta}\), by the criterion of minimizing

\[ \sum_{i=1}^q \big| \;y_i - \mathbf{X}_i \mathbf{\beta} \; \big|_{(i)}^2 \]

That is, the squared absolute deviations over some subset of the residuals, indicated by the subscript \((i)\). The smallest \(q\) residuals are chosen as some proportion, by default: \(q = \lfloor n/2 \rfloor + \lfloor (p + 1)/2 \rfloor\), where \(p\) is the number of predictors and \(n\) the number of observations. This is about half the observations.

To determine the observations to use, sort the squared residuals from smallest \((1)\) to largest \((n)\):

\[ (e^2)_{(1)}, (e^2)_{(2)}, \ldots (e^2)_{(n)} \] and select the observations with the smallest \(q\) of these to re-fit the model.

p <- 2 # number of predictors
(q <- floor((n/2) + floor((p+1)/2))) # number of `good' observations
## [1] 8

In our example, 8 ‘good’ observations will be used; the other 7 observations with the largest residuals will be discarded. This is the ‘trimming’.

However, it is not so simple, because after re-fitting on the selected observations the residuals from all the observations must be re-computed with the new coefficients, and these sorted residuals may not produce the same \(q\) smallest residuals. So the procedure is iterative until there is no change in the order of sorted residuals.

This is solved by the lqs function found in the MASS “Modern Applied Statistics with S” package, explained in the textbook: Venables, W., & Ripley, B. (2002). Modern Applied Statistics with S. Fourth Edition. Springer. This function has several methods, of which least trimmed squares is the default and most advised.

We show how to compute one iteration of the robust regression line, starting from the OLS line:

Find the 7 largest absolute residuals:

m1 <- lm(logZn ~ dist, data=ds)
(r0.s <- sort(r0.a <- abs(residuals(m1))))
##          17         155         135         114         149         162 
## 0.001249728 0.068171646 0.090390217 0.096557197 0.114594383 0.116570331 
##         161         156          62          39          83         153 
## 0.120334640 0.126894715 0.169219136 0.177686517 0.190950485 0.196847543 
##         131          16         148 
## 0.198240909 0.219561094 0.457544139
(hi.res <- r0.s[(q+1):n])
##        62        39        83       153       131        16       148 
## 0.1692191 0.1776865 0.1909505 0.1968475 0.1982409 0.2195611 0.4575441
(hi.res.ix <- which(abs(residuals(m1)) %in% hi.res))
## [1]  1  3  4  5  7 11 13
ds.trim <- ds[-hi.res.ix,]
dim(ds); dim(ds.trim)
## [1] 15  3
## [1] 8 3
m1.r <- lm(logZn ~ dist, data=ds.trim)

Compare the regression coefficients of the OLS and robust models, and the percent change:

coef(m1.r); coef(m1); round(100*(coef(m1.r) - coef(m1))/coef(m1),1)
## (Intercept)        dist 
##   2.6813878  -0.8269679
## (Intercept)        dist 
##    2.794119   -1.054924
## (Intercept)        dist 
##        -4.0       -21.6

Now re-compute the residuals using the predict method, the first-iteration robust model, and the full dataset:

(r1.s <- sort(r1.a <- abs(ds$logZn - predict(m1.r, newdata=ds))))
##        135        149        155        161        114        162 
## 0.00382612 0.01790144 0.04091070 0.05715218 0.07804182 0.08062176 
##        156         17        131         83         39         62 
## 0.11078014 0.11119392 0.12388219 0.25786758 0.27936224 0.28071165 
##        153         16        148 
## 0.28857367 0.33229189 0.36829943
r0.s
##          17         155         135         114         149         162 
## 0.001249728 0.068171646 0.090390217 0.096557197 0.114594383 0.116570331 
##         161         156          62          39          83         153 
## 0.120334640 0.126894715 0.169219136 0.177686517 0.190950485 0.196847543 
##         131          16         148 
## 0.198240909 0.219561094 0.457544139

In general the residuals after the re-fit will not be the same, nor in the same sorted order, as from the OLS fit. Check if the same observations are discarded:

hi.res.1 <- r1.s[(q+1):n]
(hi.res.ix.1 <- which(r1.a %in% hi.res.1))
## [1]  1  3  4  5  7 11 13
hi.res.ix
## [1]  1  3  4  5  7 11 13
intersect(hi.res.ix, hi.res.ix.1)
## [1]  1  3  4  5  7 11 13

In this case 7 of the 7 largest absolute residuals are from the same observations in both the OLS regression and the first iteration of the LTS regression.

If these sets are the same, this is the robust regression. If not, the process is repeated with the new set of ‘good’ observations, until the sets are the same.

The robust line can be computed directly by the lqs method, with the method optional argument set to lts for least-trimmed squares. This performs the iteration as needed.

require(MASS)
## Loading required package: MASS
m1.r <- lqs(logZn ~ dist, data=ds, method = "lts")
coef(m1.r); coef(m1); round(100*(coef(m1.r) - coef(m1))/coef(m1),1)
## (Intercept)        dist 
##    3.025942   -1.872351
## (Intercept)        dist 
##    2.794119   -1.054924
## (Intercept)        dist 
##         8.3        77.5

Depending on the sample, the regression coefficients can change quite a lot.

Visualize the two regression lines:

plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()
abline(m1)
abline(m1.r, lty=2)
legend("topright", c("OLS", "robust"), lty=1:2)