---
title: "Explaining Ordinary Least Squares (OLS) regression with R"
author: "D G Rossiter"
date: "09 Feburary 2017"
output:
   html_document:
     toc: TRUE
     theme: "spacelab"
     number_section: TRUE
---

This R Markdown file shows how linear models are computed by ordinary least squares (OLS) and by a robust regression variant of OLS.

# Example dataset

We use a subset of the `meuse` soil pollution dataset found in the  `sp` (spatial objects) package. Note that each time the source R Markdown file is executed, a different random set of observation is selected from the full dataset, so the results will be different.

```{r}
library(sp)
data(meuse)
# if you want to see its metadata
# help(meuse)
dim(meuse)
```

Add computed fields with the log-transformed values of the heavy metals:

```{r}
meuse$logZn <- log10(meuse$zinc)    # 锌
meuse$logCu <- log10(meuse$copper)  # 铜
meuse$logPb <- log10(meuse$lead)    # 铅
meuse$logCd <- log10(meuse$cadmium) # 镉
dim(meuse)
names(meuse)
table(meuse$ffreq) # observations in each flood frequency class
```

To be able to show the matrices, we limit the data to only a few rows and 4 selected columns: a dependent variable `logZn`, a continuous predictor `dist` (normalized distance to the river Meuse), and a categorical predictor `ffreq`（flooding frequency).

To ensure we have some observations from each flood frequency class, we take a stratified sample based on this category. This uses the `strata` method from the `sampling` package. We choose an equal number of observations from each of the three classes.

The `n.s` parameter is the sample size from each stratum; you can change this to see the effect of taking larger or smaller samples.


```{r}
n.s <- 5  # sample size from each stratum
(n <- n.s * length(levels(meuse$ffreq))) # total sample size
library(sampling)
# indices of the selected rows in the data frame
(ix <- strata(meuse, stratanames="ffreq", size=rep(n.s,3), method="srswor")$ID_unit)
```

These are the selected rows, they will differ with each random sample.

```{r}
# the reduced data frame
(ds <- meuse[ix, c("logZn","dist", "ffreq")])
```

Visualize the relation with possible predictors of log(Zn):

```{r out.width = '500px', fig.align='center'}
plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()
plot(logZn ~ ffreq, data=ds, xlab="Flood frequency class", boxwex=0.4)
```

Depending on the random sample, there should be some relation: further distances and less-frequent flooding are associated with lower metal concentrations.

# Regression on a continuous predictor

Model form: $y = \beta_0 + \beta_1 x + \varepsilon;  \;\; \mathbf{\varepsilon} \sim \mathcal{N}(0, \sigma^2)$
    

* $y$: *dependent* variable, to be modelled/predicted
* $x$: *independent* variable, \emph{predictor}
* $\varepsilon$: *error*, lack of fit, noise;  __independently and identically distributed__ (IID) from a 0-mean normal distribution with some __error variance__ $\sigma^2$
* $\beta_1$: coefficient for $x$, ``slope'' for simple regression
* $\beta_0$: centering coefficient, ``intercept'' for simple regression

Each observation $i$: $y_i = \beta_0 + \beta_1 x_i + r_i$ where $r_i$ is the **residual** -- i.e., what the model does not explain.

There is only one set of coefficients  $\beta_p$ which must be adjusted to all observations.

Once the model is fit, i.e., we compute the $\beta_p$, we can compute the **fitted** values that would be expected at each observation point: $\widehat{y}_i = \beta_0 + \beta_1 x_i$, and from that the residuals: $y_i - (\beta_0 + \beta_1 x_i)$, i.e., $(y_i - \widehat{y}_i)$, actual less fitted.

So the question is: what are the `best' values of $\beta_p$?

## Ordinary Least Squares (OLS)

One criterion is to select $\beta_0$, $\beta_1$  to **minimize sum of squared residuals**:  $\sum_i (y_i - (\beta_0 + \beta_1 x_i))^2$.


Note:

* All observations are equally weighted
* This implies that the residuals are independent -- they do not co-vary
    + Not true for most time series and spatial data
* This implies that all all residuals are from the same distribution
    + Not true if some observations have more or less variance than others
    + For example, different measurement precisions
* The residuals are squared so that under- and over-fits are equally penalized.

Given this criterion, the optimal values of the coefficients $\beta_0$, $\beta_1$ can be found by simple minimization:

1. Minimize $\sum_i \varepsilon_i^2 = \sum_i (y_i - (\beta_0 + \beta_1 x_i))^2$
2. Take partial derivatives with respect to the two parameters; solve system of two simultaneous equations

Solution:

$$
\begin{aligned}
\widehat{\beta}_1 &=  \frac{
        \sum_i (x_i - \overline{x})(y_i - \overline{y})
        }
        {
          \sum_i (x_i - \overline{x})^2
        }
\\
\widehat{\beta}_0 &=  \overline{y} - \widehat{\beta}_1 \overline{x}
\end{aligned}
$$
Where:

* $\overline{x}, \overline{y}$ are the means
* $\widehat{\beta}_0$  centres the regression on $(\overline{x}, \overline{y})$

This can also be expressed in terms of variances and covariances:

$$
        \widehat{\beta}_1 =  \frac{s_{xy}}{s_x^2}
$$
Where:

* $s_{xy}$ is the sample covariance between predictor (independent) and predictand (dependent).
* $s_x^2$ is the sample variance of the *independent* variable -- all the error is assumed to be in the *dependent* variable.

These are unbiased estimates of the population variance/covariance:

$$
\widehat{\beta}_1 =  \frac{\mathrm{Covar}(x,y)}{\mathrm{Var}(x)}
$$

## Matrix notation

This can be more efficiently written in matrix notation: $\mathbf{y} = \mathbf{X} \mathbf{\beta} + \varepsilon$

Where:

* $\mathbf{\varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I})$, i.e., residuals are independently drawn from a normal distribution with $0$ mean and variance $\sigma^2%
* $\mathbf{X}$ is the *design matrix*
* $\mathbf{\beta}$ is the coefficient vector
* $\mathbf{I}$ is the identity matrix: diagonals all 1, off-diagonals all 0

This allows extension to multiple variables and to categorical variables, as we will see below.

Notice that $\mathbf{I}$ means there is *no correlation* among the residuals.
If this is not true, *generalized least squares* (GLS) must be used.

**Solution** by *minimizing* the sum of squares of the residuals:

$$
\begin{aligned}
S &= \varepsilon^T \varepsilon
\\
S &= (\mathbf{y} - \mathbf{X}\mathbf{\beta})^T (\mathbf{y} - \mathbf{X}\mathbf{\beta}) \end{aligned}
$$

Note that $(\mathbf{X}^T \mathbf{X})$ is the matrix equivalent of $s_x^2$, i.e., the variance of the predictor $x$.


This expands to:
$$
\begin{aligned}
S &= \mathbf{y}^T\mathbf{y} - \beta^T \mathbf{X}^T \mathbf{y} -  \mathbf{y}^T \mathbf{X}\mathbf{\beta} + \mathbf{\beta}^T\mathbf{X}^T\mathbf{X} \mathbf{\beta}
  \nonumber \\
S &= \mathbf{y}^T\mathbf{y} - 2 \mathbf{\beta}^T \mathbf{X}^T \mathbf{y}  + \mathbf{\beta}^T\mathbf{X}^T\mathbf{X} \mathbf{\beta}
\end{aligned}
$$


Minimize by finding the partial derivative with respect the the unknown coefficients  $\mathbf{\beta}$, setting this equal to $\mathbf{0}$, and solving:

$$
\begin{aligned}
\frac{\partial}{\partial \mathbf{\beta}^T} S &= -2 \mathbf{X}^T \mathbf{y}  + 2 \mathbf{X}^T \mathbf{X} \mathbf{\beta} 
\\
 \mathbf{0} &=  -\mathbf{X}^T \mathbf{y}  + \mathbf{X}^T \mathbf{X} \mathbf{\beta} 
\\
(\mathbf{X}^T \mathbf{X} ) \mathbf{\beta} &= \mathbf{X}^T \mathbf{y} 
 \\
(\mathbf{X}^T \mathbf{X})^{-1} (\mathbf{X}^T \mathbf{X}) \mathbf{\beta}  &= (\mathbf{X}^T \mathbf{X})^{-1}  \mathbf{X}^T \mathbf{y}
\\
 \mathbf{\widehat{\beta}}_{\mathrm{OLS}}  &=  (\mathbf{X}^T \mathbf{X})^{-1}  \mathbf{X}^T \mathbf{y} 
\end{aligned}
$$


## Computation

The model matix $\mathbf{X}$ is computed from the model formula, with the `model.matrix` function:

```{r}
(X <- model.matrix(logZn ~ dist, data=ds))
dim(X)
```

Note the column of $1$s to represent the mean, and the column `dist` which are the observed normalized distances at each of the selected observations.

Its square, using the `t` "transpose" function and "%*%" matrix multiplication operator:

```{r}
t(X)
(XTX <- t(X)%*%X)
dim(XTX)
```

This is the product-crossproduct matrix of the two predictor variables:

```{r}
sum(rep(1,n)^2); sum((ds$dist)^2); sum(rep(1,n)*ds$dist)
```

Its inverse, using the `solve` matrix solution function:

```{r}
(XTXinv <- solve(XTX))
dim(XTXinv)
```

Compute $\mathbf{X}^T \mathbf{y}$, the matrix equivalent of $s_{xy}$, i.e., the covariance between two predictors (intercept and normalized distance) and predictand:

```{r}
y <- ds$logZn
(XTy <- t(X)%*%y)
dim(XTy)
```

And finally the solution:

```{r}
(beta <- XTXinv%*%XTy)
dim(beta)
```

This is the same result as using the `lm` function:

```{r}
coef(m1 <- lm(logZn ~ dist, data=ds))
```

## Model diagnostics

Recall, in OLS the residuals are assumed to be identically and independently distributed, from a normal distribution.

First see if they appear to be normally-distributed. We use the `qqPlot' method from the `car' package, which shows the quantiles of the normal distribution against the residuals, with confidence intervals.

```{r fig.align='center', out.width='400px'}
library(car)
qqPlot(residuals(m1))
```

Second, see if there is any relation between the residuals and fitted values, and also if the variance of the residuals is the same across the range of fitted values:

```{r fig.align='center', out.width='400px'}
plot(residuals(m1) ~ fitted(m1))
grid()
abline(h=0, col="blue")
lines(stats::lowess(residuals(m1) ~ fitted(m1)), col="red")
```

There are not so many points here, so it's hard to draw a firm conclusion. The empirical smoother, using `lowess` "locally-weighted sum of squares", helps visualize any relation.

Third, look for observations with high influence on the coefficients; these may be from a different population (process). The leverage is measured by the *hat value*, which measures the overall influence of a single observation on the predictions.

```{r fig.align='center', out.width='400px'}
(sort(h <- hatvalues(m1)))
plot(ds$dist, h, xlab="Normalized distance", ylab="Hat value", col=ds$ffreq, pch=20,
     xlim=c(0,1), ylim=c(0,0.8))
text(ds$dist, h, labels=names(residuals(m1)), pos=3)
```

The highest hat values are at the extremes of the predictor.

The influence of each observation can be measured by its *Cook's distance*, which measures how much $\beta$ would change if the observation were omitted. It is defined as:

$$
\begin{aligned}
D_i &= \frac{r^2_i}{\mathrm{tr}(\mathbf{H})} \frac{h_i}{1-h_i}
\end{aligned}
$$
where $\mathbf{H}$ is the hat matrix, $h_i$ is its $i$th diagonal, and $r_i$ is the standardized residual, i.e., $(y_i - \widehat{y}_i) / s \sqrt{1 - h_i}$. This is  based on $s^2$, which is the residual variance after the model fit.

```{r fig.align='center', out.width='400px'}
sort(d <- cooks.distance(m1))
plot(fitted(m1), d, xlab="Fitted value", ylab="Cook's distance", col=ds$ffreq, pch=20)
text(fitted(m1), d, labels=names(residuals(m1)), pos=4)
```

## The hat matrix

The hat matrix is so-named because it `puts the hats on' the fitted values: from observed $\mathbf{y}$ to best-fit $\mathbf{\widehat{y}}$

It is defined as:

$$
\begin{aligned}
\mathbf{H} &=  \mathbf{X} {(\mathbf{X}' \mathbf{X})}^{-1} \mathbf{X}'
\end{aligned}
$$

The hat matrix gives the weights by which each original observation is multiplied when computing $\mathbf{\widehat{\beta}}$. This means that if a high-leverage observation were changed, the fit would change substantially. In terms of vector spaces, the $\mathbf{H}$ matrix gives the \emph{orthogonal projection} of the observation vector $\mathbf{y}$ into the \emph{column space} of the design (model) matrix $\mathbf{X}$.

The overall leverage of each observation $y_i$ is given by its *hat value*, which is the sum of the squared entries of the hat matrix associated with the observation. This is also the diagonal of the hat matrix, because of the fact that $\mathbf{H} = \mathbf{H'} \mathbf{H} = \mathbf{H^2}$:

$$
\begin{aligned}
  h_{ii} &= \sum_{j=1}^n h_{ij}^2
\end{aligned}
$$

# Regression on a categorical predictor

There are various ways to set up the model matrix to separate categories. These are viewed and set with the `contrasts` function:

```{r}
contrasts(ds$ffreq)
```

This is the default "treatment contrasts" matrix; where "treatments" comes from experimental design. Here the "treatment" was by nature, not humans. Treatment contrasts are sometimes called "dummy variables".

The model matrix uses these contrasts:

```{r}
(X <- model.matrix(logZn ~ ffreq, data=ds))
```

Notice how the first category is included in the intercept -- this is the base case, the other two classes are contrasted to it. The matrix has created two variables, beyond the intercept, to represent the two additional classes.

Solution is exactly as before, by OLS:

```{r}
(XTX <- t(X)%*%X)
dim(XTX)
```

This is the product-crossproduct matrix among the predictors. Note the relation to the number of observations in each class.

```{r}
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
(beta <- XTXinv%*%XTy)
```

And again, these are the coefficents found by `lm`:

```{r}
coef(m2 <- lm(logZn ~ ffreq, data=ds))
```

The first coefficient is the mean of the independent variable for observations in the first class:

```{r}
beta[1]
mean(ds$logZn[ds$ffreq==1])
```

The others are the difference between this and their class means:

```{r}
beta[2]
mean(ds$logZn[ds$ffreq==2])
mean(ds$logZn[ds$ffreq==2]) - beta[1]
```

We can  make the model **without an intercept**, by adding a `-1` to the model formulation:

```{r}
(X <- model.matrix(logZn ~ ffreq - 1, data=ds))
```

Now each observation has a `1` only in its own `dummy' variable.

```{r}
(XTX <- t(X)%*%X)
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
(beta <- XTXinv%*%XTy)
coef(m3 <- lm(logZn ~ ffreq - 1, data=ds))
```

Now the coefficients are the means of each class.

# Mixed models

The same formulation works with any combination of predictors. For example, both the flood frequency and distance:

```{r}
(X <- model.matrix(logZn ~ ffreq + dist, data=ds))
(XTX <- t(X)%*%X)
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
(beta <- XTXinv%*%XTy)
coef(m4 <- lm(logZn ~ ffreq + dist, data=ds))
```

Here the intercept is the value at distance 0 and in flood frequency class 1.

Their interaction:

```{r}
(X <- model.matrix(logZn ~ ffreq * dist, data=ds))
(XTX <- t(X)%*%X)
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
(beta <- XTXinv%*%XTy)
coef(m5 <- lm(logZn ~ ffreq * dist, data=ds))
```

A nested model without intercept:

```{r}
(X <- model.matrix(logZn ~ ffreq/dist - 1, data=ds))
(XTX <- t(X)%*%X)
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
(beta <- XTXinv%*%XTy)
coef(m6 <- lm(logZn ~ ffreq/dist - 1, data=ds))
```

This gives directly the means of each class at distance 0, and a coefficient showing how much the metal concentration decreases with each  normalized distance away from the river.

# Robust regression

A **robust** inference is one that is not greatly disturbed by:

* a few unusual observations in the dataset, so-called *outliers*; or
* a small violation of model assumptions.

For linear modelling, the underlying model form should still be linear, but there may be unusual observations that obscure the general relation.

An example of the first case is a *contaminated* dataset, where some observations result from a different process then that which produced the others.  The issue here is that the observations do *not* all come from the same population, i.e., result of a single process.

Many of the problems with parametric regression can be avoided by fitting a so-called **robust linear regression**, which attempts to fit the majority of the observations and ignore the unusual cases. Since we do not know in advance if there is contamination or problems with the model assumptions, there is no way to select an optimum method. See the discussion in J Fox and S Weisberg. Robust regression in R: An appendix to “An R companion to applied regression, second edition”. 15-Dec-2010. URL http://socserv.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Robust-Regression.pdf

One method is called *least trimmed squares* (LTS). It estimates the slope vector $\mathbf{\beta}$, by the criterion of minimizing

$$
\sum_{i=1}^q  \big| \;y_i - \mathbf{X}_i \mathbf{\beta} \; \big|_{(i)}^2
$$

That is, the squared absolute deviations over some subset of the residuals, indicated by the subscript $(i)$. The smallest $q$ residuals are chosen as some proportion, by default: $q = \lfloor n/2 \rfloor + \lfloor (p + 1)/2 \rfloor$, where $p$ is the number of predictors and $n$ the number of observations. This is about half the observations.

To determine the observations to use, sort the squared residuals from smallest $(1)$ to largest $(n)$:

$$
(e^2)_{(1)}, (e^2)_{(2)}, \ldots (e^2)_{(n)}
$$
and select the observations with the smallest $q$ of these to re-fit the model.


```{r}
p <- 2 # number of predictors
(q <- floor((n/2) + floor((p+1)/2))) # number of `good' observations
```

In our example, `r q` 'good' observations will be used; the other `r (n-q)` observations with the largest residuals will be discarded. This is the 'trimming'.

However, it is not so simple, because after re-fitting on the selected observations the residuals from *all* the observations must be re-computed with the new coefficients, and these sorted residuals may not produce the same $q$ smallest residuals. So the procedure is *iterative* until there is no change in the order of sorted residuals.

This is solved by the `lqs` function found in the `MASS` "Modern Applied Statistics with S" package, explained in the textbook: Venables, W., & Ripley, B. (2002). Modern Applied Statistics with S. Fourth Edition. Springer. This function has several methods, of which least trimmed squares is the default and most advised.

We show how to compute one iteration of the robust regression line, starting from the OLS line:

Find the `r (n-q)` largest absolute residuals:

```{r}
m1 <- lm(logZn ~ dist, data=ds)
(r0.s <- sort(r0.a <- abs(residuals(m1))))
(hi.res <- r0.s[(q+1):n])
(hi.res.ix <- which(abs(residuals(m1)) %in% hi.res))
ds.trim <- ds[-hi.res.ix,]
dim(ds); dim(ds.trim)
m1.r <- lm(logZn ~ dist, data=ds.trim)
```

Compare the regression coefficients of the OLS and robust models, and the percent change:

```{r}
coef(m1.r); coef(m1); round(100*(coef(m1.r) - coef(m1))/coef(m1),1)
```

Now re-compute the residuals using the `predict` method, the first-iteration robust model, and the full dataset:

```{r}
(r1.s <- sort(r1.a <- abs(ds$logZn - predict(m1.r, newdata=ds))))
r0.s
```

In general the residuals after the re-fit will not be the same, nor in the same sorted order, as from the OLS fit. Check if the same observations are discarded:

```{r}
hi.res.1 <- r1.s[(q+1):n]
(hi.res.ix.1 <- which(r1.a %in% hi.res.1))
hi.res.ix
intersect(hi.res.ix, hi.res.ix.1)
```

In this case  `r length(intersect(hi.res.ix, hi.res.ix.1))` of the `r (n-q)` largest absolute residuals are from the same observations in both the OLS regression and the first iteration of the LTS regression.

If these sets are the same, this is the robust regression. If not, the process is repeated with the new set of 'good' observations, until the sets are the same.

The robust line can be computed directly by the `lqs` method, with the `method` optional argument set to `lts` for least-trimmed squares. This performs the iteration as needed.

```{r}
require(MASS)
m1.r <- lqs(logZn ~ dist, data=ds, method = "lts")
coef(m1.r); coef(m1); round(100*(coef(m1.r) - coef(m1))/coef(m1),1)
```

Depending on the sample, the regression coefficients can change quite a lot.

Visualize the two regression lines:

```{r out.width = '500px', fig.align='center'}
plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()
abline(m1)
abline(m1.r, lty=2)
legend("topright", c("OLS", "robust"), lty=1:2)
```