This R Markdown file shows how linear models are computed by ordinary least squares (OLS) and by a robust regression variant of OLS.

1 Example dataset

We use a subset of the meuse soil pollution dataset found in the sp (spatial objects) package. Note that each time the source R Markdown file is executed, a different random set of observation is selected from the full dataset, so the results will be different.

library(sp)
data(meuse)
# if you want to see its metadata
# help(meuse)
dim(meuse)
## [1] 155  14

Add computed fields with the log-transformed values of the heavy metals:

meuse$logZn <- log10(meuse$zinc)    # 锌
meuse$logCu <- log10(meuse$copper)  # 铜
meuse$logPb <- log10(meuse$lead)    # 铅
meuse$logCd <- log10(meuse$cadmium) # 镉
dim(meuse)
## [1] 155  18
names(meuse)
##  [1] "x"       "y"       "cadmium" "copper"  "lead"    "zinc"    "elev"   
##  [8] "dist"    "om"      "ffreq"   "soil"    "lime"    "landuse" "dist.m" 
## [15] "logZn"   "logCu"   "logPb"   "logCd"
table(meuse$ffreq) # observations in each flood frequency class
## 
##  1  2  3 
## 84 48 23

To be able to show the matrices, we limit the data to only a few rows and 4 selected columns: a dependent variable logZn, a continuous predictor dist (normalized distance to the river Meuse), and a categorical predictor ffreq(flooding frequency).

To ensure we have some observations from each flood frequency class, we take a stratified sample based on this category. This uses the strata method from the sampling package. We choose an equal number of observations from each of the three classes.

The n.s parameter is the sample size from each stratum; you can change this to see the effect of taking larger or smaller samples.

n.s <- 5  # sample size from each stratum
(n <- n.s * length(levels(meuse$ffreq))) # total sample size
## [1] 15
library(sampling)
# indices of the selected rows in the data frame
(ix <- strata(meuse, stratanames="ffreq", size=rep(n.s,3), method="srswor")$ID_unit)
##  [1]  16  17  38  61  74 109 125 129 131 132 143 144 148 150 151

These are the selected rows, they will differ with each random sample.

# the reduced data frame
(ds <- meuse[ix, c("logZn","dist", "ffreq")])
##        logZn      dist ffreq
## 16  3.013680 0.0000000     1
## 17  2.782473 0.0122243     1
## 39  2.920645 0.0484965     1
## 62  2.957607 0.0054321     1
## 83  2.773055 0.2009760     1
## 114 2.283301 0.5757520     2
## 131 2.418301 0.1683310     2
## 135 2.618048 0.0812194     2
## 161 2.100371 0.7716980     2
## 162 2.322219 0.3368290     2
## 148 2.227887 0.1030290     3
## 149 2.605305 0.0703550     3
## 153 2.893762 0.0921435     3
## 155 2.330414 0.3749400     3
## 156 2.220108 0.4238370     3

Visualize the relation with possible predictors of log(Zn):

plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()

plot(logZn ~ ffreq, data=ds, xlab="Flood frequency class", boxwex=0.4)