In this exercise we return to the Meuse heavy metals dataset.

1 Dataset

Packages used in this section: sp for the sample dataset.

library(sp)

Sample point dataset, comes with sp; make a log10-transformed copy of the metal concentrations.

Also load the prediction grid: locations where we do not know the value of the target variable, at which we want to make prediction of that variable, from the model built (trained, calibrated) from the known points, where we know both the target variable and the predictors.

data(meuse)
str(meuse)

## 'data.frame':    155 obs. of  14 variables:
##  $ x      : num  181072 181025 181165 181298 181307 ...
##  $ y      : num  333611 333558 333537 333484 333330 ...
##  $ cadmium: num  11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
##  $ copper : num  85 81 68 81 48 61 31 29 37 24 ...
##  $ lead   : num  299 277 199 116 117 137 132 150 133 80 ...
##  $ zinc   : num  1022 1141 640 257 269 ...
##  $ elev   : num  7.91 6.98 7.8 7.66 7.48 ...
##  $ dist   : num  0.00136 0.01222 0.10303 0.19009 0.27709 ...
##  $ om     : num  13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
##  $ ffreq  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ soil   : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
##  $ lime   : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
##  $ dist.m : num  50 30 150 270 380 470 240 120 240 420 ...

meuse$logZn <- log10(meuse$zinc)    # 锌
meuse$logCu <- log10(meuse$copper)  # 铜
meuse$logPb <- log10(meuse$lead)    # 铅
meuse$logCd <- log10(meuse$cadmium) # 镉
data(meuse.grid)
str(meuse.grid)

## 'data.frame':    3103 obs. of  7 variables:
##  $ x     : num  181180 181140 181180 181220 181100 ...
##  $ y     : num  333740 333700 333700 333700 333660 ...
##  $ part.a: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ part.b: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ dist  : num  0 0 0.0122 0.0435 0 ...
##  $ soil  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

2 Regression trees

The objective is to model the log10(Zn) concentration from covariables, not including the other metals. One approach is the regression tree, which splits the target according to thresholds of predictor variables.

Reference: Brieman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton, FL: Chapman & Hall, CRC Press.

Packages used in this section: sp for spatial objects, rpart for regression trees.

library(sp)
library(rpart)
library(rpart.plot)

names(meuse)

##  [1] "x"       "y"       "cadmium" "copper"  "lead"    "zinc"    "elev"   
##  [8] "dist"    "om"      "ffreq"   "soil"    "lime"    "landuse" "dist.m" 
## [15] "logZn"   "logCu"   "logPb"   "logCd"

names(meuse.grid)

## [1] "x"      "y"      "part.a" "part.b" "dist"   "soil"   "ffreq"

Q: What covariables are available in the point dataset? Are these all available in the prediction grid?

We will try to model the log10(Zn) concentration as a function of flooding frequency class, distance to the river in meters, soil type, whether the site has had added limestone 灰岩, and elevation above sea level. We also use the coordinates to capture any geographic trend.

Q: Why could these be good predictors of the metal concentration in this site?

Q: Do you see any problem in using so many predictors? Would you do this in a linear model?

2.1 Building a regression tree

We will use the rpart function to build the regression tree. This uses a helper function rpart.control to control the behaviour of rpart.

Examine the help for rpart.control.

help(rpart.control)

Q: What is the meaning of the minsplit and cp parameters?

Build a regression tree for this, print its results, and display as a tree plot. Allow splits all the way to single observations, if the \(R^2\) increases by at least 0.5% at the split.

m.lzn.rp <- rpart(logZn ~ x + y + ffreq + dist.m + soil + lime + elev,
                  data=meuse,
                  minsplit=2,
                  cp=0.005)
print(m.lzn.rp)

## n= 155 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 155 15.13633000 2.556160  
##     2) dist.m>=145 101  4.65211000 2.388584  
##       4) elev>=6.943 93  2.83545700 2.349952  
##         8) dist.m>=230 78  1.73584100 2.311717  
##          16) x>=179102.5 71  1.23063300 2.288808  
##            32) ffreq=2,3 40  0.36471060 2.223342  
##              64) y>=330942 21  0.14958940 2.167449 *
##              65) y< 330942 19  0.07701018 2.285117 *
##            33) ffreq=1 31  0.47328450 2.373281  
##              66) x< 179913.5 1  0.00000000 2.075547 *
##              67) x>=179913.5 30  0.38168440 2.383205  
##               134) x>=180346.5 25  0.20839560 2.355868  
##                 268) y< 333191 19  0.09154361 2.320174 *
##                 269) y>=333191 6  0.01598697 2.468900 *
##               135) x< 180346.5 5  0.06119400 2.519889 *
##          17) x< 179102.5 7  0.08998420 2.544084 *
##         9) dist.m< 230 15  0.39263770 2.548774  
##          18) x< 180563 10  0.23688250 2.494602  
##            36) x>=179405.5 7  0.10694750 2.420923 *
##            37) x< 179405.5 3  0.00326675 2.666521 *
##          19) x>=180563 5  0.06771894 2.657116 *
##       5) elev< 6.943 8  0.06436102 2.837680 *
##     3) dist.m< 145 54  2.34313500 2.869589  
##       6) elev>=8.1455 15  0.49618250 2.646331  
##        12) elev< 8.48 4  0.07502826 2.456464 *
##        13) elev>=8.48 11  0.22452210 2.715373 *
##       7) elev< 8.1455 39  0.81172600 2.955457  
##        14) dist.m>=75 11  0.08001190 2.848717 *
##        15) dist.m< 75 28  0.55715050 2.997391  
##          30) x< 179565 9  0.12124970 2.914541 *
##          31) x>=179565 19  0.34486030 3.036636  
##            62) x>=180194 12  0.12169000 2.966064 *
##            63) x< 180194 7  0.06095067 3.157617 *

rpart.plot(m.lzn.rp, digits=3, type=4, extra=1)

Q: Which variables were actually used in the tree?

Q: Which variable was used at the first split?

Q: Does that agree with your theory of the origin of the metals in these soils?

Q: What was the value of the splitting variable at that split?

Q: What was the average log10(Zn) before the split, and in the two leaves after the split?

Q: How many leaves (i.e., different predictions)? What proportion of the observations is this?

Q: What is the minimum number of observations among the leaves? The maximum?

2.2 Variable importance

Now we examine the variable importance:

print(tmp <- m.lzn.rp$variable.importance)

##   dist.m     elev     lime        x        y     soil    ffreq 
## 9.410837 6.134458 4.633959 2.097574 1.911179 1.169126 1.143282

data.frame(variableImportance = 100 * tmp / sum(tmp))

##        variableImportance
## dist.m          35.512035
## elev            23.148534
## lime            17.486364
## x                7.915251
## y                7.211884
## soil             4.411728
## ffreq            4.314205

Q: Which variable explained most of the metal concentration? What proportion? Were all predictor variables used somewhere in the tree?

2.3 Pruning the regression tree

Have we built a tree that is too complex, i.e., fitting noise not signal? In other words, fitting our dataset well but not the process?

Examine the cross-validation relative error vs. the complexity parameter.

Q: What complexity parameter minimizes the cross-validation relative error?

We can also compute this, rather than just looking at the graph.

plotcp(m.lzn.rp)

cp.table <- m.lzn.rp[["cptable"]]
cp.ix <- which.min(cp.table[,"xerror"])
print(cp.table[cp.ix,])

##          CP      nsplit   rel error      xerror        xstd 
## 0.009124468 8.000000000 0.153385306 0.292993740 0.032477480

cp.min <- cp.table[cp.ix,"CP"]

(Note that some other cross-validation errors are quite close to this minimum, so we might choose to use a higher value of the control parameter to have a more parsimonious model, that might extrapolate better.)

Prune the tree with this complexity parameter, and examine in the same way as above.

(Note we would get the same result if we re-fit with this parameter.)

(m.lzn.rpp <- prune(m.lzn.rp, cp=cp.min))

## n= 155 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 155 15.13633000 2.556160  
##    2) dist.m>=145 101  4.65211000 2.388584  
##      4) elev>=6.943 93  2.83545700 2.349952  
##        8) dist.m>=230 78  1.73584100 2.311717  
##         16) x>=179102.5 71  1.23063300 2.288808  
##           32) ffreq=2,3 40  0.36471060 2.223342 *
##           33) ffreq=1 31  0.47328450 2.373281 *
##         17) x< 179102.5 7  0.08998420 2.544084 *
##        9) dist.m< 230 15  0.39263770 2.548774 *
##      5) elev< 6.943 8  0.06436102 2.837680 *
##    3) dist.m< 145 54  2.34313500 2.869589  
##      6) elev>=8.1455 15  0.49618250 2.646331  
##       12) elev< 8.48 4  0.07502826 2.456464 *
##       13) elev>=8.48 11  0.22452210 2.715373 *
##      7) elev< 8.1455 39  0.81172600 2.955457  
##       14) dist.m>=75 11  0.08001190 2.848717 *
##       15) dist.m< 75 28  0.55715050 2.997391 *

rpart.plot(m.lzn.rpp, digits=3, type=4, extra=1)

Q: How does this tree differ from the full (un-pruned tree)? Look at the predictors actually used, number of splits, number of leaves, number of observations in each leaf.

Q: Which leaf predicts the highest concentration of log10(Zn)? Explain that by reading the predictors and their values down the tree from the root to this leaf. Does this fit your theory of the origin of these metals?

2.4 Goodness-of-fit

Now, compare the model predictions to the known values:

p.rpp <- predict(m.lzn.rpp, newdata=meuse)
length(unique(p.rpp))

## [1] 9

summary(r.rpp <- meuse$logZn - p.rpp)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.320887 -0.090478 -0.005171  0.000000  0.081155  0.293237

sqrt(sum(r.rpp^2)/length(r.rpp))

## [1] 0.1223873

plot(meuse$logZn ~ p.rpp, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Regression Tree")
grid()
abline(0,1)

Q: What is the RMSE?

Q: Why are there vertical lines at all the fitted values?

3 Classification tree

Another use of CART is to predict a category. Here we try to predict whether a site is in flood frequency class 1, 2, or 3. If this is a successful model, we could use it to map flood frequency, if we didn’t already have this map.

Q: What covariables might help predict flooding frequency?

3.1 Building a classification tree

Flooding is influenced by the distance from the river and the elevation above the river. Build a model with these, using a complexity parameter of 2% increase in \(R^2\).

m.ff.rp <- rpart(ffreq ~ x + y + dist.m + elev,
                  data=meuse,
                  minsplit=2,
                  cp=0.02)
print(m.ff.rp)

## n= 155 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 155 71 1 (0.54193548 0.30967742 0.14838710)  
##     2) y>=331984.5 63  6 1 (0.90476190 0.07936508 0.01587302)  
##       4) elev< 9.399 60  3 1 (0.95000000 0.03333333 0.01666667) *
##       5) elev>=9.399 3  0 2 (0.00000000 1.00000000 0.00000000) *
##     3) y< 331984.5 92 49 2 (0.29347826 0.46739130 0.23913043)  
##       6) elev< 7.552 24  2 1 (0.91666667 0.00000000 0.08333333)  
##        12) y>=330043 22  0 1 (1.00000000 0.00000000 0.00000000) *
##        13) y< 330043 2  0 3 (0.00000000 0.00000000 1.00000000) *
##       7) elev>=7.552 68 25 2 (0.07352941 0.63235294 0.29411765)  
##        14) y>=330330 54 13 2 (0.09259259 0.75925926 0.14814815)  
##          28) elev< 9.947 49  9 2 (0.10204082 0.81632653 0.08163265)  
##            56) y>=330644 35  5 2 (0.14285714 0.85714286 0.00000000)  
##             112) dist.m>=470 17  5 2 (0.29411765 0.70588235 0.00000000)  
##               224) elev< 8.526 3  0 1 (1.00000000 0.00000000 0.00000000) *
##               225) elev>=8.526 14  2 2 (0.14285714 0.85714286 0.00000000) *
##             113) dist.m< 470 18  0 2 (0.00000000 1.00000000 0.00000000) *
##            57) y< 330644 14  4 2 (0.00000000 0.71428571 0.28571429)  
##             114) elev< 8.7225 11  1 2 (0.00000000 0.90909091 0.09090909) *
##             115) elev>=8.7225 3  0 3 (0.00000000 0.00000000 1.00000000) *
##          29) elev>=9.947 5  1 3 (0.00000000 0.20000000 0.80000000) *
##        15) y< 330330 14  2 3 (0.00000000 0.14285714 0.85714286) *

rpart.plot(m.ff.rp)

Q: Were both predictors used?

Q: What was the first split?

3.2 Pruning the classification tree

Find the value of the complexity parameter that minimizes the cross-validation error, and prune the tree with that value.

printcp(m.ff.rp)

## 
## Classification tree:
## rpart(formula = ffreq ~ x + y + dist.m + elev, data = meuse, 
##     minsplit = 2, cp = 0.02)
## 
## Variables actually used in tree construction:
## [1] dist.m elev   y     
## 
## Root node error: 71/155 = 0.45806
## 
## n= 155 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.267606      0   1.00000 1.00000 0.087366
## 2 0.140845      2   0.46479 0.49296 0.073316
## 3 0.042254      3   0.32394 0.36620 0.065518
## 4 0.028169      5   0.23944 0.39437 0.067462
## 5 0.021127      6   0.21127 0.33803 0.063433
## 6 0.020000     10   0.12676 0.38028 0.066506

plotcp(m.ff.rp)

cp.table.class <- m.ff.rp[["cptable"]]
cp.ix <- which.min(cp.table.class[,"xerror"])
print(cp.table.class[cp.ix,])

##         CP     nsplit  rel error     xerror       xstd 
## 0.02112676 6.00000000 0.21126761 0.33802817 0.06343327

(cp.min <- cp.table.class[cp.ix,"CP"])

## [1] 0.02112676

m.ff.rpp <- prune(m.ff.rp, cp=cp.min)
print(m.ff.rpp)

## n= 155 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 155 71 1 (0.54193548 0.30967742 0.14838710)  
##    2) y>=331984.5 63  6 1 (0.90476190 0.07936508 0.01587302)  
##      4) elev< 9.399 60  3 1 (0.95000000 0.03333333 0.01666667) *
##      5) elev>=9.399 3  0 2 (0.00000000 1.00000000 0.00000000) *
##    3) y< 331984.5 92 49 2 (0.29347826 0.46739130 0.23913043)  
##      6) elev< 7.552 24  2 1 (0.91666667 0.00000000 0.08333333)  
##       12) y>=330043 22  0 1 (1.00000000 0.00000000 0.00000000) *
##       13) y< 330043 2  0 3 (0.00000000 0.00000000 1.00000000) *
##      7) elev>=7.552 68 25 2 (0.07352941 0.63235294 0.29411765)  
##       14) y>=330330 54 13 2 (0.09259259 0.75925926 0.14814815)  
##         28) elev< 9.947 49  9 2 (0.10204082 0.81632653 0.08163265) *
##         29) elev>=9.947 5  1 3 (0.00000000 0.20000000 0.80000000) *
##       15) y< 330330 14  2 3 (0.00000000 0.14285714 0.85714286) *

rpart.plot(m.ff.rpp, type=2, extra=1)

Q: Now how many splits? Which variables were used? Are all classes predicted? How pure are the leaves?

Q: How useful is this model? In other words, how well could flooding frequency class be mapped from knowing the distance to river and elevation?

Variable importance:

m.ff.rp$variable.importance

##     elev        y        x   dist.m 
## 38.26320 38.24051 16.58514 11.40992

m.ff.rpp$variable.importance

##        y     elev        x   dist.m 
## 36.74909 29.85313 14.99264  9.89731

In both the unpruned and pruned trees the y-coordinate and elevation are most important.

4 Mapping over the grid.

names(meuse.grid)

## [1] "x"      "y"      "part.a" "part.b" "dist"   "soil"   "ffreq"

Q: What variables are available for mapping over the grid?

The grid only has coordinates, relative distance, soil type, and flood frequency. So if we want a model to map log(Zn) over the grid, we can only use these variables.

Build and prune a regression tree for log(Zn) with these predictors; display the pruned tree and the variable importance.

m.lzn.g <- rpart(logZn ~ x + y +  ffreq + dist + soil,
                  data=meuse,
                  minsplit=2,
                  cp=0.005)
plotcp(m.lzn.g)

rpart.plot(m.lzn.g, digits=3, type=4, extra=1)

cp.table <- m.lzn.g[["cptable"]]
cp.table[cp.ix <- which.min(cp.table[,"xerror"]),]

##           CP       nsplit    rel error       xerror         xstd 
##  0.008159035 11.000000000  0.132179501  0.414892945  0.074165183

(cp.min <- cp.table[cp.ix,"CP"])

## [1] 0.008159035

Now prune the tree to the indicated complexity:

m.lzn.gp <- prune(m.lzn.g, cp=cp.min)
rpart.plot(m.lzn.gp)

print(tmp <- m.lzn.gp$variable.importance)

##      dist         x      soil         y     ffreq 
## 10.316275  4.015682  3.417669  2.582310  2.407865

data.frame(variableImportance = 100 * tmp / sum(tmp))

##       variableImportance
## dist            45.36660
## x               17.65927
## soil            15.02946
## y               11.35591
## ffreq           10.58877

Q: Which predictors were used? What was their relative importance?

Q: Why is the x-coordinate important in this dataset? Do you think this would be important in other parts of the flood plain?

Now we can map with this.

First, make an sp version of the grid, with the coordinates having special status, i.e., not as feature-space attributes:

meuse.grid.sp <- meuse.grid
coordinates(meuse.grid.sp) <- ~ x + y; gridded(meuse.grid.sp) <- TRUE

meuse.grid.sp$predictedLog10Zn <- predict(m.lzn.gp, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn")

Q: How realistic is this map?

5 The variability of regression trees

Single regression trees can vary quite a bit depending on small differences in the dataset. Convince yourself of this.

Remove a random 15 of the 155 points, then fit and prune the first tree we attempted, i.e., log10(Zn) concentration as a function of flooding frequency class, distance to the river in meters, soil type, land use, whether the site has had added limestone 灰岩, and elevation above sea level.

meuse.140.ix <- sample(1:dim(meuse)[1], size=140, replace=FALSE)
meuse.140 <- meuse[meuse.140.ix,]
m.lzn.rp.140 <- rpart(logZn ~ x + y + ffreq + dist.m + soil +  lime + elev,
                  data=meuse.140,
                  minsplit=2,
                  cp=0.005)
rpart.plot(m.lzn.rp.140, digits=3, type=4, extra=1)

print(tmp <- m.lzn.rp.140$variable.importance)

##    dist.m      elev      lime         x         y      soil     ffreq 
## 8.9071397 5.9338954 4.4152575 2.1333175 1.9975137 1.4132827 0.8036173

data.frame(variableImportance = 100 * tmp / sum(tmp))

##        variableImportance
## dist.m          34.788046
## elev            23.175636
## lime            17.244389
## x                8.331962
## y                7.801562
## soil             5.519768
## ffreq            3.138637

Prune the tree:

plotcp(m.lzn.rp.140)

cp.table <- m.lzn.rp.140[["cptable"]]
cp.ix <- which.min(cp.table[,"xerror"])
print(cp.table[cp.ix,])

##          CP      nsplit   rel error      xerror        xstd 
## 0.009101917 9.000000000 0.135288284 0.273448718 0.037955567

cp.min <- cp.table[cp.ix,"CP"]
m.lzn.rpp.140 <- prune(m.lzn.rp.140, cp=cp.min)

rpart.plot(m.lzn.rpp.140, digits=3, type=4, extra=1)

Q: What are the similarities and differences between your tree and the one we did at the beginning of the exercise? Consider the “optimum” complexity parameter, number of leaves and split, where the predictors are used, which predictors are used.

We can do this also for the map:

m.lzn.g.140 <- rpart(logZn ~ x + y + ffreq + dist + soil,
                  data=meuse.140,
                  minsplit=2,
                  cp=0.005)

Prune the tree to the indicated complexity:

plotcp(m.lzn.g.140)

rpart.plot(m.lzn.g.140, digits=3, type=4, extra=1)

cp.table <- m.lzn.g.140[["cptable"]]
cp.table[cp.ix <- which.min(cp.table[,"xerror"]),]

##         CP     nsplit  rel error     xerror       xstd 
## 0.00957948 8.00000000 0.14802401 0.26309089 0.04321030

(cp.min <- cp.table[cp.ix,"CP"])

## [1] 0.00957948

m.lzn.gp.140 <- prune(m.lzn.g.140, cp=cp.min)

rpart.plot(m.lzn.gp.140)

print(tmp <- m.lzn.gp.140$variable.importance)

##     dist     soil        x    ffreq        y 
## 9.557269 4.683515 3.714189 3.399352 3.251236

data.frame(variableImportance = 100 * tmp / sum(tmp))

##       variableImportance
## dist            38.84191
## soil            19.03438
## x               15.09492
## ffreq           13.81538
## y               13.21342

Computed and display the map using the pruned tree:

meuse.grid.sp$predictedLog10Zn.140 <- predict(m.lzn.gp.140, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.140")

Compute and display the differences in predictions:

meuse.grid.sp$predictedLog10Zn.diff <- 
  meuse.grid.sp$predictedLog10Zn - meuse.grid.sp$predictedLog10Zn.140
summary(meuse.grid.sp$predictedLog10Zn.diff)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.537763  0.007214  0.033264 -0.003123  0.042397  0.592105

spplot(meuse.grid.sp, zcol="predictedLog10Zn.diff", col.regions=topo.colors(36))

Q: How different are the maps made with the regression tree based on all 155 observations, and the one made with 140 randomly-selected observations? Can you explain the spatial pattern of the differences? Hint: plot the locations of the omitted points.

meuse.omitted <- meuse[-meuse.140.ix,]
plot(y ~ x, data=meuse.omitted, asp=1,
     cex=2*meuse.omitted$logZn/max(meuse.omitted$logZn))
data(meuse.riv)
lines(meuse.riv)
grid()

6 Random Forests

The idea here is simple: build many regression trees and average them.

Reference: Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Packages used in this section: randomForest for random forests, randomForestExplainer for some diagnostics.

library(randomForest)
library(randomForestExplainer)

First build the forest, using the randomForest function:

m.lzn.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data=meuse, importance=T, na.action=na.omit, mtry=5)
print(m.lzn.rf)

## 
## Call:
##  randomForest(formula = logZn ~ ffreq + x + y + dist.m + elev +      soil + lime, data = meuse, importance = T, mtry = 5, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.01887242
##                     % Var explained: 80.67

The mtry optional argument gives the number of variables randomly sampled as candidates at each split. By default this is \(\lfloor{p/3}\rfloor\) where \(p\) is the number of predictors, here \(5\), so \(\lfloor{5/3}\rfloor = 2\). We want to try all the variables at each split for more accurate splitting.

Q: How much of the variance is explained by the forest?

Here we can not see a tree structure, since each tree in the forest is different.

Display the cross-validation error rate against the number of trees:

plot(m.lzn.rf)

Q: How many trees were needed to make a reliable forest?

6.1 Sensitivity

Each RF is random, so several attempts will not be the same. How consistent are they?

Compute the RF several times and collect statistics:

n <- 24
rf.stats <- data.frame(rep=1:10, rsq=as.numeric(NA), mse=as.numeric(NA))
for (i in 1:n) {
  model.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
                           data=meuse, importance=T, na.action=na.omit, mtry=5)
  summary(model.rf$rsq)
  summary(model.rf$mse)
  rf.stats[i, "mse"] <- median(summary(model.rf$mse))
  rf.stats[i, "rsq"] <- median(summary(model.rf$rsq))
}
summary(rf.stats[,2:3])

##       rsq              mse         
##  Min.   :0.7917   Min.   :0.01872  
##  1st Qu.:0.8002   1st Qu.:0.01921  
##  Median :0.8013   Median :0.01940  
##  Mean   :0.8013   Mean   :0.01940  
##  3rd Qu.:0.8033   3rd Qu.:0.01951  
##  Max.   :0.8083   Max.   :0.02034

hist(rf.stats[,"rsq"], xlab="RandomForest R^2")
rug(rf.stats[,"rsq"])

hist(rf.stats[,"mse"], xlab="RandomForest RMSE")
rug(rf.stats[,"mse"])

Q: How consistent are the 24 runs of the random forest?

6.2 Variable importance

Examine the variable importance. Type 1 is the increase in MSE if that predictor is assigned to its actuql point, rather than randomly assigned to a point. It is computed as follows (from the Help): “For each tree, the prediction error on the OOB portion of the data is recorded (…MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences”.

Type 2 is the increase in node purity. It is computed as follows: “The total decrease in node impurities from splitting on the [given] variable, averaged over all trees… measured by residual sum of squares.”

importance(m.lzn.rf, type=1)

##          %IncMSE
## ffreq  13.511952
## x      26.031058
## y      18.648671
## dist.m 57.516695
## elev   37.554231
## soil    6.934790
## lime    9.604052

importance(m.lzn.rf, type=2)

##        IncNodePurity
## ffreq      0.4501877
## x          1.1210360
## y          0.7244526
## dist.m     8.0261675
## elev       3.5421854
## soil       0.2220359
## lime       0.7047061

varImpPlot(m.lzn.rf, type=1)

Q: Does this agree with the variable importance from the single trees we built above?

Q: Why are some variables given importance, although they were not used in the regression tree models?

6.3 Examine variable importance and interaction with `randomForestExplainer`

Another way to look at this is by the depth in the tree at which each predictor is used. This is found with the min_depth_frame function of the randomForestExplainer package.

min_depth_frame <- min_depth_distribution(m.lzn.rf)
str(min_depth_frame)  # has results for all the trees

## 'data.frame':    3222 obs. of  3 variables:
##  $ tree         : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ variable     : Factor w/ 7 levels "dist.m","elev",..: 1 2 3 4 5 6 1 2 4 5 ...
##  $ minimal_depth: num  0 1 1 4 2 3 1 0 1 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

plot_min_depth_distribution(min_depth_frame)

Q: Which predictor is used most as the root?

Q: Which predictor is used most at the second level?

Q: Which predictors are not used in any trees?

Q: Do these results agree with the variable importance plot?

This package gives a wider choice of importance measures:

importance_frame <- measure_importance(m.lzn.rf)
print(importance_frame)

##   variable mean_min_depth no_of_nodes mse_increase node_purity_increase
## 1   dist.m       0.342000        5437  0.081638354            8.0261675
## 2     elev       1.042000        6520  0.033787030            3.5421854
## 3    ffreq       3.625152         994  0.007941017            0.4501877
## 4     lime       4.404208         484  0.003762594            0.7047061
## 5     soil       5.189408         686  0.001803041            0.2220359
## 6        x       2.142000        5722  0.011567810            1.1210360
## 7        y       2.640000        5199  0.007384188            0.7244526
##   no_of_trees times_a_root       p_value
## 1         500          351 2.311138e-219
## 2         500          100  0.000000e+00
## 3         458            1  1.000000e+00
## 4         357           40  1.000000e+00
## 5         407            8  1.000000e+00
## 6         500            0 1.575661e-286
## 7         500            0 1.324765e-169

Q: Which predictor was used in the most nodes, over all trees?

Q: Which predictors were never used as the root of the tree?

Q: Which predictors were used in every tree?

The p-value measureswhether the observed number of successes (number of nodes in which \(X_j\) was used for splitting) exceeds the theoretical number of successes if they were random, following a binomial distribution. See https://cran.rstudio.com/web/packages/randomForestExplainer/vignettes/randomForestExplainer.html for details.

Another interesting plot is the “multiway importance plot”, which by default shows the number of times as root vs. the mean minimum depth for each predictor. There is typically a negative relation between times_a_root and mean_min_depth.

Q: Which predictor is most important by both measures?

Q: Is this a consistent negative relation? If not, where are the exceptions?

Another interesting plot is the pairwise comparison of different importance measures:

plot_importance_ggpairs(importance_frame)

Q: Which are the most and least related measures?

plot_multi_way_importance(m.lzn.rf, size_measure = "no_of_nodes")

Examine the interactions between predictors, in terms of their use within the forest:

interactions_frame <- min_depth_interactions(m.lzn.rf)
interactions_frame[order(interactions_frame$occurrences, decreasing = TRUE)[1:12], ]

##    variable root_variable mean_min_depth occurrences   interaction
## 43        y        dist.m      1.5595960         495      dist.m:y
## 8      elev        dist.m      0.3920364         494   dist.m:elev
## 1    dist.m        dist.m      1.0286343         492 dist.m:dist.m
## 36        x        dist.m      1.0994586         491      dist.m:x
## 9      elev          elev      1.1333818         483     elev:elev
## 44        y          elev      1.7066505         474        elev:y
## 37        x          elev      1.5122101         473        elev:x
## 2    dist.m          elev      1.2774303         456   elev:dist.m
## 15    ffreq        dist.m      3.9632323         395  dist.m:ffreq
## 48        y             x      1.9688081         389           x:y
## 41        x             x      2.2824646         388           x:x
## 13     elev             x      2.3496162         364        x:elev
##    uncond_mean_min_depth
## 43              2.640000
## 8               1.042000
## 1               0.342000
## 36              2.142000
## 9               1.042000
## 44              2.640000
## 37              2.142000
## 2               0.342000
## 15              3.625152
## 48              2.640000
## 41              2.142000
## 13              1.042000

plot_min_depth_interactions(interactions_frame)

Q: Which pair of predictors are most used near the root of the tree, and with the most co-occurrences?

“To further investigate the most frequent interaction we use the function plot_predict_interaction to plot the prediction of our forest on a grid of values for the components of each interaction. The function requires the forest, training data, variable to use on x and y-axis, respectively. In addition, one can also decrease the number of points in both dimensions of the grid from the default of 100 in case of insufficient memory using the parameter grid.”"

Here we see the predicted log(Zn) content for all combinations of distance from river and elevation:

plot_predict_interaction(m.lzn.rf, meuse, "dist.m", "elev")

6.4 Partial dependence plots

These plots show the effect of one predictor on the predictand, holding all the others constant at their median value. We show these in order of variable importance, as determined just above.

partialPlot(m.lzn.rf, pred.data=meuse, x.var="x")

partialPlot(m.lzn.rf, pred.data=meuse, x.var="y")

partialPlot(m.lzn.rf, pred.data=meuse, x.var="dist.m")

partialPlot(m.lzn.rf, pred.data=meuse, x.var="elev")

partialPlot(m.lzn.rf, pred.data=meuse, x.var="ffreq", which.class=1)

Q: What is the effect of just the distance to river on the RF prediction of log(Zn), when the other variables are held at their mean values (continuous variables, i.e., elev) or most common class (classified variables, i.e. ffreq, soil and lime). Is the effect monotonic (i.e., consistently increasing or decreasing)? Why or why not?

6.5 Goodness-of-fit and out-of-bag errors

Predict back onto the known points, and evaluate the goodness-of-fit:

p.rf <- predict(m.lzn.rf, newdata=meuse)
length(unique(p.rf))

## [1] 155

summary(r.rpp <- meuse$logZn - p.rf)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.1825439 -0.0364426 -0.0025826 -0.0004145  0.0329439  0.1577282

(rmse.rf <- sqrt(sum(r.rpp^2)/length(r.rpp)))

## [1] 0.06099567

plot(meuse$logZn ~ p.rf, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3),          ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)

Q: What is the RMSE?

Q: How many different predictions were made? Why not just a few, as in regression trees?

A more realistic view of the predictive power of the model is the out-of-bag validation.

p.rf.oob <- predict(m.lzn.rf)
summary(r.rpp.oob <- meuse$logZn - p.rf.oob)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.426973 -0.082194 -0.010906 -0.001954  0.077214  0.377211

(rmse.oob <- sqrt(sum(r.rpp.oob^2)/length(r.rpp.oob)))

## [1] 0.1373769

plot(meuse$logZn ~ p.rf.oob, asp=1, pch=20,
     xlab="Out-of-bag cross-validation estimates",
     ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3),
     main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)

Q: What is the OOB RMSE? How does this compare with the fitted RMSE? Which is more realistic for predictions?

Q: How does this compare to the model fits?

Q: How does this compare with the RMSE from the Regression Tree?

6.6 Map over the grid with an RF model

We can try to build an RF model to map the grid, but again we can only use coordinates, flood frequency, soil type and relative distance to the river, because these are the only variables we have over the whole area, i.e., in the spatial object meuse.grid.

names(meuse.grid)

## [1] "x"      "y"      "part.a" "part.b" "dist"   "soil"   "ffreq"

m.lzn.rf.g <- randomForest(logZn ~ x + y + ffreq + dist + soil, data=meuse, importance=T, na.action=na.omit, mtry=3)
print(m.lzn.rf)

## 
## Call:
##  randomForest(formula = logZn ~ ffreq + x + y + dist.m + elev +      soil + lime, data = meuse, importance = T, mtry = 5, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.01887242
##                     % Var explained: 80.67

(tmp <- importance(m.lzn.rf.g, type=1))

##        %IncMSE
## x     30.38927
## y     20.81755
## ffreq 37.19663
## dist  52.99835
## soil  11.68151

(100*tmp/sum(tmp))

##         %IncMSE
## x     19.851457
## y     13.598840
## ffreq 24.298292
## dist  34.620595
## soil   7.630817

varImpPlot(m.lzn.rf.g, type=1)

meuse.grid.sp$predictedLog10Zn.rf <- predict(m.lzn.rf.g, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.rf")

Q: Does this map look realistic? How does it compare with the map made with the single regressiom tree?

7 Cubist

This method is derived from the C4.5 approach of Quinlan.

References:

Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo, Calif: Morgan Kaufmann Publishers.

Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77–90. https://doi.org/10.1613/jair.279

A recent article comparing it to RF is: Houborg, R., & McCabe, M. F. (2018). A hybrid training approach for leaf area index estimation via Cubist and random forests machine-learning. ISPRS Journal of Photogrammetry and Remote Sensing, 135, 173–188. https://doi.org/10.1016/j.isprsjprs.2017.10.004

It is similar to RF but instead of single values at leaves it creates a multivariate linear regression for the cases in the leaf, using the variables in the nodes leading to it (?).

The Cubist package is an R port of the Cubist GPL C code released by RuleQuest at http://rulequest.com/cubist-info.html.

An explanation from the Cubist vignette:

Cubist is a rule–based model that is an extension of Quinlan’s M5 model tree. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification.

Package used in this section: Cubist for the C5 algorithm.

library(Cubist)

Split into training sets, here 80% in the training set:

inTrain <- sample(1:nrow(meuse), floor(.8*nrow(meuse)))

Build training and test matrices. Note there is no formula method for cubist; the predictors are specified as matrix or data frame and the outcome is a numeric vector.

preds <- c("x","y","ffreq","dist.m","elev","soil","lime")
train.pred <- meuse[ inTrain, preds]
test.pred  <- meuse[-inTrain, preds]
train.resp <- meuse$logZn[ inTrain]
test.resp  <- meuse$logZn[-inTrain]

7.1 Unoptimized Cubist model

Now fit the model to the training set:

(c.model <- cubist(x = train.pred, y = train.resp))

## 
## Call:
## cubist.default(x = train.pred, y = train.resp)
## 
## Number of samples: 124 
## Number of predictors: 7 
## 
## Number of committees: 1 
## Number of rules: 2

summary(c.model)

## 
## Call:
## cubist.default(x = train.pred, y = train.resp)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sun Aug  4 18:41:04 2019
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 124 cases (8 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [80 cases, mean 2.366983, range 2.053078 to 2.89098, est err 0.118638]
## 
##     if
##  dist.m > 140
##     then
##  outcome = -10.61724 + 0.000166 y - 0.000228 x - 0.111 elev
##            - 8e-05 dist.m
## 
##   Rule 2: [44 cases, mean 2.833035, range 2.281033 to 3.196176, est err 0.143328]
## 
##     if
##  dist.m <= 140
##     then
##  outcome = 3.586028 - 0.00129 dist.m - 0.088 elev
## 
## 
## Evaluation on training data (124 cases):
## 
##     Average  |error|           0.119965
##     Relative |error|               0.47
##     Correlation coefficient        0.87
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##    100%   100%    dist.m
##           100%    elev
##            65%    x
##            65%    y
## 
## 
## Time: 0.0 secs

In this case there are only two leaves! each with a regression model. The split is on distance to river.

We predict onto the test set to get an estimate of predictive accuracy:

c.pred <- predict(c.model, test.pred)
## Test set RMSE
sqrt(mean((c.pred - test.resp)^2))

## [1] 0.1770856

## Test set R^2
cor(c.pred, test.resp)^2

## [1] 0.8244534

Now, fit a model on the entire set and then predict over the grid. Again we can only use predictors that are in the grid.

preds <- c("x","y","dist","soil","ffreq")
all.preds <- meuse[, preds]
all.resp <- meuse$logZn
c.model.grid <- cubist(x = all.preds, y = all.resp)
summary(c.model.grid)

## 
## Call:
## cubist.default(x = all.preds, y = all.resp)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sun Aug  4 18:41:04 2019
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 155 cases (6 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [66 cases, mean 2.288309, range 2.053078 to 2.89098, est err 0.103603]
## 
##     if
##  x > 179095
##  dist > 0.211846
##     then
##  outcome = 2.406759 - 0.32 dist
## 
##   Rule 2: [9 cases, mean 2.596965, range 2.330414 to 2.832509, est err 0.116378]
## 
##     if
##  x <= 179095
##  dist > 0.211846
##     then
##  outcome = -277.415278 + 0.000847 y + 0.56 dist
## 
##   Rule 3: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.157513]
## 
##     if
##  dist <= 0.211846
##     then
##  outcome = 2.632508 - 2.1 dist - 2.4e-05 x + 1.4e-05 y
## 
## 
## Evaluation on training data (155 cases):
## 
##     Average  |error|           0.111292
##     Relative |error|               0.41
##     Correlation coefficient        0.85
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##    100%   100%    dist
##     48%    52%    x
##            57%    y
## 
## 
## Time: 0.0 secs

meuse.grid.sp$predictedLog10Zn.cubist <- predict(c.model.grid, newdata=meuse.grid[,preds])
spplot(meuse.grid.sp, zcol="predictedLog10Zn.cubist", col.regions=bpy.colors(64))

Here we can see the effect of the local regression models, as well as the hard split on x, y, and dist.

7.2 Optimizing cubist

Cubist can improve performance in two ways: (1) with “committees” and (2) adjusting by nearest neighbours:

The Cubist model can also use a boosting–like scheme called committees where iterative model trees are created in sequence. The first tree follows the procedure described in the last section. Subsequent trees are created using adjusted versions to the training set outcome: if the model over–predicted a value, the response is adjusted downward for the next model (and so on). Unlike traditional boosting, stage weights for each committee are not used to average the predictions from each model tree; the final prediction is a simple average of the predictions from each model tree.

Another innovation in Cubist using nearest–neighbors to adjust the predictions from the rule–based model. First, a model tree (with or without committees) is created. Once a sample is predicted by this model, Cubist can find it’s nearest neighbors and determine the average of these training set points.

To find the optimum values for these, we turn to the caret package.

The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques.

The train function in caret can use many models, supplied by other packages, by specifying the method argument. Then the tuneGrid parameter refers to a grid of possible parameters to compare by cross-validation. Note we use all the points, because train does the splitting as it evaluates the alternative models.

require(caret)
test <- train(x = all.preds, y = all.resp, method="cubist", 
              tuneGrid = expand.grid(.committees = 1:12, 
                                     .neighbors = 0:5), 
              trControl = trainControl(method = 'cv'))
print(test)

## Cubist 
## 
## 155 samples
##   5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 140, 139, 140, 140, 139, 139, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          0          0.1883956  0.6644354  0.1458751
##    1          1          0.1588804  0.7350231  0.1166150
##    1          2          0.1460033  0.7801726  0.1091955
##    1          3          0.1465169  0.7847191  0.1106126
##    1          4          0.1493369  0.7795308  0.1132737
##    1          5          0.1520507  0.7743130  0.1169301
##    2          0          0.1683147  0.7411218  0.1315260
##    2          1          0.1608123  0.7397582  0.1172865
##    2          2          0.1462212  0.7862414  0.1088365
##    2          3          0.1448057  0.7933283  0.1096120
##    2          4          0.1457635  0.7923567  0.1110771
##    2          5          0.1458465  0.7932452  0.1128878
##    3          0          0.1676694  0.7274362  0.1300419
##    3          1          0.1594861  0.7355455  0.1159521
##    3          2          0.1463699  0.7792583  0.1092873
##    3          3          0.1450852  0.7868112  0.1097221
##    3          4          0.1463667  0.7846627  0.1116297
##    3          5          0.1471833  0.7842686  0.1135879
##    4          0          0.1722867  0.7230334  0.1350674
##    4          1          0.1595969  0.7374772  0.1157987
##    4          2          0.1465086  0.7796208  0.1087658
##    4          3          0.1457687  0.7860170  0.1102305
##    4          4          0.1471952  0.7840891  0.1124495
##    4          5          0.1479295  0.7841296  0.1146606
##    5          0          0.1692093  0.7264544  0.1308310
##    5          1          0.1573260  0.7432324  0.1134389
##    5          2          0.1451288  0.7830943  0.1085275
##    5          3          0.1445451  0.7892661  0.1097826
##    5          4          0.1458363  0.7875755  0.1119257
##    5          5          0.1465786  0.7876978  0.1137389
##    6          0          0.1681138  0.7348997  0.1306287
##    6          1          0.1577049  0.7448716  0.1139049
##    6          2          0.1453464  0.7845837  0.1083253
##    6          3          0.1450470  0.7898445  0.1096985
##    6          4          0.1465139  0.7878958  0.1119542
##    6          5          0.1472043  0.7882202  0.1139688
##    7          0          0.1689879  0.7319173  0.1298003
##    7          1          0.1590891  0.7424403  0.1141648
##    7          2          0.1478159  0.7790716  0.1101016
##    7          3          0.1472611  0.7850995  0.1114656
##    7          4          0.1486386  0.7831451  0.1129634
##    7          5          0.1494694  0.7834350  0.1150561
##    8          0          0.1692387  0.7303566  0.1308911
##    8          1          0.1583597  0.7451302  0.1140818
##    8          2          0.1469568  0.7816830  0.1096589
##    8          3          0.1467558  0.7864582  0.1110293
##    8          4          0.1482476  0.7842532  0.1130076
##    8          5          0.1490496  0.7844580  0.1148313
##    9          0          0.1680918  0.7334097  0.1293326
##    9          1          0.1580984  0.7445659  0.1135899
##    9          2          0.1460286  0.7835949  0.1082919
##    9          3          0.1456577  0.7891344  0.1096461
##    9          4          0.1470649  0.7872591  0.1117024
##    9          5          0.1473304  0.7886694  0.1135409
##   10          0          0.1667062  0.7361165  0.1293054
##   10          1          0.1583984  0.7437456  0.1141123
##   10          2          0.1463530  0.7827190  0.1089020
##   10          3          0.1461691  0.7873274  0.1105037
##   10          4          0.1475471  0.7854459  0.1122338
##   10          5          0.1484767  0.7850997  0.1142535
##   11          0          0.1667215  0.7397498  0.1287537
##   11          1          0.1583368  0.7437772  0.1138371
##   11          2          0.1460518  0.7836939  0.1080391
##   11          3          0.1456496  0.7891811  0.1092283
##   11          4          0.1468634  0.7879053  0.1108272
##   11          5          0.1471398  0.7893048  0.1128886
##   12          0          0.1645851  0.7456427  0.1277227
##   12          1          0.1583615  0.7432244  0.1137673
##   12          2          0.1456015  0.7842648  0.1080848
##   12          3          0.1451584  0.7898198  0.1092122
##   12          4          0.1462251  0.7892730  0.1110120
##   12          5          0.1469560  0.7895430  0.1129932
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 5 and neighbors = 3.

plot(test, metric="RMSE")

plot(test, metric="Rsquared")

Of course this is random in the splits, so each run will be different. In this case the graph shows clearly that more committees boost the performance considerably, from around 0.19 to 0.15 cross-validation RMSE. The nearest-neighbours also help considerably, up to about four, and then the improvement is less.

So we fit a model with these recommended parameters. Note the neighbour count is only used in prediction, not model building.

c.model.grid.2 <- cubist(x = all.preds, y = all.resp, committees=4)
summary(c.model.grid.2)

## 
## Call:
## cubist.default(x = all.preds, y = all.resp, committees = 4)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sun Aug  4 18:41:13 2019
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 155 cases (6 attributes) from undefined.data
## 
## Model 1:
## 
##   Rule 1/1: [66 cases, mean 2.288309, range 2.053078 to 2.89098, est err 0.103603]
## 
##     if
##  x > 179095
##  dist > 0.211846
##     then
##  outcome = 2.406759 - 0.32 dist
## 
##   Rule 1/2: [9 cases, mean 2.596965, range 2.330414 to 2.832509, est err 0.116378]
## 
##     if
##  x <= 179095
##  dist > 0.211846
##     then
##  outcome = -277.415278 + 0.000847 y + 0.56 dist
## 
##   Rule 1/3: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.157513]
## 
##     if
##  dist <= 0.211846
##     then
##  outcome = 2.632508 - 2.1 dist - 2.4e-05 x + 1.4e-05 y
## 
## Model 2:
## 
##   Rule 2/1: [45 cases, mean 2.418724, range 2.10721 to 2.893762, est err 0.182228]
## 
##     if
##  x <= 179826
##  ffreq in {2, 3}
##     then
##  outcome = 128.701732 - 0.000705 x
## 
##   Rule 2/2: [121 cases, mean 2.443053, range 2.053078 to 3.055378, est err 0.181513]
## 
##     if
##  dist > 0.0703468
##     then
##  outcome = 30.512065 - 0.87 dist - 0.000154 x
## 
##   Rule 2/3: [55 cases, mean 2.543648, range 2.075547 to 3.055378, est err 0.125950]
## 
##     if
##  dist > 0.0703468
##  ffreq = 1
##     then
##  outcome = 37.730889 - 0.000314 x - 0.35 dist + 6.5e-05 y
## 
##   Rule 2/4: [34 cases, mean 2.958686, range 2.574031 to 3.264582, est err 0.139639]
## 
##     if
##  dist <= 0.0703468
##     then
##  outcome = 2.982852 - 0.36 dist
## 
## Model 3:
## 
##   Rule 3/1: [75 cases, mean 2.325347, range 2.053078 to 2.89098, est err 0.150262]
## 
##     if
##  dist > 0.211846
##     then
##  outcome = 2.263175 - 0.11 dist
## 
##   Rule 3/2: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.156918]
## 
##     if
##  dist <= 0.211846
##     then
##  outcome = 2.978074 - 2.42 dist
## 
## Model 4:
## 
##   Rule 4/1: [121 cases, mean 2.443053, range 2.053078 to 3.055378, est err 0.161020]
## 
##     if
##  dist > 0.0703468
##     then
##  outcome = 4.601694 - 0.61 dist - 0.000101 x + 4.9e-05 y
## 
##   Rule 4/2: [51 cases, mean 2.505994, range 2.075547 to 3.055378, est err 0.196863]
## 
##     if
##  x <= 179731
##  dist > 0.0703468
##     then
##  outcome = 175.779227 - 0.000967 x
## 
##   Rule 4/3: [55 cases, mean 2.543648, range 2.075547 to 3.055378, est err 0.147504]
## 
##     if
##  dist > 0.0703468
##  ffreq = 1
##     then
##  outcome = 53.629078 - 0.000343 x - 0.44 dist + 3.3e-05 y
## 
##   Rule 4/4: [34 cases, mean 2.958686, range 2.574031 to 3.264582, est err 0.140965]
## 
##     if
##  dist <= 0.0703468
##     then
##  outcome = 2.975269 - 0.08 dist
## 
## 
## Evaluation on training data (155 cases):
## 
##     Average  |error|           0.110135
##     Relative |error|               0.40
##     Correlation coefficient        0.84
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##     95%    88%    dist
##     21%    64%    x
##     19%           ffreq
##            39%    y
## 
## 
## Time: 0.0 secs

meuse.grid.sp$predictedLog10Zn.cubist.2 <- 
  predict(c.model.grid.2, newdata=meuse.grid[,preds],
          neighbors=5)
summary(meuse.grid.sp$predictedLog10Zn.cubist.2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.954   2.228   2.346   2.445   2.640   3.154

spplot(meuse.grid.sp, zcol="predictedLog10Zn.cubist.2", col.regions=bpy.colors(64),
       main="Optimized Cubist prediction")

A much nicer looking map.

Variable importance. The usage field of the fitted model shows the usage of each variable in either the rule conditions or the (terminal) linear model.

c.model.grid$usage

##   Conditions Model Variable
## 1        100   100     dist
## 2         48    52        x
## 3          0    57        y
## 4          0     0     soil
## 5          0     0    ffreq

c.model.grid.2$usage

##   Conditions Model Variable
## 1         95    88     dist
## 2         21    64        x
## 3         19     0    ffreq
## 4          0    39        y
## 5          0     0     soil

caret::varImp(c.model.grid)

##       Overall
## dist    100.0
## x        50.0
## y        28.5
## soil      0.0
## ffreq     0.0

caret::varImp(c.model.grid.2)

##       Overall
## dist     91.5
## x        42.5
## ffreq     9.5
## y        19.5
## soil      0.0

Distance is always used in the single model, but sometimes not in the committee model.

Notice how flood frequency came into the model when a committee was formed, i.e., subsidiary rules to adjust mis-matches from higher-level rules.

It is interesting to see how the different models fit a single predictor, in this case distance to river:

distances <- all.preds[, "dist", drop = FALSE]
rules_only <- cubist(x = distances, y = all.resp)
rules_and_com <- cubist(x = distances, y = all.resp, committees = 4)

predictions <- distances
predictions$true <- all.resp
predictions$rules <- predict(rules_only, distances, neighbors = 0)
predictions$rules_neigh <- predict(rules_only, distances, neighbors = 5)
predictions$committees <- predict(rules_and_com, distances, neighbors = 0)
predictions$committees_neigh <- predict(rules_and_com, distances, neighbors = 5)
summary(predictions)

##       dist              true           rules        rules_neigh   
##  Min.   :0.00000   Min.   :2.053   Min.   :2.069   Min.   :2.046  
##  1st Qu.:0.07569   1st Qu.:2.297   1st Qu.:2.311   1st Qu.:2.331  
##  Median :0.21184   Median :2.513   Median :2.482   Median :2.541  
##  Mean   :0.24002   Mean   :2.556   Mean   :2.545   Mean   :2.556  
##  3rd Qu.:0.36407   3rd Qu.:2.829   3rd Qu.:2.804   3rd Qu.:2.791  
##  Max.   :0.88039   Max.   :3.265   Max.   :2.982   Max.   :2.992  
##    committees    committees_neigh
##  Min.   :2.043   Min.   :2.035   
##  1st Qu.:2.347   1st Qu.:2.330   
##  Median :2.495   Median :2.524   
##  Mean   :2.545   Mean   :2.555   
##  3rd Qu.:2.771   3rd Qu.:2.791   
##  Max.   :2.924   Max.   :2.990

plot(true ~ dist, data=predictions, pch=20, col="darkgray",
     xlab="covariate: normalized distance to river",
     ylab="log10(Zn)", main="Cubist models")
grid()
m <- matrix(cbind(predictions$dist, predictions$rules), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=1, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$rules_neigh), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=2, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$committees), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=3, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$committees_neigh), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=4, lwd=2)
legend("topright", c("rules", "rules + neighbour",
                     "committee", "committee + neighbour"),
       pch="----", col=1:4)

Notice how the committee (green) smoothes out the centre of distance range from the single-rules model (black).

Notice how the neighbours come much closer to some of the points – too close?

8 K nearest neighbours

A pure “instance based” approach is to predict at each location from the average (possibly weighted) of some set of similar values. The caret package includes a KNN regression with the knnreg function.

Try with five clusters:

knn.5 <- caret::knnreg(logZn ~ x + y + dist + elev + ffreq, data=meuse, k=5)
knn.pred <- predict(knn.5, newdata=meuse)
plot(meuse$logZn ~ knn.pred, asp=1, col=meuse$ffreq, pch=20)
abline(0,1); grid()

However, unless we know a priori a number of classes, we should let train evaluate the optimum number of clusters.

knn.tune <- train(x = all.preds, y = all.resp,
                  method="knn", 
                tuneGrid = data.frame(.k = 1:20), 
                trControl = trainControl(method = 'cv'))
str(knn.tune)

## List of 20
##  $ method      : chr "knn"
##  $ modelInfo   :List of 13
##   ..$ label     : chr "k-Nearest Neighbors"
##   ..$ library   : NULL
##   ..$ loop      : NULL
##   ..$ type      : chr [1:2] "Classification" "Regression"
##   ..$ parameters:'data.frame':   1 obs. of  3 variables:
##   .. ..$ parameter: Factor w/ 1 level "k": 1
##   .. ..$ class    : Factor w/ 1 level "numeric": 1
##   .. ..$ label    : Factor w/ 1 level "#Neighbors": 1
##   ..$ grid      :function (x, y, len = NULL, search = "grid")  
##   .. ..- attr(*, "srcref")= 'srcref' int [1:8] 8 26 16 19 26 19 8 16
##   .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8> 
##   ..$ fit       :function (x, y, wts, param, lev, last, classProbs, ...)  
##   .. ..- attr(*, "srcref")= 'srcref' int [1:8] 17 25 24 19 25 19 17 24
##   .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8> 
##   ..$ predict   :function (modelFit, newdata, submodels = NULL)  
##   .. ..- attr(*, "srcref")= 'srcref' int [1:8] 25 29 33 19 29 19 25 33
##   .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8> 
##   ..$ predictors:function (x, ...)  
##   .. ..- attr(*, "srcref")= 'srcref' int [1:8] 34 32 34 67 32 67 34 34
##   .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8> 
##   ..$ tags      : chr "Prototype Models"
##   ..$ prob      :function (modelFit, newdata, submodels = NULL)  
##   .. ..- attr(*, "srcref")= 'srcref' int [1:8] 36 26 37 61 26 61 36 37
##   .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8> 
##   ..$ levels    :function (x)  
##   .. ..- attr(*, "srcref")= 'srcref' int [1:8] 38 28 38 56 28 56 38 38
##   .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8> 
##   ..$ sort      :function (x)  
##   .. ..- attr(*, "srcref")= 'srcref' int [1:8] 39 26 39 54 26 54 39 39
##   .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8> 
##  $ modelType   : chr "Regression"
##  $ results     :'data.frame':    20 obs. of  7 variables:
##   ..$ k         : int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ RMSE      : num [1:20] 0.24 0.208 0.192 0.199 0.2 ...
##   ..$ Rsquared  : num [1:20] 0.5 0.587 0.635 0.625 0.622 ...
##   ..$ MAE       : num [1:20] 0.171 0.158 0.147 0.153 0.155 ...
##   ..$ RMSESD    : num [1:20] 0.0488 0.058 0.0394 0.0333 0.0333 ...
##   ..$ RsquaredSD: num [1:20] 0.185 0.206 0.147 0.126 0.141 ...
##   ..$ MAESD     : num [1:20] 0.0361 0.0394 0.0254 0.0232 0.0216 ...
##  $ pred        : NULL
##  $ bestTune    :'data.frame':    1 obs. of  1 variable:
##   ..$ k: int 3
##  $ call        : language train.default(x = all.preds, y = all.resp, method = "knn", trControl = trainControl(method = "cv"),      tuneGrid| __truncated__
##  $ dots        : list()
##  $ metric      : chr "RMSE"
##  $ control     :List of 28
##   ..$ method           : chr "cv"
##   ..$ number           : num 10
##   ..$ repeats          : logi NA
##   ..$ search           : chr "grid"
##   ..$ p                : num 0.75
##   ..$ initialWindow    : NULL
##   ..$ horizon          : num 1
##   ..$ fixedWindow      : logi TRUE
##   ..$ skip             : num 0
##   ..$ verboseIter      : logi FALSE
##   ..$ returnData       : logi TRUE
##   ..$ returnResamp     : chr "final"
##   ..$ savePredictions  : chr "none"
##   ..$ classProbs       : logi FALSE
##   ..$ summaryFunction  :function (data, lev = NULL, model = NULL)  
##   ..$ selectionFunction: chr "best"
##   ..$ preProcOptions   :List of 6
##   .. ..$ thresh   : num 0.95
##   .. ..$ ICAcomp  : num 3
##   .. ..$ k        : num 5
##   .. ..$ freqCut  : num 19
##   .. ..$ uniqueCut: num 10
##   .. ..$ cutoff   : num 0.9
##   ..$ sampling         : NULL
##   ..$ index            :List of 10
##   .. ..$ Fold01: int [1:139] 1 2 3 5 6 7 8 9 10 11 ...
##   .. ..$ Fold02: int [1:139] 1 2 3 4 5 6 8 9 10 11 ...
##   .. ..$ Fold03: int [1:139] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ Fold04: int [1:139] 1 2 3 4 5 6 7 8 10 11 ...
##   .. ..$ Fold05: int [1:140] 1 2 4 5 6 7 8 9 10 11 ...
##   .. ..$ Fold06: int [1:141] 2 3 4 6 7 8 9 10 11 13 ...
##   .. ..$ Fold07: int [1:140] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ Fold08: int [1:138] 1 3 4 5 7 8 9 10 11 12 ...
##   .. ..$ Fold09: int [1:141] 1 2 3 4 5 6 7 8 9 11 ...
##   .. ..$ Fold10: int [1:139] 1 2 3 4 5 6 7 9 10 11 ...
##   ..$ indexOut         :List of 10
##   .. ..$ Resample01: int [1:16] 4 47 55 69 76 77 83 84 91 94 ...
##   .. ..$ Resample02: int [1:16] 7 17 25 45 46 67 82 90 103 104 ...
##   .. ..$ Resample03: int [1:16] 16 26 37 40 42 51 65 73 78 99 ...
##   .. ..$ Resample04: int [1:16] 9 13 21 22 24 39 43 56 63 66 ...
##   .. ..$ Resample05: int [1:15] 3 44 49 52 70 75 79 87 102 118 ...
##   .. ..$ Resample06: int [1:14] 1 5 12 23 28 34 38 60 71 74 ...
##   .. ..$ Resample07: int [1:15] 11 14 35 36 53 101 116 120 121 123 ...
##   .. ..$ Resample08: int [1:17] 2 6 18 30 33 58 61 64 68 80 ...
##   .. ..$ Resample09: int [1:14] 10 27 31 32 48 50 59 62 72 81 ...
##   .. ..$ Resample10: int [1:16] 8 15 19 20 29 41 54 57 88 92 ...
##   ..$ indexFinal       : NULL
##   ..$ timingSamps      : num 0
##   ..$ predictionBounds : logi [1:2] FALSE FALSE
##   ..$ seeds            :List of 11
##   .. ..$ : int [1:20] 813312 236430 890334 170210 987240 86795 771536 792235 734967 739182 ...
##   .. ..$ : int [1:20] 992121 766012 71718 751307 208375 932562 722301 125947 260482 982996 ...
##   .. ..$ : int [1:20] 140971 37331 700566 643864 410247 323307 643651 345975 730213 713809 ...
##   .. ..$ : int [1:20] 374397 137389 688138 832233 18497 308165 140183 90599 949960 577892 ...
##   .. ..$ : int [1:20] 433773 214035 844035 451682 989706 24932 415280 26177 187708 536907 ...
##   .. ..$ : int [1:20] 758333 843264 504143 796850 388392 218870 291714 865609 690031 550765 ...
##   .. ..$ : int [1:20] 177784 777203 120215 562508 699079 657961 101426 797222 768871 597234 ...
##   .. ..$ : int [1:20] 741939 49288 20108 977177 156691 918281 320889 197208 837870 249758 ...
##   .. ..$ : int [1:20] 976237 308871 76423 599447 668329 948711 579944 505136 792497 19165 ...
##   .. ..$ : int [1:20] 82024 348360 99686 64554 888597 583501 904959 636821 109234 793446 ...
##   .. ..$ : int 667130
##   ..$ adaptive         :List of 4
##   .. ..$ min     : num 5
##   .. ..$ alpha   : num 0.05
##   .. ..$ method  : chr "gls"
##   .. ..$ complete: logi TRUE
##   ..$ trim             : logi FALSE
##   ..$ allowParallel    : logi TRUE
##   ..$ yLimits          : num [1:2] 1.99 3.33
##  $ finalModel  :List of 8
##   ..$ learn      :List of 2
##   .. ..$ y: num [1:155] 3.01 3.06 2.81 2.41 2.43 ...
##   .. ..$ X: chr [1:155, 1:5] "181072" "181025" "181165" "181298" ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:155] "X1" "X2" "X3" "X4" ...
##   .. .. .. ..$ : chr [1:5] "x" "y" "dist" "soil" ...
##   ..$ k          : int 3
##   ..$ theDots    : list()
##   ..$ xNames     : chr [1:5] "x" "y" "dist" "soil" ...
##   ..$ problemType: chr "Regression"
##   ..$ tuneValue  :'data.frame':  1 obs. of  1 variable:
##   .. ..$ k: int 3
##   ..$ obsLevels  : logi NA
##   ..$ param      : list()
##   ..- attr(*, "class")= chr "knnreg"
##  $ preProcess  : NULL
##  $ trainingData:'data.frame':    155 obs. of  6 variables:
##   ..$ x       : num [1:155] 181072 181025 181165 181298 181307 ...
##   ..$ y       : num [1:155] 333611 333558 333537 333484 333330 ...
##   ..$ dist    : num [1:155] 0.00136 0.01222 0.10303 0.19009 0.27709 ...
##   ..$ soil    : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
##   ..$ ffreq   : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ .outcome: num [1:155] 3.01 3.06 2.81 2.41 2.43 ...
##  $ resample    :'data.frame':    10 obs. of  4 variables:
##   ..$ RMSE    : num [1:10] 0.182 0.276 0.179 0.152 0.21 ...
##   ..$ Rsquared: num [1:10] 0.675 0.342 0.758 0.762 0.639 ...
##   ..$ MAE     : num [1:10] 0.134 0.184 0.138 0.118 0.165 ...
##   ..$ Resample: chr [1:10] "Fold07" "Fold02" "Fold09" "Fold06" ...
##  $ resampledCM : NULL
##  $ perfNames   : chr [1:3] "RMSE" "Rsquared" "MAE"
##  $ maximize    : logi FALSE
##  $ yLimits     : num [1:2] 1.99 3.33
##  $ times       :List of 3
##   ..$ everything: 'proc_time' Named num [1:5] 1.357 0.026 1.403 0 0
##   .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
##   ..$ final     : 'proc_time' Named num [1:5] 0.001 0 0.002 0 0
##   .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
##   ..$ prediction: logi [1:3] NA NA NA
##  $ levels      : logi NA
##  - attr(*, "class")= chr "train"

print(knn.tune)

## k-Nearest Neighbors 
## 
## 155 samples
##   5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 139, 139, 139, 139, 140, 141, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.2396729  0.5000961  0.1707858
##    2  0.2084718  0.5866904  0.1576243
##    3  0.1924236  0.6348138  0.1472308
##    4  0.1985827  0.6245674  0.1528394
##    5  0.1996725  0.6215416  0.1554437
##    6  0.2041662  0.6067799  0.1589935
##    7  0.2040105  0.6098834  0.1601178
##    8  0.2096095  0.5869486  0.1645314
##    9  0.2117159  0.5824187  0.1679363
##   10  0.2148451  0.5706556  0.1710285
##   11  0.2205898  0.5480597  0.1749506
##   12  0.2264245  0.5328982  0.1816822
##   13  0.2305251  0.5177084  0.1849336
##   14  0.2350731  0.4912753  0.1895975
##   15  0.2350344  0.4951237  0.1906744
##   16  0.2383019  0.4800084  0.1953387
##   17  0.2412291  0.4615528  0.1983915
##   18  0.2459047  0.4408346  0.2036497
##   19  0.2476376  0.4375852  0.2065670
##   20  0.2490945  0.4339538  0.2079632
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.

plot.train(knn.tune, metric="RMSE")

plot.train(knn.tune, metric="Rsquared")

plot.train(knn.tune, metric="MAE")

min(knn.tune$results$RMSE)

## [1] 0.1924236

(ix <- which.min(knn.tune$results$RMSE))

## [1] 3

max(knn.tune$results$Rsquared)

## [1] 0.6348138

(ix <- which.max(knn.tune$results$Rsquared))

## [1] 3

# directly find optimum
knn.tune$finalModel$k

## [1] 3

Here there is a clear winner, k=3. It is not a very good model: cross-validation RMSE about \(0.193\), \(R^2 = 0.62\).

Use this to predict:

knn.final <- caret::knnreg(logZn ~ x + y + dist + elev + ffreq,
                       data=meuse,
                       k = knn.tune$finalModel$k)
knn.pred <- predict(knn.final, newdata=meuse)
plot(meuse$logZn ~ knn.pred, asp=1, col=meuse$ffreq, pch=20)
abline(0,1); grid()

We can’t map with this because we don’t have elevation in the grid. So re-tune with what we do have:

preds.2 <- c("x","y","ffreq","dist.m","soil","lime")
all.preds.2 <- meuse[, preds.2]

knn.tune <- train(x = all.preds.2, y = all.resp,
                  method="knn", 
                tuneGrid = data.frame(.k = 1:20), 
                trControl = trainControl(method = 'cv'))
min(knn.tune$results$RMSE)

## [1] 0.1924077

(ix <- which.min(knn.tune$results$RMSE))

## [1] 3

max(knn.tune$results$Rsquared)

## [1] 0.6143408

(ix <- which.max(knn.tune$results$Rsquared))

## [1] 4

# directly find optimum
knn.tune$finalModel$k

## [1] 3

plot.train(knn.tune, metric="RMSE")

plot.train(knn.tune, metric="Rsquared")

plot.train(knn.tune, metric="MAE")

## the two criteria do not agree

Map with this:

knn.final <- caret::knnreg(logZn ~ x + y + dist + ffreq,
                       data=meuse,
                       k = knn.tune$finalModel$k)
meuse.grid.sp$knn <- predict(knn.final, newdata=meuse.grid)
summary(meuse.grid.sp$knn)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.074   2.276   2.429   2.475   2.669   3.217

spplot(meuse.grid.sp, zcol="knn")

Interesting map! Note here ‘nearest neighbour’ included both coordinates and feature-space.

9 Spatial random forests

Random forests can use coordinates and distances to geographic features as predictors; we used the x and y coordinates in the previous section.

RF can also use distances to multiple points as predictors:

Distance buffers: distance to closest point with some range of values
A common approach is to compute quantiles of the response variable and create one buffer for each
Each sample point has a distance to the closest point in each quantile
This uses separation between point-pairs of different values, but with no model.

Reference: Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B. M., & Gräler, B. (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ, 6, e5518. https://doi.org/10.7717/peerj.5518

Packages used in this section: randomForest to build the RF, scales for visualization, GSIF for buffer distances.

# library(quantregForest)  # the main work is here
library(scales)  # for visualization
library(GSIF)    # for buffer distances

9.1 Method 1 – buffers to quantiles

9.1.1 Quantiles

Set up quantiles for the transformed and original Zn concentrations:

library(ggplot2)
g <- ggplot(meuse, mapping=aes(x=logZn))
g + geom_histogram(bins=16, fill="lightgreen", color="blue") + geom_rug()

(q.lzn <- quantile(meuse$logZn, seq(0,1,by=0.0625)))
# must include.lowest to have lowest value in some class
classes.lzn.q <- cut(meuse$logZn, breaks=q.lzn, ordered_result=TRUE, include.lowest=TRUE)
levels(classes.lzn.q)

##       0%    6.25%    12.5%   18.75%      25%   31.25%    37.5%   43.75% 
## 2.053078 2.139461 2.201372 2.261554 2.296665 2.322219 2.362643 2.414125 
##      50%   56.25%    62.5%   68.75%      75%   81.25%    87.5%   93.75% 
## 2.513218 2.603414 2.680704 2.754247 2.828981 2.873820 2.920515 3.056094 
##     100% 
## 3.264582 
##  [1] "[2.05,2.14]" "(2.14,2.2]"  "(2.2,2.26]"  "(2.26,2.3]"  "(2.3,2.32]" 
##  [6] "(2.32,2.36]" "(2.36,2.41]" "(2.41,2.51]" "(2.51,2.6]"  "(2.6,2.68]" 
## [11] "(2.68,2.75]" "(2.75,2.83]" "(2.83,2.87]" "(2.87,2.92]" "(2.92,3.06]"
## [16] "(3.06,3.26]"

Q: Are the 16 quantiles more or less equally distributed along the range of the target variable?

9.1.2 Distances to each quantile

Compute distance buffers to points in each quantile. For this, the points object must be converted to a SpatialPointsDataFrame and the function buffer.dist in the GSIF packge is used.

coordinates(meuse) <- ~x + y
grid.dist.lzn <- GSIF::buffer.dist(meuse["logZn"], meuse.grid.sp, classes.lzn.q)
plot(stack(grid.dist.lzn))

Q: Explain the pattern of Layer 16.

Extract the distances to each buffer at each point, and add these as fields to the points:

buffer.dists <- over(meuse, grid.dist.lzn)
str(buffer.dists)

## 'data.frame':    155 obs. of  16 variables:
##  $ layer.1 : num  1897 1809 1865 1878 1737 ...
##  $ layer.2 : num  1784 1695 1755 1772 1633 ...
##  $ layer.3 : num  735 680 621 561 402 ...
##  $ layer.4 : num  468 412 362 330 179 ...
##  $ layer.5 : num  804 730 721 699 544 ...
##  $ layer.6 : num  685 612 601 582 431 ...
##  $ layer.7 : num  268 283 126 0 160 ...
##  $ layer.8 : num  369 330 233 160 0 ...
##  $ layer.9 : num  268 226 160 170 126 ...
##  $ layer.10: num  243 160 226 305 283 ...
##  $ layer.11: num  369 283 344 400 330 ...
##  $ layer.12: num  144 160 0 126 233 ...
##  $ layer.13: num  734 645 738 794 700 ...
##  $ layer.14: num  1531 1442 1537 1585 1471 ...
##  $ layer.15: num  0 89.4 144.2 268.3 368.8 ...
##  $ layer.16: num  89.4 0 160 282.8 344.1 ...

meuse@data <- cbind(meuse@data, buffer.dists)

Show how far point 1 is from the nearest point of each quantile:

buffer.dists[1,]

##    layer.1  layer.2  layer.3 layer.4 layer.5  layer.6  layer.7  layer.8
## 1 1897.367 1783.928 735.3911 468.188  803.99 684.6897 268.3282 368.7818
##    layer.9 layer.10 layer.11 layer.12 layer.13 layer.14 layer.15 layer.16
## 1 268.3282 243.3105 368.7818 144.2221 734.3024 1531.013        0 89.44272

Of course it is zero distance to its own quantile, in this case 15.

9.1.3 Spatial random forest

First make a formula with all the quantiles as predictors:

dn <- paste(names(grid.dist.lzn), collapse="+")
(fm <- as.formula(paste("logZn ~", dn)))

## logZn ~ layer.1 + layer.2 + layer.3 + layer.4 + layer.5 + layer.6 + 
##     layer.7 + layer.8 + layer.9 + layer.10 + layer.11 + layer.12 + 
##     layer.13 + layer.14 + layer.15 + layer.16

Now compute the random forest model, use it to predict at the observation points.

(rf <- randomForest(fm , meuse@data, importance=TRUE, min.split=6, mtry=6, ntree=640))

## 
## Call:
##  randomForest(formula = fm, data = meuse@data, importance = TRUE,      min.split = 6, mtry = 6, ntree = 640) 
##                Type of random forest: regression
##                      Number of trees: 640
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.02514371
##                     % Var explained: 74.25

pred.rf <- predict(rf, newdata=meuse@data)

plot(rf)

Q: How many trees were needed to make a reliable forest?

Variable importance:

importance(rf, type=1)

##            %IncMSE
## layer.1  15.125917
## layer.2  14.959521
## layer.3  13.808773
## layer.4  11.403248
## layer.5  11.637966
## layer.6  17.236908
## layer.7   7.091440
## layer.8  10.915838
## layer.9   9.779422
## layer.10 10.792435
## layer.11  6.995092
## layer.12 17.543799
## layer.13 11.360035
## layer.14 13.343180
## layer.15 18.142337
## layer.16 23.833595

varImpPlot(rf, type=1)

Q: Which distance buffer was most important? Explain why.

Plot the RF fit and the OOB predictions. Compute the fitted and OOB RMSE.

plot(meuse$logZn ~ pred.rf, asp=1, pch=20, xlab="Random forest fit", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()

(rmse.rf <- sqrt(sum((pred.rf-meuse$logZn)^2)/length(pred.rf)))

## [1] 0.06472746

#
oob.rf <- predict(rf)
plot(meuse$logZn ~ oob.rf, asp=1, pch=20, xlab="Out-of-bag prediction", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()

(rmse.oob <- sqrt(sum((oob.rf-meuse$logZn)^2)/length(oob.rf)))

## [1] 0.1585677

Q: What are the goodness-of-fit and Out-of-bag RMSE?

9.1.4 Mapping

pred.grid <- predict(rf, newdata=grid.dist.lzn@data)
summary(pred.grid)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.111   2.316   2.437   2.489   2.650   3.169

meuse.grid.sp$logZn.rf <- pred.grid
breaks <- seq(2, 3.3, by=.05)
p1 <- spplot(meuse.grid.sp, zcol="logZn.rf", main="Zn, log(mg kg-1)", sub="RRF 16 distance buffers", at=breaks)
print(p1)

Compare this with OK cross-validation. First compute the ordinary variogram and fit a model:

require(gstat)
v <- variogram(logZn ~ 1, meuse)
(vmf <- fit.variogram(v, model=vgm(0.1, "Sph", 1000, 0.02)))

##   model       psill    range
## 1   Nug 0.009554751   0.0000
## 2   Sph 0.111394769 896.9909

plot(v, model=vmf, plot.numbers=T)

Now compute the cross-validation statistics:

k.cv <- krige.cv(logZn ~ 1, meuse, model=vmf, verbose=FALSE)
plot(meuse$logZn ~ k.cv$var1.pred, pch=20, xlab="Cross-validation fit", ylab="Actual value", main="Zn, log(mg kg-1)", asp=1)
abline(0,1); grid()

(rmse.kcv <- sqrt(sum((k.cv$var1.pred-meuse$logZn)^2)/length(k.cv$var1.pred)))

## [1] 0.170157

Q: Which approach gives the lower cross-validation error in this case?

Compare the kriging map with the spatial random forest map:

k.ok <- krige(logZn ~ 1, meuse, newdata=meuse.grid.sp, model=vmf)

## [using ordinary kriging]

summary(k.ok$var1.pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.074   2.275   2.420   2.479   2.680   3.231

p2 <- spplot(k.ok, zcol="var1.pred", main="Zn, log(mg kg-1)", sub="Ordinary Kriging", at=breaks)
print(p2, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)

Compute and display the differences:

k.ok$diff.srf.ok <- meuse.grid.sp$logZn.rf - k.ok$var1.pred
summary(k.ok$diff.srf.ok)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.237562 -0.040157  0.004313  0.010131  0.048444  0.403515

p3 <- spplot(k.ok, zcol="diff.srf.ok", main="Spatial RF - OK predictions, log(Zn)", col.regions=topo.colors(36))
print(p3)

Q: Comment on the differences/similarities between the Spatial Random Forest map and the Ordinary Kriging map. Note that for Kriging we needed to specify a model, for RF no model was needed.

9.2 Method 2 – buffers around each point

The concept of quantile buffers is similar to the empirical variogram, i.e., assume stationarity of spatial correlation structure.

Another approach is to use the distance from each known point as a predictor. Obviously this increases the size of the problem. This is the approach taken in the Hengl paper cited above.

Set up the buffer distances and visualize a few of them:

grid.dist.0 <- GSIF::buffer.dist(meuse["logZn"], meuse.grid.sp, as.factor(1:nrow(meuse)))
dim(grid.dist.0)

## [1] 3103  155

summary(grid.dist.0$layer.1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1677    2685    2556    3506    4569

plot(stack(grid.dist.0[1:9]))

Extract the distances to each buffer at each point, and add these as fields to the points. Note that we first have to remove the previous buffer distances (the 16 quantiles).

buffer.dists <- over(meuse, grid.dist.0)
dim(buffer.dists)

## [1] 155 155

# distances from point 1 to all the other points
(buffer.dists[1,1:10])

##   layer.1  layer.2 layer.3  layer.4  layer.5  layer.6  layer.7  layer.8
## 1       0 89.44272 144.222 268.3282 368.7818 481.6638 268.3282 243.3105
##   layer.9 layer.10
## 1     400  468.188

names(meuse@data)

##  [1] "cadmium"  "copper"   "lead"     "zinc"     "elev"     "dist"    
##  [7] "om"       "ffreq"    "soil"     "lime"     "landuse"  "dist.m"  
## [13] "logZn"    "logCu"    "logPb"    "logCd"    "layer.1"  "layer.2" 
## [19] "layer.3"  "layer.4"  "layer.5"  "layer.6"  "layer.7"  "layer.8" 
## [25] "layer.9"  "layer.10" "layer.11" "layer.12" "layer.13" "layer.14"
## [31] "layer.15" "layer.16"

meuse@data <- meuse@data[,1:16]
meuse@data <- cbind(meuse@data, buffer.dists)

So now we have 155 buffer distances. Set up a formula for the RF and build the RF. Note that the default mtry will be 52, i.e., distance to 1/3 of the points. We increase this to 2/3.

dn0 <- paste(names(grid.dist.lzn), collapse="+") 
fm0 <- as.formula(paste("logZn ~ ", dn0))    
(rf <- randomForest(fm0 , meuse@data, importance=TRUE, 
                    mtry=102, min.split=6, ntree=640))

## 
## Call:
##  randomForest(formula = fm0, data = meuse@data, importance = TRUE,      mtry = 102, min.split = 6, ntree = 640) 
##                Type of random forest: regression
##                      Number of trees: 640
## No. of variables tried at each split: 16
## 
##           Mean of squared residuals: 0.09413569
##                     % Var explained: 3.6

This does not perform as well as using the quantiles, the \(R^2\) is about 10% lower.

Mapping:

pred.grid <- predict(rf, newdata=grid.dist.lzn@data)
summary(pred.grid)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.299   2.477   2.596   2.590   2.697   2.897

meuse.grid.sp$logZn.rf0 <- pred.grid
breaks <- seq(2, 3.3, by=.05)
p2 <- spplot(meuse.grid.sp, zcol="logZn.rf0", main="Zn, log(mg kg-1)",
             sub="RF distance all points", at=breaks)
# compare with 16-quantile buffers
print(p2, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)

A clearly inferior map. So we don’t care about individual point values, rather about the distance to quantiles.

10 Random forests for categorical variables

Random forests can also be used to predict classes. As with RF for continuous variables, the forest is made up of many trees. The prediction is the class predicted by the majority of trees.

m.ffreq.rf <- randomForest(ffreq ~ x + y + dist.m + elev + soil + lime, data=meuse, importance=T, na.action=na.omit, mtry=3)
print(m.ffreq.rf)

## 
## Call:
##  randomForest(formula = ffreq ~ x + y + dist.m + elev + soil +      lime, data = meuse, importance = T, mtry = 3, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.77%
## Confusion matrix:
##    1  2  3 class.error
## 1 77  7  0  0.08333333
## 2  4 39  5  0.18750000
## 3  1  9 13  0.43478261

importance(m.ffreq.rf)

##               1          2         3 MeanDecreaseAccuracy MeanDecreaseGini
## x      20.09590  9.1360302  8.463421            22.248976        14.404681
## y      35.13394 16.9772733 40.939480            45.641554        31.592005
## dist.m 10.73260  1.9609074  6.635458            10.737786         7.549293
## elev   51.32741 35.3973475 10.994921            55.201272        32.358615
## soil   10.28211  0.2486713  5.214684             9.536508         2.352493
## lime   11.49724 -3.5100701  7.608437            11.107870         2.511667

varImpPlot(m.ffreq.rf)

Here we see the class error rate, i.e., the training samples not predicted into the correct class. In this example class 3 is poorly-predicted, whereas class 1 is well-predicted.

Elevation is by far the most important predictor of accurary for all classes. The coordinates are next, because of the geographic configuration of the studty area – these replace distance to river when that is not used in the model.

However for the GINI coefficient (how often a randomly-chosen training observation would be incorrectly assigned if it were randomly labeled according to the frequency distribution of labels in the subset) the y-coordinate becomes somewhat more important – this perhaps because of the one high value far to the east, at the river bend.

See where the predictors are used:

min_depth_frame <- min_depth_distribution(m.ffreq.rf)
plot_min_depth_distribution(min_depth_frame)

Soil type and lime application are not used in more than half the trees. The y-coordinate is used first in most trees, followed by elevation and x-coordinate.

We can see the probabilities of each class by using the model to predict back at the known points:

summary(prob_rf <- predict(m.ffreq.rf, meuse, type = "prob"))

##        1                2                3         
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0380   1st Qu.:0.0050   1st Qu.:0.0020  
##  Median :0.8000   Median :0.0480   Median :0.0060  
##  Mean   :0.5449   Mean   :0.3112   Mean   :0.1439  
##  3rd Qu.:0.9920   3rd Qu.:0.7970   3rd Qu.:0.0600  
##  Max.   :1.0000   Max.   :0.9860   Max.   :0.9940

# probability of each observation being predicted to class 1
prob_rf[,1]

##     1     2     3     4     5     6     7     8     9    10    11    12 
## 0.994 1.000 0.996 1.000 1.000 1.000 1.000 0.992 0.994 0.996 1.000 0.922 
##    13    14    15    16    17    18    19    20    21    22    23    24 
## 0.996 0.974 0.992 0.986 0.998 0.992 0.992 1.000 0.948 0.980 0.978 0.998 
##    25    26    27    28    29    30    31    32    33    34    35    37 
## 0.998 1.000 1.000 1.000 0.978 0.992 1.000 0.994 0.990 0.998 0.996 0.998 
##    38    39    40    41    42    43    44    45    46    47    48    49 
## 0.996 0.992 1.000 1.000 0.988 0.986 0.932 0.748 0.936 0.998 0.966 0.780 
##    50    51    52    53    54    55    56    57    58    59    60    61 
## 0.700 0.780 0.800 0.956 0.942 0.966 0.968 0.982 0.808 0.968 0.982 0.992 
##    62    63    64    65    66    67    69    75    76    79    80    81 
## 0.988 0.994 0.996 0.988 0.982 0.996 0.960 0.710 0.946 0.998 0.998 0.970 
##    82    83    84    85    86    87    88    89    90   123   160   163 
## 0.974 0.994 0.978 0.964 0.896 0.794 0.996 0.994 0.998 0.852 0.992 0.996 
##    70    71    91    92    93    94    95    96    97    98    99   100 
## 0.122 0.082 0.012 0.008 0.012 0.004 0.116 0.024 0.056 0.110 0.010 0.074 
##   101   102   103   104   105   106   108   109   110   111   112   113 
## 0.022 0.024 0.018 0.002 0.048 0.052 0.044 0.012 0.172 0.092 0.008 0.038 
##   114   115   116   117   118   119   120   121   122   124   125   126 
## 0.012 0.002 0.038 0.012 0.002 0.014 0.010 0.022 0.038 0.074 0.104 0.046 
##   127   128   129   130   131   132   133   134   135   136   161   162 
## 0.016 0.106 0.208 0.294 0.076 0.096 0.066 0.146 0.130 0.068 0.082 0.046 
##   137   138   140   141   142   143   144   145   146   147   148   149 
## 0.288 0.068 0.002 0.000 0.010 0.006 0.016 0.000 0.004 0.010 0.004 0.004 
##   150   151   152   153   154   155   156   157   158   159   164 
## 0.000 0.152 0.090 0.004 0.000 0.002 0.010 0.002 0.002 0.000 0.130

# an observation in order of probability
sort(prob_rf[1,], decreasing=TRUE)

##     1     2     3 
## 0.994 0.004 0.002

order(prob_rf[1,], decreasing=TRUE)

## [1] 1 2 3

# its actual class -- in this case it agrees
meuse$ffreq[1]

## [1] 1
## Levels: 1 2 3

These can be summarized into list of most-probable, 2nd-most probable etc. classes:

# list of most likely classes
probs <- apply(prob_rf, 1, order, decreasing=TRUE)
probs[,1]

## [1] 1 2 3

probs[,121]

## [1] 2 1 3

# most likely, 2nd most, 3rd most, for each observation
table(probs.1 <- probs[1,])

## 
##  1  2  3 
## 84 48 23

# cross-classification of most and 2nd-most probable classes
table(probs.1, probs.2 <- probs[2,])

##        
## probs.1  1  2  3
##       1  0 71 13
##       2 25  0 23
##       3  3 20  0

10.1 Mapping classes

Again we need to restrict the RF model to variables also available in the prediction grid.

(m.ffreq.rf.2 <- randomForest(ffreq ~ x + y + dist + soil, data=meuse, importance=T, na.action=na.omit, mtry=3))

## 
## Call:
##  randomForest(formula = ffreq ~ x + y + dist + soil, data = meuse,      importance = T, mtry = 3, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 28.39%
## Confusion matrix:
##    1  2  3 class.error
## 1 71 11  2   0.1547619
## 2 14 29  5   0.3958333
## 3  3  9 11   0.5217391

This is less accurate than the model with elevation. Anyway, we show the map:

meuse.grid.sp$predictedffreq.rf <- predict(m.ffreq.rf.2, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedffreq.rf", main="Most probable flood frequency class")

Not a nice-looking map. It disagrees strongly with the actual flooding-frequency map.

summary(meuse.grid.sp$ffreq.tf <-
          as.factor((meuse.grid.sp$predictedffreq.rf==
                       meuse.grid.sp$ffreq)))

## FALSE  TRUE 
##  1452  1651

spplot(meuse.grid.sp, zcol="ffreq.tf",
       main="Accuracy of RF model")

Probability of class 1 at each grid cell:

meuse.grid.sp$probffreq <-
  predict(m.ffreq.rf.2, newdata=meuse.grid, type="prob")
meuse.grid.sp$probffreq.rf.1 <- meuse.grid.sp$probffreq[,1]
meuse.grid.sp$probffreq.rf.2 <- meuse.grid.sp$probffreq[,2]
meuse.grid.sp$probffreq.rf.3 <- meuse.grid.sp$probffreq[,3]
p1 <- spplot(meuse.grid.sp, zcol="probffreq.rf.1",
             main="Prob. class 1")
p2 <- spplot(meuse.grid.sp, zcol="probffreq.rf.2",
             main="Prob. class 2")
p3 <- spplot(meuse.grid.sp, zcol="probffreq.rf.3",
             main="Prob. class 3")
print(p1, split=c(1,1,3,1), more=T)
print(p2, split=c(2,1,3,1), more=T)
print(p3, split=c(3,1,3,1), more=F)

11 Spatial random forest, including covariates

Does adding covariates improve the prediction?

(covar.names <- paste(intersect(names(meuse), names(meuse.grid)), collapse="+"))

## [1] "dist+ffreq+soil"

(fm.c <- as.formula(paste("logZn ~", dn, "+", covar.names)))

## logZn ~ layer.1 + layer.2 + layer.3 + layer.4 + layer.5 + layer.6 + 
##     layer.7 + layer.8 + layer.9 + layer.10 + layer.11 + layer.12 + 
##     layer.13 + layer.14 + layer.15 + layer.16 + dist + ffreq + 
##     soil

Now compute the random forest model including these covariates, use it to predict at the observation points.

(rf.c <- randomForest(fm.c , meuse@data, importance=TRUE, min.split=6, mtry=6, ntree=640))

## 
## Call:
##  randomForest(formula = fm.c, data = meuse@data, importance = TRUE,      min.split = 6, mtry = 6, ntree = 640) 
##                Type of random forest: regression
##                      Number of trees: 640
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.0281278
##                     % Var explained: 71.2

pred.rf.c <- predict(rf.c, newdata=meuse@data)

plot(rf.c)

Display the variable importance:

importance(rf.c, type=1)

##            %IncMSE
## layer.1  10.048767
## layer.2   8.720124
## layer.3   8.373037
## layer.4   9.261439
## layer.5   9.984490
## layer.6  11.892754
## layer.7   9.195146
## layer.8   9.635517
## layer.9   9.876280
## layer.10  9.392341
## layer.11 11.452915
## layer.12 11.004939
## layer.13 10.047642
## layer.14  8.762264
## layer.15  8.820727
## layer.16  9.572263
## dist     43.197300
## ffreq    32.726246
## soil     17.135675

varImpPlot(rf.c, type=1)

Q: Are any of the covariables important? Are any more important than the distance buffers? Explain.

Plot the RF fit and the OOB predictions. Compute the fitted and OOB RMSE.

plot(meuse$logZn ~ pred.rf.c, asp=1, pch=20, xlab="Random forest fit", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()

(rmse.rf.c <- sqrt(sum((pred.rf.c-meuse$logZn)^2)/length(pred.rf.c)))

## [1] 0.07823778

#
oob.rf.c <- predict(rf.c)
plot(meuse$logZn ~ oob.rf.c, asp=1, pch=20, xlab="Out-of-bag prediction", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()

(rmse.oob.c <- sqrt(sum((oob.rf.c-meuse$logZn)^2)/length(oob.rf.c)))

## [1] 0.1677134

Q: What are the goodness-of-fit and Out-of-bag RMSE? How do these compare to the spatial RF without covariates?

11.1 Mapping

We first have to add the covariates to the prediction grid.

grid.dist.lzn.covar <- grid.dist.lzn
grid.dist.lzn.covar$dist <- meuse.grid$dist
grid.dist.lzn.covar$ffreq <- meuse.grid$ffreq
grid.dist.lzn.covar$soil <- meuse.grid$soil

Now predict over the grid and display:

pred.grid.c <- predict(rf.c, newdata=grid.dist.lzn.covar@data)
summary(pred.grid.c)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.280   2.345   2.422   2.491   2.615   2.919

meuse.grid.sp$logZn.rf.c <- pred.grid.c
breaks <- seq(2, 3.3, by=.05)
p1.c <- spplot(meuse.grid.sp, zcol="logZn.rf.c", main="Zn, log(mg kg-1)", sub="Random forest on distance buffers & covariates", at=breaks)
print(p1.c, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)

Compute and display the differences:

meuse.grid.sp$logZn.rf.diff <- meuse.grid.sp$logZn.rf.c -  meuse.grid.sp$logZn.rf
summary(meuse.grid.sp$logZn.rf.diff)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.4434319 -0.0854825  0.0005576  0.0022438  0.0907812  0.4409751

p3.c <- spplot(meuse.grid.sp, zcol="logZn.rf.diff", main="Spatial RF with - without covariates, log(Zn)", col.regions=topo.colors(36))
print(p3.c)

Q: Compare this to the maps made by RF/Spatial Distances without covariates. Where does using the covariates result in higher and lower predicted values? Try to explain why.

Challenge: Map logZn over the grid by Kriging with External Drift (KED), using the same three covariates (dist, ffreq, soil), and compare with the map made by spatial random forest with covariates. Reminder: the variogram must be computed for the linear model residuals using these covariates.

Q: Describe the differences between the residual and ordinary variogram.

Q: Describe the differences between the OK and KED maps.

Q: Compare the KED map with the spatial random forest + covariates map.

12 General references

Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning data mining, inference, and prediction (2nd ed). New York: Springer. Retrieved from http://link.springer.com/book/10.1007%2F978-0-387-84858-7

Kuhn, M. (2013). Applied Predictive Modeling. New York, NY: Springer New York. Retrieved from https://link.springer.com/openurl?genre=book&isbn=978-1-4614-6848-6

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. New York: Springer. Retrieved from http://link.springer.com/book/10.1007%2F978-1-4614-7138-7

Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., … Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37. https://doi.org/10.1007/s10115-007-0114-2

Meuse heavy metals exercise — CART and random forests

D G Rossiter

2019-08-04

1 Dataset

2 Regression trees

2.1 Building a regression tree

2.2 Variable importance

2.3 Pruning the regression tree

2.4 Goodness-of-fit

3 Classification tree

3.1 Building a classification tree

3.2 Pruning the classification tree

4 Mapping over the grid.

5 The variability of regression trees

6 Random Forests

6.1 Sensitivity

6.2 Variable importance

6.3 Examine variable importance and interaction with `randomForestExplainer`

6.4 Partial dependence plots

6.5 Goodness-of-fit and out-of-bag errors

6.6 Map over the grid with an RF model

7 Cubist

7.1 Unoptimized Cubist model

7.2 Optimizing cubist

8 K nearest neighbours

9 Spatial random forests

9.1 Method 1 – buffers to quantiles

9.1.1 Quantiles

9.1.2 Distances to each quantile

9.1.3 Spatial random forest

9.1.4 Mapping

9.2 Method 2 – buffers around each point

10 Random forests for categorical variables

10.1 Mapping classes

11 Spatial random forest, including covariates

11.1 Mapping

12 General references

Meuse heavy metals exercise — CART and random forests

D G Rossiter

2019-08-04

1 Dataset

2 Regression trees

2.1 Building a regression tree

2.2 Variable importance

2.3 Pruning the regression tree

2.4 Goodness-of-fit

3 Classification tree

3.1 Building a classification tree

3.2 Pruning the classification tree

4 Mapping over the grid.

5 The variability of regression trees

6 Random Forests

6.1 Sensitivity

6.2 Variable importance

6.3 Examine variable importance and interaction with randomForestExplainer

6.4 Partial dependence plots

6.5 Goodness-of-fit and out-of-bag errors

6.6 Map over the grid with an RF model

7 Cubist

7.1 Unoptimized Cubist model

7.2 Optimizing cubist

8 K nearest neighbours

9 Spatial random forests

9.1 Method 1 – buffers to quantiles

9.1.1 Quantiles

9.1.2 Distances to each quantile

9.1.3 Spatial random forest

9.1.4 Mapping

9.2 Method 2 – buffers around each point

10 Random forests for categorical variables

10.1 Mapping classes

11 Spatial random forest, including covariates

11.1 Mapping

12 General references

6.3 Examine variable importance and interaction with `randomForestExplainer`