In this exercise we return to the Meuse heavy metals dataset.
Packages used in this section: sp
for the sample dataset.
library(sp)
Sample point dataset, comes with sp
; make a log10-transformed copy of the metal concentrations.
Also load the prediction grid: locations where we do not know the value of the target variable, at which we want to make prediction of that variable, from the model built (trained, calibrated) from the known points, where we know both the target variable and the predictors.
data(meuse)
str(meuse)
## 'data.frame': 155 obs. of 14 variables:
## $ x : num 181072 181025 181165 181298 181307 ...
## $ y : num 333611 333558 333537 333484 333330 ...
## $ cadmium: num 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
## $ copper : num 85 81 68 81 48 61 31 29 37 24 ...
## $ lead : num 299 277 199 116 117 137 132 150 133 80 ...
## $ zinc : num 1022 1141 640 257 269 ...
## $ elev : num 7.91 6.98 7.8 7.66 7.48 ...
## $ dist : num 0.00136 0.01222 0.10303 0.19009 0.27709 ...
## $ om : num 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
## $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## $ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
## $ dist.m : num 50 30 150 270 380 470 240 120 240 420 ...
meuse$logZn <- log10(meuse$zinc) # 锌
meuse$logCu <- log10(meuse$copper) # 铜
meuse$logPb <- log10(meuse$lead) # 铅
meuse$logCd <- log10(meuse$cadmium) # 镉
data(meuse.grid)
str(meuse.grid)
## 'data.frame': 3103 obs. of 7 variables:
## $ x : num 181180 181140 181180 181220 181100 ...
## $ y : num 333740 333700 333700 333700 333660 ...
## $ part.a: num 1 1 1 1 1 1 1 1 1 1 ...
## $ part.b: num 0 0 0 0 0 0 0 0 0 0 ...
## $ dist : num 0 0 0.0122 0.0435 0 ...
## $ soil : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
The objective is to model the log10(Zn) concentration from covariables, not including the other metals. One approach is the regression tree, which splits the target according to thresholds of predictor variables.
Reference: Brieman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton, FL: Chapman & Hall, CRC Press.
Packages used in this section: sp
for spatial objects, rpart
for regression trees.
library(sp)
library(rpart)
library(rpart.plot)
names(meuse)
## [1] "x" "y" "cadmium" "copper" "lead" "zinc" "elev"
## [8] "dist" "om" "ffreq" "soil" "lime" "landuse" "dist.m"
## [15] "logZn" "logCu" "logPb" "logCd"
names(meuse.grid)
## [1] "x" "y" "part.a" "part.b" "dist" "soil" "ffreq"
Q: What covariables are available in the point dataset? Are these all available in the prediction grid?
We will try to model the log10(Zn) concentration as a function of flooding frequency class, distance to the river in meters, soil type, whether the site has had added limestone 灰岩, and elevation above sea level. We also use the coordinates to capture any geographic trend.
Q: Why could these be good predictors of the metal concentration in this site?
Q: Do you see any problem in using so many predictors? Would you do this in a linear model?
We will use the rpart
function to build the regression tree. This uses a helper function rpart.control
to control the behaviour of rpart
.
Examine the help for rpart.control
.
help(rpart.control)
Q: What is the meaning of the minsplit
and cp
parameters?
Build a regression tree for this, print its results, and display as a tree plot. Allow splits all the way to single observations, if the \(R^2\) increases by at least 0.5% at the split.
m.lzn.rp <- rpart(logZn ~ x + y + ffreq + dist.m + soil + lime + elev,
data=meuse,
minsplit=2,
cp=0.005)
print(m.lzn.rp)
## n= 155
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 155 15.13633000 2.556160
## 2) dist.m>=145 101 4.65211000 2.388584
## 4) elev>=6.943 93 2.83545700 2.349952
## 8) dist.m>=230 78 1.73584100 2.311717
## 16) x>=179102.5 71 1.23063300 2.288808
## 32) ffreq=2,3 40 0.36471060 2.223342
## 64) y>=330942 21 0.14958940 2.167449 *
## 65) y< 330942 19 0.07701018 2.285117 *
## 33) ffreq=1 31 0.47328450 2.373281
## 66) x< 179913.5 1 0.00000000 2.075547 *
## 67) x>=179913.5 30 0.38168440 2.383205
## 134) x>=180346.5 25 0.20839560 2.355868
## 268) y< 333191 19 0.09154361 2.320174 *
## 269) y>=333191 6 0.01598697 2.468900 *
## 135) x< 180346.5 5 0.06119400 2.519889 *
## 17) x< 179102.5 7 0.08998420 2.544084 *
## 9) dist.m< 230 15 0.39263770 2.548774
## 18) x< 180563 10 0.23688250 2.494602
## 36) x>=179405.5 7 0.10694750 2.420923 *
## 37) x< 179405.5 3 0.00326675 2.666521 *
## 19) x>=180563 5 0.06771894 2.657116 *
## 5) elev< 6.943 8 0.06436102 2.837680 *
## 3) dist.m< 145 54 2.34313500 2.869589
## 6) elev>=8.1455 15 0.49618250 2.646331
## 12) elev< 8.48 4 0.07502826 2.456464 *
## 13) elev>=8.48 11 0.22452210 2.715373 *
## 7) elev< 8.1455 39 0.81172600 2.955457
## 14) dist.m>=75 11 0.08001190 2.848717 *
## 15) dist.m< 75 28 0.55715050 2.997391
## 30) x< 179565 9 0.12124970 2.914541 *
## 31) x>=179565 19 0.34486030 3.036636
## 62) x>=180194 12 0.12169000 2.966064 *
## 63) x< 180194 7 0.06095067 3.157617 *
rpart.plot(m.lzn.rp, digits=3, type=4, extra=1)
Q: Which variables were actually used in the tree?
Q: Which variable was used at the first split?
Q: Does that agree with your theory of the origin of the metals in these soils?
Q: What was the value of the splitting variable at that split?
Q: What was the average log10(Zn) before the split, and in the two leaves after the split?
Q: How many leaves (i.e., different predictions)? What proportion of the observations is this?
Q: What is the minimum number of observations among the leaves? The maximum?
Now we examine the variable importance:
print(tmp <- m.lzn.rp$variable.importance)
## dist.m elev lime x y soil ffreq
## 9.410837 6.134458 4.633959 2.097574 1.911179 1.169126 1.143282
data.frame(variableImportance = 100 * tmp / sum(tmp))
## variableImportance
## dist.m 35.512035
## elev 23.148534
## lime 17.486364
## x 7.915251
## y 7.211884
## soil 4.411728
## ffreq 4.314205
Q: Which variable explained most of the metal concentration? What proportion? Were all predictor variables used somewhere in the tree?
Have we built a tree that is too complex, i.e., fitting noise not signal? In other words, fitting our dataset well but not the process?
Examine the cross-validation relative error vs. the complexity parameter.
Q: What complexity parameter minimizes the cross-validation relative error?
We can also compute this, rather than just looking at the graph.
plotcp(m.lzn.rp)
cp.table <- m.lzn.rp[["cptable"]]
cp.ix <- which.min(cp.table[,"xerror"])
print(cp.table[cp.ix,])
## CP nsplit rel error xerror xstd
## 0.009124468 8.000000000 0.153385306 0.292993740 0.032477480
cp.min <- cp.table[cp.ix,"CP"]
(Note that some other cross-validation errors are quite close to this minimum, so we might choose to use a higher value of the control parameter to have a more parsimonious model, that might extrapolate better.)
Prune the tree with this complexity parameter, and examine in the same way as above.
(Note we would get the same result if we re-fit with this parameter.)
(m.lzn.rpp <- prune(m.lzn.rp, cp=cp.min))
## n= 155
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 155 15.13633000 2.556160
## 2) dist.m>=145 101 4.65211000 2.388584
## 4) elev>=6.943 93 2.83545700 2.349952
## 8) dist.m>=230 78 1.73584100 2.311717
## 16) x>=179102.5 71 1.23063300 2.288808
## 32) ffreq=2,3 40 0.36471060 2.223342 *
## 33) ffreq=1 31 0.47328450 2.373281 *
## 17) x< 179102.5 7 0.08998420 2.544084 *
## 9) dist.m< 230 15 0.39263770 2.548774 *
## 5) elev< 6.943 8 0.06436102 2.837680 *
## 3) dist.m< 145 54 2.34313500 2.869589
## 6) elev>=8.1455 15 0.49618250 2.646331
## 12) elev< 8.48 4 0.07502826 2.456464 *
## 13) elev>=8.48 11 0.22452210 2.715373 *
## 7) elev< 8.1455 39 0.81172600 2.955457
## 14) dist.m>=75 11 0.08001190 2.848717 *
## 15) dist.m< 75 28 0.55715050 2.997391 *
rpart.plot(m.lzn.rpp, digits=3, type=4, extra=1)
Q: How does this tree differ from the full (un-pruned tree)? Look at the predictors actually used, number of splits, number of leaves, number of observations in each leaf.
Q: Which leaf predicts the highest concentration of log10(Zn)? Explain that by reading the predictors and their values down the tree from the root to this leaf. Does this fit your theory of the origin of these metals?
Now, compare the model predictions to the known values:
p.rpp <- predict(m.lzn.rpp, newdata=meuse)
length(unique(p.rpp))
## [1] 9
summary(r.rpp <- meuse$logZn - p.rpp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.320887 -0.090478 -0.005171 0.000000 0.081155 0.293237
sqrt(sum(r.rpp^2)/length(r.rpp))
## [1] 0.1223873
plot(meuse$logZn ~ p.rpp, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Regression Tree")
grid()
abline(0,1)
Q: What is the RMSE?
Q: Why are there vertical lines at all the fitted values?
Another use of CART is to predict a category. Here we try to predict whether a site is in flood frequency class 1, 2, or 3. If this is a successful model, we could use it to map flood frequency, if we didn’t already have this map.
Q: What covariables might help predict flooding frequency?
Flooding is influenced by the distance from the river and the elevation above the river. Build a model with these, using a complexity parameter of 2% increase in \(R^2\).
m.ff.rp <- rpart(ffreq ~ x + y + dist.m + elev,
data=meuse,
minsplit=2,
cp=0.02)
print(m.ff.rp)
## n= 155
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 155 71 1 (0.54193548 0.30967742 0.14838710)
## 2) y>=331984.5 63 6 1 (0.90476190 0.07936508 0.01587302)
## 4) elev< 9.399 60 3 1 (0.95000000 0.03333333 0.01666667) *
## 5) elev>=9.399 3 0 2 (0.00000000 1.00000000 0.00000000) *
## 3) y< 331984.5 92 49 2 (0.29347826 0.46739130 0.23913043)
## 6) elev< 7.552 24 2 1 (0.91666667 0.00000000 0.08333333)
## 12) y>=330043 22 0 1 (1.00000000 0.00000000 0.00000000) *
## 13) y< 330043 2 0 3 (0.00000000 0.00000000 1.00000000) *
## 7) elev>=7.552 68 25 2 (0.07352941 0.63235294 0.29411765)
## 14) y>=330330 54 13 2 (0.09259259 0.75925926 0.14814815)
## 28) elev< 9.947 49 9 2 (0.10204082 0.81632653 0.08163265)
## 56) y>=330644 35 5 2 (0.14285714 0.85714286 0.00000000)
## 112) dist.m>=470 17 5 2 (0.29411765 0.70588235 0.00000000)
## 224) elev< 8.526 3 0 1 (1.00000000 0.00000000 0.00000000) *
## 225) elev>=8.526 14 2 2 (0.14285714 0.85714286 0.00000000) *
## 113) dist.m< 470 18 0 2 (0.00000000 1.00000000 0.00000000) *
## 57) y< 330644 14 4 2 (0.00000000 0.71428571 0.28571429)
## 114) elev< 8.7225 11 1 2 (0.00000000 0.90909091 0.09090909) *
## 115) elev>=8.7225 3 0 3 (0.00000000 0.00000000 1.00000000) *
## 29) elev>=9.947 5 1 3 (0.00000000 0.20000000 0.80000000) *
## 15) y< 330330 14 2 3 (0.00000000 0.14285714 0.85714286) *
rpart.plot(m.ff.rp)
Q: Were both predictors used?
Q: What was the first split?
Find the value of the complexity parameter that minimizes the cross-validation error, and prune the tree with that value.
printcp(m.ff.rp)
##
## Classification tree:
## rpart(formula = ffreq ~ x + y + dist.m + elev, data = meuse,
## minsplit = 2, cp = 0.02)
##
## Variables actually used in tree construction:
## [1] dist.m elev y
##
## Root node error: 71/155 = 0.45806
##
## n= 155
##
## CP nsplit rel error xerror xstd
## 1 0.267606 0 1.00000 1.00000 0.087366
## 2 0.140845 2 0.46479 0.49296 0.073316
## 3 0.042254 3 0.32394 0.36620 0.065518
## 4 0.028169 5 0.23944 0.39437 0.067462
## 5 0.021127 6 0.21127 0.33803 0.063433
## 6 0.020000 10 0.12676 0.38028 0.066506
plotcp(m.ff.rp)
cp.table.class <- m.ff.rp[["cptable"]]
cp.ix <- which.min(cp.table.class[,"xerror"])
print(cp.table.class[cp.ix,])
## CP nsplit rel error xerror xstd
## 0.02112676 6.00000000 0.21126761 0.33802817 0.06343327
(cp.min <- cp.table.class[cp.ix,"CP"])
## [1] 0.02112676
m.ff.rpp <- prune(m.ff.rp, cp=cp.min)
print(m.ff.rpp)
## n= 155
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 155 71 1 (0.54193548 0.30967742 0.14838710)
## 2) y>=331984.5 63 6 1 (0.90476190 0.07936508 0.01587302)
## 4) elev< 9.399 60 3 1 (0.95000000 0.03333333 0.01666667) *
## 5) elev>=9.399 3 0 2 (0.00000000 1.00000000 0.00000000) *
## 3) y< 331984.5 92 49 2 (0.29347826 0.46739130 0.23913043)
## 6) elev< 7.552 24 2 1 (0.91666667 0.00000000 0.08333333)
## 12) y>=330043 22 0 1 (1.00000000 0.00000000 0.00000000) *
## 13) y< 330043 2 0 3 (0.00000000 0.00000000 1.00000000) *
## 7) elev>=7.552 68 25 2 (0.07352941 0.63235294 0.29411765)
## 14) y>=330330 54 13 2 (0.09259259 0.75925926 0.14814815)
## 28) elev< 9.947 49 9 2 (0.10204082 0.81632653 0.08163265) *
## 29) elev>=9.947 5 1 3 (0.00000000 0.20000000 0.80000000) *
## 15) y< 330330 14 2 3 (0.00000000 0.14285714 0.85714286) *
rpart.plot(m.ff.rpp, type=2, extra=1)
Q: Now how many splits? Which variables were used? Are all classes predicted? How pure are the leaves?
Q: How useful is this model? In other words, how well could flooding frequency class be mapped from knowing the distance to river and elevation?
Variable importance:
m.ff.rp$variable.importance
## elev y x dist.m
## 38.26320 38.24051 16.58514 11.40992
m.ff.rpp$variable.importance
## y elev x dist.m
## 36.74909 29.85313 14.99264 9.89731
In both the unpruned and pruned trees the y-coordinate and elevation are most important.
names(meuse.grid)
## [1] "x" "y" "part.a" "part.b" "dist" "soil" "ffreq"
Q: What variables are available for mapping over the grid?
The grid only has coordinates, relative distance, soil type, and flood frequency. So if we want a model to map log(Zn) over the grid, we can only use these variables.
Build and prune a regression tree for log(Zn) with these predictors; display the pruned tree and the variable importance.
m.lzn.g <- rpart(logZn ~ x + y + ffreq + dist + soil,
data=meuse,
minsplit=2,
cp=0.005)
plotcp(m.lzn.g)
rpart.plot(m.lzn.g, digits=3, type=4, extra=1)
cp.table <- m.lzn.g[["cptable"]]
cp.table[cp.ix <- which.min(cp.table[,"xerror"]),]
## CP nsplit rel error xerror xstd
## 0.008159035 11.000000000 0.132179501 0.414892945 0.074165183
(cp.min <- cp.table[cp.ix,"CP"])
## [1] 0.008159035
Now prune the tree to the indicated complexity:
m.lzn.gp <- prune(m.lzn.g, cp=cp.min)
rpart.plot(m.lzn.gp)
print(tmp <- m.lzn.gp$variable.importance)
## dist x soil y ffreq
## 10.316275 4.015682 3.417669 2.582310 2.407865
data.frame(variableImportance = 100 * tmp / sum(tmp))
## variableImportance
## dist 45.36660
## x 17.65927
## soil 15.02946
## y 11.35591
## ffreq 10.58877
Q: Which predictors were used? What was their relative importance?
Q: Why is the x-coordinate important in this dataset? Do you think this would be important in other parts of the flood plain?
Now we can map with this.
First, make an sp
version of the grid, with the coordinates having special status, i.e., not as feature-space attributes:
meuse.grid.sp <- meuse.grid
coordinates(meuse.grid.sp) <- ~ x + y; gridded(meuse.grid.sp) <- TRUE
meuse.grid.sp$predictedLog10Zn <- predict(m.lzn.gp, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn")
Q: How realistic is this map?
Single regression trees can vary quite a bit depending on small differences in the dataset. Convince yourself of this.
Remove a random 15 of the 155 points, then fit and prune the first tree we attempted, i.e., log10(Zn) concentration as a function of flooding frequency class, distance to the river in meters, soil type, land use, whether the site has had added limestone 灰岩, and elevation above sea level.
meuse.140.ix <- sample(1:dim(meuse)[1], size=140, replace=FALSE)
meuse.140 <- meuse[meuse.140.ix,]
m.lzn.rp.140 <- rpart(logZn ~ x + y + ffreq + dist.m + soil + lime + elev,
data=meuse.140,
minsplit=2,
cp=0.005)
rpart.plot(m.lzn.rp.140, digits=3, type=4, extra=1)
print(tmp <- m.lzn.rp.140$variable.importance)
## dist.m elev lime x y soil ffreq
## 8.9071397 5.9338954 4.4152575 2.1333175 1.9975137 1.4132827 0.8036173
data.frame(variableImportance = 100 * tmp / sum(tmp))
## variableImportance
## dist.m 34.788046
## elev 23.175636
## lime 17.244389
## x 8.331962
## y 7.801562
## soil 5.519768
## ffreq 3.138637
Prune the tree:
plotcp(m.lzn.rp.140)
cp.table <- m.lzn.rp.140[["cptable"]]
cp.ix <- which.min(cp.table[,"xerror"])
print(cp.table[cp.ix,])
## CP nsplit rel error xerror xstd
## 0.009101917 9.000000000 0.135288284 0.273448718 0.037955567
cp.min <- cp.table[cp.ix,"CP"]
m.lzn.rpp.140 <- prune(m.lzn.rp.140, cp=cp.min)
rpart.plot(m.lzn.rpp.140, digits=3, type=4, extra=1)
Q: What are the similarities and differences between your tree and the one we did at the beginning of the exercise? Consider the “optimum” complexity parameter, number of leaves and split, where the predictors are used, which predictors are used.
We can do this also for the map:
m.lzn.g.140 <- rpart(logZn ~ x + y + ffreq + dist + soil,
data=meuse.140,
minsplit=2,
cp=0.005)
Prune the tree to the indicated complexity:
plotcp(m.lzn.g.140)
rpart.plot(m.lzn.g.140, digits=3, type=4, extra=1)
cp.table <- m.lzn.g.140[["cptable"]]
cp.table[cp.ix <- which.min(cp.table[,"xerror"]),]
## CP nsplit rel error xerror xstd
## 0.00957948 8.00000000 0.14802401 0.26309089 0.04321030
(cp.min <- cp.table[cp.ix,"CP"])
## [1] 0.00957948
m.lzn.gp.140 <- prune(m.lzn.g.140, cp=cp.min)
rpart.plot(m.lzn.gp.140)
print(tmp <- m.lzn.gp.140$variable.importance)
## dist soil x ffreq y
## 9.557269 4.683515 3.714189 3.399352 3.251236
data.frame(variableImportance = 100 * tmp / sum(tmp))
## variableImportance
## dist 38.84191
## soil 19.03438
## x 15.09492
## ffreq 13.81538
## y 13.21342
Computed and display the map using the pruned tree:
meuse.grid.sp$predictedLog10Zn.140 <- predict(m.lzn.gp.140, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.140")
Compute and display the differences in predictions:
meuse.grid.sp$predictedLog10Zn.diff <-
meuse.grid.sp$predictedLog10Zn - meuse.grid.sp$predictedLog10Zn.140
summary(meuse.grid.sp$predictedLog10Zn.diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.537763 0.007214 0.033264 -0.003123 0.042397 0.592105
spplot(meuse.grid.sp, zcol="predictedLog10Zn.diff", col.regions=topo.colors(36))
Q: How different are the maps made with the regression tree based on all 155 observations, and the one made with 140 randomly-selected observations? Can you explain the spatial pattern of the differences? Hint: plot the locations of the omitted points.
meuse.omitted <- meuse[-meuse.140.ix,]
plot(y ~ x, data=meuse.omitted, asp=1,
cex=2*meuse.omitted$logZn/max(meuse.omitted$logZn))
data(meuse.riv)
lines(meuse.riv)
grid()
The idea here is simple: build many regression trees and average them.
Reference: Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Packages used in this section: randomForest
for random forests, randomForestExplainer
for some diagnostics.
library(randomForest)
library(randomForestExplainer)
First build the forest, using the randomForest
function:
m.lzn.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data=meuse, importance=T, na.action=na.omit, mtry=5)
print(m.lzn.rf)
##
## Call:
## randomForest(formula = logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data = meuse, importance = T, mtry = 5, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.01887242
## % Var explained: 80.67
The mtry
optional argument gives the number of variables randomly sampled as candidates at each split. By default this is \(\lfloor{p/3}\rfloor\) where \(p\) is the number of predictors, here \(5\), so \(\lfloor{5/3}\rfloor = 2\). We want to try all the variables at each split for more accurate splitting.
Q: How much of the variance is explained by the forest?
Here we can not see a tree structure, since each tree in the forest is different.
Display the cross-validation error rate against the number of trees:
plot(m.lzn.rf)
Q: How many trees were needed to make a reliable forest?
Each RF is random, so several attempts will not be the same. How consistent are they?
Compute the RF several times and collect statistics:
n <- 24
rf.stats <- data.frame(rep=1:10, rsq=as.numeric(NA), mse=as.numeric(NA))
for (i in 1:n) {
model.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
data=meuse, importance=T, na.action=na.omit, mtry=5)
summary(model.rf$rsq)
summary(model.rf$mse)
rf.stats[i, "mse"] <- median(summary(model.rf$mse))
rf.stats[i, "rsq"] <- median(summary(model.rf$rsq))
}
summary(rf.stats[,2:3])
## rsq mse
## Min. :0.7917 Min. :0.01872
## 1st Qu.:0.8002 1st Qu.:0.01921
## Median :0.8013 Median :0.01940
## Mean :0.8013 Mean :0.01940
## 3rd Qu.:0.8033 3rd Qu.:0.01951
## Max. :0.8083 Max. :0.02034
hist(rf.stats[,"rsq"], xlab="RandomForest R^2")
rug(rf.stats[,"rsq"])
hist(rf.stats[,"mse"], xlab="RandomForest RMSE")
rug(rf.stats[,"mse"])
Q: How consistent are the 24 runs of the random forest?
Examine the variable importance. Type 1 is the increase in MSE if that predictor is assigned to its actuql point, rather than randomly assigned to a point. It is computed as follows (from the Help): “For each tree, the prediction error on the OOB portion of the data is recorded (…MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences”.
Type 2 is the increase in node purity. It is computed as follows: “The total decrease in node impurities from splitting on the [given] variable, averaged over all trees… measured by residual sum of squares.”
importance(m.lzn.rf, type=1)
## %IncMSE
## ffreq 13.511952
## x 26.031058
## y 18.648671
## dist.m 57.516695
## elev 37.554231
## soil 6.934790
## lime 9.604052
importance(m.lzn.rf, type=2)
## IncNodePurity
## ffreq 0.4501877
## x 1.1210360
## y 0.7244526
## dist.m 8.0261675
## elev 3.5421854
## soil 0.2220359
## lime 0.7047061
varImpPlot(m.lzn.rf, type=1)
Q: Does this agree with the variable importance from the single trees we built above?
Q: Why are some variables given importance, although they were not used in the regression tree models?
randomForestExplainer
Another way to look at this is by the depth in the tree at which each predictor is used. This is found with the min_depth_frame
function of the randomForestExplainer
package.
min_depth_frame <- min_depth_distribution(m.lzn.rf)
str(min_depth_frame) # has results for all the trees
## 'data.frame': 3222 obs. of 3 variables:
## $ tree : int 1 1 1 1 1 1 2 2 2 2 ...
## $ variable : Factor w/ 7 levels "dist.m","elev",..: 1 2 3 4 5 6 1 2 4 5 ...
## $ minimal_depth: num 0 1 1 4 2 3 1 0 1 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
plot_min_depth_distribution(min_depth_frame)
Q: Which predictor is used most as the root?
Q: Which predictor is used most at the second level?
Q: Which predictors are not used in any trees?
Q: Do these results agree with the variable importance plot?
This package gives a wider choice of importance measures:
importance_frame <- measure_importance(m.lzn.rf)
print(importance_frame)
## variable mean_min_depth no_of_nodes mse_increase node_purity_increase
## 1 dist.m 0.342000 5437 0.081638354 8.0261675
## 2 elev 1.042000 6520 0.033787030 3.5421854
## 3 ffreq 3.625152 994 0.007941017 0.4501877
## 4 lime 4.404208 484 0.003762594 0.7047061
## 5 soil 5.189408 686 0.001803041 0.2220359
## 6 x 2.142000 5722 0.011567810 1.1210360
## 7 y 2.640000 5199 0.007384188 0.7244526
## no_of_trees times_a_root p_value
## 1 500 351 2.311138e-219
## 2 500 100 0.000000e+00
## 3 458 1 1.000000e+00
## 4 357 40 1.000000e+00
## 5 407 8 1.000000e+00
## 6 500 0 1.575661e-286
## 7 500 0 1.324765e-169
Q: Which predictor was used in the most nodes, over all trees?
Q: Which predictors were never used as the root of the tree?
Q: Which predictors were used in every tree?
The p-value measureswhether the observed number of successes (number of nodes in which \(X_j\) was used for splitting) exceeds the theoretical number of successes if they were random, following a binomial distribution. See https://cran.rstudio.com/web/packages/randomForestExplainer/vignettes/randomForestExplainer.html for details.
Another interesting plot is the “multiway importance plot”, which by default shows the number of times as root vs. the mean minimum depth for each predictor. There is typically a negative relation between times_a_root
and mean_min_depth
.
Q: Which predictor is most important by both measures?
Q: Is this a consistent negative relation? If not, where are the exceptions?
Another interesting plot is the pairwise comparison of different importance measures:
plot_importance_ggpairs(importance_frame)
Q: Which are the most and least related measures?
plot_multi_way_importance(m.lzn.rf, size_measure = "no_of_nodes")
Examine the interactions between predictors, in terms of their use within the forest:
interactions_frame <- min_depth_interactions(m.lzn.rf)
interactions_frame[order(interactions_frame$occurrences, decreasing = TRUE)[1:12], ]
## variable root_variable mean_min_depth occurrences interaction
## 43 y dist.m 1.5595960 495 dist.m:y
## 8 elev dist.m 0.3920364 494 dist.m:elev
## 1 dist.m dist.m 1.0286343 492 dist.m:dist.m
## 36 x dist.m 1.0994586 491 dist.m:x
## 9 elev elev 1.1333818 483 elev:elev
## 44 y elev 1.7066505 474 elev:y
## 37 x elev 1.5122101 473 elev:x
## 2 dist.m elev 1.2774303 456 elev:dist.m
## 15 ffreq dist.m 3.9632323 395 dist.m:ffreq
## 48 y x 1.9688081 389 x:y
## 41 x x 2.2824646 388 x:x
## 13 elev x 2.3496162 364 x:elev
## uncond_mean_min_depth
## 43 2.640000
## 8 1.042000
## 1 0.342000
## 36 2.142000
## 9 1.042000
## 44 2.640000
## 37 2.142000
## 2 0.342000
## 15 3.625152
## 48 2.640000
## 41 2.142000
## 13 1.042000
plot_min_depth_interactions(interactions_frame)
Q: Which pair of predictors are most used near the root of the tree, and with the most co-occurrences?
“To further investigate the most frequent interaction we use the function plot_predict_interaction to plot the prediction of our forest on a grid of values for the components of each interaction. The function requires the forest, training data, variable to use on x and y-axis, respectively. In addition, one can also decrease the number of points in both dimensions of the grid from the default of 100 in case of insufficient memory using the parameter grid.”"
Here we see the predicted log(Zn) content for all combinations of distance from river and elevation:
plot_predict_interaction(m.lzn.rf, meuse, "dist.m", "elev")
These plots show the effect of one predictor on the predictand, holding all the others constant at their median value. We show these in order of variable importance, as determined just above.
partialPlot(m.lzn.rf, pred.data=meuse, x.var="x")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="y")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="dist.m")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="elev")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="ffreq", which.class=1)
Q: What is the effect of just the distance to river on the RF prediction of log(Zn), when the other variables are held at their mean values (continuous variables, i.e., elev
) or most common class (classified variables, i.e. ffreq
, soil
and lime
). Is the effect monotonic (i.e., consistently increasing or decreasing)? Why or why not?
Predict back onto the known points, and evaluate the goodness-of-fit:
p.rf <- predict(m.lzn.rf, newdata=meuse)
length(unique(p.rf))
## [1] 155
summary(r.rpp <- meuse$logZn - p.rf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1825439 -0.0364426 -0.0025826 -0.0004145 0.0329439 0.1577282
(rmse.rf <- sqrt(sum(r.rpp^2)/length(r.rpp)))
## [1] 0.06099567
plot(meuse$logZn ~ p.rf, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)
Q: What is the RMSE?
Q: How many different predictions were made? Why not just a few, as in regression trees?
A more realistic view of the predictive power of the model is the out-of-bag validation.
p.rf.oob <- predict(m.lzn.rf)
summary(r.rpp.oob <- meuse$logZn - p.rf.oob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.426973 -0.082194 -0.010906 -0.001954 0.077214 0.377211
(rmse.oob <- sqrt(sum(r.rpp.oob^2)/length(r.rpp.oob)))
## [1] 0.1373769
plot(meuse$logZn ~ p.rf.oob, asp=1, pch=20,
xlab="Out-of-bag cross-validation estimates",
ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3),
main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)
Q: What is the OOB RMSE? How does this compare with the fitted RMSE? Which is more realistic for predictions?
Q: How does this compare to the model fits?
Q: How does this compare with the RMSE from the Regression Tree?
We can try to build an RF model to map the grid, but again we can only use coordinates, flood frequency, soil type and relative distance to the river, because these are the only variables we have over the whole area, i.e., in the spatial object meuse.grid
.
names(meuse.grid)
## [1] "x" "y" "part.a" "part.b" "dist" "soil" "ffreq"
m.lzn.rf.g <- randomForest(logZn ~ x + y + ffreq + dist + soil, data=meuse, importance=T, na.action=na.omit, mtry=3)
print(m.lzn.rf)
##
## Call:
## randomForest(formula = logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data = meuse, importance = T, mtry = 5, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.01887242
## % Var explained: 80.67
(tmp <- importance(m.lzn.rf.g, type=1))
## %IncMSE
## x 30.38927
## y 20.81755
## ffreq 37.19663
## dist 52.99835
## soil 11.68151
(100*tmp/sum(tmp))
## %IncMSE
## x 19.851457
## y 13.598840
## ffreq 24.298292
## dist 34.620595
## soil 7.630817
varImpPlot(m.lzn.rf.g, type=1)
meuse.grid.sp$predictedLog10Zn.rf <- predict(m.lzn.rf.g, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.rf")
Q: Does this map look realistic? How does it compare with the map made with the single regressiom tree?
This method is derived from the C4.5 approach of Quinlan.
References:
Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo, Calif: Morgan Kaufmann Publishers.
Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77–90. https://doi.org/10.1613/jair.279
A recent article comparing it to RF is: Houborg, R., & McCabe, M. F. (2018). A hybrid training approach for leaf area index estimation via Cubist and random forests machine-learning. ISPRS Journal of Photogrammetry and Remote Sensing, 135, 173–188. https://doi.org/10.1016/j.isprsjprs.2017.10.004
It is similar to RF but instead of single values at leaves it creates a multivariate linear regression for the cases in the leaf, using the variables in the nodes leading to it (?).
The Cubist
package is an R
port of the Cubist GPL C
code released by RuleQuest at http://rulequest.com/cubist-info.html
.
An explanation from the Cubist
vignette:
Cubist is a rule–based model that is an extension of Quinlan’s M5 model tree. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification.
Package used in this section: Cubist
for the C5 algorithm.
library(Cubist)
Split into training sets, here 80% in the training set:
inTrain <- sample(1:nrow(meuse), floor(.8*nrow(meuse)))
Build training and test matrices. Note there is no formula method for cubist
; the predictors are specified as matrix or data frame and the outcome is a numeric vector.
preds <- c("x","y","ffreq","dist.m","elev","soil","lime")
train.pred <- meuse[ inTrain, preds]
test.pred <- meuse[-inTrain, preds]
train.resp <- meuse$logZn[ inTrain]
test.resp <- meuse$logZn[-inTrain]
Now fit the model to the training set:
(c.model <- cubist(x = train.pred, y = train.resp))
##
## Call:
## cubist.default(x = train.pred, y = train.resp)
##
## Number of samples: 124
## Number of predictors: 7
##
## Number of committees: 1
## Number of rules: 2
summary(c.model)
##
## Call:
## cubist.default(x = train.pred, y = train.resp)
##
##
## Cubist [Release 2.07 GPL Edition] Sun Aug 4 18:41:04 2019
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 124 cases (8 attributes) from undefined.data
##
## Model:
##
## Rule 1: [80 cases, mean 2.366983, range 2.053078 to 2.89098, est err 0.118638]
##
## if
## dist.m > 140
## then
## outcome = -10.61724 + 0.000166 y - 0.000228 x - 0.111 elev
## - 8e-05 dist.m
##
## Rule 2: [44 cases, mean 2.833035, range 2.281033 to 3.196176, est err 0.143328]
##
## if
## dist.m <= 140
## then
## outcome = 3.586028 - 0.00129 dist.m - 0.088 elev
##
##
## Evaluation on training data (124 cases):
##
## Average |error| 0.119965
## Relative |error| 0.47
## Correlation coefficient 0.87
##
##
## Attribute usage:
## Conds Model
##
## 100% 100% dist.m
## 100% elev
## 65% x
## 65% y
##
##
## Time: 0.0 secs
In this case there are only two leaves! each with a regression model. The split is on distance to river.
We predict onto the test set to get an estimate of predictive accuracy:
c.pred <- predict(c.model, test.pred)
## Test set RMSE
sqrt(mean((c.pred - test.resp)^2))
## [1] 0.1770856
## Test set R^2
cor(c.pred, test.resp)^2
## [1] 0.8244534
Now, fit a model on the entire set and then predict over the grid. Again we can only use predictors that are in the grid.
preds <- c("x","y","dist","soil","ffreq")
all.preds <- meuse[, preds]
all.resp <- meuse$logZn
c.model.grid <- cubist(x = all.preds, y = all.resp)
summary(c.model.grid)
##
## Call:
## cubist.default(x = all.preds, y = all.resp)
##
##
## Cubist [Release 2.07 GPL Edition] Sun Aug 4 18:41:04 2019
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 155 cases (6 attributes) from undefined.data
##
## Model:
##
## Rule 1: [66 cases, mean 2.288309, range 2.053078 to 2.89098, est err 0.103603]
##
## if
## x > 179095
## dist > 0.211846
## then
## outcome = 2.406759 - 0.32 dist
##
## Rule 2: [9 cases, mean 2.596965, range 2.330414 to 2.832509, est err 0.116378]
##
## if
## x <= 179095
## dist > 0.211846
## then
## outcome = -277.415278 + 0.000847 y + 0.56 dist
##
## Rule 3: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.157513]
##
## if
## dist <= 0.211846
## then
## outcome = 2.632508 - 2.1 dist - 2.4e-05 x + 1.4e-05 y
##
##
## Evaluation on training data (155 cases):
##
## Average |error| 0.111292
## Relative |error| 0.41
## Correlation coefficient 0.85
##
##
## Attribute usage:
## Conds Model
##
## 100% 100% dist
## 48% 52% x
## 57% y
##
##
## Time: 0.0 secs
meuse.grid.sp$predictedLog10Zn.cubist <- predict(c.model.grid, newdata=meuse.grid[,preds])
spplot(meuse.grid.sp, zcol="predictedLog10Zn.cubist", col.regions=bpy.colors(64))
Here we can see the effect of the local regression models, as well as the hard split on x
, y
, and dist
.
Cubist can improve performance in two ways: (1) with “committees” and (2) adjusting by nearest neighbours:
The Cubist model can also use a boosting–like scheme called committees where iterative model trees are created in sequence. The first tree follows the procedure described in the last section. Subsequent trees are created using adjusted versions to the training set outcome: if the model over–predicted a value, the response is adjusted downward for the next model (and so on). Unlike traditional boosting, stage weights for each committee are not used to average the predictions from each model tree; the final prediction is a simple average of the predictions from each model tree.
Another innovation in Cubist using nearest–neighbors to adjust the predictions from the rule–based model. First, a model tree (with or without committees) is created. Once a sample is predicted by this model, Cubist can find it’s nearest neighbors and determine the average of these training set points.
To find the optimum values for these, we turn to the caret
package.
The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques.
See also: http://topepo.github.io/caret/model-training-and-tuning.html
The train
function in caret
can use many models, supplied by other packages, by specifying the method
argument. Then the tuneGrid
parameter refers to a grid of possible parameters to compare by cross-validation. Note we use all the points, because train
does the splitting as it evaluates the alternative models.
require(caret)
test <- train(x = all.preds, y = all.resp, method="cubist",
tuneGrid = expand.grid(.committees = 1:12,
.neighbors = 0:5),
trControl = trainControl(method = 'cv'))
print(test)
## Cubist
##
## 155 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 140, 139, 140, 140, 139, 139, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.1883956 0.6644354 0.1458751
## 1 1 0.1588804 0.7350231 0.1166150
## 1 2 0.1460033 0.7801726 0.1091955
## 1 3 0.1465169 0.7847191 0.1106126
## 1 4 0.1493369 0.7795308 0.1132737
## 1 5 0.1520507 0.7743130 0.1169301
## 2 0 0.1683147 0.7411218 0.1315260
## 2 1 0.1608123 0.7397582 0.1172865
## 2 2 0.1462212 0.7862414 0.1088365
## 2 3 0.1448057 0.7933283 0.1096120
## 2 4 0.1457635 0.7923567 0.1110771
## 2 5 0.1458465 0.7932452 0.1128878
## 3 0 0.1676694 0.7274362 0.1300419
## 3 1 0.1594861 0.7355455 0.1159521
## 3 2 0.1463699 0.7792583 0.1092873
## 3 3 0.1450852 0.7868112 0.1097221
## 3 4 0.1463667 0.7846627 0.1116297
## 3 5 0.1471833 0.7842686 0.1135879
## 4 0 0.1722867 0.7230334 0.1350674
## 4 1 0.1595969 0.7374772 0.1157987
## 4 2 0.1465086 0.7796208 0.1087658
## 4 3 0.1457687 0.7860170 0.1102305
## 4 4 0.1471952 0.7840891 0.1124495
## 4 5 0.1479295 0.7841296 0.1146606
## 5 0 0.1692093 0.7264544 0.1308310
## 5 1 0.1573260 0.7432324 0.1134389
## 5 2 0.1451288 0.7830943 0.1085275
## 5 3 0.1445451 0.7892661 0.1097826
## 5 4 0.1458363 0.7875755 0.1119257
## 5 5 0.1465786 0.7876978 0.1137389
## 6 0 0.1681138 0.7348997 0.1306287
## 6 1 0.1577049 0.7448716 0.1139049
## 6 2 0.1453464 0.7845837 0.1083253
## 6 3 0.1450470 0.7898445 0.1096985
## 6 4 0.1465139 0.7878958 0.1119542
## 6 5 0.1472043 0.7882202 0.1139688
## 7 0 0.1689879 0.7319173 0.1298003
## 7 1 0.1590891 0.7424403 0.1141648
## 7 2 0.1478159 0.7790716 0.1101016
## 7 3 0.1472611 0.7850995 0.1114656
## 7 4 0.1486386 0.7831451 0.1129634
## 7 5 0.1494694 0.7834350 0.1150561
## 8 0 0.1692387 0.7303566 0.1308911
## 8 1 0.1583597 0.7451302 0.1140818
## 8 2 0.1469568 0.7816830 0.1096589
## 8 3 0.1467558 0.7864582 0.1110293
## 8 4 0.1482476 0.7842532 0.1130076
## 8 5 0.1490496 0.7844580 0.1148313
## 9 0 0.1680918 0.7334097 0.1293326
## 9 1 0.1580984 0.7445659 0.1135899
## 9 2 0.1460286 0.7835949 0.1082919
## 9 3 0.1456577 0.7891344 0.1096461
## 9 4 0.1470649 0.7872591 0.1117024
## 9 5 0.1473304 0.7886694 0.1135409
## 10 0 0.1667062 0.7361165 0.1293054
## 10 1 0.1583984 0.7437456 0.1141123
## 10 2 0.1463530 0.7827190 0.1089020
## 10 3 0.1461691 0.7873274 0.1105037
## 10 4 0.1475471 0.7854459 0.1122338
## 10 5 0.1484767 0.7850997 0.1142535
## 11 0 0.1667215 0.7397498 0.1287537
## 11 1 0.1583368 0.7437772 0.1138371
## 11 2 0.1460518 0.7836939 0.1080391
## 11 3 0.1456496 0.7891811 0.1092283
## 11 4 0.1468634 0.7879053 0.1108272
## 11 5 0.1471398 0.7893048 0.1128886
## 12 0 0.1645851 0.7456427 0.1277227
## 12 1 0.1583615 0.7432244 0.1137673
## 12 2 0.1456015 0.7842648 0.1080848
## 12 3 0.1451584 0.7898198 0.1092122
## 12 4 0.1462251 0.7892730 0.1110120
## 12 5 0.1469560 0.7895430 0.1129932
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 5 and neighbors = 3.
plot(test, metric="RMSE")
plot(test, metric="Rsquared")
Of course this is random in the splits, so each run will be different. In this case the graph shows clearly that more committees boost the performance considerably, from around 0.19 to 0.15 cross-validation RMSE. The nearest-neighbours also help considerably, up to about four, and then the improvement is less.
So we fit a model with these recommended parameters. Note the neighbour count is only used in prediction, not model building.
c.model.grid.2 <- cubist(x = all.preds, y = all.resp, committees=4)
summary(c.model.grid.2)
##
## Call:
## cubist.default(x = all.preds, y = all.resp, committees = 4)
##
##
## Cubist [Release 2.07 GPL Edition] Sun Aug 4 18:41:13 2019
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 155 cases (6 attributes) from undefined.data
##
## Model 1:
##
## Rule 1/1: [66 cases, mean 2.288309, range 2.053078 to 2.89098, est err 0.103603]
##
## if
## x > 179095
## dist > 0.211846
## then
## outcome = 2.406759 - 0.32 dist
##
## Rule 1/2: [9 cases, mean 2.596965, range 2.330414 to 2.832509, est err 0.116378]
##
## if
## x <= 179095
## dist > 0.211846
## then
## outcome = -277.415278 + 0.000847 y + 0.56 dist
##
## Rule 1/3: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.157513]
##
## if
## dist <= 0.211846
## then
## outcome = 2.632508 - 2.1 dist - 2.4e-05 x + 1.4e-05 y
##
## Model 2:
##
## Rule 2/1: [45 cases, mean 2.418724, range 2.10721 to 2.893762, est err 0.182228]
##
## if
## x <= 179826
## ffreq in {2, 3}
## then
## outcome = 128.701732 - 0.000705 x
##
## Rule 2/2: [121 cases, mean 2.443053, range 2.053078 to 3.055378, est err 0.181513]
##
## if
## dist > 0.0703468
## then
## outcome = 30.512065 - 0.87 dist - 0.000154 x
##
## Rule 2/3: [55 cases, mean 2.543648, range 2.075547 to 3.055378, est err 0.125950]
##
## if
## dist > 0.0703468
## ffreq = 1
## then
## outcome = 37.730889 - 0.000314 x - 0.35 dist + 6.5e-05 y
##
## Rule 2/4: [34 cases, mean 2.958686, range 2.574031 to 3.264582, est err 0.139639]
##
## if
## dist <= 0.0703468
## then
## outcome = 2.982852 - 0.36 dist
##
## Model 3:
##
## Rule 3/1: [75 cases, mean 2.325347, range 2.053078 to 2.89098, est err 0.150262]
##
## if
## dist > 0.211846
## then
## outcome = 2.263175 - 0.11 dist
##
## Rule 3/2: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.156918]
##
## if
## dist <= 0.211846
## then
## outcome = 2.978074 - 2.42 dist
##
## Model 4:
##
## Rule 4/1: [121 cases, mean 2.443053, range 2.053078 to 3.055378, est err 0.161020]
##
## if
## dist > 0.0703468
## then
## outcome = 4.601694 - 0.61 dist - 0.000101 x + 4.9e-05 y
##
## Rule 4/2: [51 cases, mean 2.505994, range 2.075547 to 3.055378, est err 0.196863]
##
## if
## x <= 179731
## dist > 0.0703468
## then
## outcome = 175.779227 - 0.000967 x
##
## Rule 4/3: [55 cases, mean 2.543648, range 2.075547 to 3.055378, est err 0.147504]
##
## if
## dist > 0.0703468
## ffreq = 1
## then
## outcome = 53.629078 - 0.000343 x - 0.44 dist + 3.3e-05 y
##
## Rule 4/4: [34 cases, mean 2.958686, range 2.574031 to 3.264582, est err 0.140965]
##
## if
## dist <= 0.0703468
## then
## outcome = 2.975269 - 0.08 dist
##
##
## Evaluation on training data (155 cases):
##
## Average |error| 0.110135
## Relative |error| 0.40
## Correlation coefficient 0.84
##
##
## Attribute usage:
## Conds Model
##
## 95% 88% dist
## 21% 64% x
## 19% ffreq
## 39% y
##
##
## Time: 0.0 secs
meuse.grid.sp$predictedLog10Zn.cubist.2 <-
predict(c.model.grid.2, newdata=meuse.grid[,preds],
neighbors=5)
summary(meuse.grid.sp$predictedLog10Zn.cubist.2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.954 2.228 2.346 2.445 2.640 3.154
spplot(meuse.grid.sp, zcol="predictedLog10Zn.cubist.2", col.regions=bpy.colors(64),
main="Optimized Cubist prediction")
A much nicer looking map.
Variable importance. The usage
field of the fitted model shows the usage of each variable in either the rule conditions or the (terminal) linear model.
c.model.grid$usage
## Conditions Model Variable
## 1 100 100 dist
## 2 48 52 x
## 3 0 57 y
## 4 0 0 soil
## 5 0 0 ffreq
c.model.grid.2$usage
## Conditions Model Variable
## 1 95 88 dist
## 2 21 64 x
## 3 19 0 ffreq
## 4 0 39 y
## 5 0 0 soil
caret::varImp(c.model.grid)
## Overall
## dist 100.0
## x 50.0
## y 28.5
## soil 0.0
## ffreq 0.0
caret::varImp(c.model.grid.2)
## Overall
## dist 91.5
## x 42.5
## ffreq 9.5
## y 19.5
## soil 0.0
Distance is always used in the single model, but sometimes not in the committee model.
Notice how flood frequency came into the model when a committee was formed, i.e., subsidiary rules to adjust mis-matches from higher-level rules.
It is interesting to see how the different models fit a single predictor, in this case distance to river:
distances <- all.preds[, "dist", drop = FALSE]
rules_only <- cubist(x = distances, y = all.resp)
rules_and_com <- cubist(x = distances, y = all.resp, committees = 4)
predictions <- distances
predictions$true <- all.resp
predictions$rules <- predict(rules_only, distances, neighbors = 0)
predictions$rules_neigh <- predict(rules_only, distances, neighbors = 5)
predictions$committees <- predict(rules_and_com, distances, neighbors = 0)
predictions$committees_neigh <- predict(rules_and_com, distances, neighbors = 5)
summary(predictions)
## dist true rules rules_neigh
## Min. :0.00000 Min. :2.053 Min. :2.069 Min. :2.046
## 1st Qu.:0.07569 1st Qu.:2.297 1st Qu.:2.311 1st Qu.:2.331
## Median :0.21184 Median :2.513 Median :2.482 Median :2.541
## Mean :0.24002 Mean :2.556 Mean :2.545 Mean :2.556
## 3rd Qu.:0.36407 3rd Qu.:2.829 3rd Qu.:2.804 3rd Qu.:2.791
## Max. :0.88039 Max. :3.265 Max. :2.982 Max. :2.992
## committees committees_neigh
## Min. :2.043 Min. :2.035
## 1st Qu.:2.347 1st Qu.:2.330
## Median :2.495 Median :2.524
## Mean :2.545 Mean :2.555
## 3rd Qu.:2.771 3rd Qu.:2.791
## Max. :2.924 Max. :2.990
plot(true ~ dist, data=predictions, pch=20, col="darkgray",
xlab="covariate: normalized distance to river",
ylab="log10(Zn)", main="Cubist models")
grid()
m <- matrix(cbind(predictions$dist, predictions$rules), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=1, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$rules_neigh), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=2, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$committees), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=3, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$committees_neigh), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=4, lwd=2)
legend("topright", c("rules", "rules + neighbour",
"committee", "committee + neighbour"),
pch="----", col=1:4)
Notice how the committee (green) smoothes out the centre of distance range from the single-rules model (black).
Notice how the neighbours come much closer to some of the points – too close?
A pure “instance based” approach is to predict at each location from the average (possibly weighted) of some set of similar values. The caret
package includes a KNN regression with the knnreg
function.
Try with five clusters:
knn.5 <- caret::knnreg(logZn ~ x + y + dist + elev + ffreq, data=meuse, k=5)
knn.pred <- predict(knn.5, newdata=meuse)
plot(meuse$logZn ~ knn.pred, asp=1, col=meuse$ffreq, pch=20)
abline(0,1); grid()
However, unless we know a priori a number of classes, we should let train
evaluate the optimum number of clusters.
knn.tune <- train(x = all.preds, y = all.resp,
method="knn",
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = 'cv'))
str(knn.tune)
## List of 20
## $ method : chr "knn"
## $ modelInfo :List of 13
## ..$ label : chr "k-Nearest Neighbors"
## ..$ library : NULL
## ..$ loop : NULL
## ..$ type : chr [1:2] "Classification" "Regression"
## ..$ parameters:'data.frame': 1 obs. of 3 variables:
## .. ..$ parameter: Factor w/ 1 level "k": 1
## .. ..$ class : Factor w/ 1 level "numeric": 1
## .. ..$ label : Factor w/ 1 level "#Neighbors": 1
## ..$ grid :function (x, y, len = NULL, search = "grid")
## .. ..- attr(*, "srcref")= 'srcref' int [1:8] 8 26 16 19 26 19 8 16
## .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8>
## ..$ fit :function (x, y, wts, param, lev, last, classProbs, ...)
## .. ..- attr(*, "srcref")= 'srcref' int [1:8] 17 25 24 19 25 19 17 24
## .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8>
## ..$ predict :function (modelFit, newdata, submodels = NULL)
## .. ..- attr(*, "srcref")= 'srcref' int [1:8] 25 29 33 19 29 19 25 33
## .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8>
## ..$ predictors:function (x, ...)
## .. ..- attr(*, "srcref")= 'srcref' int [1:8] 34 32 34 67 32 67 34 34
## .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8>
## ..$ tags : chr "Prototype Models"
## ..$ prob :function (modelFit, newdata, submodels = NULL)
## .. ..- attr(*, "srcref")= 'srcref' int [1:8] 36 26 37 61 26 61 36 37
## .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8>
## ..$ levels :function (x)
## .. ..- attr(*, "srcref")= 'srcref' int [1:8] 38 28 38 56 28 56 38 38
## .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8>
## ..$ sort :function (x)
## .. ..- attr(*, "srcref")= 'srcref' int [1:8] 39 26 39 54 26 54 39 39
## .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f99404588b8>
## $ modelType : chr "Regression"
## $ results :'data.frame': 20 obs. of 7 variables:
## ..$ k : int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ RMSE : num [1:20] 0.24 0.208 0.192 0.199 0.2 ...
## ..$ Rsquared : num [1:20] 0.5 0.587 0.635 0.625 0.622 ...
## ..$ MAE : num [1:20] 0.171 0.158 0.147 0.153 0.155 ...
## ..$ RMSESD : num [1:20] 0.0488 0.058 0.0394 0.0333 0.0333 ...
## ..$ RsquaredSD: num [1:20] 0.185 0.206 0.147 0.126 0.141 ...
## ..$ MAESD : num [1:20] 0.0361 0.0394 0.0254 0.0232 0.0216 ...
## $ pred : NULL
## $ bestTune :'data.frame': 1 obs. of 1 variable:
## ..$ k: int 3
## $ call : language train.default(x = all.preds, y = all.resp, method = "knn", trControl = trainControl(method = "cv"), tuneGrid| __truncated__
## $ dots : list()
## $ metric : chr "RMSE"
## $ control :List of 28
## ..$ method : chr "cv"
## ..$ number : num 10
## ..$ repeats : logi NA
## ..$ search : chr "grid"
## ..$ p : num 0.75
## ..$ initialWindow : NULL
## ..$ horizon : num 1
## ..$ fixedWindow : logi TRUE
## ..$ skip : num 0
## ..$ verboseIter : logi FALSE
## ..$ returnData : logi TRUE
## ..$ returnResamp : chr "final"
## ..$ savePredictions : chr "none"
## ..$ classProbs : logi FALSE
## ..$ summaryFunction :function (data, lev = NULL, model = NULL)
## ..$ selectionFunction: chr "best"
## ..$ preProcOptions :List of 6
## .. ..$ thresh : num 0.95
## .. ..$ ICAcomp : num 3
## .. ..$ k : num 5
## .. ..$ freqCut : num 19
## .. ..$ uniqueCut: num 10
## .. ..$ cutoff : num 0.9
## ..$ sampling : NULL
## ..$ index :List of 10
## .. ..$ Fold01: int [1:139] 1 2 3 5 6 7 8 9 10 11 ...
## .. ..$ Fold02: int [1:139] 1 2 3 4 5 6 8 9 10 11 ...
## .. ..$ Fold03: int [1:139] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ Fold04: int [1:139] 1 2 3 4 5 6 7 8 10 11 ...
## .. ..$ Fold05: int [1:140] 1 2 4 5 6 7 8 9 10 11 ...
## .. ..$ Fold06: int [1:141] 2 3 4 6 7 8 9 10 11 13 ...
## .. ..$ Fold07: int [1:140] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ Fold08: int [1:138] 1 3 4 5 7 8 9 10 11 12 ...
## .. ..$ Fold09: int [1:141] 1 2 3 4 5 6 7 8 9 11 ...
## .. ..$ Fold10: int [1:139] 1 2 3 4 5 6 7 9 10 11 ...
## ..$ indexOut :List of 10
## .. ..$ Resample01: int [1:16] 4 47 55 69 76 77 83 84 91 94 ...
## .. ..$ Resample02: int [1:16] 7 17 25 45 46 67 82 90 103 104 ...
## .. ..$ Resample03: int [1:16] 16 26 37 40 42 51 65 73 78 99 ...
## .. ..$ Resample04: int [1:16] 9 13 21 22 24 39 43 56 63 66 ...
## .. ..$ Resample05: int [1:15] 3 44 49 52 70 75 79 87 102 118 ...
## .. ..$ Resample06: int [1:14] 1 5 12 23 28 34 38 60 71 74 ...
## .. ..$ Resample07: int [1:15] 11 14 35 36 53 101 116 120 121 123 ...
## .. ..$ Resample08: int [1:17] 2 6 18 30 33 58 61 64 68 80 ...
## .. ..$ Resample09: int [1:14] 10 27 31 32 48 50 59 62 72 81 ...
## .. ..$ Resample10: int [1:16] 8 15 19 20 29 41 54 57 88 92 ...
## ..$ indexFinal : NULL
## ..$ timingSamps : num 0
## ..$ predictionBounds : logi [1:2] FALSE FALSE
## ..$ seeds :List of 11
## .. ..$ : int [1:20] 813312 236430 890334 170210 987240 86795 771536 792235 734967 739182 ...
## .. ..$ : int [1:20] 992121 766012 71718 751307 208375 932562 722301 125947 260482 982996 ...
## .. ..$ : int [1:20] 140971 37331 700566 643864 410247 323307 643651 345975 730213 713809 ...
## .. ..$ : int [1:20] 374397 137389 688138 832233 18497 308165 140183 90599 949960 577892 ...
## .. ..$ : int [1:20] 433773 214035 844035 451682 989706 24932 415280 26177 187708 536907 ...
## .. ..$ : int [1:20] 758333 843264 504143 796850 388392 218870 291714 865609 690031 550765 ...
## .. ..$ : int [1:20] 177784 777203 120215 562508 699079 657961 101426 797222 768871 597234 ...
## .. ..$ : int [1:20] 741939 49288 20108 977177 156691 918281 320889 197208 837870 249758 ...
## .. ..$ : int [1:20] 976237 308871 76423 599447 668329 948711 579944 505136 792497 19165 ...
## .. ..$ : int [1:20] 82024 348360 99686 64554 888597 583501 904959 636821 109234 793446 ...
## .. ..$ : int 667130
## ..$ adaptive :List of 4
## .. ..$ min : num 5
## .. ..$ alpha : num 0.05
## .. ..$ method : chr "gls"
## .. ..$ complete: logi TRUE
## ..$ trim : logi FALSE
## ..$ allowParallel : logi TRUE
## ..$ yLimits : num [1:2] 1.99 3.33
## $ finalModel :List of 8
## ..$ learn :List of 2
## .. ..$ y: num [1:155] 3.01 3.06 2.81 2.41 2.43 ...
## .. ..$ X: chr [1:155, 1:5] "181072" "181025" "181165" "181298" ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:155] "X1" "X2" "X3" "X4" ...
## .. .. .. ..$ : chr [1:5] "x" "y" "dist" "soil" ...
## ..$ k : int 3
## ..$ theDots : list()
## ..$ xNames : chr [1:5] "x" "y" "dist" "soil" ...
## ..$ problemType: chr "Regression"
## ..$ tuneValue :'data.frame': 1 obs. of 1 variable:
## .. ..$ k: int 3
## ..$ obsLevels : logi NA
## ..$ param : list()
## ..- attr(*, "class")= chr "knnreg"
## $ preProcess : NULL
## $ trainingData:'data.frame': 155 obs. of 6 variables:
## ..$ x : num [1:155] 181072 181025 181165 181298 181307 ...
## ..$ y : num [1:155] 333611 333558 333537 333484 333330 ...
## ..$ dist : num [1:155] 0.00136 0.01222 0.10303 0.19009 0.27709 ...
## ..$ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## ..$ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ .outcome: num [1:155] 3.01 3.06 2.81 2.41 2.43 ...
## $ resample :'data.frame': 10 obs. of 4 variables:
## ..$ RMSE : num [1:10] 0.182 0.276 0.179 0.152 0.21 ...
## ..$ Rsquared: num [1:10] 0.675 0.342 0.758 0.762 0.639 ...
## ..$ MAE : num [1:10] 0.134 0.184 0.138 0.118 0.165 ...
## ..$ Resample: chr [1:10] "Fold07" "Fold02" "Fold09" "Fold06" ...
## $ resampledCM : NULL
## $ perfNames : chr [1:3] "RMSE" "Rsquared" "MAE"
## $ maximize : logi FALSE
## $ yLimits : num [1:2] 1.99 3.33
## $ times :List of 3
## ..$ everything: 'proc_time' Named num [1:5] 1.357 0.026 1.403 0 0
## .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
## ..$ final : 'proc_time' Named num [1:5] 0.001 0 0.002 0 0
## .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
## ..$ prediction: logi [1:3] NA NA NA
## $ levels : logi NA
## - attr(*, "class")= chr "train"
print(knn.tune)
## k-Nearest Neighbors
##
## 155 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 139, 139, 139, 139, 140, 141, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.2396729 0.5000961 0.1707858
## 2 0.2084718 0.5866904 0.1576243
## 3 0.1924236 0.6348138 0.1472308
## 4 0.1985827 0.6245674 0.1528394
## 5 0.1996725 0.6215416 0.1554437
## 6 0.2041662 0.6067799 0.1589935
## 7 0.2040105 0.6098834 0.1601178
## 8 0.2096095 0.5869486 0.1645314
## 9 0.2117159 0.5824187 0.1679363
## 10 0.2148451 0.5706556 0.1710285
## 11 0.2205898 0.5480597 0.1749506
## 12 0.2264245 0.5328982 0.1816822
## 13 0.2305251 0.5177084 0.1849336
## 14 0.2350731 0.4912753 0.1895975
## 15 0.2350344 0.4951237 0.1906744
## 16 0.2383019 0.4800084 0.1953387
## 17 0.2412291 0.4615528 0.1983915
## 18 0.2459047 0.4408346 0.2036497
## 19 0.2476376 0.4375852 0.2065670
## 20 0.2490945 0.4339538 0.2079632
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
plot.train(knn.tune, metric="RMSE")
plot.train(knn.tune, metric="Rsquared")
plot.train(knn.tune, metric="MAE")
min(knn.tune$results$RMSE)
## [1] 0.1924236
(ix <- which.min(knn.tune$results$RMSE))
## [1] 3
max(knn.tune$results$Rsquared)
## [1] 0.6348138
(ix <- which.max(knn.tune$results$Rsquared))
## [1] 3
# directly find optimum
knn.tune$finalModel$k
## [1] 3
Here there is a clear winner, k=3
. It is not a very good model: cross-validation RMSE about \(0.193\), \(R^2 = 0.62\).
Use this to predict:
knn.final <- caret::knnreg(logZn ~ x + y + dist + elev + ffreq,
data=meuse,
k = knn.tune$finalModel$k)
knn.pred <- predict(knn.final, newdata=meuse)
plot(meuse$logZn ~ knn.pred, asp=1, col=meuse$ffreq, pch=20)
abline(0,1); grid()
We can’t map with this because we don’t have elevation in the grid. So re-tune with what we do have:
preds.2 <- c("x","y","ffreq","dist.m","soil","lime")
all.preds.2 <- meuse[, preds.2]
knn.tune <- train(x = all.preds.2, y = all.resp,
method="knn",
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = 'cv'))
min(knn.tune$results$RMSE)
## [1] 0.1924077
(ix <- which.min(knn.tune$results$RMSE))
## [1] 3
max(knn.tune$results$Rsquared)
## [1] 0.6143408
(ix <- which.max(knn.tune$results$Rsquared))
## [1] 4
# directly find optimum
knn.tune$finalModel$k
## [1] 3
plot.train(knn.tune, metric="RMSE")
plot.train(knn.tune, metric="Rsquared")
plot.train(knn.tune, metric="MAE")
## the two criteria do not agree
Map with this:
knn.final <- caret::knnreg(logZn ~ x + y + dist + ffreq,
data=meuse,
k = knn.tune$finalModel$k)
meuse.grid.sp$knn <- predict(knn.final, newdata=meuse.grid)
summary(meuse.grid.sp$knn)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.074 2.276 2.429 2.475 2.669 3.217
spplot(meuse.grid.sp, zcol="knn")
Interesting map! Note here ‘nearest neighbour’ included both coordinates and feature-space.
Random forests can use coordinates and distances to geographic features as predictors; we used the x and y coordinates in the previous section.
RF can also use distances to multiple points as predictors:
Reference: Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B. M., & Gräler, B. (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ, 6, e5518. https://doi.org/10.7717/peerj.5518
Packages used in this section: randomForest
to build the RF, scales
for visualization, GSIF
for buffer distances.
# library(quantregForest) # the main work is here
library(scales) # for visualization
library(GSIF) # for buffer distances
Set up quantiles for the transformed and original Zn concentrations:
library(ggplot2)
g <- ggplot(meuse, mapping=aes(x=logZn))
g + geom_histogram(bins=16, fill="lightgreen", color="blue") + geom_rug()
(q.lzn <- quantile(meuse$logZn, seq(0,1,by=0.0625)))
# must include.lowest to have lowest value in some class
classes.lzn.q <- cut(meuse$logZn, breaks=q.lzn, ordered_result=TRUE, include.lowest=TRUE)
levels(classes.lzn.q)
## 0% 6.25% 12.5% 18.75% 25% 31.25% 37.5% 43.75%
## 2.053078 2.139461 2.201372 2.261554 2.296665 2.322219 2.362643 2.414125
## 50% 56.25% 62.5% 68.75% 75% 81.25% 87.5% 93.75%
## 2.513218 2.603414 2.680704 2.754247 2.828981 2.873820 2.920515 3.056094
## 100%
## 3.264582
## [1] "[2.05,2.14]" "(2.14,2.2]" "(2.2,2.26]" "(2.26,2.3]" "(2.3,2.32]"
## [6] "(2.32,2.36]" "(2.36,2.41]" "(2.41,2.51]" "(2.51,2.6]" "(2.6,2.68]"
## [11] "(2.68,2.75]" "(2.75,2.83]" "(2.83,2.87]" "(2.87,2.92]" "(2.92,3.06]"
## [16] "(3.06,3.26]"
Q: Are the 16 quantiles more or less equally distributed along the range of the target variable?
Compute distance buffers to points in each quantile. For this, the points object must be converted to a SpatialPointsDataFrame
and the function buffer.dist
in the GSIF
packge is used.
coordinates(meuse) <- ~x + y
grid.dist.lzn <- GSIF::buffer.dist(meuse["logZn"], meuse.grid.sp, classes.lzn.q)
plot(stack(grid.dist.lzn))
Q: Explain the pattern of Layer 16.
Extract the distances to each buffer at each point, and add these as fields to the points:
buffer.dists <- over(meuse, grid.dist.lzn)
str(buffer.dists)
## 'data.frame': 155 obs. of 16 variables:
## $ layer.1 : num 1897 1809 1865 1878 1737 ...
## $ layer.2 : num 1784 1695 1755 1772 1633 ...
## $ layer.3 : num 735 680 621 561 402 ...
## $ layer.4 : num 468 412 362 330 179 ...
## $ layer.5 : num 804 730 721 699 544 ...
## $ layer.6 : num 685 612 601 582 431 ...
## $ layer.7 : num 268 283 126 0 160 ...
## $ layer.8 : num 369 330 233 160 0 ...
## $ layer.9 : num 268 226 160 170 126 ...
## $ layer.10: num 243 160 226 305 283 ...
## $ layer.11: num 369 283 344 400 330 ...
## $ layer.12: num 144 160 0 126 233 ...
## $ layer.13: num 734 645 738 794 700 ...
## $ layer.14: num 1531 1442 1537 1585 1471 ...
## $ layer.15: num 0 89.4 144.2 268.3 368.8 ...
## $ layer.16: num 89.4 0 160 282.8 344.1 ...
meuse@data <- cbind(meuse@data, buffer.dists)
Show how far point 1 is from the nearest point of each quantile:
buffer.dists[1,]
## layer.1 layer.2 layer.3 layer.4 layer.5 layer.6 layer.7 layer.8
## 1 1897.367 1783.928 735.3911 468.188 803.99 684.6897 268.3282 368.7818
## layer.9 layer.10 layer.11 layer.12 layer.13 layer.14 layer.15 layer.16
## 1 268.3282 243.3105 368.7818 144.2221 734.3024 1531.013 0 89.44272
Of course it is zero distance to its own quantile, in this case 15.
First make a formula with all the quantiles as predictors:
dn <- paste(names(grid.dist.lzn), collapse="+")
(fm <- as.formula(paste("logZn ~", dn)))
## logZn ~ layer.1 + layer.2 + layer.3 + layer.4 + layer.5 + layer.6 +
## layer.7 + layer.8 + layer.9 + layer.10 + layer.11 + layer.12 +
## layer.13 + layer.14 + layer.15 + layer.16
Now compute the random forest model, use it to predict at the observation points.
(rf <- randomForest(fm , meuse@data, importance=TRUE, min.split=6, mtry=6, ntree=640))
##
## Call:
## randomForest(formula = fm, data = meuse@data, importance = TRUE, min.split = 6, mtry = 6, ntree = 640)
## Type of random forest: regression
## Number of trees: 640
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 0.02514371
## % Var explained: 74.25
pred.rf <- predict(rf, newdata=meuse@data)
plot(rf)
Q: How many trees were needed to make a reliable forest?
Variable importance:
importance(rf, type=1)
## %IncMSE
## layer.1 15.125917
## layer.2 14.959521
## layer.3 13.808773
## layer.4 11.403248
## layer.5 11.637966
## layer.6 17.236908
## layer.7 7.091440
## layer.8 10.915838
## layer.9 9.779422
## layer.10 10.792435
## layer.11 6.995092
## layer.12 17.543799
## layer.13 11.360035
## layer.14 13.343180
## layer.15 18.142337
## layer.16 23.833595
varImpPlot(rf, type=1)
Q: Which distance buffer was most important? Explain why.
Plot the RF fit and the OOB predictions. Compute the fitted and OOB RMSE.
plot(meuse$logZn ~ pred.rf, asp=1, pch=20, xlab="Random forest fit", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.rf <- sqrt(sum((pred.rf-meuse$logZn)^2)/length(pred.rf)))
## [1] 0.06472746
#
oob.rf <- predict(rf)
plot(meuse$logZn ~ oob.rf, asp=1, pch=20, xlab="Out-of-bag prediction", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.oob <- sqrt(sum((oob.rf-meuse$logZn)^2)/length(oob.rf)))
## [1] 0.1585677
Q: What are the goodness-of-fit and Out-of-bag RMSE?
pred.grid <- predict(rf, newdata=grid.dist.lzn@data)
summary(pred.grid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.111 2.316 2.437 2.489 2.650 3.169
meuse.grid.sp$logZn.rf <- pred.grid
breaks <- seq(2, 3.3, by=.05)
p1 <- spplot(meuse.grid.sp, zcol="logZn.rf", main="Zn, log(mg kg-1)", sub="RRF 16 distance buffers", at=breaks)
print(p1)
Compare this with OK cross-validation. First compute the ordinary variogram and fit a model:
require(gstat)
v <- variogram(logZn ~ 1, meuse)
(vmf <- fit.variogram(v, model=vgm(0.1, "Sph", 1000, 0.02)))
## model psill range
## 1 Nug 0.009554751 0.0000
## 2 Sph 0.111394769 896.9909
plot(v, model=vmf, plot.numbers=T)
Now compute the cross-validation statistics:
k.cv <- krige.cv(logZn ~ 1, meuse, model=vmf, verbose=FALSE)
plot(meuse$logZn ~ k.cv$var1.pred, pch=20, xlab="Cross-validation fit", ylab="Actual value", main="Zn, log(mg kg-1)", asp=1)
abline(0,1); grid()
(rmse.kcv <- sqrt(sum((k.cv$var1.pred-meuse$logZn)^2)/length(k.cv$var1.pred)))
## [1] 0.170157
Q: Which approach gives the lower cross-validation error in this case?
Compare the kriging map with the spatial random forest map:
k.ok <- krige(logZn ~ 1, meuse, newdata=meuse.grid.sp, model=vmf)
## [using ordinary kriging]
summary(k.ok$var1.pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.074 2.275 2.420 2.479 2.680 3.231
p2 <- spplot(k.ok, zcol="var1.pred", main="Zn, log(mg kg-1)", sub="Ordinary Kriging", at=breaks)
print(p2, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)
Compute and display the differences:
k.ok$diff.srf.ok <- meuse.grid.sp$logZn.rf - k.ok$var1.pred
summary(k.ok$diff.srf.ok)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.237562 -0.040157 0.004313 0.010131 0.048444 0.403515
p3 <- spplot(k.ok, zcol="diff.srf.ok", main="Spatial RF - OK predictions, log(Zn)", col.regions=topo.colors(36))
print(p3)
Q: Comment on the differences/similarities between the Spatial Random Forest map and the Ordinary Kriging map. Note that for Kriging we needed to specify a model, for RF no model was needed.
The concept of quantile buffers is similar to the empirical variogram, i.e., assume stationarity of spatial correlation structure.
Another approach is to use the distance from each known point as a predictor. Obviously this increases the size of the problem. This is the approach taken in the Hengl paper cited above.
Set up the buffer distances and visualize a few of them:
grid.dist.0 <- GSIF::buffer.dist(meuse["logZn"], meuse.grid.sp, as.factor(1:nrow(meuse)))
dim(grid.dist.0)
## [1] 3103 155
summary(grid.dist.0$layer.1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1677 2685 2556 3506 4569
plot(stack(grid.dist.0[1:9]))
Extract the distances to each buffer at each point, and add these as fields to the points. Note that we first have to remove the previous buffer distances (the 16 quantiles).
buffer.dists <- over(meuse, grid.dist.0)
dim(buffer.dists)
## [1] 155 155
# distances from point 1 to all the other points
(buffer.dists[1,1:10])
## layer.1 layer.2 layer.3 layer.4 layer.5 layer.6 layer.7 layer.8
## 1 0 89.44272 144.222 268.3282 368.7818 481.6638 268.3282 243.3105
## layer.9 layer.10
## 1 400 468.188
names(meuse@data)
## [1] "cadmium" "copper" "lead" "zinc" "elev" "dist"
## [7] "om" "ffreq" "soil" "lime" "landuse" "dist.m"
## [13] "logZn" "logCu" "logPb" "logCd" "layer.1" "layer.2"
## [19] "layer.3" "layer.4" "layer.5" "layer.6" "layer.7" "layer.8"
## [25] "layer.9" "layer.10" "layer.11" "layer.12" "layer.13" "layer.14"
## [31] "layer.15" "layer.16"
meuse@data <- meuse@data[,1:16]
meuse@data <- cbind(meuse@data, buffer.dists)
So now we have 155 buffer distances. Set up a formula for the RF and build the RF. Note that the default mtry
will be 52, i.e., distance to 1/3 of the points. We increase this to 2/3.
dn0 <- paste(names(grid.dist.lzn), collapse="+")
fm0 <- as.formula(paste("logZn ~ ", dn0))
(rf <- randomForest(fm0 , meuse@data, importance=TRUE,
mtry=102, min.split=6, ntree=640))
##
## Call:
## randomForest(formula = fm0, data = meuse@data, importance = TRUE, mtry = 102, min.split = 6, ntree = 640)
## Type of random forest: regression
## Number of trees: 640
## No. of variables tried at each split: 16
##
## Mean of squared residuals: 0.09413569
## % Var explained: 3.6
This does not perform as well as using the quantiles, the \(R^2\) is about 10% lower.
Mapping:
pred.grid <- predict(rf, newdata=grid.dist.lzn@data)
summary(pred.grid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.299 2.477 2.596 2.590 2.697 2.897
meuse.grid.sp$logZn.rf0 <- pred.grid
breaks <- seq(2, 3.3, by=.05)
p2 <- spplot(meuse.grid.sp, zcol="logZn.rf0", main="Zn, log(mg kg-1)",
sub="RF distance all points", at=breaks)
# compare with 16-quantile buffers
print(p2, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)
A clearly inferior map. So we don’t care about individual point values, rather about the distance to quantiles.
Random forests can also be used to predict classes. As with RF for continuous variables, the forest is made up of many trees. The prediction is the class predicted by the majority of trees.
m.ffreq.rf <- randomForest(ffreq ~ x + y + dist.m + elev + soil + lime, data=meuse, importance=T, na.action=na.omit, mtry=3)
print(m.ffreq.rf)
##
## Call:
## randomForest(formula = ffreq ~ x + y + dist.m + elev + soil + lime, data = meuse, importance = T, mtry = 3, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.77%
## Confusion matrix:
## 1 2 3 class.error
## 1 77 7 0 0.08333333
## 2 4 39 5 0.18750000
## 3 1 9 13 0.43478261
importance(m.ffreq.rf)
## 1 2 3 MeanDecreaseAccuracy MeanDecreaseGini
## x 20.09590 9.1360302 8.463421 22.248976 14.404681
## y 35.13394 16.9772733 40.939480 45.641554 31.592005
## dist.m 10.73260 1.9609074 6.635458 10.737786 7.549293
## elev 51.32741 35.3973475 10.994921 55.201272 32.358615
## soil 10.28211 0.2486713 5.214684 9.536508 2.352493
## lime 11.49724 -3.5100701 7.608437 11.107870 2.511667
varImpPlot(m.ffreq.rf)
Here we see the class error rate, i.e., the training samples not predicted into the correct class. In this example class 3 is poorly-predicted, whereas class 1 is well-predicted.
Elevation is by far the most important predictor of accurary for all classes. The coordinates are next, because of the geographic configuration of the studty area – these replace distance to river when that is not used in the model.
However for the GINI coefficient (how often a randomly-chosen training observation would be incorrectly assigned if it were randomly labeled according to the frequency distribution of labels in the subset) the y-coordinate becomes somewhat more important – this perhaps because of the one high value far to the east, at the river bend.
See where the predictors are used:
min_depth_frame <- min_depth_distribution(m.ffreq.rf)
plot_min_depth_distribution(min_depth_frame)
Soil type and lime application are not used in more than half the trees. The y-coordinate is used first in most trees, followed by elevation and x-coordinate.
We can see the probabilities of each class by using the model to predict back at the known points:
summary(prob_rf <- predict(m.ffreq.rf, meuse, type = "prob"))
## 1 2 3
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0380 1st Qu.:0.0050 1st Qu.:0.0020
## Median :0.8000 Median :0.0480 Median :0.0060
## Mean :0.5449 Mean :0.3112 Mean :0.1439
## 3rd Qu.:0.9920 3rd Qu.:0.7970 3rd Qu.:0.0600
## Max. :1.0000 Max. :0.9860 Max. :0.9940
# probability of each observation being predicted to class 1
prob_rf[,1]
## 1 2 3 4 5 6 7 8 9 10 11 12
## 0.994 1.000 0.996 1.000 1.000 1.000 1.000 0.992 0.994 0.996 1.000 0.922
## 13 14 15 16 17 18 19 20 21 22 23 24
## 0.996 0.974 0.992 0.986 0.998 0.992 0.992 1.000 0.948 0.980 0.978 0.998
## 25 26 27 28 29 30 31 32 33 34 35 37
## 0.998 1.000 1.000 1.000 0.978 0.992 1.000 0.994 0.990 0.998 0.996 0.998
## 38 39 40 41 42 43 44 45 46 47 48 49
## 0.996 0.992 1.000 1.000 0.988 0.986 0.932 0.748 0.936 0.998 0.966 0.780
## 50 51 52 53 54 55 56 57 58 59 60 61
## 0.700 0.780 0.800 0.956 0.942 0.966 0.968 0.982 0.808 0.968 0.982 0.992
## 62 63 64 65 66 67 69 75 76 79 80 81
## 0.988 0.994 0.996 0.988 0.982 0.996 0.960 0.710 0.946 0.998 0.998 0.970
## 82 83 84 85 86 87 88 89 90 123 160 163
## 0.974 0.994 0.978 0.964 0.896 0.794 0.996 0.994 0.998 0.852 0.992 0.996
## 70 71 91 92 93 94 95 96 97 98 99 100
## 0.122 0.082 0.012 0.008 0.012 0.004 0.116 0.024 0.056 0.110 0.010 0.074
## 101 102 103 104 105 106 108 109 110 111 112 113
## 0.022 0.024 0.018 0.002 0.048 0.052 0.044 0.012 0.172 0.092 0.008 0.038
## 114 115 116 117 118 119 120 121 122 124 125 126
## 0.012 0.002 0.038 0.012 0.002 0.014 0.010 0.022 0.038 0.074 0.104 0.046
## 127 128 129 130 131 132 133 134 135 136 161 162
## 0.016 0.106 0.208 0.294 0.076 0.096 0.066 0.146 0.130 0.068 0.082 0.046
## 137 138 140 141 142 143 144 145 146 147 148 149
## 0.288 0.068 0.002 0.000 0.010 0.006 0.016 0.000 0.004 0.010 0.004 0.004
## 150 151 152 153 154 155 156 157 158 159 164
## 0.000 0.152 0.090 0.004 0.000 0.002 0.010 0.002 0.002 0.000 0.130
# an observation in order of probability
sort(prob_rf[1,], decreasing=TRUE)
## 1 2 3
## 0.994 0.004 0.002
order(prob_rf[1,], decreasing=TRUE)
## [1] 1 2 3
# its actual class -- in this case it agrees
meuse$ffreq[1]
## [1] 1
## Levels: 1 2 3
These can be summarized into list of most-probable, 2nd-most probable etc. classes:
# list of most likely classes
probs <- apply(prob_rf, 1, order, decreasing=TRUE)
probs[,1]
## [1] 1 2 3
probs[,121]
## [1] 2 1 3
# most likely, 2nd most, 3rd most, for each observation
table(probs.1 <- probs[1,])
##
## 1 2 3
## 84 48 23
# cross-classification of most and 2nd-most probable classes
table(probs.1, probs.2 <- probs[2,])
##
## probs.1 1 2 3
## 1 0 71 13
## 2 25 0 23
## 3 3 20 0
Again we need to restrict the RF model to variables also available in the prediction grid.
(m.ffreq.rf.2 <- randomForest(ffreq ~ x + y + dist + soil, data=meuse, importance=T, na.action=na.omit, mtry=3))
##
## Call:
## randomForest(formula = ffreq ~ x + y + dist + soil, data = meuse, importance = T, mtry = 3, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 28.39%
## Confusion matrix:
## 1 2 3 class.error
## 1 71 11 2 0.1547619
## 2 14 29 5 0.3958333
## 3 3 9 11 0.5217391
This is less accurate than the model with elevation. Anyway, we show the map:
meuse.grid.sp$predictedffreq.rf <- predict(m.ffreq.rf.2, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedffreq.rf", main="Most probable flood frequency class")
Not a nice-looking map. It disagrees strongly with the actual flooding-frequency map.
summary(meuse.grid.sp$ffreq.tf <-
as.factor((meuse.grid.sp$predictedffreq.rf==
meuse.grid.sp$ffreq)))
## FALSE TRUE
## 1452 1651
spplot(meuse.grid.sp, zcol="ffreq.tf",
main="Accuracy of RF model")
Probability of class 1 at each grid cell:
meuse.grid.sp$probffreq <-
predict(m.ffreq.rf.2, newdata=meuse.grid, type="prob")
meuse.grid.sp$probffreq.rf.1 <- meuse.grid.sp$probffreq[,1]
meuse.grid.sp$probffreq.rf.2 <- meuse.grid.sp$probffreq[,2]
meuse.grid.sp$probffreq.rf.3 <- meuse.grid.sp$probffreq[,3]
p1 <- spplot(meuse.grid.sp, zcol="probffreq.rf.1",
main="Prob. class 1")
p2 <- spplot(meuse.grid.sp, zcol="probffreq.rf.2",
main="Prob. class 2")
p3 <- spplot(meuse.grid.sp, zcol="probffreq.rf.3",
main="Prob. class 3")
print(p1, split=c(1,1,3,1), more=T)
print(p2, split=c(2,1,3,1), more=T)
print(p3, split=c(3,1,3,1), more=F)
Does adding covariates improve the prediction?
(covar.names <- paste(intersect(names(meuse), names(meuse.grid)), collapse="+"))
## [1] "dist+ffreq+soil"
(fm.c <- as.formula(paste("logZn ~", dn, "+", covar.names)))
## logZn ~ layer.1 + layer.2 + layer.3 + layer.4 + layer.5 + layer.6 +
## layer.7 + layer.8 + layer.9 + layer.10 + layer.11 + layer.12 +
## layer.13 + layer.14 + layer.15 + layer.16 + dist + ffreq +
## soil
Now compute the random forest model including these covariates, use it to predict at the observation points.
(rf.c <- randomForest(fm.c , meuse@data, importance=TRUE, min.split=6, mtry=6, ntree=640))
##
## Call:
## randomForest(formula = fm.c, data = meuse@data, importance = TRUE, min.split = 6, mtry = 6, ntree = 640)
## Type of random forest: regression
## Number of trees: 640
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 0.0281278
## % Var explained: 71.2
pred.rf.c <- predict(rf.c, newdata=meuse@data)
plot(rf.c)
Display the variable importance:
importance(rf.c, type=1)
## %IncMSE
## layer.1 10.048767
## layer.2 8.720124
## layer.3 8.373037
## layer.4 9.261439
## layer.5 9.984490
## layer.6 11.892754
## layer.7 9.195146
## layer.8 9.635517
## layer.9 9.876280
## layer.10 9.392341
## layer.11 11.452915
## layer.12 11.004939
## layer.13 10.047642
## layer.14 8.762264
## layer.15 8.820727
## layer.16 9.572263
## dist 43.197300
## ffreq 32.726246
## soil 17.135675
varImpPlot(rf.c, type=1)
Q: Are any of the covariables important? Are any more important than the distance buffers? Explain.
Plot the RF fit and the OOB predictions. Compute the fitted and OOB RMSE.
plot(meuse$logZn ~ pred.rf.c, asp=1, pch=20, xlab="Random forest fit", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.rf.c <- sqrt(sum((pred.rf.c-meuse$logZn)^2)/length(pred.rf.c)))
## [1] 0.07823778
#
oob.rf.c <- predict(rf.c)
plot(meuse$logZn ~ oob.rf.c, asp=1, pch=20, xlab="Out-of-bag prediction", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.oob.c <- sqrt(sum((oob.rf.c-meuse$logZn)^2)/length(oob.rf.c)))
## [1] 0.1677134
Q: What are the goodness-of-fit and Out-of-bag RMSE? How do these compare to the spatial RF without covariates?
We first have to add the covariates to the prediction grid.
grid.dist.lzn.covar <- grid.dist.lzn
grid.dist.lzn.covar$dist <- meuse.grid$dist
grid.dist.lzn.covar$ffreq <- meuse.grid$ffreq
grid.dist.lzn.covar$soil <- meuse.grid$soil
Now predict over the grid and display:
pred.grid.c <- predict(rf.c, newdata=grid.dist.lzn.covar@data)
summary(pred.grid.c)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.280 2.345 2.422 2.491 2.615 2.919
meuse.grid.sp$logZn.rf.c <- pred.grid.c
breaks <- seq(2, 3.3, by=.05)
p1.c <- spplot(meuse.grid.sp, zcol="logZn.rf.c", main="Zn, log(mg kg-1)", sub="Random forest on distance buffers & covariates", at=breaks)
print(p1.c, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)
Compute and display the differences:
meuse.grid.sp$logZn.rf.diff <- meuse.grid.sp$logZn.rf.c - meuse.grid.sp$logZn.rf
summary(meuse.grid.sp$logZn.rf.diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4434319 -0.0854825 0.0005576 0.0022438 0.0907812 0.4409751
p3.c <- spplot(meuse.grid.sp, zcol="logZn.rf.diff", main="Spatial RF with - without covariates, log(Zn)", col.regions=topo.colors(36))
print(p3.c)
Q: Compare this to the maps made by RF/Spatial Distances without covariates. Where does using the covariates result in higher and lower predicted values? Try to explain why.
Challenge: Map logZn over the grid by Kriging with External Drift (KED), using the same three covariates (dist
, ffreq
, soil
), and compare with the map made by spatial random forest with covariates. Reminder: the variogram must be computed for the linear model residuals using these covariates.
Q: Describe the differences between the residual and ordinary variogram.
Q: Describe the differences between the OK and KED maps.
Q: Compare the KED map with the spatial random forest + covariates map.
Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning data mining, inference, and prediction (2nd ed). New York: Springer. Retrieved from http://link.springer.com/book/10.1007%2F978-0-387-84858-7
Kuhn, M. (2013). Applied Predictive Modeling. New York, NY: Springer New York. Retrieved from https://link.springer.com/openurl?genre=book&isbn=978-1-4614-6848-6
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. New York: Springer. Retrieved from http://link.springer.com/book/10.1007%2F978-1-4614-7138-7
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., … Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37. https://doi.org/10.1007/s10115-007-0114-2