We compare several R packages that build random forests: an older
package randomForest
and a much faster implementation,
ranger
, and the spmodel
package that also
models the spatial structure of the random forest residuals and
interpolates these to improve predictions.
We hope the results are similar. Note that due to randomization there
will always be differences, although not between runs of this script
because of the use of set.seed
.
This document also explores Quantile Regression Forests option in
ranger
to assess uncertainty of predictions, and local
measures of variable importance including Shapley values.
We also have some fun with interpretable machine learning… just because we can!
We use as an example the small “Meuse (Maas) River soil pollution”
dataset. See help(meuse, package = "sp")
for details and
references.
This dataset is provided with the sp
“classes and
methods for spatial data: points, lines, polygons and grids” package.
Athough sp
has been superseded by sf
, it is
still available and has this famous test dataset. You do not need to
load sp
, just have it installed in order to locate the
sp::meuse
dataset.
data(meuse, package = "sp")
str(meuse)
## 'data.frame': 155 obs. of 14 variables:
## $ x : num 181072 181025 181165 181298 181307 ...
## $ y : num 333611 333558 333537 333484 333330 ...
## $ cadmium: num 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
## $ copper : num 85 81 68 81 48 61 31 29 37 24 ...
## $ lead : num 299 277 199 116 117 137 132 150 133 80 ...
## $ zinc : num 1022 1141 640 257 269 ...
## $ elev : num 7.91 6.98 7.8 7.66 7.48 ...
## $ dist : num 0.00136 0.01222 0.10303 0.19009 0.27709 ...
## $ om : num 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
## $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## $ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
## $ dist.m : num 50 30 150 270 380 470 240 120 240 420 ...
We try to model one of the heavy metal concentration (zinc, Zn) from all possibly relevant predictors:
ffreq
flooding frequency (3 classes)dist.m
distance from the Meuse river (m)elev
elevation above mean sea level (m)soil
soil type (3 classes)lime
whether agricultural lime was applied to the field
(yes/no)We also include locations, because this may account for local effects, by building “boxes” around sets of locations.
x
, y
coordinates in the RDH (Dutch) grid
(m), EPSG code 28992Make a log10-transformed copy of the Zn metal concentration to obtain a somewhat balanced distribution of the target variable:
meuse$logZn <- log10(meuse$zinc) # 锌
hist(meuse$logZn, main = ""); rug(meuse$logZn)
There is a good range of values, in a bimodal distribution: low and high values, with fewere medium values. There is still some right skew towards the highest values.
randomForest
This was (one of?) the first R packages to implement Breiman’s random forest algorithm.1.
Packages used in this section: randomForest
for random
forests, randomForestExplainer
for some diagnostics, and
ggplot2
for graphics.
library(ggplot2, warn.conflicts = FALSE, verbose = FALSE)
library(randomForest, warn.conflicts = FALSE, verbose = FALSE)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(randomForestExplainer, warn.conflicts = FALSE, verbose = FALSE)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
First build the forest, using the randomForest
function.
Use set.seed
for reproducibility, so that your results (if
you render this script or run this chunk by itself) will be the same as
these.
The mtry
optional argument gives the number of variables
randomly sampled as candidates at each split. By default this is \(\lfloor{p/3}\rfloor\) where \(p\) is the number of predictors, here \(7\), so \(\lfloor{7/3}\rfloor = 2\). We increase this
to 3 to get better trees but still include weak predictors; this is
matched for ranger
(below).
set.seed(314)
m.lzn.rf <- randomForest(logZn ~ ffreq + dist.m + elev + soil + lime + x + y,
data=meuse,
# eight permutations per tree to estimate importance
importance = TRUE, nperm = 8,
na.action = na.omit, mtry = 3)
print(m.lzn.rf)
##
## Call:
## randomForest(formula = logZn ~ ffreq + dist.m + elev + soil + lime + x + y, data = meuse, importance = TRUE, nperm = 8, mtry = 3, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.01887348
## % Var explained: 80.67
The structure of this object:
str(m.lzn.rf)
## List of 18
## $ call : language randomForest(formula = logZn ~ ffreq + dist.m + elev + soil + lime + x + y, data = meuse, importance = TRUE,| __truncated__
## $ type : chr "regression"
## $ predicted : Named num [1:155] 2.93 2.97 2.65 2.54 2.46 ...
## ..- attr(*, "names")= chr [1:155] "1" "2" "3" "4" ...
## $ mse : num [1:500] 0.0148 0.0158 0.0231 0.0259 0.0227 ...
## $ rsq : num [1:500] 0.849 0.839 0.763 0.735 0.767 ...
## $ oob.times : int [1:155] 180 173 183 185 182 192 181 168 174 168 ...
## $ importance : num [1:7, 1:2] 0.01295 0.07057 0.03233 0.00488 0.00635 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:7] "ffreq" "dist.m" "elev" "soil" ...
## .. ..$ : chr [1:2] "%IncMSE" "IncNodePurity"
## $ importanceSD : Named num [1:7] 0.000741 0.001703 0.001127 0.000486 0.000573 ...
## ..- attr(*, "names")= chr [1:7] "ffreq" "dist.m" "elev" "soil" ...
## $ localImportance: NULL
## $ proximity : NULL
## $ ntree : num 500
## $ mtry : num 3
## $ forest :List of 11
## ..$ ndbigtree : int [1:500] 93 101 95 103 97 91 105 103 105 105 ...
## ..$ nodestatus : int [1:119, 1:500] -3 -3 -3 -3 -3 -3 -3 -1 -1 -1 ...
## ..$ leftDaughter : int [1:119, 1:500] 2 4 6 8 10 12 14 0 0 0 ...
## ..$ rightDaughter: int [1:119, 1:500] 3 5 7 9 11 13 15 0 0 0 ...
## ..$ nodepred : num [1:119, 1:500] 2.54 2.92 2.43 2.81 2.96 ...
## ..$ bestvar : int [1:119, 1:500] 3 5 2 6 6 3 2 0 0 0 ...
## ..$ xbestsplit : num [1:119, 1:500] 7.46 1.00 1.45e+02 1.80e+05 1.79e+05 ...
## ..$ ncat : Named int [1:7] 3 1 1 3 2 1 1
## .. ..- attr(*, "names")= chr [1:7] "ffreq" "dist.m" "elev" "soil" ...
## ..$ nrnodes : int 119
## ..$ ntree : num 500
## ..$ xlevels :List of 7
## .. ..$ ffreq : chr [1:3] "1" "2" "3"
## .. ..$ dist.m: num 0
## .. ..$ elev : num 0
## .. ..$ soil : chr [1:3] "1" "2" "3"
## .. ..$ lime : chr [1:2] "0" "1"
## .. ..$ x : num 0
## .. ..$ y : num 0
## $ coefs : NULL
## $ y : Named num [1:155] 3.01 3.06 2.81 2.41 2.43 ...
## ..- attr(*, "names")= chr [1:155] "1" "2" "3" "4" ...
## $ test : NULL
## $ inbag : NULL
## $ terms :Classes 'terms', 'formula' language logZn ~ ffreq + dist.m + elev + soil + lime + x + y
## .. ..- attr(*, "variables")= language list(logZn, ffreq, dist.m, elev, soil, lime, x, y)
## .. ..- attr(*, "factors")= int [1:8, 1:7] 0 1 0 0 0 0 0 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:8] "logZn" "ffreq" "dist.m" "elev" ...
## .. .. .. ..$ : chr [1:7] "ffreq" "dist.m" "elev" "soil" ...
## .. ..- attr(*, "term.labels")= chr [1:7] "ffreq" "dist.m" "elev" "soil" ...
## .. ..- attr(*, "order")= int [1:7] 1 1 1 1 1 1 1
## .. ..- attr(*, "intercept")= num 0
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(logZn, ffreq, dist.m, elev, soil, lime, x, y)
## .. ..- attr(*, "dataClasses")= Named chr [1:8] "numeric" "factor" "numeric" "numeric" ...
## .. .. ..- attr(*, "names")= chr [1:8] "logZn" "ffreq" "dist.m" "elev" ...
## - attr(*, "class")= chr [1:2] "randomForest.formula" "randomForest"
How successful was the forest? Compute the RMSE and \(R^2\) of the fits, based on the out-of-bag (OOB) cross-validation. These are estimates of the predictive power of the fitted model.
m.lzn.rf.resid <- (meuse$logZn - m.lzn.rf$predicted)
print(sqrt(mse <- mean(m.lzn.rf.resid^2))) # RMSE
## [1] 0.1373808
print(rsq <- (1 - mse/var(meuse$logZn))) # pseudeo-R^2
## [1] 0.8079775
The model is successful overall, with a low root mean squared error (RMSE) 0.1374 and high \(R^2=\) 0.808, these both from the fit, not cross-validation. Compare the RMSE to the mean of the target variable, 2.5562
Display the cross-validation error rate against the number of trees, to see how many trees were needed for a stable result:
plot(m.lzn.rf)
Only about 300 trees were needed in this case.
Examine the variable importance.
There are two kinds. From the help text:
“The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences.”
“The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.”
Here we have a regression, so Type 2 is the change in RSS due to the permutation.
We compare both:
print(data.frame(randomForest::importance(m.lzn.rf, type=1),
randomForest::importance(m.lzn.rf, type=2)))
## X.IncMSE IncNodePurity
## ffreq 17.47817 0.8474457
## dist.m 41.43391 6.1204559
## elev 28.67894 3.4642117
## soil 10.05626 0.7216287
## lime 11.07493 1.2812095
## x 26.36551 1.1783446
## y 17.19509 0.9118874
par(mfrow = c(1, 2))
varImpPlot(m.lzn.rf, type=1, main = "Importance: permutation")
varImpPlot(m.lzn.rf, type=2, main = "Importance: node impurity")
par(mfrow = c(1, 1))
The ranks are similar but not identical. After the first two
dist.m
and elev
(obviously the most important,
since the heavy metal is mostly from flooding of the Meuse/Maas River)
the others are not in the same order. There is likely substitution of
partially correlated predictors.
Predict back onto the known points, and evaluate the goodness-of-fit:
p.rf <- predict(m.lzn.rf, newdata=meuse)
summary(r.rpp <- meuse$logZn - p.rf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2313655 -0.0402785 -0.0063284 -0.0003102 0.0387612 0.1579461
print(paste("RMSE:", round(rmse.rf <- sqrt(sum(r.rpp^2)/length(r.rpp)), 4)))
## [1] "RMSE: 0.0655"
plot(meuse$logZn ~ p.rf, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Random Forest")
grid(); abline(0,1)
Quite a good internal fit, but this is because we use all the training points to make each prediction.
A more realistic view of the predictive power of the model is the out-of-bag cross-validation. This summarizes the predictions at training points not used in a set of trees – recall that in the random forest each tree uses sampling with replacement, and some training points are not included.
We get the OOB X-validation with predict
with no dataset
specified, i.e., it’s the one used to build the model..
p.rf.oob <- predict(m.lzn.rf)
summary(r.rpp.oob <- meuse$logZn - p.rf.oob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5353648 -0.0853604 -0.0121787 -0.0009662 0.0786027 0.3075081
(rmse.oob <- sqrt(sum(r.rpp.oob^2)/length(r.rpp.oob)))
## [1] 0.1373808
plot(meuse$logZn ~ p.rf.oob, asp=1, pch=20,
xlab="Out-of-bag cross-validation estimates",
ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3),
main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)
There is quite a bit more spread than the internal fit, as expected. The summary statistics are a bit more than doubled.
Because of randomization, each run of the random forest will be different. To evaluate this, compute the RF several times and collect statistics. This also allows a speed comparison with other RF packages.
set.seed(314)
n <- 48 # number of simulations
# data frame to collect results
rf.stats <- data.frame(rep=1:10, rsq=as.numeric(NA), mse=as.numeric(NA))
t.rf <- system.time(
for (i in 1:n) {
model.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
data=meuse, importance=T, na.action=na.omit, mtry=5)
summary(model.rf$rsq)
summary(model.rf$mse)
rf.stats[i, "rsq"] <- median(summary(model.rf$rsq))
rf.stats[i, "mse"] <- median(summary(model.rf$mse))
}
)
summary(rf.stats[,2:3])
## rsq mse
## Min. :0.7922 Min. :0.01839
## 1st Qu.:0.7998 1st Qu.:0.01908
## Median :0.8021 Median :0.01933
## Mean :0.8020 Mean :0.01934
## 3rd Qu.:0.8046 3rd Qu.:0.01955
## Max. :0.8117 Max. :0.02029
hist(rf.stats[,"rsq"], xlab="RandomForest R^2", breaks = 16, main = "Frequency of fits (R^2)")
rug(rf.stats[,"rsq"])
hist(rf.stats[,"mse"], xlab="RandomForest RMSE", breaks = 16, main = "Frequency of OOB acuracy (RMSE)")
rug(rf.stats[,"mse"])
This is fairly stable. It would be much less so with subsets (e.g., randomly remove 10% of training points).
ranger
Packages used in this section: ranger
for random forests
and vip
for variable importance.
(Note that vip
can be used for many model types, see https://koalaverse.github.io/vip/articles/vip.html).
The mtry
optional argument gives the number of variables
randomly sampled as candidates at each split. By default for
ranger
(unlike randomForest
) this is \(\lfloor \sqrt{p} \rfloor\) where \(p\) is the number of predictors, here \(5\), so \(\lfloor
\sqrt{p} \rfloor = 2\). We want to try all the variables at each
split for more accurate splitting. We increase this to 3 to get better
trees but still include weak predictors; this is matched for
randomForest
(above).
First build the forest, using the ranger
function. Ask
for the permutation
measure of variable importance, to
match randomForest
.
The variable importance measures are of several types:
scale.permutation.importance = TRUE
, this should match the
randomForest
concept.Compute two ways, for the two kinds of importance. Use the same random seed, the forests will then be identical, but the importance measures will be different.
Permutation:
library(ranger, warn.conflicts = FALSE, verbose = FALSE)
set.seed(314)
m.lzn.ra <- ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
data = meuse,
importance = 'permutation',
scale.permutation.importance = TRUE,
mtry = 3)
print(m.lzn.ra)
## Ranger result
##
## Call:
## ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data = meuse, importance = "permutation", scale.permutation.importance = TRUE, mtry = 3)
##
## Type: Regression
## Number of trees: 500
## Sample size: 155
## Number of independent variables: 7
## Mtry: 3
## Target node size: 5
## Variable importance mode: permutation
## Splitrule: variance
## OOB prediction error (MSE): 0.01861645
## R squared (OOB): 0.8105926
Impurity:
set.seed(314)
m.lzn.ra.i <- ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
data = meuse,
importance = 'impurity',
mtry = 3)
print(m.lzn.ra.i)
## Ranger result
##
## Call:
## ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data = meuse, importance = "impurity", mtry = 3)
##
## Type: Regression
## Number of trees: 500
## Sample size: 155
## Number of independent variables: 7
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.01861645
## R squared (OOB): 0.8105926
Predict with fitted model, at all observations. For this we need to
specify a data=
argument.
p.ra <- predict(m.lzn.ra, data=meuse)
str(p.ra)
## List of 5
## $ predictions : num [1:155] 2.99 3.02 2.75 2.47 2.44 ...
## $ num.trees : num 500
## $ num.independent.variables: num 7
## $ num.samples : int 155
## $ treetype : chr "Regression"
## - attr(*, "class")= chr "ranger.prediction"
summary(r.rap <- meuse$logZn - p.ra$predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2419633 -0.0382925 -0.0055786 -0.0006077 0.0380941 0.1616413
(rmse.ra <- sqrt(sum(r.rap^2)/length(r.rap)))
## [1] 0.06501253
Compare the fits RMSE with that computed by
RandomForest
:
c(rmse.ra, rmse.rf)
## [1] 0.06501253 0.06554428
There is a slight difference.
Plot the fits:
par(mfrow=c(1,2))
plot(meuse$logZn ~ p.ra$predictions, asp=1, pch=20, xlab="fitted", ylab="log10(Zn)",
xlim=c(2,3.3), ylim=c(2,3.3), main="ranger")
grid(); abline(0,1)
plot(meuse$logZn ~ p.rf, asp=1, pch=20, xlab="fitted", ylab="log10(Zn)",
xlim=c(2,3.3), ylim=c(2,3.3), main="randomForest")
grid(); abline(0,1)
par(mfrow=c(1,1))
Almost identical, including which points are poorly-predicted.
The default model already has the OOB predictions stored in it.
Compare to RandomForest
results.
summary(m.lzn.ra$predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.131 2.319 2.523 2.558 2.749 3.095
summary(p.rf.oob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.132 2.321 2.534 2.557 2.758 3.063
summary(m.lzn.ra$predictions - p.rf.oob) # difference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0389200 -0.0075100 0.0002348 0.0004606 0.0083507 0.0394288
Overall, ranger
has a slightly higher median OOB
prediction randomForest
.
Plot the OOB cross-validation predictions:
par(mfrow=c(1,2))
plot(meuse$logZn ~ m.lzn.ra$predictions, asp=1, pch=20,
ylab="log10(Zn)", xlab="OOB X-validation estimates",
xlim=c(2,3.3), ylim=c(2,3.3),
main="ranger")
abline(0,1); grid()
plot(meuse$logZn ~ p.rf.oob, asp=1, pch=20,
xlab="OOB X-validation estimates",
ylab="log10(Zn)", xlim=c(2,3.3), ylim=c(2,3.3),
main="randomForest")
grid(); abline(0,1)
par(mfrow=c(1,1))
These are very similar. As with the fits, the same points are poorly-predicted by both models.
First, for permutation, the sums and the individual importances:
sum(ranger::importance(m.lzn.ra))
## [1] 153.2905
sum(randomForest::importance(m.lzn.rf)[,1])
## [1] 152.2828
cbind(ranger = ranger::importance(m.lzn.ra),
rF = randomForest::importance(m.lzn.rf)[,1])
## ranger rF
## ffreq 16.79039 17.47817
## x 25.29951 41.43391
## y 15.34952 28.67894
## dist.m 40.71367 10.05626
## elev 29.94610 11.07493
## soil 10.32160 26.36551
## lime 14.86976 17.19509
Second, for impurity, the sums and the individual importances::
sum(ranger::importance(m.lzn.ra.i))
## [1] 14.59591
sum(randomForest::importance(m.lzn.rf)[,2])
## [1] 14.52518
cbind(ranger =ranger::importance(m.lzn.ra.i),
rF = randomForest::importance(m.lzn.rf)[,2])
## ranger rF
## ffreq 0.7748065 0.8474457
## x 1.2215072 6.1204559
## y 0.9253974 3.4642117
## dist.m 5.7685564 0.7216287
## elev 3.6611503 1.2812095
## soil 0.6664583 1.1783446
## lime 1.5780326 0.9118874
There is quite some difference between the
packages:randomForest
gives much more importance to the
geographic coordinates and much less to the distance to river and
elevation. The sums of variable importance are slightly different.
Compute the ranger
RF several times and collect
statistics, also the timing:
n <- 48
ra.stats <- data.frame(rep=1:10, rsq=as.numeric(NA), mse=as.numeric(NA))
t.ra <- system.time(
for (i in 1:n) {
model.ra <- ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
data=meuse, importance="none", mtry=5,
write.forest = FALSE)
ra.stats[i, "mse"] <- model.ra$prediction.error
ra.stats[i, "rsq"] <- model.ra$r.squared
}
)
summary(ra.stats[,2:3])
## rsq mse
## Min. :0.8008 Min. :0.01829
## 1st Qu.:0.8049 1st Qu.:0.01874
## Median :0.8071 Median :0.01896
## Mean :0.8070 Mean :0.01897
## 3rd Qu.:0.8093 3rd Qu.:0.01918
## Max. :0.8139 Max. :0.01958
hist(ra.stats[,"rsq"], xlab="ranger R^2", breaks = 16, main = "Frequency of fits (R^2)")
rug(ra.stats[,"rsq"])
hist(ra.stats[,"mse"], xlab="ranger RMSE", breaks = 16, main = "Frequency of OOB accuracy (RMSE)")
rug(ra.stats[,"mse"])
Again, not a large variation.
Compare the timings:
print(t.rf)
## user system elapsed
## 6.127 0.081 6.278
print(t.ra)
## user system elapsed
## 2.064 0.107 1.123
print(round(t.rf/t.ra, 1))
## user system elapsed
## 3.0 0.8 5.6
randomForest
is almost 6x slower than
ranger
in this test.
Compare the sensitivity statistics to those from
RandomForest
:
summary(ra.stats[,2:3])
## rsq mse
## Min. :0.8008 Min. :0.01829
## 1st Qu.:0.8049 1st Qu.:0.01874
## Median :0.8071 Median :0.01896
## Mean :0.8070 Mean :0.01897
## 3rd Qu.:0.8093 3rd Qu.:0.01918
## Max. :0.8139 Max. :0.01958
summary(rf.stats[,2:3])
## rsq mse
## Min. :0.7922 Min. :0.01839
## 1st Qu.:0.7998 1st Qu.:0.01908
## Median :0.8021 Median :0.01933
## Mean :0.8020 Mean :0.01934
## 3rd Qu.:0.8046 3rd Qu.:0.01955
## Max. :0.8117 Max. :0.02029
tmp <- round(data.frame(sd(ra.stats[,2]), sd(rf.stats[,2]), sd(ra.stats[,3]), sd(rf.stats[,3])),6)
names(tmp) <- c("s.d. R^2 ranger", "s.d. R^2 randomForest","s.d. MSE ranger","s.d. MSE randomForest")
print(tmp)
## s.d. R^2 ranger s.d. R^2 randomForest s.d. MSE ranger s.d. MSE randomForest
## 1 0.002899 0.004206 0.000285 0.000411
No clear pattern as to which is more sensitive.
splmRF
Thanks to Michael McManus (EPA) for this. Here we try a
spatially-explicit random forest plus local model using the experimental
(as of mid 2025) spmodel
package, also from the EPA.2
First, make the object spatially-explicit in the sf
“Simple Features” data structure. Note that we specify
remove = FALSE
to keep x
and y
also in the data frame, to use in the random forest model.
library(sf, warn.conflicts = FALSE, verbose = FALSE)
## Linking to GEOS 3.13.1, GDAL 3.10.2, PROJ 9.6.0; sf_use_s2() is TRUE
meuse_sf <- st_as_sf(meuse, coords = c("x", "y"), remove = FALSE)
summary(meuse_sf)
## x y cadmium copper
## Min. :178605 Min. :329714 Min. : 0.200 Min. : 14.00
## 1st Qu.:179371 1st Qu.:330762 1st Qu.: 0.800 1st Qu.: 23.00
## Median :179991 Median :331633 Median : 2.100 Median : 31.00
## Mean :180005 Mean :331635 Mean : 3.246 Mean : 40.32
## 3rd Qu.:180630 3rd Qu.:332463 3rd Qu.: 3.850 3rd Qu.: 49.50
## Max. :181390 Max. :333611 Max. :18.100 Max. :128.00
##
## lead zinc elev dist
## Min. : 37.0 Min. : 113.0 Min. : 5.180 Min. :0.00000
## 1st Qu.: 72.5 1st Qu.: 198.0 1st Qu.: 7.546 1st Qu.:0.07569
## Median :123.0 Median : 326.0 Median : 8.180 Median :0.21184
## Mean :153.4 Mean : 469.7 Mean : 8.165 Mean :0.24002
## 3rd Qu.:207.0 3rd Qu.: 674.5 3rd Qu.: 8.955 3rd Qu.:0.36407
## Max. :654.0 Max. :1839.0 Max. :10.520 Max. :0.88039
##
## om ffreq soil lime landuse dist.m
## Min. : 1.000 1:84 1:97 0:111 W :50 Min. : 10.0
## 1st Qu.: 5.300 2:48 2:46 1: 44 Ah :39 1st Qu.: 80.0
## Median : 6.900 3:23 3:12 Am :22 Median : 270.0
## Mean : 7.478 Fw :10 Mean : 290.3
## 3rd Qu.: 9.000 Ab : 8 3rd Qu.: 450.0
## Max. :17.000 (Other):25 Max. :1000.0
## NA's :2 NA's : 1
## logZn geometry
## Min. :2.053 POINT :155
## 1st Qu.:2.297 epsg:NA: 0
## Median :2.513
## Mean :2.556
## 3rd Qu.:2.829
## Max. :3.265
##
Plot the target variable.
ggplot(data = meuse_sf) +
geom_sf(aes(size = logZn, col = logZn), pch=1)
The spmodel
package includes splmRF
for
random forest spatial residual models, using random forest to fit the
mean and a spatial linear model (e.g., kriging) to fit the
residuals.
“A random forest model is fit to the mean portion of the model
specified by formula using ranger::ranger()
. Residuals are
computed and used as the response variable in an intercept-only spatial
linear model fit using splm().
”
So in this case we can use the coordinates twice: (1) in the random forest model as with the other methods, to make “boxes”, and (2) for spatial interpolation of the residuals from the random forest fit.
To compare with the other RF methods, we set mtry = 3
.
The spatial covariance is fit with an exponential function, a usual
conservative choice.
set.seed(314)
library(spmodel, warn.conflicts = FALSE, verbose = FALSE)
m.lzn.splmRF <- splmRF(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
data = meuse_sf,
spcov_type = "exponential",
importance = 'permutation',
scale.permutation.importance = TRUE,
mtry = 3)
summary(m.lzn.splmRF)
## Length Class Mode
## call 7 -none- call
## ranger 17 ranger list
## splm 36 splm list
## newdata 0 -none- NULL
The fitted model object has the call, the results from calling
ranger
to fit the random forest, and the parameters of the
fitted residual model, i.e., spatial covariance structure.
print(m.lzn.splmRF$call)
## splmRF(formula = logZn ~ ffreq + x + y + dist.m + elev + soil +
## lime, data = meuse_sf, spcov_type = "exponential", importance = "permutation",
## scale.permutation.importance = TRUE, mtry = 3)
print(m.lzn.splmRF$ranger)
## Ranger result
##
## Call:
## NA
##
## Type: Regression
## Number of trees: 500
## Sample size: 155
## Number of independent variables: 7
## Mtry: 3
## Target node size: 5
## Variable importance mode: permutation
## Splitrule: variance
## OOB prediction error (MSE): 0.01861645
## R squared (OOB): 0.8105926
print(m.lzn.ra)
## Ranger result
##
## Call:
## ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data = meuse, importance = "permutation", scale.permutation.importance = TRUE, mtry = 3)
##
## Type: Regression
## Number of trees: 500
## Sample size: 155
## Number of independent variables: 7
## Mtry: 3
## Target node size: 5
## Variable importance mode: permutation
## Splitrule: variance
## OOB prediction error (MSE): 0.01861645
## R squared (OOB): 0.8105926
print(m.lzn.splmRF$splm)
##
## Call:
## NA
##
##
## Coefficients (fixed):
## (Intercept)
## -0.004817
##
##
## Coefficients (exponential spatial covariance):
## de ie range
## 0.011188 0.007789 68.068288
The ranger
portion of the model is identical to calling
ranger
directly, due to the same model form, parameters and
seed.
There is residual spatial correlation, with a very effective range of
204 meters. This is the range
coefficient of the
exponential model, multiplied by 3 to reach 95% of the total sill, as is
conventional.
The coefficient de
is the dependent error variance, and
ie
the independent error variance. These correspond to the
partial sill and nugget variance in a variogram model. So the
nugget-to-total sill ratio is ie/(ie+de)
:
round(m.lzn.splmRF$splm$coefficients$spcov["ie"]/(m.lzn.splmRF$splm$coefficients$spcov["ie"]+m.lzn.splmRF$splm$coefficients$spcov["de"]),2)
## ie
## 0.41
This is a high nugget, around 40% of the residual variance after the random forest model.
Compare the variable importance to ranger
:
sort(m.lzn.splmRF$ranger$variable.importance, decreasing = TRUE)
## dist.m elev x ffreq y lime soil
## 40.71367 29.94610 25.29951 16.79039 15.34952 14.86976 10.32160
sort(m.lzn.ra$variable.importance, decreasing = TRUE)
## dist.m elev x ffreq y lime soil
## 40.71367 29.94610 25.29951 16.79039 15.34952 14.86976 10.32160
These are identical, because of the same random seed.
Out-of-bag predictions from the random forest, comparing to
ranger
:
summary(m.lzn.splmRF$ranger$predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.131 2.319 2.523 2.558 2.749 3.095
summary(m.lzn.ra$predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.131 2.319 2.523 2.558 2.749 3.095
Again, identical.
Compare model performance:
print(m.lzn.splmRF$ranger$r.squared)
## [1] 0.8105926
print(m.lzn.ra$r.squared)
## [1] 0.8105926
Again, identical.
Compare the predictions at the training points:
summary(p.splmRF <- predict(m.lzn.splmRF, newdata = meuse_sf))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.017 2.287 2.513 2.555 2.823 3.264
summary(p.ra.p <- predict(m.lzn.ra, data = meuse)$predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.090 2.307 2.523 2.557 2.795 3.153
summary(p.rf.p <- predict(m.lzn.rf, newdata = meuse))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.094 2.306 2.529 2.556 2.798 3.140
The predictions from randomForest
and
ranger
are very close but not identical. Including the
interpolation of residuals with splmRF
has widened the
predictions and slightly lowered many of them. We can see the extremes
by adding a rug
plot to the histograms:
hist(p.splmRF, main = "splmRF", xlab = "log10(Zn)")
rug(p.splmRF)
hist(p.ra.p, main = "ranger", xlab = "log10(Zn)")
rug(p.ra.p)
hist(p.rf.p, main = "randomForest", xlab = "log10(Zn)")
rug(p.rf.p)
ranger
random forest residualsWe can try to duplicate splmRF
by modelling the spatial
structure of the ranger
residuals with a variogram model
and then interpolating these to modify the ranger predictions. We use
the gstat
package for this.
library(gstat, warn.conflicts = FALSE, verbose = FALSE)
Compute the ranger
residuals and compute their empirical
variogram using the default cutoff of 1/3 of the diagonal of the
bounding box, and 15 bins:
summary(meuse_sf$r.ra <- (meuse_sf$logZn - p.ra.p))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2419633 -0.0382925 -0.0055786 -0.0006077 0.0380941 0.1616413
v <- variogram(r.ra ~ 1, data = meuse_sf)
plot(v, plot.numbers = TRUE)
This is a somewhat erratic variogram, due to the small number of points, but shows some structure. The cutoff seems a bit too far, with large fluctuations around an apparent sill. So re-compute with a shorter cutoff. Again accept the default 15 bins.
v <- variogram(r.ra ~ 1, data = meuse_sf, cutoff = 1200)
plot(v, plot.numbers = TRUE)
Model this with an exponential model (as used by splmRF
in our example) and fit the model with gstat::fit.variogram
using its default ad-hoc weighted least-squares
ptimization:
(vm <- fit.variogram(v, vgm(psill = 0.002, "Exp", range = 800/3, nugget = 0.003)))
## model psill range
## 1 Nug 0.002985382 0.0000
## 2 Exp 0.002356873 550.5001
plot(v, plot.numbers = TRUE, model = vm)
vm[1,"psill"]/(sum(vm[,"psill"]))
## [1] 0.5588243
This has a nugget/total sill ratio of 0.56, somewhat higher than
estimated by splmRF
. The effective range is 1652 m, quite a
bit longer than that estimated by splmRF
.
There are other ways to fit an exponential covariance to the
residuals, clearly splmRF
is using one of these.
Krige the residuals to the observation points by LOOCV (otherwise the predictions would be identical to the actual values, since kriging is an exact interpolator):
kcv <- krige.cv(r.ra ~ 1, nfold = nrow(meuse_sf),
locations = meuse_sf, model = vm)
head(kcv)
## Simple feature collection with 6 features and 6 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 181025 ymin: 333260 xmax: 181390 ymax: 333611
## CRS: NA
## var1.pred var1.var observed residual zscore fold
## 1 0.0171924582 0.004055268 0.01954183 0.002349371 0.03689281 1
## 2 0.0116623374 0.003984250 0.04078331 0.029120971 0.46135216 2
## 3 0.0022227688 0.003935970 0.05955231 0.057329540 0.91380296 3
## 4 0.0149304649 0.004080820 -0.05534256 -0.070273027 -1.10005641 4
## 5 0.0002684245 0.003896392 -0.01362113 -0.013889559 -0.22251388 5
## 6 -0.0098630929 0.004109839 0.02410763 0.033970722 0.52989811 6
## geometry
## 1 POINT (181072 333611)
## 2 POINT (181025 333558)
## 3 POINT (181165 333537)
## 4 POINT (181298 333484)
## 5 POINT (181307 333330)
## 6 POINT (181390 333260)
summary(kcv$var1.pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0463394 -0.0108523 -0.0002145 -0.0005191 0.0126427 0.0488344
summary(meuse_sf$r.ra)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2419633 -0.0382925 -0.0055786 -0.0006077 0.0380941 0.1616413
Kriging has greatly smoothed out the residuals.
Add these to the ranger
predictions, and compare to
splmRF1
:
summary(p.ra$predictions + kcv$var1.pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.073 2.301 2.523 2.556 2.811 3.181
summary(p.splmRF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.017 2.287 2.513 2.555 2.823 3.264
These are not the same.
Try with an exponential variogram model identical to that fit by
splmRF
:
(coef <- m.lzn.splmRF$splm$coefficients$spcov[1:3])
## de ie range
## 0.011187869 0.007789315 68.068287715
(vm.splmRF <- vgm(psill = coef["de"],
model = "Exp",
range = coef["range"],
nugget = coef["ie"]))
## model psill range
## 1 Nug 0.007789315 0.00000
## 2 Exp 0.011187869 68.06829
kcv <- krige.cv(r.ra ~ 1, nfold = nrow(meuse_sf),
locations = meuse_sf, model = vm.splmRF)
head(kcv)
## Simple feature collection with 6 features and 6 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 181025 ymin: 333260 xmax: 181390 ymax: 333611
## CRS: NA
## var1.pred var1.var observed residual zscore fold
## 1 0.0117124744 0.01809020 0.01954183 0.007829355 0.05821090 1
## 2 0.0051476734 0.01816430 0.04078331 0.035635635 0.26440832 2
## 3 -0.0010341731 0.01867348 0.05955231 0.060586482 0.44336668 3
## 4 0.0032282031 0.01889692 -0.05534256 -0.058570765 -0.42607439 4
## 5 -0.0007943765 0.01861821 -0.01362113 -0.012826758 -0.09400433 5
## 6 -0.0048906983 0.01879490 0.02410763 0.028998327 0.21152075 6
## geometry
## 1 POINT (181072 333611)
## 2 POINT (181025 333558)
## 3 POINT (181165 333537)
## 4 POINT (181298 333484)
## 5 POINT (181307 333330)
## 6 POINT (181390 333260)
summary(kcv$var1.pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0328518 -0.0056037 -0.0016030 -0.0004472 0.0049743 0.0326940
summary(meuse_sf$r.ra)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2419633 -0.0382925 -0.0055786 -0.0006077 0.0380941 0.1616413
These values of the ranger
residuals are not identical.
Why? Maybe splmRF
kriges to the points and so gets the
exact ranger
residual?
summary(p.ra$predictions + meuse_sf$r.ra)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.053 2.297 2.513 2.556 2.829 3.265
summary(p.splmRF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.017 2.287 2.513 2.555 2.823 3.264
No, that’s not the reason. Although the predictions are close:
summary((p.ra$predictions + meuse_sf$r.ra)-p.splmRF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0502105 -0.0107659 0.0022561 0.0008191 0.0117497 0.0584474
Maybe there is a difference in the radius used for the kriging step?
This is an extension of ranger
which produces estimates
of each quantile of the predictive distribution for each prediction. The
method is explained by Meinshausen.3
To compute these, use ranger
, and specify
quantreg = TRUE
. Alsokeep the in-bag fits to use in the
quantile predictions, with keep.inbag = TRUE
.
set.seed(314)
m.lzn.qrf <- ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
data = meuse,
quantreg = TRUE,
importance = 'permutation',
keep.inbag=TRUE, # needed for QRF
scale.permutation.importance = TRUE,
mtry = 3)
pred.qrf <- predict(m.lzn.qrf, type = "quantiles",
# default is c(0.1, 0.5, 0.9)
quantiles = c(0.1, 0.25, 0.5, 0.75, 0.9))
summary(pred.qrf$predictions)
## quantile= 0.1 quantile= 0.25 quantile= 0.5 quantile= 0.75
## Min. :2.053 Min. :2.076 Min. :2.111 Min. :2.134
## 1st Qu.:2.190 1st Qu.:2.262 1st Qu.:2.304 1st Qu.:2.352
## Median :2.276 Median :2.350 Median :2.513 Median :2.676
## Mean :2.371 Mean :2.453 Mean :2.551 Mean :2.656
## 3rd Qu.:2.513 3rd Qu.:2.702 3rd Qu.:2.783 3rd Qu.:2.887
## Max. :2.910 Max. :2.958 Max. :3.184 Max. :3.190
## quantile= 0.9
## Min. :2.182
## 1st Qu.:2.486
## Median :2.755
## Mean :2.749
## 3rd Qu.:2.964
## Max. :3.265
This shows the range of predictions from the set of trees. There is quite some variability between quantiles. For example, here is the Q9-Q1 range:
summary(pred.qrf$predictions[, "quantile= 0.9"] - pred.qrf$predictions[, "quantile= 0.1"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07311 0.27053 0.35717 0.37846 0.48179 0.73920
Most are on the order of 0.3 - 0.5 log10(Zn).
The 0.5 (median) quantile can be compared to the single prediction
from ranger
, although the QRF prediction is a median, not a
mean:
summary(pred.qrf$predictions[, "quantile= 0.5"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.111 2.304 2.513 2.551 2.783 3.184
summary(p.ra.p)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.090 2.307 2.523 2.557 2.795 3.153
The means and the Q1/Q3 of the prediction means and medians are close but not identical.
Both randomForest
and ranger
provide local
measures of importance, i.e., how much each variable influences an
individual prediction.
randomForest
“The ‘local’ (or casewise) variable importance for a random regression forest is the average increase in squared OOB residuals when the variable is permuted, for each observation separately.”
This is computed if localImp = TRUE
in the call to
randomForest
.
set.seed(314)
m.lzn.rf.l <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data=meuse,
localImp = TRUE, nperm = 3,
na.action = na.omit, mtry = 3)
dim(m.lzn.rf.l$localImportance)
## [1] 7 155
row.names(m.lzn.rf.l$localImportance)
## [1] "ffreq" "x" "y" "dist.m" "elev" "soil" "lime"
There is one importance measure for each variable for each observation.
View the local importance for the first few observation points:
m.lzn.rf.l$localImportance[, 1:6]
## 1 2 3 4 5
## ffreq 0.025223725 0.0311664580 0.007239915 0.001172955 0.007573898
## x 0.003638953 -0.0005331196 0.000884300 0.023488953 0.020065479
## y 0.025311191 0.0087814013 0.014651997 0.010936406 0.010148829
## dist.m 0.152338528 0.1684531194 0.042116343 0.031325391 0.035634740
## elev 0.023182941 0.0330063161 0.003759453 0.011066830 0.003895781
## soil 0.012211841 0.0155922172 0.009372301 0.005682348 0.015348364
## lime 0.035701605 0.0443260269 -0.006084308 0.013544121 0.016970023
## 6
## ffreq 0.006609679
## x 0.015141041
## y 0.012441277
## dist.m 0.063966507
## elev 0.016467674
## soil 0.008685162
## lime 0.010301643
We see a large difference in importance of dist.m
, very
important for points 1 and 2, and much less so for points 3–6.
meuse[1:6, "dist.m"]
## [1] 50 30 150 270 380 470
These two points are much closer to the river.
ranger
For this package we use the local.importance
option:
“Calculate and return local importance values as in (Breiman 2001)4. Only
applicable if importance is set to permutation
”.
This is the same permutation measure as for the overall variable importance, but computed for each observation.
Refit the model with this option, using the same random seed so that the overall model fit and importance match the first section above. The model fit will be identical.
set.seed(314)
m.lzn.ra.l <- ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
data = meuse,
importance = 'permutation',
local.importance = TRUE,
scale.permutation.importance = TRUE,
mtry = 3)
names(m.lzn.ra.l)
## [1] "predictions" "num.trees"
## [3] "num.independent.variables" "mtry"
## [5] "min.node.size" "variable.importance"
## [7] "variable.importance.local" "prediction.error"
## [9] "forest" "splitrule"
## [11] "treetype" "r.squared"
## [13] "call" "importance.mode"
## [15] "num.samples" "replace"
## [17] "dependent.variable.name" "max.depth"
dim(m.lzn.ra.l$variable.importance.local)
## [1] 155 7
The variable.importance.local
field in the fitted model
results has the importance of each variable for each of the
predictions.
Show the local importances for the first few observation points, compared the global importance. These both measure the increase in OOB RMSE when the predictor is permuted.
m.lzn.ra.l$variable.importance.local[1:6,]
## ffreq x y dist.m elev soil
## 1 0.011068076 0.005452759 0.008899231 0.05850423 0.0099005893 0.003129961
## 2 0.015304762 0.001705302 0.004220736 0.05917265 0.0142170662 0.008075754
## 3 0.004277926 0.001818228 0.015349669 0.01411915 0.0017249558 0.001603407
## 4 0.001182663 0.004961601 0.003628138 0.01120759 -0.0006275804 0.005303960
## 5 0.002316050 0.005549051 0.003685398 0.01970707 0.0028770418 0.005360852
## 6 0.001043971 0.003847687 0.002211735 0.01152574 0.0035831139 0.001049993
## lime
## 1 0.013524257
## 2 0.020399560
## 3 -0.002937614
## 4 0.001612304
## 5 0.008106699
## 6 0.005264946
ranger::importance(m.lzn.ra.l)
## ffreq x y dist.m elev soil
## 0.012904514 0.013116775 0.009516604 0.065130013 0.031742157 0.005236516
## lime
## 0.007461143
We can see quite a range of importances for the predictors, depending
on the point. As with randomForest
, the predictor
dist.m
is much more imporant for points 1 and 2 than for
3-6.
Show the range of local variable importance for each predictor:
summary(m.lzn.ra.l$variable.importance.local[,1:7])
## ffreq x y
## Min. :-0.0351568 Min. :-0.014961 Min. :-0.015054
## 1st Qu.: 0.0009273 1st Qu.: 0.001846 1st Qu.: 0.001168
## Median : 0.0033483 Median : 0.004529 Median : 0.003061
## Mean : 0.0047760 Mean : 0.004835 Mean : 0.003528
## 3rd Qu.: 0.0076180 3rd Qu.: 0.007396 3rd Qu.: 0.005084
## Max. : 0.0311604 Max. : 0.029015 Max. : 0.022957
## dist.m elev soil
## Min. :-0.029625 Min. :-0.015450 Min. :-0.0096089
## 1st Qu.: 0.008341 1st Qu.: 0.004190 1st Qu.: 0.0001734
## Median : 0.020808 Median : 0.009619 Median : 0.0014919
## Mean : 0.023971 Mean : 0.011747 Mean : 0.0019462
## 3rd Qu.: 0.034208 3rd Qu.: 0.017547 3rd Qu.: 0.0031633
## Max. : 0.163139 Max. : 0.090050 Max. : 0.0098893
## lime
## Min. :-0.0438236
## 1st Qu.: 0.0005832
## Median : 0.0019866
## Mean : 0.0028013
## 3rd Qu.: 0.0044497
## Max. : 0.0478122
There is a large range in the importance, even for
dist.m
and elev
. Notice that the overall
importance of these is between the 3rd quartile and maximum, i.e., they
are quite influential in less than 1/4 of cases.
What is the relation of the importance with the factor itself? Show a
scatterplot of the imporatance for dist.m
, with a
horizontal line at the overall importance.
plot(meuse$dist.m,
m.lzn.ra.l$variable.importance.local[,"dist.m"],
xlab = "elevation", ylab = "importance of distance to river",
pch = 20)
abline(h = ranger::importance(m.lzn.ra.l)["dist.m"], col = "red")
Quite important for the points near the river, but not for those further away.
What about for elev
?
plot(meuse$elev,
m.lzn.ra.l$variable.importance.local[,"elev"],
xlab = "dist.m", ylab = "importance of elevation m.a.s.l",
pch = 20)
abline(h = ranger::importance(m.lzn.ra.l)["elev"], col = "red")
Weak relation to no relation.
What about flood frequency class?
plot(meuse$ffreq,
m.lzn.ra.l$variable.importance.local[,"ffreq"],
xlab = "ffreq", ylab = "importance of `ffreq`",
pch = 20)
abline(h = ranger::importance(m.lzn.ra.l)["ffreq"], col = "red")
Again, important for a few points near the river.
We would like to look inside the “black box” of the random forest models to see if it can give insights into its predictions. Christoph Molnar has written an online text5 explaining this.
One set of interpretations is provided by the iml
package.
library(iml, warn.conflicts = FALSE, verbose = FALSE)
We use this to interpret one of the random forests built above.
iml::Predictor
objectThe interpretation methods in the iml
package need the
machine learning model to be wrapped in a Predictor
object.” This “… holds any machine learning model (mlr
,
caret
, randomForest
, …) and the data to be
used for analyzing the model.
Create a wrapper function for predict.ranger
that
matches the format iml
expects. This is because
predict.ranger
does not follow the typical
predict
argument naming convention.
# Define a prediction function compatible with iml
predict_iml <- function(model, newdata, type = "response", ...) {
preds <- predict(model, data = newdata, type = type)$predictions
if (type == "response") {
return(preds)
} else {
return(preds[, 2]) # For probability, not used here
}
}
(The ranger
package defines a method called
predict.ranger
for objects of class ranger
,
but it does not export a function called predict
. Instead,
predict.ranger
will be dispatched from generic
predict
automatically if the object is of type
ranger
. So don’t use ranger::predict
, just
predict
.)
Now use it for the model fitted by ranger
, i.e., make a
Predictor
object.
Note that the predict.function
argument is needed if the
model
is not from either the mlr
or
caret
packages.
vars <- c("ffreq","x","y","dist.m","elev","soil","lime")
X <- meuse[, vars]
predictor.ra <- iml::Predictor$new(model = m.lzn.ra,
data = X,
y = meuse$logZn,
predict.function = predict_iml)
str(predictor.ra)
## Classes 'Predictor', 'R6' <Predictor>
## Public:
## batch.size: 1000
## class: NULL
## clone: function (deep = FALSE)
## data: Data, R6
## initialize: function (model = NULL, data = NULL, predict.function = NULL,
## model: ranger
## predict: function (newdata)
## prediction.colnames: NULL
## prediction.function: function (newdata)
## print: function ()
## task: unknown
## Private:
## predictionChecked: FALSE
Also make a Predictor
object for the ranger
QRF model:
predictor.qrf <- iml::Predictor$new(model = m.lzn.qrf,
data = X,
y = meuse$logZn,
predict.function = predict_iml)
Feature importance, based on mae
(mean absolute error)
or mse
(mean squared error). Plot the median and the
distribution (5–95% quantiles over all the cases).
This uses iml::plot.FeatureImp
.
imp.mae <- iml::FeatureImp$new(predictor.ra, loss = "mae")
imp.mse <- iml::FeatureImp$new(predictor.ra, loss = "mse")
plot(imp.mae)
plot(imp.mse)
print(imp.mae)
## Interpretation method: FeatureImp
## error function: mae
##
## Analysed predictor:
## Prediction task: unknown
##
##
## Analysed data:
## Sampling from data.frame with 155 rows and 7 columns.
##
##
## Head of results:
## feature importance.05 importance importance.95 permutation.error
## 1 dist.m 3.545983 3.694893 3.783031 0.18562503
## 2 elev 2.310812 2.614338 2.670165 0.13133980
## 3 x 1.716875 1.735522 1.818265 0.08718962
## 4 ffreq 1.499996 1.519180 1.610322 0.07632099
## 5 y 1.423054 1.505106 1.523621 0.07561394
## 6 lime 1.360322 1.375500 1.441221 0.06910274
print(imp.mse)
## Interpretation method: FeatureImp
## error function: mse
##
## Analysed predictor:
## Prediction task: unknown
##
##
## Analysed data:
## Sampling from data.frame with 155 rows and 7 columns.
##
##
## Head of results:
## feature importance.05 importance importance.95 permutation.error
## 1 dist.m 13.921234 14.877915 15.801162 0.062883429
## 2 elev 5.563535 6.367359 7.097176 0.026912465
## 3 x 2.862070 3.211992 3.338609 0.013575897
## 4 ffreq 2.263704 2.536549 2.857294 0.010721054
## 5 lime 1.903443 2.151855 2.274674 0.009095095
## 6 y 2.062338 2.148588 2.313753 0.009081283
The plots show the range of local effects for each predictor, for the
given indicator. Predictor dist.m
has quite variable local
effects, whereas x
and y
have similar (and
smaller) effects at all points.
Another way to see local effects is with the
FeatureEffect
object:
accumulated local effect (ALE); “ALE plots are a faster and unbiased alternative to partial dependence plots (PDPs).”6
Partial Dependence Plots (PDP): this shows the prediction with other factors kept at their medians (?).7
Individual Conditional Expectation (ICE) curves – one per observation “one line per instance that shows how the instance’s prediction changes when a feature changes”.8. This is a PDP plot per observation.
See with plot.FeatureEffect
. Create the object with
$new
, specifying a method, then plot.
ale <- FeatureEffect$new(predictor.ra, feature = "elev")
ale$plot()
pdp <- FeatureEffect$new(predictor.ra, feature = "elev", method = "pdp")
pdp$plot()
ice <- FeatureEffect$new(predictor.ra, feature = "elev", method = "ice")
ice$plot()
In the ICE graph some interesting behaviour in the 7-8.5 m elevation range. But in general this follows the average PDP plot, we don’t have any really unusual observations in terms of their dependence on elevation.
See this for distance:
ice.dist.m <- FeatureEffect$new(predictor.ra, feature = "dist.m", method = "ice")
ice.dist.m$plot()
There is an interesting contrasting behaviour from 0 to 50 m: some with high values at 0 increase substantially at 50, while most decrease, as expected.
“The interaction measure regards how much of the variance of f(x) is explained by the interaction. The measure is between 0 (no interaction) and 1 (= 100% of variance of f(x) due to interactions). For each feature, we measure how much they interact with any other feature.”9
“When features interact with each other in a prediction model, the prediction cannot be expressed as the sum of the feature effects, because the effect of one feature depends on the value of the other feature.”
interact <- Interaction$new(predictor.ra)
plot(interact)
This shows that the predictors interact a fair amount – none are independent. The obvious example is distance and elevation, which we can see in a 2-way interaction plot:
Look at the 2-way interactions with a predictor, e.g., elevation:
interact.elev <- Interaction$new(predictor.ra, feature = "elev")
plot(interact.elev)
This shows strong interactions between elevation and the others,
especially dist.m
, in prediction \(\log_{10}\mathrm{Zn}\).
This uses the glmnet
“Lasso and Elastic-Net Regularized
Generalized Linear Models” package, also the gower
“Gower’s
Distance” package, based on Gower (1971).10
“The … model fitted by LocalModel
is a linear regression
model and the data points are weighted by how close they are to the data
point for w[h]ich we want to explain the prediction.” Here ‘close’ means
in multivariate feature space, hence the use of Gower’s distance.
From ?LocalModel
: “A weighted glm is fitted with the
machine learning model prediction as target. Data points are weighted by
their proximity to the instance to be explained, using the Gower
proximity measure. L1-regularization is used to make the results
sparse.” (hence the use of glmnet
).
So this is a linear regression model, therefore we can interpret its fitted coefficients to see local effects. However the model only uses “close-by” points in feature space.
Look at this for the first point, which happens to be close to the river and with a high level of Zn.
meuse[1,]
## x y cadmium copper lead zinc elev dist om ffreq soil lime
## 1 181072 333611 11.7 85 299 1022 7.909 0.00135803 13.6 1 1 1
## landuse dist.m logZn
## 1 Ah 50 3.009451
lime.explain <- iml::LocalModel$new(predictor.ra, x.interest = X[1, ])
## Loading required package: glmnet
## Loaded glmnet 4.1-8
## Loading required package: gower
lime.explain$results
## beta x.recoded effect x.original feature feature.value
## dist.m -0.0005252513 50.000 -0.02626257 50 dist.m dist.m=50
## elev -0.0469173108 7.909 -0.37106901 7.909 elev elev=7.909
## lime=1 0.0790714563 1.000 0.07907146 1 lime=1 lime=1
plot(lime.explain)
The local model at this point predicts a lower value than actual prediction based on the full model. The main reason is the elevation of the surrounding points: these differ from the target point but affect the prediction.
Look at this for one of the furthest points from the river:
ix.maxdist <- which.max(meuse$dist.m)
meuse[ix.maxdist, ]
## x y cadmium copper lead zinc elev dist om ffreq soil lime
## 111 180451 331473 0.2 18 50 117 9.095 0.809742 5.3 2 3 0
## landuse dist.m logZn
## 111 Fw 1000 2.068186
lime.explain <- LocalModel$new(predictor.ra, x.interest = X[ix.maxdist, ])
lime.explain$results
## beta x.recoded effect x.original feature feature.value
## dist.m -0.0004749539 1000.000 -0.47495392 1000 dist.m dist.m=1000
## elev -0.0646175574 9.095 -0.58769668 9.095 elev elev=9.095
## lime=0 -0.0772825118 1.000 -0.07728251 0 lime=0 lime=0
plot(lime.explain)
Here both distance and elevation have a strong effect but the predictions are the same.
The local importance metrics explored in the previous sections do not
account for the full set of interactions at a data point. The Shapley
method corrects this. It is available in several packages, including
iml
. An excellent intuitive explanation is given by Molnar
in his “Interpretable Machine Learning” on-line text11
“[A] method from coalitional game theory named Shapley value. Assume that for one data point, the feature values play a game together, in which they get the prediction as a payout. The Shapley value tells us how to fairly distribute the payout among the feature values.
“The ‘game’ is the prediction task for a single instance of the dataset. The ‘gain’ is the actual prediction for this instance minus the average prediction for all instances. The ‘players’ are the feature values of the instance that collaborate to receive the gain (= predict a certain value).
“The Shapley value is the average marginal contribution of a feature value across all possible coalitions.
“The Shapley value is the only explanation method with a solid theory. The axioms – efficiency, symmetry, dummy, additivity – give the explanation a reasonable foundation.”12
The Shapley value is a solution for computing feature contributions for single predictions for any machine learning model. It is defined via a value function \({val}\) of players in \(S\). The Shapley value of a feature value is its contribution to the payout, weighted and summed over all possible feature value combinations:
\[\phi_j(val)=\sum_{S\subseteq\{1,\ldots,p\} \backslash \{j\}}\frac{|S|!\left(p-|S|-1\right)!}{p!}\left(val\left(S\cup\{j\}\right)-val(S)\right)\]
where \(S\) is a subset of the features used in the model, \(x\) is the vector of feature values of the instance to be explained and \(p\) the number of features. \(val_x(S)\) is the prediction for feature values in set \(S\) that are marginalized over features that are not included in set \(S\): \[val_{x}(S)=\int\hat{f}(x_{1},\ldots,x_{p})d\mathbb{P}_{x\notin{}S}-E_X(\hat{f}(X))\]
Note that the Shapley values are dependent on the reference dataset which is used to compute the marginal effects. This can be the entire dataset, but also some meaningful subset, in which case we get the importance of the predictors within only that subset.
Compute and display the Shapley values for the predictive model in
predictor.ra
, i.e. the fitted ranger
random
forest model, for the first-listed observation, and for the
maximum-distance observation.
set.seed(314)
shapley <- iml::Shapley$new(predictor.ra, x.interest = X[1, ])
shapley$plot()
shapley.maxdist <- Shapley$new(predictor.ra, x.interest = X[ix.maxdist, ])
shapley.maxdist$plot()
This figure shows the actual values of each predictor at the observation point, and the \(\phi\) value, i.e., numerical contribution to the difference between actual and average prediction. These sum to the difference.
List the results as a table:
(results <- shapley$results)
## feature phi phi.var feature.value
## 1 ffreq 0.032337223 0.0024784950 ffreq=1
## 2 x 0.005993525 0.0017357917 x=181072
## 3 y 0.023750034 0.0011312287 y=333611
## 4 dist.m 0.204097527 0.0332688137 dist.m=50
## 5 elev 0.025422614 0.0079663934 elev=7.909
## 6 soil 0.015721168 0.0006200223 soil=1
## 7 lime 0.050894354 0.0039251067 lime=1
sum(shapley$results$phi)
## [1] 0.3582164
(results.maxdist <- shapley.maxdist$results)
## feature phi phi.var feature.value
## 1 ffreq -0.049214283 0.0025348477 ffreq=2
## 2 x -0.023238327 0.0050728046 x=180451
## 3 y -0.035464012 0.0011016137 y=331473
## 4 dist.m -0.198119934 0.0320157016 dist.m=1000
## 5 elev -0.094321942 0.0100878821 elev=9.095
## 6 soil -0.044662497 0.0008986534 soil=3
## 7 lime -0.004580032 0.0005639863 lime=0
sum(shapley.maxdist$results.maxdist$phi)
## [1] 0
For the point near the river the actual is higher than expected; this is the sum of the individual \(\phi\) values. The main positive contribution is from distance: it is closer than most points. For the distant point the actual is lower than the prediction; this is mainly because of distance: the point is further, and also lower and these have the most influence on the lower actual value.
“Be careful to interpret the Shapley value correctly: The Shapley value is the average contribution of a feature value to the prediction in different coalitions. The Shapley value is NOT the difference in prediction when we would remove the feature from the model.”
Another approach to Shapley values is SHapley Additive exPlanations (SHAP) by Lundberg and Lee (2017)13. Here the Shapley value explanation is represented as an additive feature attribution method, i.e., as a linear model.
The original method for SHAP is KernelSHAP
(see next
section), but Lundberg et al.14 proposed TreeSHAP
, a variant
of SHAP for tree-based machine learning models including random forests.
It is a fast, model-specific alternative to the more general
KernelSHAP
“.
One option to compute TreeSHAP
is the
fastshap
package by Brandon Greenwll15. This is based on
the work of Štrumbelj &Kononenko.16
This is an approximation using Monte-Carlo sampling:
\[ \hat{\phi}_{j} = \frac{1}{M}\sum_{m=1}^M\left(\hat{f}(\mathbf{x}^{(m)}_{+j}) - \hat{f}(\mathbf{x}^{(m)}_{-j})\right) \]
where \(\hat{f}(\mathbf{x}^{(m)}_{+j})\) is the prediction for point \(\mathbf{x}\), but with a random number of feature values replaced by feature values from a random data point \(\mathbf{x}\), except for the respective value of feature \(j\), i.e., the feature for which we want the SHAP value, which has its original value. From this estimate is subtracted the same estimate but with with value for feature \(j\) also taken from the random data point. This isolates the effect of the feature within the original data point.
fastshap.explain
The key data structure for fastshap
is an object of
class explain
. This is computed by the
fastshap::explain
function, which requires a prediction
function. In this case it is ranger::predict.ranger
, which
is automatically called from the generic predict
when the
object is a ranger
model.
Note the use of nsim
: “To obtain the most accurate
results, nsim
should be set as large as feasibly
possible.”.
library(fastshap, warn.conflicts = FALSE, verbose = FALSE)
pfun <- function(object, newdata) {
predict(object, data = newdata)$predictions
}
set.seed(314)
fshap <- fastshap::explain(object = m.lzn.ra,
X = X, pred_wrapper = pfun,
nsim = 48)
class(fshap)
## [1] "explain" "matrix" "array"
Each observation has a set of Shapley values. These are the
contribution to the difference between the observed value and the
average value of all observations. For the first (closest) point, there
is a large positive contribution to the difference from
dist.m
and elev
.
How much do these differ from those computed by iml
?
Examine for the first observation
set.seed(314)
shapley <- iml::Shapley$new(predictor.ra, x.interest = X[1, ])
print(data.frame(fastshap = fshap[1,], iml = shapley$results$phi))
## fastshap iml
## ffreq 0.02893541 0.032337223
## x -0.01155836 0.005993525
## y 0.02644784 0.023750034
## dist.m 0.24578811 0.204097527
## elev 0.05502968 0.025422614
## soil 0.01383911 0.015721168
## lime 0.05974382 0.050894354
Similar but not identical. The SHAP method using
fastshap
finds more importance for most predictors.
The shapviz
package can be used to visualize the Shapley
values. Load the package and create a shapviz
object from
the fastshap
explanation result.
library(shapviz, warn.conflicts = FALSE, verbose = FALSE)
shv <- shapviz(fshap, X = X)
class(shv)
## [1] "shapviz"
Now we can use the several functions in shapviz
.
First, plot the mean Shapley values over all points, i.e., global variable importance but calculated from all individuals.
sv_importance(shv, kind = "bar", show_numbers = TRUE)
We can see all individual importances with the “beeswarm” version of this plot:
sv_importance(shv, kind = "beeswarm")
This shows clearly that at close distances of dist.m
the
prediction is postively influenced by the predictor, and the reverse is
true (but less strikingly) at far distances.
We can see the effect of each predictor, compared to its value, with
a “dependence” plot. Plot the dependence on elevation for all
observations, using the sv_dependence
function. We can get
some idea of the interactions by colouring the points by the value of
another predictor. Here we select the flood frequency.
sv_dependence(shv, v = "elev", color_var = "ffreq")
sv_dependence(shv, v = "elev", color_var = "dist.m")
A consistent story: influence of elevation on the prediction is significant for both low and high points and the relation is almost linear. The value is positive for low elevations; the reverse is true for high elevations. Flood frequency is not well-related to Shapley values for all but the lowest points.
And on distance to river for all observations, with both flood frequency and elevation shown as interactions:
sv_dependence(shv, v = "dist.m", color_var = "ffreq")
sv_dependence(shv, v = "dist.m", color_var = "elev")
We see that for nearby points there is a strong positive influence of distance, i.e., the distance is very important to that prediction. For points \(>250\) m there is also an influence but negative.
The contribution of each predictor for the closest and furthest
points can be visualized using the sv_waterfall
function,
or for more compact display, the sv_force
function:
sv_waterfall(shv, row_id = 1)
sv_waterfall(shv, row_id = ix.maxdist)
sv_force(shv, row_id = 1)
sv_force(shv, row_id = ix.maxdist)
For this simple model these do not differ much from the Shapley
values computed by iml
.
Another approach to compute Shapley values is using the
kernelshap
package, which is an efficient implementation of
the Kernal SHAP algorithm. See ?kernelshap
for a detailed
explanation.
This requires (1) a fitted model, (2) a set of features for which to display the values, with observation values at each point, (3) background data to integrate out “switched off” features. By default this is a random sample of 200 rows from the observations. These are the observtions that will be used to randomize and estimate the effects within a coalition.
The X
argument gives the points for which to compute. In
a small data set this can be all the points.
The bg_X
argument is the “background” data, if null this
requires a bg_n
“size of background data” argument. Here
again we have a small dataset so we could use all the points. More
background points, more computation time, so restrict to about 2/3 of
the dataset.
library(kernelshap, warn.conflicts = FALSE, verbose = FALSE)
# use existing ranger fits
dim(X)
## [1] 155 7
system.time(s <- kernelshap(m.lzn.ra.l,
X = X,
bg_X = NULL,
bg_n = 100,
verbose = FALSE)
)
## user system elapsed
## 72.896 2.362 38.321
str(s)
## List of 15
## $ S : num [1:155, 1:7] 0.0415 0.0396 0.0331 0.0263 0.027 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:7] "ffreq" "x" "y" "dist.m" ...
## $ X :'data.frame': 155 obs. of 7 variables:
## ..$ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ x : num [1:155] 181072 181025 181165 181298 181307 ...
## ..$ y : num [1:155] 333611 333558 333537 333484 333330 ...
## ..$ dist.m: num [1:155] 50 30 150 270 380 470 240 120 240 420 ...
## ..$ elev : num [1:155] 7.91 6.98 7.8 7.66 7.48 ...
## ..$ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## ..$ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ baseline : num 2.53
## $ bg_X :'data.frame': 100 obs. of 7 variables:
## ..$ ffreq : Factor w/ 3 levels "1","2","3": 2 2 1 2 1 1 2 1 1 1 ...
## ..$ x : num [1:100] 180114 179881 180283 178803 180530 ...
## ..$ y : num [1:100] 330803 330912 332014 330349 332538 ...
## ..$ dist.m: num [1:100] 500 650 320 280 60 50 340 220 10 80 ...
## ..$ elev : num [1:100] 9.4 10.52 7.78 8.66 7.78 ...
## ..$ soil : Factor w/ 3 levels "1","2","3": 3 3 2 1 1 1 1 1 1 1 ...
## ..$ lime : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 2 2 ...
## $ bg_w : NULL
## $ SE : num [1:155, 1:7] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:7] "ffreq" "x" "y" "dist.m" ...
## $ n_iter : int [1:155] 1 1 1 1 1 1 1 1 1 1 ...
## $ converged : logi [1:155] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ m : int 14
## $ m_exact : int 126
## $ prop_exact : num 1
## $ exact : logi TRUE
## $ txt : chr "Exact Kernel SHAP values"
## $ predictions: num [1:155, 1] 2.99 3.02 2.75 2.47 2.44 ...
## $ algorithm : chr "kernelshap"
## - attr(*, "class")= chr "kernelshap"
Compute the visualization object and show two diagnostic plots. Note
that shapviz
understands kernelshap
objects,
so there is no need to re-format the Shapley values matrix.
First, the importances of each predictor at each point:
sv <- shapviz(s)
sv_importance(sv, kind = "bee")
Second, selected points:
sv_force(sv, row_id = 1)
sv_force(sv, row_id = ix.maxdist)
Third, some dependence plots:
# auto-select most important interacting feature
sv_dependence(sv, v = "dist.m", color_var = "elev")
sv_dependence(sv, v = "dist.m", color_var = "ffreq")
All of these are slightly different than the plots from the
fastshap
computation. This is because
kernelshap
uses a different approach to fast computation of
the Shapley values.
Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32↩︎
Dumelle, M., Higham, M., & Hoef, J. M. V. (2023). spmodel: Spatial statistical modeling and prediction in R. PLOS ONE, 18(3), e0282524. https://doi.org/10.1371/journal.pone.0282524↩︎
Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7, 983-999, DOI:10.5555/1248547.1248582↩︎
Breiman, L. (2001). Random forests. Mach Learn, 45:5-32. doi: 10.1023/A:1010933404324↩︎
Molnar, C. (2022). Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/↩︎
https://christophm.github.io/interpretable-ml-book/ale.html↩︎
https://christophm.github.io/interpretable-ml-book/pdp.html↩︎
see https://christophm.github.io/interpretable-ml-book/ice.html↩︎
https://christophm.github.io/interpretable-ml-book/interaction.html↩︎
Gower, John C. “A general coefficient of similarity and some of its properties.” Biometrics (1971): 857-871↩︎
https://christophm.github.io/interpretable-ml-book/shapley.html↩︎
https://christophm.github.io/interpretable-ml-book/shapley.html↩︎
Lundberg, Scott M., and Su-In Lee. “A unified approach to interpreting model predictions.” Advances in Neural Information Processing Systems (2017)↩︎
Lundberg, Scott M., Gabriel G. Erion, and Su-In Lee. “Consistent individualized feature attribution for tree ensembles.” arXiv preprint arXiv:1802.03888 (2018)↩︎
Štrumbelj, E., & Kononenko, I. (2014). Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems, 41(3), 647-665. https://doi.org/10.1007/s10115-013-0679-x↩︎