1 Purpose

We compare several R packages that build random forests: an older package randomForest and a much faster implementation, ranger, and the spmodel package that also models the spatial structure of the random forest residuals and interpolates these to improve predictions.

We hope the results are similar. Note that due to randomization there will always be differences, although not between runs of this script because of the use of set.seed.

This document also explores Quantile Regression Forests option in ranger to assess uncertainty of predictions, and local measures of variable importance including Shapley values.

We also have some fun with interpretable machine learning… just because we can!

2 Dataset

We use as an example the small “Meuse (Maas) River soil pollution” dataset. See help(meuse, package = "sp") for details and references.

This dataset is provided with the sp “classes and methods for spatial data: points, lines, polygons and grids” package. Athough sp has been superseded by sf, it is still available and has this famous test dataset. You do not need to load sp, just have it installed in order to locate the sp::meuse dataset.

data(meuse, package = "sp")
str(meuse)
## 'data.frame':    155 obs. of  14 variables:
##  $ x      : num  181072 181025 181165 181298 181307 ...
##  $ y      : num  333611 333558 333537 333484 333330 ...
##  $ cadmium: num  11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
##  $ copper : num  85 81 68 81 48 61 31 29 37 24 ...
##  $ lead   : num  299 277 199 116 117 137 132 150 133 80 ...
##  $ zinc   : num  1022 1141 640 257 269 ...
##  $ elev   : num  7.91 6.98 7.8 7.66 7.48 ...
##  $ dist   : num  0.00136 0.01222 0.10303 0.19009 0.27709 ...
##  $ om     : num  13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
##  $ ffreq  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ soil   : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
##  $ lime   : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
##  $ dist.m : num  50 30 150 270 380 470 240 120 240 420 ...

We try to model one of the heavy metal concentration (zinc, Zn) from all possibly relevant predictors:

  • ffreq flooding frequency (3 classes)
  • dist.m distance from the Meuse river (m)
  • elev elevation above mean sea level (m)
  • soil soil type (3 classes)
  • lime whether agricultural lime was applied to the field (yes/no)

We also include locations, because this may account for local effects, by building “boxes” around sets of locations.

  • x, y coordinates in the RDH (Dutch) grid (m), EPSG code 28992

Make a log10-transformed copy of the Zn metal concentration to obtain a somewhat balanced distribution of the target variable:

meuse$logZn <- log10(meuse$zinc)    # 锌
hist(meuse$logZn, main = ""); rug(meuse$logZn)

There is a good range of values, in a bimodal distribution: low and high values, with fewere medium values. There is still some right skew towards the highest values.

3 Random forest with randomForest

This was (one of?) the first R packages to implement Breiman’s random forest algorithm.1.

Packages used in this section: randomForest for random forests, randomForestExplainer for some diagnostics, and ggplot2 for graphics.

library(ggplot2, warn.conflicts = FALSE, verbose = FALSE)
library(randomForest, warn.conflicts = FALSE, verbose = FALSE)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(randomForestExplainer,  warn.conflicts = FALSE, verbose = FALSE)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

3.1 Build the forest

First build the forest, using the randomForest function. Use set.seed for reproducibility, so that your results (if you render this script or run this chunk by itself) will be the same as these.

The mtry optional argument gives the number of variables randomly sampled as candidates at each split. By default this is \(\lfloor{p/3}\rfloor\) where \(p\) is the number of predictors, here \(7\), so \(\lfloor{7/3}\rfloor = 2\). We increase this to 3 to get better trees but still include weak predictors; this is matched for ranger (below).

set.seed(314)
m.lzn.rf <- randomForest(logZn ~ ffreq + dist.m + elev + soil + lime + x + y,
                         data=meuse, 
                         # eight permutations per tree to estimate importance
                         importance = TRUE, nperm = 8, 
                         na.action = na.omit, mtry = 3)
print(m.lzn.rf)
## 
## Call:
##  randomForest(formula = logZn ~ ffreq + dist.m + elev + soil +      lime + x + y, data = meuse, importance = TRUE, nperm = 8,      mtry = 3, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.01887348
##                     % Var explained: 80.67

The structure of this object:

str(m.lzn.rf)
## List of 18
##  $ call           : language randomForest(formula = logZn ~ ffreq + dist.m + elev + soil + lime + x +      y, data = meuse, importance = TRUE,| __truncated__
##  $ type           : chr "regression"
##  $ predicted      : Named num [1:155] 2.93 2.97 2.65 2.54 2.46 ...
##   ..- attr(*, "names")= chr [1:155] "1" "2" "3" "4" ...
##  $ mse            : num [1:500] 0.0148 0.0158 0.0231 0.0259 0.0227 ...
##  $ rsq            : num [1:500] 0.849 0.839 0.763 0.735 0.767 ...
##  $ oob.times      : int [1:155] 180 173 183 185 182 192 181 168 174 168 ...
##  $ importance     : num [1:7, 1:2] 0.01295 0.07057 0.03233 0.00488 0.00635 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:7] "ffreq" "dist.m" "elev" "soil" ...
##   .. ..$ : chr [1:2] "%IncMSE" "IncNodePurity"
##  $ importanceSD   : Named num [1:7] 0.000741 0.001703 0.001127 0.000486 0.000573 ...
##   ..- attr(*, "names")= chr [1:7] "ffreq" "dist.m" "elev" "soil" ...
##  $ localImportance: NULL
##  $ proximity      : NULL
##  $ ntree          : num 500
##  $ mtry           : num 3
##  $ forest         :List of 11
##   ..$ ndbigtree    : int [1:500] 93 101 95 103 97 91 105 103 105 105 ...
##   ..$ nodestatus   : int [1:119, 1:500] -3 -3 -3 -3 -3 -3 -3 -1 -1 -1 ...
##   ..$ leftDaughter : int [1:119, 1:500] 2 4 6 8 10 12 14 0 0 0 ...
##   ..$ rightDaughter: int [1:119, 1:500] 3 5 7 9 11 13 15 0 0 0 ...
##   ..$ nodepred     : num [1:119, 1:500] 2.54 2.92 2.43 2.81 2.96 ...
##   ..$ bestvar      : int [1:119, 1:500] 3 5 2 6 6 3 2 0 0 0 ...
##   ..$ xbestsplit   : num [1:119, 1:500] 7.46 1.00 1.45e+02 1.80e+05 1.79e+05 ...
##   ..$ ncat         : Named int [1:7] 3 1 1 3 2 1 1
##   .. ..- attr(*, "names")= chr [1:7] "ffreq" "dist.m" "elev" "soil" ...
##   ..$ nrnodes      : int 119
##   ..$ ntree        : num 500
##   ..$ xlevels      :List of 7
##   .. ..$ ffreq : chr [1:3] "1" "2" "3"
##   .. ..$ dist.m: num 0
##   .. ..$ elev  : num 0
##   .. ..$ soil  : chr [1:3] "1" "2" "3"
##   .. ..$ lime  : chr [1:2] "0" "1"
##   .. ..$ x     : num 0
##   .. ..$ y     : num 0
##  $ coefs          : NULL
##  $ y              : Named num [1:155] 3.01 3.06 2.81 2.41 2.43 ...
##   ..- attr(*, "names")= chr [1:155] "1" "2" "3" "4" ...
##  $ test           : NULL
##  $ inbag          : NULL
##  $ terms          :Classes 'terms', 'formula'  language logZn ~ ffreq + dist.m + elev + soil + lime + x + y
##   .. ..- attr(*, "variables")= language list(logZn, ffreq, dist.m, elev, soil, lime, x, y)
##   .. ..- attr(*, "factors")= int [1:8, 1:7] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:8] "logZn" "ffreq" "dist.m" "elev" ...
##   .. .. .. ..$ : chr [1:7] "ffreq" "dist.m" "elev" "soil" ...
##   .. ..- attr(*, "term.labels")= chr [1:7] "ffreq" "dist.m" "elev" "soil" ...
##   .. ..- attr(*, "order")= int [1:7] 1 1 1 1 1 1 1
##   .. ..- attr(*, "intercept")= num 0
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(logZn, ffreq, dist.m, elev, soil, lime, x, y)
##   .. ..- attr(*, "dataClasses")= Named chr [1:8] "numeric" "factor" "numeric" "numeric" ...
##   .. .. ..- attr(*, "names")= chr [1:8] "logZn" "ffreq" "dist.m" "elev" ...
##  - attr(*, "class")= chr [1:2] "randomForest.formula" "randomForest"

How successful was the forest? Compute the RMSE and \(R^2\) of the fits, based on the out-of-bag (OOB) cross-validation. These are estimates of the predictive power of the fitted model.

m.lzn.rf.resid <- (meuse$logZn - m.lzn.rf$predicted)
print(sqrt(mse <- mean(m.lzn.rf.resid^2))) # RMSE
## [1] 0.1373808
print(rsq <- (1 - mse/var(meuse$logZn)))  # pseudeo-R^2
## [1] 0.8079775

The model is successful overall, with a low root mean squared error (RMSE) 0.1374 and high \(R^2=\) 0.808, these both from the fit, not cross-validation. Compare the RMSE to the mean of the target variable, 2.5562

Display the cross-validation error rate against the number of trees, to see how many trees were needed for a stable result:

plot(m.lzn.rf)

Only about 300 trees were needed in this case.

3.2 Variable importance

Examine the variable importance.

There are two kinds. From the help text:

  1. “The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences.”

  2. “The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.”

Here we have a regression, so Type 2 is the change in RSS due to the permutation.

We compare both:

print(data.frame(randomForest::importance(m.lzn.rf, type=1),
           randomForest::importance(m.lzn.rf, type=2)))
##        X.IncMSE IncNodePurity
## ffreq  17.47817     0.8474457
## dist.m 41.43391     6.1204559
## elev   28.67894     3.4642117
## soil   10.05626     0.7216287
## lime   11.07493     1.2812095
## x      26.36551     1.1783446
## y      17.19509     0.9118874
par(mfrow = c(1, 2))
varImpPlot(m.lzn.rf, type=1, main = "Importance: permutation")
varImpPlot(m.lzn.rf, type=2, main = "Importance: node impurity")

par(mfrow = c(1, 1))

The ranks are similar but not identical. After the first two dist.m and elev (obviously the most important, since the heavy metal is mostly from flooding of the Meuse/Maas River) the others are not in the same order. There is likely substitution of partially correlated predictors.

3.3 Goodness-of-fit

Predict back onto the known points, and evaluate the goodness-of-fit:

p.rf <- predict(m.lzn.rf, newdata=meuse)
summary(r.rpp <- meuse$logZn - p.rf)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.2313655 -0.0402785 -0.0063284 -0.0003102  0.0387612  0.1579461
print(paste("RMSE:", round(rmse.rf <- sqrt(sum(r.rpp^2)/length(r.rpp)), 4)))
## [1] "RMSE: 0.0655"
plot(meuse$logZn ~ p.rf, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3),          ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Random Forest")
grid(); abline(0,1)

Quite a good internal fit, but this is because we use all the training points to make each prediction.

3.4 Out-of bag cross-validation

A more realistic view of the predictive power of the model is the out-of-bag cross-validation. This summarizes the predictions at training points not used in a set of trees – recall that in the random forest each tree uses sampling with replacement, and some training points are not included.

We get the OOB X-validation with predict with no dataset specified, i.e., it’s the one used to build the model..

p.rf.oob <- predict(m.lzn.rf)
summary(r.rpp.oob <- meuse$logZn - p.rf.oob)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.5353648 -0.0853604 -0.0121787 -0.0009662  0.0786027  0.3075081
(rmse.oob <- sqrt(sum(r.rpp.oob^2)/length(r.rpp.oob)))
## [1] 0.1373808
plot(meuse$logZn ~ p.rf.oob, asp=1, pch=20,
     xlab="Out-of-bag cross-validation estimates",
     ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3),
     main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)

There is quite a bit more spread than the internal fit, as expected. The summary statistics are a bit more than doubled.

3.5 Sensitivity

Because of randomization, each run of the random forest will be different. To evaluate this, compute the RF several times and collect statistics. This also allows a speed comparison with other RF packages.

set.seed(314)
n <- 48  # number of simulations
# data frame to collect results
rf.stats <- data.frame(rep=1:10, rsq=as.numeric(NA), mse=as.numeric(NA))
t.rf <- system.time(
  for (i in 1:n) {
    model.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
                             data=meuse, importance=T, na.action=na.omit, mtry=5)
    summary(model.rf$rsq)
    summary(model.rf$mse)
    rf.stats[i, "rsq"] <- median(summary(model.rf$rsq))
    rf.stats[i, "mse"] <- median(summary(model.rf$mse))
  }
)
summary(rf.stats[,2:3])
##       rsq              mse         
##  Min.   :0.7922   Min.   :0.01839  
##  1st Qu.:0.7998   1st Qu.:0.01908  
##  Median :0.8021   Median :0.01933  
##  Mean   :0.8020   Mean   :0.01934  
##  3rd Qu.:0.8046   3rd Qu.:0.01955  
##  Max.   :0.8117   Max.   :0.02029
hist(rf.stats[,"rsq"], xlab="RandomForest R^2", breaks = 16, main = "Frequency of fits (R^2)")
rug(rf.stats[,"rsq"])

hist(rf.stats[,"mse"], xlab="RandomForest RMSE", breaks = 16, main = "Frequency of OOB acuracy (RMSE)")
rug(rf.stats[,"mse"])

This is fairly stable. It would be much less so with subsets (e.g., randomly remove 10% of training points).

4 Random forest with ranger

Packages used in this section: ranger for random forests and vip for variable importance.

(Note that vip can be used for many model types, see https://koalaverse.github.io/vip/articles/vip.html).

The mtry optional argument gives the number of variables randomly sampled as candidates at each split. By default for ranger (unlike randomForest) this is \(\lfloor \sqrt{p} \rfloor\) where \(p\) is the number of predictors, here \(5\), so \(\lfloor \sqrt{p} \rfloor = 2\). We want to try all the variables at each split for more accurate splitting. We increase this to 3 to get better trees but still include weak predictors; this is matched for randomForest (above).

4.1 Build the forest

First build the forest, using the ranger function. Ask for the permutation measure of variable importance, to match randomForest.

The variable importance measures are of several types:

  1. “impurity”: variance of the responses for regression (as here)
  2. “impurity_corrected”: “The ‘impurity_corrected’ importance measure is unbiased in terms of the number of categories and category frequencies” – not relevant for regression forests
  3. “permutation”. For this one can specify scale.permutation.importance = TRUE, this should match the randomForest concept.

Compute two ways, for the two kinds of importance. Use the same random seed, the forests will then be identical, but the importance measures will be different.

Permutation:

library(ranger, warn.conflicts = FALSE, verbose = FALSE)
set.seed(314)
m.lzn.ra <- ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, 
                   data = meuse, 
                   importance = 'permutation',
                   scale.permutation.importance = TRUE,
                   mtry = 3)
print(m.lzn.ra)
## Ranger result
## 
## Call:
##  ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data = meuse,      importance = "permutation", scale.permutation.importance = TRUE,      mtry = 3) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      155 
## Number of independent variables:  7 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.01861645 
## R squared (OOB):                  0.8105926

Impurity:

set.seed(314)
m.lzn.ra.i <- ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, 
                   data = meuse, 
                   importance = 'impurity',
                   mtry = 3)
print(m.lzn.ra.i)
## Ranger result
## 
## Call:
##  ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data = meuse,      importance = "impurity", mtry = 3) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      155 
## Number of independent variables:  7 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.01861645 
## R squared (OOB):                  0.8105926

4.2 Goodness-of-fit:

Predict with fitted model, at all observations. For this we need to specify a data= argument.

p.ra <- predict(m.lzn.ra, data=meuse)
str(p.ra)
## List of 5
##  $ predictions              : num [1:155] 2.99 3.02 2.75 2.47 2.44 ...
##  $ num.trees                : num 500
##  $ num.independent.variables: num 7
##  $ num.samples              : int 155
##  $ treetype                 : chr "Regression"
##  - attr(*, "class")= chr "ranger.prediction"
summary(r.rap <- meuse$logZn - p.ra$predictions)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.2419633 -0.0382925 -0.0055786 -0.0006077  0.0380941  0.1616413
(rmse.ra <- sqrt(sum(r.rap^2)/length(r.rap)))
## [1] 0.06501253

Compare the fits RMSE with that computed by RandomForest:

c(rmse.ra, rmse.rf)
## [1] 0.06501253 0.06554428

There is a slight difference.

Plot the fits:

par(mfrow=c(1,2))
plot(meuse$logZn ~ p.ra$predictions, asp=1, pch=20, xlab="fitted", ylab="log10(Zn)",
     xlim=c(2,3.3),   ylim=c(2,3.3),  main="ranger")
grid(); abline(0,1)
plot(meuse$logZn ~ p.rf, asp=1, pch=20, xlab="fitted", ylab="log10(Zn)",
     xlim=c(2,3.3), ylim=c(2,3.3), main="randomForest")
grid(); abline(0,1)

par(mfrow=c(1,1))

Almost identical, including which points are poorly-predicted.

4.3 Out-of-bag cross-validation

The default model already has the OOB predictions stored in it. Compare to RandomForest results.

summary(m.lzn.ra$predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.131   2.319   2.523   2.558   2.749   3.095
summary(p.rf.oob)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.132   2.321   2.534   2.557   2.758   3.063
summary(m.lzn.ra$predictions - p.rf.oob)  # difference
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0389200 -0.0075100  0.0002348  0.0004606  0.0083507  0.0394288

Overall, ranger has a slightly higher median OOB prediction randomForest.

Plot the OOB cross-validation predictions:

par(mfrow=c(1,2))
plot(meuse$logZn ~ m.lzn.ra$predictions, asp=1, pch=20,
     ylab="log10(Zn)", xlab="OOB X-validation estimates",
    xlim=c(2,3.3), ylim=c(2,3.3),
    main="ranger")
abline(0,1); grid()
plot(meuse$logZn ~ p.rf.oob, asp=1, pch=20,
     xlab="OOB X-validation estimates",
     ylab="log10(Zn)", xlim=c(2,3.3), ylim=c(2,3.3),
     main="randomForest")
grid(); abline(0,1)

par(mfrow=c(1,1))

These are very similar. As with the fits, the same points are poorly-predicted by both models.

4.4 Variable importance:

First, for permutation, the sums and the individual importances:

sum(ranger::importance(m.lzn.ra))
## [1] 153.2905
sum(randomForest::importance(m.lzn.rf)[,1])
## [1] 152.2828
cbind(ranger = ranger::importance(m.lzn.ra),
      rF = randomForest::importance(m.lzn.rf)[,1])
##          ranger       rF
## ffreq  16.79039 17.47817
## x      25.29951 41.43391
## y      15.34952 28.67894
## dist.m 40.71367 10.05626
## elev   29.94610 11.07493
## soil   10.32160 26.36551
## lime   14.86976 17.19509

Second, for impurity, the sums and the individual importances::

sum(ranger::importance(m.lzn.ra.i))
## [1] 14.59591
sum(randomForest::importance(m.lzn.rf)[,2])
## [1] 14.52518
cbind(ranger =ranger::importance(m.lzn.ra.i),
      rF = randomForest::importance(m.lzn.rf)[,2])
##           ranger        rF
## ffreq  0.7748065 0.8474457
## x      1.2215072 6.1204559
## y      0.9253974 3.4642117
## dist.m 5.7685564 0.7216287
## elev   3.6611503 1.2812095
## soil   0.6664583 1.1783446
## lime   1.5780326 0.9118874

There is quite some difference between the packages:randomForest gives much more importance to the geographic coordinates and much less to the distance to river and elevation. The sums of variable importance are slightly different.

4.5 Sensitivity

Compute the ranger RF several times and collect statistics, also the timing:

n <- 48
ra.stats <- data.frame(rep=1:10, rsq=as.numeric(NA), mse=as.numeric(NA))
t.ra <- system.time(
  for (i in 1:n) {
    model.ra <- ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
                       data=meuse, importance="none", mtry=5,
                       write.forest = FALSE)
    ra.stats[i, "mse"] <- model.ra$prediction.error
    ra.stats[i, "rsq"] <- model.ra$r.squared
  }
)
summary(ra.stats[,2:3])
##       rsq              mse         
##  Min.   :0.8008   Min.   :0.01829  
##  1st Qu.:0.8049   1st Qu.:0.01874  
##  Median :0.8071   Median :0.01896  
##  Mean   :0.8070   Mean   :0.01897  
##  3rd Qu.:0.8093   3rd Qu.:0.01918  
##  Max.   :0.8139   Max.   :0.01958
hist(ra.stats[,"rsq"], xlab="ranger R^2", breaks = 16, main = "Frequency of fits (R^2)")
rug(ra.stats[,"rsq"])

hist(ra.stats[,"mse"], xlab="ranger RMSE", breaks = 16, main = "Frequency of OOB accuracy (RMSE)")
rug(ra.stats[,"mse"])

Again, not a large variation.

Compare the timings:

print(t.rf)
##    user  system elapsed 
##   6.127   0.081   6.278
print(t.ra)
##    user  system elapsed 
##   2.064   0.107   1.123
print(round(t.rf/t.ra, 1))
##    user  system elapsed 
##     3.0     0.8     5.6

randomForest is almost 6x slower than ranger in this test.

Compare the sensitivity statistics to those from RandomForest:

summary(ra.stats[,2:3])
##       rsq              mse         
##  Min.   :0.8008   Min.   :0.01829  
##  1st Qu.:0.8049   1st Qu.:0.01874  
##  Median :0.8071   Median :0.01896  
##  Mean   :0.8070   Mean   :0.01897  
##  3rd Qu.:0.8093   3rd Qu.:0.01918  
##  Max.   :0.8139   Max.   :0.01958
summary(rf.stats[,2:3])
##       rsq              mse         
##  Min.   :0.7922   Min.   :0.01839  
##  1st Qu.:0.7998   1st Qu.:0.01908  
##  Median :0.8021   Median :0.01933  
##  Mean   :0.8020   Mean   :0.01934  
##  3rd Qu.:0.8046   3rd Qu.:0.01955  
##  Max.   :0.8117   Max.   :0.02029
tmp <- round(data.frame(sd(ra.stats[,2]), sd(rf.stats[,2]), sd(ra.stats[,3]), sd(rf.stats[,3])),6)
names(tmp) <- c("s.d. R^2 ranger", "s.d. R^2 randomForest","s.d. MSE ranger","s.d. MSE randomForest")
print(tmp)
##   s.d. R^2 ranger s.d. R^2 randomForest s.d. MSE ranger s.d. MSE randomForest
## 1        0.002899              0.004206        0.000285              0.000411

No clear pattern as to which is more sensitive.

5 Spatial random forests with splmRF

Thanks to Michael McManus (EPA) for this. Here we try a spatially-explicit random forest plus local model using the experimental (as of mid 2025) spmodel package, also from the EPA.2

First, make the object spatially-explicit in the sf “Simple Features” data structure. Note that we specify remove = FALSE to keep x and y also in the data frame, to use in the random forest model.

library(sf, warn.conflicts = FALSE, verbose = FALSE)
## Linking to GEOS 3.13.1, GDAL 3.10.2, PROJ 9.6.0; sf_use_s2() is TRUE
meuse_sf <- st_as_sf(meuse, coords = c("x", "y"), remove = FALSE)
summary(meuse_sf)
##        x                y             cadmium           copper      
##  Min.   :178605   Min.   :329714   Min.   : 0.200   Min.   : 14.00  
##  1st Qu.:179371   1st Qu.:330762   1st Qu.: 0.800   1st Qu.: 23.00  
##  Median :179991   Median :331633   Median : 2.100   Median : 31.00  
##  Mean   :180005   Mean   :331635   Mean   : 3.246   Mean   : 40.32  
##  3rd Qu.:180630   3rd Qu.:332463   3rd Qu.: 3.850   3rd Qu.: 49.50  
##  Max.   :181390   Max.   :333611   Max.   :18.100   Max.   :128.00  
##                                                                     
##       lead            zinc             elev             dist        
##  Min.   : 37.0   Min.   : 113.0   Min.   : 5.180   Min.   :0.00000  
##  1st Qu.: 72.5   1st Qu.: 198.0   1st Qu.: 7.546   1st Qu.:0.07569  
##  Median :123.0   Median : 326.0   Median : 8.180   Median :0.21184  
##  Mean   :153.4   Mean   : 469.7   Mean   : 8.165   Mean   :0.24002  
##  3rd Qu.:207.0   3rd Qu.: 674.5   3rd Qu.: 8.955   3rd Qu.:0.36407  
##  Max.   :654.0   Max.   :1839.0   Max.   :10.520   Max.   :0.88039  
##                                                                     
##        om         ffreq  soil   lime       landuse       dist.m      
##  Min.   : 1.000   1:84   1:97   0:111   W      :50   Min.   :  10.0  
##  1st Qu.: 5.300   2:48   2:46   1: 44   Ah     :39   1st Qu.:  80.0  
##  Median : 6.900   3:23   3:12           Am     :22   Median : 270.0  
##  Mean   : 7.478                         Fw     :10   Mean   : 290.3  
##  3rd Qu.: 9.000                         Ab     : 8   3rd Qu.: 450.0  
##  Max.   :17.000                         (Other):25   Max.   :1000.0  
##  NA's   :2                              NA's   : 1                   
##      logZn          geometry  
##  Min.   :2.053   POINT  :155  
##  1st Qu.:2.297   epsg:NA:  0  
##  Median :2.513                
##  Mean   :2.556                
##  3rd Qu.:2.829                
##  Max.   :3.265                
## 

Plot the target variable.

ggplot(data = meuse_sf) +
  geom_sf(aes(size = logZn, col = logZn), pch=1)

The spmodel package includes splmRF for random forest spatial residual models, using random forest to fit the mean and a spatial linear model (e.g., kriging) to fit the residuals.

“A random forest model is fit to the mean portion of the model specified by formula using ranger::ranger(). Residuals are computed and used as the response variable in an intercept-only spatial linear model fit using splm().

So in this case we can use the coordinates twice: (1) in the random forest model as with the other methods, to make “boxes”, and (2) for spatial interpolation of the residuals from the random forest fit.

To compare with the other RF methods, we set mtry = 3. The spatial covariance is fit with an exponential function, a usual conservative choice.

set.seed(314)
library(spmodel, warn.conflicts = FALSE, verbose = FALSE)
m.lzn.splmRF <- splmRF(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
                  data = meuse_sf,
                  spcov_type = "exponential",
                  importance = 'permutation',
                  scale.permutation.importance = TRUE,
                  mtry = 3)
summary(m.lzn.splmRF)
##         Length Class  Mode
## call     7     -none- call
## ranger  17     ranger list
## splm    36     splm   list
## newdata  0     -none- NULL

The fitted model object has the call, the results from calling ranger to fit the random forest, and the parameters of the fitted residual model, i.e., spatial covariance structure.

print(m.lzn.splmRF$call)
## splmRF(formula = logZn ~ ffreq + x + y + dist.m + elev + soil + 
##     lime, data = meuse_sf, spcov_type = "exponential", importance = "permutation", 
##     scale.permutation.importance = TRUE, mtry = 3)
print(m.lzn.splmRF$ranger)
## Ranger result
## 
## Call:
##  NA 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      155 
## Number of independent variables:  7 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.01861645 
## R squared (OOB):                  0.8105926
print(m.lzn.ra)
## Ranger result
## 
## Call:
##  ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data = meuse,      importance = "permutation", scale.permutation.importance = TRUE,      mtry = 3) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      155 
## Number of independent variables:  7 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.01861645 
## R squared (OOB):                  0.8105926
print(m.lzn.splmRF$splm)
## 
## Call:
## NA
## 
## 
## Coefficients (fixed):
## (Intercept)  
##   -0.004817  
## 
## 
## Coefficients (exponential spatial covariance):
##        de         ie      range  
##  0.011188   0.007789  68.068288

The ranger portion of the model is identical to calling ranger directly, due to the same model form, parameters and seed.

There is residual spatial correlation, with a very effective range of 204 meters. This is the range coefficient of the exponential model, multiplied by 3 to reach 95% of the total sill, as is conventional.

The coefficient de is the dependent error variance, and ie the independent error variance. These correspond to the partial sill and nugget variance in a variogram model. So the nugget-to-total sill ratio is ie/(ie+de):

round(m.lzn.splmRF$splm$coefficients$spcov["ie"]/(m.lzn.splmRF$splm$coefficients$spcov["ie"]+m.lzn.splmRF$splm$coefficients$spcov["de"]),2)
##   ie 
## 0.41

This is a high nugget, around 40% of the residual variance after the random forest model.

Compare the variable importance to ranger:

sort(m.lzn.splmRF$ranger$variable.importance, decreasing = TRUE)
##   dist.m     elev        x    ffreq        y     lime     soil 
## 40.71367 29.94610 25.29951 16.79039 15.34952 14.86976 10.32160
sort(m.lzn.ra$variable.importance, decreasing = TRUE)
##   dist.m     elev        x    ffreq        y     lime     soil 
## 40.71367 29.94610 25.29951 16.79039 15.34952 14.86976 10.32160

These are identical, because of the same random seed.

Out-of-bag predictions from the random forest, comparing to ranger:

summary(m.lzn.splmRF$ranger$predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.131   2.319   2.523   2.558   2.749   3.095
summary(m.lzn.ra$predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.131   2.319   2.523   2.558   2.749   3.095

Again, identical.

Compare model performance:

print(m.lzn.splmRF$ranger$r.squared)
## [1] 0.8105926
print(m.lzn.ra$r.squared)
## [1] 0.8105926

Again, identical.

Compare the predictions at the training points:

summary(p.splmRF <- predict(m.lzn.splmRF, newdata = meuse_sf))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.017   2.287   2.513   2.555   2.823   3.264
summary(p.ra.p <- predict(m.lzn.ra, data = meuse)$predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.090   2.307   2.523   2.557   2.795   3.153
summary(p.rf.p <- predict(m.lzn.rf, newdata = meuse))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.094   2.306   2.529   2.556   2.798   3.140

The predictions from randomForest and ranger are very close but not identical. Including the interpolation of residuals with splmRF has widened the predictions and slightly lowered many of them. We can see the extremes by adding a rug plot to the histograms:

hist(p.splmRF, main = "splmRF", xlab = "log10(Zn)")
rug(p.splmRF)

hist(p.ra.p, main = "ranger", xlab = "log10(Zn)")
rug(p.ra.p)

hist(p.rf.p, main = "randomForest", xlab = "log10(Zn)")
rug(p.rf.p)

6 Spatial structure of ranger random forest residuals

We can try to duplicate splmRF by modelling the spatial structure of the ranger residuals with a variogram model and then interpolating these to modify the ranger predictions. We use the gstat package for this.

library(gstat, warn.conflicts = FALSE, verbose = FALSE)

Compute the ranger residuals and compute their empirical variogram using the default cutoff of 1/3 of the diagonal of the bounding box, and 15 bins:

summary(meuse_sf$r.ra <- (meuse_sf$logZn - p.ra.p))
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.2419633 -0.0382925 -0.0055786 -0.0006077  0.0380941  0.1616413
v <- variogram(r.ra ~ 1, data = meuse_sf)
plot(v, plot.numbers = TRUE)

This is a somewhat erratic variogram, due to the small number of points, but shows some structure. The cutoff seems a bit too far, with large fluctuations around an apparent sill. So re-compute with a shorter cutoff. Again accept the default 15 bins.

v <- variogram(r.ra ~ 1, data = meuse_sf, cutoff = 1200)
plot(v, plot.numbers = TRUE)

Model this with an exponential model (as used by splmRF in our example) and fit the model with gstat::fit.variogram using its default ad-hoc weighted least-squares ptimization:

(vm <- fit.variogram(v, vgm(psill = 0.002, "Exp", range = 800/3, nugget = 0.003)))
##   model       psill    range
## 1   Nug 0.002985382   0.0000
## 2   Exp 0.002356873 550.5001
plot(v, plot.numbers = TRUE, model = vm)

vm[1,"psill"]/(sum(vm[,"psill"]))
## [1] 0.5588243

This has a nugget/total sill ratio of 0.56, somewhat higher than estimated by splmRF. The effective range is 1652 m, quite a bit longer than that estimated by splmRF.

There are other ways to fit an exponential covariance to the residuals, clearly splmRF is using one of these.

6.1 Kriging the residuals

Krige the residuals to the observation points by LOOCV (otherwise the predictions would be identical to the actual values, since kriging is an exact interpolator):

kcv <- krige.cv(r.ra ~ 1, nfold = nrow(meuse_sf),
         locations = meuse_sf, model = vm)
head(kcv)
## Simple feature collection with 6 features and 6 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 181025 ymin: 333260 xmax: 181390 ymax: 333611
## CRS:           NA
##       var1.pred    var1.var    observed     residual      zscore fold
## 1  0.0171924582 0.004055268  0.01954183  0.002349371  0.03689281    1
## 2  0.0116623374 0.003984250  0.04078331  0.029120971  0.46135216    2
## 3  0.0022227688 0.003935970  0.05955231  0.057329540  0.91380296    3
## 4  0.0149304649 0.004080820 -0.05534256 -0.070273027 -1.10005641    4
## 5  0.0002684245 0.003896392 -0.01362113 -0.013889559 -0.22251388    5
## 6 -0.0098630929 0.004109839  0.02410763  0.033970722  0.52989811    6
##                geometry
## 1 POINT (181072 333611)
## 2 POINT (181025 333558)
## 3 POINT (181165 333537)
## 4 POINT (181298 333484)
## 5 POINT (181307 333330)
## 6 POINT (181390 333260)
summary(kcv$var1.pred)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0463394 -0.0108523 -0.0002145 -0.0005191  0.0126427  0.0488344
summary(meuse_sf$r.ra)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.2419633 -0.0382925 -0.0055786 -0.0006077  0.0380941  0.1616413

Kriging has greatly smoothed out the residuals.

6.2 Combined prediction: RF + kriging

Add these to the ranger predictions, and compare to splmRF1:

summary(p.ra$predictions + kcv$var1.pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.073   2.301   2.523   2.556   2.811   3.181
summary(p.splmRF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.017   2.287   2.513   2.555   2.823   3.264

These are not the same.

Try with an exponential variogram model identical to that fit by splmRF:

(coef <- m.lzn.splmRF$splm$coefficients$spcov[1:3])
##           de           ie        range 
##  0.011187869  0.007789315 68.068287715
(vm.splmRF <- vgm(psill = coef["de"], 
                  model = "Exp",
                 range = coef["range"],
                 nugget = coef["ie"]))
##   model       psill    range
## 1   Nug 0.007789315  0.00000
## 2   Exp 0.011187869 68.06829
kcv <- krige.cv(r.ra ~ 1, nfold = nrow(meuse_sf),
         locations = meuse_sf, model = vm.splmRF)
head(kcv)
## Simple feature collection with 6 features and 6 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 181025 ymin: 333260 xmax: 181390 ymax: 333611
## CRS:           NA
##       var1.pred   var1.var    observed     residual      zscore fold
## 1  0.0117124744 0.01809020  0.01954183  0.007829355  0.05821090    1
## 2  0.0051476734 0.01816430  0.04078331  0.035635635  0.26440832    2
## 3 -0.0010341731 0.01867348  0.05955231  0.060586482  0.44336668    3
## 4  0.0032282031 0.01889692 -0.05534256 -0.058570765 -0.42607439    4
## 5 -0.0007943765 0.01861821 -0.01362113 -0.012826758 -0.09400433    5
## 6 -0.0048906983 0.01879490  0.02410763  0.028998327  0.21152075    6
##                geometry
## 1 POINT (181072 333611)
## 2 POINT (181025 333558)
## 3 POINT (181165 333537)
## 4 POINT (181298 333484)
## 5 POINT (181307 333330)
## 6 POINT (181390 333260)
summary(kcv$var1.pred)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0328518 -0.0056037 -0.0016030 -0.0004472  0.0049743  0.0326940
summary(meuse_sf$r.ra)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.2419633 -0.0382925 -0.0055786 -0.0006077  0.0380941  0.1616413

These values of the ranger residuals are not identical. Why? Maybe splmRF kriges to the points and so gets the exact ranger residual?

summary(p.ra$predictions + meuse_sf$r.ra)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.053   2.297   2.513   2.556   2.829   3.265
summary(p.splmRF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.017   2.287   2.513   2.555   2.823   3.264

No, that’s not the reason. Although the predictions are close:

summary((p.ra$predictions + meuse_sf$r.ra)-p.splmRF)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0502105 -0.0107659  0.0022561  0.0008191  0.0117497  0.0584474

Maybe there is a difference in the radius used for the kriging step?

7 Quantile Regression Forest (QRF)

This is an extension of ranger which produces estimates of each quantile of the predictive distribution for each prediction. The method is explained by Meinshausen.3

To compute these, use ranger, and specify quantreg = TRUE. Alsokeep the in-bag fits to use in the quantile predictions, with keep.inbag = TRUE.

set.seed(314)
m.lzn.qrf <- ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, 
                   data = meuse, 
                   quantreg = TRUE,
                   importance = 'permutation',
                   keep.inbag=TRUE,  # needed for QRF
                   scale.permutation.importance = TRUE,
                   mtry = 3)
pred.qrf <- predict(m.lzn.qrf, type = "quantiles",
                    # default is c(0.1, 0.5, 0.9)
                    quantiles = c(0.1, 0.25, 0.5, 0.75, 0.9))
summary(pred.qrf$predictions)
##  quantile= 0.1   quantile= 0.25  quantile= 0.5   quantile= 0.75 
##  Min.   :2.053   Min.   :2.076   Min.   :2.111   Min.   :2.134  
##  1st Qu.:2.190   1st Qu.:2.262   1st Qu.:2.304   1st Qu.:2.352  
##  Median :2.276   Median :2.350   Median :2.513   Median :2.676  
##  Mean   :2.371   Mean   :2.453   Mean   :2.551   Mean   :2.656  
##  3rd Qu.:2.513   3rd Qu.:2.702   3rd Qu.:2.783   3rd Qu.:2.887  
##  Max.   :2.910   Max.   :2.958   Max.   :3.184   Max.   :3.190  
##  quantile= 0.9  
##  Min.   :2.182  
##  1st Qu.:2.486  
##  Median :2.755  
##  Mean   :2.749  
##  3rd Qu.:2.964  
##  Max.   :3.265

This shows the range of predictions from the set of trees. There is quite some variability between quantiles. For example, here is the Q9-Q1 range:

summary(pred.qrf$predictions[, "quantile= 0.9"] - pred.qrf$predictions[, "quantile= 0.1"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07311 0.27053 0.35717 0.37846 0.48179 0.73920

Most are on the order of 0.3 - 0.5 log10(Zn).

The 0.5 (median) quantile can be compared to the single prediction from ranger, although the QRF prediction is a median, not a mean:

summary(pred.qrf$predictions[, "quantile= 0.5"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.111   2.304   2.513   2.551   2.783   3.184
summary(p.ra.p)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.090   2.307   2.523   2.557   2.795   3.153

The means and the Q1/Q3 of the prediction means and medians are close but not identical.

8 Local measures of variable importance

Both randomForest and ranger provide local measures of importance, i.e., how much each variable influences an individual prediction.

8.1 randomForest

“The ‘local’ (or casewise) variable importance for a random regression forest is the average increase in squared OOB residuals when the variable is permuted, for each observation separately.”

This is computed if localImp = TRUE in the call to randomForest.

set.seed(314)
m.lzn.rf.l <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data=meuse, 
                         localImp = TRUE, nperm = 3, 
                         na.action = na.omit, mtry = 3)
dim(m.lzn.rf.l$localImportance)
## [1]   7 155
row.names(m.lzn.rf.l$localImportance)
## [1] "ffreq"  "x"      "y"      "dist.m" "elev"   "soil"   "lime"

There is one importance measure for each variable for each observation.

View the local importance for the first few observation points:

m.lzn.rf.l$localImportance[, 1:6]
##                  1             2            3           4           5
## ffreq  0.025223725  0.0311664580  0.007239915 0.001172955 0.007573898
## x      0.003638953 -0.0005331196  0.000884300 0.023488953 0.020065479
## y      0.025311191  0.0087814013  0.014651997 0.010936406 0.010148829
## dist.m 0.152338528  0.1684531194  0.042116343 0.031325391 0.035634740
## elev   0.023182941  0.0330063161  0.003759453 0.011066830 0.003895781
## soil   0.012211841  0.0155922172  0.009372301 0.005682348 0.015348364
## lime   0.035701605  0.0443260269 -0.006084308 0.013544121 0.016970023
##                  6
## ffreq  0.006609679
## x      0.015141041
## y      0.012441277
## dist.m 0.063966507
## elev   0.016467674
## soil   0.008685162
## lime   0.010301643

We see a large difference in importance of dist.m, very important for points 1 and 2, and much less so for points 3–6.

meuse[1:6, "dist.m"]
## [1]  50  30 150 270 380 470

These two points are much closer to the river.

8.2 ranger

For this package we use the local.importance option: “Calculate and return local importance values as in (Breiman 2001)4. Only applicable if importance is set to permutation”.

This is the same permutation measure as for the overall variable importance, but computed for each observation.

Refit the model with this option, using the same random seed so that the overall model fit and importance match the first section above. The model fit will be identical.

set.seed(314)
m.lzn.ra.l <- ranger(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, 
                   data = meuse, 
                   importance = 'permutation',
                   local.importance = TRUE,
                   scale.permutation.importance = TRUE,
                   mtry = 3)
names(m.lzn.ra.l)
##  [1] "predictions"               "num.trees"                
##  [3] "num.independent.variables" "mtry"                     
##  [5] "min.node.size"             "variable.importance"      
##  [7] "variable.importance.local" "prediction.error"         
##  [9] "forest"                    "splitrule"                
## [11] "treetype"                  "r.squared"                
## [13] "call"                      "importance.mode"          
## [15] "num.samples"               "replace"                  
## [17] "dependent.variable.name"   "max.depth"
dim(m.lzn.ra.l$variable.importance.local)
## [1] 155   7

The variable.importance.local field in the fitted model results has the importance of each variable for each of the predictions.

Show the local importances for the first few observation points, compared the global importance. These both measure the increase in OOB RMSE when the predictor is permuted.

m.lzn.ra.l$variable.importance.local[1:6,]
##         ffreq           x           y     dist.m          elev        soil
## 1 0.011068076 0.005452759 0.008899231 0.05850423  0.0099005893 0.003129961
## 2 0.015304762 0.001705302 0.004220736 0.05917265  0.0142170662 0.008075754
## 3 0.004277926 0.001818228 0.015349669 0.01411915  0.0017249558 0.001603407
## 4 0.001182663 0.004961601 0.003628138 0.01120759 -0.0006275804 0.005303960
## 5 0.002316050 0.005549051 0.003685398 0.01970707  0.0028770418 0.005360852
## 6 0.001043971 0.003847687 0.002211735 0.01152574  0.0035831139 0.001049993
##           lime
## 1  0.013524257
## 2  0.020399560
## 3 -0.002937614
## 4  0.001612304
## 5  0.008106699
## 6  0.005264946
ranger::importance(m.lzn.ra.l)
##       ffreq           x           y      dist.m        elev        soil 
## 0.012904514 0.013116775 0.009516604 0.065130013 0.031742157 0.005236516 
##        lime 
## 0.007461143

We can see quite a range of importances for the predictors, depending on the point. As with randomForest, the predictor dist.m is much more imporant for points 1 and 2 than for 3-6.

Show the range of local variable importance for each predictor:

summary(m.lzn.ra.l$variable.importance.local[,1:7])
##      ffreq                  x                   y            
##  Min.   :-0.0351568   Min.   :-0.014961   Min.   :-0.015054  
##  1st Qu.: 0.0009273   1st Qu.: 0.001846   1st Qu.: 0.001168  
##  Median : 0.0033483   Median : 0.004529   Median : 0.003061  
##  Mean   : 0.0047760   Mean   : 0.004835   Mean   : 0.003528  
##  3rd Qu.: 0.0076180   3rd Qu.: 0.007396   3rd Qu.: 0.005084  
##  Max.   : 0.0311604   Max.   : 0.029015   Max.   : 0.022957  
##      dist.m               elev                soil           
##  Min.   :-0.029625   Min.   :-0.015450   Min.   :-0.0096089  
##  1st Qu.: 0.008341   1st Qu.: 0.004190   1st Qu.: 0.0001734  
##  Median : 0.020808   Median : 0.009619   Median : 0.0014919  
##  Mean   : 0.023971   Mean   : 0.011747   Mean   : 0.0019462  
##  3rd Qu.: 0.034208   3rd Qu.: 0.017547   3rd Qu.: 0.0031633  
##  Max.   : 0.163139   Max.   : 0.090050   Max.   : 0.0098893  
##       lime           
##  Min.   :-0.0438236  
##  1st Qu.: 0.0005832  
##  Median : 0.0019866  
##  Mean   : 0.0028013  
##  3rd Qu.: 0.0044497  
##  Max.   : 0.0478122

There is a large range in the importance, even for dist.m and elev. Notice that the overall importance of these is between the 3rd quartile and maximum, i.e., they are quite influential in less than 1/4 of cases.

What is the relation of the importance with the factor itself? Show a scatterplot of the imporatance for dist.m, with a horizontal line at the overall importance.

plot(meuse$dist.m,
     m.lzn.ra.l$variable.importance.local[,"dist.m"],
     xlab = "elevation", ylab = "importance of distance to river",
     pch = 20)
abline(h = ranger::importance(m.lzn.ra.l)["dist.m"], col = "red")

Quite important for the points near the river, but not for those further away.

What about for elev?

plot(meuse$elev,
     m.lzn.ra.l$variable.importance.local[,"elev"],
     xlab = "dist.m", ylab = "importance of elevation m.a.s.l",
     pch = 20)
abline(h = ranger::importance(m.lzn.ra.l)["elev"], col = "red")

Weak relation to no relation.

What about flood frequency class?

plot(meuse$ffreq,
     m.lzn.ra.l$variable.importance.local[,"ffreq"],
     xlab = "ffreq", ylab = "importance of `ffreq`",
     pch = 20)
abline(h = ranger::importance(m.lzn.ra.l)["ffreq"], col = "red")

Again, important for a few points near the river.

9 Interpretable Machine Learning

We would like to look inside the “black box” of the random forest models to see if it can give insights into its predictions. Christoph Molnar has written an online text5 explaining this.

One set of interpretations is provided by the iml package.

library(iml, warn.conflicts = FALSE, verbose = FALSE)

We use this to interpret one of the random forests built above.

9.1 Create an iml::Predictor object

The interpretation methods in the iml package need the machine learning model to be wrapped in a Predictor object.” This “… holds any machine learning model (mlr, caret, randomForest, …) and the data to be used for analyzing the model.

Create a wrapper function for predict.ranger that matches the format iml expects. This is because predict.ranger does not follow the typical predict argument naming convention.

# Define a prediction function compatible with iml
predict_iml <- function(model, newdata, type = "response", ...) {
  preds <- predict(model, data = newdata, type = type)$predictions
  if (type == "response") {
    return(preds)
  } else {
    return(preds[, 2]) # For probability, not used here
  }
}

(The ranger package defines a method called predict.ranger for objects of class ranger, but it does not export a function called predict. Instead, predict.ranger will be dispatched from generic predict automatically if the object is of type ranger. So don’t use ranger::predict, just predict.)

Now use it for the model fitted by ranger, i.e., make a Predictor object.

Note that the predict.function argument is needed if the model is not from either the mlr or caret packages.

vars <- c("ffreq","x","y","dist.m","elev","soil","lime")
X <-  meuse[, vars]
predictor.ra <- iml::Predictor$new(model = m.lzn.ra, 
                                   data = X, 
                                   y = meuse$logZn,
                                   predict.function = predict_iml)
str(predictor.ra)
## Classes 'Predictor', 'R6' <Predictor>
##   Public:
##     batch.size: 1000
##     class: NULL
##     clone: function (deep = FALSE) 
##     data: Data, R6
##     initialize: function (model = NULL, data = NULL, predict.function = NULL, 
##     model: ranger
##     predict: function (newdata) 
##     prediction.colnames: NULL
##     prediction.function: function (newdata) 
##     print: function () 
##     task: unknown
##   Private:
##     predictionChecked: FALSE

Also make a Predictor object for the ranger QRF model:

predictor.qrf <- iml::Predictor$new(model = m.lzn.qrf,
                                    data = X, 
                                    y = meuse$logZn,
                                    predict.function = predict_iml)

9.2 Feature importance

Feature importance, based on mae (mean absolute error) or mse (mean squared error). Plot the median and the distribution (5–95% quantiles over all the cases).

This uses iml::plot.FeatureImp.

imp.mae <- iml::FeatureImp$new(predictor.ra, loss = "mae")
imp.mse <- iml::FeatureImp$new(predictor.ra, loss = "mse")
plot(imp.mae)

plot(imp.mse)

print(imp.mae)
## Interpretation method:  FeatureImp 
## error function: mae
## 
## Analysed predictor: 
## Prediction task: unknown 
## 
## 
## Analysed data:
## Sampling from data.frame with 155 rows and 7 columns.
## 
## 
## Head of results:
##   feature importance.05 importance importance.95 permutation.error
## 1  dist.m      3.545983   3.694893      3.783031        0.18562503
## 2    elev      2.310812   2.614338      2.670165        0.13133980
## 3       x      1.716875   1.735522      1.818265        0.08718962
## 4   ffreq      1.499996   1.519180      1.610322        0.07632099
## 5       y      1.423054   1.505106      1.523621        0.07561394
## 6    lime      1.360322   1.375500      1.441221        0.06910274
print(imp.mse)
## Interpretation method:  FeatureImp 
## error function: mse
## 
## Analysed predictor: 
## Prediction task: unknown 
## 
## 
## Analysed data:
## Sampling from data.frame with 155 rows and 7 columns.
## 
## 
## Head of results:
##   feature importance.05 importance importance.95 permutation.error
## 1  dist.m     13.921234  14.877915     15.801162       0.062883429
## 2    elev      5.563535   6.367359      7.097176       0.026912465
## 3       x      2.862070   3.211992      3.338609       0.013575897
## 4   ffreq      2.263704   2.536549      2.857294       0.010721054
## 5    lime      1.903443   2.151855      2.274674       0.009095095
## 6       y      2.062338   2.148588      2.313753       0.009081283

The plots show the range of local effects for each predictor, for the given indicator. Predictor dist.m has quite variable local effects, whereas x and y have similar (and smaller) effects at all points.

Another way to see local effects is with the FeatureEffect object:

  1. accumulated local effect (ALE); “ALE plots are a faster and unbiased alternative to partial dependence plots (PDPs).”6

  2. Partial Dependence Plots (PDP): this shows the prediction with other factors kept at their medians (?).7

  3. Individual Conditional Expectation (ICE) curves – one per observation “one line per instance that shows how the instance’s prediction changes when a feature changes”.8. This is a PDP plot per observation.

See with plot.FeatureEffect. Create the object with $new, specifying a method, then plot.

ale <- FeatureEffect$new(predictor.ra, feature = "elev")
ale$plot()

pdp <- FeatureEffect$new(predictor.ra, feature = "elev", method = "pdp")
pdp$plot()

ice <- FeatureEffect$new(predictor.ra, feature = "elev", method = "ice")
ice$plot()

In the ICE graph some interesting behaviour in the 7-8.5 m elevation range. But in general this follows the average PDP plot, we don’t have any really unusual observations in terms of their dependence on elevation.

See this for distance:

ice.dist.m <- FeatureEffect$new(predictor.ra, feature = "dist.m", method = "ice")
ice.dist.m$plot()

There is an interesting contrasting behaviour from 0 to 50 m: some with high values at 0 increase substantially at 50, while most decrease, as expected.

9.2.1 Interactions

“The interaction measure regards how much of the variance of f(x) is explained by the interaction. The measure is between 0 (no interaction) and 1 (= 100% of variance of f(x) due to interactions). For each feature, we measure how much they interact with any other feature.”9

“When features interact with each other in a prediction model, the prediction cannot be expressed as the sum of the feature effects, because the effect of one feature depends on the value of the other feature.”

interact <- Interaction$new(predictor.ra)
plot(interact)

This shows that the predictors interact a fair amount – none are independent. The obvious example is distance and elevation, which we can see in a 2-way interaction plot:

Look at the 2-way interactions with a predictor, e.g., elevation:

interact.elev <- Interaction$new(predictor.ra, feature = "elev")
plot(interact.elev)

This shows strong interactions between elevation and the others, especially dist.m, in prediction \(\log_{10}\mathrm{Zn}\).

9.3 Explain single predictions with a local model

This uses the glmnet “Lasso and Elastic-Net Regularized Generalized Linear Models” package, also the gower “Gower’s Distance” package, based on Gower (1971).10

“The … model fitted by LocalModel is a linear regression model and the data points are weighted by how close they are to the data point for w[h]ich we want to explain the prediction.” Here ‘close’ means in multivariate feature space, hence the use of Gower’s distance.

From ?LocalModel: “A weighted glm is fitted with the machine learning model prediction as target. Data points are weighted by their proximity to the instance to be explained, using the Gower proximity measure. L1-regularization is used to make the results sparse.” (hence the use of glmnet).

So this is a linear regression model, therefore we can interpret its fitted coefficients to see local effects. However the model only uses “close-by” points in feature space.

Look at this for the first point, which happens to be close to the river and with a high level of Zn.

meuse[1,]
##        x      y cadmium copper lead zinc  elev       dist   om ffreq soil lime
## 1 181072 333611    11.7     85  299 1022 7.909 0.00135803 13.6     1    1    1
##   landuse dist.m    logZn
## 1      Ah     50 3.009451
lime.explain <- iml::LocalModel$new(predictor.ra, x.interest = X[1, ])
## Loading required package: glmnet
## Loaded glmnet 4.1-8
## Loading required package: gower
lime.explain$results
##                 beta x.recoded      effect x.original feature feature.value
## dist.m -0.0005252513    50.000 -0.02626257         50  dist.m     dist.m=50
## elev   -0.0469173108     7.909 -0.37106901      7.909    elev    elev=7.909
## lime=1  0.0790714563     1.000  0.07907146          1  lime=1        lime=1
plot(lime.explain)

The local model at this point predicts a lower value than actual prediction based on the full model. The main reason is the elevation of the surrounding points: these differ from the target point but affect the prediction.

Look at this for one of the furthest points from the river:

ix.maxdist <- which.max(meuse$dist.m)
meuse[ix.maxdist, ]
##          x      y cadmium copper lead zinc  elev     dist  om ffreq soil lime
## 111 180451 331473     0.2     18   50  117 9.095 0.809742 5.3     2    3    0
##     landuse dist.m    logZn
## 111      Fw   1000 2.068186
lime.explain <- LocalModel$new(predictor.ra, x.interest = X[ix.maxdist, ])
lime.explain$results
##                 beta x.recoded      effect x.original feature feature.value
## dist.m -0.0004749539  1000.000 -0.47495392       1000  dist.m   dist.m=1000
## elev   -0.0646175574     9.095 -0.58769668      9.095    elev    elev=9.095
## lime=0 -0.0772825118     1.000 -0.07728251          0  lime=0        lime=0
plot(lime.explain)

Here both distance and elevation have a strong effect but the predictions are the same.

9.4 Shapley values

The local importance metrics explored in the previous sections do not account for the full set of interactions at a data point. The Shapley method corrects this. It is available in several packages, including iml. An excellent intuitive explanation is given by Molnar in his “Interpretable Machine Learning” on-line text11

“[A] method from coalitional game theory named Shapley value. Assume that for one data point, the feature values play a game together, in which they get the prediction as a payout. The Shapley value tells us how to fairly distribute the payout among the feature values.

“The ‘game’ is the prediction task for a single instance of the dataset. The ‘gain’ is the actual prediction for this instance minus the average prediction for all instances. The ‘players’ are the feature values of the instance that collaborate to receive the gain (= predict a certain value).

“The Shapley value is the average marginal contribution of a feature value across all possible coalitions.

“The Shapley value is the only explanation method with a solid theory. The axioms – efficiency, symmetry, dummy, additivity – give the explanation a reasonable foundation.”12

The Shapley value is a solution for computing feature contributions for single predictions for any machine learning model. It is defined via a value function \({val}\) of players in \(S\). The Shapley value of a feature value is its contribution to the payout, weighted and summed over all possible feature value combinations:

\[\phi_j(val)=\sum_{S\subseteq\{1,\ldots,p\} \backslash \{j\}}\frac{|S|!\left(p-|S|-1\right)!}{p!}\left(val\left(S\cup\{j\}\right)-val(S)\right)\]

where \(S\) is a subset of the features used in the model, \(x\) is the vector of feature values of the instance to be explained and \(p\) the number of features. \(val_x(S)\) is the prediction for feature values in set \(S\) that are marginalized over features that are not included in set \(S\): \[val_{x}(S)=\int\hat{f}(x_{1},\ldots,x_{p})d\mathbb{P}_{x\notin{}S}-E_X(\hat{f}(X))\]

Note that the Shapley values are dependent on the reference dataset which is used to compute the marginal effects. This can be the entire dataset, but also some meaningful subset, in which case we get the importance of the predictors within only that subset.

Compute and display the Shapley values for the predictive model in predictor.ra, i.e. the fitted ranger random forest model, for the first-listed observation, and for the maximum-distance observation.

set.seed(314)
shapley <- iml::Shapley$new(predictor.ra, x.interest = X[1, ])
shapley$plot()

shapley.maxdist <- Shapley$new(predictor.ra, x.interest = X[ix.maxdist, ])
shapley.maxdist$plot()

This figure shows the actual values of each predictor at the observation point, and the \(\phi\) value, i.e., numerical contribution to the difference between actual and average prediction. These sum to the difference.

List the results as a table:

(results <- shapley$results)
##   feature         phi      phi.var feature.value
## 1   ffreq 0.032337223 0.0024784950       ffreq=1
## 2       x 0.005993525 0.0017357917      x=181072
## 3       y 0.023750034 0.0011312287      y=333611
## 4  dist.m 0.204097527 0.0332688137     dist.m=50
## 5    elev 0.025422614 0.0079663934    elev=7.909
## 6    soil 0.015721168 0.0006200223        soil=1
## 7    lime 0.050894354 0.0039251067        lime=1
sum(shapley$results$phi)
## [1] 0.3582164
(results.maxdist <- shapley.maxdist$results)
##   feature          phi      phi.var feature.value
## 1   ffreq -0.049214283 0.0025348477       ffreq=2
## 2       x -0.023238327 0.0050728046      x=180451
## 3       y -0.035464012 0.0011016137      y=331473
## 4  dist.m -0.198119934 0.0320157016   dist.m=1000
## 5    elev -0.094321942 0.0100878821    elev=9.095
## 6    soil -0.044662497 0.0008986534        soil=3
## 7    lime -0.004580032 0.0005639863        lime=0
sum(shapley.maxdist$results.maxdist$phi)
## [1] 0

For the point near the river the actual is higher than expected; this is the sum of the individual \(\phi\) values. The main positive contribution is from distance: it is closer than most points. For the distant point the actual is lower than the prediction; this is mainly because of distance: the point is further, and also lower and these have the most influence on the lower actual value.

“Be careful to interpret the Shapley value correctly: The Shapley value is the average contribution of a feature value to the prediction in different coalitions. The Shapley value is NOT the difference in prediction when we would remove the feature from the model.”

10 SHAP – (SHapley Additive exPlanations)

Another approach to Shapley values is SHapley Additive exPlanations (SHAP) by Lundberg and Lee (2017)13. Here the Shapley value explanation is represented as an additive feature attribution method, i.e., as a linear model.

The original method for SHAP is KernelSHAP (see next section), but Lundberg et al.14 proposed TreeSHAP, a variant of SHAP for tree-based machine learning models including random forests. It is a fast, model-specific alternative to the more general KernelSHAP“.

One option to compute TreeSHAP is the fastshap package by Brandon Greenwll15. This is based on the work of Štrumbelj &Kononenko.16

This is an approximation using Monte-Carlo sampling:

\[ \hat{\phi}_{j} = \frac{1}{M}\sum_{m=1}^M\left(\hat{f}(\mathbf{x}^{(m)}_{+j}) - \hat{f}(\mathbf{x}^{(m)}_{-j})\right) \]

where \(\hat{f}(\mathbf{x}^{(m)}_{+j})\) is the prediction for point \(\mathbf{x}\), but with a random number of feature values replaced by feature values from a random data point \(\mathbf{x}\), except for the respective value of feature \(j\), i.e., the feature for which we want the SHAP value, which has its original value. From this estimate is subtracted the same estimate but with with value for feature \(j\) also taken from the random data point. This isolates the effect of the feature within the original data point.

10.1 fastshap.explain

The key data structure for fastshap is an object of class explain. This is computed by the fastshap::explain function, which requires a prediction function. In this case it is ranger::predict.ranger, which is automatically called from the generic predict when the object is a ranger model.

Note the use of nsim: “To obtain the most accurate results, nsim should be set as large as feasibly possible.”.

library(fastshap, warn.conflicts = FALSE, verbose = FALSE)
pfun <- function(object, newdata) {
  predict(object, data = newdata)$predictions
}
set.seed(314)
fshap <- fastshap::explain(object = m.lzn.ra, 
                  X = X, pred_wrapper = pfun,
                  nsim = 48)
class(fshap) 
## [1] "explain" "matrix"  "array"

Each observation has a set of Shapley values. These are the contribution to the difference between the observed value and the average value of all observations. For the first (closest) point, there is a large positive contribution to the difference from dist.m and elev.

How much do these differ from those computed by iml? Examine for the first observation

set.seed(314)
shapley <- iml::Shapley$new(predictor.ra, x.interest = X[1, ])
print(data.frame(fastshap = fshap[1,], iml = shapley$results$phi))
##           fastshap         iml
## ffreq   0.02893541 0.032337223
## x      -0.01155836 0.005993525
## y       0.02644784 0.023750034
## dist.m  0.24578811 0.204097527
## elev    0.05502968 0.025422614
## soil    0.01383911 0.015721168
## lime    0.05974382 0.050894354

Similar but not identical. The SHAP method using fastshap finds more importance for most predictors.

10.2 Visualing Shapley values

The shapviz package can be used to visualize the Shapley values. Load the package and create a shapviz object from the fastshap explanation result.

library(shapviz, warn.conflicts = FALSE, verbose = FALSE)
shv <- shapviz(fshap, X = X)
class(shv)
## [1] "shapviz"

Now we can use the several functions in shapviz.

First, plot the mean Shapley values over all points, i.e., global variable importance but calculated from all individuals.

sv_importance(shv, kind = "bar", show_numbers = TRUE)

We can see all individual importances with the “beeswarm” version of this plot:

sv_importance(shv, kind = "beeswarm")

This shows clearly that at close distances of dist.m the prediction is postively influenced by the predictor, and the reverse is true (but less strikingly) at far distances.

We can see the effect of each predictor, compared to its value, with a “dependence” plot. Plot the dependence on elevation for all observations, using the sv_dependence function. We can get some idea of the interactions by colouring the points by the value of another predictor. Here we select the flood frequency.

sv_dependence(shv, v = "elev", color_var = "ffreq")

sv_dependence(shv, v = "elev", color_var = "dist.m")

A consistent story: influence of elevation on the prediction is significant for both low and high points and the relation is almost linear. The value is positive for low elevations; the reverse is true for high elevations. Flood frequency is not well-related to Shapley values for all but the lowest points.

And on distance to river for all observations, with both flood frequency and elevation shown as interactions:

sv_dependence(shv, v = "dist.m", color_var = "ffreq")

sv_dependence(shv, v = "dist.m", color_var = "elev")

We see that for nearby points there is a strong positive influence of distance, i.e., the distance is very important to that prediction. For points \(>250\) m there is also an influence but negative.

The contribution of each predictor for the closest and furthest points can be visualized using the sv_waterfall function, or for more compact display, the sv_force function:

sv_waterfall(shv, row_id = 1)

sv_waterfall(shv, row_id = ix.maxdist)

sv_force(shv, row_id = 1)

sv_force(shv, row_id = ix.maxdist)

For this simple model these do not differ much from the Shapley values computed by iml.

11 Kernel SHAP

Another approach to compute Shapley values is using the kernelshap package, which is an efficient implementation of the Kernal SHAP algorithm. See ?kernelshap for a detailed explanation.

This requires (1) a fitted model, (2) a set of features for which to display the values, with observation values at each point, (3) background data to integrate out “switched off” features. By default this is a random sample of 200 rows from the observations. These are the observtions that will be used to randomize and estimate the effects within a coalition.

The X argument gives the points for which to compute. In a small data set this can be all the points.

The bg_X argument is the “background” data, if null this requires a bg_n “size of background data” argument. Here again we have a small dataset so we could use all the points. More background points, more computation time, so restrict to about 2/3 of the dataset.

library(kernelshap, warn.conflicts = FALSE, verbose = FALSE)
# use existing ranger fits
dim(X)
## [1] 155   7
system.time(s <- kernelshap(m.lzn.ra.l, 
                X = X,
                bg_X = NULL,
                bg_n = 100,
                verbose = FALSE)
)
##    user  system elapsed 
##  72.896   2.362  38.321
str(s)
## List of 15
##  $ S          : num [1:155, 1:7] 0.0415 0.0396 0.0331 0.0263 0.027 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:7] "ffreq" "x" "y" "dist.m" ...
##  $ X          :'data.frame': 155 obs. of  7 variables:
##   ..$ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ x     : num [1:155] 181072 181025 181165 181298 181307 ...
##   ..$ y     : num [1:155] 333611 333558 333537 333484 333330 ...
##   ..$ dist.m: num [1:155] 50 30 150 270 380 470 240 120 240 420 ...
##   ..$ elev  : num [1:155] 7.91 6.98 7.8 7.66 7.48 ...
##   ..$ soil  : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
##   ..$ lime  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ baseline   : num 2.53
##  $ bg_X       :'data.frame': 100 obs. of  7 variables:
##   ..$ ffreq : Factor w/ 3 levels "1","2","3": 2 2 1 2 1 1 2 1 1 1 ...
##   ..$ x     : num [1:100] 180114 179881 180283 178803 180530 ...
##   ..$ y     : num [1:100] 330803 330912 332014 330349 332538 ...
##   ..$ dist.m: num [1:100] 500 650 320 280 60 50 340 220 10 80 ...
##   ..$ elev  : num [1:100] 9.4 10.52 7.78 8.66 7.78 ...
##   ..$ soil  : Factor w/ 3 levels "1","2","3": 3 3 2 1 1 1 1 1 1 1 ...
##   ..$ lime  : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 2 2 ...
##  $ bg_w       : NULL
##  $ SE         : num [1:155, 1:7] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:7] "ffreq" "x" "y" "dist.m" ...
##  $ n_iter     : int [1:155] 1 1 1 1 1 1 1 1 1 1 ...
##  $ converged  : logi [1:155] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ m          : int 14
##  $ m_exact    : int 126
##  $ prop_exact : num 1
##  $ exact      : logi TRUE
##  $ txt        : chr "Exact Kernel SHAP values"
##  $ predictions: num [1:155, 1] 2.99 3.02 2.75 2.47 2.44 ...
##  $ algorithm  : chr "kernelshap"
##  - attr(*, "class")= chr "kernelshap"

Compute the visualization object and show two diagnostic plots. Note that shapviz understands kernelshap objects, so there is no need to re-format the Shapley values matrix.

First, the importances of each predictor at each point:

sv <- shapviz(s)
sv_importance(sv, kind = "bee")

Second, selected points:

sv_force(sv, row_id = 1)

sv_force(sv, row_id = ix.maxdist)

Third, some dependence plots:

# auto-select most important interacting feature
sv_dependence(sv, v = "dist.m", color_var = "elev")

sv_dependence(sv, v = "dist.m", color_var = "ffreq")

All of these are slightly different than the plots from the fastshap computation. This is because kernelshap uses a different approach to fast computation of the Shapley values.


  1. Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32↩︎

  2. Dumelle, M., Higham, M., & Hoef, J. M. V. (2023). spmodel: Spatial statistical modeling and prediction in R. PLOS ONE, 18(3), e0282524. https://doi.org/10.1371/journal.pone.0282524↩︎

  3. Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7, 983-999, DOI:10.5555/1248547.1248582↩︎

  4. Breiman, L. (2001). Random forests. Mach Learn, 45:5-32. doi: 10.1023/A:1010933404324↩︎

  5. Molnar, C. (2022). Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/↩︎

  6. https://christophm.github.io/interpretable-ml-book/ale.html↩︎

  7. https://christophm.github.io/interpretable-ml-book/pdp.html↩︎

  8. see https://christophm.github.io/interpretable-ml-book/ice.html↩︎

  9. https://christophm.github.io/interpretable-ml-book/interaction.html↩︎

  10. Gower, John C. “A general coefficient of similarity and some of its properties.” Biometrics (1971): 857-871↩︎

  11. https://christophm.github.io/interpretable-ml-book/shapley.html↩︎

  12. https://christophm.github.io/interpretable-ml-book/shapley.html↩︎

  13. Lundberg, Scott M., and Su-In Lee. “A unified approach to interpreting model predictions.” Advances in Neural Information Processing Systems (2017)↩︎

  14. Lundberg, Scott M., Gabriel G. Erion, and Su-In Lee. “Consistent individualized feature attribution for tree ensembles.” arXiv preprint arXiv:1802.03888 (2018)↩︎

  15. https://github.com/bgreenwell/fastshap↩︎

  16. Štrumbelj, E., & Kononenko, I. (2014). Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems, 41(3), 647-665. https://doi.org/10.1007/s10115-013-0679-x↩︎