---
title: "Meuse heavy metals exercise — CART and random forests"
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
html_document:
toc: TRUE
toc_float: TRUE
theme: "lumen"
code_folding: show
number_section: TRUE
fig_height: 4
fig_width: 6
fig_align: 'center'
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=4, fig.height=6, fig.align='center',
fig.path='rmdfigs/exMeuseTreesForests_',
warning=FALSE, message=FALSE)
writeLines(capture.output(sessionInfo()), "sessionInfo.txt")
```
In this exercise we return to the Meuse heavy metals dataset.
# Dataset
Packages used in this section: `sp` for the sample dataset.
```{r}
library(sp)
```
Sample point dataset, comes with `sp`; make a log10-transformed copy of the metal concentrations.
Also load the prediction grid: locations where we do not know the value of the target variable, at which we want to make prediction of that variable, from the model built (trained, calibrated) from the known points, where we know both the target variable and the predictors.
```{r}
data(meuse)
str(meuse)
meuse$logZn <- log10(meuse$zinc) # 锌
meuse$logCu <- log10(meuse$copper) # 铜
meuse$logPb <- log10(meuse$lead) # 铅
meuse$logCd <- log10(meuse$cadmium) # 镉
data(meuse.grid)
str(meuse.grid)
```
# Regression trees
The objective is to model the log10(Zn) concentration from covariables, not including the other metals. One approach is the _regression tree_, which splits the target according to thresholds of predictor variables.
Reference: Brieman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton, FL: Chapman & Hall, CRC Press.
Packages used in this section: `sp` for spatial objects, `rpart` for regression trees.
```{r}
library(sp)
library(rpart)
library(rpart.plot)
```
```{r}
names(meuse)
names(meuse.grid)
```
Q: What covariables are available in the point dataset? Are these all available in the prediction grid?
We will try to model the log10(Zn) concentration as a function of flooding frequency class, distance to the river in meters, soil type, whether the site has had added limestone 灰岩, and elevation above sea level. We also use the coordinates to capture any geographic trend.
Q: Why could these be good predictors of the metal concentration in this site?
Q: Do you see any problem in using so many predictors? Would you do this in a linear model?
## Building a regression tree
We will use the `rpart` function to build the regression tree. This uses a helper function `rpart.control` to control the behaviour of `rpart`.
Examine the help for `rpart.control`.
```{r}
help(rpart.control)
```
Q: What is the meaning of the `minsplit` and `cp` parameters?
Build a regression tree for this, print its results, and display as a tree plot. Allow splits all the way to single observations, if the $R^2$ increases by at least 0.5% at the split.
```{r fig.width=9, fig.height=5}
m.lzn.rp <- rpart(logZn ~ x + y + ffreq + dist.m + soil + lime + elev,
data=meuse,
minsplit=2,
cp=0.005)
print(m.lzn.rp)
rpart.plot(m.lzn.rp, digits=3, type=4, extra=1)
```
Q: Which variables were actually used in the tree?
Q: Which variable was used at the first split?
Q: Does that agree with your theory of the origin of the metals in these soils?
Q: What was the value of the splitting variable at that split?
Q: What was the average log10(Zn) before the split, and in the two leaves after the split?
Q: How many leaves (i.e., different predictions)? What proportion of the observations is this?
Q: What is the minimum number of observations among the leaves? The maximum?
## Variable importance
Now we examine the variable importance:
```{r}
print(tmp <- m.lzn.rp$variable.importance)
data.frame(variableImportance = 100 * tmp / sum(tmp))
```
Q: Which variable explained most of the metal concentration? What proportion? Were all predictor variables used somewhere in the tree?
## Pruning the regression tree
Have we built a tree that is too complex, i.e., fitting noise not signal? In other words, fitting our dataset well but not the process?
Examine the cross-validation relative error vs. the complexity parameter.
Q: What complexity parameter minimizes the cross-validation relative error?
We can also compute this, rather than just looking at the graph.
```{r fig.width=6, fig.height=3}
plotcp(m.lzn.rp)
cp.table <- m.lzn.rp[["cptable"]]
cp.ix <- which.min(cp.table[,"xerror"])
print(cp.table[cp.ix,])
cp.min <- cp.table[cp.ix,"CP"]
```
(Note that some other cross-validation errors are quite close to this minimum, so we might choose to use a higher value of the control parameter to have a more parsimonious model, that might extrapolate better.)
Prune the tree with this complexity parameter, and examine in the same way as above.
(Note we would get the same result if we re-fit with this parameter.)
```{r}
(m.lzn.rpp <- prune(m.lzn.rp, cp=cp.min))
rpart.plot(m.lzn.rpp, digits=3, type=4, extra=1)
```
Q: How does this tree differ from the full (un-pruned tree)? Look at the predictors actually used, number of splits, number of leaves, number of observations in each leaf.
Q: Which leaf predicts the highest concentration of log10(Zn)? Explain that by reading the predictors and their values down the tree from the root to this leaf. Does this fit your theory of the origin of these metals?
## Goodness-of-fit
Now, compare the model predictions to the known values:
```{r fig.width=5, fig.height=5}
p.rpp <- predict(m.lzn.rpp, newdata=meuse)
length(unique(p.rpp))
summary(r.rpp <- meuse$logZn - p.rpp)
sqrt(sum(r.rpp^2)/length(r.rpp))
plot(meuse$logZn ~ p.rpp, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Regression Tree")
grid()
abline(0,1)
```
Q: What is the RMSE?
Q: Why are there vertical lines at all the fitted values?
# Classification tree
Another use of CART is to predict a category. Here we try to predict whether a site is in flood frequency class 1, 2, or 3. If this is a successful model, we could use it to map flood frequency, if we didn't already have this map.
Q: What covariables might help predict flooding frequency?
## Building a classification tree
Flooding is influenced by the distance from the river and the elevation above the river. Build a model with these, using a complexity parameter of 2% increase in $R^2$.
```{r fig.width=9, fig.height=5}
m.ff.rp <- rpart(ffreq ~ x + y + dist.m + elev,
data=meuse,
minsplit=2,
cp=0.02)
print(m.ff.rp)
rpart.plot(m.ff.rp)
```
Q: Were both predictors used?
Q: What was the first split?
## Pruning the classification tree
Find the value of the complexity parameter that minimizes the cross-validation error, and prune the tree with that value.
```{r fig.width=5, fig.height=3}
printcp(m.ff.rp)
plotcp(m.ff.rp)
cp.table.class <- m.ff.rp[["cptable"]]
cp.ix <- which.min(cp.table.class[,"xerror"])
print(cp.table.class[cp.ix,])
(cp.min <- cp.table.class[cp.ix,"CP"])
m.ff.rpp <- prune(m.ff.rp, cp=cp.min)
print(m.ff.rpp)
```
```{r}
rpart.plot(m.ff.rpp, type=2, extra=1)
```
Q: Now how many splits? Which variables were used? Are all classes predicted? How pure are the leaves?
Q: How useful is this model? In other words, how well could flooding frequency class be mapped from knowing the distance to river and elevation?
Variable importance:
```{r}
m.ff.rp$variable.importance
m.ff.rpp$variable.importance
```
In both the unpruned and pruned trees the y-coordinate and elevation are most important.
# Mapping over the grid.
```{r}
names(meuse.grid)
```
Q: What variables are available for mapping over the grid?
The grid only has coordinates, relative distance, soil type, and flood frequency. So if we want a model to map log(Zn) over the grid, we can only use these variables.
Build and prune a regression tree for log(Zn) with these predictors; display the pruned tree and the variable importance.
```{r fig.width=7, fig.height=4}
m.lzn.g <- rpart(logZn ~ x + y + ffreq + dist + soil,
data=meuse,
minsplit=2,
cp=0.005)
plotcp(m.lzn.g)
```
```{r fig.width=9, fig.height=5}
rpart.plot(m.lzn.g, digits=3, type=4, extra=1)
cp.table <- m.lzn.g[["cptable"]]
cp.table[cp.ix <- which.min(cp.table[,"xerror"]),]
(cp.min <- cp.table[cp.ix,"CP"])
```
Now prune the tree to the indicated complexity:
```{r}
m.lzn.gp <- prune(m.lzn.g, cp=cp.min)
rpart.plot(m.lzn.gp)
print(tmp <- m.lzn.gp$variable.importance)
data.frame(variableImportance = 100 * tmp / sum(tmp))
```
Q: Which predictors were used? What was their relative importance?
Q: Why is the x-coordinate important in this dataset? Do you think this would be important in other parts of the flood plain?
Now we can map with this.
First, make an `sp` version of the grid, with the coordinates having special status, i.e., not as feature-space attributes:
```{r}
meuse.grid.sp <- meuse.grid
coordinates(meuse.grid.sp) <- ~ x + y; gridded(meuse.grid.sp) <- TRUE
```
```{r}
meuse.grid.sp$predictedLog10Zn <- predict(m.lzn.gp, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn")
```
Q: How realistic is this map?
# The variability of regression trees
Single regression trees can vary quite a bit depending on small differences in the dataset. Convince yourself of this.
Remove a random 15 of the 155 points, then fit and prune the first tree we attempted, i.e., log10(Zn) concentration as a function of flooding frequency class, distance to the river in meters, soil type, land use, whether the site has had added limestone 灰岩, and elevation above sea level.
```{r fig.width=9, fig.height=5}
meuse.140.ix <- sample(1:dim(meuse)[1], size=140, replace=FALSE)
meuse.140 <- meuse[meuse.140.ix,]
m.lzn.rp.140 <- rpart(logZn ~ x + y + ffreq + dist.m + soil + lime + elev,
data=meuse.140,
minsplit=2,
cp=0.005)
rpart.plot(m.lzn.rp.140, digits=3, type=4, extra=1)
print(tmp <- m.lzn.rp.140$variable.importance)
data.frame(variableImportance = 100 * tmp / sum(tmp))
```
Prune the tree:
```{r fig.width=7, fig.height=4}
plotcp(m.lzn.rp.140)
cp.table <- m.lzn.rp.140[["cptable"]]
cp.ix <- which.min(cp.table[,"xerror"])
print(cp.table[cp.ix,])
cp.min <- cp.table[cp.ix,"CP"]
m.lzn.rpp.140 <- prune(m.lzn.rp.140, cp=cp.min)
```
```{r}
rpart.plot(m.lzn.rpp.140, digits=3, type=4, extra=1)
```
Q: What are the similarities and differences between your tree and the one we did at the beginning of the exercise? Consider the "optimum" complexity parameter, number of leaves and split, where the predictors are used, which predictors are used.
We can do this also for the map:
```{r}
m.lzn.g.140 <- rpart(logZn ~ x + y + ffreq + dist + soil,
data=meuse.140,
minsplit=2,
cp=0.005)
```
Prune the tree to the indicated complexity:
```{r fig.width=7, fig.height=4}
plotcp(m.lzn.g.140)
rpart.plot(m.lzn.g.140, digits=3, type=4, extra=1)
cp.table <- m.lzn.g.140[["cptable"]]
cp.table[cp.ix <- which.min(cp.table[,"xerror"]),]
(cp.min <- cp.table[cp.ix,"CP"])
m.lzn.gp.140 <- prune(m.lzn.g.140, cp=cp.min)
```
```{r fig.width=9, fig.height=5}
rpart.plot(m.lzn.gp.140)
print(tmp <- m.lzn.gp.140$variable.importance)
data.frame(variableImportance = 100 * tmp / sum(tmp))
```
Computed and display the map using the pruned tree:
```{r}
meuse.grid.sp$predictedLog10Zn.140 <- predict(m.lzn.gp.140, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.140")
```
Compute and display the differences in predictions:
```{r}
meuse.grid.sp$predictedLog10Zn.diff <-
meuse.grid.sp$predictedLog10Zn - meuse.grid.sp$predictedLog10Zn.140
summary(meuse.grid.sp$predictedLog10Zn.diff)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.diff", col.regions=topo.colors(36))
```
Q: How different are the maps made with the regression tree based on all 155 observations, and the one made with 140 randomly-selected observations?
Can you explain the spatial pattern of the differences? Hint: plot the locations of the omitted points.
```{r fig.width=4, fig.height=5}
meuse.omitted <- meuse[-meuse.140.ix,]
plot(y ~ x, data=meuse.omitted, asp=1,
cex=2*meuse.omitted$logZn/max(meuse.omitted$logZn))
data(meuse.riv)
lines(meuse.riv)
grid()
```
# Random Forests
The idea here is simple: build many regression trees and average them.
Reference: Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Packages used in this section: `randomForest` for random forests, `randomForestExplainer` for some diagnostics.
```{r}
library(randomForest)
library(randomForestExplainer)
```
First build the forest, using the `randomForest` function:
```{r}
m.lzn.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data=meuse, importance=T, na.action=na.omit, mtry=5)
print(m.lzn.rf)
```
The `mtry` optional argument gives the number of variables randomly sampled as candidates at each split. By default this is $\lfloor{p/3}\rfloor$ where $p$ is the number of predictors, here $5$, so $\lfloor{5/3}\rfloor = 2$. We want to try all the variables at each split for more accurate splitting.
Q: How much of the variance is explained by the forest?
Here we can not see a tree structure, since each tree in the forest is different.
Display the cross-validation error rate against the number of trees:
```{r fig.width=8, fig.height=4}
plot(m.lzn.rf)
```
Q: How many trees were needed to make a reliable forest?
## Sensitivity
Each RF is random, so several attempts will not be the same. How consistent are they?
Compute the RF several times and collect statistics:
```{r fig.width=6, fig.height=4}
n <- 24
rf.stats <- data.frame(rep=1:10, rsq=as.numeric(NA), mse=as.numeric(NA))
for (i in 1:n) {
model.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
data=meuse, importance=T, na.action=na.omit, mtry=5)
summary(model.rf$rsq)
summary(model.rf$mse)
rf.stats[i, "mse"] <- median(summary(model.rf$mse))
rf.stats[i, "rsq"] <- median(summary(model.rf$rsq))
}
summary(rf.stats[,2:3])
hist(rf.stats[,"rsq"], xlab="RandomForest R^2")
rug(rf.stats[,"rsq"])
hist(rf.stats[,"mse"], xlab="RandomForest RMSE")
rug(rf.stats[,"mse"])
```
Q: How consistent are the 24 runs of the random forest?
## Variable importance
Examine the variable importance. Type 1 is the increase in MSE if that predictor is assigned to its actuql point, rather than randomly assigned to a point. It is computed as follows (from the Help): "For each tree, the prediction error on the OOB portion of the data is recorded (...MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences".
Type 2 is the increase in node purity. It is computed as follows: "The total decrease in node impurities from splitting on the [given] variable, averaged over all trees... measured by residual sum of squares."
```{r fig.width=5, fig.height=5}
importance(m.lzn.rf, type=1)
importance(m.lzn.rf, type=2)
varImpPlot(m.lzn.rf, type=1)
```
Q: Does this agree with the variable importance from the single trees we built above?
Q: Why are some variables given importance, although they were not used in the regression tree models?
## Examine variable importance and interaction with `randomForestExplainer`
Another way to look at this is by the depth in the tree at which each predictor is used. This is found with the `min_depth_frame` function of the `randomForestExplainer` package.
```{r fig.width=8, fig.height=5}
min_depth_frame <- min_depth_distribution(m.lzn.rf)
str(min_depth_frame) # has results for all the trees
plot_min_depth_distribution(min_depth_frame)
```
Q: Which predictor is used most as the root?
Q: Which predictor is used most at the second level?
Q: Which predictors are not used in any trees?
Q: Do these results agree with the variable importance plot?
This package gives a wider choice of importance measures:
```{r}
importance_frame <- measure_importance(m.lzn.rf)
print(importance_frame)
```
Q: Which predictor was used in the most nodes, over all trees?
Q: Which predictors were never used as the root of the tree?
Q: Which predictors were used in every tree?
The p-value measureswhether the observed number of successes (number of nodes in which $X_j$ was used for splitting) exceeds the theoretical number of successes if they were random, following a binomial distribution. See https://cran.rstudio.com/web/packages/randomForestExplainer/vignettes/randomForestExplainer.html for details.
Another interesting plot is the "multiway importance plot", which by default shows the number of times as root vs. the mean minimum depth for each predictor. There is typically a negative relation between `times_a_root` and `mean_min_depth`.
Q: Which predictor is most important by both measures?
Q: Is this a consistent negative relation? If not, where are the exceptions?
Another interesting plot is the pairwise comparison of different importance measures:
```{r fig.width=7, fig.height=7}
plot_importance_ggpairs(importance_frame)
```
Q: Which are the most and least related measures?
```{r fig.width=4, fig.height=4}
plot_multi_way_importance(m.lzn.rf, size_measure = "no_of_nodes")
```
Examine the interactions between predictors, in terms of their use within the forest:
```{r fig.width=6, fig.height=6}
interactions_frame <- min_depth_interactions(m.lzn.rf)
interactions_frame[order(interactions_frame$occurrences, decreasing = TRUE)[1:12], ]
plot_min_depth_interactions(interactions_frame)
```
Q: Which pair of predictors are most used near the root of the tree, and with the most co-occurrences?
"To further investigate the most frequent interaction we use the function plot_predict_interaction to plot the prediction of our forest on a grid of values for the components of each interaction. The function requires the forest, training data, variable to use on x and y-axis, respectively. In addition, one can also decrease the number of points in both dimensions of the grid from the default of 100 in case of insufficient memory using the parameter grid.""
Here we see the predicted log(Zn) content for all combinations of distance from river and elevation:
```{r fig.width=8, fig.height=7}
plot_predict_interaction(m.lzn.rf, meuse, "dist.m", "elev")
```
## Partial dependence plots
These plots show the effect of one predictor on the predictand, holding all the others _constant_ at their _median_ value. We show these in order of variable importance, as determined just above.
```{r fig.width=6, fig.height=4}
partialPlot(m.lzn.rf, pred.data=meuse, x.var="x")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="y")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="dist.m")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="elev")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="ffreq", which.class=1)
```
Q: What is the effect of just the distance to river on the RF prediction of log(Zn), when the other variables are held at their mean values (continuous variables, i.e., `elev`) or most common class (classified variables, i.e. `ffreq`, `soil` and `lime`). Is the effect monotonic (i.e., consistently increasing or decreasing)? Why or why not?
## Goodness-of-fit and out-of-bag errors
Predict back onto the known points, and evaluate the goodness-of-fit:
```{r fig.width=6, fig.height=4}
p.rf <- predict(m.lzn.rf, newdata=meuse)
length(unique(p.rf))
summary(r.rpp <- meuse$logZn - p.rf)
(rmse.rf <- sqrt(sum(r.rpp^2)/length(r.rpp)))
plot(meuse$logZn ~ p.rf, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)
```
Q: What is the RMSE?
Q: How many different predictions were made? Why not just a few, as in regression trees?
A more realistic view of the predictive power of the model is the out-of-bag validation.
```{r fig.width=5, fig.height=5}
p.rf.oob <- predict(m.lzn.rf)
summary(r.rpp.oob <- meuse$logZn - p.rf.oob)
(rmse.oob <- sqrt(sum(r.rpp.oob^2)/length(r.rpp.oob)))
plot(meuse$logZn ~ p.rf.oob, asp=1, pch=20,
xlab="Out-of-bag cross-validation estimates",
ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3),
main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)
```
Q: What is the OOB RMSE? How does this compare with the fitted RMSE? Which is more realistic for predictions?
Q: How does this compare to the model fits?
Q: How does this compare with the RMSE from the Regression Tree?
## Map over the grid with an RF model
We can try to build an RF model to map the grid, but again we can only use coordinates, flood frequency, soil type and relative distance to the river, because these are the only variables we have over the whole area, i.e., in the spatial object `meuse.grid`.
```{r fig.width=5, fig.height=5}
names(meuse.grid)
m.lzn.rf.g <- randomForest(logZn ~ x + y + ffreq + dist + soil, data=meuse, importance=T, na.action=na.omit, mtry=3)
print(m.lzn.rf)
(tmp <- importance(m.lzn.rf.g, type=1))
(100*tmp/sum(tmp))
varImpPlot(m.lzn.rf.g, type=1)
```
```{r}
meuse.grid.sp$predictedLog10Zn.rf <- predict(m.lzn.rf.g, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.rf")
```
Q: Does this map look realistic? How does it compare with the map made with the single regressiom tree?
# Cubist
This method is derived from the C4.5 approach of Quinlan.
References:
Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo, Calif: Morgan Kaufmann Publishers.
Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77–90. https://doi.org/10.1613/jair.279
A recent article comparing it to RF is: Houborg, R., & McCabe, M. F. (2018). A hybrid training approach for leaf area index estimation via Cubist and random forests machine-learning. ISPRS Journal of Photogrammetry and Remote Sensing, 135, 173–188. https://doi.org/10.1016/j.isprsjprs.2017.10.004
It is similar to RF but instead of single values at leaves it creates a multivariate linear regression for the cases in the leaf, using the variables in the nodes leading to it (?).
The `Cubist` package is an `R` port of the Cubist GPL `C` code released by RuleQuest at [`http://rulequest.com/cubist-info.html`](http://rulequest.com/cubist-info.html).
An explanation from the `Cubist` vignette:
> Cubist is a rule--based model that is an extension of Quinlan's M5 model tree. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is "smoothed" by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification.
Package used in this section: `Cubist` for the C5 algorithm.
```{r}
library(Cubist)
```
Split into training sets, here 80% in the training set:
```{r}
inTrain <- sample(1:nrow(meuse), floor(.8*nrow(meuse)))
```
Build training and test matrices. Note there is no formula method for `cubist`; the predictors are specified as matrix or data frame and the outcome is a numeric vector.
```{r}
preds <- c("x","y","ffreq","dist.m","elev","soil","lime")
train.pred <- meuse[ inTrain, preds]
test.pred <- meuse[-inTrain, preds]
train.resp <- meuse$logZn[ inTrain]
test.resp <- meuse$logZn[-inTrain]
```
## Unoptimized Cubist model
Now fit the model to the training set:
```{r}
(c.model <- cubist(x = train.pred, y = train.resp))
summary(c.model)
```
In this case there are only two leaves! each with a regression model. The split is on distance to river.
We predict onto the test set to get an estimate of predictive accuracy:
```{r bh3}
c.pred <- predict(c.model, test.pred)
## Test set RMSE
sqrt(mean((c.pred - test.resp)^2))
## Test set R^2
cor(c.pred, test.resp)^2
```
Now, fit a model on the entire set and then predict over the grid. Again we can only use predictors that are in the grid.
```{r}
preds <- c("x","y","dist","soil","ffreq")
all.preds <- meuse[, preds]
all.resp <- meuse$logZn
c.model.grid <- cubist(x = all.preds, y = all.resp)
summary(c.model.grid)
meuse.grid.sp$predictedLog10Zn.cubist <- predict(c.model.grid, newdata=meuse.grid[,preds])
spplot(meuse.grid.sp, zcol="predictedLog10Zn.cubist", col.regions=bpy.colors(64))
```
Here we can see the effect of the local regression models, as well as the hard split on `x`, `y`, and `dist`.
## Optimizing cubist
Cubist can improve performance in two ways: (1) with "committees" and (2) adjusting by nearest neighbours:
> The Cubist model can also use a boosting–like scheme called committees where iterative model trees are created in sequence. The first tree follows the procedure described in the last section. Subsequent trees are created using adjusted versions to the training set outcome: if the model over–predicted a value, the response is adjusted downward for the next model (and so on). Unlike traditional boosting, stage weights for each committee are not used to average the predictions from each model tree; the final prediction is a simple average of the predictions from each model tree.
> Another innovation in Cubist using nearest–neighbors to adjust the predictions from the rule–based model. First, a model tree (with or without committees) is created. Once a sample is predicted by this model, Cubist can find it’s nearest neighbors and determine the average of these training set points.
To find the optimum values for these, we turn to the `caret` package.
> The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques.
See also: http://topepo.github.io/caret/model-training-and-tuning.html
The `train` function in `caret` can use many models, supplied by other packages, by specifying the `method` argument. Then the `tuneGrid` parameter refers to a grid of possible parameters to compare by cross-validation. Note we use all the points, because `train` does the splitting as it evaluates the alternative models.
```{r fig.width=8, fig.height=4}
require(caret)
test <- train(x = all.preds, y = all.resp, method="cubist",
tuneGrid = expand.grid(.committees = 1:12,
.neighbors = 0:5),
trControl = trainControl(method = 'cv'))
print(test)
plot(test, metric="RMSE")
plot(test, metric="Rsquared")
```
Of course this is random in the splits, so each run will be different.
In this case the graph shows clearly that more committees boost the performance considerably, from around 0.19 to 0.15 cross-validation RMSE. The nearest-neighbours also help considerably, up to about four, and then the improvement is less.
So we fit a model with these recommended parameters. Note the neighbour count is only used in prediction, not model building.
```{r}
c.model.grid.2 <- cubist(x = all.preds, y = all.resp, committees=4)
summary(c.model.grid.2)
meuse.grid.sp$predictedLog10Zn.cubist.2 <-
predict(c.model.grid.2, newdata=meuse.grid[,preds],
neighbors=5)
summary(meuse.grid.sp$predictedLog10Zn.cubist.2)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.cubist.2", col.regions=bpy.colors(64),
main="Optimized Cubist prediction")
```
A much nicer looking map.
Variable importance. The `usage` field of the fitted model shows the usage of each variable in either the rule conditions or the (terminal) linear model.
```{r}
c.model.grid$usage
c.model.grid.2$usage
caret::varImp(c.model.grid)
caret::varImp(c.model.grid.2)
```
Distance is always used in the single model, but sometimes not in the committee model.
Notice how flood frequency came into the model when a committee was formed, i.e., subsidiary rules to adjust mis-matches from higher-level rules.
It is interesting to see how the different models fit a single predictor, in this case distance to river:
```{r}
distances <- all.preds[, "dist", drop = FALSE]
rules_only <- cubist(x = distances, y = all.resp)
rules_and_com <- cubist(x = distances, y = all.resp, committees = 4)
predictions <- distances
predictions$true <- all.resp
predictions$rules <- predict(rules_only, distances, neighbors = 0)
predictions$rules_neigh <- predict(rules_only, distances, neighbors = 5)
predictions$committees <- predict(rules_and_com, distances, neighbors = 0)
predictions$committees_neigh <- predict(rules_and_com, distances, neighbors = 5)
summary(predictions)
```
```{r fig.width=8, fig.height=5}
plot(true ~ dist, data=predictions, pch=20, col="darkgray",
xlab="covariate: normalized distance to river",
ylab="log10(Zn)", main="Cubist models")
grid()
m <- matrix(cbind(predictions$dist, predictions$rules), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=1, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$rules_neigh), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=2, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$committees), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=3, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$committees_neigh), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=4, lwd=2)
legend("topright", c("rules", "rules + neighbour",
"committee", "committee + neighbour"),
pch="----", col=1:4)
```
Notice how the committee (green) smoothes out the centre of distance range from the single-rules model (black).
Notice how the neighbours come much closer to some of the points -- too close?
# K nearest neighbours
A pure "instance based" approach is to predict at each location from the average (possibly weighted) of some set of similar values. The `caret` package includes a KNN regression with the `knnreg` function.
Try with five clusters:
```{r knn, fig.width=5, fig.height=5}
knn.5 <- caret::knnreg(logZn ~ x + y + dist + elev + ffreq, data=meuse, k=5)
knn.pred <- predict(knn.5, newdata=meuse)
plot(meuse$logZn ~ knn.pred, asp=1, col=meuse$ffreq, pch=20)
abline(0,1); grid()
```
However, unless we know a priori a number of classes, we should let `train` evaluate the optimum number of clusters.
```{r fig.width=6, fig.height=4}
knn.tune <- train(x = all.preds, y = all.resp,
method="knn",
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = 'cv'))
str(knn.tune)
print(knn.tune)
plot.train(knn.tune, metric="RMSE")
plot.train(knn.tune, metric="Rsquared")
plot.train(knn.tune, metric="MAE")
min(knn.tune$results$RMSE)
(ix <- which.min(knn.tune$results$RMSE))
max(knn.tune$results$Rsquared)
(ix <- which.max(knn.tune$results$Rsquared))
# directly find optimum
knn.tune$finalModel$k
```
Here there is a clear winner, `k=3`. It is not a very good model: cross-validation RMSE about $0.193$, $R^2 = 0.62$.
Use this to predict:
```{r fig.width=5, fig.height=5}
knn.final <- caret::knnreg(logZn ~ x + y + dist + elev + ffreq,
data=meuse,
k = knn.tune$finalModel$k)
knn.pred <- predict(knn.final, newdata=meuse)
plot(meuse$logZn ~ knn.pred, asp=1, col=meuse$ffreq, pch=20)
abline(0,1); grid()
```
We can't map with this because we don't have elevation in the grid. So re-tune with what we do have:
```{r fig.width=6, fig.height=4}
preds.2 <- c("x","y","ffreq","dist.m","soil","lime")
all.preds.2 <- meuse[, preds.2]
knn.tune <- train(x = all.preds.2, y = all.resp,
method="knn",
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = 'cv'))
min(knn.tune$results$RMSE)
(ix <- which.min(knn.tune$results$RMSE))
max(knn.tune$results$Rsquared)
(ix <- which.max(knn.tune$results$Rsquared))
# directly find optimum
knn.tune$finalModel$k
plot.train(knn.tune, metric="RMSE")
plot.train(knn.tune, metric="Rsquared")
plot.train(knn.tune, metric="MAE")
## the two criteria do not agree
```
Map with this:
```{r fig.width=8, fig.height=5}
knn.final <- caret::knnreg(logZn ~ x + y + dist + ffreq,
data=meuse,
k = knn.tune$finalModel$k)
meuse.grid.sp$knn <- predict(knn.final, newdata=meuse.grid)
summary(meuse.grid.sp$knn)
spplot(meuse.grid.sp, zcol="knn")
```
Interesting map! Note here 'nearest neighbour' included both coordinates and feature-space.
# Spatial random forests
Random forests can use coordinates and distances to geographic features as predictors; we used the x and y coordinates in the previous section.
RF can also use distances to _multiple points_ as predictors:
* Distance buffers: distance to closest point with some range of values
* A common approach is to compute _quantiles_ of the response variable and create one buffer for each
* Each sample point has a distance to the closest point in each quantile
* This uses separation between point-pairs of different values, but with _no model_.
Reference: Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B. M., & Gräler, B. (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ, 6, e5518. https://doi.org/10.7717/peerj.5518
Packages used in this section:
`randomForest` to build the RF,
`scales` for visualization, `GSIF` for buffer distances.
```{r}
# library(quantregForest) # the main work is here
library(scales) # for visualization
library(GSIF) # for buffer distances
```
## Method 1 -- buffers to quantiles
### Quantiles
Set up quantiles for the transformed and original Zn concentrations:
```{r fig.width=6, fig.height=4, results='hold'}
library(ggplot2)
g <- ggplot(meuse, mapping=aes(x=logZn))
g + geom_histogram(bins=16, fill="lightgreen", color="blue") + geom_rug()
(q.lzn <- quantile(meuse$logZn, seq(0,1,by=0.0625)))
# must include.lowest to have lowest value in some class
classes.lzn.q <- cut(meuse$logZn, breaks=q.lzn, ordered_result=TRUE, include.lowest=TRUE)
levels(classes.lzn.q)
```
Q: Are the 16 quantiles more or less equally distributed along the range of the target variable?
### Distances to each quantile
Compute distance buffers to points in each quantile.
For this, the points object must be converted to a `SpatialPointsDataFrame` and the function `buffer.dist` in the `GSIF` packge is used.
```{r fig.width=9, fig.height=9}
coordinates(meuse) <- ~x + y
grid.dist.lzn <- GSIF::buffer.dist(meuse["logZn"], meuse.grid.sp, classes.lzn.q)
plot(stack(grid.dist.lzn))
```
Q: Explain the pattern of Layer 16.
Extract the distances to each buffer at each point, and add these as fields to the points:
```{r}
buffer.dists <- over(meuse, grid.dist.lzn)
str(buffer.dists)
meuse@data <- cbind(meuse@data, buffer.dists)
```
Show how far point 1 is from the nearest point of each quantile:
```{r}
buffer.dists[1,]
```
Of course it is zero distance to its own quantile, in this case 15.
### Spatial random forest
First make a formula with all the quantiles as predictors:
```{r}
dn <- paste(names(grid.dist.lzn), collapse="+")
(fm <- as.formula(paste("logZn ~", dn)))
```
Now compute the random forest model, use it to predict at the observation points.
```{r}
(rf <- randomForest(fm , meuse@data, importance=TRUE, min.split=6, mtry=6, ntree=640))
pred.rf <- predict(rf, newdata=meuse@data)
```
```{r fig.width=7, fig.height=4}
plot(rf)
```
Q: How many trees were needed to make a reliable forest?
Variable importance:
```{r fig.width=5, fig.height=5}
importance(rf, type=1)
varImpPlot(rf, type=1)
```
Q: Which distance buffer was most important? Explain why.
Plot the RF fit and the OOB predictions. Compute the fitted and OOB RMSE.
```{r fig.width=5, fig.height=5}
plot(meuse$logZn ~ pred.rf, asp=1, pch=20, xlab="Random forest fit", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.rf <- sqrt(sum((pred.rf-meuse$logZn)^2)/length(pred.rf)))
#
oob.rf <- predict(rf)
plot(meuse$logZn ~ oob.rf, asp=1, pch=20, xlab="Out-of-bag prediction", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.oob <- sqrt(sum((oob.rf-meuse$logZn)^2)/length(oob.rf)))
```
Q: What are the goodness-of-fit and Out-of-bag RMSE?
### Mapping
```{r}
pred.grid <- predict(rf, newdata=grid.dist.lzn@data)
summary(pred.grid)
meuse.grid.sp$logZn.rf <- pred.grid
breaks <- seq(2, 3.3, by=.05)
p1 <- spplot(meuse.grid.sp, zcol="logZn.rf", main="Zn, log(mg kg-1)", sub="RRF 16 distance buffers", at=breaks)
print(p1)
```
Compare this with OK cross-validation. First compute the ordinary variogram and fit a model:
```{r fig.width=5, fig.height=5}
require(gstat)
v <- variogram(logZn ~ 1, meuse)
(vmf <- fit.variogram(v, model=vgm(0.1, "Sph", 1000, 0.02)))
plot(v, model=vmf, plot.numbers=T)
```
Now compute the cross-validation statistics:
```{r fig.width=5, fig.height=5}
k.cv <- krige.cv(logZn ~ 1, meuse, model=vmf, verbose=FALSE)
plot(meuse$logZn ~ k.cv$var1.pred, pch=20, xlab="Cross-validation fit", ylab="Actual value", main="Zn, log(mg kg-1)", asp=1)
abline(0,1); grid()
(rmse.kcv <- sqrt(sum((k.cv$var1.pred-meuse$logZn)^2)/length(k.cv$var1.pred)))
```
Q: Which approach gives the lower cross-validation error in this case?
Compare the kriging map with the spatial random forest map:
```{r fig.width=8, fig.height=5}
k.ok <- krige(logZn ~ 1, meuse, newdata=meuse.grid.sp, model=vmf)
summary(k.ok$var1.pred)
p2 <- spplot(k.ok, zcol="var1.pred", main="Zn, log(mg kg-1)", sub="Ordinary Kriging", at=breaks)
print(p2, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)
```
Compute and display the differences:
```{r fig.keep=TRUE}
k.ok$diff.srf.ok <- meuse.grid.sp$logZn.rf - k.ok$var1.pred
summary(k.ok$diff.srf.ok)
p3 <- spplot(k.ok, zcol="diff.srf.ok", main="Spatial RF - OK predictions, log(Zn)", col.regions=topo.colors(36))
print(p3)
```
Q: Comment on the differences/similarities between the Spatial Random Forest map and the Ordinary Kriging map. Note that for Kriging we needed to specify a model, for RF no model was needed.
## Method 2 -- buffers around each point
The concept of quantile buffers is similar to the empirical variogram, i.e., assume stationarity of spatial correlation structure.
Another approach is to use the distance from *each* known point as a predictor. Obviously this increases the size of the problem. This is the approach taken in the Hengl paper cited above.
Set up the buffer distances and visualize a few of them:
```{r fig.width=9, fig.height=9}
grid.dist.0 <- GSIF::buffer.dist(meuse["logZn"], meuse.grid.sp, as.factor(1:nrow(meuse)))
dim(grid.dist.0)
summary(grid.dist.0$layer.1)
plot(stack(grid.dist.0[1:9]))
```
Extract the distances to each buffer at each point, and add these as fields to the points. Note that we first have to remove the previous buffer distances (the 16 quantiles).
```{r}
buffer.dists <- over(meuse, grid.dist.0)
dim(buffer.dists)
# distances from point 1 to all the other points
(buffer.dists[1,1:10])
names(meuse@data)
meuse@data <- meuse@data[,1:16]
meuse@data <- cbind(meuse@data, buffer.dists)
```
So now we have 155 buffer distances. Set up a formula for the RF and build the RF.
Note that the default `mtry` will be `r round(155/3)`, i.e., distance to 1/3 of the points.
We increase this to 2/3.
```{r}
dn0 <- paste(names(grid.dist.lzn), collapse="+")
fm0 <- as.formula(paste("logZn ~ ", dn0))
(rf <- randomForest(fm0 , meuse@data, importance=TRUE,
mtry=102, min.split=6, ntree=640))
```
This does not perform as well as using the quantiles, the $R^2$ is about 10% lower.
Mapping:
```{r fig.width=10, fig.height=5}
pred.grid <- predict(rf, newdata=grid.dist.lzn@data)
summary(pred.grid)
meuse.grid.sp$logZn.rf0 <- pred.grid
breaks <- seq(2, 3.3, by=.05)
p2 <- spplot(meuse.grid.sp, zcol="logZn.rf0", main="Zn, log(mg kg-1)",
sub="RF distance all points", at=breaks)
# compare with 16-quantile buffers
print(p2, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)
```
A clearly inferior map. So we don't care about individual point values, rather about the distance to quantiles.
# Random forests for categorical variables
Random forests can also be used to predict classes. As with RF for continuous variables, the forest is made up of many trees. The prediction is the class predicted by the majority of trees.
```{r fig.width=5, fig.height=5}
m.ffreq.rf <- randomForest(ffreq ~ x + y + dist.m + elev + soil + lime, data=meuse, importance=T, na.action=na.omit, mtry=3)
print(m.ffreq.rf)
importance(m.ffreq.rf)
varImpPlot(m.ffreq.rf)
```
Here we see the class error rate, i.e., the training samples *not* predicted into the correct class. In this example class 3 is poorly-predicted, whereas class 1 is well-predicted.
Elevation is by far the most important predictor of accurary for all classes. The coordinates are next, because of the geographic configuration of the studty area -- these replace distance to river when that is not used in the model.
However for the GINI coefficient (how often a randomly-chosen training observation would be incorrectly assigned if it were randomly labeled according to the frequency distribution of labels in the subset) the y-coordinate becomes somewhat more important -- this perhaps because of the one high value far to the east, at the river bend.
See where the predictors are used:
```{r fig.width=8, fig.height=5}
min_depth_frame <- min_depth_distribution(m.ffreq.rf)
plot_min_depth_distribution(min_depth_frame)
```
Soil type and lime application are not used in more than half the trees. The y-coordinate is used first in most trees, followed by elevation and x-coordinate.
We can see the probabilities of each class by using the model to predict back at the known points:
```{r}
summary(prob_rf <- predict(m.ffreq.rf, meuse, type = "prob"))
# probability of each observation being predicted to class 1
prob_rf[,1]
# an observation in order of probability
sort(prob_rf[1,], decreasing=TRUE)
order(prob_rf[1,], decreasing=TRUE)
# its actual class -- in this case it agrees
meuse$ffreq[1]
```
These can be summarized into list of most-probable, 2nd-most probable etc. classes:
```{r}
# list of most likely classes
probs <- apply(prob_rf, 1, order, decreasing=TRUE)
probs[,1]
probs[,121]
# most likely, 2nd most, 3rd most, for each observation
table(probs.1 <- probs[1,])
# cross-classification of most and 2nd-most probable classes
table(probs.1, probs.2 <- probs[2,])
```
## Mapping classes
Again we need to restrict the RF model to variables also available in the prediction grid.
```{r}
(m.ffreq.rf.2 <- randomForest(ffreq ~ x + y + dist + soil, data=meuse, importance=T, na.action=na.omit, mtry=3))
```
This is less accurate than the model with elevation. Anyway, we show the map:
```{r}
meuse.grid.sp$predictedffreq.rf <- predict(m.ffreq.rf.2, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedffreq.rf", main="Most probable flood frequency class")
```
Not a nice-looking map. It disagrees strongly with the actual flooding-frequency map.
```{r}
summary(meuse.grid.sp$ffreq.tf <-
as.factor((meuse.grid.sp$predictedffreq.rf==
meuse.grid.sp$ffreq)))
spplot(meuse.grid.sp, zcol="ffreq.tf",
main="Accuracy of RF model")
```
Probability of class 1 at each grid cell:
```{r fig.width=9, fig.height=4}
meuse.grid.sp$probffreq <-
predict(m.ffreq.rf.2, newdata=meuse.grid, type="prob")
meuse.grid.sp$probffreq.rf.1 <- meuse.grid.sp$probffreq[,1]
meuse.grid.sp$probffreq.rf.2 <- meuse.grid.sp$probffreq[,2]
meuse.grid.sp$probffreq.rf.3 <- meuse.grid.sp$probffreq[,3]
p1 <- spplot(meuse.grid.sp, zcol="probffreq.rf.1",
main="Prob. class 1")
p2 <- spplot(meuse.grid.sp, zcol="probffreq.rf.2",
main="Prob. class 2")
p3 <- spplot(meuse.grid.sp, zcol="probffreq.rf.3",
main="Prob. class 3")
print(p1, split=c(1,1,3,1), more=T)
print(p2, split=c(2,1,3,1), more=T)
print(p3, split=c(3,1,3,1), more=F)
```
# Spatial random forest, including covariates
Does adding covariates improve the prediction?
```{r}
(covar.names <- paste(intersect(names(meuse), names(meuse.grid)), collapse="+"))
(fm.c <- as.formula(paste("logZn ~", dn, "+", covar.names)))
```
Now compute the random forest model including these covariates, use it to predict at the observation points.
```{r}
(rf.c <- randomForest(fm.c , meuse@data, importance=TRUE, min.split=6, mtry=6, ntree=640))
pred.rf.c <- predict(rf.c, newdata=meuse@data)
```
```{r fig.width=7, fig.height=4}
plot(rf.c)
```
Display the variable importance:
```{r fig.width=5, fig.height=5}
importance(rf.c, type=1)
varImpPlot(rf.c, type=1)
```
Q: Are any of the covariables important? Are any more important than the distance buffers? Explain.
Plot the RF fit and the OOB predictions. Compute the fitted and OOB RMSE.
```{r fig.width=5, fig.height=5}
plot(meuse$logZn ~ pred.rf.c, asp=1, pch=20, xlab="Random forest fit", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.rf.c <- sqrt(sum((pred.rf.c-meuse$logZn)^2)/length(pred.rf.c)))
#
oob.rf.c <- predict(rf.c)
plot(meuse$logZn ~ oob.rf.c, asp=1, pch=20, xlab="Out-of-bag prediction", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.oob.c <- sqrt(sum((oob.rf.c-meuse$logZn)^2)/length(oob.rf.c)))
```
Q: What are the goodness-of-fit and Out-of-bag RMSE? How do these compare to the spatial RF without covariates?
## Mapping
We first have to add the covariates to the prediction grid.
```{r}
grid.dist.lzn.covar <- grid.dist.lzn
grid.dist.lzn.covar$dist <- meuse.grid$dist
grid.dist.lzn.covar$ffreq <- meuse.grid$ffreq
grid.dist.lzn.covar$soil <- meuse.grid$soil
```
Now predict over the grid and display:
```{r fig.width=8, fig.height=5}
pred.grid.c <- predict(rf.c, newdata=grid.dist.lzn.covar@data)
summary(pred.grid.c)
meuse.grid.sp$logZn.rf.c <- pred.grid.c
breaks <- seq(2, 3.3, by=.05)
p1.c <- spplot(meuse.grid.sp, zcol="logZn.rf.c", main="Zn, log(mg kg-1)", sub="Random forest on distance buffers & covariates", at=breaks)
print(p1.c, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)
```
Compute and display the differences:
```{r fig.keep=TRUE}
meuse.grid.sp$logZn.rf.diff <- meuse.grid.sp$logZn.rf.c - meuse.grid.sp$logZn.rf
summary(meuse.grid.sp$logZn.rf.diff)
p3.c <- spplot(meuse.grid.sp, zcol="logZn.rf.diff", main="Spatial RF with - without covariates, log(Zn)", col.regions=topo.colors(36))
print(p3.c)
```
Q: Compare this to the maps made by RF/Spatial Distances without covariates.
Where does using the covariates result in higher and lower predicted values?
Try to explain why.
Challenge: Map logZn over the grid by Kriging with External Drift (KED), using the same three covariates (`dist`, `ffreq`, `soil`), and compare with the map made by spatial random forest with covariates. Reminder: the variogram must be computed for the linear model residuals using these covariates.
Q: Describe the differences between the residual and ordinary variogram.
Q: Describe the differences between the OK and KED maps.
Q: Compare the KED map with the spatial random forest + covariates map.
# General references
Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning data mining, inference, and prediction (2nd ed). New York: Springer. Retrieved from http://link.springer.com/book/10.1007%2F978-0-387-84858-7
Kuhn, M. (2013). Applied Predictive Modeling. New York, NY: Springer New York. Retrieved from https://link.springer.com/openurl?genre=book&isbn=978-1-4614-6848-6
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. New York: Springer. Retrieved from http://link.springer.com/book/10.1007%2F978-1-4614-7138-7
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., … Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37. https://doi.org/10.1007/s10115-007-0114-2