---
title: "Finding the proper complexity parameter for a Regression Tree"
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
html_document:
toc: TRUE
toc_float: TRUE
theme: "lumen"
code_folding: show
number_sections: TRUE
fig_keep: TRUE
fig_height: 4
fig_width: 6
fig_align: 'center'
---
This shows how cross-validation is used to assess the proper complexity parameter for a regression tree.
This is the graph shown by `printcp`, based on cross-validation computations in `rpart`, both in the `rpart` package.
# Setup
Required packages.
```{r}
require(rpart); require(rpart.plot); require(sp)
```
Sample dataset is Meuse heavy metals in soil samples, comes with `sp`:
```{r}
data(meuse)
names(meuse)
# meuse$logZn <- log10(meuse$zinc)
```
We select Zn (`zinc`) as the target, and try to predict from flooding frequency, distance to river in meters, and elevation m.a.sl.
# Model
Build a complex tree, with maximum possible splitting and `cp=0.001`:
```{r out.width="100%"}
m.lzn.rp <- rpart(zinc ~ ffreq + dist.m + elev,
data = meuse, minsplit = 2, cp = 0.001)
rpart.plot(m.lzn.rp)
```
This is obviously too site-specific, i.e., overfit, from just 155 observations.
# Cross-validation
The cross-validation procedure is:
1. Randomly split the dataset into ten parts. Do this by a random permutation of the record numbers, followed by a split.
2. For each of the ten parts:
2.1 Hold the subset out for validation.
2.2 Build a tree with the others.
2.3 Predict at the held-out points.
2.4 Compute validation statistics.
3. Summarize the validation statistics
First, decide on the number of cross-validations, and from that determine the size of each hold-out evaluation sample.
```{r}
# number of splits, we decide this
k <- 10; (size <- floor(dim(meuse)[1]/k))
```
Set up a data frame to keep the results of each cross-validation run.
```{r}
max.split <- 10 # the maximum number of levels we will test
split.vs.xval <- as.data.frame(matrix(0, nrow=max.split, ncol=3))
names(split.vs.xval) <- c("nsplit", "xerr", "xerr.sd")
```
To avoid any sequential effects in the database (we don't know why they are presented in this order) make a random permutation, out of which we will sample.
```{r}
records.permute <- sample(1:dim(meuse)[1])
```
Run the cross-validation for each possible depth, i.e., number of splits.
```{r}
for (i.split in 1:max.split) {
rmse <- rep(0,k)
for (i in 0:(k-1)) {
ix <- records.permute[(i*size+1):(i*size+size)]
# training set to build the tree, test set to evaluate
meuse.cal <- meuse[-ix,]; meuse.val <- meuse[ix,]
# build the tree
m.cal <- rpart(zinc ~ ffreq + dist.m + elev,
data = meuse.cal, maxdepth=i.split, cp=0)
val <- (meuse.val$zinc - predict(m.cal, meuse.val)) # vector of residuals
rmse[i+1] <- sqrt(sum(val^2)/length(val)) ## RMSE
} # for k-fold
# Now we have `k` RMSE:
rmse.m <- mean(rmse) # this is the value we use to match with split
rmse.sd <- sd(rmse)/length(ix) # and it has a standard deviation of the mean
# save it
split.vs.xval[i.split,] = c(i.split, rmse.m, rmse.sd)
} # for minsplit
```
From this we can find an "optimim" as the minimum cross-validation RMSE:
```{r}
split.vs.xval
(ix.min <- which.min(split.vs.xval$xerr))
```
Final result: the x-validation RMSE vs. number of splits, also showing one standard error.
```{r}
plot(split.vs.xval[,2] ~ split.vs.xval[,1], type="b", ylab="cross-validation RMSE", xlab="Number of splits")
grid()
abline(h=(split.vs.xval[ix.min, "xerr"] + split.vs.xval[ix.min, "xerr.sd"]), lty=2)
```
# Compare with `rpart::printcp`
What does the built-in procedure say about this model? Each run is different, although the tree is the same.
```{r}
for (i in 1:4) {
m.lzn.rp.003 <- rpart(zinc ~ ffreq + dist.m + elev,
data = meuse, minsplit = 2, cp = 0.005)
plotcp(m.lzn.rp.003)
}
```