---
title: "Finding the proper complexity parameter for a Regression Tree"
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
  html_document:  
    toc: TRUE
    toc_float: TRUE
    theme: "lumen"
    code_folding: show
    number_sections: TRUE
    fig_keep: TRUE
    fig_height: 4
    fig_width: 6
    fig_align: 'center'
---

This shows how cross-validation is used to assess the proper complexity parameter for a regression tree.
This is the graph shown by `printcp`, based on cross-validation computations in `rpart`, both in the `rpart` package.


# Setup

Required packages.

```{r}
require(rpart); require(rpart.plot); require(sp)
```

Sample dataset is Meuse heavy metals in soil samples, comes with `sp`:

```{r}
data(meuse)
names(meuse)
# meuse$logZn <- log10(meuse$zinc)
```

We select Zn (`zinc`) as the target, and try to predict from flooding frequency, distance to river in meters, and elevation m.a.sl.

# Model

Build a complex tree, with maximum possible splitting and `cp=0.001`:

```{r out.width="100%"}
m.lzn.rp <- rpart(zinc ~ ffreq + dist.m + elev,
        data = meuse, minsplit = 2, cp = 0.001)
rpart.plot(m.lzn.rp)
```

This is obviously too site-specific, i.e., overfit, from just 155 observations.

# Cross-validation

The cross-validation procedure is:

1. Randomly split the dataset into ten parts. Do this by a random permutation of the record numbers, followed by a split.

2. For each of the ten parts:

2.1 Hold the subset out for validation.

2.2 Build a tree with the others.

2.3 Predict at the held-out points.

2.4 Compute validation statistics.

3. Summarize the validation statistics

First, decide on the number of cross-validations, and from that determine the size of each hold-out evaluation sample.

```{r}
# number of splits, we decide this
k <- 10; (size <- floor(dim(meuse)[1]/k))
```

Set up a data frame to keep the results of each cross-validation run.

```{r}
max.split <- 10 # the maximum number of levels we will test
split.vs.xval <- as.data.frame(matrix(0, nrow=max.split, ncol=3))
names(split.vs.xval) <- c("nsplit", "xerr", "xerr.sd")
```

To avoid any sequential effects in the database (we don't know why they are presented in this order) make a random permutation, out of which we will sample.

```{r}
records.permute <- sample(1:dim(meuse)[1])
```


Run the cross-validation for each possible depth, i.e., number of splits.

```{r}
for (i.split in 1:max.split) {
  rmse <- rep(0,k)
  for (i in 0:(k-1)) {
    ix <- records.permute[(i*size+1):(i*size+size)]
    # training set to build the tree, test set to evaluate
    meuse.cal <- meuse[-ix,]; meuse.val <- meuse[ix,]
    # build the tree
    m.cal <- rpart(zinc ~ ffreq + dist.m + elev,
           data = meuse.cal, maxdepth=i.split, cp=0)
    val <- (meuse.val$zinc - predict(m.cal, meuse.val)) # vector of residuals
    rmse[i+1] <- sqrt(sum(val^2)/length(val)) ## RMSE
    }  # for k-fold
# Now we have `k` RMSE:
rmse.m <- mean(rmse)  # this is the value we use to match with split
rmse.sd <- sd(rmse)/length(ix)   # and it has a standard deviation of the mean
# save it
split.vs.xval[i.split,] = c(i.split, rmse.m, rmse.sd)
} # for minsplit
```

From this we can find an "optimim" as the minimum cross-validation RMSE:

```{r}
split.vs.xval
(ix.min <- which.min(split.vs.xval$xerr))
```

Final result: the x-validation RMSE vs. number of splits, also showing one standard error.

```{r}
plot(split.vs.xval[,2] ~ split.vs.xval[,1], type="b", ylab="cross-validation RMSE", xlab="Number of splits")
grid()
abline(h=(split.vs.xval[ix.min, "xerr"] + split.vs.xval[ix.min, "xerr.sd"]), lty=2)
```


# Compare with `rpart::printcp`

What does the built-in procedure say about this model? Each run is different, although the tree is the same.

```{r}
for (i in 1:4) {
  m.lzn.rp.003 <- rpart(zinc ~ ffreq + dist.m + elev,
          data = meuse, minsplit = 2, cp = 0.005)
  plotcp(m.lzn.rp.003)
}
```