--- title: "Finding the proper complexity parameter for a Regression Tree" author: "D G Rossiter" date: "`r Sys.Date()`" output: html_document: toc: TRUE toc_float: TRUE theme: "lumen" code_folding: show number_sections: TRUE fig_keep: TRUE fig_height: 4 fig_width: 6 fig_align: 'center' --- This shows how cross-validation is used to assess the proper complexity parameter for a regression tree. This is the graph shown by `printcp`, based on cross-validation computations in `rpart`, both in the `rpart` package. # Setup Required packages. ```{r} require(rpart); require(rpart.plot); require(sp) ``` Sample dataset is Meuse heavy metals in soil samples, comes with `sp`: ```{r} data(meuse) names(meuse) # meuse$logZn <- log10(meuse$zinc) ``` We select Zn (`zinc`) as the target, and try to predict from flooding frequency, distance to river in meters, and elevation m.a.sl. # Model Build a complex tree, with maximum possible splitting and `cp=0.001`: ```{r out.width="100%"} m.lzn.rp <- rpart(zinc ~ ffreq + dist.m + elev, data = meuse, minsplit = 2, cp = 0.001) rpart.plot(m.lzn.rp) ``` This is obviously too site-specific, i.e., overfit, from just 155 observations. # Cross-validation The cross-validation procedure is: 1. Randomly split the dataset into ten parts. Do this by a random permutation of the record numbers, followed by a split. 2. For each of the ten parts: 2.1 Hold the subset out for validation. 2.2 Build a tree with the others. 2.3 Predict at the held-out points. 2.4 Compute validation statistics. 3. Summarize the validation statistics First, decide on the number of cross-validations, and from that determine the size of each hold-out evaluation sample. ```{r} # number of splits, we decide this k <- 10; (size <- floor(dim(meuse)[1]/k)) ``` Set up a data frame to keep the results of each cross-validation run. ```{r} max.split <- 10 # the maximum number of levels we will test split.vs.xval <- as.data.frame(matrix(0, nrow=max.split, ncol=3)) names(split.vs.xval) <- c("nsplit", "xerr", "xerr.sd") ``` To avoid any sequential effects in the database (we don't know why they are presented in this order) make a random permutation, out of which we will sample. ```{r} records.permute <- sample(1:dim(meuse)[1]) ``` Run the cross-validation for each possible depth, i.e., number of splits. ```{r} for (i.split in 1:max.split) { rmse <- rep(0,k) for (i in 0:(k-1)) { ix <- records.permute[(i*size+1):(i*size+size)] # training set to build the tree, test set to evaluate meuse.cal <- meuse[-ix,]; meuse.val <- meuse[ix,] # build the tree m.cal <- rpart(zinc ~ ffreq + dist.m + elev, data = meuse.cal, maxdepth=i.split, cp=0) val <- (meuse.val$zinc - predict(m.cal, meuse.val)) # vector of residuals rmse[i+1] <- sqrt(sum(val^2)/length(val)) ## RMSE } # for k-fold # Now we have `k` RMSE: rmse.m <- mean(rmse) # this is the value we use to match with split rmse.sd <- sd(rmse)/length(ix) # and it has a standard deviation of the mean # save it split.vs.xval[i.split,] = c(i.split, rmse.m, rmse.sd) } # for minsplit ``` From this we can find an "optimim" as the minimum cross-validation RMSE: ```{r} split.vs.xval (ix.min <- which.min(split.vs.xval$xerr)) ``` Final result: the x-validation RMSE vs. number of splits, also showing one standard error. ```{r} plot(split.vs.xval[,2] ~ split.vs.xval[,1], type="b", ylab="cross-validation RMSE", xlab="Number of splits") grid() abline(h=(split.vs.xval[ix.min, "xerr"] + split.vs.xval[ix.min, "xerr.sd"]), lty=2) ``` # Compare with `rpart::printcp` What does the built-in procedure say about this model? Each run is different, although the tree is the same. ```{r} for (i in 1:4) { m.lzn.rp.003 <- rpart(zinc ~ ffreq + dist.m + elev, data = meuse, minsplit = 2, cp = 0.005) plotcp(m.lzn.rp.003) } ```