--- title: "Variability of regression trees" author: "D G Rossiter" date: "`r Sys.Date()`" output: html_document: toc: TRUE toc_float: TRUE theme: "lumen" code_folding: show number_sections: TRUE fig_keep: TRUE fig_height: 4 fig_width: 6 fig_align: 'center' --- Here we demonstrate the sensivity of regression trees to small changes in the training dataset, using the Meuse heavy metals dataset. # Setup Libraries used in this exercise: `sp` for spatial objects, `rpart` for regression trees and random forests, `rpart.plot` for nicer plots of trees. ```{r} library(sp) library(rpart) library(rpart.plot) ``` Sample dataset, comes with `sp`; make a log10-transformed copy of the metal concentrations. ```{r} data(meuse) str(meuse) meuse$logZn <- log10(meuse$zinc) # 锌 meuse$logCu <- log10(meuse$copper) # 铜 meuse$logPb <- log10(meuse$lead) # 铅 meuse$logCd <- log10(meuse$cadmium) # 镉 ``` Prediction grid: ```{r} data(meuse.grid) coordinates(meuse.grid) <- ~ x + y gridded(meuse.grid) <- TRUE ``` # The single best tree Use all 155 to build and prune: ```{r} m.lzn <- rpart(logZn ~ ffreq + dist + soil + x + y, data=meuse, minsplit=2, cp=0.005) cp.table <- m.lzn[["cptable"]] cp.min <- cp.table[which.min(cp.table[,"xerror"]),"CP"] m.lzn <- prune(m.lzn, cp=cp.min) rpart.plot(m.lzn) grid.map <- predict(m.lzn, newdata=meuse.grid) meuse.grid$pred <- grid.map print(spplot(meuse.grid, zcol="pred")) x <- m.lzn$variable.importance data.frame(variableImportance = 100 * x / sum(x)) ``` # Repeated random sampling and tree construction Randomly sample a specific number of the 155 points and use it to build and prune a regression tree; use the remaining points as out-of-bag evaluation. The model uses all the variable in the prediction grid: coordnates, relative distance from river, flooding frequency clqss, and soil type. We write this as a function, so we can call it any number of times. It has one argument: the number of points to include in the calibration sample. It returns information about the tree: the minimum complexity parameter, the out-of-bag error, and the variable importance, and the prediction over the grid. It also displays the tree and the prediction over the grid. ```{r} build.rpart <- function(n.cal) { meuse.cal.ix <- sample(1:dim(meuse)[1], size=n.cal, replace=FALSE) meuse.cal <- meuse[meuse.cal.ix,] meuse.val <- meuse[-meuse.cal.ix,] m.lzn.cal <- rpart(logZn ~ ffreq + dist + soil + x + y, data=meuse.cal, minsplit=2, cp=0.005) cp.table <- m.lzn.cal[["cptable"]] cp.min <- cp.table[which.min(cp.table[,"xerror"]),"CP"] m.lzn.cal <- prune(m.lzn.cal, cp=cp.min) rpart.plot(m.lzn.cal) tmp <- m.lzn.cal$variable.importance v.imp <- data.frame(variableImportance = 100 * tmp / sum(tmp)) grid.map <- predict(m.lzn.cal, newdata=meuse.grid) meuse.grid$pred <- grid.map val <- predict(m.lzn.cal, newdata=meuse.val) rmse <- sqrt((sum(meuse.val$logZn - val)^2)/length(meuse.val$logZn)) print(spplot(meuse.grid, zcol="pred")) # return a list return(list(cp = cp.min, rmse = rmse, v.imp = v.imp, pred = grid.map)) } ``` Set up an object to collect the fitted paramter, and a field in the prediction grid to show the average prediction: ```{r} runs <- 64 fits.rf <- list(cp=vector(mode="numeric",length=runs), rmse=vector(mode="numeric",length=runs), v.imp=as.data.frame(matrix(data=0, nrow=runs, ncol=5)) ) names(fits.rf[["v.imp"]]) <- c("ffreq", "dist", "soil","x","y") meuse.grid$logZn.avg <- 0 # will accumulate predictions and at the end divide by # of runs ``` Call the function and collect the fitted parameters. Save the graphics in a PDF for display. ```{r fig.show='hold', out.width='23%'} for (run in 1:runs) { tmp <- build.rpart(140) fits.rf[["cp"]][run] <- tmp[[1]] fits.rf[["rmse"]][run] <- tmp[[2]] fits.rf[["v.imp"]][run,] <- t(tmp[[3]]) # returned a row, need a column # add to average prediction meuse.grid$logZn.avg <- meuse.grid$logZn.avg + tmp[[4]] } ``` Show the distribution of the complexity parameters, RMSE, and variable importance: ```{r fig.show='hold', out.width='40%'} summary(fits.rf[["cp"]]) hist(fits.rf[["cp"]], xlab="Complexity parameter", main="") rug(fits.rf[["cp"]]) summary(fits.rf[["rmse"]]) hist(fits.rf[["rmse"]], xlab="Out-of-bag RMSE", main="") rug(fits.rf[["rmse"]]) summary(fits.rf[["v.imp"]]) ``` Show the average prediction. ```{r} meuse.grid$logZn.avg <- meuse.grid$logZn.avg/runs summary(meuse.grid$logZn.avg) print(spplot(meuse.grid, zcol="logZn.avg")) ``` Still not very smooth, but better than any single tree. # Linear model variability Compare with the variability of linear models: Function to build a linear model with a subset. Because some predictors are correlated the model may be over-fit, so apply backwards stepwise elimination before summarizing the model and computing the RMSE. The distance will always be in this model; return this to see how much it varies between models. Other coefficients may be eliminated. ```{r} build.lm <- function(n.cal) { meuse.cal.ix <- sample(1:dim(meuse)[1], size=n.cal, replace=FALSE) meuse.cal <- meuse[meuse.cal.ix,] meuse.val <- meuse[-meuse.cal.ix,] m.lzn.cal <- lm(logZn ~ ffreq + dist + soil + x + y, data=meuse.cal) m.lzn.cal <- step(m.lzn.cal, trace=0) meuse.grid$predictedLog10Zn.cal <- predict(m.lzn.cal, newdata=meuse.grid) val <- predict(m.lzn.cal, newdata=meuse.val) rmse <- sqrt((sum(meuse.val$logZn - val)^2)/length(meuse.val$logZn)) print(spplot(meuse.grid, zcol="predictedLog10Zn.cal")) return(list(rmse = rmse, coef.dist=coefficients(m.lzn.cal)["dist"])) } ``` Set up an object to collect the fitted paramters: ```{r} # runs should be consistent with regression trees # runs <- 64 fits.lm <- list(rmse=vector(mode="numeric",length=runs), coef.dist=vector(mode="numeric",length=runs)) ``` Call the function and collect the fitted parameters. Save the graphics in a PDF for display. ```{r fig.show='hold', out.width='23%'} for (run in 1:runs) { tmp <- build.lm(140) fits.lm[["rmse"]][run] <- tmp[[1]] fits.lm[["coef.dist"]][run] <- tmp[[2]] } ``` These maps are much more similar to each other than the single RT maps. Show the distribution of the RMSE and linear model coefficients; compare RMSE to RT RMSE. : ```{r fig.show='hold', out.width='40%'} summary(fits.rf[["rmse"]]) summary(fits.lm[["rmse"]]) hist(fits.lm[["rmse"]], xlab="Out-of-bag RMSE", main="") rug(fits.lm[["rmse"]]) summary(fits.lm[["coef.dist"]]) hist(fits.lm[["coef.dist"]], xlab="Linear model coefficient: relative distance", main="") rug(fits.lm[["coef.dist"]]) ``` The reduced linear model gives lower RMSE on average than the single trees. The distance coefficient is fairly stable.