---
title: 'Exercise: Regional mapping of climate variables (2) Data-driven methods'
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
   html_document:
     toc: TRUE
     toc_float: TRUE
     theme: "lumen"
     number_section: FALSE
     fig_height: 4
     fig_width: 6
     fig_align: 'center'
---

```{r global_options1, include=FALSE}
knitr::opts_knit$set(root.dir = '/Users/rossiter/data/edu/dgeostats/ex/ds/中国天气')
writeLines(capture.output(sessionInfo()), "sessionInfo.txt")
```

```{r global_options2, include=FALSE}
knitr::opts_chunk$set(fig.width=4, fig.height=6, fig.align='center',
                      fig.path='kgraph/', warning=FALSE, message=FALSE)
# rprojroot::find_rstudio_root_file()
```

This exercise should be loaded into RStudio, in the same directory with the dataset  `zhne_stations.RData`. Or, you can load this from another directory and edit the paths to the dataset.

This exercise follows `(1) Data exploration`.

# Task 0 -- Packages

Load the `rgdal`, `sp`, `gstat` and `ggplot2` packages.
During the following tasks, load additonal packages as needed.

```{r}
library(sp); library(rgdal); library(gstat); library(ggplot2)
library(ranger)
library(caret)
library(rpart)
library(rpart.plot)
```

# Task 1 -- Dataset

This dataset was examined in Part (1).

```{r load.points}
load("./zhne_climate.RData", verbose=TRUE)
```

## Fixing the encoding

This `.RData` file was saved in the UTF-8 encoding. If your system uses another encoding (typical for Chinese Windows operating systems), you have to inform R of the original encoding. It will then re-encode character strings to your system's encoding.

Here is a function to loop through the fields of a dataframe and fixes them, using the `Encoding` function. I've set it up with the default orginal encoding `UTF-8` but this can be changed.

For fields that are factors, it first converts them to characters, re-encodes the names, and then converts back to factors.

Arguments:

1. `df` : the data frame to be re-encoded
2. `inputEncoding`, default `"utf-8"`

```{r fix.encoding}
fix.encoding <- function(df, inputEncoding = "UTF-8") {
  numCols <- ncol(df)
  for (col in 1:numCols)
    if(class(df[, col]) == "character"){
      Encoding(df[, col]) <- inputEncoding
    }
  else if(class(df[,col]) == "factor") {
    tmp <- as.character(df[,col])
    Encoding(tmp) <- inputEncoding
    df[,col] <- as.factor(tmp)
  }
  return(df)
}
```

Apply this to the data frames.

```{r}
zhne.df <- fix.encoding(zhne.df)
# only the dataframe part of the Spatial* objects
zhne.m@data <- fix.encoding(zhne.m@data)  
zhne.adm1.m@data <- fix.encoding(zhne.adm1.m@data)
```

Check that the re-encoding worked:

```{r}
head(zhne.df$station.name)
levels(zhne.df$province)
levels(zhne.adm1.m$NAME_HZ)
```

## Target and predictor variables

List the field names of the points (climate stations).

```{r}
names(zhne.df)
```

Q: Which fields are the variables to be mapped over the region?

Q: Which fields could be used to predict these?

Select a target variable. Here we use annual average temperature.

```{r}
target.name <- "t.avg"
```

Set up a formula with the predictor fields.

```{r}
length(pred.fields <- c(9,16:20))
(pred.names <- names(zhne.df)[pred.fields])
(model.formula <- paste(target.name, "~", paste0(pred.names, collapse=" + ")))
```

Now we can use this model formula in any function that recognizes the S model formula structure.

# Task 3 -- Map display functions

Execute the following two code chunks. They are functions to display prediction maps and their differences. There is some interesting R code in here, but you don't need to understand it to use these functions.

A function for map display of a field stored in the `zhne.grid.m.df` object.

Arguments:

1. `.prediction.map.name`: the name of the field in the `zhne.grid.m.df` object to map
2. `.plot.title`: a title for the map (e.g., the target variable and method)
3. `.legend.title`: a title for the legend (e.g., the units of measure)
4. `.plot.limits`: limits of the z-variable, default is the range of the prediction
5. `.palette` : colour pallete from `RColorBrewer` package, default is `"YlGnBu"`.


```{r}
display.prediction.map <- function(.prediction.map.name,
                                   .plot.title, 
                                   .legend.title,
                                   .plot.limits=NULL,
                                   .palette="YlGnBu") {
    ggplot(data=zhne.grid.m.df) +
        geom_point(aes_(x=quote(E), y=quote(N),
                        color=as.name(.prediction.map.name))) +
        xlab("E") + ylab("N") + coord_fixed() +
#        geom_point(aes(x=E, y=N, color=t.avg),
#                   data=zhne.df, shape=I(20)) +
        ggtitle(.plot.title) +
        scale_colour_distiller(name=.legend.title,
                               space="Lab",
                               palette=.palette,
                               limits=.plot.limits)
}
```


A function to display the difference between two maps, this difference stored as a field in the `zhne.grid.m.d` object.

Arguments:

1. `.diff.map.name`: the name of the field in the `zhne.grid.m.df` object to map
2. `.diff.map.title`: a title for the map (e.g., the two maps being compared)
3. `.legend.title`: a title for the legend (e.g., the units of measure)
4. `.palette` : colour pallete from `RColorBrewer` package, default is `"Spectral"`.

```{r}
display.difference.map <- function(.diff.map.name,
                                   .diff.map.title,
                                   .legend.title,
                                   .palette="Spectral") {
    ggplot(data=zhne.grid.m.df) +
        geom_point(aes_(x=quote(E), y=quote(N),
                            color=as.name(.diff.map.name))) +
        xlab("E") + ylab("N") + coord_fixed() +
        ggtitle(.diff.map.title) +
        scale_colour_distiller(name=.legend.title, 
                               space="Lab",
                               palette=.palette)
    }
```


If you want to experiment with other colour schemes, pick one of the "sequential" themes from the `RColorBrewer` package. You can see them with this code:

```{r brewer.colors, eval=FALSE}
library(RColorBrewer)
display.brewer.all(type="seq") # sequential palettes
display.brewer.all(type="div")  # diverging palettes
```


# Regression Trees

## Task 4 -- Model

Build a regression tree to predict the target from the predictors. Use a small value of the complexity parameter and a final leaf size of 1, to build a complex tree, which can be pruned back.

```{r rt1}
m.rt <- rpart(model.formula, data=zhne.df, minsplit=2, cp=0.0005)
```

Plot the cross-validation error against the complexity parameter:

```{r rt1.c, fig.width=7, fig.height=5}
plotcp(m.rt)
```

Q: At what complexity is there almost no improvement in cross-validation error?

Prune the tree to this complexity:

```{r rt2}
m.rt.p <- prune(m.rt, cp=0.002)
```

Display the variable importance:

```{r rt2.vi}
x <- m.rt.p$variable.importance
data.frame(variableImportance = 100 * x / sum(x))
```

Q: What is the sequence of predictor importances? Is there one most important predictor? Why is this one so important for this target variable?

Display the pruned regression tree:

```{r rt2.plot,fig.width=12, fig.height=8}
rpart.plot(m.rt.p, digits=3, type=4, extra=1)
```

Q: What is the first split?  How many stations are in the left (lower values) split, how many in the right (higher values)? What is the mean of the whole dataset, and the mean of the two first-level branches? Is this a large difference?

Q: How many terminal nodes (leaves) are there? Which has the most and which the fewest number of stations?

Q: Why are there more than one station in each leaf? After all, with N and E in the predictor set, each station could be "boxed" into its own leaf.


Display the actual vs. fitted values on a 1:1 plot:

```{r plot.rt,fig.width=7, fig.height=7}
summary(p.rt <- predict(m.rt.p, newdata=zhne.df))
plot(p.rt, zhne.df$t.avg, asp=1, col=zhne.df$province.en, pch=20,
     main="T annual avg.", ylab="Measured", xlab="Regression Tree fit")
legend("topleft", levels(zhne.m$province.en), pch=20, col=1:9)
abline(0,1); grid()
```

Q: Is there any bias? Is the spread at each prediction similar for each value of the predicted target variable?

## Task 5 -- Map

```{r rt.map, fig.width=10, fig.height=10}
zhne.grid.m.df$pred.rt <- predict(m.rt.p, newdata=zhne.grid.m.df)
summary(zhne.grid.m.df$pred.rt)
display.prediction.map("pred.rt",
  "regression tree prediction",
  "t.avg")
```

Q: Describe the spatial pattern of the map. Does it look realistic? What geographic features used in the regression tree are obvious in the map?

## (Optional) Residuals

Display a bubble plot of the regression tree residuals. These are the actual values, minus the predicted values at the known points.

```{r rt.resid, fig.width=7, fig.height=7}
summary(fit.rt <- predict(m.rt.p, newdata=zhne.df))
summary(residuals.rt <- (zhne.df$t.avg - fit.rt))
# add them to a SpatialPointsDataFrame, the bubble() method requires this class
zhne.m$residuals.rt <- residuals.rt
bubble(zhne.m, zcol="residuals.rt", pch=1, main="Regression Tree residuals")
```

Q: Does there appear to be spatial correlation?

Compute an empirical variogram of the regression tree residuals and model it.

```{r rt.resid.vgm, fig.width=6, fig.height=4}
v.rtresid <- variogram(residuals.rt ~ 1, loc=zhne.m, cutoff=400000, width=24000)
plot(v.rtresid, plot.numbers=TRUE)
(vmf.rtresid <- fit.variogram(v.rtresid, model=vgm(0.45, "Exp", 80000, 0.05)))
plot(v.rtresid, plot.numbers=TRUE, model=vmf.rtresid)
```

This is a satisfactory fit.

Krige the residuals onto the prediction grid. Note that the prediction locations must also be an `sp` object.

```{r rt.resid.krige, fig.width=7, fig.height=4}
class(zhne.grid.m)
system.time(
  k.rt.resid <- krige(residuals.rt ~ 1, loc=zhne.m, model=vmf.rtresid, 
                      newdata=zhne.grid.m)
)
summary(k.rt.resid)
hist(k.rt.resid$var1.pred, main="Adjustments to RT prediction",
     xlab="+/- deg C, average annual T")
rug(k.rt.resid$var1.pred)
```

Q: What is the magnitude of the adjustments? How does this compare with the range of values of the target in the study area? Do you consider (some, all, any) of these adjustments significant?

Display the map of the kriged residuals.

```{r rt.resid.map, fig.width=10, fig.height=10}
spplot(k.rt.resid, zcol="var1.pred")
```


Q: Where are the largest positive and negative adjustments? Where is there no adjustment? Why is this?

Q*: (a difficult question): Why were these differences not detected by the regression tree? 

Add the predicted residuals to the regression tree prediction and compare its summary with that of the regression tree.

```{r rt.rk}
zhne.grid.m.df$pred.rt.rk <- (zhne.grid.m.df$pred.rt + k.rt.resid$var1.pred)
summary(zhne.grid.m.df$pred.rt.rk)  # predictions including kriged residuals
summary(zhne.grid.m.df$pred.rt)     # predictions only from regression tree
```

Q: How has adding the kriged residuals affected the statistical summary of the predictions?

Display the map of the regression tree predictions adjusted by the kriged residuals.

```{r rt.rk.map, fig.width=10, fig.height=10}
display.prediction.map("pred.rt.rk",
  "regression tree prediction + kriged residuals",
  "t.avg")
```

Q: Where has the map changed the most from the prediction without the kriged residuals?


# Random Forests

Instead of a single tree, we build a random forest of many trees. We use the efficient `ranger` package. In addition, we tune its paramters: the number of predictors to try at each split `mtry` and the minimum number of stations in a terminal node, `min.node.size`. We will use the optimal values of these.


## Task 6 -- Optimization

Determine the optimal parameter values for a random forest model using `ranger`:

```{r ranger.tune}
dim(preds <- zhne.df[, pred.fields])
length(response <- zhne.df[, "t.avg"])
system.time(
  ranger.tune <- train(x = preds, y = response,  
                       method="ranger", 
                       tuneGrid = expand.grid(.mtry = 2:4, 
                                              .splitrule = "variance",
                                              .min.node.size = 1:10),  
                       trControl = trainControl(method = 'cv'))
)
```

View the training results:

```{r ranger.tune.results, fig.width=6, fig.height=4}
ix <- which.min(ranger.tune$result$RMSE)
ranger.tune$result[ix, c(1,3,4)]
ix <- which.max(ranger.tune$result$Rsquared)
ranger.tune$result[ix, c(1,3,5)]
ix <- which.min(ranger.tune$result$MAE)
ranger.tune$result[ix, c(1,3,6)]
plot.train(ranger.tune, metric="RMSE")
plot.train(ranger.tune, metric="Rsquared")
```

Q: What is the optimum number of `mtry`, as evidenced by RMSE, MAE and $R^2$?

Q: What is the optimum minimum node size?

Q: Are these optima clearly better than the alternatives?

## Task 7 -- Model

Build a model with the optimal parameters.

We also request a measure of the variable importance, by the permutation method (same as used in the regression trees) -- this randomly permutes the predictor variable and measures how much the error increases.

```{r rf.full}
rf <- ranger(model.formula,
             data=zhne.df, importance="permutation",
             scale.permutation.importance=FALSE,
             mtry=4, min.node.size=4,
             oob.error=TRUE, num.trees=1024)
print(rf)
```

Q: What is the out-of-bag (OOB) goodness-of-fit ($R^2$)? Is this a "good" predictive model, in your opinion?

Q: What is the OOB prediction error (MSE)? Is this a large discrepency in mean annual temperature, in your opinion?

### OOB fit

Plot actual vs. OOB-predicted against 1:1 line:

```{r rf.full.plot,fig.width=7, fig.height=7}
plot(rf$predictions, zhne.df$t.avg, asp=1, col=zhne.df$province.en, pch=20,
     main="Annual T avg", ylab="Measured", xlab="Ranger RF fit")
legend("bottomright", levels(zhne.m$province.en), pch=20, col=1:9)
abline(0,1); grid()
```

Q: How good is the 1:1 fit? Are there any serious OOB errors?

Find the most serious OOB errors. Set a threshold at 2.5 deg C.

```{r}
(ix <- which((abs(rf$predictions- zhne.df$t.avg)) > 2.5))
zhne.df[ix,c("province", "station.id", "station.name", "elev.m", "t.avg")]
rf$predictions[ix]
```

Q: Did the model over- or under-predict these? Explain the reason for this large out-of-bag error.

### Variable importance

Display the variable importance, as a proportion of the total importance measures:

```{r}
sort(round(ranger::importance(rf)/sum(ranger::importance(rf)),3), decreasing=TRUE)
```

Q: What is the sequence of predictor importances? Is there one most important predictor? Why is this one so important for this target variable? Are this sequence and the relative importance the same as for the single regression tree?

## Task 8 -- Map

Display the RF map:

```{r rf.map, fig.width=10, fig.height=10}
zhne.grid.m.df$pred.rf <- predict(rf, data=zhne.grid.m.df)$predictions
summary(zhne.grid.m.df$pred.rf)
display.prediction.map("pred.rf",
  "Annual average T, ranger prediction",
  "deg C")
```

Q: Describe the spatial pattern.

Q: Is there any evidence of the predictors?


## (Optional) Residuals

Display a bubble plot of the RF residuals. These are the actual values, minus the predicted values at the known points.

```{r rf.resid, fig.width=7, fig.height=7}
summary(fit.rf <- predict(rf, data=zhne.df)$predictions)
summary(residuals.rf <- (zhne.df$t.avg - fit.rf))
# add them to a SpatialPointsDataFrame, the bubble() method requires this class
zhne.m$residuals.rf <- residuals.rf
bubble(zhne.m, zcol="residuals.rf", pch=1, main="Random Forest residuals")
```

Q: Does there appear to be spatial correlation?

Examine this with an empirical variogram.

```{r rf.resid.vgm.1, fig.width=6, fig.height=4}
v.rfresid <- variogram(residuals.rf ~ 1, loc=zhne.m, cutoff=400000, width=24000)
plot(v.rfresid, plot.numbers=TRUE)
```

Q: Do these residuals show more or less spatial correlation than those from the regression trees? (Consider the total sill, range, and nugget.) Why?

Model the variogram.

```{r rf.resid.vgm.2, fig.width=6, fig.height=4}
(vmf.rfresid <- fit.variogram(v.rfresid, model=vgm(0.05, "Exp", 80000, 0.02)))
plot(v.rfresid, plot.numbers=TRUE, model=vmf.rfresid)
```

This is a satisfactory fit.

Krige the residuals onto the prediction grid. 

```{r rf.resid.krige, fig.width=7, fig.height=4}
system.time(
  k.rf.resid <- krige(residuals.rf ~ 1, loc=zhne.m, model=vmf.rfresid, 
                      newdata=zhne.grid.m)
)
summary(k.rf.resid)
hist(k.rf.resid$var1.pred, main="Adjustments to RF prediction",
     xlab="+/- deg C, average annual T")
rug(k.rf.resid$var1.pred)
```

Q: What is the magnitude of the adjustments? How does this compare with the range of values of the target in the study area? How does this compare with the residuals from the regression tree? Do you consider (some, all, any) of these adjustments significant?

Display the map of the kriged residuals.

```{r rf.resid.map, fig.width=10, fig.height=10}
spplot(k.rf.resid, zcol="var1.pred")
```


Q: Where are the largest positive and negative adjustments? Where is there no adjustment? Why is this?

Add the predicted residuals to the regression tree prediction and compare its summary with that of the regression tree.

```{r rf.rk}
zhne.grid.m.df$pred.rf.rk <- (zhne.grid.m.df$pred.rf + k.rf.resid$var1.pred)
summary(zhne.grid.m.df$pred.rf.rk)  # predictions including kriged residuals
summary(zhne.grid.m.df$pred.rf)     # predictions only from regression tree
```

Q: How has adding the kriged residuals affected the statistical summary of the predictions?

Display the map of the regression tree predictions adjusted by the kriged residuals.

```{r rf.rk.map, fig.width=10, fig.height=10}
display.prediction.map("pred.rf.rk",
  "random forest prediction + kriged residuals",
  "t.avg")
```

Q: Where has the map changed the most from the prediction without the kriged residuals?