---
title: "Tutorial: Regional mapping of climate variables from point samples -- using covariates"
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
   html_document:
     toc: TRUE
     toc_float: TRUE
     theme: "lumen"
     code_folding: show
     number_section: TRUE
     fig_height: 4
     fig_width: 4
     fig_align: 'center'
bibliography: ex.bib
---

```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=4, fig.height=4, fig.align='center',
                      fig.path='kgraph/', warning=FALSE, message=FALSE)
writeLines(capture.output(sessionInfo()), "sessionInfo.txt")
```

# Introduction

In exercise `exRKGLS.pdf` a large number of modelling and mapping methods for regional climate are compared, but using only at most three regional predictors: coordinates (Northing, Easting) and elevation, as well as the local neighbourhood. See that document for explanations of each method. Here we use some of the same methods, but extended to use a larger set of predictors.

In tutorial `exRKGLS_SetupAdditionalCovariates.html` we developed more predictors (covaribles) that might have a relation to regional or local climate: distance to the Great Lakes shoreline, distance to the Atlantic Ocean coast, two terrain indices (Multi-resolution Valley-Bottom Flatness MRVBF and Terrain Ruggedness Index TRI), and population density within two radii around stations or prediction points (nominally 2.5' and 15').

In that tutorial we extracted the values of these at the stations (objects `ne.m` and `ne.df`, as a `SpatialPointsDataFrame` and `data.frame`, respectively) and over the grid (objects `dem.ne.m.sp` and `dem.ne.m.df`, as a `SpatialPixelsDataFrame` and `data.frame`, respectively).
These were saved in the R Data file `StationsDEM_covariates.RData`.

In this tutorial we investigate whether any of these covariables separately or in combination improve modelling and prediction of the regional climate. As in `exRKGLS.pdf` the target variable is the 30-year 1971-2000 average annual growing degree days, base $50^\circ $F.

Packages used in this tutorial:

```{r}
library(ggplot2)
library(sp)
library(ranger)
library(caret)
library(rpart)
library(rpart.plot)
library(Cubist)
# library(mgcv)
# library(spgwr)
```

## Display functions

A function to display prediction maps (see `exRKGLS.pdf`):

```{r}
display.prediction.map <- function(.prediction.map.name,
                                   .plot.title, 
                                   .legend.title,
                                   .plot.limits=NULL,
                                   .palette="YlGnBu") {
    ggplot(data=dem.ne.m.df) +
        geom_point(aes_(x=quote(E), y=quote(N),
                        color=as.name(.prediction.map.name))) +
        xlab("E") + ylab("N") + coord_fixed() +
        geom_point(aes(x=E, y=N, color=ANN_GDD50),
                   data=ne.df, shape=I(20)) +
        ggtitle(.plot.title) +
        scale_colour_distiller(name=.legend.title,
                               space="Lab",
                               palette=.palette,
                               limits=.plot.limits)
}
```

A function to display the difference between two maps:

```{r}
display.difference.map <- function(.diff.map.name,
                                   .diff.map.title,
                                   .legend.name,
                                   .palette="Spectral") {
    ggplot(data=dem.ne.m.df) +
        geom_point(aes_(x=quote(E), y=quote(N),
                            color=as.name(.diff.map.name))) +
        xlab("E") + ylab("N") + coord_fixed() +
        ggtitle(.diff.map.title) +
        scale_colour_distiller(name=.legend.name, 
                               space="Lab",
                               palette=.palette)
    }
```


# Dataset

```{r}
load("./ds/NEweather/StationsDEM_covariates.RData", verbose=TRUE)
names(ne.df)
length(pred.fields <- c(6,27:34))
(pred.names <- names(ne.df)[pred.fields])
(model.formula <- paste("ANN_GDD50 ~", paste0(pred.names, collapse=" + ")))
length(pred.fields.base <- c(6,27,28))
(pred.names.base <- names(ne.df)[pred.fields.base])
(model.formula.base <- paste("ANN_GDD50 ~", paste0(pred.names.base, collapse=" + ")))
names(dem.ne.m.df)
summary(dem.ne.m.df)
```

We see the list of predictors that could be used for modelling and mapping. In the `Spatial*` versions the coordinates are in the `@coordinates` slot, not in the `@data` slot.

# Linear modelling

## Relation among predictors

First, investigate the relation among the predictors:

```{r pairs,fig.height=10, fig.width=10}
cor(ne.df[, pred.fields])
pairs(ne.df[, pred.fields], pch=20, cex=0.4)
```

The distances to lakes and coast are inversely related,
and these are related to the coordinates, because of the geography.
Of course the two population densitities are positively correlated.
So there is definite colinearity of predictors.

PCA:

```{r pca,fig.height=7, fig.width=7}
pc <- prcomp(ne.df[, pred.fields], scale. = TRUE, retx=TRUE)
summary(pc)
pc$rotation
biplot(pc)
biplot(pc, choices=3:4)
```

The two populations are closely positively related in PC1 and PC2.
Lakes and coasts are almost perfectly inversely related in PC1 and PC2.
Population is correlated with distance to lakes (and so inverse to distance from coast); this is a geographic accident because population is denser towards the SE (NYC, NJ).
The terrain indices and geography also have a fortuitous relation.
We can not easily reduce these PCs to meaningful factors.


## Model

Build an OLS linear model:

```{r ols}
summary(m <- lm(model.formula, data=ne.df))
car::vif(m)
summary(m.step <- step(m, direction="backward", trace=0))
car::vif(m.step)
```

The stepwise regression removes the two terrain indices and one of the distances, but six predictors remain.

Serious colinearity and variance inflation! The Northing, which we know physically is the most important, shows  one of the highest VIF, the other is distance to lakes. Easting is also fairly high.


```{r ols.final, fig.width=12, fig.height=4}
summary(m.final <- update(m.step, . ~ . - dist.lakes))
car::vif(m.final)
par(mfrow=c(1,3))
plot(m.final, which=c(1,2,6))
par(mfrow=c(1,1))
ix <- which(abs(residuals(m.final)) > 500)
ne.df[ix, 2:3]
```

Interestingly, both log10-population densities remain in the "final" model.

```{r}
summary(p.lm <- predict(m.final, newdata=ne.df))
plot(p.lm, ne.df$ANN_GDD50, asp=1, col=ne.df$STATE, pch=20,
     main="ANN_GDD50", ylab="Measured", xlab="Reduced OLS fit")
abline(0,1); grid()
```


Linear models do not perform well here.


# Regression Trees

## Model

```{r rt}
m.rt <- rpart(model.formula, data=ne.df, minsplit=2, cp=0.001)
plotcp(m.rt)
```

Prune:

```{r rt2,fig.width=10, fig.height=6}
m.rt.p <- prune(m.rt, cp=0.004)
x <- m.rt.p$variable.importance
data.frame(variableImportance = 100 * x / sum(x))
rpart.plot(m.rt.p, digits=3, type=4, extra=1)
```

The distance to lakes is the most important!


```{r}
summary(p.rt <- predict(m.rt.p, newdata=ne.df))
plot(p.rt, ne.df$ANN_GDD50, asp=1, col=ne.df$STATE, pch=20,
     main="ANN_GDD50", ylab="Measured", xlab="Regression Tree fit")
abline(0,1); grid()
```
  
## Mapping

```{r rt.map, eval=TRUE, fig.width=6, fig.height=6}
dem.ne.m.df$pred.rt <- predict(m.rt.p, newdata=dem.ne.m.df)
summary(dem.ne.m.df$pred.rt)
display.prediction.map("pred.rt",
  "Annual GDD, base 50F, regression tree prediction, 9 predictors",
  "GDD50")
```

Clear artefacts of the distance to coast and population densities.

# Random Forests

## Optimization

Determine the optimum parameters for a random forest model using `ranger`:

```{r ranger.tune}
dim(preds <- ne.df[, pred.fields])
length(response <- ne.df[, "ANN_GDD50"])
system.time(
  ranger.tune <- train(x = preds, y = response,  
                       method="ranger", 
                       tuneGrid = expand.grid(.mtry = 2:7, 
                                              .splitrule = "variance",
                                              .min.node.size = 1:10),  
                       trControl = trainControl(method = 'cv'))
)
```

View the training results:

```{r ranger.tune.results, fig.width=6, fig.height=4}
ix <- which.min(ranger.tune$result$RMSE)
ranger.tune$result[ix, c(1,3,4)]
ix <- which.max(ranger.tune$result$Rsquared)
ranger.tune$result[ix, c(1,3,5)]
ix <- which.min(ranger.tune$result$MAE)
ranger.tune$result[ix, c(1,3,6)]
plot.train(ranger.tune, metric="RMSE")
plot.train(ranger.tune, metric="Rsquared")
```

All three methods agree on five predictors to try at each split, and node size of three.  There is not much improvement past `mtry=3`, and `min.node.size` from 1 to 3 give similar results.

## Best model

Build a model with the optimal parameters:

```{r rf.full}
rf <- ranger(model.formula,
             data=ne.df, importance="impurity", mtry=5, min.node.size=3,
             oob.error=TRUE, num.trees=1024)
print(rf)
str(rf, max.level=1)
round(rf$prediction.error/rf$num.samples)
round(rf$r.squared,3)
plot(rf$predictions, ne.df$ANN_GDD50, asp=1, col=ne.df$STATE, pch=20,
     main="ANN_GDD50", ylab="Measured", xlab="Ranger RF fit")
abline(0,1); grid()
round(ranger::importance(rf)/sum(ranger::importance(rf)),3)
```

Surprisingly, distance to lakes is the most important; this may be partially substituting for Northing. Population at 15' is somewhat important.

The mean out-of-bag error is `r round(rf$prediction.error/rf$num.samples)`.

## Mapping


```{r rf.map, eval=TRUE, fig.width=6, fig.height=6}
str(dem.ne.m.df)
dem.ne.m.df$pred.rf <- predict(rf, data=dem.ne.m.df)$predictions
summary(dem.ne.m.df$pred.rf)
display.prediction.map("pred.rf",
  "Annual GDD, base 50F, ranger prediction, 9 predictors",
  "GDD50")
```

The effect of the population grid is clear.


# Cubist

```{r cubist.tune, fig.width=6, fig.height=4}
system.time(
  cubist.tune <- train(x = preds, y = response, method="cubist",
                    tuneGrid = expand.grid(.committees = 1:12, .neighbors = 0:8),
                    trControl = trainControl(method = 'cv'))
)
plot(cubist.tune, metric="RMSE")
plot(cubist.tune, metric="Rsquared")
cubist.tune$bestTune$neighbors
```

Committees have little effect; two is somewhat better than one.
Adjustment by one and two neighbours is very bad; three is similar to none; 4 to 8 gives steady improvement but not much past 5.

Best model:

```{r cubist.model}
c.model <- cubist(x = preds, y = response, committees=2)
summary(c.model)
```

The only split is on elevation. The second model has no split.

```{r cubist1to1}
cubist.fits <- predict(c.model, newdata=preds,
                       neighbors=cubist.tune$bestTune$neighbors)
(rmse.cubist <- sqrt(mean((cubist.fits - ne.df$ANN_GDD50)^2)))
plot(cubist.fits, ne.df$ANN_GDD50, asp=1, col=ne.df$STATE, pch=20,
     main="ANN_GDD50", ylab="Measured", xlab="Cubist fit")
abline(0,1); grid()
```

Predict over the grid:

```{r cubist.pred, fig.width=6, fig.height=6}
summary(dem.ne.m.df$pred.cubist <- 
          predict(c.model, newdata=dem.ne.m.df,
                  neighbours=cubist.tune$bestTune$neighbors))
display.prediction.map("pred.cubist",
  "Annual GDD, base 50F, Cubist prediction",
  "GDD50")
```

# Difference with simple models

Do the additional predictors improve or degrade the model? Compare with the base model, which uses only the coordinates and elevation:

A matrix of the values of base predictors at each station:

```{r preds.base}
pred.fields.base
dim(preds.base <- ne.df[, pred.fields.base])
```


## Random forest

Tune and build a random forest model using just three predictors:

```{r}
system.time(
  ranger.tune.base <- train(x = preds.base, y = response, method="ranger", 
                       tuneGrid = expand.grid(.mtry = 1:3, 
                                              .splitrule = "variance",
                                              .min.node.size = 1:10), 
                       trControl = trainControl(method = 'cv'))
)
```


```{r ranger.tune.results.base, fig.width=6, fig.height=4}
ix <- which.min(ranger.tune.base$result$RMSE)
ranger.tune$result[ix, c(1,3,4)]
ix <- which.max(ranger.tune.base$result$Rsquared)
ranger.tune$result[ix, c(1,3,5)]
ix <- which.min(ranger.tune.base$result$MAE)
ranger.tune$result[ix, c(1,3,6)]
plot.train(ranger.tune.base, metric="RMSE")
plot.train(ranger.tune.base, metric="Rsquared")
```

Here the results vary quite a bit with each run; one or two predictors give similar results, so we select two. Node size makes little difference.

Build a model with the optimal parameters:

```{r rf.base}
rf.base <- ranger(model.formula.base,
             data=ne.df, importance="impurity", mtry=2, min.node.size=2,
             oob.error=TRUE, num.trees=1024)
print(rf.base)
round(ranger::importance(rf.base)/sum(ranger::importance(rf.base)),3)
round(rf.base$prediction.error/rf.base$num.samples)
round(rf$prediction.error/rf.base$num.samples)
round(rf.base$r.squared,3)
round(rf$r.squared,3)
```

This simpler model somewhat poorer than the complex model, showing that some of the added predictors handle local situations.

Northing and elevation are by far the most important. 

Compute and display the difference with the 9-predictor map:

```{r}
dem.ne.m.df$pred.rf.base <- predict(rf.base, data=dem.ne.m.df)$predictions
summary(dem.ne.m.df$diff.rf.9.3 <- 
          dem.ne.m.df$pred.rf - dem.ne.m.df$pred.rf.base)
```


```{r display.rf.diff, fig.width=6, fig.height=6}
display.difference.map("diff.rf.9.3",
  "Difference RF 9 and RF 3 predictors",
  "+/- GDD50")
```

The largest differences are outside the study area, mostly (?) due to distance from the Great Lakes. Within it, small negatice adjustments on Long Island, but also near Lake Ontario (not correct).

## Cubist

une and build a Cubist model using just three predictors:

```{r cubist.tune.base, fig.width=6, fig.height=4}
system.time(
  cubist.tune.base <- train(x = preds.base, y = response, method="cubist",
                    tuneGrid = expand.grid(.committees = 1:12, .neighbors = 0:8),
                    trControl = trainControl(method = 'cv'))
)
plot(cubist.tune.base, metric="RMSE")
plot(cubist.tune.base, metric="Rsquared")
cubist.tune.base$bestTune$neighbors
```

Committees improve considerably, up to 8.

Adjustment by one to four neighbours worsens the OOB fit; five to to eight improves it; after seven, eight gives little advantage.

Best model:

```{r cubist.model.base}
c.model.base <- cubist(x = preds.base, y = response, committees=8)
summary(c.model.base)
```

In this model there are many splits in the eight committees, on Northing and elevation.

```{r cubist1to1.base}
cubist.fits.base <- predict(c.model.base, newdata=preds.base,
                       neighbors=cubist.tune.base$bestTune$neighbors)
(rmse.cubist.base <- sqrt(mean((cubist.fits.base - ne.df$ANN_GDD50)^2)))
print(rmse.cubist)
```

This is a much worse model than the full moel.

Predict over the grid:

```{r cubist.pred.base, fig.width=6, fig.height=6}
summary(dem.ne.m.df$pred.cubist.base <- 
          predict(c.model.base, newdata=dem.ne.m.df,
                  neighbours=cubist.tune$bestTune$neighbors))
display.prediction.map("pred.cubist.base",
  "Annual GDD, base 50F, Cubist prediction (3 predictors)",
  "GDD50")
```

Compute and display the difference with the 9-predictor map:

```{r}
dem.ne.m.df$pred.cubist.base <- 
          predict(c.model.base, newdata=dem.ne.m.df,
                  neighbours=cubist.tune$bestTune$neighbors)
summary(dem.ne.m.df$diff.cubist.9.3 <- 
          dem.ne.m.df$pred.cubist - dem.ne.m.df$pred.cubist.base)
```


```{r display.cubist.diff, fig.width=6, fig.height=6}
display.difference.map("diff.cubist.9.3",
  "Difference Cubist 9 and RF 3 predictors",
  "+/- GDD50")
```

We see the effect of the population density grid used in the 9-predictor model. This raises the prediction in the NYC/NJ area, and lowers it in some rural areas.