---
title: 'Exercise: Regional mapping of climate variables (3) Model-based methods'
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
html_document:
toc: TRUE
toc_float: TRUE
theme: "lumen"
number_section: FALSE
fig_height: 4
fig_width: 6
fig_align: 'center'
---
```{r global_options1, include=FALSE}
knitr::opts_knit$set(root.dir = '/Users/rossiter/data/edu/dgeostats/ex/ds/中国天气')
writeLines(capture.output(sessionInfo()), "sessionInfo.txt")
```
```{r global_options2, include=FALSE}
knitr::opts_chunk$set(fig.width=6, fig.height=4, fig.align='center',
fig.path='kgraph/', warning=FALSE, message=FALSE)
# rprojroot::find_rstudio_root_file()
options("show.signif.stars"=FALSE)
```
This exercise should be loaded into RStudio, in the same directory with the dataset `zhne_stations.RData`. Or, you can load this from another directory and edit the paths to the dataset.
This exercise follows `(1) Data exploration` and (2) `Data-driven methods`.
We repeat Tasks (0) -- (1) from the second exercise, to set up for other modelling approaches.
# Task 0 -- Packages
Load the `rgdal`, `sp`, `gstat` and `ggplot2` packages.
During the following tasks, load additonal packages as needed.
```{r}
library(sp); library(rgdal); library(gstat); library(ggplot2)
library(caret)
```
## Map display functions
To display some of the maps we need the map display function from the previous exercises.
```{r}
display.prediction.map <- function(.prediction.map.name,
.plot.title,
.legend.title,
.plot.limits=NULL,
.palette="YlGnBu") {
ggplot(data=zhne.grid.m.df) +
geom_point(aes_(x=quote(E), y=quote(N),
color=as.name(.prediction.map.name))) +
xlab("E") + ylab("N") + coord_fixed() +
# geom_point(aes(x=E, y=N, color=t.avg),
# data=zhne.df, shape=I(20)) +
ggtitle(.plot.title) +
scale_colour_distiller(name=.legend.title,
space="Lab",
palette=.palette,
limits=.plot.limits)
}
```
```{r}
display.difference.map <- function(.diff.map.name,
.diff.map.title,
.legend.title,
.palette="Spectral") {
ggplot(data=zhne.grid.m.df) +
geom_point(aes_(x=quote(E), y=quote(N),
color=as.name(.diff.map.name))) +
xlab("E") + ylab("N") + coord_fixed() +
ggtitle(.diff.map.title) +
scale_colour_distiller(name=.legend.title,
space="Lab",
palette=.palette)
}
```
# Task 1 -- Dataset
This dataset was examined in Part (1).
```{r load.points}
load("./zhne_climate.RData", verbose=TRUE)
```
## Fixing the encoding
This `.RData` file was saved in the UTF-8 encoding. If your system uses another encoding (typical for Chinese Windows operating systems), you have to inform R of the original encoding. It will then re-encode character strings to your system's encoding.
Here is a function to loop through the fields of a dataframe and fixes them, using the `Encoding` function. I've set it up with the default orginal encoding `UTF-8` but this can be changed.
For fields that are factors, it first converts them to characters, re-encodes the names, and then converts back to factors.
Arguments:
1. `df` : the data frame to be re-encoded
2. `inputEncoding`, default `"utf-8"`
```{r fix.encoding}
fix.encoding <- function(df, inputEncoding = "UTF-8") {
numCols <- ncol(df)
for (col in 1:numCols)
if(class(df[, col]) == "character"){
Encoding(df[, col]) <- inputEncoding
}
else if(class(df[,col]) == "factor") {
tmp <- as.character(df[,col])
Encoding(tmp) <- inputEncoding
df[,col] <- as.factor(tmp)
}
return(df)
}
```
Apply this to the data frames.
```{r}
zhne.df <- fix.encoding(zhne.df)
# only the dataframe part of the Spatial* objects
zhne.m@data <- fix.encoding(zhne.m@data)
zhne.adm1.m@data <- fix.encoding(zhne.adm1.m@data)
```
Check that the re-encoding worked:
```{r}
head(zhne.df$station.name)
levels(zhne.df$province)
levels(zhne.adm1.m$NAME_HZ)
```
## Target and predictor variables
List the field names of the points (climate stations).
```{r}
names(zhne.df)
```
Q: Which fields are the variables to be mapped over the region?
Q: Which fields could be used to predict these?
Select a target variable. Here we use annual average temperature.
```{r}
target.name <- "t.avg"
```
## Model formula
Set up a formula with the predictor fields.
```{r}
length(pred.fields <- c(9,16:20))
(pred.names <- names(zhne.df)[pred.fields])
(model.formula <- paste(target.name, "~", paste0(pred.names, collapse=" + ")))
```
Now we can use this model formula in any function that recognizes the S model formula structure.
# Task 2 - Linear modelling
## Build a linear model
### Adjust the model formula
Looking at the single-factor scatterplots (Exercise 1), we see that the relation between elevation and the target does not look linear. Would square root transformation (somewhat) linearize this?
```{r scatterplots,fig.width=8, fig.height=6}
p1 <- ggplot() +
geom_point(aes(x=elev.m, y=t.avg, colour=province.en),
data=zhne.df) +
xlab("elevation")
p2 <- ggplot() +
geom_point(aes(x=sqrt(elev.m), y=t.avg, colour=province.en),
data=zhne.df ) +
xlab("sqrt(elevation)")
print(p1)
print(p2)
# gridExtra::grid.arrange(p1, p2, nrow=1)
```
Modify the formula so that the square-root of the elevation is the predictor:
```{r adjust.formula}
tmp <- strsplit(model.formula, "elev.m")[[1]]
(model.formula <- paste(tmp[1], "sqrt(elev.m)", tmp[2]))
```
### Fit a complete model
Fit an ordinary least-squares model of the target from all the predictors.
```{r build.lm}
m <- lm(model.formula, data=zhne.df)
summary(m)
```
Q: How much of the variance in the target is explained by this model?
Q: Are all the predictors considered "significant"?
### Can predictors be removed?
Some predictors may be highly-correlated, which leads to instability in the model.
Is there colinearity? Look at the linear correlation between the predictors:
```{r}
round(cor(zhne.df[,pred.fields]),2)
```
Q: Which two predictors are highly-correlated?
Q: Of these two, which should be removed from the model, and why? (Hint: look at the model summary.)
### Updating the model
Refit the model without the population density on the finer grid. The convenient `update` function provides a shorthand way to specify this.
```{r}
m <- update(m, . ~ . - pop15)
model.formula <- formula(m)
summary(m)
```
Q: Is this model a better, worse, or the same fit (adjusted $R^2$) as the model with all predictors?
Compare the linear model fit with the actual values
```{r ols.1to1, fig.height=7, fig.width=7}
(rmse.ols <- sqrt(sum(residuals(m)^2)/length(residuals(m))))
plot(zhne.m$t.avg~ predict(m, newdata=zhne.df), col=zhne.m$province,
pch=20, asp=1,
xlab="Fitted by OLS", ylab="Actual", main="Annual average T, deg C")
legend("topleft", levels(zhne.df$province.en), pch=20, col=1:4)
grid(); abline(0,1)
```
Q: How well does this model fit the known stations? Where are there the most problems?
# Mapping
Predict over the grid and display the map.
```{r ols.map, fig.width=10, fig.height=10}
zhne.grid.m.df$pred.ols <- predict(m, newdata=zhne.grid.m.df)
summary(zhne.grid.m.df$pred.ols)
display.prediction.map("pred.ols",
"Annual average T, OLS prediction",
"deg C")
```
Q: Describe the spatial pattern. Can you see where the different predictors were most used?
# Linear model diagnostics
## Task 3 -- non-spatial diagnostics
Examine the diagnostic plots:
```{r m.dx, fig.width=10, fig.height=5}
par(mfrow=c(1,2))
plot(m, which=c(1,2))
par(mfrow=c(1,1))
```
Q: Are the residuals normally-distributed? If not, where is the problem? (Hint: look at the residuals-vs.-fits plot.)
(Recall: the residual is [actual - predicted], so a negative residual is an over-prediction.)
Find the most poorly-fit observations, say with a 2.5C error:
```{r}
(ix <- which(abs(residuals(m)) > 2.5))
zhne.df[ix,"t.avg"] # actual
predict(m)[ix] # predicted
residuals(m)[ix] # residual
```
Q: Are these over- or under-predictions?
Display the station records for these:
```{r}
zhne.df[ix,c(7:12,19:20)]
```
Q: Where are these located? What could be the reason for the severe over-prediction?
## Task 4 -- spatial diagnostics
Add the model residuals to the `SpatialPointsDataFrame` of the observations and display a bubble plot of the residuals and their location.
```{r m.resid.bubble, fig.width=7, fig.height=7}
zhne.m$m.resid <- residuals(m)
bubble(zhne.m, zcol="m.resid", pch=1)
```
Q: Is there a spatial pattern to the residuals? Describe it.
# Task 5 -- Spatial structure of the linear model residuals
Compute and display the default variogram of the OLS model residuals.
```{r m.resid.variogram}
proj4string(zhne.m)
vr <- variogram(m.resid ~ 1, loc=zhne.m)
plot(vr, pl=T)
```
Q: What is the approximate range?
Since we have so many point-pairs, we can (1) reduce the cutoff, (2) decrease the bin width:
```{r m.resid.variogram.2}
vr <- variogram(m.resid ~ 1, loc=zhne.m, cutoff=500000, width=24000)
plot(vr, pl=T)
```
Fit an exponential variogram (typical for climate variables) to this empirical variogram.
```{r fit.residual.vgm}
vrm <- vgm(psill=0.6, model="Exp", range=120000, nugget=0.1)
(vrmf <- fit.variogram(vr, model=vrm))
plot(vr, pl=T, model=vrmf)
```
Q: How well does this model fit this empirical variogram?
## Ordinary kriging of the regression residuals
We can krige the residuals onto the prediction grid, to see where KED (next section) will most adjust the linear model predictions. In other words, where the linear model did not fit well.
```{r krige.resid, fig.width=7, fig.height=7}
kr <- krige(m.resid ~ 1,
loc=zhne.m, newdata=zhne.grid.m, model=vrmf)
summary(kr)
spplot(kr, zcol="var1.pred", col.regions=topo.colors(64))
```
Q: Where are the largest adjustments to the OLS predictions?
# Task 6 -- Kriging with External Drift (KED, Regression Kriging)
We've proven that the OLS model is not optimal, because the residuals are spatially-correlated. We should re-fit it with Generalized Least Squares (GLS), but a reasonable approximation, especially with many well-distributed points, is to use KED with the variogram model estimated from the OLS residuals.
The formula has to be the same as that used in the linear model, which produced the residuals, which produced the variogram.
```{r ked}
model.formula
ked <- krige(t.avg ~ sqrt(elev.m) + E + N + pop2pt5 + mrvbf,
loc=zhne.m, newdata=zhne.grid.m, model=vrmf)
summary(ked)
```
Display the prediction map:
```{r ked.pred, fig.width=8, fig.height=8}
spplot(ked, zcol="var1.pred")
```
Q: Describe the spatial pattern. Does it look reasonable? Are there details that look (in)correct?
Display the prediction variances:
```{r ked.var, fig.width=8, fig.height=8}
spplot(ked, zcol="var1.var", col.regions=cm.colors(64))
```
Q: Where are the maximum prediction variances? Why? What is their magnitude, compared to the target variable?
# Task 7 -- Cross-validation of the KED model
The kriging prediction can be evaluated by cross-validation.
```{r xval, fig.width=7, fig.height=7}
ked.cv <- krige.cv(t.avg ~ sqrt(elev.m) + E + N + pop2pt5 + mrvbf,
loc=zhne.m, model=vrmf, verbose=FALSE)
summary(ked.cv)
bubble(ked.cv, zcol="residual", pch=1,
main="X-validation residuals, KED")
```
Q: Where are the highest over- and under-predictions?
Q: Is there any spatial pattern to these residuals?
Compute the RMSE and MAE. Compare the RMSE to the target variable.
```{r}
(rmse.ked <- sqrt(mean(ked.cv$residual^2)))
summary(zhne.m$t.avg)
(mae.ked <- mean(ked.cv$residual))
```
Q: Is the KED method biased?
Q: Do you consider the RMSE to be acceptable error, given the target variable and how it is used in decision-making?
# Task 8 -- Generalized Additive Model
Generalized Additive Models (GAM) are similar to multiple linear regression, except that each term in the linear sum of predictors need not be the predictor variable itself, but can be an *empirical smooth function* of it.
In the OLS formula the coefficients $\beta$ are one per predictor, and do not vary over the range of the predictor.
$$
y_i = \beta_0 + \sum_k \beta_k x_{k,i} + \varepsilon_i
$$
In the GAM formula we replace the fixed $\beta$ with smooth functions of the predictors:
$$
y_i = \beta_0 + \sum_k f_k(x_{k,i}) + \varepsilon_i
$$
The advantage is that non-linear relations in nature can be fit, without any need to try transformations or to fit piecewise regressions.
The disadvantage is that it is just an empirical fit and can not be extrapolated beyond the range of calibration.
A further disadvantage is that the choice of function is arbitrary; it is generally some smooth function of the predictor, with the degree of smoothness determined by cross-validation.
## Re-examine relation with predictors
Display scatterplots of the target variable against the predictors, with one-dimensional empirical smooth curves to show the relation through the range of the predictor.
```{r gam.plots,fig.width=12, fig.height=8}
g1 <- ggplot(zhne.df, aes(x=E, y=t.avg)) + geom_point() + geom_smooth(method="loess")
g2 <- ggplot(zhne.df, aes(x=N, y=t.avg)) + geom_point() + geom_smooth(method="loess")
g3 <- ggplot(zhne.df, aes(x=elev.m, y=t.avg)) + geom_point() + geom_smooth(method="loess")
g4 <- ggplot(zhne.df, aes(x=pop2pt5, y=t.avg)) + geom_point() + geom_smooth(method="loess")
g5 <- ggplot(zhne.df, aes(x=pop15, y=t.avg)) + geom_point() + geom_smooth(method="loess")
g6 <- ggplot(zhne.df, aes(x=mrvbf, y=t.avg)) + geom_point() + geom_smooth(method="loess")
require(gridExtra)
grid.arrange(g1, g2, g3, g4, g5, g6, ncol = 3)
```
Q: Describe these relations? Are any linear? Do they vary through the range?
## Fitting a GAM
Fit a GAM to the target variable at the observation stations, with the predictors being a *two-dimensional thin-plate spline* of the coördinates and * one-dimensional penalized regression splines* of the elevation and fine-scale population.
```{r gam.fit}
require(mgcv)
m.gam <- gam(t.avg ~ s(E, N) + s(elev.m) + s(pop2pt5), data=zhne.df)
summary(m.gam)
summary(residuals(m.gam))
```
## GAM fit
Compare the GAM model fit with the actual values, and compare the fit to those from the OLS regression and KED.
```{r gam.1to1, fig.height=7, fig.width=7}
(rmse.gam <- sqrt(sum(residuals(m.gam)^2)/length(residuals(m.gam))))
print(rmse.ols)
print(rmse.ked)
plot(zhne.m$t.avg~ predict(m.gam, newdata=zhne.df), col=zhne.m$province,
pch=20, asp=1,
xlab="Fitted by GAM", ylab="Actual", main="Annual average T, deg C")
legend("topleft", levels(zhne.df$province.en), pch=20, col=1:4)
grid(); abline(0,1)
```
Q: How does this RMSE of fit compare with the linear model? with the LOOCV of the KED model?
Does this indicate the predictive accuracy (1) inside the study area, (2) outside it?
## GAM residuals
Examine the spatial structure of the residuals:
```{r gam.resid,fig.width=7, fig.height=7}
zhne.m@data$resid.m.gam <- residuals(m.gam)
bubble(zhne.m, zcol="resid.m.gam", pch=1, main="Residuals from GAM")
```
Q: How does this spatial structure compare with the OLS residuals? What is the effect of using smooth functions instead of a linear function?
Examine the spatial correlation of the GAM residuals?
```{r gam.vgm}
vr.gam <- variogram(resid.m.gam ~ 1, loc=zhne.m, cutoff=200000, width=24000)
plot(vr.gam, pl=T)
(vrmf.gam <- fit.variogram(vr.gam, vgm(0.1, "Exp", 50000, 0.05)))
print(vrmf)
```
Q: Do these appear to be spatially-correlated? How does this spatial correlation compare with that of the OLS residuals? (Consider the total sill, range, and nugget). Explain the difference.
## GAM prediction
Predict over the grid and summarize the predictions.
```{r gam.pred}
zhne.grid.m.df$pred.gam <- predict(m.gam, newdata=zhne.grid.m.df)
summary(zhne.grid.m.df$pred.gam)
```
## GAM map
Predict over the grid and display the map.
```{r gam.map, fig.width=10, fig.height=10}
display.prediction.map("pred.gam",
"Annual average T, GAM prediction",
"deg C")
```
Q: Describe the spatial pattern. Can you see where using the smooth functions had the most effect, compared to the OLS map?
# Task 9 -- Comparing maps
We can see the effect of the different mapping methods by computing and displaying difference maps.
First, GAM vs. OLS -- this shows the effect of using smooth functions instead of a linear model.
```{r diff.ols.gam, fig.width=10, fig.height=10}
zhne.grid.m.df$diff.ols.gam <- (zhne.grid.m.df$pred.ols - zhne.grid.m.df$pred.gam)
summary(zhne.grid.m.df$diff.ols.gam)
display.difference.map("diff.ols.gam",
"OLS - GAM",
"deg C")
```
Q: Decscribe the differences, both the statistical summary and the map. Can you explain these?
Second, GAM vs. KED -- two "best" mapping methods compared.
```{r diff.ked.gam, fig.width=10, fig.height=10}
zhne.grid.m.df$diff.ked.gam <- (ked$var1.pred - zhne.grid.m.df$pred.gam)
summary(zhne.grid.m.df$diff.ked.gam)
display.difference.map("diff.ked.gam",
"KED - GAM",
"deg C")
```
Q: Decscribe the differences, both the statistical summary and the map. Can you explain these?