---
title: "Assigment: Trend Surfaces"
author: "(Your Name and NetID here please)"
date: "`r Sys.Date()`"
output:
  html_document:
     toc: TRUE
     toc_float: TRUE
     theme: "lumen"
     code_folding: show
     number_section: TRUE
     fig_height: 5
     fig_width: 7
     fig_align: 'center'
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.align = 'center', fig.path = './figs/ts_assignment/')
knitr::opts_chunk$set(cache.extra = R.version.string)
```

# Objective


In this assignment you will model trend surfaces based on geographic coördinates and then predict over a four-state grid the annual growing-degree days base 50F (10C), a measure of the heat available for the growth of C4 crops such as maize.

Models include ordinary least squares (OLS) trend surfaces of [first](#ts1) and [second](#ts2) order, [generalized additive models (GAM)](#gam) and [thin-plate splines](#tps).

An [*optional* section](#plus_elev) extends this to include a third dimension, the elevation of the weather station.

For each *Task* write a code chunk to carry out the task.

For each *Q* write your *A*, which may include more code or in-line R expressions.

# Dataset

Task: Load the `sf` "simple (spatial) features" package, which we will use for vector spatial data.


Task: Load the "Northeast Climate" R Dataset `ne_stations_sf.RData`, with the `load` function. Use the `verbose=TRUE` optional argument to display the names of the objects that are loaded.

Q1 What are the data types of these objects? For those that are spatial, what are their coördinate reference systems (CRS)?

A1:

Q2: For how many climate stations are there records? How many states of the USA are represented?

A2:

Task: Display the names of the variables in this dataset.

Explanation of the names:

* STATION_ID: a code for the station
* STATE, STATE_1: US State responsible for the weather station
* STATION_NA, STN_NAME: station name
* LATITUDE_D, LONGITUDE_: geographic coordinates
* ELEVATION_, ELEV_FT: elevation in feet above sea level (duplicate)
* LAT_DD, LONG_DD: same, as even decimal degrees (low precision)
* OID, COOP_ID: unknown (to me)
* E, N: metric coordinates
* *_GDD50: cumulative growing degree days per-month and annual (ANN_GDD50), 30-year averages 1971-2000.

There are some duplicate fields, but this is how it came from the data provider.

Q3: Which of the attributes in the `ne.df` data frame are response variables, i.e., attributes than can be modelled by a trend surface of the coordinates and possibly elevation? 

A3:

# First-order trend surface {#ols1}

Task: Compute a first-order trend surface model of the annual growing-degree days base 50F (field `ANN_GDD50`) using the two coordinates (East and North meters) as the predictors, and the growing degree days as the dependent (response) variable, using Ordinary Least Squares linear regression. 

Display the model summary and the linear model diagnostic plots "residuals vs. fitted" and "Normal Q-Q"


Q4: What proportion of the variation in GDD50 is explained by the first-order trend?

A4:

Q5: Which of the two coördinates most affects the GDD50?

A5:

Q6: Is there a relation between the fitted values and residuals? If so, what?

A6:

Q7: Are the residuals normally-distributed? If not, how so? Are there any especially large residuals?

A7:

Optional Task: Identify the most poorly-modelled station.

Optional Q: Where is this? Why might it be so poorly modelled?

Task: Display a post-plot of the residuals, with the positive residuals (under-predictions) shown in green and the negative (over-predictions) in red.

Optional A:

Q8: What is the spatial pattern of the trend-surface residuals? Does it suggest a higher-order trend?

A8: 

# Second-order trend surface {#ols2}

Task: Repeat all the steps of the previous section, and answer the same questions, for a full 2nd-order trend surface.


Q9: What proportion of the variation in GDD50 is explained by the second-order trend?

A9:

Q10: Which of the two coördinates (taking into account their squares also) most affects the GDD50?

A10:

Q11: Is there a relation between the fitted values and residuals? If so, what?

A11: 

Q12: Are the residuals normally-distributed? If not, how so? Are there any especially large residuals?

A12: Close to normal, only a few extreme values deviate from this.

Optional Task: Identify the most poorly-modelled station.

Optional Q: Where is this? Why might it be so poorly modelled?

Optional A: Mount Mansfield (VT). This is the highest elevation, and in the extreme NE of the area.

Task: Display a post-plot of the residuals, with the positive residuals (under-predictions) shown in green and the negative (over-predictions) in red.


Q13: What is the spatial pattern of the trend-surface residuals? Does it suggest a higher-order trend?

A13: 

# Comparing trend orders

Task: Compare the two orders of trend surfaces with an analysis of variance.

Also compare with a null model.


Q14: Is the 2nd-order trend clearly superior to the 1st-order trend?

A14: 

# Mapping

Task: Load the `terra` package which handles gridded data structures.


Task: Load a gridded DEM covering the study area from file `dem_nj_ny_pa_vt_4km.tif`, using the `rast` method. Name the object `dem.ne.m` (standing for "DEM of the Northeast, elevation in metres").


Q15: What is this object's class? What is its CRS? Which of the point datasets does this match?


A15: 

Q16: What is the bounding box and resolution of the grid? (You can see these with the `st_bbox` and `res` functions.)


A16: 

Task: Display the grid with the `plot` function. 


Task: Create a `data.frame` version of the grid, with the coördinates as fields. Use the optional `xy = TRUE` argument for this.  
Also use the optional `na.rm = FALSE` to include the entire bounding box, including `NA` values outside the boundaries of the four states.

Name these fields to match coördinate names of the point dataset.


Task: Predict over this DEM grid with the 2nd-order trend surface. Add the predictions and the upper and lower prediction limits as fields in the study area data frame. 


Task: Summarize the predictions.


Q17: How do the predictions compare with the summary of the weather stations observations?

A17:


Task: Convert the predictions (fitted values) to a `SpatRaster` raster map and display them. The easiest way to do this is to copy the `SpatRaster` for the DEM and then replace its values with the `values()` function, selecting the `"fit"` field in the DEM object.

Use the `mask` function to remove the predictions outside of the study area.


Q18: Is this trend surface a realistic representation of the growing-degree days?  Explain.


# Generalized Additive Model (GAM) {#gam}

Task: Load the `mgcv` package, used for GAM.


Task: Display a scatterplot of the GDD50 against the two coördinates.


Q19: Do these relations look linear or quadratic throughout their ranges?

A19: 

Task: Compute a Generalized Additive Model (GAM) of GDD50 modelled by the two coördinates, using the `gam` function of the `mgcv` package.


Q20: What proportion of the variation in GDD50 is explained by the GAM? How does this compare to the 2nd-order trend surface?


A20: 

Task: Predict over this study area with the fitted GAM. Add this as  a field in the study area data frame.

Note that the result of `predict` on a GAM is an R `array`, whereas here we want a vector to add as a field. So this must be converted with the `as.vector` function.


Task: Summarize the predictions.

Q21: How do the predictions compare with the summary of the weather stations observations, and to the summary of the 2nd-order trend surface predictions?

A21:


Task: Display a post-plot of the residuals.


Q22: What is the spatial pattern of the GAM trend-surface residuals? How does this differ from the pattern of the 2nd-order trend residuals?

A22: 

Task: Display a map of the GAM predictions.


Q23: Is this smooth surface a realistic representation of the growing-degree days? Explain. How does it improve on the 2nd-order trend?

A23:

# Thin-plate splines {#tps}

Task: Load the `fields` package, to use for thin-plate spline calculation.


Task: Add a field to the stations data frame, containing a matrix of the two coördinates, as required by the `fields` package.

Note: This is an example of how different R packages handle spatial data. The same information (here, coördinates) structured differently.


Task: compute the thin-plate spline for the growing degree days, as a function of coördinates.


Task: Predict with the thin-plate spline over the study area. Note you will have to make a 2D matrix with the grid coördinates (these are two of the fields in the `dem.ne.m.df` data frame).
Then predict the surface on that, and then convert the prediction to a field in the `data.frame` version of the grid.


Task: Display the map.


Q24: Is this smooth  surface a realistic representation of the growing-degree days? Explain. Does it improve on the GAM surface?

A24: 

# Challenge (Optional!) -- including elevation {#plus_elev}

Of course, elevation also affects temperature and thus the growing-degree days. 

Repeat the exercise, but including elevation in all the models. Comment on how and why the trend surfaces (now 3D) are different/improved from the 2D trends.

In these models, consider elevation as an independent dimension, with no interaction with the geographic coördinates (although, that may not be a valid assumption... you could test this... ).

Also, see if a transformation of elevation might make its relation with the target variable more linear.


# Saving the workspace 

Task: Save the workspace for a later exercise using the `save.image` function:

The DEM is of class `SpatRaster` and so must be saved in a separate file as a GeoTiff, using the `terra::writeRaster` function with `filetype = "GTiff"`). This can later be read into R with the `terra::rast` function.