---
title: "PLSCS/NTRES 6200 Lab exercise -- spatial sampling"
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
html_document:
toc: TRUE
toc_float: TRUE
theme: "spacelab"
number_section: TRUE
fig_height: 4
fig_width: 4
fig_align: "center"
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=4, fig.height=4, fig.align = "center",
warning=FALSE, message=FALSE)
```
This lab. exercise is a complilation and modification of tutorials developed by Dick Brus for a course at Nanjing Normal University School of Geographic Sciences (南京师范大学地理学学院) and well-explained (with code) in his paper:
Brus, D. J. (2019). Sampling for digital soil mapping: A tutorial supported by R scripts. Geoderma, 338, 464–480. https://doi.org/10.1016/j.geoderma.2018.07.036
I have added some explanation, some more visualizations, and a set of **Question**s. To answer the **Question**s, additional R code may be required.
# Design-based sampling for population statistics
In this section we investigate how to place *samples* to estimate some statistics of the *total* population, for example a total or mean.
Load the *simulated* data for study area Voorst (NL) and examine its structure.
Note that this is the complete population, which of course in practice we do not know. Here we simulate it, to compare how closely different sampling schemes can come to the known (in this case) true value.
```{r}
load(file="Voorst.RData")
class(grdVoorst)
summary(grdVoorst)
head(grdVoorst)
```
The variable of interest (to be summarized) is in the first column (`z`).
Make a spatial version of this, using the coordinates `c(s1, s2)` and display a map of the target variable:
```{r fig.width=10, fig.height=4}
require(sp)
grdVoorst.sp <- grdVoorst
coordinates(grdVoorst.sp) <- c("s1","s2")
class(grdVoorst.sp)
# specify the CRS, these are Dutch national grid (RD), meters
proj4string(grdVoorst.sp) <- CRS("+init=epsg:28992")
proj4string(grdVoorst.sp)
bbox(grdVoorst.sp)
spplot(grdVoorst.sp, zcol="z",
cuts=seq(0,50,by=5), key.space="right")
```
Programming note:
* `spplot` requires an object in one of the `Spatial*DataFrame` classes, defined by the `sp` package.
* The `zcol` argument specifies which field in the attribute table (dataframe) should be displayed.
**Question**: Suppose `z` is some soil property, perhaps SOM (Soil Organic Matter) weight % in the fine soil. Why are there white areas (`NA` values) on the map?
**Answer**:
We know the full population in this case, but we suppose we did not, and wanted to estimate it by sampling.
## Simple random sampling
In the first case we treat the whole area as one unit.
Select a *simple random sample without replacement* of 40 units from the entire population.
```{r}
# set the sample size
n <- 40
# compute the total number of units, N, in the population
N<-nrow(grdVoorst)
# optionally, set a seed so that results can be reproduced
set.seed(314)
# select the sample
units <- sample.int(N, size = n, replace = FALSE)
mysample <- grdVoorst[units,]
summary(mysample)
```
Programming notes:
* `set.seed` would not be used in practice; here it is so your results will match mine. This is commonly used when developing methods, for reproducibility. In practice of course we want true randomness.
* `sample.int` is a simplified version of `sample`, that just needs a single number for the population, `N` = the number of items from which to select. The result is a vector of the numbers between 1..N selected.
**Question**: How do the statistics of the target variable (range, mean, median, quantiles) compare with that of the full population? Explain any differences.
**Answer**:
Now we estimate the population mean and its standard error from this sample:
```{r}
(estimatedmean <- mean(mysample$z))
varmean <- var(mysample$z)/n
(semean <- sqrt(varmean))
```
We saw above that this does not exacly match the known populaton mean -- that is not surprising. But does the confidence interval of the mean contain the true mean?
Compute lower and upper bound of 95% confidence interval using the `t` distribution.
(Programming note: quantiles of the Student `t` distribution can be computed with function `qt`.)
Since we are willing to take a 5% risk of missing the true mean, we remove 2.5% from each tail of the symmetric `t`-distribution.
```{r}
(lower <- estimatedmean-qt(0.975, n-1, lower.tail=TRUE)*semean)
# another way to get the same value
# (lower <- estimatedmean-qt(0.025, n-1, lower.tail=FALSE)*semean)
(upper <- estimatedmean+qt(0.975, n-1, lower.tail=TRUE)*semean)
```
Check whether the true population mean is inside the 95% confidence interval:
```{r}
(populationmean <- mean(grdVoorst$z))
(is.in <- ((populationmean >lower) & (populationmean < upper)))
```
**Question**: Is it? Suppose we re-do this sampling plan a large number of times. In what proportion of these would the true mean be outside of the confidence interval?
**Answer**:
## Stratified random sampling
If the target variable is expected to have different values according to some classifying factor, stratified sampling based on that factor is usually more efficient.
In stratified sampling, the total population is divided into several *strata* based on some classifying factor. Each item in the population is in exactly one stratum. We then sample from each stratum separately.
We continue with the Voorst study area. The stratifying factor is `stratum`. These represent different land uses or managements that might affect SOM concentration. `XF` is forest.
```{r fig.width=10, fig.height=4}
spplot(grdVoorst.sp, zcol="stratum", key.space="right")
class(grdVoorst$stratum)
table(grdVoorst$stratum) # count the grid cells in each stratum
```
**Question**: Are the strata of equal size?
**Answer**:
Compute the total number of units per stratum and stratum weights (relative size). We choose to distribute the sample numbers proportional to the stratum size.
A common method to divide the total number of samples into the number per stratum is *proportional* to their size within the population, but there are other choices. An alternative would be to take into account the variance within each stratum -- but in this case we assume that we don't have an estimate of these variances.
```{r}
Nh <- tapply(X=grdVoorst$stratum, INDEX = grdVoorst$stratum, FUN =length)
print(Nh)
(wh <- Nh/sum(Nh))
```
Programming notes:
* the `tapply()` function evaluations a function `FUN` (in this example the `length()` function) across an array `X` (in this example the stratum ID `grdVoorst$stratum`), according to some classifying factor `INDEX` (here again the stratum ID `grdVoorst$stratum`). The result is a vector the same length as the classifying factor, with the result of evaluating the function on that level of the factor's part of the vector.
Compute the per-stratum subsample sizes:
```{r}
# set total sample size
n <- 40
# compute stratum sample sizes for proportional allocation
nh <- round(n * wh)
# compute sum of stratum sample sizes
(sumnh <- sum(nh))
# sum of stratum sample sizes is 41, we want 40, so we reduce largest stratum sample size by 1
# first see which stratum has most sampling points
ix <- which.max(nh)
nh[ix] <- nh[ix] - 1
print(nh)
```
Use function `strata` of package `sampling`^[from the Université de Neuchâtel (CH)] to select a stratified random sample with replacement.
Note: to use function `strata`, the order of stratum sample sizes must be equal to the order in which the strata are encountered in the object.
```{r}
require(sampling)
set.seed(314) # reproducible
nh.order <- nh[unique(grdVoorst$stratum)]
units<-strata(grdVoorst,stratanames="stratum",
size=nh.order,method="srswr")
# use function getdata to get the data for the selected units
mysample<-getdata(grdVoorst,units)
```
Estimate the population mean and its standard error. This has to be per-stratum and then summed according to the respective weights.
```{r}
# per-stratum mean, then weighted sum
(meanh <- tapply(mysample$z,INDEX=mysample$stratum,FUN=mean))
(estimatedmean<-sum(wh*meanh))
# per-stratum variance, then weighted sum
S2h <- tapply(mysample$z,INDEX=mysample$stratum,FUN=var)
(varmeanh <- S2h/nh)
varmean <- sum(wh^2*varmeanh)
(semean<-sqrt(varmean))
```
**Question**: Do the different strata have different estimated means and variances? Which stratum is most variable? Does the order of variances match that of means? Should it? Why or why not?
**Answer**:
Compute lower and upper bound of 95% confidence interval using the `t` distribution.
Approximate the degrees of freedom of the Student `t` distribution by the total sample size minus number of strata.
```{r}
(df <- n - length(Nh))
(lower<-estimatedmean-qt(0.975,df,lower.tail=TRUE)*semean)
# another way to get the same value
# (lower<-estimatedmean-qt(0.025,df,lower.tail=FALSE)*semean)
(upper<-estimatedmean+qt(0.975,df,lower.tail=TRUE)*semean)
```
**Question**: Is this confidence interval wider or narrower than that from SRS? Is this expected? Why or why not?
**Answer**:
Check whether the true population mean is inside the 95% confidence interval:
```{r}
(populationmean <- mean(grdVoorst$z))
(ind<-(populationmean >lower & populationmean < upper))
```
**Question**: Is it? Suppose we re-do this sampling plan a large number of times. In what proportion of these would the true mean be outside of the confidence interval?
**Answer**:
# Sampling to fit a linear regression
The aim here is to select a sample that will give the best linear regression model. "Best" can be measured internally ($R^2$, RMSE) or externally (evaluation sample or cross-validation).
We suppose there is a study area where we want to model one variable (the dependent) on another (the independent). We know the covariate value at all locations -- for example, a remote-sensing product such as NDVI. We want to go the field and measure the dependent variable at some few locations, build the model, and then use it to predict over the whole grid.
We suppose the relation is *linear*, i.e., we will fit a linear regression model. We compare two sampling schemes: one completely random, and one based on the spread of the covariate.
Read the data of covariate $x$ and the variable of interest $z$, and spatial coordinates. Although we know the target variable $z$, we suppose we do not know it and would find these values by field sampling. We *do* know the covariate everywhere.
```{r}
grid <- read.csv(file="Data4ExerciseModelBasedSampling_SimpleLinearRegression.csv")
str(grid)
```
Map the covariate, using `ggplot`:
```{r}
library(ggplot2)
ggplot(data = grid) +
geom_tile(mapping = aes(x = s1, y = s2, fill = z)) +
scale_fill_gradient(name="z",low = "lightgreen", high = "darkgreen") +
scale_y_continuous(name = "Northing") +
scale_x_continuous(name = "Easting") +
coord_equal()
```
Programming note:
* `ggplot2` implements the graphics system "grammar of graphics" approach of Wilkinson, L. (2005). The grammar of graphics (2nd ed). Springer.
* A good introduction is from the Tidyverse https://ggplot2.tidyverse.org/
* `ggplot()` builds up a graphic piece-by-piece, with many functions to add elements and specify the graphic; these ar separated by `+` "add element".
* In this simple example, the first `ggplot` specifies the dataframe where the variables named later can be found.
* Then the `geom_tile` function specified a geometry, here a tiled grid; the `aes` `aesthetics` function specifies which fields in the dataframe should be assigned to the two axes and the attribute.
## Fully random sample (without replacement)
Select a fully random sample of 12 units without replacement.
```{r}
set.seed(314)
idsrandom <- sample.int(400,12)
(randomsample <- grid[idsrandom,])
```
Fit a simple linear regression model, give a summary of the model, and save the estimated regression coefficients.
```{r}
summary(model.1 <- lm(z~x, data=randomsample))
(betas.1 <- coef(model.1))
```
**Question**: How well does this model fit the 12 known points?
Consider both the $R^2$ and the residual standard error.
What is the slope coefficient, i.e., how many units of $z$ change for each one unit of $x$?
**Answer**:
Plot the selected units on two grids: the covariate and the target:
```{r fig.width=8, fig.height=4}
require(gridExtra)
g1 <- ggplot(data = grid) +
geom_tile(mapping = aes(x = s1, y = s2, fill = x)) +
geom_tile(data=randomsample,mapping = aes(x = s1, y = s2),
width=1,height=1,colour="red",fill = NA) +
scale_fill_gradient(name="x",low = "skyblue", high = "darkblue") +
scale_y_continuous(name = "Northing") +
scale_x_continuous(name = "Easting") +
labs(title="Covariate, with sample points") +
coord_equal()
g2 <- ggplot(data = grid) +
geom_tile(mapping = aes(x = s1, y = s2, fill = z)) +
geom_tile(data=randomsample,mapping = aes(x = s1, y = s2),
width=1,height=1, colour="red", fill = NA) +
scale_fill_gradient(name="z",low = "lightgreen", high = "darkgreen") +
scale_y_continuous(name = "Northing") +
scale_x_continuous(name = "Easting") +
labs(title="Target, with sample points") +
coord_equal()
grid.arrange(g1, g2, nrow=1)
```
**Question**: Does there appear to be any pattern to the selection of sample points (1) spatially, (2) with respect to the target, (3) with respect to the covariate? Is there any reason why there should be?
**Answer**:
Plot the regression line:
```{r}
ggplot(randomsample) +
geom_point(mapping=aes(x=x,y=z)) +
geom_abline(intercept=betas.1[1],slope=betas.1[2]) +
scale_y_continuous(name = "target: z",limits=c(0,20)) +
scale_x_continuous(name = "covariate: x",limits=c(0,32))
```
There is a simpler way to plot this, using base graphics:
```{r}
plot(z ~ x, data=randomsample, pch=20, xlim=c(0,32), ylim=c(0,20))
abline(model.1); grid()
```
## Non-random sample: extremes of the covariate range
Now select the 6 units with smallest value for covariate $x$ and 6 units with largest values for $x$ -- i.e., still 12 units, but now distributed at the extremes of the covariate.
(Remember, we know the covariate values before we do the field sampling).
```{r}
ord <- order(grid$x)
grid <- grid[ord,]
n <- dim(grid)[1]
(nonrandomsample <- grid[c(1:6, (n-5):n),])
```
Fit again a simple linear regression model, give summary of the model, and save regression coefficients. Plot non-random sample and make scatter plot of $z$ and $x$,
```{r}
model.2 <- lm(z~x, data=nonrandomsample)
summary(model.2)
(betas.2 <- coef(model.2))
```
**Question**: How do the regression coefficients differ between the models built using random and non-random samples? Which shows a stronger relation, i.e., more change in the target per unit of the covariate? Is this expected? Explain why.
**Answer**:
Plot the regression line:
```{r}
ggplot(nonrandomsample) +
geom_point(mapping=aes(x=x,y=z)) +
geom_abline(intercept=betas.2[1],slope=betas.2[2]) +
scale_y_continuous(name = "target: z",limits=c(0,20)) +
scale_x_continuous(name = "covariate: x",limits=c(0,32))
```
Using base graphics:
```{r}
plot(z ~ x, data=nonrandomsample, pch=20, xlim=c(0,32), ylim=c(0,20))
abline(model.1); grid()
```
**Question**: How well does this model fit the 12 known points selected non-randomly, based on covariate values?
Consider both the $R^2$ and the residual standard error.
**Answer**:
**Question**: How do these compare to the model based on 12 random points?
**Answer**:
Plot the selected units on two grids: the covariate and the target:
```{r fig.width=8, fig.height=4}
g1 <- ggplot(data = grid) +
geom_tile(mapping = aes(x = s1, y = s2, fill = x)) +
geom_tile(data=nonrandomsample,mapping = aes(x = s1, y = s2),
width=1,height=1,colour="red",fill = NA) +
scale_fill_gradient(name="x",low = "skyblue", high = "darkblue") +
scale_y_continuous(name = "Northing") +
scale_x_continuous(name = "Easting") +
labs(title="Target, with sample points") +
coord_equal()
g2 <- ggplot(data = grid) +
geom_tile(mapping = aes(x = s1, y = s2, fill = z)) +
geom_tile(data=nonrandomsample,mapping = aes(x = s1, y = s2),
width=1,height=1, colour="red", fill = NA) +
scale_fill_gradient(name="z",low = "lightgreen", high = "darkgreen") +
scale_y_continuous(name = "Northing") +
scale_x_continuous(name = "Easting") +
labs(title="Covariate, with sample points") +
coord_equal()
grid.arrange(g1, g2, nrow=1)
```
**Question**: Does there appear to be any pattern to the selection of sample points (1) spatially, (2) with respect to the target, (3) with respect to the covariate? Is there any reason why there should be?
**Answer**:
Use the fitted models to make map of predictions of $z$, compare them side-by-side:
```{r fig.width=8, fig.height=4}
grid$zpred.1 <- betas.1[1] + betas.1[2]*grid$x
grid$zpred.2 <- betas.2[1] + betas.2[2]*grid$x
scale <- seq(0,20,by=2) # use a common scale so the colours match
g1 <- ggplot(data = grid) +
geom_tile(mapping = aes(x = s1, y = s2, fill = zpred.1)) +
scale_fill_gradient(name="Zpred.1",low = "skyblue", high = "darkblue", breaks=scale) +
scale_y_continuous(name = "Northing") +
scale_x_continuous(name = "Easting") +
labs(title="Prediction: random") +
coord_equal()
g2 <- ggplot(data = grid) +
geom_tile(mapping = aes(x = s1, y = s2, fill = zpred.2)) +
scale_fill_gradient(name="Zpred.1",low = "skyblue", high = "darkblue", breaks=scale) +
scale_y_continuous(name = "Northing") +
scale_x_continuous(name = "Easting") +
labs(title="Prediction: non-random") +
coord_equal()
grid.arrange(g1, g2, nrow=1)
```
Compute and display their differences:
```{r}
summary(grid$diff.r.nr <- grid$zpred.1-grid$zpred.2)
ggplot(data = grid) +
geom_tile(mapping = aes(x = s1, y = s2, fill = diff.r.nr)) +
scale_fill_gradient(name="diff.r.nr", low = "red", high = "green") +
scale_y_continuous(name = "Northing") +
scale_x_continuous(name = "Easting") +
labs(title="Random - Non-random") +
coord_equal()
```
**Question**: Where are the largest differences? What are they? Explain why. (Hint: look at the map of the known target variable.)
**Answer**:
**Question**: Would you select in practice the nonrandom sample? If not, why not? Can you think of drawbacks to that scheme? If so, how might you sample to avoid that drawback?
**Answer**:
## Compare to the known relation
In this case we know the full population. Compute the regression model using the full population:
```{r}
model.all <- lm(z~x, data=grid)
summary(model.all)
(betas.all <- coef(model.all))
rbind(betas.all, betas.1, betas.2)
df <- data.frame(adj.r.squared=rbind(
summary(model.1)$adj.r.squared,
summary(model.2)$adj.r.squared,
summary(model.all)$adj.r.squared
),
sigma=rbind(
summary(model.1)$sigma,
summary(model.2)$sigma,
summary(model.all)$sigma
))
row.names(df) <- c("random", "nonrandom", "all")
print(df)
```
**Question**: which sampling scheme resulted in a model that was closest to the best linear model using all the data? Is this expected?
**Answer**:
**Question**: which sampling scheme resulted in a model with the most realistic adjusted $R^2$ and residual standard error? Is this expected?
**Answer**:
## Challenge
Repeat the above analysis for the _random_ sampling scheme a number of times (in a loop), saving the coefficients and the model fit. Characterize the variability of the random sampling schemes. Why can't we do this for the non-random scheme?
# Sampling for mapping: grid sampling
The aim here is to *randomly* place a square grid on a study area, so that (1) all points are within the area; (2) the number is as planned (e.g., according to the budget). If the area is a rectangle this is no problem at all, but in most field sampling campaigns the area is irregularly-shaped (e.g., a county, a farm) and may have "holes" that are not of interest.
## Example dataset
Load the data of Ethiopia. Change the class of grdEthiopia into a SpatialPointsDataFrame using function ```coordinates```, and then change it into a SpatialPixelsDataFrame using function ```gridded```.
```{r}
load(file="EthiopiaSimulations.RData")
#change class into SpatialPointsDataFrame
class(grdEthiopia)
coordinates(grdEthiopia) <- ~s1 + s2
#change class into SpatialPixelsDataFrame
gridded(grdEthiopia)<-TRUE
str(grdEthiopia)
spplot(grdEthiopia, zcol="rfl_MIR", col.regions=topo.colors(64),
main="Mid infra-red reflectance")
```
**Question**: Why could the study area be so irregular?
**Answer**:
The total area to be covered by the sample, assuming the cell dimensions are meters:
```{r}
(cell.area <- prod(grdEthiopia@grid@cellsize))
(n.data <- length(grdEthiopia@data$woreda))
# total area in km^2
(total.area <- (cell.area * n.data)/(100^2))
```
## Select a square grid
We suppose the budget allows for 100 points.
We use `spsample` to place the grid, with points at the centre of each grid cell, using the `offset` argument. Note that this is not a random sample because we select the starting point.
```{r}
n <- 100
mysample<-spsample(x=grdEthiopia, n=n, type="regular", offset=c(0.5,0.5))
length(mysample)
mysample <- as(mysample,"data.frame")
head(mysample)
```
Compute the resulting grid spaceing:
```{r}
(spacing <- mysample$x1[2]-mysample$x1[1])
```
Note: it is also possible to specify a cell size, rather than a number of samples, as an argument to `spsample`.
## Select square grid with required sample size
Note that because of the irregularly-shaped area the resulting number is not exactly 100.
To correct this, we select multiple times a random square grid and then find one with the required sample size.
We write a `for`-loop to select 200 times a square grid with random start of approximately 100 points. Set a seed so that results always can be reproduced. Determine the number of selected gridpoints, and save this in a numeric. Compute summary statistics of the sample size and plot a histogram.
This is a random sample. By not specifying the `offset` argument, `spsample` finds a random location within each cell. Because of this offset, there will be some 100-point samples that fit within the study area.
```{r}
set.seed(314)
samplesize <- numeric(length=200)
for (i in 1:200) {
mysample <- spsample(x=grdEthiopia, n=n, type="regular")
samplesize[i] <- length(mysample)
}
table(samplesize)
```
Now select a square grid of exactly 100 points. Print the numeric with the sample sizes. Determine the ids of the samples with exactly 100 points using function ```which()```. Set the seed again to the same number, and run the for loop of the previous chunk ```ids[1]``` times.
```{r fig.width=8, fig.height=6}
(ids <- which(samplesize==100)) # find which ones have 100 points
set.seed(314) # the same starting conditions
# do it the required number of times to get that sample back
for (i in 1:ids[1]) {
mysample <- spsample(x=grdEthiopia, cellsize=spacing, type="regular")
}
length(mysample)
plot(coordinates(grdEthiopia), asp=1, pch=20, cex=0.1)
points(coordinates(mysample), col="red")
grid()
```
**Question**: Can you see any problem with this sample plan?
**Answer**:
# Sampling for mapping: K-means sampling using covariates
Now we use information from *covariates* that we think are correlated with the target variable. Our aim is to *create* strata based on the values of the covariates. Presumably these combinations of covariates are better linked to the target than using just one covariate. This is a form of *cluster analysis*. A popular method is k-means. This is explained in many references, e.g. Ch. 14 of:
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning data mining, inference, and prediction (2nd ed). New York: Springer. Retrieved from http://link.springer.com.proxy.library.cornell.edu/book/10.1007%2F978-0-387-84858-7
The concept is to find clusters that minimize the within-cluster variance and maximize the between-cluster variance. I.e., minimize:
$$W(C) = \frac{1}{2}\sum_{k=1}^K \sum_{C(i)=k} \sum_{C{i'}=k} \|x_i - x_{i'}\|^2 - \sum_{k=1}^K N_k \sum_{C(i)=k} \|x_i - \bar{x}_k\|^2$$
This is minimized by assigning the $N$ observations to the $K$ clusters so that the _average dissimilarity_ of the observations within each cluster from that cluster's mean is minimized, over all clusters. Although this could be solved by exhaustive search, that is not feasible for any reasonably-sized dataset, so there are various "greedy" algorithms, e.g., iterative descent from a starting allocation.
## Sample data
We use a study area from the Hunter Valley of New South Wales (AU), famous for its wines. The soil geography of this area has been extensively studied by soil scientists from the University of Sydney. See for example:
Huang, J., McBratney, A. B., Malone, B. P., & Field, D. J. (2018). Mapping the transition from pre-European settlement to contemporary soil conditions in the Lower Hunter Valley, Australia. Geoderma, 329, 27–42. https://doi.org/10.1016/j.geoderma.2018.05.016
Load the file with discretisation of the study area and summarize it:
```{r}
load("HunterValley4Practicals.RData")
# loaded object is `grdHunterValley`
summary(grdHunterValley)
```
We see this has two coordintes in an unknown CRS, five continuous and two classified covariates.
Examine the inter-correlation of the continuous covariates:
```{r}
cor(grdHunterValley[,3:7])
```
**Question**: Which covariates are closely related?
**Answer**:
## Construct clusters
Now use k-means clustering to find the required number of clusters. We can then find one or more points in each cluster.
The `scale` function centres (to 0) and scales (to 1) its arguments, so they are all of equal weight. This is a choice we make, it is not required by the `kmeans` function.
An important parameter is `nstart`, which is the number of initial random allocations to try. The larger this is, the more chance of avoiding poor clusterings due to chance.
```{r}
n <- 10 # this number of clusters
set.seed(314)
myClusters <- kmeans(scale(grdHunterValley[,3:7]), centers=n, iter.max=100,nstart=24)
str(myClusters)
table(myClusters$cluster)
```
Notice the number of iterations that were necessary to converge: `r myClusters$iter`. The $R^2$ is the proportion of variance explained between the clusters:
```{r cluster.r2}
round(myClusters$betweenss/myClusters$totss,3)
```
**Question**: Are the clusters balanced (approximately equal number of cells per cluster)? Why is this? Do you expect this?
**Answer**:
Display the `coordinates' (values of covariates) in feature space of the cluster centres:
```{r}
(myClusters$centers)
```
**Question**: Which cluster has the highest average elevation? Which other covariate is maximum in this cluster? Does this make sense? (Hint: look at the covariate inter-correlations.)
**Answer**:
Add a field to the grid showing which cluster is assigned to each grid cell, and display a map of the clusters:
```{r fig.height=6, fig.width=6}
grdHunterValley$clusters <- myClusters$cluster
ggplot(grdHunterValley) +
geom_tile(mapping = aes(x = 25*round(Easting/25), y = 25*round(Northing/25),
fill = factor(clusters))) +
scale_fill_discrete(name = "cluster") +
scale_x_continuous(name = "Easting") +
scale_y_continuous(name = "Northing") +
coord_fixed() +
theme(legend.position="right")
```
**Question**: Which landscape feature does Cluster 4 represent? (Hint: look at its `cti` = compound topographic index: a large value indicates a low position.) What about Cluster 3? (Hint: look at its `ndvi` = Normalized Difference Vegetation Index: a low value indicates sparse to no vegetation.)
**Answer**:
## Select sample
Select locations closest to the centers of the clusters *in feature space*, i.e., the *central concept* of the clustering of covariares.
First, compute distances in feature space of all points (scaled) to all cluster (feature space) centers.
```{r}
library(fields) # this provides the `rdist` function
rdist.out <- rdist(x1=myClusters$centers, x2=scale(grdHunterValley[,3:7]))
```
Find the closest point in each cluster to its feature-space centre; this is the sample.
```{r}
ids.mindist <- apply(rdist.out, MARGIN=1, which.min)
mySample <- grdHunterValley[ids.mindist,]
```
(If we wanted more samples in each cluster, we would remove this point and again apply `which.min`.)
Plot clusters and sampling points:
```{r fig.height=6, fig.width=6}
ggplot(grdHunterValley) +
geom_tile(mapping = aes(x = 25*round(Easting/25,0),
y = 25*round(Northing/25,0),
fill = factor(clusters))) +
scale_fill_discrete(name = "cluster") +
geom_point(data=mySample,
mapping=aes(x=Easting,
y=Northing,
fill = factor(clusters)),
size=2, shape=21) +
scale_x_continuous(name = "Easting") +
scale_y_continuous(name = "Northing") +
coord_fixed() +
theme(legend.position="right")
```
**Question**: Is there any spatial pattern to this point set? Should there be? Why or why not?
**Answer**:
Show the sample point location *in feature space*. We can only show two covariates at a time; here we select two that were most poorly-correlated:
```{r fig.height=6, fig.width=6}
ggplot(grdHunterValley) +
geom_point(mapping=aes(y = elevation_m,
x = ndvi,
colour = factor(clusters))) +
geom_point(data=mySample,
mapping=aes(y = elevation_m,
x = ndvi,
fill = factor(clusters)),
size=2, shape=21) +
scale_fill_discrete(name = "cluster") +
scale_y_continuous(name = "Elevation") +
scale_x_continuous(name = "NDVI") +
theme(legend.position="right")
```
**Question**: How well are the clusters separated in this space?
**Answer**:
**Question**: Why might a soil scientist use this sampling scheme to locate ten intensively-sampled soil profiles in the Hunter Valley?
**Answer**:
**Challenge**: Show the sample points in other 2D-attribute spaces and comment on their distribution.
**Optional**: Visualize the clusters using three covariables; this produces an interactive plot you can rotate and locate indivual grid cells.
```{r fig.height=8, fig.width=8}
library(plotly)
with(grdHunterValley,
plot_ly(x=elevation_m, y=ndvi, z=cti, type="scatter3d",
mode="markers",
color=factor(clusters),
colors=terrain.colors(10),
size=0.15,
xlab="elevation")
)
```
# Sampling for mapping: Model-based sampling for OK and KED
Now we consider the case where we have or can construct a model of local spatial dependence (for OK) or residual local spatial dependence (for KED). We can then place points so as to *minimize* some objective, e.g., the mean or maximum kriging prediction variance over the grid to be mapped. Recall, the kriging system is solved by minimizing the prediction variance, and this depends only on the *positions* of the sample point, not their values. So we don't need to sample before optimizing! Just add a dummy value and "map" that.
In this section we use `gstat` for geostatistics:
```{r}
library(gstat)
```
Load the data on soil organic carbon of the three woredas in Ethiopia.
```{r}
load(file="DataThreeWoredasEthiopia.RData")
str(grdEthiopia)
```
This loads `grdEthiopia`, an `sp` object with unknown CRS; it seems the units are km.
## Estimation of the variogram
Suppose the aim is to map soil organic carbon (SOC). This seems to have been measured in %.
Use the data to compute an experimental variogram and to fit a spherical model with nugget.
```{r}
summary(d$SOC)
vg <- variogram(SOC~1, data = d)
plot(vg, pch=20, plot.numbers=TRUE)
```
This is a nice variogram with a clear sill, easy to model:
```{r}
(vgmf <- fit.variogram(vg, model = vgm(psill = 0.6, model = "Sph", range = 40, nugget = 0.6)))
plot(vg, model=vgmf, pch=20)
```
**Question**: What is the range of local spatial dependence, in meters?
**Answer**:
**Question**: What proportion of the total variance is unexplained even at a point?
**Answer**:
## Construction of the sampling grid
Construct a sampling grid of $4 \times 4$ points using function ```expand.grid```. The number of grid points must be the square of an even number. If this is the square of an odd number, the centre of the sampling grid which is used as a prediction point, will coincide with a gridpoint and as a consequence the kriging variance will equal 0.
This grid will be expanded according to the actual spacings to be evaluated.
```{r}
ygrid <- xgrid <- 0:3
xygrid <- expand.grid(xgrid, ygrid)
names(xygrid) <- c("x","y")
```
## Computing the maximum kriging variance
Define the grid spacings to be evaluated, and set up a vector to hold the results for each one:
```{r}
(spacings <- seq(from=5, to=12 ,by=1))
maxkrigvar <- numeric(length=length(spacings))
```
Predict the value at the centre of the grid using the `krige` function of R package `gstat`. Save the kriging variance in a numeric, and plot the maximum kriging variance against the grid spacing. Estimate from the graph the maximum grid spacing so that the maximum kriging variance does not exceed 1.0.
```{r}
for (i in 1:length(spacings)) {
xsc <- xygrid$x * spacings[i] # set up the proposed grid
ysc <- xygrid$y * spacings[i]
xygridsc <- data.frame(x=xsc, y=ysc, z=1) # z is the dummy variable to `map`
coordinates(xygridsc) <- ~x+y # make the proposed grid a spatial object
# prediction point at the center of the grid, this will have the maximum variance
# because it is furthest from the known points
xypred <- data.frame(x=mean(xsc), y=mean(ysc))
coordinates(xypred) <- ~x+y # also a spatial object
prediction <- krige( # predict at this one point
formula = z ~ 1,
locations = xygridsc,
newdata = xypred,
model = vgmf,
debug.level=0
)
maxkrigvar[i]<-prediction$var1.var
}
```
Summarize and plot the results:
```{r}
(df <- data.frame(spacings, maxkrigvar))
ggplot() +
geom_point(data=df,mapping=aes(x=spacings,y=maxkrigvar),size=3) +
scale_x_continuous(name = "Grid spacing (km)") +
scale_y_continuous(name = "Maximum kriging variance",limits=c(0.7,0.9))
```
**Question**: How does the maximum kriging variance vary with grid spacing?
**Answer**:
**Question**: What is the maximum spacing that will give a maximum kriging *standard deviation* of 0.8% SOC over the map?
**Answer**:
Challenge: What is the minimum possible maximum kriging standard deviation at any spacing?
## Exploring the sensitivity of grid spacing for variogram parameters
In practice we are uncertain about the variogram. For that reason it can be wise to explore the sensitivity of the determined grid spacing for the variogram parameters.
### Sensitivity for nugget parameter
Change the nugget parameter to 0.4, leave the other variogram parameters unchanged, and determine the maximum grid spacing.
```{r}
vgmf_lessnug <- vgm(psill = vgmf$psill[2], model = "Sph", range = vgmf$range[2], nugget = 0.4)
maxkrigvar_lessnug<-numeric(length=length(spacings))
for (i in 1:length(spacings)) {
xsc<-xygrid$x*spacings[i]
ysc<-xygrid$y*spacings[i]
xygridsc<-data.frame(x=xsc,y=ysc,z=1)
coordinates(xygridsc)<-~x+y
xypred<-data.frame(x=mean(xsc),y=mean(ysc))
coordinates(xypred)<-~x+y
prediction <- krige(
formula = z ~ 1,
locations = xygridsc,
newdata = xypred,
model = vgmf_lessnug,
debug.level=0
)
maxkrigvar_lessnug[i]<-prediction$var1.var
}
(df.lessnug <-data.frame(spacings, maxkrigvar_lessnug))
ggplot() +
geom_point(data=df.lessnug,mapping=aes(x=spacings, y=maxkrigvar_lessnug),size=3) +
geom_point(data=df, mapping=aes(x=spacings, y=maxkrigvar), size=3, col="red") +
scale_x_continuous(name = "Grid spacing (km)") +
scale_y_continuous(name = "Maximum kriging variance",limits=c(0.5,1))
```
**Question**: Explain the difference in determined grid spacing.
**Answer**:
### Sensitivity for range parameter
Do the same for a smaller range of local spatial dependence. Change the range to 30 km.
```{r}
vgmf_smallerrange <- vgm(psill = vgmf$psill[2], model = "Sph", range = 30, nugget = vgmf$psill[1])
maxkrigvar_smallerrange <- numeric(length=length(spacings))
for (i in 1:length(spacings)) {
xsc<-xygrid$x*spacings[i]
ysc<-xygrid$y*spacings[i]
xygridsc<-data.frame(x=xsc,y=ysc,z=1)
coordinates(xygridsc)<-~x+y
xypred<-data.frame(x=mean(xsc),y=mean(ysc))
coordinates(xypred)<-~x+y
prediction <- krige(
formula = z ~ 1,
locations = xygridsc,
newdata = xypred,
model = vgmf_smallerrange,
debug.level=0
)
maxkrigvar_smallerrange[i]<-prediction$var1.var
}
(df<-data.frame(spacings,maxkrigvar_smallerrange))
ggplot() +
geom_point(data=df,mapping=aes(x=spacings,y=maxkrigvar_smallerrange),size=3) +
geom_point(data=df, mapping=aes(x=spacings, y=maxkrigvar), size=3, col="red") +
scale_x_continuous(name = "Grid spacing (km)") +
scale_y_continuous(name = "Maximum kriging variance",limits=c(0.5,1))
```
**Question**: Explain the difference in grid-spacing.
**Answer**:
**Question**: What is the maximum spacing that will give a maximum kriging *standard deviation* of 0.8% SOC over the map? How does this compare with the longer-range variogram?
**Answer**:
# Sampling for mapping: Conditional Latin Hypercube sampling
An alternative to k-means clustering in feature space is to locate samples in feature space according to a *Latin hypercube* so that each combination of covariates is represented, as far as possible with the desired number of sample points. This is especially useful for machine-learning methods such as random forests, where we want to be sure to cover the feature space.
The *conditional* LHS (cLHS) tries to match the inter-covariate correlations of the sample set to that of the full population of covariates.
Reference and algorithm: Minasny, B., & McBratney, A. B. (2006). A conditioned Latin hypercube method for sampling in the presence of ancillary information. Computers & Geosciences, 32(9), 1378–1388. https://doi.org/10.1016/j.cageo.2005.12.009
## Reading the data
Load packages and read grid with covariates
```{r}
library(sp)
library(spsann)
library(reshape)
library(ggplot2)
load(file="HunterValley4Practicals.RData")
grid <- grdHunterValley
rm(grdHunterValley)
```
Grid has too many points (memory problems hereafter), so a subgrid is selected.
```{r}
#Grid has too many points (memory problems hereafter), so a subgrid is selected.
gridded(grid) <- c("Easting","Northing")
subgrid <- spsample(grid,type="regular", cellsize=50, offset=c(0.5,0.5))
subgriddata <- (subgrid %over% grid) # overlay function to get covariate values at these points
grd <- data.frame(coordinates(subgrid), subgriddata)
```
Specify the candidate sampling points and the covariates to be used in constrained Latin Hypercube Sampling
```{r}
candi <- grd[,1:2]
names(candi) <- c("x","y")
covars <- grd[, 3:7]
```
## Optimizing the sample by simulated annealing
cLHS is an example of a problem that is too large to solve analytically.
Another way is to start with a random or fixed scheme and repeatedly modify it, each time checking if the modification makes the scheme better. But "better" at one step may lead to sub-optimal overall solutions.
This method was introduced by Kirkpatrick _et al._ (1983) and named by analogy to physical annealing as used in metalurgy:
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by Simulated Annealing. Science, 220(4598), 671–680. https://doi.org/10.1126/science.220.4598.671
A good explanation in the context of spatial sampling is:
van Groenigen, J. W., & Stein, A. (1998). Constrained optimization of spatial sampling using continuous simulated annealing. Journal of Environmental Quality, 27(5), 1078–1086. https://doi.org/10.2134/jeq1998.00472425002700050013x
SSA has been implemented in package `spsann` by Alessandro Samuel-Rosa as part of his PhD project at Universidade Federal Rural do Rio de Janeiro, Brazil, in cooperation with ISRIC-World Soil Information.
### The simulated annealing schedule
Define the _schedule_ for simulated annealing. This controls how the annealing will proceed, and requires careful tuning.
Note that both the initial acceptance rate and the initial temperature are set, which may seem unnecessary as the acceptance rate is a function of the initial temperature: $P =e^{\frac{-\Delta f}{T}}$. The initial acceptance rate is used as a threshold value. If an initial temperature is chosen that leads to an acceptance rate smaller than the chosen value for the initial acceptance rate, then the optimization stops. In this case a larger value for the initial temperature must be chosen.
The `chain.length` is the number of iterations during which the temperature is kept constant. The value for `chain.length` is a multiplier, see `?scheduleSPSANN`. So when for instance 50 locations are optimized, and a value 2 is chosen for `chain.length`, the chain length equals 50 x 2 = 100 iterations.
```{r}
schedule <- scheduleSPSANN(initial.acceptance = 0.8,
initial.temperature = 0.02,
temperature.decrease=0.95,
chains=1000,
chain.length=4,
stopping=10,
x.min=50,
y.min=50,
cellsize=50)
```
### Optimization
Now start the simulated annealing algorithm, using the `optimCLHS` function of `spsann`. This should take several minutes (1 -- 10) depending on the system.
Note that there are several optimization criteria, which can be given different weights; these are explained in Minasny and McBratney (2006).
* O1: match the empirical CDF of the covariables (i.e., original LHS)
* O2: (categorical variables equivalent to O1)
* O3: match correlation of the sampled points to that of the original full dataset
Here we weight O1 and O3 equivalently:
```{r}
weights <- list(O1 = 0.5, O3 = 0.5)
samplesize <- 50
set.seed(314) # reproduce one solution -- can remove
system.time(
res <- optimCLHS(
points = samplesize,
candi = candi,
covars = covars,
schedule = schedule,
weights = weights,
track=TRUE) # to record the objective function value vs. chain
)
```
## Result
Check whether optimization has converged:
```{r}
ggplot(res$objective$energy) +
geom_line(mapping = aes(x=1:nrow(res$objective$energy), y = obj),
colour="red") +
scale_y_continuous(name = "O") +
scale_x_continuous(name = "Chain")
```
**Question**: Approximately how many chains were needed for convergence? Are we sure we've found a true optimum?
**Answer**:
Make a matrix of the strata divided into a number of quantiles equal to the number of sampling points. We can then check this empirical cumulative probability distribution against that of the selected points.
```{r}
probs <- seq(from=0, to=1, by=1/samplesize) # divide [0..1] into n quantles
last <- length(probs); n.covars <- 5
lb <- matrix(nrow=last-1, ncol=n.covars) # matrix to hold the quantiles
colnames(lb) <- names(grd)[3:7]
# find the quantiles of each covariate
for (i in 1:n.covars) {
q <- quantile(grd[,i+2], probs=probs)
lb[,i] <- q[-last]
}
head(lb)
```
Check that the O1 criterion (match the empirical CDF of the covariables) is met:
```{r}
mySample <- res$points
# covariate values at these points
mySample <- data.frame(mySample,grd[mySample$id,3:7])
# a matrix of the points and their quantile in each covariable
stratum <- matrix(nrow=nrow(mySample), ncol=n.covars)
colnames(stratum) <- names(grd)[3:7]
for ( i in 1:n.covars) {
# left.open=TRUE makes the interval (..], i.e., < .. >=
stratum[,i] <- findInterval(mySample[,i+4],lb[,i], left.open=TRUE)
}
head(stratum)
# count how many points are in each stratum -- should be 1
counts <- matrix(nrow=nrow(lb), ncol=n.covars)
colnames(counts) <- names(grd)[3:7]
# for each quantile, find the points in each column that are in that quantile
# sum these for a count. Ideally this would be a matrix of 1's, i.e., exactly O1
for (q in 1:nrow(lb)) {
counts[q,] <- apply(stratum, MARGIN=2, function(x,i) sum(x==i), i=q)
}
head(counts, n=12)
```
**Question**: Is there exactly one point in each of the `r nrow(lb)` quantiles for each of the `r n.covars` covariates? Is this an exact Latin Hypercube?
**Answer**:
Compute the O1 criterion:
```{r}
# the number of non-1's, i.e., not meeting the criterion
(not.ones <- sum(abs(counts-1)))
# this divided by n*p, i.e., the O1 criterion
(O1 <- (not.ones)/(samplesize*n.covars))
```
**Question**: What is the computed value of the O1 criterion from this SSA optimization? What would be the ideal?
**Answer**:
Plot the sample sizes in the marginal strata. We would like to have one in each, but because of cLHS sometimes there are 0 or 2, although the total is always the same.
```{r}
index<-seq(1:samplesize)
countsdf <- as.data.frame(counts)
names(countsdf)<-names(grd[3:7])
countslf <- melt(countsdf)
countslf$index <- rep(index,times=5)
ggplot(countslf) +
geom_point(mapping = aes(x=index,y = value), colour = "black",size=1) +
facet_wrap(~variable) +
scale_x_continuous(name = "Stratum") +
scale_y_continuous(name = "Sample size",breaks=c(0,1,2,3,4))
```
Compute sample and population correlation matrices and see how well they match; this is O3, and should be as small as possible.
```{r}
round(rsample <- cor(mySample[,c(5:9)]),4)
round(rpopulation <- cor(grd[,3:7]),4)
round(diff <- rpopulation-rsample,4)
(O3 <- abs(sum(diff))/(n.covars^2 - n.covars))
```
**Question**: What is the value of O3? What is the ideal?
**Answer**:
Plot the optimized sample in geographical space, with one of the covariates as background. We choose compound topographic index (CTI), which shows where water sheds and accumulates: q
```{r fig.height=6, fig.width=6}
ggplot(data=grd) +
geom_tile(mapping = aes(x = x1, y = x2, fill = cti))+
geom_point(data = mySample, mapping = aes(x = x, y = y), colour = "black") +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
scale_fill_gradient(name="cti",low = "darkblue", high = "red")+
coord_fixed()
```
**Question**: Is there any spatial pattern to the placement of the sample points? Why or why not?
**Answer**:
Plot the optimized sample in covariate space by a 2D scatter plot of all the cells, the selected cells, and the cutoffs selected to define the hypercube:
```{r fig.height=6, fig.width=6}
ggplot(data=grd) +
geom_point(mapping = aes(y = elevation_m, x = cti), colour = "black",size=1,alpha=0.5) +
geom_point(data=mySample, mapping = aes(y = elevation_m, x = cti), colour = "red",size=2) +
geom_vline(xintercept=lb[-1,"cti"],colour="grey")+
geom_hline(yintercept=lb[-1,"elevation_m"],colour="grey")+
scale_y_continuous(name = "Elevation") +
scale_x_continuous(name = "CTI")
```
**Question**: Why are the "boxes" of unequal size?
**Answer**:
**Question**: How well do the sample points represent the distribution of elevation and CTI in feature space?
**Answer**:
**Challenge**: Repeat the map and 2D scatterplot for other combinations of covariates.