--- title: "(Geo-)statistical simulation" author: "D G Rossiter" date: "`r Sys.Date()`" output: html_document: toc: TRUE toc_float: TRUE theme: "lumen" code_folding: show number_sections: TRUE fig_keep: TRUE fig_height: 4 fig_width: 6 fig_align: 'center' --- ```{r global_options, include=FALSE} knitr::opts_chunk$set(fig.width=4.5, fig.height=4.5, fig.align='center', fig.path='simulfigs/', warning=FALSE, message=FALSE) ``` ```{r results='hide'} rm(list=ls()) ``` # Random numbers Draw samples from various distributions and display their histograms. A function that can be repeated: ```{r} test_random <- function(n, binom.trials=32, binom.prob=0.5, pois.lambda=4) { hist(d <- runif(n), breaks=seq(0, 1, by=0.05), freq=FALSE, main="Uniform") rug(d); abline(h=1, lty=2, col="red") summary(d) hist(d <- rnorm(n, mean=0, sd=1), freq=FALSE, main="Normal"); rug(d) x <- seq(-3, 3, length=128); y=dnorm(x,0,1) lines(x,y, col="red") summary(d) hist(d <- rbinom(n, size=binom.trials, prob=binom.prob), main="Binomial",freq=FALSE, breaks=seq(0,binom.trials, by=1)) x <- seq(0, binom.trials, by=1); y=dbinom(x, binom.trials, prob=0.5) points(x+0.5,y, col="red", type="h") table(d) hist(d <- rpois(n, lambda=pois.lambda), freq=FALSE, breaks=seq(0, (3*pois.lambda), by=1), main="Poisson") x <- seq(0, 3*pois.lambda, by=1); y=dpois(x,pois.lambda) points(x+0.5,y, col="red", type="h") table(d) return() } ``` Show these with one number of sample draws: ```{r fig.width=7,fig.height=5} test_random(128) ``` Task: run this with different sample sizes, each sample size a few times. Q: How does the agreement of the empirical with the theoretical distribution change with increasing sample size? Q: How does the variability of the empirical distribution change with increasing sample size? # Simulation The essential idea is to assume a statistical model for the distribution of one or more variables, and then repeatedly sample from that distribution. This random sample simulates what might be encountered in reality, with appropriate probability. ## Binomial distribution This is used for Bernoulli trials: independent events with a given probability of occurring. Each trials is either a "success" (1) or "failure" (0); we simulate the number of successes. The classic example is flips of a fair coin: ```{r fig.width=7,fig.height=5} n <- 1024 sample <- rbinom(n, size=24, prob=0.2); head(sample, n=20) (table.k <- table(sample)) plot(table.k/n, xlab="k", ylab="density") ``` A more interesting example is whether a certain group of the population is fairly represented in a sample. For example: 30% of the students are male, 70% female, in a certain study. The best 20 students of the class receive a bonus or a scholarship. Do the best 20 represent this 30/70 split? Is the deviation from 6 males, 14 females (exactly matching the population) by chance? ```{r fig.width=7,fig.height=5} n <- 1024 sample <- rbinom(n, size=20, prob=0.3) (table.k <- table(sample)) plot(table.k/n, xlab="k", ylab="density") ``` Q: What is the maximum and minimum number of males selected by chance? Q: What is the mode (most common value)? Does it match the expected value from the population? Q: Suppose only 3 males are selected. How strong is the evidence that they are under-represented? (This could be either due to their poor performance or discrimination, we have no way to distinguish the cause.) ```{r} length(which(sample <= 3))/n ``` ## Normal distribution Suppose we want to know how large a random sample should be to reliably estimate a total. ```{r fig.width=7,fig.height=5} nsim <- 256 # set up to save results totals <- vector(mode="integer", length=nsim) # save run results n <- 32; mu=100; s=1 for (run in 1:nsim) { totals[run] <- sum(rnorm(n, mean=mu, sd=s)) } summary(totals) print(paste("The actual total is", mu*n)) print(paste("Sampling standard deviation of the mean:", round(sd(totals),4))) hist(totals, breaks=16); rug(totals) ``` ## Mix of distributions Practical example: risk of overloading an aircraft. Risk of an overweight airplane on full 19-seat plane * Passengers weights assumed to follow a normal distribution * Estimate mean and standard deviation from measurements from the target population * separate distributions for males/females; hierarchical model gender binomial → gender-specific normal * Estimate mean proportion of female passengers (parameter of binomial) * Simulate number of females/males of the 19, from binomial distribution * Simulate each individual’s weight; sum all 19 * Compare to maximum allowable weight; find proportion overweight ```{r} # parameters of normal distribution # estimate these from the target population # mean, s.d. of fe/male weights, kg mu.m <- 100; sd.m <- 14; mu.f <- 50; sd.f <- 12 # mean proportion of female passengers: binomial \theta prop.f.mu <- 0.05 ``` ```{r} # simulation nsim <- 2048 # number of simulations n.females <- vector(mode="integer", length=nsim) # save run results wt.sum <- vector(mode="integer", length=nsim) # save run results n <- 19 # aircract size for (run in 1:nsim) { # number of fe/males num.f <- rbinom(n=1, size=n, prob=prop.f.mu) num.m <- n - num.f n.females[run] <- num.f wts.f <- rnorm(num.f, mean=mu.f, sd=sd.f) wts.m <- rnorm(num.m, mean=mu.m, sd=sd.m) wt.sum[run] <- ceiling(sum(wts.f) + sum(wts.m)) } summary(wt.sum) summary(n.females) ``` ```{r} # Fairchild Metro II: empty 3380 kg, max takeoff 5670kg, total load 2290 kg load.wt <- 2290; pilots.wt <- 200; fuel.wt <- 600 max.load <- load.wt-pilots.wt-fuel.wt (n.overweight <- sum(wt.sum > max.load)) # probabilty of overweight (prob.overweight <- round(n.overweight/nsim,3)) ``` ```{r fig.width=7,fig.height=5} # figures # number of females per flight plot(as.vector(table(n.females)), type="h", xlim=c(0,19), ylab="frequency", xlab="females per 19 passengers", lwd=2) grid() # number of flights overweight hist(wt.sum, main="", xlab="Total passenger weight, kg", breaks=24); rug(wt.sum) abline(v=(2290-800), col="red") ``` # Geostatistical simulation ## Sampling We can simulate the placement of random samples ## Random fields There are several obvious problems with kriging predictions over an area, with respect to uncertainty: 1. Kriging is an exact predictor at known points, because all the weight is given to the known point; this is mathematically necessary but not realistic, since just away from the point the predictions are weighted; 2. Kriging prediction maps are by definition smooth, even if there is a nugget component to the model; the actual spatial field is usually much rougher. Both of these problems are serious issues for spatially-distributed models. An example is the soil hydraulic conductivity in models of groundwater flow: the actual erratic nature of this attribute leads to much different model outputs than a smoothly-varying field. So, a map produced by kriging gives an unrealistic view of the fine-scale spatial variability. We can recover this with conditional simulation: this shows one (or many) possible realizations of the spatial field: * as defined by the covariance structure (variogram); and * as constrained by the known data values What is geostatistical simulation? * The theory of regionalized variables assumes that the values we observe come from some random process * simplest case: one expected value (first-order stationarity) with a spatially-correlated error that is the same over the whole area (second-order stationarity). * The one reality we observe is the results of a random process * There are “alternative realities”; that is, spatial patterns that, by this theory, could have occurred in another realization of the same spatial process. * the construction of a gridded surface corresponding to a random function, i.e., model of spatial correlation * the statistical properties of the surface match those of the sample: spatial mean, spatial variance, semivariogram (model, partial sill, nugget variance, range parameter) * Gaussian simulation assumes that the target field is multivariate Gaussian, with a defined stationary spatial mean and covariance structure * This generates multiple, equally probable “realities”, i.e., the spatial distribution of the target attribute There are various geostatistical simulation algorithms; that used in gstat is described by Pebesma (2004). ```{r} library(sp) library(gstat) library(lattice) ``` ```{r} data(jura) str(jura.pred) str(jura.val) str(jura.grid) coordinates(jura.pred) <- ~ Xloc + Yloc class(jura.pred) coordinates(jura.grid) <- ~ Xloc + Yloc class(jura.grid) gridded(jura.grid) <- TRUE class(jura.grid) ``` Target variable: ```{r fig.width=7,fig.height=7} print(xyplot(Yloc ~ Xloc, as.data.frame(jura.pred), cex=1.2*jura.pred$Co/max(jura.pred$Co), pch=1, asp="iso", xlab="East", ylab="North", main="Co concentration in topsoils, Jura")) ``` Examine local spatial correlation with an empirical variogram: ```{r} v <- variogram(Co ~ 1, loc=jura.pred, cutoff=1.6) plot(v, plot.numbers=TRUE, xlab="separation distance") ``` Fit a variogrm model, i.e., model of spatial correlation: ```{r} (vmf <- fit.variogram(v, vgm(12.5, "Sph", 1.2, 1.5))) plot(v, plot.numbers=TRUE, xlab="separation distance", model=vmf) ``` This is a good model fit. Predict over the study area with this model using OK: ```{r} k.o <- krige(Co ~ 1, loc=jura.pred, newdata=jura.grid, model=vmf) summary(k.o) ``` View a map of the predictions and their standard deviations. ```{r} spplot(k.o, zcol="var1.pred", main="Co, ppm", sub="Ordinary Kriging prediction") spplot(k.o, zcol="var1.var", formula=sqrt(var1.var) ~ Xloc + Yloc, main="Co, ppm", sub="Ordinary Kriging prediction standard deviation", col.regions=topo.colors(64)) ``` This is the best single map, but we see that there is uncertainty. ## Conditional simulation Many simulated fields can be created, each equally valid, and used as model inputs. Make four realizations of a conditional simulation of the Co concentration in Jura soils over the prediction grid jura.grid, using the OK model. • The `krige` method can also do conditional simulations. It requires one optional argument, `nsim`, the number of simulations. For each set of simulations in this section we use `set.seed` so your results will look the same as ours; in practice you would only do this if you want to make sure to have the same starting point. Parameter `nsim` is the number of conditional simulations. ```{r} if (file.exists("./JuraSim4.RData")) { load("./JuraSim4.Rdata") } else { set.seed(621) system.time( k.sim.4 <- krige(Co ~ 1, loc = jura.pred, newdata = jura.grid, model=vmf, nsim=4, nmax=256) ) } ``` Save the simulation results -- this took a lot of time! ```{r} if (!file.exists("./JuraSim4.RData")) save(k.sim.4, file="./JuraSim4.RData") ``` Compare the means, medians, IQR and ranges of the four simulations, also with the calibration data set on which the simulation was based. ```{r} summary(k.sim.4) summary(jura.pred$Co) ``` Q: How do the ranges of the four simulations compare with the range of the data? Visualize: Display the four conditional simulations in one plot, with the locations of the the observation points over-printed. ```{r fig.width=8, fig.height=8} layout.2 <- list("sp.points", jura.pred, pch=1, cex=1.4*jura.pred$Co/max(jura.pred$Co), col="white") print(spplot(k.sim.4, zcol=1:4, col.regions=bpy.colors(64), main="Conditional simulations, Jura Co (ppm), OK", sp.layout=list(layout.2))) ``` Simulation is supposed to respect the spatial structure; that is, the simulated field should have the same structure as the variogram model. Let’s see how well it was reproduced. Compute the empirical variograms of the four simulations, and plot them, each with the fitted model `vmf`. These have to be converted to points rather than pixels. ```{r fig.width=8, fig.height=8} k.sim.4.pts <- as( k.sim.4, "SpatialPointsDataFrame") vs.1 <- variogram(sim1 ~ 1, loc=k.sim.4.pts) vs.2 <- variogram(sim2 ~ 1, loc=k.sim.4.pts) vs.3 <- variogram(sim3 ~ 1, loc=k.sim.4.pts) vs.4 <- variogram(sim4 ~ 1, loc=k.sim.4.pts) gamma.max=ceiling(max(vs.1$gamma, vs.2$gamma, vs.3$gamma, vs.4$gamma, sum(vmf[,"psill"]))) p.1 <- plot(vs.1, model=vmf, ylim=c(0,gamma.max), main="Simulation 1") p.2 <- plot(vs.2, model=vmf, ylim=c(0,gamma.max), main="Simulation 2") p.3 <- plot(vs.3, model=vmf, ylim=c(0,gamma.max), main="Simulation 3") p.4 <- plot(vs.4, model=vmf, ylim=c(0,gamma.max), main="Simulation 4") print(p.1, split=c(1,1,2,2), more=T) print(p.2, split=c(1,2,2,2), more=T) print(p.3, split=c(2,1,2,2), more=T) print(p.4, split=c(2,2,2,2), more=F) ``` How well do the simulations reproduce the variogram? Do the empirical variograms of the four simulations match the variogram model? What are the differences? Why? ## Unconditional simulation In some situations we would like to simulate a random field defined by an assumed spatial covariance structure (e.g., as represented by a variogram model), without considering any data points. For example, for designing a sampling scheme, if we have an idea of the field structure, we can create many realizations by unconditional simulation, and see how well our proposed scheme would capture the known structure. This is as the conditional simulation, using the krige method. However, for unconditional simulation we specify: 1. that the data locations are missing, by setting the `loc` argument to `NULL`; 2. that there is no data, by setting the `dummy` argument to `TRUE`; 3. the value of the trend surface coefficients, by setting the `beta` argument to the presumed mean; for OK this is the mean, so that OK becomes Simple Kriging (SK). The `beta` argument gives the expected value of the stationary field; here it is most reasonably set to the non-spatial mean of the Co concentration at the sample points. As with conditional simulation, the nmax and/or maxdist optional argu- ments are often specified as well. ```{r} if (file.exists("./JuraSimU4.RData")) { load("./JuraSimU4.RData") } else { set.seed(621) system.time(k.sim.4.u <- krige(z ~ 1, loc=NULL, newdata=jura.grid, model=vmf, nsim=4, nmax=128, beta=mean(jura.pred$Co), dummy=T) ) } ``` ```{r} if (!file.exists("./JuraSimU4.RData")) save(k.sim.4.u, file="./JuraSimU4.RData") ``` Compare the means, medians, IQR and ranges of the four simulations ```{r fig.width=8, fig.height=8} summary(k.sim.4.u) print(spplot(k.sim.4.u, zcol=1:4, col.regions=bpy.colors(64), main="Unconditional simulations, Co variogram model")) ``` Describe the similarities and differences among the four simulations. ## Simulation different spatial structures One use of unconditional simulation is to visualize different random fields. We begin by comparing random fields with the same variogram parameters but different forms. Use the same variogram parameters for four different models of spatial dependence: Simulate random fields on the Jura grid for spherical, pentaspher- ical, exponential, Gaussian models, all with the same effective range, partial sill and nugget as the fitted model for Jura cobalt. Recall that the effective range of the exponential model is $3$ times the range parameter, and for the Gaussian model it is $\sqrt 3 $ times. ```{r fig.width=8, fig.height=8} c0 <- vmf[1,"psill"]; c1 <- vmf[2,"psill"]; a <- vmf[2,"range"] p.1 <- show.vgms(max=round(a*1.4,2), sill=round(c1,2), range=round(a,2), nugget=round(c0,2), models="Sph") p.2 <- show.vgms(max=round(a*1.4,2), sill=round(c1,2), range=round(a,2), nugget=round(c0,2), models="Pen") p.3 <- show.vgms(max=round(a*1.4,2), sill=round(c1,2), range=round(a/3,2), nugget=round(c0,2), models="Exp") p.4 <- show.vgms(max=round(a*1.4,2), sill=round(c1,2), range=round(a/sqrt(3),2), nugget=round(c0,2), models="Gau") print(p.1, split=c(1,1,2,2), more=T) print(p.2, split=c(2,1,2,2), more=T) print(p.3, split=c(1,2,2,2), more=T) print(p.4, split=c(2,2,2,2), more=F) ``` Simulate; we also need to specify the mean value of the field. ```{r} mu <- mean(jura.pred$Co) set.seed(621) k.sim.u.sph <- krige(z ~ 1, loc=NULL, newdata=jura.grid, model=vgm(c1,"Sph",a,c0), nsim=1, nmax=128, beta=mu, dummy=T) k.sim.u.pen <- krige(z ~ 1, loc=NULL, newdata=jura.grid, model=vgm(c1,"Pen",a,c0), nsim=1, nmax=128, beta=mu, dummy=T) k.sim.u.exp <- krige(z ~ 1, loc=NULL, newdata=jura.grid, model=vgm(c1,"Exp",a/3,c0), nsim=1, nmax=128, beta=mu, dummy=T) k.sim.u.gau <- krige(z ~ 1, loc=NULL, newdata=jura.grid, model=vgm(c1,"Gau",a/sqrt(3),c0), nsim=1, nmax=128, beta=mu, dummy=T) ``` Display the simulated fields in one figure. ```{r fig.width=8, fig.height=8} k.sim.u <- k.sim.u.sph names(k.sim.u) <- "sim.sph" k.sim.u$sim.pen <- k.sim.u.pen$sim1 k.sim.u$sim.exp <- k.sim.u.exp$sim1 k.sim.u$sim.gau <- k.sim.u.gau$sim1 spplot(k.sim.u, zcol=c(3,4,1,2), col.regions=bpy.colors(64)) ``` Describe the differences between the fields simulated by the four models with different structure of spatial correlation. # References E J Pebesma. Multivariable geostatistics in S: the gstat package. Computers & Geosciences, 30(7):683–691, 2004. DOI: _10.1016/j.cageo.2004.03.012_