---
title: "Variance"
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
html_document:
fig_height: 4
fig_width: 6
fig_align: 'center'
---
This small example is to explain the concept of variance and how it is affected by classification.
# Definition of the variance of a sample
The variance $s^2$ of a set of $n$ numbers is the average squared separation $(z_i - \bar{z})^2$ from the mean $\bar{z}$ of the set, with a small finite-population correction $(n-1$:
$$s^2 = \frac{1}{(n-1)} \sum_{i=1}^n (z_i - \bar{z})^2$$
It measures the overall variability of the dataset. The sample $s^2$ is an estimate of the (unknown) population variance $\sigma^2$ -- this assumes a simple random sample (SRS) from the population!
For example, here are samples of $n = 12$ from two simulated populations of heights, normally-distributed, with a known $\mu = 165$ and variance $\sigma^2 = 12$ (population 1) and variance $\sigma^2 = 24$ (population 2). Population 2 is more variable, perhaps all the children in grades 1--12, whereas is population 1 is only the children in grade 12.
```{r}
n <- 12
sort(sample.1 <- rnorm(n, mean=165, sd=sqrt(12)))
sort(sample.2 <- rnorm(n, mean=150, sd=sqrt(48)))
```
If we only had these samples, we would estimate the variance as:
```{r}
1/(n-1) * sum((sample.1 - mean(sample.1))^2)
(v.1 <- var(sample.1))
1/(n-1) * sum((sample.2 - mean(sample.2))^2)
(v.2 <- var(sample.2))
```
# Confidence intervals
It is quite unlikely that with a small sample we would accurately estimate the true mean and variance.
You can run the above two code chunks several times to see the variability of the estimates of a population variance from a sample variance.
The confidence intervals that the true population variance (which we suppose we do not know) lie between a lower and upper limit of probability (assigned by us), based on the sample variance, is given by:
$$
\frac{(n-1) s^2}{\chi^2_{\alpha/2, (n-1)}} \le \sigma^2 \le \frac{(n-1) s^2}{\chi^2_{1-\alpha/2, (n-1)}}
$$
where $\alpha$ is the probability of a Type I error, i.e., we accept the null hypothesis $H_0$ that the population variance is within this interval. For these two samples:
```{r}
c(((n-1)*v.1)/qchisq(0.95, df=(n-1)), ((n-1)*v.1)/qchisq(0.05, df=(n-1)))
c(((n-1)*v.2)/qchisq(0.95, df=(n-1)), ((n-1)*v.2)/qchisq(0.05, df=(n-1)))
```
# Effect of sample size
The sample size does not affect the true population variance, but it does affect the confidence interval in which we think the true variance lies:
These should all give the same estimate $s^2$ of $\sigma^2$:
```{r}
(v.1 <- var(sample.1 <- rnorm(n, mean=165, sd=sqrt(12))))
(v.2 <- var(sample.2 <- rnorm(2*n, mean=165, sd=sqrt(12))))
(v.3 <- var(sample.3 <- rnorm(3*n, mean=165, sd=sqrt(12))))
(v.4 <- var(sample.4 <- rnorm(4*n, mean=165, sd=sqrt(12))))
```
But the confidence intervals will in general get narrower:
```{r}
c(((n-1)*v.1)/qchisq(0.95, df=(n-1)), ((n-1)*v.1)/qchisq(0.05, df=(n-1)))
c(((n-1)*v.2)/qchisq(0.95, df=(n-1)), ((n-1)*v.2)/qchisq(0.05, df=(n-1)))
c(((n-1)*v.3)/qchisq(0.95, df=(n-1)), ((n-1)*v.3)/qchisq(0.05, df=(n-1)))
c(((n-1)*v.4)/qchisq(0.95, df=(n-1)), ((n-1)*v.4)/qchisq(0.05, df=(n-1)))
```
# Reducing variance by subpopulations
If there is some attribute of each sampling unit that affects the value, this may allow a reduction of variance, by using this attribute to divide the population. Some of the total variance is explained by the attribute. We can see this in a typical analysis of variance of an experiment.
This is one of the sample R data sets, an NPK factorial experiment of pea yield.
```{r}
data(npk)
summary(npk)
```
The overall sample variance:
```{r}
var(npk$yield)
```
and the sample variance for two subsets: the two levels of Nitrogen:
```{r}
by(npk$yield, npk$N, var)
```
In this case the sub-populations are less variable than the overall population. This is because the means are different:
```{r}
mean(npk$yield)
by(npk$yield, npk$N, mean)
```
and the values in each set are closer to that set's means than to the overall mean.
We can see this graphically:
```{r}
require(ggplot2)
qplot(yield, data=npk, fill=N)
```
Notice that the yields for N=0 are lower, N=1 are higher, and both have a narrower range than the full dataset.
The analysis of variance shows how much of the variance is explained by the factor:
```{r}
summary(aov(yield ~ N, data=npk))
```
Notice how a portion of the total sum of squares (TSS) is accounted for by the factor; the remainder is the residual sum of squares (RSS); these can be converted to variances by their respective degrees of freedom.
The `lm` method shows this in a somewhat different way:
```{r}
summary(lm(yield ~ N, data=npk))
```
The adjusted $R^2$ is the proportion of variance explained by the factor.
# Prediction variance of a linear model
Fit a linear model of tree volume based on its girth (circumfrance) and height:
```{r}
data(trees)
summary(trees)
(m.v.gh <- lm(Volume ~ Girth*Height, data=trees))
summary(m.v.gh)
```
The model is good but not perfect, each coefficient is well-predicted but has uncertainty. So when we use this model to predict the volume of a newly-observed tree (its girth and height) our prediction is not certain:
```{r}
(volume <- predict(m.v.gh, newdata=data.frame(Girth=14, Height=70), se.fit=TRUE, interval="prediction"))
```
The `se.fit` here is the square root of the prediction variance.
The prediction variance is:
$$
\sigma^2 \left[1 + \mathbf{X^*} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X^*}' \right]
$$
where $\mathbf{X}$ is the design (model) matrix and $\mathbf{X^*}$ is the row vector of predictor values at the point to be predicted, e.g., here $[14, 70]$.
The design matrix here is:
```{r}
(mm <- model.matrix(m.v.gh))
```
The inverse of its square is:
```{r}
(mm2.i <- solve(t(mm) %*% mm))
```
This is pre- and post-multipled by the predictor values. Note that a new vector in R is interpreted as a column vector in matrix operations, so to get the row vector $\mathbf{X^*}$ we take its transpose.
```{r}
# the intercept is set to 1, the others calculated from the values of predictors
(obs <- c(1, 14, 70, 14*70))
(tmp <- c(t(obs) %*% mm2.i %*% obs))
```
Completing the calculation:
```{r}
# estimate of \sigma^2
(sig2 <- c(crossprod(residuals(m.v.gh))) / df.residual(m.v.gh))
c(sig2*(1 + tmp))
```
Here is some matrix algebra to compute the prediction variance directly from the linear model.
```{r}
tm <- delete.response(terms(m.v.gh)) # the right-hand side of the formula
(Xp <- model.matrix(tm, data=data.frame(Girth=14, Height=70))) # make a design matrix from the predictors
(pred <- c(Xp %*% coef(m.v.gh))) # directly compute the prediction
QR <- m.v.gh$qr # the QR decomposition of the model matrix
# solve this lower-triangular system to give the .... ??
(B <- forwardsolve(t(QR$qr), t(Xp), k=QR$rank))
# estimate \sigma^2
(sig2 <- c(crossprod(residuals(m.v.gh))) / df.residual(m.v.gh)) # squared residuals divided by d.f.
sqrt(sig2) # we saw this in the summary() above, it is our estimate of \sigma
(pred.var <- c(crossprod(B) * sig2))
(pred.se <- sqrt(pred.var)) # we saw this in the results of predict.lm with se.fit=TRUE
```