This small example is to explain the concept of variance and how it is affected by classification.

# Definition of the variance of a sample

The variance $$s^2$$ of a set of $$n$$ numbers is the average squared separation $$(z_i - \bar{z})^2$$ from the mean $$\bar{z}$$ of the set, with a small finite-population correction $$(n-1$$:

$s^2 = \frac{1}{(n-1)} \sum_{i=1}^n (z_i - \bar{z})^2$

It measures the overall variability of the dataset. The sample $$s^2$$ is an estimate of the (unknown) population variance $$\sigma^2$$ – this assumes a simple random sample (SRS) from the population!

For example, here are samples of $$n = 12$$ from two simulated populations of heights, normally-distributed, with a known $$\mu = 165$$ and variance $$\sigma^2 = 12$$ (population 1) and variance $$\sigma^2 = 24$$ (population 2). Population 2 is more variable, perhaps all the children in grades 1–12, whereas is population 1 is only the children in grade 12.

n <- 12
sort(sample.1 <- rnorm(n, mean=165, sd=sqrt(12)))
##   158.2938 160.2751 161.6946 163.0842 163.5987 163.9856 165.2202
##   165.6517 166.0777 166.6068 166.8974 170.4908
sort(sample.2 <- rnorm(n, mean=150, sd=sqrt(48)))
##   144.1384 144.1777 145.3618 145.9731 146.0455 148.9586 150.9867
##   151.4841 152.5590 153.6113 162.7390 169.2977

If we only had these samples, we would estimate the variance as:

1/(n-1) * sum((sample.1 - mean(sample.1))^2)
##  10.6684
(v.1 <- var(sample.1))
##  10.6684
1/(n-1) * sum((sample.2 - mean(sample.2))^2)
##  60.05138
(v.2 <- var(sample.2))
##  60.05138

# Confidence intervals

It is quite unlikely that with a small sample we would accurately estimate the true mean and variance.

You can run the above two code chunks several times to see the variability of the estimates of a population variance from a sample variance.

The confidence intervals that the true population variance (which we suppose we do not know) lie between a lower and upper limit of probability (assigned by us), based on the sample variance, is given by:

$\frac{(n-1) s^2}{\chi^2_{\alpha/2, (n-1)}} \le \sigma^2 \le \frac{(n-1) s^2}{\chi^2_{1-\alpha/2, (n-1)}}$

where $$\alpha$$ is the probability of a Type I error, i.e., we accept the null hypothesis $$H_0$$ that the population variance is within this interval. For these two samples:

c(((n-1)*v.1)/qchisq(0.95, df=(n-1)), ((n-1)*v.1)/qchisq(0.05, df=(n-1)))
##   5.96450 25.65184
c(((n-1)*v.2)/qchisq(0.95, df=(n-1)), ((n-1)*v.2)/qchisq(0.05, df=(n-1)))
##   33.5736 144.3917

# Effect of sample size

The sample size does not affect the true population variance, but it does affect the confidence interval in which we think the true variance lies:

These should all give the same estimate $$s^2$$ of $$\sigma^2$$:

(v.1 <- var(sample.1 <- rnorm(n, mean=165, sd=sqrt(12))))
##  17.22879
(v.2 <- var(sample.2 <- rnorm(2*n, mean=165, sd=sqrt(12))))
##  13.66656
(v.3 <- var(sample.3 <- rnorm(3*n, mean=165, sd=sqrt(12))))
##  10.4533
(v.4 <- var(sample.4 <- rnorm(4*n, mean=165, sd=sqrt(12))))
##  16.5499

But the confidence intervals will in general get narrower:

c(((n-1)*v.1)/qchisq(0.95, df=(n-1)), ((n-1)*v.1)/qchisq(0.05, df=(n-1)))
##   9.63229 41.42609
c(((n-1)*v.2)/qchisq(0.95, df=(n-1)), ((n-1)*v.2)/qchisq(0.05, df=(n-1)))
##   7.640716 32.860826
c(((n-1)*v.3)/qchisq(0.95, df=(n-1)), ((n-1)*v.3)/qchisq(0.05, df=(n-1)))
##   5.844241 25.134633
c(((n-1)*v.4)/qchisq(0.95, df=(n-1)), ((n-1)*v.4)/qchisq(0.05, df=(n-1)))
##   9.252737 39.793731

# Reducing variance by subpopulations

If there is some attribute of each sampling unit that affects the value, this may allow a reduction of variance, by using this attribute to divide the population. Some of the total variance is explained by the attribute. We can see this in a typical analysis of variance of an experiment.

This is one of the sample R data sets, an NPK factorial experiment of pea yield.

data(npk)
summary(npk)
##  block N      P      K          yield
##  1:4   0:12   0:12   0:12   Min.   :44.20
##  2:4   1:12   1:12   1:12   1st Qu.:49.73
##  3:4                        Median :55.65
##  4:4                        Mean   :54.88
##  5:4                        3rd Qu.:58.62
##  6:4                        Max.   :69.50

The overall sample variance:

var(npk$yield) ##  38.10283 and the sample variance for two subsets: the two levels of Nitrogen: by(npk$yield, npk$N, var) ## npk$N: 0
##  28.92242
## --------------------------------------------------------
## npk$N: 1 ##  33.5397 In this case the sub-populations are less variable than the overall population. This is because the means are different: mean(npk$yield)
##  54.875
by(npk$yield, npk$N, mean)
## npk$N: 0 ##  52.06667 ## -------------------------------------------------------- ## npk$N: 1
##  57.68333

and the values in each set are closer to that set’s means than to the overall mean.

We can see this graphically:

require(ggplot2)
## Loading required package: ggplot2
qplot(yield, data=npk, fill=N)
## stat_bin() using bins = 30. Pick better value with binwidth. Notice that the yields for N=0 are lower, N=1 are higher, and both have a narrower range than the full dataset.

The analysis of variance shows how much of the variance is explained by the factor:

summary(aov(yield ~ N, data=npk))
##             Df Sum Sq Mean Sq F value Pr(>F)
## N            1  189.3  189.28   6.061 0.0221 *
## Residuals   22  687.1   31.23
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice how a portion of the total sum of squares (TSS) is accounted for by the factor; the remainder is the residual sum of squares (RSS); these can be converted to variances by their respective degrees of freedom.

The lm method shows this in a somewhat different way:

summary(lm(yield ~ N, data=npk))
##
## Call:
## lm(formula = yield ~ N, data = npk)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -8.8833 -3.7667  0.1667  3.5583 11.8167
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   52.067      1.613  32.274   <2e-16 ***
## N1             5.617      2.281   2.462   0.0221 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.588 on 22 degrees of freedom
## Multiple R-squared:  0.216,  Adjusted R-squared:  0.1803
## F-statistic: 6.061 on 1 and 22 DF,  p-value: 0.02213

The adjusted $$R^2$$ is the proportion of variance explained by the factor.

# Prediction variance of a linear model

Fit a linear model of tree volume based on its girth (circumfrance) and height:

data(trees)
summary(trees)
##      Girth           Height       Volume
##  Min.   : 8.30   Min.   :63   Min.   :10.20
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40
##  Median :12.90   Median :76   Median :24.20
##  Mean   :13.25   Mean   :76   Mean   :30.17
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30
##  Max.   :20.60   Max.   :87   Max.   :77.00
(m.v.gh <- lm(Volume ~ Girth*Height, data=trees))
##
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
##
## Coefficients:
##  (Intercept)         Girth        Height  Girth:Height
##      69.3963       -5.8558       -1.2971        0.1347
summary(m.v.gh)
##
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -6.5821 -1.0673  0.3026  1.5641  4.6649
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  69.39632   23.83575   2.911  0.00713 **
## Girth        -5.85585    1.92134  -3.048  0.00511 **
## Height       -1.29708    0.30984  -4.186  0.00027 ***
## Girth:Height  0.13465    0.02438   5.524 7.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16

The model is good but not perfect, each coefficient is well-predicted but has uncertainty. So when we use this model to predict the volume of a newly-observed tree (its girth and height) our prediction is not certain:

(volume <- predict(m.v.gh, newdata=data.frame(Girth=14, Height=70), se.fit=TRUE, interval="prediction"))
## $fit ## fit lwr upr ## 1 28.57992 22.67534 34.48449 ## ##$se.fit
##  0.9720972
##
## $df ##  27 ## ##$residual.scale
##  2.70855

The se.fit here is the square root of the prediction variance.

The prediction variance is:

$\sigma^2 \left[1 + \mathbf{X^*} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X^*}' \right]$

where $$\mathbf{X}$$ is the design (model) matrix and $$\mathbf{X^*}$$ is the row vector of predictor values at the point to be predicted, e.g., here $$[14, 70]$$.

The design matrix here is:

(mm <- model.matrix(m.v.gh))
##    (Intercept) Girth Height Girth:Height
## 1            1   8.3     70        581.0
## 2            1   8.6     65        559.0
## 3            1   8.8     63        554.4
## 4            1  10.5     72        756.0
## 5            1  10.7     81        866.7
## 6            1  10.8     83        896.4
## 7            1  11.0     66        726.0
## 8            1  11.0     75        825.0
## 9            1  11.1     80        888.0
## 10           1  11.2     75        840.0
## 11           1  11.3     79        892.7
## 12           1  11.4     76        866.4
## 13           1  11.4     76        866.4
## 14           1  11.7     69        807.3
## 15           1  12.0     75        900.0
## 16           1  12.9     74        954.6
## 17           1  12.9     85       1096.5
## 18           1  13.3     86       1143.8
## 19           1  13.7     71        972.7
## 20           1  13.8     64        883.2
## 21           1  14.0     78       1092.0
## 22           1  14.2     80       1136.0
## 23           1  14.5     74       1073.0
## 24           1  16.0     72       1152.0
## 25           1  16.3     77       1255.1
## 26           1  17.3     81       1401.3
## 27           1  17.5     82       1435.0
## 28           1  17.9     80       1432.0
## 29           1  18.0     80       1440.0
## 30           1  18.0     80       1440.0
## 31           1  20.6     87       1792.2
## attr(,"assign")
##  0 1 2 3

The inverse of its square is:

(mm2.i <- solve(t(mm) %*% mm))
##              (Intercept)        Girth      Height  Girth:Height
## (Intercept)  77.44334157 -5.983063010 -1.00093404  7.662883e-02
## Girth        -5.98306301  0.503191000  0.07603970 -6.354863e-03
## Height       -1.00093404  0.076039705  0.01308607 -9.843500e-04
## Girth:Height  0.07662883 -0.006354863 -0.00098435  8.100241e-05

This is pre- and post-multipled by the predictor values. Note that a new vector in R is interpreted as a column vector in matrix operations, so to get the row vector $$\mathbf{X^*}$$ we take its transpose.

# the intercept is set to 1, the others calculated from the values of predictors
(obs <- c(1, 14, 70, 14*70))
##    1  14  70 980
(tmp <- c(t(obs) %*% mm2.i %*% obs))
##  0.1288088

Completing the calculation:

# estimate of \sigma^2
(sig2 <- c(crossprod(residuals(m.v.gh))) / df.residual(m.v.gh))
##  7.336243
c(sig2*(1 + tmp))
##  8.281216

Here is some matrix algebra to compute the prediction variance directly from the linear model.

tm <- delete.response(terms(m.v.gh))  # the right-hand side of the formula
(Xp <- model.matrix(tm, data=data.frame(Girth=14, Height=70)))  # make a design matrix from the predictors
##   (Intercept) Girth Height Girth:Height
## 1           1    14     70          980
## attr(,"assign")
##  0 1 2 3
(pred <- c(Xp %*% coef(m.v.gh)))  # directly compute the prediction
##  28.57992
QR <- m.v.gh$qr # the QR decomposition of the model matrix # solve this lower-triangular system to give the .... ?? (B <- forwardsolve(t(QR$qr), t(Xp), k=QR\$rank))
##             [,1]
## [1,] -0.17960530
## [2,]  0.04372819
## [3,] -0.22774013
## [4,] -0.20681650
# estimate \sigma^2
(sig2 <- c(crossprod(residuals(m.v.gh))) / df.residual(m.v.gh)) # squared residuals divided by d.f.
##  7.336243
sqrt(sig2) # we saw this in the summary() above, it is our estimate of \sigma
##  2.70855
(pred.var <- c(crossprod(B) * sig2))
##  0.9449731
(pred.se <- sqrt(pred.var))  # we saw this in the results of predict.lm with se.fit=TRUE
##  0.9720972