This tutorial shows the main features of the S language as implemented in the R Environment for Statistical Computing (https://www.r-project.org/).

R is fully described in: Venables, W N, D M Smith, and R Development Core Team. 2019. An Introduction to R; Notes on R: A Programming Environment for Data Analysis and Graphics. Version 3.6.2 (2019-12-12). Vienna: R Foundation for Statistical Computing. https://cran.r-project.org/manuals.html.

A simple introduction to R for statistical analysis is: Dalgaard, Peter. 2008. Introductory Statistics with R. Second. Springer. http://link.springer.com/book/10.1007%2F978-0-387-79054-1.

You should work through the code examples. At the end of each section is an exercise. Please create a new R Markdown document and complete these exercises; you may want to use some of the code from this document.

1 R as an expression language

S code can be executed at the R console command line, or from a R script, or from an R Markdown code chunk, as a calculator.

The simplest use of S is to evaluate mathematical expressions.

2*pi/360
## [1] 0.01745329

The ratio of a circle’s radius squared to its area \(\pi\) is a built-in constant pi.

S has the usual scalar arithmetic operators +, -, *, /, ^ and some less-common ones like %% (modulus) and %/% (integer division). Expressions are evaluated in accordance with the usual operator precedence; parentheses may be used to change the precedence or make it explicit.

3 / 2^2 + 2 * pi
## [1] 7.033185
((3 / 2)^2 + 2) * pi
## [1] 13.35177

Exercise 1: Write an expression to compute the number of seconds in a 365-day year, and execute the expression.

2 Assignment and workspace objects

Results of S expressions can be saved as objects in the R workspace. There are two (equivalent) assignment operators, the left-hand side is the name of the workspace object, which is assigned the value of the expression on the right-hand side.

Object names are case-sensitive and may contain special characters, but no spaces.

rad.deg <- 2*pi/360
Rad2Deg = 2*pi/360

By default nothing is printed; but all of these give the same output:

(rad.deg <- 2*pi/360)
## [1] 0.01745329
rad.deg
## [1] 0.01745329
print(rad.deg)
## [1] 0.01745329

Workspace objects are listed with the ls “list” function and deleted with the rm “remove” function:

ls()
## [1] "rad.deg" "Rad2Deg"
rm(rad.deg)
ls()
## [1] "Rad2Deg"

The character(0) result means that the list of objects in the workspace is a zero-length character vector, i.e., null.

Exercise 2: Define a workspace object which contains the number of seconds in 365-day year, and display the results.

3 Functions and methods

Most work in S is done with functions or methods. These are parts of expressions containing:

  1. Method or function name; arguments to the function are between parentheses ( ) following the function name
  2. Argument list:
  • Required
  • Optional, with defaults
  • positional and/or named

Functions return the results. These can be directly printed, assigned to workspace objects, or directly used as arguments to other functions or as parts of expressions.

For example, the c “catenate”, “make a chain” function builds a vector; these can be saved as workspace objects. The arguments of this function are the items to be joined into one vector:

(primes.lt.20 <- c(2, 3, 5, 7, 11, 13, 17))
## [1]  2  3  5  7 11 13 17
(colours <- c("indigo", "violet", "red", "orange", "yellow", "green", "blue"))
## [1] "indigo" "violet" "red"    "orange" "yellow" "green"  "blue"

Base R defined many common mathematical functions:

sum(primes.lt.20); min(primes.lt.20); max(primes.lt.20); median(primes.lt.20)
## [1] 58
## [1] 2
## [1] 17
## [1] 7
sin(pi/4); cos(pi/4); tan(pi/4); atan(1)
## [1] 0.7071068
## [1] 0.7071068
## [1] 1
## [1] 0.7853982
log(128); log2(128); exp(4.85203)
## [1] 4.85203
## [1] 7
## [1] 128

Note that multiple expressions can be written on one line, separated by ;.

Also many vector manipulation functions:

rev(colours)  # reverse the order of elements in a vector
## [1] "blue"   "green"  "yellow" "orange" "red"    "violet" "indigo"
sort(colours) # sort in ascending order
## [1] "blue"   "green"  "indigo" "orange" "red"    "violet" "yellow"
sort(colours, decreasing=TRUE) # sort in decending order
## [1] "yellow" "violet" "red"    "orange" "indigo" "green"  "blue"

Exercise 3: Find the function name for base-10 logarithms, and compute the base-10 logarithm of 10, 100, and 1000 (use the ?? function at the console to search).

The results of one function can be used as part of an expression:

sqrt(2)
## [1] 1.414214
sqrt(2)/2
## [1] 0.7071068

The results of one function can be used as an argument to another:

asin(sqrt(2)/2)
## [1] 0.7853982
rev(sort(colours)) # equivalent to `sort(colours, decreasing=TRUE)`
## [1] "yellow" "violet" "red"    "orange" "indigo" "green"  "blue"

An example of a function with required and optional arguments is rnorm “random sample from a normal distribution”. The only required argument is the number of items to be returned.

rnorm(12)
##  [1] -1.53450983  0.42906541  1.27942366 -1.54190547 -0.02263238  2.17587828
##  [7] -1.14696705  0.90661266  1.07239549  1.24777960 -0.11737334  1.63321551

Optional arguments are the parameters of this distribution, i.e., its mean and standard deviation:

rnorm(n=12, mean=180)
##  [1] 180.0752 179.9373 180.6208 179.9303 181.4353 180.0183 181.0947 180.9020
##  [9] 181.3579 178.6629 181.7138 178.5944
rnorm(n=12, mean=180, sd=10)
##  [1] 176.7439 182.9745 168.1655 176.9599 179.6355 166.3144 186.4413 179.3060
##  [9] 162.9089 192.9902 188.9649 166.6695

How do we know all this? From the help system.

help(rnorm)

(You can also access the help for this function by enterung ?rnorm at the console).

This shows:

  1. Title and package where found
  2. Description
  3. Usage: how to call the function
  4. Arguments (what each one means, defaults)
  5. Details of the algorithm
  6. Value returned
  7. Source of code
  8. References to the statistical or numerical methods
  9. See Also (related commands)
  10. Examples of use and output

Exercise 4: What are the arguments of the rbinom (random numbers following the binomial distribution) function? Are any default or must all be specified? What is the value returned?

Exercise 5: Display the vector of the number of successes in 24 trials with probability of success 0.2 (20%), this simulation carried out 128 times.

Since S is an expression language, the results of one function can be passed as an argument to another.

Here, the results of the rnorm function are passed to the sort function.

sort(rnorm(n=12, mean=180, sd=10), decreasing=TRUE)
##  [1] 203.0544 191.6411 182.8871 182.3158 180.4643 179.9218 177.4824 176.4939
##  [9] 173.5188 173.1279 172.4832 162.0565

Note the use here of the optional argument decreasing of the sort function.

4 Including computations in the text

An important feature of R Markdown is the ability to include results of computations in the text, using the syntax `r ` followed by any R expression, which can include any workspace object defined in previous code chunks.

For example, suppose we compute the sample mean of a simulated normal distribution, rounded to two decimal places:

print(my.mean <- round(mean(rnorm(n=12, mean=180, sd=10)), 2))
## [1] 178.12

We then could report it like this: the sample mean is 178.12. Notice how in the compiled document the in-line expression beginning with `r ` is replaced by its value from the computation.

Exercise 6: Summarize the result of rbinom (previous exercise) with the table function. What is the range of results, i.e., the minimum and maximum values? Which is the most likely result? For these, write text which includes the computed results. This is necessary because the results change with each random sampling.

(Hint: convert the result of table to a data.frame; see ?table; you may want to come back here after reading about data.frame in a later section.)

5 Vectorized operations

A major feature of S is that most operations are vectorized, that is, they work element-wise on vectors.

Above we built a vector with the c function:

(primes.lt.20 <- c(2, 3, 5, 7, 11, 13, 17))
## [1]  2  3  5  7 11 13 17

Elements can be selected from a vector with the [] matrix selection operator:

primes.lt.20[1]
## [1] 2
primes.lt.20[1:3]
## [1] 2 3 5
primes.lt.20[c(1,3,5)]
## [1]  2  5 11

For example, create a sequence of integers 1..10 and then add a normally-distributed error to it. These are added in parallel:

(s <- seq(1, 10))
##  [1]  1  2  3  4  5  6  7  8  9 10
(r <- rnorm(10, 0, 0.1))
##  [1]  0.0523664118  0.0008837435 -0.1131221806  0.0720411113  0.1501412454
##  [6] -0.0666157339 -0.0508666871 -0.0980716909  0.0314177778  0.0593004570
s[1] + r[1]
## [1] 1.052366
s + r
##  [1]  1.052366  2.000884  2.886878  4.072041  5.150141  5.933384  6.949133
##  [8]  7.901928  9.031418 10.059300

If one of the arguments is shorter than the other, it is recycled:

primes.lt.20/2
## [1] 1.0 1.5 2.5 3.5 5.5 6.5 8.5

Many functions operate element-wise on vectors:

seq(0, 2*pi, by=pi/6)
##  [1] 0.0000000 0.5235988 1.0471976 1.5707963 2.0943951 2.6179939 3.1415927
##  [8] 3.6651914 4.1887902 4.7123890 5.2359878 5.7595865 6.2831853
round(sin(seq(0, 2*pi, by=pi/6)),4)
##  [1]  0.000  0.500  0.866  1.000  0.866  0.500  0.000 -0.500 -0.866 -1.000
## [11] -0.866 -0.500  0.000

Others summarize a vector:

sum(primes.lt.20)
## [1] 58

Exercise 7: Create and display a vector representing latitudes in degrees from \(0^\circ\) (equator) to \(+90^\circ\) (north pole), in intervals of \(5^\circ\). Compute and display their cosines – recall, the trig functions in R expect arguments in radians. Find and display the maximum cosine.

6 Packages

R is built from a set of packages, several of which are loaded by default in all R installations and required for normal operation. The search function shows the loaded packages and the order in which they are searched for function and object names:

search()
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

There are thousands of contributed packages; these must first be installed on each R system from a CRAN repository1 using the install.packages function, and then loaded into the search space with the library or require function. Installation only needs to be done once per system, but loading one time each session. It is common practice to also load any packages on which the installed package depends.

Although packages are usually installed via the RStudio “Packages” tab, they can also be installed directly with install.packages.

Here we load the MASS package that implements statistical methods from the textbook: Venables, W N, and B D Ripley. 2002. Modern Applied Statistics with S. Fourth edition. New York: Springer-Verlag.

install.packages("MASS", dependencies=TRUE)

Once installed, they must be loaded into the workspace. Then their functions are available for use.

library(MASS)
search()
##  [1] ".GlobalEnv"        "package:MASS"      "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"
help(package=MASS)

Exercise 8: Check if the gstat package is installed on your system. If not, install it. Load it into the workspace. Display its help and find the variogram function. What is its description?

7 Classes

7.1 Fundamental classes

All objects in S have a class. This controls what can be done with the object. For example, it does not make sense to use arithmetic operations on character strings.

Some basic types are logical (TRUE/FALSE), numeric, and character.

The class function returns the data type of an object:

class(TRUE); class(316); class(3.1415926); class("blue")
## [1] "logical"
## [1] "numeric"
## [1] "numeric"
## [1] "character"

Note both integers and real numbers are class numeric.

The type list can combine any objects, each with its own class:

class(l <- list(1,FALSE,"red"))
## [1] "list"

Elements of a list are extracted with the [[]] operator:

class(l[[2]])
## [1] "logical"

Functions are also a class:

class(sort)
## [1] "function"

The classes logical, numeric, and character are all vectors with one or more elements.

class(c(TRUE, FALSE, TRUE))
## [1] "logical"
class(primes.lt.20)
## [1] "numeric"
class(colours)
## [1] "character"

Exercise 9: Display the classes of the built-in constant pi and of the built-in constant letters.

7.2 Derived classes

These are basic classes with some additional attributes.

Examples:

  1. an array is a vector with a dim ``dimensions’’ attribute
  2. a matrix is a 2-D array
primes.lt.20
## [1]  2  3  5  7 11 13 17
dim(primes.lt.20)
## NULL
my.a <- array(primes.lt.20)
class(my.a)
## [1] "array"
dim(my.a)
## [1] 7
(my.matrix <- matrix(data=rep(primes.lt.20,2), nrow=length(primes.lt.20), byrow=FALSE))
##      [,1] [,2]
## [1,]    2    2
## [2,]    3    3
## [3,]    5    5
## [4,]    7    7
## [5,]   11   11
## [6,]   13   13
## [7,]   17   17
class(my.matrix); dim(my.matrix)
## [1] "matrix" "array"
## [1] 7 2
my.matrix[,2] <- my.matrix[,2]^2
my.matrix
##      [,1] [,2]
## [1,]    2    4
## [2,]    3    9
## [3,]    5   25
## [4,]    7   49
## [5,]   11  121
## [6,]   13  169
## [7,]   17  289

Certain operations and functions are restricted to work on certain classes. For example, a matrix may be transposed with the t “transpose” function:

t(my.matrix)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    2    3    5    7   11   13   17
## [2,]    4    9   25   49  121  169  289

Some functions automtically “promote” the class of their arguments, if possible:

class(t(primes.lt.20))
## [1] "matrix" "array"

7.3 Classes defined by functions

Functions can define their own classes. For example, the rlm “robust linear models” function of the MASS package returns an object with several classes defined by both rlm and the function lm “linear models” on which it depends:

data(birthwt, package="MASS")
class(model <- rlm(low ~ ., birthwt))
## [1] "rlm" "lm"

Model formulas and built-in datasets will be explained below.

Exercise 10: What is the class of the object returned by the variogram function? (Hint: see the heading “Value” in the help text.)

8 Example datasets

R has many example datasets in the datasets package, which is loaded by default in all R installations. There are also datasets in most contributed packages. These are intended to show the features of various functions,

(You can also load your own data, of course; see below.)

The data function with the optional package argument lists the datasets available in the named packages:

data(package="datasets")
data(package="ggplot2")

These can be loaded into the workspace with the data function. If the package which provides the dataset is loaded, there is no need to specify it.

data(diamonds, package="ggplot2")
data("birthwt") # in MASS, this was loaded above, so no need to specify

A dataset (actually, any S object) can be summarized with the summary function, which gives output appropriate to the object class. An object’s internal structure is revealed with the str “structure” function:

is(diamonds)
## [1] "tbl_df"
summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Exercise 11: List the datasets in the gstat package.

Exercise 12: Load, summarize, and show the structure of the oxford dataset.

9 Data frames

The “data frame” is the most common class for statistical and database work. An object of class data.frame is a matrix with column (field) colnames and (optional) row.names.

Rows are generally observations or individuals (database “cases”) and columns are attributes (database “fields”).

(m.d <- as.data.frame(my.matrix))
##   V1  V2
## 1  2   4
## 2  3   9
## 3  5  25
## 4  7  49
## 5 11 121
## 6 13 169
## 7 17 289
class(m.d)
## [1] "data.frame"
str(m.d)
## 'data.frame':    7 obs. of  2 variables:
##  $ V1: num  2 3 5 7 11 13 17
##  $ V2: num  4 9 25 49 121 169 289
colnames(m.d)
## [1] "V1" "V2"
colnames(m.d) <- c("primes", "primes.sq")
rownames(m.d)
## [1] "1" "2" "3" "4" "5" "6" "7"
rownames(m.d) <- letters[1:length(rownames(m.d))]
rownames(m.d)
## [1] "a" "b" "c" "d" "e" "f" "g"
str(m.d)
## 'data.frame':    7 obs. of  2 variables:
##  $ primes   : num  2 3 5 7 11 13 17
##  $ primes.sq: num  4 9 25 49 121 169 289

There are a variety of ways to extract parts of a dataframe. We illustrate this with the trees sample dataset.

The most common way to extract a field as a vector is with the $ operator:

data(trees)
str(trees)
## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
# fields
trees$Volume 
##  [1] 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 24.2 21.0 21.4 21.3 19.1
## [16] 22.2 33.8 27.4 25.7 24.9 34.5 31.7 36.3 38.3 42.6 55.4 55.7 58.3 51.5 51.0
## [31] 77.0

However, since a dataframe is just a matrix with named rows and columns, matrix selection operator [] also can be used:

trees[,1]
##  [1]  8.3  8.6  8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
## [31] 20.6
trees[,"Volume"]
##  [1] 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 24.2 21.0 21.4 21.3 19.1
## [16] 22.2 33.8 27.4 25.7 24.9 34.5 31.7 36.3 38.3 42.6 55.4 55.7 58.3 51.5 51.0
## [31] 77.0
# cases (records, tuples)
trees[1,]
##   Girth Height Volume
## 1   8.3     70   10.3
# subsets of cases and selected fields
trees[1,1]
## [1] 8.3
trees[1:3, c(1,3)]
##   Girth Volume
## 1   8.3   10.3
## 2   8.6   10.3
## 3   8.8   10.2

Exercise 13: load the women sample dataset. How many observations (cases) and how many attributes (fields) for each case? What are the column (field) and row names? What is the height of the first-listed woman?

9.1 Factors

R makes a strong distinction between continuous numeric variables and categorical variables. The latter are called R factors and may be ordered (a natural ordering of class levels) or unordered.

is.factor(diamonds$color)
## [1] TRUE
is.ordered(diamonds$color)
## [1] TRUE
levels(diamonds$color)
## [1] "D" "E" "F" "G" "H" "I" "J"
is.factor(diamonds$carat)
## [1] FALSE

The identification as a factor or not has major implications for statistical models (see below).

Exercise 14: List the factors in the oxford dataset.

10 Missing values

S has a special NA value for missing values. For example, in a data frame there may be no value for a certain field for one or more cases (observations). The missing value is handled correctly by functions.

Suppose there was no volume measurement for the first tree in the trees dataset:

trees[1,]
##   Girth Height Volume
## 1   8.3     70   10.3
trees[1,"Volume"] <- NA
trees[1,]
##   Girth Height Volume
## 1   8.3     70     NA
summary(trees$Volume)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.20   19.75   24.55   30.83   37.80   77.00       1

This is different from “not a number”, NaN, which is the result of an illegal mathematical operation, and Inf, which is the result of dividing by zero. If these are used in further mathematical operations, the error wil propagate.

(x <- sqrt(-1))
## [1] NaN
x^2
## [1] NaN
(x <- 100/0)
## [1] Inf
x+1
## [1] Inf

11 Logical expressions

The usual logical S operators <, >, <=, >=, ==, != return the logical values TRUE and FALSE. The which function then selects the indices of the TRUE elements of a vector.

trees$Height > 80
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE
(ix <- which(trees$Height > 80))
## [1]  5  6 17 18 26 27 31
dim(trees)
## [1] 31  3
(tall.trees <- trees[ix, ])
##    Girth Height Volume
## 5   10.7     81   18.8
## 6   10.8     83   19.7
## 17  12.9     85   33.8
## 18  13.3     86   27.4
## 26  17.3     81   55.4
## 27  17.5     82   55.7
## 31  20.6     87   77.0
dim(tall.trees)
## [1] 7 3

Exercise 15: Identify the thin trees, defined as those with height/girth ratio more than 1 s.d. above the mean. You will have to define a new field in the dataframe with this ratio, and then use the mean and sd summary functions, along with a logical expression.

12 Combination: functions, logical expressions, simulation

Here is an example of using logical expressions, along with simulation: ““What are the chances that three people on an elevator with 13 buttons press three consecutive floors?”

mean(replicate(2^12, all(diff(sort(sample.int(n=13, size=3, replace=TRUE))) == 1)))
## [1] 0.02978516

Let’s break this down “inside-out”, but only using a small number of simulations:

sample.int(n=13, size=3, replace=TRUE) # three floors selected, can be repeats
## [1] 2 4 2
sort(sample.int(n=13, size=3, replace=TRUE)) # same, but in order
## [1] 3 7 8
diff(sort(sample.int(n=13, size=3, replace=TRUE))) # delta between each choice
## [1] 3 2
diff(sort(sample.int(n=13, size=3, replace=TRUE))) == 1 # are they consecutive 1 by 1?
## [1] FALSE FALSE
all(diff(sort(sample.int(n=13, size=3, replace=TRUE))) == 1) # are they all consecutive?
## [1] FALSE
replicate(12, all(diff(sort(sample.int(n=13, size=3, replace=TRUE))) == 1)) # repeat the test many times
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
mean(replicate(2^12, all(diff(sort(sample.int(n=13, size=3, replace=TRUE))) == 1))) # what is the average of the replicated tests?
## [1] 0.02954102

Challenge: run this several times and see how variable is the simulation result.

13 Graphics

R has several graphics systems, beginning with base graphics which are installed by default. The plot method is generic, and changes its behaviour according to its arguments.

The simplest plot is the scatterplot of two variables, which uses the ~ formula operator (see ‘Statistical Models’, below) to symbolize “y-axis” vs. “x-axis”.

plot(Height ~ Girth, data=trees)

# plot(trees$Height ~ trees$Girth)

Base graphics also has functions for histograms, boxplots, dot charts, and conditioning plots. See help(package=graphics).

hist(trees$Volume); rug(trees$Volume)
boxplot(diamonds$carat)
dotchart(trees$Volume)

Another graphics system, ggplot2, is becoming ever more popular. This views graphics as being built in layers, each with functions, starting with ggplot2 to initialize the graphic, then each layer added with a + operator.

For example, here is a histogram of the tree volumes with a “rug” plot to show the actual values:

library(ggplot2)
g.h <- ggplot(data=trees) +
        geom_histogram(mapping = aes(x=Volume), binwidth = 5,
                        colour="blue") +
        geom_rug(mapping = aes(x=Volume))
print(g.h)

See here for more on ggplot2.

Exercise 16: Display a histogram of the diamond prices in the diamonds dataset.

14 Statistical models

S specifies statistical models in symbolic form with model formulae, which are arguments to many statistical methods, for example lm (linear models, in the base packages) and rlm (robust linear models, in the MASS package).

These have a:

  • Left-hand side: (mathematically) dependent variable(s)
  • Formula operator ~, read as “depends on”, “is modelled by”
  • Right-hand side: (mathematically) indepdendent variable(s)

The simplest use is in simple linear regression with the lm function. For example, to model the linear relation of tree Volume from its Height, the formula is Volume ~ Height.

model.vh <- lm(Volume ~ Height, data=trees)
# equivalent to: model <- lm(trees$Volume ~ trees$Height)
summary(model.vh)
## 
## Call:
## lm(formula = Volume ~ Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.249  -9.845  -2.434  11.806  30.100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -82.5245    29.9844  -2.752 0.010267
## Height        1.4876     0.3922   3.793 0.000729
## 
## Residual standard error: 13.48 on 28 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3395, Adjusted R-squared:  0.3159 
## F-statistic: 14.39 on 1 and 28 DF,  p-value: 0.0007292

Exercise 17: Write a model to predict tree height from tree girth. How much of the height can be predicted from the girth?

The right-hand side can have formula operators, which look like arithmetic opertors but are interpreted in terms of the model structure:

  • + additive effect
  • * interaction effect
  • / nested model
  • - remove terms from the model

For arithmetic within model formulas, the I “identity” function must be used.

To illustrate the use of opertors, here is a model of tree volume Volume modelled by the square root of tree height sqrt(Height), interacting with the tree radius I(Girth/(2*pi)), with no intercept (-1):

model <- lm(Volume ~ sqrt(Height)*I(Girth/(2*pi)) - 1, data=trees)
summary(model)
## 
## Call:
## lm(formula = Volume ~ sqrt(Height) * I(Girth/(2 * pi)) - 1, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9522 -1.6022  0.1426  2.0223  4.8618 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## sqrt(Height)                    -3.6253     0.3557 -10.191 9.44e-11
## I(Girth/(2 * pi))              -19.8235     8.0063  -2.476   0.0198
## sqrt(Height):I(Girth/(2 * pi))   5.6003     0.8383   6.681 3.60e-07
## 
## Residual standard error: 3.175 on 27 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9925, Adjusted R-squared:  0.9916 
## F-statistic:  1188 on 3 and 27 DF,  p-value: < 2.2e-16

This model may or may not make sense physically!

Exercise 18: Write a model to predict tree volume as a linear function of tree height and tree girth, with no interaction.

Variables in statistical models may be categorical, i.e., R factors; these are specified the same way in models but interpreted differently. For example in the linear model lm function, a factor is converted into contrasts, typically so-called “dummy” variables:

summary(m.c <- lm(price ~ carat, data=diamonds))
## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36      13.06  -172.8   <2e-16
## carat        7756.43      14.07   551.4   <2e-16
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
head(model.matrix(m.c))
##   (Intercept) carat
## 1           1  0.23
## 2           1  0.21
## 3           1  0.23
## 4           1  0.29
## 5           1  0.31
## 6           1  0.24
summary(m.f <- lm(price ~ clarity - 1, data=diamonds))
## 
## Call:
## lm(formula = price ~ clarity - 1, data = diamonds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4737  -2727  -1429   1262  16254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## clarityI1    3924.17     144.56   27.14   <2e-16
## claritySI2   5063.03      41.04  123.37   <2e-16
## claritySI1   3996.00      34.43  116.07   <2e-16
## clarityVS2   3924.99      35.54  110.43   <2e-16
## clarityVS1   3839.46      43.53   88.19   <2e-16
## clarityVVS2  3283.74      55.29   59.39   <2e-16
## clarityVVS1  2523.11      65.09   38.76   <2e-16
## clarityIF    2864.84      93.01   30.80   <2e-16
## 
## Residual standard error: 3935 on 53932 degrees of freedom
## Multiple R-squared:  0.5066, Adjusted R-squared:  0.5066 
## F-statistic:  6923 on 8 and 53932 DF,  p-value: < 2.2e-16
head(model.matrix(m.f))
##   clarityI1 claritySI2 claritySI1 clarityVS2 clarityVS1 clarityVVS2 clarityVVS1
## 1         0          1          0          0          0           0           0
## 2         0          0          1          0          0           0           0
## 3         0          0          0          0          1           0           0
## 4         0          0          0          1          0           0           0
## 5         0          1          0          0          0           0           0
## 6         0          0          0          0          0           1           0
##   clarityIF
## 1         0
## 2         0
## 3         0
## 4         0
## 5         0
## 6         0

15 Control structures

Within an expression the ifelse function can be used to select two outcomes, based on a logical expression.

For example, to select a plotting colour based on whether one or the other of two variables in a scatterplot is greater:

# two vectors of 100 random uniform variates
n <- 128
x <- runif(n); y <- runif(n)
plot(y ~ x, asp=1, col=ifelse(y > x, "red", "green"), pch=20)
abline(0, 1, lty=2); grid()

S has ALGOL-like control structures:

  • ifelse
  • for
  • while, repeat
  • break, next

Note that vectorized functions or the apply family of functions can often be used where other programming languages would use for loops.

Here is an example of the while control structure. For some simulation we want to draw a sample from the normal distribution but make sure that there is an extreme value, so we repeat the sampling until we get what we want:

n <- 1
while (max(abs(sample <- rnorm(80))) < 3) {
  print("No extreme")
  n <- n + 1
}
## [1] "No extreme"
## [1] "No extreme"
print(paste("Sample required",n,"draws"))
## [1] "Sample required 3 draws"

The for loop works as in other languages:

v <- vector(mode="numeric", length=11)
for (power in 0:10) {
  v[power+1] <- 2^power
}
print(v)
##  [1]    1    2    4    8   16   32   64  128  256  512 1024

But of course this is much easier with a vectorized operation:

0:10
##  [1]  0  1  2  3  4  5  6  7  8  9 10
(v <- 2^(0:10))
##  [1]    1    2    4    8   16   32   64  128  256  512 1024

15.1 The apply family of functions

These apply a function across either the rows or columns of a matrix, including data frames.

For example, to find the maximum value of all the variables in a data frame we can use the sapply “apply a function and simplify the result” function:

sapply(trees, FUN=max)
##  Girth Height Volume 
##   20.6   87.0     NA

This applies the max built-in function along the fields of the data frame.

The by function performs a function on subsets defined by a factor (categorical varibable):

by(diamonds[,c(1,3:5)], diamonds$cut, summary)
## diamonds$cut: Fair
##      carat       color      clarity        depth      
##  Min.   :0.220   D:163   SI2    :466   Min.   :43.00  
##  1st Qu.:0.700   E:224   SI1    :408   1st Qu.:64.40  
##  Median :1.000   F:312   VS2    :261   Median :65.00  
##  Mean   :1.046   G:314   I1     :210   Mean   :64.04  
##  3rd Qu.:1.200   H:303   VS1    :170   3rd Qu.:65.90  
##  Max.   :5.010   I:175   VVS2   : 69   Max.   :79.00  
##                  J:119   (Other): 26                  
## ------------------------------------------------------------ 
## diamonds$cut: Good
##      carat        color      clarity         depth      
##  Min.   :0.2300   D:662   SI1    :1560   Min.   :54.30  
##  1st Qu.:0.5000   E:933   SI2    :1081   1st Qu.:61.30  
##  Median :0.8200   F:909   VS2    : 978   Median :63.40  
##  Mean   :0.8492   G:871   VS1    : 648   Mean   :62.37  
##  3rd Qu.:1.0100   H:702   VVS2   : 286   3rd Qu.:63.80  
##  Max.   :3.0100   I:522   VVS1   : 186   Max.   :67.00  
##                   J:307   (Other): 167                  
## ------------------------------------------------------------ 
## diamonds$cut: Very Good
##      carat        color       clarity         depth      
##  Min.   :0.2000   D:1513   SI1    :3240   Min.   :56.80  
##  1st Qu.:0.4100   E:2400   VS2    :2591   1st Qu.:60.90  
##  Median :0.7100   F:2164   SI2    :2100   Median :62.10  
##  Mean   :0.8064   G:2299   VS1    :1775   Mean   :61.82  
##  3rd Qu.:1.0200   H:1824   VVS2   :1235   3rd Qu.:62.90  
##  Max.   :4.0000   I:1204   VVS1   : 789   Max.   :64.90  
##                   J: 678   (Other): 352                  
## ------------------------------------------------------------ 
## diamonds$cut: Premium
##      carat       color       clarity         depth      
##  Min.   :0.200   D:1603   SI1    :3575   Min.   :58.00  
##  1st Qu.:0.410   E:2337   VS2    :3357   1st Qu.:60.50  
##  Median :0.860   F:2331   SI2    :2949   Median :61.40  
##  Mean   :0.892   G:2924   VS1    :1989   Mean   :61.26  
##  3rd Qu.:1.200   H:2360   VVS2   : 870   3rd Qu.:62.20  
##  Max.   :4.010   I:1428   VVS1   : 616   Max.   :63.00  
##                  J: 808   (Other): 435                  
## ------------------------------------------------------------ 
## diamonds$cut: Ideal
##      carat        color       clarity         depth      
##  Min.   :0.2000   D:2834   VS2    :5071   Min.   :43.00  
##  1st Qu.:0.3500   E:3903   SI1    :4282   1st Qu.:61.30  
##  Median :0.5400   F:3826   VS1    :3589   Median :61.80  
##  Mean   :0.7028   G:4884   VVS2   :2606   Mean   :61.71  
##  3rd Qu.:1.0100   H:3115   SI2    :2598   3rd Qu.:62.20  
##  Max.   :3.5000   I:2093   VVS1   :2047   Max.   :66.70  
##                   J: 896   (Other):1358

16 User-defined functions

A function is defined with the function function (!), specifying the arguments, body of the function, and the return values:

For example, here is a function to compute the geometric mean of a vector \(\overline{v_h} = \left[ \prod_{i=1\ldots n} \; v_i \right]^{1/n}\):

hm <- function(v) {
  return(exp(sum(log(v))/length(v)))
}
class(hm)
## [1] "function"
hm(1:99); mean(1:99)
## [1] 37.6231
## [1] 50

Of course, a function should check for valid inputs. This example shows the use of the if, else if, else control structure, as well as the functions is.numeric, any, and length, and the return function to return a value from a function.

g.mean <- function(v) { 
    if (!is.numeric(v)) { 
      print("Argument must be numeric"); return(NULL) 
      } 
    else if (any(v <= 0)) { 
      print("All elements must be positive"); return(NULL) 
      } 
    else return(exp(sum(log(v))/length(v))) 
} 
g.mean(letters) 
## [1] "Argument must be numeric"
## NULL
g.mean(c(-1, -2, 1, 2)) 
## [1] "All elements must be positive"
## NULL
g.mean(1:99) 
## [1] 37.6231

Exercise 19: Write a function to restrict the values of a vector to the range \(0 \ldots 1\). Any values \(< 0\) should be replaced with \(0\), and any values \(>1\) should be replaced with \(1\). Test the function on a vector with elements from \(-1.2\) to \(+1.2\) in increments of \(0.1\) – see the seq “sequence” function.

17 Import/Export

R can exchange data in any conceivable format.

Reference: “R Data Import/Export”, an R manual installed with R; available under the Help menu as R Help.`

The most common interchange format for “flat” files is the Comma-Separated Values (CSV). Here we first download a sample CSV file and then import to R:

download.file(
  url="https://data.ny.gov/api/views/vpv5-zd4k/rows.csv?accessType=DOWNLOAD",
  destfile="enplanements.csv")
file.show("enplanements.csv")  # examine in another window
airports <- read.csv("enplanements.csv")
str(airports)
## 'data.frame':    378 obs. of  4 variables:
##  $ Airport.Group: chr  "UPSTATE NON-PRIMARY" "UPSTATE HUB" "OTHER UPSTATE PRIMARY" "UPSTATE HUB" ...
##  $ Airport.Name : chr  "ADIRONDACK REGIONAL" "ALBANY INTERNATIONAL" "BINGHAMTON REGIONAL AIRPORT" "BUFFALO-NIAGARA INTERNATIONAL" ...
##  $ Year         : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Enplanements : int  5770 1216626 108172 2582597 3483 139698 152582 1190967 121733 23664832 ...

This file is described here

The more general read.table function can read tables in many formats.

18 The Tidyverse

In the past few years “an opinionated collection of R packages designed for data science” called the tidyverse has been developed by Hadley Wickham and colleagues. The ggplot2 package is part of this; other important packages are dplyr for data manipulation,readr for data import, and stringr for string manipulation.

The complete tidyverse can be installed with:

install.packages("tidyverse")

The dplyr package provides a flexible grammar of data manipulation, including the “pipe” operator %>%, which works from left to right, unlike the conventional <- assignment operator.

library(dplyr)

The typical use is to do some data manipulation by filtering, as well as to summarize. For example, to subset the airports to just the upstate hubs and summarize the enplanements by year:

levels(airports$Airport.Group)
## NULL
airports %>% filter(Airport.Group=="UPSTATE HUB") %>% 
  group_by(Year) %>% 
  summarize(Enplanements=sum(Enplanements))
## # A tibble: 21 × 2
##     Year Enplanements
##    <int>        <int>
##  1  1997      4890591
##  2  1998      5059778
##  3  1999      5283594
##  4  2000      5826243
##  5  2001      5736766
##  6  2002      5580027
##  7  2003      5632693
##  8  2004      6237753
##  9  2005      6643091
## 10  2006      6511005
## # … with 11 more rows

Another typical use is to define new variables, possibly as combinations of others. For example, load the Meuse River geochemical dataset, select only the most frequently-flooded areas, rename the columns for the heavy metals, compute new columns with their base-10 logarithms, compute a ratio, and finally save as a new object:

library(sp)
data(meuse)
names(meuse)
##  [1] "x"       "y"       "cadmium" "copper"  "lead"    "zinc"    "elev"   
##  [8] "dist"    "om"      "ffreq"   "soil"    "lime"    "landuse" "dist.m"
meuse %>% 
  filter(ffreq==1) %>%
  select(-ffreq) %>%  # column is no longer  needed, it is constant
  rename(Cd=cadmium, Cu=copper, Pb=lead, Zn=zinc) %>%
  mutate(lCd=log10(Cd), lCu=log10(Cu), lPb=log10(Pb), lZn=log10(Zn)) %>%
  mutate(ratio.lCu.lPb=lCu/lPb) ->
  meuse.new
meuse.new[1:6,]
##        x      y   Cd Cu  Pb   Zn  elev       dist   om soil lime landuse dist.m
## 1 181072 333611 11.7 85 299 1022 7.909 0.00135803 13.6    1    1      Ah     50
## 2 181025 333558  8.6 81 277 1141 6.983 0.01222430 14.0    1    1      Ah     30
## 3 181165 333537  6.5 68 199  640 7.800 0.10302900 13.0    1    1      Ah    150
## 4 181298 333484  2.6 81 116  257 7.655 0.19009400  8.0    2    0      Ga    270
## 5 181307 333330  2.8 48 117  269 7.480 0.27709000  8.7    2    0      Ah    380
## 6 181390 333260  3.0 61 137  281 7.791 0.36406700  7.8    2    0      Ga    470
##         lCd      lCu      lPb      lZn ratio.lCu.lPb
## 1 1.0681859 1.929419 2.475671 3.009451     0.7793519
## 2 0.9344985 1.908485 2.442480 3.057286     0.7813719
## 3 0.8129134 1.832509 2.298853 2.806180     0.7971405
## 4 0.4149733 1.908485 2.064458 2.409933     0.9244485
## 5 0.4471580 1.681241 2.068186 2.429752     0.8129063
## 6 0.4771213 1.785330 2.136721 2.448706     0.8355467

Note the use here of the right-hand assignment operator -> to save the result into a new object (we could also overwrite the old object meuse if we had no use for its original form).

Bonus Exercise : Use tidyverse functions and pipes on the trees dataset, to select the trees (use the filter function) with a volume greater than the median volume (use the median function), compute the ratio of girth to height as a new variable (use the mutate function), and sort by this (use the arrange function) from thin to thick trees.

names(trees)
## [1] "Girth"  "Height" "Volume"
mv <- median(trees$Volume)
trees %>% 
  filter(Volume > median(Volume)) %>%
  mutate(thickness=round(Girth/Height,3)) %>%
  arrange(thickness)
## [1] Girth     Height    Volume    thickness
## <0 rows> (or 0-length row.names)