3 Simple Statistics in R

Author: Nathaniel E. Helwig

3.1 Chapter Outline and Goals

In this chapter, we will cover how to…

  • Load, explore, and summarize data
  • Calculate descriptive statistics
  • Create reproducible plots and tables
  • Perform one and two sample t-tests
  • Fit one-way analysis of variance models
  • Conduct simple correlation tests

R has many helpful functions for simple descriptive and inferential statistics, which make reproducible research easy!

3.2 Minnesota Beer Data

3.2.1 Overview

The Minnesota beer data has 44 beers measured on 7 variables:

  1. Brewery: Name of the brewery (factor with 8 levels)
  2. Beer: Name of the beer (factor with 44 levels)
  3. Description: Description of the beer (factor with 37 levels)
  4. Style: Style of the beer (factor with 3 levels)
  5. ABV: Alcohol by volume (numeric)
  6. IBU: International bitterness units (integer)
  7. Rating: Beer Advocate rating (integer)

Data obtained by NEH from Beer Advocate and the websites of the eight breweries.

3.2.2 Load the Data

Use the read.csv function to load the beer data into R

beer <- read.csv("http://users.stat.umn.edu/~helwig/notes/MNbeer.csv")

The dim function returns the number of rows and columns of the data frame

dim(beer)
## [1] 44  7

The beer data frame has 44 beers (rows) measured on 7 variables (columns).

The names function returns the names of the variables in a data frame

names(beer)
## [1] "Brewery"     "Beer"        "Description" "Style"       "ABV"        
## [6] "IBU"         "Rating"

3.2.3 Look at the Data

The head function returns the first six lines of a data frame

head(beer)
##       Brewery         Beer              Description Style ABV IBU Rating
## 1     Bauhaus  Wonderstuff     New Bohemian Pilsner Lager 5.4  48     88
## 2     Bauhaus    Stargazer German Style Schwarzbier Lager 5.0  28     87
## 3     Bauhaus  Wagon Party    West Cost Style Lager Lager 5.4  55     86
## 4     Bauhaus    Sky-Five!        Midwest Coast IPA   IPA 6.7  70     86
## 5 Bent Paddle         Kanu         Session Pale Ale   Ale 4.8  48     85
## 6 Bent Paddle Venture Pils            Pilsner Lager Lager 5.0  38     87

The summary function provides a summary of each variable in a data frame

summary(beer)
##           Brewery                Beer                    Description
##  Indeed       :7   14* ESB         : 1   India Pale Ale        : 5  
##  Summit       :7   B-Side Pils     : 1   English IPA           : 2  
##  Surly        :7   Batch 300       : 1   Pilsner Lager         : 2  
##  Bent Paddle  :5   Bender          : 1   Porter                : 2  
##  Fulton       :5   Bent Hop        : 1   American Blonde Ale   : 1  
##  Urban Growler:5   Big Boot Rye IPA: 1   Belgian Style Pale Ale: 1  
##  (Other)      :8   (Other)         :38   (Other)               :31  
##    Style         ABV             IBU            Rating     
##  Ale  :18   Min.   :4.200   Min.   :15.00   Min.   :79.00  
##  IPA  :17   1st Qu.:5.200   1st Qu.:33.00   1st Qu.:85.00  
##  Lager: 9   Median :5.600   Median :48.50   Median :87.00  
##             Mean   :5.818   Mean   :51.07   Mean   :87.18  
##             3rd Qu.:6.500   3rd Qu.:68.25   3rd Qu.:90.00  
##             Max.   :7.500   Max.   :99.00   Max.   :98.00  
## 

3.2.4 Plot the Data

We can check out boxplots of the relationship between ABV and style of beer:

ggplot(beer, aes(x = Style, y = ABV)) + 
  geom_boxplot()

There is, of course, more than one way to achieve this task! For example, we can also make boxplots using the boxplot() function in “base R”. The brewer.pal function in the RColorBrewer package creates ColorBrewer palettes for plotting

library(RColorBrewer)
MyColors <- brewer.pal(nlevels(beer$Style), "Pastel1")

The boxplot function creates simple boxplots

boxplot(ABV ~ Style, data = beer, ylab = "Alcohol By Volume (ABV)",
        main = "Alcohol by Style of Beer", col = MyColors)

Further, we can plot the relationship between ABV and IBU by Style:

ggplot(beer, aes(x = IBU, y = ABV, color = Style)) + 
  geom_point()

Alternatively, we can use the plot() function:

StyleInt <- as.integer(beer$Style)
plot(beer$IBU, beer$ABV, xlab = "International Bitterness Units (IBU)",
     ylab = "Alcohol By Volume (ABV)", pch = StyleInt + 14,
     main = "Bitterness vs Alcohol", col = MyColors[StyleInt])
legend("bottomright", legend = levels(beer$Style), pch = 15:17,
       col = MyColors, bty = "n")

R has functions for saving plots to various figure formats:

  • bmp function for bitmap graphics
  • jpeg function for JPEG graphics
  • png function for Portable Network Graphics
  • tiff function for Tag Image File Format graphics
  • pdf or dev.copy2pdf functions for Portable Document Format graphics
  • postscript or dev.copy2eps functions for PostScript graphics

3.3 Descriptive Statistics in R

3.3.1 Overview

We often need to calculate simple descriptive statistics of variables in a data frame, e.g., to make summary tables. As we have already seen, R is a function based and object oriented programming language. To obtain descriptive statistics, we input an object (e.g., column of a data frame) into the corresponding function. Thankfully, functions in R often have intuitive names—you can typically guess the name of the function you need!

3.3.2 Minimum and Maximum

To calculate the minimum or maximum of a variable, we could use the min or max functions

min(beer$ABV)
## [1] 4.2
max(beer$ABV)
## [1] 7.5

or the range function to return both the minimum and maximum

range(beer$ABV)
## [1] 4.2 7.5

The minimum ABV in the sample is 4.2% and the maximum is 7.5%. To determine which beers have the min/max ABV values, we can use the which.min and which.max functions

minmaxID <- c(which.min(beer$ABV), which.max(beer$ABV))
beer[minmaxID,]
##    Brewery              Beer      Description Style ABV IBU Rating
## 12  Indeed Lucy Session Sour Session Sour Ale   Ale 4.2  27     86
## 38   Surly         Overrated   West Coast IPA   IPA 7.5  69     91

3.3.3 Mean, Standard Deviation, and Variance

The mean function calculates the sample mean \(\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\)

mean(beer$ABV)
## [1] 5.818182

The sd function calculates the sample standard deviation \(s = \{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\}^{1/2}\)

sd(beer$ABV)
## [1] 0.8176178

The var function calculates the sample variance \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\)

var(beer$ABV)
## [1] 0.6684989

The mean ABV is about 5.82% with a standard deviation of about 0.82% (variance of about 0.67%).

3.3.4 Medians and Quantiles

The median function calculates the sample median of a vector

median(beer$ABV)
## [1] 5.6

and the quantile function can be used for other quantiles

quantile(beer$ABV)
##   0%  25%  50%  75% 100% 
##  4.2  5.2  5.6  6.5  7.5
quantile(beer$ABV, probs = seq(0, 1, length=11))
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
## 4.20 4.86 5.16 5.30 5.42 5.60 6.20 6.40 6.50 6.87 7.50

The median is 5.6% ABV, which implies that half of the beers have at least 5.6% ABV.

3.3.5 Factor Level Information

The levels function extracts the levels (i.e., unique values) of a factor variable

levels(beer$Style)
## [1] "Ale"   "IPA"   "Lager"

and the nlevels function returns the number of levels of a factor

nlevels(beer$Style)
## [1] 3

The 44 beers are classified into one of three Styles: Ale, IPA, or Lager.

3.3.6 Covariances and Correlations

The cov function calculates the covariance \(c = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\) between two variables

cov(beer$ABV, beer$IBU)
## [1] 13.22664

or a covariance matrix between the columns of an input data frame

cov(beer[, c("ABV","IBU","Rating")])
##               ABV       IBU    Rating
## ABV     0.6684989  13.22664  1.510571
## IBU    13.2266385 461.18129 30.871036
## Rating  1.5105708  30.87104 13.919662

The cor function calculates the correlation \(r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\{\sum_{i=1}^n (x_i - \bar{x})^2 \}^{1/2} \{\sum_{i=1}^n (y_i - \bar{y})^2 \}^{1/2} }\) between two variables

cor(beer$ABV, beer$IBU)
## [1] 0.7532919

or a correlation matrix between the columns of an input data frame

cor(beer[, c("ABV","IBU","Rating")])
##              ABV       IBU    Rating
## ABV    1.0000000 0.7532919 0.4951952
## IBU    0.7532919 1.0000000 0.3853018
## Rating 0.4951952 0.3853018 1.0000000

3.3.7 Applying Functions to Multiple Variables

To apply a function to several columns, we can use the apply function

apply(beer[, c("ABV","IBU","Rating")], 2, range)
##      ABV IBU Rating
## [1,] 4.2  15     79
## [2,] 7.5  99     98
apply(beer[, c("ABV","IBU","Rating")], 2, mean)
##       ABV       IBU    Rating 
##  5.818182 51.068182 87.181818

3.3.8 Applying Functions at Levels of Factors

Use the tapply (ragged apply) function to apply some function to a numeric variable separately at each level of a factor variable. For example, we could apply the range function to the ABV variable separately for each Style of beer

tapply(beer$ABV, beer$Style, range)
## $Ale
## [1] 4.2 7.0
## 
## $IPA
## [1] 5.8 7.5
## 
## $Lager
## [1] 4.5 5.4

In the given sample, Ales range from 4.2% to 7% ABV, India Pale Ales range from 5.8% to 7.5% ABV, and Lagers range from 4.5% to 5.4% ABV.

3.3.9 Making a Table

Use the cbind (column combine) function in combination with the tapply function to create tables

tab1 <- cbind(tapply(beer$Rating, beer$Style, length),
              tapply(beer$ABV, beer$Style, mean),
              tapply(beer$ABV, beer$Style, sd),
              tapply(beer$IBU, beer$Style, mean),
              tapply(beer$IBU, beer$Style, sd),
              tapply(beer$Rating, beer$Style, mean),
              tapply(beer$Rating, beer$Style, sd))
colnames(tab1) <- c("n", "ABV.Mean", "ABV.SD", "IBU.Mean", 
                    "IBU.SD", "Rating.Mean", "Rating.SD")
rtab1 <- round(tab1, 2)
rtab1
##        n ABV.Mean ABV.SD IBU.Mean IBU.SD Rating.Mean Rating.SD
## Ale   18     5.49   0.67    39.00  12.28       86.83      3.50
## IPA   17     6.56   0.46    73.53  10.59       88.18      4.54
## Lager  9     5.06   0.35    32.78  12.58       86.00      1.87

The write.csv function can be used to save the table

write.csv(rtab1, file = "~/Desktop/table1.csv", row.names = TRUE)

After some minor stylistic edits, the table is ready for publication—without having to manually type or copy-paste numbers!

The kable function (in the knitr package) includes nicely formatted tables in R Markdown documents

library(knitr)
kable(rtab1, caption = "Table 1: Sample size (n) and variable means and standard deviations (SD) for each style of beer.")
Table 3.1: Table 1: Sample size (n) and variable means and standard deviations (SD) for each style of beer.
n ABV.Mean ABV.SD IBU.Mean IBU.SD Rating.Mean Rating.SD
Ale 18 5.49 0.67 39.00 12.28 86.83 3.50
IPA 17 6.56 0.46 73.53 10.59 88.18 4.54
Lager 9 5.06 0.35 32.78 12.58 86.00 1.87

3.4 Student’s t-Test in R

3.4.1 One Sample t-Test

Mass produced beers (e.g., Bud Light, Miller Lite, etc.) have 4.2% ABV. Suppose we want to test if Minnesota beers have the same mean ABV as mass produced beers \[ H_0: \mu = 4.2 \quad \mbox{vs.} \quad H_1: \mu \neq 4.2 \] where \(\mu\) is the mean ABV, and \(H_0\) and \(H_1\) denote the null and alternative hypotheses. Assuming that the ABV scores are normally distributed, the t.test function can be used to test the null hypothesis

t.test(beer$ABV, mu = 4.2)
## 
##  One Sample t-test
## 
## data:  beer$ABV
## t = 13.128, df = 43, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 4.2
## 95 percent confidence interval:
##  5.569603 6.066760
## sample estimates:
## mean of x 
##  5.818182

The observed t statistic is \(t = 13.13\) with 43 degrees of freedom, resulting in a p-value of essentially zero—we reject \(H_0\) using any standard \(\alpha\) level. The sample mean is \(\bar{x} = 5.8\)% ABV and the 95% confidence interval for the \(\mu\) (population mean ABV of Minnesota beers) is 5.6% to 6.1% ABV.

If we expect that the Minnesota beers have higher ABV than mass produced beers, i.e., \[ H_0: \mu = 4.2 \quad \mbox{vs.} \quad H_1: \mu > 4.2 \] we need to adjust the alternative input

t.test(beer$ABV, mu = 4.2, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  beer$ABV
## t = 13.128, df = 43, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 4.2
## 95 percent confidence interval:
##  5.610972      Inf
## sample estimates:
## mean of x 
##  5.818182

The only noteworthy difference is that the confidence interval is now a 95% lower bound for the mean ABV of Minnesota beers, which we expect to be at least 5.6% ABV. Note that changing the alternative also changes the p-value, but for this example we do not notice (because the p-value is so small).

3.4.2 Two Sample t-Test

Suppose that we want to test if IPA beers have higher ABV than non-IPAs (Ales and Lagers) \[ H_0: \mu_1 = \mu_2 \quad \mbox{vs.} \quad H_1: \mu_1 > \mu_2 \] where \(\mu_1\) and \(\mu_2\) denote the mean ABV of IPA and non-IPA beers, respectively.

To use the t.test function for a two sample t-test, we need to input two vectors

beer$IPA <- (beer$Style == "IPA")
t.test(beer$ABV[beer$IPA], beer$ABV[!beer$IPA], alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  beer$ABV[beer$IPA] and beer$ABV[!beer$IPA]
## t = 7.4454, df = 40.527, p-value = 2.093e-09
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.9415048       Inf
## sample estimates:
## mean of x mean of y 
##  6.564706  5.348148

The observed t statistic is \(t = 7.45\) with 40.53 degrees of freedom, resulting in a p-value of essentially zero—we reject \(H_0\) using any standard \(\alpha\) level. The sample mean difference is \(\bar{x}_1 - \bar{x}_2 = 1.22\)% ABV and the 95% lower-bound confidence interval reveals that we expect IPAs to have at least 0.94% more ABV than non-IPAs.

The default uses the Welch version, which does not assume equal variance for the two groups. The var.equal input can be used to change this assumption, which produces the classic two sample t-test

t.test(beer$ABV[beer$IPA], beer$ABV[!beer$IPA], alternative = "greater",
       var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  beer$ABV[beer$IPA] and beer$ABV[!beer$IPA]
## t = 6.9809, df = 42, p-value = 7.74e-09
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.9234437       Inf
## sample estimates:
## mean of x mean of y 
##  6.564706  5.348148

Note that the observed t-test statistic, p-value, and 95% lower-bound are slightly different, but our conclusion does not change: we expect IPAs to have at least 0.9% more ABV than non-IPAs.

3.5 One-Way ANOVA in R

3.5.1 Omnibus F-Test

Extending the previous example, suppose that we want to test if the mean ABV differs for the three Styles of beer \[ H_0: \mu_j = \mu \mbox{ for all } j \quad \mbox{vs.} \quad H_1: \mu_j \neq \mu \mbox{ for some } j \] where \(\mu_j\) denotes the mean ABV of the three Styles of beer: Ales, IPAs, and Lagers. Assuming that the ABV scores a normally distributed, we can use the aov (analysis of variance) function

amod <- aov(ABV ~ Style, data = beer)
summary(amod)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Style        2  16.59   8.297      28 2.16e-08 ***
## Residuals   41  12.15   0.296                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The observed F statistic is \(F = 28\) with 2 numerator and 41 denominator degrees of freedom, resulting in a p-value of essentially zero—we reject \(H_0\) using any standard \(\alpha\) level. We conclude that the mean ABV of Minnesota beers depends on the Style of beer, but the results do not directly reveal which Styles significantly differ from one another.

3.5.2 Pairwise Comparisons (Tukey’s HSD)

To determine which Styles significantly differ in their mean ABV, we can use Tukey’s Honest Significant Differences (HSD) procedure via the TukeyHSD function

TukeyHSD(amod)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ABV ~ Style, data = beer)
## 
## $Style
##                 diff        lwr        upr     p adj
## IPA-Ale    1.0702614  0.6225693  1.5179536 0.0000024
## Lager-Ale -0.4388889 -0.9793078  0.1015301 0.1313124
## Lager-IPA -1.5091503 -2.0548418 -0.9634589 0.0000001

The pairwise comparisons reveal that

  • IPAs have significantly higher mean ABV than Ales. The estimated mean difference is \(\hat{\delta}_1 = \hat{\mu}_{2} - \hat{\mu}_1 = 1.07\) with 95% confidence interval \(\delta_1 \in [0.62, 1.52]\).
  • Lagers and Ales do not significantly differ in mean ABV. The estimated mean difference is \(\hat{\delta}_2 = \hat{\mu}_{3} - \hat{\mu}_1 = -0.44\) with 95% confidence interval \(\delta_2 \in [-0.98, 0.10]\).
  • Lagers have significantly lower mean ABV than IPAs. The estimated mean difference is \(\hat{\delta}_3 = \hat{\mu}_{3} - \hat{\mu}_2 = -1.51\) with 95% confidence interval \(\delta_3 \in [-2.05, -0.96]\).

3.6 Correlation Tests in R

Suppose that we want to test if a beer’s ABV and Rating are postively correlated \[ H_0: \rho = 0 \quad \mbox{vs.} \quad H_1: \rho > 0 \] where \(\rho\) is the population correlation between the ABV and Rating. Assuming that the ABV and Rating variables follow a bivariate normal distribution, we can use the cor.test function

cor.test(beer$ABV, beer$Rating, alternative = "greater")
## 
##  Pearson's product-moment correlation
## 
## data:  beer$ABV and beer$Rating
## t = 3.6939, df = 42, p-value = 0.000316
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.2784833 1.0000000
## sample estimates:
##       cor 
## 0.4951952

The estimated sample correlation is \(r = 0.495\) and the 95% lower-bound confidence interval reveals that we expect the ABV and Ratings to have a positive correlation of at least \(\rho = 0.28\).

3.7 Exercises

  1. Load the Minnesota Beer Data into R.
  2. Make a boxplot of the IBUs by Style of beer.
  3. Make a scatterplot of the ABV (x-axis) by Rating (y-axis).
  4. Calculate some descriptive statistics for the IBU variable.
  5. Create a table showing the sample size and variable means and standard deviations for each Brewery.
  6. Repeat the t-tests using the IBU variable as the response.
  7. Repeat the one-way ANOVA using the IBU variable as the response.
  8. Repeat the correlation test using the IBU and Rating variables.



SOLUTIONS

#1 Load the Minnesota Beer Data into R.
beer <- read.csv("http://users.stat.umn.edu/~helwig/notes/MNbeer.csv")

#2) Make a boxplot of the IBUs by Style of beer.
# Notice that IPAs tend to have higher IBU (bitterness)
ggplot(beer, aes(x = Style, y = IBU)) + 
  geom_boxplot()

#3) Make a scatterplot of the ABV (x-axis) by Rating (y-axis).
# Notice that Rating tends to increase with ABV (alcohol level)
ggplot(beer, aes(x = ABV, y = Rating)) + 
  geom_point()

#4) Calculate some descriptive statistics for the IBU variable.
summary(beer$IBU)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   33.00   48.50   51.07   68.25   99.00
#5) Create a table showing the sample size and variable means and standard deviations for each Brewery.
tab1 <- cbind(tapply(beer$Rating, beer$Brewery, length),
              tapply(beer$Rating, beer$Brewery, mean),
              tapply(beer$Rating, beer$Brewery, sd))
colnames(tab1) <- c("n", "Rating.Mean", "Rating.SD")
rtab1 <- round(tab1, 2)
rtab1
##               n Rating.Mean Rating.SD
## Bauhaus       4       86.75      0.96
## Bent Paddle   5       87.60      1.67
## Fulton        5       84.40      5.03
## Indeed        7       87.86      2.12
## Steel Toe     4       88.00      2.45
## Summit        7       85.43      3.10
## Surly         7       92.14      3.53
## Urban Growler 5       83.80      1.48
#6) Repeat the *t*-tests using the IBU variable as the response.
t.test(beer$IBU)
## 
##  One Sample t-test
## 
## data:  beer$IBU
## t = 15.774, df = 43, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  44.53914 57.59722
## sample estimates:
## mean of x 
##  51.06818
#7) Repeat the one-way ANOVA using the IBU variable as the response.
# The p-value here indicates that IBU significantly differs by Style
amod <- aov(IBU ~ Style, data = beer)
summary(amod)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Style        2  14209    7105   51.81 5.98e-12 ***
## Residuals   41   5622     137                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#8) Repeat the correlation test using the IBU and Rating variables.
# The p-value here indicates that there's a significant, positive
# correlation between beers' IBU & Rating.
cor.test(beer$IBU, beer$Rating, alternative = "greater")
## 
##  Pearson's product-moment correlation
## 
## data:  beer$IBU and beer$Rating
## t = 2.706, df = 42, p-value = 0.0049
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.1482865 1.0000000
## sample estimates:
##       cor 
## 0.3853018