4 Linear Regression in R
Author: Nathaniel E. Helwig
4.1 Chapter Outline and Goals
In this chapter, we will cover how to…
- Fit simple and multiple linear regression models
- Test the significance of regression coefficients
- Plot and interpret the regression results
- Make predictions from fit regression models
R’s lm (linear model) function will be the primary tool used in the chapter.
4.2 Minnesota Beer Data (Reminder)
4.2.1 Overview
The Minnesota beer data has 44 beers measured on 7 variables:
- Brewery: Name of the brewery (factor with 8 levels)
- Beer: Name of the beer (factor with 44 levels)
- Description: Description of the beer (factor with 37 levels)
- Style: Style of the beer (factor with 3 levels)
- ABV: Alcohol by volume (numeric)
- IBU: International bitterness units (integer)
- Rating: Beer Advocate rating (integer)
Data obtained by NEH from Beer Advocate and the websites of the eight breweries.
4.2.2 Load and Look at the Data
Use the read.csv function to load the beer data into R
beer <- read.csv("http://users.stat.umn.edu/~helwig/notes/MNbeer.csv")
The head function returns the first six lines of a data frame
head(beer)
## Brewery Beer Description Style ABV IBU
## 1 Bauhaus Wonderstuff New Bohemian Pilsner Lager 5.4 48
## 2 Bauhaus Stargazer German Style Schwarzbier Lager 5.0 28
## 3 Bauhaus Wagon Party West Cost Style Lager Lager 5.4 55
## 4 Bauhaus Sky-Five! Midwest Coast IPA IPA 6.7 70
## 5 Bent Paddle Kanu Session Pale Ale Ale 4.8 48
## 6 Bent Paddle Venture Pils Pilsner Lager Lager 5.0 38
## Rating
## 1 88
## 2 87
## 3 86
## 4 86
## 5 85
## 6 87
4.3 Simple Linear Regression
4.3.1 Fit the Model
Consider a simple linear regression model of the form \[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \] where \(y_i\) is the Rating of the i-th beer (response), \(x_i\) is the ABV of the i-th beer (predictor), \(\beta_0\) is the unknown regression intercept, \(\beta_1\) is the unknown regression slope, and \(\epsilon_i \sim \mathrm{N}(0, \sigma^2)\) is a latent Gaussian error term. To fit the model, we can use the lm function
mod <- lm(Rating ~ ABV, data = beer)
The first input is the regression formula (Response ~ Predictor), and the second input is the data frame containing the variables in the regression formula. Note that mod is an object of class lm, which is a list containing information about the fit model.
class(mod)
## [1] "lm"
names(mod)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
For example, the $coefficients element contains the estimated regression coefficients
mod$coefficients
## (Intercept) ABV
## 74.03 2.26
which reveal that the expected Rating increases by about 2.26 points for every 1 unit (i.e., 1%) increaese in ABV.
4.3.2 Inference Information
To obtain a more detailed summary of the fit model, use the summary function
modsum <- summary(mod)
names(modsum)
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
modsum
##
## Call:
## lm(formula = Rating ~ ABV, data = beer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.497 -2.156 0.359 1.667 7.018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.035 3.593 20.60 < 2e-16 ***
## ABV 2.260 0.612 3.69 0.00063 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.28 on 42 degrees of freedom
## Multiple R-squared: 0.245, Adjusted R-squared: 0.227
## F-statistic: 13.6 on 1 and 42 DF, p-value: 0.000632
Note that summarizing an lm object returns the estimated error standard deviation sigma (\(\hat{\sigma} = 3.28\)), the coefficient of determination r.squared (\(R^2 = 0.2452\)), and a coefficient inference table for testing \(H_0: \beta_j = 0\) versus \(H_1: \beta_j \neq 0\). The observed t statistic for testing the slope parameter is \(t = 3.69\) with 42 degrees of freedom, resulting in a p-value less than 0.001—we reject \(H_0\) using any standard \(\alpha\) level.
Use the confint function to obtain confidence intervals for regression coefficients
confint(mod, "ABV")
## 2.5 % 97.5 %
## ABV 1.025 3.494
The 95% confidence interval for \(\beta_1\) reveals that we expect the average Rating to increase by 1.03 to 3.49 points for each additional 1% ABV.
4.3.3 Plot the Regression Line
The abline function makes it easy to include the least-squares regression line on a scatterplot
plot(beer$ABV, beer$Rating, xlab = "Alcohol By Volume",
ylab = "Beer Advocate Rating", main = "Alcohol by Rating")
abline(mod)
4.3.4 Diagnostic and Influence Plots
R makes it really easy to create simple diagnostic and influence plots for a fit regression model:
plot(mod)
4.3.5 Prediction for New Data
We often want to use a fit regression model to create predictions for new data. In R, this involves first creating the data frame of new predictor scores
newdata <- data.frame(ABV = seq(4.2, 7.5, by = 0.1))
which we input to the predict function along with the fit model
newfit <- predict(mod, newdata)
newfit
## 1 2 3 4 5 6 7 8 9 10 11 12
## 83.53 83.75 83.98 84.20 84.43 84.66 84.88 85.11 85.33 85.56 85.78 86.01
## 13 14 15 16 17 18 19 20 21 22 23 24
## 86.24 86.46 86.69 86.91 87.14 87.37 87.59 87.82 88.04 88.27 88.50 88.72
## 25 26 27 28 29 30 31 32 33 34
## 88.95 89.17 89.40 89.63 89.85 90.08 90.30 90.53 90.76 90.98
By default, the predict function returns a vector of predictions \(\hat{y}_{i(\mbox{new})} = \hat{\beta}_0 + \hat{\beta}_1 x_{i(\mbox{new})}\). To obtain the corresponding standard errors of the predictions, we can use the se.fit input
newfitse <- predict(mod, newdata, se.fit = TRUE)
newfitse
## $fit
## 1 2 3 4 5 6 7 8 9 10 11 12
## 83.53 83.75 83.98 84.20 84.43 84.66 84.88 85.11 85.33 85.56 85.78 86.01
## 13 14 15 16 17 18 19 20 21 22 23 24
## 86.24 86.46 86.69 86.91 87.14 87.37 87.59 87.82 88.04 88.27 88.50 88.72
## 25 26 27 28 29 30 31 32 33 34
## 88.95 89.17 89.40 89.63 89.85 90.08 90.30 90.53 90.76 90.98
##
## $se.fit
## 1 2 3 4 5 6 7 8 9 10
## 1.1065 1.0521 0.9985 0.9459 0.8943 0.8440 0.7952 0.7483 0.7035 0.6614
## 11 12 13 14 15 16 17 18 19 20
## 0.6225 0.5873 0.5567 0.5314 0.5121 0.4997 0.4946 0.4970 0.5068 0.5236
## 21 22 23 24 25 26 27 28 29 30
## 0.5468 0.5756 0.6092 0.6469 0.6879 0.7317 0.7779 0.8261 0.8758 0.9270
## 31 32 33 34
## 0.9793 1.0325 1.0866 1.1414
##
## $df
## [1] 42
##
## $residual.scale
## [1] 3.28
The interval input can be used to create confidence and prediction intervals
newfitCI <- predict(mod, newdata, interval = "confidence")
newfitPI <- predict(mod, newdata, interval = "prediction")
head(newfitCI)
## fit lwr upr
## 1 83.53 81.29 85.76
## 2 83.75 81.63 85.87
## 3 83.98 81.96 85.99
## 4 84.20 82.29 86.11
## 5 84.43 82.62 86.23
## 6 84.66 82.95 86.36
head(newfitPI)
## fit lwr upr
## 1 83.53 76.54 90.51
## 2 83.75 76.80 90.70
## 3 83.98 77.06 90.90
## 4 84.20 77.31 91.09
## 5 84.43 77.57 91.29
## 6 84.66 77.82 91.49
The confidence and prediction intervals can be plotted using
plot(beer$ABV, beer$Rating, xlab = "Alcohol By Volume",
ylab = "Beer Advocate Rating", main = "Alcohol by Rating",
ylim = c(75, 100))
lines(newdata$ABV, newfitCI[,1])
lines(newdata$ABV, newfitCI[,2], lty = 2, col = "blue")
lines(newdata$ABV, newfitCI[,3], lty = 2, col = "blue")
lines(newdata$ABV, newfitPI[,2], lty = 3, col = "red")
lines(newdata$ABV, newfitPI[,3], lty = 3, col = "red")
legend("bottomright", lty = 1:3, legend = c("fit", "95% CI", "95% PI"),
col = c("black", "blue", "red"), bty = "n")
4.4 Multiple Linear Regression
4.4.1 Overview
A multiple linear regression model has the form \[ y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} + \epsilon_i \] where \(y_i\) is the response for the \(i\)-th observation, \(x_{ij}\) is the j-th predictor for the i-th observation, \(\beta_0\) is the unknown regression intercept, \(\beta_j\) is the unknown regression slope for the j-th predictor, and \(\epsilon_i \sim \mathrm{N}(0, \sigma^2)\) is a latent Gaussian error term. Note that \(\beta_j\) gives the expected change in the response variable for a 1-unit change in the j-th predictor variable conditioned on the other predictors, i.e., holding all other predictors constant.
4.4.2 Additive Effects
We will start by considering a model predicting the Rating from the additive effects of ABV and Brewery
amod <- lm(Rating ~ ABV + Brewery, data = beer)
Note that this model allows each Brewery to have a unique regression intercept (Bauhaus is the baseline), but assumes that the slope between ABV and Rating is the same for each Brewery. We can summarize the model using the same approach as before:
amodsum <- summary(amod)
amodsum
##
## Call:
## lm(formula = Rating ~ ABV + Brewery, data = beer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.702 -1.477 0.217 1.207 5.505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.8216 3.2784 23.74 <2e-16 ***
## ABV 1.5873 0.5342 2.97 0.0053 **
## BreweryBent Paddle 1.0167 1.7600 0.58 0.5672
## BreweryFulton -2.2786 1.7592 -1.30 0.2037
## BreweryIndeed 0.8747 1.6455 0.53 0.5984
## BrewerySteel Toe 0.0596 1.8970 0.03 0.9751
## BrewerySummit -1.1684 1.6444 -0.71 0.4821
## BrewerySurly 4.2534 1.6877 2.52 0.0164 *
## BreweryUrban Growler -3.2278 1.7615 -1.83 0.0754 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.62 on 35 degrees of freedom
## Multiple R-squared: 0.598, Adjusted R-squared: 0.506
## F-statistic: 6.51 on 8 and 35 DF, p-value: 3.5e-05
Compared to the simple linear regression model containing only the ABV predictor, we have noticeably reduced the residual standard deviation estimate sigma (\(\hat{\sigma} = 2.622\)) and increased the coefficient of (multiple) determination r.squared (\(R^2 = 0.5979\)).
The anova and Anova functions can be used to test the significance of terms
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
anova(amod) # Type I (sequential) SS test
## Analysis of Variance Table
##
## Response: Rating
## Df Sum Sq Mean Sq F value Pr(>F)
## ABV 1 147 146.8 21.35 5e-05 ***
## Brewery 7 211 30.2 4.39 0.0014 **
## Residuals 35 241 6.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(amod) # Type II SS test
## Anova Table (Type II tests)
##
## Response: Rating
## Sum Sq Df F value Pr(>F)
## ABV 60.7 1 8.83 0.0053 **
## Brewery 211.1 7 4.39 0.0014 **
## Residuals 240.7 35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that anova tests the effects sequentially (ABV alone, then Brewery given ABV), whereas the Anova function (in the car package) tests the effects conditioned on the other effect (ABV given Brewery, Brewery given ABV). Using the Type II tests from the Anova function, we see that both ABV (\(F_{1,35} = 8.83, p = 0.005\)) and Brewery (\(F_{7,35} = 4.39, p = 0.001\)) significantly add to the prediction of the beer’s Rating.
4.4.3 Interaction Effects
Next we consider a model predicting the Rating from the interaction effects of ABV and Brewery
imod <- lm(Rating ~ ABV * Brewery, data = beer)
Note that formula notation is shorthand for Rating ~ ABV + Brewery + ABV:Brewery, so this model allows each Brewery to have a unique regression intercept and slope relating ABV and Rating. We can summarize the model using the same approach as before:
imodsum <- summary(imod)
imodsum
##
## Call:
## lm(formula = Rating ~ ABV * Brewery, data = beer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.291 -1.304 -0.047 1.477 4.943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.420 12.410 7.29 6.2e-08 ***
## ABV -0.653 2.192 -0.30 0.77
## BreweryBent Paddle -17.065 17.824 -0.96 0.35
## BreweryFulton -18.885 14.606 -1.29 0.21
## BreweryIndeed -8.170 14.412 -0.57 0.58
## BrewerySteel Toe -16.653 16.604 -1.00 0.32
## BrewerySummit -11.251 15.783 -0.71 0.48
## BrewerySurly -12.048 14.679 -0.82 0.42
## BreweryUrban Growler -8.655 19.727 -0.44 0.66
## ABV:BreweryBent Paddle 3.233 3.182 1.02 0.32
## ABV:BreweryFulton 2.958 2.581 1.15 0.26
## ABV:BreweryIndeed 1.624 2.527 0.64 0.53
## ABV:BrewerySteel Toe 2.885 2.784 1.04 0.31
## ABV:BrewerySummit 1.785 2.807 0.64 0.53
## ABV:BrewerySurly 2.824 2.511 1.12 0.27
## ABV:BreweryUrban Growler 1.003 3.428 0.29 0.77
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.81 on 28 degrees of freedom
## Multiple R-squared: 0.63, Adjusted R-squared: 0.431
## F-statistic: 3.17 on 15 and 28 DF, p-value: 0.00403
Compared to the additive model, we have slightly increased the residual standard deviation estimate sigma (\(\hat{\sigma} = 2.813\)) and increased the coefficient of (multiple) determination r.squared (\(R^2 = 0.6297\)).
Use the Anova function to test the signifiance of the effects
library(car)
Anova(imod) # Type II SS test
## Anova Table (Type II tests)
##
## Response: Rating
## Sum Sq Df F value Pr(>F)
## ABV 60.7 1 7.67 0.0099 **
## Brewery 211.1 7 3.81 0.0050 **
## ABV:Brewery 19.0 7 0.34 0.9267
## Residuals 221.6 28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results reveal that the interaction effect is not significant (\(F_{7,28} = 0.34, p = 0.927\)), but the main effects of ABV (\(F_{1,28} = 7.67, p = 0.01\)) and Brewery (\(F_{7,28} = 3.81, p = 0.005\)) are significant at the classic \(\alpha = 0.05\) significance level.
4.4.4 Comparing Fit Models
To compare the fit models, we can use the anova function for F-tests
anova(mod, amod, imod)
## Analysis of Variance Table
##
## Model 1: Rating ~ ABV
## Model 2: Rating ~ ABV + Brewery
## Model 3: Rating ~ ABV * Brewery
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 42 452
## 2 35 241 7 211 3.81 0.005 **
## 3 28 222 7 19 0.34 0.927
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
or the AIC function to extract Akaike’s information criterion
AIC(mod, amod, imod)
## df AIC
## mod 3 233.3
## amod 10 219.6
## imod 17 230.0
In this case, the F-tests and AIC values suggest that the additive model should be preferred. We conclude that each Brewery has a unique baseline Rating, and increasing the ABV by 1% corresponds to an expected 1.59 point increase in the Rating.
4.5 Exercises
- Load the Minnesota Beer Data into R.
- Make a scatterplot of the IBU (x-axis) by Rating (y-axis)
- Fit a simple linear regression model predicting Rating from IBU.
- Is there a significant linear relationship between IBU and Rating?
- Plot the linear relationship, along with 95% confidence and prediction intervals.
- Fit a multiple linear regression model predicting Rating from the additive effects of IBU and Brewery.
- Fit a multiple linear regression model predicting Rating from the additive and interaction effects of IBU and Brewery.
- Considering the models you fit in Ex 3, 7, 8, which do you prefer and why?