2 Data Visualization

Author: Alicia Johnson

The following data set on the 2016 election is stored as a csv file at https://www.macalester.edu/~ajohns24/data/IMAdata1.csv:

This data set combines the county level election results provided by Tony McGovern (shared on github), county level demographic data from the df_county_demographics data set within the choroplethr R package, and historical information about red/blue/purple states. Let’s take a quick look:

#use read.csv() to import the csv file
election <- read.csv("https://www.macalester.edu/~ajohns24/data/IMAdata1.csv")

dim(election)       #dimensions 
## [1] 3143   34
head(election, 2)   #first 2 rows
##   region total_population percent_white percent_black percent_asian
## 1   1001            54907            76            18             1
## 2   1003           187114            83             9             1
##   percent_hispanic per_capita_income median_rent median_age fips_code
## 1                2             24571         668       37.5      1001
## 2                4             26766         693       41.5      1003
##           county total_2008 dem_2008 gop_2008 oth_2008 total_2012
## 1 Autauga County      23641     6093    17403      145      23909
## 2 Baldwin County      81413    19386    61271      756      84988
##   dem_2012 gop_2012 oth_2012 total_2016 dem_2016 gop_2016 oth_2016
## 1     6354    17366      189      24661     5908    18110      643
## 2    18329    65772      887      94090    18409    72780     2901
##   perdem_2016 perrep_2016 winrep_2016 perdem_2012 perrep_2012
## 1      0.2396      0.7344        TRUE      0.2658      0.7263
## 2      0.1957      0.7735        TRUE      0.2157      0.7739
##   winrep_2012 polyname abb StateColor value IncomeBracket
## 1        TRUE  alabama  AL        red  TRUE           low
## 2        TRUE  alabama  AL        red  TRUE          high
names(election)     #variable names
##  [1] "region"            "total_population"  "percent_white"    
##  [4] "percent_black"     "percent_asian"     "percent_hispanic" 
##  [7] "per_capita_income" "median_rent"       "median_age"       
## [10] "fips_code"         "county"            "total_2008"       
## [13] "dem_2008"          "gop_2008"          "oth_2008"         
## [16] "total_2012"        "dem_2012"          "gop_2012"         
## [19] "oth_2012"          "total_2016"        "dem_2016"         
## [22] "gop_2016"          "oth_2016"          "perdem_2016"      
## [25] "perrep_2016"       "winrep_2016"       "perdem_2012"      
## [28] "perrep_2012"       "winrep_2012"       "polyname"         
## [31] "abb"               "StateColor"        "value"            
## [34] "IncomeBracket"

Now that we understand the structure of this data set, we can start to ask some questions:

To what degree did Trump support vary from county to county?
In what number of counties did Trump win?
What’s the relationship between Trump’s 2016 support and Romney’s 2012 support?
What’s the relationship between Trump’s support and the “color” of the state in which the county exists?

Visualizing the data is the first natural step in answering these questions. Why?

Visualizations help us understand what we’re working with: What are the scales of our variables? Are there any outliers, i.e. unusual cases? What are the patterns among our variables?
This understanding will inform our next steps: What statistical tool / model is appropriate?
Once our analysis is complete, visualizations are a powerful way to communicate our findings and tell a story.

2.1 ggplot

We’ll construct visualizations using the ggplot function in RStudio. Though the ggplot learning curve can be steep, its “grammar” is intuitive and generalizable once mastered. The ggplot plotting function is stored in the ggplot2 package:

library(ggplot2)

The best way to learn about ggplot is to just play around. Don’t worry about memorizing the syntax. Rather, focus on the patterns and potential of their application. There’s a helpful cheat sheet for future reference:

GGPLOT CHEAT SHEET

2.2 Univariate visualizations

We’ll start with univariate visualizations.

Categorical Variables

Consider the categorical winrep_2016 variable which indicates whether Trump won the county:

levels(factor(election$winrep_2016))
## [1] "FALSE" "TRUE"

A table provides a simple summary of the number of counties that fall into these 2 categories:

table(election$winrep_2016)
## 
## FALSE  TRUE 
##   487  2625

A bar chart provides a visualization of this table. Try out the code below that builds up from a simple to a customized bar chart. At each step determine how each piece of code contributes to the plot.

#set up a plotting frame
ggplot(election, aes(x=winrep_2016))

#add a layer with the bars
ggplot(election, aes(x=winrep_2016)) + 
    geom_bar()

#add axis labels
ggplot(election, aes(x=winrep_2016)) + 
    geom_bar() +
    labs(x="Trump win", y="Number of counties")

In summary:

Quantitative Variables

The quantitative perrep_2016 variable summarizes Trump’s percent of the vote in each county. Quantitative variables require different summary tools than categorical variables. We’ll explore 2 methods for graphing quantitative variables: histograms & density plots.

Histograms are constructed by (1) dividing up the observed range of the variable into ‘bins’ of equal width; and (2) counting up the number of cases that fall into each bin. Try out the code below.

#set up a plotting frame
ggplot(election, aes(x=perrep_2016))

#add a histogram layer
ggplot(election, aes(x=perrep_2016)) +
    geom_histogram()

#add axis labels
ggplot(election, aes(x=perrep_2016)) +
    geom_histogram() +
    labs(x="Trump vote (%)", y="Number of counties")

#change the border colors
ggplot(election, aes(x=perrep_2016)) +
    geom_histogram(color="white") +
    labs(x="Trump vote (%)", y="Number of counties")

#change the bin width
ggplot(election, aes(x=perrep_2016)) +
    geom_histogram(color="white", binwidth=0.10) +
    labs(x="Trump vote (%)", y="Number of counties")

In summary:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Density plots are essentially smooth versions of the histogram. Instead of sorting cases into discrete bins, the “density” of cases is calculated across the entire range of values. The greater the number of cases, the greater the density! The density is then scaled so that the area under the density curve always equals 1 and the area under any fraction of the curve represents the fraction of cases that lie in that range.

#set up the plotting frame
ggplot(election, aes(x=perrep_2016))

#add a density curve
ggplot(election, aes(x=perrep_2016)) +
    geom_density()

#add axis labels
ggplot(election, aes(x=perrep_2016)) +
    geom_density() +
    labs(x="Trump vote (%)")

#add a fill color
ggplot(election, aes(x=perrep_2016)) +
    geom_density(fill="red") +
    labs(x="Trump vote (%)")

In summary:

2.3 Visualizing Relationships

Consider the data on just 6 of the counties:

Before constructing graphics of the relationships among these variables, we need to understand what features these graphics should have. Without peaking at the exercises, challenge yourself to think about how we might graph the relationships among the following sets of variables:

perrep_2016 vs perrep_2012
perrep_2016 vs StateColor
perrep_2016 vs perrep_2012 and StateColor (in 1 plot)
perrep_2016 vs perrep_2012 and median_rent (in 1 plot)

Run through the following exercises which introduce different approaches to visualizing relationships.

Scatterplots of 2 quantitative variables

Each quantitative variable has an axis. Each case is represented by a dot.

#just a graphics frame
ggplot(election, aes(y=perrep_2016, x=perrep_2012))

#add a scatterplot layer
ggplot(election, aes(y=perrep_2016, x=perrep_2012)) +
    geom_point()

#another predictor
ggplot(election, aes(y=perrep_2016, x=median_rent)) +
    geom_point()

In summary:

Side-by-side plots of 1 quantitative variable vs 1 categorical variable

#density plots by group
ggplot(election, aes(x=perrep_2016, fill=StateColor)) +
    geom_density()

#to see better: add transparency
ggplot(election, aes(x=perrep_2016, fill=StateColor)) +
    geom_density(alpha=0.5)

#fix the color scale!
ggplot(election, aes(x=perrep_2016, fill=StateColor)) +
    geom_density(alpha=0.5) +
    scale_fill_manual(values=c("blue","purple","red"))

#to see better: split groups into separate plots
ggplot(election, aes(x=perrep_2016, fill=StateColor)) +
    geom_density(alpha=0.5) +
    facet_wrap( ~ StateColor) +
    scale_fill_manual(values=c("blue","purple","red"))

In summary:

Scatterplots of 1 quantitative variable vs 1 categorical & 1 quantitative variable

If median_rent and StateColor both explain some of the variability in perrep_2016, why not include both in our analysis?! Let’s.

#scatterplot: id groups using color
ggplot(election, aes(y=perrep_2016, x=median_rent, color=StateColor)) +
    geom_point(alpha=0.5) 

#fix the color scale!
ggplot(election, aes(y=perrep_2016, x=median_rent, color=StateColor)) +
    geom_point(alpha=0.5) + 
    scale_color_manual(values=c("blue","purple","red"))

#scatterplot: id groups using shape
ggplot(election, aes(y=perrep_2016, x=median_rent, shape=StateColor)) +
    geom_point(alpha=0.5) + 
    scale_color_manual(values=c("blue","purple","red"))

#scatterplot: split/facet by group
ggplot(election, aes(y=perrep_2016, x=median_rent, color=StateColor)) +
    geom_point(alpha=0.5) +
    facet_wrap( ~ StateColor) + 
    scale_color_manual(values=c("blue","purple","red"))

In summary:

Plots of 3 quantitative variables

Let’s try visualizing 3 quantitative variables in a single plot.

#scatterplot: represent third variable using size
ggplot(election, aes(y=perrep_2016, x=median_rent, size=perrep_2012)) +
    geom_point(alpha=0.1)

#scatterplot: represent third variable using color
ggplot(election, aes(y=perrep_2016, x=median_rent, color=perrep_2012)) +
    geom_point(alpha=0.5)

#scatterplot: discretize the third variable into 2 groups & represent with color
ggplot(election, aes(y=perrep_2016, x=median_rent, color=cut(perrep_2012,2))) +
    geom_point(alpha=0.5)

In summary:

2.4 Exercises

Recall the US_births_2000_2014 data in the fivethirtyeight package:

library(fivethirtyeight)
data("US_births_2000_2014")

In the previous activity, we investigated the basic features of this data set:

dim(US_births_2000_2014)
## [1] 5479    6
head(US_births_2000_2014)
## # A tibble: 6 x 6
##    year month date_of_month       date day_of_week births
##   <int> <int>         <int>     <date>       <ord>  <int>
## 1  2000     1             1 2000-01-01         Sat   9083
## 2  2000     1             2 2000-01-02         Sun   8006
## 3  2000     1             3 2000-01-03         Mon  11363
## 4  2000     1             4 2000-01-04        Tues  13032
## 5  2000     1             5 2000-01-05         Wed  12558
## 6  2000     1             6 2000-01-06       Thurs  12466
names(US_births_2000_2014)
## [1] "year"          "month"         "date_of_month" "date"         
## [5] "day_of_week"   "births"
levels(factor(US_births_2000_2014$day_of_week))
## [1] "Sun"   "Mon"   "Tues"  "Wed"   "Thurs" "Fri"   "Sat"

Let’s graphically explore these variables and the relationships among them! NOTE: This set of exercises is inspired by the work of Randy Pruim for the MAA statPREP program.

First, let’s focus on 2014:
```
Only2014 <- subset(US_births_2000_2014, year==2014)
```
Construct a univariate visualization of births. Describe the variability in births from day to day in 2014.
The time of year might explain some of this variability. Construct a plot that illustrates the relationship between births and date in 2014. NOTE: Make sure that births, our variable of interest, is on the y-axis and treat date as quantitative.
One goofy thing that stands out are the 2-3 distinct groups of points. Add a layer to this plot that explains the distinction between these groups.
There are some exceptions to the rule in exercise 3, ie. some cases that should belong to group 1 but behave like the cases in group 2. Explain why these cases are exceptions - what explains the anomalies / why these are special cases?
Next, consider all births from 2000-2014. Construct 1 graphic that illustrates births trends across all of these years.
Finally, consider only those births that occur on Fridays:
```
OnlyFridays <- subset(US_births_2000_2014, day_of_week=="Fri")
```
Define a new variable fri13 that indicates whether the case falls on a Friday in the 13th date of the month:
```
OnlyFridays$fri13 <- (OnlyFridays$date_of_month==13)
```
Construct and comment on a plot of that illustrates the distribution of births among Fridays that fall on & off the 13th. Do you see any evidence of superstition?

SOLUTIONS

#1
Only2014 <- subset(US_births_2000_2014, year==2014)
ggplot(Only2014, aes(x=births)) + 
    geom_histogram(color="white")

#2
ggplot(Only2014, aes(x=date, y=births)) + 
    geom_point()

#3
ggplot(Only2014, aes(x=date, y=births, color=day_of_week)) + 
    geom_point()

#4
#holidays!

#5
ggplot(US_births_2000_2014, aes(x=date, y=births, color=day_of_week)) + 
    geom_point()

#6
OnlyFridays <- subset(US_births_2000_2014, day_of_week=="Fri")
OnlyFridays$fri13 <- (OnlyFridays$date_of_month==13)
ggplot(OnlyFridays, aes(x=births, fill=fri13)) + 
    geom_density(alpha=0.5)

2.5 Extra

We’ve covered some basic graphics. However, different types of relationships require different visualization strategies. For example, there’s a geographical component to the election data. If you have time, try to construct some maps of the election related variables. To this end, you’ll need to install the choroplethr and choroplethrMaps packages:

install.packages("choroplethr", dependencies=TRUE)
install.packages("choroplethrMaps", dependencies=TRUE)

library(choroplethr)
library(choroplethrMaps)

#to make a map of Trump support store `perrep_2016` as value
election$value <- election$perrep_2016
county_choropleth(election)

#a map of Trump wins
election$value <- election$winrep_2016
county_choropleth(election)

#a map of state color
election$value <- election$StateColor
county_choropleth(election)

#a map of percent white
election$value <- election$percent_white
county_choropleth(election)