2 Data Visualization
Author: Alicia Johnson
The following data set on the 2016 election is stored as a csv file at https://www.macalester.edu/~ajohns24/data/IMAdata1.csv:
This data set combines the county level election results provided by Tony McGovern (shared on github), county level demographic data from the df_county_demographics
data set within the choroplethr
R package, and historical information about red/blue/purple states. Let’s take a quick look:
#use read.csv() to import the csv file
election <- read.csv("https://www.macalester.edu/~ajohns24/data/IMAdata1.csv")
dim(election) #dimensions
## [1] 3143 34
head(election, 2) #first 2 rows
## region total_population percent_white percent_black percent_asian
## 1 1001 54907 76 18 1
## 2 1003 187114 83 9 1
## percent_hispanic per_capita_income median_rent median_age fips_code
## 1 2 24571 668 37.5 1001
## 2 4 26766 693 41.5 1003
## county total_2008 dem_2008 gop_2008 oth_2008 total_2012
## 1 Autauga County 23641 6093 17403 145 23909
## 2 Baldwin County 81413 19386 61271 756 84988
## dem_2012 gop_2012 oth_2012 total_2016 dem_2016 gop_2016 oth_2016
## 1 6354 17366 189 24661 5908 18110 643
## 2 18329 65772 887 94090 18409 72780 2901
## perdem_2016 perrep_2016 winrep_2016 perdem_2012 perrep_2012
## 1 0.2396 0.7344 TRUE 0.2658 0.7263
## 2 0.1957 0.7735 TRUE 0.2157 0.7739
## winrep_2012 polyname abb StateColor value IncomeBracket
## 1 TRUE alabama AL red TRUE low
## 2 TRUE alabama AL red TRUE high
names(election) #variable names
## [1] "region" "total_population" "percent_white"
## [4] "percent_black" "percent_asian" "percent_hispanic"
## [7] "per_capita_income" "median_rent" "median_age"
## [10] "fips_code" "county" "total_2008"
## [13] "dem_2008" "gop_2008" "oth_2008"
## [16] "total_2012" "dem_2012" "gop_2012"
## [19] "oth_2012" "total_2016" "dem_2016"
## [22] "gop_2016" "oth_2016" "perdem_2016"
## [25] "perrep_2016" "winrep_2016" "perdem_2012"
## [28] "perrep_2012" "winrep_2012" "polyname"
## [31] "abb" "StateColor" "value"
## [34] "IncomeBracket"
Now that we understand the structure of this data set, we can start to ask some questions:
- To what degree did Trump support vary from county to county?
- In what number of counties did Trump win?
- What’s the relationship between Trump’s 2016 support and Romney’s 2012 support?
- What’s the relationship between Trump’s support and the “color” of the state in which the county exists?
Visualizing the data is the first natural step in answering these questions. Why?
- Visualizations help us understand what we’re working with: What are the scales of our variables? Are there any outliers, i.e. unusual cases? What are the patterns among our variables?
- This understanding will inform our next steps: What statistical tool / model is appropriate?
- Once our analysis is complete, visualizations are a powerful way to communicate our findings and tell a story.
2.1 ggplot
We’ll construct visualizations using the ggplot
function in RStudio. Though the ggplot
learning curve can be steep, its “grammar” is intuitive and generalizable once mastered. The ggplot
plotting function is stored in the ggplot2
package:
library(ggplot2)
The best way to learn about ggplot
is to just play around. Don’t worry about memorizing the syntax. Rather, focus on the patterns and potential of their application. There’s a helpful cheat sheet for future reference:
2.2 Univariate visualizations
We’ll start with univariate visualizations.
Categorical Variables
Consider the categorical winrep_2016
variable which indicates whether Trump won the county:
levels(factor(election$winrep_2016))
## [1] "FALSE" "TRUE"
A table provides a simple summary of the number of counties that fall into these 2 categories:
table(election$winrep_2016)
##
## FALSE TRUE
## 487 2625
A bar chart provides a visualization of this table. Try out the code below that builds up from a simple to a customized bar chart. At each step determine how each piece of code contributes to the plot.
#set up a plotting frame
ggplot(election, aes(x=winrep_2016))
#add a layer with the bars
ggplot(election, aes(x=winrep_2016)) +
geom_bar()
#add axis labels
ggplot(election, aes(x=winrep_2016)) +
geom_bar() +
labs(x="Trump win", y="Number of counties")
In summary:
Quantitative Variables
The quantitative perrep_2016
variable summarizes Trump’s percent of the vote in each county. Quantitative variables require different summary tools than categorical variables. We’ll explore 2 methods for graphing quantitative variables: histograms & density plots.
Histograms are constructed by (1) dividing up the observed range of the variable into ‘bins’ of equal width; and (2) counting up the number of cases that fall into each bin. Try out the code below.
#set up a plotting frame
ggplot(election, aes(x=perrep_2016))
#add a histogram layer
ggplot(election, aes(x=perrep_2016)) +
geom_histogram()
#add axis labels
ggplot(election, aes(x=perrep_2016)) +
geom_histogram() +
labs(x="Trump vote (%)", y="Number of counties")
#change the border colors
ggplot(election, aes(x=perrep_2016)) +
geom_histogram(color="white") +
labs(x="Trump vote (%)", y="Number of counties")
#change the bin width
ggplot(election, aes(x=perrep_2016)) +
geom_histogram(color="white", binwidth=0.10) +
labs(x="Trump vote (%)", y="Number of counties")
In summary:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Density plots are essentially smooth versions of the histogram. Instead of sorting cases into discrete bins, the “density” of cases is calculated across the entire range of values. The greater the number of cases, the greater the density! The density is then scaled so that the area under the density curve always equals 1 and the area under any fraction of the curve represents the fraction of cases that lie in that range.
#set up the plotting frame
ggplot(election, aes(x=perrep_2016))
#add a density curve
ggplot(election, aes(x=perrep_2016)) +
geom_density()
#add axis labels
ggplot(election, aes(x=perrep_2016)) +
geom_density() +
labs(x="Trump vote (%)")
#add a fill color
ggplot(election, aes(x=perrep_2016)) +
geom_density(fill="red") +
labs(x="Trump vote (%)")
In summary:
2.3 Visualizing Relationships
Consider the data on just 6 of the counties:
Before constructing graphics of the relationships among these variables, we need to understand what features these graphics should have. Without peaking at the exercises, challenge yourself to think about how we might graph the relationships among the following sets of variables:
perrep_2016
vsperrep_2012
perrep_2016
vsStateColor
perrep_2016
vsperrep_2012
andStateColor
(in 1 plot)perrep_2016
vsperrep_2012
andmedian_rent
(in 1 plot)
Run through the following exercises which introduce different approaches to visualizing relationships.
Scatterplots of 2 quantitative variables
Each quantitative variable has an axis. Each case is represented by a dot.
#just a graphics frame
ggplot(election, aes(y=perrep_2016, x=perrep_2012))
#add a scatterplot layer
ggplot(election, aes(y=perrep_2016, x=perrep_2012)) +
geom_point()
#another predictor
ggplot(election, aes(y=perrep_2016, x=median_rent)) +
geom_point()
In summary:
Side-by-side plots of 1 quantitative variable vs 1 categorical variable
#density plots by group
ggplot(election, aes(x=perrep_2016, fill=StateColor)) +
geom_density()
#to see better: add transparency
ggplot(election, aes(x=perrep_2016, fill=StateColor)) +
geom_density(alpha=0.5)
#fix the color scale!
ggplot(election, aes(x=perrep_2016, fill=StateColor)) +
geom_density(alpha=0.5) +
scale_fill_manual(values=c("blue","purple","red"))
#to see better: split groups into separate plots
ggplot(election, aes(x=perrep_2016, fill=StateColor)) +
geom_density(alpha=0.5) +
facet_wrap( ~ StateColor) +
scale_fill_manual(values=c("blue","purple","red"))
In summary:
Scatterplots of 1 quantitative variable vs 1 categorical & 1 quantitative variable
If median_rent
and StateColor
both explain some of the variability in perrep_2016
, why not include both in our analysis?! Let’s.
#scatterplot: id groups using color
ggplot(election, aes(y=perrep_2016, x=median_rent, color=StateColor)) +
geom_point(alpha=0.5)
#fix the color scale!
ggplot(election, aes(y=perrep_2016, x=median_rent, color=StateColor)) +
geom_point(alpha=0.5) +
scale_color_manual(values=c("blue","purple","red"))
#scatterplot: id groups using shape
ggplot(election, aes(y=perrep_2016, x=median_rent, shape=StateColor)) +
geom_point(alpha=0.5) +
scale_color_manual(values=c("blue","purple","red"))
#scatterplot: split/facet by group
ggplot(election, aes(y=perrep_2016, x=median_rent, color=StateColor)) +
geom_point(alpha=0.5) +
facet_wrap( ~ StateColor) +
scale_color_manual(values=c("blue","purple","red"))
In summary:
Plots of 3 quantitative variables
Let’s try visualizing 3 quantitative variables in a single plot.
#scatterplot: represent third variable using size
ggplot(election, aes(y=perrep_2016, x=median_rent, size=perrep_2012)) +
geom_point(alpha=0.1)
#scatterplot: represent third variable using color
ggplot(election, aes(y=perrep_2016, x=median_rent, color=perrep_2012)) +
geom_point(alpha=0.5)
#scatterplot: discretize the third variable into 2 groups & represent with color
ggplot(election, aes(y=perrep_2016, x=median_rent, color=cut(perrep_2012,2))) +
geom_point(alpha=0.5)
In summary:
2.4 Exercises
Recall the US_births_2000_2014
data in the fivethirtyeight package
:
library(fivethirtyeight)
data("US_births_2000_2014")
In the previous activity, we investigated the basic features of this data set:
dim(US_births_2000_2014)
## [1] 5479 6
head(US_births_2000_2014)
## # A tibble: 6 x 6
## year month date_of_month date day_of_week births
## <int> <int> <int> <date> <ord> <int>
## 1 2000 1 1 2000-01-01 Sat 9083
## 2 2000 1 2 2000-01-02 Sun 8006
## 3 2000 1 3 2000-01-03 Mon 11363
## 4 2000 1 4 2000-01-04 Tues 13032
## 5 2000 1 5 2000-01-05 Wed 12558
## 6 2000 1 6 2000-01-06 Thurs 12466
names(US_births_2000_2014)
## [1] "year" "month" "date_of_month" "date"
## [5] "day_of_week" "births"
levels(factor(US_births_2000_2014$day_of_week))
## [1] "Sun" "Mon" "Tues" "Wed" "Thurs" "Fri" "Sat"
Let’s graphically explore these variables and the relationships among them! NOTE: This set of exercises is inspired by the work of Randy Pruim for the MAA statPREP program.
First, let’s focus on 2014:
Only2014 <- subset(US_births_2000_2014, year==2014)
Construct a univariate visualization of
births
. Describe the variability in births from day to day in 2014.The time of year might explain some of this variability. Construct a plot that illustrates the relationship between
births
anddate
in 2014. NOTE: Make sure that births, our variable of interest, is on the y-axis and treatdate
as quantitative.One goofy thing that stands out are the 2-3 distinct groups of points. Add a layer to this plot that explains the distinction between these groups.
There are some exceptions to the rule in exercise 3, ie. some cases that should belong to group 1 but behave like the cases in group 2. Explain why these cases are exceptions - what explains the anomalies / why these are special cases?
Next, consider all births from 2000-2014. Construct 1 graphic that illustrates births trends across all of these years.
Finally, consider only those births that occur on Fridays:
OnlyFridays <- subset(US_births_2000_2014, day_of_week=="Fri")
Define a new variable
fri13
that indicates whether the case falls on a Friday in the 13th date of the month:OnlyFridays$fri13 <- (OnlyFridays$date_of_month==13)
Construct and comment on a plot of that illustrates the distribution of births among Fridays that fall on & off the 13th. Do you see any evidence of superstition?
SOLUTIONS
#1
Only2014 <- subset(US_births_2000_2014, year==2014)
ggplot(Only2014, aes(x=births)) +
geom_histogram(color="white")
#2
ggplot(Only2014, aes(x=date, y=births)) +
geom_point()
#3
ggplot(Only2014, aes(x=date, y=births, color=day_of_week)) +
geom_point()
#4
#holidays!
#5
ggplot(US_births_2000_2014, aes(x=date, y=births, color=day_of_week)) +
geom_point()
#6
OnlyFridays <- subset(US_births_2000_2014, day_of_week=="Fri")
OnlyFridays$fri13 <- (OnlyFridays$date_of_month==13)
ggplot(OnlyFridays, aes(x=births, fill=fri13)) +
geom_density(alpha=0.5)
2.5 Extra
We’ve covered some basic graphics. However, different types of relationships require different visualization strategies. For example, there’s a geographical component to the election
data. If you have time, try to construct some maps of the election related variables. To this end, you’ll need to install the choroplethr
and choroplethrMaps
packages:
install.packages("choroplethr", dependencies=TRUE)
install.packages("choroplethrMaps", dependencies=TRUE)
library(choroplethr)
library(choroplethrMaps)
#to make a map of Trump support store `perrep_2016` as value
election$value <- election$perrep_2016
county_choropleth(election)
#a map of Trump wins
election$value <- election$winrep_2016
county_choropleth(election)
#a map of state color
election$value <- election$StateColor
county_choropleth(election)
#a map of percent white
election$value <- election$percent_white
county_choropleth(election)