Exploratory Data Analysis

Logo

Data Science Institute
Vanderbilt University


Course Overview
Course Materials
Course Policies

View the Project on GitHub dsi-explore/eda-course-website

Week 5: Finishing up Distributions and Correlations

DSI-EDA

Professor: Cassy Dorff

Introduction

In weeks 3 and 4 of this course we learned how to visually inspect the distributions of univariate data using histograms and density plots. We also explored how boxplots can be useful for comparing across the distribution of variables across categories. We then learned about correlations and their corresponding scatterplots, and the ways in which we might investigate the strength of two variable’s covariance using Pearson’s Correlation Coefficient.

Today we will continue along these themes to wrap up this component of our course. First, I will show you a boxplot example that combines information from the geom_point() layer to present both boxplot and scatterplot information on a single graphic. On your own you will explore a graphic also commonly used to represent the shape of a distribution: the violin plot.

After this, we will revisit correlations by learning about correlograms (aka correlation matrices) and under what circumstances they might prove useful for our exploratory data analysis.

Distributions, again!

Today, for illustrative purposes, we will work with data from the gapminder package. The dataset contains information on life expectancy, GDP per capita, and population by country from 1952 to 2007 in increments of 5 years.

library(gapminder)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
head(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':	1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
?gapminder

# let's rename it 
df <- gapminder

Let’s examine the distribution of variables across groups using boxplots. Recall from last class:

Below, let’s explore the relationship between life expectancy and region. To do so, we specify x and y in the first layer and then add the geom_boxplot() layer.

# together:

FYI we can flip the axes by using the coord_flip() command. This might increase legibility, particularly if some of your distributions exhibit high spread.

# together: 

We can overlap the boxplot with the original observations using the geom_point() geom. This illustrates how many observations are included in each group. Make sure to specify outlier.shape = NA within geom_boxplot() so that the outliers are not plotted in the boxplot() layer, which would be confusing since we are also plotting points using the geom_point() geom.

# together:

This is confusing because there is a lot of overplotting when we add all original observations to the plot. We can add jitter to the points–a method we’ve seen before! Recall, this just adds a small random value (i.e. noise) to each point in either direction in order to ‘jitter’ the points away from one another and increase the graph’s interpretability. We can control the spread by using the width argument. We can also decrease the opacity of the points. The plot below shows that while Oceania has the highest median life expectancy, this value is based on a lot fewer observations as compared to other continents.

# together:

Though that looks quite nice, we might not think this is useful enough for our purposes. Perhaps we have more questions about the distributions and want other ways to inspect their shape. One option is to turn to violin plots. Violin plots show the distribution’s range and median but they also add kernel density estimation on each side. As this excellent post describes, sometimes you want to know about ‘where’ the data is at: are most values clustered around the median? Wider sections of the violin plot represent a higher probability that members of the population will take on the given value.

Practice 1 Please try to re-create the plot shown on the projector in class.

Hints:

# students:

Correlation matrix

As mentioned in the introduction, there is another way in which we might want to visualize correlations. Specifically, it can be useful to look for patterns in our data across multiple variables, rather than relying on the more narrow specification of a bivariate scatterplot.

A correlogram (or correlation matrix) allows us to analyse the relationship between each pair of numeric variables in a dataset. It gives a quick overview of many relationships in the data.

# together:
library(corrgram)
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus

Practice 2

Look up the help file for corrgram, and change the arguments to produce different results. For example, change ‘order’ to NULL and upper.panel to NULL. What data does the plot automatically drop? What do the lines and colors mean?

# students:

The above example is a bit awkward due to the low number of variables. Note, there are many ways of creating correlograms in R. Here is another, ggcorrplot is available on github (shown below) which we will use to explore a familiar dataset.

devtools::install_github("kassambara/ggcorrplot")
## Skipping install of 'ggcorrplot' from a github remote, the SHA1 (58cfb3a8) has not changed since last install.
##   Use `force = TRUE` to force installation
library(ggcorrplot)

First, for the sake of illustration, let’s look at the correlation of an easy familiar dataset, mtcars. Below we simply create a matrix of all the variables in mtcars using the cor() function which we went over last time!

# Correlation matrix
data(mtcars)
corr <- round(cor(mtcars), 1)

Below, we can make a basic correlogram using this same numeric information.

# Plot
ggcorrplot(corr)

Just by reading a bit more about this function, we can make this much more visually useful. First, we can pick colors that are a bit easier on the eyes, and add our own colors in the colors argument. We can also choose type="lower" so that the graph only prints the lower triangle of correlations. Then we can change the plot to represent correlations as a circle conditioned on the size of the correlation using the method="circle argument. Try to do all of this below to reproduce the figure shown in class.

Practice 3 Reproduce the graphic shown on the projector in class.

# students

References