Exploratory Data Analysis

Logo

Data Science Institute
Vanderbilt University


Course Overview
Course Materials
Course Policies

View the Project on GitHub dsi-explore/eda-course-website

R Notes

These notes accompany the .Rmd lecture from week 11 day 1 on applying k-means clustering to political data.

These are a few notes to round out our lecture. The rest of the ‘practice’ examples can be completed on your own. The important thing here is that you (a) practice ‘reading’ code and learning from ‘working examples’ and (b) understand that this illustrates how you can de-clutter your graphic by manipulating the labels vector.

Practice

PRACTICE 1

The plot above is super bonkers! Can you improve those labels?

# HINT:

data = data.frame(
  cntries=paste0(letters,letters),
  x = rnorm(length(paste0(letters,letters))),
  y = rnorm(length(paste0(letters,letters))),
  stringsAsFactors=FALSE
)

keep = paste0(letters,letters)[sample(1:nrow(data), 10)]

data$cntries[which(!data$cntries %in% keep)] = ""

ggplot(data,
       aes(x=x, y=y)) +
  geom_text(aes(label=cntries))

Note: the above code first creates a simple data.frame, purely to provide a working example. You are familiar with the rnorm functions, and it is fairly easy to interpret the paste0 function by looking it up. Run this code, and examine the data.

Next, we create a new vector called keep. The key here is that we are grabbing a few random letters to keep as labels. This is for ILLUSTRATION, and in your actual k-means analysis, you’d want to justify the labels you choose to show on your plot.

Finally, we are able to subset the original vector in our data that contains information for labeling, in this case cntries. This is the heart of the working example, the goal was to show you how you can keep some labels, while setting the others to ‘blank.’

# practice
polData74_labels<-polData74
vec = c("Mexico", "Sweden", "Ecuador", "Burkina Faso", "Iran", "Syria", "Pakistan", "United States")
polData74_labels$X<-as.character(polData74_labels$X)
polData74_labels$X[which(!polData74_labels$X %in% vec)] = ""
polData74_labels$X

ggplot(polData74_labels, aes(x=TYPE3, y=XROPEN, label = X, color = clusters3)) +
    geom_jitter() +
    geom_text_repel(hjust = 0, nudge_y = -.6, segment.alpha=.2, show.legend=FALSE, size=3,
                    segment.size =.3) +
    labs(x="Internal Conflict", y="Closed v. Open Leadership Change") +
    theme_bw()

The code above could use further tweaking for an even more compelling graphic, but it demonstrates how you can minimize the amount of noise in your plot by choosing only a few labels to print.