Exploratory Data Analysis

Logo

Data Science Institute
Vanderbilt University


Course Overview
Course Materials
Course Policies

View the Project on GitHub dsi-explore/eda-course-website

Week 9 Day 2: K-means Clustering

Cassy Dorff

10/17/2019”

Introduction

K-means clustering takes your unlabeled data and a constant, k, and returns k number of clusters that are within a specified distance of each other. It automatically groups together things that look alike so you don’t have to!

Let’s begin with the USArrests data. This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It also has the percentage of population living in urban areas in each state.

data("USArrests")
head(USArrests, n = 3)

Variables are often scaled (i.e. standardized) before measuring the inter-observation dissimilarities. Scaling is recommended when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, etc).

Should I scale? Ask yourself: do I have two features, one where the differences between cases is large and the other small? Am I willing to have the former as almost the only driver of distance? If the first answer is yes and the second answer is no, you will likely need to scale your data.

See more on scaling and k-means, here!

# we see all values are numeric
# scale the data set
df <- scale(USArrests)

# for now..
df <- na.omit(df)

Can you tell what actions the scale() function performed with default inputs? scale calculates the mean and standard deviation of each column vector, then subtracts the mean and divides by the standard deviation for each observation. It’s especially useful when your dataset contains variables with widely varying scales (as highlighted above).

Within R it is simple to compute and visualize the distance matrix using the functions get_dist and fviz_dist from the factoextra R package.

library(stats)
library(tidyverse)
#install.packages("factoextra")
library(factoextra)

distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

Questions

Running the algorithm

Below we will first set the seed before running the algorithm. Then, take note of the arguments for kmeans.

We will use kmeans() on our df object and specify k as equal to 2. We can also change nstart to something like 25, if we wanted, since k-means is sensitive to its random starting positions. This means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within-cluster variation.

# compute k0means with k=2
set.seed(1234)
k2 <- kmeans()

#Print the results
print(k2)

The primary printed output tells us our analysis resulted in two groups with cluster sizes for 30 and 20. Plus it gives us a matrix of cluster means, where the rows are the cluster number (in this case 1 to 2) and the columns are the variables in the data.

The output also gives us a clustering vector to tell us the data point that is assigned to each cluster.

The “within cluster sum of squares by cluster” output is useful here to use as a rough measure of goodness of fit for k-means. SS stands for Sum of Squares, and ideally you want a clustering that has the properties of internal cohesion and external separation, so here the between sum of squares/ total sum of squares ratio should approach 1 if the model fits well. In other words, in this case we can interpret 47.5% as the amount of variance explained by our analysis (so we might want to try more than 2 clusters, but remember you want them to remain ‘interpretable!’). Recall: sum of squares = the squared distance of each point from its centroid summed over all points.

We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.(We will learn more about PCA later on). Why is this needed? Because visualizing anything above 3D (but often even 2D) is very hard! And usually it is also difficult to interpret for human beings.

fviz_cluster(k2, data = df)

Stop and discuss: What might be driving these clusters, or the classes produced by the algorithm? Brainstorm ideas.

Wonderfully, you can also just use pairwise scatter plots to illustrate the clusters compared to the original variables which can aid in interpretation.

Practice

df %>%
  as_tibble() %>%



Stop here and put up a blue sticky note if you have completed the graphic above.

On your own:

Practice

Because the number of clusters (k) must be set before we start the algorithm, it is often advantageous to use several different values of k and examine the differences in the results. We can execute the same process for 3, 4, and 5 clusters, and the plot the results in one figure:





We could also look at the output for each of these to assess how well the clusters are mapping onto the data.

Determining Optimal Clusters

While the basic introduction to the methods above will help you explore your data in preliminary way, there are three common methods for determining the number of clusters to use in the analysis (k).

Recall that in general, our goal is to (a) define clusters such that the total intra-cluster variation (known as total within-cluster variation) is minimized and (b) be able to interpret our results. We will focus only on the Elbow method here, though you may explore the others on your own time.

Fortunately, the process to compute the “Elbow method” has been wrapped up in a single function (fviz_nbclust). The plot below will plot the within cluster sum of squares on the Y axis (recall this output is easily readible above from our first run of the algorithm) and the number of clusters n the X axis.

The ‘elbow point’ is where the within cluster sum of squares (WCSS) doesn’t decrease significantly with every iteration (or addition of a cluster). We can generally interpret this number to be an ideal cluster point (as moderated by your interpretation of the data as well).

set.seed(123)

fviz_nbclust(df, kmeans, method = "wss")

Discuss

What is the optimal number of clusters?

Write the code above to re-run the algorithm with 4 clusters and plot the results (nothing new here, but just to come full circle)!





Discuss

What kind of analysis might you do next? Why? What kind of questions does this exploratory exercise raise?