Week 6 Day 1: Time Series Data

Introduction

Today we will focus on time series data. Time series data consists of any metric that is measured over regular time intervals. Examples: Weather data, Local Crime Statistics, Stock prices, Industry forecasts, Number of Protests Events.

Note: you could take an entire course on time series data and modeling! This is only an introduction, but it will plant the seeds for future learning on the topic and help you understand why ‘accounting for time’ even if visually, is important. To begin, let’s explore the ever-intuitive line-plot using familiar data.

Line plots

Line plots are particularly useful for exploring time series data. We will continue working with the gapminder data.

library(gapminder)
library(ggplot2)
str(gapminder)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

df <- gapminder

To begin, let’s say we wanted to know a bit more about GDP development in China. Below, we will graph the GDP per capita development of China from 1952 to 2007. We select the data for China by using the subset() function on the original data frame.

ggplot(subset(df, country == "China"),aes(x = year,y = gdpPercap)) +
  geom_line()

We can add points to the line to highlight which observations are available in the underlying data.

ggplot(subset(df, country == "China"),aes(x = year, y = gdpPercap)) +
  geom_line() +
  geom_point()

China and India both have very large populations that have grown rapidly over the last few decades. What if we wanted to also check out whether a similar trend exists for India?

ggplot(subset(df, country == "India"),aes(x = year, y = gdpPercap)) +
  geom_line() +
  geom_point()

Now that we are exploring comparisons, let’s compare the data across multiple countries overtime.

Go ahead and make a plot to compare the GDP per capita development of the BRICS countries (Brazil, Russia, India, China, South Africa; these have been coined as five major emerging national economies). Unfortunately, Russia is not part of the gapminder data, so we cannot display it in the plot.

Practice 1

Please create a publication-ready graph that can be printed using grayscale. Keep in mind you should distinguish the different lines easily from one another.

# students:

In cases like this, we might want to facet the plot by a single variable, using facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete. Below, we can use facet wrap ~ country to inspect trends across groupings.

ggplot(subset(df, country %in% c("Brazil", "Russia", "China", "India", "South Africa")),
       aes(x = year,
           y = gdpPercap)) +
  geom_line(alpha = 0.5) +
  geom_point(alpha = 0.8,
             size = 0.4) +
  theme_light() +
  labs(title = "GDP per capita in BRICS countries",
       x = "Year",
       y = "GDP per capita") +
  facet_wrap(~country, nrow = 1) +
  scale_x_continuous(breaks = seq(1950,2000, 25))

Which plot do you like better? Facet wrap, or everything on the same graph? What are the pros and cons?

Practice 2 We may also want to look at how population in BRICS countries changed over time. Try to produce a similar graph for population but use facet_grid() instead of facet_wrap(). You might want to check the help file for facet_grid() because there is a tiny difference between the two.

Hint: consider adding scale_y_continuous(breaks=seq(0, 1500000000, 300000000) and changing labels = function(x){paste0(x/1000000, ' million')}) to control the labels of your Y axis.

# students:

Spaghetti plots

Spaghetti plots are line plots with many lines displayed in a plot. Typically, the lines are very thin and/or have a high level of transparency to show trends in the data. Below, we graph the evolution of life expectancy for all countries in the data set. We allow each country to have its own line by using the group() parameter inside aes(). We will highlight one country in red.

brics <- c("Brazil", "Russia", "China", "India", "South Africa")
ggplot(df, aes(x = year, y = lifeExp, group = country)) +
  geom_line(alpha = 0.2, 
            size = 0.1) +
  labs(title = "Life expectancy over time",
       x = "Year",
       y = "Life expectancy") +
  theme_light() +
  geom_line(data = subset(df, country %in% "South Africa"),
            color = "red")

Practice 3 We can also display the temporal trend for select countries using color. Below, plot a spagetti plot simialr to the one above, but this time highlight all four of the BRIC countries. Here is a good example of when and why you might use multiple layers of the same geom!

Note: How does scale_color_brewer work? How does it ‘know’ where to apply color?

# students:

In the plot below, we will use a log transformation on GDP per capita within the scale_color_gradient() function to display GDP per capita values on their original scale but map the color of the lines to the natural log of GDP per capita. Why is it useful to apply a log transformation here? Discuss and interpret this plot.

ggplot(df,aes(x = year,y = lifeExp, group = country, color = gdpPercap)) +
  # Setting up spaghetti plot
  geom_line(alpha = 1, size = 0.1) +
  
  # Each continent in a separate panel
  facet_wrap(~continent, nrow = 1) +
  
  # Choosing default theme
  theme_light() +
  
  # Adjusting default theme
  theme(panel.background = element_rect(fill = "black"),
        panel.grid = element_line(size = 0.1),
        strip.background = element_rect(fill = "black"),
        strip.text = element_text(color = "white")) +
  
  # Making the colors pop
  scale_color_gradient(low = "#f7ff00",
                      high = "#00f7ff",
                      name = "GDP per capita",
                      trans = "log") +
  
  # Additional appearance adjustments
  labs(x = "Year",
       y = "Life expectancy",
       title = "Global life expectancy") +
  theme(legend.position = "bottom",
        legend.key.width = unit(1.5, "cm")) +
  scale_x_continuous(breaks = seq(1950, 2010, 25))

Why time matters

Understanding time often tells us a lot about changes in trajectories across different cases in our data. Later in your education as a data scientist, you will also learn about the ways in which dimensions of the data correlate overtime.

Let’s consider the eu_stocks dataset (available in R by default as EuStockMarkets). This dataset contains daily closing prices of major European stock indices from 1991-1998, specifically, from Germany (DAX), Switzerland (SMI), France (CAC), and the UK (FTSE). The data were observed when the markets were open, so there are no observations on weekends and holidays.

#install.packages("datasets.load") #some might not have to do this
data("EuStockMarkets")
eu_stocks = EuStockMarkets

Below, explore the ts() and is.ts() function from the R stats package, as well as the help files that describe what a time series object is.

Explore: What is the ‘frequency’ argument of a time series object? Comment these below.

is.ts(eu_stocks)

## [1] TRUE

start(eu_stocks)

## [1] 1991  130

end(eu_stocks)

## [1] 1998  169

frequency(eu_stocks)

## [1] 260

We can tell from using the time-series function that this data is already a time-series object. The ts() function will convert a numeric vector into an R time series object. The format is ts(vector, start=, end=, frequency=) where start and end are the times of the first and last observation and frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).

# R recognizes when the x argument is a ts object and actually calls the plot.ts() function under the hood
plot(eu_stocks)

ts.plot(eu_stocks, col = 1:4, xlab = "Year", ylab = "Index Value", main = "Major European Stock Indices, 1991-1998")
legend("topleft", colnames(eu_stocks), lty = 1, col = 1:4, bty = "n")

To create a similar plot in ggplot we can use the handy ggfortify! ggfortify let ggplot2 know how to interpret ts objects. After loading ggfortify, you can use ggplot2::autoplot function for ts objects.

library(ggfortify)
autoplot(eu_stocks) + theme_bw()

#note, what kind of data object does autoplot use?

For more illustration:

set.seed(123)
# a sequence from 1 to 100 scaled up by 10 to avoid negatives +  some random normal noise 
t <- seq(from = 1, to = 100, by = 1) + 10 + rnorm(100, sd = 7)
plot(t)

This could potentially represent some time series, with time represented along the x-axis. However, it’s hard to tell. The x-axis is simply an index from 1 to 100 in this case.

A vector object such as t above can easily be converted to a time series object using the ts() function. Let’s assume our generated data is quarterly data that starts in the first quarter of 2000.

tseries <- ts(t, start = c(2000, 1), frequency = 4)
print(tseries)

##            Qtr1       Qtr2       Qtr3       Qtr4
## 2000   7.076670  10.388758  23.910958  14.493559
## 2001  15.905014  28.005455  20.226413   9.144571
## 2002  14.192030  16.880366  29.568573  24.518697
## 2003  25.805400  24.774779  21.109112  38.508392
## 2004  30.484953  14.233680  33.909491  26.690460
## 2005  23.525234  30.474176  25.817969  28.897761
## 2006  30.624725  24.193147  42.864509  39.073612
## 2007  31.033041  48.776704  43.985250  39.934500
## 2008  49.265880  50.146934  50.751068  50.820482
## 2009  50.877424  47.566618  46.858261  47.336703
## 2010  46.137051  50.544579  44.142226  69.182692
## 2011  63.455734  48.138240  54.179806  54.733413
## 2012  64.459756  59.416417  62.773230  61.800173
## 2013  62.699907  73.580216  63.419603  76.615294
## 2014  56.158730  72.092296  69.866980  71.511591
## 2015  73.657476  68.483736  70.667548  66.869972
## 2016  67.497461  78.124700  80.137468  78.371030
## 2017  85.455872  94.350593  77.562782  65.835818
## 2018  90.040170  79.035595  80.183940  93.179000
## 2019  85.006589  79.454976  90.269124  89.027760
## 2020  91.040349  94.696963  90.405380  98.510636
## 2021  93.456594  98.322474 104.677873 101.046270
## 2022  96.718479 108.041653 107.954527 105.838779
## 2023 104.671122  99.604657 114.524567 101.798183
## 2024 122.311331 118.728274 107.350097 102.815054

plot(tseries)

Week 6 Day 1: Time Series Data

DSI-Explore

Cassy Dorff

Introduction

Line plots

Spaghetti plots

Why time matters

References