Today we will focus on time series data. Time series data consists of any metric that is measured over regular time intervals. Examples: Weather data, Local Crime Statistics, Stock prices, Industry forecasts, Number of Protests Events.
Note: you could take an entire course on time series data and modeling! This is only an introduction, but it will plant the seeds for future learning on the topic and help you understand why ‘accounting for time’ even if visually, is important. To begin, let’s explore the ever-intuitive line-plot using familiar data.
Line plots are particularly useful for exploring time series data. We will continue working with the gapminder
data.
library(gapminder)
library(ggplot2)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
df <- gapminder
To begin, let’s say we wanted to know a bit more about GDP development in China. Below, we will graph the GDP per capita development of China from 1952 to 2007. We select the data for China by using the subset()
function on the original data frame.
ggplot(subset(df, country == "China"),aes(x = year,y = gdpPercap)) +
geom_line()
We can add points to the line to highlight which observations are available in the underlying data.
ggplot(subset(df, country == "China"),aes(x = year, y = gdpPercap)) +
geom_line() +
geom_point()
China and India both have very large populations that have grown rapidly over the last few decades. What if we wanted to also check out whether a similar trend exists for India?
ggplot(subset(df, country == "India"),aes(x = year, y = gdpPercap)) +
geom_line() +
geom_point()
Now that we are exploring comparisons, let’s compare the data across multiple countries overtime.
Go ahead and make a plot to compare the GDP per capita development of the BRICS countries (Brazil, Russia, India, China, South Africa; these have been coined as five major emerging national economies). Unfortunately, Russia is not part of the gapminder
data, so we cannot display it in the plot.
Practice 1
Please create a publication-ready graph that can be printed using grayscale. Keep in mind you should distinguish the different lines easily from one another.
# students:
In cases like this, we might want to facet the plot by a single variable, using facet_wrap()
. The first argument of facet_wrap()
should be a formula, which you create with ~
followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap()
should be discrete. Below, we can use facet wrap ~ country to inspect trends across groupings.
ggplot(subset(df, country %in% c("Brazil", "Russia", "China", "India", "South Africa")),
aes(x = year,
y = gdpPercap)) +
geom_line(alpha = 0.5) +
geom_point(alpha = 0.8,
size = 0.4) +
theme_light() +
labs(title = "GDP per capita in BRICS countries",
x = "Year",
y = "GDP per capita") +
facet_wrap(~country, nrow = 1) +
scale_x_continuous(breaks = seq(1950,2000, 25))
Which plot do you like better? Facet wrap, or everything on the same graph? What are the pros and cons?
Practice 2 We may also want to look at how population in BRICS countries changed over time. Try to produce a similar graph for population but use facet_grid()
instead of facet_wrap()
. You might want to check the help file for facet_grid()
because there is a tiny difference between the two.
Hint: consider adding scale_y_continuous(breaks=seq(0, 1500000000, 300000000)
and changing labels = function(x){paste0(x/1000000, ' million')})
to control the labels of your Y axis.
# students:
Spaghetti plots are line plots with many lines displayed in a plot. Typically, the lines are very thin and/or have a high level of transparency to show trends in the data. Below, we graph the evolution of life expectancy for all countries in the data set. We allow each country to have its own line by using the group()
parameter inside aes()
. We will highlight one country in red.
brics <- c("Brazil", "Russia", "China", "India", "South Africa")
ggplot(df, aes(x = year, y = lifeExp, group = country)) +
geom_line(alpha = 0.2,
size = 0.1) +
labs(title = "Life expectancy over time",
x = "Year",
y = "Life expectancy") +
theme_light() +
geom_line(data = subset(df, country %in% "South Africa"),
color = "red")
Practice 3 We can also display the temporal trend for select countries using color. Below, plot a spagetti plot simialr to the one above, but this time highlight all four of the BRIC countries. Here is a good example of when and why you might use multiple layers of the same geom!
Note: How does scale_color_brewer work? How does it ‘know’ where to apply color?
# students:
In the plot below, we will use a log transformation on GDP per capita within the scale_color_gradient()
function to display GDP per capita values on their original scale but map the color of the lines to the natural log of GDP per capita. Why is it useful to apply a log transformation here? Discuss and interpret this plot.
ggplot(df,aes(x = year,y = lifeExp, group = country, color = gdpPercap)) +
# Setting up spaghetti plot
geom_line(alpha = 1, size = 0.1) +
# Each continent in a separate panel
facet_wrap(~continent, nrow = 1) +
# Choosing default theme
theme_light() +
# Adjusting default theme
theme(panel.background = element_rect(fill = "black"),
panel.grid = element_line(size = 0.1),
strip.background = element_rect(fill = "black"),
strip.text = element_text(color = "white")) +
# Making the colors pop
scale_color_gradient(low = "#f7ff00",
high = "#00f7ff",
name = "GDP per capita",
trans = "log") +
# Additional appearance adjustments
labs(x = "Year",
y = "Life expectancy",
title = "Global life expectancy") +
theme(legend.position = "bottom",
legend.key.width = unit(1.5, "cm")) +
scale_x_continuous(breaks = seq(1950, 2010, 25))
Understanding time often tells us a lot about changes in trajectories across different cases in our data. Later in your education as a data scientist, you will also learn about the ways in which dimensions of the data correlate overtime.
Let’s consider the eu_stocks
dataset (available in R by default as EuStockMarkets). This dataset contains daily closing prices of major European stock indices from 1991-1998, specifically, from Germany (DAX), Switzerland (SMI), France (CAC), and the UK (FTSE). The data were observed when the markets were open, so there are no observations on weekends and holidays.
#install.packages("datasets.load") #some might not have to do this
data("EuStockMarkets")
eu_stocks = EuStockMarkets
Below, explore the ts()
and is.ts()
function from the R stats package, as well as the help files that describe what a time series object is.
Explore: What is the ‘frequency’ argument of a time series object? Comment these below.
is.ts(eu_stocks)
## [1] TRUE
start(eu_stocks)
## [1] 1991 130
end(eu_stocks)
## [1] 1998 169
frequency(eu_stocks)
## [1] 260
We can tell from using the time-series function that this data is already a time-series object. The ts()
function will convert a numeric vector into an R time series object. The format is ts(vector, start=, end=, frequency=)
where start
and end
are the times of the first and last observation and frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).
# R recognizes when the x argument is a ts object and actually calls the plot.ts() function under the hood
plot(eu_stocks)
ts.plot(eu_stocks, col = 1:4, xlab = "Year", ylab = "Index Value", main = "Major European Stock Indices, 1991-1998")
legend("topleft", colnames(eu_stocks), lty = 1, col = 1:4, bty = "n")
To create a similar plot in ggplot we can use the handy ggfortify
! ggfortify
let ggplot2
know how to interpret ts objects. After loading ggfortify
, you can use ggplot2::autoplot function for ts
objects.
library(ggfortify)
autoplot(eu_stocks) + theme_bw()
#note, what kind of data object does autoplot use?
For more illustration:
set.seed(123)
# a sequence from 1 to 100 scaled up by 10 to avoid negatives + some random normal noise
t <- seq(from = 1, to = 100, by = 1) + 10 + rnorm(100, sd = 7)
plot(t)
This could potentially represent some time series, with time represented along the x-axis. However, it’s hard to tell. The x-axis is simply an index from 1 to 100 in this case.
A vector object such as t above can easily be converted to a time series object using the ts()
function. Let’s assume our generated data is quarterly data that starts in the first quarter of 2000.
tseries <- ts(t, start = c(2000, 1), frequency = 4)
print(tseries)
## Qtr1 Qtr2 Qtr3 Qtr4
## 2000 7.076670 10.388758 23.910958 14.493559
## 2001 15.905014 28.005455 20.226413 9.144571
## 2002 14.192030 16.880366 29.568573 24.518697
## 2003 25.805400 24.774779 21.109112 38.508392
## 2004 30.484953 14.233680 33.909491 26.690460
## 2005 23.525234 30.474176 25.817969 28.897761
## 2006 30.624725 24.193147 42.864509 39.073612
## 2007 31.033041 48.776704 43.985250 39.934500
## 2008 49.265880 50.146934 50.751068 50.820482
## 2009 50.877424 47.566618 46.858261 47.336703
## 2010 46.137051 50.544579 44.142226 69.182692
## 2011 63.455734 48.138240 54.179806 54.733413
## 2012 64.459756 59.416417 62.773230 61.800173
## 2013 62.699907 73.580216 63.419603 76.615294
## 2014 56.158730 72.092296 69.866980 71.511591
## 2015 73.657476 68.483736 70.667548 66.869972
## 2016 67.497461 78.124700 80.137468 78.371030
## 2017 85.455872 94.350593 77.562782 65.835818
## 2018 90.040170 79.035595 80.183940 93.179000
## 2019 85.006589 79.454976 90.269124 89.027760
## 2020 91.040349 94.696963 90.405380 98.510636
## 2021 93.456594 98.322474 104.677873 101.046270
## 2022 96.718479 108.041653 107.954527 105.838779
## 2023 104.671122 99.604657 114.524567 101.798183
## 2024 122.311331 118.728274 107.350097 102.815054
plot(tseries)