gapminder-model-drilldown.Rmd
library(virgo)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(gapminder)
This example is inspired by the Many Models example in R for Data Science by Grolemund and Wickham.
To begin we create a drop down menu with each continent using input_select()
and then we create a line plot for each country’s life expectancy over time. Here the selection modifies the opacity of each line, so we can compare countries within a continent to each other:
selection <- select_bind(
continent = input_select(
name = "Select Continent:",
choices = c(NA, levels(gapminder$continent)))
)
lex <- gapminder %>%
vega(enc(x = year, y = lifeExp, group = country)) %>%
mark_line(
enc(opacity = encode_if(selection, 0.5, 0.1))
)
lex
We see that for most countries life expectancy is generally increasing, however several countries from Africa and Asia have dips in their life expectancies.
Next for each country we can estimate a simple linear model to summarise the relationship of how life expectancy has changed over time. We will use \(R^2\) to assess the quality of the fit and link that back to our raw data
# we could do this with broom but this seems
# ok for now
gapminder_augmented <- gapminder %>%
group_nest(continent, country) %>%
mutate(
model = lapply(data, function(x) lm(lifeExp ~ year, data = x)),
r2 = vapply(model, function(x) summary(x)$r.squared, numeric(1)),
) %>%
unnest(data) %>%
select(-model)
gapminder_augmented
#> # A tibble: 1,704 x 7
#> continent country year lifeExp pop gdpPercap r2
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Africa Algeria 1952 43.1 9279525 2449. 0.985
#> 2 Africa Algeria 1957 45.7 10270856 3014. 0.985
#> 3 Africa Algeria 1962 48.3 11000948 2551. 0.985
#> 4 Africa Algeria 1967 51.4 12760499 3247. 0.985
#> 5 Africa Algeria 1972 54.5 14760787 4183. 0.985
#> 6 Africa Algeria 1977 58.0 17152804 4910. 0.985
#> 7 Africa Algeria 1982 61.4 20033753 5745. 0.985
#> 8 Africa Algeria 1987 65.8 23254956 5681. 0.985
#> 9 Africa Algeria 1992 67.7 26298373 5023. 0.985
#> 10 Africa Algeria 1997 69.2 29072015 4797. 0.985
#> # … with 1,694 more rows
Now we can create a plot driven selection, that will display the \(R^2\) values within each continent:
select_r2 <- select_interval("y")
tick_plot <- gapminder_augmented %>%
vega(enc(x = continent, y = r2)) %>%
mark_tick(
enc(opacity = encode_if(select_r2, 1, 0.1))
)
tick_plot
New we can overlay the real data alongside the model \(R^2\) values to identify which countries have poor fits. First, we modify the color scale to match the provided gapminder::country_colors
and then produce a line plot.
palette <- country_colors[sort(names(country_colors))]
country_fits <- gapminder_augmented %>%
vega(enc(x = year, y = lifeExp, color = country)) %>%
mark_line(selection = select_r2) %>%
scale_color(range = palette ,guide = FALSE)
country_fits
Next we combine them together to see which countries have poor fits.
hconcat(tick_plot, country_fits)
The countries with \(R^2\) below 0.2 have strong non-linear trends in life expectancies hovering over the lines identifies them to be countries affected by genocide (Rwanda) or the HIV/AIDS epidemic (Botswana, Lesotho, Swaziland, Zambia, Zimbabwe).