Overview

The purpose of this article is to demonstrate how to use data provided by the Vega datasets Python package.

Here’s the short version:

  • We can access the Vega datasets using the import_vega_data() function. This package’s convention is to assign this to an object named vega_data.

  • To access the data for a specific dataset, call the method for that dataset: e.g. vega_data$cars(). Where you see a hyphen in a dataset name in the Altair documentation, use an underscore instead, e.g. vega_data$sf_temps().

  • These datasets have metadata that include a description, references, and a URL: e.g. vega_data$cars$references.

  • When you create an alt$Chart(), the data argument need not be a data frame; it can be a reference to a data frame, such as a URL.

    • If your chart contains a reference to an external resource, such as vega_data$cars$url, it will not render in the RStudio IDE due to RStudio’s (well founded) security policy. If you view such a chart using an external browser, it should work OK if your computer can access the remote data.

Importing

In the Altair documentation, you will see this code used often:

from vega_datasets import data

cars = data.cars()

The Altair convention is to use the name data to refer to the data object in the vega_datasets package. This package offers a similar convention:

library("altair")

vega_data <- import_vega_data()

cars <- vega_data$cars()

Instead, our convention is to use an object called vega_data.

Accessing

Our vega_data object has a method to list all its datasets:

vega_data$list_datasets() %>% head()
#> [1] "7zip"          "airports"      "annual-precip" "anscombe"     
#> [5] "barley"        "birdstrikes"

Each dataset can be accessed using a method whose name is an element returned from vega_data$list_datasets().

library("tibble")

vega_data$anscombe() %>% as_tibble()
#> # A tibble: 44 × 3
#>    Series     X     Y
#>    <chr>  <dbl> <dbl>
#>  1 I         10  8.04
#>  2 I          8  6.95
#>  3 I         13  7.58
#>  4 I          9  8.81
#>  5 I         11  8.33
#>  6 I         14  9.96
#>  7 I          6  7.24
#>  8 I          4  4.26
#>  9 I         12 10.8 
#> 10 I          7  4.81
#> # ℹ 34 more rows

It is useful to keep in mind that reticulate changes the names of the datasets, and presumably, Python objects in general. Where you see a - in a name of a Python object, a _ will be used in the name of the reticulated object in R. For example, in Python: data.sf-temps(); in R:

vega_data$sf_temps() %>% as_tibble()
#> # A tibble: 8,759 × 2
#>     temp date               
#>    <dbl> <dttm>             
#>  1  47.8 2010-01-01 00:00:00
#>  2  47.4 2010-01-01 01:00:00
#>  3  46.9 2010-01-01 02:00:00
#>  4  46.5 2010-01-01 03:00:00
#>  5  46   2010-01-01 04:00:00
#>  6  45.8 2010-01-01 05:00:00
#>  7  45.9 2010-01-01 06:00:00
#>  8  45.9 2010-01-01 07:00:00
#>  9  46.4 2010-01-01 08:00:00
#> 10  48   2010-01-01 09:00:00
#> # ℹ 8,749 more rows

Metadata

Each dataset has some metadata, such as a description and references.

wrapcat <- function(x) {
  x %>% strwrap() %>% cat(sep = "\n")
}

vega_data$anscombe$description %>% wrapcat()
#> Anscombe's Quartet is a famous dataset constructed by Francis Anscombe
#> [1]_. Common summary statistics are identical for each subset of the
#> data, despite the subsets having vastly different characteristics.
vega_data$anscombe$references %>% wrapcat()
#> Anscombe, F. J. (1973). 'Graphs in Statistical Analysis'. American
#> Statistician. 27 (1): 17-21. JSTOR 2682899.

Some of the datasets are stored locally as a part of the Vega datasets Python package, others are not. The method that returns the data, e.g. vega_data$anscombe() will do the right thing. You can use the is_local property to find out what the right thing is for a dataset.

vega_data$anscombe$is_local
#> [1] TRUE

Each dataset has a remote URL, which you can use instead of a data frame in any Altair data argument.

vega_data$anscombe$url
#> [1] "https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/anscombe.json"

Using a URL

You can specify data using a URL that points to a dataset, rather than using a data frame explicitly.

cars_url <- vega_data$cars$url

chart_cars <- 
  alt$Chart(cars_url)$
  encode(
    x = "Weight_in_lbs:Q",
    y = "Miles_per_Gallon:Q",
    color = "Cylinders:N"
  )$
  mark_point()

chart_cars

This works in your browser, but not might not work in the RStudio IDE. This is because, for security reasons, the RStudio IDE may not let you refer external URLs that are not on their allow-list (such as YouTube and Vimeo). If you open this up in a browser, it works just fine (as long as you have access to the internet).