The purpose of this article is to demonstrate how to use data provided by the Vega datasets Python package.
Here’s the short version:
We can access the Vega datasets using the
import_vega_data()
function. This package’s convention is
to assign this to an object named vega_data
.
To access the data for a specific dataset, call the method for
that dataset: e.g. vega_data$cars()
. Where you see a hyphen
in a dataset name in the Altair documentation, use an underscore
instead, e.g. vega_data$sf_temps()
.
These datasets have metadata that include a description,
references, and a URL:
e.g. vega_data$cars$references
.
When you create an alt$Chart()
, the
data
argument need not be a data frame; it can be a
reference to a data frame, such as a URL.
vega_data$cars$url
, it will not render in the RStudio IDE
due to RStudio’s (well founded) security policy. If you view such a
chart using an external browser, it should work OK if your computer can
access the remote data.In the Altair documentation, you will see this code used often:
from vega_datasets import data
= data.cars() cars
The Altair convention is to use the name data
to refer
to the data
object in the vega_datasets
package. This package offers a similar convention:
library("altair")
vega_data <- import_vega_data()
cars <- vega_data$cars()
Instead, our convention is to use an object called
vega_data
.
Our vega_data
object has a method to list all its
datasets:
vega_data$list_datasets() %>% head()
#> [1] "7zip" "airports" "annual-precip" "anscombe"
#> [5] "barley" "birdstrikes"
Each dataset can be accessed using a method whose name is an element
returned from vega_data$list_datasets()
.
library("tibble")
vega_data$anscombe() %>% as_tibble()
#> # A tibble: 44 × 3
#> Series X Y
#> <chr> <dbl> <dbl>
#> 1 I 10 8.04
#> 2 I 8 6.95
#> 3 I 13 7.58
#> 4 I 9 8.81
#> 5 I 11 8.33
#> 6 I 14 9.96
#> 7 I 6 7.24
#> 8 I 4 4.26
#> 9 I 12 10.8
#> 10 I 7 4.81
#> # ℹ 34 more rows
It is useful to keep in mind that reticulate changes
the names of the datasets, and presumably, Python objects in general.
Where you see a -
in a name of a Python object, a
_
will be used in the name of the reticulated object in R.
For example, in Python: data.sf-temps()
; in R:
vega_data$sf_temps() %>% as_tibble()
#> # A tibble: 8,759 × 2
#> temp date
#> <dbl> <dttm>
#> 1 47.8 2010-01-01 00:00:00
#> 2 47.4 2010-01-01 01:00:00
#> 3 46.9 2010-01-01 02:00:00
#> 4 46.5 2010-01-01 03:00:00
#> 5 46 2010-01-01 04:00:00
#> 6 45.8 2010-01-01 05:00:00
#> 7 45.9 2010-01-01 06:00:00
#> 8 45.9 2010-01-01 07:00:00
#> 9 46.4 2010-01-01 08:00:00
#> 10 48 2010-01-01 09:00:00
#> # ℹ 8,749 more rows
Each dataset has some metadata, such as a description and references.
wrapcat <- function(x) {
x %>% strwrap() %>% cat(sep = "\n")
}
vega_data$anscombe$description %>% wrapcat()
#> Anscombe's Quartet is a famous dataset constructed by Francis Anscombe
#> [1]_. Common summary statistics are identical for each subset of the
#> data, despite the subsets having vastly different characteristics.
vega_data$anscombe$references %>% wrapcat()
#> Anscombe, F. J. (1973). 'Graphs in Statistical Analysis'. American
#> Statistician. 27 (1): 17-21. JSTOR 2682899.
Some of the datasets are stored locally as a part of the Vega
datasets Python package, others are not. The method that returns the
data, e.g. vega_data$anscombe()
will do the right thing.
You can use the is_local
property to find out what the
right thing is for a dataset.
vega_data$anscombe$is_local
#> [1] TRUE
Each dataset has a remote URL, which you can use instead of a data
frame in any Altair data
argument.
vega_data$anscombe$url
#> [1] "https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/anscombe.json"
You can specify data
using a URL that points to a
dataset, rather than using a data frame explicitly.
cars_url <- vega_data$cars$url
chart_cars <-
alt$Chart(cars_url)$
encode(
x = "Weight_in_lbs:Q",
y = "Miles_per_Gallon:Q",
color = "Cylinders:N"
)$
mark_point()
chart_cars
This works in your browser, but not might not work in the RStudio IDE. This is because, for security reasons, the RStudio IDE may not let you refer external URLs that are not on their allow-list (such as YouTube and Vimeo). If you open this up in a browser, it works just fine (as long as you have access to the internet).