About virgo

The virgo package extends the grammar of interactive graphics based on the Vega-Lite library in Javascript, using the design principles of the tidyverse suite of packages. It is built on top of the vegawidget package, which handles the actual drawing of a virgo graphic.

Melbourne’s microclimate

Throughout this vignette we will be exploring three months of microclimate sensor data collected from December 2019 until February 2020 from Melbourne City Council. This data is built into to virgo and consists of measurements for things like temperature and relative humidity at 5 different locations across the city.

library(virgo)
library(lubridate)
library(dplyr)
melbweather

Data, Encodings and Marks

To construct a visualisation, we begin with tidy data and map columns to visual elements using encodings:

melbweather %>%
  filter(date == "2019-12-01") %>% 
  vega(encoding = enc(x = ambient_temperature))

The data is piped into the vega() function and aesthetic mappings are specified with the enc(). Valid encodings depend on the mark or graphical element being used. As there are no marks for this virgo chart, blank canvas is printed. The vega() function is required to be called in order to produce a valid visualisation, however as well we see later both data and marks can be specified elsewhere.

To produce interesting charts, we can add marks which are visual layers that are added to the chart.

Let’s go back to our data:

melbweather

To begin, we will collapse our quarter hourly measurements for each site to hourly averages using dplyr:

hourly_weather <- melbweather %>% 
  group_by(site, date, hour_of_day = hour(date_time) ) %>% 
  summarise(
    across(
      c(ambient_temperature:pm10), 
      mean, 
      na.rm = TRUE
    )
  ) %>% 
  ungroup()

hourly_weather

Suppose we are interested in the relationship between ambient temperature and relative humidity. It is theorised that as ambient temperature increases, relative humidity decreases.

We can check whether this is the case with our data by creating a scatter plot.

hourly_weather %>% 
  vega() %>% 
  mark_point(enc(x = ambient_temperature, y = relative_humidity))

For users, of ggplot2 the above incantation should look somewhat familiar. We specify our data at the top level using vega() and then add layers using the pipe %>% in combination with marks. Inside the mark_point() function we specify the aesthetic mapping or encoding that says the x axis should correspond to ambient temperature and the y axis should correspond to relative humidity.

The virgo library will automatically add legends depending on the type of encoding and input variable. If we color by a continuous variable a scale is automatically determined and placed on the right hand side of the plot.

hourly_weather %>% 
  vega() %>% 
  mark_point(
    enc(
      x = ambient_temperature, 
      y = relative_humidity, 
      color = wind_speed
    )
  )

Likewise, we can evaluate expressions inside of the encoding function, for example, evaluating the month of the date column with lubridate:

hourly_weather %>% 
  vega() %>% 
  mark_point(
    enc(
      x = ambient_temperature, 
      y = relative_humidity, 
      color = month(date, label = TRUE)
    )
  )

It appears that there is a slight non-linear and negative relationship between ambient temperature and relative humidity, however, there is a lot over-plotting in the first example, so perhaps it would be best to add some opacity to the previous scatter plot. Graphical elements that don’t depend on columns of the data can be added via extra arguments to the mark function:

hourly_weather %>% 
  vega() %>% 
  mark_point(
    enc(x = ambient_temperature, y = relative_humidity),
    opacity = 0.1
  )

Multiple marks can be added to a visualisation by piping a sequence of marks together. We can take our previous scatter plot and draw a loess regression line on top.

p <- hourly_weather %>% 
  vega(enc(x = ambient_temperature, y = relative_humidity)) %>% 
  mark_point(opacity = 0.1) %>% 
  mark_smooth(method = "loess", color = "blue")
p

Here we’ve specified the encoding at the top level with the vega() function, so they are available to all downstream layers. We can have layer specific encodings, by passing the enc() to any mark in the sequence. In addition to a single regression line, we could generate separate lines for each measurement within each month of the year:

p %>% 
  mark_smooth(
    enc(color = month(date, label = TRUE)),
    method = "loess"
  )

Facets

Sometimes we want to visualise multiple categorical variables at the same time, by creating small multiples (facets) views. Say we were interested in the distribution of wind speed, we could create a histogram:

ws <- hourly_weather %>% 
  vega() %>% 
  mark_histogram(enc(x = wind_speed), bin = list(step = 1))
ws

How similar are the wind speeds across each measurement site? We can facet by site to find out:

ws %>% 
  facet_views(row = site)

Faceting requires specifying a row and/or column encoding, that determines how the subplots are drawn on the canvas. There appears to be a suspiciously large number of records in site “arc1045” that have recorded an average wind speed between 4-5km.

Multiple Views

Often it is helpful to place views side by side that shows all of the data rather than subplots like those produced by faceting.

virgo provides functions to align plots horizontally or vertically with the hconcat() or vconcat() functions.

left <- hourly_weather %>% 
  vega() %>% 
  mark_point(enc(x = ambient_temperature, y = relative_humidity),
             opacity = 0.1)
right <- hourly_weather %>%
  vega() %>% 
  mark_point(enc(x = wind_speed, y = ambient_temperature),
             opacity = 0.1)
hconcat(left, right)

Client side data transformations

The virgo package exports special functions that are computed directly by Vega-Lite rather than R. These functions are prefixed with vg, and are always be called inside of enc().

By using the vg functions, we can get Vega-Lite to perform data transformations or aggregations without the use of tidyverse. Suppose we were interested in finding out the hour of the day, where the average temperature across all sites reaches its max:

hourly_temp <- melbweather %>% 
  vega(
    enc(
      x = vg_hours(date_time), 
      y = vg_mean(ambient_temperature)
    )
  ) %>% 
  mark_point() %>% 
  mark_line()
hourly_temp

In the above code, the vg_hours() function extract the hour from the date time variable and groups the data within each our block. The y-axis is then average ambient temperature within each hour pooled across all days and sites. We could get a more granular view by using an alternative mark like a boxplot to view the distribution over each hour instead of just the average:

hourly_dist_temp <- melbweather %>% 
  vega(
    enc(
      x = vg_hours(date_time), 
      y = ambient_temperature
    ),
    width = 600
  ) %>% 
  mark_boxplot()
hourly_dist_temp

Generally, we believe it is best to perform data transformations outside of the visualisation environment as the transformations are explicit (i.e. when using the tidyverse). However, as we will see later these client side transformations become especially useful when combined with interactivity.

Interactivity via selections

Selections are how a virgo graphic defines interactions, and are strongly influenced by the Vega-Lite javascript API. There are several types of selection, but to begin we will create an interval selection:

selection <- select_interval()

By default, an interval selection will specify a rectangular region, that is generated by dragging the mouse over the view. Each mark can accept a selection object, which will filter the data when the region is active:

p <- hourly_weather %>% 
  vega(enc(x = ambient_temperature, y = relative_humidity)) %>% 
  mark_point(opacity = 0.1, selection = selection) 
p

If you just want to draw the selection without performing the filter, you can specify it as an identity selection

p <- hourly_weather %>% 
  vega(enc(x = ambient_temperature, y = relative_humidity)) %>% 
  mark_point(opacity = 0.1, selection = I(selection)) 
p

Rather than filtering the data, you may want to conditionally encode a visual element based on whether data falls into a selection. The encode_if() function allows an encoding to depend on one (or more) selections. We could rewrite the above plot so the points that fall inside of the selection are given an opacity of 0.5, while those outside are given an opacity of 0.1.

p <- hourly_weather %>% 
  vega(enc(x = ambient_temperature, y = relative_humidity)) %>% 
  mark_point(
    enc(opacity = encode_if(selection, 0.5, 0.1))
  ) 
p

Instead of using values, the conditional encoding can depend on a variable instead:

p <- hourly_weather %>% 
  vega(enc(x = ambient_temperature, y = relative_humidity)) %>% 
  mark_point(
    enc(
      color = encode_if(selection, wind_speed > 10, "grey"),
      opacity = encode_if(selection, 0.5, 0.01)
      )
  ) 
p

Interval selections can be restricted to a single axis, so a brush only moves in one direction. The following produces a line plot of the maximum daily temperature, with a brush over the x-axis

x_selection <- select_interval(encodings = "x")

overview <- melbweather %>% 
  vega(
    enc(
      x = date,
      y = vg_max(ambient_temperature)
    ),
    width = 600
  ) %>% 
  mark_line() %>% 
  mark_point(
    selection = I(x_selection)
  )
overview

Interval selections can also be bound to a scale and then composed with another view to produce a zooming effect. After concatenating the two views, dragging the x-axis on the bottom graphic will zoom in on the top view:

context <- melbweather %>% 
  vega(
    enc(
      x = date,
      y = vg_max(ambient_temperature)
    ),
    width = 600,
    height = 200
  ) %>% 
  mark_line() %>% 
  mark_point() %>% 
  scale_x(name = NULL, domain = x_selection)

vconcat(context, overview)

Selections can also be used to filter data in multi-view plots. For example we could view the trail of temperature measurements that each site recorded over the highlighted period in the overview plot.

active_sensor <- melbweather %>% 
  vega(enc(x = ambient_temperature, y = site), height = 200) %>% 
  mark_tick(selection = x_selection) 
vconcat(overview, active_sensor)

This is just the beginning of what is possible to achieve with selections, for more detail see the selections vignette.

Sizing and saving

The sizing of a plot is controlled inside the vega() function with the height and width arguments. These arguments refer to the pixel dimensions of the view.

virgo graphics can be saved with the vegawidget package using the vg_write_png() or vg_write_svg() functions.

Introducing virgo for interactive exploratory graphics

Stuart Lee and Earo Wang