R thing of the day: Speed up base R using the collapse package
Last updated: Mar 9, 2022
#tldr: just add these 2 lines
options(collapse_mask = "all")
library(dplyr)
library(collapse)
- Begin your script by setting the
collapse_mask
option. Seehelp('collapse-options')
for more options, e.g."manip"
. - Needs to come after
library(dplyr)
forcollapse_mask
to work
I’m a huge fan of the tidyverse
, but I also adore data.table
(especially when I need/want an extra speedup 🏎️). My go-to is fread
to load CSV files, and I often consider trying out vroom
more often as it seems to be competitive. For data wrangling, packages like dtplyr
and tidyfast
are also compelling alternatives as they bring the efficiency of data.table
with the familiar tidyverse
syntax.
I often mean to give the collapse
package 📦 by Sebastian Krantz a go, but I tend to slip into familiar ways before I ever remember to do so. This is unfortunate because it seems that it can seamlessly augment and speed up my current practice, be it alongside dplyr
, data.table
, or even when making maps using sf
.
Grant McDermott recently tweeted how one can simply add a couple lines of code to get the benefits of collapse
without really any additional code changes:
TL;DR many R users will get a major speedup simply by adding
— Grant McDermott (@grant_mcdermott) February 15, 2022
options(collapse_mask = "manip")
library(collapse)
to the top of their scripts. (First line could even go into your ~/.Rprofile file for a permanent change.) Everything else remains unchanged. https://t.co/Mt3FboE8uC
He also shared a gist comparing dplyr
, data.table
and the power of collapse
using the same ol' dplyr
code. I attempted to run the same benchmark to see if I too got the same performance bump using collapse
. I had trouble installing the microbenchmark
📦, so used rbenchmark
instead.
## Context: https://twitter.com/grant_mcdermott/status/1493400952878952448
options(collapse_mask = "all")
library(dplyr)
library(data.table)
library(collapse) # Needs to come after library(dplyr) for collapse_mask to work
flights = fread('https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv')
vars = c('dep_delay', 'arr_delay', 'air_time', 'distance', 'hour')
## Note we explicitly call dplyr::<function> for the 1st line in this benchmark,
## since we've masked the regular dplyr operations with their collapse
## equivalents (i.e. 2nd line).
library(rbenchmark)
benchmark(
dplyr = flights %>% dplyr::group_by(month, day, origin, dest) %>% dplyr::summarise(across(vars, sum)),
collapse = flights %>% group_by(month, day, origin, dest) %>% summarise(across(vars, sum)),
data.table = flights[, lapply(.SD, sum), by=.(month, day, origin, dest), .SDcols=vars],
replications = 20
)
# test replications elapsed relative user.self sys.self user.child sys.child
#2 collapse 20 0.275 1.062 0.263 0.012 0 0
#3 data.table 20 0.259 1.000 5.502 0.019 0 0
#1 dplyr 20 32.587 125.819 31.736 1.251 0 0
🚀🚀🚀