Wicked Good Data Website and blog of Daniel P. Martin

Speeding up data cleaning with dplyr

About a year and a half ago, I wanted to get some experience creating an R package and putting it on CRAN. I spend some time working with Brian Nosek (see our Crowdsourcing Data Analysis project here), who has helped develop a scoring algorithm for the Implicit Association Test (IAT). While this cleaning algorithm was already implemented in SPSS and SAS, it did not have an R counterpart. Sure enough, this is what I decided to work on.

Now, data from ProjectImplicit can be fairly large, with the number of participants easily in the hundreds of thousands, if not millions. Because the IAT measures reaction times, these datasets are time series in long format. So, when I started thinking of ways to do quick aggregation in R in order to stay competetive with the SAS and SPSS alternatives, my mind immediately went to data.table. At this time, data.table was blowing everything out of the water. It took some time, and sure enough it was put on CRAN, and was had decent speed considering it was R (it was a little bit slower than SAS).

Just as a quick aside, rdocumentation.org is a great website to use to keep track of CRAN downloads and identify popular R packages that you might not be using.

Anyway, back to business. Ever since I put the IAT package on CRAN, I have been keeping an eye out for packages that I could use to speed things up even more. Coming from using plyr and other tools made by Hadley Wickham, I naturally heard about dplyr a few months back, and started using it. Spoiler alert: it’s great. I find dplyr to be much more intuitive for cleaning data than something like data.table. Now, data.table is still a great package (I use the fread() function all the time to quickly read in data), but you only see substantial improvements in speed if you use it the “data.table way.” For me, it’s much easier to get dplyr right the first time. So, even though data.table might be faster if used correctly, I decided to see if I could get speed improvements in the IAT package by making the switch (there are a ton of other benchmarks for this, see here and here for a few).

The great thing is that it only took me about an hour or so to convert the code over to dplyr. Let’s benchmark this updated function using dplyr to see if it’s an improvement over my old code. For this benchmark, we will be using sample IAT data included in the IAT package (N = 49, number of rows = 6566), and I replicated each function 100 times

# install if necessary
# install.packages("IAT")

require(dplyr)
require(IAT)
require(rbenchmark)

myData <- IATData[IATData$isCongruentFirst == 1, ]

benchmark(replications = 100,
          
  old_way = cleanIAT(
    myData = myData,
    blockName = "BLOCK_NAME_S",
    trialBlocks = c("BLOCK2", "BLOCK3", "BLOCK5", "BLOCK6"),
    sessionID = "SESSION_ID",
    trialLatency = "TRIAL_LATENCY",
    trialError = "TRIAL_ERROR",
    vError = 1, vExtreme = 2, vStd = 1),
  
  new_way = cleanIAT_NEW(
    myData = myData,
    blockName = "BLOCK_NAME_S",
    trialBlocks = c("BLOCK2", "BLOCK3", "BLOCK5", "BLOCK6"),
    sessionID = "SESSION_ID",
    trialLatency = "TRIAL_LATENCY",
    trialError = "TRIAL_ERROR",
    vError = 1, vExtreme = 2, vStd = 1)
  
          )
##      test replications elapsed relative user.self sys.self user.child
## 2 new_way          100   16.15    1.000     13.57    0.136          0
## 1 old_way          100   33.18    2.055     29.56    0.301          0
##   sys.child
## 2         0
## 1         0

Woah, the time was cut in half! I’m sure this could be sped up even more with a better implementation of data.table, but I’m happy with the result for now.

Now, the one issue I had with dplyr is passing variables as strings to dplyr functions. While there are ways to do this (see here for more info), it seemed like too much of a pain to implement given I only really have three variables to pass as strings, but about dozen lines that need these variables. So as a kludgy workaround, I renamed these three variables in the beginning. Voila! Problem solved. Making this easier is on Hadley’s long term to-do list, so I imagine I’ll make the change when such a solution becomes available.

You can get access to this quicker function in the development version of the IAT package on my GitHub using devtools. See the code below for installation instructions:

# install devtools if necessary
# install.packages("devtools")

require(devtools)

install_github(repo = "IAT", username = "dpmartin42")