Blogs

Reading Data Fast via Feather

Kenny Darrell

April 17, 2016

A few interesting packages that are newish to me.

The first of which is called feather. It stems from the Apache Arrow project and makes it super fast to read datafiles in R.

The process which this uses seems rather intutive as to why it works better.

# devtools::install_github("dgrtwo/fuzzyjoin")
# devtools::install_github("hadley/tibble")
# devtools::install_github("wesm/feather/R")
# devtools::install_github("hadley/readr")

library(feather)
library(readr)
library(data.table)
library(tibble)
library(dplyr)

This data can be found here.

system.time(x <- read.csv('2008.csv'))
#   user  system elapsed
# 96.721   2.850  99.914

write_feather(x, '2008.feather')

rm(x);gc();
system.time(x <- read_feather('2008.feather'))
#   user  system elapsed
# 0.765   0.396   1.162

rm(x);gc();
system.time(x <- read_csv('2008.csv'))
#  user  system elapsed 
# 15.345   1.413  18.642

rm(x);gc();
system.time(x <- fread('2008.csv'))
#  user  system elapsed 
# 6.923   0.374   7.302

Another useful package is tibble. Normally I start all of my code with by turning strings to factors as FALSE, becuase if I don’t 15 minutes later I have to figure out why something that should be easy is not working. I modify this option then re-reun everything I am good to go, except that my train of thought was derailed pretty hard.

I have also never been a fan of the odd process of creating a throw away data.frame.

throw_away <- data.frame(a = c(1, 2, 3), b = c('a', 'b', 'c'))
str(throw_away)
## 'data.frame':    3 obs. of  2 variables:
##  $ a: num  1 2 3
##  $ b: Factor w/ 3 levels "a","b","c": 1 2 3
rbind(throw_away, c(1, 'd'))
## Warning in `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 2L, 3L,
## NA), .Label = c("a", : invalid factor level, NA generated
##   a    b
## 1 1    a
## 2 2    b
## 3 3    c
## 4 1 <NA>

That was in no way what you would expect to happen.

The tibble method to construct a dataset is very similar to SAS or Matlab, which is clean. And most of all it does not make anything into a factor for me.

a <- tibble::frame_data(
  ~x, ~y,  ~z,
  "a", 2,  3.6,
  "b", 1,  8.5,
  "c", 1,  8.5,
  "D", 1,  8.5)