A few interesting packages that are newish to me.
The first of which is called feather. It stems from the Apache Arrow project and makes it super fast to read datafiles in R.
The process which this uses seems rather intutive as to why it works better.
# devtools::install_github("dgrtwo/fuzzyjoin")
# devtools::install_github("hadley/tibble")
# devtools::install_github("wesm/feather/R")
# devtools::install_github("hadley/readr")
library(feather)
library(readr)
library(data.table)
library(tibble)
library(dplyr)
This data can be found here.
system.time(x <- read.csv('2008.csv'))
# user system elapsed
# 96.721 2.850 99.914
write_feather(x, '2008.feather')
rm(x);gc();
system.time(x <- read_feather('2008.feather'))
# user system elapsed
# 0.765 0.396 1.162
rm(x);gc();
system.time(x <- read_csv('2008.csv'))
# user system elapsed
# 15.345 1.413 18.642
rm(x);gc();
system.time(x <- fread('2008.csv'))
# user system elapsed
# 6.923 0.374 7.302
Another useful package is tibble. Normally I start all of my code with by turning strings to factors as FALSE, becuase if I don’t 15 minutes later I have to figure out why something that should be easy is not working. I modify this option then re-reun everything I am good to go, except that my train of thought was derailed pretty hard.
I have also never been a fan of the odd process of creating a throw away data.frame.
throw_away <- data.frame(a = c(1, 2, 3), b = c('a', 'b', 'c'))
str(throw_away)
## 'data.frame': 3 obs. of 2 variables:
## $ a: num 1 2 3
## $ b: Factor w/ 3 levels "a","b","c": 1 2 3
rbind(throw_away, c(1, 'd'))
## Warning in `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 2L, 3L,
## NA), .Label = c("a", : invalid factor level, NA generated
## a b
## 1 1 a
## 2 2 b
## 3 3 c
## 4 1 <NA>
That was in no way what you would expect to happen.
The tibble method to construct a dataset is very similar to SAS or Matlab, which is clean. And most of all it does not make anything into a factor for me.
a <- tibble::frame_data(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5,
"c", 1, 8.5,
"D", 1, 8.5)