There are now tons of packages on CRAN, and many more on GitHub. After attending the useR! Conference I noticed that a lot of this growth is in width instead of depth, they are doing more than should statistical analysis. There seems to be a growing number of computer scientists and software engineers adding packages and even making new versions of R. A similar thing happened to Javascript about ten years ago. Google and Mozilla had a battle over which browser was faster, which really meant who had a faster Javascript implementation. In the end Javascript was the winner as it became really fast. R seems to be in a similar position. Microsoft purchased Revolution Analytics so that they can put R inside of things like SQL Server, Azure and Excel. Oracle is pushing one version into the Oracle database so you can run analytics from SQL code while also creating another implementation that compiles to JVM bytecode. There is also a whole slew of various other implemetnations and work happening. This could raise the bar for all of R as they each start to compete. It may also give rise to a formal specification which it lacks now as the GNU implementation is the spec.
I think in general that when a bunch of people complain that something is slow or that it is not really high-quality enough that it is likely to become so. This rings even more true when you consider quotes like, "There are only two kinds of languages: the ones people complain about and the ones nobody uses", Bjarne Stroustrup. A long time ago people complained that assembly was to high level and you should use machine code. Then the debate moved to c and fortran being toys and that real programmers used assembly language to get work done. Now most people see these two languages as basically being a cleaner assembly. Later many people poked fun at Java for being to slow to do anything real. That has changed quite a bit, the JVM and the hotspot compiler are probably two of the most well tuned pieces of software in existence, and they are lighting fast. Javascript was supposedly once only used to add random stuff to your Myspace page, but it got its drastic makeover. Thus maybe its good that so many people say that R can’t be used in production, it can only work on small data, it is slow, or it is single core.
As data science expands it reach, more will be required from the tools. It seems the R community is interested in expanding the tools as well. There are plenty of new packages that do nothing in regaurds to statistical models or visualizations, but instead how to verify assumptions of data or keeping the workspace clean. These are the things it needs to be more generally capable. This is showing that the community is at least aware of its gaps and people are trying to make fixes.
I wanted to experiment with some of the tools that I think are really cool and will help R become more capable. The first is the assertr package. It is useful for doing assertions on data, checking that your data is what you think it is.
library(lubridate)
library(dplyr)
library(assertr)
We can start by using the verify function. For the mtcars dataset, I know that the mpg field should be positive, which it is.
mtcars %>% verify(mpg >= 0) %>% head
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
What if this was not true. We can modify the data to demonstrate what would happen.
mtcars_bad <- mtcars
mtcars_bad$mpg[1] <- -1
mtcars_bad %>% verify(mpg >= 0)
## Error in verify(., mpg >= 0) : verification failed! (1 failure)
There is another function called assert. To use this we give a function, that is really a predicate function, and data instead of an expression.
mtcars %>% assert(within_bounds(0, Inf), mpg) %>% head
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars_bad %>% assert(within_bounds(0, Inf), mpg) %>% head
## Error: Vector 'mpg' violates assertion 'within_bounds' 1 time (value [-1] at index 1)
One cool thing to note here is that this shows the actual row that caused this issue, verify just said that there was an issue. We can also look at things from a macro level, the whole data set instead of each specific value.
mtcars %>% verify(nrow(.) > 10) %>% head
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Another cool feature is the ability to create custom predicate functions. Thus we can create checks for every type of assumption we could make of the data. We can add a character field and then make sure it is not empty.
mtcars$string <- sample(LETTERS, nrow(mtcars), replace = T)
# Create predicate function.
not.empty.p <- function(x) if (x == "") FALSE
# Check it
mtcars %>% assert(not.empty.p, string) %>% head
## mpg cyl disp hp drat wt qsec vs am gear carb string
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 P
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 D
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 V
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 F
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 U
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 W
# Make one of them empty
mtcars$string[1] <- ''
mtcars %>% assert(not.empty.p, string) %>% head
## Error: Vector 'string' violates assertion 'not.empty.p' 1 time (value [] at index 1)
The last thing I want to note here is the insist function. It is often the case that when you start to push some analysis out of the exploratory mode you need to verify that the data looks similar to how it did originally. One thing you need to check for is outliers. The insist function can help with this.
mtcars %>%
insist(within_n_sds(3), mpg) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
## Source: local data frame [3 x 2]
##
## cyl avg.mpg
## (dbl) (dbl)
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
mtcars %>%
insist(within_n_sds(2), mpg) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
## Error: Vector 'mpg' violates assertion 'within_n_sds' 2 times (e.g. [32.4] at index 18)
There are many other cool things built into this package which can be see at the link above. This is also not the only package that attempts to provide this type of functionality. There is some discussion of these options here.
There are other packages that provide similar functionality through different means and some that have different functionality altogether. The assertive package has a large collection of assert_all_are_?
and is_in_?
types of functions that can be used in a manner seen below.
library(assertive)
is_in_future(x = today() + days(10))
## 2015-09-08 20:00:00
## TRUE
is_in_future(x = today() - days(10))
## There was 1 failure:
## Position Value Cause
## 1 1 2015-08-19 20:00:00 in past
The biggest difference to note here is that this does not cause an error. It says that there was one failure instead.
There is also the assertthat package which seems to be more for functions than data. Really it seems to be for making checks on data that go into and come out of functions.
library(assertthat)
is_odd <- function(x) {
assert_that(is.numeric(x), length(x) == 1)
x %% 2 == 1
}
assert_that(is_odd(2))
## Error: is_odd(x = 2) is not TRUE
on_failure(is_odd) <- function(call, env) {
paste0(deparse(call$x), " is even")
}
assert_that(is_odd(1))
## [1] TRUE
assert_that(is_odd(2))
## Error: 2 is even
Another package that provides assertions on function arguments is argufy, which does this in a pretty interesting way. We wrap the function definition in a set of checks.
library(argufy)
prefix <- argufy(function(
str = ? is.character,
len = ? is.numeric(len) && len > 0) {
substring(str, 1, len)
})
substring('This works', 1, 5)
## [1] "This "
prefix('This works', 5)
## [1] "This "
substring('This works', 1, -1)
## [1] ""
prefix('This dos not work', -1)
# Error in is.numeric && len > 0 : invalid 'x' type in 'x && y'
The ensurer package comes from a different perspective. It does assertions but it views them from the perspective of ensuring properties via contracts. This this starts to push towards having type safty in R.
library(ensurer)
matrix(runif(16), 4, 4) %>%
ensure_that(ncol(.) == nrow(.), all(. <= 1))
## [,1] [,2] [,3] [,4]
## [1,] 0.7939310 0.101298372 0.6966813 0.4942526
## [2,] 0.5289590 0.935170193 0.6730220 0.4995462
## [3,] 0.3428521 0.003302559 0.4415779 0.3875924
## [4,] 0.8808607 0.054348952 0.4190798 0.3552488
matrix(runif(20), 5, 4) %>%
ensure_that(ncol(.) == nrow(.), all(. <= 1))
## Error: conditions failed for call 'matrix(runif(20), 5 .. nrow(.),
## all(. <= ': * ncol(.) == nrow(.)
While looking into the ensurer package I noticed a few other packages on smbache's GitHub page. They provide some interesting capabilities and even some interesting restrictions.
The immutequality seems to work just like var and val in Scala.library(immutequality)
x = 10
print(x)
## [1] 10
# Will raise an error!
x <- x*2
## Error: Cannot reuse the symbol x!
# This also
assign("x", 20)
## Error: Cannot reuse the symbol x!
# .. and this too
x = 20
## Error: Cannot reuse the symbol x!
# But this works.
y <- 5
y <- y + 1
The same auther also has the import package. Say I needed to use the mdy
function from the lubridate package. In order to have access to it I need to load everything from this package. This pulls a lot of other stuff into the workspace that I don't need. If I need one function from a bunch of packages I can very quickly add a lot of bloat. It can also hide things that I want when collisions exsits. There are some issues with plyr and dplyr doing this. Sometimes I need something like the ddply type of function but I almost always have dplyr loaded. One solution is to know the correct order to load them, which may not always have a valid answer. The other option is to use something like lubridate::mdy
. An even more precise solution is to use the import package.
head(objects("package:lubridate"))
## [1] "%--%" "%m-%" "%m+%" "%within%" "am"
## [6] "as.difftime"
length(objects("package:lubridate"))
## [1] 157
# Notice how the package follows wants you to follow its own advice.
library(import)
## The import package should not be attached.
## Use "colon syntax" instead, e.g. import::from, or import:::from.
##
## Attaching package: 'import'
##
## The following object is masked from 'package:lubridate':
##
## here
Instead we should use it like this.
import::from(magrittr, "%>%", "%$%", .into = "operators")
import::from(lubridate, mdy, .into = "datatools")
import::into("operators", "%>%", "%$%", .from = magrittr)
import::into("datatools", arrange, .from = dplyr)
This package also provides something similar to python style modules.
import::from(some_module.R, a, b, p, plot_it)
Instead of digging into this impementation though I wanted to mention a package the documentation referred to. Themodules package replicates the python module methodology. Instead showong some code that showcases how these work I will point to the great documentation about the reasoning behind the package here and a comparison to python here. These packages give us some powerful features that I like in Scala and Python.
Another useful tool is the wakefield pacakge. This package lets you create random or fake data very easily. This is super useful when you need to try something out or see how soemthing scales, or for setting up demos.
#if (!require("pacman")) install.packages("pacman"); library(pacman)
#p_install_gh("trinker/wakefield")
#p_load(dplyr, wakefield)
library(wakefield)
#set.seed(10)
r_data_frame(n = 30,
id,
race,
age(x = 8:14),
Gender = sex,
Time = hour,
iq,
grade,
height(mean=50, sd = 10),
died,
Scoring = rnorm,
Smoker = valid
)
## Source: local data frame [30 x 11]
##
## ID Race Age Gender Time IQ Grade Height Died
## (chr) (fctr) (int) (fctr) (tims) (dbl) (dbl) (dbl) (lgl)
## 1 01 White 14 Female 00:30:00 98 88.6 46 FALSE
## 2 02 Bi-Racial 12 Male 00:30:00 101 79.1 63 FALSE
## 3 03 White 13 Male 02:30:00 107 91.4 55 FALSE
## 4 04 White 12 Male 03:30:00 112 88.5 56 TRUE
## 5 05 White 12 Female 04:00:00 82 88.8 57 TRUE
## 6 06 White 11 Male 04:00:00 91 85.2 45 FALSE
## 7 07 White 10 Female 04:00:00 100 86.3 56 FALSE
## 8 08 White 10 Female 04:30:00 106 90.9 55 FALSE
## 9 09 Hispanic 13 Female 04:30:00 101 90.9 46 TRUE
## 10 10 White 11 Male 05:00:00 84 88.1 55 FALSE
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: Scoring (dbl), Smoker (lgl)
A very hard problem is the versioning of data. It would be great if things like git could handle data. While they can handle smaller data sets it starts to fall apart as the size increases. The daff package, which is a wrapper around a Java library, provides some functionality in this area.
library(daff)
y <- iris[1:3,]
z <- y
z <- head(z, 2) # remove a row
z[1,1] <- 10 # change a value
z$hello <- "world" # add a column
z$Species <- NULL # remove a column
patch <- diff_data(y, z)
render_diff(patch)
You can see how a data.frame is different from another. This allows you to do some cool things. You can patch once source with the difference so that it is up to date. It also lets you resolve issues of two sources sharing a common data set but have deviated in differnet ways. You can merge both copies with the parent to get all updates. There are some examples of this on the github site.
There were a few other packages that I wanted to mentione here but maybe they are better suited for another post.