Kenny Darrell @ darrell@datamininglab.com
Lead Data Scientist @ Elder Research
Data and Analytics can provide a great power
“With great power comes great responsibility” Uncle Ben (Spider Man)
AKA - It’s easier to shoot yourself in the foot
Data has helped elect a president!
Can it inform us of who will be the next president?
If the Redskins win their final home game -> incumbent wins
Since 1936, but not 2004 and 2012 (18-2)
If LA reaches championship Republican wins
Since move to LA in 1960, but not 2008 and 2012 (12-2)
Alabama (D) vs LSU (R)
Worked since 1984 (8-0)
Winning cookie recipe wins election
Worked since 1992, except 2008 (5-1)
But these were jokes, right?
Real research occurred
Data was analyzed
Results were published
We work very hard to prove ourselves correct!
EXAMPLE
load("~/Desktop/car.RData")
head(train)
Age IsBadBuy VehOdo Color
1 3 0 86788 BLUE
2 5 0 80824 MAROON
3 2 0 59269 SILVER
4 9 0 78875 MAROON
5 4 0 45341 ORANGE
6 4 0 80785 BLUE
We need a goal by which to measure success
ex: Buy good used cars that don't need repairs (kick cars)
summary(train)
Age VehOdo IsBadBuy Color
Min. :0.000 Min. : 4825 Min. :0.0000 SILVER :10275
1st Qu.:3.000 1st Qu.: 61930 1st Qu.:0.0000 WHITE : 8321
Median :4.000 Median : 73399 Median :0.0000 BLUE : 7054
Mean :4.171 Mean : 71528 Mean :0.1223 GREY : 5391
3rd Qu.:5.000 3rd Qu.: 82392 3rd Qu.:0.0000 BLACK : 5196
Max. :9.000 Max. :115026 Max. :1.0000 RED : 4247
(Other): 9351
We want to indentify the lemons, then avoid them
table(train$IsBadBuy)
0 1
43739 6096
This is a rare event scenerio
What traits can help us predict the lemon-ality?
I don't see a huge difference
More interesting!
We have statistics for this purpose!
How likely could this occur at random?
pval(train, 'ORANGE')
[1] 0.003342797
Very significant
We can publish!
Should we believe this result?
What happens when we use this info?
mod1 <- glm(IsBadBuy ~ Color, data = train, family = "binomial")
summary(mod1)
Call:
glm(formula = IsBadBuy ~ Color, family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5961 -0.5109 -0.5103 -0.4922 2.2838
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.899913 0.089660 -21.190 < 2e-16 ***
ColorBLACK -0.168533 0.099855 -1.688 0.09145 .
ColorBLUE -0.141035 0.097119 -1.452 0.14645
ColorORANGE -0.631511 0.243607 -2.592 0.00953 **
ColorPURPLE 0.262304 0.190993 1.373 0.16964
ColorBROWN 0.072222 0.190659 0.379 0.70484
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 37030 on 49834 degrees of freedom
Residual deviance: 36987 on 49821 degrees of freedom
AIC: 37015
Number of Fisher Scoring iterations: 4
Currently 1000 cars for sale
buy <- test[sample(nrow(test), 1000), ]
sum(buy$IsBadBuy)
[1] 136
I run my model
res1 <- predict(mod1, newdata = buy, type = 'response')
cutoff <- .12 # My cutoff
I can only buy 80
res1 <- data.frame(p = res1, act = buy$IsBadBuy)
res1 <- res1[order(res1$p), ][1:80, ]
And I'm stuck with
sum(res1$act)
[1] 10
Except it was very close to random!
prop.test(c(sum(res1$act), nrow(res1)),
c(sum(buy$IsBadBuy), 1000),
alternative = 'less')$p.value
[1] 0.4629769
Target shuffling to the rescue
Wait, What?
Analyze data, note its strength
(e.g., R-squared, lift, correlation, explanatory power)
Randomly shuffle the target to “break the relationship” between output and its inputs
Search for a new best model – or “most interesting result” - and save its strength
Repeat steps 2 and 3 and create a distribution of the strengths of the “Best Apparent Discoveries” (BADs)
Evaluate where your true results (from step 1) are on this BAD distribution. This is the “significance” that a result this strong could occur by chance.
m <- c(pval(train, 'ORANGE'))
copy_train <- train
for (i in 1:1000) {
copy_train$IsBadBuy <- sample(copy_train$IsBadBuy)
ps <- sapply(levels(train$Color), function(x) pval(copy_train, x))
m <- c(m, min(ps))
}
min(which(sort(m) == m[1])) / length(m)
[1] 0.05494505
Target shuffling tells us our model is junk! When it does like a model it is a very good sign that we have something real!
t.test(train[train$IsBadBuy == 0, ]$VehOdo,
train[train$IsBadBuy == 1, ]$VehOdo,
alternative = 'less')$p.value
[1] 3.291555e-85
m <- c(pvals(train))
copy_train <- train
for (i in 1:1000) {
copy_train$IsBadBuy <- sample(copy_train$IsBadBuy)
ps <- pvals(copy_train)
m <- c(m, min(ps))
}
min(which(sort(m) == m[1])) / length(m)
[1] 0.000999001
This statistic also got worse, but it is still much better than random!
mod2 <- glm(IsBadBuy ~ VehOdo, data = train, family = "binomial")
summary(mod2)
Call:
glm(formula = IsBadBuy ~ VehOdo, family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7072 -0.5468 -0.4986 -0.4285 2.5753
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.372e+00 7.592e-02 -44.41 <2e-16 ***
VehOdo 1.919e-05 1.003e-06 19.13 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 37030 on 49834 degrees of freedom
Residual deviance: 36647 on 49833 degrees of freedom
AIC: 36651
Number of Fisher Scoring iterations: 5
Currently 1000 cars for sale
sum(buy$IsBadBuy)
[1] 136
I run my model
res2 <- predict(mod2, newdata = buy, type = 'response')
I can only buy 80
res2 <- data.frame(p = res2, act = buy$IsBadBuy)
res2 <- res2[order(res2$p), ][1:80, ]
And I'm stuck with
sum(res2$act)
[1] 4
prop.test(c(sum(res2$act), nrow(res2)),
c(sum(buy$IsBadBuy), 1000),
alternative = 'less')$p.value
[1] 0.02615255
We need the mindset of proving something false, not true.
Nobody goes to work with the goal of breaking everything they have done, but the real world has no problem doing it for you.
Amgen could only replicate 6/58 studies
Bayer Heathcare replicated only 25% of 67 studies
BMJ: 92% of 1,500 referees missed serious errors
157/304 Journals accepted fake Bohannon paper
Stan Young: Examined controlled experiments trying to replicate 12 data “discoveries”:
0 replicated; 7 neutral; 5 reversed
We can always find something interesting just by randomness
We love stories and will believe anything, thus interpretability is no protection against error
Science requires replication & transparency 65-95% “health discovery” papers are false, due to vast search effect (multiple comparison)
Questions?