Blogs

Empirical Bayesian Estimation

Kenny Darrell

June 30, 2016

How would you determine who is the best three-point shooter in the NBA? Is it the same as the best two point-shooter or the best free throw shooter. There is a three-point competition but that is a little different from what this question is asking. One of two next steps are commonly pursued to answer a question of this sort. Each has its own pitfalls. By trying to answer this question we can see how both of them fall apart.

We will use some NBA player statistics. It is basically the season totals for each player. Here is roughly what it looks like.

head(details)

## Source: local data frame [6 x 14]
## 
##             name         birth  season three_made three_attempt two_made
##            (chr)         (chr)   (chr)      (dbl)         (dbl)    (dbl)
## 1 Alaa Abdelnaby June 24, 1968 1990-91          0             0       55
## 2 Alaa Abdelnaby June 24, 1968 1991-92          0             0      178
## 3 Alaa Abdelnaby June 24, 1968 1992-93          0             1      245
## 4 Alaa Abdelnaby June 24, 1968 1992-93          0             1       26
## 5 Alaa Abdelnaby June 24, 1968 1992-93          0             0      219
## 6 Alaa Abdelnaby June 24, 1968 1993-94          0             0       24
## Variables not shown: two_attempt (dbl), ft_made (dbl), ft_attempt (dbl),
##   rebounds (dbl), assist (dbl), steal (dbl), block (dbl), turnover (dbl)

For the first method, we are going to calculate the average, that is shots made divided by attempts. First we will focus on just three-point shooting. We first group by player, which is uniquely defined by the name and birthday, then we sum over the number of shots made and attempted. We then divide the total number of shots made by the total number shots attempted. If we then sort on the average we should be able to see who is the best three-point shooter?

details %>% 
  group_by(name, birth) %>%
  summarise(three_made = sum(three_made), 
            three_attempt = sum(three_attempt)) %>%
  select(-birth) %>% 
  ungroup %>%
  mutate(three_avg = three_made / three_attempt) %>% 
  arrange(desc(three_avg)) -> threes

head(threes)

## Source: local data frame [6 x 4]
## 
##             name three_made three_attempt three_avg
##            (chr)      (dbl)         (dbl)     (dbl)
## 1   Al Salvadori          1             1         1
## 2     Alvin Sims          1             1         1
## 3     Bill Allen          2             2         1
## 4 Carroll Hooser          1             1         1
## 5    Coty Clarke          2             2         1
## 6   David Gaines          1             1         1

Clearly this did not work. This method is obviously flawed since we know the best three-point shooters should have taken more than only a couple of shots. These are really just people that only made a few attempts, they may be good but they may also just be lucky.

threes %>% filter(three_avg == 1) %>%
  arrange(desc(three_made))

## Source: local data frame [21 x 4]
## 
##              name three_made three_attempt three_avg
##             (chr)      (dbl)         (dbl)     (dbl)
## 1   Gundars Vetra          3             3         1
## 2      Bill Allen          2             2         1
## 3     Coty Clarke          2             2         1
## 4      Eddy Curry          2             2         1
## 5   Eric Anderson          2             2         1
## 6    Al Salvadori          1             1         1
## 7      Alvin Sims          1             1         1
## 8  Carroll Hooser          1             1         1
## 9    David Gaines          1             1         1
## 10     Eric White          1             1         1
## ..            ...        ...           ...       ...

We can also see what would be the result if we were to ask the question who is the worst three-point shooter. This method produces just as bad of an error.

threes %>% filter(three_avg == 0) %>%
  arrange(desc(three_made))

## Source: local data frame [480 x 4]
## 
##               name three_made three_attempt three_avg
##              (chr)      (dbl)         (dbl)     (dbl)
## 1      A.J. Wynder          0             1         0
## 2       Aaron Gray          0             4         0
## 3    Abdul Jeelani          0             7         0
## 4        Acie Earl          0            14         0
## 5     Adonal Foyle          0             2         0
## 6  Adrian Caldwell          0             5         0
## 7   Alaa Abdelnaby          0             6         0
## 8       Alan Hardy          0             5         0
## 9         Alan Ogg          0             2         0
## 10  Alex Blackwell          0             3         0
## ..             ...        ...           ...       ...

So this method does not work. There has been a lot of discussion of this problem. It is talked about in the book Thinking Fast and Slow. There is a very interesting example of rural counties having both the highest and lowest cancer rates, one dicsussion of that can be found here. What is happening is that in the cases with fewer observations have more room for variance, thus just a few flips in either direction will really skew things. As a player takes more shots the average will start to approach the true average. We can see that effect here in this plot.

threes %>%
  ggplot(aes(three_attempt, three_avg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10()

What if we just looked at it by who made the most?

threes %>% arrange(desc(three_made)) %>% head

## Source: local data frame [6 x 4]
## 
##               name three_made three_attempt three_avg
##              (chr)      (dbl)         (dbl)     (dbl)
## 1        Ray Allen       3174          7962 0.3986436
## 2    Reggie Miller       2560          6486 0.3946963
## 3 Chauncey Billups       2245          5810 0.3864028
## 4     Vince Carter       2180          5823 0.3743775
## 5      Jason Terry       2169          5724 0.3789308
## 6       Jason Kidd       2168          6178 0.3509226

This may look more valid but Jason Terry made one more than Jason Kidd but it took 454 less tries also. Vince Carter made eleven more shots than Jason Terry but he also took 99 more shots. Since Jason Terry has an average of around 37% if he was to take 99 more shots he would most likely have made at least eleven of them.

To get over this issue we can use what is known as Empirical Bayes Estimation. You can see the inspirational example here What this method does is that it creates a baseline for the distribution and then adjusts it based on the number of shots made and attempted.

We need to fit a beta distribution to our data. But first we need to remove some of the crazy observations so as to not taint the empirical prior.

threes %>%
  filter(three_attempt >= 500) -> three_filt

m <- suppressWarnings(MASS::fitdistr(three_filt$three_avg, dbeta,
                    start = list(shape1 = 1, shape2 = 10)))

(alpha0 <- m$estimate[1])

##   shape1 
## 64.37282

(beta0 <- m$estimate[2])

##   shape2 
## 119.3305

ggplot(three_filt) +
  geom_histogram(aes(three_avg, y = ..density..), binwidth = .005) +
  stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red", size = 1) +
  xlab("Three Point Percentage")

Now to construct the new estimate using the players' data and this beta distribution we apply a couple of additions.

threes %>% 
  filter(three_attempt > 0) %>%
  mutate(est = (three_made + alpha0) / (three_attempt + alpha0 + beta0)) -> threes

And now we can see more practical results.

threes %>% arrange(est) %>% head

## Source: local data frame [6 x 5]
## 
##             name three_made three_attempt  three_avg       est
##            (chr)      (dbl)         (dbl)      (dbl)     (dbl)
## 1   Andre Miller        218           986 0.22109533 0.2414055
## 2  Avery Johnson         32           215 0.14883721 0.2417156
## 3 Darrell Walker          6           103 0.05825243 0.2454552
## 4  Kelvin Ransey         20           152 0.13157895 0.2513315
## 5    Darwin Cook         63           321 0.19626168 0.2523717
## 6 Speedy Claxton         53           279 0.18996416 0.2536676

threes %>% arrange(est) %>% tail

## Source: local data frame [6 x 5]
## 
##            name three_made three_attempt three_avg       est
##           (chr)      (dbl)         (dbl)     (dbl)     (dbl)
## 1   Kyle Korver       1998          4691 0.4259220 0.4230766
## 2    Steve Nash       1685          3939 0.4277735 0.4243266
## 3   Steve Novak        626          1440 0.4347222 0.4251841
## 4  Hubert Davis        806          1823 0.4421284 0.4337327
## 5 Stephen Curry       1593          3590 0.4437326 0.4391900
## 6    Steve Kerr        732          1625 0.4504615 0.4403004

This plot is also borrowed from the above mentioned blog. It does a great job at showing the shrinkage to baseline. When you have very shots attempted you are pulled further to the mean. Points are pulled down if they are above and pulled up if they are below. The amount of the pull depends on the number of shots, so the lighter blue do not get shrunken nearly as much.

ggplot(threes, aes(three_avg, est, color = three_attempt)) +
  geom_hline(yintercept = alpha0 / (alpha0 + beta0), color = "red", lty = 2) +
  geom_point() +
  geom_abline(color = "red") +
  scale_colour_gradient(trans = "log", breaks = 10 ^ (1:5)) +
  xlab("Field Goal Percentage") +
  ylab("Empirical Bayes Field Goal Percentage")

Some issues to think about here though. If you were pretty bad in first few games, you may not play for long. We may also see some impact from players who stayed in the NBA beyond there prime. Thus at one point in there career they may have been the best ever, but earlier and later seasons may weaken this. We could try to get around this by asking what is the best season average.

details %>% 
  select(name, birth,
         three_made, three_attempt, season) -> threes_season
head(threes_season)

## Source: local data frame [6 x 5]
## 
##             name         birth three_made three_attempt  season
##            (chr)         (chr)      (dbl)         (dbl)   (chr)
## 1 Alaa Abdelnaby June 24, 1968          0             0 1990-91
## 2 Alaa Abdelnaby June 24, 1968          0             0 1991-92
## 3 Alaa Abdelnaby June 24, 1968          0             1 1992-93
## 4 Alaa Abdelnaby June 24, 1968          0             1 1992-93
## 5 Alaa Abdelnaby June 24, 1968          0             0 1992-93
## 6 Alaa Abdelnaby June 24, 1968          0             0 1993-94

threes_season %>% 
  group_by(name, birth, season) %>%
  summarise(three_made = sum(three_made), three_attempt = sum(three_attempt)) %>% 
  ungroup %>% 
  filter(three_attempt > 0) %>% 
  mutate(three_avg = three_made / three_attempt) %>% 
  arrange(desc(three_avg)) -> threes_season

threes_season %>%
  ggplot(aes(three_attempt, three_avg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10()

This is very similar

threes_season %>%
  filter(three_attempt >= 200) -> threes_season_filt

m <- suppressWarnings(MASS::fitdistr(threes_season_filt$three_avg, dbeta,
                                     start = list(shape1 = 1, shape2 = 10)))

alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]

ggplot(threes_season_filt) +
  geom_histogram(aes(three_avg, y = ..density..), binwidth = .005) +
  stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red", size = 1) +
  xlab("Seasonal Three Point Percentage")

threes_season %>%
   filter(three_attempt > 0) %>%
  mutate(est = (three_made + alpha0) / (three_attempt + alpha0 + beta0)) -> threes_season

threes_season %>% select(-birth) -> threes_season


threes_season %>% arrange(desc(est)) %>% select(-three_made, -three_attempt) %>% head(20)

## Source: local data frame [20 x 4]
## 
##               name    season three_avg       est
##              (chr)     (chr)     (dbl)     (dbl)
## 1       Tim Legler   1995-96 0.5224490 0.4662383
## 2      Kyle Korver 2014-15 ★ 0.4922049 0.4626337
## 3       Steve Kerr   1995-96 0.5147679 0.4601699
## 4     Craig Hodges   1987-88 0.4914286 0.4560655
## 5     Jason Kapono   2006-07 0.5142857 0.4556309
## 6       Steve Kerr   1994-95 0.5235294 0.4531079
## 7  Detlef Schrempf 1994-95 ★ 0.5138122 0.4500138
## 8      J.J. Redick   2015-16 0.4750594 0.4482160
## 9      Joe Johnson   2004-05 0.4783784 0.4479380
## 10     Steve Novak   2010-11 0.5652174 0.4458721
## 11      Dale Ellis 1988-89 ★ 0.4778761 0.4455982
## 12       Glen Rice 1996-97 ★ 0.4704545 0.4455888
## 13     Kyle Korver   2013-14 0.4719388 0.4444326
## 14   Jose Calderon   2012-13 0.4609929 0.4423752
## 15      Steve Nash 2007-08 ★ 0.4698163 0.4422877
## 16   Stephen Curry 2015-16 ★ 0.4537246 0.4419404
## 17     Kyle Korver   2009-10 0.5363636 0.4417209
## 18     Dana Barros 1994-95 ★ 0.4635294 0.4396890
## 19    Hubert Davis   1995-96 0.4756554 0.4383795
## 20      Steve Kerr   1989-90 0.5069444 0.4380973

It is a little reassuring to see the stars next to the season which indicates that this player was an all star that year.

ggplot(threes_season, aes(three_avg, est, color = three_attempt)) +
  geom_hline(yintercept = alpha0 / (alpha0 + beta0), color = "red", lty = 2) +
  geom_point() +
  geom_abline(color = "red") +
  scale_colour_gradient(trans = "log", breaks = 10 ^ (1:5)) +
  xlab("Seasonal Three Point Percentage") +
  ylab("Empirical Bayes Seasonal Three Point Percentage")

Another interesting thing to do is apply this to free throws and two pointers.

details %>% 
  group_by(name, birth) %>%
  summarise(two_made = sum(two_made), 
            two_attempt = sum(two_attempt),
            ft_made = sum(ft_made), 
            ft_attempt = sum(ft_attempt)) %>%
  select(-birth) %>% ungroup -> totals

head(totals)

## Source: local data frame [6 x 5]
## 
##            name two_made two_attempt ft_made ft_attempt
##           (chr)    (dbl)       (dbl)   (dbl)      (dbl)
## 1    A.C. Green     4653        9177    3247       4447
## 2 A.J. Bramlett        4          21       0          0
## 3  A.J. English      608        1353     259        333
## 4   A.J. Guyton       93         247      37         45
## 5    A.J. Price      367         834     223        303
## 6   A.J. Wynder        3          11       6          8

totals %>% 
  mutate(two_avg = two_made / two_attempt,
         ft_avg = ft_made / ft_attempt) -> totals

It is very interesting to note that we see the same effect but it is just shifted around this shot types true mean.

totals %>%
  ggplot(aes(ft_attempt, ft_avg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10()

totals %>%
  ggplot(aes(two_attempt, two_avg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10()

totals %>% filter(ft_attempt >= 1000) -> filt1
totals %>% filter(two_attempt >= 1000) -> filt2

m1 <- suppressWarnings(MASS::fitdistr(filt1$ft_avg, dbeta,
                    start = list(shape1 = 1, shape2 = 10)))
m2 <- suppressWarnings(MASS::fitdistr(filt2$two_avg, dbeta,
                    start = list(shape1 = 1, shape2 = 10)))


totals %>% 
  mutate(est_1 = (ft_made + m1$estimate[1]) / 
           (ft_attempt + m1$estimate[1] + m1$estimate[2]),
         est_2 = (two_made + m2$estimate[1]) / 
           (two_attempt + m2$estimate[1] + m2$estimate[2])) %>% na.omit %>% ungroup -> totals

totals %>% select(name, ft_made, ft_attempt, est_1) %>% arrange(desc(est_1)) %>% head

## Source: local data frame [6 x 4]
## 
##                 name ft_made ft_attempt     est_1
##                (chr)   (dbl)      (dbl)     (dbl)
## 1         Steve Nash    3060       3384 0.9027327
## 2         Mark Price    2135       2362 0.9017280
## 3 Mahmoud Abdul-Rauf    1051       1161 0.9008704
## 4      Brian Roberts     345        378 0.8993686
## 5      Stephen Curry    1668       1850 0.8989065
## 6    Peja Stojakovic    2498       2785 0.8951895

totals %>% select(name, two_made, two_attempt, est_2) %>% arrange(desc(est_2)) %>% head

## Source: local data frame [6 x 4]
## 
##               name two_made two_attempt     est_2
##              (chr)    (dbl)       (dbl)     (dbl)
## 1   DeAndre Jordan     2074        3088 0.6662031
## 2   Brandan Wright     1422        2324 0.6066962
## 3 Hassan Whiteside      668        1096 0.5990589
## 4       Greg Smith      243         393 0.5908849
## 5  Andris Biedrins     1432        2407 0.5904658
## 6   Tyson Chandler     3209        5422 0.5898742

It is interesting that we see Stephen Curry in a couple of all nut one of these tables. This brings up the question of who is the best all overall shooter? Or even more interesting who is the best overall player. I want to cover that next, but that is a task for another day.