15 August, 2016

wOBA Shrinkage Estimation by the Multinomial-Dirichilet Model

In a previous article, I showed how to calculate a very basic confidence interval for wOBA using the multinomial model. Since then, I've shown how to perform shrinkage estimation (regression towards the mean) for basic counting stats such as BA and OBP and rate stats such as WHIP. In this article, I'm going to show how to use the multinomial model with a Dirichlet prior to find a regressed estimate of wOBA (and other functions that are linear transformations of counting stats).

As in the previous post, I will use the weights of wOBA from the definition on fangraphs.com. I am aware that in The Book wOBA also includes the outcome of reaching a base on error, but the method here will easily expand to include that factor and the results should not change drastically for its exclusion.

As usual, all the code and data I use can be found on my github.


The Multinomial and Dirichlet Models



Suppose you observe $n$ independent, identical trials of an event (say, a plate appearance or an at-bat) with $k$ possible outcomes (single, double, triple, home run, walk, sacrifice fly, etc. - all the way up to an out). The distribution of counts of each of the possible outcomes is multinomial with probability mass function:

$p(x_1, x_2,...,x_k | \theta_1, \theta_2, ..., \theta_{k}, n) = \dfrac{n!}{x_1!x_2!...x_k!}\theta_1^{x_1}\theta_2^{x_2}...\theta_k^{x_k}$

Where $x_1$, $x_2$,..., $x_k$ represent counts of each outcome (with $n = x_1 + x_2 + ... + x_k$ fixed) and $\theta_1, \theta_2, ..., \theta_k$ represent the probability of each outcome in a single event - note that all the probabilities $\theta_j$ must sum to 1, so you will sometimes see the last term written as $\theta_k = (1-\theta_1-\theta_2-...-\theta_{k-1})$.

To give an example, suppose that in each plate appearance (the event) a certain player has a 0.275 chance of getting a hit, a 0.050 chance of getting a walk, and a 0.675 chance of an "other" outcome happening (meaning anything other than a hit or walk - an out, hit by pitch, reach base on error, etc.). Then in $n = 200$ plate appearances, the probability of exactly $x_H = 55$ hits, $x_{BB} = 10$ walks, and $x_{OTH} = 135$ other outcomes is given by



$\dfrac{200!}{55!10!135!} 0.275^{55} 0.05^{10} 0.675^{135} = 0.008177562$



(This probability is necessarily small because there are a little over 20,000 ways to have the three outcomes sum up to 200 plate appearances. In fact, 55 hits, 10 walks, and 135 other is the most probable set of outcomes.)

The multinomial is, as its name implies, a multivariate extension of the classic binomial distribution (a binomial is just a multinomial with $k = 2$). Similarly, there is a multivariate extension of the beta distribution called the Dirichlet distribution. The Dirichlet is used to represent the joint distribution of the $\theta_j$ themselves - that is, the joint distribution (over the entire league) of sets of talent levels for each of the $k$ possible outcomes. The probability density function of the Dirichlet is

$p(\theta_1, \theta_2, ..., \theta_{k} | \alpha_1, \alpha_2, ..., \alpha_k) = \displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j)}{\Gamma(\sum_{j = 1}^k \alpha_j)} \theta_1^{\alpha_1 - 1} \theta_2^{\alpha_2 - 1}...\theta_k^{\alpha_k - 1}$


The Dirichlet distribution would be used to answer the question "What is the probability that a player has a hit probability of between 0.250 and 0.300 and simultaneously a walk probability of between 0.50 and 0.100?"- the advantage of doing this is being able to model the covariance between the talent levels.

The expected values of each of the $\theta_j$ are given by

$E[\theta_j] = \dfrac{\alpha_j}{\alpha_0}$

where

$\alpha_0 = \displaystyle \sum_{j = 1}^k \alpha_j$

These represent the league average talent levels in each of the outcomes. So for example, using hits (H), walks (BB), and other (OTH), the quantity given by

$E[\theta_{BB}] = \dfrac{\alpha_{BB}}{\alpha_H + \alpha_{BB} + \alpha_{OTH}}$

would be the average walk proportion (per PA) over all MLB players.

The reason the Dirichlet distribution is useful is that it is conjugate to the multinomial density above. Given raw counts $x_1, x_2, ..., x_k$ for each outcome in the multinomial model and parameters $\alpha_1$, $\alpha_2$, ..., $\alpha_k$ for the Dirichlet model, the posterior distribution for the $\theta_i$ is also Dirichlet with parameters $ \alpha_1 + x_1, \alpha_2 + x_2, ..., \alpha_k + x_k$:


 $p(\theta_1, \theta_2, ..., \theta_{k} | x_1, x_2, ..., x_k) = $

$\displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j + x_i)}{\Gamma(\sum_{j = 1}^k \alpha_j + x_j)} \theta_1^{\alpha_1 + x_1- 1} \theta_2^{\alpha_2 + x_2- 1}...\theta_{k-1}^{\alpha_{k-1} + x_{k-1} - 1}\theta_k^{\alpha_k + x_k - 1}$



For the posterior, the expected value for each outcome is given by

$E[\theta_j] = \dfrac{\alpha'_j}{\alpha'_0}$

where

$\alpha'_j = x_j + \alpha_j$

$\alpha'_0 = \displaystyle \sum_{j = 1}^k \alpha'_j = \sum_{j = 1}^k (x_j + \alpha_j)$



These posterior means $E[\theta_j]$ represent regressed estimates for each of the outcome talent levels towards the league means. These shrunk estimates can then be plugged in to the formula for any statistic to get a regressed version of that statistic.


Linearly Weighted Statistics


Most basic counting statistics (such as batting average, on-base percentage, etc.) simply try to estimating one particular outcome using the raw proportion of events ending in that outcome:


$\hat{\theta_j} \approx \dfrac{x_j}{n}$


More advanced statistics instead attempt to estimate linear functions of true talent levels $\theta_j$ with weights $w_j$ for each outcome:

$w_1 \theta_1 + w_2\theta_2 + ... + w_k \theta_k$

The standard version of these statistics that you can find on any number of baseball sites uses the raw proportion $x_j/n$ as an estimate $\hat{\theta}_j$ as above. To get the regressed version of the statistic, use $\hat{\theta_j} = E[\theta_j]$ from the posterior distribution for the Dirichlet-multinomial model - the formula for the regressed statistic is then

$w_1 \hat{\theta}_1 + w_2 \hat{\theta_2} + ... + w_k \hat{\theta_k} = \displaystyle \sum_{j = 1}^k w_j \left(\dfrac{x_j + \alpha_j}{\sum_{j = 1}^k x_j + \alpha_j}\right) =\sum_{j = 1}^k w_j \left(\dfrac{\alpha_j'}{\alpha_0'}\right)$


(The full posterior distribution can also be used to get interval estimates for the statistic, which will be the focus of the next article)


Estimation



This raises the obvious question of what values to use for the $\alpha_j$ in the Dirichlet distribution - ideally, $\alpha_j$ should be picked so that the Dirichlet distribution accurately modes the joint distribution of talent levels for MLB hitters. There are different ways to do this, but I'm going to use use MLB data itself and a marginal maximum likelihood technique to find estimates $\hat{\alpha_j}$ and plug those into the equations above - this method was chosen because there are existing R packages to do the estimation and it is relatively numerically stable and should be very close to the results of other methods for large sample sizes. Using the data to find estimates of the $\alpha_j$ and then plugging those back in makes this an empirical Bayesian technique.

First, it's necessary to get rid of the talent levels $\theta_j$ and find the probability of the observed data based not on a particular player's talent level(s), but on the Dirichlet distribution of population talent levels itself. This is done by integrating out each talent level:


$p(x_1, x_2, ..., x_k | \alpha_1, \alpha_2, ..., \alpha_k, n) = $
$\displaystyle \int_{\tilde{\theta}} p(x_1, x_2, ..., x_k | \theta_1, \theta_2, ..., \theta_k, n) p(\theta_1, \theta_2, ..., \theta_k | \alpha_1, \alpha_2, ..., \alpha_k, n) d\tilde{\theta}$ 

where $\tilde{\theta}$ indicates the set of all $\theta_j$ values - so integrating out the success probability for each outcome, from 0 to 1.

The calculus here is a bit tedious, so skipping straight to the solution, this gives probability mass function:

$ = \displaystyle \dfrac{n! \Gamma(\sum_{j = 1}^k \alpha_j)}{\Gamma(n + \sum_{j = 1}^k \alpha_j)} \prod_{i = j}^k \dfrac{\Gamma(x_j + \alpha_j)}{x_j! \Gamma(\alpha_j)}$


This distribution, known as the Dirichlet-Multinomial distribution, represents the probability of getting $x_1, x_2, ..., x_k$ outcomes in fixed $n = x_1 + x_2 + ... + x_k$ events, given only information about the population. Essentially, this distribution would be used to answer the question "What's the probability that, if I select a player from the population of all MLB players completely at random - so not knowing the player's talent levels at all - the player gets $x_1$ singles, $x_2$ doubles, etc., in $n$ plate appearances?"

Using $x_{i,j}$ to represent the raw count for outcome $j$ for player $i = 1, 2, ..., N$ and $\tilde{x}_i$ as shorthand to represent the complete set of counting statistics for player $i$, with the counting stats of multiple players $\tilde{x}_1, \tilde{x}_2, ..., \tilde{x}_N$  from some population (such as all MLB players that meet some event threshold), statistical estimation procedures can be used to acquire estimates $\hat{\alpha}_j$ of the true population parameters $\alpha_j$.

For the maximum likelihood approach, the log-likelihood of a set of estimates $\tilde{\alpha}$ is given by

$\ell(\tilde{\alpha} | \tilde{x}_1, ..., \tilde{x}_N) = \displaystyle \sum_{i = 1}^N [\log(n_i!) + \log(\Gamma(\sum_{j = 1}^k \alpha_j)) - \log(\Gamma(n + \sum_{j = 1}^k \alpha_j))$ 
$+ \displaystyle \sum_{j = 1}^k \log(\Gamma(x_{i,j} + \alpha_j)) - \sum_{j = 1}^k \log(x_{i,j}!) - \sum_{j = 1}^k \log(\Gamma(\alpha_j))]$

The maximum likelihood method works by finding the values of the $\alpha_j$ that maximize  $\ell(\tilde{\alpha})$  above - these are the maximum likelihood estimates $\hat{\alpha}_j$. From a numerical perspective, doing this is not simple, and papers have been written on fast, easy ways to perform the computations. For simplicity, I'm going to use the dirmult package in R, which only requires the set of counts for each outcome as a matrix where each row corresponds to exactly one player. The dirmult package can be installed with the command

> install.packages('dirmult')

Once the data is entered and estimation is performed, you will have estimates $\hat{\alpha}_j$. These can then be plugged into the posterior equations above to get regressed statistic estimate

$w_1 \hat{\theta}_1 + w_2 \hat{\theta_2} + ... + w_k \hat{\theta_k} = \displaystyle \sum_{j = 1}^k w_j \left(\dfrac{x_j + \hat{\alpha}_j}{\sum_{j = 1}^k x_j + \hat{\alpha}_j}\right)$


I'll give two examples of offensive stats that can be regressed in this way. 


wOBA Shrinkage



The first is weighted on-base average (which I'll call wOBA for short), introduced in Tom Tango's The Book, though as previously mentioned, I am using the fangraphs.com definition.  For events, at-bats plus unintentional walks, sacrifice flies, and times hit by a pitch will be used (n = AB + BB - IBB + SF + HBP) , and seven outcomes are defined - singles (1B), doubles (2B), triples (3B), home runs (HR), unintentional walks (BB), and hit by pitch (HBP), with everything else being lumped into an "other" (OTH) outcome.

For notation, again let $x_{i, 1B}$, $x_{i, 2B}$, ..., $x_{i, OTH}$ represent the number of singles, doubles, etc. for player $i$  (abbreviating the entire set as $\tilde{x}_i$) and let  $\theta_{i, 1B}$, $\theta_{i, 2B}$, ..., $\theta_{i, OTH}$ represent the true probability of getting a single, double, etc. for player $i$ (abbreviating the entire set as $\tilde{\theta}_i$). The total number of events for player $i$ is given by $n_i$.

Data was collected from fangraphs.com on all MLB non-pitchers from 2010 - 2015. A cutoff of 300 events was used - so only players with at least 300 total AB + BB - IBB + SF + HBP in a given season were used. The code and data I used may be found in my github.

A player's wOBA can be written as a linear transformation of the $\theta_j$ for each of these outcomes with weights $w_{1B} = 0.89$, $w_{2B} = 1.27$, $w_{3B} = 1.62$, $W_{HR} = 2.10$, $w_{BB} = 0.69$, $w_{HBP} = 0.72$, and $w_{OTH} = 0$ as

$wOBA_i = 0.89*\theta_{i,1B} + 1.27*\theta_{i,2B} + 1.62*\theta_{i,3B} + 2.10*\theta_{i,HR} + 0.69*\theta_{i,BB}+0.72*\theta_{i,HBP}$

For player $i$, the distribution of the counts $\tilde{x}_i$ in $n_i$ events is multinomial with mass function

$p(\tilde{x}_i | \tilde{\theta}_i, n_i) =\dfrac{n_i!}{x_{i,1B} x_{i,2B}! x_{i,}! x_{i,HR}! x_{i,OTH}! }\theta_{1B,i}^{x_{i,1B}} \theta_{i,2B}^{x_{i,2B}} \theta_{i,3B}^{x_{i,3B}} \theta_{i,HR}^{x_{i,HR}} \theta_{i,OTH}^{x_{i,OTH}}$


The  joint distribution of possible talent levels $\tilde{\theta}$ is assumed to be Dirichlet.

$p(\tilde{\theta}_i | \tilde{\alpha}) = \displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j)}{\Gamma(\sum_{j = 1}^k \alpha_j)} \theta_{i,1B}^{\alpha_{1B} - 1} \theta_{i,2B}^{\alpha_{2B} - 1}\theta_{i,3B}^{\alpha_{3B} - 1} \theta_{i,HR}^{\alpha_{HR} - 1}\theta_{i,BB}^{\alpha_{BB} - 1}\theta_{i,HBP}^{\alpha_{HBP} - 1}\theta_{i,OTH}^{\alpha_{OTH} - 1}$


To find the maximum likelihood estimates $\hat{\alpha}_j$ for this model using the dirmult package in R, the data needs to be loaded into a matrix, where row $i$ represents the raw counts for each outcome for player $i$.  There are any number of way to do this, but the first 10 rows of the matrix (out of 1598 in the sample total in the data set used for this example) should look something like:

> x[1:10,]
      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
 [1,]   91   38    1   42  109    5  353
 [2,]  122   26    1   44   71    5  364
 [3,]   58   25    1   18   43    0  196
 [4,]  111   40    3   32   38    5  336
 [5,]   63   25    0   30   56    3  253
 [6,]   67   18    1   21   46    5  213
 [7,]   86   24    2   43  108    6  362
 [8,]   58   25    2   20   24    3  201
 [9,]   35   16    2   25   56    2  200
[10,]   68   44    0   14   76    5  250


Once the data is in this form,  finding the maximum likelihood estimates can be done with the commands

> dirmult.fit  <- dirmult(x)
Iteration 1: Log-likelihood value: -863245.946200229
Iteration 2: Log-likelihood value: -863214.860463357
Iteration 3: Log-likelihood value: -863210.928976511
Iteration 4: Log-likelihood value: -863210.901250554
Iteration 5: Log-likelihood value: -863210.901248778


> dirmult.fit


$loglik
[1] -863210.9


$ite
[1] 5


$gamma
[1]  34.30376  10.44264   1.15606   5.73569  16.28635   1.96183 144.51164


$pi
[1] 0.160000389 0.048706790 0.005392124 0.026752540 0.075963185 0.009150412 0.674034560


$theta
[1] 0.004642569

What I called the $\hat{\alpha}_j$ are given as the $gamma in the output. The quantity$\alpha_0$ can be calculated by summing these up

> alpha <- dirmult.fit$gamma
> sum(alpha)
[1] 214.398


So the joint distribution of talent levels over the population of MLB players with at least 300 events is approximated by a Dirichlet distribution with parameters:

$(\theta_{1B}, \theta_{2B}, \theta_{3B}, \theta_{HR},  \theta_{BB}, \theta_{HBP}, \theta_{OTH}) \sim Dirichlet(34.30, 10.44,  1.16,  5.74, 16.29, 1.96, 144.51)$

In 2013, Mike Trout had $x_{1B} = 115$ singles, $x_{2B} = 39$ doubles, $x_{3B} = 9$ triples, $x_{HR} = 27$ home runs, $x_{BB} = 100$ unintentional walks, $x_{HBP} = 9$ times hit by a pitch, and $x_{OTH} = 397$ other outcomes in $n = 706$ total events for a raw (non-regressed) wOBA of

$0.89 \left(\dfrac{115}{706}\right) + 1.27 \left(\dfrac{39}{706}\right) + 1.62 \left(\dfrac{9}{706}\right) + 2.10 \left(\dfrac{27}{706}\right) + 0.69 \left(\dfrac{100}{706}\right) +  0.72 \left(\dfrac{9}{706}\right) \approx 0.423$

In order to calculate the regressed weighted on-base average, first calculate the $\alpha_j'$ for Mike Trout's posterior distribution of batting ability by

$\alpha_{1B}' = 115 + 34.30 = 149.30$
$\alpha_{2B}' = 39 + 10.44 = 49.44$
$\alpha_{3B}' = 9+ 1.16 = 10.16$
$\alpha_{HR}' = 27 + 5.74 = 32.74$
$\alpha_{BB}' = 100 + 16.29 = 116.29$
$\alpha_{HBP}' = 9 + 1.96 = 10.96$
$\alpha_{OTH}' = 407 + 144.51 = 551.51$

With $\alpha_0' = 920.40$. The regressed version of Mike Trout's 2013 slugging percentage is then given by a linear transformation of the expected proportion in each outcome:

$0.89 \left(\dfrac{149.30}{920.40}\right) + 1.27 \left(\dfrac{49.44}{920.40}\right) + 1.62 \left(\dfrac{10.16}{920.40}\right) + 2.10 \left(\dfrac{32.74}{920.40}\right) + 0.69 \left(\dfrac{116.29}{920.40}\right) + 0.72 \left(\dfrac{10.96}{920.40}\right)$

 $\approx 0.401$

So based solely on his 2013 stats and the population information, it's estimated that he is a "true" 0.401 wOBA hitter. Of course, we know from many more years of watching him that this is a bit unfair, and his "true" wOBA is closer to 0.423.

Stabilization



As a side note - and I have verified this by simulation, though I have not worked out the details yet mathetmatically - the so-called "stabilization point" (defined as a split-half correlation of $r = 0.5$) for wOBA is given by $\alpha_0$ - so if split-half correlation was conducted among the players from 2010-2015 with at least 300 AB + BB - IBB + SF + HBP, there should be a correlation of 0.5 after approximately 214 PA.  I'm not sure if this works for just the wOBA weights or any arbitrary set of weights, though I suspect the fact that the weight for the "other" outcome is 0 and all the rest are nonzero has a big role to play in this.


SLG Shrinkage



Another statistic that can be regressed in this same way is slugging percentage (which I'll call SLG for short).  Using at-bats (AB) as events, and defining five outcomes - singles (1B), doubles (2B), triples (3B), and home runs (HR), with everything else being lumped into the "other" (OTH) outcome, a player's slugging percentage can be written as a linear transformation of the $\theta_i$ for each of these outcomes with weights $w_{1B} = 1$, $w_{2B} = 2$, $w_{3B} = 3$, $W_{HR} = 4$, and $w_{OTH} = 0$

$SLG_i = 1*\theta_{i,1B} + 2*\theta_{i,2B} + 3*\theta_{i,3B} + 4*\theta_{i,HR}$

For player $i$, the multinomial distribution of the counts of $x_{i,1B}$ singles, $x_{i,2B}$ doubles, $x_{i,3B}$ triples, $x_{i,HR}$ home runs, and $x_{i,OTH}$ other outcomes in $n_i$ at-bats is

$p(\tilde{x}_i | \tilde{\theta}_i, n_i) =\dfrac{n_i!}{x_{i,1B} x_{i,2B}! x_{i,}! x_{i,HR}! x_{i,OTH}! }\theta_{1B,i}^{x_{i,1B}} \theta_{i,2B}^{x_{i,2B}} \theta_{i,3B}^{x_{i,3B}} \theta_{i,HR}^{x_{i,HR}} \theta_{i,OTH}^{x_{i,OTH}}$


The Dirichlet distribution of all possible $\tilde{\theta}$ values is

$p(\tilde{\theta}_i | \tilde{\alpha}) = \displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j)}{\Gamma(\sum_{j = 1}^k \alpha_j)} \theta_{i,1B}^{\alpha_{1B} - 1} \theta_{i,2B}^{\alpha_{2B} - 1}\theta_{i,3B}^{\alpha_{3B} - 1} \theta_{i,HR}^{\alpha_{HR} - 1}\theta_{i,OTH}^{\alpha_{OTH} - 1}$

Once again, data from fangraphs.com was used, and all MLB non-pitchers from 2010-2015 who had at least 300 AB in a given season were included in the sample. To find maximum likelihood estimates in R the data needs to be loaded into a matrix where row $i$ represents the raw counts for each outcome $\tilde{x}_i$ for player $i$.  The first 10 rows of the matrix  (out of 1477) should look something like:

> x[1:10,]
      [,1] [,2] [,3] [,4] [,5]
 [1,]   91   38    1   42  349
 [2,]  122   26    1   44  362
 [3,]  111   40    3   32  332
 [4,]   63   25    0   30  251
 [5,]   67   18    1   21  208
 [6,]   86   24    2   43  358
 [7,]   58   25    2   20  199
 [8,]   68   44    0   14  248
 [9,]  102   36    2   37  370
[10,]  119   48    0   30  375


The dirmult package can then be used to find maximum likelihood estimates for the $\alpha_j$ of the underlying joint Dirichlet distribution of talent levels

> dirmult.fit <- dirmult(x)
Iteration 1: Log-likelihood value: -575845.635311559
Iteration 2: Log-likelihood value: -575999.779702559
Iteration 3: Log-likelihood value: -575829.132259007
Iteration 4: Log-likelihood value: -575784.936726078
Iteration 5: Log-likelihood value: -575780.270877135
Iteration 6: Log-likelihood value: -575780.190649985
Iteration 7: Log-likelihood value: -575780.1906191


> dirmult.fit


$loglik
[1] -575780.2
 


$ite
[1] 7
 


$gamma
[1]  42.443604  12.855782   1.381905   7.073672 176.120837
 

$pi
[1] 0.176939918 0.053593494 0.005760921 0.029488894 0.734216774 

$theta
[1] 0.004151517

And $\alpha_0$ is

> alpha <- dirmult.fit$gamma
> sum(alpha)
[1] 239.8758


So the joint distribution of talent levels over the population of MLB players with at least 300 AB is given by a Dirichlet distribution.

$(\theta_{1B}, \theta_{2B}, \theta_{3B}, \theta_{HR}, \theta_{OTH}) \sim Dirichlet(42.44, 12.86, 1.38, 7.07, 176.12)$
 
(This also implies the "stabilization point" for slugging percentage should be at around $\alpha_0 \approx 240$ AB - this is different than for wOBA because the definition of "events" are different between the two statistics)

In 2013, Mike Trout had $x_{1B} = 115$ singles, $X_{2B} = 39$ doubles, $X_{3B} = 9$ triples, $X_{HR} = 27$ home runs, and $X_{OTH} = 399$ other outcomes in $n = 589$ at-bats.

$1 \left(\dfrac{115}{589}\right) + 2 \left(\dfrac{39}{589}\right) + 3 \left(\dfrac{9}{589}\right) + 4 \left(\dfrac{27}{589}\right) + 0 \left(\dfrac{399}{589}\right) \approx 0.557$

In order to calculate the regressed slugging percentage, calculate Mike Trout's posterior distribution for batting ability by

$\alpha_{1B}' = 115 + 42.44 = 157.44$
$\alpha_{2B}' = 39 + 12.86 = 51.86$
$\alpha_{3B}' = 9+ 1.38 = 10.38$
$\alpha_{HR}' = 27 + 7.07 = 34.07$
$\alpha_{OTH}' = 399 + 176.12 = 575.12$

With $\alpha_0' = 828.79$. The regressed version of Mike Trout's 2013 slugging percentage is then given by

$1 \left(\dfrac{157.44}{828.79}\right) + 2 \left(\dfrac{51.86}{828.79}\right) + 3 \left(\dfrac{10.38}{828.79}\right) + 4 \left(\dfrac{34.07}{828.79}\right) \approx 0.517$



Model Criticisms



From a statistical perspective, this is the most convenient way to perform shrinkage of wOBA, but it doesn't necessarily mean that this is correct - all of this research is dependent on how well the Dirichlet models the joint distribution of talent levels in the league. The fact that the beta works well for the population distributions of each of the talent levels when looked at individually is no guarantee that the multivariate extension should work well for the joint.

In order to do a simple test of the fit of the model, data was simulated from the fit model used to perform the wOBA shrinkage (the posterior predictive means and variances are actually available exactly for this model, but it's good practice to simulate). A set of $\tilde{\theta}_i$ simulated from the Dirichlet distribution was used to simulate a corresponding set of $\tilde{x}_i$, with the same $n_i$ as in the original data set. Comparing the means and standard deviations of the real and simulated data set, the means are

\begin{array}{c c c}
\textrm{Outcome} & \textrm{Observed Mean} & \textrm{Simulated Mean} \\ \hline
\textrm{1B}  &  0.1598  &  0.1598 \\
\textrm{2B}   & 0.0473  &  0.0480 \\
\textrm{3B}   & 0.0049  &  0.0053 \\
\textrm{HR}  &  0.0275  & 0.0262 \\
\textrm{BB}  &  0.0772  & 0.0756 \\
\textrm{HBP}  &  0.0088  &  0.0095 \\
\textrm{OTH}   & 0.6745  &  0.6756 \\
\end{array}

which look relatively good - the simulated and real means are fairly close. For the standard deviations, real and simulated values are

\begin{array}{c c c}
\textrm{Outcome} & \textrm{Observed SD} & \textrm{Simulated SD} \\ \hline
\textrm{1B}  &  0.0296  &  0.0303 \\
\textrm{2B}   & 0.0114  &  0.0176 \\
\textrm{3B}   &  0.0049  &  0.0061 \\
\textrm{HR}  & 0.0153  &  0.0131 \\
\textrm{BB}  &   0.0278  &  0.0217 \\
\textrm{HBP}  &  0.0072  &  0.0080 \\
\textrm{OTH}   & 0.0324  &  0.0381\\
\end{array}

which isn't nearly as good. The Multinomial-Dirichlet model is clearly underestimating the amount of variance in double rates and "other" outcome rates while overestimating the variance in home run and walk rates. It's not to an extreme extent - and comparing histograms, the shapes of the real and simulated data sets match - but it does present a source of problems. More ad-hoc methods may give superior results.

No comments:

Post a Comment