Probabilaball: 2016

03 September, 2016

2016 Win Total Predictions (Through August 31)

These predictions are based on my own silly estimator, which I know can be improved with some effort on my part. There's some work related to this estimator that I'm trying to get published academically, so I won't talk about the technical details yet (not that they're particularly mind-blowing anyway). These predictions include all games played before through August 31 break.

As a side note, I noticed that my projections are very similar to the Fangraphs projections on the same day. I'm sure we're both calculating the projections from completely different methods, but it's reassuring that others have arrived at basically the same conclusions. Theirs have also have playoff projections, though mine have intervals attached to them.

I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), but based on tests of earlier seasons at this point in the season the actual coverage is around 94%, with intervals usually being one game off if and when they are off.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c}
\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} & \textrm{Current Wins/Games}\\ \hline
ARI & 63 & 68.82 & 74 & 71.61 & 56 / 133 \\
ATL & 57 & 62.25 & 68 & 68.41 & 50 / 133 \\
BAL & 81 & 86.57 & 92 & 81.42 & 72 / 133 \\
BOS & 85 & 90.41 & 96 & 91.7 & 74 / 133 \\
CHC & 98 & 103.63 & 109 & 100.59 & 85 / 132 \\
CHW & 71 & 76.85 & 83 & 77.61 & 62 / 131 \\
CIN & 62 & 67.9 & 74 & 69.67 & 55 / 132 \\
CLE & 87 & 92.68 & 98 & 90.03 & 76 / 132 \\
COL & 73 & 78.63 & 84 & 81.72 & 64 / 133 \\
DET & 81 & 86.95 & 92 & 83.51 & 72 / 133 \\
HOU & 81 & 86.24 & 92 & 85.14 & 71 / 133 \\
KCR & 78 & 83.09 & 89 & 78.69 & 69 / 133 \\
LAA & 67 & 72.93 & 78 & 77.8 & 59 / 133 \\
LAD & 84 & 89.44 & 95 & 86.26 & 74 / 133 \\
MIA & 76 & 81.66 & 87 & 81.91 & 67 / 133 \\
MIL & 65 & 70.13 & 76 & 73.34 & 57 / 133 \\
MIN & 56 & 61.98 & 68 & 70.1 & 49 / 132 \\
NYM & 78 & 83.61 & 89 & 81.58 & 69 / 133 \\
NYY & 78 & 83.86 & 90 & 80.28 & 69 / 132 \\
OAK & 64 & 69.88 & 75 & 71.92 & 57 / 133 \\
PHI & 67 & 72.42 & 78 & 69.38 & 60 / 133 \\
PIT & 77 & 82.37 & 88 & 80.33 & 67 / 131 \\
SDP & 63 & 68.73 & 74 & 74.14 & 55 / 132 \\
SEA & 77 & 82.55 & 88 & 81.3 & 68 / 133 \\
SFG & 82 & 88 & 94 & 86.39 & 72 / 132 \\
STL & 81 & 86.25 & 92 & 87.69 & 70 / 132 \\
TBR & 65 & 70.72 & 76 & 79.47 & 56 / 132 \\
TEX & 89 & 94.45 & 100 & 83.61 & 80 / 134 \\
TOR & 86 & 91.93 & 97 & 89 & 76 / 133 \\
WSN & 89 & 94.82 & 100 & 94.01 & 78 / 133 \\
\hline\end{array}

These quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is to the left of the peak, the team is predicted to finish "lucky" - more wins than would be expected based on their talent level - and if the blue line is to the right of the peak, the team is predicted to finish "unlucky" - fewer wins that would be expected based on their talent level.

It's still difficult to predict final win totals even at the beginning of September - intervals have a width of approximately 11-12 games. The Texas Ranges have been lucky this season, with a projected win total over 10 games larger than their estimated true talent level! Conversely, the Tampa Bay Rays have been unlucky, with a projected win total 10 games lower than their true talent level.

The Chicago Cubs have a good chance at winning 105+ games. My system believes they are a "true" 101 win team. Conversely, the system believes that the worst team is the Atlanta Braves, which are a "true" 68 win team (though the Minnesota Twins are projected to have the worst record at 62 wins).

Terminology

To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (from the beginning of the season until the all-star break).

15 August, 2016

wOBA Shrinkage Estimation by the Multinomial-Dirichilet Model

In a previous article, I showed how to calculate a very basic confidence interval for wOBA using the multinomial model. Since then, I've shown how to perform shrinkage estimation (regression towards the mean) for basic counting stats such as BA and OBP and rate stats such as WHIP. In this article, I'm going to show how to use the multinomial model with a Dirichlet prior to find a regressed estimate of wOBA (and other functions that are linear transformations of counting stats).

As in the previous post, I will use the weights of wOBA from the definition on fangraphs.com. I am aware that in The Book wOBA also includes the outcome of reaching a base on error, but the method here will easily expand to include that factor and the results should not change drastically for its exclusion.

As usual, all the code and data I use can be found on my github.

The Multinomial and Dirichlet Models

Suppose you observe $n$ independent, identical trials of an event (say, a plate appearance or an at-bat) with $k$ possible outcomes (single, double, triple, home run, walk, sacrifice fly, etc. - all the way up to an out). The distribution of counts of each of the possible outcomes is multinomial with probability mass function:

$p(x_1, x_2,...,x_k | \theta_1, \theta_2, ..., \theta_{k}, n) = \dfrac{n!}{x_1!x_2!...x_k!}\theta_1^{x_1}\theta_2^{x_2}...\theta_k^{x_k}$

Where $x_1$, $x_2$,..., $x_k$ represent counts of each outcome (with $n = x_1 + x_2 + ... + x_k$ fixed) and $\theta_1, \theta_2, ..., \theta_k$ represent the probability of each outcome in a single event - note that all the probabilities $\theta_j$ must sum to 1, so you will sometimes see the last term written as $\theta_k = (1-\theta_1-\theta_2-...-\theta_{k-1})$.

To give an example, suppose that in each plate appearance (the event) a certain player has a 0.275 chance of getting a hit, a 0.050 chance of getting a walk, and a 0.675 chance of an "other" outcome happening (meaning anything other than a hit or walk - an out, hit by pitch, reach base on error, etc.). Then in $n = 200$ plate appearances, the probability of exactly $x_H = 55$ hits, $x_{BB} = 10$ walks, and $x_{OTH} = 135$ other outcomes is given by

$\dfrac{200!}{55!10!135!} 0.275^{55} 0.05^{10} 0.675^{135} = 0.008177562$

(This probability is necessarily small because there are a little over 20,000 ways to have the three outcomes sum up to 200 plate appearances. In fact, 55 hits, 10 walks, and 135 other is the most probable set of outcomes.)

The multinomial is, as its name implies, a multivariate extension of the classic binomial distribution (a binomial is just a multinomial with $k = 2$). Similarly, there is a multivariate extension of the beta distribution called the Dirichlet distribution. The Dirichlet is used to represent the joint distribution of the $\theta_j$ themselves - that is, the joint distribution (over the entire league) of sets of talent levels for each of the $k$ possible outcomes. The probability density function of the Dirichlet is

$p(\theta_1, \theta_2, ..., \theta_{k} | \alpha_1, \alpha_2, ..., \alpha_k) = \displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j)}{\Gamma(\sum_{j = 1}^k \alpha_j)} \theta_1^{\alpha_1 - 1} \theta_2^{\alpha_2 - 1}...\theta_k^{\alpha_k - 1}$

The Dirichlet distribution would be used to answer the question "What is the probability that a player has a hit probability of between 0.250 and 0.300 and simultaneously a walk probability of between 0.50 and 0.100?"- the advantage of doing this is being able to model the covariance between the talent levels.

The expected values of each of the $\theta_j$ are given by

$E[\theta_j] = \dfrac{\alpha_j}{\alpha_0}$

where

$\alpha_0 = \displaystyle \sum_{j = 1}^k \alpha_j$

These represent the league average talent levels in each of the outcomes. So for example, using hits (H), walks (BB), and other (OTH), the quantity given by

$E[\theta_{BB}] = \dfrac{\alpha_{BB}}{\alpha_H + \alpha_{BB} + \alpha_{OTH}}$

would be the average walk proportion (per PA) over all MLB players.

The reason the Dirichlet distribution is useful is that it is conjugate to the multinomial density above. Given raw counts $x_1, x_2, ..., x_k$ for each outcome in the multinomial model and parameters $\alpha_1$, $\alpha_2$, ..., $\alpha_k$ for the Dirichlet model, the posterior distribution for the $\theta_i$ is also Dirichlet with parameters $ \alpha_1 + x_1, \alpha_2 + x_2, ..., \alpha_k + x_k$:

$p(\theta_1, \theta_2, ..., \theta_{k} | x_1, x_2, ..., x_k) = $

$\displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j + x_i)}{\Gamma(\sum_{j = 1}^k \alpha_j + x_j)} \theta_1^{\alpha_1 + x_1- 1} \theta_2^{\alpha_2 + x_2- 1}...\theta_{k-1}^{\alpha_{k-1} + x_{k-1} - 1}\theta_k^{\alpha_k + x_k - 1}$

For the posterior, the expected value for each outcome is given by

$E[\theta_j] = \dfrac{\alpha'_j}{\alpha'_0}$

where

$\alpha'_j = x_j + \alpha_j$

$\alpha'_0 = \displaystyle \sum_{j = 1}^k \alpha'_j = \sum_{j = 1}^k (x_j + \alpha_j)$

These posterior means $E[\theta_j]$ represent regressed estimates for each of the outcome talent levels towards the league means. These shrunk estimates can then be plugged in to the formula for any statistic to get a regressed version of that statistic.

Linearly Weighted Statistics

Most basic counting statistics (such as batting average, on-base percentage, etc.) simply try to estimating one particular outcome using the raw proportion of events ending in that outcome:

$\hat{\theta_j} \approx \dfrac{x_j}{n}$

More advanced statistics instead attempt to estimate linear functions of true talent levels $\theta_j$ with weights $w_j$ for each outcome:

$w_1 \theta_1 + w_2\theta_2 + ... + w_k \theta_k$

The standard version of these statistics that you can find on any number of baseball sites uses the raw proportion $x_j/n$ as an estimate $\hat{\theta}_j$ as above. To get the regressed version of the statistic, use $\hat{\theta_j} = E[\theta_j]$ from the posterior distribution for the Dirichlet-multinomial model - the formula for the regressed statistic is then

$w_1 \hat{\theta}_1 + w_2 \hat{\theta_2} + ... + w_k \hat{\theta_k} = \displaystyle \sum_{j = 1}^k w_j \left(\dfrac{x_j + \alpha_j}{\sum_{j = 1}^k x_j + \alpha_j}\right) =\sum_{j = 1}^k w_j \left(\dfrac{\alpha_j'}{\alpha_0'}\right)$

(The full posterior distribution can also be used to get interval estimates for the statistic, which will be the focus of the next article)

Estimation

This raises the obvious question of what values to use for the $\alpha_j$ in the Dirichlet distribution - ideally, $\alpha_j$ should be picked so that the Dirichlet distribution accurately modes the joint distribution of talent levels for MLB hitters. There are different ways to do this, but I'm going to use use MLB data itself and a marginal maximum likelihood technique to find estimates $\hat{\alpha_j}$ and plug those into the equations above - this method was chosen because there are existing R packages to do the estimation and it is relatively numerically stable and should be very close to the results of other methods for large sample sizes. Using the data to find estimates of the $\alpha_j$ and then plugging those back in makes this an empirical Bayesian technique.

First, it's necessary to get rid of the talent levels $\theta_j$ and find the probability of the observed data based not on a particular player's talent level(s), but on the Dirichlet distribution of population talent levels itself. This is done by integrating out each talent level:

$p(x_1, x_2, ..., x_k | \alpha_1, \alpha_2, ..., \alpha_k, n) = $

$\displaystyle \int_{\tilde{\theta}} p(x_1, x_2, ..., x_k | \theta_1, \theta_2, ..., \theta_k, n) p(\theta_1, \theta_2, ..., \theta_k | \alpha_1, \alpha_2, ..., \alpha_k, n) d\tilde{\theta}$

where $\tilde{\theta}$ indicates the set of all $\theta_j$ values - so integrating out the success probability for each outcome, from 0 to 1.

The calculus here is a bit tedious, so skipping straight to the solution, this gives probability mass function:

$ = \displaystyle \dfrac{n! \Gamma(\sum_{j = 1}^k \alpha_j)}{\Gamma(n + \sum_{j = 1}^k \alpha_j)} \prod_{i = j}^k \dfrac{\Gamma(x_j + \alpha_j)}{x_j! \Gamma(\alpha_j)}$

This distribution, known as the Dirichlet-Multinomial distribution, represents the probability of getting $x_1, x_2, ..., x_k$ outcomes in fixed $n = x_1 + x_2 + ... + x_k$ events, given only information about the population. Essentially, this distribution would be used to answer the question "What's the probability that, if I select a player from the population of all MLB players completely at random - so not knowing the player's talent levels at all - the player gets $x_1$ singles, $x_2$ doubles, etc., in $n$ plate appearances?"

Using $x_{i,j}$ to represent the raw count for outcome $j$ for player $i = 1, 2, ..., N$ and $\tilde{x}_i$ as shorthand to represent the complete set of counting statistics for player $i$, with the counting stats of multiple players $\tilde{x}_1, \tilde{x}_2, ..., \tilde{x}_N$ from some population (such as all MLB players that meet some event threshold), statistical estimation procedures can be used to acquire estimates $\hat{\alpha}_j$ of the true population parameters $\alpha_j$.

For the maximum likelihood approach, the log-likelihood of a set of estimates $\tilde{\alpha}$ is given by

$\ell(\tilde{\alpha} | \tilde{x}_1, ..., \tilde{x}_N) = \displaystyle \sum_{i = 1}^N [\log(n_i!) + \log(\Gamma(\sum_{j = 1}^k \alpha_j)) - \log(\Gamma(n + \sum_{j = 1}^k \alpha_j))$

$+ \displaystyle \sum_{j = 1}^k \log(\Gamma(x_{i,j} + \alpha_j)) - \sum_{j = 1}^k \log(x_{i,j}!) - \sum_{j = 1}^k \log(\Gamma(\alpha_j))]$

The maximum likelihood method works by finding the values of the $\alpha_j$ that maximize $\ell(\tilde{\alpha})$ above - these are the maximum likelihood estimates $\hat{\alpha}_j$. From a numerical perspective, doing this is not simple, and papers have been written on fast, easy ways to perform the computations. For simplicity, I'm going to use the dirmult package in R, which only requires the set of counts for each outcome as a matrix where each row corresponds to exactly one player. The dirmult package can be installed with the command

> install.packages('dirmult')

Once the data is entered and estimation is performed, you will have estimates $\hat{\alpha}_j$. These can then be plugged into the posterior equations above to get regressed statistic estimate

$w_1 \hat{\theta}_1 + w_2 \hat{\theta_2} + ... + w_k \hat{\theta_k} = \displaystyle \sum_{j = 1}^k w_j \left(\dfrac{x_j + \hat{\alpha}_j}{\sum_{j = 1}^k x_j + \hat{\alpha}_j}\right)$

I'll give two examples of offensive stats that can be regressed in this way.

wOBA Shrinkage

The first is weighted on-base average (which I'll call wOBA for short), introduced in Tom Tango's The Book, though as previously mentioned, I am using the fangraphs.com definition. For events, at-bats plus unintentional walks, sacrifice flies, and times hit by a pitch will be used (n = AB + BB - IBB + SF + HBP) , and seven outcomes are defined - singles (1B), doubles (2B), triples (3B), home runs (HR), unintentional walks (BB), and hit by pitch (HBP), with everything else being lumped into an "other" (OTH) outcome.

For notation, again let $x_{i, 1B}$, $x_{i, 2B}$, ..., $x_{i, OTH}$ represent the number of singles, doubles, etc. for player $i$ (abbreviating the entire set as $\tilde{x}_i$) and let $\theta_{i, 1B}$, $\theta_{i, 2B}$, ..., $\theta_{i, OTH}$ represent the true probability of getting a single, double, etc. for player $i$ (abbreviating the entire set as $\tilde{\theta}_i$). The total number of events for player $i$ is given by $n_i$.

Data was collected from fangraphs.com on all MLB non-pitchers from 2010 - 2015. A cutoff of 300 events was used - so only players with at least 300 total AB + BB - IBB + SF + HBP in a given season were used. The code and data I used may be found in my github.

A player's wOBA can be written as a linear transformation of the $\theta_j$ for each of these outcomes with weights $w_{1B} = 0.89$, $w_{2B} = 1.27$, $w_{3B} = 1.62$, $W_{HR} = 2.10$, $w_{BB} = 0.69$, $w_{HBP} = 0.72$, and $w_{OTH} = 0$ as

$wOBA_i = 0.89*\theta_{i,1B} + 1.27*\theta_{i,2B} + 1.62*\theta_{i,3B} + 2.10*\theta_{i,HR} + 0.69*\theta_{i,BB}+0.72*\theta_{i,HBP}$

For player $i$, the distribution of the counts $\tilde{x}_i$ in $n_i$ events is multinomial with mass function

$p(\tilde{x}_i | \tilde{\theta}_i, n_i) =\dfrac{n_i!}{x_{i,1B} x_{i,2B}! x_{i,}! x_{i,HR}! x_{i,OTH}! }\theta_{1B,i}^{x_{i,1B}} \theta_{i,2B}^{x_{i,2B}} \theta_{i,3B}^{x_{i,3B}} \theta_{i,HR}^{x_{i,HR}} \theta_{i,OTH}^{x_{i,OTH}}$

The joint distribution of possible talent levels $\tilde{\theta}$ is assumed to be Dirichlet.

$p(\tilde{\theta}_i | \tilde{\alpha}) = \displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j)}{\Gamma(\sum_{j = 1}^k \alpha_j)} \theta_{i,1B}^{\alpha_{1B} - 1} \theta_{i,2B}^{\alpha_{2B} - 1}\theta_{i,3B}^{\alpha_{3B} - 1} \theta_{i,HR}^{\alpha_{HR} - 1}\theta_{i,BB}^{\alpha_{BB} - 1}\theta_{i,HBP}^{\alpha_{HBP} - 1}\theta_{i,OTH}^{\alpha_{OTH} - 1}$

To find the maximum likelihood estimates $\hat{\alpha}_j$ for this model using the dirmult package in R, the data needs to be loaded into a matrix, where row $i$ represents the raw counts for each outcome for player $i$. There are any number of way to do this, but the first 10 rows of the matrix (out of 1598 in the sample total in the data set used for this example) should look something like:

> x[1:10,]
      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]   91   38    1   42 109    5 353
[2,] 122   26    1   44   71    5 364
[3,]   58   25    1   18   43    0 196
[4,] 111   40    3   32   38    5 336
[5,]   63   25    0   30   56    3 253
[6,]   67   18    1   21   46    5 213
[7,]   86   24    2   43 108    6 362
[8,]   58   25    2   20   24    3 201
[9,]   35   16    2   25   56    2 200
[10,]   68   44    0   14   76    5 250

Once the data is in this form, finding the maximum likelihood estimates can be done with the commands

> dirmult.fit <- dirmult(x)
Iteration 1: Log-likelihood value: -863245.946200229
Iteration 2: Log-likelihood value: -863214.860463357
Iteration 3: Log-likelihood value: -863210.928976511
Iteration 4: Log-likelihood value: -863210.901250554
Iteration 5: Log-likelihood value: -863210.901248778

> dirmult.fit

$loglik

[1] -863210.9

$ite

[1] 5

$gamma

[1] 34.30376 10.44264 1.15606 5.73569 16.28635 1.96183 144.51164

$pi

[1] 0.160000389 0.048706790 0.005392124 0.026752540 0.075963185 0.009150412 0.674034560

$theta

[1] 0.004642569

What I called the $\hat{\alpha}_j$ are given as the $gamma in the output. The quantity$\alpha_0$ can be calculated by summing these up

> alpha <- dirmult.fit$gamma
> sum(alpha)
[1] 214.398

So the joint distribution of talent levels over the population of MLB players with at least 300 events is approximated by a Dirichlet distribution with parameters:

$(\theta_{1B}, \theta_{2B}, \theta_{3B}, \theta_{HR}, \theta_{BB}, \theta_{HBP}, \theta_{OTH}) \sim Dirichlet(34.30, 10.44, 1.16, 5.74, 16.29, 1.96, 144.51)$

In 2013, Mike Trout had $x_{1B} = 115$ singles, $x_{2B} = 39$ doubles, $x_{3B} = 9$ triples, $x_{HR} = 27$ home runs, $x_{BB} = 100$ unintentional walks, $x_{HBP} = 9$ times hit by a pitch, and $x_{OTH} = 397$ other outcomes in $n = 706$ total events for a raw (non-regressed) wOBA of

$0.89 \left(\dfrac{115}{706}\right) + 1.27 \left(\dfrac{39}{706}\right) + 1.62 \left(\dfrac{9}{706}\right) + 2.10 \left(\dfrac{27}{706}\right) + 0.69 \left(\dfrac{100}{706}\right) + 0.72 \left(\dfrac{9}{706}\right) \approx 0.423$

In order to calculate the regressed weighted on-base average, first calculate the $\alpha_j'$ for Mike Trout's posterior distribution of batting ability by

$\alpha_{1B}' = 115 + 34.30 = 149.30$

$\alpha_{2B}' = 39 + 10.44 = 49.44$

$\alpha_{3B}' = 9+ 1.16 = 10.16$

$\alpha_{HR}' = 27 + 5.74 = 32.74$
$\alpha_{BB}' = 100 + 16.29 = 116.29$
$\alpha_{HBP}' = 9 + 1.96 = 10.96$

$\alpha_{OTH}' = 407 + 144.51 = 551.51$

With $\alpha_0' = 920.40$. The regressed version of Mike Trout's 2013 slugging percentage is then given by a linear transformation of the expected proportion in each outcome:

$0.89 \left(\dfrac{149.30}{920.40}\right) + 1.27 \left(\dfrac{49.44}{920.40}\right) + 1.62 \left(\dfrac{10.16}{920.40}\right) + 2.10 \left(\dfrac{32.74}{920.40}\right) + 0.69 \left(\dfrac{116.29}{920.40}\right) + 0.72 \left(\dfrac{10.96}{920.40}\right)$

$\approx 0.401$

So based solely on his 2013 stats and the population information, it's estimated that he is a "true" 0.401 wOBA hitter. Of course, we know from many more years of watching him that this is a bit unfair, and his "true" wOBA is closer to 0.423.

Stabilization

As a side note - and I have verified this by simulation, though I have not worked out the details yet mathetmatically - the so-called "stabilization point" (defined as a split-half correlation of $r = 0.5$) for wOBA is given by $\alpha_0$ - so if split-half correlation was conducted among the players from 2010-2015 with at least 300 AB + BB - IBB + SF + HBP, there should be a correlation of 0.5 after approximately 214 PA. I'm not sure if this works for just the wOBA weights or any arbitrary set of weights, though I suspect the fact that the weight for the "other" outcome is 0 and all the rest are nonzero has a big role to play in this.

SLG Shrinkage

Another statistic that can be regressed in this same way is slugging percentage (which I'll call SLG for short). Using at-bats (AB) as events, and defining five outcomes - singles (1B), doubles (2B), triples (3B), and home runs (HR), with everything else being lumped into the "other" (OTH) outcome, a player's slugging percentage can be written as a linear transformation of the $\theta_i$ for each of these outcomes with weights $w_{1B} = 1$, $w_{2B} = 2$, $w_{3B} = 3$, $W_{HR} = 4$, and $w_{OTH} = 0$

$SLG_i = 1*\theta_{i,1B} + 2*\theta_{i,2B} + 3*\theta_{i,3B} + 4*\theta_{i,HR}$

For player $i$, the multinomial distribution of the counts of $x_{i,1B}$ singles, $x_{i,2B}$ doubles, $x_{i,3B}$ triples, $x_{i,HR}$ home runs, and $x_{i,OTH}$ other outcomes in $n_i$ at-bats is

The Dirichlet distribution of all possible $\tilde{\theta}$ values is

Once again, data from fangraphs.com was used, and all MLB non-pitchers from 2010-2015 who had at least 300 AB in a given season were included in the sample. To find maximum likelihood estimates in R the data needs to be loaded into a matrix where row $i$ represents the raw counts for each outcome $\tilde{x}_i$ for player $i$. The first 10 rows of the matrix (out of 1477) should look something like:

> x[1:10,]
      [,1] [,2] [,3] [,4] [,5]
[1,]   91   38    1   42 349
[2,] 122   26    1   44 362
[3,] 111   40    3   32 332
[4,]   63   25    0   30 251
[5,]   67   18    1   21 208
[6,]   86   24    2   43 358
[7,]   58   25    2   20 199
[8,]   68   44    0   14 248
[9,] 102   36    2   37 370
[10,] 119   48    0   30 375

The dirmult package can then be used to find maximum likelihood estimates for the $\alpha_j$ of the underlying joint Dirichlet distribution of talent levels

> dirmult.fit <- dirmult(x)
Iteration 1: Log-likelihood value: -575845.635311559
Iteration 2: Log-likelihood value: -575999.779702559
Iteration 3: Log-likelihood value: -575829.132259007
Iteration 4: Log-likelihood value: -575784.936726078
Iteration 5: Log-likelihood value: -575780.270877135
Iteration 6: Log-likelihood value: -575780.190649985
Iteration 7: Log-likelihood value: -575780.1906191

> dirmult.fit

$loglik
[1] -575780.2

$ite
[1] 7

$gamma
[1] 42.443604 12.855782   1.381905   7.073672 176.120837

$pi
[1] 0.176939918 0.053593494 0.005760921 0.029488894 0.734216774

$theta
[1] 0.004151517

And $\alpha_0$ is

> alpha <- dirmult.fit$gamma
> sum(alpha)
[1] 239.8758

So the joint distribution of talent levels over the population of MLB players with at least 300 AB is given by a Dirichlet distribution.

$(\theta_{1B}, \theta_{2B}, \theta_{3B}, \theta_{HR}, \theta_{OTH}) \sim Dirichlet(42.44, 12.86, 1.38, 7.07, 176.12)$

(This also implies the "stabilization point" for slugging percentage should be at around $\alpha_0 \approx 240$ AB - this is different than for wOBA because the definition of "events" are different between the two statistics)

In 2013, Mike Trout had $x_{1B} = 115$ singles, $X_{2B} = 39$ doubles, $X_{3B} = 9$ triples, $X_{HR} = 27$ home runs, and $X_{OTH} = 399$ other outcomes in $n = 589$ at-bats.

$1 \left(\dfrac{115}{589}\right) + 2 \left(\dfrac{39}{589}\right) + 3 \left(\dfrac{9}{589}\right) + 4 \left(\dfrac{27}{589}\right) + 0 \left(\dfrac{399}{589}\right) \approx 0.557$

In order to calculate the regressed slugging percentage, calculate Mike Trout's posterior distribution for batting ability by

$\alpha_{1B}' = 115 + 42.44 = 157.44$

$\alpha_{2B}' = 39 + 12.86 = 51.86$

$\alpha_{3B}' = 9+ 1.38 = 10.38$

$\alpha_{HR}' = 27 + 7.07 = 34.07$

$\alpha_{OTH}' = 399 + 176.12 = 575.12$

With $\alpha_0' = 828.79$. The regressed version of Mike Trout's 2013 slugging percentage is then given by

$1 \left(\dfrac{157.44}{828.79}\right) + 2 \left(\dfrac{51.86}{828.79}\right) + 3 \left(\dfrac{10.38}{828.79}\right) + 4 \left(\dfrac{34.07}{828.79}\right) \approx 0.517$

Model Criticisms

From a statistical perspective, this is the most convenient way to perform shrinkage of wOBA, but it doesn't necessarily mean that this is correct - all of this research is dependent on how well the Dirichlet models the joint distribution of talent levels in the league. The fact that the beta works well for the population distributions of each of the talent levels when looked at individually is no guarantee that the multivariate extension should work well for the joint.

In order to do a simple test of the fit of the model, data was simulated from the fit model used to perform the wOBA shrinkage (the posterior predictive means and variances are actually available exactly for this model, but it's good practice to simulate). A set of $\tilde{\theta}_i$ simulated from the Dirichlet distribution was used to simulate a corresponding set of $\tilde{x}_i$, with the same $n_i$ as in the original data set. Comparing the means and standard deviations of the real and simulated data set, the means are

\begin{array}{c c c}
\textrm{Outcome} & \textrm{Observed Mean} & \textrm{Simulated Mean} \\ \hline
\textrm{1B} & 0.1598 & 0.1598 \\
\textrm{2B}   & 0.0473 & 0.0480 \\
\textrm{3B}   & 0.0049 & 0.0053 \\
\textrm{HR} & 0.0275 & 0.0262 \\
\textrm{BB} & 0.0772 & 0.0756 \\
\textrm{HBP} & 0.0088 & 0.0095 \\
\textrm{OTH}   & 0.6745 & 0.6756 \\
\end{array}

which look relatively good - the simulated and real means are fairly close. For the standard deviations, real and simulated values are

\begin{array}{c c c}
\textrm{Outcome} & \textrm{Observed SD} & \textrm{Simulated SD} \\ \hline
\textrm{1B} & 0.0296 & 0.0303 \\
\textrm{2B}   & 0.0114 & 0.0176 \\
\textrm{3B}   & 0.0049 & 0.0061 \\
\textrm{HR} & 0.0153 & 0.0131 \\
\textrm{BB} &   0.0278 & 0.0217 \\
\textrm{HBP} & 0.0072 & 0.0080 \\
\textrm{OTH}   & 0.0324 & 0.0381\\
\end{array}

which isn't nearly as good. The Multinomial-Dirichlet model is clearly underestimating the amount of variance in double rates and "other" outcome rates while overestimating the variance in home run and walk rates. It's not to an extreme extent - and comparing histograms, the shapes of the real and simulated data sets match - but it does present a source of problems. More ad-hoc methods may give superior results.

16 July, 2016

2016 Win Total Predictions (Through All-Star Break)

These predictions are based on my own silly estimator, which I know can be improved with some effort on my part. There's some work related to this estimator that I'm trying to get published academically, so I won't talk about the technical details yet (not that they're particularly mind-blowing anyway). These predictions include all games played before the all-star break.

I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), but based on tests of earlier seasons at this point in the season the actual coverage is just under 93%, with intervals usually being one game off if and when they are off.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c}
\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} & \textrm{Current Wins/Games}\\ \hline

ARI & 62 & 72.11 & 82 & 76.75 & 38 / 90 \\
ATL & 52 & 61.82 & 72 & 68.4 & 31 / 89 \\
BAL & 81 & 90.93 & 101 & 86.25 & 51 / 87 \\
BOS & 80 & 90.3 & 100 & 89.19 & 49 / 87 \\
CHC & 87 & 96.9 & 106 & 96.11 & 53 / 88 \\
CHW & 71 & 81.04 & 91 & 80 & 44 / 87 \\
CIN & 51 & 60.62 & 70 & 63.51 & 32 / 89 \\
CLE & 84 & 93.41 & 103 & 90.67 & 52 / 88 \\
COL & 66 & 76.18 & 86 & 79.22 & 40 / 88 \\
DET & 73 & 82.55 & 92 & 81.11 & 46 / 89 \\
HOU & 76 & 85.81 & 96 & 83.9 & 48 / 89 \\
KCR & 70 & 80.3 & 90 & 77.29 & 45 / 88 \\
LAA & 62 & 71.88 & 82 & 77.4 & 37 / 89 \\
LAD & 80 & 89.43 & 99 & 87.7 & 51 / 91 \\
MIA & 75 & 84.5 & 94 & 82.1 & 47 / 88 \\
MIL & 61 & 71.33 & 81 & 71.98 & 38 / 87 \\
MIN & 56 & 65.83 & 76 & 73.06 & 32 / 87 \\
NYM & 75 & 84.9 & 95 & 82.97 & 47 / 88 \\
NYY & 69 & 78.88 & 89 & 76.38 & 44 / 88 \\
OAK & 61 & 70.93 & 81 & 73.08 & 38 / 89 \\
PHI & 64 & 74 & 84 & 72 & 42 / 90 \\
PIT & 73 & 82.71 & 93 & 81.45 & 46 / 89 \\
SDP & 62 & 72.04 & 82 & 75.53 & 38 / 89 \\
SEA & 74 & 83.44 & 93 & 85.31 & 45 / 89 \\
SFG & 87 & 96.8 & 106 & 89.55 & 57 / 90 \\
STL & 77 & 87.13 & 97 & 90.03 & 46 / 88 \\
TBR & 58 & 67.3 & 77 & 72.91 & 34 / 88 \\
TEX & 81 & 91.22 & 101 & 83.75 & 54 / 90 \\
TOR & 80 & 89.42 & 99 & 87.66 & 51 / 91 \\
WSN & 86 & 95.42 & 105 & 93.21 & 54 / 90 \\ \hline\end{array}
It's still fairly difficult to predict final win totals even a little over halfway through the season - intervals have a width of approximately 20 games. A few stand-out points - the teams that are predicted to definitely finish below 0.500 are the Atlanta Braves, the Cincinnati Reds, the Minnesota Twins, and the Tampa Bay Rays, with the Reds being the worst of those teams (they are an estimated as a "true" 63.51 win team). On the other side, the teams predicted to definitely finish above 0.500 are the Chicago Cubs, the Cleveland Indians, the San Francisco Giants, and the Washington Nationals, with the Cubs being the best of these teams (they are estimated as a "true" 96.11 win team). The Texas Rangers and San Francisco Giants in particular have been an exceptionally lucky team - they are predicted to win approximately 7 more games than their "true" win total. Likewise, the Atlanta Braves and Minnesota Twins have been unlucky, both predicted to win approximately 7 fewer games than their "true" win total.

To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (from the beginning of the season until the all-star break).

These quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is to the left of the peak, the team is predicted to finish "lucky" - more wins than would be expected based on their talent level - and if the blue line is to the right of the peak, the team is predicted to finish "unlucky" - fewer wins that would be expected based on their talent level.

24 May, 2016

Let's Code an MCMC for a Hierarchical Model for Batting Averages

In previous articles, I've discussed empirical Bayesian estimation for the beta-binomial model. Empirical Bayesian analysis is useful, but it's only an approximation to the full hierarchical Bayesian analysis. In this post, I'm going to work through the entire process of doing an equivalent full hierarchical Bayesian analysis with MCMC, from looking at the data and picking a model to creating the MCMC to checking the results. There are, of course, great packages and programs out there such as PyMC and Stan that will fit the MCMC for you, but I want to give a basic and complete "under the hood" example.

Before I get started, I want to be clear that coding a Bayesian analysis with MCMC from scratch involves many choices and multiple checks at almost all levels. I'm going to hand wave some choices based on what I know will work well (though I'll try to be clear where and why I'm doing so) and I'm not going to attempt to show every possible way of checking an MCMC procedure in one post - so statistics such as $\hat{R}$ and effective sample size will not be discussed. For a fuller treatment of Bayesian estimation using MCMC, I recommend Gelman et. al's Bayesian Data Analysis and/or Carlin and Louis's Bayesian Methods in Data Analysis.

As usual, all my code and data can be found on my github.

The Data and Notation

The goal is to fit a hierarchical model to batting averages in the 2015 season. I'm going to limit my data set to only the batting averages of all MLB hitters (excluding pitchers) who had at least 300 AB, as those who do not meet these qualifications can arguably be said to come from a different "population" of players. This data was collected from fangraphs.com and can be seen in the histogram below.

For notation, I'm going let $i$ index MLB players in the sample and define $\theta_i$ as a player's "true" batting average in 2015. The goal is to use the observed number of hits $x_i$ in $n_i$ at-bats (AB) to estimate $\theta_i$ for player $i$. I'll assume that I have $N$ total players - in 2015, there were $N = 254$ non-pitchers with at least 300 AB.

I'm also going to use a $\sim$ over a variable to represent the collection of statistics over all players in the sample. For example, $\tilde{x} = \{ x_1, x_2, ..., x_N\}$ and $\tilde{\theta} = \{\theta_1, \theta_2, ..., \theta_N\}$.

Lastly, when we get to the MCMC part, we're going to take samples from the posterior distributions rather than calculating them directly. I'm going to use $\mu^*_j$ to represent the set of samples from the posterior distribution for $\mu$, where $j$ indexes 1 to however many samples the computer is programmed to obtain (usually a very large number, since computation is relatively cheap these days), and similarly $\phi^*_j$ and $\theta^*_{i,j}$ for samples from the posterior distribution of $\phi$ and $\theta_i$, respectively.

The Model

First, the model must be specified. I'll assume that for each each at-bat, a given player has identical probability $\theta_i$ of getting a hit, independent of other at-bats. The distribution of the total number of hits in $n_i$ at-bats is then binomial.

$x_i \sim Bin(n_i, \theta_i)$

For the distribution of the batting averages $\theta_i$ themselves, I'm going to use a beta distribution. Looking at the histogram of the data, it looks relatively unimodal and bell-shaped, and batting averages by definition must be between 0 and 1. Keep in mind that the distribution of observed batting averages $x_i/n_i$ is not the same as the distribution of actual batting averages $\theta_i$, but even after taking into account the binomial variation around the true batting averages, the distribution of the $\theta_i$ should also be unimodal, roughly bell-shaped, and bounded by 0 and 1. The beta distribution - bounded by 0 and 1 by definition - will be able to take that shape (though others have plausibly argued that a beta is not entirely correct).

Most people are familiar with the beta distribution in terms of $\alpha$ and $\beta$:

$\theta_i \sim Beta(\alpha, \beta)$

There isn't anything wrong with coding an MCMC in this form (and would almost certainly work well in this scenario), but I know from experience that a different parametrization works better - I'm going to use the beta distribution with parameters $\mu$ and $\phi$:

$\theta_i \sim Beta(\mu, \phi)$

where $\mu$ and $\phi$ are given in terms of $\alpha$ and $\beta$ as

$\mu = \dfrac{\alpha}{\alpha + \beta}$

$\phi = \dfrac{1}{\alpha + \beta + 1}$

In this parametrization, $\mu$ represents the expected value $E[\theta_i]$ of the beta distribution - the true league mean batting average - and $\phi$, known formally as the "dispersion parameter," is the correlation between two individual at-bats from the same randomly chosen player - in sabermetric speak, it's how much a hitter's batting average has "stabilized" after a single at-bat. The value of $\phi$ controls how spread out the $\theta_i$ are around $\mu$.

The advantage of using this parametrization instead of the traditional one is that both $\mu$ and $\phi$ are bounded between 0 and 1 (whereas $\alpha$ and $\beta$ can take any value from 0 to $\infty$) and a closed parameter space makes the process of specifying priors easier and will improve the convergence of the MCMC algorithm later on.

Finally, priors must be chosen for the parameters $\mu$ and $\phi$. I'm going to lazily choose diffuse beta priors for both.

$\mu \sim Beta(0.5,0.5)$

$\phi \sim Beta(0.5,0.5)$

The advantage of choosing beta distributions for both (possible with the parametrization I used!) is that both priors are proper (in the sense of being valid probability density functions), and proper priors always yield proper posteriors - so that eliminates one potential problem to worry about. These prior distributions are definitely arguable - they put a fair amount of probability at the ends of the distributions, and I know for a fact that the true league mean batting average isn't actually 0.983 or 0.017, but I wanted to use something that worked well in the MCMC procedure and wasn't simply a flat uniform prior between 0 and 1.

The Math

Before jumping into the code, we need to do some math. Mass functions and densities of the binomial distribution for the $x_i$, beta distributions for $\theta_i$ (in terms of$\mu$ and $\phi$), and beta priors for $\mu$ and $\phi$ are given by

$p(x_i | n_i, \theta_i) = \displaystyle {n_i \choose x_i} \theta_i^{x_i} (1-\theta_i)^{n_i - x_i}$

$p(\theta_i | \mu, \phi) = \dfrac{\theta_i^{\mu (1-\phi)/\phi - 1} (1-\theta_i)^{(1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)}$

$\pi(\mu) = \dfrac{\mu^{-0.5}(1-\mu)^{-0.5}}{\beta(0.5,0.5)}$

$\pi(\phi) = \dfrac{\phi^{-0.5}(1-\phi)^{-0.5}}{\beta(0.5,0.5)}$

From Bayes' theorem, the joint posterior density of $\mu$, $\phi$, and all $N = 254$ of the $\theta_i$ is given by

$p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n}) = \dfrac{p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi)}{\int \int ... \int \int p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi) d\tilde{\theta} d\mu d\phi}$

The $...$ in the integrals means that every single one of the $\theta_i$ must be integrated out as well as $\mu$ and $\phi$, so the numerical integration here involves 256 dimensions. This is not numerically tractable, hence Markov chain Monte Carlo will be used instead.

The goal of Markov chain Monte Carlo is to draw a "chain" of samples $\mu^*_j$, $\phi^*_j$, and $\theta^*_{i,j}$ from the posterior distribution $p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n})$. This is going to be accomplished in iterations, where at each iteration $j$ the distribution of the samples depends only on the values at the previous iteration $j-1$ (this is the "Markov" property of the chain). There are two basic "building block" techniques that are commonly used to do this.

The first technique is called the Gibbs sampler. The full joint posterior $p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n})$ may not be known, but suppose that given values of the other parameters, the conditional posterior distribution $p(\tilde{\theta} | \mu, \phi, \tilde{x}, \tilde{n})$ is known - if so, it can be used to simulate $\tilde{\theta}$ values from $p(\tilde{\theta} | \mu^*_j, \phi^*_j, \tilde{x}, \tilde{n})$.

Looking at the joint posterior described above, the denominator of the posterior density (after performing all integrations) is just a normalizing constant, so we can focus on the numerator:

$p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n}) \propto p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi) $

$= \displaystyle \prod_{i = 1}^N \left( {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \dfrac{\mu^{-0.5}(1-\mu)^{-0.5}}{\beta(0.5,0.5)} \dfrac{\phi^{-0.5}(1-\phi)^{-0.5}}{\beta(0.5,0.5)}$

From here, we can ignore any of the terms above that do not have a $\phi$, a $\mu$, or a $\theta_i$ in them, since those will either cancel out or remain constants in the full posterior as well:

$\displaystyle \prod_{i = 1}^N \left( \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}\phi^{-0.5}(1-\phi)^{-0.5}$

Now we're going to check and see if there are any terms that, when looked at as variables with everything else treated as a constant, take the form of a recognizable distribution. It turns out that the function:

$\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}$

is the kernel of an un-normalized beta distribution for $\theta_i$ with parameters

$\alpha_i = x_i + \mu \left(\dfrac{1-\phi}{\phi}\right)$

$\beta_i = n_i - x_i + (1-\mu) \left(\dfrac{1-\phi}{\phi}\right) $

since we are assuming $\mu$ and $\phi$ are known in the conditional distribution. Hence, we can say that the conditional distribution of the $\theta_i$ given $\mu$, $\phi$, and the data is beta.

This fact be used in the MCMC to draw an observation $\theta^*_{i,j}$ from the posterior distribution for each $\theta_i$ given draws $\mu^*_j$ and $\phi^*_j$ from the posterior distributions for $\mu$ and $\phi$:

$\theta^*_{i,j} \sim Beta\left(x_i + \mu^*_j \left(\dfrac{1-\phi^*_j}{\phi^*_j}\right), n_i - x_i + (1- \mu^*_j )\left(\dfrac{1-\phi^*_j}{\phi^*_j}\right) \right)$

Note that this formulation uses the traditional $\alpha, \beta$ parametrization. This is a "Gibbs step" for the $\theta_i$.

Unfortunately, looking at $\mu$ and $\phi$ in isolation doesn't yield a similar outcome - observing just the terms involving $\mu$ and treating everything else as constant, for example, gives the function

$\displaystyle \prod_{i = 1}^N \left( \dfrac{\theta_i^{\mu (1-\phi)/\phi - 1} (1-\theta_i)^{(1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}$

which is not recognizable as the kernel of any common density. Doing the same thing for $\phi$ gives a nearly identical function. Hence, the Gibbs technique won't be used for $\mu$ and $\phi$.

One advantage, however, of recognizing that the conditional distribution of the $\theta_i$ given all other parameters is beta is that we can integrate the $\theta_i$ out in the likelihood in order to get at the distributions of $\mu$ and $\phi$ more directly:

$\displaystyle p(x_i, n_i | \mu, \phi) = \int_0^1 p(x_i, n_i | \theta_i) p(\theta_i | \mu, \phi) d\theta_i = \int_0^1 {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1)} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} d\theta_i $

$ = \displaystyle {n_i \choose x_i} \dfrac{\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi)}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)}$

In fact, we can do this for every single one of the $\theta_i$ in the formula above and rewrite the posterior function just in terms of $\mu$ and $\phi$:

$p(\mu, \phi | \tilde{x}, \tilde{n}) \propto \displaystyle \prod_{i = 1}^N \left( \dfrac{\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi)}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}\phi^{-0.5}(1-\phi)^{-0.5}$

This leads directly into the second (and more general) technique for obtaining draws from the posterior distribution: the Metropolis-Hastings algorithm. Suppose that instead of the full posterior $p(\mu, \phi | \tilde{x}, \tilde{n})$, you have a function that is proportional to the full posterior (like the numerator above)

$h(\mu, \phi | \tilde{x}, \tilde{n}) \propto p(\mu, \phi | \tilde{x}, \tilde{n})$

It's possible to construct a Markov chain of $\mu^*_j$ samples using the following steps:

Simulate a candidate value $\mu^*_c$ from some distribution $G(\mu^*_c | \mu^*_{j-1})$
Simulate $u$ from a uniform distribution between 0 and 1.
Calculate the ratio

$\dfrac{h(\mu^*_{c}, \phi^*_{j-1} | \tilde{x}, \tilde{n})}{h(\mu^*_{j-1}, \phi^*_{j-1} | \tilde{x}, \tilde{n})}$

If this ratio is larger than $u$, accept the candidate value and declare $\mu^*_j = \mu^*_{c}$.
If this ratio is smaller than $u$, reject the candidate value and declare $\mu^*_j = \mu^*_{j-1}$

A nearly identical step may be used to draw a sample $\phi^*_j$, only using $h(\mu^*_{j-1}, \phi^*_{c} | \tilde{x}, \tilde{n})$ instead. Note that at each Metropolis-Hastings step the value from the previous iteration is used, even if a new value for another parameter was accepted in another step.

In practice, there are two things that are very commonly (but not always) done for Metropolis-Hastings steps: first, calculations are generally performed on the log scale, as the computations become much, much more numerically stable. To do this, we simply need to take the log of the function $h(\mu, \phi | \tilde{x}, \tilde{n})$ above:

$m(\mu, \phi | \tilde{x}, \tilde{n}) = \log[h(\mu, \phi | \tilde{x}, \tilde{n})] = \displaystyle \sum_{i = 1}^N \left[ \log(\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi))\right]$

$- N \log(\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)) - 0.5\log(\mu) - 0.5\log(1-\mu) - 0.5\log(\phi) - 0.5\log(1-\phi)$

This $m$ function is called repeatedly throughout the code. Secondly, for the candidate distribution, a normal distribution is used centered at the previous value of the chain, with some pre-chosen variance $\sigma^2$, which I will explain how to determine in the next section. Using $\mu$ as an example, the candidate distribution would be

$G(\mu^*_c | \mu^*_{j-1}) \sim N(\mu^*_{j -1}, \sigma^2_{\mu})$

Using these two adjustments, the Metropolis-Hastings step for $\mu$ then becomes

Simulate a candidate value from a $N(\mu^*_{j-1}, \sigma^2_{\mu})$ distribution
Simulate $u$ from a uniform distribution between 0 and 1.
If $m(\mu^*_{c}, \phi^*_{j-1} | \tilde{x}, \tilde{n}) - m(\mu^*_{j-1}, \phi^*_{j-1} | \tilde{x}, \tilde{n}) > \log(u)$, accept the candidate value and declare $\mu^*_j = \mu^*_{c}$. Otherwise, reject the candidate value and declare $\mu^*_j = \mu^*_{j-1}$

With Metropolis-Hastings steps and Gibbs steps, we can create a Markov chain that converges to the posterior distribution.

Choosing Starting Values and Checking Output

Now that we have either the conditional posteriors we need for the Gibbs sampler or a function proportional to them for the Metropolis-Hastings steps, it's time to write code to sample from them. Each iteration of the MCMC code will perform the following steps:

Draw a candidate value $\mu^*_c$ from $N(\mu^*_{j-1}, \sigma^2_{\mu})$
Perform a Metropolis-Hastings calculation to determine whether to accept or reject $\mu^*_c$. If accepted, set $\mu^*_j = \mu^*_c$. If rejected, set $\mu^*_j = \mu^*_{j - 1}$
Draw a candidate value $\phi^*_c$ from $N(\phi^*_{j-1}, \sigma^2_{\phi})$
Perform a Metropolis-Hastings calculation to determine whether to accept or reject $\phi_c$. If accepted, set $\phi^*_j = \phi^*_c$. If rejected, set $\phi^*_j = \phi^*_{j - 1}$
For each of the $\theta^*_i$, draw a new $\theta^*_{i,j}$ from the conditional beta distribution:

$\theta^*_{i,j} \sim Beta\left(x_i + \mu^*_j \left(\dfrac{1-\phi^*_j}{\phi^*_j}\right), n_i - x_i + (1- \mu^*_j )\left(\dfrac{1-\phi^*_j}{\phi^*_j}\right) \right)$

Again, note that this formulation of the beta distribution uses the traditional $\alpha, \beta$ parametrization.

A problem emerges - we need starting values $\mu^*_1$ and $\phi^*_1$ before we can use the algorithm (starting values for the $\theta^*_{i,1}$ aren't needed - the Gibbs sampler in step 5 above can be used to simulate them given starting values for the other two parameters). Ideally, you would pick starting values in a high-probability area of the posterior distribution, but if you knew the posterior distribution you wouldn't be performing MCMC!

You could just pick arbitrary starting points - statistical theory says that no matter what starting values you choose, the distribution of samples from the Markov chain will eventually converge to the distribution of the posterior you want (assuming certain regularity conditions which I will not go into), but there's no hard and fast rule on how long it will take. If you pick values extremely far away from the posterior, it could take quite a while for your chain to converge. There's a chance you could have run for your code for 10,000 iterations and still not have reached the posterior distribution, and there's no way of knowing since you don't know the posterior to begin with!

Statisticians generally do two things to check that this hasn't occurred:

Use multiple starting points to create multiple chains of $\mu^*_j$, $\phi^*_j$, and $\theta^*_{i,j}$ that can be compared (visually or otherwise) to see if they all appear to have converged to the same area in the parameter space.
Use a fixed number of "burn-in" iterations to give the chain a chance to converge to the posterior distribution before taking the "real" draws from the chain.

There is no definite answer on exactly how to pick the different starting points - you could randomly choose points in the parameter space (which is handily confined to between 0 and 1 for the parametrization I used!), or you could obtain estimates from some frequentist statistical procedure (such as method of moments or marginal maximum likelihood) and use those, or you could pick values based on your own knowledge of the problem - for example, choosing $\mu^*_1 = 0.265$ based on what knowing the league mean batting average is probably close that value. No matter how you do it, starting points should be spread out over the parameter space to make sure the chains aren't all going to the same place just because they started off close to each other.

Two more questions must be answered to perform the Metropolis-Hastings step - how do you choose $\sigma^2_{\mu}$ and $\sigma^2_{\phi}$ in the normal candidate distributions? And how often should you accept the candidate values?

The answers to these questions are closely tied to each other. For mathematical reasons that I will not go into in this article (and a bit of old habit), I usually aim for an acceptance rate of roughly around 40%, though the specific value depends on the dimensionality of the problem (see this paper by Gelman, Roberts, and Wilks for more information). In practice, I'm usually not worried if it's 30% or 50% as long as everything else looks okay.

If the acceptance rate is good, then a plot of the value of the chain versus the iteration number (called a "trace plot") should look something like

I've used two chains for $\mu$ here, starting at different points. The "spiky blob" shape is exactly what we're looking for - the values of the chains jump around at a good pace, but still making large enough jumps to effectively cover the parameter space.

If the acceptance rate is too small or too large, it can be adjusted by changing $\sigma^2$ in the normal candidate distribution. An acceptance rate that is too low means that the chains will not move around the parameter space effectively. If this is the case, a plot of the chain value versus the iteration number looks like

The plot looks nicer visually, but that's not a good thing - sometimes the chains stay at the same value for hundreds of iterations! The solution to this problem is to lower $\sigma^2$ so that the candidate values are closer to the previous value, and more likely to be accepted.

Conversely, if the acceptance rate is too high then the chains will still explore the parameter space, but much too slowly. A plot of the chain value versus the iteration looks like

In this plot, it looks like the two the chains don't quite converge to the posterior distribution until hundreds of iterations after the initial draws. Furthermore, the chains are jumping to new values at nearly every iteration, but the jumps are so small that it takes an incredibly large number of iterations to explore the parameter space. If this is the case, the solution is to increase $\sigma^2$ so that the candidates are further from the current value, and less likely to be accepted.

The value of $\sigma^2$, then, is often chosen by trial-and-error after the code has been written by manually adjusting the value in multiple runs of the MCMC so that the trace plots have the "spiky blob" shape and the acceptance rate is reasonable. Through this method, I found that the following candidate distributions for $\mu$ and $\phi$ worked well.

$\mu^*_c \sim N(\mu^*_{j-1}, 0.005^2)$

$\phi^*_c \sim N(\phi^*_{j-1}, 0.001^2)$

The Code

Now that we know the steps the codes will take and what inputs are necessary, coding can begin. I typically code in R, and find it useful to write a function that has inputs of data vectors, starting values for any parameters, and any MCMC tuning parameters I might want to change (such as the number of draws, length of the burn-in period, or the variance of the candidate distributions). In the code below, I set the burn-in period and number of iterations to default to 1000 and 5000, respectively, and after running the code several times without defaults for candidate variances, I determined values of $\sigma^2_{\mu}$ and $\sigma^2_{\phi}$ that produced reasonable trace plots and acceptance rates and set those as defaults as well.

For output, I used the list() structure in R to return a vector chain of $\mu^*_j$, a vector chain of $\phi^*_j$, a matrix of chains $\theta^*_{i,j}$, and a vector of acceptance rates for the Metropolis-Hastings steps for $\mu$ and $\phi$.

The raw code for the MCMC function is shown below, and annotated code may be found on my Github.

.
betaBin.mcmc <- function(x, n, mu.start, phi.start, burn.in = 1000, n.draws = 5000, sigma.mu = 0.005, sigma.phi = 0.001) {

m = function(mu, phi, x, n) {
       N = length(x)
       l = sum(lbeta(mu*(1-phi)/phi + x, (1-mu)*(1-phi)/phi+n-x)) - N*lbeta(mu*(1-phi)/phi, (1-mu)*(1-phi)/phi)
       p = -0.5*log(mu) - 0.5*log(1-mu) - 0.5*log(phi) - 0.5*log(1-phi)
       return(l + p)
}

phi = rep(0, burn.in + n.draws)
mu = rep(0, burn.in + n.draws)
theta = matrix(rep(0, length(n)*(burn.in + n.draws)), length(n), (burn.in + n.draws))

acceptance.mu = 0
acceptance.phi = 0

mu[1] = mu.start
phi[1] = phi.start

for(i in 1:length(x)) {
    theta[i, 1] = rbeta(1, mu[1]*(1-phi)[1]/phi[1] + x[i], (1-phi)[1]/phi[1]*(1-mu[1]) + n[i] - x[i])
}

for(j in 2:(burn.in + n.draws)) {

   phi[j] = phi[j-1]
   mu[j] = mu[j-1]

   cand = rnorm(1, mu[j-1], sigma.mu)

   if((cand > 0) & (cand < 1)) {

    m.old = m(mu[j-1],phi[j-1],x,n)
    m.new = m(cand,phi[j-1],x,n)

    u = runif(1)

      if((m.new - m.old) > log(u)) {
       mu[j] = cand
       acceptance.mu = acceptance.mu+1
      }
}

cand = rnorm(1,phi[j-1],sigma.phi)

if( (cand > 0) & (cand < 1)) {

    m.old = m(mu[j-1],phi[j-1],x,n)
   m.new = m(mu[j-1],cand,x,n)

    u = runif(1)

    if((m.new - m.old) > log(u)) {
       phi[j] = cand
       acceptance.phi = acceptance.phi + 1
   }
}

for(i in 1:length(n)) {
    theta[i, j] = rbeta(1, (1-phi[j])/phi[j]*mu[j] + x[i], (1-phi[j])/phi[j]*(1-mu[j]) + n[i] - x[i])
}

}

mu <- mu[(burn.in + 1):(burn.in + n.draws)]
phi <- phi[(burn.in + 1):(burn.in + n.draws)]
theta <- theta[,(burn.in + 1):(burn.in + n.draws)]

return(list(mu = mu, phi = phi, theta = theta, acceptance = c(acceptance.mu/(burn.in + n.draws), acceptance.phi/(burn.in + n.draws))))

}

This, of course, is not the only way it may be coded, and I'm sure that others with more practical programming experience could easily improve upon this code. Note that I add an additional wrinkle to the formulation given in the previous sections to address a practical concern - I immediately reject a candidate value if it is less than 0 or larger than 1. This is not the only possible way to deal with this potential problem, but works well in my experience, and the acceptance rate and/or starting points can be adjusted if the issue becomes serious.

There is a bit of redundancy in the code - the quantity m.old is calculated twice, when it is used identically in both Metropolis-Hastings steps - and I'm inflating the acceptance rate slightly by including the burn-in iterations, but the chains should converge quickly so the effect will be minimal, and more draws can always be taken to minimize the effect.

Though coded in R, the principles should apply no matter which language you use - hopefully you could take this setup and write code in C or python if you wanted to.

The Results

Using the function defined above, I ran three separate chains of 5000 iterations each after a burn-in of 1000 draws. For starting points, I picked values near where I thought the posterior means would end up, plus values both above and below, to check that all chains converged to the same distributions.

> chain.1 <- betaBin.mcmc(x,n, 0.265, 0.002)
> chain.2 <- betaBin.mcmc(x,n, 0.5, 0.1)
> chain.3 <- betaBin.mcmc(x,n, 0.100, 0.0001)

Checking the acceptance rates for $\mu$ and $\phi$ from each of the three chains, all are reasonable:

> chain.1$\$$acceptance
[1] 0.3780000 0.3613333
> chain.2$\$$acceptance
[1] 0.4043333 0.3845000
> chain.3$\$$acceptance
[1] 0.3698333 0.3768333

(Since the $\theta_i$ were obtained by a Gibbs sampler, they do not have an associated acceptance rate)

Next, plots of the chain value versus iteration for $\mu$, $\phi$, and $\theta_1$ show all three chains appear to have converged to the same distribution, and the trace plots appear to have the "spiky blob" shape that indicates good mixing:

Hence, we can use our MCMC draws to estimate properties of the posterior. To do this, combine the results of all three chains into one big set of draws for each variable:

mu <- c(chain.1$\$$mu, chain.2$\$$mu, chain.3$\$$mu)
phi <- c(chain.1$\$$phi, chain.2$\$$phi, chain.3$\$$phi)
theta <- cbind(chain.1$\$$theta, chain.2$\$$theta, chain.3$\$$theta)

Statistical theory says that posterior distributions should converge to a normal distribution as the sample size increases. With a sample size of $N = 254$ batting averages, posteriors should be close to normal in the parametrization I used - though normality of the posteriors is in general not a guarantee that everything has worked well, nor is non-normality evidence that something has gone wrong.

First, the posterior distribution for league batting average can be seen just by taking a histogram:

> hist(mu)

The histogram looks almost perfectly normally distributed - about as close to the ideal as is reasonable.

Next, we want to get an estimator for the league mean batting average. There are a different few ways to turn the posterior sample $\mu^*_j$ into an estimator $\hat{\mu}$, but I'll give the simplest here (and since the posterior distribution looks normal, other methods should give very similar results) - taking the sample average of the $\mu^*_j$ values:

> mean(mu)
[1] 0.2660155

Similarly, we can get an estimate of the standard error for $\hat{\mu}$ and a 95% credible interval for $\mu$ by taking the standard deviation and quantiles from $\mu^*_j$:

> sd(mu)
[1] 0.001679727
> quantile(mu,c(.025,.975))
2.5% 97.5%
0.2626874 0.2693175

For $\phi$, do the same thing - first look at the histogram:

There is one outlier on the high side - which can happen in an MCMC chain simply by chance - and a slight skew to the right, but otherwise, the posterior looks close to normal. The mean, standard deviation, and a 95% credible interval are given by

> mean(phi)
[1] 0.001567886
> sd(phi)
[1] 0.000332519
> quantile(phi,c(.025,.975))
2.5% 97.5%
0.0009612687 0.0022647623

Furthermore, let's say that instead of $\phi$, I had a particular function of one of the parameters in mind instead - for example, I mentioned at the beginning that $\phi$ is, in sabermetric speak, the proportion of stabilization after a single at-bat. This can be turned into the general so-called "stabilization point" $M$ by

$M = \dfrac{1-\phi}{\phi}$

and so to get a posterior distribution for $M$, all we need to do is apply this transformation to each draw from $\phi^*_j$. A histogram of $M$ is given by

> hist((1-phi)/phi)

The histogram is skewed clearly to the right, but that's okay since $M$ is not one of the parameters in the model.

An estimate and 95% credible for the stabilization point is given by taking the average and quantiles of the transformed values

> mean((1-phi)/phi)
[1] 667.8924
> quantile((1-phi)/phi, c(0.025,0.975))
2.5% 97.5%
440.5474 1039.2918

This estimate is different than the value I gave in my article 2016 Stabilization Points because the calculations in that article used the past six years of data - this calculation only uses one. This is also why the uncertainty is so much larger.

Lastly, we can get at what we really want - estimates of the "true" batting averages $\theta_i$ for each player. I'm going to look at $i = 1$ (the first player in the sample), who happens to be Bryce Harper, the National League MVP in 2015. His batting average was 0.330 (from $x_1 = 172$ hits in $n_1 = 521$ AB), but the effect of fitting the hierarchical Bayesian analysis is to shrink the estimate of his "true" batting average $\theta_i$ towards the league mean $\mu$ - and by quite a bit in this case, since Bryce had nearly largest batting average in the sample. A histogram of the $\theta^*_{1,j}$ shows, again, a roughly normal distribution.

> hist(theta[1,])

and an estimate of his true batting average, standard error of the estimate, and 95% credible interval for the estimate are given by

> mean(theta[1,])
[1] 0.2947706
> sd(theta[1,])
[1] 0.01366782
> quantile(theta[1,], c(0.025,0.975))
2.5% 97.5%
0.2687120 0.3222552

Other functions of the batting averages, functions of the league mean and variance, or posterior predictive calculations can be performed using the posterior samples $\mu^*$, $\phi^*$, and $\theta^*_i$.

Conclusion and Connections

MCMC techniques similar to the ones shown here have become fairly standard in Bayesian estimation, though there are more advanced techniques in use today that build upon these "building block" steps by, to give one example, adaptively changing the acceptance rate as the code runs rather than guessing-and-checking to find a reasonable value.

The empirical Bayesian techniques from my article Beta-binomial empirical Bayes represent an approximation to this full hierarchical method. In fact, using the empirical Bayesian estimator from that article on the baseball set described in this article gives $\hat{\alpha} = 172.5478$ and $\hat{\beta} = 476.0831$ (equivalent to $\hat{\mu} = 0.266$ and $\hat{\phi} = 0.001539$), and gives Bryce Harper an estimated true batting average of $\theta_1 = 0.2946$, with a 95% credible interval of $(0.2688, 0.3210)$ - only slightly shorter than the interval from the full hierarchical model.

Lastly, the "regression toward the mean" technique common in sabermetrics also approximates this analysis. Supposing you had a "stabilization point" of around 650 AB for batting averages (650 is actually way too large, but I'm pulling this number from my calculations above to illustrate a point), then the amount shrunk towards league mean of $\mu \approx 0.266$ is

$\left(\dfrac{521}{521 + 650}\right) \approx 0.4449$

So that the estimate of Harper's batting average is

$0.266 + 0.4449\left(\dfrac{172}{521} - 0.266\right) \approx 0.2945$

Three methods all going to the same place - all closely related in theory and execution.

Hopefully this helps with understanding MCMC coding. The article ended up much longer than I originally intended, but there were many parts I've gotten used to doing quickly that I realized required a not-so-quick explanation to justify why I'm doing them. As usual, comments and suggestions are appreciated!