24 May, 2016

Let's Code an MCMC for a Hierarchical Model for Batting Averages

In previous articles, I've discussed empirical Bayesian estimation for the beta-binomial model. Empirical Bayesian analysis is useful, but it's only an approximation to the full hierarchical Bayesian analysis. In this post, I'm going to work through the entire process of doing an equivalent full hierarchical Bayesian analysis with MCMC, from looking at the data and picking a model to creating the MCMC to checking the results. There are, of course, great packages and programs out there such as PyMC and Stan that will fit the MCMC for you, but I want to give a basic and complete "under the hood" example.

Before I get started, I want to be clear that coding a Bayesian analysis with MCMC from scratch involves many choices and multiple checks at almost all levels. I'm going to hand wave some choices based on what I know will work well (though I'll try to be clear where and why I'm doing so) and I'm not going to attempt to show every possible way of checking an MCMC procedure in one post - so statistics such as $\hat{R}$ and effective sample size will not be discussed. For a fuller treatment of Bayesian estimation using MCMC, I recommend Gelman et. al's Bayesian Data Analysis and/or Carlin and Louis's Bayesian Methods in Data Analysis.

As usual, all my code and data can be found on my github.

The Data and Notation

 

The goal is to fit a hierarchical model to batting averages in the 2015 season. I'm going to limit my data set to only the batting averages of all MLB hitters (excluding pitchers) who had at least 300 AB, as those who do not meet these qualifications can arguably be said to come from a different "population" of players. This data was collected from fangraphs.com and can be seen in the histogram below.

 


For notation, I'm going let $i$ index MLB players in the sample and define $\theta_i$ as a player's "true" batting average in 2015. The goal is to use the observed number of hits $x_i$ in $n_i$ at-bats (AB) to estimate $\theta_i$ for player $i$. I'll assume that I have $N$ total players - in 2015, there were $N = 254$ non-pitchers with at least 300 AB.

I'm also going to use a $\sim$ over a variable to represent the collection of statistics over all players in the sample. For example, $\tilde{x} = \{ x_1, x_2, ..., x_N\}$ and $\tilde{\theta} = \{\theta_1, \theta_2, ..., \theta_N\}$.

Lastly, when we get to the MCMC part, we're going to take samples from the posterior distributions rather than calculating them directly. I'm going to use $\mu^*_j$ to represent the set of samples from the posterior distribution for $\mu$, where $j$ indexes 1 to however many samples the computer is programmed to obtain (usually a very large number, since computation is relatively cheap these days), and similarly $\phi^*_j$ and $\theta^*_{i,j}$ for samples from the posterior distribution of $\phi$ and $\theta_i$, respectively.

 

The Model


First, the model must be specified. I'll assume that for each each at-bat, a given player has identical probability $\theta_i$ of getting a hit, independent of other at-bats. The distribution of the total number of hits in $n_i$ at-bats is then binomial.


$x_i \sim Bin(n_i, \theta_i)$


For the distribution of the batting averages $\theta_i$ themselves,  I'm going to use a beta distribution. Looking at the histogram of the data, it looks relatively unimodal and bell-shaped, and batting averages by definition must be between 0 and 1. Keep in mind that the distribution of observed batting averages $x_i/n_i$ is not the same as the distribution of actual batting averages $\theta_i$, but even after taking into account the binomial variation around the true batting averages, the distribution of the $\theta_i$ should also be unimodal, roughly bell-shaped, and bounded by 0 and 1. The beta distribution - bounded by 0 and 1 by definition - will be able to take that shape (though others have plausibly argued that a beta is not entirely correct).

Most people are familiar with the beta distribution in terms of $\alpha$ and $\beta$:

$\theta_i \sim Beta(\alpha, \beta)$

There isn't anything wrong with coding an MCMC in this form (and would almost certainly work well in this scenario), but I know from experience that a different parametrization works better - I'm going to use the beta distribution with parameters $\mu$ and $\phi$:

 $\theta_i \sim Beta(\mu, \phi)$

where $\mu$ and $\phi$ are given in terms of $\alpha$ and $\beta$ as

$\mu = \dfrac{\alpha}{\alpha + \beta}$
$\phi = \dfrac{1}{\alpha + \beta + 1}$

In this parametrization, $\mu$ represents the expected value $E[\theta_i]$ of the beta distribution - the true league mean batting average - and $\phi$, known formally as the "dispersion parameter," is the correlation between two individual at-bats from the same randomly chosen player - in sabermetric speak, it's how much a hitter's batting average has "stabilized" after a single at-bat. The value of $\phi$ controls how spread out the $\theta_i$ are around $\mu$.

The advantage of using this parametrization instead of the traditional one is that both $\mu$ and $\phi$ are bounded between 0 and 1 (whereas $\alpha$ and $\beta$ can take any value from 0 to $\infty$) and a closed parameter space makes the process of specifying priors easier and will improve the convergence of the MCMC algorithm later on.

Finally, priors must be chosen for the parameters $\mu$ and $\phi$. I'm going to lazily choose diffuse beta priors for both.

$\mu \sim Beta(0.5,0.5)$
$\phi \sim Beta(0.5,0.5)$

The advantage of choosing beta distributions for both (possible with the parametrization I used!) is that both priors are proper (in the sense of being valid probability density functions), and proper priors always yield proper posteriors - so that eliminates one potential problem to worry about. These prior distributions are definitely arguable - they put a fair amount of probability at the ends of the distributions, and I know for a fact that the true league mean batting average isn't actually 0.983 or 0.017, but I wanted to use something that worked well in the MCMC procedure and wasn't simply a flat uniform prior between 0 and 1. 

The Math


Before jumping into the code, we need to do some math. Mass functions and densities of the binomial distribution for the $x_i$, beta distributions for $\theta_i$ (in terms of$\mu$ and $\phi$), and beta priors for $\mu$ and $\phi$ are given by

$p(x_i | n_i, \theta_i) = \displaystyle {n_i \choose x_i} \theta_i^{x_i} (1-\theta_i)^{n_i - x_i}$

$p(\theta_i | \mu, \phi) = \dfrac{\theta_i^{\mu (1-\phi)/\phi - 1} (1-\theta_i)^{(1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)}$

$\pi(\mu) = \dfrac{\mu^{-0.5}(1-\mu)^{-0.5}}{\beta(0.5,0.5)}$

 $\pi(\phi) = \dfrac{\phi^{-0.5}(1-\phi)^{-0.5}}{\beta(0.5,0.5)}$


 From Bayes' theorem, the joint posterior density of  $\mu$, $\phi$, and all $N = 254$ of the $\theta_i$ is given by

$p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n}) = \dfrac{p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi)}{\int \int ... \int \int p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi)  \pi(\mu) \pi(\phi) d\tilde{\theta} d\mu d\phi}$

The $...$ in the integrals means that every single one of the $\theta_i$ must be integrated out as well as $\mu$ and $\phi$, so the numerical integration here involves 256 dimensions. This is not numerically tractable, hence Markov chain Monte Carlo will be used instead.

The goal of Markov chain Monte Carlo is to draw a "chain" of samples $\mu^*_j$, $\phi^*_j$, and $\theta^*_{i,j}$ from the posterior distribution $p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n})$. This is going to be accomplished in iterations, where at each iteration $j$ the distribution of the samples depends only on the values at the previous iteration $j-1$ (this is the "Markov" property of the chain). There are two basic "building block" techniques that are commonly used to do this.

The first technique is called the Gibbs sampler. The full joint posterior $p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n})$ may not be known, but suppose that given values of the other parameters, the conditional posterior distribution $p(\tilde{\theta} | \mu, \phi, \tilde{x}, \tilde{n})$ is known - if so, it can be used to simulate $\tilde{\theta}$ values from $p(\tilde{\theta} | \mu^*_j, \phi^*_j, \tilde{x}, \tilde{n})$.

Looking at the joint posterior described above, the denominator of the posterior density (after performing all integrations) is just a normalizing constant, so we can focus on the numerator:

$p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n}) \propto p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi) $

$= \displaystyle \prod_{i = 1}^N \left( {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \dfrac{\mu^{-0.5}(1-\mu)^{-0.5}}{\beta(0.5,0.5)} \dfrac{\phi^{-0.5}(1-\phi)^{-0.5}}{\beta(0.5,0.5)}$

From here, we can ignore any of the terms above that do not have a $\phi$, a $\mu$, or a $\theta_i$ in them, since those will either cancel out or remain constants in the full posterior as well:

$\displaystyle \prod_{i = 1}^N \left( \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}\phi^{-0.5}(1-\phi)^{-0.5}$

Now we're going to check and see if there are any terms that, when looked at as variables with everything else treated as a constant, take the form of a recognizable distribution. It turns out that the function:

$\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}$ 

is the kernel of an un-normalized beta distribution for $\theta_i$ with parameters

 $\alpha_i = x_i + \mu \left(\dfrac{1-\phi}{\phi}\right)$

$\beta_i =  n_i - x_i + (1-\mu) \left(\dfrac{1-\phi}{\phi}\right) $

since we are assuming $\mu$ and $\phi$ are known in the conditional distribution. Hence, we can say that the conditional distribution of the $\theta_i$ given $\mu$, $\phi$, and the data is beta.

This fact be used in the MCMC to draw an observation $\theta^*_{i,j}$ from the posterior distribution for each $\theta_i$ given draws $\mu^*_j$ and $\phi^*_j$ from the posterior distributions for $\mu$ and $\phi$:

$\theta^*_{i,j} \sim Beta\left(x_i + \mu^*_j \left(\dfrac{1-\phi^*_j}{\phi^*_j}\right), n_i - x_i + (1- \mu^*_j )\left(\dfrac{1-\phi^*_j}{\phi^*_j}\right) \right)$

Note that this formulation uses the traditional $\alpha, \beta$ parametrization. This is a "Gibbs step" for the $\theta_i$.

Unfortunately, looking at $\mu$ and $\phi$ in isolation doesn't yield a similar outcome - observing just the terms involving $\mu$ and treating everything else as constant, for example, gives the function

$\displaystyle \prod_{i = 1}^N \left( \dfrac{\theta_i^{\mu (1-\phi)/\phi - 1} (1-\theta_i)^{(1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}$

which is not recognizable as the kernel of any common density.  Doing the same thing for $\phi$ gives a nearly identical function. Hence, the Gibbs technique won't be used for $\mu$ and $\phi$.

One advantage, however, of recognizing that the conditional distribution of the $\theta_i$ given all other parameters is beta is that we can integrate the $\theta_i$ out in the likelihood in order to get at the distributions of $\mu$ and $\phi$ more directly:

$\displaystyle p(x_i, n_i | \mu, \phi) = \int_0^1 p(x_i, n_i | \theta_i) p(\theta_i | \mu, \phi) d\theta_i = \int_0^1 {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1)} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} d\theta_i $

$ = \displaystyle {n_i \choose x_i} \dfrac{\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi)}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)}$

In fact, we can do this for every single one of the $\theta_i$ in the formula above and rewrite the posterior function just in terms of $\mu$ and $\phi$:

$p(\mu, \phi | \tilde{x}, \tilde{n}) \propto \displaystyle \prod_{i = 1}^N \left( \dfrac{\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi)}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}\phi^{-0.5}(1-\phi)^{-0.5}$

This leads directly into the second (and more general) technique for obtaining draws from the posterior distribution: the Metropolis-Hastings algorithm. Suppose that instead of the full posterior $p(\mu, \phi | \tilde{x}, \tilde{n})$, you have a function that is proportional to the full posterior (like the numerator above)

$h(\mu, \phi | \tilde{x}, \tilde{n}) \propto p(\mu, \phi | \tilde{x}, \tilde{n})$

It's possible to construct a Markov chain of $\mu^*_j$ samples using the following steps:
  1. Simulate a candidate value $\mu^*_c$ from some distribution $G(\mu^*_c | \mu^*_{j-1})$
  2. Simulate $u$ from a uniform distribution between 0 and 1.
  3. Calculate the ratio 
$\dfrac{h(\mu^*_{c}, \phi^*_{j-1} | \tilde{x}, \tilde{n})}{h(\mu^*_{j-1}, \phi^*_{j-1} | \tilde{x}, \tilde{n})}$

          If this ratio is larger than $u$, accept the candidate value and declare $\mu^*_j = \mu^*_{c}$.
          If this ratio is smaller than $u$, reject the candidate value and declare $\mu^*_j = \mu^*_{j-1}$

A nearly identical step may be used to draw a sample $\phi^*_j$, only using $h(\mu^*_{j-1}, \phi^*_{c} | \tilde{x}, \tilde{n})$ instead. Note that at each Metropolis-Hastings step the value from the previous iteration is used, even if a new value for another parameter was accepted in another step.

In practice, there are two things that are very commonly (but not always) done for Metropolis-Hastings steps: first, calculations are generally performed on the log scale, as the computations become much, much more numerically stable.  To do this, we simply need to take the log of the function $h(\mu, \phi | \tilde{x}, \tilde{n})$  above:

$m(\mu, \phi | \tilde{x}, \tilde{n}) = \log[h(\mu, \phi | \tilde{x}, \tilde{n})] =  \displaystyle \sum_{i = 1}^N \left[ \log(\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi))\right]$
$- N \log(\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)) - 0.5\log(\mu) - 0.5\log(1-\mu) - 0.5\log(\phi) - 0.5\log(1-\phi)$

This $m$ function is called repeatedly throughout the code. Secondly, for the candidate distribution, a normal distribution is used centered at the previous value of the chain, with some pre-chosen variance $\sigma^2$, which I will explain how to determine in the next section. Using $\mu$ as an example, the candidate distribution would be

$G(\mu^*_c | \mu^*_{j-1}) \sim N(\mu^*_{j -1}, \sigma^2_{\mu})$

Using these two adjustments, the Metropolis-Hastings step for $\mu$ then becomes

  1. Simulate a candidate value from a $N(\mu^*_{j-1}, \sigma^2_{\mu})$ distribution
  2. Simulate $u$ from a uniform distribution between 0 and 1.
  3. If $m(\mu^*_{c}, \phi^*_{j-1} | \tilde{x}, \tilde{n}) - m(\mu^*_{j-1}, \phi^*_{j-1} | \tilde{x}, \tilde{n}) > \log(u)$, accept the candidate value and declare $\mu^*_j = \mu^*_{c}$. Otherwise, reject the candidate value and declare $\mu^*_j = \mu^*_{j-1}$


With Metropolis-Hastings steps and Gibbs steps, we can create a Markov chain that converges to the posterior distribution.


Choosing Starting Values and Checking Output



Now that we have either the conditional posteriors we need for the Gibbs sampler or a function proportional to them for the Metropolis-Hastings steps, it's time to write code  to sample from them. Each iteration of the MCMC code will perform the following steps:

  1. Draw a candidate value $\mu^*_c$ from $N(\mu^*_{j-1}, \sigma^2_{\mu})$ 
  2. Perform a Metropolis-Hastings calculation to determine whether to accept or reject $\mu^*_c$. If accepted, set $\mu^*_j = \mu^*_c$. If rejected, set $\mu^*_j = \mu^*_{j - 1}$
  3. Draw a candidate value $\phi^*_c$ from $N(\phi^*_{j-1}, \sigma^2_{\phi})$ 
  4.  Perform a Metropolis-Hastings calculation to determine whether to accept or reject $\phi_c$. If accepted, set $\phi^*_j = \phi^*_c$. If rejected, set $\phi^*_j = \phi^*_{j - 1}$
  5. For each of the $\theta^*_i$, draw a new $\theta^*_{i,j}$ from the conditional beta distribution:
$\theta^*_{i,j} \sim Beta\left(x_i + \mu^*_j \left(\dfrac{1-\phi^*_j}{\phi^*_j}\right), n_i - x_i + (1- \mu^*_j )\left(\dfrac{1-\phi^*_j}{\phi^*_j}\right) \right)$

Again, note that this formulation of the beta distribution uses the traditional $\alpha, \beta$ parametrization.

A problem emerges - we need starting values $\mu^*_1$ and $\phi^*_1$ before we can use the algorithm (starting values for the $\theta^*_{i,1}$ aren't needed - the Gibbs sampler in step 5 above can be used to simulate them given starting values for the other two parameters). Ideally, you would pick starting values in a high-probability area of the posterior distribution, but if you  knew the posterior distribution you wouldn't be performing MCMC!

You could just pick arbitrary starting points - statistical theory says that no matter what starting values you choose, the distribution of samples from the Markov chain will eventually converge to the distribution of the posterior you want (assuming certain regularity conditions which I will not go into), but there's no hard and fast rule on how long it will take. If you pick values extremely far away from the posterior, it could take quite a while for your chain to converge. There's a chance you could have run for your code for 10,000 iterations and still not have reached the posterior distribution, and there's no way of knowing since you don't know the posterior to begin with!

Statisticians generally do two things to check that this hasn't occurred:
  1. Use multiple starting points to create multiple chains of $\mu^*_j$, $\phi^*_j$, and $\theta^*_{i,j}$ that can be compared (visually or otherwise) to see if they all appear to have converged to the same area in the parameter space.
  2. Use a fixed number of "burn-in" iterations to give the chain a chance to converge to the posterior distribution before taking the "real" draws from the chain.

There is no definite answer on exactly how to pick the different starting points - you could randomly choose points in the parameter space (which is handily confined to between 0 and 1 for the parametrization I used!), or you could obtain estimates from some frequentist statistical procedure (such as method of moments or marginal maximum likelihood) and use those, or you could pick values based on your own knowledge of the problem - for example, choosing $\mu^*_1 = 0.265$ based on what knowing the league mean batting average is probably close that value. No matter how you do it, starting points should be spread out over the parameter space to make sure the chains aren't all going to the same place just because they started off close to each other.

Two more questions must be answered to perform the Metropolis-Hastings step - how do you choose $\sigma^2_{\mu}$ and $\sigma^2_{\phi}$ in the normal candidate distributions? And how often should you accept the candidate values?

The answers to these questions are closely tied to each other. For mathematical reasons that I will not go into in this article (and a bit of old habit), I usually aim for an acceptance rate of roughly around 40%, though the specific value depends on the dimensionality of the problem (see this paper by Gelman, Roberts, and Wilks for more information). In practice, I'm usually not worried if it's 30% or 50% as long as everything else looks okay.

If the acceptance rate is good, then a plot of the value of the chain versus the iteration number (called a "trace plot") should look something like


I've used two chains for $\mu$ here, starting at different points. The "spiky blob" shape is exactly what we're looking for - the values of the chains jump around at a good pace, but still making large enough jumps to effectively cover the parameter space.

If the acceptance rate is too small or too large,  it can be adjusted by changing $\sigma^2$ in the normal candidate distribution. An acceptance rate that is too low means that the chains will not move around the parameter space effectively. If this is the case, a plot of the chain value versus the iteration number looks like



The plot looks nicer visually, but that's not a good thing - sometimes the chains stay at the same value for hundreds of iterations! The solution to this problem is to lower $\sigma^2$ so that the candidate values are closer to the previous value, and more likely to be accepted.

Conversely, if the acceptance rate is too high then the chains will still explore the parameter space, but much too slowly. A plot of the chain value versus the iteration looks like


In this plot, it looks like the two the chains don't quite converge to the posterior distribution until hundreds of iterations after the initial draws. Furthermore, the chains are jumping to new values at nearly every iteration, but the jumps are so small that it takes an incredibly large number of iterations to explore the parameter space. If this is the case, the solution is to increase $\sigma^2$ so that the candidates are further from the current value, and less likely to be accepted.

The value of $\sigma^2$, then, is often chosen by trial-and-error after the code has been written by manually adjusting the value in multiple runs of the MCMC so that the trace plots have the "spiky blob" shape and the acceptance rate is reasonable. Through this method, I found that the following candidate distributions for $\mu$ and $\phi$ worked well.

$\mu^*_c \sim N(\mu^*_{j-1}, 0.005^2)$

$\phi^*_c \sim N(\phi^*_{j-1}, 0.001^2)$


The Code


Now that we know the steps the codes will take and what inputs are necessary, coding can begin. I typically code in R, and find it useful to write a function that has inputs of data vectors, starting values for any parameters, and any MCMC tuning parameters I might want to change (such as the number of draws, length of the burn-in period, or the variance of the candidate distributions). In the code below, I set the burn-in period and number of iterations to default to 1000 and 5000, respectively, and after running the code several times without defaults for candidate variances, I determined values of $\sigma^2_{\mu}$ and $\sigma^2_{\phi}$ that produced reasonable trace plots and acceptance rates and set those as defaults as well.

For output, I used the list() structure in R to return a vector chain of $\mu^*_j$, a vector chain of $\phi^*_j$, a matrix of chains $\theta^*_{i,j}$, and a vector of acceptance rates for the Metropolis-Hastings steps for $\mu$ and $\phi$.

The raw code for the MCMC function is shown below, and annotated code may be found on my Github.

.
betaBin.mcmc <- function(x, n, mu.start, phi.start, burn.in = 1000, n.draws = 5000, sigma.mu = 0.005, sigma.phi = 0.001) {

 m  = function(mu, phi, x, n) {
       N = length(x)
       l = sum(lbeta(mu*(1-phi)/phi + x, (1-mu)*(1-phi)/phi+n-x)) - N*lbeta(mu*(1-phi)/phi, (1-mu)*(1-phi)/phi)
       p = -0.5*log(mu) - 0.5*log(1-mu) - 0.5*log(phi) - 0.5*log(1-phi)
       return(l + p)
 }

 phi = rep(0, burn.in + n.draws)
 mu = rep(0, burn.in + n.draws)
 theta = matrix(rep(0, length(n)*(burn.in + n.draws)), length(n), (burn.in + n.draws))

 acceptance.mu = 0
 acceptance.phi = 0

 mu[1] = mu.start
 phi[1] = phi.start

 for(i in 1:length(x)) {
    theta[i, 1] = rbeta(1, mu[1]*(1-phi)[1]/phi[1] + x[i], (1-phi)[1]/phi[1]*(1-mu[1]) + n[i] - x[i])
 }


 for(j in 2:(burn.in + n.draws)) {


   phi[j] = phi[j-1]
   mu[j] = mu[j-1]


   cand = rnorm(1, mu[j-1], sigma.mu)

   if((cand > 0) & (cand < 1)) {


      m.old = m(mu[j-1],phi[j-1],x,n)
      m.new = m(cand,phi[j-1],x,n)

      u = runif(1)

      if((m.new - m.old) > log(u)) {
        mu[j] = cand
        acceptance.mu = acceptance.mu+1
      }
  }


  cand = rnorm(1,phi[j-1],sigma.phi)
 
  if( (cand > 0) & (cand < 1)) {

 
    m.old = m(mu[j-1],phi[j-1],x,n)
    m.new = m(mu[j-1],cand,x,n)

    u = runif(1)

    if((m.new - m.old) > log(u)) {
        phi[j] = cand
        acceptance.phi = acceptance.phi + 1
    }
  }
 

  for(i in 1:length(n)) {
    theta[i, j] = rbeta(1, (1-phi[j])/phi[j]*mu[j] + x[i], (1-phi[j])/phi[j]*(1-mu[j]) + n[i] - x[i])
  }

 }

 mu <- mu[(burn.in + 1):(burn.in + n.draws)]
 phi <- phi[(burn.in + 1):(burn.in + n.draws)]
 theta <- theta[,(burn.in + 1):(burn.in + n.draws)]

 return(list(mu = mu, phi = phi, theta = theta, acceptance =  c(acceptance.mu/(burn.in + n.draws), acceptance.phi/(burn.in + n.draws))))

}


This, of course, is not the only way it may be coded, and I'm sure that others with more practical programming experience could easily improve upon this code. Note that I add an additional wrinkle to the formulation given in the previous sections to address a practical concern - I immediately reject a candidate value if it is less than 0 or larger than 1. This is not the only possible way to deal with this potential problem, but works well in my experience, and the acceptance rate and/or starting points can be adjusted if the issue becomes serious.

There is a bit of redundancy in the code - the quantity m.old is calculated twice, when it is used identically in both Metropolis-Hastings steps - and I'm inflating the acceptance rate slightly by including the burn-in iterations, but the chains should converge quickly so the effect will be minimal, and more draws can always be taken to minimize the effect.

Though coded in R, the principles should apply no matter which language you use - hopefully you could take this setup and write code in C or python if you wanted to.

The Results


Using the function defined above, I ran three separate chains of 5000 iterations each after a burn-in of 1000 draws. For starting points, I picked values near where I thought the posterior means would end up, plus values both above and below, to check that all chains converged to the same distributions.

> chain.1 <- betaBin.mcmc(x,n, 0.265, 0.002)
> chain.2 <- betaBin.mcmc(x,n, 0.5, 0.1)
> chain.3 <- betaBin.mcmc(x,n, 0.100, 0.0001)

Checking the acceptance rates for $\mu$ and $\phi$ from each of the three chains, all are reasonable:

> chain.1$\$$acceptance
[1] 0.3780000 0.3613333
> chain.2$\$$acceptance
[1] 0.4043333 0.3845000
> chain.3$\$$acceptance
[1] 0.3698333 0.3768333


(Since the $\theta_i$ were obtained by a Gibbs sampler, they do not have an associated acceptance rate)

Next, plots of the chain value versus iteration for $\mu$, $\phi$, and $\theta_1$ show all three chains appear to have converged to the same distribution, and the trace plots appear to have the "spiky blob" shape that indicates good mixing:



Hence, we can use our MCMC draws to estimate properties of the posterior.  To do this, combine the results of all three chains into one big set of draws for each variable:

mu <- c(chain.1$\$$mu, chain.2$\$$mu, chain.3$\$$mu)
phi <- c(chain.1
$\$$phi, chain.2$\$$phi, chain.3$\$$phi)
theta <- cbind(chain.1
$\$$theta, chain.2$\$$theta, chain.3$\$$theta)

Statistical theory says that posterior distributions should converge to a normal distribution as the sample size increases. With a sample size of $N = 254$ batting averages, posteriors should be close to normal in the parametrization I used - though normality of the posteriors is in general not a guarantee that everything has worked well, nor is non-normality evidence that something has gone wrong.

First, the posterior distribution for league batting average can be seen just by taking a histogram:

> hist(mu)


The histogram looks almost perfectly normally distributed - about as close to the ideal as is reasonable.

Next, we want to get an estimator for the league mean batting average. There are a different few ways to turn the posterior sample $\mu^*_j$ into an estimator $\hat{\mu}$, but I'll give the simplest here (and since the posterior distribution looks normal, other methods should give very similar results) - taking the sample average of the $\mu^*_j$ values:

> mean(mu)
[1] 0.2660155
 

Similarly, we can get an estimate of the standard error for $\hat{\mu}$ and a 95% credible interval for $\mu$ by taking the standard deviation and quantiles from $\mu^*_j$:

> sd(mu)
[1] 0.001679727
> quantile(mu,c(.025,.975))
     2.5%     97.5%
0.2626874 0.2693175 


For $\phi$, do the same thing - first look at the histogram:


There is one outlier on the high side - which can happen in an MCMC chain simply by chance - and a slight skew to the right, but otherwise, the posterior looks close to normal. The mean, standard deviation, and a 95% credible interval are given by

> mean(phi)
[1] 0.001567886
> sd(phi)
[1] 0.000332519
> quantile(phi,c(.025,.975))
        2.5%        97.5%
0.0009612687 0.0022647623



Furthermore, let's say that instead of $\phi$, I had a particular function of one of the parameters in mind instead - for example, I mentioned at the beginning that $\phi$ is, in sabermetric speak, the proportion of stabilization after a single at-bat. This can be turned into the general so-called "stabilization point" $M$ by

$M = \dfrac{1-\phi}{\phi}$

and so to get a posterior distribution for $M$, all we need to do is apply this transformation to each draw from $\phi^*_j$. A histogram of $M$ is given by

> hist((1-phi)/phi)


The histogram is skewed clearly to the right, but that's okay since $M$ is not one of the parameters in the model.

An estimate and 95% credible for the stabilization point is given by taking the average and quantiles of the transformed values

> mean((1-phi)/phi)
[1] 667.8924

> quantile((1-phi)/phi, c(0.025,0.975))
     2.5%     97.5%
 440.5474 1039.2918 


This estimate is different than the value I gave in my article 2016 Stabilization Points because the calculations in that article used the past six years of data - this calculation only uses one. This is also why the uncertainty is so much larger.

Lastly, we can get at what we really want - estimates of the "true" batting averages $\theta_i$ for each player. I'm going to look at $i = 1$ (the first player in the sample), who happens to be Bryce Harper, the National League MVP in 2015. His batting average was 0.330 (from $x_1 = 172$ hits in $n_1 = 521$ AB), but the effect of fitting the hierarchical Bayesian analysis is to shrink the estimate of his "true" batting average $\theta_i$ towards the league mean $\mu$  - and by quite a bit in this case, since Bryce had nearly largest batting average in the sample. A histogram of the $\theta^*_{1,j}$ shows, again, a roughly normal distribution.

> hist(theta[1,])


and an estimate of his true batting average, standard error of the estimate, and 95% credible interval for the estimate are given by

> mean(theta[1,])
[1] 0.2947706
> sd(theta[1,])
[1] 0.01366782
> quantile(theta[1,], c(0.025,0.975))
     2.5%     97.5%
0.2687120 0.3222552


Other functions of the batting averages, functions of the league mean and variance, or posterior predictive calculations can be performed using the posterior samples $\mu^*$, $\phi^*$, and $\theta^*_i$.


Conclusion and Connections



MCMC techniques similar to the ones shown here have become fairly standard in Bayesian estimation, though there are more advanced techniques in use today that build upon these "building block" steps by, to give one example, adaptively changing the acceptance rate as the code runs rather than guessing-and-checking to find a reasonable value.

The empirical Bayesian techniques from my article Beta-binomial empirical Bayes represent an approximation to this full hierarchical method. In fact, using the empirical Bayesian estimator from that article on the baseball set described in this article gives $\hat{\alpha} = 172.5478$ and $\hat{\beta} = 476.0831$ (equivalent to $\hat{\mu} = 0.266$ and $\hat{\phi} = 0.001539$), and gives Bryce Harper an estimated true batting average of $\theta_1 = 0.2946$, with a 95% credible interval of $(0.2688, 0.3210)$ - only slightly shorter than the interval from the full hierarchical model.

Lastly, the "regression toward the mean" technique common in sabermetrics also approximates this analysis. Supposing you had a "stabilization point" of around 650 AB for batting averages (650 is actually way too large, but I'm pulling this number from my calculations above to illustrate a point), then the amount shrunk towards league mean of $\mu \approx 0.266$ is

$\left(\dfrac{521}{521 + 650}\right) \approx 0.4449$

So that the estimate of Harper's batting average is

$0.266 + 0.4449\left(\dfrac{172}{521} - 0.266\right) \approx 0.2945$

Three methods all going to the same place - all closely related in theory and execution.

Hopefully this helps with understanding MCMC coding. The article ended up much longer than I originally intended, but there were many parts I've gotten used to doing quickly that I realized required a not-so-quick explanation to justify why I'm doing them. As usual, comments and suggestions are appreciated!

2016 Win Prediction Totals (Through May 22)

These predictions are based on my own silly estimator, which I know can be improved with some effort on my part.  There's some work related to this estimator that I'm trying to get published academically, so I won't talk about the technical details yet (not that they're particularly mind-blowing anyway).

I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), but based on tests of earlier seasons point in the season the actual coverage is slightly under 94%, with intervals being one game off if and when they are off.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c} 
\textrm{Team}  & \textrm{Lower}  & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total}  & \textrm{Current Wins}\\ \hline

ARI & 65 & 79.58 & 94 & 81.81 & 21 \\
ATL & 48 & 61.91 & 77 & 67.95 & 12 \\
BAL & 74 & 89.11 & 104 & 85.19 & 26 \\
BOS & 80 & 94.48 & 109 & 92.65 & 27 \\
CHC & 88 & 102.56 & 117 & 99.31 & 29 \\
CHW & 75 & 89.64 & 104 & 87.36 & 26 \\
CIN & 49 & 63.47 & 78 & 66.52 & 15 \\
CLE & 71 & 85.5 & 100 & 85.03 & 22 \\
COL & 66 & 80.94 & 96 & 80.92 & 21 \\
DET & 66 & 80.55 & 95 & 81.05 & 21 \\
HOU & 57 & 71.08 & 86 & 74.86 & 17 \\
KCR & 64 & 79.12 & 94 & 77.75 & 22 \\
LAA & 62 & 76.87 & 91 & 78.08 & 20 \\
LAD & 68 & 82.33 & 97 & 83.53 & 22 \\
MIA & 67 & 81.46 & 96 & 80.95 & 22 \\
MIL & 57 & 71.51 & 86 & 73.47 & 18 \\
MIN & 47 & 61.06 & 76 & 68.16 & 11 \\
NYM & 72 & 87.09 & 102 & 84.52 & 25 \\
NYY & 63 & 77.99 & 93 & 77.6 & 21 \\
OAK & 58 & 71.82 & 86 & 73.12 & 19 \\
PHI & 67 & 81.6 & 96 & 77.71 & 25 \\
PIT & 69 & 83.7 & 98 & 81.94 & 23 \\
SDP & 59 & 72.83 & 87 & 74.53 & 19 \\
SEA & 76 & 90.55 & 105 & 87.86 & 26 \\
SFG & 72 & 85.92 & 100 & 82.28 & 27 \\
STL & 73 & 87.24 & 102 & 88.18 & 23 \\
TBR & 68 & 82.68 & 98 & 83.92 & 20 \\
TEX & 70 & 84.81 & 99 & 82.1 & 25 \\
TOR & 65 & 79.6 & 94 & 80.45 & 22 \\
WSN & 78 & 92.88 & 107 & 90.45 & 27 \\  \hline\end{array}

As you would expect, it's really, really difficult to predict how many games a team is going to win only a quarter of the way through the season, and intervals are necessarily going to be very wide. A couple of things stand out, though - at this point we can be confident that the Chicago Cubs will finish above 0.500 and the Minnesota Twins, Cincinnati Reds, and Atlanta Braves will finish below 0.500. For every other team, we just don't have enough information yet.

To explain the difference between "Mean" and "True Win Total"  - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (from the beginning of the season until May 22).

These quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is to the left of the peak, the team is predicted to finish "lucky" - more wins than would be expected based on their talent level - and if the blue line is to the right of the peak, the team is predicted to finish "unlucky" - fewer wins that would be expected based on their talent level.