## 21 May, 2015

### Beta-Binomial Empirical Bayes

When you have a set of random variables - say, batting averages of a group of players - building a model that allows for "borrowing" of information from can greatly improve estimation. In the case of batting averages, say a player has a batting average of 0.400 over 45 at-bats. Any baseball fan knows that's unsustainably high - but the raw prediction doesn't "know." All the raw prediction knows is that a player went to bat 45 times and got 18 hits - so the best prediction of that player's future is 0.400. But if it also can somehow know that 0.400 is the highest batting average in a group of players, then that information can be used - possibly, to predict that some amount of regression towards the mean is due.

A basic statistical model that involves a group of similar probability distributions can often be improved by adding something that allows the model from one player to access the information from another player. One technique common in statistical modeling is to add a distribution "underneath" (or on top of, depending on how you visualize it) the individual probability distributions for each player. This is known as a "mixing" distribution, and it represents the probability distribution of the "true" parameters for each player.

In the batting averages example, it represents the distribution of batting averages. For example, you could use the mixing distribution to answer to questions such as "If I randomly select a batter, what is the probability that he has a true batting average greater than 0.250?" or "If I randomly select a batter, what is the probability that I see 5 hits in 30 at bats?" - essentially, sampling from the pool of all batters rather than looking at an individual one.

## The Beta-Binomial Model

A very common statistical model of this form is called the beta-binomial model (I'm going to assume you're familiar with both the beta and binomial distributions individually). The basic formulation is

$y_i \sim Bin(\theta_i, n_i)$
$\theta_i \sim Beta(\alpha, \beta)$

This model assumes player $i$ has a "true" batting average represented by constant proportion $\theta_i$, where $0 \le \theta_i \le 1$. Of course, we don't actually know that each $\theta_i$ exists, any more than we know that it follows a binomial distribution or know that there's a beta distribution underneath it. But what this model does allow is the sharing of information - in this case, the beta distribution is the mixing distribution - which is very, very useful to us.

If we knew the values of the parameters of the beta distribution $\alpha$ and $\beta$, we could use Bayes' rule to get an estimate of the $\theta_i$

$\hat{\theta_i} =\dfrac{y_i + \alpha}{n_i + \alpha + \beta}$

If we don't know and don't want to be full Bayesians, however, another option is to use empirical Bayes - that is, to replace $\alpha$ and $\beta$ with estimates $\hat{\alpha}$ and $\hat{\beta}$ (these can be moment estimates, maximum likelihood estimates - any kind of estimates). Then our estimates of the $\theta_i$ become

$\hat{\theta_i} = \dfrac{y_i + \hat{\alpha}}{n_i + \hat{\alpha } + \hat{\beta}}$

The effect of this is that the raw estimates $y_i/n_i$ are pulled towards the group mean in an amount that is determined by two things - the variance between the raw batting averages (if the raw batting averages are very close together, that will increase the shrinkage) and the number of at-bats. Raw predictions that are further away from the mean are pulled harder - a player batting 0.400 or 0.100 is going to have his predicted batting average reduced or improved drastically, but a player batting 0.250 is not going to see much of a change. Furthermore, no matter what $\hat{\alpha}$ and $\hat{\beta}$ are, their effect will go to zero (and hence, shrinkage will lessen) as the number of at-bats $n_i$ gets larger.

## Estimating the Parameters

The two most common procedures used for estimation of the parameters in the beta-binomial model are the method of moments and the method of maximum likelihood. I'm going to briefly describe an iterative method of moments estimators, which is useful if all you need are a point estimate and don't necessarily care about the uncertainty of estimation for the parameters. The disadvantage is that moment estimators tend to be less efficient than likelihood estimators, and for small sample sizes can produce negative estimates of $\alpha$ and $\beta$.

Code in R for this procedure can be found on my github.

As previously stated, the two-stage model for beta-binomial empirical Bayes is given by

$y_i \sim Binomial(n_i, \theta_i)$
$\theta_i \sim Beta(\alpha, \beta)$

For the purpose of estimation, I'm going to rewrite the parameters of the beta distribution as

$\mu = \dfrac{\alpha}{\alpha + \beta}$
$\phi = \dfrac{1}{\alpha + \beta + 1}$

This makes $\mu$ equal to the expected value of the beta distribution with $\phi$ as the dispersion parameter.

Begin by defining

$\tilde{\theta_i} = \dfrac{y_i}{n_i}$

These are the raw, un-shrunk estimates of the $\theta_i$. Estimate the mean as

$\hat{\mu} = \dfrac\sum w_i \tilde{\theta_i}}{w$

where $w_i$ are weights and $w = \sum w_i$. Initial values of weights must be chosen - using $w_i = n_i$ is a good choice.

Then define

$S = \displaystyle \sum w_i (\hat{\theta_i} - \hat{\mu})^2$

The dispersion parameter $\phi$ of the beta distribution is estimated as

$\hat{\phi} = \dfrac{\hat{\mu}(1-\hat{\mu})\left[\displaystyle \sum \dfrac{w_i}{n_i} \left(1 - \dfrac{w_i}{w}\right)\right]}{\hat{\mu}(1-\hat{\mu})\left[\displaystyle \sum w_i \left(1 - \dfrac{w_i}{w}\right) - \displaystyle \sum \dfrac{w_i}{n_i} \left(1 - \dfrac{w_i}{w}\right)\right]}$

The weights $w_i$ are then re-estimated as

$w_i = \dfrac{n_i}{1 + \hat{\phi}(n_i - 1)}$

And the process is repeated (starting at calculating $\hat{\mu}$) until the sum of squared differences between the old and new weights is less than a certain tolerance. The iterated moments estimator produces estimates tend to be closer in value to the estimates produced by maximum likelihood than the naive estimator found on Wikipedia, but both moments estimators will give the same estimates when all sample sizes are equal.

To convert these estimates back to the traditional $\alpha, \beta$ format, apply the fomulas:

$\hat{\alpha} = \hat{\mu} \left(\dfrac{1-\hat{\phi}}{\hat{\phi}}\right)$

$\hat{\beta} = (1- \hat{\mu}) \left(\dfrac{1-\hat{\phi}}{\hat{\phi}}\right)$

And empirical Bayesian estimates of the $\hat{\theta_i}$ are given by

$\hat{\theta_i} =\dfrac{y_i + \hat{\alpha}}{n_i + \hat{\alpha} + \hat{\beta}}$

Caution: It's important here not to try to use the raw $y_i / n_i$ as observations from the beta distribution in order to estimate $\alpha$ and $\beta$ directly - they're estimates of the $\theta_i$, and you'd be ignoring the uncertainty in those estimates, which could lead to poor estimation of the parameters of the underlying beta distribution when the $n_i$ are small or the noise around the $\theta_i$ is large, such as for batting average on balls in play.

## Batting Averages

Recall the batting-averages data used in the James-Stein estimator post. That data required a messy transformation to make it fit the equal variance, normality assumption . The beta-binomial method does not require either. If we say that $y_i$ is the number of hits player $i$ gets in $n_i$ at-bats, then moment estimates of the beta parameters are given by $\hat{\mu} = 0.265$ and $\hat{M} = 367.988$. Also define $\theta_i$ as the batting average over the rest of the season

\begin{array}{l c c c c c c c} \hline
\textrm{Player} & y_i & y_i/n_i & \textrm{JS Estimate} & \textrm{BB Estimate} & \theta_i\\ \hline
Clemente & 18 & .400 & .290  & .280 & .346 \\
F. Robinson & 17 & .378  & .286  & .278 & .298 \\
F. Howard & 16 & .356 &  .282 &  .275 & .276 \\
Johnstone & 15 & .333  &  .277 &  .273 & .222\\
Barry & 14 & .311 &  .273 & .270 & .273\\
Spencer & 14 & .311 &  .273 & .270 & .270\\
Kessinger & 13 & .289 & .268  & .268 & .263\\
L. Alvarado & 12 & .267 &  .264 & .266 & .210\\
Santo & 11 & .244 &  .259 &  .263 & .269\\
Swoboda & 11 & .244 & .259  &  .263 & .230\\
Unser & 10 &.222  &  .254 & .261 & .264\\
Williams & 10 & .222 & .254  & .261 & .256 \\
Scott & 10 & .222 &  .254 & .261 &  .303 \\
Petrocelli & 10 & .222 & .254  & .261 & .264\\
E. Rodriguez & 10 & .222 &  .254 & .261 & .226\\
Campaneris & 9 & .200 &  .249 & .258 & .285\\
Munson & 8 & .178 &  .244 & .256 & .316\\
Alvis & 7 & .156 &  .239  & .253 & .200\\ \hline
\end{array}

The beta-binomial empirical Bayesian estimates tend to shrink a bit harder towards the mean - sometimes for better, sometimes for worse. Both the James-Stein and empirical Bayesian estimates improve dramatically on the naive estimates for prediction - compare

•  $\sum (\hat{\theta}^{JS}_i - \theta_i)^2 = 0.0215$
•  $\sum (\hat{\theta}^{BB}_i - \theta_i)^2 = 0.0218$
•  $\sum (y_i/n_i - \theta_i)^2 = 0.0753$.

This type of estimation that pulls observations towards a group mean is known as shrinkage estimation. Whether by means of the James-Stein estimator or a beta-binomial empirical Bayesian method, it offers a clear improvement on prediction, when the measurement criterion is squared error.