24 July, 2015

Normal-Normal Shrinkage Estimation by Empirical Bayes

Shrinkage estimation is a very common technique in baseball statistics. So is the normal model. It turns out that one of the one way to shrink is to assume that both the data you see and the distribution of means of the data you see are normal - and then estimate the distribution of means prior by use of the data itself.

The (non-annotated) code I used to generate these results may be found on my github.

The Normal-Normal Model

The basic normal-normal model assumes two "stages" of data - the first is the observed data, which we will call $Y_i$ - this is assumed to be normally distributed with mean $\theta_i$ and variance $\sigma^2$.

$Y_i \sim N(\theta_i, \sigma^2)$
$\theta_i \sim N(\mu, \tau^2)$

Bayes' rule says that if you assume the "second" level - the distribution of the $\theta_i$ - is also normal, and treat it as a prior, the posterior distribution of $\theta_i$ (that is, the distribution in belief of values of $\theta_i$ after taking into account the data $y_i$ - see my previous post on Bayesian inference) is also normal with distribution

$\theta_i | y_i, \mu, \sigma^2 \sim N(B\mu + (1-B)y_i, (1-B)\sigma^2)$

where
$B = \left(\dfrac{\sigma^2}{\sigma^2 + \tau^2}\right)$

The quantity $B$ gives the amount that the mean of the data shrinks towards the mean of the prior. If $\sigma^2$ is large compared to $\tau^2$ (and so the data has a large amount of variance relative to the prior) then the mean of the data gets shrunk towards the prior mean by quite a bit. If $\sigma^2$ is small compared to $\tau^2$ (and so the data has a small amount of variance relative to the prior) then the mean of the data doesn't get shrunk much at all - it tends to stay near the observed $y_i$.

No matter what the prior is, this Bayesian estimator can be thought of as a shrinkage estimator. It's just a question of what you're shrinking to, and by how much. A Bayesian, if he or she wishes to be "noninformative" (and I'm going to completely ignore all the controversy of naming and choosing priors) might pick something like $\theta_i \sim N(0,1000)$ as a prior, so $B$ is very close to zero and the shrinkage is very small.

Empirical Bayes

What we're going to do, however, is focus on using the data to choose the prior distribution, by assuming the normal-normal model as described above and estimating the parameters of the prior from the data itself - this is known as empirical Bayes. The effect  of using the data to choose the prior is that the data is shrunk towards the mean of the data, in an amount determined by the variance(s) of the data.

How do we estimate $\mu$ and $\sigma^2$? One nice property of the normal-normal model is that the marginal distribution of $y_i$ is also normal:

$y_i | \sigma^2, \tau^2, \mu \sim N( \mu, \sigma^2 + \tau^2 )$

There are three quantities that need to be estimated to perform this - $\mu$, $\sigma^2$, and $\tau^2$. The formula above gives us two - the first, the population mean, can be estimated by the sample mean $\hat{\mu} = \bar{y}$. The variance is a bit trickier. If I just take the standard variance estimator of the $y_i$

$Var(Y) = \dfrac{\sum (y_i - \bar{y})^2}{N-1}$

then that gives an estimate $\hat{\sigma^2 + \tau^2}$ (the hat is over the entire thing - it's not estimating two individual variances and summing them, it's estimating the sum of two individual variances), assuming $\sigma^2$ is the same for every observation. So we're going to need to get some information from somewhere about what $\sigma^2$  is. If the $y_i$ are sums or averages of observations, we can use that. If not, we have to get creative.

Baseball Example

Let's say $Y_i$ is the distribution of a player's batting average in $n_i$ at-bats (the $Y_i$ are already divided by $n_i$, so they are averaged), and the player has true batting average $\theta_i$. We know that a true 0.300 hitter isn't going to always hit 0.300 - he will hit sometimes above, sometimes below. The model says that the player's observed batting average follows a normal distribution with true batting average $\theta_i$ and variance $\sigma^2/n_i$ (since it is an average) - this is the "first" normal distribution as described above.

The second stage is the underlying distribution of the $\theta_i$ - that is, the distribution of all players' true batting averages. It is normal with mean $\mu$ (the true mean league average) and variance $\tau^2$. So using this two-stage model is equivalent to saying that if I selected a random, unknown batter from the pool of all major league baseball players and observed his batting average $y_i$, it will follow a normal distribution with mean $\mu$ (the true league mean batting average) and variance $\sigma^2/n_i + \tau^2$ (the sum of the variation due naturally to a player's luck and the variation in batting averages between all players). I understand that this is a bit weird to think about, but imagine trying to describe the distribution of a player's observed batting average when we don't even know his name - you have to figure his true average is somewhere between 0.200 and 0.330, with an mean of around 0.265, and then on top of that add in the average amount of natural variation around his true average (sometimes above, sometimes below) at $n_i$ at-bats. That's what's going on here.

In baseball terms, $\sigma^2/n_i$ can be thought of as the "within-player" variance or "luck" and $\tau^2$ can be thought of as the "between-player" variance or "talent." If a batter is very consistent in hitting near his true ability and the distribution of all batting averages is very spread out, then not much shrinkage will occur. Conversely, if a batter has a lot of variation in his observed batting average but there's not much apparent variation in the distribution of all batting averages, then the player will be shrunk heavily towards the league mean.

We need three estimates in order to do the procedure - an estimate of each of $\mu$, $\sigma^2/n_i$, and $\tau^2$. The first, $\mu$, is the mean batting average of the league - and since $y_i$ is the batting average for player i, the estimate of the league mean batting average is the obvious one - $\hat{\mu} = \bar{y}$.

In the case that $n_i$ is the same for all of your observations, taking $Var(y_i)$ gives us $\hat{\sigma^2/n_i + \tau^2}$. If there are differing $n_i$, the estimation method gets more complex - it's worth its own post at some point to work through it. I'm going to assume that all the players have the same number of at-bats to keep things simple for this example, though it admittedly does make it feel rather artificial.

Now we need an estimate of $\sigma^2/n_i$ - the "within-player" variance. But the normal distribution doesn't tell us what $\sigma^2/n_i$ should be. It's pretty typical to model a player's hits in $n_i$ at-bats as following a binomial distribution with true batting average $\theta_i$. Then the sample batting average (hits over at-bats) has variance

$Var(\textrm{Sample Batting Average}) = \dfrac{\theta_i(1-\theta_i)}{n_i}$

The binomial distribution is just a sum of independent, identical Bernoulli trials (at-bats, in this case) each with probability of getting a hit $\theta$. So the central limit theorem says that for large $n_i$, we can approximate the distribution of the sample batting average with a normal!

$\textrm{Sample Batting Average} \sim N\left(\theta_i, \dfrac{\theta_i(1-\theta_i)}{n_i}\right)$

A value for $\theta_i$ is needed. It feels natural to use $y_i$ - the player's batting average - in the estimation. This is wrong, however - remember, we don't know how the player's true talent level! We need to shrink by the average variance amount. The average variance amount is estimated by the variance at the league batting average - which is a quantity we also have an estimate for. The estimate of the within-player variance is then

$\dfrac{\hat{\sigma^2}}{n_i} \approx \dfrac{\bar{y}(1-\bar{y})}{n_i}$

Then the empirical Bayes estimator is given by

$\hat{\theta_i} = \hat{B} \bar{y} + (1-\hat{B})y_i$

where

$\hat{B} = \dfrac{\bar{y}(1-\bar{y})/n_i}{\sum (y_i - \bar{y})^2/(N-1)}$

Comparison

Using the famous Morris data set to compare these estimators (I'll call them $\hat{\theta}^{NN}$ for normal-normal) to the shrinkage estimators from the Beta-Binomial (see post here - I'll call these $\hat{\theta}^{BB}$) and James-Stein (see post here - and note that the James-Stein estimator is just a specific version of the normal-normal estimator - I'll call them $\hat{\theta}^{JS}$), we see that it performs well.

\begin{array}{l c c c c c} \hline
\textrm{Player} & y_i & \hat{\theta}^{NN} & \hat{\theta}^{BB} & \hat{\theta}^{JS} & \theta \\ \hline
Clemente        &0.400             &0.280             &0.280            &0.290 &0.346 \\
F. Robinson        &0.378             &0.277             &0.278            &0.286 &0.298 \\
F. Howard        &0.356             &0.275             &0.275            &0.282 &0.276 \\
Johnstone        &0.333             &0.273            &0.273            &0.277 &0.222\\
Barry        &0.311             &0.270             &0.270            &0.273 &0.273\\
Spencer        &0.311             &0.270             &0.270            &0.273 &0.270\\
Kessinger        &0.289             &0.268             &0.268            &0.268 &0.263\\
L.Alvarado        &0.267             &0.266             &0.266            &0.264 &0.210\\
Santo        &0.244             &0.263             &0.263            &0.259 &0.269\\
Swoboda       &0.244             &0.263             &0.263            &0.259 &0.230\\
Unser       &0.222             &0.261             &0.261            &0.254 &0.264\\
Williams       &0.222             &0.261             &0.261            &0.254 &0.256\\
Scott      &0.222             &0.261             &0.261            &0.254 &0.303\\
Petrocelli &0.222             &0.261             &0.261            &0.254 &0.264\\
E. Rodriguez &0.222             &0.261             &0.261            &0.254 &0.226\\
Campaneris       &0.200             &0.258             &0.258            &0.249 &0.285\\
Munson &0.178             &0.256             &0.256            &0.244 &0.316\\
Alvis       &0.156             &0.254             &0.253            &0.239 &0.200\\ \hline
\end{array}

For this data set, the normal-normal and beta-binomial estimates are almost identical. This shouldn't be a surprise - both the distribution of batting average talent and variation around batting averages is roughly bell-shaped and symmetric, so the normal-normal model and the beta-binomial models are both flexible enough to take that shape.

The normal-normal and beta-binomial estimators shrinks the most while the James-Stein shrinks a moderate amount. For this specific data set, the James-Stein estimator seems to hold a slight-advantage - not by much, though.

• $\sum (\hat{\theta}^{NN}_i - \theta_i)^2 = 0.0218$
• $\sum (\hat{\theta}^{BB}_i - \theta_i)^2 = 0.0218$
• $\sum (\hat{\theta}^{JS}_i - \theta_i)^2 = 0.0215$
• $\sum (y_i - \theta_i)^2 = 0.0753$.

Whatever the method you use to shrink, estimates are produced that, when judged using the squared error loss function, are far superior to using the raw batting averages.

This estimator relies on the assumption of normality for both the data and underlying distribution of means - this means it will work well for batting statistics (which tend to be constructed as sums, weighted or otherwise, of assumed independent, identical events) but not as well for other statistics which don't naturally look "normal." Furthermore, if I have to estimate $\sigma^2$ with a binomial variance -why don't I just use a beta-binomial model? That doesn't depend on a large number of at-bats for normality of the distribution of the sample batting average. Overall, I think it will give nice results when used appropriately, but in many situations a different model will fit more naturally to the data.