30 July, 2015

Shrinkage Estimators for Counting Statistics

Edit 19 March 2020: This post has been adapted into a paper in the Journal of Mathematical Psychology. In the process of writing the paper, a number of  mistakes, omissions, or misstatements were found in this post. It is being left up as it was originally written, just in case anybody is interested. For a more correct version, please refer to the journal article.

Warning: this post is going to be incredibly technical, even by the standards of this blog.  If what I normally post is gory math, this is the running of the bulls. I'm making it so I can refer back to it when I need to.

The goal is to set up the theoretical framework for shrinkage estimation of normalized counting statistics to some common mean. I will fully admit this is a very, very limited framework, but some of the most basic baseball statistics fit into it. In the future I hope I can possibly expand this to include more advanced statistics.

I will give (not show) a few purely theoretical results - for proofs, see Natural Exponential Families with Quadratic Variance Functions by Carl Morris in The Annals of Statistics, Vol. 11, No. 2 (1983), 515-529, or the more updated version of that paper.

Theoretical Framework


Let's say I have some metric $X_i$ for player, team, or object $i$. In this framework, $X_i$ represents a count or a sum of some kind - the raw number of hits, or the raw number of earned runs, etc. I know that $X_i$ is the result of a random process that is controlled by a probability distribution with parameter $\theta_i$, which is unique to each player, team, or object - in baseball, for example, $\theta_i$ represents the player's true "talent" level with respect to metric $X_i$.

$X_i \sim p(x_i | \theta_i)$

I have to assume that the talent levels $\theta_i$ are exchangeable, though the definition is a bit too much to go into here.

I'm going to assume that $p(x_i | \theta_i)$ is a member of the natural exponential family with a quadratic variance function (NEFQVF) - this includes very common distributions such as the normal, binomial, Poisson, gamma, and negative binomial.

Each of these can be written as the convolution (sum) of $n_i$ other independent, identical distributions, each of which is also NEFQVF with mean $\theta_i$ - the normal is the sum of normals, the binomial is the sum of Bernoullis, the Poisson is the sum of Poissons, the negative binomial is the sum of geometrics, etc.. I will assume that is the case here - that

$X_i = \displaystyle \sum_{j = 1}^{n_i} Y_{ij}$

Translating this to baseball terms, this means that $Y_{ij}$ is the outcome of inning, plate appearance, etc.,  $j$ for player $i$ ($j$ ranges from 1 to $n_i$). The metric $X_i$ is then sum of $n_i$ of these outcomes. Each outcome is assumed independent and identical. Once again, $X_i$ is not normalized by dividing by $n_i$.

Conditional on having mean $\theta_i$, the expectations of the $Y_{ij}$ are

$E[Y_{ij} | \theta_i] = \theta_i$
 
And so conditional on having mean $\theta_i$, the expected value of the $X_i$ are

$E[X_i | \theta_i] = E\left[\displaystyle \sum_{j = 1}^{n_i} Y_{ij} \biggr | \theta_i \right] = \displaystyle \sum_{j = 1}^{n_i} E\left[ Y_{ij} \biggr | \theta_i \right] = n_i E[Y_{ij}| \theta_i] = n_i \theta_i$

Baseball terms: if a player has, for example, on-base percentage $\theta_i$, then the number of on-base events I expect in $n_i$ plate appearances is $n_i \theta_i$. This does not have to be a whole number.

Similarly, and again conditional on mean $\theta_i$, the independence assumption allows us to write the variance of the $X_i$ as

$Var(X_i | \theta_i) = Var\left(\displaystyle \sum_{j = 1}^{n_i} Y_{ij} \biggr | \theta_i \right) = \displaystyle \sum_{j = 1}^{n_i} Var\left( Y_{ij} \biggr | \theta_i \right) = n_i Var(Y_{ij}| \theta_i) = n_i V(\theta_i)$

I'm going to repeat that last bit of notation again, because it's important:

 $Var(Y_{ij}| \theta_i) =V(\theta_i)$

$V(\theta_i)$ is the variance of the outcome at the most basic level - plate appearance, inning, batter faced, etc.  - conditional on having mean $\theta_i$. For NEFQVF distributions, this has a very particular form -  the variance can be written as a polynomial function of the mean $\theta_i$ up to degree 2 (this is the "Quadratic Variance Function" part of NEFQVF):

$Var(Y_{ij} | \theta_i) = V(\theta_i) = c_0 + c_1 \theta_i + c_2 \theta_i^2$

For example, the normal distribution has $V(\theta_i) = \sigma^2$, so it fits the QVF model with $c_0 = \sigma^2$ and $c_1 = c_2 = 0$. For the Binomial distribution, $V(\theta_i) = \theta_i (1-\theta_i) =   \theta_i - \theta_i^2$, so it fits the QVF model with $c_0 = 0, c_1 = 1$, and $c_2 = -1$. The Poisson distribution has $V(\theta_i) = \theta_i$, so it fits the QVF model with $c_0 = c_2 = 0$ and $c_1 = 1$.

I'm now going to assume that the talent levels $\theta_i$ themselves follow some distribution $G(\theta_i | \mu, \eta)$. The parameter $\mu$ is the expected value of  the $\theta_i$ ($E[\theta_i] = \mu$), and it represents the league average talent level. The parameter $\eta$ controls, but is not necessarily equal to, the variance of $\theta_i$ (how spread out the talent levels are). Both are assumed to be known. The two-stage model is then

 $X_i \sim p(x_i | \theta_i)$
 $\theta_i \sim G(\theta_i | \mu, \eta)$

The unconditional expectation of the $X_i$ is

$E[X_i] = E[E[X_i | \theta_i]] = E[n_i \theta_i] = n_i \mu$

And the unconditional variance of $X_i$ is

$Var(X_i) = E[Var(X_i | \theta_i)] + Var(E[X_i | \theta_i]) = n_i E[ V(\theta_i)] + n_i^2 Var(\theta_i) $

In the above formula, the quantity $E[V(\theta_i)]$ is the average variance of the outcome at the most basic level (plate appearance, inning, etc.), averaging over all possible talent levels $\theta_i$. The quantity $Var(\theta_i)$ is the variance of the talent levels themselves - how spread out talent is in the league.

To this point I haven't normalized the $X_i$ by dividing by each by $n_i$ - let's do that. If I define $\bar{X_i} = X_i/n_i,$ then based on the formulas above

$E[\bar{X_i}] =  E\left[\dfrac{X_i}{n_i}\right] = \dfrac{1}{n_i} E[X_i] = \dfrac{n_i \theta_i}{n_i} = \theta_i$

And variance

$Var(\bar{X_i}) = Var\left(\dfrac{X_i}{n_i}\right) = \dfrac{1}{n_i^2} Var(X_i) = \dfrac{n_i E[ V(\theta_i)] + n_i^2 Var(\theta_i)}{n_i^2} = \dfrac{1}{n_i}E[ V(\theta_i)] + Var(\theta_i)$ 

As members of the exponential family, members of the NEFQVF family are guaranteed to have a conjugate prior distribution, so I'll assume that $G(\theta_i | \mu, \eta)$ is conjugate to $p(x_i | \theta_i)$. For example, if $X_i$ follows a normal distribution, $G(\theta_i | \mu, \eta)$ is a normal as well. If $X_i$ follows a Binomial distribution, then $G(\theta_i | \mu, \eta)$ is a beta distribution. If $X_i$ follows a Poisson distribution, then $G(\theta_i | \mu, \eta)$ is a gamma distribution. The priors themselves do not have to be NEFQVF.

Since $\eta$ and $\mu$ are assumed known, we can use the Bayes' rule with conjugate prior $G(\theta_i | \mu, \eta)$ to calculate the posterior distribution for $\theta_i$

$\theta_i | x_i, \mu, \eta \sim \dfrac{p(x_i | \theta_i)G(\theta_i | \mu, \eta)}{\int p(x_i | \theta_i)G(\theta_i | \mu, \eta) d\theta_i}$

NEFQVF families have closed-form posterior densities.

I'm then going to take my as my estimator the expected value of the posterior, $\hat{\theta_i} = E[\theta_i | x_i]$. Specifically for NEFQVF distributions with conjugate priors, the estimator is then given by

$\hat{\theta_i} = \mu + (1 - B)(\bar{x_i} - \mu) = (1-B) \bar{x_i} + B \mu$

Where $B$ is known as the shrinkage coefficient. For NEFQVF distributions, the form of $B$ is

$B = \dfrac{E[\bar{X_i} | \theta_i]}{Var(X_i)} = \dfrac{\dfrac{1}{n_i}E[ V(\theta_i)]}{\dfrac{1}{n_i}E[ V(\theta_i)] + Var(\theta_i)} = \dfrac{E[V(\theta_i)]}{E[V(\theta_i)] + n_i Var(\theta_i)}$

Note: The above two formulas, and several of the rules I used to derive them, are guaranteed for NEF distributions and not just NEFQVF distributions; however, the conjugate prior for a NEF may not have a normalizing constant that exists in closed form, and in practical application the distributions that are actually used tend to be NEFQFV. For NEFQFV distributions, a few more algebraic results can be shown about the exact form of the shrinkage estimator by writing the conjugate prior in the general form for exponential densities - for more information, see section 5 of Morris (1983), mentioned in the introduction.

The shrinkage estimator $B$ for NEFQVF distributions is the ratio of the within-metric variance to the total variance - which is a function of how noisy the data are compared and how spread out the talent levels are. If at a certain $n_i$ the normalized metric tends to be very noisy around its mean but the means tend to be clustered together, shrinkage will be large. If the normalized metric tends to stay close to its mean value but the means tend to be very spread out, shrinkage will be small. And as the number of observations $n_i$ grows bigger, the effect of the noise gets smaller, decreasing the shrinkage amount.
 
$B$ itself can be thought of as a shrinkage proportion - if $B = 0$ then there is no shrinkage, and the estimator is just the raw observation.  This would occur if the average variance around the mean is zero - if there's no noise. If $B = 1$ then complete shrinkage takes place and the estimate of the player's true talent level is just the league average talent level. This occurs if the variance in league talent levels is equal to zero - every player has the exact same talent level.

Note that $B$ has no units, since both the top and bottom are variances, so rescaling the data will not change the shrinkage proportion.

I'm going to show a few examples, working through gory mathematical details.

WARNING: the above results are guaranteed only for NEFQVF distributions - the normal, binomial, negative binomial, Poisson, and gamma, NEF-GHS. Some results also apply to NEF distributions - see Morris (1983) for details. If the data model is not one of those distributions, I can't say whether or not the formulas I've given above will be correct.


Normal-Normal Example


Let's start with one familiar form - the normal model. This model says that $X_i$, the metric for player $i$, is normally distributed, and is constructed as a sum of $Y_{ij}$ random variables, which are also normally distributed with mean $\theta_i$ and known variance $\sigma^2$. The distribution of talent levels also follows a normal distribution with league mean $\mu$ and variance $\tau^2$.

This can be written as

$Y_{ij} \sim N(\theta_i, \sigma^2)$
$X_i \sim N(n_i \theta_i, n_i \sigma^2)$
$\theta_i \sim N(\mu, \tau^2)$

The average variance is simple. As stated before, $V(\theta_i) = \sigma^2$ is constant for the normal distribution, no matter what the actual $\theta_i$ is. Hence,

$E[V(\theta_i)] = E[\sigma^2] = \sigma^2$

The variance of the averages is simple, too - the model assumes it's constant as well.

$Var(\theta_i) = \tau^2$

This gives a shrinkage coefficient of

$B = \dfrac{\sigma^2}{\sigma^2 + n_i \tau^2}$

Which, if I divide both the top and bottom by $n_i$, might look more familiar as

$B = \dfrac{\sigma^2/n_i}{\sigma^2/n_i + \tau^2}$

 The shrinkage estimator is then

$\hat{\theta_i} = \mu + \left(1 - \dfrac{\sigma^2/n_i}{\sigma^2/n_i + \tau^2}\right)(\bar{x_i} - \mu)$

Alternatively, I can write $B$ as

$B = \dfrac{\sigma^2/\tau^2}{\sigma^2/\tau^2 + n_i}$

And then it follows the familiar pattern from other estimators of $B = m/(m  + n)$ for some parameter $m$.

It may seem like the normal-normal is not of use - how many counting statistics are there that are normally distributed at the level of inning, plate appearance, or batter faced? The very idea that they are counting statistics says that that's impossible.

However, the central limit theorem guarantees that sums of independent, identical random variables converge to a normal - hence the distribution of $X_i$ should be unimodal and bell-shaped for large enough $n_i$ (and I'll intentionally leave the discussion of what constitutes "large enough" aside). Thus, as long as the distribution of the $\theta_i$ (the distribution of talent levels) is bell-shaped and symmetric, using a normal-normal with the normal as an approximation at the $X_i$ level should work.

Beta-Binomial Example


Suppose we're measuring the sum of binary events of some kind - a hit, an on-base event, a strikeout, etc. - in $n_i$ observations - plate appearances, innings pitched, batters faced, etc. Each event can be thought of as a sample from a Bernoulli distribution (these are the $Y_{ij}$) with variance function $V(\theta_i) = \theta_i(1-\theta_i)$. The observed metric $X_i$ binomial, and it is constructed as the sum of these Bernoulli random variables


$Y_{ij} \sim Bernoulli(\theta_i)$
$X_i \sim Binomial (n_i, \theta_i)$

The prior distribution for the binomial distribution is the beta.
$\theta_i \sim Beta(\mu, M)$

Fitting with the framework given above, I'm using $\mu = \alpha/(\alpha+\beta)$ and $M = \alpha + \beta$ instead of the traditional $\alpha, \beta$ parametrization, so that $\mu$ represents the league mean and $M$ controls the variation.

The average variance is fairly complicated here. We need to find

$E[V(\theta_i)] = E[\theta_i(1-\theta_i)] = \displaystyle \int_0^1 \dfrac{\theta_i(1-\theta_i) * \theta_i^{\mu M-1}(1-\theta_i)^{(1-\mu) M-1}}{\beta(\mu M, (1-\mu) M)} d\theta_i = \dfrac{\displaystyle \int_0^1 \theta_i^{\mu M}(1-\theta_i)^{(1-\mu) M} d\theta_i}{\beta(\mu M, (1-\mu) M)}$

 The top part is a $\beta(\mu M + 1, (1-\mu)M + 1)$ function. Utilizing the properties of the beta function, we have

$E[\theta_i(1-\theta_i)] = \dfrac{\beta(\mu M+1, (1-\mu) M + 1)}{\beta(\mu M, (1-\mu) M)} = \dfrac{\beta(\mu M, (1-\mu) M + 1)}{\beta(\mu M, (1-\mu) M)}\left(\dfrac{\mu M}{\mu M + (1-\mu) M + 1}\right) = $

$\dfrac{\beta(\mu M, (1-\mu) M )}{\beta(\mu M, (1-\mu) M)}\left(\dfrac{\mu M}{\mu M + (1-\mu) M + 1}\right) \left(\dfrac{(1-\mu) M}{\mu M + (1-\mu) M}\right) = \dfrac{\mu(1-\mu)M^2}{(M+1)M} = \dfrac{\mu(1-\mu) M}{M+1}$

 The variance of the $\theta_i$ doesn't require nearly as much calculus, since it can be taken directly as the variance of a beta distribution

$Var(\theta_i) = \dfrac{\mu(1-\mu)}{M+1}$

The shrinkage estimator $B$ is then

$B =  \dfrac{\dfrac{\mu(1-\mu)M}{(M+1)}}{\dfrac{\mu(1-\mu)M}{(M+1)} +\dfrac{n_i \mu(1-\mu)}{(M+1)}} = \dfrac{M}{M + n_i}$

Since $\mu(1-\mu)/(M+1)$ is in every term on the top and bottom, so it will cancel out. Using this model, then the shrinkage estimator is given by

$\hat{\theta_i} = \mu + \left(1 - \dfrac{M}{M + n_i}\right)\left(\bar{x_i} - \mu\right)$


Poisson-Gamma Example


Now suppose that instead of a binary event, the outcome can be a count - zero, one, two, three, etc. Each count can be thought of as a sample from a Poisson distribution with parameter $\theta_i$ (these are the $Y_{ij}$, with $V(\theta_i) = \theta_i$) with $X_i$ as the sum total of counts, which also has a Poisson distribution with parameter $n_i \theta_i$.

$Y_{ij} \sim Poisson(\theta_i)$
$X_i \sim Poisson(n_i \theta_i)$

The prior distribution of $\theta_i$ for a Poisson is a gamma.
$\theta_i \sim Gamma(\mu, K)$

In this parametrization, I'm using $\mu = \alpha/\beta$ and $K = \beta$ as compared to the traditional $\alpha, \beta$ parametrization.

The average variance is

$E[V(\theta_i)] = E[\theta_i] = \mu$

And the variance of the averages is

$Var(\theta_i) = \dfrac{\mu}{K}$

So the shrinkage coefficient $B$ is

$B = \dfrac{\mu}{\mu + \dfrac{n_i \mu}{K}} = \dfrac{1}{1 + \dfrac{n_i}{K}} = \dfrac{K}{K + n_i}$

Which gives a shrinkage estimator of

$\hat{\theta_i} = \mu + \left(1 - \dfrac{K}{K + n_i}\right)(\bar{x_i} - \mu)$

What Statistics Fit Into this Framework?


Any counting statistic that is constructed as a sum of the same basic events falls under framework. It's possible to combine multiple basic events into one "super" event, as long as they are considered to be equal. Examples of this include batting average, on-base percentage, earned run average, batting average on balls in play, fielding percentage, stolen base percentage, team win percentage, etc. It's possible to weight the sum, as long as you're just adding the same type of event to itself over and over.

Any statistic that is a sum, weighted or unweighted, of different events does not fall into this framework - examples include weighted on-base average, slugging percentage, on-base plus slugging percentage, fielding independent pitching, isolated power, etc. Also, any statistics that are ratios of counts -strikeout to walk ratio, for example - do not fall under this framework.

Statistics like wins above replacement are right out.

I want to make clear that this is simply a discussion of what statistics fit nominally into a very specific theoretical framework.  A statistic falling under the framework does not imply that a statistic is good, nor does not falling under it imply that a statistic is bad. Furthermore, even if a statistic does not fall under this framework, shrinkage estimation using these formulas may still work as a very good approximation - the best statistics in sabermetrics today are often weighted sums of counting events, and people have been using these shrinkage estimators on them successfully for years, so clearly they must be doing something right.. This is simply what I can justify using statistical theory.

Performing the Analysis


The values of  $\eta$ and $\mu$ must be chosen or estimated. If prior data exists - like, for example, historical baseball data - values can be chosen based upon a careful analysis of that information. If no prior data exists, one option is to estimate the parameters through either moment-based or marginal likelihood-based estimation, and then plug in those values - this method is known as parametric empirical Bayes. Another option is to place a hyperprior or hyperpriors on $\eta$ and $\mu$ and perform a full hierarchical Bayesian analysis, which will almost certainly involve MCMC. Depending on the form of your prior, your shrunk results will likely be similar to, but not equal to, the shrinkage estimators given here.

What if none of the NEFQVF models appear to fit your data? You have a few options, such as nonparametric or hierarchical Bayesian modeling, but any method is to get more difficult and more computational.

24 July, 2015

Normal-Normal Shrinkage Estimation by Empirical Bayes

Shrinkage estimation is a very common technique in baseball statistics. So is the normal model. It turns out that one of the one way to shrink is to assume that both the data you see and the distribution of means of the data you see are normal - and then estimate the distribution of means prior by use of the data itself.

 The (non-annotated) code I used to generate these results may be found on my github.

The Normal-Normal Model


The basic normal-normal model assumes two "stages" of data - the first is the observed data, which we will call $Y_i$ - this is assumed to be normally distributed with mean $\theta_i$ and variance $\sigma^2$.

$Y_i \sim N(\theta_i, \sigma^2)$
$\theta_i \sim N(\mu, \tau^2)$

Bayes' rule says that if you assume the "second" level - the distribution of the $\theta_i$ - is also normal, and treat it as a prior, the posterior distribution of $\theta_i$ (that is, the distribution in belief of values of $\theta_i$ after taking into account the data $y_i$ - see my previous post on Bayesian inference) is also normal with distribution

$\theta_i | y_i, \mu, \sigma^2 \sim N(B\mu + (1-B)y_i, (1-B)\sigma^2)$

where
$B = \left(\dfrac{\sigma^2}{\sigma^2 + \tau^2}\right)$

The quantity $B$ gives the amount that the mean of the data shrinks towards the mean of the prior. If $\sigma^2$ is large compared to $\tau^2$ (and so the data has a large amount of variance relative to the prior) then the mean of the data gets shrunk towards the prior mean by quite a bit. If $\sigma^2$ is small compared to $\tau^2$ (and so the data has a small amount of variance relative to the prior) then the mean of the data doesn't get shrunk much at all - it tends to stay near the observed $y_i$.

No matter what the prior is, this Bayesian estimator can be thought of as a shrinkage estimator. It's just a question of what you're shrinking to, and by how much. A Bayesian, if he or she wishes to be "noninformative" (and I'm going to completely ignore all the controversy of naming and choosing priors) might pick something like $\theta_i \sim N(0,1000)$ as a prior, so $B$ is very close to zero and the shrinkage is very small.

Empirical Bayes


What we're going to do, however, is focus on using the data to choose the prior distribution, by assuming the normal-normal model as described above and estimating the parameters of the prior from the data itself - this is known as empirical Bayes. The effect  of using the data to choose the prior is that the data is shrunk towards the mean of the data, in an amount determined by the variance(s) of the data.

How do we estimate $\mu$ and $\sigma^2$? One nice property of the normal-normal model is that the marginal distribution of $y_i$ is also normal:

$y_i | \sigma^2, \tau^2, \mu \sim N( \mu, \sigma^2 + \tau^2 )$

There are three quantities that need to be estimated to perform this - $\mu$, $\sigma^2$, and $\tau^2$. The formula above gives us two - the first, the population mean, can be estimated by the sample mean $\hat{\mu} = \bar{y}$. The variance is a bit trickier. If I just take the standard variance estimator of the $y_i$

$Var(Y) = \dfrac{\sum (y_i - \bar{y})^2}{N-1}$

then that gives an estimate $\hat{\sigma^2 + \tau^2}$ (the hat is over the entire thing - it's not estimating two individual variances and summing them, it's estimating the sum of two individual variances), assuming $\sigma^2$ is the same for every observation. So we're going to need to get some information from somewhere about what $\sigma^2$  is. If the $y_i$ are sums or averages of observations, we can use that. If not, we have to get creative.


Baseball Example


Let's say $Y_i$ is the distribution of a player's batting average in $n_i$ at-bats (the $Y_i$ are already divided by $n_i$, so they are averaged), and the player has true batting average $\theta_i$. We know that a true 0.300 hitter isn't going to always hit 0.300 - he will hit sometimes above, sometimes below. The model says that the player's observed batting average follows a normal distribution with true batting average $\theta_i$ and variance $\sigma^2/n_i$ (since it is an average) - this is the "first" normal distribution as described above.

The second stage is the underlying distribution of the $\theta_i$ - that is, the distribution of all players' true batting averages. It is normal with mean $\mu$ (the true mean league average) and variance $\tau^2$. So using this two-stage model is equivalent to saying that if I selected a random, unknown batter from the pool of all major league baseball players and observed his batting average $y_i$, it will follow a normal distribution with mean $\mu$ (the true league mean batting average) and variance $\sigma^2/n_i + \tau^2$ (the sum of the variation due naturally to a player's luck and the variation in batting averages between all players). I understand that this is a bit weird to think about, but imagine trying to describe the distribution of a player's observed batting average when we don't even know his name - you have to figure his true average is somewhere between 0.200 and 0.330, with an mean of around 0.265, and then on top of that add in the average amount of natural variation around his true average (sometimes above, sometimes below) at $n_i$ at-bats. That's what's going on here.

In baseball terms, $\sigma^2/n_i$ can be thought of as the "within-player" variance or "luck" and $\tau^2$ can be thought of as the "between-player" variance or "talent." If a batter is very consistent in hitting near his true ability and the distribution of all batting averages is very spread out, then not much shrinkage will occur. Conversely, if a batter has a lot of variation in his observed batting average but there's not much apparent variation in the distribution of all batting averages, then the player will be shrunk heavily towards the league mean.

We need three estimates in order to do the procedure - an estimate of each of $\mu$, $\sigma^2/n_i$, and $\tau^2$. The first, $\mu$, is the mean batting average of the league - and since $y_i$ is the batting average for player i, the estimate of the league mean batting average is the obvious one - $\hat{\mu} = \bar{y}$.

In the case that $n_i$ is the same for all of your observations, taking $Var(y_i)$ gives us $\hat{\sigma^2/n_i + \tau^2}$. If there are differing $n_i$, the estimation method gets more complex - it's worth its own post at some point to work through it. I'm going to assume that all the players have the same number of at-bats to keep things simple for this example, though it admittedly does make it feel rather artificial.

Now we need an estimate of $\sigma^2/n_i$ - the "within-player" variance. But the normal distribution doesn't tell us what $\sigma^2/n_i$ should be. It's pretty typical to model a player's hits in $n_i$ at-bats as following a binomial distribution with true batting average $\theta_i$. Then the sample batting average (hits over at-bats) has variance

$Var(\textrm{Sample Batting Average}) = \dfrac{\theta_i(1-\theta_i)}{n_i}$

The binomial distribution is just a sum of independent, identical Bernoulli trials (at-bats, in this case) each with probability of getting a hit $\theta$. So the central limit theorem says that for large $n_i$, we can approximate the distribution of the sample batting average with a normal!


$\textrm{Sample Batting Average} \sim N\left(\theta_i, \dfrac{\theta_i(1-\theta_i)}{n_i}\right)$


A value for $\theta_i$ is needed. It feels natural to use $y_i$ - the player's batting average - in the estimation. This is wrong, however - remember, we don't know how the player's true talent level! We need to shrink by the average variance amount. The average variance amount is estimated by the variance at the league batting average - which is a quantity we also have an estimate for. The estimate of the within-player variance is then

$\dfrac{\hat{\sigma^2}}{n_i} \approx \dfrac{\bar{y}(1-\bar{y})}{n_i}$

 Then the empirical Bayes estimator is given by

$\hat{\theta_i} = \hat{B} \bar{y} + (1-\hat{B})y_i$

where

 $\hat{B} = \dfrac{\bar{y}(1-\bar{y})/n_i}{\sum (y_i - \bar{y})^2/(N-1)}$

Comparison


Using the famous Morris data set to compare these estimators (I'll call them $\hat{\theta}^{NN}$ for normal-normal) to the shrinkage estimators from the Beta-Binomial (see post here - I'll call these $\hat{\theta}^{BB}$) and James-Stein (see post here - and note that the James-Stein estimator is just a specific version of the normal-normal estimator - I'll call them $\hat{\theta}^{JS}$), we see that it performs well.

\begin{array}{l c c c c c} \hline
\textrm{Player} & y_i & \hat{\theta}^{NN} & \hat{\theta}^{BB} & \hat{\theta}^{JS} & \theta \\ \hline
Clemente        &0.400             &0.280             &0.280            &0.290 &0.346 \\
F. Robinson        &0.378             &0.277             &0.278            &0.286 &0.298 \\
F. Howard        &0.356             &0.275             &0.275            &0.282 &0.276 \\
Johnstone        &0.333             &0.273            &0.273            &0.277 &0.222\\
Barry        &0.311             &0.270             &0.270            &0.273 &0.273\\
Spencer        &0.311             &0.270             &0.270            &0.273 &0.270\\
Kessinger        &0.289             &0.268             &0.268            &0.268 &0.263\\
L.Alvarado        &0.267             &0.266             &0.266            &0.264 &0.210\\
Santo        &0.244             &0.263             &0.263            &0.259 &0.269\\
Swoboda       &0.244             &0.263             &0.263            &0.259 &0.230\\
Unser       &0.222             &0.261             &0.261            &0.254 &0.264\\
Williams       &0.222             &0.261             &0.261            &0.254 &0.256\\
Scott      &0.222             &0.261             &0.261            &0.254 &0.303\\
Petrocelli &0.222             &0.261             &0.261            &0.254 &0.264\\
E. Rodriguez &0.222             &0.261             &0.261            &0.254 &0.226\\
Campaneris       &0.200             &0.258             &0.258            &0.249 &0.285\\
Munson &0.178             &0.256             &0.256            &0.244 &0.316\\
Alvis       &0.156             &0.254             &0.253            &0.239 &0.200\\ \hline
\end{array}

For this data set, the normal-normal and beta-binomial estimates are almost identical. This shouldn't be a surprise - both the distribution of batting average talent and variation around batting averages is roughly bell-shaped and symmetric, so the normal-normal model and the beta-binomial models are both flexible enough to take that shape.

The normal-normal and beta-binomial estimators shrinks the most while the James-Stein shrinks a moderate amount. For this specific data set, the James-Stein estimator seems to hold a slight-advantage - not by much, though.

  • $\sum (\hat{\theta}^{NN}_i - \theta_i)^2 = 0.0218$
  • $\sum (\hat{\theta}^{BB}_i - \theta_i)^2 = 0.0218$
  • $\sum (\hat{\theta}^{JS}_i - \theta_i)^2 = 0.0215$
  • $\sum (y_i - \theta_i)^2 = 0.0753$.

Whatever the method you use to shrink, estimates are produced that, when judged using the squared error loss function, are far superior to using the raw batting averages.

Advantages/Disadvantages

This estimator relies on the assumption of normality for both the data and underlying distribution of means - this means it will work well for batting statistics (which tend to be constructed as sums, weighted or otherwise, of assumed independent, identical events) but not as well for other statistics which don't naturally look "normal." Furthermore, if I have to estimate $\sigma^2$ with a binomial variance -why don't I just use a beta-binomial model? That doesn't depend on a large number of at-bats for normality of the distribution of the sample batting average. Overall, I think it will give nice results when used appropriately, but in many situations a different model will fit more naturally to the data.

16 July, 2015

Bayes' Rule

(This post acts as sort of a prequel to my post on Bayesian inference)

I've talked enough about it, so I figured I would make an actual post on a probability rule that I've been using quite a bit - Bayes' rule.

Purpose


I want to jump straight into a probability example first. Much like the infield fly rule or the offside rule in soccer, I think Bayes' rule makes much more sense when you understand what it's trying to do before learning the technical details.


Suppose a manager has exactly two pinch hitters available to him - let's call them Adam and José.  Adam has a $0.350$ OBP and José has a $0.300$ OBP. The manager calls Adam 70% of the time and calls José 30% of the time.

So without knowing who the manager will call, how do we calculate what the probability is that the pinch hitter will get on-base? Like so:

$P(\textrm{ On-Base }) = P(\textrm{ On-Base } | \textrm{ Adam })P(\textrm{ Adam })+P(\textrm{ On-Base } | \textrm{ José })P(\textrm{ José })$

The notation $P(\textrm{ On-Base }|\textrm{ Adam })$ means the probability of getting on-base "given" that Adam was chosen to pinch hit - that is, if we knew that the manager selected Adam, there would be a $0.350$ probability of getting on-base. As stated above, $P(\textrm{ Adam })$ - the probability the manager selects player Adam - is $0.7$. Similarly, $P(\textrm{ On-Base }|\textrm{ José }) = 0.300$ and $P(\textrm{ José }) =0.3$. Plugging numbers into the formula above,

$P(\textrm{ On-Base }) = (0.350)(0.7) + (0.300)(0.3) = 0.335$ 

The OBP is $0.350$ with probability $0.7$, and $0.300$ with probability $0.3$, so overall there is a $0.335$ probability that the pinch hitter will get on-base.

Now, let's flip what you know around - suppose you know that the pinch hitter got on base, but not which pinch hitter it was. Which player do you think was picked, Adam or José? Logically it's more likely to be Adam - but how much more likely? Can you give probabilities?

Bayes' Rule


This is the basic idea of Bayes' rule - it flips conditional probabilities around. Instead of $P(\textrm{ On-Base } | \textrm{ Adam })$, it allows you to find find $P(\textrm{ Adam } | \textrm{ On-Base })$. For two events $A$ and $B$, the basic formulation is

$P(B | A) = \dfrac{P(A|B)P(B)}{P(A)}$

$P(A)$ on the bottom can be calculated as

$P(A) = P(A | B)P(B) + P(A | \textrm{ Not B }) P( \textrm{ Not B })$ 

Note that this is why above I specified that there were exactly two pinch hitters, so saying that "Not Adam" is the same thing as saying "José".  If there are more than two pinch hitters available, the formula above can be expanded.

Applying this to the batting averages, we have

$P(\textrm{ Adam }| \textrm{ On-Base }) = \dfrac{P(\textrm{ On-Base }|\textrm{ Adam })P(\textrm{ Adam })}{P(\textrm{ On-Base })}$

$ =\dfrac{P(\textrm{ On-Base }|\textrm{ Adam })P(\textrm{ Adam })}{ P(\textrm{ On-Base } | \textrm{ Adam })P(\textrm{ Adam })+P(\textrm{ On-Base } | \textrm{ José })P(\textrm{ José })} $

Plugging in numbers, we get

$P(\textrm{ Adam }| \textrm{ On-Base }) = \dfrac{(0.350)(0.7)}{0.335} \approx 0.731$

And similarly,

$P(\textrm{ José }| \textrm{ On-Base }) = \dfrac{(0.300)(0.3)}{0.335} \approx 0.269$

So, given that the pinch hitter got on base, there was approximately  a 73.1% chance it was Adam and approximately a 26.9% chance it was Jose.

One More Example


Adam tests positive for PED use. Let's suppose 10% of all MLB players are using PEDs. The particular test Adam took has 95% specificity and sensitivity - that is, if the player is using PEDs it will correctly identify so 95% of the time, and if the player is not using PEDs it will correctly identify so 95% of the time. Given a positive test, what is the probability that Adam is actually using PEDs? It's not 95%! We have to use Bayes' rule to figure it out.

I'm going to use "+" to indicate a positive test (indicating the test says that the player is using drugs) and a "-" to indicate a negative test.

$P(\textrm{ PEDs } | +) = \dfrac{P( + | \textrm{ PEDs })P(\textrm{ PEDs })}{P( + | \textrm{ PEDs })P(\textrm{ PEDs })+P( + | \textrm{ Not PEDs })P(\textrm{ Not PEDs }) }$

Let's figure these out one by one. As stated in the problem description, if a person is using PEDs, the test will identify so 95% of the time. Hence, $P( + | \textrm{ PEDs }) = 0.95$. Furthermore, 10% of all players are using PEDs, so $P(\textrm{ PEDs }) = 0.1$ and $P(\textrm{ Not PEDs }) = 0.9$. Lastly, since $P( - | \textrm{ Not PEDs }) = 0.95$ as stated in the problem description, it must be that $P( + | \textrm{ Not PEDs }) = 0.05$.

Plugging all these numbers in, we get

 $P(\textrm{ PEDs } | +) = \dfrac{(0.95)(0.1)}{(0.95)(0.1) + (0.05)(0.9)} \approx 0.678$

So given that Adam tests positive for PEDs, there's actually only about a 2/3 chance that  he's using. It seems counter-intuitive given that the tests were pretty good - 95% sensitivity and specificity - but since most players aren't using (90%), there's bound to be a lot of false positives, making it so that Adam has a very, very good argument if Adam gets suspended over this particular test (vindicated).

Put another way - suppose you have 200 MLB players. 180 (90% of the total) are clean, and 20 (10% of the total) are using PEDs. Of the 180 that are clean, 171 (95% of the clean) test negative and 9 (5% of the clean) test positive. Of those that are using, 19 (95% of the PED users) test positive and 1 (5% of the PED users) tests negative.

This gives 19 PED users testing positive and 9 clean players testing positive, so the probability of being a PED user given testing positive is 19/(9+19) = 0.678.

(Note that I made all these numbers up. I'm sure that the tests MLB actually uses have higher specificity and sensitivity than 95%, and I have no idea what proportion of all MLB players are using PEDs)

From Bayes' Rule to Inference


So how do we go from the rule to inference? Given some sort of model with parameter $\theta$, we can calculate $p(x | \theta)$ -  the probability of seeing the data that you saw given a particular value of the parameter. You may recognize this as the likelihood from earlier posts. Bayesians use Bayes' rule to flip around what's inside the probability statement and calculate $p(\theta | x)$  - the probability of a particular value of the parameter given the data that you saw - by

$p(\theta | x) = \dfrac{p(x | \theta)p(\theta)}{\int p(x | \theta)p(\theta) d\theta}$

where $p(\theta)$ is the prior distribution chosen by the Bayesian and $p(\theta | x)$ is the posterior distribution that is calculated. Inference about $\theta$ is then performed using the posterior.

That's Bayesian inference in a nutshell - start with a model, calculate the probability of seeing the data $x$ given a parameter $\theta$, and then use Bayes' rule to flip that around to the probability of  the parameter $\theta$ given that you saw the data $x$.

(Oh, and then do checking to make sure your model fits - but that's another post)

15 July, 2015

2015 Win Total Predictions (At All-Star Game)

These predictions are based on my own "secret sauce" method, which I definitely think can be improved. I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), but based on testing the actual coverage might be at around 93%.

Intervals are inclusive.

\begin{array} {c c c} 
Team  & Lower  & Upper \\ \hline

        ATL  &  66  &   86 \\
        ARI  & 69   &  90 \\
        BAL  &  71  &   92 \\
        BOS    &66   &  86 \\
        CHC   & 75   &  96 \\
        CHW &   64  &   84 \\
        CIN  &  64   &  84 \\
        CLE   & 68   &  89 \\
        COL  &  62   &  82 \\
        DET  &  71   &  91 \\
        HOU  &  78   &  97 \\
        KCR   & 84  &  105 \\
        LAA   & 76  &   97 \\
        LAD   & 81  &  101 \\
        MIA &   63   &  83 \\
        MIL  &  61   &  82 \\
        MIN  &  76   &  96 \\
        NYM  &  73  &   94 \\
        NYY   & 76  &   97 \\
        OAK &   69  &   88 \\
        PHI  &  45   &  64 \\
        PIT  &  84  &  104 \\
        SDP  &  64 &    83 \\
        SEA  &  65 &    85 \\
        SFG  &  74  &   94 \\
        STL  &  89  &  109 \\
        TBR  &  73  &   93 \\
        TEX  &  67  &   87 \\
        TOR  &  75 &    94 \\
        WSN  &  77  &   98 \\ \hline\end{array}

Interesting features: my model thinks that the St. Louis Cardinals are the best team in baseball, and that the Phillies are the worse. I will add that the model gives a Phillies a true winning percentage of around 36%, (roughly 58 games out of 162), but they've been both bad and unlucky, and so are likely to finish with less than that.

Also note that even at this point in the season, it still can't predict whether all but four teams (the Cardinals, Pirates, Royals, and Phillies) will finish above 0.500 or not.

10 July, 2015

A Beta-Binomial Derivation of the 37-37 Shrinkage Rule

A rule of thumb is that to estimate a team's "true" ability $\theta_i$, you should add 74 games of 0.500 ball - that is,

$\hat{\theta_i} = \dfrac{w_i + 37}{n_i + 74}$

where $\hat{\theta_i}$ is the estimate of team $i$'s true winning proportion, $w_i$ is the number of wins of team $i$, and $n_i$ is the number of games team $i$ has played so far. Notice that the number of games stays the same no matter what $n_i$ is - if the team has played a full season ($n_i = 162$), shrink by 74 games of 0.500 ball. If the team has only played 10 games ($n_i = 10$), shrink by 74 games of 0.500 ball.

In this post I'm going to derive a very similar result as the posterior expectation of a binomial model with a beta prior, and try to give a less mathematical explanation of why the rule works no matter what $n_i$ is.

The code I used to generate the images in this post may be found on my github.

The Beta-Binomial Model


First off, let's assume that the number of wins $w_i$ follows a binomial distribution with number of games played $n_i$ and true winning proportion $\theta_i$.

$w_i \sim Bin(n_i, \theta_i)$

Furthermore, let's assume that the winning percentages themselves follow a beta distribution. Traditionally the beta distribution has parameters $\alpha$ and $\beta$, but I'm going to use the parametrization $\mu = \alpha/(\alpha + \beta)$ and $M = \alpha + \beta$. This makes $\mu$ the mean of the $\theta_i$ and $M$ control the variation - how spread out the $\theta_i$ are.

The reason I'm doing this is that we know what $\mu$ is - mathematically, we must have $\mu = 0.5$. Why? Because in a system like baseball, every win by one team represents a loss by another team. The wins and losses cancel out, the scales remain balanced, and the average $\theta_i$ must be equal to 0.5.

$\theta_i \sim Beta(0.5, M)$

If we knew $M$, we could just apply Bayes' rule to obtain the posterior distribution of the $\theta_i$ - but there's no intuitive value for it like there is for $\mu$. Thankfully, we have a way to get $M$.

Estimating M


Often, rather than working with the observed win totals $w_i$, people work with the observed win proportion $w_i/n_i$ (and in fact, we're going to use some data from that form in a bit). In a two-level model like this, we can calculate the variance of the observed win proportions as


$Var\left(\dfrac{w_i}{n_i}\right)  = E\left[Var\left(\dfrac{w_i}{n_i} \biggr |  \theta_i\right)\right] + Var\left(E\left[\dfrac{w_i}{n_i} \biggr | \theta_i\right]\right)$

The first part - $Var\left(w_i/n_i\right)$ - is the variance of the observed win proportions. I'm going to call this the total variance.

The second part -  $E\left[Var\left(w_i/n_i| \theta_i\right)\right]$ - is the average amount of variance of a team's observed winning proportion around its true winning proportion $\theta_i$. I'm going to call this the within-team variance (this is what others have referred to this as "luck").

This can be calculated as

$E\left[Var\left(\dfrac{w_i}{n_i} \biggr | \theta_i\right)\right] = E\left[\dfrac{\theta_i(1-\theta_i)}{n_i}\right]  = \dfrac{1}{n_i}(E[\theta_i(1-\theta_i)])$

$ = \left(\dfrac{1}{n_i}\right)\left(0.5(1-0.5) \right)\left(\dfrac{M}{M+1}\right) = \dfrac{0.25M}{n_i(M+1)}$

I'm skipping a bit of gory math here - $E[\theta_i(1-\theta_i)]$ can be found by noting that multiplying the $Beta(0.5, M)$ density by $\theta_i(1-\theta_i)$ produces the kernel of another beta density.

The last part - $Var(E[w_i/n_i | \theta_i])$ - is the variance of the $\theta_i$ themselves. It represent the natural variance of true winning proportions among all teams. I'm going to call this the between-team variance (this is what others have referred to as "talent").

This can be calculated as

$Var\left(E\left[\dfrac{w_i }{n_i} \biggr | \theta_i\right]\right) = Var(\theta_i) = \dfrac{0.5(1-0.5)}{M+1} = \dfrac{0.25}{M+1}$

Hence, the total variation in observed winning proportion is

$Var\left(\dfrac{w_i}{n_i}\right) = \dfrac{0.25M}{n_i(M+1)} +\dfrac{0.25}{M+1}$

Based on historical data, sports analyst Tom Tango suggests that the correct value of $Var(w_i/n_i)$ for teams that have played at least 160 games is $Var(w_i/n_i) =  0.07^2$. I'll trust him that this is accurate.
 
Notice from the formula above that the $Var(w_i/n_i)$ value is linked to the number of games used to estimate it - it's within-team variance plus between-team variance, and within-team variance is shrinking while the between-team variance is constant. This is why it's important to use a point estimate based off observations with the same number of games - the $n_i$ is constant in the formula above, and it becomes a function solely of $M$.

What we're going to do is assume that $n_i = 162$ for all the teams in the $Var(w_i/n_i) = 0.07^2$ value. This isn't technically true, but it's true for most of them, and the difference in within-team variation between a 160 win team and a 163 win team is very small, so we can safely ignore it. Then using Tom Tango's value, this sets up the equation

$0.07^2 = \dfrac{0.25M}{162 (M+1)} +\dfrac{0.25}{M+1}$ 

Doing a bit of algebra yields the value of M

$M = \dfrac{0.25*162-0.07^2*162}{0.07^2*162-0.25} = 73.01618$

Which is close enough that $M = 73$ can be used as the variance parameter for the distribution of the $\theta_i$. This is only one game smaller than the value of $74$ used in the rule of thumb, and the difference probably derives from different distributional choices used when calculating the between-team variance.

As a side note, the variance of observed win proportions in $n_i$ games is given by 

$Var\left(\dfrac{w_i}{n_i}\right) = \dfrac{0.25(73)}{n_i(74)} +\dfrac{0.25}{74}$

which implies that at $n_i = M = 73$ games, the within-team variance (luck) is equal to the between-team variance (talent).

 

Bayesian Estimator


Now that we have $M$, we can use treat the $Beta(0.5, 73)$ distribution as a the prior distribution for  $\theta_i$ as a prior and use Bayes' rule to get the posterior distribution for the $\theta_i$

$\theta_i  | w_i \sim Beta(w_i + 0.5*73, n_i - w_i + 0.5*73)$

(Here I'm using the traditional $\alpha$ and $\beta$ parametrization to define the beta distribution above)

And if we use $\hat{\theta_i} = E[\theta_i | w_i]$, that gives us the famous

$\hat{\theta} = \dfrac{w_i + 0.5*73}{w_i + 0.5*73 + n_i - w_i + 0.5*73} = \dfrac{w_i + 36.5}{n_i + 73}$

I want to add that this is not the only possible estimator that can be derived from the posterior - you could take the posterior mode rather than the mean, and calculate

$\hat{\theta_i} = \dfrac{y_i + 36.5 -1}{n_i + 73 - 2} = \dfrac{y_i + 35.5}{n_i + 71}$

That is, add 71 games of 0.500 ball to the team in order to shrink it, and it would still be a statistically justified estimator (since the beta posterior starts off bell-shaped and symmetric, however, the mean and the mode will remain very close - so these two estimates should closely coincide)

You could also use the posterior to calculate a credible interval for $\theta_i$ by taking quantiles from the posterior distribution - see my previous post on credible intervals for doing this in the beta-binomial situation. For example, if you have a team that has won $w_i = 6$ out of $n_i = 10$ games (for an observed 0.60 winning proportion), a 95% interval estimate for $\theta_i$ is given as $(0.405, 0.618)$. Similarly, if you have a team that has won $w_i = 96$ out of $n_i = 160$ games (again, for an observed 0.60 winning proportion), a 95% interval estimate for $\theta_i$ is given as $(0.505, 0.632)$.



Above is the posterior distribution for $\theta_i$ for the 6-4 team. The solid vertical lines represent the boundaries of the 95% credible interval and the dashed line represents $\hat{\theta_i}$.

Why Does this Happen?


And by that, I mean - why does do you add the same approximately 74 "shrinkage" games, no matter what the actual number of games played is?

As I write this post, the major league baseball season is currently underway. I'm going to ask you to estimate $\theta_i,$ the true winning proportion of team $i$. That doesn't sound tough, right? First, you'll want to know what team I'm thinking of.

Here's the thing, though: I'm not going to tell you which team I'm thinking of.

So how in the world can you guess how good a team is without knowing anything about it? Use what you know about baseball! How good is the average team? The average is 0.500, right? So let's start there. Your guess for $\theta_i$ is 0.500.

Okay, so now I want you to think of how much variation there is for $\theta_i$. What range do you think $\theta_i$ could possible be in? I'm guessing most people would agree with me that most teams are 60 to 100 win teams over the course of the season.

That sounds like a pretty reasonable estimate - and I'll add that there's a better chance of being near 81 than 60 or 100. This corresponds to a winning proportion of between approximately 0.37 and 0.617.

I'm not going to show the math, but this range can and should be adjusted a little bit based on historical data. When we calculate the correct range, it's the same as the range of observed win proportions you would expect a 0.500 team that has played 74 (or 73, by my calculation) games to be in. Check it- the within-team standard deviation (luck) is $\sqrt{0.5(1-0.5)/74} = 0.058$, and so two standard deviations below and above 0.500 gives a range of (0.384, 0.616), or roughly between 62 and 100 wins over a full season.

Okay, so now you've used your baseball knowledge to determine that without knowing anything about the team, you can guess how good it is by assuming it has gone 37-37. Now, let's get some information about the identity of team $i$. I'll start by telling you that the team went 0-2 in its first two games. Now try to estimate $\theta_i$.

The raw point estimate is $\hat{\theta_i} = 0/2 = 0$. But you don't really believe that the team has a true winning proportion of 0, do you? That would mean it never wins a single game the entire season. No team has ever won zero games before, or even come close. And you just told me that the most teams finish with between 60 and 100 wins!

But neither should you throw away the 0-2 information either -that's information as it is now, rather than about the hypothetical average team. What you want to do is mix your guess with your current information. Before I told you the name of the team, your guess represented a team that was 37-37. Now that you know the team is 0-2 - just add those to the what you thought before.

Your new estimate for the team's true winning percentage is $\hat{\theta_i} = (0+37)/(2+74) = 0.486$. You've shrunk a little bit towards zero to represent the new information, but not by much - two games is very little additional knowledge. So little, in fact, that you're better off leaning heavily on the 37-37 you figured before you even knew what the team was.

Now how about if I told you the team played 100 games and went 60-40? Great. We do the same thing - mix our guess about the average team with the new information we had before, by taking $\hat{\theta_i} = (60+37)/(100+74) = 0.557$.

As the number of games the team has played increases, the information you had about the hypothetical average team stays constant - but the information you have about the team current team grows. So adding 37-37 to whatever the current record is now means that the current information is slowly overtaking your guess - which is exactly what you want. If the number of games your hypothetical average team played was increasing, then the amount of information being mixed together from what you guessed and from what you observed would be increasing at the same rate - and so your current information would never overtake your guess! That's not what you want - so keeping 37-37 constant is the correct thing to do.

06 July, 2015

Bayesian Credible Intervals for a Batting Average

(This post is a followup to my previous post on bayesian inference)

I've posted a fair amount about confidence intervals for various quantities. All of the ones I've posted so far - central limit based theorem intervals, Wald theory intervals, and likelihood intervals - are based on a frequentist understanding of probability - that is to say, probability is defined as the limiting result of the proportion of times an event (say, a hit) happens as the number of trials (say, at-bats) goes to infinity.

The term "95% confidence" refers to the construction of the interval itself - that is, if we were to calculate a 95% confidence interval for the true batting average $\theta$ for each of our millions and millions of trials, then 95% of them will contain the true batting average $\theta$.

Statisticians tend to avoid making probability statements about confidence intervals. The statement that "There is a 95% chance that $\theta$ is in the interval." is incorrect because $\theta$ is conceptualized as a fixed quantity. Furthermore, the statement "There is a 95% chance that the interval contains $\theta$." is awkward because once an interval has been calculated, there's no more randomness anymore - it either contains $\theta$, or it doesn't. This is why statisticians prefer to use confidence rather than probability to describe intervals.

But what if you are working in a Bayesian framework? The end result of a Bayesian analysis is a distribution $p(\theta | x)$ that represents the distribution of believe in $\theta$ after the data has been accounted for - so it makes perfect sense to write, for example, $P(0.250 \le \theta \le 0.300)$.

All the code used to generate the images in this article may be found on my github.

Credible Intervals


Instead of confidence intervals, Bayesian statisticians instead calculate  "Credible" intervals - these are intervals from the distribution of $p(\theta | x)$ that contain the desired amount of probability. Say, for example, that $P(0.250 \le \theta \le 0.300) = 0.95$ - then the interval $(0.250, 0.300)$ would be a 95% credible interval for $\theta$.

The main issue with this method is that there is more than one way to get a 95% credible interval given $p(\theta | x)$ - technically, any interval $(L, U)$ with $P(L \le \theta \le U) = 0.95$ is a valid 95% credible interval. Statisticians have several ways to determine which one to use, but I'm going to show you one that can be easily done with most computer software.

Baseball Example


In the previous post, I explained how to use the beta-binomial model to get a posterior distribution for a batting average using a beta-binomial model. Let's take observer A, who for the batter with $15$ hits in $n = 50$ at-bats had a prior distribution of belief given by a beta distribution with parameters $\alpha = 1$ and $\beta = 1$



And a posterior distribution of belief given by a beta distribution with parameters $\alpha' = 16$ and $\beta' = 36$


To get a credible interval, we can take quantiles from the beta distribution. A quantile is a value $Q_{p}$ for a distribution so that $P(X \le Q_{p}) = p$. To get a 95% credible interval, we can take $Q_{0.025}$ as the lower boundary of the interval and $Q_{0.975}$ as the upper boundary of the interval (since 97.5% - 2.5% = 95%) so that the interval contains the middle 95% of the probability.

Since these do not have nice formulas to calculate by hand, it's easiest to use computer software to get the - any good statistical software should be able to give quantiles for common distributions. In the program $R$, the command to do this is

                    > qbeta(c(.025,.975),16,36)
                    [1] 0.1911040 0.4382887


So for observer A, a 95% credible interval for $\theta$ is given by $(0.191, 0.438)$. With quantiles, the posterior belief for observer A looks like

The area under the curve between the two vertical lines is 0.95 - and so the values of the vertical lines give the 95% credible interval.

What about observer B? Observer B used a beta distribution as their prior with $\alpha = 53$ and $\beta = 147$, for a posterior distribution that is beta with $\alpha' = 68$ and $\beta' = 182$. Quantiles from observer B's posterior distribution are

                    > qbeta(c(.025,.975),68,182)
                    [1] 0.2187438 0.3287111


So observer B's 95% credible interval for $\theta$ is $(0.219, 0.329)$. Note that observer B's interval is much more realistic - as baseball fans, we know that a $\theta = 0.400$ batting average is very, very unlikely - so good prior information can lead to improved inference.