30 July, 2015

Shrinkage Estimators for Counting Statistics

Edit 19 March 2020: This post has been adapted into a paper in the Journal of Mathematical Psychology. In the process of writing the paper, a number of  mistakes, omissions, or misstatements were found in this post. It is being left up as it was originally written, just in case anybody is interested. For a more correct version, please refer to the journal article.

Warning: this post is going to be incredibly technical, even by the standards of this blog.  If what I normally post is gory math, this is the running of the bulls. I'm making it so I can refer back to it when I need to.

The goal is to set up the theoretical framework for shrinkage estimation of normalized counting statistics to some common mean. I will fully admit this is a very, very limited framework, but some of the most basic baseball statistics fit into it. In the future I hope I can possibly expand this to include more advanced statistics.

I will give (not show) a few purely theoretical results - for proofs, see Natural Exponential Families with Quadratic Variance Functions by Carl Morris in The Annals of Statistics, Vol. 11, No. 2 (1983), 515-529, or the more updated version of that paper.

Theoretical Framework


Let's say I have some metric $X_i$ for player, team, or object $i$. In this framework, $X_i$ represents a count or a sum of some kind - the raw number of hits, or the raw number of earned runs, etc. I know that $X_i$ is the result of a random process that is controlled by a probability distribution with parameter $\theta_i$, which is unique to each player, team, or object - in baseball, for example, $\theta_i$ represents the player's true "talent" level with respect to metric $X_i$.

$X_i \sim p(x_i | \theta_i)$

I have to assume that the talent levels $\theta_i$ are exchangeable, though the definition is a bit too much to go into here.

I'm going to assume that $p(x_i | \theta_i)$ is a member of the natural exponential family with a quadratic variance function (NEFQVF) - this includes very common distributions such as the normal, binomial, Poisson, gamma, and negative binomial.

Each of these can be written as the convolution (sum) of $n_i$ other independent, identical distributions, each of which is also NEFQVF with mean $\theta_i$ - the normal is the sum of normals, the binomial is the sum of Bernoullis, the Poisson is the sum of Poissons, the negative binomial is the sum of geometrics, etc.. I will assume that is the case here - that

$X_i = \displaystyle \sum_{j = 1}^{n_i} Y_{ij}$

Translating this to baseball terms, this means that $Y_{ij}$ is the outcome of inning, plate appearance, etc.,  $j$ for player $i$ ($j$ ranges from 1 to $n_i$). The metric $X_i$ is then sum of $n_i$ of these outcomes. Each outcome is assumed independent and identical. Once again, $X_i$ is not normalized by dividing by $n_i$.

Conditional on having mean $\theta_i$, the expectations of the $Y_{ij}$ are

$E[Y_{ij} | \theta_i] = \theta_i$
 
And so conditional on having mean $\theta_i$, the expected value of the $X_i$ are

$E[X_i | \theta_i] = E\left[\displaystyle \sum_{j = 1}^{n_i} Y_{ij} \biggr | \theta_i \right] = \displaystyle \sum_{j = 1}^{n_i} E\left[ Y_{ij} \biggr | \theta_i \right] = n_i E[Y_{ij}| \theta_i] = n_i \theta_i$

Baseball terms: if a player has, for example, on-base percentage $\theta_i$, then the number of on-base events I expect in $n_i$ plate appearances is $n_i \theta_i$. This does not have to be a whole number.

Similarly, and again conditional on mean $\theta_i$, the independence assumption allows us to write the variance of the $X_i$ as

$Var(X_i | \theta_i) = Var\left(\displaystyle \sum_{j = 1}^{n_i} Y_{ij} \biggr | \theta_i \right) = \displaystyle \sum_{j = 1}^{n_i} Var\left( Y_{ij} \biggr | \theta_i \right) = n_i Var(Y_{ij}| \theta_i) = n_i V(\theta_i)$

I'm going to repeat that last bit of notation again, because it's important:

 $Var(Y_{ij}| \theta_i) =V(\theta_i)$

$V(\theta_i)$ is the variance of the outcome at the most basic level - plate appearance, inning, batter faced, etc.  - conditional on having mean $\theta_i$. For NEFQVF distributions, this has a very particular form -  the variance can be written as a polynomial function of the mean $\theta_i$ up to degree 2 (this is the "Quadratic Variance Function" part of NEFQVF):

$Var(Y_{ij} | \theta_i) = V(\theta_i) = c_0 + c_1 \theta_i + c_2 \theta_i^2$

For example, the normal distribution has $V(\theta_i) = \sigma^2$, so it fits the QVF model with $c_0 = \sigma^2$ and $c_1 = c_2 = 0$. For the Binomial distribution, $V(\theta_i) = \theta_i (1-\theta_i) =   \theta_i - \theta_i^2$, so it fits the QVF model with $c_0 = 0, c_1 = 1$, and $c_2 = -1$. The Poisson distribution has $V(\theta_i) = \theta_i$, so it fits the QVF model with $c_0 = c_2 = 0$ and $c_1 = 1$.

I'm now going to assume that the talent levels $\theta_i$ themselves follow some distribution $G(\theta_i | \mu, \eta)$. The parameter $\mu$ is the expected value of  the $\theta_i$ ($E[\theta_i] = \mu$), and it represents the league average talent level. The parameter $\eta$ controls, but is not necessarily equal to, the variance of $\theta_i$ (how spread out the talent levels are). Both are assumed to be known. The two-stage model is then

 $X_i \sim p(x_i | \theta_i)$
 $\theta_i \sim G(\theta_i | \mu, \eta)$

The unconditional expectation of the $X_i$ is

$E[X_i] = E[E[X_i | \theta_i]] = E[n_i \theta_i] = n_i \mu$

And the unconditional variance of $X_i$ is

$Var(X_i) = E[Var(X_i | \theta_i)] + Var(E[X_i | \theta_i]) = n_i E[ V(\theta_i)] + n_i^2 Var(\theta_i) $

In the above formula, the quantity $E[V(\theta_i)]$ is the average variance of the outcome at the most basic level (plate appearance, inning, etc.), averaging over all possible talent levels $\theta_i$. The quantity $Var(\theta_i)$ is the variance of the talent levels themselves - how spread out talent is in the league.

To this point I haven't normalized the $X_i$ by dividing by each by $n_i$ - let's do that. If I define $\bar{X_i} = X_i/n_i,$ then based on the formulas above

$E[\bar{X_i}] =  E\left[\dfrac{X_i}{n_i}\right] = \dfrac{1}{n_i} E[X_i] = \dfrac{n_i \theta_i}{n_i} = \theta_i$

And variance

$Var(\bar{X_i}) = Var\left(\dfrac{X_i}{n_i}\right) = \dfrac{1}{n_i^2} Var(X_i) = \dfrac{n_i E[ V(\theta_i)] + n_i^2 Var(\theta_i)}{n_i^2} = \dfrac{1}{n_i}E[ V(\theta_i)] + Var(\theta_i)$ 

As members of the exponential family, members of the NEFQVF family are guaranteed to have a conjugate prior distribution, so I'll assume that $G(\theta_i | \mu, \eta)$ is conjugate to $p(x_i | \theta_i)$. For example, if $X_i$ follows a normal distribution, $G(\theta_i | \mu, \eta)$ is a normal as well. If $X_i$ follows a Binomial distribution, then $G(\theta_i | \mu, \eta)$ is a beta distribution. If $X_i$ follows a Poisson distribution, then $G(\theta_i | \mu, \eta)$ is a gamma distribution. The priors themselves do not have to be NEFQVF.

Since $\eta$ and $\mu$ are assumed known, we can use the Bayes' rule with conjugate prior $G(\theta_i | \mu, \eta)$ to calculate the posterior distribution for $\theta_i$

$\theta_i | x_i, \mu, \eta \sim \dfrac{p(x_i | \theta_i)G(\theta_i | \mu, \eta)}{\int p(x_i | \theta_i)G(\theta_i | \mu, \eta) d\theta_i}$

NEFQVF families have closed-form posterior densities.

I'm then going to take my as my estimator the expected value of the posterior, $\hat{\theta_i} = E[\theta_i | x_i]$. Specifically for NEFQVF distributions with conjugate priors, the estimator is then given by

$\hat{\theta_i} = \mu + (1 - B)(\bar{x_i} - \mu) = (1-B) \bar{x_i} + B \mu$

Where $B$ is known as the shrinkage coefficient. For NEFQVF distributions, the form of $B$ is

$B = \dfrac{E[\bar{X_i} | \theta_i]}{Var(X_i)} = \dfrac{\dfrac{1}{n_i}E[ V(\theta_i)]}{\dfrac{1}{n_i}E[ V(\theta_i)] + Var(\theta_i)} = \dfrac{E[V(\theta_i)]}{E[V(\theta_i)] + n_i Var(\theta_i)}$

Note: The above two formulas, and several of the rules I used to derive them, are guaranteed for NEF distributions and not just NEFQVF distributions; however, the conjugate prior for a NEF may not have a normalizing constant that exists in closed form, and in practical application the distributions that are actually used tend to be NEFQFV. For NEFQFV distributions, a few more algebraic results can be shown about the exact form of the shrinkage estimator by writing the conjugate prior in the general form for exponential densities - for more information, see section 5 of Morris (1983), mentioned in the introduction.

The shrinkage estimator $B$ for NEFQVF distributions is the ratio of the within-metric variance to the total variance - which is a function of how noisy the data are compared and how spread out the talent levels are. If at a certain $n_i$ the normalized metric tends to be very noisy around its mean but the means tend to be clustered together, shrinkage will be large. If the normalized metric tends to stay close to its mean value but the means tend to be very spread out, shrinkage will be small. And as the number of observations $n_i$ grows bigger, the effect of the noise gets smaller, decreasing the shrinkage amount.
 
$B$ itself can be thought of as a shrinkage proportion - if $B = 0$ then there is no shrinkage, and the estimator is just the raw observation.  This would occur if the average variance around the mean is zero - if there's no noise. If $B = 1$ then complete shrinkage takes place and the estimate of the player's true talent level is just the league average talent level. This occurs if the variance in league talent levels is equal to zero - every player has the exact same talent level.

Note that $B$ has no units, since both the top and bottom are variances, so rescaling the data will not change the shrinkage proportion.

I'm going to show a few examples, working through gory mathematical details.

WARNING: the above results are guaranteed only for NEFQVF distributions - the normal, binomial, negative binomial, Poisson, and gamma, NEF-GHS. Some results also apply to NEF distributions - see Morris (1983) for details. If the data model is not one of those distributions, I can't say whether or not the formulas I've given above will be correct.


Normal-Normal Example


Let's start with one familiar form - the normal model. This model says that $X_i$, the metric for player $i$, is normally distributed, and is constructed as a sum of $Y_{ij}$ random variables, which are also normally distributed with mean $\theta_i$ and known variance $\sigma^2$. The distribution of talent levels also follows a normal distribution with league mean $\mu$ and variance $\tau^2$.

This can be written as

$Y_{ij} \sim N(\theta_i, \sigma^2)$
$X_i \sim N(n_i \theta_i, n_i \sigma^2)$
$\theta_i \sim N(\mu, \tau^2)$

The average variance is simple. As stated before, $V(\theta_i) = \sigma^2$ is constant for the normal distribution, no matter what the actual $\theta_i$ is. Hence,

$E[V(\theta_i)] = E[\sigma^2] = \sigma^2$

The variance of the averages is simple, too - the model assumes it's constant as well.

$Var(\theta_i) = \tau^2$

This gives a shrinkage coefficient of

$B = \dfrac{\sigma^2}{\sigma^2 + n_i \tau^2}$

Which, if I divide both the top and bottom by $n_i$, might look more familiar as

$B = \dfrac{\sigma^2/n_i}{\sigma^2/n_i + \tau^2}$

 The shrinkage estimator is then

$\hat{\theta_i} = \mu + \left(1 - \dfrac{\sigma^2/n_i}{\sigma^2/n_i + \tau^2}\right)(\bar{x_i} - \mu)$

Alternatively, I can write $B$ as

$B = \dfrac{\sigma^2/\tau^2}{\sigma^2/\tau^2 + n_i}$

And then it follows the familiar pattern from other estimators of $B = m/(m  + n)$ for some parameter $m$.

It may seem like the normal-normal is not of use - how many counting statistics are there that are normally distributed at the level of inning, plate appearance, or batter faced? The very idea that they are counting statistics says that that's impossible.

However, the central limit theorem guarantees that sums of independent, identical random variables converge to a normal - hence the distribution of $X_i$ should be unimodal and bell-shaped for large enough $n_i$ (and I'll intentionally leave the discussion of what constitutes "large enough" aside). Thus, as long as the distribution of the $\theta_i$ (the distribution of talent levels) is bell-shaped and symmetric, using a normal-normal with the normal as an approximation at the $X_i$ level should work.

Beta-Binomial Example


Suppose we're measuring the sum of binary events of some kind - a hit, an on-base event, a strikeout, etc. - in $n_i$ observations - plate appearances, innings pitched, batters faced, etc. Each event can be thought of as a sample from a Bernoulli distribution (these are the $Y_{ij}$) with variance function $V(\theta_i) = \theta_i(1-\theta_i)$. The observed metric $X_i$ binomial, and it is constructed as the sum of these Bernoulli random variables


$Y_{ij} \sim Bernoulli(\theta_i)$
$X_i \sim Binomial (n_i, \theta_i)$

The prior distribution for the binomial distribution is the beta.
$\theta_i \sim Beta(\mu, M)$

Fitting with the framework given above, I'm using $\mu = \alpha/(\alpha+\beta)$ and $M = \alpha + \beta$ instead of the traditional $\alpha, \beta$ parametrization, so that $\mu$ represents the league mean and $M$ controls the variation.

The average variance is fairly complicated here. We need to find

$E[V(\theta_i)] = E[\theta_i(1-\theta_i)] = \displaystyle \int_0^1 \dfrac{\theta_i(1-\theta_i) * \theta_i^{\mu M-1}(1-\theta_i)^{(1-\mu) M-1}}{\beta(\mu M, (1-\mu) M)} d\theta_i = \dfrac{\displaystyle \int_0^1 \theta_i^{\mu M}(1-\theta_i)^{(1-\mu) M} d\theta_i}{\beta(\mu M, (1-\mu) M)}$

 The top part is a $\beta(\mu M + 1, (1-\mu)M + 1)$ function. Utilizing the properties of the beta function, we have

$E[\theta_i(1-\theta_i)] = \dfrac{\beta(\mu M+1, (1-\mu) M + 1)}{\beta(\mu M, (1-\mu) M)} = \dfrac{\beta(\mu M, (1-\mu) M + 1)}{\beta(\mu M, (1-\mu) M)}\left(\dfrac{\mu M}{\mu M + (1-\mu) M + 1}\right) = $

$\dfrac{\beta(\mu M, (1-\mu) M )}{\beta(\mu M, (1-\mu) M)}\left(\dfrac{\mu M}{\mu M + (1-\mu) M + 1}\right) \left(\dfrac{(1-\mu) M}{\mu M + (1-\mu) M}\right) = \dfrac{\mu(1-\mu)M^2}{(M+1)M} = \dfrac{\mu(1-\mu) M}{M+1}$

 The variance of the $\theta_i$ doesn't require nearly as much calculus, since it can be taken directly as the variance of a beta distribution

$Var(\theta_i) = \dfrac{\mu(1-\mu)}{M+1}$

The shrinkage estimator $B$ is then

$B =  \dfrac{\dfrac{\mu(1-\mu)M}{(M+1)}}{\dfrac{\mu(1-\mu)M}{(M+1)} +\dfrac{n_i \mu(1-\mu)}{(M+1)}} = \dfrac{M}{M + n_i}$

Since $\mu(1-\mu)/(M+1)$ is in every term on the top and bottom, so it will cancel out. Using this model, then the shrinkage estimator is given by

$\hat{\theta_i} = \mu + \left(1 - \dfrac{M}{M + n_i}\right)\left(\bar{x_i} - \mu\right)$


Poisson-Gamma Example


Now suppose that instead of a binary event, the outcome can be a count - zero, one, two, three, etc. Each count can be thought of as a sample from a Poisson distribution with parameter $\theta_i$ (these are the $Y_{ij}$, with $V(\theta_i) = \theta_i$) with $X_i$ as the sum total of counts, which also has a Poisson distribution with parameter $n_i \theta_i$.

$Y_{ij} \sim Poisson(\theta_i)$
$X_i \sim Poisson(n_i \theta_i)$

The prior distribution of $\theta_i$ for a Poisson is a gamma.
$\theta_i \sim Gamma(\mu, K)$

In this parametrization, I'm using $\mu = \alpha/\beta$ and $K = \beta$ as compared to the traditional $\alpha, \beta$ parametrization.

The average variance is

$E[V(\theta_i)] = E[\theta_i] = \mu$

And the variance of the averages is

$Var(\theta_i) = \dfrac{\mu}{K}$

So the shrinkage coefficient $B$ is

$B = \dfrac{\mu}{\mu + \dfrac{n_i \mu}{K}} = \dfrac{1}{1 + \dfrac{n_i}{K}} = \dfrac{K}{K + n_i}$

Which gives a shrinkage estimator of

$\hat{\theta_i} = \mu + \left(1 - \dfrac{K}{K + n_i}\right)(\bar{x_i} - \mu)$

What Statistics Fit Into this Framework?


Any counting statistic that is constructed as a sum of the same basic events falls under framework. It's possible to combine multiple basic events into one "super" event, as long as they are considered to be equal. Examples of this include batting average, on-base percentage, earned run average, batting average on balls in play, fielding percentage, stolen base percentage, team win percentage, etc. It's possible to weight the sum, as long as you're just adding the same type of event to itself over and over.

Any statistic that is a sum, weighted or unweighted, of different events does not fall into this framework - examples include weighted on-base average, slugging percentage, on-base plus slugging percentage, fielding independent pitching, isolated power, etc. Also, any statistics that are ratios of counts -strikeout to walk ratio, for example - do not fall under this framework.

Statistics like wins above replacement are right out.

I want to make clear that this is simply a discussion of what statistics fit nominally into a very specific theoretical framework.  A statistic falling under the framework does not imply that a statistic is good, nor does not falling under it imply that a statistic is bad. Furthermore, even if a statistic does not fall under this framework, shrinkage estimation using these formulas may still work as a very good approximation - the best statistics in sabermetrics today are often weighted sums of counting events, and people have been using these shrinkage estimators on them successfully for years, so clearly they must be doing something right.. This is simply what I can justify using statistical theory.

Performing the Analysis


The values of  $\eta$ and $\mu$ must be chosen or estimated. If prior data exists - like, for example, historical baseball data - values can be chosen based upon a careful analysis of that information. If no prior data exists, one option is to estimate the parameters through either moment-based or marginal likelihood-based estimation, and then plug in those values - this method is known as parametric empirical Bayes. Another option is to place a hyperprior or hyperpriors on $\eta$ and $\mu$ and perform a full hierarchical Bayesian analysis, which will almost certainly involve MCMC. Depending on the form of your prior, your shrunk results will likely be similar to, but not equal to, the shrinkage estimators given here.

What if none of the NEFQVF models appear to fit your data? You have a few options, such as nonparametric or hierarchical Bayesian modeling, but any method is to get more difficult and more computational.

No comments:

Post a Comment