23 October, 2015

From Stabilization to Interval Estimation



In this post, I'm going to show how to use league means and stabilization points to construct mean interval and prediction interval estimates for some basic counting statistics. I'll focus on two specific models: the normal-normal and the beta-binomial model.

At a few points during this post I'm going to mention some empirical results. You can find the data I used and the code I ran on my github.

Distributional Assumptions


I'm assuming the statistic in question is a binomial outcome - this covers many basic counting statistics (batting average, on-base percentage, batting average on balls in play etc.) but not rate statistics, or more complicated statistics such as wOBA.

Assume that in $n_i$ trials, a player accrues $x_i$ events (hits, on-base events, etc.). I'm going to assume that trials are independent and identical with parameter of success $\theta_i$.

For the distribution of the $x_i$, I'm going to work out the math for two specific distributions - the normal and the binomial. I'm also going to assume that the distribution of the $\theta_i$ follows the respective conjugate distribution - the normal for the normal model, and the beta for the binomial model. This prior has mean talent level $\mu$ and stabilization point $M$.

$x_i \sim p(x_i | \theta_i, n_i)$
$\theta_i \sim G(\theta_i | \mu, M)$

For the stabilization point $M$, I'm going assume this is the number of events at which $r = 0.5$. If you choose the point at which $r = 0.7$, then these formulas won't work

For several of the mathematical results here, I'm going to refer back to my article shrinkage estimators for counting statistics - particularly, the examples section at the end - without offering proofs or algebraic derivations.

The posterior distribution for $\theta_i$ is given by

$\displaystyle p(\theta_i | x_i, n_i, \mu, M) = \dfrac{p(x_i | \theta_i, n_i)G(\theta_i | \mu, M)}{\int p(x_i | \theta_i, n_i)G(\theta_i | \mu, M) d_{\theta_i}}$

Intervals will be constructed by taking quantiles from this distribution.

(For rate statistics instead of count statistics, the gamma-Poisson model can be used - though that will take more math to figure out the correct forms of the intervals. I got about three-quarters of the way there in my article WHIP stabilization by the gamma-Poisson model if somebody else wants to work through the rest. For more complicated statistics such as wOBA, I'm going to have to work through some hard math.)

Mean Intervals

 

Normal-Normal Model


For the normal-normal model, suppose that both the counts and the distribution of talent levels follow a normal distribution. Then the observed proportion $x_i / n_i$ also follows a normal distribution.

$\dfrac{x_i}{n_i} \sim N\left(\theta_i, \dfrac{\sigma^2}{n_i}\right)$

$\theta_i \sim N(\mu, \tau^2)$

Furthermore, a normal approximation to the binomial is used to estimate $\sigma^2$ as $\sigma^2 = \mu (1-\mu)$. The usual normal approximation to the binomial takes $\theta_i (1-\theta_i)$ as the variance; however, the normal-normal model assumes that variance $\sigma^2$ is constant around every single $\theta_i$ - so an estimate for that is the average amount of variance over all of them, $\mu(1-\mu)$.

As a side note, the relationship between $M$ and $\tau^2$ is given by

$M = \dfrac{\sigma^2}{\tau^2} \approx \dfrac{\mu(1-\mu)}{\tau^2}$

As I showed in my article shrinkage estimators for counting statistics, for the normal-normal model the shrinkage coefficient is given as

$B = \dfrac{\sigma^2/\tau^2}{\sigma^2/\tau^2 + n_i} = \dfrac{ M }{ M + n_i }$

The resulting posterior is then

$\theta_i | x_i, n_i, \mu, M \sim N\left(  (1-B) \left(\dfrac{x_i}{n_i}\right) + B \mu, (1-B) \left(\dfrac{\sigma^2}{n_i}\right)\right)$

And substituting in the values of $B$, the variance of the posterior is given as

$\left(1 - \dfrac{M}{M + n_i}\right)\left(\dfrac{\mu(1-\mu)}{n_i}\right) = \left(\dfrac{n_i}{M + n_i}\right)\left(\dfrac{\mu(1-\mu)}{n_i}\right) = \dfrac{\mu(1-\mu)}{M+n_i}$

And a 95% interval estimate for $\theta_i$ is

$ \left[ \left(\dfrac{n_i}{n_i + M}\right) \dfrac{x_i}{n_i}+ \left(\dfrac{M}{n_i+M}\right) \mu \right] \pm 1.96 \sqrt{ \dfrac{\mu(1-\mu)}{M+n_i}}$


Beta-Binomial Model



For the beta-binomial model, suppose that the counts of events $x_i$ in $n_i$ events follows a binomial distribution and the distribution of the $\theta_i$ themselves is beta.

$x_i \sim Binomial(n_i, \theta_i )$

$\theta_i \sim Beta(\alpha, \beta)$

For the beta distribution of talent levels, the parameters can be constructed from the league mean and stabilization point as

$\alpha = \mu M$

$\beta = (1-\mu) M$

Using the beta as a prior distribution, the posterior for $\theta_i$ is then

$\theta_i | x_i, n_i, \mu, M \sim Beta(x_i + \mu M, n_i - x_i + (1-\mu) M)$

A 95% credible interval can then be taken as quantiles from this distribution - I show how to do this in in R in my article on Bayesian credible intervals. Most statistical software should be able to take quantiles from the beta distribution easily.

Alternatively, a normal approximation may be used - the posterior should be approximately normal with mean and variance

$\theta_i | x_i, n_i, \mu, M \sim N\left( \dfrac{x_i + \mu M}{n_i  + M},  \dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu) M)}{(n_i + M)^2 (1 + n_i + M)}\right)$

So a 95% credible interval based on the normal approximation to the beta posterior is given by

$\left(\dfrac{x_i + \mu M}{n_i  + M}\right) \pm 1.96 \sqrt{\dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu) M)}{(n_i + M)^2 (1 + n_i + M)}}$

This should be very close to the interval given by taking quantiles directly from the beta distribution.

Practical Application


I downloaded first and second half data hitting data from all qualified non-pitchers from 2010 to 2015 from fangraphs.com. I used the above formulas on the first half of on-base percentage data to create intervals, and then calculated the proportion of those intervals that contained the on-base percentage for the second half. For the league mean and stabilization point, I used values of $M$ and $\mu$ (even though I didn't show $\mu$) from my article "More Offensive Stabilization Points."

But wait...isn't there uncertainty in those estimates of $M$ and $\mu$? Yes, but it actually doesn't play a huge role unless the uncertainty is large, such as for the BABIP. You can try it out yourself by running the code and changing the values slightly, or just trust me.

I rounded off to the nearest whole number for $M$ and to three nonzero digits  for $\mu$. The intervals compared were the normal-normal as NN, beta-binomial as BB, and the normal approximation to the beta-binomial as BB (N). The resulting coverages were

\begin{array}{| l | l | c | c | c | c |} \hline
\textrm{Stat}& \mu & M & \textrm{NN Coverage} & \textrm{BB Coverage} & \textrm{BB (N) Coverage} \\ \hline
OBP & 0.33 & 296 & 0.66 & 0.659 & 0.659 \\
BA & 0.268 & 466 & 0.604 & 0.601 & 0.603 \\
1B & 0.158 & 222 & 0.685 & 0.675 & 0.679 \\
2B & 0.0475 & 1025 & 0.532 & 0.532 & 0.531 \\
3B & 0.00492 & 373 & 0.762 & 0.436 & 0.76 \\
XBH & 0.0524 & 1006 & 0.545 & 0.542 & 0.551 \\
HR & 0.0274 & 125 & 0.754 & 0.707 & 0.738 \\
BB & 0.085 & 106 & 0.688 & 0.661 & 0.673 \\
SO & 0.181 & 50 & 0.74 & 0.728 & 0.729 \\
HBP & 0.00866 & 297 & 0.725 & 0.591 & 0.721 \\  \hline\end{array}

So what happened? Shouldn't the 95% intervals have 95% coverage? Well, they should. The problem is, I used the wrong type of interval - the intervals calculated here are for the mean $\theta_i$. But we don't have $\theta_i$. What we have is the second half on-base percentage, which is $\theta_i$ plus the random noise that naturally surrounds $\theta_i$ in however many additional plate appearances. What's appropriate here is a prediction-type interval that attempts to cover not the mean, but a new observation - this interval will have to account for both the uncertainty of estimation and the natural randomness in a new set of observations.


Prediction Intervals


The interval needed is predictive - since the previous intervals were constructed as Bayesian credible intervals, a posterior predictive interval can be used.

Suppose that $\tilde{x_i}$ is the new count of events for player $i$ in $\tilde{n_i}$ new trials. I'm going to assume that $\tilde{n_i}$ is known. I'll also assume that $\tilde{x_i}$ is generated from the same process that generated $x_i$.

$\tilde{x_i} \sim p(\tilde{x_i} | \theta_i, \tilde{n_i})$

The posterior predictive is then

$p(\tilde{x_i}| \tilde{n_i}, x_i, n_i, \mu, M) = \displaystyle \int p(\tilde{x_i} | \theta_i, \tilde{n_i})p(\theta_i | x_i, n_i, \mu, M) d\theta_i$

For a bit more explanation, check out my article on posterior predictive distributions.

Normal-Normal Model


As stated above, the posterior distribution for $\theta_i$ is normal

$\theta_i | x_i, n_i, \mu, M \sim N\left(  B \left(\dfrac{x_i}{n_i}\right) + (1-B) \mu, \dfrac{\mu (1-\mu)}{n_i + M}\right)$

$B = \dfrac{M}{n_i + M}$

Using a normal approximation to the binomial, the distribution of the new on-base percentage in the second half (call this $\tilde{x_i}/\tilde{n_i}$) is also normal

$\dfrac{\tilde{x_i}}{\tilde{n_i}} | \theta_i, \mu  \sim N\left(\theta_i, \dfrac{\mu(1-\mu)}{\tilde{n_i}}\right)$

The posterior predictive is the marginal distribution, integrating out over $\theta_i$ - it is given as

$\dfrac{\tilde{x_i}}{\tilde{n_i}} | \tilde{n_i}, x_i, n_i, \mu, M \sim N\left(B \left(\dfrac{x_i}{n_i}\right) + (1-B) \mu,  \dfrac{\mu (1-\mu)}{n_i + M} + \dfrac{\mu(1-\mu)}{\tilde{n_i}}\right)$

And so a 95% posterior predictive interval for the on-base percentage in the second half is given by


$ \left[ \left(\dfrac{M}{n_i + M}\right) \dfrac{x_i}{n_i}+ \left(\dfrac{n_i}{n_i+M}\right) \mu \right] \pm 1.96 \sqrt{ \dfrac{\mu(1-\mu)}{n_i + M} + \dfrac{\mu(1-\mu)}{\tilde{n_i}}}$

 

Beta-Binomial Model

 

The posterior distribution for $\theta_i$ is beta

$\theta_i | x_i, n_i, \mu, M \sim Beta(x_i + \mu M, n_i - x_i + (1-\mu) M)$

 The distribution for the number of on-base events $\tilde{x_i}$ in $\tilde{n_i}$ follows a binomial distribution

$\tilde{x_i} \sim Binomial(\theta_i, \tilde{n_i})$

The posterior predictive for the number of on-base events in the new number of trials is the marginal distribution, which has density

$p(\tilde{x_i}| x_i, n_i, \mu, M, \tilde{n_i}) = \displaystyle  {\tilde{n_i} \choose \tilde{x_i}} \dfrac{\beta(\tilde{x_i} + x_i + \mu M, \tilde{n_i} - \tilde{x_i} + n_i - x_i + (1-\mu) M)}{\beta(x_i + \mu M,n_i - x_i + (1-\mu) M)}$


This is the beta-binomial distribution. It's is a discrete distribution that gives the probability of the number of on-base events in $\tilde{n_i}$ new PA, not the actual on-base percentage.

Since it is discrete, it's easy to solve for quantiles

$Q(\alpha) = \displaystyle \min_{k} \{ k : F(k) \le \alpha \}$
Where

$F(k) = \displaystyle \sum_{\tilde{x_i} \le k} p(\tilde{x_i} | x_i, n_i, \mu, M, \tilde{n_i}) =  \displaystyle \sum_{\tilde{x_i} \le k} \displaystyle  {\tilde{n_i} \choose \tilde{x_i}} \dfrac{\beta(\tilde{x_i} + x_i + \mu M, \tilde{n_i} - \tilde{x_i} + n_i - x_i + (1-\mu) M)}{\beta(x_i + \mu M,n_i - x_i + (1-\mu) M)}$

Since $Q(\alpha)$ is the quantile for the count of events, a 95% interval for the actual on-base proportion is given by

$\left(\dfrac{Q(.025)}{\tilde{n_i}} ,\dfrac{Q(.975)}{\tilde{n_i}}\right)$.

Alternatively, since the distribution is likely to be unimodal and bell-shaped, a normal approximation to the 95% posterior predictive interval is given by

$\left(\dfrac{x_i + \mu M}{n_i  + M}\right) \pm 1.96 \sqrt{\dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu)M)(n_i + M + \tilde{n_i})}{\tilde{n_i} (n_i + M)(n_i + M + 1)}}$

This isn't as good of an approximation as the normal approximation to the beta-binomial interval for the mean, but the difference between intervals is still only around 1% of the length and should work well.

Practical Application



I repeated the analysis using the predictive formulas given above, using the first half on-base percentage to try to capture the second half on-base percentage, using the same $\mu$ and $M$ values as before.

\begin{array}{| l | l | c | c | c | c |} \hline
\textrm{Stat}& \mu & M & \textrm{NN Coverage} & \textrm{BB Coverage} & \textrm{BB (N) Coverage} \\ \hline
OBP & 0.33 & 296 & 0.944 & 0.944 & 0.94 \\
BA & 0.268 & 466 & 0.943 & 0.943 & 0.944 \\
1B & 0.158 & 222 & 0.941 & 0.941 & 0.942 \\
2B & 0.0475 & 1025 & 0.956 & 0.956 & 0.955 \\
3B & 0.00492 & 373 & 0.955 & 0.955 & 0.956 \\
XBH & 0.0524 & 1006 & 0.957 & 0.957 & 0.959 \\
HR & 0.0274 & 125 & 0.951 & 0.951 & 0.952 \\
BB & 0.085 & 106 & 0.925 & 0.925 & 0.921 \\
SO & 0.181 & 50 & 0.918 & 0.918 & 0.92 \\
HBP & 0.00866 & 297 & 0.95 & 0.95 & 0.947 \\ \hline\end{array}

Cautions and Conclusion



Despite the positive results, I think that 95% actual coverage from these intervals is overoptimistic. For one, I selected a very "nice" group of individuals to test it on - nonpitchers with more than 300 PA. Being in this category implies a high talent level and a lack of anything that could drastically change that talent level over the course of the season, such as injury. I also treated the second half sample size $\tilde{n_i}$ as known - obviously, that must be estimated as well, and should add additional uncertainty.

Furthermore, there are clearly other factors at work than just random variation - players can get traded to different environments (a player being traded to or from Coors park, for example), talent levels may very well change over the course of the season, and events are clearly not independent and identical.

Applying these formulas to the population of players at large should see the empirical coverage drop - my guess (though I haven't tested it) is that 95% intervals should empirically get around 85%-90% actual coverage. Also keep in mind that $M$ and $\mu$ need to be kept updated - using means and stabilization points from too far in the past will lead to shrinkage towards the wrong point.

You can and should be able to do better than these intervals, in terms of length - these are incredibly simplistic, using only information about the player and information about the population. Adding covariates to the model to account for other sources of variation should allow you to decrease the length without sacrificing accuracy.

Alternatively, you could use these formulas with projections, with $\mu$ as the preseason projection and $M$ representing how many events that projection is "worth." This is more in keeping with the traditional Bayesian sense of the interval, and won't guarantee any sort of coverage.

However, I still think these intervals are useful in that they represent a sort of baseline - any more advanced model that generates predictive intervals should be able to do better than these.

Edit 16 Mar. 2017: I found that the data file I used for this analysis with the split first and second half  statistics was not what I thought it was - it repeated the same player multiple times, giving an inaccurate estimate of the confidence level. I have corrected the data file and re-run the analysis and presented the corrected confidence levels.

No comments:

Post a Comment