04 December, 2015

Correcting Parametric Empirical Bayesian Intervals using a Bootstrap

In a previous post I discussed empirical Bayes for the beta-binomial model. Empirical Bayesian estimates are just expected values of a posterior distribution - suppose instead that you want interval estimates. The empirical Bayesian method can be used, but the intervals potentially have to be adjusted. In this post I want to show how to construct empirical Bayesian intervals for the beta-binomial model, correct them using a parametric bootstrap, and give a baseball example.

Annotated code for the procedure I will describe in this post can be found on my github.

As a side note, I just want to say how much I love this procedure. It uses parametric bootstrapping to correct Bayesian intervals to achieve frequentist coverage. God bless America.


Empirical Bayesian Intervals



In my previous post on beta-binomial empirical Bayes analysis, I used the model

$y_i \sim Bin(n_i, \theta_i)$
$\theta_i \sim Beta(\alpha, \beta)$

The empirical Bayes method says to get estimates $\hat{\alpha}$ and $\hat{\beta}$ of the prior parameters - I give a method of moments estimator in the post, but marginal maximum likelihood may also be used - and then calculate posterior distributions for each of the $\theta_i$ using $Beta(\hat{\alpha}, \hat{\beta})$ as the prior, essentially using the data itself to estimate the prior.

The empirical Bayesian estimate is the mean of this posterior distribution. If an interval estimate is desired, a credible interval can be calculated by taking quantiles directly from this posterior - see my post on credible intervals. This is what I will call a "naive" empirical Bayesian interval.


The Empirical Bayes Problem



The fundamental problem with naive empirical Bayesian intervals is that they often end up too short, inappropriately centered, or both. This is because the uncertainty of the prior parameters themselves has not been accounted for. From the law of total variance, the posterior variance is given by

$Var(\theta_i | y_i) = E_{\alpha, \beta|y_i}[Var(\theta_i|y_i, \alpha, \beta)] + Var_{\alpha, \beta|y_i}[E(\theta_i|y_i, \alpha, \beta)]$

Taking quantiles from the empirical Bayesian posterior estimates the first term, but not the second. For small samples this second term can be significant, and empirical Bayesian generally won't achieve nominal coverage (for more information, see Carlin and Louis's book Bayesian Methods for Data Analysis)

One way to correct for the uncertainty is to perform a hierarchical Bayesian analysis, but it's not clear what the "correct" hyperpriors should be - and just using noninformative priors doesn't guarantee that you'll get nominal frequentist coverage.

An alternative is to use the bootstrap. Since I'm working in the parametric empirical Bayes case, a parametric bootstrap will be used, though this doesn't necessarily have to be the case. For more information on the technique I will use (and a discussion on how it applies to the normal-normal case), see Laird, N., and Louis, T. (1987), "Empirical Bayes Con fidence Intervals Based on Bootstrap Samples," Journal of the American Statistical Association, 82(399), 739-750.

I want to emphasize that this is a technique for small samples. For even moderate samples, the uncertainty in the parameters will be small enough that naive empirical Bayesian intervals will have good frequentist properties, and this can be checked by simulation if you desire.


Parametric Bootstrap



The idea of the parametric bootstrap is that we can account for $Var_{\alpha, \beta|y_i}[E(\theta_i|y_i, \alpha, \beta)]$ by resampling. In a traditional bootstrap, data is sampled with replacement from the original data set. Since this is a parametric bootstrap instead, we will resample by generating observations from the model assuming that our estimates $\hat{\alpha}$ and $\hat{\beta}$ are correct.

  1. Generate $\theta^*_1, \theta^*_2, ..., \theta^*_k$ from $Beta(\hat{\alpha}, \hat{\beta})$
  2. Generate $y^*_1, y^*_2,..., y^*_k$ from $Bin(n_i, \theta^*_i)$
  3. Estimate $\alpha$ and $\beta$ from the bootstrapped $(y^*_i, n_i)$ observations using the same method as you initially used. Call these estimates $\alpha^*_j, \beta^*_j$


In that way, we get a set of $N$ bootstrap estimates $\alpha^*, \beta^*$ of the parameters of the underlying beta distribution. The posterior density that accounts for uncertainty of $\hat{\alpha}$ and $\hat{\beta}$ can then be estimated as

$p^*(\theta_i | y_i, \hat{\alpha}, \hat{\beta}) \approx \dfrac{ \sum_{j = 1}^N p(\theta_i | y_i, \alpha^*_j, \beta^*_j)}{N}$

Essentially, just the raw average density at each point, averaging over all the bootstrapped parameters values. The corrected 95% empirical Bayesian interval is given by solving

$\displaystyle \int_{-\infty}^{l} p^*(\theta_i | y_i, \hat{\alpha}, \hat{\beta}) = 0.025$

$\displaystyle \int_{u}^{\infty} p^*(\theta_i | y_i, \hat{\alpha}, \hat{\beta}) = 0.975$


For lower and upper bounds $l$ and $u$ using numerical techniques.


Baseball Example



In my previous post, I analyzed the famous Morris baseball data set that with respect to loss functions to show why empirical Bayes works. Analyzing it with respect to interval estimation also provides an interesting example.

Using a beta-binomial model with the method of moments estimator I described in the previous post, this data set has parameters $\hat{\alpha} = 97.676$ and $\hat{\beta} = 270.312$.  Each player had $n_i = 45$ at-bats, so the posterior distribution for the batting average $\theta_i$ of player $i$ is

$\theta_i | y_i, \hat{\alpha}, \hat{\beta} \sim Beta(y_i + 97.676, 45 - y_i + 270.312)$

Naive intervals can be taken directly as central 95% quantiles from the posterior distribution - again, see my article on Bayesian credible intervals for more explanation on this.

\begin{array}{l c c c c c c c c} \hline
\textrm{Player} & y_i & y_i/n_i &  \textrm{EB Estimate} & \textrm{Naive Lower} &  \theta_i & \textrm{Naive Upper}\\ \hline
Clemente & 18 & .400  & .280 & 0.238 & .346 &  0.324 \\
F. Robinson & 17 & .378   & .278 & 0.236   & .298 &  0.322 \\
F. Howard & 16 & .356 &  .275 & 0.233   & .276 &  0.319 \\
Johnstone & 15 & .333  &  .273 & 0.231   & .222 & 0.317  \\
Barry & 14 & .311 & .270 &  0.229    & .273 & 0.314  \\
Spencer & 14 & .311  & .270 & 0.229  & .270 & 0.314 \\
Kessinger & 13 & .289  & .268 & 0.226    & .263 &  0.312 \\
L. Alvarado & 12 & .267 & .266 & 0.224    & .210 &  0.309 \\
Santo & 11 & .244 &  .263 &  0.222  & .269 &  0.307\\
Swoboda & 11 & .244 &  .263 & 0.222   & .230 & 0.307  \\
Unser & 10 &.222 & .261 &  0.220  & .264 & 0.304  \\
Williams & 10 & .222 & .261 & 0.220  & .256 & 0.304 \\
Scott & 10 & .222 & .261 & 0.220    & .303 & 0.304 \\
Petrocelli & 10 & .222  & .261 & 0.220   & .264 & 0.304  \\
E. Rodriguez & 10 & .222  & .261 &  0.220   & .226 &  0.304\\
Campaneris & 9 & .200 & .258 &  0.217   & .285 & 0.302 \\
Munson & 8 & .178 & .256 & 0.215   & .316 & 0.299  \\
Alvis & 7 & .156 & .253 &  0.213    & .200 & 0.296 \\ \hline
\end{array}

Thirteen out of the eighteen intervals captured the hitter's true average for the rest of the year, for an observed coverage of 72.22%. Roberto Clemente and Thurman Munson managed to overperform with respect to their intervals, while Jay Johnstone, Luis Alvarado, and Max Alvis underperformed.

The parametric bootstrap procedure fixes this as follows:

  1. Simulate a set of 18 new $\theta^*_i$ from a $Beta(97.676, 270.312)$ distribution.
  2. Simulate a set of 18 new $y^*_i$ from a $Bin(45, \theta^*_i)$ distribution
  3. Estimate $\alpha^*$ and $\beta^*$ using the same method used on the original data set.
I repeated this for 5000 bootstrap samples. The bootstrapped posterior is then


$p^*(\theta_i | y_i, 97.676, 270.312) \approx \dfrac{ \sum_{j = 1}^{5000} p(\theta_i | y_i, \alpha^*_j, \beta^*_j)}{5000}$

The effect of this is to create a posterior distribution that's centered around the same empirical Bayesian estimate $\hat{\theta_i}$, but more spread out. This is shown in the naive (solid line) and bootstrapped (dashed line) posterior distributions for $y_i = 10$.


Quantiles taken from this bootstrapped distribution will give wider intervals than the naive empirical Bayesian intervals (though it is possible to come up with "odd" data sets where the bootstrap interval is shorter). The bootstrap interval is given by solving


$\displaystyle \int_{-\infty}^{l} p^*(\theta_i | y_i, 97.676, 270.312) = 0.025$

$\displaystyle \int_{u}^{\infty} p^*(\theta_i | y_i, 97.676, 270.312) = 0.975$


For lower and upper bounds $l$ and $u$ - and in full disclosure, I didn't actually perform the full numerical integration. Instead, I averaged over the pbeta(x, alpha, beta) function in R and solved for the value where the averaged CDF is equal to 0.025 or 0.975.

Doing this,  95% bootstrapped intervals are given by

\begin{array}{l c c c c c c c c} \hline
\textrm{Player} & y_i & y_i/n_i &  \textrm{EB Estimate} & \textrm{Bootstrap Lower} &  \theta_i & \textrm{Bootstrap Upper}\\ \hline
Clemente & 18 & .400  & .280 &  0.231  & .346 &  0.391 \\
F. Robinson & 17 & .378   & .278 & 0.227   & .298 &  0.382 \\
F. Howard & 16 & .356 &  .275 & 0.222   & .276 &  0.372 \\
Johnstone & 15 & .333  &  .273 & 0.217   & .222 & 0.363  \\
Barry & 14 & .311 & .270 &  0.211    & .273 & 0.355  \\
Spencer & 14 & .311  & .270 & 0.211  & .270 & 0.355 \\
Kessinger & 13 & .289  & .268 & 0.205    & .263 &  0.346 \\
L. Alvarado & 12 & .267 & .266 & 0.199    & .210 &  0.338 \\
Santo & 11 & .244 &  .263 &  0.192  & .269 &  0.330\\
Swoboda & 11 & .244 &  .263 & 0.192   & .230 & 0.330  \\
Unser & 10 &.222 & .261 &  0.184  & .264 & 0.323 \\
Williams & 10 & .222 & .261 & 0.184  & .256 & 0.323 \\
Scott & 10 & .222 & .261 & 0.184  & .303 & 0.323 \\
Petrocelli & 10 & .222  & .261 & 0.184   & .264 & 0.323  \\
E. Rodriguez & 10 & .222  & .261 &  0.184   & .226 &  0.323\\
Campaneris & 9 & .200 & .258 &  0.177   & .285 & 0.317 \\
Munson & 8 & .178 & .256 & 0.169  & .316 & 0.310  \\
Alvis & 7 & .156 & .253 &  0.161  & .200 & 0.305 \\ \hline
\end{array}

Only one player out of eighteen is not captured by the interval - Thurman Munson, who managed to hit 0.178 in is his first 45 at-bats and 0.316 the rest of the season -  for an observed coverage of 94.44%.

(As a reminder, annotated code to perform estimation in the beta-binomial model and calculate bootstrapped empirical Bayesian intervals for the beta-binomial model is available on my github)

23 October, 2015

From Stabilization to Interval Estimation



In this post, I'm going to show how to use league means and stabilization points to construct mean interval and prediction interval estimates for some basic counting statistics. I'll focus on two specific models: the normal-normal and the beta-binomial model.

At a few points during this post I'm going to mention some empirical results. You can find the data I used and the code I ran on my github.

Distributional Assumptions


I'm assuming the statistic in question is a binomial outcome - this covers many basic counting statistics (batting average, on-base percentage, batting average on balls in play etc.) but not rate statistics, or more complicated statistics such as wOBA.

Assume that in $n_i$ trials, a player accrues $x_i$ events (hits, on-base events, etc.). I'm going to assume that trials are independent and identical with parameter of success $\theta_i$.

For the distribution of the $x_i$, I'm going to work out the math for two specific distributions - the normal and the binomial. I'm also going to assume that the distribution of the $\theta_i$ follows the respective conjugate distribution - the normal for the normal model, and the beta for the binomial model. This prior has mean talent level $\mu$ and stabilization point $M$.

$x_i \sim p(x_i | \theta_i, n_i)$
$\theta_i \sim G(\theta_i | \mu, M)$

For the stabilization point $M$, I'm going assume this is the number of events at which $r = 0.5$. If you choose the point at which $r = 0.7$, then these formulas won't work

For several of the mathematical results here, I'm going to refer back to my article shrinkage estimators for counting statistics - particularly, the examples section at the end - without offering proofs or algebraic derivations.

The posterior distribution for $\theta_i$ is given by

$\displaystyle p(\theta_i | x_i, n_i, \mu, M) = \dfrac{p(x_i | \theta_i, n_i)G(\theta_i | \mu, M)}{\int p(x_i | \theta_i, n_i)G(\theta_i | \mu, M) d_{\theta_i}}$

Intervals will be constructed by taking quantiles from this distribution.

(For rate statistics instead of count statistics, the gamma-Poisson model can be used - though that will take more math to figure out the correct forms of the intervals. I got about three-quarters of the way there in my article WHIP stabilization by the gamma-Poisson model if somebody else wants to work through the rest. For more complicated statistics such as wOBA, I'm going to have to work through some hard math.)

Mean Intervals

 

Normal-Normal Model


For the normal-normal model, suppose that both the counts and the distribution of talent levels follow a normal distribution. Then the observed proportion $x_i / n_i$ also follows a normal distribution.

$\dfrac{x_i}{n_i} \sim N\left(\theta_i, \dfrac{\sigma^2}{n_i}\right)$

$\theta_i \sim N(\mu, \tau^2)$

Furthermore, a normal approximation to the binomial is used to estimate $\sigma^2$ as $\sigma^2 = \mu (1-\mu)$. The usual normal approximation to the binomial takes $\theta_i (1-\theta_i)$ as the variance; however, the normal-normal model assumes that variance $\sigma^2$ is constant around every single $\theta_i$ - so an estimate for that is the average amount of variance over all of them, $\mu(1-\mu)$.

As a side note, the relationship between $M$ and $\tau^2$ is given by

$M = \dfrac{\sigma^2}{\tau^2} \approx \dfrac{\mu(1-\mu)}{\tau^2}$

As I showed in my article shrinkage estimators for counting statistics, for the normal-normal model the shrinkage coefficient is given as

$B = \dfrac{\sigma^2/\tau^2}{\sigma^2/\tau^2 + n_i} = \dfrac{ M }{ M + n_i }$

The resulting posterior is then

$\theta_i | x_i, n_i, \mu, M \sim N\left(  (1-B) \left(\dfrac{x_i}{n_i}\right) + B \mu, (1-B) \left(\dfrac{\sigma^2}{n_i}\right)\right)$

And substituting in the values of $B$, the variance of the posterior is given as

$\left(1 - \dfrac{M}{M + n_i}\right)\left(\dfrac{\mu(1-\mu)}{n_i}\right) = \left(\dfrac{n_i}{M + n_i}\right)\left(\dfrac{\mu(1-\mu)}{n_i}\right) = \dfrac{\mu(1-\mu)}{M+n_i}$

And a 95% interval estimate for $\theta_i$ is

$ \left[ \left(\dfrac{n_i}{n_i + M}\right) \dfrac{x_i}{n_i}+ \left(\dfrac{M}{n_i+M}\right) \mu \right] \pm 1.96 \sqrt{ \dfrac{\mu(1-\mu)}{M+n_i}}$


Beta-Binomial Model



For the beta-binomial model, suppose that the counts of events $x_i$ in $n_i$ events follows a binomial distribution and the distribution of the $\theta_i$ themselves is beta.

$x_i \sim Binomial(n_i, \theta_i )$

$\theta_i \sim Beta(\alpha, \beta)$

For the beta distribution of talent levels, the parameters can be constructed from the league mean and stabilization point as

$\alpha = \mu M$

$\beta = (1-\mu) M$

Using the beta as a prior distribution, the posterior for $\theta_i$ is then

$\theta_i | x_i, n_i, \mu, M \sim Beta(x_i + \mu M, n_i - x_i + (1-\mu) M)$

A 95% credible interval can then be taken as quantiles from this distribution - I show how to do this in in R in my article on Bayesian credible intervals. Most statistical software should be able to take quantiles from the beta distribution easily.

Alternatively, a normal approximation may be used - the posterior should be approximately normal with mean and variance

$\theta_i | x_i, n_i, \mu, M \sim N\left( \dfrac{x_i + \mu M}{n_i  + M},  \dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu) M)}{(n_i + M)^2 (1 + n_i + M)}\right)$

So a 95% credible interval based on the normal approximation to the beta posterior is given by

$\left(\dfrac{x_i + \mu M}{n_i  + M}\right) \pm 1.96 \sqrt{\dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu) M)}{(n_i + M)^2 (1 + n_i + M)}}$

This should be very close to the interval given by taking quantiles directly from the beta distribution.

Practical Application


I downloaded first and second half data hitting data from all qualified non-pitchers from 2010 to 2015 from fangraphs.com. I used the above formulas on the first half of on-base percentage data to create intervals, and then calculated the proportion of those intervals that contained the on-base percentage for the second half. For the league mean and stabilization point, I used values of $M$ and $\mu$ (even though I didn't show $\mu$) from my article "More Offensive Stabilization Points."

But wait...isn't there uncertainty in those estimates of $M$ and $\mu$? Yes, but it actually doesn't play a huge role unless the uncertainty is large, such as for the BABIP. You can try it out yourself by running the code and changing the values slightly, or just trust me.

I rounded off to the nearest whole number for $M$ and to three nonzero digits  for $\mu$. The intervals compared were the normal-normal as NN, beta-binomial as BB, and the normal approximation to the beta-binomial as BB (N). The resulting coverages were

\begin{array}{| l | l | c | c | c | c |} \hline
\textrm{Stat}& \mu & M & \textrm{NN Coverage} & \textrm{BB Coverage} & \textrm{BB (N) Coverage} \\ \hline
OBP & 0.33 & 296 & 0.66 & 0.659 & 0.659 \\
BA & 0.268 & 466 & 0.604 & 0.601 & 0.603 \\
1B & 0.158 & 222 & 0.685 & 0.675 & 0.679 \\
2B & 0.0475 & 1025 & 0.532 & 0.532 & 0.531 \\
3B & 0.00492 & 373 & 0.762 & 0.436 & 0.76 \\
XBH & 0.0524 & 1006 & 0.545 & 0.542 & 0.551 \\
HR & 0.0274 & 125 & 0.754 & 0.707 & 0.738 \\
BB & 0.085 & 106 & 0.688 & 0.661 & 0.673 \\
SO & 0.181 & 50 & 0.74 & 0.728 & 0.729 \\
HBP & 0.00866 & 297 & 0.725 & 0.591 & 0.721 \\  \hline\end{array}

So what happened? Shouldn't the 95% intervals have 95% coverage? Well, they should. The problem is, I used the wrong type of interval - the intervals calculated here are for the mean $\theta_i$. But we don't have $\theta_i$. What we have is the second half on-base percentage, which is $\theta_i$ plus the random noise that naturally surrounds $\theta_i$ in however many additional plate appearances. What's appropriate here is a prediction-type interval that attempts to cover not the mean, but a new observation - this interval will have to account for both the uncertainty of estimation and the natural randomness in a new set of observations.


Prediction Intervals


The interval needed is predictive - since the previous intervals were constructed as Bayesian credible intervals, a posterior predictive interval can be used.

Suppose that $\tilde{x_i}$ is the new count of events for player $i$ in $\tilde{n_i}$ new trials. I'm going to assume that $\tilde{n_i}$ is known. I'll also assume that $\tilde{x_i}$ is generated from the same process that generated $x_i$.

$\tilde{x_i} \sim p(\tilde{x_i} | \theta_i, \tilde{n_i})$

The posterior predictive is then

$p(\tilde{x_i}| \tilde{n_i}, x_i, n_i, \mu, M) = \displaystyle \int p(\tilde{x_i} | \theta_i, \tilde{n_i})p(\theta_i | x_i, n_i, \mu, M) d\theta_i$

For a bit more explanation, check out my article on posterior predictive distributions.

Normal-Normal Model


As stated above, the posterior distribution for $\theta_i$ is normal

$\theta_i | x_i, n_i, \mu, M \sim N\left(  B \left(\dfrac{x_i}{n_i}\right) + (1-B) \mu, \dfrac{\mu (1-\mu)}{n_i + M}\right)$

$B = \dfrac{M}{n_i + M}$

Using a normal approximation to the binomial, the distribution of the new on-base percentage in the second half (call this $\tilde{x_i}/\tilde{n_i}$) is also normal

$\dfrac{\tilde{x_i}}{\tilde{n_i}} | \theta_i, \mu  \sim N\left(\theta_i, \dfrac{\mu(1-\mu)}{\tilde{n_i}}\right)$

The posterior predictive is the marginal distribution, integrating out over $\theta_i$ - it is given as

$\dfrac{\tilde{x_i}}{\tilde{n_i}} | \tilde{n_i}, x_i, n_i, \mu, M \sim N\left(B \left(\dfrac{x_i}{n_i}\right) + (1-B) \mu,  \dfrac{\mu (1-\mu)}{n_i + M} + \dfrac{\mu(1-\mu)}{\tilde{n_i}}\right)$

And so a 95% posterior predictive interval for the on-base percentage in the second half is given by


$ \left[ \left(\dfrac{M}{n_i + M}\right) \dfrac{x_i}{n_i}+ \left(\dfrac{n_i}{n_i+M}\right) \mu \right] \pm 1.96 \sqrt{ \dfrac{\mu(1-\mu)}{n_i + M} + \dfrac{\mu(1-\mu)}{\tilde{n_i}}}$

 

Beta-Binomial Model

 

The posterior distribution for $\theta_i$ is beta

$\theta_i | x_i, n_i, \mu, M \sim Beta(x_i + \mu M, n_i - x_i + (1-\mu) M)$

 The distribution for the number of on-base events $\tilde{x_i}$ in $\tilde{n_i}$ follows a binomial distribution

$\tilde{x_i} \sim Binomial(\theta_i, \tilde{n_i})$

The posterior predictive for the number of on-base events in the new number of trials is the marginal distribution, which has density

$p(\tilde{x_i}| x_i, n_i, \mu, M, \tilde{n_i}) = \displaystyle  {\tilde{n_i} \choose \tilde{x_i}} \dfrac{\beta(\tilde{x_i} + x_i + \mu M, \tilde{n_i} - \tilde{x_i} + n_i - x_i + (1-\mu) M)}{\beta(x_i + \mu M,n_i - x_i + (1-\mu) M)}$


This is the beta-binomial distribution. It's is a discrete distribution that gives the probability of the number of on-base events in $\tilde{n_i}$ new PA, not the actual on-base percentage.

Since it is discrete, it's easy to solve for quantiles

$Q(\alpha) = \displaystyle \min_{k} \{ k : F(k) \le \alpha \}$
Where

$F(k) = \displaystyle \sum_{\tilde{x_i} \le k} p(\tilde{x_i} | x_i, n_i, \mu, M, \tilde{n_i}) =  \displaystyle \sum_{\tilde{x_i} \le k} \displaystyle  {\tilde{n_i} \choose \tilde{x_i}} \dfrac{\beta(\tilde{x_i} + x_i + \mu M, \tilde{n_i} - \tilde{x_i} + n_i - x_i + (1-\mu) M)}{\beta(x_i + \mu M,n_i - x_i + (1-\mu) M)}$

Since $Q(\alpha)$ is the quantile for the count of events, a 95% interval for the actual on-base proportion is given by

$\left(\dfrac{Q(.025)}{\tilde{n_i}} ,\dfrac{Q(.975)}{\tilde{n_i}}\right)$.

Alternatively, since the distribution is likely to be unimodal and bell-shaped, a normal approximation to the 95% posterior predictive interval is given by

$\left(\dfrac{x_i + \mu M}{n_i  + M}\right) \pm 1.96 \sqrt{\dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu)M)(n_i + M + \tilde{n_i})}{\tilde{n_i} (n_i + M)(n_i + M + 1)}}$

This isn't as good of an approximation as the normal approximation to the beta-binomial interval for the mean, but the difference between intervals is still only around 1% of the length and should work well.

Practical Application



I repeated the analysis using the predictive formulas given above, using the first half on-base percentage to try to capture the second half on-base percentage, using the same $\mu$ and $M$ values as before.

\begin{array}{| l | l | c | c | c | c |} \hline
\textrm{Stat}& \mu & M & \textrm{NN Coverage} & \textrm{BB Coverage} & \textrm{BB (N) Coverage} \\ \hline
OBP & 0.33 & 296 & 0.944 & 0.944 & 0.94 \\
BA & 0.268 & 466 & 0.943 & 0.943 & 0.944 \\
1B & 0.158 & 222 & 0.941 & 0.941 & 0.942 \\
2B & 0.0475 & 1025 & 0.956 & 0.956 & 0.955 \\
3B & 0.00492 & 373 & 0.955 & 0.955 & 0.956 \\
XBH & 0.0524 & 1006 & 0.957 & 0.957 & 0.959 \\
HR & 0.0274 & 125 & 0.951 & 0.951 & 0.952 \\
BB & 0.085 & 106 & 0.925 & 0.925 & 0.921 \\
SO & 0.181 & 50 & 0.918 & 0.918 & 0.92 \\
HBP & 0.00866 & 297 & 0.95 & 0.95 & 0.947 \\ \hline\end{array}

Cautions and Conclusion



Despite the positive results, I think that 95% actual coverage from these intervals is overoptimistic. For one, I selected a very "nice" group of individuals to test it on - nonpitchers with more than 300 PA. Being in this category implies a high talent level and a lack of anything that could drastically change that talent level over the course of the season, such as injury. I also treated the second half sample size $\tilde{n_i}$ as known - obviously, that must be estimated as well, and should add additional uncertainty.

Furthermore, there are clearly other factors at work than just random variation - players can get traded to different environments (a player being traded to or from Coors park, for example), talent levels may very well change over the course of the season, and events are clearly not independent and identical.

Applying these formulas to the population of players at large should see the empirical coverage drop - my guess (though I haven't tested it) is that 95% intervals should empirically get around 85%-90% actual coverage. Also keep in mind that $M$ and $\mu$ need to be kept updated - using means and stabilization points from too far in the past will lead to shrinkage towards the wrong point.

You can and should be able to do better than these intervals, in terms of length - these are incredibly simplistic, using only information about the player and information about the population. Adding covariates to the model to account for other sources of variation should allow you to decrease the length without sacrificing accuracy.

Alternatively, you could use these formulas with projections, with $\mu$ as the preseason projection and $M$ representing how many events that projection is "worth." This is more in keeping with the traditional Bayesian sense of the interval, and won't guarantee any sort of coverage.

However, I still think these intervals are useful in that they represent a sort of baseline - any more advanced model that generates predictive intervals should be able to do better than these.

Edit 16 Mar. 2017: I found that the data file I used for this analysis with the split first and second half  statistics was not what I thought it was - it repeated the same player multiple times, giving an inaccurate estimate of the confidence level. I have corrected the data file and re-run the analysis and presented the corrected confidence levels.

16 October, 2015

Stabilization, Regression, Shrinkage, and Bayes

This post is somewhat of a brain dump, for all my thoughts on the concept of a "stabilization point," (which is a term I dislike) how it's being used, assumptions that are (usually) knowingly and (sometimes) unknowingly being made in the process, and when and how it can be used correctly.

Stabilization In Practice


The most well-known stabilization point calculation was performed and performed by Russell Carleton, who took samples of size $n$ of size statistic from multiple MLB players and declared the stabilization point to be the $n$ such that the correlation coefficient $r = 0.7$ (the logic being that this gives $R^2 \approx 0.5$ - however, I don't like this). This approach is nonparametric in the sense that it's not making any assumptions about the underlying structure of the data (for example, that the events are binomial distributed), only about the structure of the residuals, and these have shown to be fairly reasonable and robust assumptions - in fact, the biggest problems with the original study came down to issues of sampling.

The split-and-correlate method is the most common method used to find stabilization points. This method will work, though it's not especially efficient, and should give good results assuming the sampling is being performed well - more recent studies randomly split the data into two halves and then correlate. In fact, it will work for essentially any statistic, especially ones that are difficult to fit into a parametric model.

In his original study (and subsequent studies), Carleton finds the point at which the split-half correlation is equal to $r = 0.7$, since then $R^2 \approx 0.5$. Others have disagreed with this. Commenter Kincaid on Tom Tango's blog writes

$r=.7$ between two separate observed samples implies that half of the variance in one observed sample is explained by the other observed sample.  But the other observed sample is not a pure measure of skill; it also has random variance.  So you can’t extrapolate that as half of the variance is explained by skill.

I agree with this statement. In traditional regression analysis, the explanatory variable $x$ is conceived as fixed. In this correlation analysis, both the explanatory and response variables are random. Hence, it makes no sense to say that the linear regression with $x$ explains 50% of the variation in $y$ when $x$ is random and, if given the same player, in fact independent of $y$. Other arguments have also been made regarding the units of $r$ and $R^2$.

There's a more practical reason, however - a commonly used form of regression towards the mean is given by

$\dfrac{M}{n + M}$

where $M$ is the regression amount towards some mean. Tom Tango notes that, if the stabilization point is estimated as the point at which $r = 0.5$, then this value $M$ can then turn around and directly be plugged into the regression equation given above. the As Kincaid has noted, this is the form of statistical shrinkage for the binomial distribution with a beta prior. More generally, this is the form of the shrinkage coefficient $B$ that is obtained by modeling the outcome of a random event with a natural exponential family and performing a Bayesian analysis using a conjugate prior (see section five of Carl Morris's paper Natural Exponential Families with Quadratic Variance Functions). Fundamentally, this is why the $M/(n + M)$ formula seems to work so well - not because it's beta-binomial Bayes, but because it's also normal-normal Bayes, and gamma-poisson Bayes, and more - any member of the incredibly flexible natural exponential family.

So simply taking the correlation doesn't make any assumptions about the parametric structure of the observed data - but taking that stabilization point and turning it into a shrinkage (regression) estimator towards the population mean does assume that observed data come from natural exponential family with the corresponding conjugate prior for the distribution of true talent levels.


Mathematical Considerations

 

In practical usage, only certain members of the natural exponential family are considered - the beta-binomial, gamma-Poisson, and the normal-normal models, for example, with the normal-normal largely dominating these choices.  These form a specific subset of the natural exponential family - the natural exponential family with quadratic variance functions. The advantage these have over general NEF distributions is that, aside from being the most commonly used distributions, they are closed under convolution - that is, the sum of NEFQFV distributions is also NEFQFV - and this makes them ideal for modeling counting statistics, as the forms of all calculations stay the same as new information arrives, requiring only that new estimates and sample sizes be plugged into formulas.

In a previous post I used Morris's work to with the natural exponential family with quadratic variance functions to describe a two-stage model with some raw counting statistic $x_i$ as the sum of $n_i$ trials

 $X_i \sim p(x_i | \theta_i)$
 $\theta_i \sim G(\theta_i | \mu, \eta)$

where $p(x_i | \theta_i)$ is NEFQVF with mean $theta_i$. If $G(.)$ is treated as a prior distribution for $\theta_i$, then the form of the shrinkage estimator for $\theta_i$ is given by

$\hat{\theta_i} = \mu + (1 - B)(\bar{x_i} - \mu) = (1-B)\bar{x_i} + B \mu$

where $\bar{x_i} = x_i/n_i$ and $B$, as mentioned before, is the shrinkage coefficient. The shrinkage coefficient controlled by the average amount of variance at the event level and the variance of $G(.)$, weighted by the sample size $n_i$.

$B = \dfrac{ E[Var(\bar{x_i} | \theta_i)]}{ E[Var(\bar{x_i} | \theta_i)] + n_i Var(\theta_i)}$

And for NEF models, this simplifies down to

$B = \dfrac{M}{M + n_i}$

Implying that the form of the stabilization point $M$ is given as

$M = \dfrac{E[Var(\bar{x_i} | \theta_i)]}{Var(\theta_i)} = \dfrac{E[V(\theta_i)]}{Var(\theta_i)}$

Where $V(\theta_i)$ is the variance around the mean $\theta_i$ at the most basic level of the event (plate appearance, inning pitched, etc.). So under the NEF family, the stabilization point is the ratio of the average variance around the true talent level (the variance being a function of the true talent level itself) to the variance of the true talent levels themselves.

In another post, I showed briefly that for this model, the split-half correlation is theoretically equal to one minus the shrinkage coefficient $B$.

$\rho = 1 - B = \dfrac{n_i}{M+n_i}$

Another result that has been commonly used. Therefore, to achieve any desired level of correlation $p$ between split samples, the formula

$n = \left(\dfrac{p}{1-p}\right) M$ 

can be used to estimate the sample size required. This formula derives not from any sort of correlation prophecy formula, but just from some algebra involving the forms of the shrinkage coefficient and split-half correlation $\rho$.

It's for this reason that I dislike the name "stabilization point" - in its natural form it is the number of events required for a split-half correlation of $r = 0.5$ (and correspondingly a shrinkage coefficient of $0.5$), but really, you can estimate the split-half correlation and/or shrinkage amount for any sample size just by plugging in the corresponding values of $M$ and $n$. In general, there's not going to be much difference between samples of size $n$, $n-1$, and $n + 1$ - there's no magical threshold that the sample size can cross that suddenly a statistic becomes perfectly reliable - and in fact that the formula implies a given statistic can never reach 100% stabilization.

If I had my choice I'd call it the stabilization parameter, but alas, the name seems to already be set.

Practical Considerations


Note that at no point in the previous description was the shrinkage (regression)  explicitly required to be towards the league mean talent level. The league mean is a popular choice to shrink towards; however, if the sabermetrician can construct a different prior distribution from other data (for example, the player's own past results) then all of the above formulas and results can be applied using that prior instead.

When calculating stabilization points, studies have typically used a large sample of players from the league - theoretically, this implies that the league distribution of talent levels is being used as the prior distribution (the $G(.)$ in the two-stage model above), and the so-called stabilization point that results to be used for any player. In actuality, choices made during the sampling process imply certain priors which consist but only of certain portions of the league distribution of talent levels (mainly, those that have reached a certain threshold for the number of trials). In my article that calculated offensive stabilization points, I specified a hard minimum of 300 PA/AB in order to be included in the population. I estimated the stabilization point directly, but the issue also effects the correlation method used by Carleton, Carty, Pavlidis, etc. - in order to be included in the sample, a player must have had enough events (however they are defined), and this limits the sample to a nonrepresentative subset of the population of all MLB players. The effect of this is that the specific point calculated is really only valid for those individuals that meet those PA/AB requirements - even though those are the players who we know the most about! Furthermore, players that accrue more events do so specifically because they have higher talent levels - the stabilization points calculated for players who we know will receive, for example, at least 300 PA can't then turn around and be applied to players who we know will accrue fewer than 300 PA. This also explains how two individuals both using the same method in the same manner with the same data can arrive at different conclusions depending entirely on how they chose inclusion/exclusion rules for their sample.

As a final issue, I used six years worth of data - in doing this, I made the assumption that the basic shape of  true talent levels for the subset of the population I chose had changed negligibly or not at all over six years. I didn't simply use all data, however, because I recognize that offensive environments change - the late 90s and early 2000s, for example, are drastically different from than the early 2010s. This brings up another point - stabilization points, as they are defined, are a function of  the mean (coming into play in the average variance around a statistic) and, primarily, the variance of population talent levels - however, both of those are changing over time. This means there is not necessarily such a thing as "the" stabilization point, since as the population of talent levels changes over time, so will the mean and variance (I wrote a couple of articles looking at how offensive and pitching stabilization points have changed over time), so stabilization points in articles that were published just a few years ago may or may not be valid any longer.


Conclusion

Even after all this math, I still think the split-and-correlate method should be thought of as the primary method for calculating stabilization points, since it works on almost any kind of statistic, even more advanced ones that don't fit clearly into a NEF or NEFQVF framework. Turning around and using the results of that analysis to perform shrinkage (regression towards the mean), however, does make very specific assumptions about the form of both the observed data and underlying distribution of talent levels. Furthermore, sampling choices made at the beginning can strongly affect the final outcome, and limit the applicability of your analysis to the larger population. And if you remember nothing else from this - there is no such thing as "the" stabilization point, either in terms of when a statistic is reliable (it's always somewhat unreliable, the question is by how much) or one value that applies to all players at all times (since it's a function of the underlying distribution of talent levels, which is always changing).

This has largely been just a summary of techniques, studies, and research others have done - I know others have expressed similar opinions as well - but I found the topic interesting and I wanted to explain it in a way that made sense to me. Hopefully I've made a little more clear the connections between statistical theory and things people were doing just because they seemed to work.

Various Links

 

These are just some of the various links I read to attempt to understand what people were doing in practice and attempt to connect it to statistical theory: 

Carl Morris's paper Natural Exponential Families with Quadratic Variance Functions: Statistical Theory: http://www.stat.harvard.edu/People/Faculty/Carl_N._Morris/NEF-QVF_1983_2240566.pdf

Russell Carleton's original reliability study: http://web.archive.org/web/20080112135748/mvn.com/mlb-stats/2008/01/06/on-the-reliability-of-pitching-stats/

Carleton's updated calculations: http://www.baseballprospectus.com/article.php?articleid=20516

Tom Tango comments on Carleton's article:
http://tangotiger.com/index.php/site/comments/point-at-which-pitching-metrics-are-half-signal-half-noise


Derek Carty's stabilization point calculations: http://www.baseballprospectus.com/a/14215#88434

Tom Tango discusses Carty's article, the $r = 0.7$ versus $r = 0.5$ threshold, and regression towards the mean: http://www.insidethebook.com/ee/index.php/site/comments/rates_without_sample_size/

Steve Staude discusses $r = 0.5$ versus $r = 0.7$: http://www.fangraphs.com/blogs/randomness-stabilization-regression/

Tom Tango comments on Steve's work: http://tangotiger.com/index.php/site/comments/randomness-stabilization-regression 

Tom Tango links to Phil Birnbaum's proof of the regression towards the mean formula: http://tangotiger.com/index.php/site/comments/blase-from-the-past-proof-of-the-regression-toward-the-mean

Kincaid shows that the beta-binomial model produces the regression towards the mean formula: http://www.3-dbaseball.net/2011/08/regression-to-mean-and-beta.html

Harry Pavlidis looks at stabilization for some pitching events: http://www.hardballtimes.com/it-makes-sense-to-me-i-must-regress/

Tom Tango discusses Harry's article, and gives the connection between regression and stabilization: http://www.insidethebook.com/ee/index.php/site/article/regression_equations_for_pitcher_events/

Great summary of various regression and population variance estimation techniques - heavy on the math: http://www.countthebasket.com/blog/2008/05/19/regression-to-the-mean/

The original discussion on regression and shrinkage from Tom Tango's archives: http://www.tangotiger.net/archives/stud0098.shtml





21 September, 2015

The Posterior Predictive

Let's say you do a Bayesian analysis and end up with a posterior distribution $p(\theta | y)$. What does that tell you about a new observation $\tilde{y}$ from some data-generating process that involves $\theta$? The answer can be found using the posterior predictive distribution.

The code that I used to generate the images in this article can be found on my github.


Posterior Predictive


 Given a posterior distribution $p(\theta | y)$, the posterior predictive distribution is defined as

$p(\tilde{y} | y) = \int p(\tilde{y} | \theta) p(\theta | y) d\theta$

and it represents the distribution of a new observation $\tilde{y}$ given your updated information about a parameter $\theta$ and natural variation around the observation that arises from the data-generating process.

Since applied Bayesian techniques have tended towards fully computational MCMC procedures, the posterior predictive is usually obtained through simulation - let's say you have a sample $\theta_j^*$ ($j = 1,2,...,k$) from the posterior distribution of $\theta$ and you want to know about a new observation from some process that uses the parameter $\theta$.

$\tilde{y} \sim p(\tilde{y} | \theta)$

To obtain the posterior predictive, you would simulate a set of observation $\tilde{y}^*_j$ from $p(\tilde{y} | \theta_j^*)$ (in other words, simulate a new observation from the data model for each draw from the MCMC for your parameter). The distribution of these  $\tilde{y}^*_j$ approximates the posterior predictive distribution for $\tilde{y}$.

In general, this process is very specific to the problem at hand. It is possible in a few common scenarios, however, to calculate the posterior predictive distribution analytically. One example that is useful in baseball analysis the beta-binomial model.


Beta-Binomial Example


Let's say a batter obtains a 0.300 OBP in 250 PA - that corresponds to 75 on-base events and 175 not on-base events. What can you say about the distribution of on-base events in a new set of 250 PA?

Suppose that the distribution of on-base events is given by a binomial distribution with $n = 250$ and chance of getting on base $\theta$, which is the same in both sets of PA.

$p(y | \theta) \sim Binomial(250, \theta)$

For the prior distribution, let's suppose that a $Beta(1,1)$ distribution was used - this is a uniform distribution between zero and one - so any possible value for $\theta$ is equally likely. Since the beta and binomial are conjugate distributions, the posterior distribution of  $\theta$ (the batter's chance of getting on base) is also a beta distribution:

$p(\theta| y = 75) = \dfrac{\theta^{75+1-1}(1-\theta)^{175+1-1}}{\beta(75+1,175+1)} \sim Beta(76,176)$

Now, suppose we are planning to observe another 250 PA for the same batter, and we want to know the distribution of on-base events $\tilde{y}$ in the new 250 PA. This distribution is also binomial

$p(\tilde{y} | \theta) = \displaystyle {250 \choose \tilde{y}} \theta^{\tilde{y}}(1-\theta)^{250-\tilde{y}}$

 The posterior predictive distribution for the number of on-base events in another 250 PA is then obtained by multiplying the two densities and integrating out $\theta$.

$p(\tilde{y} | y = 75)  = \displaystyle \int_0^1 {250 \choose \tilde{y}} \theta^{\tilde{y}}(1-\theta)^{250-\tilde{y}} *  \dfrac{\theta^{75}(1-\theta)^{175}}{\beta(76,176)} d\theta$

The resulting distribution is known as beta-binomial distribution, which has density

$p(\tilde{y} | y = 75)  =\displaystyle  {250 \choose \tilde{y}} \dfrac{\beta(76 + \tilde{y}, 426-\tilde{y})}{\beta(76,176)}$

(The beta-binomial distribution is obtained from the beta-binomial model - it does get a bit confusing, but they are different things - the beta-binomial model can be thought of as a binomial with extra variance)

Now I can use the posterior predictive for inference. If, for example, I wanted to know the probability that a player will have a 0.300 OBP in another 250 PA (corresponding, again, to $\tilde{y}$ = 75 on-base events) then I can calculate that as

$p(\tilde{y} = 75 | y = 75) = \displaystyle  {250 \choose 75} \dfrac{\beta(76 + 75, 426-75)}{\beta(76,176)} \approx 0.0389$

That is, our updated information says there's a 3.89% chance of getting exactly a 0.300 OBP in a new 250 PA by the same player.

The actual distribution of OBP in a new 250 PA is given by



This can also be accomplished by simulation - first, by simulating a large number $k$ of $\theta^*$ values from the posterior

$\theta^*_j \sim Beta(76,176)$

And then using those $\theta^*$ values to simulate from the data model $p(\tilde{y} | \theta^*)$

$y^*_j \sim Binomial(250, \theta^*_j)$

The estimated probability of a 0.300 OBP (equivalent to 75 on-base events) is then

$P(0.300 | y = 75) \approx \displaystyle \dfrac{\textrm{# } y^*_j \textrm{ that equal 75}}{k} $

This is much easier to do in $R$ -with $k = 1000000$, the code to quickly perform this is

     > theta <- rbeta(1000000, 76,176) #Simulate from Posterior
     > y <- rbinom(1000000, 250, theta) #Simulate from data model
     > mean(y == 75) #Estimate P(y = 75)
     [1] 0.039077


Notice that the result is very close to the analytic answer of 3.89%. The simulated posterior predictive distribution for OBP is



Which visually looks very similar to the analytic version.

What's the Use?


Aside from doing (as the name implies) prediction, the posterior predictive is very useful for model checking - if data simulated from the posterior predictive (new data) is similar to the data you used to fit the model (old data), that is evidence that the model is a good fit. Conversely, if your posterior predictive data looks nothing like your original data set, you may have misspecified your model (and I'm criminally oversimplifying here - I recommend Bayesian Data Analysis by Gelman et al. for all the detail on model-checking using the posterior predictive distribution).

14 September, 2015

Pitching Stabilization Points through Time

In the previous post, I calculated some stabilization points for pitching statistics for the past few years. In this post, I want to look at how some of those stabilization points have changed over time.

(I have previously done this for offensive statistics)

Each stabilization point is a six-year calculation, including the current and five previous years (so for example, 2014 incudes 2009-2014 data, 1965 includes 1959 - 1965 data, etc.). There's not a mathematical or baseball reason for this choice - through trial and error it just seemed to provide enough data for estimation that the overall trend was apparent, with a decent amount of smoothing. Data includes only starting pitchers from each year, and for cutoff values (the minimum number of TBF, BIP, etc. to be included in the dataset) I used the same values as in my previous post. Years were split for the same player. Raw counts are used, not adjusted in any form. Relief pitchers are excluded.

All of the code that I used to create these can be found in my github, though I make no claims to efficiency or ease of operation. Because I added this code several months after the article was originally posted, I did not clean and annotate it as I normally would have - I just posted the raw code. The code is a modified form of the code used to calculate offensive stabilization points over time.

All of the plots shown below, and more, can be found in my imgur account.

Historical Plots


For some statistics, I will show plots for both the mean of a statistic over time and the stabilization point. The stabilization point driven largely by the underlying population variance of talent levels, which tends to be more difficult to estimate then the mean - hence the reason that, even with six years of moving data, the 'stabilization point' will appear to fluctuate quite a bit. I recommend not reading too much into the fluctuations, but rather looking for more general patterns.

Firstly, the ground ball, fly ball, and line drive rates (per BIP) only have recent data available. In that time, neither the fly ball or ground ball stabilization points have changed much


Line drive rate appears to have increased in recent years, however.

Though keep in mind the standard error is approximately 100 balls in play.

More interesting is batting average on balls in play, for which we have more data. The standard error for BABIP is approximately 500 balls in play, so it's not wise to trust small fluctuations in this plot as representing real shifts - however, it does appear that there is a positive trend in the stabilization point, indicative of the spread in BABIP values getting smaller. (A plot with 95% error bounds at each point can be found here, though I don't necessarily care for it)

The mean is easier to estimate with more accuracy - and  it shows that batting average on balls in play is at its highest point in history.

 An animated plot shows how the mean and variance of the observed (histogram) and estimated true talent (dashed line) distributions have changed over time.
As I've previously mentioned, the primary driving force for stabilization points is the underlying population variance. For example, take strikeout rate (per batter faced): since the dead ball era, it has followed a pattern of fairly consistent decrease (with a recent upsurge that still places it within previously observed ranges).


 Over time, however, the  mean strikeout rate (per batter faced) has been on the increase.

What does coincide with the increase in stabilization point is the decrease in population variance over time, as seen in this animated plot with the observed strikeout rates (histogram) and estimated true talent distribution (dashed line) - the spread in both is constantly increasing over time.


Also interesting is the earned run rate (per inning pitched, min 80 IP).

Beginning in the early 2000s, it dropped to a very low point, relative to its history, and has remained there more consistently than in the past. Meanwhile, the stabilization point for walk rate (min 400 BF) has increased in recent years, after reaching a maximum in the 1980s and decreasing.



On-base percentage and hit by-pitch rate have all fluctuated within a relatively stable area over time.





Though interestingly, both (per batter faced) took a dip in the 1960 that corresponds to an increase in the mean hit-by-pitch rate and a decrease in the mean on-base-percentage.




For some statistics, such as WHIP and home run rate, and it is difficult to discern a pattern other than fluctuations within a certain range.


An interesting look at how certain things have changed over time - though, as I mentioned before, I would encourage not reading too much into these plots.

03 September, 2015

More Pitching Stabilization Points


Using the beta-binomial model (notated BB) or the gamma-Poisson model (notated GP, and in this post what I call M is what in the previous post I called K - the variance parameter of the population talent distribution), I calculated the stabilization point for some more pitching statistics. I don't think the model(s) fit perfectly to the data, but they provide a good approximation that generally matches up with results I've seen elsewhere on the web.

Data was acquired from fangraphs.com. I only considered starting pitchers from 2009 - 2014, splitting the same pitcher between years, and did not adjust the data in any way.

All the data and code I used here may be found on my github. I make no claims to efficiency or ease of use.

The "cutoff value" is the minimum number of the denominator (IP, TBF, BIP, etc.) in a year in order to be included in the data set. These numbers were chosen somewhat arbitrarily, and for some of my statistics, changing the cutoff value will change the stabilization point. I'm not sure which statistics this will happen to - I know WHIP for sure, and I suspect ER as well, whereas I think BABIP doesn't exhibit this tendency. It's a function of the change (or lack thereof) in population variance of talent levels as the cutoff value increases - if somebody wants to take a look at it, it would be neat.

I wanted have a little fun and apply the model to stats where it clearly is silly to do so, such as win rate (I defined as wins per game started) and extra batters faced per inning (the total number of additional batters a pitcher faced beyond what is required by their IP). The model still produces estimates, but of course, but bad data fed into a good model doesn't magically produce good analysis.

\begin{array}{| l | l | c | c | c | c  | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\textrm{95% CI}&\textrm{Cutoff}&\textrm{Model}\\ \hline
\textrm{BABIP}&\textrm{(H-HR)/n*}&2006.71&484.94&(1056.22,2957.20)&300&BB\\
\textrm{GB Rate}&\textrm{GB/BIP}&65.52&3.63&(58.39,72.64)&300&BB\\
\textrm{FB Rate}&\textrm{FB/BIP}&61.96&3.42&(55.25,68.66)&300&BB\\
\textrm{LD Rate}&\textrm{LD/BIP}&768.42&94.10&(583.99,952.86)&300&BB\\
\textrm{HR/FB Rate}&\textrm{HR/FB}&505.11&93.95&(320.96,689.26)&100&BB\\
\textrm{SO Rate}&\textrm{SO/TBF}&90.94&5.04&(81.06,100.82)&400&BB\\
\textrm{HR Rate}&\textrm{HR/TBF}&931.59&107.80&(720.30,1142.88)&400&BB\\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}&221.25&14.43&(192.97,249.53)&400&BB\\
\textrm{HBP Rate}&\textrm{HBP/TBF}&989.30&119.95&(754.21,1224.41)&400&BB\\
\textrm{Hit rate}&\textrm{H/TBF}&623.35&57.57&(510.51,736.18)&400&BB\\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}&524.73&44.96&(436.62,612.84)&400&BB\\
\textrm{Win Rate}&\textrm{W/GS}&57.23&8.68&(40.21,74.24)&15&BB\\
\textrm{WHIP}&\textrm{(H + BB)/IP**}&77.20&5.46&(66.50,87.90)&80&GP\\
\textrm{ER Rate}&\textrm{ER/IP**}&59.55&3.94&(51.82,67.25)&80&GP\\
\textrm{Extra BF}&\textrm{(TBF - 3IP**)/IP**}&73.00&5.08&(63.05,82.95)&80&GP\\ \hline
\end{array}

* I'm not exactly sure what combinations of statistics fangraphs is using for the denominator of their BABIP - it's not BIP = GB + FB + LD. I know the numerator of H - HR is correct, but the denominator was usually smaller , though sometimes larger, than BIP. I solved for what fangraphs was using and used that in my calculations - if somebody wants to let me know exactly what they're using for n, please do.

** When dividing by IP, I corrected the 0.1 and 0.2 decimal representations to 0.33 and 0.67. 

I've also created histograms of each observed statistic with an overlay of the estimated distribution of true talent levels. They can be found in this imgur gallery. Remember that the dashed line represents the distribution of talent levels, not of observed data, so it's not necessarily bad if it is shaped differently than the observed data.

$\hat{M}$ is the estimated variance parameter of the underlying talent distribution. Under the model, it is equal to the number of plate appearances at which there is 50% shrinkage.

$SE(\hat{M})$ is the standard error of the estimate $\hat{M}$. It is on the same scale as the divisor in the formula.

The 95% CI is calculated as

$\hat{M} \pm 1.96 SE(\hat{M})$

It represents a 95% confidence interval for the number of plate appearances at which there is 50% shrinkage.

For an arbitrary stabilization level $p$, the number of required plate appearances can be estimated as

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{M}$

And a 95% confidence interval for the required number of plate appearances is given as

$\left(\dfrac{p}{1-p}\right) \hat{M} \pm 1.96 \left(\dfrac{p}{1-p}\right) SE(\hat{M})$

Since the denominators are so different (as opposed to offensive statistics where PA was used for almost everything except for batting average, and AB are fairly close to PA), I don't feel as comfortable putting everything on the same plot. That being said, the stats that use TBF look like



And the stats that use BIP for their denominator look like

As always, comments are appreciated.

02 September, 2015

2015 Win Prediction Totals (Through August)

These predictions are based on my own method (which can be improved). I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), and I think by this point in the season the actual coverage should be close to that.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c} 
\textrm{Team}  & \textrm{Lower}  & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} \\ \hline

ATL & 61 & 66.56 & 72 & 65.57 \\
ARI & 73 & 78.98 & 85 & 83.48 \\
BAL & 72 & 77.94 & 84 & 83.3 \\
BOS & 70 & 75.88 & 82 & 77.71 \\
CHC & 85 & 90.58 & 96 & 83.91 \\
CHW & 70 & 76.24 & 82 & 74.78 \\
CIN & 63 & 68.64 & 75 & 74.14 \\
CLE & 74 & 80.20 & 86 & 82.03 \\
COL & 61 & 67.29 & 73 & 70.14 \\
DET & 69 & 74.75 & 81 & 74.66 \\
HOU & 84 & 89.99 & 96 & 91.79 \\
KCR & 92 & 97.84 & 104 & 90.34 \\
LAA & 74 & 79.95 & 86 & 78.16 \\
LAD & 84 & 90.31 & 96 & 87.66 \\
MIA & 61 & 66.79 & 73 & 74.49 \\
MIL & 64 & 69.71 & 76 & 74.48 \\
MIN & 77 & 82.69 & 89 & 79.41 \\
NYM & 84 & 89.67 & 95 & 87.19 \\
NYY & 83 & 89.34 & 95 & 87.76 \\
OAK & 68 & 73.33 & 79 & 82.77 \\
PHI & 58 & 63.87 & 70 & 64.11 \\
PIT & 91 & 97.21 & 103 & 89.43 \\
SDP & 73 & 78.54 & 84 & 75.97 \\
SEA & 69 & 74.32 & 80 & 71.95 \\
SFG & 80 & 85.61 & 91 & 86.79 \\
STL & 98 & 103.63 & 109 & 97.33 \\
TBR & 76 & 81.45 & 87 & 80.79 \\
TEX & 78 & 83.60 & 90 & 79.00 \\
TOR & 87 & 92.88 & 98 & 98.67 \\
WSN & 77 & 82.61 & 89 & 84.04 \\  \hline\end{array}

To explain the difference between "Mean" and "True Win Total"  - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (at the end of August).

These quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is lower than the peak, the team is predicted to finish lucky, and if the blue line is higher than the peak, the team is predicted to finish unlucky.

27 August, 2015

WHIP Stabilization by the Gamma-Poisson Model

I've previously covered shrinkage estimation for offensive statistics - or at least, those that can be written as binomial events. In a previous post, I showed that for models that follow the natural exponential family with quadratic variance function, the split-half correlation is equal to one minus the shrinkage coefficient $B$.

The techniques I used can also be used when the outcome at the most basic level (at-bat, inning pitched, etc.) is not just a binary outcome. In particular, the Poisson distribution also fits within the framework I derived, as it is a member of the natural exponential family with quadratic variance function, and so events that can be modeled as Poisson at the base level will follow the same basic principles I used for the binomial outcomes. I chose the statistic WHIP (walks + hits per inning pitched) to illustrate this method, as it is a counting statistic that is a non-binary event (i.e., you can have 0, 1, 2, ... walks + hits in a given inning), so it fits the support of the Poisson.

Model and Estimation


I will assume that in each inning, pitcher $i$ gives up a number of walks + hits that follows a Poisson model with mean $\theta_i$, which is unique to each pitcher. The sum number of walks + hits given up in $n_i$ innings is $x_i$, and I have $N$ pitchers total. I considered only starting pitchers from 2009-2014, and split years between the same pitcher. My code and data are on github for anybody who wants to check my calculations.

Since the sum of Poissons is Poisson, the sum number of walks + hits $x_i$ given up in $n_i$ innings follows a Poisson distribution with mean $n_i \theta_i$ and mass function

$p(x_i | \theta_i, n_i) =  \dfrac{e^{-n_i \theta_i} (n_i \theta_i)^{x_i}}{x_i !}$

I will also assume that the distribution of means $\theta_i$ follows a gamma distribution with mean $\mu$ and variance parameter $K$ (in this parametrization, $\mu = \alpha/\beta$ and $K = \beta$ as opposed to the traditional $\alpha, \beta$ parametrization). This distribution has density

$f(\theta_i | \mu, K) = \dfrac{K^{\mu K}}{\Gamma(\mu K)} \theta_i^{\mu K - 1} e^{-K \theta_i}$

As shown in a previous post, the split-half correlation is then one minus the shrinkage coefficient $B$, or

$\rho = 1 - B = \left(\dfrac{n_i}{n_i + K}\right)$

So once I have an estimate $\hat{K}$ and a desired stabilization level $p$, solving for $n$ gives

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{K}$

Once again, the population variance parameter $K$ is equivalent to the 0.5 stabilization point - the point where the split half correlation should be exactly equal to 0.5, and also the point where the individual pitcher estimates are shrunk 50% of the way towards the mean.

For estimation of $mu$ and $K$, I used marginal maximum likelihood - a one dimensional introduction to maximum likelihood is given here. The marginal density of $\mu$ and $K$ is

$p(x_i | n_i, \mu, K) = \displaystyle \int_0^{\infty} \dfrac{K^{\mu K}n_i^{x_i}}{\Gamma(\mu K) x_i !} e^{-\theta_i (n_i + K)}  \theta_i^{x_i + \mu K - 1} d\theta_i = \dfrac{K^{\mu K}n_i^{x_i}}{\Gamma(\mu K) x_i !} \dfrac{\Gamma(x_i + \mu K)}{(n_i + K)^{x_i + \mu K}}$

And the log-likelihood (dropping terms that do not involve either $\mu$ or $K$) is given by

$\ell(\mu, K) =  N \mu K \log(K) - N \log(\Gamma(\mu K)) + \displaystyle \sum_{i = 1}^N \left[\log(\Gamma(x_i + \mu K)) - (x_i + \mu K) \log(n_i + K)\right]$

Once again, I wrote code to maximize this function in $R$ using a Newton-Raphson algorithm. I converted $K$ to $\phi = 1/(1 + K)$ in the equation above for estimation and then converted it back by $K = (1-\phi)/\phi$ after estimation was complete - the reason being that it makes the estimation procedure much more stable.

In performing this estimation, I had to make a choice of the minimum number of innings pitched (IP) in order to be included in the dataset. When performing a similar analysis for on-base percentage, I found that at around 300 PA, the population variance (and hence, the stabilization point) became roughly constant. Unfortunately, this is not true for starting pitchers.



The population variance in talent levels decreases consistently as a function of the minimum number of IP that are considered, and so the stabilization point $K$ increases. This means that, unlike OBP, for example, the stabilization point is always determined by what percentage of pitchers you look at (by IP) - if you look at only the top 50%, the stabilization point will be larger than the stabilization point for the top 70%.

This is reflected in the plot below - as with OBP and PA, the mean WHIP is associated with the number of IP,  but unlike with OBP, the variance around the mean is constantly changing with the mean.



For my calculation, I chose to use 80 innings pitched as my cutoff point - corresponding to approximately 15 games started and capturing slightly more than 50% of pitchers (by IP). This point was completely arbitrary, though, and other cutoffs will be equally valid depending on the question at hand.

Performing the estimation, the estimated league mean WHIP was $\hat{\mu} = 1.304$ with variance parameter $\hat{K} = 77.203$.



Once again,  95% confidence intervals for a specific stabilization level p are given as

$\left(\dfrac{p}{1-p}\right) \hat{K} \pm 1.96 \left(\dfrac{p}{1-p}\right) \sqrt{Var(\hat{k})}$

From (delta-method transformed) maximum likelihood output, $Var(\hat{K}) = 29.791$ (for a standard error of $5.459$ IP). The stabilization curve, with confidence bounds, is then


Aside from the model criticisms I've already mentioned, standard ones apply - innings pitched are not identical and independent (and treating them as so is clearly much worse than treating plate appearances as identical and independent), pitchers are not machines, etc. I don't think the model is great, but it is useful. It gives confidence bounds for the stabilization point something other methods don't do. As always, comments are appreciated.