Probabilaball

For hitters, cold streaks run colder than hot streaks run hot

2023-10-25T02:12:00.003-05:00

This blog post is just the product of a thought exercise: how much information do you get from a certain number of plate appearances? Suppose we observe $n = 25$ plate appearances for a batter. If the batter gets on-base $x = 5$ times, is that the same amount of information as if the batter gets on-base $x = 10$ times?

The answer is no. As it turns out, the batter obtaining fewer hits is more informative for the batter being "bad" than the batter obtaining more hits is for the batter being "good." How is this possible? Consider the very simple case of forming a standard 95% confidence interval for a binomial proportion. From any statistics textbook, this is just

$\hat{p} \pm 1.96 \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}$

where $\hat{p}$ is the proportion of on-base events for $n$ plate appearances. Consider the second part, which I will refer to as the "margin of error" and which controls the width of the confidence interval. For $n = 25$ plate appearances, $x = 5$ gives $\hat{p} = 5/25 = 0.2$ and gives

$1.96 \sqrt{\dfrac{0.2(1-0.2)}{25}} = 0.1568$

For $n = 10$ on-base events, this gives $\hat{p} = 0.4$ and

$1.96 \sqrt{\dfrac{0.4(1-0.4)}{25}} = 0.19204$

The width of the margin of error of the confidence is nearly 15% higher for $\hat{p} = 0.4$ than for $\hat{p} = 0.2$! There is more uncertainty with a better result.

Going to a Bayesian framework does not fix this issue, with the possible exception of when heavily weighted priors are being used which would be not justifiable in practice. Suppose that the number of on-base events $x$ in $n$ plate appearances once again follows the (overly simplistic) binomial distribution with parameter $p$, and $p$ is assumed to have a $Beta(1,1)$ distribution, which is the simple uniform case.

$x \sim Bin(n, p)$

$p \sim Beta(1,1)$

For the case of $x = 5$ on-base events in $n = 25$ plate appearances, the posterior distribution has form , standard deviation, and 95% central credible interval

$p | x = 5, n = 25 \sim Beta(6, 21)$

$SD(p | x = 5, n = 25) = \sqrt{\dfrac{(6)(21)}{(6 + 21)^2 (6 + 21 + 1)}} = 0.0786$

95% Central CI: $(0.0897,0.3935)$

For the case of $x = 10$ on-base events in $n = 25$ plate appearances, the posterior distribution has form , standard deviation, and 95% central credible interval

$p | x = 10, n = 25 \sim Beta(11, 16)$

$SD(p | x = 10, n = 25) = \sqrt{\dfrac{(11)(16)}{(11 + 16)^2 (11 + 16 + 1)}} = 0.929$

95% Central CI:$(0.2335,0.5942)$

Once again, the scenario with worse performance ($x = 5$ on-base events) has a smaller standard deviation, implying there is less posterior uncertainty about the outcome. In addition, the width of the 95% central credible interval is smaller for $x = 5$ ($0.3038$) than for $x = 10$ (0.3608)$.

So how much information is in the $n$ trials? One way to define information, in a probabilistic sense, is with a concept called the observed information. Observed information is a statistic which measures a concept called the Fisher information. Fisher information measures the amount of information that a sample carries about an unknown parameter $\theta$. Unfortunately, calculating this requires knowing the parameter in question, and so it is usually estimated. The log-likelihood of a set of observations $\tilde{x} = \{x_1, x_2, \dots, x_n\}$ is defined as

$\displaystyle \ell (\theta | \tilde{x}) = \sum_{i = 1}^n \log[ f(x _i | \theta)]$

And the observed information is defined as the negative second derivative of the log-likelihood, taken with respect to $\theta$.

$I(\tilde{x}) = -\dfrac{d}{d \theta^2} \ell (\theta | \tilde{x})$

Note that Bayesians may replace the log-likelihood with the log-posterior distribution of $\theta$.

In general the observed information must be calculated for every model, but is known for certain models. For the binomial distribution, the observed information is

$I(\hat{p}) = \dfrac{n}{\hat{p}(1-\hat{p})}$

where $\hat{p} = x/n$. Hence, for the case of $x = 5$ on-base events in $n = 25$ plate appearances, $I(0.2) = 156.25$. For the case of $x = 10$ on-base events in $n = 25$ plate appearances, $I(0.4) = 104.1667$. There is quite literally more information in the case where the batter performed worse.

For pitchers, the opposite is the case. If we assume that the number of events in a fixed number of trials (such as runs allowed per 9 innings or walks plus hits per inning pitched), this most appropriate simple distribution is the Poisson distribution with parameter $\lambda$. For $n$ trials, the observed information is

$I(\hat{\lambda}) = \dfrac{n}{hat{\lambda}}$

where $\hat{\lambda} = \bar{x}$, the sample mean of observations.

Imagine two pitchers: one has allowed $x = 30$ walks plus hits in $n = 25$ innings pitched, while the other has allowed $x = 20$ walks plus hits in $n = 25$ innings pitched. For which pitcher do we have more information about their abilities?

For the first pitcher, their sample WHIP is $\hat{\lambda} = 30/25 = 1.2$ and their observed information is $I(1.2) = 25/1.2 = 20.8333$. For the second pitcher, their sample WHIP is $\hat{\lambda} = 20/30 = 0.8$ and their observed information is $I(0.8) = 25/0.8 = 31.25$. Hence, we have more information about the pitcher who has performed better.

So the situation is reversed for batters and hitters. For batters, we tend to have more information when they perform poorly. For pitchers, we tend to have more information when the perform well. This suggests certain managerial strategies in small samples: it is justifiable to pull a poorly performing batter, but also perhaps justifiable to allow a poorly performing pitcher to have more innings. We just have more information about the bad hitter than we do about the bad pitcher, thanks to information theory.

2022 and 2023 Stabilization Points

2023-04-06T15:26:00.002-05:00

Hello everyone! It's been another couple of years, and I'm ready to update stabilization points again. These are my estimated stabilization points for the 2022 and 2023 MLB seasons, once again using the maximum likelihood method on the totals that I used for previous years. This method is explained in my articles Estimating Theoretical Stabilization Points and WHIP Stabilization by the Gamma-Poisson Model. As usual, all data and code I used for this post can be found on my github. I make no claims about the stability, efficiency, or optimality of my code.

I've included standard error estimates for 2022 and 2023, but these should not be used to perform any kinds of tests or intervals to compare to the values from previous years, as those values are estimates themselves with their own standard errors, and approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found here for batting statistics and here for pitching statistics. The calculations for 2016 can be found here. The 2017 calculations can be found here. The 2018 calculations can be found here. The 2019 calculations can be found here. I didn't do calculations in 2020 because of the pandemic in general. The 2021 calculations can be found here.

The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data (2016-2021 for the 2022 stabilization points and 2017 - 2022 for the 2023 stabilization points) were picked fairly arbitrarily. This is consistent with my previous work, though I do have concerns about including rates from, for example, the covid year and juiced ball era. However, including fewer years means less accurate estimates. A tradeoff must be made, and I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.

Offensive Statistics

2022 Statistics

\begin{array}{| l | l | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&2022\textrm{ }\hat{M}&2022\textrm{ }SE(\hat{M})&2022\textrm{ }\hat{\mu} & \textrm{Cutoff} \\ \hline
\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 316.48 & 19.72 & 0.331 & 300 \\
\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 459.44 & 50.10 & 0.304 & 300 \\
\textrm{BA}&\textrm{H/AB} & 473.86 & 38.69 & 0.263 & 300\\
\textrm{SO Rate}&\textrm{SO/PA} & 51.89 & 2.20 & 0.210 & 300 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 105.36 & 4.96 & 0.083 & 300 \\
\textrm{1B Rate}&\textrm{1B/PA} & 195.22 & 10.55 & 0.147 & 300 \\
\textrm{2B Rate}&\textrm{2B/PA} & 1197.83& 153.68 & 0.047 & 300 \\
\textrm{3B Rate}&\textrm{3B/PA} & 561.01 & 51.43 & 0.004 & 300 \\
\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1002.35 & 114.03 & 0.051 & 300 \\
\textrm{HR Rate} & \textrm{HR/PA} & 155.39 & 8.29 & 0.035 & 300 \\
\textrm{HBP Rate} & \textrm{HBP/PA} & 248.10 & 15.53 & 0.011 & 300 \\ \hline
\end{array}

2023 Statistics

\begin{array}{| l | l | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&2023\textrm{ }\hat{M}&2023\textrm{ }SE(\hat{M})&2023\textrm{ }\hat{\mu} & \textrm{Cutoff} \\ \hline
\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 301.11 & 18.46 & 0.329 & 300 \\
\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 426.80 & 45.29 & 0.302 & 300 \\
\textrm{BA}&\textrm{H/AB} & 434.51 & 34.08 & 0.259 & 300\\
\textrm{SO Rate}&\textrm{SO/PA} & 53.16 & 2.25 & 0.213 & 300 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 107.36 & 5.06 & 0.083 & 300 \\
\textrm{1B Rate}&\textrm{1B/PA} & 196.97 & 10.65 & 0.145 & 300 \\
\textrm{2B Rate}&\textrm{2B/PA} & 1189.22 & 151.90 & 0.047 & 300 \\
\textrm{3B Rate}&\textrm{3B/PA} & 634.14 & 62.08 & 0.005 & 300 \\
\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1035.82 & 120.93 & 0.051 & 300 \\
\textrm{HR Rate} & \textrm{HR/PA} & 156.77 & 8.34 & 0.035 & 300 \\
\textrm{HBP Rate} & \textrm{HBP/PA} & 256.24 & 16.04 & 0.011 & 300 \\ \hline
\end{array}

In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.

This is a good opportunity to compare the stabilization points I calculated for the 2016 season to the stabilization points for the 2023 season, as the 2023 season includes data from 2017-2022, so there is no crossover of information between them.

\begin{array}{| l | l | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&2023\textrm{ }\hat{M}&2016\textrm{ }\hat{M} \\ \hline
\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 301.11 & 301.32 \\
\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 426.80 & 433.04 \\
\textrm{BA}&\textrm{H/AB} & 434.51 & 491.20\\
\textrm{SO Rate}&\textrm{SO/PA} & 53.16 & 49.23 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 107.36 & 112.44 \\
\textrm{1B Rate}&\textrm{1B/PA} & 196.97 & 223.86 \\
\textrm{2B Rate}&\textrm{2B/PA} & 1189.22 & 1169.75 \\
\textrm{3B Rate}&\textrm{3B/PA} & 634.14 & 365.06 \\
\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1035.82 & 1075.41 \\
\textrm{HR Rate} & \textrm{HR/PA} & 156.77 & 126.35 \\
\textrm{HBP Rate} & \textrm{HBP/PA} & 256.24 & 300.97\\ \hline
\end{array}

What is most apparent is the stability of most statistics. The stabilization point for OBP, BABIP, SO Rate, BB rate, 2B rate, and XBH rate are nearly identical, indicating that the spread of abilities within this distribution is roughly the same now as it is in 2016. Stabilization points for BA, 1B rate, HR Rate, and HBP rate are fairly close, indicating not much change. The big outlier is 3B rate, or the rate of triples. Though the estimated probability of a triple per PA is approximately 0.005 in both seasons, the stabilization rate has nearly doubled from 2016 to 2023. This is indicative that the spread in the ability to triples has increased - though the league average rate of triples has remained the same, there are fewer batters that have a "true" triples-hitting ability which is much higher or lower than the league average.

Pitching Statistics

2022 Statistics

\begin{array}{| l | l | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&2022\textrm{ }\hat{M}&2022\textrm{ }SE(\hat{M})&2022\textrm{ }\hat{\mu} & \textrm{Cutoff} \\ \hline
\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 929.31 & 165.62 & 0.284 &300 \\
\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 66.56 & 4.38 & 0.439 &3001\\
\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 61.79 & 4.03 & 0.351 &300 \\
\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 1692.45 & 467.98 & 0.210 &300 \\
\textrm{HR/FB Rate}&\textrm{HR/FB}& 715.15 & 226.83 & 0.135 & 100 \\
\textrm{SO Rate}&\textrm{SO/TBF}& 80.13 & 4.00 & 0.220 &400 \\
\textrm{HR Rate}&\textrm{HR/TBF}& 1102.12 & 175.15 & 0.032 &400 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 256.28 & 20.37 & 0.074 & 400 \\
\textrm{HBP Rate}&\textrm{HBP/TBF}& 931.55 & 131.51 & 0.009 &400 \\
\textrm{Hit rate}&\textrm{H/TBF}& 414.03 & 33.46 & 0.230 &400 \\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 395.83 & 35.75 & 0.312 &400\\
\textrm{WHIP}&\textrm{(H + BB)/IP*}& 58.49 & 4.40 & 1.28 &80 \\
\textrm{ER Rate}&\textrm{ER/IP*}& 54.50 & 4.07 & 0.465 &80 \\
\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 61.96 & 4.75 & 1.23 &80\\ \hline
\end{array}

* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively.

2023 Statistics

\begin{array}{| l | l | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&2023\textrm{ }\hat{M}&2023\textrm{ }SE(\hat{M})&2023\textrm{ }\hat{\mu} & \textrm{Cutoff} \\ \hline
\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 809.54 & 134.29 & 0.282 &300 \\
\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 66.24 & 4.40 & 0.434 &3001\\
\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 59.40 & 3.90 & 0.357 &300 \\
\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 1596.98 & 429.07 & 0.209 &300 \\
\textrm{HR/FB Rate}&\textrm{HR/FB}& 386.34 & 77.01 & 0.133 & 100 \\
\textrm{SO Rate}&\textrm{SO/TBF}& 77.46 & 4.85 & 0.223 &400 \\
\textrm{HR Rate}&\textrm{HR/TBF}& 942.58 & 134.48 & 0.032 &400 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 258.78 & 20.84 & 0.073 & 400 \\
\textrm{HBP Rate}&\textrm{HBP/TBF}& 766.40 & 98.75 & 0.009 &400 \\
\textrm{Hit rate}&\textrm{H/TBF}& 391.55 & 30.96 & 0.227 &400 \\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 358.50 & 31.64 & 0.309 &400\\
\textrm{WHIP}&\textrm{(H + BB)/IP*}& 54.96 & 4.07& 1.27 &80 \\
\textrm{ER Rate}&\textrm{ER/IP*}& 50.33 & 3.68 & 0.459 &80 \\
\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 58.00 & 4.37 & 1.22 &80\\ \hline
\end{array}

* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively.

Once again, this is a good opportunity to compare the stabilization rates for 2016 to the stabilization rates for 2023.

\begin{array}{| l | l | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&2023\textrm{ }\hat{M}&2016\textrm{ }\hat{M} \\ \hline
\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 809.54 & 1408.72 \\
\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 66.24 & 65.52 \\
\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 59.40 & 61.96\\
\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 1596.98 & 768.42 \\
\textrm{HR/FB Rate}&\textrm{HR/FB}& 386.34 & 505.11 \\
\textrm{SO Rate}&\textrm{SO/TBF}& 77.46 & 90.94 \\
\textrm{HR Rate}&\textrm{HR/TBF}& 942.58 & 931.59 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 258.78 & 221.25 \\
\textrm{HBP Rate}&\textrm{HBP/TBF}& 766.40 & 989.30\\
\textrm{Hit rate}&\textrm{H/TBF}& 391.55 & 623.35\\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 358.50 & 524.73\\
\textrm{WHIP}&\textrm{(H + BB)/IP*}& 54.96 & 77.2\\
\textrm{ER Rate}&\textrm{ER/IP*}& 50.33 & 59.55\\
\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 58.00 & 75.79\\ \hline
\end{array}

Comparing 2023 to 2016, the outliers are obvious: the stabilization point for pitcher BABIP has nearly halved since then, while the stabilization point for line drive rate has nearly doubled (and similarly for hit rate). Given that the estimated mean pitcher BABIP and line drive rate are similar for the two years (0.284/0.210 for 2023 and 0.289/0.203 for 2016), this indicates a change in the spread of abilities. Simply put, there is a much lower spread of pitcher BABIP "true" abilities, and with it, a much higher spread of line drive rates. Simply put, teams appear to be willing to trade more or less line drives for less variance in the batting average when the ball is in play.

Usage

Aside from the obvious use of knowing approximately when results are half due to luck and half skill, these stabilization points (along with league means) can be used to provide very basic confidence intervals and prediction intervals for estimates that have been shrunk towards the population mean, as demonstrated in my article From Stabilization to Interval Estimation.

For example, suppose that in the first half, a player has an on-base percentage of 0.380 in 300 plate appearances, corresponding to 114 on-base events. A 95% confidence interval using my empirical Bayesian techniques (based on a normal-normal model) is

$\dfrac{114 + 0.329*301.11}{300 + 301.11} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.11 + 300}} = (0.317,0.392)$

That is, we believe the player's true on-base percentage to be between 0.317 and 0.392 with 95% confidence. I used a normal distribution for talent levels with a normal approximation to the binomial for the distribution of observed OBP, but that is not the only possible choice - it just resulted in the simplest formulas for the intervals.

Suppose that the player will get an additional $\tilde{n} = 250$ PA in the second half of the season. A 95% prediction interval for his OBP over those PA is given by

$\dfrac{114 + 0.329*301.11}{300 + 301.11} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.11+ 300} + \dfrac{0.329(1-0.329)}{250}} = (0.285,0.424)$

That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% "unshrunk" intervals, but when I tested them in my article "From Stabilization to Interval Estimation," they worked well for prediction.

As usual, all my data and code can be found on my github. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.

> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1)
$\$$Parameters
[1] 0.3287902 301.1076958

$\$$Standard.Error
[1] 18.45775

$\$$Model
[1] "Beta-Binomial"

The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula

$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$

which is obtained from the applying the delta method to the function $p(\hat{M}) = n/(n + \hat{M})$. Note that the mean and prediction intervals I gave do not take $SE(\hat{M})$ into account (ignoring the uncertainty surrounding the correct shrinkage amount, which is indicated by the confidence bounds above), but this is not a huge problem - if you don't believe me, plug slightly different values of $M$ into the formulas yourself and see that the resulting intervals do not change much.

As always, feel free to post any comments or suggestions.

2021 Stabilization Points

2021-04-02T22:13:00.000-05:00

These are my estimated stabilization points for the 2021 MLB season, once again using the maximum likelihood method on the totals that I used for previous years. This method is explained in my articles Estimating Theoretical Stabilization Points and WHIP Stabilization by the Gamma-Poisson Model.

However, good news! In the past two years, I've had some research on reliability for non-normal data corrected, expanded upon, and published in academic journals. I can definitively say that my maximum likelihood estimator is accurately estimating the reliability of these statistics exactly the same as Cronbach's alpha or KR-20 and performs as well or better than Cronbach's alpha, assuming the model is correct, which - while no model is correct - I believe is very accurate. The article can be found here (for the preprint, click here). I also published a paper with some KR-20 and KR-21 reliability estimators specifically for exponential family distributions such as binomial, Poisson, etc. The article can be found here (for the preprint, click here). These estimators are a little more efficient for small sample sizes but for large sample sizes such as in this case, however, the estimators should be nearly identical.

As usual, all data and code I used for this post can be found on my github. I make no claims about the stability, efficiency, or optimality of my code.

I've included standard error estimates for 2021, but these should not be used to perform any kinds of tests or intervals to compare to the values from previous years, as those values are estimates themselves with their own standard errors, and approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found here for batting statistics and here for pitching statistics. The calculations for 2016 can be found here. The 2017 calculations can be found here. The 2018 calculations can be found here. The 2019 calculations can be found here. I didn't do calculations in 2020 because of the pandemic in general.

The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data (2015-20120), were picked fairly arbitrarily - I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.

Offensive Statistics

\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2019\textrm{ }\hat{M} \\ \hline
\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 302.57 & 18.39 & 0.331 & 300 & 295.20 \\
\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 451.24 & 47.22 & 0.306 & 300 & 431.49 \\
\textrm{BA}&\textrm{H/AB} & 511.71 & 42.78 & 0.265 & 300 & 488.49 \\
\textrm{SO Rate}&\textrm{SO/PA} & 50.37 & 2.12 & 0.205 & 300 & 49.05 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 100.47 & 4.67 & 0.080 & 300 & 104.08 \\
\textrm{1B Rate}&\textrm{1B/PA} & 191.17 & 10.20 & 0.150 & 300 & 197.43 \\
\textrm{2B Rate}&\textrm{2B/PA} & 1242.67 & 162.27 & 0.047 & 300 & 1200.46 \\
\textrm{3B Rate}&\textrm{3B/PA} & 481.11 & 28.74 & 0.005 & 300 & 421.91 \\
\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1059.31 & 124.09 & 0.052 & 300 & 1070.09 \\
\textrm{HR Rate} & \textrm{HR/PA} & 146.00 & 7.68 & 0.034 & 300 & 141.80\\
\textrm{HBP Rate} & \textrm{HBP/PA} & 261.13 & 16.56 & 0.010 & 300 & 266.92 \\ \hline
\end{array}

In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.

The stabilization point of the 3B rate increased dramatically by approximately two standard deviations, indicating that the talent level of hitting triples has clustered more closely around its mean. In general, however, most stabilization points are roughly the same as the previous year, taking into account that year-to-year and sample-to-sample variation in estimates is expected even if the true stabilization points are not changing.

Pitching Statistics

\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}& 2019 \textrm{ }\hat{M} \\ \hline
\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 1061.43 & 197.34 & 0.286 &300& 1184.38 \\
\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 66.20 & 4.25 & 0.443 &300& 64.51\\
\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 62.33 & 3.97 & 0.346 &300& 60.68 \\
\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 1773.66 & 486.12 & 0.211 &300& 2197.02 \\
\textrm{HR/FB Rate}&\textrm{HR/FB}& 529.40 & 129.10 & 0.130 & 100 & 351.53 \\
\textrm{SO Rate}&\textrm{SO/TBF}& 80.78 & 4.97 & 0.214 &400& 90.86 \\
\textrm{HR Rate}&\textrm{HR/TBF}& 959.57 & 133.073 & 0.031 &400& 764.48\\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 251.22 & 19.47 & 0.072 & 400 & 230.09 \\
\textrm{HBP Rate}&\textrm{HBP/TBF}& 1035.90 & 153.68 & 0.009 &400& 906.25 \\
\textrm{Hit rate}&\textrm{H/TBF}& 453.30 & 37.52 & 0.232 &400& 496.56 \\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 407.36 & 36.33 & 0.313 &400& 443.60 \\
\textrm{WHIP}&\textrm{(H + BB)/IP*}& 63.38 & 4.79 & 1.29 &80& 67.84 \\
\textrm{ER Rate}&\textrm{ER/IP*}& 57.73 & 4.30 & 0.460 &80& 57.97 \\
\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 64.70 & 4.92 & 1.23 &80& 67.23 \\ \hline
\end{array}

* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively.

Most are the same, but the HR/FB stabilization point has shifted up dramatically given its standard error, indicating a likely change in true talent level and not just sample-to-sample and year-to-year variation. This indicates that the distribution of HR/FB talent levels is clustering around its mean, possibly indicating a change in approach by pitchers or batters over the past two years. The mean has also shifted up over the previous calculation. Similarly, the HR rate stabilization point and mean have increased. Conversely, the strikeout rate stabilization rate has decreased, indicating less clustering of talent levels around the mean, and the mean has also increased.

Usage

$\dfrac{114 + 0.331*302.57}{300 + 302.57} \pm 1.96 \sqrt{\dfrac{0.331(1-0.331)}{302.57 + 300}} = (0.318,0.392)$

$\dfrac{114 + 0.331*302.57}{300 + 302.57} \pm 1.96 \sqrt{\dfrac{0.331(1-0.331)}{302.57+ 300} + \dfrac{0.331(1-0.331)}{250}} = (0.286,0.425)$

That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% "unshrunk" intervals, but when I tested them in my article "From Stabilization to Interval Estimation," they worked well for prediction.

As usual, all my data and code can be found on my github. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.

> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1)
$\$$Parameters
[1] 0.3306363 302.5670532

$\$$Standard.Error
[1] 18.38593

$\$$Model
[1] "Beta-Binomial"

The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula

$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$

2019 Stabilization Points

2019-04-21T16:49:00.002-05:00

These are my estimated stabilization points for the 2019 MLB season, once again using the maximum likelihood method on the totals that I used for previous years. This method is explained in my articles Estimating Theoretical Stabilization Points and WHIP Stabilization by the Gamma-Poisson Model.

(As usual, all data and code I used can be found on my github. I make no claims about the stability, efficiency, or optimality of my code.)

I've included standard error estimates for 2019, but these should not be used to perform any kinds of tests or intervals to compare to the values from previous years, as those values are estimates themselves with their own standard errors, and approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found here for batting statistics and here for pitching statistics. The calculations for 2016 can be found here. The 2017 calculations can be found here. The 2018 calculations can be found here.

The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data (2013-2018), were picked fairly arbitrarily - I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.

Offensive Statistics

\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2018\textrm{ }\hat{M} \\ \hline
\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 295.20 & 16.26 & 0.329 & 300 & 302.27 \\
\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 431.49 & 39.76 & 0.306 & 300 & 429.47 \\
\textrm{BA}&\textrm{H/AB} & 488.49 & 36.52 & 0.264 & 300 & 463.19 \\
\textrm{SO Rate}&\textrm{SO/PA} & 49.05 & 1.88 & 0.198 & 300 & 48.74 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 104.08 & 4.45 & 0.078 & 300 & 108.84 \\
\textrm{1B Rate}&\textrm{1B/PA} & 197.43 & 9.72 & 0.154 & 300 & 200.94 \\
\textrm{2B Rate}&\textrm{2B/PA} & 1200.46 & 140.37 & 0.047 & 300 & 1164.82 \\
\textrm{3B Rate}&\textrm{3B/PA} & 421.91 & 31.67 & 0.005 & 300 & 390.75 \\
\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1070.09 & 115.96 & 0.052 & 300 & 1064.01 \\
\textrm{HR Rate} & \textrm{HR/PA} & 141.80 & 6.78 & 0.030 & 300 & 132.52 \\
\textrm{HBP Rate} & \textrm{HBP/PA} & 266.92 & 15.74 & 0.009 & 300 & 280.00 \\ \hline
\end{array}

In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.

Noticeably, the stabilization point for the HR rate has increased over the past four years, indicating less variance in talent level of hitting home runs. Meanwhile, the stabilization point for HBP rate has decreased over the past four years, suggesting increased variance in """talent""" level of getting hit by pitches.

Pitching Statistics

\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2018 \textrm{ }\hat{M} \\ \hline
\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 1184.38 & 206.63& 0.288 &300&1322.70 \\
\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 64.51 & 3.66 & 0.446 &300&63.12 \\
\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}&60.68 &3.41 & 0.344 &300&59.80 \\
\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 2197.02 & 622.02 & 0.210 &300&2157.15 \\
\textrm{HR/FB Rate}&\textrm{HR/FB}& 351.53 & 56.05 & 0.117 & 100 & 388.61 \\
\textrm{SO Rate}&\textrm{SO/TBF}& 90.86 &5.07& 0.204&400&93.52 \\
\textrm{HR Rate}&\textrm{HR/TBF}&764.48& 82.78 & 0.028 &400&790.97 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 230.09 & 15.46 & 0.071 &400&238.70 \\
\textrm{HBP Rate}&\textrm{HBP/TBF}& 906.25 & 109.63 & 0.009 &400&935.61 \\
\textrm{Hit rate}&\textrm{H/TBF}&496.56 & 39.48 & 0.233 &400&536.32 \\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 443.60 & 36.42 & 0.312 &400& 472.09 \\
\textrm{WHIP}&\textrm{(H + BB)/IP*}&67.84 & 4.69 & 1.28 &80& 71.10 \\
\textrm{ER Rate}&\textrm{ER/IP*}& 57.97 & 3.87 & 0.444 &80& 58.59 \\
\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 67.23 & 4.64 & 1.22 &80& 69.11 \\ \hline
\end{array}

* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively.

Most statistics this year shifted not just in stabilization point, but also in mean, possibly indicating a shift in the pitching environment. The stabilization points which did shift tended to shift down, indicating an increased spread of variation around the mean talent levels.

Usage

$\dfrac{114 + 0.329*295.20}{300 + 295.20} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{295.20 + 300}} = (0.317,0.392)$

$\dfrac{114 + 0.329*295.20}{300 + 295.20} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{295.20 + 300} + \dfrac{0.329(1-0.329)}{250}} = (0.285,0.424)$

That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% "unshrunk" intervals, but when I tested them in my article "From Stabilization to Interval Estimation," they worked well for prediction.

As usual, all my data and code can be found on my github. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.

> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1)
$\$$Parameters
[1] 0.3285272 295.1970047

$\$$Standard.Error
[1] 16.25874

$\$$Model
[1] "Beta-Binomial"

The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula

$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$

2018 Stabilization Points

2018-09-05T22:39:00.000-05:00

So this post is waaaaay late in the 2018 season. I've been busy! But, I'm doing this again since it's pretty easy to do. But I am copying and pasting the text from the posts from the last two years, because I can.

These are my estimated stabilization points for the 2018 MLB season, once again using the maximum likelihood method on the totals that I used for previous years. This method is explained in my articles Estimating Theoretical Stabilization Points and WHIP Stabilization by the Gamma-Poisson Model.

(As usual, all data and code I used can be found on my github. I make no claims about the stability, efficiency, or optimality of my code.)

I've included standard error estimates for 2018, but these should not be used to perform any kinds of tests or intervals to compare to the values from previous years, as those values are estimates themselves with their own standard errors, and approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found here for batting statistics and here for pitching statistics. The calculations for 2016 can be found here. The 2017 calculations can be found here.

The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data (2012-2017), were picked fairly arbitrarily - I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.

Offensive Statistics

\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2017\textrm{ }\hat{M} \\ \hline
\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 302.27 & 16.88 & 0.329 & 300 & 303.77\\
\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 429.47 & 39.30 & 0.306 & 300 & 442.62 \\
\textrm{BA}&\textrm{H/AB} & 463.19 & 33.94 & 0.266 & 300 & 466.09 \\
\textrm{SO Rate}&\textrm{SO/PA} & 48.74 & 1.88 & 0.194 & 300 & 49.02\\
\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 108.84 & 4.72 & 0.077 & 300 & 113.64 \\
\textrm{1B Rate}&\textrm{1B/PA} & 200.94 & 9.99 & 0.156 & 300 & 215.29\\
\textrm{2B Rate}&\textrm{2B/PA} & 1164.82 & 134.26 & 0.047 & 300 & 1230.96 \\
\textrm{3B Rate}&\textrm{3B/PA} & 390.75 & 28.72 & 0.005 & 300 & 358.92\\
\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1064.01 & 115.55 & 0.052 & 300 & 1063.76 \\
\textrm{HR Rate} & \textrm{HR/PA} & 132.52 & 6.31 & 0.030 & 300 & 129.02 \\
\textrm{HBP Rate} & \textrm{HBP/PA} & 280.00 & 16.89 & 0.009 & 300 & 299.39 \\ \hline
\end{array}

In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.

Pitching Statistics

\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2016 \textrm{ }\hat{M} \\ \hline
\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 1322.70 & 244.54 & 0.289 &300&1356.06 \\
\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 63.12 & 3.55 & 0.450 &300& 63.12 \\
\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 59.86 &3.34 & 0.341 &300&59.80 \\
\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 2157.15 & 586.96 & 0.209 &300& 1497.65 \\
\textrm{HR/FB Rate}&\textrm{HR/FB}& 388.61 & 65.28 & 0.115 &100&464.60 \\
\textrm{SO Rate}&\textrm{SO/TBF}& 93.52 &5.25& 0.199&400&94.62\\
\textrm{HR Rate}&\textrm{HR/TBF}&790.97 & 86.34 & 0.029 &400&942.62 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}&238.70 & 16.10 & 0.070 &400&237.53 \\
\textrm{HBP Rate}&\textrm{HBP/TBF}& 935.61 & 115.06 & 0.008 &400&954.09 \\
\textrm{Hit rate}&\textrm{H/TBF}& 536.32 & 43.99 & 0.235 &400&550.69 \\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}&472.09 & 39.51 & 0.313 &400& 496.39 \\
\textrm{WHIP}&\textrm{(H + BB)/IP*}& 71.10 & 4.96 & 1.29 &80& 74.68 \\
\textrm{ER Rate}&\textrm{ER/IP*}& 58.59 & 3.91 & 0.447 &80& 62.82 \\
\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 69.11 & 4.79 & 1.22 &80& 73.11\\ \hline
\end{array}

* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively.

Most statistics are roughly the same; however, the line drive stabilization point has increased quite a bit, having doubled in 2016 from 2015. This is not a mistake - it corresponds to a decrease in the variance of line drive rates. Noticeably, the HR rate variance increased, and so the HR rate stabilization point decreased. This indicates a shift in the MLB pitching environment in these particular areas, and points to a weakness in the method - if the underlying league distribution of talent level of a statistic is changing rapidly, this method will fail to account for the change and may be inaccurate.

Usage

Aside from the obvious use of knowing approximately when results are half due to luck and half from skill, these stabilization points (along with league means) can be used to provide very basic confidence intervals and prediction intervals for estimates that have been shrunk towards the population mean, as demonstrated in my article From Stabilization to Interval Estimation. I believe the confidence intervals from my method should be similar to the intervals from Sean Dolinar's great fangraphs article A New Way to Look at Sample Size, though I have not personally tested this, and am not familiar with the Cronbach's alpha methodology he uses (or with reliability analysis in general).

For example, suppose that in the first half, a player has an on-base percentage of 0.380 in 300 plate appearances, corresponding to 114 on-base events. A 95% confidence interval using my empirical Bayesian techniques (based on a normal-normal model) is

$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300}} = (0.317,0.392)$

$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300} + \dfrac{0.329(1-0.329)}{250}} = (0.285,0.424)$

That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% unshrunk intervals, but when I tested them in my article "From Stabilization to Interval Estimation", they worked well for prediction.

As usual, all my data and code can be found on my github. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.

> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1)
$\$$Parameters
[1] 0.329098 301.317682

$\$$Standard.Error
[1] 16.92138

$\$$Model
[1] "Beta-Binomial"

The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula

$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$

which is obtained from the applying the delta method to the function $p(\hat{M}) = n/(n + \hat{M})$. Note that the mean and prediction intervals I gave do not take $SE(\hat{M})$ into account (ignoring the uncertainty surrounding the correct shrinkage amount, which is indicated by the confidence bounds above), but this is not a huge problem - if you don't believe me, plug slightly different values of $M$ into the formulas yourself and see that the resulting intervals do not change much.

Maybe somebody else out there might find this useful. As always, feel free to post any comments or suggestions!

2017 Stabilization Points

2017-04-24T23:45:00.000-05:00

Once again, I recalculated stabilization points for 2017 MLB data, once again using the maximum likelihood method on the totals that I used for 2015 and 2016. This method is explained in my articles Estimating Theoretical Stabilization Points and WHIP Stabilization by the Gamma-Poisson Model.

(As usual, all data and code I used can be found on my github. I make no claims about the stability, efficiency, or optimality of my code.)

I've included standard error estimates for 2017, but these should not be used to perform any kinds of tests or intervals to compare to the values from previous years, are those values are estimates themselves with their own standard errors and approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found here for batting statistics and here for pitching statistics. The calculations for 2016 can be found here.

The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data, were picked fairly arbitrarily - I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.

Offensive Statistics

\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2016\textrm{ }\hat{M} \\ \hline
\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 303.77 & 17.08 & 0.328 & 300 & 301.32 \\
\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 442.62 & 40.55 & 0.306 & 300 & 433.04\\
\textrm{BA}&\textrm{H/AB} & 466.09 & 34.30 & 0.266 & 300 & 491.20\\
\textrm{SO Rate}&\textrm{SO/PA} & 49.02 & 1.90 & 0.188 & 300 & 49.23\\
\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 113.64 & 5.00 & 0.077 & 300 & 112.44 \\
\textrm{1B Rate}&\textrm{1B/PA} & 215.29 & 10.95 & 0.157 & 300 & 223.86 \\
\textrm{2B Rate}&\textrm{2B/PA} & 1230.96 & 148.48 & 0.047 & 300 & 1169.75 \\
\textrm{3B Rate}&\textrm{3B/PA} & 358.92 & 25.71 & 0.005 & 300 & 365.06 \\
\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1063.76 & 116.54 & 0.052 & 300 & 1075.41 \\
\textrm{HR Rate} & \textrm{HR/PA} & 129.02 & 6.18 & 0.028 & 300 & 126.35\\
\textrm{HBP Rate} & \textrm{HBP/PA} & 299.39 & 18.60 & 0.009 & 300 & 300.97 \\ \hline
\end{array}

In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.

Pitching Statistics

\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2016 \textrm{ }\hat{M} \\ \hline
\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 1356.06 & 247.48 & 0.289 &300&1408.72\\
\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 64.00 & 3.56 & 0.450 &300& 63.53 \\
\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 61.73 &3.42& 0.342 &300&59.80 \\
\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 1497.65 & 296.21 & 0.208 &300&731.02 \\
\textrm{HR/FB Rate}&\textrm{HR/FB}& 464.60 & 85.51 & 0.108 &100&488.53 \\
\textrm{SO Rate}&\textrm{SO/TBF}& 94.62&5.29& 0.194&400&93.15 \\
\textrm{HR Rate}&\textrm{HR/TBF}& 942.62 & 110.66 & 0.026 &400&949.02 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 237.53 & 15.84 & 0.069 &400&236.87 \\
\textrm{HBP Rate}&\textrm{HBP/TBF}& 954.09 & 115.60 & 0.008 &400&939.00 \\
\textrm{Hit rate}&\textrm{H/TBF}& 550.69 & 45.63 & 0.235 &400&559.18 \\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 496.39 & 41.81 & 0.312 &400&526.77 \\
\textrm{WHIP}&\textrm{(H + BB)/IP*}& 74.68 & 5.25 & 1.29 &80&78.97 \\
\textrm{ER Rate}&\textrm{ER/IP*}& 62.82 & 4.24 & 0.440 &80&63.08 \\
\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 73.11& 5.11 & 1.22 &80&75.79 \\ \hline
\end{array}

* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively.

Most statistics are roughly the same; however, the line drive stabilization point has roughly doubled. I checked my calculations for both years and this is not a mistake. It corresponds to a decrease in the variance of line drive rates. It should also be noted that the average line drive rate increased from 0.203 to 0.208 - there are perhaps remnants of an odd 2010 that is no longer included in the data set.

Usage

Note: This section is largely unchanged from the previous year's version. The formulas given here work for "counting" offensive stats (OBP, BA, etc.).

Aside from the obvious use of knowing approximately when results are half due to luck and half from skill, these stabilization points (along with league means) can be used to provide very basic confidence intervals and prediction intervals for estimates that have been shrunk towards the population mean, as demonstrated in my article From Stabilization to Interval Estimation. I believe the confidence intervals from my method should be similar to the intervals from Sean Dolinar's great fangraphs article A New Way to Look at Sample Size, though I have not personally tested this, and am not familiar with the Cronbach's alpha methodology he uses (or with reliability analysis in general).

For example, suppose that in the first half, a player has an on-base percentage of 0.380 in 300 plate appearances, corresponding to 114 on-base events. A 95% confidence interval using my empirical Bayesian techniques (based on a normal-normal model) is

$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300}} = (0.317,0.392)$

$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300} + \dfrac{0.329(1-0.329)}{250}} = (0.285,0.424)$

That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% unshrunk intervals, but when I tested them in my article "From Stabilization to Interval Estimation", they worked well for prediction.

As usual, all my data and code can be found on my github. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA ). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.

> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1)
$\$$Parameters
[1] 0.329098 301.317682

$\$$Standard.Error
[1] 16.92138

$\$$Model
[1] "Beta-Binomial"

The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula

$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$

which is obtained from the applying the delta method to the function $p(\hat{M}) = n/(n + \hat{M})$. Note that the mean and prediction intervals I gave do not take $SE(\hat{M})$ into account (ignoring the uncertainty surrounding the correct shrinkage amount, which is indicated by the confidence bounds above), but this is not a huge problem - if you don't believe me, plug slightly different values of $M$ into the formulas yourself and see that the resulting intervals do not change much.

Maybe somebody else out there might find this useful. As always, feel free to post any comments or suggestions!

2016 Win Total Predictions (Through August 31)

2016-09-03T13:54:00.001-05:00

These predictions are based on my own silly estimator, which I know can be improved with some effort on my part. There's some work related to this estimator that I'm trying to get published academically, so I won't talk about the technical details yet (not that they're particularly mind-blowing anyway). These predictions include all games played before through August 31 break.

As a side note, I noticed that my projections are very similar to the Fangraphs projections on the same day. I'm sure we're both calculating the projections from completely different methods, but it's reassuring that others have arrived at basically the same conclusions. Theirs have also have playoff projections, though mine have intervals attached to them.

I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), but based on tests of earlier seasons at this point in the season the actual coverage is around 94%, with intervals usually being one game off if and when they are off.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c}
\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} & \textrm{Current Wins/Games}\\ \hline
ARI & 63 & 68.82 & 74 & 71.61 & 56 / 133 \\
ATL & 57 & 62.25 & 68 & 68.41 & 50 / 133 \\
BAL & 81 & 86.57 & 92 & 81.42 & 72 / 133 \\
BOS & 85 & 90.41 & 96 & 91.7 & 74 / 133 \\
CHC & 98 & 103.63 & 109 & 100.59 & 85 / 132 \\
CHW & 71 & 76.85 & 83 & 77.61 & 62 / 131 \\
CIN & 62 & 67.9 & 74 & 69.67 & 55 / 132 \\
CLE & 87 & 92.68 & 98 & 90.03 & 76 / 132 \\
COL & 73 & 78.63 & 84 & 81.72 & 64 / 133 \\
DET & 81 & 86.95 & 92 & 83.51 & 72 / 133 \\
HOU & 81 & 86.24 & 92 & 85.14 & 71 / 133 \\
KCR & 78 & 83.09 & 89 & 78.69 & 69 / 133 \\
LAA & 67 & 72.93 & 78 & 77.8 & 59 / 133 \\
LAD & 84 & 89.44 & 95 & 86.26 & 74 / 133 \\
MIA & 76 & 81.66 & 87 & 81.91 & 67 / 133 \\
MIL & 65 & 70.13 & 76 & 73.34 & 57 / 133 \\
MIN & 56 & 61.98 & 68 & 70.1 & 49 / 132 \\
NYM & 78 & 83.61 & 89 & 81.58 & 69 / 133 \\
NYY & 78 & 83.86 & 90 & 80.28 & 69 / 132 \\
OAK & 64 & 69.88 & 75 & 71.92 & 57 / 133 \\
PHI & 67 & 72.42 & 78 & 69.38 & 60 / 133 \\
PIT & 77 & 82.37 & 88 & 80.33 & 67 / 131 \\
SDP & 63 & 68.73 & 74 & 74.14 & 55 / 132 \\
SEA & 77 & 82.55 & 88 & 81.3 & 68 / 133 \\
SFG & 82 & 88 & 94 & 86.39 & 72 / 132 \\
STL & 81 & 86.25 & 92 & 87.69 & 70 / 132 \\
TBR & 65 & 70.72 & 76 & 79.47 & 56 / 132 \\
TEX & 89 & 94.45 & 100 & 83.61 & 80 / 134 \\
TOR & 86 & 91.93 & 97 & 89 & 76 / 133 \\
WSN & 89 & 94.82 & 100 & 94.01 & 78 / 133 \\
\hline\end{array}

These quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is to the left of the peak, the team is predicted to finish "lucky" - more wins than would be expected based on their talent level - and if the blue line is to the right of the peak, the team is predicted to finish "unlucky" - fewer wins that would be expected based on their talent level.

It's still difficult to predict final win totals even at the beginning of September - intervals have a width of approximately 11-12 games. The Texas Ranges have been lucky this season, with a projected win total over 10 games larger than their estimated true talent level! Conversely, the Tampa Bay Rays have been unlucky, with a projected win total 10 games lower than their true talent level.

The Chicago Cubs have a good chance at winning 105+ games. My system believes they are a "true" 101 win team. Conversely, the system believes that the worst team is the Atlanta Braves, which are a "true" 68 win team (though the Minnesota Twins are projected to have the worst record at 62 wins).

Terminology

To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (from the beginning of the season until the all-star break).

wOBA Shrinkage Estimation by the Multinomial-Dirichilet Model

2016-08-15T02:01:00.000-05:00

In a previous article, I showed how to calculate a very basic confidence interval for wOBA using the multinomial model. Since then, I've shown how to perform shrinkage estimation (regression towards the mean) for basic counting stats such as BA and OBP and rate stats such as WHIP. In this article, I'm going to show how to use the multinomial model with a Dirichlet prior to find a regressed estimate of wOBA (and other functions that are linear transformations of counting stats).

As in the previous post, I will use the weights of wOBA from the definition on fangraphs.com. I am aware that in The Book wOBA also includes the outcome of reaching a base on error, but the method here will easily expand to include that factor and the results should not change drastically for its exclusion.

As usual, all the code and data I use can be found on my github.

The Multinomial and Dirichlet Models

Suppose you observe $n$ independent, identical trials of an event (say, a plate appearance or an at-bat) with $k$ possible outcomes (single, double, triple, home run, walk, sacrifice fly, etc. - all the way up to an out). The distribution of counts of each of the possible outcomes is multinomial with probability mass function:

$p(x_1, x_2,...,x_k | \theta_1, \theta_2, ..., \theta_{k}, n) = \dfrac{n!}{x_1!x_2!...x_k!}\theta_1^{x_1}\theta_2^{x_2}...\theta_k^{x_k}$

Where $x_1$, $x_2$,..., $x_k$ represent counts of each outcome (with $n = x_1 + x_2 + ... + x_k$ fixed) and $\theta_1, \theta_2, ..., \theta_k$ represent the probability of each outcome in a single event - note that all the probabilities $\theta_j$ must sum to 1, so you will sometimes see the last term written as $\theta_k = (1-\theta_1-\theta_2-...-\theta_{k-1})$.

To give an example, suppose that in each plate appearance (the event) a certain player has a 0.275 chance of getting a hit, a 0.050 chance of getting a walk, and a 0.675 chance of an "other" outcome happening (meaning anything other than a hit or walk - an out, hit by pitch, reach base on error, etc.). Then in $n = 200$ plate appearances, the probability of exactly $x_H = 55$ hits, $x_{BB} = 10$ walks, and $x_{OTH} = 135$ other outcomes is given by

$\dfrac{200!}{55!10!135!} 0.275^{55} 0.05^{10} 0.675^{135} = 0.008177562$

(This probability is necessarily small because there are a little over 20,000 ways to have the three outcomes sum up to 200 plate appearances. In fact, 55 hits, 10 walks, and 135 other is the most probable set of outcomes.)

The multinomial is, as its name implies, a multivariate extension of the classic binomial distribution (a binomial is just a multinomial with $k = 2$). Similarly, there is a multivariate extension of the beta distribution called the Dirichlet distribution. The Dirichlet is used to represent the joint distribution of the $\theta_j$ themselves - that is, the joint distribution (over the entire league) of sets of talent levels for each of the $k$ possible outcomes. The probability density function of the Dirichlet is

$p(\theta_1, \theta_2, ..., \theta_{k} | \alpha_1, \alpha_2, ..., \alpha_k) = \displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j)}{\Gamma(\sum_{j = 1}^k \alpha_j)} \theta_1^{\alpha_1 - 1} \theta_2^{\alpha_2 - 1}...\theta_k^{\alpha_k - 1}$

The Dirichlet distribution would be used to answer the question "What is the probability that a player has a hit probability of between 0.250 and 0.300 and simultaneously a walk probability of between 0.50 and 0.100?"- the advantage of doing this is being able to model the covariance between the talent levels.

The expected values of each of the $\theta_j$ are given by

$E[\theta_j] = \dfrac{\alpha_j}{\alpha_0}$

where

$\alpha_0 = \displaystyle \sum_{j = 1}^k \alpha_j$

These represent the league average talent levels in each of the outcomes. So for example, using hits (H), walks (BB), and other (OTH), the quantity given by

$E[\theta_{BB}] = \dfrac{\alpha_{BB}}{\alpha_H + \alpha_{BB} + \alpha_{OTH}}$

would be the average walk proportion (per PA) over all MLB players.

The reason the Dirichlet distribution is useful is that it is conjugate to the multinomial density above. Given raw counts $x_1, x_2, ..., x_k$ for each outcome in the multinomial model and parameters $\alpha_1$, $\alpha_2$, ..., $\alpha_k$ for the Dirichlet model, the posterior distribution for the $\theta_i$ is also Dirichlet with parameters $ \alpha_1 + x_1, \alpha_2 + x_2, ..., \alpha_k + x_k$:

$p(\theta_1, \theta_2, ..., \theta_{k} | x_1, x_2, ..., x_k) = $

$\displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j + x_i)}{\Gamma(\sum_{j = 1}^k \alpha_j + x_j)} \theta_1^{\alpha_1 + x_1- 1} \theta_2^{\alpha_2 + x_2- 1}...\theta_{k-1}^{\alpha_{k-1} + x_{k-1} - 1}\theta_k^{\alpha_k + x_k - 1}$

For the posterior, the expected value for each outcome is given by

$E[\theta_j] = \dfrac{\alpha'_j}{\alpha'_0}$

where

$\alpha'_j = x_j + \alpha_j$

$\alpha'_0 = \displaystyle \sum_{j = 1}^k \alpha'_j = \sum_{j = 1}^k (x_j + \alpha_j)$

These posterior means $E[\theta_j]$ represent regressed estimates for each of the outcome talent levels towards the league means. These shrunk estimates can then be plugged in to the formula for any statistic to get a regressed version of that statistic.

Linearly Weighted Statistics

Most basic counting statistics (such as batting average, on-base percentage, etc.) simply try to estimating one particular outcome using the raw proportion of events ending in that outcome:

$\hat{\theta_j} \approx \dfrac{x_j}{n}$

More advanced statistics instead attempt to estimate linear functions of true talent levels $\theta_j$ with weights $w_j$ for each outcome:

$w_1 \theta_1 + w_2\theta_2 + ... + w_k \theta_k$

The standard version of these statistics that you can find on any number of baseball sites uses the raw proportion $x_j/n$ as an estimate $\hat{\theta}_j$ as above. To get the regressed version of the statistic, use $\hat{\theta_j} = E[\theta_j]$ from the posterior distribution for the Dirichlet-multinomial model - the formula for the regressed statistic is then

$w_1 \hat{\theta}_1 + w_2 \hat{\theta_2} + ... + w_k \hat{\theta_k} = \displaystyle \sum_{j = 1}^k w_j \left(\dfrac{x_j + \alpha_j}{\sum_{j = 1}^k x_j + \alpha_j}\right) =\sum_{j = 1}^k w_j \left(\dfrac{\alpha_j'}{\alpha_0'}\right)$

(The full posterior distribution can also be used to get interval estimates for the statistic, which will be the focus of the next article)

Estimation

This raises the obvious question of what values to use for the $\alpha_j$ in the Dirichlet distribution - ideally, $\alpha_j$ should be picked so that the Dirichlet distribution accurately modes the joint distribution of talent levels for MLB hitters. There are different ways to do this, but I'm going to use use MLB data itself and a marginal maximum likelihood technique to find estimates $\hat{\alpha_j}$ and plug those into the equations above - this method was chosen because there are existing R packages to do the estimation and it is relatively numerically stable and should be very close to the results of other methods for large sample sizes. Using the data to find estimates of the $\alpha_j$ and then plugging those back in makes this an empirical Bayesian technique.

First, it's necessary to get rid of the talent levels $\theta_j$ and find the probability of the observed data based not on a particular player's talent level(s), but on the Dirichlet distribution of population talent levels itself. This is done by integrating out each talent level:

$p(x_1, x_2, ..., x_k | \alpha_1, \alpha_2, ..., \alpha_k, n) = $

$\displaystyle \int_{\tilde{\theta}} p(x_1, x_2, ..., x_k | \theta_1, \theta_2, ..., \theta_k, n) p(\theta_1, \theta_2, ..., \theta_k | \alpha_1, \alpha_2, ..., \alpha_k, n) d\tilde{\theta}$

where $\tilde{\theta}$ indicates the set of all $\theta_j$ values - so integrating out the success probability for each outcome, from 0 to 1.

The calculus here is a bit tedious, so skipping straight to the solution, this gives probability mass function:

$ = \displaystyle \dfrac{n! \Gamma(\sum_{j = 1}^k \alpha_j)}{\Gamma(n + \sum_{j = 1}^k \alpha_j)} \prod_{i = j}^k \dfrac{\Gamma(x_j + \alpha_j)}{x_j! \Gamma(\alpha_j)}$

This distribution, known as the Dirichlet-Multinomial distribution, represents the probability of getting $x_1, x_2, ..., x_k$ outcomes in fixed $n = x_1 + x_2 + ... + x_k$ events, given only information about the population. Essentially, this distribution would be used to answer the question "What's the probability that, if I select a player from the population of all MLB players completely at random - so not knowing the player's talent levels at all - the player gets $x_1$ singles, $x_2$ doubles, etc., in $n$ plate appearances?"

Using $x_{i,j}$ to represent the raw count for outcome $j$ for player $i = 1, 2, ..., N$ and $\tilde{x}_i$ as shorthand to represent the complete set of counting statistics for player $i$, with the counting stats of multiple players $\tilde{x}_1, \tilde{x}_2, ..., \tilde{x}_N$ from some population (such as all MLB players that meet some event threshold), statistical estimation procedures can be used to acquire estimates $\hat{\alpha}_j$ of the true population parameters $\alpha_j$.

For the maximum likelihood approach, the log-likelihood of a set of estimates $\tilde{\alpha}$ is given by

$\ell(\tilde{\alpha} | \tilde{x}_1, ..., \tilde{x}_N) = \displaystyle \sum_{i = 1}^N [\log(n_i!) + \log(\Gamma(\sum_{j = 1}^k \alpha_j)) - \log(\Gamma(n + \sum_{j = 1}^k \alpha_j))$

$+ \displaystyle \sum_{j = 1}^k \log(\Gamma(x_{i,j} + \alpha_j)) - \sum_{j = 1}^k \log(x_{i,j}!) - \sum_{j = 1}^k \log(\Gamma(\alpha_j))]$

The maximum likelihood method works by finding the values of the $\alpha_j$ that maximize $\ell(\tilde{\alpha})$ above - these are the maximum likelihood estimates $\hat{\alpha}_j$. From a numerical perspective, doing this is not simple, and papers have been written on fast, easy ways to perform the computations. For simplicity, I'm going to use the dirmult package in R, which only requires the set of counts for each outcome as a matrix where each row corresponds to exactly one player. The dirmult package can be installed with the command

> install.packages('dirmult')

Once the data is entered and estimation is performed, you will have estimates $\hat{\alpha}_j$. These can then be plugged into the posterior equations above to get regressed statistic estimate

$w_1 \hat{\theta}_1 + w_2 \hat{\theta_2} + ... + w_k \hat{\theta_k} = \displaystyle \sum_{j = 1}^k w_j \left(\dfrac{x_j + \hat{\alpha}_j}{\sum_{j = 1}^k x_j + \hat{\alpha}_j}\right)$

I'll give two examples of offensive stats that can be regressed in this way.

wOBA Shrinkage

The first is weighted on-base average (which I'll call wOBA for short), introduced in Tom Tango's The Book, though as previously mentioned, I am using the fangraphs.com definition. For events, at-bats plus unintentional walks, sacrifice flies, and times hit by a pitch will be used (n = AB + BB - IBB + SF + HBP) , and seven outcomes are defined - singles (1B), doubles (2B), triples (3B), home runs (HR), unintentional walks (BB), and hit by pitch (HBP), with everything else being lumped into an "other" (OTH) outcome.

For notation, again let $x_{i, 1B}$, $x_{i, 2B}$, ..., $x_{i, OTH}$ represent the number of singles, doubles, etc. for player $i$ (abbreviating the entire set as $\tilde{x}_i$) and let $\theta_{i, 1B}$, $\theta_{i, 2B}$, ..., $\theta_{i, OTH}$ represent the true probability of getting a single, double, etc. for player $i$ (abbreviating the entire set as $\tilde{\theta}_i$). The total number of events for player $i$ is given by $n_i$.

Data was collected from fangraphs.com on all MLB non-pitchers from 2010 - 2015. A cutoff of 300 events was used - so only players with at least 300 total AB + BB - IBB + SF + HBP in a given season were used. The code and data I used may be found in my github.

A player's wOBA can be written as a linear transformation of the $\theta_j$ for each of these outcomes with weights $w_{1B} = 0.89$, $w_{2B} = 1.27$, $w_{3B} = 1.62$, $W_{HR} = 2.10$, $w_{BB} = 0.69$, $w_{HBP} = 0.72$, and $w_{OTH} = 0$ as

$wOBA_i = 0.89*\theta_{i,1B} + 1.27*\theta_{i,2B} + 1.62*\theta_{i,3B} + 2.10*\theta_{i,HR} + 0.69*\theta_{i,BB}+0.72*\theta_{i,HBP}$

For player $i$, the distribution of the counts $\tilde{x}_i$ in $n_i$ events is multinomial with mass function

$p(\tilde{x}_i | \tilde{\theta}_i, n_i) =\dfrac{n_i!}{x_{i,1B} x_{i,2B}! x_{i,}! x_{i,HR}! x_{i,OTH}! }\theta_{1B,i}^{x_{i,1B}} \theta_{i,2B}^{x_{i,2B}} \theta_{i,3B}^{x_{i,3B}} \theta_{i,HR}^{x_{i,HR}} \theta_{i,OTH}^{x_{i,OTH}}$

The joint distribution of possible talent levels $\tilde{\theta}$ is assumed to be Dirichlet.

$p(\tilde{\theta}_i | \tilde{\alpha}) = \displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j)}{\Gamma(\sum_{j = 1}^k \alpha_j)} \theta_{i,1B}^{\alpha_{1B} - 1} \theta_{i,2B}^{\alpha_{2B} - 1}\theta_{i,3B}^{\alpha_{3B} - 1} \theta_{i,HR}^{\alpha_{HR} - 1}\theta_{i,BB}^{\alpha_{BB} - 1}\theta_{i,HBP}^{\alpha_{HBP} - 1}\theta_{i,OTH}^{\alpha_{OTH} - 1}$

To find the maximum likelihood estimates $\hat{\alpha}_j$ for this model using the dirmult package in R, the data needs to be loaded into a matrix, where row $i$ represents the raw counts for each outcome for player $i$. There are any number of way to do this, but the first 10 rows of the matrix (out of 1598 in the sample total in the data set used for this example) should look something like:

> x[1:10,]
      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]   91   38    1   42 109    5 353
[2,] 122   26    1   44   71    5 364
[3,]   58   25    1   18   43    0 196
[4,] 111   40    3   32   38    5 336
[5,]   63   25    0   30   56    3 253
[6,]   67   18    1   21   46    5 213
[7,]   86   24    2   43 108    6 362
[8,]   58   25    2   20   24    3 201
[9,]   35   16    2   25   56    2 200
[10,]   68   44    0   14   76    5 250

Once the data is in this form, finding the maximum likelihood estimates can be done with the commands

> dirmult.fit <- dirmult(x)
Iteration 1: Log-likelihood value: -863245.946200229
Iteration 2: Log-likelihood value: -863214.860463357
Iteration 3: Log-likelihood value: -863210.928976511
Iteration 4: Log-likelihood value: -863210.901250554
Iteration 5: Log-likelihood value: -863210.901248778

> dirmult.fit

$loglik

[1] -863210.9

$ite

[1] 5

$gamma

[1] 34.30376 10.44264 1.15606 5.73569 16.28635 1.96183 144.51164

$pi

[1] 0.160000389 0.048706790 0.005392124 0.026752540 0.075963185 0.009150412 0.674034560

$theta

[1] 0.004642569

What I called the $\hat{\alpha}_j$ are given as the $gamma in the output. The quantity$\alpha_0$ can be calculated by summing these up

> alpha <- dirmult.fit$gamma
> sum(alpha)
[1] 214.398

So the joint distribution of talent levels over the population of MLB players with at least 300 events is approximated by a Dirichlet distribution with parameters:

$(\theta_{1B}, \theta_{2B}, \theta_{3B}, \theta_{HR}, \theta_{BB}, \theta_{HBP}, \theta_{OTH}) \sim Dirichlet(34.30, 10.44, 1.16, 5.74, 16.29, 1.96, 144.51)$

In 2013, Mike Trout had $x_{1B} = 115$ singles, $x_{2B} = 39$ doubles, $x_{3B} = 9$ triples, $x_{HR} = 27$ home runs, $x_{BB} = 100$ unintentional walks, $x_{HBP} = 9$ times hit by a pitch, and $x_{OTH} = 397$ other outcomes in $n = 706$ total events for a raw (non-regressed) wOBA of

$0.89 \left(\dfrac{115}{706}\right) + 1.27 \left(\dfrac{39}{706}\right) + 1.62 \left(\dfrac{9}{706}\right) + 2.10 \left(\dfrac{27}{706}\right) + 0.69 \left(\dfrac{100}{706}\right) + 0.72 \left(\dfrac{9}{706}\right) \approx 0.423$

In order to calculate the regressed weighted on-base average, first calculate the $\alpha_j'$ for Mike Trout's posterior distribution of batting ability by

$\alpha_{1B}' = 115 + 34.30 = 149.30$

$\alpha_{2B}' = 39 + 10.44 = 49.44$

$\alpha_{3B}' = 9+ 1.16 = 10.16$

$\alpha_{HR}' = 27 + 5.74 = 32.74$
$\alpha_{BB}' = 100 + 16.29 = 116.29$
$\alpha_{HBP}' = 9 + 1.96 = 10.96$

$\alpha_{OTH}' = 407 + 144.51 = 551.51$

With $\alpha_0' = 920.40$. The regressed version of Mike Trout's 2013 slugging percentage is then given by a linear transformation of the expected proportion in each outcome:

$0.89 \left(\dfrac{149.30}{920.40}\right) + 1.27 \left(\dfrac{49.44}{920.40}\right) + 1.62 \left(\dfrac{10.16}{920.40}\right) + 2.10 \left(\dfrac{32.74}{920.40}\right) + 0.69 \left(\dfrac{116.29}{920.40}\right) + 0.72 \left(\dfrac{10.96}{920.40}\right)$

$\approx 0.401$

So based solely on his 2013 stats and the population information, it's estimated that he is a "true" 0.401 wOBA hitter. Of course, we know from many more years of watching him that this is a bit unfair, and his "true" wOBA is closer to 0.423.

Stabilization

As a side note - and I have verified this by simulation, though I have not worked out the details yet mathetmatically - the so-called "stabilization point" (defined as a split-half correlation of $r = 0.5$) for wOBA is given by $\alpha_0$ - so if split-half correlation was conducted among the players from 2010-2015 with at least 300 AB + BB - IBB + SF + HBP, there should be a correlation of 0.5 after approximately 214 PA. I'm not sure if this works for just the wOBA weights or any arbitrary set of weights, though I suspect the fact that the weight for the "other" outcome is 0 and all the rest are nonzero has a big role to play in this.

SLG Shrinkage

Another statistic that can be regressed in this same way is slugging percentage (which I'll call SLG for short). Using at-bats (AB) as events, and defining five outcomes - singles (1B), doubles (2B), triples (3B), and home runs (HR), with everything else being lumped into the "other" (OTH) outcome, a player's slugging percentage can be written as a linear transformation of the $\theta_i$ for each of these outcomes with weights $w_{1B} = 1$, $w_{2B} = 2$, $w_{3B} = 3$, $W_{HR} = 4$, and $w_{OTH} = 0$

$SLG_i = 1*\theta_{i,1B} + 2*\theta_{i,2B} + 3*\theta_{i,3B} + 4*\theta_{i,HR}$

For player $i$, the multinomial distribution of the counts of $x_{i,1B}$ singles, $x_{i,2B}$ doubles, $x_{i,3B}$ triples, $x_{i,HR}$ home runs, and $x_{i,OTH}$ other outcomes in $n_i$ at-bats is

The Dirichlet distribution of all possible $\tilde{\theta}$ values is

Once again, data from fangraphs.com was used, and all MLB non-pitchers from 2010-2015 who had at least 300 AB in a given season were included in the sample. To find maximum likelihood estimates in R the data needs to be loaded into a matrix where row $i$ represents the raw counts for each outcome $\tilde{x}_i$ for player $i$. The first 10 rows of the matrix (out of 1477) should look something like:

> x[1:10,]
      [,1] [,2] [,3] [,4] [,5]
[1,]   91   38    1   42 349
[2,] 122   26    1   44 362
[3,] 111   40    3   32 332
[4,]   63   25    0   30 251
[5,]   67   18    1   21 208
[6,]   86   24    2   43 358
[7,]   58   25    2   20 199
[8,]   68   44    0   14 248
[9,] 102   36    2   37 370
[10,] 119   48    0   30 375

The dirmult package can then be used to find maximum likelihood estimates for the $\alpha_j$ of the underlying joint Dirichlet distribution of talent levels

> dirmult.fit <- dirmult(x)
Iteration 1: Log-likelihood value: -575845.635311559
Iteration 2: Log-likelihood value: -575999.779702559
Iteration 3: Log-likelihood value: -575829.132259007
Iteration 4: Log-likelihood value: -575784.936726078
Iteration 5: Log-likelihood value: -575780.270877135
Iteration 6: Log-likelihood value: -575780.190649985
Iteration 7: Log-likelihood value: -575780.1906191

> dirmult.fit

$loglik
[1] -575780.2

$ite
[1] 7

$gamma
[1] 42.443604 12.855782   1.381905   7.073672 176.120837

$pi
[1] 0.176939918 0.053593494 0.005760921 0.029488894 0.734216774

$theta
[1] 0.004151517

And $\alpha_0$ is

> alpha <- dirmult.fit$gamma
> sum(alpha)
[1] 239.8758

So the joint distribution of talent levels over the population of MLB players with at least 300 AB is given by a Dirichlet distribution.

$(\theta_{1B}, \theta_{2B}, \theta_{3B}, \theta_{HR}, \theta_{OTH}) \sim Dirichlet(42.44, 12.86, 1.38, 7.07, 176.12)$

(This also implies the "stabilization point" for slugging percentage should be at around $\alpha_0 \approx 240$ AB - this is different than for wOBA because the definition of "events" are different between the two statistics)

In 2013, Mike Trout had $x_{1B} = 115$ singles, $X_{2B} = 39$ doubles, $X_{3B} = 9$ triples, $X_{HR} = 27$ home runs, and $X_{OTH} = 399$ other outcomes in $n = 589$ at-bats.

$1 \left(\dfrac{115}{589}\right) + 2 \left(\dfrac{39}{589}\right) + 3 \left(\dfrac{9}{589}\right) + 4 \left(\dfrac{27}{589}\right) + 0 \left(\dfrac{399}{589}\right) \approx 0.557$

In order to calculate the regressed slugging percentage, calculate Mike Trout's posterior distribution for batting ability by

$\alpha_{1B}' = 115 + 42.44 = 157.44$

$\alpha_{2B}' = 39 + 12.86 = 51.86$

$\alpha_{3B}' = 9+ 1.38 = 10.38$

$\alpha_{HR}' = 27 + 7.07 = 34.07$

$\alpha_{OTH}' = 399 + 176.12 = 575.12$

With $\alpha_0' = 828.79$. The regressed version of Mike Trout's 2013 slugging percentage is then given by

$1 \left(\dfrac{157.44}{828.79}\right) + 2 \left(\dfrac{51.86}{828.79}\right) + 3 \left(\dfrac{10.38}{828.79}\right) + 4 \left(\dfrac{34.07}{828.79}\right) \approx 0.517$

Model Criticisms

From a statistical perspective, this is the most convenient way to perform shrinkage of wOBA, but it doesn't necessarily mean that this is correct - all of this research is dependent on how well the Dirichlet models the joint distribution of talent levels in the league. The fact that the beta works well for the population distributions of each of the talent levels when looked at individually is no guarantee that the multivariate extension should work well for the joint.

In order to do a simple test of the fit of the model, data was simulated from the fit model used to perform the wOBA shrinkage (the posterior predictive means and variances are actually available exactly for this model, but it's good practice to simulate). A set of $\tilde{\theta}_i$ simulated from the Dirichlet distribution was used to simulate a corresponding set of $\tilde{x}_i$, with the same $n_i$ as in the original data set. Comparing the means and standard deviations of the real and simulated data set, the means are

\begin{array}{c c c}
\textrm{Outcome} & \textrm{Observed Mean} & \textrm{Simulated Mean} \\ \hline
\textrm{1B} & 0.1598 & 0.1598 \\
\textrm{2B}   & 0.0473 & 0.0480 \\
\textrm{3B}   & 0.0049 & 0.0053 \\
\textrm{HR} & 0.0275 & 0.0262 \\
\textrm{BB} & 0.0772 & 0.0756 \\
\textrm{HBP} & 0.0088 & 0.0095 \\
\textrm{OTH}   & 0.6745 & 0.6756 \\
\end{array}

which look relatively good - the simulated and real means are fairly close. For the standard deviations, real and simulated values are

\begin{array}{c c c}
\textrm{Outcome} & \textrm{Observed SD} & \textrm{Simulated SD} \\ \hline
\textrm{1B} & 0.0296 & 0.0303 \\
\textrm{2B}   & 0.0114 & 0.0176 \\
\textrm{3B}   & 0.0049 & 0.0061 \\
\textrm{HR} & 0.0153 & 0.0131 \\
\textrm{BB} &   0.0278 & 0.0217 \\
\textrm{HBP} & 0.0072 & 0.0080 \\
\textrm{OTH}   & 0.0324 & 0.0381\\
\end{array}

which isn't nearly as good. The Multinomial-Dirichlet model is clearly underestimating the amount of variance in double rates and "other" outcome rates while overestimating the variance in home run and walk rates. It's not to an extreme extent - and comparing histograms, the shapes of the real and simulated data sets match - but it does present a source of problems. More ad-hoc methods may give superior results.

2016 Win Total Predictions (Through All-Star Break)

2016-07-16T18:47:00.003-05:00

These predictions are based on my own silly estimator, which I know can be improved with some effort on my part. There's some work related to this estimator that I'm trying to get published academically, so I won't talk about the technical details yet (not that they're particularly mind-blowing anyway). These predictions include all games played before the all-star break.

I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), but based on tests of earlier seasons at this point in the season the actual coverage is just under 93%, with intervals usually being one game off if and when they are off.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c}
\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} & \textrm{Current Wins/Games}\\ \hline

ARI & 62 & 72.11 & 82 & 76.75 & 38 / 90 \\
ATL & 52 & 61.82 & 72 & 68.4 & 31 / 89 \\
BAL & 81 & 90.93 & 101 & 86.25 & 51 / 87 \\
BOS & 80 & 90.3 & 100 & 89.19 & 49 / 87 \\
CHC & 87 & 96.9 & 106 & 96.11 & 53 / 88 \\
CHW & 71 & 81.04 & 91 & 80 & 44 / 87 \\
CIN & 51 & 60.62 & 70 & 63.51 & 32 / 89 \\
CLE & 84 & 93.41 & 103 & 90.67 & 52 / 88 \\
COL & 66 & 76.18 & 86 & 79.22 & 40 / 88 \\
DET & 73 & 82.55 & 92 & 81.11 & 46 / 89 \\
HOU & 76 & 85.81 & 96 & 83.9 & 48 / 89 \\
KCR & 70 & 80.3 & 90 & 77.29 & 45 / 88 \\
LAA & 62 & 71.88 & 82 & 77.4 & 37 / 89 \\
LAD & 80 & 89.43 & 99 & 87.7 & 51 / 91 \\
MIA & 75 & 84.5 & 94 & 82.1 & 47 / 88 \\
MIL & 61 & 71.33 & 81 & 71.98 & 38 / 87 \\
MIN & 56 & 65.83 & 76 & 73.06 & 32 / 87 \\
NYM & 75 & 84.9 & 95 & 82.97 & 47 / 88 \\
NYY & 69 & 78.88 & 89 & 76.38 & 44 / 88 \\
OAK & 61 & 70.93 & 81 & 73.08 & 38 / 89 \\
PHI & 64 & 74 & 84 & 72 & 42 / 90 \\
PIT & 73 & 82.71 & 93 & 81.45 & 46 / 89 \\
SDP & 62 & 72.04 & 82 & 75.53 & 38 / 89 \\
SEA & 74 & 83.44 & 93 & 85.31 & 45 / 89 \\
SFG & 87 & 96.8 & 106 & 89.55 & 57 / 90 \\
STL & 77 & 87.13 & 97 & 90.03 & 46 / 88 \\
TBR & 58 & 67.3 & 77 & 72.91 & 34 / 88 \\
TEX & 81 & 91.22 & 101 & 83.75 & 54 / 90 \\
TOR & 80 & 89.42 & 99 & 87.66 & 51 / 91 \\
WSN & 86 & 95.42 & 105 & 93.21 & 54 / 90 \\ \hline\end{array}
It's still fairly difficult to predict final win totals even a little over halfway through the season - intervals have a width of approximately 20 games. A few stand-out points - the teams that are predicted to definitely finish below 0.500 are the Atlanta Braves, the Cincinnati Reds, the Minnesota Twins, and the Tampa Bay Rays, with the Reds being the worst of those teams (they are an estimated as a "true" 63.51 win team). On the other side, the teams predicted to definitely finish above 0.500 are the Chicago Cubs, the Cleveland Indians, the San Francisco Giants, and the Washington Nationals, with the Cubs being the best of these teams (they are estimated as a "true" 96.11 win team). The Texas Rangers and San Francisco Giants in particular have been an exceptionally lucky team - they are predicted to win approximately 7 more games than their "true" win total. Likewise, the Atlanta Braves and Minnesota Twins have been unlucky, both predicted to win approximately 7 fewer games than their "true" win total.

To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (from the beginning of the season until the all-star break).

These quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is to the left of the peak, the team is predicted to finish "lucky" - more wins than would be expected based on their talent level - and if the blue line is to the right of the peak, the team is predicted to finish "unlucky" - fewer wins that would be expected based on their talent level.

Let's Code an MCMC for a Hierarchical Model for Batting Averages

2016-05-24T16:27:00.003-05:00

In previous articles, I've discussed empirical Bayesian estimation for the beta-binomial model. Empirical Bayesian analysis is useful, but it's only an approximation to the full hierarchical Bayesian analysis. In this post, I'm going to work through the entire process of doing an equivalent full hierarchical Bayesian analysis with MCMC, from looking at the data and picking a model to creating the MCMC to checking the results. There are, of course, great packages and programs out there such as PyMC and Stan that will fit the MCMC for you, but I want to give a basic and complete "under the hood" example.

Before I get started, I want to be clear that coding a Bayesian analysis with MCMC from scratch involves many choices and multiple checks at almost all levels. I'm going to hand wave some choices based on what I know will work well (though I'll try to be clear where and why I'm doing so) and I'm not going to attempt to show every possible way of checking an MCMC procedure in one post - so statistics such as $\hat{R}$ and effective sample size will not be discussed. For a fuller treatment of Bayesian estimation using MCMC, I recommend Gelman et. al's Bayesian Data Analysis and/or Carlin and Louis's Bayesian Methods in Data Analysis.

As usual, all my code and data can be found on my github.

The Data and Notation

The goal is to fit a hierarchical model to batting averages in the 2015 season. I'm going to limit my data set to only the batting averages of all MLB hitters (excluding pitchers) who had at least 300 AB, as those who do not meet these qualifications can arguably be said to come from a different "population" of players. This data was collected from fangraphs.com and can be seen in the histogram below.

For notation, I'm going let $i$ index MLB players in the sample and define $\theta_i$ as a player's "true" batting average in 2015. The goal is to use the observed number of hits $x_i$ in $n_i$ at-bats (AB) to estimate $\theta_i$ for player $i$. I'll assume that I have $N$ total players - in 2015, there were $N = 254$ non-pitchers with at least 300 AB.

I'm also going to use a $\sim$ over a variable to represent the collection of statistics over all players in the sample. For example, $\tilde{x} = \{ x_1, x_2, ..., x_N\}$ and $\tilde{\theta} = \{\theta_1, \theta_2, ..., \theta_N\}$.

Lastly, when we get to the MCMC part, we're going to take samples from the posterior distributions rather than calculating them directly. I'm going to use $\mu^*_j$ to represent the set of samples from the posterior distribution for $\mu$, where $j$ indexes 1 to however many samples the computer is programmed to obtain (usually a very large number, since computation is relatively cheap these days), and similarly $\phi^*_j$ and $\theta^*_{i,j}$ for samples from the posterior distribution of $\phi$ and $\theta_i$, respectively.

The Model

First, the model must be specified. I'll assume that for each each at-bat, a given player has identical probability $\theta_i$ of getting a hit, independent of other at-bats. The distribution of the total number of hits in $n_i$ at-bats is then binomial.

$x_i \sim Bin(n_i, \theta_i)$

For the distribution of the batting averages $\theta_i$ themselves, I'm going to use a beta distribution. Looking at the histogram of the data, it looks relatively unimodal and bell-shaped, and batting averages by definition must be between 0 and 1. Keep in mind that the distribution of observed batting averages $x_i/n_i$ is not the same as the distribution of actual batting averages $\theta_i$, but even after taking into account the binomial variation around the true batting averages, the distribution of the $\theta_i$ should also be unimodal, roughly bell-shaped, and bounded by 0 and 1. The beta distribution - bounded by 0 and 1 by definition - will be able to take that shape (though others have plausibly argued that a beta is not entirely correct).

Most people are familiar with the beta distribution in terms of $\alpha$ and $\beta$:

$\theta_i \sim Beta(\alpha, \beta)$

There isn't anything wrong with coding an MCMC in this form (and would almost certainly work well in this scenario), but I know from experience that a different parametrization works better - I'm going to use the beta distribution with parameters $\mu$ and $\phi$:

$\theta_i \sim Beta(\mu, \phi)$

where $\mu$ and $\phi$ are given in terms of $\alpha$ and $\beta$ as

$\mu = \dfrac{\alpha}{\alpha + \beta}$

$\phi = \dfrac{1}{\alpha + \beta + 1}$

In this parametrization, $\mu$ represents the expected value $E[\theta_i]$ of the beta distribution - the true league mean batting average - and $\phi$, known formally as the "dispersion parameter," is the correlation between two individual at-bats from the same randomly chosen player - in sabermetric speak, it's how much a hitter's batting average has "stabilized" after a single at-bat. The value of $\phi$ controls how spread out the $\theta_i$ are around $\mu$.

The advantage of using this parametrization instead of the traditional one is that both $\mu$ and $\phi$ are bounded between 0 and 1 (whereas $\alpha$ and $\beta$ can take any value from 0 to $\infty$) and a closed parameter space makes the process of specifying priors easier and will improve the convergence of the MCMC algorithm later on.

Finally, priors must be chosen for the parameters $\mu$ and $\phi$. I'm going to lazily choose diffuse beta priors for both.

$\mu \sim Beta(0.5,0.5)$

$\phi \sim Beta(0.5,0.5)$

The advantage of choosing beta distributions for both (possible with the parametrization I used!) is that both priors are proper (in the sense of being valid probability density functions), and proper priors always yield proper posteriors - so that eliminates one potential problem to worry about. These prior distributions are definitely arguable - they put a fair amount of probability at the ends of the distributions, and I know for a fact that the true league mean batting average isn't actually 0.983 or 0.017, but I wanted to use something that worked well in the MCMC procedure and wasn't simply a flat uniform prior between 0 and 1.

The Math

Before jumping into the code, we need to do some math. Mass functions and densities of the binomial distribution for the $x_i$, beta distributions for $\theta_i$ (in terms of$\mu$ and $\phi$), and beta priors for $\mu$ and $\phi$ are given by

$p(x_i | n_i, \theta_i) = \displaystyle {n_i \choose x_i} \theta_i^{x_i} (1-\theta_i)^{n_i - x_i}$

$p(\theta_i | \mu, \phi) = \dfrac{\theta_i^{\mu (1-\phi)/\phi - 1} (1-\theta_i)^{(1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)}$

$\pi(\mu) = \dfrac{\mu^{-0.5}(1-\mu)^{-0.5}}{\beta(0.5,0.5)}$

$\pi(\phi) = \dfrac{\phi^{-0.5}(1-\phi)^{-0.5}}{\beta(0.5,0.5)}$

From Bayes' theorem, the joint posterior density of $\mu$, $\phi$, and all $N = 254$ of the $\theta_i$ is given by

$p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n}) = \dfrac{p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi)}{\int \int ... \int \int p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi) d\tilde{\theta} d\mu d\phi}$

The $...$ in the integrals means that every single one of the $\theta_i$ must be integrated out as well as $\mu$ and $\phi$, so the numerical integration here involves 256 dimensions. This is not numerically tractable, hence Markov chain Monte Carlo will be used instead.

The goal of Markov chain Monte Carlo is to draw a "chain" of samples $\mu^*_j$, $\phi^*_j$, and $\theta^*_{i,j}$ from the posterior distribution $p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n})$. This is going to be accomplished in iterations, where at each iteration $j$ the distribution of the samples depends only on the values at the previous iteration $j-1$ (this is the "Markov" property of the chain). There are two basic "building block" techniques that are commonly used to do this.

The first technique is called the Gibbs sampler. The full joint posterior $p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n})$ may not be known, but suppose that given values of the other parameters, the conditional posterior distribution $p(\tilde{\theta} | \mu, \phi, \tilde{x}, \tilde{n})$ is known - if so, it can be used to simulate $\tilde{\theta}$ values from $p(\tilde{\theta} | \mu^*_j, \phi^*_j, \tilde{x}, \tilde{n})$.

Looking at the joint posterior described above, the denominator of the posterior density (after performing all integrations) is just a normalizing constant, so we can focus on the numerator:

$p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n}) \propto p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi) $

$= \displaystyle \prod_{i = 1}^N \left( {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \dfrac{\mu^{-0.5}(1-\mu)^{-0.5}}{\beta(0.5,0.5)} \dfrac{\phi^{-0.5}(1-\phi)^{-0.5}}{\beta(0.5,0.5)}$

From here, we can ignore any of the terms above that do not have a $\phi$, a $\mu$, or a $\theta_i$ in them, since those will either cancel out or remain constants in the full posterior as well:

$\displaystyle \prod_{i = 1}^N \left( \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}\phi^{-0.5}(1-\phi)^{-0.5}$

Now we're going to check and see if there are any terms that, when looked at as variables with everything else treated as a constant, take the form of a recognizable distribution. It turns out that the function:

$\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}$

is the kernel of an un-normalized beta distribution for $\theta_i$ with parameters

$\alpha_i = x_i + \mu \left(\dfrac{1-\phi}{\phi}\right)$

$\beta_i = n_i - x_i + (1-\mu) \left(\dfrac{1-\phi}{\phi}\right) $

since we are assuming $\mu$ and $\phi$ are known in the conditional distribution. Hence, we can say that the conditional distribution of the $\theta_i$ given $\mu$, $\phi$, and the data is beta.

This fact be used in the MCMC to draw an observation $\theta^*_{i,j}$ from the posterior distribution for each $\theta_i$ given draws $\mu^*_j$ and $\phi^*_j$ from the posterior distributions for $\mu$ and $\phi$:

$\theta^*_{i,j} \sim Beta\left(x_i + \mu^*_j \left(\dfrac{1-\phi^*_j}{\phi^*_j}\right), n_i - x_i + (1- \mu^*_j )\left(\dfrac{1-\phi^*_j}{\phi^*_j}\right) \right)$

Note that this formulation uses the traditional $\alpha, \beta$ parametrization. This is a "Gibbs step" for the $\theta_i$.

Unfortunately, looking at $\mu$ and $\phi$ in isolation doesn't yield a similar outcome - observing just the terms involving $\mu$ and treating everything else as constant, for example, gives the function

$\displaystyle \prod_{i = 1}^N \left( \dfrac{\theta_i^{\mu (1-\phi)/\phi - 1} (1-\theta_i)^{(1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}$

which is not recognizable as the kernel of any common density. Doing the same thing for $\phi$ gives a nearly identical function. Hence, the Gibbs technique won't be used for $\mu$ and $\phi$.

One advantage, however, of recognizing that the conditional distribution of the $\theta_i$ given all other parameters is beta is that we can integrate the $\theta_i$ out in the likelihood in order to get at the distributions of $\mu$ and $\phi$ more directly:

$\displaystyle p(x_i, n_i | \mu, \phi) = \int_0^1 p(x_i, n_i | \theta_i) p(\theta_i | \mu, \phi) d\theta_i = \int_0^1 {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1)} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} d\theta_i $

$ = \displaystyle {n_i \choose x_i} \dfrac{\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi)}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)}$

In fact, we can do this for every single one of the $\theta_i$ in the formula above and rewrite the posterior function just in terms of $\mu$ and $\phi$:

$p(\mu, \phi | \tilde{x}, \tilde{n}) \propto \displaystyle \prod_{i = 1}^N \left( \dfrac{\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi)}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}\phi^{-0.5}(1-\phi)^{-0.5}$

This leads directly into the second (and more general) technique for obtaining draws from the posterior distribution: the Metropolis-Hastings algorithm. Suppose that instead of the full posterior $p(\mu, \phi | \tilde{x}, \tilde{n})$, you have a function that is proportional to the full posterior (like the numerator above)

$h(\mu, \phi | \tilde{x}, \tilde{n}) \propto p(\mu, \phi | \tilde{x}, \tilde{n})$

It's possible to construct a Markov chain of $\mu^*_j$ samples using the following steps:

Simulate a candidate value $\mu^*_c$ from some distribution $G(\mu^*_c | \mu^*_{j-1})$
Simulate $u$ from a uniform distribution between 0 and 1.
Calculate the ratio

$\dfrac{h(\mu^*_{c}, \phi^*_{j-1} | \tilde{x}, \tilde{n})}{h(\mu^*_{j-1}, \phi^*_{j-1} | \tilde{x}, \tilde{n})}$

If this ratio is larger than $u$, accept the candidate value and declare $\mu^*_j = \mu^*_{c}$.
If this ratio is smaller than $u$, reject the candidate value and declare $\mu^*_j = \mu^*_{j-1}$

A nearly identical step may be used to draw a sample $\phi^*_j$, only using $h(\mu^*_{j-1}, \phi^*_{c} | \tilde{x}, \tilde{n})$ instead. Note that at each Metropolis-Hastings step the value from the previous iteration is used, even if a new value for another parameter was accepted in another step.

In practice, there are two things that are very commonly (but not always) done for Metropolis-Hastings steps: first, calculations are generally performed on the log scale, as the computations become much, much more numerically stable. To do this, we simply need to take the log of the function $h(\mu, \phi | \tilde{x}, \tilde{n})$ above:

$m(\mu, \phi | \tilde{x}, \tilde{n}) = \log[h(\mu, \phi | \tilde{x}, \tilde{n})] = \displaystyle \sum_{i = 1}^N \left[ \log(\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi))\right]$

$- N \log(\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)) - 0.5\log(\mu) - 0.5\log(1-\mu) - 0.5\log(\phi) - 0.5\log(1-\phi)$

This $m$ function is called repeatedly throughout the code. Secondly, for the candidate distribution, a normal distribution is used centered at the previous value of the chain, with some pre-chosen variance $\sigma^2$, which I will explain how to determine in the next section. Using $\mu$ as an example, the candidate distribution would be

$G(\mu^*_c | \mu^*_{j-1}) \sim N(\mu^*_{j -1}, \sigma^2_{\mu})$

Using these two adjustments, the Metropolis-Hastings step for $\mu$ then becomes

Simulate a candidate value from a $N(\mu^*_{j-1}, \sigma^2_{\mu})$ distribution
Simulate $u$ from a uniform distribution between 0 and 1.
If $m(\mu^*_{c}, \phi^*_{j-1} | \tilde{x}, \tilde{n}) - m(\mu^*_{j-1}, \phi^*_{j-1} | \tilde{x}, \tilde{n}) > \log(u)$, accept the candidate value and declare $\mu^*_j = \mu^*_{c}$. Otherwise, reject the candidate value and declare $\mu^*_j = \mu^*_{j-1}$

With Metropolis-Hastings steps and Gibbs steps, we can create a Markov chain that converges to the posterior distribution.

Choosing Starting Values and Checking Output

Now that we have either the conditional posteriors we need for the Gibbs sampler or a function proportional to them for the Metropolis-Hastings steps, it's time to write code to sample from them. Each iteration of the MCMC code will perform the following steps:

Draw a candidate value $\mu^*_c$ from $N(\mu^*_{j-1}, \sigma^2_{\mu})$
Perform a Metropolis-Hastings calculation to determine whether to accept or reject $\mu^*_c$. If accepted, set $\mu^*_j = \mu^*_c$. If rejected, set $\mu^*_j = \mu^*_{j - 1}$
Draw a candidate value $\phi^*_c$ from $N(\phi^*_{j-1}, \sigma^2_{\phi})$
Perform a Metropolis-Hastings calculation to determine whether to accept or reject $\phi_c$. If accepted, set $\phi^*_j = \phi^*_c$. If rejected, set $\phi^*_j = \phi^*_{j - 1}$
For each of the $\theta^*_i$, draw a new $\theta^*_{i,j}$ from the conditional beta distribution:

$\theta^*_{i,j} \sim Beta\left(x_i + \mu^*_j \left(\dfrac{1-\phi^*_j}{\phi^*_j}\right), n_i - x_i + (1- \mu^*_j )\left(\dfrac{1-\phi^*_j}{\phi^*_j}\right) \right)$

Again, note that this formulation of the beta distribution uses the traditional $\alpha, \beta$ parametrization.

A problem emerges - we need starting values $\mu^*_1$ and $\phi^*_1$ before we can use the algorithm (starting values for the $\theta^*_{i,1}$ aren't needed - the Gibbs sampler in step 5 above can be used to simulate them given starting values for the other two parameters). Ideally, you would pick starting values in a high-probability area of the posterior distribution, but if you knew the posterior distribution you wouldn't be performing MCMC!

You could just pick arbitrary starting points - statistical theory says that no matter what starting values you choose, the distribution of samples from the Markov chain will eventually converge to the distribution of the posterior you want (assuming certain regularity conditions which I will not go into), but there's no hard and fast rule on how long it will take. If you pick values extremely far away from the posterior, it could take quite a while for your chain to converge. There's a chance you could have run for your code for 10,000 iterations and still not have reached the posterior distribution, and there's no way of knowing since you don't know the posterior to begin with!

Statisticians generally do two things to check that this hasn't occurred:

Use multiple starting points to create multiple chains of $\mu^*_j$, $\phi^*_j$, and $\theta^*_{i,j}$ that can be compared (visually or otherwise) to see if they all appear to have converged to the same area in the parameter space.
Use a fixed number of "burn-in" iterations to give the chain a chance to converge to the posterior distribution before taking the "real" draws from the chain.

There is no definite answer on exactly how to pick the different starting points - you could randomly choose points in the parameter space (which is handily confined to between 0 and 1 for the parametrization I used!), or you could obtain estimates from some frequentist statistical procedure (such as method of moments or marginal maximum likelihood) and use those, or you could pick values based on your own knowledge of the problem - for example, choosing $\mu^*_1 = 0.265$ based on what knowing the league mean batting average is probably close that value. No matter how you do it, starting points should be spread out over the parameter space to make sure the chains aren't all going to the same place just because they started off close to each other.

Two more questions must be answered to perform the Metropolis-Hastings step - how do you choose $\sigma^2_{\mu}$ and $\sigma^2_{\phi}$ in the normal candidate distributions? And how often should you accept the candidate values?

The answers to these questions are closely tied to each other. For mathematical reasons that I will not go into in this article (and a bit of old habit), I usually aim for an acceptance rate of roughly around 40%, though the specific value depends on the dimensionality of the problem (see this paper by Gelman, Roberts, and Wilks for more information). In practice, I'm usually not worried if it's 30% or 50% as long as everything else looks okay.

If the acceptance rate is good, then a plot of the value of the chain versus the iteration number (called a "trace plot") should look something like

I've used two chains for $\mu$ here, starting at different points. The "spiky blob" shape is exactly what we're looking for - the values of the chains jump around at a good pace, but still making large enough jumps to effectively cover the parameter space.

If the acceptance rate is too small or too large, it can be adjusted by changing $\sigma^2$ in the normal candidate distribution. An acceptance rate that is too low means that the chains will not move around the parameter space effectively. If this is the case, a plot of the chain value versus the iteration number looks like

The plot looks nicer visually, but that's not a good thing - sometimes the chains stay at the same value for hundreds of iterations! The solution to this problem is to lower $\sigma^2$ so that the candidate values are closer to the previous value, and more likely to be accepted.

Conversely, if the acceptance rate is too high then the chains will still explore the parameter space, but much too slowly. A plot of the chain value versus the iteration looks like

In this plot, it looks like the two the chains don't quite converge to the posterior distribution until hundreds of iterations after the initial draws. Furthermore, the chains are jumping to new values at nearly every iteration, but the jumps are so small that it takes an incredibly large number of iterations to explore the parameter space. If this is the case, the solution is to increase $\sigma^2$ so that the candidates are further from the current value, and less likely to be accepted.

The value of $\sigma^2$, then, is often chosen by trial-and-error after the code has been written by manually adjusting the value in multiple runs of the MCMC so that the trace plots have the "spiky blob" shape and the acceptance rate is reasonable. Through this method, I found that the following candidate distributions for $\mu$ and $\phi$ worked well.

$\mu^*_c \sim N(\mu^*_{j-1}, 0.005^2)$

$\phi^*_c \sim N(\phi^*_{j-1}, 0.001^2)$

The Code

Now that we know the steps the codes will take and what inputs are necessary, coding can begin. I typically code in R, and find it useful to write a function that has inputs of data vectors, starting values for any parameters, and any MCMC tuning parameters I might want to change (such as the number of draws, length of the burn-in period, or the variance of the candidate distributions). In the code below, I set the burn-in period and number of iterations to default to 1000 and 5000, respectively, and after running the code several times without defaults for candidate variances, I determined values of $\sigma^2_{\mu}$ and $\sigma^2_{\phi}$ that produced reasonable trace plots and acceptance rates and set those as defaults as well.

For output, I used the list() structure in R to return a vector chain of $\mu^*_j$, a vector chain of $\phi^*_j$, a matrix of chains $\theta^*_{i,j}$, and a vector of acceptance rates for the Metropolis-Hastings steps for $\mu$ and $\phi$.

The raw code for the MCMC function is shown below, and annotated code may be found on my Github.

.
betaBin.mcmc <- function(x, n, mu.start, phi.start, burn.in = 1000, n.draws = 5000, sigma.mu = 0.005, sigma.phi = 0.001) {

m = function(mu, phi, x, n) {
       N = length(x)
       l = sum(lbeta(mu*(1-phi)/phi + x, (1-mu)*(1-phi)/phi+n-x)) - N*lbeta(mu*(1-phi)/phi, (1-mu)*(1-phi)/phi)
       p = -0.5*log(mu) - 0.5*log(1-mu) - 0.5*log(phi) - 0.5*log(1-phi)
       return(l + p)
}

phi = rep(0, burn.in + n.draws)
mu = rep(0, burn.in + n.draws)
theta = matrix(rep(0, length(n)*(burn.in + n.draws)), length(n), (burn.in + n.draws))

acceptance.mu = 0
acceptance.phi = 0

mu[1] = mu.start
phi[1] = phi.start

for(i in 1:length(x)) {
    theta[i, 1] = rbeta(1, mu[1]*(1-phi)[1]/phi[1] + x[i], (1-phi)[1]/phi[1]*(1-mu[1]) + n[i] - x[i])
}

for(j in 2:(burn.in + n.draws)) {

   phi[j] = phi[j-1]
   mu[j] = mu[j-1]

   cand = rnorm(1, mu[j-1], sigma.mu)

   if((cand > 0) & (cand < 1)) {

    m.old = m(mu[j-1],phi[j-1],x,n)
    m.new = m(cand,phi[j-1],x,n)

    u = runif(1)

      if((m.new - m.old) > log(u)) {
       mu[j] = cand
       acceptance.mu = acceptance.mu+1
      }
}

cand = rnorm(1,phi[j-1],sigma.phi)

if( (cand > 0) & (cand < 1)) {

    m.old = m(mu[j-1],phi[j-1],x,n)
   m.new = m(mu[j-1],cand,x,n)

    u = runif(1)

    if((m.new - m.old) > log(u)) {
       phi[j] = cand
       acceptance.phi = acceptance.phi + 1
   }
}

for(i in 1:length(n)) {
    theta[i, j] = rbeta(1, (1-phi[j])/phi[j]*mu[j] + x[i], (1-phi[j])/phi[j]*(1-mu[j]) + n[i] - x[i])
}

}

mu <- mu[(burn.in + 1):(burn.in + n.draws)]
phi <- phi[(burn.in + 1):(burn.in + n.draws)]
theta <- theta[,(burn.in + 1):(burn.in + n.draws)]

return(list(mu = mu, phi = phi, theta = theta, acceptance = c(acceptance.mu/(burn.in + n.draws), acceptance.phi/(burn.in + n.draws))))

}

This, of course, is not the only way it may be coded, and I'm sure that others with more practical programming experience could easily improve upon this code. Note that I add an additional wrinkle to the formulation given in the previous sections to address a practical concern - I immediately reject a candidate value if it is less than 0 or larger than 1. This is not the only possible way to deal with this potential problem, but works well in my experience, and the acceptance rate and/or starting points can be adjusted if the issue becomes serious.

There is a bit of redundancy in the code - the quantity m.old is calculated twice, when it is used identically in both Metropolis-Hastings steps - and I'm inflating the acceptance rate slightly by including the burn-in iterations, but the chains should converge quickly so the effect will be minimal, and more draws can always be taken to minimize the effect.

Though coded in R, the principles should apply no matter which language you use - hopefully you could take this setup and write code in C or python if you wanted to.

The Results

Using the function defined above, I ran three separate chains of 5000 iterations each after a burn-in of 1000 draws. For starting points, I picked values near where I thought the posterior means would end up, plus values both above and below, to check that all chains converged to the same distributions.

> chain.1 <- betaBin.mcmc(x,n, 0.265, 0.002)
> chain.2 <- betaBin.mcmc(x,n, 0.5, 0.1)
> chain.3 <- betaBin.mcmc(x,n, 0.100, 0.0001)

Checking the acceptance rates for $\mu$ and $\phi$ from each of the three chains, all are reasonable:

> chain.1$\$$acceptance
[1] 0.3780000 0.3613333
> chain.2$\$$acceptance
[1] 0.4043333 0.3845000
> chain.3$\$$acceptance
[1] 0.3698333 0.3768333

(Since the $\theta_i$ were obtained by a Gibbs sampler, they do not have an associated acceptance rate)

Next, plots of the chain value versus iteration for $\mu$, $\phi$, and $\theta_1$ show all three chains appear to have converged to the same distribution, and the trace plots appear to have the "spiky blob" shape that indicates good mixing:

Hence, we can use our MCMC draws to estimate properties of the posterior. To do this, combine the results of all three chains into one big set of draws for each variable:

mu <- c(chain.1$\$$mu, chain.2$\$$mu, chain.3$\$$mu)
phi <- c(chain.1$\$$phi, chain.2$\$$phi, chain.3$\$$phi)
theta <- cbind(chain.1$\$$theta, chain.2$\$$theta, chain.3$\$$theta)

Statistical theory says that posterior distributions should converge to a normal distribution as the sample size increases. With a sample size of $N = 254$ batting averages, posteriors should be close to normal in the parametrization I used - though normality of the posteriors is in general not a guarantee that everything has worked well, nor is non-normality evidence that something has gone wrong.

First, the posterior distribution for league batting average can be seen just by taking a histogram:

> hist(mu)

The histogram looks almost perfectly normally distributed - about as close to the ideal as is reasonable.

Next, we want to get an estimator for the league mean batting average. There are a different few ways to turn the posterior sample $\mu^*_j$ into an estimator $\hat{\mu}$, but I'll give the simplest here (and since the posterior distribution looks normal, other methods should give very similar results) - taking the sample average of the $\mu^*_j$ values:

> mean(mu)
[1] 0.2660155

Similarly, we can get an estimate of the standard error for $\hat{\mu}$ and a 95% credible interval for $\mu$ by taking the standard deviation and quantiles from $\mu^*_j$:

> sd(mu)
[1] 0.001679727
> quantile(mu,c(.025,.975))
2.5% 97.5%
0.2626874 0.2693175

For $\phi$, do the same thing - first look at the histogram:

There is one outlier on the high side - which can happen in an MCMC chain simply by chance - and a slight skew to the right, but otherwise, the posterior looks close to normal. The mean, standard deviation, and a 95% credible interval are given by

> mean(phi)
[1] 0.001567886
> sd(phi)
[1] 0.000332519
> quantile(phi,c(.025,.975))
2.5% 97.5%
0.0009612687 0.0022647623

Furthermore, let's say that instead of $\phi$, I had a particular function of one of the parameters in mind instead - for example, I mentioned at the beginning that $\phi$ is, in sabermetric speak, the proportion of stabilization after a single at-bat. This can be turned into the general so-called "stabilization point" $M$ by

$M = \dfrac{1-\phi}{\phi}$

and so to get a posterior distribution for $M$, all we need to do is apply this transformation to each draw from $\phi^*_j$. A histogram of $M$ is given by

> hist((1-phi)/phi)

The histogram is skewed clearly to the right, but that's okay since $M$ is not one of the parameters in the model.

An estimate and 95% credible for the stabilization point is given by taking the average and quantiles of the transformed values

> mean((1-phi)/phi)
[1] 667.8924
> quantile((1-phi)/phi, c(0.025,0.975))
2.5% 97.5%
440.5474 1039.2918

This estimate is different than the value I gave in my article 2016 Stabilization Points because the calculations in that article used the past six years of data - this calculation only uses one. This is also why the uncertainty is so much larger.

Lastly, we can get at what we really want - estimates of the "true" batting averages $\theta_i$ for each player. I'm going to look at $i = 1$ (the first player in the sample), who happens to be Bryce Harper, the National League MVP in 2015. His batting average was 0.330 (from $x_1 = 172$ hits in $n_1 = 521$ AB), but the effect of fitting the hierarchical Bayesian analysis is to shrink the estimate of his "true" batting average $\theta_i$ towards the league mean $\mu$ - and by quite a bit in this case, since Bryce had nearly largest batting average in the sample. A histogram of the $\theta^*_{1,j}$ shows, again, a roughly normal distribution.

> hist(theta[1,])

and an estimate of his true batting average, standard error of the estimate, and 95% credible interval for the estimate are given by

> mean(theta[1,])
[1] 0.2947706
> sd(theta[1,])
[1] 0.01366782
> quantile(theta[1,], c(0.025,0.975))
2.5% 97.5%
0.2687120 0.3222552

Other functions of the batting averages, functions of the league mean and variance, or posterior predictive calculations can be performed using the posterior samples $\mu^*$, $\phi^*$, and $\theta^*_i$.

Conclusion and Connections

MCMC techniques similar to the ones shown here have become fairly standard in Bayesian estimation, though there are more advanced techniques in use today that build upon these "building block" steps by, to give one example, adaptively changing the acceptance rate as the code runs rather than guessing-and-checking to find a reasonable value.

The empirical Bayesian techniques from my article Beta-binomial empirical Bayes represent an approximation to this full hierarchical method. In fact, using the empirical Bayesian estimator from that article on the baseball set described in this article gives $\hat{\alpha} = 172.5478$ and $\hat{\beta} = 476.0831$ (equivalent to $\hat{\mu} = 0.266$ and $\hat{\phi} = 0.001539$), and gives Bryce Harper an estimated true batting average of $\theta_1 = 0.2946$, with a 95% credible interval of $(0.2688, 0.3210)$ - only slightly shorter than the interval from the full hierarchical model.

Lastly, the "regression toward the mean" technique common in sabermetrics also approximates this analysis. Supposing you had a "stabilization point" of around 650 AB for batting averages (650 is actually way too large, but I'm pulling this number from my calculations above to illustrate a point), then the amount shrunk towards league mean of $\mu \approx 0.266$ is

$\left(\dfrac{521}{521 + 650}\right) \approx 0.4449$

So that the estimate of Harper's batting average is

$0.266 + 0.4449\left(\dfrac{172}{521} - 0.266\right) \approx 0.2945$

Three methods all going to the same place - all closely related in theory and execution.

Hopefully this helps with understanding MCMC coding. The article ended up much longer than I originally intended, but there were many parts I've gotten used to doing quickly that I realized required a not-so-quick explanation to justify why I'm doing them. As usual, comments and suggestions are appreciated!

2016 Win Prediction Totals (Through May 22)

2016-05-24T14:04:00.000-05:00

These predictions are based on my own silly estimator, which I know can be improved with some effort on my part. There's some work related to this estimator that I'm trying to get published academically, so I won't talk about the technical details yet (not that they're particularly mind-blowing anyway).

I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), but based on tests of earlier seasons point in the season the actual coverage is slightly under 94%, with intervals being one game off if and when they are off.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c}
\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} & \textrm{Current Wins}\\ \hline

ARI & 65 & 79.58 & 94 & 81.81 & 21 \\
ATL & 48 & 61.91 & 77 & 67.95 & 12 \\
BAL & 74 & 89.11 & 104 & 85.19 & 26 \\
BOS & 80 & 94.48 & 109 & 92.65 & 27 \\
CHC & 88 & 102.56 & 117 & 99.31 & 29 \\
CHW & 75 & 89.64 & 104 & 87.36 & 26 \\
CIN & 49 & 63.47 & 78 & 66.52 & 15 \\
CLE & 71 & 85.5 & 100 & 85.03 & 22 \\
COL & 66 & 80.94 & 96 & 80.92 & 21 \\
DET & 66 & 80.55 & 95 & 81.05 & 21 \\
HOU & 57 & 71.08 & 86 & 74.86 & 17 \\
KCR & 64 & 79.12 & 94 & 77.75 & 22 \\
LAA & 62 & 76.87 & 91 & 78.08 & 20 \\
LAD & 68 & 82.33 & 97 & 83.53 & 22 \\
MIA & 67 & 81.46 & 96 & 80.95 & 22 \\
MIL & 57 & 71.51 & 86 & 73.47 & 18 \\
MIN & 47 & 61.06 & 76 & 68.16 & 11 \\
NYM & 72 & 87.09 & 102 & 84.52 & 25 \\
NYY & 63 & 77.99 & 93 & 77.6 & 21 \\
OAK & 58 & 71.82 & 86 & 73.12 & 19 \\
PHI & 67 & 81.6 & 96 & 77.71 & 25 \\
PIT & 69 & 83.7 & 98 & 81.94 & 23 \\
SDP & 59 & 72.83 & 87 & 74.53 & 19 \\
SEA & 76 & 90.55 & 105 & 87.86 & 26 \\
SFG & 72 & 85.92 & 100 & 82.28 & 27 \\
STL & 73 & 87.24 & 102 & 88.18 & 23 \\
TBR & 68 & 82.68 & 98 & 83.92 & 20 \\
TEX & 70 & 84.81 & 99 & 82.1 & 25 \\
TOR & 65 & 79.6 & 94 & 80.45 & 22 \\
WSN & 78 & 92.88 & 107 & 90.45 & 27 \\ \hline\end{array}

As you would expect, it's really, really difficult to predict how many games a team is going to win only a quarter of the way through the season, and intervals are necessarily going to be very wide. A couple of things stand out, though - at this point we can be confident that the Chicago Cubs will finish above 0.500 and the Minnesota Twins, Cincinnati Reds, and Atlanta Braves will finish below 0.500. For every other team, we just don't have enough information yet.

To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (from the beginning of the season until May 22).

These quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is to the left of the peak, the team is predicted to finish "lucky" - more wins than would be expected based on their talent level - and if the blue line is to the right of the peak, the team is predicted to finish "unlucky" - fewer wins that would be expected based on their talent level.

2016 Stabilization Points

2016-03-18T11:11:00.001-05:00

I recalculated my stabilization points for 2016, using the same maximum likelihood technique I used for my 2015 calculations in the articles Estimating Theoretical Stabilization Points and WHIP Stabilization by the Gamma-Poisson Model.

(All data and code I used can be found on my github. I make no claims about the stability and/or efficiency of my code - there are a few places where I know it could use some work.)

I've included standard error estimates for 2016, but these should not be used to perform any kinds of tests or intervals to compare to the 2015 data - the 2015 values are estimates themselves with their own standard errors, and since I'm using the past 6 years worth of baseball data, approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found here for batting statistics and here for pitching statistics.

The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data, were picked fairly arbitrarily - I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.

Offensive Statistics

\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2015\textrm{ }\hat{M} \\ \hline
\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 301.32 & 16.92 & 0.329 & 300 & 295.79\\
\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 433.04 & 38.91 & 0.305 & 300 & NA^*\\
\textrm{BA}&\textrm{H/AB} & 491.20 & 37.10 & 0.266 & 300 & 465.92\\
\textrm{SO Rate}&\textrm{SO/PA} & 49.23 & 1.91 & 0.184 & 300 & 49.73\\
\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 112.44 & 4.93 & 0.077 & 300 & 110.91\\
\textrm{1B Rate}&\textrm{1B/PA} & 223.86 & 11.48 & 0.159 & 300 & 226.16\\
\textrm{2B Rate}&\textrm{2B/PA} & 1169.75 & 135.60 & 0.047 & 300 & 1025.31\\
\textrm{3B Rate}&\textrm{3B/PA} & 365.06 & 4.93 & 0.005 & 300 & 372.50\\
\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1075.41 & 118.22 & 0.052 & 300 & 1006.30\\
\textrm{HR Rate} & \textrm{HR/PA} & 126.35 & 6.03 & 0.027 & 300 & 124.52\\
\textrm{HBP Rate} & \textrm{HBP/PA} & 300.97 & 18.60 & 0.009 & 300 & 297.41 \\ \hline
\end{array}

* For whatever reason, I did not calculate the stabilization point for hitting BABIP in 2015.

In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.

Most stabilization points are very similar to their 2015 counterparts, though there is a general increasing trend (seven out of ten statistics), with increases tending towards larger than decreases.

Pitching Statistics

\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2015 \textrm{ }\hat{M} \\ \hline
\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}&1408.72& 258.33 & 0.289 &300&2006.71\\
\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 63.53 &3.51 & 0.449 &300& 65.52\\
\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 59.80 &3.28& 0.347 &300&61.96\\
\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 731.02 & 87.48 & 0.203 &300&768.42\\
\textrm{HR/FB Rate}&\textrm{HR/FB}&488.53 & 90.14 & 0.103 &100&505.11\\
\textrm{SO Rate}&\textrm{SO/TBF}& 93.15 &5.18& 0.189&400&90.94\\
\textrm{HR Rate}&\textrm{HR/TBF}& 949.02 & 110.87 & 0.025 &400&931.59\\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 236.87 & 15.70 & 0.069 &400&221.25\\
\textrm{HBP Rate}&\textrm{HBP/TBF}& 939.00 & 111.88 & 0.008 &400&989.30\\
\textrm{Hit rate}&\textrm{H/TBF}&559.18 & 46.08 & 0.235 &400&623.35\\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 526.77 & 45.07 & 0.312 &400&524.73\\
\textrm{WHIP}&\textrm{(H + BB)/IP*}& 78.97 & 5.60 & 1.29 &80&77.20\\
\textrm{ER Rate}&\textrm{ER/IP*}& 63.08 & 4.23 & 0.439&80&59.55\\
\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 75.79 & 5.31 & 1.22 &80&73.00\\ \hline
\end{array}

* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively.

Stabilization points are equal in being higher or lower, and generally close to last year's values, indicating not much of a change in the distributions of talent levels. Interestingly, the stabilization point for BABIP dropped by nearly 600 BIP - whether this is due to the (notoriously large) variance of BABIP or a mistake in either this or last year's calculation on my part, I'm not sure.

Usage

$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300}} = (0.317,0.392)$

$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300} + \dfrac{0.329(1-0.329)}{250}} = (0.285,0.424)$

That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% unshrunk intervals, but when I tested them in my article "From Stabilization to Interval Estimation", they worked well for prediction.

As usual, all my data and code can be found on my github. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA ). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.

> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1)
$\$$Parameters
[1] 0.329098 301.317682

$\$$Standard.Error
[1] 16.92138

$\$$Model
[1] "Beta-Binomial"

The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula

$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$

which is obtained from the applying the delta method to the function $p(\hat{M}) = n/(n + \hat{M})$. Note that the mean and prediction intervals I gave do not take $SE(\hat{M})$ into account (ignoring the uncertainty surrounding the correct shrinkage amount, which is indicated by the confidence bounds above), but this is not a huge problem - if you don't believe me, plug slightly different values of $M$ into the formulas yourself and see that the resulting intervals do not change much.

Maybe somebody else out there might find this useful. As always, feel free to post any comments or suggestions!

Correcting Parametric Empirical Bayesian Intervals using a Bootstrap

2015-12-04T12:09:00.000-06:00

In a previous post I discussed empirical Bayes for the beta-binomial model. Empirical Bayesian estimates are just expected values of a posterior distribution - suppose instead that you want interval estimates. The empirical Bayesian method can be used, but the intervals potentially have to be adjusted. In this post I want to show how to construct empirical Bayesian intervals for the beta-binomial model, correct them using a parametric bootstrap, and give a baseball example.

Annotated code for the procedure I will describe in this post can be found on my github.

As a side note, I just want to say how much I love this procedure. It uses parametric bootstrapping to correct Bayesian intervals to achieve frequentist coverage. God bless America.

Empirical Bayesian Intervals

In my previous post on beta-binomial empirical Bayes analysis, I used the model

$y_i \sim Bin(n_i, \theta_i)$

$\theta_i \sim Beta(\alpha, \beta)$

The empirical Bayes method says to get estimates $\hat{\alpha}$ and $\hat{\beta}$ of the prior parameters - I give a method of moments estimator in the post, but marginal maximum likelihood may also be used - and then calculate posterior distributions for each of the $\theta_i$ using $Beta(\hat{\alpha}, \hat{\beta})$ as the prior, essentially using the data itself to estimate the prior.

The empirical Bayesian estimate is the mean of this posterior distribution. If an interval estimate is desired, a credible interval can be calculated by taking quantiles directly from this posterior - see my post on credible intervals. This is what I will call a "naive" empirical Bayesian interval.

The Empirical Bayes Problem

The fundamental problem with naive empirical Bayesian intervals is that they often end up too short, inappropriately centered, or both. This is because the uncertainty of the prior parameters themselves has not been accounted for. From the law of total variance, the posterior variance is given by

$Var(\theta_i | y_i) = E_{\alpha, \beta|y_i}[Var(\theta_i|y_i, \alpha, \beta)] + Var_{\alpha, \beta|y_i}[E(\theta_i|y_i, \alpha, \beta)]$

Taking quantiles from the empirical Bayesian posterior estimates the first term, but not the second. For small samples this second term can be significant, and empirical Bayesian generally won't achieve nominal coverage (for more information, see Carlin and Louis's book Bayesian Methods for Data Analysis)

One way to correct for the uncertainty is to perform a hierarchical Bayesian analysis, but it's not clear what the "correct" hyperpriors should be - and just using noninformative priors doesn't guarantee that you'll get nominal frequentist coverage.

An alternative is to use the bootstrap. Since I'm working in the parametric empirical Bayes case, a parametric bootstrap will be used, though this doesn't necessarily have to be the case. For more information on the technique I will use (and a discussion on how it applies to the normal-normal case), see Laird, N., and Louis, T. (1987), "Empirical Bayes Con fidence Intervals Based on Bootstrap Samples," Journal of the American Statistical Association, 82(399), 739-750.

I want to emphasize that this is a technique for small samples. For even moderate samples, the uncertainty in the parameters will be small enough that naive empirical Bayesian intervals will have good frequentist properties, and this can be checked by simulation if you desire.

Parametric Bootstrap

The idea of the parametric bootstrap is that we can account for $Var_{\alpha, \beta|y_i}[E(\theta_i|y_i, \alpha, \beta)]$ by resampling. In a traditional bootstrap, data is sampled with replacement from the original data set. Since this is a parametric bootstrap instead, we will resample by generating observations from the model assuming that our estimates $\hat{\alpha}$ and $\hat{\beta}$ are correct.

Generate $\theta^*_1, \theta^*_2, ..., \theta^*_k$ from $Beta(\hat{\alpha}, \hat{\beta})$
Generate $y^*_1, y^*_2,..., y^*_k$ from $Bin(n_i, \theta^*_i)$
Estimate $\alpha$ and $\beta$ from the bootstrapped $(y^*_i, n_i)$ observations using the same method as you initially used. Call these estimates $\alpha^*_j, \beta^*_j$

In that way, we get a set of $N$ bootstrap estimates $\alpha^*, \beta^*$ of the parameters of the underlying beta distribution. The posterior density that accounts for uncertainty of $\hat{\alpha}$ and $\hat{\beta}$ can then be estimated as

$p^*(\theta_i | y_i, \hat{\alpha}, \hat{\beta}) \approx \dfrac{ \sum_{j = 1}^N p(\theta_i | y_i, \alpha^*_j, \beta^*_j)}{N}$

Essentially, just the raw average density at each point, averaging over all the bootstrapped parameters values. The corrected 95% empirical Bayesian interval is given by solving

$\displaystyle \int_{-\infty}^{l} p^*(\theta_i | y_i, \hat{\alpha}, \hat{\beta}) = 0.025$

$\displaystyle \int_{u}^{\infty} p^*(\theta_i | y_i, \hat{\alpha}, \hat{\beta}) = 0.975$

For lower and upper bounds $l$ and $u$ using numerical techniques.

Baseball Example

In my previous post, I analyzed the famous Morris baseball data set that with respect to loss functions to show why empirical Bayes works. Analyzing it with respect to interval estimation also provides an interesting example.

Using a beta-binomial model with the method of moments estimator I described in the previous post, this data set has parameters $\hat{\alpha} = 97.676$ and $\hat{\beta} = 270.312$. Each player had $n_i = 45$ at-bats, so the posterior distribution for the batting average $\theta_i$ of player $i$ is

$\theta_i | y_i, \hat{\alpha}, \hat{\beta} \sim Beta(y_i + 97.676, 45 - y_i + 270.312)$

Naive intervals can be taken directly as central 95% quantiles from the posterior distribution - again, see my article on Bayesian credible intervals for more explanation on this.

\begin{array}{l c c c c c c c c} \hline
\textrm{Player} & y_i & y_i/n_i & \textrm{EB Estimate} & \textrm{Naive Lower} & \theta_i & \textrm{Naive Upper}\\ \hline
Clemente & 18 & .400 & .280 & 0.238 & .346 & 0.324 \\
F. Robinson & 17 & .378   & .278 & 0.236   & .298 & 0.322 \\
F. Howard & 16 & .356 & .275 & 0.233   & .276 & 0.319 \\
Johnstone & 15 & .333 & .273 & 0.231   & .222 & 0.317 \\
Barry & 14 & .311 & .270 & 0.229    & .273 & 0.314 \\
Spencer & 14 & .311 & .270 & 0.229 & .270 & 0.314 \\
Kessinger & 13 & .289 & .268 & 0.226    & .263 & 0.312 \\
L. Alvarado & 12 & .267 & .266 & 0.224    & .210 & 0.309 \\
Santo & 11 & .244 & .263 & 0.222 & .269 & 0.307\\
Swoboda & 11 & .244 & .263 & 0.222   & .230 & 0.307 \\
Unser & 10 &.222 & .261 & 0.220 & .264 & 0.304 \\
Williams & 10 & .222 & .261 & 0.220 & .256 & 0.304 \\
Scott & 10 & .222 & .261 & 0.220    & .303 & 0.304 \\
Petrocelli & 10 & .222 & .261 & 0.220   & .264 & 0.304 \\
E. Rodriguez & 10 & .222 & .261 & 0.220 & .226 & 0.304\\
Campaneris & 9 & .200 & .258 & 0.217 & .285 & 0.302 \\
Munson & 8 & .178 & .256 & 0.215   & .316 & 0.299 \\
Alvis & 7 & .156 & .253 & 0.213 & .200 & 0.296 \\ \hline
\end{array}

Thirteen out of the eighteen intervals captured the hitter's true average for the rest of the year, for an observed coverage of 72.22%. Roberto Clemente and Thurman Munson managed to overperform with respect to their intervals, while Jay Johnstone, Luis Alvarado, and Max Alvis underperformed.

The parametric bootstrap procedure fixes this as follows:

Simulate a set of 18 new $\theta^*_i$ from a $Beta(97.676, 270.312)$ distribution.
Simulate a set of 18 new $y^*_i$ from a $Bin(45, \theta^*_i)$ distribution
Estimate $\alpha^*$ and $\beta^*$ using the same method used on the original data set.

I repeated this for 5000 bootstrap samples. The bootstrapped posterior is then

$p^*(\theta_i | y_i, 97.676, 270.312) \approx \dfrac{ \sum_{j = 1}^{5000} p(\theta_i | y_i, \alpha^*_j, \beta^*_j)}{5000}$

The effect of this is to create a posterior distribution that's centered around the same empirical Bayesian estimate $\hat{\theta_i}$, but more spread out. This is shown in the naive (solid line) and bootstrapped (dashed line) posterior distributions for $y_i = 10$.

Quantiles taken from this bootstrapped distribution will give wider intervals than the naive empirical Bayesian intervals (though it is possible to come up with "odd" data sets where the bootstrap interval is shorter). The bootstrap interval is given by solving

$\displaystyle \int_{-\infty}^{l} p^*(\theta_i | y_i, 97.676, 270.312) = 0.025$

$\displaystyle \int_{u}^{\infty} p^*(\theta_i | y_i, 97.676, 270.312) = 0.975$

For lower and upper bounds $l$ and $u$ - and in full disclosure, I didn't actually perform the full numerical integration. Instead, I averaged over the pbeta(x, alpha, beta) function in R and solved for the value where the averaged CDF is equal to 0.025 or 0.975.

Doing this, 95% bootstrapped intervals are given by

\begin{array}{l c c c c c c c c} \hline
\textrm{Player} & y_i & y_i/n_i & \textrm{EB Estimate} & \textrm{Bootstrap Lower} & \theta_i & \textrm{Bootstrap Upper}\\ \hline
Clemente & 18 & .400 & .280 & 0.231 & .346 & 0.391 \\
F. Robinson & 17 & .378   & .278 & 0.227 & .298 & 0.382 \\
F. Howard & 16 & .356 & .275 & 0.222 & .276 & 0.372 \\
Johnstone & 15 & .333 & .273 & 0.217 & .222 & 0.363 \\
Barry & 14 & .311 & .270 & 0.211    & .273 & 0.355 \\
Spencer & 14 & .311 & .270 & 0.211 & .270 & 0.355 \\
Kessinger & 13 & .289 & .268 & 0.205    & .263 & 0.346 \\
L. Alvarado & 12 & .267 & .266 & 0.199    & .210 & 0.338 \\
Santo & 11 & .244 & .263 & 0.192 & .269 & 0.330\\
Swoboda & 11 & .244 & .263 & 0.192 & .230 & 0.330 \\
Unser & 10 &.222 & .261 & 0.184 & .264 & 0.323 \\
Williams & 10 & .222 & .261 & 0.184 & .256 & 0.323 \\
Scott & 10 & .222 & .261 & 0.184 & .303 & 0.323 \\
Petrocelli & 10 & .222 & .261 & 0.184   & .264 & 0.323 \\
E. Rodriguez & 10 & .222 & .261 & 0.184 & .226 & 0.323\\
Campaneris & 9 & .200 & .258 & 0.177 & .285 & 0.317 \\
Munson & 8 & .178 & .256 & 0.169 & .316 & 0.310 \\
Alvis & 7 & .156 & .253 & 0.161 & .200 & 0.305 \\ \hline
\end{array}

Only one player out of eighteen is not captured by the interval - Thurman Munson, who managed to hit 0.178 in is his first 45 at-bats and 0.316 the rest of the season - for an observed coverage of 94.44%.

(As a reminder, annotated code to perform estimation in the beta-binomial model and calculate bootstrapped empirical Bayesian intervals for the beta-binomial model is available on my github)

From Stabilization to Interval Estimation

2015-10-23T12:17:00.000-05:00

In this post, I'm going to show how to use league means and stabilization points to construct mean interval and prediction interval estimates for some basic counting statistics. I'll focus on two specific models: the normal-normal and the beta-binomial model.

At a few points during this post I'm going to mention some empirical results. You can find the data I used and the code I ran on my github.

Distributional Assumptions

I'm assuming the statistic in question is a binomial outcome - this covers many basic counting statistics (batting average, on-base percentage, batting average on balls in play etc.) but not rate statistics, or more complicated statistics such as wOBA.

Assume that in $n_i$ trials, a player accrues $x_i$ events (hits, on-base events, etc.). I'm going to assume that trials are independent and identical with parameter of success $\theta_i$.

For the distribution of the $x_i$, I'm going to work out the math for two specific distributions - the normal and the binomial. I'm also going to assume that the distribution of the $\theta_i$ follows the respective conjugate distribution - the normal for the normal model, and the beta for the binomial model. This prior has mean talent level $\mu$ and stabilization point $M$.

$x_i \sim p(x_i | \theta_i, n_i)$

$\theta_i \sim G(\theta_i | \mu, M)$

For the stabilization point $M$, I'm going assume this is the number of events at which $r = 0.5$. If you choose the point at which $r = 0.7$, then these formulas won't work

For several of the mathematical results here, I'm going to refer back to my article shrinkage estimators for counting statistics - particularly, the examples section at the end - without offering proofs or algebraic derivations.

The posterior distribution for $\theta_i$ is given by

$\displaystyle p(\theta_i | x_i, n_i, \mu, M) = \dfrac{p(x_i | \theta_i, n_i)G(\theta_i | \mu, M)}{\int p(x_i | \theta_i, n_i)G(\theta_i | \mu, M) d_{\theta_i}}$

Intervals will be constructed by taking quantiles from this distribution.

(For rate statistics instead of count statistics, the gamma-Poisson model can be used - though that will take more math to figure out the correct forms of the intervals. I got about three-quarters of the way there in my article WHIP stabilization by the gamma-Poisson model if somebody else wants to work through the rest. For more complicated statistics such as wOBA, I'm going to have to work through some hard math.)

Mean Intervals

Normal-Normal Model

For the normal-normal model, suppose that both the counts and the distribution of talent levels follow a normal distribution. Then the observed proportion $x_i / n_i$ also follows a normal distribution.

$\dfrac{x_i}{n_i} \sim N\left(\theta_i, \dfrac{\sigma^2}{n_i}\right)$

$\theta_i \sim N(\mu, \tau^2)$

Furthermore, a normal approximation to the binomial is used to estimate $\sigma^2$ as $\sigma^2 = \mu (1-\mu)$. The usual normal approximation to the binomial takes $\theta_i (1-\theta_i)$ as the variance; however, the normal-normal model assumes that variance $\sigma^2$ is constant around every single $\theta_i$ - so an estimate for that is the average amount of variance over all of them, $\mu(1-\mu)$.

As a side note, the relationship between $M$ and $\tau^2$ is given by

$M = \dfrac{\sigma^2}{\tau^2} \approx \dfrac{\mu(1-\mu)}{\tau^2}$

As I showed in my article shrinkage estimators for counting statistics, for the normal-normal model the shrinkage coefficient is given as

$B = \dfrac{\sigma^2/\tau^2}{\sigma^2/\tau^2 + n_i} = \dfrac{ M }{ M + n_i }$

The resulting posterior is then

$\theta_i | x_i, n_i, \mu, M \sim N\left( (1-B) \left(\dfrac{x_i}{n_i}\right) + B \mu, (1-B) \left(\dfrac{\sigma^2}{n_i}\right)\right)$

And substituting in the values of $B$, the variance of the posterior is given as

$\left(1 - \dfrac{M}{M + n_i}\right)\left(\dfrac{\mu(1-\mu)}{n_i}\right) = \left(\dfrac{n_i}{M + n_i}\right)\left(\dfrac{\mu(1-\mu)}{n_i}\right) = \dfrac{\mu(1-\mu)}{M+n_i}$

And a 95% interval estimate for $\theta_i$ is

$ \left[ \left(\dfrac{n_i}{n_i + M}\right) \dfrac{x_i}{n_i}+ \left(\dfrac{M}{n_i+M}\right) \mu \right] \pm 1.96 \sqrt{ \dfrac{\mu(1-\mu)}{M+n_i}}$

Beta-Binomial Model

For the beta-binomial model, suppose that the counts of events $x_i$ in $n_i$ events follows a binomial distribution and the distribution of the $\theta_i$ themselves is beta.

$x_i \sim Binomial(n_i, \theta_i )$

$\theta_i \sim Beta(\alpha, \beta)$

For the beta distribution of talent levels, the parameters can be constructed from the league mean and stabilization point as

$\alpha = \mu M$

$\beta = (1-\mu) M$

Using the beta as a prior distribution, the posterior for $\theta_i$ is then

$\theta_i | x_i, n_i, \mu, M \sim Beta(x_i + \mu M, n_i - x_i + (1-\mu) M)$

A 95% credible interval can then be taken as quantiles from this distribution - I show how to do this in in R in my article on Bayesian credible intervals. Most statistical software should be able to take quantiles from the beta distribution easily.

Alternatively, a normal approximation may be used - the posterior should be approximately normal with mean and variance

$\theta_i | x_i, n_i, \mu, M \sim N\left( \dfrac{x_i + \mu M}{n_i + M}, \dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu) M)}{(n_i + M)^2 (1 + n_i + M)}\right)$

So a 95% credible interval based on the normal approximation to the beta posterior is given by

$\left(\dfrac{x_i + \mu M}{n_i + M}\right) \pm 1.96 \sqrt{\dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu) M)}{(n_i + M)^2 (1 + n_i + M)}}$

This should be very close to the interval given by taking quantiles directly from the beta distribution.

Practical Application

I downloaded first and second half data hitting data from all qualified non-pitchers from 2010 to 2015 from fangraphs.com. I used the above formulas on the first half of on-base percentage data to create intervals, and then calculated the proportion of those intervals that contained the on-base percentage for the second half. For the league mean and stabilization point, I used values of $M$ and $\mu$ (even though I didn't show $\mu$) from my article "More Offensive Stabilization Points."

But wait...isn't there uncertainty in those estimates of $M$ and $\mu$? Yes, but it actually doesn't play a huge role unless the uncertainty is large, such as for the BABIP. You can try it out yourself by running the code and changing the values slightly, or just trust me.

I rounded off to the nearest whole number for $M$ and to three nonzero digits for $\mu$. The intervals compared were the normal-normal as NN, beta-binomial as BB, and the normal approximation to the beta-binomial as BB (N). The resulting coverages were

\begin{array}{| l | l | c | c | c | c |} \hline
\textrm{Stat}& \mu & M & \textrm{NN Coverage} & \textrm{BB Coverage} & \textrm{BB (N) Coverage} \\ \hline
OBP & 0.33 & 296 & 0.66 & 0.659 & 0.659 \\
BA & 0.268 & 466 & 0.604 & 0.601 & 0.603 \\
1B & 0.158 & 222 & 0.685 & 0.675 & 0.679 \\
2B & 0.0475 & 1025 & 0.532 & 0.532 & 0.531 \\
3B & 0.00492 & 373 & 0.762 & 0.436 & 0.76 \\
XBH & 0.0524 & 1006 & 0.545 & 0.542 & 0.551 \\
HR & 0.0274 & 125 & 0.754 & 0.707 & 0.738 \\
BB & 0.085 & 106 & 0.688 & 0.661 & 0.673 \\
SO & 0.181 & 50 & 0.74 & 0.728 & 0.729 \\
HBP & 0.00866 & 297 & 0.725 & 0.591 & 0.721 \\ \hline\end{array}

So what happened? Shouldn't the 95% intervals have 95% coverage? Well, they should. The problem is, I used the wrong type of interval - the intervals calculated here are for the mean $\theta_i$. But we don't have $\theta_i$. What we have is the second half on-base percentage, which is $\theta_i$ plus the random noise that naturally surrounds $\theta_i$ in however many additional plate appearances. What's appropriate here is a prediction-type interval that attempts to cover not the mean, but a new observation - this interval will have to account for both the uncertainty of estimation and the natural randomness in a new set of observations.

Prediction Intervals

The interval needed is predictive - since the previous intervals were constructed as Bayesian credible intervals, a posterior predictive interval can be used.

Suppose that $\tilde{x_i}$ is the new count of events for player $i$ in $\tilde{n_i}$ new trials. I'm going to assume that $\tilde{n_i}$ is known. I'll also assume that $\tilde{x_i}$ is generated from the same process that generated $x_i$.

$\tilde{x_i} \sim p(\tilde{x_i} | \theta_i, \tilde{n_i})$

The posterior predictive is then

$p(\tilde{x_i}| \tilde{n_i}, x_i, n_i, \mu, M) = \displaystyle \int p(\tilde{x_i} | \theta_i, \tilde{n_i})p(\theta_i | x_i, n_i, \mu, M) d\theta_i$

For a bit more explanation, check out my article on posterior predictive distributions.

Normal-Normal Model

As stated above, the posterior distribution for $\theta_i$ is normal

$\theta_i | x_i, n_i, \mu, M \sim N\left( B \left(\dfrac{x_i}{n_i}\right) + (1-B) \mu, \dfrac{\mu (1-\mu)}{n_i + M}\right)$

$B = \dfrac{M}{n_i + M}$

Using a normal approximation to the binomial, the distribution of the new on-base percentage in the second half (call this $\tilde{x_i}/\tilde{n_i}$) is also normal

$\dfrac{\tilde{x_i}}{\tilde{n_i}} | \theta_i, \mu \sim N\left(\theta_i, \dfrac{\mu(1-\mu)}{\tilde{n_i}}\right)$

The posterior predictive is the marginal distribution, integrating out over $\theta_i$ - it is given as

$\dfrac{\tilde{x_i}}{\tilde{n_i}} | \tilde{n_i}, x_i, n_i, \mu, M \sim N\left(B \left(\dfrac{x_i}{n_i}\right) + (1-B) \mu, \dfrac{\mu (1-\mu)}{n_i + M} + \dfrac{\mu(1-\mu)}{\tilde{n_i}}\right)$

And so a 95% posterior predictive interval for the on-base percentage in the second half is given by

$ \left[ \left(\dfrac{M}{n_i + M}\right) \dfrac{x_i}{n_i}+ \left(\dfrac{n_i}{n_i+M}\right) \mu \right] \pm 1.96 \sqrt{ \dfrac{\mu(1-\mu)}{n_i + M} + \dfrac{\mu(1-\mu)}{\tilde{n_i}}}$

Beta-Binomial Model

The posterior distribution for $\theta_i$ is beta

$\theta_i | x_i, n_i, \mu, M \sim Beta(x_i + \mu M, n_i - x_i + (1-\mu) M)$

The distribution for the number of on-base events $\tilde{x_i}$ in $\tilde{n_i}$ follows a binomial distribution

$\tilde{x_i} \sim Binomial(\theta_i, \tilde{n_i})$

The posterior predictive for the number of on-base events in the new number of trials is the marginal distribution, which has density

$p(\tilde{x_i}| x_i, n_i, \mu, M, \tilde{n_i}) = \displaystyle {\tilde{n_i} \choose \tilde{x_i}} \dfrac{\beta(\tilde{x_i} + x_i + \mu M, \tilde{n_i} - \tilde{x_i} + n_i - x_i + (1-\mu) M)}{\beta(x_i + \mu M,n_i - x_i + (1-\mu) M)}$

This is the beta-binomial distribution. It's is a discrete distribution that gives the probability of the number of on-base events in $\tilde{n_i}$ new PA, not the actual on-base percentage.

Since it is discrete, it's easy to solve for quantiles

$Q(\alpha) = \displaystyle \min_{k} \{ k : F(k) \le \alpha \}$

Where

$F(k) = \displaystyle \sum_{\tilde{x_i} \le k} p(\tilde{x_i} | x_i, n_i, \mu, M, \tilde{n_i}) = \displaystyle \sum_{\tilde{x_i} \le k} \displaystyle {\tilde{n_i} \choose \tilde{x_i}} \dfrac{\beta(\tilde{x_i} + x_i + \mu M, \tilde{n_i} - \tilde{x_i} + n_i - x_i + (1-\mu) M)}{\beta(x_i + \mu M,n_i - x_i + (1-\mu) M)}$

Since $Q(\alpha)$ is the quantile for the count of events, a 95% interval for the actual on-base proportion is given by

$\left(\dfrac{Q(.025)}{\tilde{n_i}} ,\dfrac{Q(.975)}{\tilde{n_i}}\right)$.

Alternatively, since the distribution is likely to be unimodal and bell-shaped, a normal approximation to the 95% posterior predictive interval is given by

$\left(\dfrac{x_i + \mu M}{n_i + M}\right) \pm 1.96 \sqrt{\dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu)M)(n_i + M + \tilde{n_i})}{\tilde{n_i} (n_i + M)(n_i + M + 1)}}$

This isn't as good of an approximation as the normal approximation to the beta-binomial interval for the mean, but the difference between intervals is still only around 1% of the length and should work well.

Practical Application

I repeated the analysis using the predictive formulas given above, using the first half on-base percentage to try to capture the second half on-base percentage, using the same $\mu$ and $M$ values as before.

\begin{array}{| l | l | c | c | c | c |} \hline
\textrm{Stat}& \mu & M & \textrm{NN Coverage} & \textrm{BB Coverage} & \textrm{BB (N) Coverage} \\ \hline
OBP & 0.33 & 296 & 0.944 & 0.944 & 0.94 \\
BA & 0.268 & 466 & 0.943 & 0.943 & 0.944 \\
1B & 0.158 & 222 & 0.941 & 0.941 & 0.942 \\
2B & 0.0475 & 1025 & 0.956 & 0.956 & 0.955 \\
3B & 0.00492 & 373 & 0.955 & 0.955 & 0.956 \\
XBH & 0.0524 & 1006 & 0.957 & 0.957 & 0.959 \\
HR & 0.0274 & 125 & 0.951 & 0.951 & 0.952 \\
BB & 0.085 & 106 & 0.925 & 0.925 & 0.921 \\
SO & 0.181 & 50 & 0.918 & 0.918 & 0.92 \\
HBP & 0.00866 & 297 & 0.95 & 0.95 & 0.947 \\ \hline\end{array}

Cautions and Conclusion

Despite the positive results, I think that 95% actual coverage from these intervals is overoptimistic. For one, I selected a very "nice" group of individuals to test it on - nonpitchers with more than 300 PA. Being in this category implies a high talent level and a lack of anything that could drastically change that talent level over the course of the season, such as injury. I also treated the second half sample size $\tilde{n_i}$ as known - obviously, that must be estimated as well, and should add additional uncertainty.

Furthermore, there are clearly other factors at work than just random variation - players can get traded to different environments (a player being traded to or from Coors park, for example), talent levels may very well change over the course of the season, and events are clearly not independent and identical.

Applying these formulas to the population of players at large should see the empirical coverage drop - my guess (though I haven't tested it) is that 95% intervals should empirically get around 85%-90% actual coverage. Also keep in mind that $M$ and $\mu$ need to be kept updated - using means and stabilization points from too far in the past will lead to shrinkage towards the wrong point.

You can and should be able to do better than these intervals, in terms of length - these are incredibly simplistic, using only information about the player and information about the population. Adding covariates to the model to account for other sources of variation should allow you to decrease the length without sacrificing accuracy.

Alternatively, you could use these formulas with projections, with $\mu$ as the preseason projection and $M$ representing how many events that projection is "worth." This is more in keeping with the traditional Bayesian sense of the interval, and won't guarantee any sort of coverage.

However, I still think these intervals are useful in that they represent a sort of baseline - any more advanced model that generates predictive intervals should be able to do better than these.

Edit 16 Mar. 2017: I found that the data file I used for this analysis with the split first and second half statistics was not what I thought it was - it repeated the same player multiple times, giving an inaccurate estimate of the confidence level. I have corrected the data file and re-run the analysis and presented the corrected confidence levels.

Stabilization, Regression, Shrinkage, and Bayes

2015-10-16T10:36:00.000-05:00

This post is somewhat of a brain dump, for all my thoughts on the concept of a "stabilization point," (which is a term I dislike) how it's being used, assumptions that are (usually) knowingly and (sometimes) unknowingly being made in the process, and when and how it can be used correctly.

Stabilization In Practice

The most well-known stabilization point calculation was performed and performed by Russell Carleton, who took samples of size $n$ of size statistic from multiple MLB players and declared the stabilization point to be the $n$ such that the correlation coefficient $r = 0.7$ (the logic being that this gives $R^2 \approx 0.5$ - however, I don't like this). This approach is nonparametric in the sense that it's not making any assumptions about the underlying structure of the data (for example, that the events are binomial distributed), only about the structure of the residuals, and these have shown to be fairly reasonable and robust assumptions - in fact, the biggest problems with the original study came down to issues of sampling.

The split-and-correlate method is the most common method used to find stabilization points. This method will work, though it's not especially efficient, and should give good results assuming the sampling is being performed well - more recent studies randomly split the data into two halves and then correlate. In fact, it will work for essentially any statistic, especially ones that are difficult to fit into a parametric model.

In his original study (and subsequent studies), Carleton finds the point at which the split-half correlation is equal to $r = 0.7$, since then $R^2 \approx 0.5$. Others have disagreed with this. Commenter Kincaid on Tom Tango's blog writes

$r=.7$ between two separate observed samples implies that half of the variance in one observed sample is explained by the other observed sample. But the other observed sample is not a pure measure of skill; it also has random variance. So you can’t extrapolate that as half of the variance is explained by skill.

I agree with this statement. In traditional regression analysis, the explanatory variable $x$ is conceived as fixed. In this correlation analysis, both the explanatory and response variables are random. Hence, it makes no sense to say that the linear regression with $x$ explains 50% of the variation in $y$ when $x$ is random and, if given the same player, in fact independent of $y$. Other arguments have also been made regarding the units of $r$ and $R^2$.

There's a more practical reason, however - a commonly used form of regression towards the mean is given by

$\dfrac{M}{n + M}$

where $M$ is the regression amount towards some mean. Tom Tango notes that, if the stabilization point is estimated as the point at which $r = 0.5$, then this value $M$ can then turn around and directly be plugged into the regression equation given above. the As Kincaid has noted, this is the form of statistical shrinkage for the binomial distribution with a beta prior. More generally, this is the form of the shrinkage coefficient $B$ that is obtained by modeling the outcome of a random event with a natural exponential family and performing a Bayesian analysis using a conjugate prior (see section five of Carl Morris's paper Natural Exponential Families with Quadratic Variance Functions). Fundamentally, this is why the $M/(n + M)$ formula seems to work so well - not because it's beta-binomial Bayes, but because it's also normal-normal Bayes, and gamma-poisson Bayes, and more - any member of the incredibly flexible natural exponential family.

So simply taking the correlation doesn't make any assumptions about the parametric structure of the observed data - but taking that stabilization point and turning it into a shrinkage (regression) estimator towards the population mean does assume that observed data come from natural exponential family with the corresponding conjugate prior for the distribution of true talent levels.

Mathematical Considerations

In practical usage, only certain members of the natural exponential family are considered - the beta-binomial, gamma-Poisson, and the normal-normal models, for example, with the normal-normal largely dominating these choices. These form a specific subset of the natural exponential family - the natural exponential family with quadratic variance functions. The advantage these have over general NEF distributions is that, aside from being the most commonly used distributions, they are closed under convolution - that is, the sum of NEFQFV distributions is also NEFQFV - and this makes them ideal for modeling counting statistics, as the forms of all calculations stay the same as new information arrives, requiring only that new estimates and sample sizes be plugged into formulas.

In a previous post I used Morris's work to with the natural exponential family with quadratic variance functions to describe a two-stage model with some raw counting statistic $x_i$ as the sum of $n_i$ trials

$X_i \sim p(x_i | \theta_i)$

$\theta_i \sim G(\theta_i | \mu, \eta)$

where $p(x_i | \theta_i)$ is NEFQVF with mean $theta_i$. If $G(.)$ is treated as a prior distribution for $\theta_i$, then the form of the shrinkage estimator for $\theta_i$ is given by

$\hat{\theta_i} = \mu + (1 - B)(\bar{x_i} - \mu) = (1-B)\bar{x_i} + B \mu$

where $\bar{x_i} = x_i/n_i$ and $B$, as mentioned before, is the shrinkage coefficient. The shrinkage coefficient controlled by the average amount of variance at the event level and the variance of $G(.)$, weighted by the sample size $n_i$.

$B = \dfrac{ E[Var(\bar{x_i} | \theta_i)]}{ E[Var(\bar{x_i} | \theta_i)] + n_i Var(\theta_i)}$

And for NEF models, this simplifies down to

$B = \dfrac{M}{M + n_i}$

Implying that the form of the stabilization point $M$ is given as

$M = \dfrac{E[Var(\bar{x_i} | \theta_i)]}{Var(\theta_i)} = \dfrac{E[V(\theta_i)]}{Var(\theta_i)}$

Where $V(\theta_i)$ is the variance around the mean $\theta_i$ at the most basic level of the event (plate appearance, inning pitched, etc.). So under the NEF family, the stabilization point is the ratio of the average variance around the true talent level (the variance being a function of the true talent level itself) to the variance of the true talent levels themselves.

In another post, I showed briefly that for this model, the split-half correlation is theoretically equal to one minus the shrinkage coefficient $B$.

$\rho = 1 - B = \dfrac{n_i}{M+n_i}$

Another result that has been commonly used. Therefore, to achieve any desired level of correlation $p$ between split samples, the formula

$n = \left(\dfrac{p}{1-p}\right) M$

can be used to estimate the sample size required. This formula derives not from any sort of correlation prophecy formula, but just from some algebra involving the forms of the shrinkage coefficient and split-half correlation $\rho$.

It's for this reason that I dislike the name "stabilization point" - in its natural form it is the number of events required for a split-half correlation of $r = 0.5$ (and correspondingly a shrinkage coefficient of $0.5$), but really, you can estimate the split-half correlation and/or shrinkage amount for any sample size just by plugging in the corresponding values of $M$ and $n$. In general, there's not going to be much difference between samples of size $n$, $n-1$, and $n + 1$ - there's no magical threshold that the sample size can cross that suddenly a statistic becomes perfectly reliable - and in fact that the formula implies a given statistic can never reach 100% stabilization.

If I had my choice I'd call it the stabilization parameter, but alas, the name seems to already be set.

Practical Considerations

Note that at no point in the previous description was the shrinkage (regression) explicitly required to be towards the league mean talent level. The league mean is a popular choice to shrink towards; however, if the sabermetrician can construct a different prior distribution from other data (for example, the player's own past results) then all of the above formulas and results can be applied using that prior instead.

When calculating stabilization points, studies have typically used a large sample of players from the league - theoretically, this implies that the league distribution of talent levels is being used as the prior distribution (the $G(.)$ in the two-stage model above), and the so-called stabilization point that results to be used for any player. In actuality, choices made during the sampling process imply certain priors which consist but only of certain portions of the league distribution of talent levels (mainly, those that have reached a certain threshold for the number of trials). In my article that calculated offensive stabilization points, I specified a hard minimum of 300 PA/AB in order to be included in the population. I estimated the stabilization point directly, but the issue also effects the correlation method used by Carleton, Carty, Pavlidis, etc. - in order to be included in the sample, a player must have had enough events (however they are defined), and this limits the sample to a nonrepresentative subset of the population of all MLB players. The effect of this is that the specific point calculated is really only valid for those individuals that meet those PA/AB requirements - even though those are the players who we know the most about! Furthermore, players that accrue more events do so specifically because they have higher talent levels - the stabilization points calculated for players who we know will receive, for example, at least 300 PA can't then turn around and be applied to players who we know will accrue fewer than 300 PA. This also explains how two individuals both using the same method in the same manner with the same data can arrive at different conclusions depending entirely on how they chose inclusion/exclusion rules for their sample.

As a final issue, I used six years worth of data - in doing this, I made the assumption that the basic shape of true talent levels for the subset of the population I chose had changed negligibly or not at all over six years. I didn't simply use all data, however, because I recognize that offensive environments change - the late 90s and early 2000s, for example, are drastically different from than the early 2010s. This brings up another point - stabilization points, as they are defined, are a function of the mean (coming into play in the average variance around a statistic) and, primarily, the variance of population talent levels - however, both of those are changing over time. This means there is not necessarily such a thing as "the" stabilization point, since as the population of talent levels changes over time, so will the mean and variance (I wrote a couple of articles looking at how offensive and pitching stabilization points have changed over time), so stabilization points in articles that were published just a few years ago may or may not be valid any longer.

Conclusion

Even after all this math, I still think the split-and-correlate method should be thought of as the primary method for calculating stabilization points, since it works on almost any kind of statistic, even more advanced ones that don't fit clearly into a NEF or NEFQVF framework. Turning around and using the results of that analysis to perform shrinkage (regression towards the mean), however, does make very specific assumptions about the form of both the observed data and underlying distribution of talent levels. Furthermore, sampling choices made at the beginning can strongly affect the final outcome, and limit the applicability of your analysis to the larger population. And if you remember nothing else from this - there is no such thing as "the" stabilization point, either in terms of when a statistic is reliable (it's always somewhat unreliable, the question is by how much) or one value that applies to all players at all times (since it's a function of the underlying distribution of talent levels, which is always changing).

This has largely been just a summary of techniques, studies, and research others have done - I know others have expressed similar opinions as well - but I found the topic interesting and I wanted to explain it in a way that made sense to me. Hopefully I've made a little more clear the connections between statistical theory and things people were doing just because they seemed to work.

Various Links

These are just some of the various links I read to attempt to understand what people were doing in practice and attempt to connect it to statistical theory:

Carl Morris's paper Natural Exponential Families with Quadratic Variance Functions: Statistical Theory: http://www.stat.harvard.edu/People/Faculty/Carl_N._Morris/NEF-QVF_1983_2240566.pdf

Russell Carleton's original reliability study: http://web.archive.org/web/20080112135748/mvn.com/mlb-stats/2008/01/06/on-the-reliability-of-pitching-stats/

Carleton's updated calculations: http://www.baseballprospectus.com/article.php?articleid=20516

Tom Tango comments on Carleton's article:
http://tangotiger.com/index.php/site/comments/point-at-which-pitching-metrics-are-half-signal-half-noise

Derek Carty's stabilization point calculations: http://www.baseballprospectus.com/a/14215#88434

Tom Tango discusses Carty's article, the $r = 0.7$ versus $r = 0.5$ threshold, and regression towards the mean: http://www.insidethebook.com/ee/index.php/site/comments/rates_without_sample_size/

Steve Staude discusses $r = 0.5$ versus $r = 0.7$: http://www.fangraphs.com/blogs/randomness-stabilization-regression/

Tom Tango comments on Steve's work: http://tangotiger.com/index.php/site/comments/randomness-stabilization-regression

Tom Tango links to Phil Birnbaum's proof of the regression towards the mean formula: http://tangotiger.com/index.php/site/comments/blase-from-the-past-proof-of-the-regression-toward-the-mean

Kincaid shows that the beta-binomial model produces the regression towards the mean formula: http://www.3-dbaseball.net/2011/08/regression-to-mean-and-beta.html

Harry Pavlidis looks at stabilization for some pitching events: http://www.hardballtimes.com/it-makes-sense-to-me-i-must-regress/

Tom Tango discusses Harry's article, and gives the connection between regression and stabilization: http://www.insidethebook.com/ee/index.php/site/article/regression_equations_for_pitcher_events/

Great summary of various regression and population variance estimation techniques - heavy on the math: http://www.countthebasket.com/blog/2008/05/19/regression-to-the-mean/

The original discussion on regression and shrinkage from Tom Tango's archives: http://www.tangotiger.net/archives/stud0098.shtml

The Posterior Predictive

2015-09-21T11:13:00.003-05:00

Let's say you do a Bayesian analysis and end up with a posterior distribution $p(\theta | y)$. What does that tell you about a new observation $\tilde{y}$ from some data-generating process that involves $\theta$? The answer can be found using the posterior predictive distribution.

The code that I used to generate the images in this article can be found on my github.

Posterior Predictive

Given a posterior distribution $p(\theta | y)$, the posterior predictive distribution is defined as

$p(\tilde{y} | y) = \int p(\tilde{y} | \theta) p(\theta | y) d\theta$

and it represents the distribution of a new observation $\tilde{y}$ given your updated information about a parameter $\theta$ and natural variation around the observation that arises from the data-generating process.

Since applied Bayesian techniques have tended towards fully computational MCMC procedures, the posterior predictive is usually obtained through simulation - let's say you have a sample $\theta_j^*$ ($j = 1,2,...,k$) from the posterior distribution of $\theta$ and you want to know about a new observation from some process that uses the parameter $\theta$.

$\tilde{y} \sim p(\tilde{y} | \theta)$

To obtain the posterior predictive, you would simulate a set of observation $\tilde{y}^*_j$ from $p(\tilde{y} | \theta_j^*)$ (in other words, simulate a new observation from the data model for each draw from the MCMC for your parameter). The distribution of these $\tilde{y}^*_j$ approximates the posterior predictive distribution for $\tilde{y}$.

In general, this process is very specific to the problem at hand. It is possible in a few common scenarios, however, to calculate the posterior predictive distribution analytically. One example that is useful in baseball analysis the beta-binomial model.

Beta-Binomial Example

Let's say a batter obtains a 0.300 OBP in 250 PA - that corresponds to 75 on-base events and 175 not on-base events. What can you say about the distribution of on-base events in a new set of 250 PA?

Suppose that the distribution of on-base events is given by a binomial distribution with $n = 250$ and chance of getting on base $\theta$, which is the same in both sets of PA.

$p(y | \theta) \sim Binomial(250, \theta)$

For the prior distribution, let's suppose that a $Beta(1,1)$ distribution was used - this is a uniform distribution between zero and one - so any possible value for $\theta$ is equally likely. Since the beta and binomial are conjugate distributions, the posterior distribution of $\theta$ (the batter's chance of getting on base) is also a beta distribution:

$p(\theta| y = 75) = \dfrac{\theta^{75+1-1}(1-\theta)^{175+1-1}}{\beta(75+1,175+1)} \sim Beta(76,176)$

Now, suppose we are planning to observe another 250 PA for the same batter, and we want to know the distribution of on-base events $\tilde{y}$ in the new 250 PA. This distribution is also binomial

$p(\tilde{y} | \theta) = \displaystyle {250 \choose \tilde{y}} \theta^{\tilde{y}}(1-\theta)^{250-\tilde{y}}$

The posterior predictive distribution for the number of on-base events in another 250 PA is then obtained by multiplying the two densities and integrating out $\theta$.

$p(\tilde{y} | y = 75) = \displaystyle \int_0^1 {250 \choose \tilde{y}} \theta^{\tilde{y}}(1-\theta)^{250-\tilde{y}} * \dfrac{\theta^{75}(1-\theta)^{175}}{\beta(76,176)} d\theta$

The resulting distribution is known as beta-binomial distribution, which has density

$p(\tilde{y} | y = 75) =\displaystyle {250 \choose \tilde{y}} \dfrac{\beta(76 + \tilde{y}, 426-\tilde{y})}{\beta(76,176)}$

(The beta-binomial distribution is obtained from the beta-binomial model - it does get a bit confusing, but they are different things - the beta-binomial model can be thought of as a binomial with extra variance)

Now I can use the posterior predictive for inference. If, for example, I wanted to know the probability that a player will have a 0.300 OBP in another 250 PA (corresponding, again, to $\tilde{y}$ = 75 on-base events) then I can calculate that as

$p(\tilde{y} = 75 | y = 75) = \displaystyle {250 \choose 75} \dfrac{\beta(76 + 75, 426-75)}{\beta(76,176)} \approx 0.0389$

That is, our updated information says there's a 3.89% chance of getting exactly a 0.300 OBP in a new 250 PA by the same player.

The actual distribution of OBP in a new 250 PA is given by

This can also be accomplished by simulation - first, by simulating a large number $k$ of $\theta^*$ values from the posterior

$\theta^*_j \sim Beta(76,176)$

And then using those $\theta^*$ values to simulate from the data model $p(\tilde{y} | \theta^*)$

$y^*_j \sim Binomial(250, \theta^*_j)$

The estimated probability of a 0.300 OBP (equivalent to 75 on-base events) is then

$P(0.300 | y = 75) \approx \displaystyle \dfrac{\textrm{# } y^*_j \textrm{ that equal 75}}{k} $

This is much easier to do in $R$ -with $k = 1000000$, the code to quickly perform this is

     > theta <- rbeta(1000000, 76,176) #Simulate from Posterior
     > y <- rbinom(1000000, 250, theta) #Simulate from data model
     > mean(y == 75) #Estimate P(y = 75)
     [1] 0.039077

Notice that the result is very close to the analytic answer of 3.89%. The simulated posterior predictive distribution for OBP is

Which visually looks very similar to the analytic version.

What's the Use?

Aside from doing (as the name implies) prediction, the posterior predictive is very useful for model checking - if data simulated from the posterior predictive (new data) is similar to the data you used to fit the model (old data), that is evidence that the model is a good fit. Conversely, if your posterior predictive data looks nothing like your original data set, you may have misspecified your model (and I'm criminally oversimplifying here - I recommend Bayesian Data Analysis by Gelman et al. for all the detail on model-checking using the posterior predictive distribution).

Pitching Stabilization Points through Time

2015-09-14T11:51:00.001-05:00

In the previous post, I calculated some stabilization points for pitching statistics for the past few years. In this post, I want to look at how some of those stabilization points have changed over time.

(I have previously done this for offensive statistics)

Each stabilization point is a six-year calculation, including the current and five previous years (so for example, 2014 incudes 2009-2014 data, 1965 includes 1959 - 1965 data, etc.). There's not a mathematical or baseball reason for this choice - through trial and error it just seemed to provide enough data for estimation that the overall trend was apparent, with a decent amount of smoothing. Data includes only starting pitchers from each year, and for cutoff values (the minimum number of TBF, BIP, etc. to be included in the dataset) I used the same values as in my previous post. Years were split for the same player. Raw counts are used, not adjusted in any form. Relief pitchers are excluded.

All of the code that I used to create these can be found in my github, though I make no claims to efficiency or ease of operation. Because I added this code several months after the article was originally posted, I did not clean and annotate it as I normally would have - I just posted the raw code. The code is a modified form of the code used to calculate offensive stabilization points over time.

All of the plots shown below, and more, can be found in my imgur account.

Historical Plots

For some statistics, I will show plots for both the mean of a statistic over time and the stabilization point. The stabilization point driven largely by the underlying population variance of talent levels, which tends to be more difficult to estimate then the mean - hence the reason that, even with six years of moving data, the 'stabilization point' will appear to fluctuate quite a bit. I recommend not reading too much into the fluctuations, but rather looking for more general patterns.

Firstly, the ground ball, fly ball, and line drive rates (per BIP) only have recent data available. In that time, neither the fly ball or ground ball stabilization points have changed much

Line drive rate appears to have increased in recent years, however.

Though keep in mind the standard error is approximately 100 balls in play.

More interesting is batting average on balls in play, for which we have more data. The standard error for BABIP is approximately 500 balls in play, so it's not wise to trust small fluctuations in this plot as representing real shifts - however, it does appear that there is a positive trend in the stabilization point, indicative of the spread in BABIP values getting smaller. (A plot with 95% error bounds at each point can be found here, though I don't necessarily care for it)

The mean is easier to estimate with more accuracy - and it shows that batting average on balls in play is at its highest point in history.

An animated plot shows how the mean and variance of the observed (histogram) and estimated true talent (dashed line) distributions have changed over time.

As I've previously mentioned, the primary driving force for stabilization points is the underlying population variance. For example, take strikeout rate (per batter faced): since the dead ball era, it has followed a pattern of fairly consistent decrease (with a recent upsurge that still places it within previously observed ranges).

Over time, however, the mean strikeout rate (per batter faced) has been on the increase.

What does coincide with the increase in stabilization point is the decrease in population variance over time, as seen in this animated plot with the observed strikeout rates (histogram) and estimated true talent distribution (dashed line) - the spread in both is constantly increasing over time.

Also interesting is the earned run rate (per inning pitched, min 80 IP).

Beginning in the early 2000s, it dropped to a very low point, relative to its history, and has remained there more consistently than in the past. Meanwhile, the stabilization point for walk rate (min 400 BF) has increased in recent years, after reaching a maximum in the 1980s and decreasing.

On-base percentage and hit by-pitch rate have all fluctuated within a relatively stable area over time.

Though interestingly, both (per batter faced) took a dip in the 1960 that corresponds to an increase in the mean hit-by-pitch rate and a decrease in the mean on-base-percentage.

For some statistics, such as WHIP and home run rate, and it is difficult to discern a pattern other than fluctuations within a certain range.

An interesting look at how certain things have changed over time - though, as I mentioned before, I would encourage not reading too much into these plots.

More Pitching Stabilization Points

2015-09-03T09:57:00.001-05:00

Using the beta-binomial model (notated BB) or the gamma-Poisson model (notated GP, and in this post what I call M is what in the previous post I called K - the variance parameter of the population talent distribution), I calculated the stabilization point for some more pitching statistics. I don't think the model(s) fit perfectly to the data, but they provide a good approximation that generally matches up with results I've seen elsewhere on the web.

Data was acquired from fangraphs.com. I only considered starting pitchers from 2009 - 2014, splitting the same pitcher between years, and did not adjust the data in any way.

All the data and code I used here may be found on my github. I make no claims to efficiency or ease of use.

The "cutoff value" is the minimum number of the denominator (IP, TBF, BIP, etc.) in a year in order to be included in the data set. These numbers were chosen somewhat arbitrarily, and for some of my statistics, changing the cutoff value will change the stabilization point. I'm not sure which statistics this will happen to - I know WHIP for sure, and I suspect ER as well, whereas I think BABIP doesn't exhibit this tendency. It's a function of the change (or lack thereof) in population variance of talent levels as the cutoff value increases - if somebody wants to take a look at it, it would be neat.

I wanted have a little fun and apply the model to stats where it clearly is silly to do so, such as win rate (I defined as wins per game started) and extra batters faced per inning (the total number of additional batters a pitcher faced beyond what is required by their IP). The model still produces estimates, but of course, but bad data fed into a good model doesn't magically produce good analysis.

\begin{array}{| l | l | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\textrm{95% CI}&\textrm{Cutoff}&\textrm{Model}\\ \hline
\textrm{BABIP}&\textrm{(H-HR)/n*}&2006.71&484.94&(1056.22,2957.20)&300&BB\\
\textrm{GB Rate}&\textrm{GB/BIP}&65.52&3.63&(58.39,72.64)&300&BB\\
\textrm{FB Rate}&\textrm{FB/BIP}&61.96&3.42&(55.25,68.66)&300&BB\\
\textrm{LD Rate}&\textrm{LD/BIP}&768.42&94.10&(583.99,952.86)&300&BB\\
\textrm{HR/FB Rate}&\textrm{HR/FB}&505.11&93.95&(320.96,689.26)&100&BB\\
\textrm{SO Rate}&\textrm{SO/TBF}&90.94&5.04&(81.06,100.82)&400&BB\\
\textrm{HR Rate}&\textrm{HR/TBF}&931.59&107.80&(720.30,1142.88)&400&BB\\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}&221.25&14.43&(192.97,249.53)&400&BB\\
\textrm{HBP Rate}&\textrm{HBP/TBF}&989.30&119.95&(754.21,1224.41)&400&BB\\
\textrm{Hit rate}&\textrm{H/TBF}&623.35&57.57&(510.51,736.18)&400&BB\\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}&524.73&44.96&(436.62,612.84)&400&BB\\
\textrm{Win Rate}&\textrm{W/GS}&57.23&8.68&(40.21,74.24)&15&BB\\
\textrm{WHIP}&\textrm{(H + BB)/IP**}&77.20&5.46&(66.50,87.90)&80&GP\\
\textrm{ER Rate}&\textrm{ER/IP**}&59.55&3.94&(51.82,67.25)&80&GP\\
\textrm{Extra BF}&\textrm{(TBF - 3IP**)/IP**}&73.00&5.08&(63.05,82.95)&80&GP\\ \hline
\end{array}

* I'm not exactly sure what combinations of statistics fangraphs is using for the denominator of their BABIP - it's not BIP = GB + FB + LD. I know the numerator of H - HR is correct, but the denominator was usually smaller , though sometimes larger, than BIP. I solved for what fangraphs was using and used that in my calculations - if somebody wants to let me know exactly what they're using for n, please do.

** When dividing by IP, I corrected the 0.1 and 0.2 decimal representations to 0.33 and 0.67.

I've also created histograms of each observed statistic with an overlay of the estimated distribution of true talent levels. They can be found in this imgur gallery. Remember that the dashed line represents the distribution of talent levels, not of observed data, so it's not necessarily bad if it is shaped differently than the observed data.

$\hat{M}$ is the estimated variance parameter of the underlying talent distribution. Under the model, it is equal to the number of plate appearances at which there is 50% shrinkage.

$SE(\hat{M})$ is the standard error of the estimate $\hat{M}$. It is on the same scale as the divisor in the formula.

The 95% CI is calculated as

$\hat{M} \pm 1.96 SE(\hat{M})$

It represents a 95% confidence interval for the number of plate appearances at which there is 50% shrinkage.

For an arbitrary stabilization level $p$, the number of required plate appearances can be estimated as

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{M}$

And a 95% confidence interval for the required number of plate appearances is given as

$\left(\dfrac{p}{1-p}\right) \hat{M} \pm 1.96 \left(\dfrac{p}{1-p}\right) SE(\hat{M})$

Since the denominators are so different (as opposed to offensive statistics where PA was used for almost everything except for batting average, and AB are fairly close to PA), I don't feel as comfortable putting everything on the same plot. That being said, the stats that use TBF look like

And the stats that use BIP for their denominator look like

As always, comments are appreciated.

2015 Win Prediction Totals (Through August)

2015-09-02T13:31:00.001-05:00

These predictions are based on my own method (which can be improved). I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), and I think by this point in the season the actual coverage should be close to that.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c}
\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} \\ \hline

ATL & 61 & 66.56 & 72 & 65.57 \\
ARI & 73 & 78.98 & 85 & 83.48 \\
BAL & 72 & 77.94 & 84 & 83.3 \\
BOS & 70 & 75.88 & 82 & 77.71 \\
CHC & 85 & 90.58 & 96 & 83.91 \\
CHW & 70 & 76.24 & 82 & 74.78 \\
CIN & 63 & 68.64 & 75 & 74.14 \\
CLE & 74 & 80.20 & 86 & 82.03 \\
COL & 61 & 67.29 & 73 & 70.14 \\
DET & 69 & 74.75 & 81 & 74.66 \\
HOU & 84 & 89.99 & 96 & 91.79 \\
KCR & 92 & 97.84 & 104 & 90.34 \\
LAA & 74 & 79.95 & 86 & 78.16 \\
LAD & 84 & 90.31 & 96 & 87.66 \\
MIA & 61 & 66.79 & 73 & 74.49 \\
MIL & 64 & 69.71 & 76 & 74.48 \\
MIN & 77 & 82.69 & 89 & 79.41 \\
NYM & 84 & 89.67 & 95 & 87.19 \\
NYY & 83 & 89.34 & 95 & 87.76 \\
OAK & 68 & 73.33 & 79 & 82.77 \\
PHI & 58 & 63.87 & 70 & 64.11 \\
PIT & 91 & 97.21 & 103 & 89.43 \\
SDP & 73 & 78.54 & 84 & 75.97 \\
SEA & 69 & 74.32 & 80 & 71.95 \\
SFG & 80 & 85.61 & 91 & 86.79 \\
STL & 98 & 103.63 & 109 & 97.33 \\
TBR & 76 & 81.45 & 87 & 80.79 \\
TEX & 78 & 83.60 & 90 & 79.00 \\
TOR & 87 & 92.88 & 98 & 98.67 \\
WSN & 77 & 82.61 & 89 & 84.04 \\ \hline\end{array}

To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (at the end of August).

These quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is lower than the peak, the team is predicted to finish lucky, and if the blue line is higher than the peak, the team is predicted to finish unlucky.

WHIP Stabilization by the Gamma-Poisson Model

2015-08-27T09:33:00.000-05:00

I've previously covered shrinkage estimation for offensive statistics - or at least, those that can be written as binomial events. In a previous post, I showed that for models that follow the natural exponential family with quadratic variance function, the split-half correlation is equal to one minus the shrinkage coefficient $B$.

The techniques I used can also be used when the outcome at the most basic level (at-bat, inning pitched, etc.) is not just a binary outcome. In particular, the Poisson distribution also fits within the framework I derived, as it is a member of the natural exponential family with quadratic variance function, and so events that can be modeled as Poisson at the base level will follow the same basic principles I used for the binomial outcomes. I chose the statistic WHIP (walks + hits per inning pitched) to illustrate this method, as it is a counting statistic that is a non-binary event (i.e., you can have 0, 1, 2, ... walks + hits in a given inning), so it fits the support of the Poisson.

Model and Estimation

I will assume that in each inning, pitcher $i$ gives up a number of walks + hits that follows a Poisson model with mean $\theta_i$, which is unique to each pitcher. The sum number of walks + hits given up in $n_i$ innings is $x_i$, and I have $N$ pitchers total. I considered only starting pitchers from 2009-2014, and split years between the same pitcher. My code and data are on github for anybody who wants to check my calculations.

Since the sum of Poissons is Poisson, the sum number of walks + hits $x_i$ given up in $n_i$ innings follows a Poisson distribution with mean $n_i \theta_i$ and mass function

$p(x_i | \theta_i, n_i) = \dfrac{e^{-n_i \theta_i} (n_i \theta_i)^{x_i}}{x_i !}$

I will also assume that the distribution of means $\theta_i$ follows a gamma distribution with mean $\mu$ and variance parameter $K$ (in this parametrization, $\mu = \alpha/\beta$ and $K = \beta$ as opposed to the traditional $\alpha, \beta$ parametrization). This distribution has density

$f(\theta_i | \mu, K) = \dfrac{K^{\mu K}}{\Gamma(\mu K)} \theta_i^{\mu K - 1} e^{-K \theta_i}$

As shown in a previous post, the split-half correlation is then one minus the shrinkage coefficient $B$, or

$\rho = 1 - B = \left(\dfrac{n_i}{n_i + K}\right)$

So once I have an estimate $\hat{K}$ and a desired stabilization level $p$, solving for $n$ gives

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{K}$

Once again, the population variance parameter $K$ is equivalent to the 0.5 stabilization point - the point where the split half correlation should be exactly equal to 0.5, and also the point where the individual pitcher estimates are shrunk 50% of the way towards the mean.

For estimation of $mu$ and $K$, I used marginal maximum likelihood - a one dimensional introduction to maximum likelihood is given here. The marginal density of $\mu$ and $K$ is

$p(x_i | n_i, \mu, K) = \displaystyle \int_0^{\infty} \dfrac{K^{\mu K}n_i^{x_i}}{\Gamma(\mu K) x_i !} e^{-\theta_i (n_i + K)} \theta_i^{x_i + \mu K - 1} d\theta_i = \dfrac{K^{\mu K}n_i^{x_i}}{\Gamma(\mu K) x_i !} \dfrac{\Gamma(x_i + \mu K)}{(n_i + K)^{x_i + \mu K}}$

And the log-likelihood (dropping terms that do not involve either $\mu$ or $K$) is given by

$\ell(\mu, K) = N \mu K \log(K) - N \log(\Gamma(\mu K)) + \displaystyle \sum_{i = 1}^N \left[\log(\Gamma(x_i + \mu K)) - (x_i + \mu K) \log(n_i + K)\right]$

Once again, I wrote code to maximize this function in $R$ using a Newton-Raphson algorithm. I converted $K$ to $\phi = 1/(1 + K)$ in the equation above for estimation and then converted it back by $K = (1-\phi)/\phi$ after estimation was complete - the reason being that it makes the estimation procedure much more stable.

In performing this estimation, I had to make a choice of the minimum number of innings pitched (IP) in order to be included in the dataset. When performing a similar analysis for on-base percentage, I found that at around 300 PA, the population variance (and hence, the stabilization point) became roughly constant. Unfortunately, this is not true for starting pitchers.

The population variance in talent levels decreases consistently as a function of the minimum number of IP that are considered, and so the stabilization point $K$ increases. This means that, unlike OBP, for example, the stabilization point is always determined by what percentage of pitchers you look at (by IP) - if you look at only the top 50%, the stabilization point will be larger than the stabilization point for the top 70%.

This is reflected in the plot below - as with OBP and PA, the mean WHIP is associated with the number of IP, but unlike with OBP, the variance around the mean is constantly changing with the mean.

For my calculation, I chose to use 80 innings pitched as my cutoff point - corresponding to approximately 15 games started and capturing slightly more than 50% of pitchers (by IP). This point was completely arbitrary, though, and other cutoffs will be equally valid depending on the question at hand.

Performing the estimation, the estimated league mean WHIP was $\hat{\mu} = 1.304$ with variance parameter $\hat{K} = 77.203$.

Once again, 95% confidence intervals for a specific stabilization level p are given as

$\left(\dfrac{p}{1-p}\right) \hat{K} \pm 1.96 \left(\dfrac{p}{1-p}\right) \sqrt{Var(\hat{k})}$

From (delta-method transformed) maximum likelihood output, $Var(\hat{K}) = 29.791$ (for a standard error of $5.459$ IP). The stabilization curve, with confidence bounds, is then

Aside from the model criticisms I've already mentioned, standard ones apply - innings pitched are not identical and independent (and treating them as so is clearly much worse than treating plate appearances as identical and independent), pitchers are not machines, etc. I don't think the model is great, but it is useful. It gives confidence bounds for the stabilization point something other methods don't do. As always, comments are appreciated.

Offensive Stabilization Points through Time

2015-08-19T14:50:00.001-05:00

Using my maximum likelihood technique for estimating stabilization points, I performed a moving calculation of the stabilization point using data from 1900 - 2014 from fangraphs.com. Each stabilization point is a six-year calculation, including the current and five previous years (so for example, 2014 incudes 2009-2014 data, 1965 includes 1959 - 1965 data, etc.). There's not a mathematical or baseball reason for this choice - through trial and error it just seemed to provide enough data for estimation that the overall trend was apparent, with a decent amount of smoothing. Data includes only batters from each year with at least 300 plate appearances, and splits years for the same player. Raw counts are used, not adjusted in any form. Pitchers are excluded. My data and code is posted on my github if you would like to run it for yourself.

The "stabilization point" I have defined as the point where split-half correlation is equal to 0.5, which is equivalently where the shrinkage amount is 50% . Both of these are equal to a variance parameter $M$ in the beta-binomial model I fit, where the distribution of events given a mean $\theta_i$ is binomial for player $i$ and the underlying distribution of the $\theta_i$ follows a beta distribution with mean $\mu$ and variance parameter $M$.

Historical Plots

Trends can be see clearly in plots. For example, here is a plot of the stabilization point for home run rate from 1900 - 2014.

The effect of the dead ball era is clearly evident. A large stabilization point indicates a small variance - and during the dead ball era, there was a small variance, because most players weren't hitting home runs! More recently, the stabilization point has risen to the highest level it's been since that era.

Note that the stabilization point is should not be confused with the mean. In fact, here's a plot of the estimated league mean home run rate over the same period.

While going through peaks and valleys, the home run rate has risen fairly continuously over time - and the recent rise in home run stabilization point actually corresponds to a decrease in the mean home run rate (though interestingly, the decrease in league mean home run rate since the end of the steroid era still puts the current mean home run rate above any other preceding era).

To give another example, the stabilization point for triple rate is the lowest its been since the dead ball era - even though the league mean triple rate has decreased fairly continuously over time.

Interestingly, the stabilization points for walk rate and on-base percentage are the highest they've ever been, with walk rate having a noticeably sharp increase in recent years - one theory is that this is due to a "moneyball" effect of teams focusing much more strongly on walk rate as opposed to other statistics - indeed, the stabilization point for batting average (shown later in the article) has dropped during the same period - perhaps indicative of being more tolerant of variation in batting average but less tolerant in variation of on-base percentage (of course, pitching has grown more dominant since the end of the steroid era, which is likely adding to the effect as well).

Meanwhile, the stabilization points for double rate and extra base hit (2B + 3B) have increased over time.

But while the extra base hit rate stabilization point has decreased from the mid-2000s, while the double rate stabilization point has remained roughly the same.

The hit-by-pitch rate follows the same pattern as third base percentage - it increased after the dead ball era, peaking in the 1930s and 1950s - but has decreased since then, and despite a small recent increase, is at its lowest stabilization point since that era.

Meanwhile, the strikeout rate stabilization point decreased fairly consistently over time, before stabilizing approximately in the 1970s, with peaks in the 1980s and early 2000s.

What Drives the Stabilization Point?

As I've shown, the mean of the underlying distribution of talents does not seem to be strongly associated with the stabilization point - the variance of the underlying distribution of talent levels is the primary factor. There is an inverse relationship - a small stabilization point indicates that there is a large variance in talent levels for that particular statistic, and a large stabilization point indicates that there is a small variance in talent levels for that statistic.

To get a clearer view of the factors that are affecting the stabilization point, here's a plot of the stabilization point for batting average (using at-bats as the denominator) versus time.

Below is an animation showing the empirical distribution of batting average with the estimated underlying distribution of talent levels in dashed lines (since I'm estimating the distribution of true batting averages and not the distribution of observed batting averages, it's okay that the dashed line is narrower than the histogram). Notice that as time goes on, the distribution gets narrower (the variance is decreasing) - this is what's driving the increase in stabilization point over time.

The opposite effect can be seen in the single rate stabilization point - it has decreased (with peaks and valleys) over time

As the distribution of single rates has become more spread out.

Graphics of all the stabilization points, league mean talent levels, and animated estimated talent distributions can be found here.

Individual Years

I also selected a few years to compare individually - some for specific reasons, some just as a representative of a certain era.

1910, in the middle of the dead ball era, and a year of particularly low offensive output.
1928, to represent the 1920s and the age of Babe Ruth.
1937, to represent the 1930s.
1945, the end of the second world war.
1959, to represent the 1950s.
1968, the year of the pitcher.
1975, six years after they lowered the mound.
1987, before the steroid era and six years after the 1981 labor stoppage.
2001, the year Barry Bonds hit 73 home runs, in the middle of the steroid era.
2014, the modern era.

\begin{array}{| c | c | c | c | c | c | c | c | c | c | c |}\hline
\textrm{Year} & \textrm{1B} & \textrm{2B}& \textrm{3B}& \textrm{XBH}& \textrm{HR} & \textrm{SO} & \textrm{BB} & \textrm{BA} & \textrm{OBP} & \textrm{HBP} \\ \hline
1910 & 523.02 & 348.68 & 469.42 & 274.24 & 537.41 & 130.77 & 87.33 & 285.65 & 175.27 & 259.08 \\
1928 & 442.01 & 475.05 & 583.77 & 385.43 & 102.77 & 83.26 & 82.10 & 286.25 & 173.49 & 414.86 \\
1937 & 436.17 & 597.72 & 709.13 & 463.40 & 90.61 & 73.79 & 76.94 & 344.93 & 174.84 & 723.59\\
1945 & 456.71 & 710.05 & 543.09 & 474.47 & 98.81 & 67.99 & 79.83 & 424.54 & 180.14 & 699.53 \\
1959 & 351.40 & 1059.30 & 721.82 & 794.95 & 90.18 & 58.28 & 81.20 & 430.60 & 200.31 & 414.12\\
1968 & 333.52 & 867.61 & 691.18 & 700.24 & 94.46 & 55.18 & 93.24 & 476.94 & 265.04 & 448.36\\
1975 & 246.27 & 970.40 & 646.09 & 773.63 & 85.50 & 53.19 & 73.61 & 407.65 & 204.92 & 410.99 \\
1987 & 269.39 & 949.73 & 537.13 & 801.21 & 90.23 & 52.85 & 86.22 & 541.67 & 262.28 & 430.67 \\
2001 & 255.57 & 838.16 & 482.32 & 971.61 & 95.11 & 57.37 & 76.16 & 465.84 & 196.51 & 251.47\\
2014 & 222.16 & 1025.31 & 372.50 & 1006.30 & 124.52 & 49.73 & 105.59 & 465.92 & 295.79 & 297.41 \\ \hline
\end{array}

While generally following the fuller patterns shown in the plots, the effect of major baseball events such as the dead ball era, the second world war, the lowering of the mound, and the steroid era is evident.

Remember that a smaller stabilization point indicates a larger variance among talent levels - so looking at 1968 and 1975 to see the effect of lowering the mound, for example, the spread of single, triple, and home run rates increased while the spread of double and extra-base hit rates decreased (the extra base hit rate being largely driven by the double rate). Interestingly, the spread of strikeout rates remained roughly the same, but the spread of walk rates, hit by pitch rates, batting average, and on-base percentage all increased.

Overall, a fun way to look at how offensive statistics have changed over time. Let me know what you think in comments.

More Offensive Stabilization Points

2015-08-13T09:48:00.002-05:00

Using the same method from my previous post, I calculated some stabilization points for more offensive statistics.

All of the code and data I used may be found on my github.

For each, I used a binomial distribution for the result of the at-bat/plate appearance and a beta distribution for the distribution of talent levels. I think a different model might work better for some of these statistics (and I'll have to work through the poisson-gamma anyway when I look at pitching statistics), but this represents a decent approximation.

As I stated in a previous post, statistics that can not be constructed as a binomial event (such as wOBA) do not fall under the framework I am using, and so I have not included them in estimation. I could treat them as binomial events, fit a model, and perform estimation procedure, but I would have no idea if the results are correct or not.

All estimated stabilization points were calculated using unadjusted data from fangraphs.com for hitters from 2009-2014 with at least 300 PA, excluding pitchers.

\begin{array}{| l | l | c | c | c |} \hline
\textrm{Statistic} & \textrm{Formula} & \hat{M} & SE(\hat{M}) & \textrm{95% CI} \\ \hline
\textrm{OBP} & \textrm{(H+BB+HBP)/PA} & 295.79 & 16.41 & (263.63, 327.95) \\
\textrm{BA} & \textrm{H/AB} & 465.92 & 34.23 & (398.83, 533.02) \\
\textrm{SO Rate} & \textrm{SO/PA} & 49.73 & 1.92 & (45.96, 53.50) \\
\textrm{BB Rate} & \textrm{(BB-IBB)/(PA-IBB)} & 110.91 & 4.84 & (101.44, 120.38) \\
\textrm{1B Rate} & \textrm{1B/PA} & 222.16 & 11.32 & (199.98, 244.34) \\
\textrm{2B Rate} & \textrm{2B/PA} & 1025.31 & 108.00 & (813.64, 1236.98) \\
\textrm{3B Rate} & \textrm{3B/PA} & 372.5 & 26.56 & (320.44, 424.56) \\
\textrm{XBH Rate} & \textrm{(2B+3B)/PA} & 1006.30 & 105.23 & (800.04, 1212.57) \\
\textrm{HR Rate} & \textrm{HR/PA} & 124.52 & 5.90 & (112.95, 136.09) \\
\textrm{HBP Rate} & \textrm{HBP/PA} & 297.41 & 18.26 & (261.61, 333.20) \\ \hline
\end{array}

I've also created histograms of each observed statistic with an overlay of the estimated distribution of true talent levels. They can be found in this imgur gallery. Remember that the dashed line represents the distribution of talent levels, not of observed data, so it's not necessarily bad if it is shaped differently than the observed data.

$\hat{M}$ is the estimated variance parameter of the underlying talent distribution. Under the model, it is equal to the number of plate appearances at which there is 50% shrinkage.

$SE(\hat{M})$ is the standard error of the estimate $\hat{M}$. It is on the same scale as the divisor in the formula - so PA for all except batting average and walk rate.

The 95% CI is calculated as

$\hat{M} \pm 1.96 SE(\hat{M})$

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{M}$

And a 95% confidence interval for the required number of plate appearances is given as

$\left(\dfrac{p}{1-p}\right) \hat{M} \pm 1.96 \left(\dfrac{p}{1-p}\right) SE(\hat{M})$

Without confidence bounds, a plot of the sample size required for various stabilization levels is

And a plot of the stabilization level at various sample sizes is given as

This looks very similar to other plots I have seen.

Comments are appreciated. Also, I'm currently in the process of learning ggplot, so hopefully my graphics won't be as awful in the near future.

2015 Win Prediction Totals (Through July)

2015-08-12T12:41:00.002-05:00

These are a bit late - it's August 12, but these intervals only include games through July 31.

These predictions are based on my own method (which can be improved). I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), and I think by this point in the season the actual coverage should be close to that.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c}
\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} \\ \hline

ATL & 64 & 72.27 & 81 & 72.09 \\
ARI & 73 & 81.38 & 90 & 83.36 \\
BAL & 74 & 82.90 & 92 & 86.17 \\
BOS & 64 & 72.12 & 81 & 72.94 \\
CHC & 76 & 85.07 & 94 & 81.23 \\
CHW & 68 & 76.53 & 85 & 73.10 \\
CIN & 66 & 74.63 & 84 & 76.06 \\
CLE & 68 & 76.90 & 86 & 78.07 \\
COL & 62 & 70.37 & 79 & 72.69 \\
DET & 70 & 78.54 & 87 & 78.37 \\
HOU & 82 & 90.46 & 99 & 90.71 \\
KCR & 85 & 94.02 & 103 & 89.14 \\
LAA & 78 & 87.28 & 96 & 87.11 \\
LAD & 82 & 90.37 & 99 & 88.88 \\
MIA & 61 & 70.09 & 79 & 77.17 \\
MIL & 63 & 71.01 & 80 & 75.43 \\
MIN & 74 & 82.43 & 91 & 79.51 \\
NYM & 74 & 82.33 & 91 & 80.50 \\
NYY & 82 & 90.45 & 99 & 87.62 \\
OAK & 67 & 75.23 & 84 & 84.46 \\
PHI & 54 & 62.61 & 71 & 63.10 \\
PIT & 83 & 92.27 & 101 & 87.16 \\
SDP & 69 & 77.14 & 86 & 74.48 \\
SEA & 65 & 73.45 & 82 & 73.91 \\
SFG & 79 & 88.31 & 97 & 87.21 \\
STL & 92 & 101.20 & 110 & 96.66 \\
TBR & 72 & 80.88 & 90 & 80.62 \\
TEX & 70 & 78.37 & 87 & 76.62 \\
TOR & 77 & 86.00 & 94 & 92.18 \\
WSN & 77 & 85.99 & 95 & 84.89 \\ \hline\end{array}
To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (at the end of July).

As a bonus, these quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is lower than the peak, the team is predicted to finish lucky, and if the blue line is higher than the peak, the team is predicted to finish unlucky.

Estimating Theoretical Stabilization Points

2015-08-05T09:35:00.002-05:00

Edit 19 March 2020: This post has been adapted into a paper in the Journal of Mathematical Psychology. In the process of writing the paper, a number of mistakes, omissions, or misstatements were found in this post. It is being left up as it was originally written, just in case anybody is interested. For a more correct version, please refer to the journal article.

The technique commonly used to assess stabilization points of statistics is called split-half correlation. This post will show that within a fairly general modeling framework, the split-half correlation is a function of two things: the sample size and a variance parameter of the distribution of talent levels. It's therefore possible to skip the correlation step entirely and use statistical techniques to estimate the variance parameter of the talent distribution directly, and then use that to estimate the sample size required (with confidence bounds) for a specific stabilization level.

Theoretical Split-Half Correlation

(Note: This first part is very theoretical - it's the part that shows the statistical link between shrinkage and split-half correlation for a certain family of distributions. If you just want to trust me or your own experience that it exists, you can skip this and go straight to the "Estimation" section without missing too much.)

I'm going to work within my theoretical framework where the data follows a natural exponential family with a quadratic variance function (NEFQVF) - so this will work for normal-normal, beta-binomial, Poisson-gamma, and a few other models.

$X_i \sim p(x_i | \theta_i)$

$\theta_i \sim G(\theta_i| \mu, \eta)$

Split-half reliability takes two samples that are presumed to be measuring the same thing (I'll call these samples $X_i$ and $Y_i$) and calculates the correlation between them -if they actually are measuring the same thing, then the correlation should be high.

In baseball, it's commonly used as a function of the sample size $n$ to assess when a stat "stabilizes" - that is, if I take two samples of size $n$ from the same model (be it player at-bats, batters faced, etc.) and calculate the correlation of a statistic between the samples, then once the correlation exceeds a certain value, the statistic is considered to have "stabilized."

Let's say that $\bar{X_i}$ is normalized statistic of $n$ observations (on-base percentage, for example) from the first "half" of the data (though it does not need to be chronological) and $\bar{Y_i}$ is the normalized statistic of $n$ observations from the second half of the data. In baseball terms, $\bar{X_i}$ might be something like the OBP from the first sample and $\bar{Y_i}$ the OBP from the second sample.

I want to find the correlation coefficient $\rho$, which is defined as

$\rho = \dfrac{Cov(\bar{X_i}, \bar{Y_i})}{\sqrt{Var(\bar{X_i})Var(\bar{Y_i})}}$

First, the numerator. The law of total covariance states that

$Cov(\bar{X_i}, \bar{Y_i}) = E[Cov(\bar{X_i}, \bar{Y_i} | \theta_i)] + Cov(E[\bar{X_i} | \theta_i], E[\bar{Y_i} | \theta_i])$

Given the same mean $\theta_i$, $\bar{X_i}$ and $\bar{Y_i}$ are assumed to be independent - hence,

$E[Cov(\bar{X_i}, \bar{Y_i} | \theta_i)] = E[0] = 0$.

Functionally, this is saying that for a given player, the first set of performance data is independent of the second half of performance data.

Since $E[\bar{X_i} | \theta_i] = E[\bar{Y_i} | \theta_i] = \theta_i$ for the framework I'm working in, the second part becomes

$Cov(E[\bar{X_i} | \theta_i], E[\bar{Y_i} | \theta_i]) = Cov(\theta_i, \theta_i) = Var(\theta_i)$

Thus, $Cov(\bar{X_i}, \bar{Y_i})$ is equal to the "between player" variance - that is, the variance among league talent levels.

Now, the denominator. As I've used before, the law of total variance states that

$Var(\bar{X_i}) = E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i}| \theta_i])$

Since $\bar{X_i}$ and $\bar{Y_i}$ have the same distributional form, they will have the same variance.

$Var(\bar{X_i}) = Var(\bar{Y_i}) = E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i} | \theta_i]) = \dfrac{1}{n} E[V(\theta_i)] + Var(\theta_i)$

Where $E[V(\theta_i)]$ is the average variance of performance at the level of plate appearance, inning pitched, etc. Hence, the split correlation between them will be

$\rho = \dfrac{Var(E[\bar{X_i }| \theta_i])}{E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i} | \theta_i])} = \dfrac{Var(\theta_i)}{\dfrac{1}{n} E[V(\theta_i)] + Var(\theta_i)} = 1 - B $

Where $B$ is the shrinkage coefficient. The important theoretical result, then, is that for NEFQVF distributions, the split-half correlation is equal to one minus the shrinkage coefficient. Since we know form of the shrinkage coefficient, we can estimate what the split-half correlation will be for different values of $n$.

I'm not breaking any new ground here in terms of what people are doing in practice, but there does exist a theoretical justification linking split-half correlation and this particular formula method for shrinkage estimation.

Estimation

Using fangraphs.com, I collected data from all MLB batters (excluding pitchers) who had at least 300 PA (which is a somewhat arbitrary choice on my part - further discussion on this choice in the model criticisms section) from 2009 to 2014. I'm considering players to be different across years - so for example, 2009-2014 Miguel Cabrera is six different players. I'll define $x_i$ as the number of on-base events for player $i$ in $n_i$ plate appearances. I have $N$ of these players.

All my data and code is posted on my github if you would like to independently verify my calculations (and I'm learning to use github while I do this, so apologies if the formatting is completely wrong).

The number of on-base events $x_i$ follows a beta-binomial model - this fits into the NEFQVF family.

$x_i \sim Binomial(\theta_i, n)$

$\theta_i \sim Beta(\mu, M)$

Here I am using $\mu = \alpha/(\alpha + \beta)$ and $M = \alpha + \beta$ as opposed to the traditional $\alpha, \beta$ notation for a beta distribution. The true league mean OBP is $\mu$ and $M$ controls the variance of true OBP values among players.

Let's say I want to know at what sample size $n$ the split-half correlation will be at a certain value p. For a beta-binomial mode, the split-half correlation is

$\rho = 1 - B = 1 - \dfrac{M}{M + n} = \dfrac{n}{M + n}$

Where $B$ is the shrinkage coefficient. So if we desire a stabilization level p, it is given by solving

$p = \dfrac{n}{M + n}$

For $n$. The solution is

$n= \left(\dfrac{p}{1-p}\right) M$

Given an estimate of the talent variance parameter $\hat{M}$, the estimated $n$ is

$\hat{n} = \left(\dfrac{p}{1-p}\right)\hat{M}$

As a side note, at $n = M$ the split-half correlation and shrinkage amount are both 0.5.

For estimation of $M$, I'm going to use marginal maximum likelihood. For a one-dimensional introduction to maximum likelihood, see my post on maximum likelihood estimation for batting averages.

The marginal distribution of on-base events $x_i$ given $\mu$ and $M$ is a beta-binomial distribution, with mass function given by

$ p(x_i | \mu, M) = \displaystyle \int_0^1 {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu M -1}(1-\theta_i)^{n_i - x_i + (1-\mu)M-1}}{\beta(\mu M, (1-\mu)M)} d\theta_i = {n_i \choose x_i} \dfrac{\beta(x_i + \mu M, n_i - x_i + (1-\mu) M)}{\beta(\mu M, (1-\mu)M)}$

This distribution represents the probability of $x_i$ on-base events in $n_i$ plate appearances given league mean OBP $\mu$ and variance parameter $M$, bypassing the choice of player altogether. The maximum likelihood estimate says to choose the values of $\mu$ and $M$ that maximize the joint probability the observed OBP values.

For a sample of size $N$ players, each with $x_i$ on-base events in $n_i$ plate appearances, the log-likelihood is given as

$\ell(\mu, M) = \displaystyle \left[ \sum_{i =1}^N \log(\beta(x_i + \mu M, n_i - x_i + (1-\mu) M))\right] - N \log(\beta(\mu M, (1-\mu)M))$

This must be maximized numerically using computer software - I wrote a program using the Newton-Raphson algorithm to do this in $R$, which is posted on my github. For estimation, I actually converted $M$ to $\phi = 1/(1+M)$ in the above equation, performed the maximization, and then converted my estimate back to the original scale with $\hat{M} = (1-\hat{\phi})/\hat{\phi}$. The technical details of why I did this are a bit too much here, but it makes the estimation procedure much more stable - I'll be happy to discuss this with anybody who wants to know.

The maximum likelihood estimates are given by $\mu = 0.332$ and $M = 295.7912$.

Above is the distribution of observed OBP values with the estimated distribution of true OBP levels overlaid as dashed lines.

This means that for a split-half correlation of $p = 0.7$, the estimate of sample size required is

$\hat{n} = \left(\dfrac{0.7}{1-0.7}\right)295.7912 = 690.1794$

Which we could round to $\hat{n} = 690$ plate appearances. Since $\hat{M}$ is the maximum likelihood estimator, $\hat{n}$ is the maximum likelihood estimate of the point at which split-half correlation is 0.7 by invariance of the maximum likelihood estimator.

Furthermore, if we have a variance $Var(\hat{M})$, the variance of the estimated sample size for p-stabilization is given as

$Var(\hat{n_i}) = Var\left( \left(\dfrac{p}{1-p}\right)\hat{M}\right) = \left(\dfrac{p}{1-p}\right)^2 Var(\hat{M})$

So a $(1-\alpha)\times 100\%$ confidence interval for $n$ is given as

$\left(\dfrac{p}{1-p}\right)\hat{M} \pm z^* \left(\dfrac{p}{1-p}\right) \sqrt{Var(\hat{M})}$

The output from the maximum likelihood estimation can be used to estimate $Var(\hat{M})$. Since I estimated $\hat{\phi}$, I had to get $Var(\hat{\phi})$ from output of the computer program I used and then use the delta method to convert it back to the scale of $M$. Doing that, I got $Var(\hat{M}) = 269.1678$. This gives a 95% confidence interval for the 0.7-stabilization point as

$690.1794 \pm 1.96 \left(\dfrac{0.7}{1-0.7}\right) \sqrt{269.1678} = (615.1478, 765.211)$

Or between approximately 615 and 765 plate appearances. A 95% confidence interval for the 0.5-stabilization point (which is just $\hat{M}$) is between approximately 264 and 328 plate appearances.

For an arbitrary p-stabilization point sample size, the confidence interval formula is

$\left(\dfrac{p}{1-p}\right)295.7912 \pm z^* \left(\dfrac{p}{1-p}\right) \sqrt{269.1678}$

Below is a graph of the required sample size for for a stabilization level of p between 0.5 and 0.8 - the dashed lines are 95% confidence bounds.

As you can see, there are diminishing returns - to stabilize more, you need an increasingly larger sample size.

Model Criticisms

Basic criticisms about modeling baseball players apply: players are not machines, plate appearances are not independent and identical, etc. These criticisms will apply to just about model of baseball data, including split-half correlations.

I have not adjusted the data in any way - I simply took the raw number of on-base events and plate appearances. Estimation could likely be improved by adjusting the data for various effects before running it through the model.

One thing should be obvious: this is a parametric estimator, dependent on the model I chose. If the model I chose does not fit well, then the estimate will be bad. I stuck to OBP for this discussion because it seems to fit the beta-binomial model well. No model is correct, of course, but I do believe the beta-binomial model is close enough to be useful. I simulated data from a beta-binomial model using my estimated parameters and the fixed number of plate appearances, and both visually and with some basic summary statistics the simulated data looked close to the actual data. Not identical - and the real data appears to skew slightly right in comparison to the simulated data, and being identical isn't a realistic goal anyway - but close. Other statistics could require a the use of a different distribution.

As I mentioned, the cutoff of 300 PA or greater was somewhat arbitrary - I will fully admit that it's because I don't have a clearly defined population of players in mind. I know pitchers shouldn't be included, and I know that someone who got 10 PA and then was sent down shouldn't be included in the model, but I'm not sure what the correct cutoff for PA should be to get at this vague idea of "MLB-level" hitters I have. That's a problem with this analysis, but one that is easy to correct with the right information.

There's a bias/variance trade-off at play here - if I set the cutoff too low then I'm going to get too many players included that aren't from the population I want included in the sample, but the more players I feed into the model the smaller my variance of estimation is. Below is a plot of $\hat{M}$ with 95% confidence bounds for cutoff points from 50 PA to 600 PA.

Around 300 PA seems to be the cutoff value that leads to a roughly stable $M$ estimate that doesn't veer off into being erratic from the lack of information, and seems to approximately conform with what I know about how many plate appearances semi-regular MLB player should get.

Lower cutoff points tend to lead to lower stabilization points, as it will include hitters with smaller true OBPs, decreasing both the league average OBP (and also the average amount of variance around OBP values) and variance among true OBP values - the effect of which is to estimate $M$ smaller.

The bigger problem I have is that one of the assumptions is the number of plate appearances is independent of the observed on-base percentage - that if one player gets 500 PA and another player gets 700 PA, it tells us nothing about either of the players' true OBP values - they just happened to get 500 and 700 PA, respectively. Of course, we know this isn't true - hitters get more plate appearances specifically because they get on base more.

Since the model works by assuming that players with more plate appearances have less variation around their true OBP values, it will make the estimate of the league mean OBP higher - which will affect $M$.

Even with all of its problems, I think this estimation method is useful. I don't think I've done anything that other people aren't already doing, but I just wanted to work through it in a way that makes sense to me. Comments are appreciated.

.

Shrinkage Estimators for Counting Statistics

2015-07-30T08:59:00.000-05:00

Edit 19 March 2020: This post has been adapted into a paper in the Journal of Mathematical Psychology. In the process of writing the paper, a number of mistakes, omissions, or misstatements were found in this post. It is being left up as it was originally written, just in case anybody is interested. For a more correct version, please refer to the journal article.

Warning: this post is going to be incredibly technical, even by the standards of this blog. If what I normally post is gory math, this is the running of the bulls. I'm making it so I can refer back to it when I need to.

The goal is to set up the theoretical framework for shrinkage estimation of normalized counting statistics to some common mean. I will fully admit this is a very, very limited framework, but some of the most basic baseball statistics fit into it. In the future I hope I can possibly expand this to include more advanced statistics.

I will give (not show) a few purely theoretical results - for proofs, see Natural Exponential Families with Quadratic Variance Functions by Carl Morris in The Annals of Statistics, Vol. 11, No. 2 (1983), 515-529, or the more updated version of that paper.

Theoretical Framework

Let's say I have some metric $X_i$ for player, team, or object $i$. In this framework, $X_i$ represents a count or a sum of some kind - the raw number of hits, or the raw number of earned runs, etc. I know that $X_i$ is the result of a random process that is controlled by a probability distribution with parameter $\theta_i$, which is unique to each player, team, or object - in baseball, for example, $\theta_i$ represents the player's true "talent" level with respect to metric $X_i$.

$X_i \sim p(x_i | \theta_i)$

I have to assume that the talent levels $\theta_i$ are exchangeable, though the definition is a bit too much to go into here.

I'm going to assume that $p(x_i | \theta_i)$ is a member of the natural exponential family with a quadratic variance function (NEFQVF) - this includes very common distributions such as the normal, binomial, Poisson, gamma, and negative binomial.

Each of these can be written as the convolution (sum) of $n_i$ other independent, identical distributions, each of which is also NEFQVF with mean $\theta_i$ - the normal is the sum of normals, the binomial is the sum of Bernoullis, the Poisson is the sum of Poissons, the negative binomial is the sum of geometrics, etc.. I will assume that is the case here - that

$X_i = \displaystyle \sum_{j = 1}^{n_i} Y_{ij}$

Translating this to baseball terms, this means that $Y_{ij}$ is the outcome of inning, plate appearance, etc., $j$ for player $i$ ($j$ ranges from 1 to $n_i$). The metric $X_i$ is then sum of $n_i$ of these outcomes. Each outcome is assumed independent and identical. Once again, $X_i$ is not normalized by dividing by $n_i$.

Conditional on having mean $\theta_i$, the expectations of the $Y_{ij}$ are

$E[Y_{ij} | \theta_i] = \theta_i$

And so conditional on having mean $\theta_i$, the expected value of the $X_i$ are

$E[X_i | \theta_i] = E\left[\displaystyle \sum_{j = 1}^{n_i} Y_{ij} \biggr | \theta_i \right] = \displaystyle \sum_{j = 1}^{n_i} E\left[ Y_{ij} \biggr | \theta_i \right] = n_i E[Y_{ij}| \theta_i] = n_i \theta_i$

Baseball terms: if a player has, for example, on-base percentage $\theta_i$, then the number of on-base events I expect in $n_i$ plate appearances is $n_i \theta_i$. This does not have to be a whole number.

Similarly, and again conditional on mean $\theta_i$, the independence assumption allows us to write the variance of the $X_i$ as

$Var(X_i | \theta_i) = Var\left(\displaystyle \sum_{j = 1}^{n_i} Y_{ij} \biggr | \theta_i \right) = \displaystyle \sum_{j = 1}^{n_i} Var\left( Y_{ij} \biggr | \theta_i \right) = n_i Var(Y_{ij}| \theta_i) = n_i V(\theta_i)$

I'm going to repeat that last bit of notation again, because it's important:

$Var(Y_{ij}| \theta_i) =V(\theta_i)$

$V(\theta_i)$ is the variance of the outcome at the most basic level - plate appearance, inning, batter faced, etc. - conditional on having mean $\theta_i$. For NEFQVF distributions, this has a very particular form - the variance can be written as a polynomial function of the mean $\theta_i$ up to degree 2 (this is the "Quadratic Variance Function" part of NEFQVF):

$Var(Y_{ij} | \theta_i) = V(\theta_i) = c_0 + c_1 \theta_i + c_2 \theta_i^2$

For example, the normal distribution has $V(\theta_i) = \sigma^2$, so it fits the QVF model with $c_0 = \sigma^2$ and $c_1 = c_2 = 0$. For the Binomial distribution, $V(\theta_i) = \theta_i (1-\theta_i) = \theta_i - \theta_i^2$, so it fits the QVF model with $c_0 = 0, c_1 = 1$, and $c_2 = -1$. The Poisson distribution has $V(\theta_i) = \theta_i$, so it fits the QVF model with $c_0 = c_2 = 0$ and $c_1 = 1$.

I'm now going to assume that the talent levels $\theta_i$ themselves follow some distribution $G(\theta_i | \mu, \eta)$. The parameter $\mu$ is the expected value of the $\theta_i$ ($E[\theta_i] = \mu$), and it represents the league average talent level. The parameter $\eta$ controls, but is not necessarily equal to, the variance of $\theta_i$ (how spread out the talent levels are). Both are assumed to be known. The two-stage model is then

$X_i \sim p(x_i | \theta_i)$

$\theta_i \sim G(\theta_i | \mu, \eta)$

The unconditional expectation of the $X_i$ is

$E[X_i] = E[E[X_i | \theta_i]] = E[n_i \theta_i] = n_i \mu$

And the unconditional variance of $X_i$ is

$Var(X_i) = E[Var(X_i | \theta_i)] + Var(E[X_i | \theta_i]) = n_i E[ V(\theta_i)] + n_i^2 Var(\theta_i) $

In the above formula, the quantity $E[V(\theta_i)]$ is the average variance of the outcome at the most basic level (plate appearance, inning, etc.), averaging over all possible talent levels $\theta_i$. The quantity $Var(\theta_i)$ is the variance of the talent levels themselves - how spread out talent is in the league.

To this point I haven't normalized the $X_i$ by dividing by each by $n_i$ - let's do that. If I define $\bar{X_i} = X_i/n_i,$ then based on the formulas above

$E[\bar{X_i}] = E\left[\dfrac{X_i}{n_i}\right] = \dfrac{1}{n_i} E[X_i] = \dfrac{n_i \theta_i}{n_i} = \theta_i$

And variance

$Var(\bar{X_i}) = Var\left(\dfrac{X_i}{n_i}\right) = \dfrac{1}{n_i^2} Var(X_i) = \dfrac{n_i E[ V(\theta_i)] + n_i^2 Var(\theta_i)}{n_i^2} = \dfrac{1}{n_i}E[ V(\theta_i)] + Var(\theta_i)$

As members of the exponential family, members of the NEFQVF family are guaranteed to have a conjugate prior distribution, so I'll assume that $G(\theta_i | \mu, \eta)$ is conjugate to $p(x_i | \theta_i)$. For example, if $X_i$ follows a normal distribution, $G(\theta_i | \mu, \eta)$ is a normal as well. If $X_i$ follows a Binomial distribution, then $G(\theta_i | \mu, \eta)$ is a beta distribution. If $X_i$ follows a Poisson distribution, then $G(\theta_i | \mu, \eta)$ is a gamma distribution. The priors themselves do not have to be NEFQVF.

Since $\eta$ and $\mu$ are assumed known, we can use the Bayes' rule with conjugate prior $G(\theta_i | \mu, \eta)$ to calculate the posterior distribution for $\theta_i$

$\theta_i | x_i, \mu, \eta \sim \dfrac{p(x_i | \theta_i)G(\theta_i | \mu, \eta)}{\int p(x_i | \theta_i)G(\theta_i | \mu, \eta) d\theta_i}$

NEFQVF families have closed-form posterior densities.

I'm then going to take my as my estimator the expected value of the posterior, $\hat{\theta_i} = E[\theta_i | x_i]$. Specifically for NEFQVF distributions with conjugate priors, the estimator is then given by

$\hat{\theta_i} = \mu + (1 - B)(\bar{x_i} - \mu) = (1-B) \bar{x_i} + B \mu$

Where $B$ is known as the shrinkage coefficient. For NEFQVF distributions, the form of $B$ is

$B = \dfrac{E[\bar{X_i} | \theta_i]}{Var(X_i)} = \dfrac{\dfrac{1}{n_i}E[ V(\theta_i)]}{\dfrac{1}{n_i}E[ V(\theta_i)] + Var(\theta_i)} = \dfrac{E[V(\theta_i)]}{E[V(\theta_i)] + n_i Var(\theta_i)}$

Note: The above two formulas, and several of the rules I used to derive them, are guaranteed for NEF distributions and not just NEFQVF distributions; however, the conjugate prior for a NEF may not have a normalizing constant that exists in closed form, and in practical application the distributions that are actually used tend to be NEFQFV. For NEFQFV distributions, a few more algebraic results can be shown about the exact form of the shrinkage estimator by writing the conjugate prior in the general form for exponential densities - for more information, see section 5 of Morris (1983), mentioned in the introduction.

The shrinkage estimator $B$ for NEFQVF distributions is the ratio of the within-metric variance to the total variance - which is a function of how noisy the data are compared and how spread out the talent levels are. If at a certain $n_i$ the normalized metric tends to be very noisy around its mean but the means tend to be clustered together, shrinkage will be large. If the normalized metric tends to stay close to its mean value but the means tend to be very spread out, shrinkage will be small. And as the number of observations $n_i$ grows bigger, the effect of the noise gets smaller, decreasing the shrinkage amount.

$B$ itself can be thought of as a shrinkage proportion - if $B = 0$ then there is no shrinkage, and the estimator is just the raw observation. This would occur if the average variance around the mean is zero - if there's no noise. If $B = 1$ then complete shrinkage takes place and the estimate of the player's true talent level is just the league average talent level. This occurs if the variance in league talent levels is equal to zero - every player has the exact same talent level.

Note that $B$ has no units, since both the top and bottom are variances, so rescaling the data will not change the shrinkage proportion.

I'm going to show a few examples, working through gory mathematical details.

WARNING: the above results are guaranteed only for NEFQVF distributions - the normal, binomial, negative binomial, Poisson, and gamma, NEF-GHS. Some results also apply to NEF distributions - see Morris (1983) for details. If the data model is not one of those distributions, I can't say whether or not the formulas I've given above will be correct.

Normal-Normal Example

Let's start with one familiar form - the normal model. This model says that $X_i$, the metric for player $i$, is normally distributed, and is constructed as a sum of $Y_{ij}$ random variables, which are also normally distributed with mean $\theta_i$ and known variance $\sigma^2$. The distribution of talent levels also follows a normal distribution with league mean $\mu$ and variance $\tau^2$.

This can be written as

$Y_{ij} \sim N(\theta_i, \sigma^2)$

$X_i \sim N(n_i \theta_i, n_i \sigma^2)$

$\theta_i \sim N(\mu, \tau^2)$

The average variance is simple. As stated before, $V(\theta_i) = \sigma^2$ is constant for the normal distribution, no matter what the actual $\theta_i$ is. Hence,

$E[V(\theta_i)] = E[\sigma^2] = \sigma^2$

The variance of the averages is simple, too - the model assumes it's constant as well.

$Var(\theta_i) = \tau^2$

This gives a shrinkage coefficient of

$B = \dfrac{\sigma^2}{\sigma^2 + n_i \tau^2}$

Which, if I divide both the top and bottom by $n_i$, might look more familiar as

$B = \dfrac{\sigma^2/n_i}{\sigma^2/n_i + \tau^2}$

The shrinkage estimator is then

$\hat{\theta_i} = \mu + \left(1 - \dfrac{\sigma^2/n_i}{\sigma^2/n_i + \tau^2}\right)(\bar{x_i} - \mu)$

Alternatively, I can write $B$ as

$B = \dfrac{\sigma^2/\tau^2}{\sigma^2/\tau^2 + n_i}$

And then it follows the familiar pattern from other estimators of $B = m/(m + n)$ for some parameter $m$.

It may seem like the normal-normal is not of use - how many counting statistics are there that are normally distributed at the level of inning, plate appearance, or batter faced? The very idea that they are counting statistics says that that's impossible.

However, the central limit theorem guarantees that sums of independent, identical random variables converge to a normal - hence the distribution of $X_i$ should be unimodal and bell-shaped for large enough $n_i$ (and I'll intentionally leave the discussion of what constitutes "large enough" aside). Thus, as long as the distribution of the $\theta_i$ (the distribution of talent levels) is bell-shaped and symmetric, using a normal-normal with the normal as an approximation at the $X_i$ level should work.

Beta-Binomial Example

Suppose we're measuring the sum of binary events of some kind - a hit, an on-base event, a strikeout, etc. - in $n_i$ observations - plate appearances, innings pitched, batters faced, etc. Each event can be thought of as a sample from a Bernoulli distribution (these are the $Y_{ij}$) with variance function $V(\theta_i) = \theta_i(1-\theta_i)$. The observed metric $X_i$ binomial, and it is constructed as the sum of these Bernoulli random variables

$Y_{ij} \sim Bernoulli(\theta_i)$

$X_i \sim Binomial (n_i, \theta_i)$

The prior distribution for the binomial distribution is the beta.

$\theta_i \sim Beta(\mu, M)$

Fitting with the framework given above, I'm using $\mu = \alpha/(\alpha+\beta)$ and $M = \alpha + \beta$ instead of the traditional $\alpha, \beta$ parametrization, so that $\mu$ represents the league mean and $M$ controls the variation.

The average variance is fairly complicated here. We need to find

$E[V(\theta_i)] = E[\theta_i(1-\theta_i)] = \displaystyle \int_0^1 \dfrac{\theta_i(1-\theta_i) * \theta_i^{\mu M-1}(1-\theta_i)^{(1-\mu) M-1}}{\beta(\mu M, (1-\mu) M)} d\theta_i = \dfrac{\displaystyle \int_0^1 \theta_i^{\mu M}(1-\theta_i)^{(1-\mu) M} d\theta_i}{\beta(\mu M, (1-\mu) M)}$

The top part is a $\beta(\mu M + 1, (1-\mu)M + 1)$ function. Utilizing the properties of the beta function, we have

$E[\theta_i(1-\theta_i)] = \dfrac{\beta(\mu M+1, (1-\mu) M + 1)}{\beta(\mu M, (1-\mu) M)} = \dfrac{\beta(\mu M, (1-\mu) M + 1)}{\beta(\mu M, (1-\mu) M)}\left(\dfrac{\mu M}{\mu M + (1-\mu) M + 1}\right) = $

$\dfrac{\beta(\mu M, (1-\mu) M )}{\beta(\mu M, (1-\mu) M)}\left(\dfrac{\mu M}{\mu M + (1-\mu) M + 1}\right) \left(\dfrac{(1-\mu) M}{\mu M + (1-\mu) M}\right) = \dfrac{\mu(1-\mu)M^2}{(M+1)M} = \dfrac{\mu(1-\mu) M}{M+1}$

The variance of the $\theta_i$ doesn't require nearly as much calculus, since it can be taken directly as the variance of a beta distribution

$Var(\theta_i) = \dfrac{\mu(1-\mu)}{M+1}$

The shrinkage estimator $B$ is then

$B = \dfrac{\dfrac{\mu(1-\mu)M}{(M+1)}}{\dfrac{\mu(1-\mu)M}{(M+1)} +\dfrac{n_i \mu(1-\mu)}{(M+1)}} = \dfrac{M}{M + n_i}$

Since $\mu(1-\mu)/(M+1)$ is in every term on the top and bottom, so it will cancel out. Using this model, then the shrinkage estimator is given by

$\hat{\theta_i} = \mu + \left(1 - \dfrac{M}{M + n_i}\right)\left(\bar{x_i} - \mu\right)$

Poisson-Gamma Example

Now suppose that instead of a binary event, the outcome can be a count - zero, one, two, three, etc. Each count can be thought of as a sample from a Poisson distribution with parameter $\theta_i$ (these are the $Y_{ij}$, with $V(\theta_i) = \theta_i$) with $X_i$ as the sum total of counts, which also has a Poisson distribution with parameter $n_i \theta_i$.

$Y_{ij} \sim Poisson(\theta_i)$

$X_i \sim Poisson(n_i \theta_i)$

The prior distribution of $\theta_i$ for a Poisson is a gamma.

$\theta_i \sim Gamma(\mu, K)$

In this parametrization, I'm using $\mu = \alpha/\beta$ and $K = \beta$ as compared to the traditional $\alpha, \beta$ parametrization.

The average variance is

$E[V(\theta_i)] = E[\theta_i] = \mu$

And the variance of the averages is

$Var(\theta_i) = \dfrac{\mu}{K}$

So the shrinkage coefficient $B$ is

$B = \dfrac{\mu}{\mu + \dfrac{n_i \mu}{K}} = \dfrac{1}{1 + \dfrac{n_i}{K}} = \dfrac{K}{K + n_i}$

Which gives a shrinkage estimator of

$\hat{\theta_i} = \mu + \left(1 - \dfrac{K}{K + n_i}\right)(\bar{x_i} - \mu)$

What Statistics Fit Into this Framework?

Any counting statistic that is constructed as a sum of the same basic events falls under framework. It's possible to combine multiple basic events into one "super" event, as long as they are considered to be equal. Examples of this include batting average, on-base percentage, earned run average, batting average on balls in play, fielding percentage, stolen base percentage, team win percentage, etc. It's possible to weight the sum, as long as you're just adding the same type of event to itself over and over.

Any statistic that is a sum, weighted or unweighted, of different events does not fall into this framework - examples include weighted on-base average, slugging percentage, on-base plus slugging percentage, fielding independent pitching, isolated power, etc. Also, any statistics that are ratios of counts -strikeout to walk ratio, for example - do not fall under this framework.

Statistics like wins above replacement are right out.

I want to make clear that this is simply a discussion of what statistics fit nominally into a very specific theoretical framework. A statistic falling under the framework does not imply that a statistic is good, nor does not falling under it imply that a statistic is bad. Furthermore, even if a statistic does not fall under this framework, shrinkage estimation using these formulas may still work as a very good approximation - the best statistics in sabermetrics today are often weighted sums of counting events, and people have been using these shrinkage estimators on them successfully for years, so clearly they must be doing something right.. This is simply what I can justify using statistical theory.

Performing the Analysis

The values of $\eta$ and $\mu$ must be chosen or estimated. If prior data exists - like, for example, historical baseball data - values can be chosen based upon a careful analysis of that information. If no prior data exists, one option is to estimate the parameters through either moment-based or marginal likelihood-based estimation, and then plug in those values - this method is known as parametric empirical Bayes. Another option is to place a hyperprior or hyperpriors on $\eta$ and $\mu$ and perform a full hierarchical Bayesian analysis, which will almost certainly involve MCMC. Depending on the form of your prior, your shrunk results will likely be similar to, but not equal to, the shrinkage estimators given here.

What if none of the NEFQVF models appear to fit your data? You have a few options, such as nonparametric or hierarchical Bayesian modeling, but any method is to get more difficult and more computational.