However, good news! In the past two years, I've had some research on reliability for non-normal data corrected, expanded upon, and published in academic journals. I can definitively say that my maximum likelihood estimator is accurately estimating the reliability of these statistics exactly the same as Cronbach's alpha or KR-20 and performs as well or better than Cronbach's alpha, assuming the model is correct, which - while no model is correct - I believe is very accurate. The article can be found here (for the preprint, click here). I also published a paper with some KR-20 and KR-21 reliability estimators specifically for exponential family distributions such as binomial, Poisson, etc. The article can be found here (for the preprint, click here). These estimators are a little more efficient for small sample sizes but for large sample sizes such as in this case, however, the estimators should be nearly identical.
As usual, all data and code I used for this post can be found on my github.
I make no claims about the stability, efficiency, or optimality of my code.
I've
included standard error estimates for 2021, but these should not be
used to perform any kinds of tests or intervals to compare to the values
from previous years, as those values are estimates themselves with
their own standard
errors, and
approximately 5/6 of the data is common between the two estimates. The
calculations I performed for 2015 can be found here for batting statistics and here for pitching statistics. The calculations for 2016 can be found here. The 2017 calculations can be found here. The 2018 calculations can be found here. The 2019 calculations can be found here. I didn't do calculations in 2020 because of the pandemic in general.
The
cutoff values I picked were the minimum number of events (PA, AB, TBF,
BIP, etc. - the denominators in the formulas) in order to be considered
for a year. These cutoff values, and the choice of 6 years worth of
data (2015-20120), were picked fairly arbitrarily - I tried to go with what was
reasonable (based on seeing what others were doing and my own knowledge
of baseball) and what seemed to work well in practice.
Offensive Statistics
\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu}
& \textrm{Cutoff}&2019\textrm{ }\hat{M} \\ \hline
\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 302.57 & 18.39 & 0.331 & 300 & 295.20 \\
\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 451.24 & 47.22 & 0.306 & 300 & 431.49 \\
\textrm{BA}&\textrm{H/AB} & 511.71 & 42.78 & 0.265 & 300 & 488.49 \\
\textrm{SO Rate}&\textrm{SO/PA} & 50.37 & 2.12 & 0.205 & 300 & 49.05 \\
\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 100.47 & 4.67 & 0.080 & 300 & 104.08 \\
\textrm{1B Rate}&\textrm{1B/PA} & 191.17 & 10.20 & 0.150 & 300 & 197.43 \\
\textrm{2B Rate}&\textrm{2B/PA} & 1242.67 & 162.27 & 0.047 & 300 & 1200.46 \\
\textrm{3B Rate}&\textrm{3B/PA} & 481.11 & 28.74 & 0.005 & 300 & 421.91 \\
\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1059.31 & 124.09 & 0.052 & 300 & 1070.09 \\
\textrm{HR Rate} & \textrm{HR/PA} & 146.00 & 7.68 & 0.034 & 300 & 141.80\\
\textrm{HBP Rate} & \textrm{HBP/PA} & 261.13 & 16.56 & 0.010 & 300 & 266.92 \\ \hline
\end{array}
In
general, a larger stabilization point will be due to a decreased spread
of talent levels - as talent levels get closer together, more extreme
stats become less and less likely, and will be shrunk harder towards
the mean. Consequently, it takes more observations to know that a
player's high or low stats (relative to the rest of the league) are real
and not just a fluke of randomness. Similarly, smaller stabilization
points will point towards an increase in the spread of talent levels.
The stabilization point of the 3B rate increased dramatically by approximately two standard deviations, indicating that the talent level of hitting triples has clustered more closely around its mean. In general, however, most stabilization points are roughly the same as the previous year, taking into account that year-to-year and sample-to-sample variation in estimates is expected even if the true stabilization points are not changing.
Pitching Statistics
\begin{array}{| l | l | c | c | c | c | c | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu}
& \textrm{Cutoff}& 2019 \textrm{ }\hat{M} \\ \hline
\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 1061.43 & 197.34 & 0.286 &300& 1184.38 \\
\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 66.20 & 4.25 & 0.443 &300& 64.51\\
\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 62.33 & 3.97 & 0.346 &300& 60.68 \\
\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 1773.66 & 486.12 & 0.211 &300& 2197.02 \\
\textrm{HR/FB Rate}&\textrm{HR/FB}& 529.40 & 129.10 & 0.130 & 100 & 351.53 \\
\textrm{SO Rate}&\textrm{SO/TBF}& 80.78 & 4.97 & 0.214 &400& 90.86 \\
\textrm{HR Rate}&\textrm{HR/TBF}& 959.57 & 133.073 & 0.031 &400& 764.48\\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 251.22 & 19.47 & 0.072 & 400 & 230.09 \\
\textrm{HBP Rate}&\textrm{HBP/TBF}& 1035.90 & 153.68 & 0.009 &400& 906.25 \\
\textrm{Hit rate}&\textrm{H/TBF}& 453.30 & 37.52 & 0.232 &400& 496.56 \\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 407.36 & 36.33 & 0.313 &400& 443.60 \\
\textrm{WHIP}&\textrm{(H + BB)/IP*}& 63.38 & 4.79 & 1.29 &80& 67.84 \\
\textrm{ER Rate}&\textrm{ER/IP*}& 57.73 & 4.30 & 0.460 &80& 57.97 \\
\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 64.70 & 4.92 & 1.23 &80& 67.23 \\ \hline
\end{array}
* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively.
Most are the same, but the HR/FB stabilization point has shifted up dramatically given its standard error, indicating a likely change in true talent level and not just sample-to-sample and year-to-year variation. This indicates that the distribution of HR/FB talent levels is clustering around its mean, possibly indicating a change in approach by pitchers or batters over the past two years. The mean has also shifted up over the previous calculation. Similarly, the HR rate stabilization point and mean have increased. Conversely, the strikeout rate stabilization rate has decreased, indicating less clustering of talent levels around the mean, and the mean has also increased.
Usage
Aside from the obvious use of knowing approximately when results are half due to luck and half skill, these stabilization points (along with league means) can be used to provide very basic confidence intervals and prediction intervals for estimates that have been shrunk towards the population mean, as demonstrated in my article From Stabilization to Interval Estimation.
For example, suppose that in the first half, a player has an on-base percentage of 0.380 in 300 plate appearances, corresponding to 114 on-base events. A 95% confidence interval using my empirical Bayesian techniques (based on a normal-normal model) is
That is, we believe the player's true on-base percentage to be between 0.317 and 0.392 with 95% confidence. I used a normal distribution for talent levels with a normal approximation to the binomial for the distribution of observed OBP, but that is not the only possible choice - it just resulted in the simplest formulas for the intervals.
Suppose that the player will get an additional $\tilde{n} = 250$ PA in the second half of the season. A 95% prediction interval for his OBP over those PA is given by
That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% "unshrunk" intervals, but when I tested them in my article "From Stabilization to Interval Estimation," they worked well for prediction.
As usual, all my data and code can be found on my github. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.
> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1)
$\$$Parameters
[1] 0.3306363 302.5670532
$\$$Standard.Error
[1] 18.38593
$\$$Model
[1] "Beta-Binomial"
The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula
which is obtained from the applying the delta method to the function $p(\hat{M}) = n/(n + \hat{M})$. Note that the mean and prediction intervals I gave do not take $SE(\hat{M})$ into account (ignoring the uncertainty surrounding the correct shrinkage amount, which is indicated by the confidence bounds above), but this is not a huge problem - if you don't believe me, plug slightly different values of $M$ into the formulas yourself and see that the resulting intervals do not change much.
As always, feel free to post any comments or suggestions.