27 August, 2015

WHIP Stabilization by the Gamma-Poisson Model

I've previously covered shrinkage estimation for offensive statistics - or at least, those that can be written as binomial events. In a previous post, I showed that for models that follow the natural exponential family with quadratic variance function, the split-half correlation is equal to one minus the shrinkage coefficient $B$.

The techniques I used can also be used when the outcome at the most basic level (at-bat, inning pitched, etc.) is not just a binary outcome. In particular, the Poisson distribution also fits within the framework I derived, as it is a member of the natural exponential family with quadratic variance function, and so events that can be modeled as Poisson at the base level will follow the same basic principles I used for the binomial outcomes. I chose the statistic WHIP (walks + hits per inning pitched) to illustrate this method, as it is a counting statistic that is a non-binary event (i.e., you can have 0, 1, 2, ... walks + hits in a given inning), so it fits the support of the Poisson.

Model and Estimation


I will assume that in each inning, pitcher $i$ gives up a number of walks + hits that follows a Poisson model with mean $\theta_i$, which is unique to each pitcher. The sum number of walks + hits given up in $n_i$ innings is $x_i$, and I have $N$ pitchers total. I considered only starting pitchers from 2009-2014, and split years between the same pitcher. My code and data are on github for anybody who wants to check my calculations.

Since the sum of Poissons is Poisson, the sum number of walks + hits $x_i$ given up in $n_i$ innings follows a Poisson distribution with mean $n_i \theta_i$ and mass function

$p(x_i | \theta_i, n_i) =  \dfrac{e^{-n_i \theta_i} (n_i \theta_i)^{x_i}}{x_i !}$

I will also assume that the distribution of means $\theta_i$ follows a gamma distribution with mean $\mu$ and variance parameter $K$ (in this parametrization, $\mu = \alpha/\beta$ and $K = \beta$ as opposed to the traditional $\alpha, \beta$ parametrization). This distribution has density

$f(\theta_i | \mu, K) = \dfrac{K^{\mu K}}{\Gamma(\mu K)} \theta_i^{\mu K - 1} e^{-K \theta_i}$

As shown in a previous post, the split-half correlation is then one minus the shrinkage coefficient $B$, or

$\rho = 1 - B = \left(\dfrac{n_i}{n_i + K}\right)$

So once I have an estimate $\hat{K}$ and a desired stabilization level $p$, solving for $n$ gives

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{K}$

Once again, the population variance parameter $K$ is equivalent to the 0.5 stabilization point - the point where the split half correlation should be exactly equal to 0.5, and also the point where the individual pitcher estimates are shrunk 50% of the way towards the mean.

For estimation of $mu$ and $K$, I used marginal maximum likelihood - a one dimensional introduction to maximum likelihood is given here. The marginal density of $\mu$ and $K$ is

$p(x_i | n_i, \mu, K) = \displaystyle \int_0^{\infty} \dfrac{K^{\mu K}n_i^{x_i}}{\Gamma(\mu K) x_i !} e^{-\theta_i (n_i + K)}  \theta_i^{x_i + \mu K - 1} d\theta_i = \dfrac{K^{\mu K}n_i^{x_i}}{\Gamma(\mu K) x_i !} \dfrac{\Gamma(x_i + \mu K)}{(n_i + K)^{x_i + \mu K}}$

And the log-likelihood (dropping terms that do not involve either $\mu$ or $K$) is given by

$\ell(\mu, K) =  N \mu K \log(K) - N \log(\Gamma(\mu K)) + \displaystyle \sum_{i = 1}^N \left[\log(\Gamma(x_i + \mu K)) - (x_i + \mu K) \log(n_i + K)\right]$

Once again, I wrote code to maximize this function in $R$ using a Newton-Raphson algorithm. I converted $K$ to $\phi = 1/(1 + K)$ in the equation above for estimation and then converted it back by $K = (1-\phi)/\phi$ after estimation was complete - the reason being that it makes the estimation procedure much more stable.

In performing this estimation, I had to make a choice of the minimum number of innings pitched (IP) in order to be included in the dataset. When performing a similar analysis for on-base percentage, I found that at around 300 PA, the population variance (and hence, the stabilization point) became roughly constant. Unfortunately, this is not true for starting pitchers.



The population variance in talent levels decreases consistently as a function of the minimum number of IP that are considered, and so the stabilization point $K$ increases. This means that, unlike OBP, for example, the stabilization point is always determined by what percentage of pitchers you look at (by IP) - if you look at only the top 50%, the stabilization point will be larger than the stabilization point for the top 70%.

This is reflected in the plot below - as with OBP and PA, the mean WHIP is associated with the number of IP,  but unlike with OBP, the variance around the mean is constantly changing with the mean.



For my calculation, I chose to use 80 innings pitched as my cutoff point - corresponding to approximately 15 games started and capturing slightly more than 50% of pitchers (by IP). This point was completely arbitrary, though, and other cutoffs will be equally valid depending on the question at hand.

Performing the estimation, the estimated league mean WHIP was $\hat{\mu} = 1.304$ with variance parameter $\hat{K} = 77.203$.



Once again,  95% confidence intervals for a specific stabilization level p are given as

$\left(\dfrac{p}{1-p}\right) \hat{K} \pm 1.96 \left(\dfrac{p}{1-p}\right) \sqrt{Var(\hat{k})}$

From (delta-method transformed) maximum likelihood output, $Var(\hat{K}) = 29.791$ (for a standard error of $5.459$ IP). The stabilization curve, with confidence bounds, is then


Aside from the model criticisms I've already mentioned, standard ones apply - innings pitched are not identical and independent (and treating them as so is clearly much worse than treating plate appearances as identical and independent), pitchers are not machines, etc. I don't think the model is great, but it is useful. It gives confidence bounds for the stabilization point something other methods don't do. As always, comments are appreciated.

19 August, 2015

Offensive Stabilization Points through Time

Using my maximum likelihood technique for estimating stabilization points, I performed a moving calculation of the stabilization point using data from 1900 - 2014 from fangraphs.com. Each stabilization point is a six-year calculation, including the current and five previous years (so for example, 2014 incudes 2009-2014 data, 1965 includes 1959 - 1965 data, etc.). There's not a mathematical or baseball reason for this choice - through trial and error it just seemed to provide enough data for estimation that the overall trend was apparent, with a decent amount of smoothing. Data includes only batters from each year with at least 300 plate appearances, and splits years for the same player. Raw counts are used, not adjusted in any form. Pitchers are excluded. My data and code is posted on my github if you would like to run it for yourself.

The "stabilization point" I have defined as the point where split-half correlation is equal to 0.5, which is equivalently where the shrinkage amount is 50% . Both of these are equal to a variance parameter $M$ in the beta-binomial model I fit, where the distribution of events given a mean $\theta_i$ is binomial for player $i$ and the underlying distribution of the $\theta_i$ follows a beta distribution with mean $\mu$ and variance parameter $M$.

Historical Plots


Trends can be see clearly in plots. For example, here is a plot of the stabilization point for home run rate from 1900 - 2014.



The effect of the dead ball era is clearly evident. A large stabilization point indicates a small variance - and during the dead ball era, there was a small variance, because most players weren't hitting home runs! More recently, the stabilization point has risen to the highest level it's been since that era.

Note that the stabilization point is should not be confused with the mean. In fact, here's a plot of the estimated league mean home run rate over the same period.


While going through peaks and valleys, the home run rate has risen fairly continuously over time - and the recent rise in home run stabilization point actually corresponds to a decrease in the mean home run rate (though interestingly, the decrease in league mean home run rate since the end of the steroid era still puts the current mean home run rate above any other preceding era).

To give another example, the stabilization point for triple rate is the lowest its been since the dead ball era - even though the league mean triple rate has decreased fairly continuously over time.





Interestingly, the stabilization points for walk rate and on-base percentage are the highest they've ever been, with walk rate having a noticeably sharp increase in recent years - one theory is that this is due to a "moneyball" effect of teams focusing much more strongly on walk rate as opposed to other statistics - indeed, the stabilization point for batting average (shown later in the article) has dropped during the same period - perhaps indicative of being more tolerant of variation in batting average but less tolerant in variation of on-base percentage (of course, pitching has grown more dominant since the end of the steroid era, which is likely adding to the effect as well).



Meanwhile, the stabilization points for double rate and extra base hit (2B + 3B) have increased over time.



But while the extra base hit rate stabilization point has decreased from the mid-2000s, while the double rate stabilization point has remained roughly the same.

The hit-by-pitch rate follows the same pattern as third base percentage -  it increased after the dead ball era, peaking in the 1930s and 1950s - but has decreased since then, and despite a small recent increase, is at its lowest stabilization point since that era.


Meanwhile, the strikeout rate stabilization point decreased fairly consistently over time, before stabilizing approximately in the 1970s, with peaks in the 1980s and early 2000s.


What Drives the Stabilization Point?


As I've shown, the mean of the underlying distribution of talents does not seem to be strongly associated with the stabilization point - the variance of the underlying distribution of talent levels is the primary factor. There is an inverse relationship - a small stabilization point indicates that there is a large variance in talent levels for that particular statistic, and a large stabilization point indicates that there is a small variance in talent levels for that statistic.

To get a clearer view of the factors that are affecting the stabilization point, here's a plot of the stabilization point for batting average (using at-bats as the denominator) versus time.


Below is an animation showing the empirical distribution of batting average with the estimated underlying distribution of talent levels in dashed lines (since I'm estimating the distribution of true batting averages and not the distribution of observed batting averages, it's okay that the dashed line is narrower than the histogram). Notice that as time goes on, the distribution gets narrower (the variance is decreasing) - this is what's driving the increase in stabilization point over time.


The opposite effect can be seen in the single rate stabilization point - it has decreased (with peaks and valleys) over time


As the distribution of single rates has become more spread out.




Graphics of all the stabilization points, league mean talent levels, and animated estimated talent distributions can be found here.

Individual Years


I also selected a few years to compare individually - some for specific reasons, some just as a representative of a certain era.

  • 1910, in the middle of the dead ball era, and a year of particularly low offensive output. 
  • 1928, to represent the 1920s and the age of Babe Ruth.
  • 1937, to represent the 1930s. 
  • 1945, the end of the second world war. 
  • 1959, to represent the 1950s.
  • 1968, the year of the pitcher.
  • 1975, six years after they lowered the mound. 
  • 1987, before the steroid era and six years after the 1981 labor stoppage. 
  • 2001, the year Barry Bonds hit 73 home runs, in the middle of the steroid era. 
  • 2014, the modern era.

\begin{array}{| c | c | c | c | c | c | c | c | c | c | c |}\hline
\textrm{Year} & \textrm{1B} & \textrm{2B}& \textrm{3B}& \textrm{XBH}& \textrm{HR} &  \textrm{SO} & \textrm{BB} & \textrm{BA} & \textrm{OBP} & \textrm{HBP}  \\ \hline
1910 & 523.02  & 348.68 & 469.42  & 274.24 & 537.41 & 130.77 &   87.33 & 285.65 & 175.27 & 259.08 \\
1928 & 442.01 &  475.05 & 583.77 &  385.43 & 102.77  &  83.26 &  82.10 & 286.25 & 173.49 & 414.86 \\
1937 & 436.17 &  597.72 & 709.13 &  463.40 &  90.61 &  73.79 &  76.94 & 344.93 & 174.84 & 723.59\\
1945 &  456.71 &  710.05 & 543.09 &  474.47 &  98.81 &  67.99 &  79.83 & 424.54 & 180.14 & 699.53 \\
1959 & 351.40 & 1059.30 & 721.82 & 794.95 &  90.18 &  58.28 &  81.20 & 430.60 & 200.31 & 414.12\\
1968 & 333.52 &  867.61 & 691.18 &  700.24  & 94.46 &  55.18 &  93.24 & 476.94 & 265.04 & 448.36\\
1975 & 246.27 &  970.40 & 646.09 &  773.63 &  85.50 &  53.19 &  73.61 & 407.65 & 204.92 & 410.99 \\
1987 & 269.39 &  949.73 & 537.13 &  801.21 &  90.23 &  52.85 &  86.22 & 541.67 & 262.28 & 430.67 \\
2001 & 255.57 & 838.16 & 482.32  & 971.61 &  95.11 &  57.37 &  76.16  & 465.84 & 196.51 & 251.47\\
2014 & 222.16 & 1025.31 & 372.50 & 1006.30 & 124.52 &  49.73 & 105.59 & 465.92 & 295.79 & 297.41 \\  \hline
\end{array}


While generally following the fuller patterns shown in the plots, the effect of major baseball events such as the dead ball era, the second world war, the lowering of the mound, and the steroid era is evident.

Remember that a smaller stabilization point indicates a larger variance among talent levels - so looking at 1968 and 1975 to see the effect of lowering the mound, for example, the spread of single, triple, and home run rates increased while the spread of double and extra-base hit rates decreased (the extra base hit rate being largely driven by the double rate). Interestingly, the spread of strikeout rates remained roughly the same, but the spread of walk rates, hit by pitch rates, batting average, and on-base percentage all increased.

Overall, a fun way to look at how offensive statistics have changed over time. Let me know what you think in comments.

13 August, 2015

More Offensive Stabilization Points

Using the same method from my previous post, I calculated some stabilization points for more offensive statistics.

All of the code and data I used may be found on my github.

For each, I used a binomial distribution for the result of the at-bat/plate appearance and a beta distribution for the distribution of talent levels. I think a different model might work better for some of these statistics (and I'll have to work through the poisson-gamma anyway when I look at pitching statistics), but this represents a decent approximation.

As I stated in a previous post, statistics that can not be constructed as a binomial event (such as wOBA) do not fall under the framework I am using, and so I have not included them in estimation. I could treat them as binomial events, fit a model, and perform estimation procedure, but I would have no idea if the results are correct or not.

All estimated stabilization points were calculated using unadjusted data from fangraphs.com for hitters from 2009-2014 with at least 300 PA, excluding pitchers.

\begin{array}{| l | l | c | c | c |} \hline
\textrm{Statistic} & \textrm{Formula} & \hat{M} & SE(\hat{M}) & \textrm{95% CI} \\ \hline
\textrm{OBP} & \textrm{(H+BB+HBP)/PA} & 295.79 & 16.41 & (263.63, 327.95)  \\
\textrm{BA} & \textrm{H/AB} & 465.92 & 34.23 & (398.83, 533.02)  \\
\textrm{SO Rate} & \textrm{SO/PA} & 49.73 & 1.92 & (45.96, 53.50)  \\
\textrm{BB Rate} & \textrm{(BB-IBB)/(PA-IBB)} & 110.91 & 4.84 & (101.44, 120.38)  \\
\textrm{1B Rate} & \textrm{1B/PA} & 222.16 & 11.32 & (199.98, 244.34)  \\
\textrm{2B Rate} & \textrm{2B/PA} & 1025.31 & 108.00 & (813.64, 1236.98)  \\
\textrm{3B Rate} & \textrm{3B/PA} & 372.5 & 26.56 & (320.44, 424.56)  \\
\textrm{XBH Rate} & \textrm{(2B+3B)/PA} & 1006.30 & 105.23 & (800.04, 1212.57)  \\
\textrm{HR Rate} & \textrm{HR/PA} & 124.52 & 5.90 & (112.95, 136.09)  \\
\textrm{HBP Rate} & \textrm{HBP/PA} & 297.41 & 18.26 & (261.61, 333.20)  \\ \hline
\end{array}

I've also created histograms of each observed statistic with an overlay of the estimated distribution of true talent levels. They can be found in this imgur gallery. Remember that the dashed line represents the distribution of talent levels, not of observed data, so it's not necessarily bad if it is shaped differently than the observed data.

$\hat{M}$ is the estimated variance parameter of the underlying talent distribution. Under the model, it is equal to the number of plate appearances at which there is 50% shrinkage.

$SE(\hat{M})$ is the standard error of the estimate $\hat{M}$. It is on the same scale as the divisor in the formula - so PA for all except batting average and walk rate.

The 95% CI is calculated as

$\hat{M} \pm 1.96 SE(\hat{M})$

It represents a 95% confidence interval for the number of plate appearances at which there is 50% shrinkage.

For an arbitrary stabilization level $p$, the number of required plate appearances can be estimated as

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{M}$

And a 95% confidence interval for the required number of plate appearances is given as

$\left(\dfrac{p}{1-p}\right) \hat{M} \pm 1.96 \left(\dfrac{p}{1-p}\right) SE(\hat{M})$

Without confidence bounds, a plot of the sample size required for various stabilization levels is
 



And a plot of the stabilization level at various sample sizes is given as



This looks very similar to other plots I have seen.

Comments are appreciated. Also, I'm currently in the process of learning ggplot, so hopefully my graphics won't be as awful in the near future.

12 August, 2015

2015 Win Prediction Totals (Through July)

These are a bit late - it's August 12, but these intervals only include games through July 31.

These predictions are based on my own method (which can be improved). I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), and I think by this point in the season the actual coverage should be close to that.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c} 
\textrm{Team}  & \textrm{Lower}  & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} \\ \hline

ATL & 64 & 72.27 & 81 & 72.09 \\
ARI & 73 & 81.38 & 90 & 83.36 \\
BAL & 74 & 82.90 & 92 & 86.17 \\
BOS & 64 & 72.12 & 81 & 72.94 \\
CHC & 76 & 85.07 & 94 & 81.23 \\
CHW & 68 & 76.53 & 85 & 73.10 \\
CIN & 66 & 74.63 & 84 & 76.06 \\
CLE & 68 & 76.90 & 86 & 78.07 \\
COL & 62 & 70.37 & 79 & 72.69 \\
DET & 70 & 78.54 & 87 & 78.37 \\
HOU & 82 & 90.46 & 99 & 90.71 \\
KCR & 85 & 94.02 & 103 & 89.14 \\
LAA & 78 & 87.28 & 96 & 87.11 \\
LAD & 82 & 90.37 & 99 & 88.88 \\
MIA & 61 & 70.09 & 79 & 77.17 \\
MIL & 63 & 71.01 & 80 & 75.43 \\
MIN & 74 & 82.43 & 91 & 79.51 \\
NYM & 74 & 82.33 & 91 & 80.50 \\
NYY & 82 & 90.45 & 99 & 87.62 \\
OAK & 67 & 75.23 & 84 & 84.46 \\
PHI & 54 & 62.61 & 71 & 63.10 \\
PIT & 83 & 92.27 & 101 & 87.16 \\
SDP & 69 & 77.14 & 86 & 74.48 \\
SEA & 65 & 73.45 & 82 & 73.91 \\
SFG & 79 & 88.31 & 97 & 87.21 \\
STL & 92 & 101.20 & 110 & 96.66 \\
TBR & 72 & 80.88 & 90 & 80.62 \\
TEX & 70 & 78.37 & 87 & 76.62 \\
TOR & 77 & 86.00 & 94 & 92.18 \\
WSN & 77 & 85.99 & 95 & 84.89 \\ \hline\end{array}
To explain the difference between "Mean" and "True Win Total"  - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (at the end of July).

As a bonus, these quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is lower than the peak, the team is predicted to finish lucky, and if the blue line is higher than the peak, the team is predicted to finish unlucky.

05 August, 2015

Estimating Theoretical Stabilization Points

Edit 19 March 2020: This post has been adapted into a paper in the Journal of Mathematical Psychology. In the process of writing the paper, a number of  mistakes, omissions, or misstatements were found in this post. It is being left up as it was originally written, just in case anybody is interested. For a more correct version, please refer to the journal article.

The technique commonly used to assess stabilization points of statistics is called split-half correlation. This post will show that within a fairly general modeling framework, the split-half correlation is a function of two things: the sample size and a variance parameter of the distribution of talent levels.  It's therefore possible to skip the correlation step entirely and use statistical techniques to estimate the variance parameter of the talent distribution directly, and then use that to estimate the sample size required (with confidence bounds) for a specific stabilization level.

Theoretical Split-Half Correlation


(Note: This first part is very theoretical - it's the part that shows the statistical link between shrinkage and split-half correlation for a certain family of distributions. If you just want to trust me or your own experience that it exists, you can skip this and go straight to the "Estimation" section without missing too much.)

I'm going to work within my theoretical framework where the data follows a natural exponential family with a quadratic variance function (NEFQVF) - so this will work for normal-normal, beta-binomial, Poisson-gamma, and a few other models.

$X_i \sim p(x_i | \theta_i)$
$\theta_i \sim G(\theta_i| \mu, \eta)$
 
Split-half reliability takes two samples that are presumed to be measuring the same thing (I'll call these samples $X_i$ and $Y_i$) and calculates the correlation between them -if they actually are measuring the same thing, then the correlation should be high.

In baseball, it's commonly used as a function of the sample size $n$ to assess when a stat "stabilizes" - that is, if I take two samples of size $n$ from the same model (be it player at-bats, batters faced, etc.) and calculate the correlation of a statistic between the samples, then once the correlation exceeds a certain value, the statistic is considered to have "stabilized."

Let's say that $\bar{X_i}$ is normalized statistic of $n$ observations (on-base percentage, for example) from the first "half" of the data (though it does not need to be chronological) and $\bar{Y_i}$ is the normalized statistic of $n$ observations from the second half of the data. In baseball terms, $\bar{X_i}$ might be something like the OBP from the first sample and $\bar{Y_i}$ the OBP from the second sample.

I want to find the correlation coefficient $\rho$, which is defined as

$\rho = \dfrac{Cov(\bar{X_i}, \bar{Y_i})}{\sqrt{Var(\bar{X_i})Var(\bar{Y_i})}}$

First, the numerator. The law of total covariance states that

$Cov(\bar{X_i}, \bar{Y_i}) = E[Cov(\bar{X_i}, \bar{Y_i} | \theta_i)] + Cov(E[\bar{X_i} | \theta_i], E[\bar{Y_i} | \theta_i])$

Given the same mean $\theta_i$, $\bar{X_i}$ and $\bar{Y_i}$ are assumed to be independent - hence,

 $E[Cov(\bar{X_i}, \bar{Y_i} | \theta_i)] = E[0] = 0$.

Functionally, this is saying that for a given player, the first set of performance data is independent of the second half of performance data.

Since $E[\bar{X_i} | \theta_i] = E[\bar{Y_i} | \theta_i] = \theta_i$ for the framework I'm working in, the second part becomes

 $Cov(E[\bar{X_i} | \theta_i], E[\bar{Y_i} | \theta_i]) = Cov(\theta_i, \theta_i) = Var(\theta_i)$

Thus, $Cov(\bar{X_i}, \bar{Y_i})$ is equal to the "between player" variance - that is, the variance among league talent levels.

Now, the denominator. As I've used before, the law of total variance states that

$Var(\bar{X_i}) = E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i}| \theta_i])$ 

Since $\bar{X_i}$ and $\bar{Y_i}$ have the same distributional form, they will have the same variance.

$Var(\bar{X_i}) = Var(\bar{Y_i})  = E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i} | \theta_i]) =  \dfrac{1}{n} E[V(\theta_i)] + Var(\theta_i)$

Where $E[V(\theta_i)]$ is the average variance of performance at the level of plate appearance, inning pitched, etc. Hence, the split correlation between them will be

$\rho = \dfrac{Var(E[\bar{X_i }| \theta_i])}{E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i} | \theta_i])} = \dfrac{Var(\theta_i)}{\dfrac{1}{n} E[V(\theta_i)] + Var(\theta_i)} = 1 - B $

Where $B$ is the shrinkage coefficient. The important theoretical result, then, is that for NEFQVF distributions, the split-half correlation is equal to one minus the shrinkage coefficient. Since we know form of the shrinkage coefficient, we can estimate what the split-half correlation will be for different values of $n$.

I'm not breaking any new ground here in terms of what people are doing in practice, but there does exist a theoretical justification linking split-half correlation and this particular formula method for shrinkage estimation.

Estimation


Using fangraphs.com, I collected data from all MLB batters (excluding pitchers) who had at least 300 PA (which is a somewhat  arbitrary choice on my part - further discussion on this choice in the model criticisms section) from 2009 to 2014. I'm considering players to be different across years - so for example, 2009-2014 Miguel Cabrera is six different players. I'll define $x_i$ as the number of on-base events for player $i$ in $n_i$ plate appearances. I have $N$ of these players.

All my data and code is posted on my github if you would like to independently verify my calculations (and I'm learning to use github while I do this, so apologies if the formatting is completely wrong).

The number of on-base events $x_i$ follows a beta-binomial model - this fits into the NEFQVF family.

$x_i \sim Binomial(\theta_i, n)$
$\theta_i \sim Beta(\mu, M)$

Here I am using $\mu = \alpha/(\alpha + \beta)$ and $M = \alpha + \beta$ as opposed to the traditional $\alpha, \beta$ notation for a beta distribution. The true league mean OBP is $\mu$ and $M$ controls the variance of true OBP values among players.

Let's say I want to know at what sample size $n$ the split-half correlation will be at a certain value p. For a beta-binomial mode, the split-half correlation is

$\rho = 1 - B = 1 - \dfrac{M}{M + n} = \dfrac{n}{M + n}$

Where $B$ is the shrinkage coefficient. So if we desire a stabilization level p, it is given by solving

$p = \dfrac{n}{M + n}$

For $n$. The solution is

$n= \left(\dfrac{p}{1-p}\right) M$

Given an estimate of the talent variance parameter $\hat{M}$, the estimated $n$ is

$\hat{n} = \left(\dfrac{p}{1-p}\right)\hat{M}$

As a side note, at $n = M$ the split-half correlation and shrinkage amount are both 0.5.

For estimation of $M$, I'm going to use marginal maximum likelihood. For a one-dimensional introduction to maximum likelihood, see my post on maximum likelihood estimation for batting averages.

The marginal distribution of on-base events $x_i$ given $\mu$ and $M$ is a beta-binomial distribution, with mass function given by

$ p(x_i | \mu, M) = \displaystyle \int_0^1 {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu M -1}(1-\theta_i)^{n_i - x_i + (1-\mu)M-1}}{\beta(\mu M, (1-\mu)M)} d\theta_i = {n_i \choose x_i} \dfrac{\beta(x_i + \mu M, n_i - x_i + (1-\mu) M)}{\beta(\mu M, (1-\mu)M)}$

This distribution represents the probability of $x_i$ on-base events in $n_i$ plate appearances given league mean OBP $\mu$ and variance parameter $M$, bypassing the choice of player altogether. The maximum likelihood estimate says to choose the values of $\mu$ and $M$ that maximize the joint probability the observed OBP values.

For a sample of size $N$ players, each with $x_i$ on-base events in $n_i$ plate appearances, the log-likelihood is given as

$\ell(\mu, M) = \displaystyle \left[ \sum_{i =1}^N \log(\beta(x_i + \mu M, n_i - x_i + (1-\mu) M))\right]  - N \log(\beta(\mu M, (1-\mu)M))$

This must be maximized numerically using computer software - I wrote a program using the Newton-Raphson algorithm to do this in $R$, which is posted on my github. For estimation, I actually converted $M$ to $\phi = 1/(1+M)$ in the above equation, performed the maximization, and then converted my estimate back to the original scale with $\hat{M} = (1-\hat{\phi})/\hat{\phi}$. The technical details of why I did this are a bit too much here, but it makes the estimation procedure much more stable - I'll be happy to discuss this with anybody who wants to know.

The maximum likelihood estimates are given by $\mu = 0.332$ and $M = 295.7912$.




Above is the distribution of observed OBP values with the estimated distribution of true OBP levels overlaid as dashed lines.

This means that for a split-half correlation of $p = 0.7$, the estimate of sample size required is

$\hat{n} = \left(\dfrac{0.7}{1-0.7}\right)295.7912 = 690.1794$

Which we could round to $\hat{n} = 690$ plate appearances. Since $\hat{M}$ is the maximum likelihood estimator, $\hat{n}$ is the maximum likelihood estimate of the point at which split-half correlation is 0.7 by invariance of the maximum likelihood estimator.

Furthermore, if we have a variance $Var(\hat{M})$, the variance of the estimated sample size for p-stabilization is given as

$Var(\hat{n_i}) = Var\left( \left(\dfrac{p}{1-p}\right)\hat{M}\right) = \left(\dfrac{p}{1-p}\right)^2 Var(\hat{M})$

So a $(1-\alpha)\times 100\%$ confidence interval for $n$ is given as

$\left(\dfrac{p}{1-p}\right)\hat{M} \pm z^* \left(\dfrac{p}{1-p}\right) \sqrt{Var(\hat{M})}$


The output from the maximum likelihood estimation can be used to estimate $Var(\hat{M})$. Since I estimated $\hat{\phi}$, I had to get $Var(\hat{\phi})$ from output of the computer program I used and then use the delta method to convert it back to the scale of $M$. Doing that, I got $Var(\hat{M}) = 269.1678$. This gives a 95% confidence interval for the 0.7-stabilization point as

$690.1794 \pm 1.96 \left(\dfrac{0.7}{1-0.7}\right) \sqrt{269.1678} = (615.1478, 765.211)$

Or between approximately 615 and 765 plate appearances. A 95% confidence interval for the 0.5-stabilization point (which is just $\hat{M}$) is between approximately 264 and 328 plate appearances.

For an arbitrary p-stabilization point sample size, the confidence interval formula is

$\left(\dfrac{p}{1-p}\right)295.7912 \pm z^* \left(\dfrac{p}{1-p}\right) \sqrt{269.1678}$

Below is a graph of the required sample size for for a stabilization level of p between 0.5 and 0.8 - the dashed lines are 95% confidence bounds.





As you can see, there are diminishing returns - to stabilize more, you need an increasingly larger sample size.

Model Criticisms


Basic criticisms about modeling baseball players apply: players are not machines, plate appearances are not independent and identical, etc. These criticisms will apply to just about model of baseball data, including split-half correlations.

I have not adjusted the data in any way - I simply took the raw number of on-base events and plate appearances. Estimation could likely be improved by adjusting the data for various effects before running it through the model.

One thing should be obvious: this is a parametric estimator, dependent on the model I chose. If the model I chose does not fit well, then the estimate will be bad. I stuck to OBP for this discussion because it seems to fit the beta-binomial model well. No model is correct, of course, but I do believe the beta-binomial model is close enough to be useful. I simulated data from a beta-binomial model using my estimated parameters and the fixed number of plate appearances, and both visually and with some basic summary statistics the simulated data looked close to the actual data. Not identical - and the real data appears to skew slightly right in comparison to the simulated data, and being identical isn't a realistic goal anyway - but close. Other statistics could require a the use of a different distribution.

As I mentioned, the cutoff of 300 PA or greater was somewhat arbitrary - I will fully admit that it's  because I don't have a clearly defined population of players in mind. I know pitchers shouldn't be included, and I know that someone who got 10 PA and then was sent down shouldn't be included in the model, but I'm not sure what the correct cutoff for PA should be to get at this vague idea of "MLB-level" hitters I have. That's a problem with this analysis, but one that is easy to correct with the right information.

There's a bias/variance trade-off at play here - if I set the cutoff too low then I'm going to get too many players included that aren't from the population I want included in the sample, but the more players I feed into the model the smaller my variance of estimation is. Below is a plot of $\hat{M}$ with 95% confidence bounds for cutoff points from 50 PA to 600 PA.




Around 300 PA seems to be the cutoff value that leads to a roughly stable $M$ estimate that doesn't veer off into being erratic from the lack of information, and seems to approximately conform with what I know about how many plate appearances semi-regular MLB player should get.

Lower cutoff points tend to lead to lower stabilization points, as it will include hitters with smaller true OBPs, decreasing both the league average OBP (and also the average amount of variance around OBP values) and variance among true OBP values - the effect of which is to estimate $M$ smaller.

The bigger problem I have is that one of the assumptions is the number of plate appearances is independent of the observed on-base percentage - that if one player gets 500 PA and another player gets 700 PA, it tells us nothing about either of the players' true OBP values - they just happened to get 500 and 700 PA, respectively. Of course, we know this isn't true - hitters get more plate appearances specifically because they get on base more.



Since the model works by assuming that players with more plate appearances have less variation around their true OBP values, it will make the estimate of the league mean OBP higher - which will affect $M$.

Even with all of its problems, I think this estimation method is useful. I don't think I've done anything that other people aren't already doing, but I just wanted to work through it in a way that makes sense to me. Comments are appreciated.

.