Edit 19 March 2020: This post has been adapted into a paper in the Journal of Mathematical Psychology. In the process of writing the paper, a number of mistakes,
omissions, or misstatements were found in this post. It is being left
up as it was originally written, just in case anybody is interested. For a more correct version,
please refer to the journal article.
The technique commonly used to assess stabilization points of statistics is called split-half correlation. This post will show that within a fairly general modeling framework, the split-half correlation is a function of two things: the sample size and a variance parameter of the distribution of talent levels. It's therefore possible to skip the correlation step entirely and use statistical techniques to estimate the variance parameter of the talent distribution directly, and then use that to estimate the sample size required (with confidence bounds) for a specific stabilization level.
Theoretical Split-Half Correlation
(Note: This first part is very theoretical - it's the part that shows the statistical link between shrinkage and split-half correlation for a certain family of distributions. If you just want to trust me or your own experience that it exists, you can skip this and go straight to the "Estimation" section without missing too much.)
I'm going to work within
my theoretical framework where the data follows a natural exponential family with a quadratic variance function (NEFQVF) - so this will work for normal-normal, beta-binomial, Poisson-gamma, and a few other models.
$X_i \sim p(x_i | \theta_i)$
$\theta_i \sim G(\theta_i| \mu, \eta)$
Split-half reliability takes two samples that are presumed to be measuring the same thing (I'll call these samples $X_i$ and $Y_i$) and calculates the correlation between them -if they actually
are measuring the same thing, then the correlation should be high.
In baseball, it's commonly used as a function of the sample size $n$ to assess when a stat "stabilizes" - that is, if I take two samples of size $n$ from the same model (be it player at-bats, batters faced, etc.) and calculate the correlation of a statistic between the samples, then once the correlation exceeds a certain value, the statistic is considered to have "stabilized."
Let's say that $\bar{X_i}$ is normalized statistic of $n$ observations (on-base percentage, for example) from the first "half" of the data (though it does
not need to be chronological) and $\bar{Y_i}$ is the normalized statistic of $n$ observations from the second half of the data. In baseball terms, $\bar{X_i}$ might be something like the OBP from the first sample and $\bar{Y_i}$ the OBP from the second sample.
I want to find the correlation coefficient $\rho$, which is defined as
$\rho = \dfrac{Cov(\bar{X_i}, \bar{Y_i})}{\sqrt{Var(\bar{X_i})Var(\bar{Y_i})}}$
First, the numerator. The law of total covariance states that
$Cov(\bar{X_i}, \bar{Y_i}) = E[Cov(\bar{X_i}, \bar{Y_i} | \theta_i)] + Cov(E[\bar{X_i} | \theta_i], E[\bar{Y_i} | \theta_i])$
Given the same mean $\theta_i$, $\bar{X_i}$ and $\bar{Y_i}$ are assumed to be independent - hence,
$E[Cov(\bar{X_i}, \bar{Y_i} | \theta_i)] = E[0] = 0$.
Functionally, this is saying that for a given player, the first set of performance data is independent of the second half of performance data.
Since $E[\bar{X_i} | \theta_i] = E[\bar{Y_i} | \theta_i] = \theta_i$ for the framework I'm working in, the second part becomes
$Cov(E[\bar{X_i} | \theta_i], E[\bar{Y_i} | \theta_i]) = Cov(\theta_i, \theta_i) = Var(\theta_i)$
Thus, $Cov(\bar{X_i}, \bar{Y_i})$ is equal to the "between player" variance - that is, the variance among league talent levels.
Now, the denominator. As I've used before, the law of total variance states that
$Var(\bar{X_i}) = E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i}| \theta_i])$
Since $\bar{X_i}$ and $\bar{Y_i}$ have the same distributional form, they will have the same variance.
$Var(\bar{X_i}) = Var(\bar{Y_i}) = E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i} | \theta_i]) = \dfrac{1}{n} E[V(\theta_i)] + Var(\theta_i)$
Where $E[V(\theta_i)]$ is the average variance of performance at the level of plate appearance, inning pitched, etc. Hence, the split correlation between them will be
$\rho = \dfrac{Var(E[\bar{X_i }| \theta_i])}{E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i} | \theta_i])} = \dfrac{Var(\theta_i)}{\dfrac{1}{n} E[V(\theta_i)] + Var(\theta_i)} = 1 - B $
Where $B$ is the shrinkage coefficient. The important theoretical result, then, is that for NEFQVF distributions, the split-half correlation is equal to one minus the shrinkage coefficient. Since we know form of the shrinkage coefficient, we can estimate what the split-half correlation will be for different values of $n$.
I'm not breaking any new ground here in terms of what people are doing in practice, but there does exist a theoretical justification linking split-half correlation and this particular formula method for shrinkage estimation.
Estimation
Using
fangraphs.com, I collected data from all MLB batters (excluding pitchers) who had at least 300 PA (which is a somewhat arbitrary choice on my part - further discussion on this choice in the model criticisms section) from 2009 to 2014. I'm considering players to be different across years - so for example, 2009-2014 Miguel Cabrera is six different players. I'll define $x_i$ as the number of on-base events for player $i$ in $n_i$ plate appearances. I have $N$ of these players.
All my data and code
is posted on my github if you would like to independently verify my calculations (and I'm learning to use github while I do this, so apologies if the formatting is completely wrong).
The number of on-base events $x_i$ follows a beta-binomial model - this fits into the NEFQVF family.
$x_i \sim Binomial(\theta_i, n)$
$\theta_i \sim Beta(\mu, M)$
Here I am using $\mu = \alpha/(\alpha + \beta)$ and $M = \alpha + \beta$ as opposed to the traditional $\alpha, \beta$ notation for a beta distribution. The true league mean OBP is $\mu$ and $M$ controls the variance of true OBP values among players.
Let's say I want to know at what sample size $n$ the split-half
correlation will be at a certain value p. For a beta-binomial mode, the split-half
correlation is
$\rho = 1 - B = 1 - \dfrac{M}{M + n} = \dfrac{n}{M + n}$
Where $B$ is the shrinkage coefficient. So if we desire a stabilization level p, it is given by solving
$p = \dfrac{n}{M + n}$
For $n$. The solution is
$n= \left(\dfrac{p}{1-p}\right) M$
Given an estimate of the talent variance parameter $\hat{M}$, the estimated $n$ is
$\hat{n} = \left(\dfrac{p}{1-p}\right)\hat{M}$
As a side note, at $n = M$ the split-half correlation and shrinkage amount are both 0.5.
For estimation of $M$, I'm going to use marginal maximum likelihood. For a one-dimensional introduction to maximum likelihood,
see my post on maximum likelihood estimation for batting averages.
The marginal distribution of on-base events $x_i$ given $\mu$ and $M$ is a beta-binomial distribution, with mass function given by
$
p(x_i | \mu, M) = \displaystyle \int_0^1 {n_i \choose x_i}
\dfrac{\theta_i^{x_i + \mu M -1}(1-\theta_i)^{n_i - x_i +
(1-\mu)M-1}}{\beta(\mu M, (1-\mu)M)} d\theta_i = {n_i \choose x_i}
\dfrac{\beta(x_i + \mu M, n_i - x_i + (1-\mu) M)}{\beta(\mu M,
(1-\mu)M)}$
This distribution represents the probability of $x_i$ on-base events in $n_i$ plate appearances given league mean OBP $\mu$ and variance parameter $M$, bypassing the choice of player altogether. The maximum likelihood estimate says to choose the values of $\mu$ and $M$ that maximize the joint probability the observed OBP values.
For a sample of size $N$
players, each with $x_i$ on-base events in $n_i$ plate appearances, the
log-likelihood is given as
$\ell(\mu,
M) = \displaystyle \left[ \sum_{i =1}^N \log(\beta(x_i + \mu M, n_i -
x_i + (1-\mu) M))\right] - N \log(\beta(\mu M, (1-\mu)M))$
This must be maximized numerically using computer software - I wrote a program using the Newton-Raphson algorithm to do this in $R$, which is posted on my github. For estimation, I actually converted $M$ to $\phi = 1/(1+M)$ in the above equation, performed the maximization, and then converted my estimate back to the original scale with $\hat{M} = (1-\hat{\phi})/\hat{\phi}$. The technical details of why I did this are a bit too much here, but it makes the estimation procedure much more stable - I'll be happy to discuss this with anybody who wants to know.
The maximum likelihood estimates are given by $\mu = 0.332$ and $M = 295.7912$.
Above is the distribution of observed OBP values with the estimated distribution of true OBP levels overlaid as dashed lines.
This means that for a split-half correlation of $p = 0.7$, the estimate of sample size required is
$\hat{n} = \left(\dfrac{0.7}{1-0.7}\right)295.7912 = 690.1794$
Which we could round to $\hat{n} = 690$ plate appearances. Since
$\hat{M}$ is the maximum likelihood estimator, $\hat{n}$ is the
maximum likelihood estimate of the point at which split-half correlation is 0.7 by invariance
of the maximum likelihood estimator.
Furthermore, if we have a variance $Var(\hat{M})$, the variance of the estimated sample size for p-stabilization is given as
$Var(\hat{n_i}) = Var\left( \left(\dfrac{p}{1-p}\right)\hat{M}\right) = \left(\dfrac{p}{1-p}\right)^2 Var(\hat{M})$
So a $(1-\alpha)\times 100\%$ confidence interval for $n$ is given as
$\left(\dfrac{p}{1-p}\right)\hat{M} \pm z^* \left(\dfrac{p}{1-p}\right) \sqrt{Var(\hat{M})}$
The output from the maximum likelihood estimation can be used to estimate $Var(\hat{M})$. Since I estimated $\hat{\phi}$, I had to get $Var(\hat{\phi})$ from output of the computer program I used and then use
the delta method to convert it back to the scale of $M$. Doing that, I got $Var(\hat{M}) = 269.1678$. This gives a 95% confidence interval for the 0.7-stabilization point as
$690.1794 \pm 1.96 \left(\dfrac{0.7}{1-0.7}\right) \sqrt{269.1678} = (615.1478, 765.211)$
Or between approximately 615 and 765 plate appearances. A 95% confidence interval for the 0.5-stabilization point (which is just $\hat{M}$) is between approximately 264 and 328 plate appearances.
For an arbitrary p-stabilization point sample size, the confidence interval formula is
$\left(\dfrac{p}{1-p}\right)295.7912 \pm z^* \left(\dfrac{p}{1-p}\right) \sqrt{269.1678}$
Below is a graph of the required sample size for for a stabilization level of p between 0.5 and 0.8 - the dashed lines are 95% confidence bounds.
As you can see, there are diminishing returns - to stabilize more, you need an increasingly larger sample size.
Model Criticisms
Basic criticisms about modeling baseball players apply: players are not machines, plate appearances are not independent and identical, etc. These criticisms will apply to just about model of baseball data, including split-half correlations.
I have not adjusted the data in any way - I simply took the raw number of on-base events and plate appearances. Estimation could likely be improved by adjusting the data for various effects before running it through the model.
One thing should be obvious: this is a parametric estimator, dependent on the model I chose. If the model I chose does not fit well, then the estimate will be bad. I stuck to OBP for this discussion because it seems to fit the beta-binomial model well. No model is correct, of course, but I do believe the beta-binomial model is close enough to be useful. I simulated data from a beta-binomial model using my estimated parameters and the fixed number of plate appearances, and both visually and with some basic summary statistics the simulated data looked close to the actual data. Not identical - and the real data appears to skew slightly right in comparison to the simulated data, and being identical isn't a realistic goal anyway - but close. Other statistics could require a the use of a different distribution.
As I mentioned, the cutoff of 300 PA or greater was somewhat arbitrary - I will fully admit that it's because I don't have a clearly defined population of players in mind. I know pitchers shouldn't be included, and I know that someone who got 10 PA and then was sent down shouldn't be included in the model, but I'm not sure what the correct cutoff for PA should be to get at this vague idea of "MLB-level" hitters I have. That's a problem with this analysis, but one that is easy to correct with the right information.
There's a bias/variance trade-off at play here - if I set the cutoff too low then I'm going to get too many players included that aren't from the population I want included in the sample, but the more players I feed into the model the smaller my variance of estimation is. Below is a plot of $\hat{M}$ with 95% confidence bounds for cutoff points from 50 PA to 600 PA.
Around 300 PA seems to be the cutoff value that leads to a roughly stable $M$ estimate that doesn't veer off into being erratic from the lack of information, and seems to approximately conform with what I know about how many plate appearances semi-regular MLB player should get.
Lower cutoff points tend to lead to lower stabilization points, as it will include hitters with smaller true OBPs, decreasing both the league average OBP (and also the average amount of variance around OBP values) and variance among true OBP values - the effect of which is to estimate $M$ smaller.
The bigger problem I have is that one of the assumptions is the number of plate appearances is independent of the observed on-base percentage - that if one player gets 500 PA and another player gets 700 PA, it tells us nothing about either of the players' true OBP values - they just happened to get 500 and 700 PA, respectively. Of course, we know this isn't true - hitters get more plate appearances specifically
because they get on base more.
Since the model works by assuming that players with more plate appearances have less variation around their true OBP values, it will make the estimate of the league mean OBP higher - which will affect $M$.
Even with all of its problems, I think this estimation method is useful. I don't think I've done anything that other people aren't already doing, but I just wanted to work through it in a way that makes sense to me. Comments are appreciated.
.