23 October, 2015

From Stabilization to Interval Estimation



In this post, I'm going to show how to use league means and stabilization points to construct mean interval and prediction interval estimates for some basic counting statistics. I'll focus on two specific models: the normal-normal and the beta-binomial model.

At a few points during this post I'm going to mention some empirical results. You can find the data I used and the code I ran on my github.

Distributional Assumptions


I'm assuming the statistic in question is a binomial outcome - this covers many basic counting statistics (batting average, on-base percentage, batting average on balls in play etc.) but not rate statistics, or more complicated statistics such as wOBA.

Assume that in $n_i$ trials, a player accrues $x_i$ events (hits, on-base events, etc.). I'm going to assume that trials are independent and identical with parameter of success $\theta_i$.

For the distribution of the $x_i$, I'm going to work out the math for two specific distributions - the normal and the binomial. I'm also going to assume that the distribution of the $\theta_i$ follows the respective conjugate distribution - the normal for the normal model, and the beta for the binomial model. This prior has mean talent level $\mu$ and stabilization point $M$.

$x_i \sim p(x_i | \theta_i, n_i)$
$\theta_i \sim G(\theta_i | \mu, M)$

For the stabilization point $M$, I'm going assume this is the number of events at which $r = 0.5$. If you choose the point at which $r = 0.7$, then these formulas won't work

For several of the mathematical results here, I'm going to refer back to my article shrinkage estimators for counting statistics - particularly, the examples section at the end - without offering proofs or algebraic derivations.

The posterior distribution for $\theta_i$ is given by

$\displaystyle p(\theta_i | x_i, n_i, \mu, M) = \dfrac{p(x_i | \theta_i, n_i)G(\theta_i | \mu, M)}{\int p(x_i | \theta_i, n_i)G(\theta_i | \mu, M) d_{\theta_i}}$

Intervals will be constructed by taking quantiles from this distribution.

(For rate statistics instead of count statistics, the gamma-Poisson model can be used - though that will take more math to figure out the correct forms of the intervals. I got about three-quarters of the way there in my article WHIP stabilization by the gamma-Poisson model if somebody else wants to work through the rest. For more complicated statistics such as wOBA, I'm going to have to work through some hard math.)

Mean Intervals

 

Normal-Normal Model


For the normal-normal model, suppose that both the counts and the distribution of talent levels follow a normal distribution. Then the observed proportion $x_i / n_i$ also follows a normal distribution.

$\dfrac{x_i}{n_i} \sim N\left(\theta_i, \dfrac{\sigma^2}{n_i}\right)$

$\theta_i \sim N(\mu, \tau^2)$

Furthermore, a normal approximation to the binomial is used to estimate $\sigma^2$ as $\sigma^2 = \mu (1-\mu)$. The usual normal approximation to the binomial takes $\theta_i (1-\theta_i)$ as the variance; however, the normal-normal model assumes that variance $\sigma^2$ is constant around every single $\theta_i$ - so an estimate for that is the average amount of variance over all of them, $\mu(1-\mu)$.

As a side note, the relationship between $M$ and $\tau^2$ is given by

$M = \dfrac{\sigma^2}{\tau^2} \approx \dfrac{\mu(1-\mu)}{\tau^2}$

As I showed in my article shrinkage estimators for counting statistics, for the normal-normal model the shrinkage coefficient is given as

$B = \dfrac{\sigma^2/\tau^2}{\sigma^2/\tau^2 + n_i} = \dfrac{ M }{ M + n_i }$

The resulting posterior is then

$\theta_i | x_i, n_i, \mu, M \sim N\left(  (1-B) \left(\dfrac{x_i}{n_i}\right) + B \mu, (1-B) \left(\dfrac{\sigma^2}{n_i}\right)\right)$

And substituting in the values of $B$, the variance of the posterior is given as

$\left(1 - \dfrac{M}{M + n_i}\right)\left(\dfrac{\mu(1-\mu)}{n_i}\right) = \left(\dfrac{n_i}{M + n_i}\right)\left(\dfrac{\mu(1-\mu)}{n_i}\right) = \dfrac{\mu(1-\mu)}{M+n_i}$

And a 95% interval estimate for $\theta_i$ is

$ \left[ \left(\dfrac{n_i}{n_i + M}\right) \dfrac{x_i}{n_i}+ \left(\dfrac{M}{n_i+M}\right) \mu \right] \pm 1.96 \sqrt{ \dfrac{\mu(1-\mu)}{M+n_i}}$


Beta-Binomial Model



For the beta-binomial model, suppose that the counts of events $x_i$ in $n_i$ events follows a binomial distribution and the distribution of the $\theta_i$ themselves is beta.

$x_i \sim Binomial(n_i, \theta_i )$

$\theta_i \sim Beta(\alpha, \beta)$

For the beta distribution of talent levels, the parameters can be constructed from the league mean and stabilization point as

$\alpha = \mu M$

$\beta = (1-\mu) M$

Using the beta as a prior distribution, the posterior for $\theta_i$ is then

$\theta_i | x_i, n_i, \mu, M \sim Beta(x_i + \mu M, n_i - x_i + (1-\mu) M)$

A 95% credible interval can then be taken as quantiles from this distribution - I show how to do this in in R in my article on Bayesian credible intervals. Most statistical software should be able to take quantiles from the beta distribution easily.

Alternatively, a normal approximation may be used - the posterior should be approximately normal with mean and variance

$\theta_i | x_i, n_i, \mu, M \sim N\left( \dfrac{x_i + \mu M}{n_i  + M},  \dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu) M)}{(n_i + M)^2 (1 + n_i + M)}\right)$

So a 95% credible interval based on the normal approximation to the beta posterior is given by

$\left(\dfrac{x_i + \mu M}{n_i  + M}\right) \pm 1.96 \sqrt{\dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu) M)}{(n_i + M)^2 (1 + n_i + M)}}$

This should be very close to the interval given by taking quantiles directly from the beta distribution.

Practical Application


I downloaded first and second half data hitting data from all qualified non-pitchers from 2010 to 2015 from fangraphs.com. I used the above formulas on the first half of on-base percentage data to create intervals, and then calculated the proportion of those intervals that contained the on-base percentage for the second half. For the league mean and stabilization point, I used values of $M$ and $\mu$ (even though I didn't show $\mu$) from my article "More Offensive Stabilization Points."

But wait...isn't there uncertainty in those estimates of $M$ and $\mu$? Yes, but it actually doesn't play a huge role unless the uncertainty is large, such as for the BABIP. You can try it out yourself by running the code and changing the values slightly, or just trust me.

I rounded off to the nearest whole number for $M$ and to three nonzero digits  for $\mu$. The intervals compared were the normal-normal as NN, beta-binomial as BB, and the normal approximation to the beta-binomial as BB (N). The resulting coverages were

\begin{array}{| l | l | c | c | c | c |} \hline
\textrm{Stat}& \mu & M & \textrm{NN Coverage} & \textrm{BB Coverage} & \textrm{BB (N) Coverage} \\ \hline
OBP & 0.33 & 296 & 0.66 & 0.659 & 0.659 \\
BA & 0.268 & 466 & 0.604 & 0.601 & 0.603 \\
1B & 0.158 & 222 & 0.685 & 0.675 & 0.679 \\
2B & 0.0475 & 1025 & 0.532 & 0.532 & 0.531 \\
3B & 0.00492 & 373 & 0.762 & 0.436 & 0.76 \\
XBH & 0.0524 & 1006 & 0.545 & 0.542 & 0.551 \\
HR & 0.0274 & 125 & 0.754 & 0.707 & 0.738 \\
BB & 0.085 & 106 & 0.688 & 0.661 & 0.673 \\
SO & 0.181 & 50 & 0.74 & 0.728 & 0.729 \\
HBP & 0.00866 & 297 & 0.725 & 0.591 & 0.721 \\  \hline\end{array}

So what happened? Shouldn't the 95% intervals have 95% coverage? Well, they should. The problem is, I used the wrong type of interval - the intervals calculated here are for the mean $\theta_i$. But we don't have $\theta_i$. What we have is the second half on-base percentage, which is $\theta_i$ plus the random noise that naturally surrounds $\theta_i$ in however many additional plate appearances. What's appropriate here is a prediction-type interval that attempts to cover not the mean, but a new observation - this interval will have to account for both the uncertainty of estimation and the natural randomness in a new set of observations.


Prediction Intervals


The interval needed is predictive - since the previous intervals were constructed as Bayesian credible intervals, a posterior predictive interval can be used.

Suppose that $\tilde{x_i}$ is the new count of events for player $i$ in $\tilde{n_i}$ new trials. I'm going to assume that $\tilde{n_i}$ is known. I'll also assume that $\tilde{x_i}$ is generated from the same process that generated $x_i$.

$\tilde{x_i} \sim p(\tilde{x_i} | \theta_i, \tilde{n_i})$

The posterior predictive is then

$p(\tilde{x_i}| \tilde{n_i}, x_i, n_i, \mu, M) = \displaystyle \int p(\tilde{x_i} | \theta_i, \tilde{n_i})p(\theta_i | x_i, n_i, \mu, M) d\theta_i$

For a bit more explanation, check out my article on posterior predictive distributions.

Normal-Normal Model


As stated above, the posterior distribution for $\theta_i$ is normal

$\theta_i | x_i, n_i, \mu, M \sim N\left(  B \left(\dfrac{x_i}{n_i}\right) + (1-B) \mu, \dfrac{\mu (1-\mu)}{n_i + M}\right)$

$B = \dfrac{M}{n_i + M}$

Using a normal approximation to the binomial, the distribution of the new on-base percentage in the second half (call this $\tilde{x_i}/\tilde{n_i}$) is also normal

$\dfrac{\tilde{x_i}}{\tilde{n_i}} | \theta_i, \mu  \sim N\left(\theta_i, \dfrac{\mu(1-\mu)}{\tilde{n_i}}\right)$

The posterior predictive is the marginal distribution, integrating out over $\theta_i$ - it is given as

$\dfrac{\tilde{x_i}}{\tilde{n_i}} | \tilde{n_i}, x_i, n_i, \mu, M \sim N\left(B \left(\dfrac{x_i}{n_i}\right) + (1-B) \mu,  \dfrac{\mu (1-\mu)}{n_i + M} + \dfrac{\mu(1-\mu)}{\tilde{n_i}}\right)$

And so a 95% posterior predictive interval for the on-base percentage in the second half is given by


$ \left[ \left(\dfrac{M}{n_i + M}\right) \dfrac{x_i}{n_i}+ \left(\dfrac{n_i}{n_i+M}\right) \mu \right] \pm 1.96 \sqrt{ \dfrac{\mu(1-\mu)}{n_i + M} + \dfrac{\mu(1-\mu)}{\tilde{n_i}}}$

 

Beta-Binomial Model

 

The posterior distribution for $\theta_i$ is beta

$\theta_i | x_i, n_i, \mu, M \sim Beta(x_i + \mu M, n_i - x_i + (1-\mu) M)$

 The distribution for the number of on-base events $\tilde{x_i}$ in $\tilde{n_i}$ follows a binomial distribution

$\tilde{x_i} \sim Binomial(\theta_i, \tilde{n_i})$

The posterior predictive for the number of on-base events in the new number of trials is the marginal distribution, which has density

$p(\tilde{x_i}| x_i, n_i, \mu, M, \tilde{n_i}) = \displaystyle  {\tilde{n_i} \choose \tilde{x_i}} \dfrac{\beta(\tilde{x_i} + x_i + \mu M, \tilde{n_i} - \tilde{x_i} + n_i - x_i + (1-\mu) M)}{\beta(x_i + \mu M,n_i - x_i + (1-\mu) M)}$


This is the beta-binomial distribution. It's is a discrete distribution that gives the probability of the number of on-base events in $\tilde{n_i}$ new PA, not the actual on-base percentage.

Since it is discrete, it's easy to solve for quantiles

$Q(\alpha) = \displaystyle \min_{k} \{ k : F(k) \le \alpha \}$
Where

$F(k) = \displaystyle \sum_{\tilde{x_i} \le k} p(\tilde{x_i} | x_i, n_i, \mu, M, \tilde{n_i}) =  \displaystyle \sum_{\tilde{x_i} \le k} \displaystyle  {\tilde{n_i} \choose \tilde{x_i}} \dfrac{\beta(\tilde{x_i} + x_i + \mu M, \tilde{n_i} - \tilde{x_i} + n_i - x_i + (1-\mu) M)}{\beta(x_i + \mu M,n_i - x_i + (1-\mu) M)}$

Since $Q(\alpha)$ is the quantile for the count of events, a 95% interval for the actual on-base proportion is given by

$\left(\dfrac{Q(.025)}{\tilde{n_i}} ,\dfrac{Q(.975)}{\tilde{n_i}}\right)$.

Alternatively, since the distribution is likely to be unimodal and bell-shaped, a normal approximation to the 95% posterior predictive interval is given by

$\left(\dfrac{x_i + \mu M}{n_i  + M}\right) \pm 1.96 \sqrt{\dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu)M)(n_i + M + \tilde{n_i})}{\tilde{n_i} (n_i + M)(n_i + M + 1)}}$

This isn't as good of an approximation as the normal approximation to the beta-binomial interval for the mean, but the difference between intervals is still only around 1% of the length and should work well.

Practical Application



I repeated the analysis using the predictive formulas given above, using the first half on-base percentage to try to capture the second half on-base percentage, using the same $\mu$ and $M$ values as before.

\begin{array}{| l | l | c | c | c | c |} \hline
\textrm{Stat}& \mu & M & \textrm{NN Coverage} & \textrm{BB Coverage} & \textrm{BB (N) Coverage} \\ \hline
OBP & 0.33 & 296 & 0.944 & 0.944 & 0.94 \\
BA & 0.268 & 466 & 0.943 & 0.943 & 0.944 \\
1B & 0.158 & 222 & 0.941 & 0.941 & 0.942 \\
2B & 0.0475 & 1025 & 0.956 & 0.956 & 0.955 \\
3B & 0.00492 & 373 & 0.955 & 0.955 & 0.956 \\
XBH & 0.0524 & 1006 & 0.957 & 0.957 & 0.959 \\
HR & 0.0274 & 125 & 0.951 & 0.951 & 0.952 \\
BB & 0.085 & 106 & 0.925 & 0.925 & 0.921 \\
SO & 0.181 & 50 & 0.918 & 0.918 & 0.92 \\
HBP & 0.00866 & 297 & 0.95 & 0.95 & 0.947 \\ \hline\end{array}

Cautions and Conclusion



Despite the positive results, I think that 95% actual coverage from these intervals is overoptimistic. For one, I selected a very "nice" group of individuals to test it on - nonpitchers with more than 300 PA. Being in this category implies a high talent level and a lack of anything that could drastically change that talent level over the course of the season, such as injury. I also treated the second half sample size $\tilde{n_i}$ as known - obviously, that must be estimated as well, and should add additional uncertainty.

Furthermore, there are clearly other factors at work than just random variation - players can get traded to different environments (a player being traded to or from Coors park, for example), talent levels may very well change over the course of the season, and events are clearly not independent and identical.

Applying these formulas to the population of players at large should see the empirical coverage drop - my guess (though I haven't tested it) is that 95% intervals should empirically get around 85%-90% actual coverage. Also keep in mind that $M$ and $\mu$ need to be kept updated - using means and stabilization points from too far in the past will lead to shrinkage towards the wrong point.

You can and should be able to do better than these intervals, in terms of length - these are incredibly simplistic, using only information about the player and information about the population. Adding covariates to the model to account for other sources of variation should allow you to decrease the length without sacrificing accuracy.

Alternatively, you could use these formulas with projections, with $\mu$ as the preseason projection and $M$ representing how many events that projection is "worth." This is more in keeping with the traditional Bayesian sense of the interval, and won't guarantee any sort of coverage.

However, I still think these intervals are useful in that they represent a sort of baseline - any more advanced model that generates predictive intervals should be able to do better than these.

Edit 16 Mar. 2017: I found that the data file I used for this analysis with the split first and second half  statistics was not what I thought it was - it repeated the same player multiple times, giving an inaccurate estimate of the confidence level. I have corrected the data file and re-run the analysis and presented the corrected confidence levels.

16 October, 2015

Stabilization, Regression, Shrinkage, and Bayes

This post is somewhat of a brain dump, for all my thoughts on the concept of a "stabilization point," (which is a term I dislike) how it's being used, assumptions that are (usually) knowingly and (sometimes) unknowingly being made in the process, and when and how it can be used correctly.

Stabilization In Practice


The most well-known stabilization point calculation was performed and performed by Russell Carleton, who took samples of size $n$ of size statistic from multiple MLB players and declared the stabilization point to be the $n$ such that the correlation coefficient $r = 0.7$ (the logic being that this gives $R^2 \approx 0.5$ - however, I don't like this). This approach is nonparametric in the sense that it's not making any assumptions about the underlying structure of the data (for example, that the events are binomial distributed), only about the structure of the residuals, and these have shown to be fairly reasonable and robust assumptions - in fact, the biggest problems with the original study came down to issues of sampling.

The split-and-correlate method is the most common method used to find stabilization points. This method will work, though it's not especially efficient, and should give good results assuming the sampling is being performed well - more recent studies randomly split the data into two halves and then correlate. In fact, it will work for essentially any statistic, especially ones that are difficult to fit into a parametric model.

In his original study (and subsequent studies), Carleton finds the point at which the split-half correlation is equal to $r = 0.7$, since then $R^2 \approx 0.5$. Others have disagreed with this. Commenter Kincaid on Tom Tango's blog writes

$r=.7$ between two separate observed samples implies that half of the variance in one observed sample is explained by the other observed sample.  But the other observed sample is not a pure measure of skill; it also has random variance.  So you can’t extrapolate that as half of the variance is explained by skill.

I agree with this statement. In traditional regression analysis, the explanatory variable $x$ is conceived as fixed. In this correlation analysis, both the explanatory and response variables are random. Hence, it makes no sense to say that the linear regression with $x$ explains 50% of the variation in $y$ when $x$ is random and, if given the same player, in fact independent of $y$. Other arguments have also been made regarding the units of $r$ and $R^2$.

There's a more practical reason, however - a commonly used form of regression towards the mean is given by

$\dfrac{M}{n + M}$

where $M$ is the regression amount towards some mean. Tom Tango notes that, if the stabilization point is estimated as the point at which $r = 0.5$, then this value $M$ can then turn around and directly be plugged into the regression equation given above. the As Kincaid has noted, this is the form of statistical shrinkage for the binomial distribution with a beta prior. More generally, this is the form of the shrinkage coefficient $B$ that is obtained by modeling the outcome of a random event with a natural exponential family and performing a Bayesian analysis using a conjugate prior (see section five of Carl Morris's paper Natural Exponential Families with Quadratic Variance Functions). Fundamentally, this is why the $M/(n + M)$ formula seems to work so well - not because it's beta-binomial Bayes, but because it's also normal-normal Bayes, and gamma-poisson Bayes, and more - any member of the incredibly flexible natural exponential family.

So simply taking the correlation doesn't make any assumptions about the parametric structure of the observed data - but taking that stabilization point and turning it into a shrinkage (regression) estimator towards the population mean does assume that observed data come from natural exponential family with the corresponding conjugate prior for the distribution of true talent levels.


Mathematical Considerations

 

In practical usage, only certain members of the natural exponential family are considered - the beta-binomial, gamma-Poisson, and the normal-normal models, for example, with the normal-normal largely dominating these choices.  These form a specific subset of the natural exponential family - the natural exponential family with quadratic variance functions. The advantage these have over general NEF distributions is that, aside from being the most commonly used distributions, they are closed under convolution - that is, the sum of NEFQFV distributions is also NEFQFV - and this makes them ideal for modeling counting statistics, as the forms of all calculations stay the same as new information arrives, requiring only that new estimates and sample sizes be plugged into formulas.

In a previous post I used Morris's work to with the natural exponential family with quadratic variance functions to describe a two-stage model with some raw counting statistic $x_i$ as the sum of $n_i$ trials

 $X_i \sim p(x_i | \theta_i)$
 $\theta_i \sim G(\theta_i | \mu, \eta)$

where $p(x_i | \theta_i)$ is NEFQVF with mean $theta_i$. If $G(.)$ is treated as a prior distribution for $\theta_i$, then the form of the shrinkage estimator for $\theta_i$ is given by

$\hat{\theta_i} = \mu + (1 - B)(\bar{x_i} - \mu) = (1-B)\bar{x_i} + B \mu$

where $\bar{x_i} = x_i/n_i$ and $B$, as mentioned before, is the shrinkage coefficient. The shrinkage coefficient controlled by the average amount of variance at the event level and the variance of $G(.)$, weighted by the sample size $n_i$.

$B = \dfrac{ E[Var(\bar{x_i} | \theta_i)]}{ E[Var(\bar{x_i} | \theta_i)] + n_i Var(\theta_i)}$

And for NEF models, this simplifies down to

$B = \dfrac{M}{M + n_i}$

Implying that the form of the stabilization point $M$ is given as

$M = \dfrac{E[Var(\bar{x_i} | \theta_i)]}{Var(\theta_i)} = \dfrac{E[V(\theta_i)]}{Var(\theta_i)}$

Where $V(\theta_i)$ is the variance around the mean $\theta_i$ at the most basic level of the event (plate appearance, inning pitched, etc.). So under the NEF family, the stabilization point is the ratio of the average variance around the true talent level (the variance being a function of the true talent level itself) to the variance of the true talent levels themselves.

In another post, I showed briefly that for this model, the split-half correlation is theoretically equal to one minus the shrinkage coefficient $B$.

$\rho = 1 - B = \dfrac{n_i}{M+n_i}$

Another result that has been commonly used. Therefore, to achieve any desired level of correlation $p$ between split samples, the formula

$n = \left(\dfrac{p}{1-p}\right) M$ 

can be used to estimate the sample size required. This formula derives not from any sort of correlation prophecy formula, but just from some algebra involving the forms of the shrinkage coefficient and split-half correlation $\rho$.

It's for this reason that I dislike the name "stabilization point" - in its natural form it is the number of events required for a split-half correlation of $r = 0.5$ (and correspondingly a shrinkage coefficient of $0.5$), but really, you can estimate the split-half correlation and/or shrinkage amount for any sample size just by plugging in the corresponding values of $M$ and $n$. In general, there's not going to be much difference between samples of size $n$, $n-1$, and $n + 1$ - there's no magical threshold that the sample size can cross that suddenly a statistic becomes perfectly reliable - and in fact that the formula implies a given statistic can never reach 100% stabilization.

If I had my choice I'd call it the stabilization parameter, but alas, the name seems to already be set.

Practical Considerations


Note that at no point in the previous description was the shrinkage (regression)  explicitly required to be towards the league mean talent level. The league mean is a popular choice to shrink towards; however, if the sabermetrician can construct a different prior distribution from other data (for example, the player's own past results) then all of the above formulas and results can be applied using that prior instead.

When calculating stabilization points, studies have typically used a large sample of players from the league - theoretically, this implies that the league distribution of talent levels is being used as the prior distribution (the $G(.)$ in the two-stage model above), and the so-called stabilization point that results to be used for any player. In actuality, choices made during the sampling process imply certain priors which consist but only of certain portions of the league distribution of talent levels (mainly, those that have reached a certain threshold for the number of trials). In my article that calculated offensive stabilization points, I specified a hard minimum of 300 PA/AB in order to be included in the population. I estimated the stabilization point directly, but the issue also effects the correlation method used by Carleton, Carty, Pavlidis, etc. - in order to be included in the sample, a player must have had enough events (however they are defined), and this limits the sample to a nonrepresentative subset of the population of all MLB players. The effect of this is that the specific point calculated is really only valid for those individuals that meet those PA/AB requirements - even though those are the players who we know the most about! Furthermore, players that accrue more events do so specifically because they have higher talent levels - the stabilization points calculated for players who we know will receive, for example, at least 300 PA can't then turn around and be applied to players who we know will accrue fewer than 300 PA. This also explains how two individuals both using the same method in the same manner with the same data can arrive at different conclusions depending entirely on how they chose inclusion/exclusion rules for their sample.

As a final issue, I used six years worth of data - in doing this, I made the assumption that the basic shape of  true talent levels for the subset of the population I chose had changed negligibly or not at all over six years. I didn't simply use all data, however, because I recognize that offensive environments change - the late 90s and early 2000s, for example, are drastically different from than the early 2010s. This brings up another point - stabilization points, as they are defined, are a function of  the mean (coming into play in the average variance around a statistic) and, primarily, the variance of population talent levels - however, both of those are changing over time. This means there is not necessarily such a thing as "the" stabilization point, since as the population of talent levels changes over time, so will the mean and variance (I wrote a couple of articles looking at how offensive and pitching stabilization points have changed over time), so stabilization points in articles that were published just a few years ago may or may not be valid any longer.


Conclusion

Even after all this math, I still think the split-and-correlate method should be thought of as the primary method for calculating stabilization points, since it works on almost any kind of statistic, even more advanced ones that don't fit clearly into a NEF or NEFQVF framework. Turning around and using the results of that analysis to perform shrinkage (regression towards the mean), however, does make very specific assumptions about the form of both the observed data and underlying distribution of talent levels. Furthermore, sampling choices made at the beginning can strongly affect the final outcome, and limit the applicability of your analysis to the larger population. And if you remember nothing else from this - there is no such thing as "the" stabilization point, either in terms of when a statistic is reliable (it's always somewhat unreliable, the question is by how much) or one value that applies to all players at all times (since it's a function of the underlying distribution of talent levels, which is always changing).

This has largely been just a summary of techniques, studies, and research others have done - I know others have expressed similar opinions as well - but I found the topic interesting and I wanted to explain it in a way that made sense to me. Hopefully I've made a little more clear the connections between statistical theory and things people were doing just because they seemed to work.

Various Links

 

These are just some of the various links I read to attempt to understand what people were doing in practice and attempt to connect it to statistical theory: 

Carl Morris's paper Natural Exponential Families with Quadratic Variance Functions: Statistical Theory: http://www.stat.harvard.edu/People/Faculty/Carl_N._Morris/NEF-QVF_1983_2240566.pdf

Russell Carleton's original reliability study: http://web.archive.org/web/20080112135748/mvn.com/mlb-stats/2008/01/06/on-the-reliability-of-pitching-stats/

Carleton's updated calculations: http://www.baseballprospectus.com/article.php?articleid=20516

Tom Tango comments on Carleton's article:
http://tangotiger.com/index.php/site/comments/point-at-which-pitching-metrics-are-half-signal-half-noise


Derek Carty's stabilization point calculations: http://www.baseballprospectus.com/a/14215#88434

Tom Tango discusses Carty's article, the $r = 0.7$ versus $r = 0.5$ threshold, and regression towards the mean: http://www.insidethebook.com/ee/index.php/site/comments/rates_without_sample_size/

Steve Staude discusses $r = 0.5$ versus $r = 0.7$: http://www.fangraphs.com/blogs/randomness-stabilization-regression/

Tom Tango comments on Steve's work: http://tangotiger.com/index.php/site/comments/randomness-stabilization-regression 

Tom Tango links to Phil Birnbaum's proof of the regression towards the mean formula: http://tangotiger.com/index.php/site/comments/blase-from-the-past-proof-of-the-regression-toward-the-mean

Kincaid shows that the beta-binomial model produces the regression towards the mean formula: http://www.3-dbaseball.net/2011/08/regression-to-mean-and-beta.html

Harry Pavlidis looks at stabilization for some pitching events: http://www.hardballtimes.com/it-makes-sense-to-me-i-must-regress/

Tom Tango discusses Harry's article, and gives the connection between regression and stabilization: http://www.insidethebook.com/ee/index.php/site/article/regression_equations_for_pitcher_events/

Great summary of various regression and population variance estimation techniques - heavy on the math: http://www.countthebasket.com/blog/2008/05/19/regression-to-the-mean/

The original discussion on regression and shrinkage from Tom Tango's archives: http://www.tangotiger.net/archives/stud0098.shtml