16 October, 2015

Stabilization, Regression, Shrinkage, and Bayes

This post is somewhat of a brain dump, for all my thoughts on the concept of a "stabilization point," (which is a term I dislike) how it's being used, assumptions that are (usually) knowingly and (sometimes) unknowingly being made in the process, and when and how it can be used correctly.

Stabilization In Practice


The most well-known stabilization point calculation was performed and performed by Russell Carleton, who took samples of size $n$ of size statistic from multiple MLB players and declared the stabilization point to be the $n$ such that the correlation coefficient $r = 0.7$ (the logic being that this gives $R^2 \approx 0.5$ - however, I don't like this). This approach is nonparametric in the sense that it's not making any assumptions about the underlying structure of the data (for example, that the events are binomial distributed), only about the structure of the residuals, and these have shown to be fairly reasonable and robust assumptions - in fact, the biggest problems with the original study came down to issues of sampling.

The split-and-correlate method is the most common method used to find stabilization points. This method will work, though it's not especially efficient, and should give good results assuming the sampling is being performed well - more recent studies randomly split the data into two halves and then correlate. In fact, it will work for essentially any statistic, especially ones that are difficult to fit into a parametric model.

In his original study (and subsequent studies), Carleton finds the point at which the split-half correlation is equal to $r = 0.7$, since then $R^2 \approx 0.5$. Others have disagreed with this. Commenter Kincaid on Tom Tango's blog writes

$r=.7$ between two separate observed samples implies that half of the variance in one observed sample is explained by the other observed sample.  But the other observed sample is not a pure measure of skill; it also has random variance.  So you can’t extrapolate that as half of the variance is explained by skill.

I agree with this statement. In traditional regression analysis, the explanatory variable $x$ is conceived as fixed. In this correlation analysis, both the explanatory and response variables are random. Hence, it makes no sense to say that the linear regression with $x$ explains 50% of the variation in $y$ when $x$ is random and, if given the same player, in fact independent of $y$. Other arguments have also been made regarding the units of $r$ and $R^2$.

There's a more practical reason, however - a commonly used form of regression towards the mean is given by

$\dfrac{M}{n + M}$

where $M$ is the regression amount towards some mean. Tom Tango notes that, if the stabilization point is estimated as the point at which $r = 0.5$, then this value $M$ can then turn around and directly be plugged into the regression equation given above. the As Kincaid has noted, this is the form of statistical shrinkage for the binomial distribution with a beta prior. More generally, this is the form of the shrinkage coefficient $B$ that is obtained by modeling the outcome of a random event with a natural exponential family and performing a Bayesian analysis using a conjugate prior (see section five of Carl Morris's paper Natural Exponential Families with Quadratic Variance Functions). Fundamentally, this is why the $M/(n + M)$ formula seems to work so well - not because it's beta-binomial Bayes, but because it's also normal-normal Bayes, and gamma-poisson Bayes, and more - any member of the incredibly flexible natural exponential family.

So simply taking the correlation doesn't make any assumptions about the parametric structure of the observed data - but taking that stabilization point and turning it into a shrinkage (regression) estimator towards the population mean does assume that observed data come from natural exponential family with the corresponding conjugate prior for the distribution of true talent levels.


Mathematical Considerations

 

In practical usage, only certain members of the natural exponential family are considered - the beta-binomial, gamma-Poisson, and the normal-normal models, for example, with the normal-normal largely dominating these choices.  These form a specific subset of the natural exponential family - the natural exponential family with quadratic variance functions. The advantage these have over general NEF distributions is that, aside from being the most commonly used distributions, they are closed under convolution - that is, the sum of NEFQFV distributions is also NEFQFV - and this makes them ideal for modeling counting statistics, as the forms of all calculations stay the same as new information arrives, requiring only that new estimates and sample sizes be plugged into formulas.

In a previous post I used Morris's work to with the natural exponential family with quadratic variance functions to describe a two-stage model with some raw counting statistic $x_i$ as the sum of $n_i$ trials

 $X_i \sim p(x_i | \theta_i)$
 $\theta_i \sim G(\theta_i | \mu, \eta)$

where $p(x_i | \theta_i)$ is NEFQVF with mean $theta_i$. If $G(.)$ is treated as a prior distribution for $\theta_i$, then the form of the shrinkage estimator for $\theta_i$ is given by

$\hat{\theta_i} = \mu + (1 - B)(\bar{x_i} - \mu) = (1-B)\bar{x_i} + B \mu$

where $\bar{x_i} = x_i/n_i$ and $B$, as mentioned before, is the shrinkage coefficient. The shrinkage coefficient controlled by the average amount of variance at the event level and the variance of $G(.)$, weighted by the sample size $n_i$.

$B = \dfrac{ E[Var(\bar{x_i} | \theta_i)]}{ E[Var(\bar{x_i} | \theta_i)] + n_i Var(\theta_i)}$

And for NEF models, this simplifies down to

$B = \dfrac{M}{M + n_i}$

Implying that the form of the stabilization point $M$ is given as

$M = \dfrac{E[Var(\bar{x_i} | \theta_i)]}{Var(\theta_i)} = \dfrac{E[V(\theta_i)]}{Var(\theta_i)}$

Where $V(\theta_i)$ is the variance around the mean $\theta_i$ at the most basic level of the event (plate appearance, inning pitched, etc.). So under the NEF family, the stabilization point is the ratio of the average variance around the true talent level (the variance being a function of the true talent level itself) to the variance of the true talent levels themselves.

In another post, I showed briefly that for this model, the split-half correlation is theoretically equal to one minus the shrinkage coefficient $B$.

$\rho = 1 - B = \dfrac{n_i}{M+n_i}$

Another result that has been commonly used. Therefore, to achieve any desired level of correlation $p$ between split samples, the formula

$n = \left(\dfrac{p}{1-p}\right) M$ 

can be used to estimate the sample size required. This formula derives not from any sort of correlation prophecy formula, but just from some algebra involving the forms of the shrinkage coefficient and split-half correlation $\rho$.

It's for this reason that I dislike the name "stabilization point" - in its natural form it is the number of events required for a split-half correlation of $r = 0.5$ (and correspondingly a shrinkage coefficient of $0.5$), but really, you can estimate the split-half correlation and/or shrinkage amount for any sample size just by plugging in the corresponding values of $M$ and $n$. In general, there's not going to be much difference between samples of size $n$, $n-1$, and $n + 1$ - there's no magical threshold that the sample size can cross that suddenly a statistic becomes perfectly reliable - and in fact that the formula implies a given statistic can never reach 100% stabilization.

If I had my choice I'd call it the stabilization parameter, but alas, the name seems to already be set.

Practical Considerations


Note that at no point in the previous description was the shrinkage (regression)  explicitly required to be towards the league mean talent level. The league mean is a popular choice to shrink towards; however, if the sabermetrician can construct a different prior distribution from other data (for example, the player's own past results) then all of the above formulas and results can be applied using that prior instead.

When calculating stabilization points, studies have typically used a large sample of players from the league - theoretically, this implies that the league distribution of talent levels is being used as the prior distribution (the $G(.)$ in the two-stage model above), and the so-called stabilization point that results to be used for any player. In actuality, choices made during the sampling process imply certain priors which consist but only of certain portions of the league distribution of talent levels (mainly, those that have reached a certain threshold for the number of trials). In my article that calculated offensive stabilization points, I specified a hard minimum of 300 PA/AB in order to be included in the population. I estimated the stabilization point directly, but the issue also effects the correlation method used by Carleton, Carty, Pavlidis, etc. - in order to be included in the sample, a player must have had enough events (however they are defined), and this limits the sample to a nonrepresentative subset of the population of all MLB players. The effect of this is that the specific point calculated is really only valid for those individuals that meet those PA/AB requirements - even though those are the players who we know the most about! Furthermore, players that accrue more events do so specifically because they have higher talent levels - the stabilization points calculated for players who we know will receive, for example, at least 300 PA can't then turn around and be applied to players who we know will accrue fewer than 300 PA. This also explains how two individuals both using the same method in the same manner with the same data can arrive at different conclusions depending entirely on how they chose inclusion/exclusion rules for their sample.

As a final issue, I used six years worth of data - in doing this, I made the assumption that the basic shape of  true talent levels for the subset of the population I chose had changed negligibly or not at all over six years. I didn't simply use all data, however, because I recognize that offensive environments change - the late 90s and early 2000s, for example, are drastically different from than the early 2010s. This brings up another point - stabilization points, as they are defined, are a function of  the mean (coming into play in the average variance around a statistic) and, primarily, the variance of population talent levels - however, both of those are changing over time. This means there is not necessarily such a thing as "the" stabilization point, since as the population of talent levels changes over time, so will the mean and variance (I wrote a couple of articles looking at how offensive and pitching stabilization points have changed over time), so stabilization points in articles that were published just a few years ago may or may not be valid any longer.


Conclusion

Even after all this math, I still think the split-and-correlate method should be thought of as the primary method for calculating stabilization points, since it works on almost any kind of statistic, even more advanced ones that don't fit clearly into a NEF or NEFQVF framework. Turning around and using the results of that analysis to perform shrinkage (regression towards the mean), however, does make very specific assumptions about the form of both the observed data and underlying distribution of talent levels. Furthermore, sampling choices made at the beginning can strongly affect the final outcome, and limit the applicability of your analysis to the larger population. And if you remember nothing else from this - there is no such thing as "the" stabilization point, either in terms of when a statistic is reliable (it's always somewhat unreliable, the question is by how much) or one value that applies to all players at all times (since it's a function of the underlying distribution of talent levels, which is always changing).

This has largely been just a summary of techniques, studies, and research others have done - I know others have expressed similar opinions as well - but I found the topic interesting and I wanted to explain it in a way that made sense to me. Hopefully I've made a little more clear the connections between statistical theory and things people were doing just because they seemed to work.

Various Links

 

These are just some of the various links I read to attempt to understand what people were doing in practice and attempt to connect it to statistical theory: 

Carl Morris's paper Natural Exponential Families with Quadratic Variance Functions: Statistical Theory: http://www.stat.harvard.edu/People/Faculty/Carl_N._Morris/NEF-QVF_1983_2240566.pdf

Russell Carleton's original reliability study: http://web.archive.org/web/20080112135748/mvn.com/mlb-stats/2008/01/06/on-the-reliability-of-pitching-stats/

Carleton's updated calculations: http://www.baseballprospectus.com/article.php?articleid=20516

Tom Tango comments on Carleton's article:
http://tangotiger.com/index.php/site/comments/point-at-which-pitching-metrics-are-half-signal-half-noise


Derek Carty's stabilization point calculations: http://www.baseballprospectus.com/a/14215#88434

Tom Tango discusses Carty's article, the $r = 0.7$ versus $r = 0.5$ threshold, and regression towards the mean: http://www.insidethebook.com/ee/index.php/site/comments/rates_without_sample_size/

Steve Staude discusses $r = 0.5$ versus $r = 0.7$: http://www.fangraphs.com/blogs/randomness-stabilization-regression/

Tom Tango comments on Steve's work: http://tangotiger.com/index.php/site/comments/randomness-stabilization-regression 

Tom Tango links to Phil Birnbaum's proof of the regression towards the mean formula: http://tangotiger.com/index.php/site/comments/blase-from-the-past-proof-of-the-regression-toward-the-mean

Kincaid shows that the beta-binomial model produces the regression towards the mean formula: http://www.3-dbaseball.net/2011/08/regression-to-mean-and-beta.html

Harry Pavlidis looks at stabilization for some pitching events: http://www.hardballtimes.com/it-makes-sense-to-me-i-must-regress/

Tom Tango discusses Harry's article, and gives the connection between regression and stabilization: http://www.insidethebook.com/ee/index.php/site/article/regression_equations_for_pitcher_events/

Great summary of various regression and population variance estimation techniques - heavy on the math: http://www.countthebasket.com/blog/2008/05/19/regression-to-the-mean/

The original discussion on regression and shrinkage from Tom Tango's archives: http://www.tangotiger.net/archives/stud0098.shtml





2 comments: