21 September, 2015

The Posterior Predictive

Let's say you do a Bayesian analysis and end up with a posterior distribution $p(\theta | y)$. What does that tell you about a new observation $\tilde{y}$ from some data-generating process that involves $\theta$? The answer can be found using the posterior predictive distribution.

The code that I used to generate the images in this article can be found on my github.


Posterior Predictive


 Given a posterior distribution $p(\theta | y)$, the posterior predictive distribution is defined as

$p(\tilde{y} | y) = \int p(\tilde{y} | \theta) p(\theta | y) d\theta$

and it represents the distribution of a new observation $\tilde{y}$ given your updated information about a parameter $\theta$ and natural variation around the observation that arises from the data-generating process.

Since applied Bayesian techniques have tended towards fully computational MCMC procedures, the posterior predictive is usually obtained through simulation - let's say you have a sample $\theta_j^*$ ($j = 1,2,...,k$) from the posterior distribution of $\theta$ and you want to know about a new observation from some process that uses the parameter $\theta$.

$\tilde{y} \sim p(\tilde{y} | \theta)$

To obtain the posterior predictive, you would simulate a set of observation $\tilde{y}^*_j$ from $p(\tilde{y} | \theta_j^*)$ (in other words, simulate a new observation from the data model for each draw from the MCMC for your parameter). The distribution of these  $\tilde{y}^*_j$ approximates the posterior predictive distribution for $\tilde{y}$.

In general, this process is very specific to the problem at hand. It is possible in a few common scenarios, however, to calculate the posterior predictive distribution analytically. One example that is useful in baseball analysis the beta-binomial model.


Beta-Binomial Example


Let's say a batter obtains a 0.300 OBP in 250 PA - that corresponds to 75 on-base events and 175 not on-base events. What can you say about the distribution of on-base events in a new set of 250 PA?

Suppose that the distribution of on-base events is given by a binomial distribution with $n = 250$ and chance of getting on base $\theta$, which is the same in both sets of PA.

$p(y | \theta) \sim Binomial(250, \theta)$

For the prior distribution, let's suppose that a $Beta(1,1)$ distribution was used - this is a uniform distribution between zero and one - so any possible value for $\theta$ is equally likely. Since the beta and binomial are conjugate distributions, the posterior distribution of  $\theta$ (the batter's chance of getting on base) is also a beta distribution:

$p(\theta| y = 75) = \dfrac{\theta^{75+1-1}(1-\theta)^{175+1-1}}{\beta(75+1,175+1)} \sim Beta(76,176)$

Now, suppose we are planning to observe another 250 PA for the same batter, and we want to know the distribution of on-base events $\tilde{y}$ in the new 250 PA. This distribution is also binomial

$p(\tilde{y} | \theta) = \displaystyle {250 \choose \tilde{y}} \theta^{\tilde{y}}(1-\theta)^{250-\tilde{y}}$

 The posterior predictive distribution for the number of on-base events in another 250 PA is then obtained by multiplying the two densities and integrating out $\theta$.

$p(\tilde{y} | y = 75)  = \displaystyle \int_0^1 {250 \choose \tilde{y}} \theta^{\tilde{y}}(1-\theta)^{250-\tilde{y}} *  \dfrac{\theta^{75}(1-\theta)^{175}}{\beta(76,176)} d\theta$

The resulting distribution is known as beta-binomial distribution, which has density

$p(\tilde{y} | y = 75)  =\displaystyle  {250 \choose \tilde{y}} \dfrac{\beta(76 + \tilde{y}, 426-\tilde{y})}{\beta(76,176)}$

(The beta-binomial distribution is obtained from the beta-binomial model - it does get a bit confusing, but they are different things - the beta-binomial model can be thought of as a binomial with extra variance)

Now I can use the posterior predictive for inference. If, for example, I wanted to know the probability that a player will have a 0.300 OBP in another 250 PA (corresponding, again, to $\tilde{y}$ = 75 on-base events) then I can calculate that as

$p(\tilde{y} = 75 | y = 75) = \displaystyle  {250 \choose 75} \dfrac{\beta(76 + 75, 426-75)}{\beta(76,176)} \approx 0.0389$

That is, our updated information says there's a 3.89% chance of getting exactly a 0.300 OBP in a new 250 PA by the same player.

The actual distribution of OBP in a new 250 PA is given by



This can also be accomplished by simulation - first, by simulating a large number $k$ of $\theta^*$ values from the posterior

$\theta^*_j \sim Beta(76,176)$

And then using those $\theta^*$ values to simulate from the data model $p(\tilde{y} | \theta^*)$

$y^*_j \sim Binomial(250, \theta^*_j)$

The estimated probability of a 0.300 OBP (equivalent to 75 on-base events) is then

$P(0.300 | y = 75) \approx \displaystyle \dfrac{\textrm{# } y^*_j \textrm{ that equal 75}}{k} $

This is much easier to do in $R$ -with $k = 1000000$, the code to quickly perform this is

     > theta <- rbeta(1000000, 76,176) #Simulate from Posterior
     > y <- rbinom(1000000, 250, theta) #Simulate from data model
     > mean(y == 75) #Estimate P(y = 75)
     [1] 0.039077


Notice that the result is very close to the analytic answer of 3.89%. The simulated posterior predictive distribution for OBP is



Which visually looks very similar to the analytic version.

What's the Use?


Aside from doing (as the name implies) prediction, the posterior predictive is very useful for model checking - if data simulated from the posterior predictive (new data) is similar to the data you used to fit the model (old data), that is evidence that the model is a good fit. Conversely, if your posterior predictive data looks nothing like your original data set, you may have misspecified your model (and I'm criminally oversimplifying here - I recommend Bayesian Data Analysis by Gelman et al. for all the detail on model-checking using the posterior predictive distribution).

14 September, 2015

Pitching Stabilization Points through Time

In the previous post, I calculated some stabilization points for pitching statistics for the past few years. In this post, I want to look at how some of those stabilization points have changed over time.

(I have previously done this for offensive statistics)

Each stabilization point is a six-year calculation, including the current and five previous years (so for example, 2014 incudes 2009-2014 data, 1965 includes 1959 - 1965 data, etc.). There's not a mathematical or baseball reason for this choice - through trial and error it just seemed to provide enough data for estimation that the overall trend was apparent, with a decent amount of smoothing. Data includes only starting pitchers from each year, and for cutoff values (the minimum number of TBF, BIP, etc. to be included in the dataset) I used the same values as in my previous post. Years were split for the same player. Raw counts are used, not adjusted in any form. Relief pitchers are excluded.

All of the code that I used to create these can be found in my github, though I make no claims to efficiency or ease of operation. Because I added this code several months after the article was originally posted, I did not clean and annotate it as I normally would have - I just posted the raw code. The code is a modified form of the code used to calculate offensive stabilization points over time.

All of the plots shown below, and more, can be found in my imgur account.

Historical Plots


For some statistics, I will show plots for both the mean of a statistic over time and the stabilization point. The stabilization point driven largely by the underlying population variance of talent levels, which tends to be more difficult to estimate then the mean - hence the reason that, even with six years of moving data, the 'stabilization point' will appear to fluctuate quite a bit. I recommend not reading too much into the fluctuations, but rather looking for more general patterns.

Firstly, the ground ball, fly ball, and line drive rates (per BIP) only have recent data available. In that time, neither the fly ball or ground ball stabilization points have changed much


Line drive rate appears to have increased in recent years, however.

Though keep in mind the standard error is approximately 100 balls in play.

More interesting is batting average on balls in play, for which we have more data. The standard error for BABIP is approximately 500 balls in play, so it's not wise to trust small fluctuations in this plot as representing real shifts - however, it does appear that there is a positive trend in the stabilization point, indicative of the spread in BABIP values getting smaller. (A plot with 95% error bounds at each point can be found here, though I don't necessarily care for it)

The mean is easier to estimate with more accuracy - and  it shows that batting average on balls in play is at its highest point in history.

 An animated plot shows how the mean and variance of the observed (histogram) and estimated true talent (dashed line) distributions have changed over time.
As I've previously mentioned, the primary driving force for stabilization points is the underlying population variance. For example, take strikeout rate (per batter faced): since the dead ball era, it has followed a pattern of fairly consistent decrease (with a recent upsurge that still places it within previously observed ranges).


 Over time, however, the  mean strikeout rate (per batter faced) has been on the increase.

What does coincide with the increase in stabilization point is the decrease in population variance over time, as seen in this animated plot with the observed strikeout rates (histogram) and estimated true talent distribution (dashed line) - the spread in both is constantly increasing over time.


Also interesting is the earned run rate (per inning pitched, min 80 IP).

Beginning in the early 2000s, it dropped to a very low point, relative to its history, and has remained there more consistently than in the past. Meanwhile, the stabilization point for walk rate (min 400 BF) has increased in recent years, after reaching a maximum in the 1980s and decreasing.



On-base percentage and hit by-pitch rate have all fluctuated within a relatively stable area over time.





Though interestingly, both (per batter faced) took a dip in the 1960 that corresponds to an increase in the mean hit-by-pitch rate and a decrease in the mean on-base-percentage.




For some statistics, such as WHIP and home run rate, and it is difficult to discern a pattern other than fluctuations within a certain range.


An interesting look at how certain things have changed over time - though, as I mentioned before, I would encourage not reading too much into these plots.

03 September, 2015

More Pitching Stabilization Points


Using the beta-binomial model (notated BB) or the gamma-Poisson model (notated GP, and in this post what I call M is what in the previous post I called K - the variance parameter of the population talent distribution), I calculated the stabilization point for some more pitching statistics. I don't think the model(s) fit perfectly to the data, but they provide a good approximation that generally matches up with results I've seen elsewhere on the web.

Data was acquired from fangraphs.com. I only considered starting pitchers from 2009 - 2014, splitting the same pitcher between years, and did not adjust the data in any way.

All the data and code I used here may be found on my github. I make no claims to efficiency or ease of use.

The "cutoff value" is the minimum number of the denominator (IP, TBF, BIP, etc.) in a year in order to be included in the data set. These numbers were chosen somewhat arbitrarily, and for some of my statistics, changing the cutoff value will change the stabilization point. I'm not sure which statistics this will happen to - I know WHIP for sure, and I suspect ER as well, whereas I think BABIP doesn't exhibit this tendency. It's a function of the change (or lack thereof) in population variance of talent levels as the cutoff value increases - if somebody wants to take a look at it, it would be neat.

I wanted have a little fun and apply the model to stats where it clearly is silly to do so, such as win rate (I defined as wins per game started) and extra batters faced per inning (the total number of additional batters a pitcher faced beyond what is required by their IP). The model still produces estimates, but of course, but bad data fed into a good model doesn't magically produce good analysis.

\begin{array}{| l | l | c | c | c | c  | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\textrm{95% CI}&\textrm{Cutoff}&\textrm{Model}\\ \hline
\textrm{BABIP}&\textrm{(H-HR)/n*}&2006.71&484.94&(1056.22,2957.20)&300&BB\\
\textrm{GB Rate}&\textrm{GB/BIP}&65.52&3.63&(58.39,72.64)&300&BB\\
\textrm{FB Rate}&\textrm{FB/BIP}&61.96&3.42&(55.25,68.66)&300&BB\\
\textrm{LD Rate}&\textrm{LD/BIP}&768.42&94.10&(583.99,952.86)&300&BB\\
\textrm{HR/FB Rate}&\textrm{HR/FB}&505.11&93.95&(320.96,689.26)&100&BB\\
\textrm{SO Rate}&\textrm{SO/TBF}&90.94&5.04&(81.06,100.82)&400&BB\\
\textrm{HR Rate}&\textrm{HR/TBF}&931.59&107.80&(720.30,1142.88)&400&BB\\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}&221.25&14.43&(192.97,249.53)&400&BB\\
\textrm{HBP Rate}&\textrm{HBP/TBF}&989.30&119.95&(754.21,1224.41)&400&BB\\
\textrm{Hit rate}&\textrm{H/TBF}&623.35&57.57&(510.51,736.18)&400&BB\\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}&524.73&44.96&(436.62,612.84)&400&BB\\
\textrm{Win Rate}&\textrm{W/GS}&57.23&8.68&(40.21,74.24)&15&BB\\
\textrm{WHIP}&\textrm{(H + BB)/IP**}&77.20&5.46&(66.50,87.90)&80&GP\\
\textrm{ER Rate}&\textrm{ER/IP**}&59.55&3.94&(51.82,67.25)&80&GP\\
\textrm{Extra BF}&\textrm{(TBF - 3IP**)/IP**}&73.00&5.08&(63.05,82.95)&80&GP\\ \hline
\end{array}

* I'm not exactly sure what combinations of statistics fangraphs is using for the denominator of their BABIP - it's not BIP = GB + FB + LD. I know the numerator of H - HR is correct, but the denominator was usually smaller , though sometimes larger, than BIP. I solved for what fangraphs was using and used that in my calculations - if somebody wants to let me know exactly what they're using for n, please do.

** When dividing by IP, I corrected the 0.1 and 0.2 decimal representations to 0.33 and 0.67. 

I've also created histograms of each observed statistic with an overlay of the estimated distribution of true talent levels. They can be found in this imgur gallery. Remember that the dashed line represents the distribution of talent levels, not of observed data, so it's not necessarily bad if it is shaped differently than the observed data.

$\hat{M}$ is the estimated variance parameter of the underlying talent distribution. Under the model, it is equal to the number of plate appearances at which there is 50% shrinkage.

$SE(\hat{M})$ is the standard error of the estimate $\hat{M}$. It is on the same scale as the divisor in the formula.

The 95% CI is calculated as

$\hat{M} \pm 1.96 SE(\hat{M})$

It represents a 95% confidence interval for the number of plate appearances at which there is 50% shrinkage.

For an arbitrary stabilization level $p$, the number of required plate appearances can be estimated as

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{M}$

And a 95% confidence interval for the required number of plate appearances is given as

$\left(\dfrac{p}{1-p}\right) \hat{M} \pm 1.96 \left(\dfrac{p}{1-p}\right) SE(\hat{M})$

Since the denominators are so different (as opposed to offensive statistics where PA was used for almost everything except for batting average, and AB are fairly close to PA), I don't feel as comfortable putting everything on the same plot. That being said, the stats that use TBF look like



And the stats that use BIP for their denominator look like

As always, comments are appreciated.

02 September, 2015

2015 Win Prediction Totals (Through August)

These predictions are based on my own method (which can be improved). I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), and I think by this point in the season the actual coverage should be close to that.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c} 
\textrm{Team}  & \textrm{Lower}  & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} \\ \hline

ATL & 61 & 66.56 & 72 & 65.57 \\
ARI & 73 & 78.98 & 85 & 83.48 \\
BAL & 72 & 77.94 & 84 & 83.3 \\
BOS & 70 & 75.88 & 82 & 77.71 \\
CHC & 85 & 90.58 & 96 & 83.91 \\
CHW & 70 & 76.24 & 82 & 74.78 \\
CIN & 63 & 68.64 & 75 & 74.14 \\
CLE & 74 & 80.20 & 86 & 82.03 \\
COL & 61 & 67.29 & 73 & 70.14 \\
DET & 69 & 74.75 & 81 & 74.66 \\
HOU & 84 & 89.99 & 96 & 91.79 \\
KCR & 92 & 97.84 & 104 & 90.34 \\
LAA & 74 & 79.95 & 86 & 78.16 \\
LAD & 84 & 90.31 & 96 & 87.66 \\
MIA & 61 & 66.79 & 73 & 74.49 \\
MIL & 64 & 69.71 & 76 & 74.48 \\
MIN & 77 & 82.69 & 89 & 79.41 \\
NYM & 84 & 89.67 & 95 & 87.19 \\
NYY & 83 & 89.34 & 95 & 87.76 \\
OAK & 68 & 73.33 & 79 & 82.77 \\
PHI & 58 & 63.87 & 70 & 64.11 \\
PIT & 91 & 97.21 & 103 & 89.43 \\
SDP & 73 & 78.54 & 84 & 75.97 \\
SEA & 69 & 74.32 & 80 & 71.95 \\
SFG & 80 & 85.61 & 91 & 86.79 \\
STL & 98 & 103.63 & 109 & 97.33 \\
TBR & 76 & 81.45 & 87 & 80.79 \\
TEX & 78 & 83.60 & 90 & 79.00 \\
TOR & 87 & 92.88 & 98 & 98.67 \\
WSN & 77 & 82.61 & 89 & 84.04 \\  \hline\end{array}

To explain the difference between "Mean" and "True Win Total"  - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (at the end of August).

These quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is lower than the peak, the team is predicted to finish lucky, and if the blue line is higher than the peak, the team is predicted to finish unlucky.

27 August, 2015

WHIP Stabilization by the Gamma-Poisson Model

I've previously covered shrinkage estimation for offensive statistics - or at least, those that can be written as binomial events. In a previous post, I showed that for models that follow the natural exponential family with quadratic variance function, the split-half correlation is equal to one minus the shrinkage coefficient $B$.

The techniques I used can also be used when the outcome at the most basic level (at-bat, inning pitched, etc.) is not just a binary outcome. In particular, the Poisson distribution also fits within the framework I derived, as it is a member of the natural exponential family with quadratic variance function, and so events that can be modeled as Poisson at the base level will follow the same basic principles I used for the binomial outcomes. I chose the statistic WHIP (walks + hits per inning pitched) to illustrate this method, as it is a counting statistic that is a non-binary event (i.e., you can have 0, 1, 2, ... walks + hits in a given inning), so it fits the support of the Poisson.

Model and Estimation


I will assume that in each inning, pitcher $i$ gives up a number of walks + hits that follows a Poisson model with mean $\theta_i$, which is unique to each pitcher. The sum number of walks + hits given up in $n_i$ innings is $x_i$, and I have $N$ pitchers total. I considered only starting pitchers from 2009-2014, and split years between the same pitcher. My code and data are on github for anybody who wants to check my calculations.

Since the sum of Poissons is Poisson, the sum number of walks + hits $x_i$ given up in $n_i$ innings follows a Poisson distribution with mean $n_i \theta_i$ and mass function

$p(x_i | \theta_i, n_i) =  \dfrac{e^{-n_i \theta_i} (n_i \theta_i)^{x_i}}{x_i !}$

I will also assume that the distribution of means $\theta_i$ follows a gamma distribution with mean $\mu$ and variance parameter $K$ (in this parametrization, $\mu = \alpha/\beta$ and $K = \beta$ as opposed to the traditional $\alpha, \beta$ parametrization). This distribution has density

$f(\theta_i | \mu, K) = \dfrac{K^{\mu K}}{\Gamma(\mu K)} \theta_i^{\mu K - 1} e^{-K \theta_i}$

As shown in a previous post, the split-half correlation is then one minus the shrinkage coefficient $B$, or

$\rho = 1 - B = \left(\dfrac{n_i}{n_i + K}\right)$

So once I have an estimate $\hat{K}$ and a desired stabilization level $p$, solving for $n$ gives

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{K}$

Once again, the population variance parameter $K$ is equivalent to the 0.5 stabilization point - the point where the split half correlation should be exactly equal to 0.5, and also the point where the individual pitcher estimates are shrunk 50% of the way towards the mean.

For estimation of $mu$ and $K$, I used marginal maximum likelihood - a one dimensional introduction to maximum likelihood is given here. The marginal density of $\mu$ and $K$ is

$p(x_i | n_i, \mu, K) = \displaystyle \int_0^{\infty} \dfrac{K^{\mu K}n_i^{x_i}}{\Gamma(\mu K) x_i !} e^{-\theta_i (n_i + K)}  \theta_i^{x_i + \mu K - 1} d\theta_i = \dfrac{K^{\mu K}n_i^{x_i}}{\Gamma(\mu K) x_i !} \dfrac{\Gamma(x_i + \mu K)}{(n_i + K)^{x_i + \mu K}}$

And the log-likelihood (dropping terms that do not involve either $\mu$ or $K$) is given by

$\ell(\mu, K) =  N \mu K \log(K) - N \log(\Gamma(\mu K)) + \displaystyle \sum_{i = 1}^N \left[\log(\Gamma(x_i + \mu K)) - (x_i + \mu K) \log(n_i + K)\right]$

Once again, I wrote code to maximize this function in $R$ using a Newton-Raphson algorithm. I converted $K$ to $\phi = 1/(1 + K)$ in the equation above for estimation and then converted it back by $K = (1-\phi)/\phi$ after estimation was complete - the reason being that it makes the estimation procedure much more stable.

In performing this estimation, I had to make a choice of the minimum number of innings pitched (IP) in order to be included in the dataset. When performing a similar analysis for on-base percentage, I found that at around 300 PA, the population variance (and hence, the stabilization point) became roughly constant. Unfortunately, this is not true for starting pitchers.



The population variance in talent levels decreases consistently as a function of the minimum number of IP that are considered, and so the stabilization point $K$ increases. This means that, unlike OBP, for example, the stabilization point is always determined by what percentage of pitchers you look at (by IP) - if you look at only the top 50%, the stabilization point will be larger than the stabilization point for the top 70%.

This is reflected in the plot below - as with OBP and PA, the mean WHIP is associated with the number of IP,  but unlike with OBP, the variance around the mean is constantly changing with the mean.



For my calculation, I chose to use 80 innings pitched as my cutoff point - corresponding to approximately 15 games started and capturing slightly more than 50% of pitchers (by IP). This point was completely arbitrary, though, and other cutoffs will be equally valid depending on the question at hand.

Performing the estimation, the estimated league mean WHIP was $\hat{\mu} = 1.304$ with variance parameter $\hat{K} = 77.203$.



Once again,  95% confidence intervals for a specific stabilization level p are given as

$\left(\dfrac{p}{1-p}\right) \hat{K} \pm 1.96 \left(\dfrac{p}{1-p}\right) \sqrt{Var(\hat{k})}$

From (delta-method transformed) maximum likelihood output, $Var(\hat{K}) = 29.791$ (for a standard error of $5.459$ IP). The stabilization curve, with confidence bounds, is then


Aside from the model criticisms I've already mentioned, standard ones apply - innings pitched are not identical and independent (and treating them as so is clearly much worse than treating plate appearances as identical and independent), pitchers are not machines, etc. I don't think the model is great, but it is useful. It gives confidence bounds for the stabilization point something other methods don't do. As always, comments are appreciated.

19 August, 2015

Offensive Stabilization Points through Time

Using my maximum likelihood technique for estimating stabilization points, I performed a moving calculation of the stabilization point using data from 1900 - 2014 from fangraphs.com. Each stabilization point is a six-year calculation, including the current and five previous years (so for example, 2014 incudes 2009-2014 data, 1965 includes 1959 - 1965 data, etc.). There's not a mathematical or baseball reason for this choice - through trial and error it just seemed to provide enough data for estimation that the overall trend was apparent, with a decent amount of smoothing. Data includes only batters from each year with at least 300 plate appearances, and splits years for the same player. Raw counts are used, not adjusted in any form. Pitchers are excluded. My data and code is posted on my github if you would like to run it for yourself.

The "stabilization point" I have defined as the point where split-half correlation is equal to 0.5, which is equivalently where the shrinkage amount is 50% . Both of these are equal to a variance parameter $M$ in the beta-binomial model I fit, where the distribution of events given a mean $\theta_i$ is binomial for player $i$ and the underlying distribution of the $\theta_i$ follows a beta distribution with mean $\mu$ and variance parameter $M$.

Historical Plots


Trends can be see clearly in plots. For example, here is a plot of the stabilization point for home run rate from 1900 - 2014.



The effect of the dead ball era is clearly evident. A large stabilization point indicates a small variance - and during the dead ball era, there was a small variance, because most players weren't hitting home runs! More recently, the stabilization point has risen to the highest level it's been since that era.

Note that the stabilization point is should not be confused with the mean. In fact, here's a plot of the estimated league mean home run rate over the same period.


While going through peaks and valleys, the home run rate has risen fairly continuously over time - and the recent rise in home run stabilization point actually corresponds to a decrease in the mean home run rate (though interestingly, the decrease in league mean home run rate since the end of the steroid era still puts the current mean home run rate above any other preceding era).

To give another example, the stabilization point for triple rate is the lowest its been since the dead ball era - even though the league mean triple rate has decreased fairly continuously over time.





Interestingly, the stabilization points for walk rate and on-base percentage are the highest they've ever been, with walk rate having a noticeably sharp increase in recent years - one theory is that this is due to a "moneyball" effect of teams focusing much more strongly on walk rate as opposed to other statistics - indeed, the stabilization point for batting average (shown later in the article) has dropped during the same period - perhaps indicative of being more tolerant of variation in batting average but less tolerant in variation of on-base percentage (of course, pitching has grown more dominant since the end of the steroid era, which is likely adding to the effect as well).



Meanwhile, the stabilization points for double rate and extra base hit (2B + 3B) have increased over time.



But while the extra base hit rate stabilization point has decreased from the mid-2000s, while the double rate stabilization point has remained roughly the same.

The hit-by-pitch rate follows the same pattern as third base percentage -  it increased after the dead ball era, peaking in the 1930s and 1950s - but has decreased since then, and despite a small recent increase, is at its lowest stabilization point since that era.


Meanwhile, the strikeout rate stabilization point decreased fairly consistently over time, before stabilizing approximately in the 1970s, with peaks in the 1980s and early 2000s.


What Drives the Stabilization Point?


As I've shown, the mean of the underlying distribution of talents does not seem to be strongly associated with the stabilization point - the variance of the underlying distribution of talent levels is the primary factor. There is an inverse relationship - a small stabilization point indicates that there is a large variance in talent levels for that particular statistic, and a large stabilization point indicates that there is a small variance in talent levels for that statistic.

To get a clearer view of the factors that are affecting the stabilization point, here's a plot of the stabilization point for batting average (using at-bats as the denominator) versus time.


Below is an animation showing the empirical distribution of batting average with the estimated underlying distribution of talent levels in dashed lines (since I'm estimating the distribution of true batting averages and not the distribution of observed batting averages, it's okay that the dashed line is narrower than the histogram). Notice that as time goes on, the distribution gets narrower (the variance is decreasing) - this is what's driving the increase in stabilization point over time.


The opposite effect can be seen in the single rate stabilization point - it has decreased (with peaks and valleys) over time


As the distribution of single rates has become more spread out.




Graphics of all the stabilization points, league mean talent levels, and animated estimated talent distributions can be found here.

Individual Years


I also selected a few years to compare individually - some for specific reasons, some just as a representative of a certain era.

  • 1910, in the middle of the dead ball era, and a year of particularly low offensive output. 
  • 1928, to represent the 1920s and the age of Babe Ruth.
  • 1937, to represent the 1930s. 
  • 1945, the end of the second world war. 
  • 1959, to represent the 1950s.
  • 1968, the year of the pitcher.
  • 1975, six years after they lowered the mound. 
  • 1987, before the steroid era and six years after the 1981 labor stoppage. 
  • 2001, the year Barry Bonds hit 73 home runs, in the middle of the steroid era. 
  • 2014, the modern era.

\begin{array}{| c | c | c | c | c | c | c | c | c | c | c |}\hline
\textrm{Year} & \textrm{1B} & \textrm{2B}& \textrm{3B}& \textrm{XBH}& \textrm{HR} &  \textrm{SO} & \textrm{BB} & \textrm{BA} & \textrm{OBP} & \textrm{HBP}  \\ \hline
1910 & 523.02  & 348.68 & 469.42  & 274.24 & 537.41 & 130.77 &   87.33 & 285.65 & 175.27 & 259.08 \\
1928 & 442.01 &  475.05 & 583.77 &  385.43 & 102.77  &  83.26 &  82.10 & 286.25 & 173.49 & 414.86 \\
1937 & 436.17 &  597.72 & 709.13 &  463.40 &  90.61 &  73.79 &  76.94 & 344.93 & 174.84 & 723.59\\
1945 &  456.71 &  710.05 & 543.09 &  474.47 &  98.81 &  67.99 &  79.83 & 424.54 & 180.14 & 699.53 \\
1959 & 351.40 & 1059.30 & 721.82 & 794.95 &  90.18 &  58.28 &  81.20 & 430.60 & 200.31 & 414.12\\
1968 & 333.52 &  867.61 & 691.18 &  700.24  & 94.46 &  55.18 &  93.24 & 476.94 & 265.04 & 448.36\\
1975 & 246.27 &  970.40 & 646.09 &  773.63 &  85.50 &  53.19 &  73.61 & 407.65 & 204.92 & 410.99 \\
1987 & 269.39 &  949.73 & 537.13 &  801.21 &  90.23 &  52.85 &  86.22 & 541.67 & 262.28 & 430.67 \\
2001 & 255.57 & 838.16 & 482.32  & 971.61 &  95.11 &  57.37 &  76.16  & 465.84 & 196.51 & 251.47\\
2014 & 222.16 & 1025.31 & 372.50 & 1006.30 & 124.52 &  49.73 & 105.59 & 465.92 & 295.79 & 297.41 \\  \hline
\end{array}


While generally following the fuller patterns shown in the plots, the effect of major baseball events such as the dead ball era, the second world war, the lowering of the mound, and the steroid era is evident.

Remember that a smaller stabilization point indicates a larger variance among talent levels - so looking at 1968 and 1975 to see the effect of lowering the mound, for example, the spread of single, triple, and home run rates increased while the spread of double and extra-base hit rates decreased (the extra base hit rate being largely driven by the double rate). Interestingly, the spread of strikeout rates remained roughly the same, but the spread of walk rates, hit by pitch rates, batting average, and on-base percentage all increased.

Overall, a fun way to look at how offensive statistics have changed over time. Let me know what you think in comments.

13 August, 2015

More Offensive Stabilization Points

Using the same method from my previous post, I calculated some stabilization points for more offensive statistics.

All of the code and data I used may be found on my github.

For each, I used a binomial distribution for the result of the at-bat/plate appearance and a beta distribution for the distribution of talent levels. I think a different model might work better for some of these statistics (and I'll have to work through the poisson-gamma anyway when I look at pitching statistics), but this represents a decent approximation.

As I stated in a previous post, statistics that can not be constructed as a binomial event (such as wOBA) do not fall under the framework I am using, and so I have not included them in estimation. I could treat them as binomial events, fit a model, and perform estimation procedure, but I would have no idea if the results are correct or not.

All estimated stabilization points were calculated using unadjusted data from fangraphs.com for hitters from 2009-2014 with at least 300 PA, excluding pitchers.

\begin{array}{| l | l | c | c | c |} \hline
\textrm{Statistic} & \textrm{Formula} & \hat{M} & SE(\hat{M}) & \textrm{95% CI} \\ \hline
\textrm{OBP} & \textrm{(H+BB+HBP)/PA} & 295.79 & 16.41 & (263.63, 327.95)  \\
\textrm{BA} & \textrm{H/AB} & 465.92 & 34.23 & (398.83, 533.02)  \\
\textrm{SO Rate} & \textrm{SO/PA} & 49.73 & 1.92 & (45.96, 53.50)  \\
\textrm{BB Rate} & \textrm{(BB-IBB)/(PA-IBB)} & 110.91 & 4.84 & (101.44, 120.38)  \\
\textrm{1B Rate} & \textrm{1B/PA} & 222.16 & 11.32 & (199.98, 244.34)  \\
\textrm{2B Rate} & \textrm{2B/PA} & 1025.31 & 108.00 & (813.64, 1236.98)  \\
\textrm{3B Rate} & \textrm{3B/PA} & 372.5 & 26.56 & (320.44, 424.56)  \\
\textrm{XBH Rate} & \textrm{(2B+3B)/PA} & 1006.30 & 105.23 & (800.04, 1212.57)  \\
\textrm{HR Rate} & \textrm{HR/PA} & 124.52 & 5.90 & (112.95, 136.09)  \\
\textrm{HBP Rate} & \textrm{HBP/PA} & 297.41 & 18.26 & (261.61, 333.20)  \\ \hline
\end{array}

I've also created histograms of each observed statistic with an overlay of the estimated distribution of true talent levels. They can be found in this imgur gallery. Remember that the dashed line represents the distribution of talent levels, not of observed data, so it's not necessarily bad if it is shaped differently than the observed data.

$\hat{M}$ is the estimated variance parameter of the underlying talent distribution. Under the model, it is equal to the number of plate appearances at which there is 50% shrinkage.

$SE(\hat{M})$ is the standard error of the estimate $\hat{M}$. It is on the same scale as the divisor in the formula - so PA for all except batting average and walk rate.

The 95% CI is calculated as

$\hat{M} \pm 1.96 SE(\hat{M})$

It represents a 95% confidence interval for the number of plate appearances at which there is 50% shrinkage.

For an arbitrary stabilization level $p$, the number of required plate appearances can be estimated as

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{M}$

And a 95% confidence interval for the required number of plate appearances is given as

$\left(\dfrac{p}{1-p}\right) \hat{M} \pm 1.96 \left(\dfrac{p}{1-p}\right) SE(\hat{M})$

Without confidence bounds, a plot of the sample size required for various stabilization levels is
 



And a plot of the stabilization level at various sample sizes is given as



This looks very similar to other plots I have seen.

Comments are appreciated. Also, I'm currently in the process of learning ggplot, so hopefully my graphics won't be as awful in the near future.