21 September, 2015

The Posterior Predictive

Let's say you do a Bayesian analysis and end up with a posterior distribution $p(\theta | y)$. What does that tell you about a new observation $\tilde{y}$ from some data-generating process that involves $\theta$? The answer can be found using the posterior predictive distribution.

The code that I used to generate the images in this article can be found on my github.

Posterior Predictive

 Given a posterior distribution $p(\theta | y)$, the posterior predictive distribution is defined as

$p(\tilde{y} | y) = \int p(\tilde{y} | \theta) p(\theta | y) d\theta$

and it represents the distribution of a new observation $\tilde{y}$ given your updated information about a parameter $\theta$ and natural variation around the observation that arises from the data-generating process.

Since applied Bayesian techniques have tended towards fully computational MCMC procedures, the posterior predictive is usually obtained through simulation - let's say you have a sample $\theta_j^*$ ($j = 1,2,...,k$) from the posterior distribution of $\theta$ and you want to know about a new observation from some process that uses the parameter $\theta$.

$\tilde{y} \sim p(\tilde{y} | \theta)$

To obtain the posterior predictive, you would simulate a set of observation $\tilde{y}^*_j$ from $p(\tilde{y} | \theta_j^*)$ (in other words, simulate a new observation from the data model for each draw from the MCMC for your parameter). The distribution of these  $\tilde{y}^*_j$ approximates the posterior predictive distribution for $\tilde{y}$.

In general, this process is very specific to the problem at hand. It is possible in a few common scenarios, however, to calculate the posterior predictive distribution analytically. One example that is useful in baseball analysis the beta-binomial model.

Beta-Binomial Example

Let's say a batter obtains a 0.300 OBP in 250 PA - that corresponds to 75 on-base events and 175 not on-base events. What can you say about the distribution of on-base events in a new set of 250 PA?

Suppose that the distribution of on-base events is given by a binomial distribution with $n = 250$ and chance of getting on base $\theta$, which is the same in both sets of PA.

$p(y | \theta) \sim Binomial(250, \theta)$

For the prior distribution, let's suppose that a $Beta(1,1)$ distribution was used - this is a uniform distribution between zero and one - so any possible value for $\theta$ is equally likely. Since the beta and binomial are conjugate distributions, the posterior distribution of  $\theta$ (the batter's chance of getting on base) is also a beta distribution:

$p(\theta| y = 75) = \dfrac{\theta^{75+1-1}(1-\theta)^{175+1-1}}{\beta(75+1,175+1)} \sim Beta(76,176)$

Now, suppose we are planning to observe another 250 PA for the same batter, and we want to know the distribution of on-base events $\tilde{y}$ in the new 250 PA. This distribution is also binomial

$p(\tilde{y} | \theta) = \displaystyle {250 \choose \tilde{y}} \theta^{\tilde{y}}(1-\theta)^{250-\tilde{y}}$

 The posterior predictive distribution for the number of on-base events in another 250 PA is then obtained by multiplying the two densities and integrating out $\theta$.

$p(\tilde{y} | y = 75)  = \displaystyle \int_0^1 {250 \choose \tilde{y}} \theta^{\tilde{y}}(1-\theta)^{250-\tilde{y}} *  \dfrac{\theta^{75}(1-\theta)^{175}}{\beta(76,176)} d\theta$

The resulting distribution is known as beta-binomial distribution, which has density

$p(\tilde{y} | y = 75)  =\displaystyle  {250 \choose \tilde{y}} \dfrac{\beta(76 + \tilde{y}, 426-\tilde{y})}{\beta(76,176)}$

(The beta-binomial distribution is obtained from the beta-binomial model - it does get a bit confusing, but they are different things - the beta-binomial model can be thought of as a binomial with extra variance)

Now I can use the posterior predictive for inference. If, for example, I wanted to know the probability that a player will have a 0.300 OBP in another 250 PA (corresponding, again, to $\tilde{y}$ = 75 on-base events) then I can calculate that as

$p(\tilde{y} = 75 | y = 75) = \displaystyle  {250 \choose 75} \dfrac{\beta(76 + 75, 426-75)}{\beta(76,176)} \approx 0.0389$

That is, our updated information says there's a 3.89% chance of getting exactly a 0.300 OBP in a new 250 PA by the same player.

The actual distribution of OBP in a new 250 PA is given by

This can also be accomplished by simulation - first, by simulating a large number $k$ of $\theta^*$ values from the posterior

$\theta^*_j \sim Beta(76,176)$

And then using those $\theta^*$ values to simulate from the data model $p(\tilde{y} | \theta^*)$

$y^*_j \sim Binomial(250, \theta^*_j)$

The estimated probability of a 0.300 OBP (equivalent to 75 on-base events) is then

$P(0.300 | y = 75) \approx \displaystyle \dfrac{\textrm{# } y^*_j \textrm{ that equal 75}}{k} $

This is much easier to do in $R$ -with $k = 1000000$, the code to quickly perform this is

     > theta <- rbeta(1000000, 76,176) #Simulate from Posterior
     > y <- rbinom(1000000, 250, theta) #Simulate from data model
     > mean(y == 75) #Estimate P(y = 75)
     [1] 0.039077

Notice that the result is very close to the analytic answer of 3.89%. The simulated posterior predictive distribution for OBP is

Which visually looks very similar to the analytic version.

What's the Use?

Aside from doing (as the name implies) prediction, the posterior predictive is very useful for model checking - if data simulated from the posterior predictive (new data) is similar to the data you used to fit the model (old data), that is evidence that the model is a good fit. Conversely, if your posterior predictive data looks nothing like your original data set, you may have misspecified your model (and I'm criminally oversimplifying here - I recommend Bayesian Data Analysis by Gelman et al. for all the detail on model-checking using the posterior predictive distribution).

14 September, 2015

Pitching Stabilization Points through Time

In the previous post, I calculated some stabilization points for pitching statistics for the past few years. In this post, I want to look at how some of those stabilization points have changed over time.

(I have previously done this for offensive statistics)

Each stabilization point is a six-year calculation, including the current and five previous years (so for example, 2014 incudes 2009-2014 data, 1965 includes 1959 - 1965 data, etc.). There's not a mathematical or baseball reason for this choice - through trial and error it just seemed to provide enough data for estimation that the overall trend was apparent, with a decent amount of smoothing. Data includes only starting pitchers from each year, and for cutoff values (the minimum number of TBF, BIP, etc. to be included in the dataset) I used the same values as in my previous post. Years were split for the same player. Raw counts are used, not adjusted in any form. Relief pitchers are excluded.

All of the code that I used to create these can be found in my github, though I make no claims to efficiency or ease of operation. Because I added this code several months after the article was originally posted, I did not clean and annotate it as I normally would have - I just posted the raw code. The code is a modified form of the code used to calculate offensive stabilization points over time.

All of the plots shown below, and more, can be found in my imgur account.

Historical Plots

For some statistics, I will show plots for both the mean of a statistic over time and the stabilization point. The stabilization point driven largely by the underlying population variance of talent levels, which tends to be more difficult to estimate then the mean - hence the reason that, even with six years of moving data, the 'stabilization point' will appear to fluctuate quite a bit. I recommend not reading too much into the fluctuations, but rather looking for more general patterns.

Firstly, the ground ball, fly ball, and line drive rates (per BIP) only have recent data available. In that time, neither the fly ball or ground ball stabilization points have changed much

Line drive rate appears to have increased in recent years, however.

Though keep in mind the standard error is approximately 100 balls in play.

More interesting is batting average on balls in play, for which we have more data. The standard error for BABIP is approximately 500 balls in play, so it's not wise to trust small fluctuations in this plot as representing real shifts - however, it does appear that there is a positive trend in the stabilization point, indicative of the spread in BABIP values getting smaller. (A plot with 95% error bounds at each point can be found here, though I don't necessarily care for it)

The mean is easier to estimate with more accuracy - and  it shows that batting average on balls in play is at its highest point in history.

 An animated plot shows how the mean and variance of the observed (histogram) and estimated true talent (dashed line) distributions have changed over time.
As I've previously mentioned, the primary driving force for stabilization points is the underlying population variance. For example, take strikeout rate (per batter faced): since the dead ball era, it has followed a pattern of fairly consistent decrease (with a recent upsurge that still places it within previously observed ranges).

 Over time, however, the  mean strikeout rate (per batter faced) has been on the increase.

What does coincide with the increase in stabilization point is the decrease in population variance over time, as seen in this animated plot with the observed strikeout rates (histogram) and estimated true talent distribution (dashed line) - the spread in both is constantly increasing over time.

Also interesting is the earned run rate (per inning pitched, min 80 IP).

Beginning in the early 2000s, it dropped to a very low point, relative to its history, and has remained there more consistently than in the past. Meanwhile, the stabilization point for walk rate (min 400 BF) has increased in recent years, after reaching a maximum in the 1980s and decreasing.

On-base percentage and hit by-pitch rate have all fluctuated within a relatively stable area over time.

Though interestingly, both (per batter faced) took a dip in the 1960 that corresponds to an increase in the mean hit-by-pitch rate and a decrease in the mean on-base-percentage.

For some statistics, such as WHIP and home run rate, and it is difficult to discern a pattern other than fluctuations within a certain range.

An interesting look at how certain things have changed over time - though, as I mentioned before, I would encourage not reading too much into these plots.

03 September, 2015

More Pitching Stabilization Points

Using the beta-binomial model (notated BB) or the gamma-Poisson model (notated GP, and in this post what I call M is what in the previous post I called K - the variance parameter of the population talent distribution), I calculated the stabilization point for some more pitching statistics. I don't think the model(s) fit perfectly to the data, but they provide a good approximation that generally matches up with results I've seen elsewhere on the web.

Data was acquired from fangraphs.com. I only considered starting pitchers from 2009 - 2014, splitting the same pitcher between years, and did not adjust the data in any way.

All the data and code I used here may be found on my github. I make no claims to efficiency or ease of use.

The "cutoff value" is the minimum number of the denominator (IP, TBF, BIP, etc.) in a year in order to be included in the data set. These numbers were chosen somewhat arbitrarily, and for some of my statistics, changing the cutoff value will change the stabilization point. I'm not sure which statistics this will happen to - I know WHIP for sure, and I suspect ER as well, whereas I think BABIP doesn't exhibit this tendency. It's a function of the change (or lack thereof) in population variance of talent levels as the cutoff value increases - if somebody wants to take a look at it, it would be neat.

I wanted have a little fun and apply the model to stats where it clearly is silly to do so, such as win rate (I defined as wins per game started) and extra batters faced per inning (the total number of additional batters a pitcher faced beyond what is required by their IP). The model still produces estimates, but of course, but bad data fed into a good model doesn't magically produce good analysis.

\begin{array}{| l | l | c | c | c | c  | c |} \hline
\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\textrm{95% CI}&\textrm{Cutoff}&\textrm{Model}\\ \hline
\textrm{GB Rate}&\textrm{GB/BIP}&65.52&3.63&(58.39,72.64)&300&BB\\
\textrm{FB Rate}&\textrm{FB/BIP}&61.96&3.42&(55.25,68.66)&300&BB\\
\textrm{LD Rate}&\textrm{LD/BIP}&768.42&94.10&(583.99,952.86)&300&BB\\
\textrm{HR/FB Rate}&\textrm{HR/FB}&505.11&93.95&(320.96,689.26)&100&BB\\
\textrm{SO Rate}&\textrm{SO/TBF}&90.94&5.04&(81.06,100.82)&400&BB\\
\textrm{HR Rate}&\textrm{HR/TBF}&931.59&107.80&(720.30,1142.88)&400&BB\\
\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}&221.25&14.43&(192.97,249.53)&400&BB\\
\textrm{HBP Rate}&\textrm{HBP/TBF}&989.30&119.95&(754.21,1224.41)&400&BB\\
\textrm{Hit rate}&\textrm{H/TBF}&623.35&57.57&(510.51,736.18)&400&BB\\
\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}&524.73&44.96&(436.62,612.84)&400&BB\\
\textrm{Win Rate}&\textrm{W/GS}&57.23&8.68&(40.21,74.24)&15&BB\\
\textrm{WHIP}&\textrm{(H + BB)/IP**}&77.20&5.46&(66.50,87.90)&80&GP\\
\textrm{ER Rate}&\textrm{ER/IP**}&59.55&3.94&(51.82,67.25)&80&GP\\
\textrm{Extra BF}&\textrm{(TBF - 3IP**)/IP**}&73.00&5.08&(63.05,82.95)&80&GP\\ \hline

* I'm not exactly sure what combinations of statistics fangraphs is using for the denominator of their BABIP - it's not BIP = GB + FB + LD. I know the numerator of H - HR is correct, but the denominator was usually smaller , though sometimes larger, than BIP. I solved for what fangraphs was using and used that in my calculations - if somebody wants to let me know exactly what they're using for n, please do.

** When dividing by IP, I corrected the 0.1 and 0.2 decimal representations to 0.33 and 0.67. 

I've also created histograms of each observed statistic with an overlay of the estimated distribution of true talent levels. They can be found in this imgur gallery. Remember that the dashed line represents the distribution of talent levels, not of observed data, so it's not necessarily bad if it is shaped differently than the observed data.

$\hat{M}$ is the estimated variance parameter of the underlying talent distribution. Under the model, it is equal to the number of plate appearances at which there is 50% shrinkage.

$SE(\hat{M})$ is the standard error of the estimate $\hat{M}$. It is on the same scale as the divisor in the formula.

The 95% CI is calculated as

$\hat{M} \pm 1.96 SE(\hat{M})$

It represents a 95% confidence interval for the number of plate appearances at which there is 50% shrinkage.

For an arbitrary stabilization level $p$, the number of required plate appearances can be estimated as

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{M}$

And a 95% confidence interval for the required number of plate appearances is given as

$\left(\dfrac{p}{1-p}\right) \hat{M} \pm 1.96 \left(\dfrac{p}{1-p}\right) SE(\hat{M})$

Since the denominators are so different (as opposed to offensive statistics where PA was used for almost everything except for batting average, and AB are fairly close to PA), I don't feel as comfortable putting everything on the same plot. That being said, the stats that use TBF look like

And the stats that use BIP for their denominator look like

As always, comments are appreciated.

02 September, 2015

2015 Win Prediction Totals (Through August)

These predictions are based on my own method (which can be improved). I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), and I think by this point in the season the actual coverage should be close to that.

Intervals are inclusive. All win totals assume a 162 game schedule.

\begin{array} {c c c c} 
\textrm{Team}  & \textrm{Lower}  & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} \\ \hline

ATL & 61 & 66.56 & 72 & 65.57 \\
ARI & 73 & 78.98 & 85 & 83.48 \\
BAL & 72 & 77.94 & 84 & 83.3 \\
BOS & 70 & 75.88 & 82 & 77.71 \\
CHC & 85 & 90.58 & 96 & 83.91 \\
CHW & 70 & 76.24 & 82 & 74.78 \\
CIN & 63 & 68.64 & 75 & 74.14 \\
CLE & 74 & 80.20 & 86 & 82.03 \\
COL & 61 & 67.29 & 73 & 70.14 \\
DET & 69 & 74.75 & 81 & 74.66 \\
HOU & 84 & 89.99 & 96 & 91.79 \\
KCR & 92 & 97.84 & 104 & 90.34 \\
LAA & 74 & 79.95 & 86 & 78.16 \\
LAD & 84 & 90.31 & 96 & 87.66 \\
MIA & 61 & 66.79 & 73 & 74.49 \\
MIL & 64 & 69.71 & 76 & 74.48 \\
MIN & 77 & 82.69 & 89 & 79.41 \\
NYM & 84 & 89.67 & 95 & 87.19 \\
NYY & 83 & 89.34 & 95 & 87.76 \\
OAK & 68 & 73.33 & 79 & 82.77 \\
PHI & 58 & 63.87 & 70 & 64.11 \\
PIT & 91 & 97.21 & 103 & 89.43 \\
SDP & 73 & 78.54 & 84 & 75.97 \\
SEA & 69 & 74.32 & 80 & 71.95 \\
SFG & 80 & 85.61 & 91 & 86.79 \\
STL & 98 & 103.63 & 109 & 97.33 \\
TBR & 76 & 81.45 & 87 & 80.79 \\
TEX & 78 & 83.60 & 90 & 79.00 \\
TOR & 87 & 92.88 & 98 & 98.67 \\
WSN & 77 & 82.61 & 89 & 84.04 \\  \hline\end{array}

To explain the difference between "Mean" and "True Win Total"  - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (at the end of August).

These quantiles are based off of a distribution - I've uploaded a picture of each team's distribution to imgur. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is lower than the peak, the team is predicted to finish lucky, and if the blue line is higher than the peak, the team is predicted to finish unlucky.