tag:blogger.com,1999:blog-41284987387420556032023-01-20T16:14:14.260-06:00ProbabilaballA blog for probability and statistics within a baseball framework.rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.comBlogger39125tag:blogger.com,1999:blog-4128498738742055603.post-7157422560049390022021-04-02T22:13:00.000-05:002021-04-02T22:13:28.202-05:002021 Stabilization Points<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> These are my estimated stabilization points for the 2021 MLB season, once again using the maximum likelihood method on the totals that I used for previous years. This method is explained in my articles <a href="http://probabilaball.blogspot.com/2015/08/estimating-theoretical-stabilization.html" target="_blank">Estimating Theoretical Stabilization Points</a> and <a href="http://www.probabilaball.com/2015/08/whip-stabilization-by-gamma-poisson.html" target="_blank">WHIP Stabilization by the Gamma-Poisson Model</a>.<br /><p>However, good news! In the past two years, I've had some research on reliability for non-normal data corrected, expanded upon, and published in academic journals. I can definitively say that my maximum likelihood estimator is accurately estimating the reliability of these statistics exactly the same as Cronbach's alpha or KR-20 and performs as well or better than Cronbach's alpha, assuming the model is correct, which - while no model is correct - I believe is very accurate. The article <a href="https://www.sciencedirect.com/science/article/abs/pii/S002224962030016X" target="_blank">can be found here</a> (for the preprint, <a href="https://psyarxiv.com/4j9vt/" target="_blank">click here</a>). I also published a paper with some KR-20 and KR-21 reliability estimators specifically for exponential family distributions such as binomial, Poisson, etc. The article <a href="https://journals.sagepub.com/doi/abs/10.1177/0013164421992535" target="_blank">can be found here </a>(for the preprint, <a href="https://psyarxiv.com/63a9k/" target="_blank">click here</a>). These estimators are a little more efficient for small sample sizes but for large sample sizes such as in this case, however, the estimators should be nearly identical. <br /></p><p>As usual, all data and code I used for this post <a href="https://github.com/Probabilaball/Blog-Code/tree/master/2021-Stabilization-Points" target="_blank">can be found on my github</a>. I make no claims about the stability, efficiency, or optimality of my code.<br /><br />I've included standard error estimates for 2021, but these should not be used to perform any kinds of tests or intervals to compare to the values from previous years, as those values are estimates themselves with their own standard errors, and approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found <a href="http://probabilaball.blogspot.com/2015/08/more-offensive-stabilization-points.html" target="_blank">here for batting statistics</a> and <a href="http://probabilaball.blogspot.com/2015/09/more-pitching-stabilization-points.html" target="_blank">here for pitching statistics</a>. The calculations for 2016 can be found <a href="http://www.probabilaball.com/2016/03/2016-stabilization-points.html">here</a>. The 2017 calculations can be found <a href="http://www.probabilaball.com/2017/04/2017-stabilization-points.html">here</a>. The 2018 calculations can be found <a href="http://www.probabilaball.com/2018/09/2018-stabilization-points.html">here</a>. The 2019 calculations can be found <a href="http://www.probabilaball.com/2019/04/2019-stabilization-points.html">here</a>. I didn't do calculations in 2020 because of the pandemic in general.<br /></p><p><br />The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data (2015-20120), were picked fairly arbitrarily - I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.<br /><br /></p><h2><b>Offensive Statistics</b></h2><p><br />\begin{array}{| l | l | c | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2019\textrm{ }\hat{M} \\ \hline<br />\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 302.57 & 18.39 & 0.331 & 300 & 295.20 \\<br />\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 451.24 & 47.22 & 0.306 & 300 & 431.49 \\ <br />\textrm{BA}&\textrm{H/AB} & 511.71 & 42.78 & 0.265 & 300 & 488.49 \\<br />\textrm{SO Rate}&\textrm{SO/PA} & 50.37 & 2.12 & 0.205 & 300 & 49.05 \\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 100.47 & 4.67 & 0.080 & 300 & 104.08 \\<br />\textrm{1B Rate}&\textrm{1B/PA} & 191.17 & 10.20 & 0.150 & 300 & 197.43 \\<br />\textrm{2B Rate}&\textrm{2B/PA} & 1242.67 & 162.27 & 0.047 & 300 & 1200.46 \\<br />\textrm{3B Rate}&\textrm{3B/PA} & 481.11 & 28.74 & 0.005 & 300 & 421.91 \\<br />\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1059.31 & 124.09 & 0.052 & 300 & 1070.09 \\<br />\textrm{HR Rate} & \textrm{HR/PA} & 146.00 & 7.68 & 0.034 & 300 & 141.80\\<br />\textrm{HBP Rate} & \textrm{HBP/PA} & 261.13 & 16.56 & 0.010 & 300 & 266.92 \\ \hline <br />\end{array}<br /><br />In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.</p><p>The stabilization point of the 3B rate increased dramatically by approximately two standard deviations, indicating that the talent level of hitting triples has clustered more closely around its mean. In general, however, most stabilization points are roughly the same as the previous year, taking into account that year-to-year and sample-to-sample variation in estimates is expected even if the true stabilization points are not changing.<br /><br /></p><h2><b>Pitching Statistics </b></h2><p><br />\begin{array}{| l | l | c | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}& 2019 \textrm{ }\hat{M} \\ \hline<br />\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 1061.43 & 197.34 & 0.286 &300& 1184.38 \\<br />\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 66.20 & 4.25 & 0.443 &300& 64.51\\<br />\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 62.33 & 3.97 & 0.346 &300& 60.68 \\<br />\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 1773.66 & 486.12 & 0.211 &300& 2197.02 \\<br />\textrm{HR/FB Rate}&\textrm{HR/FB}& 529.40 & 129.10 & 0.130 & 100 & 351.53 \\<br />\textrm{SO Rate}&\textrm{SO/TBF}& 80.78 & 4.97 & 0.214 &400& 90.86 \\<br />\textrm{HR Rate}&\textrm{HR/TBF}& 959.57 & 133.073 & 0.031 &400& 764.48\\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 251.22 & 19.47 & 0.072 & 400 & 230.09 \\<br />\textrm{HBP Rate}&\textrm{HBP/TBF}& 1035.90 & 153.68 & 0.009 &400& 906.25 \\<br />\textrm{Hit rate}&\textrm{H/TBF}& 453.30 & 37.52 & 0.232 &400& 496.56 \\<br />\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 407.36 & 36.33 & 0.313 &400& 443.60 \\<br />\textrm{WHIP}&\textrm{(H + BB)/IP*}& 63.38 & 4.79 & 1.29 &80& 67.84 \\<br />\textrm{ER Rate}&\textrm{ER/IP*}& 57.73 & 4.30 & 0.460 &80& 57.97 \\<br />\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 64.70 & 4.92 & 1.23 &80& 67.23 \\ \hline<br />\end{array}<br /><br /><i>* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively. </i><br /><br /></p><p>Most are the same, but the HR/FB stabilization point has shifted up dramatically given its standard error, indicating a likely change in true talent level and not just sample-to-sample and year-to-year variation. This indicates that the distribution of HR/FB talent levels is clustering around its mean, possibly indicating a change in approach by pitchers or batters over the past two years. The mean has also shifted up over the previous calculation. Similarly, the HR rate stabilization point and mean have increased. Conversely, the strikeout rate stabilization rate has decreased, indicating less clustering of talent levels around the mean, and the mean has also increased.<br /></p><h2>Usage</h2><h2> </h2> Aside from the obvious use of knowing approximately when results are half due to luck and half skill, these stabilization points (along with league means) can be used to provide very basic confidence intervals and prediction intervals for estimates that have been shrunk towards the population mean, as demonstrated in my article <a href="http://www.probabilaball.com/2015/10/from-stabilization-to-interval.html" target="_blank">From Stabilization to Interval Estimation</a>.<br /><br />For example, suppose that in the first half, a player has an on-base percentage of 0.380 in 300 plate appearances, corresponding to 114 on-base events. A 95% confidence interval using my empirical Bayesian techniques (based on a normal-normal model) is<br /><br /><div style="text-align: center;">$\dfrac{114 + 0.331*302.57}{300 + 302.57} \pm 1.96 \sqrt{\dfrac{0.331(1-0.331)}{302.57 + 300}} = (0.318,0.392)$ </div><br />That is, we believe the player's true on-base percentage to be between 0.317 and 0.392 with 95% confidence. I used a normal distribution for talent levels with a normal approximation to the binomial for the distribution of observed OBP, but that is not the only possible choice - it just resulted in the simplest formulas for the intervals.<br /><br />Suppose that the player will get an additional $\tilde{n} = 250$ PA in the second half of the season. A 95% prediction interval for his OBP over those PA is given by<br /><br /><div style="text-align: center;">$\dfrac{114 + 0.331*302.57}{300 + 302.57} \pm 1.96 \sqrt{\dfrac{0.331(1-0.331)}{302.57+ 300} + \dfrac{0.331(1-0.331)}{250}} = (0.286,0.425)$ </div><br />That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% "unshrunk" intervals, <a href="http://www.probabilaball.com/2015/10/from-stabilization-to-interval.html" target="_blank">but when I tested them in my article "From Stabilization to Interval Estimation,"</a> they worked well for prediction.<br /><br />As usual, all my data and code <a href="https://github.com/Probabilaball/Blog-Code/tree/master/2019-Stabilization-Points" target="_blank">can be found on my github</a>. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1) <br />$\$$Parameters<br />[1] 0.3306363 302.5670532<br /><br />$\$$Standard.Error<br />[1] 18.38593<br /><br />$\$$Model<br />[1] "Beta-Binomial"</span><br /><br /><div style="text-align: center;"><br /><a href="http://3.bp.blogspot.com/-GMZjno0yGro/XLzdQF6krTI/AAAAAAAAAns/GQ-yuG10ajwGhH3-GtlV-Yyf95GA7XwFACK4BGAYYCw/s1600/Stabilization.jpeg"></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Pbs09suNrQE/YGfRx6btRzI/AAAAAAAAA4I/ctV5By6gfwAn3O8t6qMOTFY4mf6XMYs2gCLcBGAsYHQ/s756/OBP.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="756" data-original-width="756" height="400" src="https://1.bp.blogspot.com/-Pbs09suNrQE/YGfRx6btRzI/AAAAAAAAA4I/ctV5By6gfwAn3O8t6qMOTFY4mf6XMYs2gCLcBGAsYHQ/w400-h400/OBP.jpeg" width="400" /></a></div><br /><br />The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula<br /><br /><div style="text-align: center;">$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$</div><br />which is obtained from the applying <a href="http://probabilaball.blogspot.com/2015/06/the-delta-method-for-confidence.html" target="_blank">the delta method</a> to the function $p(\hat{M}) = n/(n + \hat{M})$. Note that the mean and prediction intervals I gave do <i>not</i> take $SE(\hat{M})$ into account (ignoring the uncertainty surrounding the correct shrinkage amount, which is indicated by the confidence bounds above), but this is not a huge problem - if you don't believe me, plug slightly different values of $M$ into the formulas yourself and see that the resulting intervals do not change much.<br /><br />As always, feel free to post any comments or suggestions.<br /><br /><br />rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-29505035986095786652019-04-21T16:49:00.002-05:002021-04-02T18:39:27.101-05:002019 Stabilization Points<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> These are my estimated stabilization points for the 2019 MLB season, once again using the maximum likelihood method on the totals that I used for previous years. This method is explained in my articles <a href="http://probabilaball.blogspot.com/2015/08/estimating-theoretical-stabilization.html" target="_blank">Estimating Theoretical Stabilization Points</a> and <a href="http://www.probabilaball.com/2015/08/whip-stabilization-by-gamma-poisson.html" target="_blank">WHIP Stabilization by the Gamma-Poisson Model</a>.<br /><br />(As usual, all data and code I used <a href="https://github.com/Probabilaball/Blog-Code/tree/master/2019-Stabilization-Points" target="_blank">can be found on my github</a>. I make no claims about the stability, efficiency, or optimality of my code.) <br /><br />I've included standard error estimates for 2019, but these should not be used to perform any kinds of tests or intervals to compare to the values from previous years, as those values are estimates themselves with their own standard errors, and approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found <a href="http://probabilaball.blogspot.com/2015/08/more-offensive-stabilization-points.html" target="_blank">here for batting statistics</a> and <a href="http://probabilaball.blogspot.com/2015/09/more-pitching-stabilization-points.html" target="_blank">here for pitching statistics</a>. The calculations for 2016 can be found <a href="http://www.probabilaball.com/2016/03/2016-stabilization-points.html">here</a>. The 2017 calculations can be found <a href="http://www.probabilaball.com/2017/04/2017-stabilization-points.html">here</a>. The 2018 calculations can be found <a href="http://www.probabilaball.com/2018/09/2018-stabilization-points.html">here</a>.<br /><br />The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data (2013-2018), were picked fairly arbitrarily - I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.<br /><br /><h2><b>Offensive Statistics</b></h2><br />\begin{array}{| l | l | c | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2018\textrm{ }\hat{M} \\ \hline<br />\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 295.20 & 16.26 & 0.329 & 300 & 302.27 \\<br />\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 431.49 & 39.76 & 0.306 & 300 & 429.47 \\ <br />\textrm{BA}&\textrm{H/AB} & 488.49 & 36.52 & 0.264 & 300 & 463.19 \\<br />\textrm{SO Rate}&\textrm{SO/PA} & 49.05 & 1.88 & 0.198 & 300 & 48.74 \\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 104.08 & 4.45 & 0.078 & 300 & 108.84 \\<br />\textrm{1B Rate}&\textrm{1B/PA} & 197.43 & 9.72 & 0.154 & 300 & 200.94 \\<br />\textrm{2B Rate}&\textrm{2B/PA} & 1200.46 & 140.37 & 0.047 & 300 & 1164.82 \\<br />\textrm{3B Rate}&\textrm{3B/PA} & 421.91 & 31.67 & 0.005 & 300 & 390.75 \\<br />\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1070.09 & 115.96 & 0.052 & 300 & 1064.01 \\<br />\textrm{HR Rate} & \textrm{HR/PA} & 141.80 & 6.78 & 0.030 & 300 & 132.52 \\<br />\textrm{HBP Rate} & \textrm{HBP/PA} & 266.92 & 15.74 & 0.009 & 300 & 280.00 \\ \hline <br />\end{array}<br /><br />In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.<br /><br />Noticeably, the stabilization point for the HR rate has increased over the past four years, indicating less variance in talent level of hitting home runs. Meanwhile, the stabilization point for HBP rate has decreased over the past four years, suggesting increased variance in """talent""" level of getting hit by pitches.<br /><br /><h2><b>Pitching Statistics </b></h2><br />\begin{array}{| l | l | c | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2018 \textrm{ }\hat{M} \\ \hline<br />\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 1184.38 & 206.63& 0.288 &300&1322.70 \\<br />\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 64.51 & 3.66 & 0.446 &300&63.12 \\<br />\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}&60.68 &3.41 & 0.344 &300&59.80 \\<br />\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 2197.02 & 622.02 & 0.210 &300&2157.15 \\<br />\textrm{HR/FB Rate}&\textrm{HR/FB}& 351.53 & 56.05 & 0.117 & 100 & 388.61 \\<br />\textrm{SO Rate}&\textrm{SO/TBF}& 90.86 &5.07& 0.204&400&93.52 \\<br />\textrm{HR Rate}&\textrm{HR/TBF}&764.48& 82.78 & 0.028 &400&790.97 \\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 230.09 & 15.46 & 0.071 &400&238.70 \\<br />\textrm{HBP Rate}&\textrm{HBP/TBF}& 906.25 & 109.63 & 0.009 &400&935.61 \\<br />\textrm{Hit rate}&\textrm{H/TBF}&496.56 & 39.48 & 0.233 &400&536.32 \\<br />\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 443.60 & 36.42 & 0.312 &400& 472.09 \\<br />\textrm{WHIP}&\textrm{(H + BB)/IP*}&67.84 & 4.69 & 1.28 &80& 71.10 \\<br />\textrm{ER Rate}&\textrm{ER/IP*}& 57.97 & 3.87 & 0.444 &80& 58.59 \\<br />\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 67.23 & 4.64 & 1.22 &80& 69.11 \\ \hline<br />\end{array}<br /><br /><i>* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively. </i><br /><br />Most statistics this year shifted not just in stabilization point, but also in mean, possibly indicating a shift in the pitching environment. The stabilization points which did shift tended to shift down, indicating an increased spread of variation around the mean talent levels.<br /><br /><h2>Usage</h2><h2> </h2><br />Aside from the obvious use of knowing approximately when results are half due to luck and half skill, these stabilization points (along with league means) can be used to provide very basic confidence intervals and prediction intervals for estimates that have been shrunk towards the population mean, as demonstrated in my article <a href="http://www.probabilaball.com/2015/10/from-stabilization-to-interval.html" target="_blank">From Stabilization to Interval Estimation</a>.<br /><br />For example, suppose that in the first half, a player has an on-base percentage of 0.380 in 300 plate appearances, corresponding to 114 on-base events. A 95% confidence interval using my empirical Bayesian techniques (based on a normal-normal model) is<br /><br /><div style="text-align: center;">$\dfrac{114 + 0.329*295.20}{300 + 295.20} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{295.20 + 300}} = (0.317,0.392)$ </div><br />That is, we believe the player's true on-base percentage to be between 0.317 and 0.392 with 95% confidence. I used a normal distribution for talent levels with a normal approximation to the binomial for the distribution of observed OBP, but that is not the only possible choice - it just resulted in the simplest formulas for the intervals.<br /><br />Suppose that the player will get an additional $\tilde{n} = 250$ PA in the second half of the season. A 95% prediction interval for his OBP over those PA is given by<br /><br /><div style="text-align: center;">$\dfrac{114 + 0.329*295.20}{300 + 295.20} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{295.20 + 300} + \dfrac{0.329(1-0.329)}{250}} = (0.285,0.424)$ </div><br />That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% "unshrunk" intervals, <a href="http://www.probabilaball.com/2015/10/from-stabilization-to-interval.html" target="_blank">but when I tested them in my article "From Stabilization to Interval Estimation,"</a> they worked well for prediction.<br /><br />As usual, all my data and code <a href="https://github.com/Probabilaball/Blog-Code/tree/master/2019-Stabilization-Points" target="_blank">can be found on my github</a>. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1) <br />$\$$Parameters<br />[1] 0.3285272 295.1970047<br /><br />$\$$Standard.Error<br />[1] 16.25874<br /><br />$\$$Model<br />[1] "Beta-Binomial"</span><br /><br /><div style="text-align: center;"><a href="http://3.bp.blogspot.com/-GMZjno0yGro/XLzdQF6krTI/AAAAAAAAAns/GQ-yuG10ajwGhH3-GtlV-Yyf95GA7XwFACK4BGAYYCw/s1600/Stabilization.jpeg"><img border="0" height="398" src="https://3.bp.blogspot.com/-GMZjno0yGro/XLzdQF6krTI/AAAAAAAAAns/GQ-yuG10ajwGhH3-GtlV-Yyf95GA7XwFACK4BGAYYCw/s400/Stabilization.jpeg" width="400" /></a></div><br />The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula<br /><br /><div style="text-align: center;">$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$</div><br />which is obtained from the applying <a href="http://probabilaball.blogspot.com/2015/06/the-delta-method-for-confidence.html" target="_blank">the delta method</a> to the function $p(\hat{M}) = n/(n + \hat{M})$. Note that the mean and prediction intervals I gave do <i>not</i> take $SE(\hat{M})$ into account (ignoring the uncertainty surrounding the correct shrinkage amount, which is indicated by the confidence bounds above), but this is not a huge problem - if you don't believe me, plug slightly different values of $M$ into the formulas yourself and see that the resulting intervals do not change much.<br /><br />As always, feel free to post any comments or suggestions.<br /><br />rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-20723616966150646542018-09-05T22:39:00.000-05:002018-09-05T22:42:11.422-05:002018 Stabilization Points<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> So this post is waaaaay late in the 2018 season. I've been busy! But, I'm doing this again since it's pretty easy to do. But I am copying and pasting the text from the posts from the last two years, because I can.<br /><br />These are my estimated stabilization points for the 2018 MLB season, once again using the maximum likelihood method on the totals that I used for previous years. This method is explained in my articles <a href="http://probabilaball.blogspot.com/2015/08/estimating-theoretical-stabilization.html" target="_blank">Estimating Theoretical Stabilization Points</a> and <a href="http://www.probabilaball.com/2015/08/whip-stabilization-by-gamma-poisson.html" target="_blank">WHIP Stabilization by the Gamma-Poisson Model</a>.<br /><br />(As usual, all data and code I used <a href="https://github.com/Probabilaball/Blog-Code/tree/master/2017-Stabilization-Points" target="_blank">can be found on my github</a>. I make no claims about the stability, efficiency, or optimality of my code.) <br /><br />I've included standard error estimates for 2018, but these should not be used to perform any kinds of tests or intervals to compare to the values from previous years, as those values are estimates themselves with their own standard errors, and approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found <a href="http://probabilaball.blogspot.com/2015/08/more-offensive-stabilization-points.html" target="_blank">here for batting statistics</a> and <a href="http://probabilaball.blogspot.com/2015/09/more-pitching-stabilization-points.html" target="_blank">here for pitching statistics</a>. The calculations for 2016 can be found <a href="http://www.probabilaball.com/2016/03/2016-stabilization-points.html">here</a>. The 2017 calculations can be found <a href="http://www.probabilaball.com/2017/04/2017-stabilization-points.html">here</a>.<br /><br />The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data (2012-2017), were picked fairly arbitrarily - I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.<br /><br /><h2><b>Offensive Statistics</b></h2><br />\begin{array}{| l | l | c | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2017\textrm{ }\hat{M} \\ \hline<br />\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 302.27 & 16.88 & 0.329 & 300 & 303.77\\<br />\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 429.47 & 39.30 & 0.306 & 300 & 442.62 \\ <br />\textrm{BA}&\textrm{H/AB} & 463.19 & 33.94 & 0.266 & 300 & 466.09 \\<br />\textrm{SO Rate}&\textrm{SO/PA} & 48.74 & 1.88 & 0.194 & 300 & 49.02\\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 108.84 & 4.72 & 0.077 & 300 & 113.64 \\<br />\textrm{1B Rate}&\textrm{1B/PA} & 200.94 & 9.99 & 0.156 & 300 & 215.29\\<br />\textrm{2B Rate}&\textrm{2B/PA} & 1164.82 & 134.26 & 0.047 & 300 & 1230.96 \\<br />\textrm{3B Rate}&\textrm{3B/PA} & 390.75 & 28.72 & 0.005 & 300 & 358.92\\<br />\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1064.01 & 115.55 & 0.052 & 300 & 1063.76 \\<br />\textrm{HR Rate} & \textrm{HR/PA} & 132.52 & 6.31 & 0.030 & 300 & 129.02 \\<br />\textrm{HBP Rate} & \textrm{HBP/PA} & 280.00 & 16.89 & 0.009 & 300 & 299.39 \\ \hline <br />\end{array}<br /><br />In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.<br /><br /><h2><b>Pitching Statistics </b></h2><br />\begin{array}{| l | l | c | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2016 \textrm{ }\hat{M} \\ \hline<br />\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 1322.70 & 244.54 & 0.289 &300&1356.06 \\<br />\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 63.12 & 3.55 & 0.450 &300& 63.12 \\<br />\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 59.86 &3.34 & 0.341 &300&59.80 \\<br />\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 2157.15 & 586.96 & 0.209 &300& 1497.65 \\<br />\textrm{HR/FB Rate}&\textrm{HR/FB}& 388.61 & 65.28 & 0.115 &100&464.60 \\<br />\textrm{SO Rate}&\textrm{SO/TBF}& 93.52 &5.25& 0.199&400&94.62\\<br />\textrm{HR Rate}&\textrm{HR/TBF}&790.97 & 86.34 & 0.029 &400&942.62 \\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}&238.70 & 16.10 & 0.070 &400&237.53 \\<br />\textrm{HBP Rate}&\textrm{HBP/TBF}& 935.61 & 115.06 & 0.008 &400&954.09 \\<br />\textrm{Hit rate}&\textrm{H/TBF}& 536.32 & 43.99 & 0.235 &400&550.69 \\<br />\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}&472.09 & 39.51 & 0.313 &400& 496.39 \\<br />\textrm{WHIP}&\textrm{(H + BB)/IP*}& 71.10 & 4.96 & 1.29 &80& 74.68 \\<br />\textrm{ER Rate}&\textrm{ER/IP*}& 58.59 & 3.91 & 0.447 &80& 62.82 \\<br />\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 69.11 & 4.79 & 1.22 &80& 73.11\\ \hline<br />\end{array}<br /><br /><i>* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively. </i><br /><br />Most statistics are roughly the same; however, the line drive stabilization point has increased quite a bit, having doubled in 2016 from 2015. This is not a mistake - it corresponds to a decrease in the variance of line drive rates. Noticeably, the HR rate variance increased, and so the HR rate stabilization point decreased. This indicates a shift in the MLB pitching environment in these particular areas, and points to a weakness in the method - if the underlying league distribution of talent level of a statistic is changing rapidly, this method will fail to account for the change and may be inaccurate.<br /><h2> </h2><h2>Usage</h2><h2> </h2><br />Aside from the obvious use of knowing approximately when results are half due to luck and half from skill, these stabilization points (along with league means) can be used to provide very basic confidence intervals and prediction intervals for estimates that have been shrunk towards the population mean, as demonstrated in my article <a href="http://www.probabilaball.com/2015/10/from-stabilization-to-interval.html" target="_blank">From Stabilization to Interval Estimation</a>. I believe the confidence intervals from my method should be similar to the intervals from Sean Dolinar's great fangraphs article <a href="http://www.fangraphs.com/blogs/a-new-way-to-look-at-sample-size/">A New Way to Look at Sample Size</a>, though I have not personally tested this, and am not familiar with the Cronbach's alpha methodology he uses (or with reliability analysis in general).<br /><br />For example, suppose that in the first half, a player has an on-base percentage of 0.380 in 300 plate appearances, corresponding to 114 on-base events. A 95% confidence interval using my empirical Bayesian techniques (based on a normal-normal model) is<br /><br /><div style="text-align: center;">$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300}} = (0.317,0.392)$ </div><br />That is, we believe the player's true on-base percentage to be between 0.317 and 0.392 with 95% confidence. I used a normal distribution for talent levels with a normal approximation to the binomial for the distribution of observed OBP, but that is not the only possible choice - it just resulted in the simplest formulas for the intervals.<br /><br />Suppose that the player will get an additional $\tilde{n} = 250$ PA in the second half of the season. A 95% prediction interval for his OBP over those PA is given by<br /><br /><div style="text-align: center;">$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300} + \dfrac{0.329(1-0.329)}{250}} = (0.285,0.424)$ </div><br />That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% unshrunk intervals, <a href="http://www.probabilaball.com/2015/10/from-stabilization-to-interval.html" target="_blank">but when I tested them in my article "From Stabilization to Interval Estimation"</a>, they worked well for prediction.<br /><br />As usual, all my data and code <a href="https://github.com/Probabilaball/Blog-Code/tree/master/2018-Stabilization-Points" target="_blank">can be found on my github</a>. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1) <br />$\$$Parameters<br />[1] 0.329098 301.317682<br /><br />$\$$Standard.Error<br />[1] 16.92138<br /><br />$\$$Model<br />[1] "Beta-Binomial"</span><br /><br /><div style="text-align: center;"><a href="http://3.bp.blogspot.com/-xgucIosXSK8/Vur_nBEO17I/AAAAAAAAAco/E6CM2RtCGaU3L8DhABUtFAuLYVh7Zt0zQ/s1600/OBP%2BStabilization.jpeg"><img border="0" height="397" src="https://3.bp.blogspot.com/-xgucIosXSK8/Vur_nBEO17I/AAAAAAAAAco/E6CM2RtCGaU3L8DhABUtFAuLYVh7Zt0zQ/s400/OBP%2BStabilization.jpeg" width="400" /></a></div><br />The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula<br /><br /><div style="text-align: center;">$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$</div><br />which is obtained from the applying <a href="http://probabilaball.blogspot.com/2015/06/the-delta-method-for-confidence.html" target="_blank">the delta method</a> to the function $p(\hat{M}) = n/(n + \hat{M})$. Note that the mean and prediction intervals I gave do <i>not</i> take $SE(\hat{M})$ into account (ignoring the uncertainty surrounding the correct shrinkage amount, which is indicated by the confidence bounds above), but this is not a huge problem - if you don't believe me, plug slightly different values of $M$ into the formulas yourself and see that the resulting intervals do not change much.<br /><br />Maybe somebody else out there might find this useful. As always, feel free to post any comments or suggestions!<br /><br />rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-60699374013663073712017-04-24T23:45:00.000-05:002017-04-24T23:49:08.125-05:002017 Stabilization Points<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> Once again, I recalculated stabilization points for 2017 MLB data, once again using the maximum likelihood method on the totals that I used for 2015 and 2016. This method is explained in my articles <a href="http://probabilaball.blogspot.com/2015/08/estimating-theoretical-stabilization.html" target="_blank">Estimating Theoretical Stabilization Points</a> and <a href="http://www.probabilaball.com/2015/08/whip-stabilization-by-gamma-poisson.html" target="_blank">WHIP Stabilization by the Gamma-Poisson Model</a>.<br /><br />(As usual, all data and code I used <a href="https://github.com/Probabilaball/Blog-Code/tree/master/2017-Stabilization-Points" target="_blank">can be found on my github</a>. I make no claims about the stability, efficiency, or optimality of my code.) <br /><br />I've included standard error estimates for 2017, but these should not be used to perform any kinds of tests or intervals to compare to the values from previous years, are those values are estimates themselves with their own standard errors and approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found <a href="http://probabilaball.blogspot.com/2015/08/more-offensive-stabilization-points.html" target="_blank">here for batting statistics</a> and <a href="http://probabilaball.blogspot.com/2015/09/more-pitching-stabilization-points.html" target="_blank">here for pitching statistics</a>. The calculations for 2016 can be found <a href="http://www.probabilaball.com/2016/03/2016-stabilization-points.html">here</a>.<br /><br />The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data, were picked fairly arbitrarily - I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.<br /><br /><h2><b>Offensive Statistics</b></h2><br />\begin{array}{| l | l | c | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2016\textrm{ }\hat{M} \\ \hline<br />\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 303.77 & 17.08 & 0.328 & 300 & 301.32 \\<br />\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 442.62 & 40.55 & 0.306 & 300 & 433.04\\ <br />\textrm{BA}&\textrm{H/AB} & 466.09 & 34.30 & 0.266 & 300 & 491.20\\<br />\textrm{SO Rate}&\textrm{SO/PA} & 49.02 & 1.90 & 0.188 & 300 & 49.23\\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 113.64 & 5.00 & 0.077 & 300 & 112.44 \\<br />\textrm{1B Rate}&\textrm{1B/PA} & 215.29 & 10.95 & 0.157 & 300 & 223.86 \\<br />\textrm{2B Rate}&\textrm{2B/PA} & 1230.96 & 148.48 & 0.047 & 300 & 1169.75 \\<br />\textrm{3B Rate}&\textrm{3B/PA} & 358.92 & 25.71 & 0.005 & 300 & 365.06 \\<br />\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1063.76 & 116.54 & 0.052 & 300 & 1075.41 \\<br />\textrm{HR Rate} & \textrm{HR/PA} & 129.02 & 6.18 & 0.028 & 300 & 126.35\\<br />\textrm{HBP Rate} & \textrm{HBP/PA} & 299.39 & 18.60 & 0.009 & 300 & 300.97 \\ \hline <br />\end{array}<br /><br />In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.<br /><br /><h2><b>Pitching Statistics </b></h2><br />\begin{array}{| l | l | c | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2016 \textrm{ }\hat{M} \\ \hline<br />\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}& 1356.06 & 247.48 & 0.289 &300&1408.72\\<br />\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 64.00 & 3.56 & 0.450 &300& 63.53 \\<br />\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 61.73 &3.42& 0.342 &300&59.80 \\<br />\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 1497.65 & 296.21 & 0.208 &300&731.02 \\<br />\textrm{HR/FB Rate}&\textrm{HR/FB}& 464.60 & 85.51 & 0.108 &100&488.53 \\<br />\textrm{SO Rate}&\textrm{SO/TBF}& 94.62&5.29& 0.194&400&93.15 \\<br />\textrm{HR Rate}&\textrm{HR/TBF}& 942.62 & 110.66 & 0.026 &400&949.02 \\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 237.53 & 15.84 & 0.069 &400&236.87 \\<br />\textrm{HBP Rate}&\textrm{HBP/TBF}& 954.09 & 115.60 & 0.008 &400&939.00 \\<br />\textrm{Hit rate}&\textrm{H/TBF}& 550.69 & 45.63 & 0.235 &400&559.18 \\<br />\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 496.39 & 41.81 & 0.312 &400&526.77 \\<br />\textrm{WHIP}&\textrm{(H + BB)/IP*}& 74.68 & 5.25 & 1.29 &80&78.97 \\<br />\textrm{ER Rate}&\textrm{ER/IP*}& 62.82 & 4.24 & 0.440 &80&63.08 \\<br />\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 73.11& 5.11 & 1.22 &80&75.79 \\ \hline<br />\end{array}<br /><br /><i>* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively. </i><br /><br />Most statistics are roughly the same; however, the line drive stabilization point has roughly <i>doubled</i>. I checked my calculations for both years and this is not a mistake. It corresponds to a decrease in the variance of line drive rates. It should also be noted that the average line drive rate increased from 0.203 to 0.208 - there are perhaps remnants of an odd 2010 that is no longer included in the data set.<br /><h2> </h2><h2>Usage</h2><h2> </h2><i>Note: This section is largely unchanged from the previous year's version. The formulas given here work for "counting" offensive stats (OBP, BA, etc.). </i><br /><br />Aside from the obvious use of knowing approximately when results are half due to luck and half from skill, these stabilization points (along with league means) can be used to provide very basic confidence intervals and prediction intervals for estimates that have been shrunk towards the population mean, as demonstrated in my article <a href="http://www.probabilaball.com/2015/10/from-stabilization-to-interval.html" target="_blank">From Stabilization to Interval Estimation</a>. I believe the confidence intervals from my method should be similar to the intervals from Sean Dolinar's great fangraphs article <a href="http://www.fangraphs.com/blogs/a-new-way-to-look-at-sample-size/">A New Way to Look at Sample Size</a>, though I have not personally tested this, and am not familiar with the Cronbach's alpha methodology he uses (or with reliability analysis in general).<br /><br />For example, suppose that in the first half, a player has an on-base percentage of 0.380 in 300 plate appearances, corresponding to 114 on-base events. A 95% confidence interval using my empirical Bayesian techniques (based on a normal-normal model) is<br /><br /><div style="text-align: center;">$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300}} = (0.317,0.392)$ </div><br />That is, we believe the player's true on-base percentage to be between 0.317 and 0.392 with 95% confidence. I used a normal distribution for talent levels with a normal approximation to the binomial for the distribution of observed OBP, but that is not the only possible choice - it just resulted in the simplest formulas for the intervals.<br /><br />Suppose that the player will get an additional $\tilde{n} = 250$ PA in the second half of the season. A 95% prediction interval for his OBP over those PA is given by<br /><br /><div style="text-align: center;">$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300} + \dfrac{0.329(1-0.329)}{250}} = (0.285,0.424)$ </div><br />That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% unshrunk intervals, <a href="http://www.probabilaball.com/2015/10/from-stabilization-to-interval.html" target="_blank">but when I tested them in my article "From Stabilization to Interval Estimation"</a>, they worked well for prediction.<br /><br />As usual, all my data and code <a href="https://github.com/Probabilaball/Blog-Code/tree/master/2017-Stabilization-Points" target="_blank">can be found on my github</a>. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA ). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1) <br />$\$$Parameters<br />[1] 0.329098 301.317682<br /><br />$\$$Standard.Error<br />[1] 16.92138<br /><br />$\$$Model<br />[1] "Beta-Binomial"</span><br /><br /><div style="text-align: center;"><a href="http://3.bp.blogspot.com/-xgucIosXSK8/Vur_nBEO17I/AAAAAAAAAco/E6CM2RtCGaU3L8DhABUtFAuLYVh7Zt0zQ/s1600/OBP%2BStabilization.jpeg"><img border="0" height="397" src="https://3.bp.blogspot.com/-xgucIosXSK8/Vur_nBEO17I/AAAAAAAAAco/E6CM2RtCGaU3L8DhABUtFAuLYVh7Zt0zQ/s400/OBP%2BStabilization.jpeg" width="400" /></a></div><br />The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula<br /><br /><div style="text-align: center;">$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$</div><br />which is obtained from the applying <a href="http://probabilaball.blogspot.com/2015/06/the-delta-method-for-confidence.html" target="_blank">the delta method</a> to the function $p(\hat{M}) = n/(n + \hat{M})$. Note that the mean and prediction intervals I gave do <i>not</i> take $SE(\hat{M})$ into account (ignoring the uncertainty surrounding the correct shrinkage amount, which is indicated by the confidence bounds above), but this is not a huge problem - if you don't believe me, plug slightly different values of $M$ into the formulas yourself and see that the resulting intervals do not change much.<br /><br />Maybe somebody else out there might find this useful. As always, feel free to post any comments or suggestions!<br /><br />rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-4440768051762487972016-09-03T13:54:00.001-05:002016-09-03T13:54:39.741-05:002016 Win Total Predictions (Through August 31)<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <br />These predictions are based on my own silly estimator, which I know can be improved with some effort on my part. There's some work related to this estimator that I'm trying to get published academically, so I won't talk about the technical details yet (not that they're particularly mind-blowing anyway). These predictions include all games played before through August 31 break.<br /><br />As a side note, I noticed that my projections are very similar to <a href="http://www.fangraphs.com/coolstandings.aspx?type=2&lg=&date=2016-09-01">the Fangraphs projections on the same day</a>. I'm sure we're both calculating the projections from completely different methods, but it's reassuring that others have arrived at basically the same conclusions. Theirs have also have playoff projections, though mine have intervals attached to them.<br /><br />I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), but based on tests of earlier seasons at this point in the season the actual coverage is around 94%, with intervals usually being one game off if and when they are off.<br /><br />Intervals are inclusive. All win totals assume a 162 game schedule.<br /><br />\begin{array} {c c c c} <br />\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} & \textrm{Current Wins/Games}\\ \hline<br />ARI & 63 & 68.82 & 74 & 71.61 & 56 / 133 \\ <br />ATL & 57 & 62.25 & 68 & 68.41 & 50 / 133 \\ <br />BAL & 81 & 86.57 & 92 & 81.42 & 72 / 133 \\ <br />BOS & 85 & 90.41 & 96 & 91.7 & 74 / 133 \\ <br />CHC & 98 & 103.63 & 109 & 100.59 & 85 / 132 \\ <br />CHW & 71 & 76.85 & 83 & 77.61 & 62 / 131 \\ <br />CIN & 62 & 67.9 & 74 & 69.67 & 55 / 132 \\ <br />CLE & 87 & 92.68 & 98 & 90.03 & 76 / 132 \\ <br />COL & 73 & 78.63 & 84 & 81.72 & 64 / 133 \\ <br />DET & 81 & 86.95 & 92 & 83.51 & 72 / 133 \\ <br />HOU & 81 & 86.24 & 92 & 85.14 & 71 / 133 \\ <br />KCR & 78 & 83.09 & 89 & 78.69 & 69 / 133 \\ <br />LAA & 67 & 72.93 & 78 & 77.8 & 59 / 133 \\ <br />LAD & 84 & 89.44 & 95 & 86.26 & 74 / 133 \\ <br />MIA & 76 & 81.66 & 87 & 81.91 & 67 / 133 \\ <br />MIL & 65 & 70.13 & 76 & 73.34 & 57 / 133 \\ <br />MIN & 56 & 61.98 & 68 & 70.1 & 49 / 132 \\ <br />NYM & 78 & 83.61 & 89 & 81.58 & 69 / 133 \\ <br />NYY & 78 & 83.86 & 90 & 80.28 & 69 / 132 \\ <br />OAK & 64 & 69.88 & 75 & 71.92 & 57 / 133 \\ <br />PHI & 67 & 72.42 & 78 & 69.38 & 60 / 133 \\ <br />PIT & 77 & 82.37 & 88 & 80.33 & 67 / 131 \\ <br />SDP & 63 & 68.73 & 74 & 74.14 & 55 / 132 \\ <br />SEA & 77 & 82.55 & 88 & 81.3 & 68 / 133 \\ <br />SFG & 82 & 88 & 94 & 86.39 & 72 / 132 \\ <br />STL & 81 & 86.25 & 92 & 87.69 & 70 / 132 \\ <br />TBR & 65 & 70.72 & 76 & 79.47 & 56 / 132 \\ <br />TEX & 89 & 94.45 & 100 & 83.61 & 80 / 134 \\ <br />TOR & 86 & 91.93 & 97 & 89 & 76 / 133 \\ <br />WSN & 89 & 94.82 & 100 & 94.01 & 78 / 133 \\<br /> \hline\end{array}<br /><br />These quantiles are based off of a distribution - <a href="http://imgur.com/a/rLGT6" target="_blank">I've uploaded a picture of each team's distribution to imgur</a>. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is to the left of the peak, the team is predicted to finish "lucky" - more wins than would be expected based on their talent level - and if the blue line is to the right of the peak, the team is predicted to finish "unlucky" - fewer wins that would be expected based on their talent level. <br /><br />It's still difficult to predict final win totals even at the beginning of September - intervals have a width of approximately 11-12 games. The Texas Ranges have been lucky this season, with a projected win total over 10 games larger than their estimated true talent level! Conversely, the Tampa Bay Rays have been unlucky, with a projected win total 10 games lower than their true talent level.<br /><br />The Chicago Cubs have a good chance at winning 105+ games. My system believes they are a "true" 101 win team. Conversely, the system believes that the worst team is the Atlanta Braves, which are a "true" 68 win team (though the Minnesota Twins are projected to have the worst record at 62 wins).<br /><br /><br /><br /><h2>Terminology</h2><br /><br /><br />To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (from the beginning of the season until the all-star break). <br /><br /><br />rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com2tag:blogger.com,1999:blog-4128498738742055603.post-40223343182882945632016-08-15T02:01:00.000-05:002016-08-15T13:01:45.312-05:00wOBA Shrinkage Estimation by the Multinomial-Dirichilet Model<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <a href="http://probabilaball.blogspot.com/2015/06/confidence-interval-for-woba-based-on.html">In a previous article</a>, I showed how to calculate a very basic confidence interval for wOBA using the multinomial model. Since then, I've shown how to perform shrinkage estimation (regression towards the mean) for <a href="http://probabilaball.blogspot.com/2015/08/estimating-theoretical-stabilization.html">basic counting stats such as BA and OBP</a> and <a href="http://www.probabilaball.com/2015/08/whip-stabilization-by-gamma-poisson.html">rate stats such as WHIP</a>. In this article, I'm going to show how to use the multinomial model with a Dirichlet prior to find a regressed estimate of wOBA (and other functions that are linear transformations of counting stats).<br /><br />As in the previous post, I will use the weights of wOBA <a href="http://www.fangraphs.com/library/offense/woba/">from the definition on fangraphs.com</a>. I am aware that in <i>The Book</i> wOBA also includes the outcome of reaching a base on error, but the method here will easily expand to include that factor and the results should not change drastically for its exclusion.<br /><br />As usual, all the code and data I use can be found <a href="https://github.com/Probabilaball/Blog-Code/tree/master/wOBA-Shrinkage-Estimation-by-the-Multinomial-Dirichlet-Model">on my github</a>.<br /><br /><br /><h2><span style="font-size: x-large;">The Multinomial and Dirichlet Models </span></h2><br /><br />Suppose you observe $n$ independent, identical trials of an event (say, a plate appearance or an at-bat) with $k$ possible outcomes (single, double, triple, home run, walk, sacrifice fly, etc. - all the way up to an out). The distribution of counts of each of the possible outcomes is multinomial with probability mass function: <br /><div style="text-align: center;"><br /></div><div style="text-align: center;"><span style="font-size: large;">$p(x_1, x_2,...,x_k | \theta_1, \theta_2, ..., \theta_{k}, n) = \dfrac{n!}{x_1!x_2!...x_k!}\theta_1^{x_1}\theta_2^{x_2}...\theta_k^{x_k}$</span></div><br />Where $x_1$, $x_2$,..., $x_k$ represent counts of each outcome (with $n = x_1 + x_2 + ... + x_k$ fixed) and $\theta_1, \theta_2, ..., \theta_k$ represent the probability of each outcome in a single event - note that all the probabilities $\theta_j$ must sum to 1, so you will sometimes see the last term written as $\theta_k = (1-\theta_1-\theta_2-...-\theta_{k-1})$. <br /><br />To give an example, suppose that in each plate appearance (the event) a certain player has a 0.275 chance of getting a hit, a 0.050 chance of getting a walk, and a 0.675 chance of an "other" outcome happening (meaning anything other than a hit or walk - an out, hit by pitch, reach base on error, etc.). Then in $n = 200$ plate appearances, the probability of exactly $x_H = 55$ hits, $x_{BB} = 10$ walks, and $x_{OTH} = 135$ other outcomes is given by<br /><br /><br /><br /><div style="text-align: center;"><span style="font-size: large;">$\dfrac{200!}{55!10!135!} 0.275^{55} 0.05^{10} 0.675^{135} = 0.008177562$</span></div><br /><br /><br />(This probability is necessarily small because there are a little over 20,000 ways to have the three outcomes sum up to 200 plate appearances. In fact, 55 hits, 10 walks, and 135 other is the most probable set of outcomes.)<br /><br />The multinomial is, as its name implies, a multivariate extension of the classic binomial distribution (a binomial is just a multinomial with $k = 2$). Similarly, there is a multivariate extension of the beta distribution called the Dirichlet distribution. The Dirichlet is used to represent the joint distribution of the $\theta_j$ themselves - that is, the joint distribution (over the entire league) of <i>sets</i> of talent levels for each of the $k$ possible outcomes. The probability density function of the Dirichlet is<br /><br /><div style="text-align: center;"><span style="font-size: large;">$p(\theta_1, \theta_2, ..., \theta_{k} | \alpha_1, \alpha_2, ..., \alpha_k) = \displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j)}{\Gamma(\sum_{j = 1}^k \alpha_j)} \theta_1^{\alpha_1 - 1} \theta_2^{\alpha_2 - 1}...\theta_k^{\alpha_k - 1}$</span></div><br /><br />The Dirichlet distribution would be used to answer the question "What is the probability that a player has a hit probability of between 0.250 and 0.300 and <i>simultaneously </i>a walk probability of between 0.50 and 0.100?"- the advantage of doing this is being able to model the covariance between the talent levels.<br /><br />The expected values of each of the $\theta_j$ are given by<br /><br /><div style="text-align: center;"><span style="font-size: large;">$E[\theta_j] = \dfrac{\alpha_j}{\alpha_0}$</span></div><br />where<br /><br /><div style="text-align: center;"><span style="font-size: large;">$\alpha_0 = \displaystyle \sum_{j = 1}^k \alpha_j$</span></div><br />These represent the league average talent levels in each of the outcomes. So for example, using hits (H), walks (BB), and other (OTH), the quantity given by<br /><br /><div style="text-align: center;"><span style="font-size: large;">$E[\theta_{BB}] = \dfrac{\alpha_{BB}}{\alpha_H + \alpha_{BB} + \alpha_{OTH}}$</span></div><br />would be the average walk proportion (per PA) over all MLB players. <br /><br />The reason the Dirichlet distribution is useful is that it is conjugate to the multinomial density above. Given raw counts $x_1, x_2, ..., x_k$ for each outcome in the multinomial model and parameters $\alpha_1$, $\alpha_2$, ..., $\alpha_k$ for the Dirichlet model, the posterior distribution for the $\theta_i$ is also Dirichlet with parameters $ \alpha_1 + x_1, \alpha_2 + x_2, ..., \alpha_k + x_k$:<br /><br /><br /><div style="text-align: center;"><span style="font-size: large;"> $p(\theta_1, \theta_2, ..., \theta_{k} | x_1, x_2, ..., x_k) = $</span></div><div style="text-align: center;"><br /></div><div style="text-align: center;"><span style="font-size: large;">$\displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j + x_i)}{\Gamma(\sum_{j = 1}^k \alpha_j + x_j)} \theta_1^{\alpha_1 + x_1- 1} \theta_2^{\alpha_2 + x_2- 1}...\theta_{k-1}^{\alpha_{k-1} + x_{k-1} - 1}\theta_k^{\alpha_k + x_k - 1}$</span></div><br /><br /><br />For the posterior, the expected value for each outcome is given by<br /><br /><div style="text-align: center;"><span style="font-size: large;">$E[\theta_j] = \dfrac{\alpha'_j}{\alpha'_0}$</span></div><br />where<br /><br /><div style="text-align: center;"><span style="font-size: large;">$\alpha'_j = x_j + \alpha_j$</span></div><br /><div style="text-align: center;"><span style="font-size: large;">$\alpha'_0 = \displaystyle \sum_{j = 1}^k \alpha'_j = \sum_{j = 1}^k (x_j + \alpha_j)$</span></div><span style="font-size: large;"><br /></span><br /><br />These posterior means $E[\theta_j]$ represent regressed estimates for each of the outcome talent levels towards the league means. These shrunk estimates can then be plugged in to the formula for any statistic to get a regressed version of that statistic.<br /><br /><br /><h2>Linearly Weighted Statistics </h2><br />Most basic counting statistics (such as batting average, on-base percentage, etc.) simply try to estimating one particular outcome using the raw proportion of events ending in that outcome:<br /><br /><br /><div style="text-align: center;"><span style="font-size: large;">$\hat{\theta_j} \approx \dfrac{x_j}{n}$</span></div><br /><br />More advanced statistics instead attempt to estimate linear functions of true talent levels $\theta_j$ with weights $w_j$ for each outcome:<br /><br /><div style="text-align: center;"><span style="font-size: large;">$w_1 \theta_1 + w_2\theta_2 + ... + w_k \theta_k$</span></div><br />The standard version of these statistics that you can find on any number of baseball sites uses the raw proportion $x_j/n$ as an estimate $\hat{\theta}_j$ as above. To get the regressed version of the statistic, use $\hat{\theta_j} = E[\theta_j]$ from the posterior distribution for the Dirichlet-multinomial model - the formula for the regressed statistic is then<br /><div style="text-align: center;"><br /></div><div style="text-align: center;"><span style="font-size: large;">$w_1 \hat{\theta}_1 + w_2 \hat{\theta_2} + ... + w_k \hat{\theta_k} = \displaystyle \sum_{j = 1}^k w_j \left(\dfrac{x_j + \alpha_j}{\sum_{j = 1}^k x_j + \alpha_j}\right) =\sum_{j = 1}^k w_j \left(\dfrac{\alpha_j'}{\alpha_0'}\right)$</span></div><br /><br />(The full posterior distribution can also be used to get interval estimates for the statistic, which will be the focus of the next article) <br /><span style="font-size: large;"><br /></span><br /><h2><span style="font-size: large;">Estimation</span></h2><br /><br />This raises the obvious question of what values to use for the $\alpha_j$ in the Dirichlet distribution - ideally, $\alpha_j$ should be picked so that the Dirichlet distribution accurately modes the joint distribution of talent levels for MLB hitters. There are different ways to do this, but I'm going to use use MLB data itself and a marginal maximum likelihood technique to find estimates $\hat{\alpha_j}$ and plug those into the equations above - this method was chosen because there are existing <span style="font-family: "courier new" , "courier" , monospace;">R</span> packages to do the estimation and it is relatively numerically stable and should be very close to the results of other methods for large sample sizes. Using the data to find estimates of the $\alpha_j$ and then plugging those back in makes this an empirical Bayesian technique.<br /><br />First, it's necessary to get rid of the talent levels $\theta_j$ and find the probability of the observed data based not on a particular player's talent level(s), but on the Dirichlet distribution of population talent levels itself. This is done by integrating out each talent level:<br /><div style="text-align: center;"><span style="font-size: large;"><br /></span></div><div style="text-align: center;"><span style="font-size: large;"><br /></span></div><div style="text-align: center;"><span style="font-size: large;">$p(x_1, x_2, ..., x_k | \alpha_1, \alpha_2, ..., \alpha_k, n) = $</span></div><div style="text-align: center;"><span style="font-size: large;">$\displaystyle \int_{\tilde{\theta}} p(x_1, x_2, ..., x_k | \theta_1, \theta_2, ..., \theta_k, n) p(\theta_1, \theta_2, ..., \theta_k | \alpha_1, \alpha_2, ..., \alpha_k, n) d\tilde{\theta}$ </span></div><div style="text-align: center;"><span style="font-size: large;"><br /></span></div><div style="text-align: left;"><span style="font-size: large;"><span style="font-size: small;">where $\tilde{\theta}$ indicates the set of all $\theta_j$ values - so integrating out the success probability for each outcome, from 0 to 1.</span></span><br /><br /><span style="font-size: large;"><span style="font-size: small;">The calculus here is a bit tedious, so skipping straight to the solution, this gives probability mass function:</span></span></div><div style="text-align: center;"><span style="font-size: large;"><br /></span></div><div style="text-align: center;"><span style="font-size: large;">$ = \displaystyle \dfrac{n! \Gamma(\sum_{j = 1}^k \alpha_j)}{\Gamma(n + \sum_{j = 1}^k \alpha_j)} \prod_{i = j}^k \dfrac{\Gamma(x_j + \alpha_j)}{x_j! \Gamma(\alpha_j)}$</span></div><br /><br />This distribution, known as the Dirichlet-Multinomial distribution, represents the probability of getting $x_1, x_2, ..., x_k$ outcomes in fixed $n = x_1 + x_2 + ... + x_k$ events, given only information about the population. Essentially, this distribution would be used to answer the question "What's the probability that, if I select a player from the population of all MLB players completely at random - so not knowing the player's talent levels at all - the player gets $x_1$ singles, $x_2$ doubles, etc., in $n$ plate appearances?"<br /><br />Using $x_{i,j}$ to represent the raw count for outcome $j$ for player $i = 1, 2, ..., N$ and $\tilde{x}_i$ as shorthand to represent the complete set of counting statistics for player $i$, with the counting stats of multiple players $\tilde{x}_1, \tilde{x}_2, ..., \tilde{x}_N$ from some population (such as all MLB players that meet some event threshold), statistical estimation procedures can be used to acquire estimates $\hat{\alpha}_j$ of the true population parameters $\alpha_j$.<br /><br />For the maximum likelihood approach, the log-likelihood of a set of estimates $\tilde{\alpha}$ is given by<br /><br /><div style="text-align: center;"><span style="font-size: large;">$\ell(\tilde{\alpha} | \tilde{x}_1, ..., \tilde{x}_N) = \displaystyle \sum_{i = 1}^N [\log(n_i!) + \log(\Gamma(\sum_{j = 1}^k \alpha_j)) - \log(\Gamma(n + \sum_{j = 1}^k \alpha_j))$ </span></div><div style="text-align: center;"><span style="font-size: large;">$+ \displaystyle \sum_{j = 1}^k \log(\Gamma(x_{i,j} + \alpha_j)) - \sum_{j = 1}^k \log(x_{i,j}!) - \sum_{j = 1}^k \log(\Gamma(\alpha_j))]$</span></div><br />The maximum likelihood method works by finding the values of the $\alpha_j$ that maximize $\ell(\tilde{\alpha})$ above - these are the maximum likelihood estimates $\hat{\alpha}_j$. From a numerical perspective, doing this is not simple, and papers have been written on fast, easy ways to perform the computations. For simplicity, I'm going to use the <span style="font-family: "courier new" , "courier" , monospace;">dirmult</span> package in <span style="font-family: "courier new" , "courier" , monospace;">R</span>, which only requires the set of counts for each outcome as a matrix where each row corresponds to exactly one player. The <span style="font-family: "courier new" , "courier" , monospace;">dirmult </span>package can be installed with the command<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> install.packages('dirmult')</span><br /><br />Once the data is entered and estimation is performed, you will have estimates $\hat{\alpha}_j$. These can then be plugged into the posterior equations above to get regressed statistic estimate<br /><br /><div style="text-align: center;"><span style="font-size: large;">$w_1 \hat{\theta}_1 + w_2 \hat{\theta_2} + ... + w_k \hat{\theta_k} = \displaystyle \sum_{j = 1}^k w_j \left(\dfrac{x_j + \hat{\alpha}_j}{\sum_{j = 1}^k x_j + \hat{\alpha}_j}\right)$</span></div><br /><br />I'll give two examples of offensive stats that can be regressed in this way. <br /><br /><br /><h2>wOBA Shrinkage</h2><br /><br />The first is weighted on-base average (which I'll call wOBA for short), introduced in Tom Tango's <i>The Book</i>, though as previously mentioned, I am using the fangraphs.com definition. For events, at-bats plus unintentional walks, sacrifice flies, and times hit by a pitch will be used (n = AB + BB - IBB + SF + HBP) , and seven outcomes are defined - singles (1B), doubles (2B), triples (3B), home runs (HR), <u>unintentional</u> walks (BB), and hit by pitch (HBP), with everything else being lumped into an "other" (OTH) outcome.<br /><br />For notation, again let $x_{i, 1B}$, $x_{i, 2B}$, ..., $x_{i, OTH}$ represent the number of singles, doubles, etc. for player $i$ (abbreviating the entire set as $\tilde{x}_i$) and let $\theta_{i, 1B}$, $\theta_{i, 2B}$, ..., $\theta_{i, OTH}$ represent the true probability of getting a single, double, etc. for player $i$ (abbreviating the entire set as $\tilde{\theta}_i$). The total number of events for player $i$ is given by $n_i$.<br /><br />Data was collected from fangraphs.com on all MLB non-pitchers from 2010 - 2015. A cutoff of 300 events was used - so only players with at least 300 total AB + BB - IBB + SF + HBP in a given season were used. The code and data I used <a href="https://github.com/Probabilaball/Blog-Code/tree/master/wOBA-Shrinkage-Estimation-by-the-Multinomial-Dirichlet-Model">may be found in my github</a>.<br /><br />A player's wOBA can be written as a linear transformation of the $\theta_j$ for each of these outcomes with weights $w_{1B} = 0.89$, $w_{2B} = 1.27$, $w_{3B} = 1.62$, $W_{HR} = 2.10$, $w_{BB} = 0.69$, $w_{HBP} = 0.72$, and $w_{OTH} = 0$ as<br /><br /><div style="text-align: center;">$wOBA_i = 0.89*\theta_{i,1B} + 1.27*\theta_{i,2B} + 1.62*\theta_{i,3B} + 2.10*\theta_{i,HR} + 0.69*\theta_{i,BB}+0.72*\theta_{i,HBP}$</div><br />For player $i$, the distribution of the counts $\tilde{x}_i$ in $n_i$ events is multinomial with mass function <br /><br /><div style="text-align: center;"><span style="font-size: large;">$p(\tilde{x}_i | \tilde{\theta}_i, n_i) =\dfrac{n_i!}{x_{i,1B} x_{i,2B}! x_{i,}! x_{i,HR}! x_{i,OTH}! }\theta_{1B,i}^{x_{i,1B}} \theta_{i,2B}^{x_{i,2B}} \theta_{i,3B}^{x_{i,3B}} \theta_{i,HR}^{x_{i,HR}} \theta_{i,OTH}^{x_{i,OTH}}$</span></div><div style="text-align: center;"><br /></div><br />The joint distribution of possible talent levels $\tilde{\theta}$ is assumed to be Dirichlet.<br /><br /><div style="text-align: center;"><span style="font-size: large;">$p(\tilde{\theta}_i | \tilde{\alpha}) = \displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j)}{\Gamma(\sum_{j = 1}^k \alpha_j)} \theta_{i,1B}^{\alpha_{1B} - 1} \theta_{i,2B}^{\alpha_{2B} - 1}\theta_{i,3B}^{\alpha_{3B} - 1} \theta_{i,HR}^{\alpha_{HR} - 1}\theta_{i,BB}^{\alpha_{BB} - 1}\theta_{i,HBP}^{\alpha_{HBP} - 1}\theta_{i,OTH}^{\alpha_{OTH} - 1}$</span></div><br /><br />To find the maximum likelihood estimates $\hat{\alpha}_j$ for this model using the <span style="font-family: "courier new" , "courier" , monospace;">dirmult</span> package in <span style="font-family: "courier new" , "courier" , monospace;">R</span>, the data needs to be loaded into a matrix, where row $i$ represents the raw counts for each outcome for player $i$. There are any number of way to do this, but the first 10 rows of the matrix (out of 1598 in the sample total in the data set used for this example) should look something like:<br /><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">> x[1:10,]<br /> [,1] [,2] [,3] [,4] [,5] [,6] [,7]<br /> [1,] 91 38 1 42 109 5 353<br /> [2,] 122 26 1 44 71 5 364<br /> [3,] 58 25 1 18 43 0 196<br /> [4,] 111 40 3 32 38 5 336<br /> [5,] 63 25 0 30 56 3 253<br /> [6,] 67 18 1 21 46 5 213<br /> [7,] 86 24 2 43 108 6 362<br /> [8,] 58 25 2 20 24 3 201<br /> [9,] 35 16 2 25 56 2 200<br />[10,] 68 44 0 14 76 5 250</span></span></span> <br /><br />Once the data is in this form, <span style="font-family: inherit;">finding the maximum likelihood estimates</span> can be done with the commands<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> dirmult.fit <span style="font-family: "courier new" , "courier" , monospace;"></span><- dirmult(x)<br />Iteration 1: Log-likelihood value: -863245.946200229<br />Iteration 2: Log-likelihood value: -863214.860463357<br />Iteration 3: Log-likelihood value: -863210.928976511<br />Iteration 4: Log-likelihood value: -863210.901250554<br />Iteration 5: Log-likelihood value: -863210.901248778</span><span style="font-family: "courier new" , "courier" , monospace;"></span><br /><span style="font-family: "courier new" , "courier" , monospace;"></span><br /><span style="font-family: "courier new" , "courier" , monospace;">> dirmult.fit</span><span style="font-family: "courier new" , "courier" , monospace;"> </span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><br /><div style="text-align: justify;"><span style="font-family: "courier new" , "courier" , monospace;">$loglik</span></div><span style="font-family: "courier new" , "courier" , monospace;">[1] -863210.9</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><br /><div style="text-align: justify;"><span style="font-family: "courier new" , "courier" , monospace;">$ite</span></div><span style="font-family: "courier new" , "courier" , monospace;">[1] 5</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><br /><div style="text-align: justify;"><span style="font-family: "courier new" , "courier" , monospace;">$gamma</span></div><span style="font-family: "courier new" , "courier" , monospace;">[1] 34.30376 10.44264 1.15606 5.73569 16.28635 1.96183 144.51164</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><br /><div style="text-align: justify;"><span style="font-family: "courier new" , "courier" , monospace;">$pi</span></div><span style="font-family: "courier new" , "courier" , monospace;">[1] 0.160000389 0.048706790 0.005392124 0.026752540 0.075963185 0.009150412 0.674034560</span><br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><br /><div style="text-align: justify;"><span style="font-family: "courier new" , "courier" , monospace;">$theta</span></div><span style="font-family: "courier new" , "courier" , monospace;">[1] 0.004642569</span><br /><br />What I called the $\hat{\alpha}_j$ are given as the $<span style="font-family: "courier new" , "courier" , monospace;">gamma</span> in the output. The quantity$\alpha_0$ can be calculated by summing these up<br /><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">> alpha <- dirmult.fit$gamma<br />> sum(alpha)<br />[1] 214.398</span></span><br /><br />So the joint distribution of talent levels over the population of MLB players with at least 300 events is approximated by a Dirichlet distribution with parameters:<br /><br /><div style="text-align: center;">$(\theta_{1B}, \theta_{2B}, \theta_{3B}, \theta_{HR}, \theta_{BB}, \theta_{HBP}, \theta_{OTH}) \sim Dirichlet(34.30, 10.44, 1.16, 5.74, 16.29, 1.96, 144.51)$</div><br />In 2013, Mike Trout had $x_{1B} = 115$ singles, $x_{2B} = 39$ doubles, $x_{3B} = 9$ triples, $x_{HR} = 27$ home runs, $x_{BB} = 100$ unintentional walks, $x_{HBP} = 9$ times hit by a pitch, and $x_{OTH} = 397$ other outcomes in $n = 706$ total events for a raw (non-regressed) wOBA of<br /><br /><div style="text-align: center;">$0.89 \left(\dfrac{115}{706}\right) + 1.27 \left(\dfrac{39}{706}\right) + 1.62 \left(\dfrac{9}{706}\right) + 2.10 \left(\dfrac{27}{706}\right) + 0.69 \left(\dfrac{100}{706}\right) + 0.72 \left(\dfrac{9}{706}\right) \approx 0.423$</div><br />In order to calculate the <i>regressed</i> weighted on-base average, first calculate the $\alpha_j'$ for Mike Trout's posterior distribution of batting ability by<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$\alpha_{1B}' = 115 + 34.30 = 149.30$</div><div style="text-align: center;">$\alpha_{2B}' = 39 + 10.44 = 49.44$</div><div style="text-align: center;">$\alpha_{3B}' = 9+ 1.16 = 10.16$</div><div style="text-align: center;">$\alpha_{HR}' = 27 + 5.74 = 32.74$<br />$\alpha_{BB}' = 100 + 16.29 = 116.29$<br />$\alpha_{HBP}' = 9 + 1.96 = 10.96$ </div><div style="text-align: center;">$\alpha_{OTH}' = 407 + 144.51 = 551.51$</div><br />With $\alpha_0' = 920.40$. The regressed version of Mike Trout's 2013 slugging percentage is then given by a linear transformation of the expected proportion in each outcome:<br /><br /><div style="text-align: center;">$0.89 \left(\dfrac{149.30}{920.40}\right) + 1.27 \left(\dfrac{49.44}{920.40}\right) + 1.62 \left(\dfrac{10.16}{920.40}\right) + 2.10 \left(\dfrac{32.74}{920.40}\right) + 0.69 \left(\dfrac{116.29}{920.40}\right) + 0.72 \left(\dfrac{10.96}{920.40}\right)$<br /><br /> $\approx 0.401$</div><br />So based solely on his 2013 stats and the population information, it's estimated that he is a "true" 0.401 wOBA hitter. Of course, we know from many more years of watching him that this is a bit unfair, and his "true" wOBA is closer to 0.423.<br /><br /><h2>Stabilization</h2><br /><br />As a side note - and I have verified this by simulation, though I have not worked out the details yet mathetmatically - the so-called "stabilization point" (defined as a split-half correlation of $r = 0.5$) for wOBA is given by $\alpha_0$ - so if split-half correlation was conducted among the players from 2010-2015 with at least 300 AB + BB - IBB + SF + HBP, there should be a correlation of 0.5 after approximately 214 PA. I'm not sure if this works for just the wOBA weights or any arbitrary set of weights, though I suspect the fact that the weight for the "other" outcome is 0 and all the rest are nonzero has a big role to play in this.<br /><br /><br /><h2>SLG Shrinkage</h2><br /><br />Another statistic that can be regressed in this same way is slugging percentage (which I'll call SLG for short). Using at-bats (AB) as events, and defining five outcomes - singles (1B), doubles (2B), triples (3B), and home runs (HR), with everything else being lumped into the "other" (OTH) outcome, a player's slugging percentage can be written as a linear transformation of the $\theta_i$ for each of these outcomes with weights $w_{1B} = 1$, $w_{2B} = 2$, $w_{3B} = 3$, $W_{HR} = 4$, and $w_{OTH} = 0$<br /><br /><div style="text-align: center;">$SLG_i = 1*\theta_{i,1B} + 2*\theta_{i,2B} + 3*\theta_{i,3B} + 4*\theta_{i,HR}$ </div><br />For player $i$, the multinomial distribution of the counts of $x_{i,1B}$ singles, $x_{i,2B}$ doubles, $x_{i,3B}$ triples, $x_{i,HR}$ home runs, and $x_{i,OTH}$ other outcomes in $n_i$ at-bats is <br /><br /><div style="text-align: center;"><span style="font-size: large;">$p(\tilde{x}_i | \tilde{\theta}_i, n_i) =\dfrac{n_i!}{x_{i,1B} x_{i,2B}! x_{i,}! x_{i,HR}! x_{i,OTH}! }\theta_{1B,i}^{x_{i,1B}} \theta_{i,2B}^{x_{i,2B}} \theta_{i,3B}^{x_{i,3B}} \theta_{i,HR}^{x_{i,HR}} \theta_{i,OTH}^{x_{i,OTH}}$</span></div><div style="text-align: center;"><br /></div><br />The Dirichlet distribution of all possible $\tilde{\theta}$ values is <br /><br /><div style="text-align: center;"><span style="font-size: large;">$p(\tilde{\theta}_i | \tilde{\alpha}) = \displaystyle \dfrac{\prod_{j = 1}^k \Gamma(\alpha_j)}{\Gamma(\sum_{j = 1}^k \alpha_j)} \theta_{i,1B}^{\alpha_{1B} - 1} \theta_{i,2B}^{\alpha_{2B} - 1}\theta_{i,3B}^{\alpha_{3B} - 1} \theta_{i,HR}^{\alpha_{HR} - 1}\theta_{i,OTH}^{\alpha_{OTH} - 1}$</span></div><br />Once again, data from fangraphs.com was used, and all MLB non-pitchers from 2010-2015 who had at least 300 AB in a given season were included in the sample. To find maximum likelihood estimates in <span style="font-family: "courier new" , "courier" , monospace;">R</span> the data needs to be loaded into a matrix where row $i$ represents the raw counts for each outcome $\tilde{x}_i$ for player $i$. The first 10 rows of the matrix (out of 1477) should look something like:<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> x[1:10,]<br /> [,1] [,2] [,3] [,4] [,5]<br /> [1,] 91 38 1 42 349<br /> [2,] 122 26 1 44 362<br /> [3,] 111 40 3 32 332<br /> [4,] 63 25 0 30 251<br /> [5,] 67 18 1 21 208<br /> [6,] 86 24 2 43 358<br /> [7,] 58 25 2 20 199<br /> [8,] 68 44 0 14 248<br /> [9,] 102 36 2 37 370<br />[10,] 119 48 0 30 375</span> <br /><br />The <span style="font-family: "courier new" , "courier" , monospace;">dirmult</span> package can then be used to find maximum likelihood estimates for the $\alpha_j$ of the underlying joint Dirichlet distribution of talent levels<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> dirmult.fit <- dirmult(x)<br />Iteration 1: Log-likelihood value: -575845.635311559<br />Iteration 2: Log-likelihood value: -575999.779702559<br />Iteration 3: Log-likelihood value: -575829.132259007<br />Iteration 4: Log-likelihood value: -575784.936726078<br />Iteration 5: Log-likelihood value: -575780.270877135<br /><span style="font-family: "courier new" , "courier" , monospace;">Iteration 6: Log-likelihood value: -575780.190649985<br />Iteration 7: Log-likelihood value: -575780.1906191</span></span><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><br />> dirmult.fit</span></span><br /><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$</span>loglik<br />[1] -575780.2</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"> </span></span></span><br /><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$</span>ite<br />[1] 7</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"> </span></span></span><br /><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$</span>gamma<br />[1] 42.443604 12.855782 1.381905 7.073672 176.120837<br /> </span></span><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$pi</span></span><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">[1] 0.176939918 0.053593494 0.005760921 0.029488894 0.734216774</span></span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"> </span></span><br /><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$theta</span></span><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">[1] 0.004151517</span></span><br /><br />And $\alpha_0$ is<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> alpha <- dirmult.fit$gamma<br />> sum(alpha)<br />[1] 239.8758</span><br /><br />So the joint distribution of talent levels over the population of MLB players with at least 300 AB is given by a Dirichlet distribution.<br /><br /><div style="text-align: center;">$(\theta_{1B}, \theta_{2B}, \theta_{3B}, \theta_{HR}, \theta_{OTH}) \sim Dirichlet(42.44, 12.86, 1.38, 7.07, 176.12)$</div> <br />(This also implies the "stabilization point" for slugging percentage should be at around $\alpha_0 \approx 240$ AB - this is different than for wOBA because the definition of "events" are different between the two statistics)<br /><br />In 2013, Mike Trout had $x_{1B} = 115$ singles, $X_{2B} = 39$ doubles, $X_{3B} = 9$ triples, $X_{HR} = 27$ home runs, and $X_{OTH} = 399$ other outcomes in $n = 589$ at-bats.<br /><br /><div style="text-align: center;">$1 \left(\dfrac{115}{589}\right) + 2 \left(\dfrac{39}{589}\right) + 3 \left(\dfrac{9}{589}\right) + 4 \left(\dfrac{27}{589}\right) + 0 \left(\dfrac{399}{589}\right) \approx 0.557$</div><br />In order to calculate the <i>regressed</i> slugging percentage, calculate Mike Trout's posterior distribution for batting ability by<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$\alpha_{1B}' = 115 + 42.44 = 157.44$</div><div style="text-align: center;">$\alpha_{2B}' = 39 + 12.86 = 51.86$</div><div style="text-align: center;">$\alpha_{3B}' = 9+ 1.38 = 10.38$</div><div style="text-align: center;">$\alpha_{HR}' = 27 + 7.07 = 34.07$ </div><div style="text-align: center;">$\alpha_{OTH}' = 399 + 176.12 = 575.12$</div><br />With $\alpha_0' = 828.79$. The regressed version of Mike Trout's 2013 slugging percentage is then given by<br /><br /><div style="text-align: center;">$1 \left(\dfrac{157.44}{828.79}\right) + 2 \left(\dfrac{51.86}{828.79}\right) + 3 \left(\dfrac{10.38}{828.79}\right) + 4 \left(\dfrac{34.07}{828.79}\right) \approx 0.517$</div><div style="text-align: center;"><br /></div><div style="text-align: left;"><br /><br /></div><div style="text-align: left;"><h2><span style="font-size: large;">Model Criticisms</span></h2><br /><br />From a statistical perspective, this is the most convenient way to perform shrinkage of wOBA, but it doesn't necessarily mean that this is correct - all of this research is dependent on how well the Dirichlet models the joint distribution of talent levels in the league. The fact that the beta works well for the population distributions of each of the talent levels when looked at individually is no guarantee that the multivariate extension should work well for the joint.<br /><br />In order to do a simple test of the fit of the model, data was simulated from the fit model used to perform the wOBA shrinkage (the posterior predictive means and variances are actually available exactly for this model, but it's good practice to simulate). A set of $\tilde{\theta}_i$ simulated from the Dirichlet distribution was used to simulate a corresponding set of $\tilde{x}_i$, with the same $n_i$ as in the original data set. Comparing the means and standard deviations of the real and simulated data set, the means are <br /><br />\begin{array}{c c c}<br />\textrm{Outcome} & \textrm{Observed Mean} & \textrm{Simulated Mean} \\ \hline<br />\textrm{1B} & 0.1598 & 0.1598 \\<br />\textrm{2B} & 0.0473 & 0.0480 \\<br />\textrm{3B} & 0.0049 & 0.0053 \\<br />\textrm{HR} & 0.0275 & 0.0262 \\<br />\textrm{BB} & 0.0772 & 0.0756 \\<br />\textrm{HBP} & 0.0088 & 0.0095 \\ <br />\textrm{OTH} & 0.6745 & 0.6756 \\<br />\end{array}<br /><br />which look relatively good - the simulated and real means are fairly close. For the standard deviations, real and simulated values are<br /><br />\begin{array}{c c c}<br />\textrm{Outcome} & \textrm{Observed SD} & \textrm{Simulated SD} \\ \hline<br />\textrm{1B} & 0.0296 & 0.0303 \\<br />\textrm{2B} & 0.0114 & 0.0176 \\<br />\textrm{3B} & 0.0049 & 0.0061 \\<br />\textrm{HR} & 0.0153 & 0.0131 \\<br />\textrm{BB} & 0.0278 & 0.0217 \\<br />\textrm{HBP} & 0.0072 & 0.0080 \\ <br />\textrm{OTH} & 0.0324 & 0.0381\\<br />\end{array}</div><br />which isn't nearly as good. The Multinomial-Dirichlet model is clearly underestimating the amount of variance in double rates and "other" outcome rates while overestimating the variance in home run and walk rates. It's not to an extreme extent - and comparing histograms, the shapes of the real and simulated data sets match - but it does present a source of problems. More ad-hoc methods may give superior results.rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-13130667676312660052016-07-16T18:47:00.003-05:002016-07-16T18:47:50.674-05:002016 Win Total Predictions (Through All-Star Break)<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <br />These predictions are based on my own silly estimator, which I know can be improved with some effort on my part. There's some work related to this estimator that I'm trying to get published academically, so I won't talk about the technical details yet (not that they're particularly mind-blowing anyway). These predictions include all games played before the all-star break.<br /><br />I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), but based on tests of earlier seasons at this point in the season the actual coverage is just under 93%, with intervals usually being one game off if and when they are off.<br /><br />Intervals are inclusive. All win totals assume a 162 game schedule.<br /><br />\begin{array} {c c c c} <br />\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} & \textrm{Current Wins/Games}\\ \hline<br /><br />ARI & 62 & 72.11 & 82 & 76.75 & 38 / 90 \\ <br />ATL & 52 & 61.82 & 72 & 68.4 & 31 / 89 \\ <br />BAL & 81 & 90.93 & 101 & 86.25 & 51 / 87 \\ <br />BOS & 80 & 90.3 & 100 & 89.19 & 49 / 87 \\ <br />CHC & 87 & 96.9 & 106 & 96.11 & 53 / 88 \\ <br />CHW & 71 & 81.04 & 91 & 80 & 44 / 87 \\ <br />CIN & 51 & 60.62 & 70 & 63.51 & 32 / 89 \\ <br />CLE & 84 & 93.41 & 103 & 90.67 & 52 / 88 \\ <br />COL & 66 & 76.18 & 86 & 79.22 & 40 / 88 \\ <br />DET & 73 & 82.55 & 92 & 81.11 & 46 / 89 \\ <br />HOU & 76 & 85.81 & 96 & 83.9 & 48 / 89 \\ <br />KCR & 70 & 80.3 & 90 & 77.29 & 45 / 88 \\ <br />LAA & 62 & 71.88 & 82 & 77.4 & 37 / 89 \\ <br />LAD & 80 & 89.43 & 99 & 87.7 & 51 / 91 \\ <br />MIA & 75 & 84.5 & 94 & 82.1 & 47 / 88 \\ <br />MIL & 61 & 71.33 & 81 & 71.98 & 38 / 87 \\ <br />MIN & 56 & 65.83 & 76 & 73.06 & 32 / 87 \\ <br />NYM & 75 & 84.9 & 95 & 82.97 & 47 / 88 \\ <br />NYY & 69 & 78.88 & 89 & 76.38 & 44 / 88 \\ <br />OAK & 61 & 70.93 & 81 & 73.08 & 38 / 89 \\ <br />PHI & 64 & 74 & 84 & 72 & 42 / 90 \\ <br />PIT & 73 & 82.71 & 93 & 81.45 & 46 / 89 \\ <br />SDP & 62 & 72.04 & 82 & 75.53 & 38 / 89 \\ <br />SEA & 74 & 83.44 & 93 & 85.31 & 45 / 89 \\ <br />SFG & 87 & 96.8 & 106 & 89.55 & 57 / 90 \\ <br />STL & 77 & 87.13 & 97 & 90.03 & 46 / 88 \\ <br />TBR & 58 & 67.3 & 77 & 72.91 & 34 / 88 \\ <br />TEX & 81 & 91.22 & 101 & 83.75 & 54 / 90 \\ <br />TOR & 80 & 89.42 & 99 & 87.66 & 51 / 91 \\ <br />WSN & 86 & 95.42 & 105 & 93.21 & 54 / 90 \\ \hline\end{array}<br />It's still fairly difficult to predict final win totals even a little over halfway through the season - intervals have a width of approximately 20 games. A few stand-out points - the teams that are predicted to definitely finish below 0.500 are the Atlanta Braves, the Cincinnati Reds, the Minnesota Twins, and the Tampa Bay Rays, with the Reds being the worst of those teams (they are an estimated as a "true" 63.51 win team). On the other side, the teams predicted to definitely finish above 0.500 are the Chicago Cubs, the Cleveland Indians, the San Francisco Giants, and the Washington Nationals, with the Cubs being the best of these teams (they are estimated as a "true" 96.11 win team). The Texas Rangers and San Francisco Giants in particular have been an exceptionally lucky team - they are predicted to win approximately 7 more games than their "true" win total. Likewise, the Atlanta Braves and Minnesota Twins have been unlucky, both predicted to win approximately 7 fewer games than their "true" win total.<br /><br />To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (from the beginning of the season until the all-star break). <br /><br />These quantiles are based off of a distribution - <a href="http://imgur.com/a/hXQtZ" target="_blank">I've uploaded a picture of each team's distribution to imgur</a>. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is to the left of the peak, the team is predicted to finish "lucky" - more wins than would be expected based on their talent level - and if the blue line is to the right of the peak, the team is predicted to finish "unlucky" - fewer wins that would be expected based on their talent level.rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-6420800523460854362016-05-24T16:27:00.003-05:002016-05-24T17:24:40.789-05:00Let's Code an MCMC for a Hierarchical Model for Batting Averages<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> In previous articles, I've discussed <a href="http://probabilaball.blogspot.com/2015/05/beta-binomial-empirical-bayes.html" target="_blank">empirical Bayesian estimation for the beta-binomial model</a>. Empirical Bayesian analysis is useful, but it's only an approximation to the full hierarchical Bayesian analysis. In this post, I'm going to work through the entire process of doing an equivalent full hierarchical Bayesian analysis with MCMC, from looking at the data and picking a model to creating the MCMC to checking the results. There are, of course, great packages and programs out there such as <a href="https://pymc-devs.github.io/pymc/" target="_blank">PyMC</a> and <a href="http://mc-stan.org/" target="_blank">Stan</a> that will fit the MCMC for you, but I want to give a basic and complete "under the hood" example.<br /><br />Before I get started, I want to be clear that coding a Bayesian analysis with MCMC from scratch involves many choices and multiple checks at almost all levels. I'm going to hand wave some choices based on what I know will work well (though I'll try to be clear where and why I'm doing so) and I'm not going to attempt to show every possible way of checking an MCMC procedure in one post - so statistics such as $\hat{R}$ and effective sample size will not be discussed. For a fuller treatment of Bayesian estimation using MCMC, I recommend Gelman et. al's <i>Bayesian Data Analysis</i> and/or Carlin and Louis's <i>Bayesian Methods in Data Analysis</i>.<br /><br />As usual, <a href="https://github.com/Probabilaball/Blog-Code/tree/master/Lets-Code-an-MCMC-For-a-Beta-Binomial-Hierarchical-Model-for-Batting-Averages">all my code and data can be found on my github</a>.<br /><br /><h2>The Data and Notation</h2><h2> </h2>The goal is to fit a hierarchical model to batting averages in the 2015 season. I'm going to limit my data set to only the batting averages of all MLB hitters (excluding pitchers) who had at least 300 AB, as those who do not meet these qualifications can arguably be said to come from a different "population" of players. This data was collected from fangraphs.com and can be seen in the histogram below.<br /><br /><div style="text-align: center;"><a href="http://2.bp.blogspot.com/-DtPRE5BStIM/VurdyUKEzTI/AAAAAAAAAcY/0JRAzWsCTowMvAIczdXLDDeXW691aMCjQ/s1600/Batting%2BAverages%2BDistribution.jpeg" imageanchor="1"><img border="0" height="398" src="https://2.bp.blogspot.com/-DtPRE5BStIM/VurdyUKEzTI/AAAAAAAAAcY/0JRAzWsCTowMvAIczdXLDDeXW691aMCjQ/s400/Batting%2BAverages%2BDistribution.jpeg" width="400" /></a> </div><br /><br />For notation, I'm going let $i$ index MLB players in the sample and define $\theta_i$ as a player's "true" batting average in 2015. The goal is to use the observed number of hits $x_i$ in $n_i$ at-bats (AB) to estimate $\theta_i$ for player $i$. I'll assume that I have $N$ total players - in 2015, there were $N = 254$ non-pitchers with at least 300 AB.<br /><br />I'm also going to use a $\sim$ over a variable to represent the collection of statistics over all players in the sample. For example, $\tilde{x} = \{ x_1, x_2, ..., x_N\}$ and $\tilde{\theta} = \{\theta_1, \theta_2, ..., \theta_N\}$.<br /><br />Lastly, when we get to the MCMC part, we're going to take samples from the posterior distributions rather than calculating them directly. I'm going to use $\mu^*_j$ to represent the set of samples from the posterior distribution for $\mu$, where $j$ indexes 1 to however many samples the computer is programmed to obtain (usually a very large number, since computation is relatively cheap these days), and similarly $\phi^*_j$ and $\theta^*_{i,j}$ for samples from the posterior distribution of $\phi$ and $\theta_i$, respectively.<br /><h2> </h2><h2></h2><h2></h2><h2>The Model</h2><br />First, the model must be specified. I'll assume that for each each at-bat, a given player has identical probability $\theta_i$ of getting a hit, independent of other at-bats. The distribution of the total number of hits in $n_i$ at-bats is then binomial.<br /><br /><br /><div style="text-align: center;">$x_i \sim Bin(n_i, \theta_i)$</div><div style="text-align: center;"><br /><br /><div style="text-align: left;">For the distribution of the batting averages $\theta_i$ themselves, I'm going to use a beta distribution. Looking at the histogram of the data, it looks relatively unimodal and bell-shaped, and batting averages by definition must be between 0 and 1. Keep in mind that the distribution of <i>observed </i>batting averages $x_i/n_i$ is not the same as the distribution of <i>actual</i> batting averages $\theta_i$, but even after taking into account the binomial variation around the true batting averages, the distribution of the $\theta_i$ should also be unimodal, roughly bell-shaped, and bounded by 0 and 1. The beta distribution - bounded by 0 and 1 by definition - will be able to take that shape (though <a href="https://baseballwithr.wordpress.com/2016/05/16/a-conversation-with-herman-rubin/" target="_blank">others have plausibly argued that a beta is not entirely correct</a>).<br /><br />Most people are familiar with the beta distribution in terms of $\alpha$ and $\beta$: </div><br />$\theta_i \sim Beta(\alpha, \beta)$</div><div style="text-align: center;"><br /></div><div style="text-align: left;">There isn't anything wrong with coding an MCMC in this form (and would almost certainly work well in this scenario), but I know from experience that a different parametrization works better - I'm going to use the beta distribution with parameters $\mu$ and $\phi$:<br /><br /><div style="text-align: center;"> $\theta_i \sim Beta(\mu, \phi)$</div><br />where $\mu$ and $\phi$ are given in terms of $\alpha$ and $\beta$ as</div><div style="text-align: left;"><br /></div><div style="text-align: center;">$\mu = \dfrac{\alpha}{\alpha + \beta}$</div><div style="text-align: center;">$\phi = \dfrac{1}{\alpha + \beta + 1}$</div><div style="text-align: center;"><br /></div><div style="text-align: left;">In this parametrization, $\mu$ represents the expected value $E[\theta_i]$ of the beta distribution - the true league mean batting average - and $\phi$, known formally as the "dispersion parameter," is the correlation between two individual at-bats from the same randomly chosen player - in sabermetric speak, it's how much a hitter's batting average has "stabilized" after a single at-bat. The value of $\phi$ controls how spread out the $\theta_i$ are around $\mu$. <br /><br />The advantage of using this parametrization instead of the traditional one is that both $\mu$ and $\phi$ are bounded between 0 and 1 (whereas $\alpha$ and $\beta$ can take any value from 0 to $\infty$) and a closed parameter space makes the process of specifying priors easier and will improve the convergence of the MCMC algorithm later on. <br /><br />Finally, priors must be chosen for the parameters $\mu$ and $\phi$. I'm going to lazily choose diffuse beta priors for both.<br /><br /><div style="text-align: center;">$\mu \sim Beta(0.5,0.5)$</div><div style="text-align: center;">$\phi \sim Beta(0.5,0.5)$</div><div style="text-align: center;"><br /></div><div style="text-align: left;">The advantage of choosing beta distributions for both (possible with the parametrization I used!) is that both priors are proper (in the sense of being valid probability density functions), and proper priors always yield proper posteriors - so that eliminates one potential problem to worry about. These prior distributions are definitely arguable - they put a fair amount of probability at the ends of the distributions, and I know for a fact that the true league mean batting average isn't actually 0.983 or 0.017, but I wanted to use something that worked well in the MCMC procedure and wasn't simply a flat uniform prior between 0 and 1. </div><br /></div><h2>The Math</h2><h2></h2><br />Before jumping into the code, we need to do some math. Mass functions and densities of the binomial distribution for the $x_i$, beta distributions for $\theta_i$ (in terms of$\mu$ and $\phi$), and beta priors for $\mu$ and $\phi$ are given by<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$p(x_i | n_i, \theta_i) = \displaystyle {n_i \choose x_i} \theta_i^{x_i} (1-\theta_i)^{n_i - x_i}$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">$p(\theta_i | \mu, \phi) = \dfrac{\theta_i^{\mu (1-\phi)/\phi - 1} (1-\theta_i)^{(1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)}$ </div><div style="text-align: center;"><br /></div><div style="text-align: center;">$\pi(\mu) = \dfrac{\mu^{-0.5}(1-\mu)^{-0.5}}{\beta(0.5,0.5)}$</div><div style="text-align: center;"><br /></div><div style="text-align: center;"> $\pi(\phi) = \dfrac{\phi^{-0.5}(1-\phi)^{-0.5}}{\beta(0.5,0.5)}$</div><br /><br /> From Bayes' theorem, the joint posterior density of $\mu$, $\phi$, and all $N = 254$ of the $\theta_i$ is given by<br /><br /><div style="text-align: center;">$p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n}) = \dfrac{p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi)}{\int \int ... \int \int p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi) d\tilde{\theta} d\mu d\phi}$</div><br />The $...$ in the integrals means that every single one of the $\theta_i$ must be integrated out as well as $\mu$ and $\phi$, so the numerical integration here involves 256 dimensions. This is not numerically tractable, hence Markov chain Monte Carlo will be used instead.<br /><br />The goal of Markov chain Monte Carlo is to draw a "chain" of samples $\mu^*_j$, $\phi^*_j$, and $\theta^*_{i,j}$ from the posterior distribution $p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n})$. This is going to be accomplished in iterations, where at each iteration $j$ the distribution of the samples depends only on the values at the previous iteration $j-1$ (this is the "Markov" property of the chain). There are two basic "building block" techniques that are commonly used to do this.<br /><br />The first technique is called the Gibbs sampler. The full joint posterior $p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n})$ may not be known, but suppose that given values of the other parameters, the <i>conditional</i> posterior distribution $p(\tilde{\theta} | \mu, \phi, \tilde{x}, \tilde{n})$ is known - if so, it can be used to simulate $\tilde{\theta}$ values from $p(\tilde{\theta} | \mu^*_j, \phi^*_j, \tilde{x}, \tilde{n})$. <br /><br />Looking at the joint posterior described above, the denominator of the posterior density (after performing all integrations) is just a normalizing constant, so we can focus on the numerator:<br /><br /><div style="text-align: center;">$p(\tilde{\theta}, \mu, \phi | \tilde{x}, \tilde{n}) \propto p( \tilde{x}, \tilde{n}| \tilde{\theta} )p(\tilde{\theta} | \mu, \phi) \pi(\mu) \pi(\phi) $</div><br /><div style="text-align: center;">$= \displaystyle \prod_{i = 1}^N \left( {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \dfrac{\mu^{-0.5}(1-\mu)^{-0.5}}{\beta(0.5,0.5)} \dfrac{\phi^{-0.5}(1-\phi)^{-0.5}}{\beta(0.5,0.5)}$</div><br />From here, we can ignore any of the terms above that do not have a $\phi$, a $\mu$, or a $\theta_i$ in them, since those will either cancel out or remain constants in the full posterior as well:<br /><br /><div style="text-align: center;">$\displaystyle \prod_{i = 1}^N \left( \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}\phi^{-0.5}(1-\phi)^{-0.5}$</div><br />Now we're going to check and see if there are any terms that, when looked at as variables with <u>everything else</u> treated as a constant, take the form of a recognizable distribution. It turns out that the function:<br /><br /><div style="text-align: center;"><span style="font-size: large;">$\theta_i^{x_i + \mu (1-\phi)/\phi - 1} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}$ </span></div><br />is the kernel of an un-normalized beta distribution for $\theta_i$ with parameters<br /><br /><div style="text-align: center;"> $\alpha_i = x_i + \mu \left(\dfrac{1-\phi}{\phi}\right)$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">$\beta_i = n_i - x_i + (1-\mu) \left(\dfrac{1-\phi}{\phi}\right) $</div><br />since we are assuming $\mu$ and $\phi$ are <i>known</i> in the conditional distribution. Hence, we can say that the conditional distribution of the $\theta_i$ given $\mu$, $\phi$, and the data is beta.<br /><br />This fact be used in the MCMC to draw an observation $\theta^*_{i,j}$ from the posterior distribution for each $\theta_i$ given draws $\mu^*_j$ and $\phi^*_j$ from the posterior distributions for $\mu$ and $\phi$:<br /><br /><div style="text-align: center;">$\theta^*_{i,j} \sim Beta\left(x_i + \mu^*_j \left(\dfrac{1-\phi^*_j}{\phi^*_j}\right), n_i - x_i + (1- \mu^*_j )\left(\dfrac{1-\phi^*_j}{\phi^*_j}\right) \right)$</div><br />Note that this formulation uses the traditional $\alpha, \beta$ parametrization. This is a "Gibbs step" for the $\theta_i$.<br /><br />Unfortunately, looking at $\mu$ and $\phi$ in isolation doesn't yield a similar outcome - observing just the terms involving $\mu$ and treating everything else as constant, for example, gives the function<br /><br /><div style="text-align: center;">$\displaystyle \prod_{i = 1}^N \left( \dfrac{\theta_i^{\mu (1-\phi)/\phi - 1} (1-\theta_i)^{(1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}$</div><br />which is not recognizable as the kernel of any common density. Doing the same thing for $\phi$ gives a nearly identical function. Hence, the Gibbs technique won't be used for $\mu$ and $\phi$.<br /><br />One advantage, however, of recognizing that the conditional distribution of the $\theta_i$ given all other parameters is beta is that we can integrate the $\theta_i$ out in the likelihood in order to get at the distributions of $\mu$ and $\phi$ more directly:<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$\displaystyle p(x_i, n_i | \mu, \phi) = \int_0^1 p(x_i, n_i | \theta_i) p(\theta_i | \mu, \phi) d\theta_i = \int_0^1 {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu (1-\phi)/\phi - 1)} (1-\theta_i)^{n_i - x_i + (1-\mu) (1-\phi)/\phi - 1}}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} d\theta_i $</div><div style="text-align: center;"><br /></div><div style="text-align: center;">$ = \displaystyle {n_i \choose x_i} \dfrac{\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi)}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)}$</div><br />In fact, we can do this for <i>every single one</i> of the $\theta_i$ in the formula above and rewrite the posterior function just in terms of $\mu$ and $\phi$:<br /><br /><div style="text-align: center;">$p(\mu, \phi | \tilde{x}, \tilde{n}) \propto \displaystyle \prod_{i = 1}^N \left( \dfrac{\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi)}{\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)} \right) \mu^{-0.5}(1-\mu)^{-0.5}\phi^{-0.5}(1-\phi)^{-0.5}$</div><br />This leads directly into the second (and more general) technique for obtaining draws from the posterior distribution: the Metropolis-Hastings algorithm. Suppose that instead of the full posterior $p(\mu, \phi | \tilde{x}, \tilde{n})$, you have a function that is <i>proportional </i>to the full posterior (like the numerator above)<br /><br /><div style="text-align: center;">$h(\mu, \phi | \tilde{x}, \tilde{n}) \propto p(\mu, \phi | \tilde{x}, \tilde{n})$</div><br />It's possible to construct a Markov chain of $\mu^*_j$ samples using the following steps:<br /><ol><li>Simulate a candidate value $\mu^*_c$ from some distribution $G(\mu^*_c | \mu^*_{j-1})$</li><li>Simulate $u$ from a uniform distribution between 0 and 1.</li><li>Calculate the ratio </li></ol><div style="text-align: center;">$\dfrac{h(\mu^*_{c}, \phi^*_{j-1} | \tilde{x}, \tilde{n})}{h(\mu^*_{j-1}, \phi^*_{j-1} | \tilde{x}, \tilde{n})}$</div><div style="text-align: center;"><br /></div> If this ratio is larger than $u$, accept the candidate value and declare $\mu^*_j = \mu^*_{c}$.<br /> If this ratio is smaller than $u$, reject the candidate value and declare $\mu^*_j = \mu^*_{j-1}$<br /><ol></ol><br />A nearly identical step may be used to draw a sample $\phi^*_j$, only using $h(\mu^*_{j-1}, \phi^*_{c} | \tilde{x}, \tilde{n})$ instead. Note that at each Metropolis-Hastings step the value from the <u>previous</u> iteration is used, even if a new value for another parameter was accepted in another step.<br /><br />In practice, there are two things that are very commonly (but not always) done for Metropolis-Hastings steps: first, calculations are generally performed on the <i>log</i> scale, as the computations become much, much more numerically stable. To do this, we simply need to take the log of the function $h(\mu, \phi | \tilde{x}, \tilde{n})$ above: <br /><br /><div style="text-align: center;">$m(\mu, \phi | \tilde{x}, \tilde{n}) = \log[h(\mu, \phi | \tilde{x}, \tilde{n})] = \displaystyle \sum_{i = 1}^N \left[ \log(\beta(x_i + \mu (1-\phi)/\phi), n_i - x_i + (1-\mu) (1-\phi)/\phi))\right]$<br /></div><div style="text-align: center;">$- N \log(\beta(\mu (1-\phi)/\phi, (1-\mu) (1-\phi)/\phi)) - 0.5\log(\mu) - 0.5\log(1-\mu) - 0.5\log(\phi) - 0.5\log(1-\phi)$</div><br />This $m$ function is called repeatedly throughout the code. Secondly, for the candidate distribution, a normal distribution is used centered at the previous value of the chain, with some pre-chosen variance $\sigma^2$, which I will explain how to determine in the next section. Using $\mu$ as an example, the candidate distribution would be<br /><br /><div style="text-align: center;">$G(\mu^*_c | \mu^*_{j-1}) \sim N(\mu^*_{j -1}, \sigma^2_{\mu})$</div><br />Using these two adjustments, the Metropolis-Hastings step for $\mu$ then becomes<br /><br /><ol><li>Simulate a candidate value from a $N(\mu^*_{j-1}, \sigma^2_{\mu})$ distribution</li><li>Simulate $u$ from a uniform distribution between 0 and 1.</li><li>If $m(\mu^*_{c}, \phi^*_{j-1} | \tilde{x}, \tilde{n}) - m(\mu^*_{j-1}, \phi^*_{j-1} | \tilde{x}, \tilde{n}) > \log(u)$, accept the candidate value and declare $\mu^*_j = \mu^*_{c}$. Otherwise, reject the candidate value and declare $\mu^*_j = \mu^*_{j-1}$</li></ol><br /><br />With Metropolis-Hastings steps and Gibbs steps, we can create a Markov chain that converges to the posterior distribution.<br /><br /><br /><h2>Choosing Starting Values and Checking Output</h2><br /><br />Now that we have either the conditional posteriors we need for the Gibbs sampler or a function proportional to them for the Metropolis-Hastings steps, it's time to write code to sample from them. Each iteration of the MCMC code will perform the following steps:<br /><br /><ol><li>Draw a candidate value $\mu^*_c$ from $N(\mu^*_{j-1}, \sigma^2_{\mu})$ </li><li>Perform a Metropolis-Hastings calculation to determine whether to accept or reject $\mu^*_c$. If accepted, set $\mu^*_j = \mu^*_c$. If rejected, set $\mu^*_j = \mu^*_{j - 1}$</li><li>Draw a candidate value $\phi^*_c$ from $N(\phi^*_{j-1}, \sigma^2_{\phi})$ </li><li> Perform a Metropolis-Hastings calculation to determine whether to accept or reject $\phi_c$. If accepted, set $\phi^*_j = \phi^*_c$. If rejected, set $\phi^*_j = \phi^*_{j - 1}$</li><li>For each of the $\theta^*_i$, draw a new $\theta^*_{i,j}$ from the conditional beta distribution:</li></ol><div style="text-align: center;">$\theta^*_{i,j} \sim Beta\left(x_i + \mu^*_j \left(\dfrac{1-\phi^*_j}{\phi^*_j}\right), n_i - x_i + (1- \mu^*_j )\left(\dfrac{1-\phi^*_j}{\phi^*_j}\right) \right)$</div><ol></ol><br />Again, note that this formulation of the beta distribution uses the traditional $\alpha, \beta$ parametrization.<br /><br />A problem emerges - we need starting values $\mu^*_1$ and $\phi^*_1$ before we can use the algorithm (starting values for the $\theta^*_{i,1}$ aren't needed - the Gibbs sampler in step 5 above can be used to simulate them given starting values for the other two parameters). Ideally, you would pick starting values in a high-probability area of the posterior distribution, but if you knew the posterior distribution you wouldn't be performing MCMC!<br /><br />You could just pick arbitrary starting points - statistical theory says that no matter what starting values you choose, the distribution of samples from the Markov chain will <i>eventually </i>converge to the distribution of the posterior you want (assuming certain regularity conditions which I will not go into), but there's no hard and fast rule on how long it will take. If you pick values extremely far away from the posterior, it could take quite a while for your chain to converge. There's a chance you could have run for your code for 10,000 iterations and <i>still</i> not have reached the posterior distribution, and there's no way of knowing since you don't know the posterior to begin with!<br /><br />Statisticians generally do two things to check that this hasn't occurred:<br /><ol><li>Use multiple starting points to create multiple chains of $\mu^*_j$, $\phi^*_j$, and $\theta^*_{i,j}$ that can be compared (visually or otherwise) to see if they all appear to have converged to the same area in the parameter space.</li><li>Use a fixed number of "burn-in" iterations to give the chain a chance to converge to the posterior distribution before taking the "real" draws from the chain.</li></ol><br />There is no definite answer on exactly how to pick the different starting points - you could randomly choose points in the parameter space (which is handily confined to between 0 and 1 for the parametrization I used!), or you could obtain estimates from some frequentist statistical procedure (such as method of moments or marginal maximum likelihood) and use those, or you could pick values based on your own knowledge of the problem - for example, choosing $\mu^*_1 = 0.265$ based on what knowing the league mean batting average is probably close that value. No matter how you do it, starting points should be spread out over the parameter space to make sure the chains aren't all going to the same place just because they started off close to each other.<br /><br />Two more questions must be answered to perform the Metropolis-Hastings step - how do you choose $\sigma^2_{\mu}$ and $\sigma^2_{\phi}$ in the normal candidate distributions? And how often should you accept the candidate values?<br /><br />The answers to these questions are closely tied to each other. For mathematical reasons that I will not go into in this article (and a bit of old habit), I usually aim for an acceptance rate of roughly around 40%, though the specific value depends on the dimensionality of the problem (see <a href="http://www.stat.columbia.edu/~gelman/research/published/baystat5.pdf" target="_blank">this paper</a> by Gelman, Roberts, and Wilks for more information). In practice, I'm usually not worried if it's 30% or 50% as long as everything else looks okay.<br /><br />If the acceptance rate is good, then a plot of the value of the chain versus the iteration number (called a "trace plot") should look something like<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-OX5VnLevSKU/Vz-1WtcPnCI/AAAAAAAAAeM/IH2_vPAWYlQWnmNy0nORdFGsg80IJG2EwCLcB/s1600/Good%2BMixing.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="260" src="https://2.bp.blogspot.com/-OX5VnLevSKU/Vz-1WtcPnCI/AAAAAAAAAeM/IH2_vPAWYlQWnmNy0nORdFGsg80IJG2EwCLcB/s640/Good%2BMixing.jpeg" width="640" /></a></div><br />I've used two chains for $\mu$ here, starting at different points. The "spiky blob" shape is exactly what we're looking for - the values of the chains jump around at a good pace, but still making large enough jumps to effectively cover the parameter space. <br /><br />If the acceptance rate is too small or too large, it can be adjusted by changing $\sigma^2$ in the normal candidate distribution. An acceptance rate that is too <i>low</i> means that the chains will not move around the parameter space effectively. If this is the case, a plot of the chain value versus the iteration number looks like<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-qXgwZLcR-Xw/Vz-1Wgg6BVI/AAAAAAAAAeU/-ENt_NtVkIs6s4aFxvOQa4PoKqWDcrYdwCKgB/s1600/Poor%2BMixing.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="260" src="https://1.bp.blogspot.com/-qXgwZLcR-Xw/Vz-1Wgg6BVI/AAAAAAAAAeU/-ENt_NtVkIs6s4aFxvOQa4PoKqWDcrYdwCKgB/s640/Poor%2BMixing.jpeg" width="640" /></a></div><br />The plot looks nicer visually, but that's not a good thing - sometimes the chains stay at the same value for hundreds of iterations! The solution to this problem is to <i>lower</i> $\sigma^2$ so that the candidate values are closer to the previous value, and more likely to be accepted.<br /><br />Conversely, if the acceptance rate is too high then the chains will still explore the parameter space, but much too slowly. A plot of the chain value versus the iteration looks like<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-p4e-abL1wPM/Vz-1WnTNt2I/AAAAAAAAAeU/IKnubx_ilR49GAKYyv8DgCtpOKP0T284QCKgB/s1600/Slow%2BMixing.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="260" src="https://2.bp.blogspot.com/-p4e-abL1wPM/Vz-1WnTNt2I/AAAAAAAAAeU/IKnubx_ilR49GAKYyv8DgCtpOKP0T284QCKgB/s640/Slow%2BMixing.jpeg" width="640" /></a></div><br />In this plot, it looks like the two the chains don't <i>quite</i> converge to the posterior distribution until hundreds of iterations after the initial draws. Furthermore, the chains are jumping to new values at nearly every iteration, but the jumps are so small that it takes an incredibly large number of iterations to explore the parameter space. If this is the case, the solution is to <i>increase</i> $\sigma^2$ so that the candidates are further from the current value, and less likely to be accepted.<br /><br />The value of $\sigma^2$, then, is often chosen by trial-and-error after the code has been written by manually adjusting the value in multiple runs of the MCMC so that the trace plots have the "spiky blob" shape and the acceptance rate is reasonable. Through this method, I found that the following candidate distributions for $\mu$ and $\phi$ worked well. <br /><br /><div style="text-align: center;">$\mu^*_c \sim N(\mu^*_{j-1}, 0.005^2)$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">$\phi^*_c \sim N(\phi^*_{j-1}, 0.001^2)$</div><div style="text-align: center;"><br /></div><br /><h2>The Code </h2><br />Now that we know the steps the codes will take and what inputs are necessary, coding can begin. I typically code in R, and find it useful to write a function that has inputs of data vectors, starting values for any parameters, and any MCMC tuning parameters I might want to change (such as the number of draws, length of the burn-in period, or the variance of the candidate distributions). In the code below, I set the burn-in period and number of iterations to default to 1000 and 5000, respectively, and after running the code several times without defaults for candidate variances, I determined values of $\sigma^2_{\mu}$ and $\sigma^2_{\phi}$ that produced reasonable trace plots and acceptance rates and set those as defaults as well.<br /><br />For output, I used the <span style="font-family: "courier new" , "courier" , monospace;">list()</span> structure in R to return a vector chain of $\mu^*_j$, a vector chain of $\phi^*_j$, a matrix of chains $\theta^*_{i,j}$, and a vector of acceptance rates for the Metropolis-Hastings steps for $\mu$ and $\phi$.<br /><br />The raw code for the MCMC function is shown below, and annotated code <a href="https://github.com/Probabilaball/Blog-Code/blob/master/Lets-Code-an-MCMC-For-a-Beta-Binomial-Hierarchical-Model-for-Batting-Averages/Beta%20Binomial%20MCMC%20%28Function%20Code%20Only%29.R">may be found on my Github</a>.<br /><br />. <br /><span style="font-size: small;"><span style="font-family: "courier new" , "courier" , monospace;">betaBin.mcmc <- function(x, n, mu.start, phi.start, burn.in = 1000, n.draws = 5000, sigma.mu = 0.005, sigma.phi = 0.001) {<br /><br /> m = function(mu, phi, x, n) {<br /> N = length(x)<br /> l = sum(lbeta(mu*(1-phi)/phi + x, (1-mu)*(1-phi)/phi+n-x)) - N*lbeta(mu*(1-phi)/phi, (1-mu)*(1-phi)/phi)<br /> p = -0.5*log(mu) - 0.5*log(1-mu) - 0.5*log(phi) - 0.5*log(1-phi)<br /> return(l + p)<br /> }<br /><br /> phi = rep(0, burn.in + n.draws)<br /> mu = rep(0, burn.in + n.draws)<br /> theta = matrix(rep(0, length(n)*(burn.in + n.draws)), length(n), (burn.in + n.draws))<br /><br /> acceptance.mu = 0<br /> acceptance.phi = 0<br /><br /> mu[1] = mu.start<br /> phi[1] = phi.start<br /><br /> for(i in 1:length(x)) {<br /> theta[i, 1] = rbeta(1, mu[1]*(1-phi)[1]/phi[1] + x[i], (1-phi)[1]/phi[1]*(1-mu[1]) + n[i] - x[i])<br /> }</span></span><br /><span style="font-size: small;"><span style="font-family: "courier new" , "courier" , monospace;"><br /> for(j in 2:(burn.in + n.draws)) {</span></span><br /><span style="font-size: small;"><span style="font-family: "courier new" , "courier" , monospace;"><br /> phi[j] = phi[j-1]<br /> mu[j] = mu[j-1]</span></span><br /><span style="font-size: small;"><span style="font-family: "courier new" , "courier" , monospace;"><br /> cand = rnorm(1, mu[j-1], sigma.mu)<br /><br /> if((cand > 0) & (cand < 1)) {</span></span><br /><span style="font-size: small;"><span style="font-family: "courier new" , "courier" , monospace;"><br /> m.old = m(mu[j-1],phi[j-1],x,n)<br /> m.new = m(cand,phi[j-1],x,n)<br /><br /> u = runif(1)<br /><br /> if((m.new - m.old) > log(u)) {<br /> mu[j] = cand<br /> acceptance.mu = acceptance.mu+1<br /> }<br /> }</span></span><br /><span style="font-size: small;"><span style="font-family: "courier new" , "courier" , monospace;"><br /> cand = rnorm(1,phi[j-1],sigma.phi)<br /> <br /> if( (cand > 0) & (cand < 1)) {</span></span><br /><span style="font-size: small;"><span style="font-family: "courier new" , "courier" , monospace;"> <br /> m.old = m(mu[j-1],phi[j-1],x,n)<br /> m.new = m(mu[j-1],cand,x,n)<br /><br /> u = runif(1)<br /><br /> if((m.new - m.old) > log(u)) {<br /> phi[j] = cand<br /> acceptance.phi = acceptance.phi + 1<br /> } <br /> }<br /> </span></span><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-size: small;"> for(i in 1:length(n)) {<br /> theta[i, j] = rbeta(1, (1-phi[j])/phi[j]*mu[j] + x[i], (1-phi[j])/phi[j]*(1-mu[j]) + n[i] - x[i])<br /> } <br /><br /> } <br /><br /> mu <- mu[(burn.in + 1):(burn.in + n.draws)]<br /> phi <- phi[(burn.in + 1):(burn.in + n.draws)]<br /> theta <- theta[,(burn.in + 1):(burn.in + n.draws)]<br /><br /> return(list(mu = mu, phi = phi, theta = theta, acceptance = c(acceptance.mu/(burn.in + n.draws), acceptance.phi/(burn.in + n.draws))))<br /><br />}</span></span><br /><br />This, of course, is not the only way it may be coded, and I'm sure that others with more practical programming experience could easily improve upon this code. Note that I add an additional wrinkle to the formulation given in the previous sections to address a practical concern - I immediately reject a candidate value if it is less than 0 or larger than 1. This is not the only possible way to deal with this potential problem, but works well in my experience, and the acceptance rate and/or starting points can be adjusted if the issue becomes serious.<br /><br />There is a bit of redundancy in the code - the quantity <span style="font-family: "courier new" , "courier" , monospace;">m.old</span> is calculated twice, when it is used identically in both Metropolis-Hastings steps - and I'm inflating the acceptance rate slightly by including the burn-in iterations, but the chains should converge quickly so the effect will be minimal, and more draws can always be taken to minimize the effect. <br /><br />Though coded in R, the principles should apply no matter which language you use - hopefully you could take this setup and write code in C or python if you wanted to. <br /><br /><h2>The Results</h2><br />Using the function defined above, I ran three separate chains of 5000 iterations each after a burn-in of 1000 draws. For starting points, I picked values near where I thought the posterior means would end up, plus values both above and below, to check that all chains converged to the same distributions.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> chain.1 <- betaBin.mcmc(x,n, 0.265, 0.002)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">> chain.2 <- betaBin.mcmc(x,n, 0.5, 0.1)</span><br /><span style="font-family: "courier new" , "courier" , monospace;">> chain.3 <- betaBin.mcmc(x,n, 0.100, 0.0001)</span><br /><br />Checking the acceptance rates for $\mu$ and $\phi$ from each of the three chains, all are reasonable:<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> chain.1$\$$acceptance<br />[1] 0.3780000 0.3613333<br />> chain.2$\$$acceptance<br />[1] 0.4043333 0.3845000<br />> chain.3$\$$acceptance<br />[1] 0.3698333 0.3768333</span> <br /><br />(Since the $\theta_i$ were obtained by a Gibbs sampler, they do not have an associated acceptance rate) <br /><br />Next, plots of the chain value versus iteration for $\mu$, $\phi$, and $\theta_1$ show all three chains appear to have converged to the same distribution, and the trace plots appear to have the "spiky blob" shape that indicates good mixing:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-f6b2WukhCcM/VywHbRkOUxI/AAAAAAAAAdQ/K8qzuhpXZPIQeaEYk2Zy9ha8s7SCEqy3wCLcB/s1600/MCMC%2BChains.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="304" src="https://3.bp.blogspot.com/-f6b2WukhCcM/VywHbRkOUxI/AAAAAAAAAdQ/K8qzuhpXZPIQeaEYk2Zy9ha8s7SCEqy3wCLcB/s640/MCMC%2BChains.jpeg" width="640" /></a></div><br /><br />Hence, we can use our MCMC draws to estimate properties of the posterior. To do this, combine the results of all three chains into one big set of draws for each variable:<br /><span style="font-family: "courier new" , "courier" , monospace;"><br /></span><span style="font-family: "courier new" , "courier" , monospace;">mu <- c(chain.1$\$$mu, chain.2</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$\$$</span>mu, chain.3</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$\$$</span>mu)<br />phi <- c(chain.1</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$\$$</span>phi, chain.2</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$\$$</span>phi, chain.3</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$\$$</span>phi)<br />theta <- cbind(chain.1</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$\$$</span>theta, chain.2</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$\$$</span>theta, chain.3</span><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">$\$$</span>theta)</span><br /><br />Statistical theory says that posterior distributions should converge to a normal distribution as the sample size increases. With a sample size of $N = 254$ batting averages, posteriors should be close to normal in the parametrization I used - though normality of the posteriors is in general not a guarantee that everything has worked well, nor is non-normality evidence that something has gone wrong.<br /><br />First, the posterior distribution for league batting average can be seen just by taking a histogram:<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> hist(mu) </span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-JKl2ZiW3htg/VywRXQm7vwI/AAAAAAAAAd0/u3UZd3KiccUaBHO3iiorKVDySfgnoNhfwCKgB/s1600/Mu%2BHistogram.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://4.bp.blogspot.com/-JKl2ZiW3htg/VywRXQm7vwI/AAAAAAAAAd0/u3UZd3KiccUaBHO3iiorKVDySfgnoNhfwCKgB/s400/Mu%2BHistogram.jpeg" width="400" /></a></div><br />The histogram looks almost perfectly normally distributed - about as close to the ideal as is reasonable.<br /><br />Next, we want to get an estimator for the league mean batting average. There are a different few ways to turn the posterior sample $\mu^*_j$ into an estimator $\hat{\mu}$, but I'll give the simplest here (and since the posterior distribution looks normal, other methods should give very similar results) - taking the sample average of the $\mu^*_j$ values: <br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> mean(mu)<br />[1] 0.2660155<br /> </span><br /><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "times" , "times new roman" , serif;">Similarly, we can get an estimate of the standard error for $\hat{\mu}$ and a 95% <a href="http://probabilaball.blogspot.com/2015/07/bayesian-credible-intervals-for-batting.html" target="_blank">credible interval</a> for $\mu$ by taking the standard deviation and quantiles from $\mu^*_j$<span style="font-family: "times" , "times new roman" , serif;">:</span></span></span><br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> sd(mu)<br />[1] 0.001679727<br />> quantile(mu,c(.025,.975))<br /> 2.5% 97.5% <br />0.2626874 0.2693175 </span><br /><br />For $\phi$, do the same thing - first look at the histogram:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-B0EMAETV0ro/VywRXvpXHKI/AAAAAAAAAdw/S2dE7kS2qgg7f0HuHKZDhAzWax3a82fpwCKgB/s1600/Phi%2BHistogram.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://1.bp.blogspot.com/-B0EMAETV0ro/VywRXvpXHKI/AAAAAAAAAdw/S2dE7kS2qgg7f0HuHKZDhAzWax3a82fpwCKgB/s400/Phi%2BHistogram.jpeg" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><br />There is one outlier on the high side - which can happen in an MCMC chain simply by chance - and a slight skew to the right, but otherwise, the posterior looks close to normal. The mean, standard deviation, and a 95% credible interval are given by<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> mean(phi)<br />[1] 0.001567886<br />> sd(phi)<br />[1] 0.000332519<br />> quantile(phi,c(.025,.975))<br /> 2.5% 97.5% <br />0.0009612687 0.0022647623</span><br /><br /><br />Furthermore, let's say that instead of $\phi$, I had a particular function of one of the parameters in mind instead - for example, I mentioned at the beginning that $\phi$ is, in sabermetric speak, the proportion of stabilization after a single at-bat. This can be turned into the general so-called "stabilization point" $M$ by<br /><br /><div style="text-align: center;">$M = \dfrac{1-\phi}{\phi}$</div><br />and so to get a posterior distribution for $M$, all we need to do is apply this transformation to each draw from $\phi^*_j$. A histogram of $M$ is given by<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> hist((1-phi)/phi) </span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-7L1C9ga4IyQ/VywRXZ11TcI/AAAAAAAAAd0/C2M1FN2ALYwoxhKNqpxUtWmroJclILX8wCKgB/s1600/M%2BHistogram.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://1.bp.blogspot.com/-7L1C9ga4IyQ/VywRXZ11TcI/AAAAAAAAAd0/C2M1FN2ALYwoxhKNqpxUtWmroJclILX8wCKgB/s400/M%2BHistogram.jpeg" width="400" /></a></div><br />The histogram is skewed clearly to the right, but that's okay since $M$ is not one of the parameters in the model. <br /><br />An estimate and 95% credible for the stabilization point is given by taking the average and quantiles of the transformed values<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> mean((1-phi)/phi)<br />[1] 667.8924</span><br /><span style="font-family: "courier new" , "courier" , monospace;">> quantile((1-phi)/phi, c(0.025,0.975))<br /> 2.5% 97.5% <br /> 440.5474 1039.2918 </span><br /><br />This estimate is different than the value I gave in my article <a href="http://www.probabilaball.com/2016/03/2016-stabilization-points.html" target="_blank">2016 Stabilization Points</a> because the calculations in that article used the past six years of data - this calculation only uses one. This is also why the uncertainty is so much larger.<br /><br />Lastly, we can get at what we really want - estimates of the "true" batting averages $\theta_i$ for each player. I'm going to look at $i = 1$ (the first player in the sample), who happens to be Bryce Harper, the National League MVP in 2015. His batting average was 0.330 (from $x_1 = 172$ hits in $n_1 = 521$ AB), but the effect of fitting the hierarchical Bayesian analysis is to shrink the estimate of his "true" batting average $\theta_i$ towards the league mean $\mu$ - and by quite a bit in this case, since Bryce had nearly largest batting average in the sample. A histogram of the $\theta^*_{1,j}$ shows, again, a roughly normal distribution.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> hist(<span style="font-family: "courier new" , "courier" , monospace;">t<span style="font-family: "courier new" , "courier" , monospace;">heta[1,]</span></span></span><span style="font-family: "courier new" , "courier" , monospace;"></span><span style="font-family: "courier new" , "courier" , monospace;">) </span><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-9yFzzEqKd80/VywRXTheG9I/AAAAAAAAAd0/6omLA5trchgUR31qYsYTRG3ktQW-GgNCQCKgB/s1600/Harper%2BAverage%2BHistogram.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://1.bp.blogspot.com/-9yFzzEqKd80/VywRXTheG9I/AAAAAAAAAd0/6omLA5trchgUR31qYsYTRG3ktQW-GgNCQCKgB/s400/Harper%2BAverage%2BHistogram.jpeg" width="400" /></a></div><br />and an estimate of his true batting average, standard error of the estimate, and 95% credible interval for the estimate are given by<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> mean(theta[1,])<br />[1] 0.2947706<br />> sd(theta[1,])<br />[1] 0.01366782<br />> quantile(theta[1,], c(0.025,0.975))<br /> 2.5% 97.5% <br />0.2687120 0.3222552 </span><br /><br />Other functions of the batting averages, functions of the league mean and variance, or <a href="http://www.probabilaball.com/2015/09/the-posterior-predictive.html" target="_blank">posterior predictive calculations</a> can be performed using the posterior samples $\mu^*$, $\phi^*$, and $\theta^*_i$.<br /><br /><br /><h2>Conclusion and Connections</h2><br /><br />MCMC techniques similar to the ones shown here have become fairly standard in Bayesian estimation, though there are more advanced techniques in use today that build upon these "building block" steps by, to give one example, adaptively changing the acceptance rate as the code runs rather than guessing-and-checking to find a reasonable value.<br /><br />The empirical Bayesian techniques from my article <a href="http://www.probabilaball.com/2015/05/beta-binomial-empirical-bayes.html" target="_blank">Beta-binomial empirical Bayes </a>represent an approximation to this full hierarchical method. In fact, using the empirical Bayesian estimator from that article on the baseball set described in this article gives $\hat{\alpha} = 172.5478$ and $\hat{\beta} = 476.0831$ (equivalent to $\hat{\mu} = 0.266$ and $\hat{\phi} = 0.001539$), and gives Bryce Harper an estimated true batting average of $\theta_1 = 0.2946$, with a 95% credible interval of $(0.2688, 0.3210)$ - only slightly shorter than the interval from the full hierarchical model.<br /><br />Lastly, the "regression toward the mean" technique common in sabermetrics also approximates this analysis. Supposing you had a "stabilization point" of around 650 AB for batting averages (650 is actually way too large, but I'm pulling this number from my calculations above to illustrate a point), then the amount shrunk towards league mean of $\mu \approx 0.266$ is<br /><br /><div style="text-align: center;">$\left(\dfrac{521}{521 + 650}\right) \approx 0.4449$</div><br />So that the estimate of Harper's batting average is<br /><br /><div style="text-align: center;">$0.266 + 0.4449\left(\dfrac{172}{521} - 0.266\right) \approx 0.2945$</div><br />Three methods all going to the same place - all closely related in theory and execution. <br /><br />Hopefully this helps with understanding MCMC coding. The article ended up much longer than I originally intended, but there were many parts I've gotten used to doing quickly that I realized required a not-so-quick explanation to justify <i>why </i>I'm doing them. As usual, comments and suggestions are appreciated!rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-75777313754235242932016-05-24T14:04:00.000-05:002016-05-24T17:24:47.673-05:002016 Win Prediction Totals (Through May 22)<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> These predictions are based on my own silly estimator, which I know can be improved with some effort on my part. There's some work related to this estimator that I'm trying to get published academically, so I won't talk about the technical details yet (not that they're particularly mind-blowing anyway).<br /><br />I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), but based on tests of earlier seasons point in the season the actual coverage is slightly under 94%, with intervals being one game off if and when they are off.<br /><br />Intervals are inclusive. All win totals assume a 162 game schedule.<br /><br />\begin{array} {c c c c} <br />\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} & \textrm{Current Wins}\\ \hline<br /><br />ARI & 65 & 79.58 & 94 & 81.81 & 21 \\ <br />ATL & 48 & 61.91 & 77 & 67.95 & 12 \\ <br />BAL & 74 & 89.11 & 104 & 85.19 & 26 \\ <br />BOS & 80 & 94.48 & 109 & 92.65 & 27 \\ <br />CHC & 88 & 102.56 & 117 & 99.31 & 29 \\ <br />CHW & 75 & 89.64 & 104 & 87.36 & 26 \\ <br />CIN & 49 & 63.47 & 78 & 66.52 & 15 \\ <br />CLE & 71 & 85.5 & 100 & 85.03 & 22 \\ <br />COL & 66 & 80.94 & 96 & 80.92 & 21 \\ <br />DET & 66 & 80.55 & 95 & 81.05 & 21 \\ <br />HOU & 57 & 71.08 & 86 & 74.86 & 17 \\ <br />KCR & 64 & 79.12 & 94 & 77.75 & 22 \\ <br />LAA & 62 & 76.87 & 91 & 78.08 & 20 \\ <br />LAD & 68 & 82.33 & 97 & 83.53 & 22 \\ <br />MIA & 67 & 81.46 & 96 & 80.95 & 22 \\ <br />MIL & 57 & 71.51 & 86 & 73.47 & 18 \\ <br />MIN & 47 & 61.06 & 76 & 68.16 & 11 \\ <br />NYM & 72 & 87.09 & 102 & 84.52 & 25 \\ <br />NYY & 63 & 77.99 & 93 & 77.6 & 21 \\ <br />OAK & 58 & 71.82 & 86 & 73.12 & 19 \\ <br />PHI & 67 & 81.6 & 96 & 77.71 & 25 \\ <br />PIT & 69 & 83.7 & 98 & 81.94 & 23 \\ <br />SDP & 59 & 72.83 & 87 & 74.53 & 19 \\ <br />SEA & 76 & 90.55 & 105 & 87.86 & 26 \\ <br />SFG & 72 & 85.92 & 100 & 82.28 & 27 \\ <br />STL & 73 & 87.24 & 102 & 88.18 & 23 \\ <br />TBR & 68 & 82.68 & 98 & 83.92 & 20 \\ <br />TEX & 70 & 84.81 & 99 & 82.1 & 25 \\ <br />TOR & 65 & 79.6 & 94 & 80.45 & 22 \\ <br />WSN & 78 & 92.88 & 107 & 90.45 & 27 \\ \hline\end{array}<br /><br />As you would expect, it's really, really difficult to predict how many games a team is going to win only a quarter of the way through the season, and intervals are necessarily going to be very wide. A couple of things stand out, though - at this point we can be confident that the Chicago Cubs will finish above 0.500 and the Minnesota Twins, Cincinnati Reds, and Atlanta Braves will finish below 0.500. For every other team, we just don't have enough information yet.<br /><br />To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (from the beginning of the season until May 22). <br /><br />These quantiles are based off of a distribution - <a href="http://imgur.com/a/NTRfu" target="_blank">I've uploaded a picture of each team's distribution to imgur</a>. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is to the left of the peak, the team is predicted to finish "lucky" - more wins than would be expected based on their talent level - and if the blue line is to the right of the peak, the team is predicted to finish "unlucky" - fewer wins that would be expected based on their talent level.rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-66637860979651268462016-03-18T11:11:00.001-05:002016-03-18T11:11:04.844-05:002016 Stabilization Points<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>I recalculated my stabilization points for 2016, using the same maximum likelihood technique I used for my 2015 calculations in the articles <a href="http://probabilaball.blogspot.com/2015/08/estimating-theoretical-stabilization.html" target="_blank">Estimating Theoretical Stabilization Points</a> and <a href="http://www.probabilaball.com/2015/08/whip-stabilization-by-gamma-poisson.html" target="_blank">WHIP Stabilization by the Gamma-Poisson Model</a>.<br /><br />(All data and code I used <a href="https://github.com/Probabilaball/Blog-Code/tree/master/2016-Stabilization-Points" target="_blank">can be found on my github</a>. I make no claims about the stability and/or efficiency of my code - there are a few places where I know it could use some work.) <br /><br />I've included standard error estimates for 2016, but these should not be used to perform any kinds of tests or intervals to compare to the 2015 data - the 2015 values are estimates themselves with their own standard errors, and since I'm using the past 6 years worth of baseball data, approximately 5/6 of the data is common between the two estimates. The calculations I performed for 2015 can be found <a href="http://probabilaball.blogspot.com/2015/08/more-offensive-stabilization-points.html" target="_blank">here for batting statistics</a> and <a href="http://probabilaball.blogspot.com/2015/09/more-pitching-stabilization-points.html" target="_blank">here for pitching statistics</a>.<br /><br />The cutoff values I picked were the minimum number of events (PA, AB, TBF, BIP, etc. - the denominators in the formulas) in order to be considered for a year. These cutoff values, and the choice of 6 years worth of data, were picked fairly arbitrarily - I tried to go with what was reasonable (based on seeing what others were doing and my own knowledge of baseball) and what seemed to work well in practice.<br /><br /><h2><b>Offensive Statistics</b></h2><br />\begin{array}{| l | l | c | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2015\textrm{ }\hat{M} \\ \hline<br />\textrm{OBP}&\textrm{(H + BB + HBP)/PA} & 301.32 & 16.92 & 0.329 & 300 & 295.79\\<br />\textrm{BABIP}&\textrm{(H - HR)/(AB-SO-HR+SF)} & 433.04 & 38.91 & 0.305 & 300 & NA^*\\ <br />\textrm{BA}&\textrm{H/AB} & 491.20 & 37.10 & 0.266 & 300 & 465.92\\<br />\textrm{SO Rate}&\textrm{SO/PA} & 49.23 & 1.91 & 0.184 & 300 & 49.73\\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(PA-IBB)} & 112.44 & 4.93 & 0.077 & 300 & 110.91\\<br />\textrm{1B Rate}&\textrm{1B/PA} & 223.86 & 11.48 & 0.159 & 300 & 226.16\\<br />\textrm{2B Rate}&\textrm{2B/PA} & 1169.75 & 135.60 & 0.047 & 300 & 1025.31\\<br />\textrm{3B Rate}&\textrm{3B/PA} & 365.06 & 4.93 & 0.005 & 300 & 372.50\\<br />\textrm{XBH Rate} & \textrm{(2B + 3B)/PA} & 1075.41 & 118.22 & 0.052 & 300 & 1006.30\\<br />\textrm{HR Rate} & \textrm{HR/PA} & 126.35 & 6.03 & 0.027 & 300 & 124.52\\<br />\textrm{HBP Rate} & \textrm{HBP/PA} & 300.97 & 18.60 & 0.009 & 300 & 297.41 \\ \hline <br />\end{array}<br /><br /><i><b> </b>* For whatever reason, I did not calculate the stabilization point for hitting BABIP in 2015.</i> <br /><br />In general, a larger stabilization point will be due to a decreased spread of talent levels - as talent levels get closer together, more extreme stats become less and less likely, and will be shrunk harder towards the mean. Consequently, it takes more observations to know that a player's high or low stats (relative to the rest of the league) are real and not just a fluke of randomness. Similarly, smaller stabilization points will point towards an increase in the spread of talent levels.<br /><br /> Most stabilization points are very similar to their 2015 counterparts, though there is a general increasing trend (seven out of ten statistics), with increases tending towards larger than decreases.<br /><br /><h2><b>Pitching Statistics </b></h2><br />\begin{array}{| l | l | c | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\hat{\mu} & \textrm{Cutoff}&2015 \textrm{ }\hat{M} \\ \hline<br />\textrm{BABIP}&\textrm{(H-HR)/(GB + FB + LD)}&1408.72& 258.33 & 0.289 &300&2006.71\\<br />\textrm{GB Rate}&\textrm{GB/(GB + FB + LD)}& 63.53 &3.51 & 0.449 &300& 65.52\\<br />\textrm{FB Rate}&\textrm{FB/(GB + FB + LD)}& 59.80 &3.28& 0.347 &300&61.96\\<br />\textrm{LD Rate}&\textrm{LD/(GB + FB + LD)}& 731.02 & 87.48 & 0.203 &300&768.42\\<br />\textrm{HR/FB Rate}&\textrm{HR/FB}&488.53 & 90.14 & 0.103 &100&505.11\\<br />\textrm{SO Rate}&\textrm{SO/TBF}& 93.15 &5.18& 0.189&400&90.94\\<br />\textrm{HR Rate}&\textrm{HR/TBF}& 949.02 & 110.87 & 0.025 &400&931.59\\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}& 236.87 & 15.70 & 0.069 &400&221.25\\<br />\textrm{HBP Rate}&\textrm{HBP/TBF}& 939.00 & 111.88 & 0.008 &400&989.30\\<br />\textrm{Hit rate}&\textrm{H/TBF}&559.18 & 46.08 & 0.235 &400&623.35\\<br />\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}& 526.77 & 45.07 & 0.312 &400&524.73\\<br />\textrm{WHIP}&\textrm{(H + BB)/IP*}& 78.97 & 5.60 & 1.29 &80&77.20\\<br />\textrm{ER Rate}&\textrm{ER/IP*}& 63.08 & 4.23 & 0.439&80&59.55\\<br />\textrm{Extra BF}&\textrm{(TBF - 3IP*)/IP*}& 75.79 & 5.31 & 1.22 &80&73.00\\ \hline<br />\end{array}<br /><br /><i>* When dividing by IP, I corrected the 0.1 and 0.2 representations to 0.33 and 0.67, respectively. </i><br /><br />Stabilization points are equal in being higher or lower, and generally close to last year's values, indicating not much of a change in the distributions of talent levels. Interestingly, the stabilization point for BABIP dropped by nearly 600 BIP - whether this is due to the (notoriously large) variance of BABIP or a mistake in either this or last year's calculation on my part, I'm not sure.<br /><br /><h2>Usage</h2><br /><br />Aside from the obvious use of knowing approximately when results are half due to luck and half from skill, these stabilization points (along with league means) can be used to provide very basic confidence intervals and prediction intervals for estimates that have been shrunk towards the population mean, as demonstrated in my article <a href="http://www.probabilaball.com/2015/10/from-stabilization-to-interval.html" target="_blank">From Stabilization to Interval Estimation</a>. I believe the confidence intervals from my method should be similar to the intervals from Sean Dolinar's great fangraphs article <a href="http://www.fangraphs.com/blogs/a-new-way-to-look-at-sample-size/">A New Way to Look at Sample Size</a>, though I have not personally tested this, and am not familiar with the Cronbach's alpha methodology he uses (or with reliability analysis in general).<br /><br />For example, suppose that in the first half, a player has an on-base percentage of 0.380 in 300 plate appearances, corresponding to 114 on-base events. A 95% confidence interval using my empirical Bayesian techniques (based on a normal-normal model) is<br /><br /><div style="text-align: center;">$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300}} = (0.317,0.392)$ </div><br />That is, we believe the player's true on-base percentage to be between 0.317 and 0.392 with 95% confidence. I used a normal distribution for talent levels with a normal approximation to the binomial for the distribution of observed OBP, but that is not the only possible choice - it just resulted in the simplest formulas for the intervals.<br /><br />Suppose that the player will get an additional $\tilde{n} = 250$ PA in the second half of the season. A 95% prediction interval for his OBP over those PA is given by<br /><br /><div style="text-align: center;">$\dfrac{114 + 0.329*301.32}{300 + 301.32} \pm 1.96 \sqrt{\dfrac{0.329(1-0.329)}{301.32 + 300} + \dfrac{0.329(1-0.329)}{250}} = (0.285,0.424)$ </div><br />That is, 95% of the time the player's OBP over the 250 PA in the second half of the season should be between 0.285 and 0.424. These intervals are overly optimistic and "dumb" in that they take only the league mean and variance and the player's own statistics into account, representing an advantage only over 95% unshrunk intervals, <a href="http://www.probabilaball.com/2015/10/from-stabilization-to-interval.html" target="_blank">but when I tested them in my article "From Stabilization to Interval Estimation"</a>, they worked well for prediction.<br /><br />As usual, all my data and code <a href="https://github.com/Probabilaball/Blog-Code/tree/master/2016-Stabilization-Points" target="_blank">can be found on my github</a>. I wrote a general function in $R$ to calculate the stabilization point for any basic counting stat, or unweighted sums of counting stats like OBP (I am still working on weighted sums so I can apply this to things like wOBA ). The function returns the estimated league mean of the statistic and estimated stabilization point, a standard error for the stabilization point, and what model was used (I only have two programmed in - 1 for the beta-binomial and 2 for the gamma-Poisson). It also gives a plot of the estimated stabilization at different numbers of events, with 95% confidence bounds.<br /><br /><span style="font-family: "courier new" , "courier" , monospace;">> stabilize(h$\$$H + h$\$$BB + h$\$$HBP, h$\$$PA, cutoff = 300, 1) <br />$\$$Parameters<br />[1] 0.329098 301.317682<br /><br />$\$$Standard.Error<br />[1] 16.92138<br /><br />$\$$Model<br />[1] "Beta-Binomial"</span><br /><br /><div style="text-align: center;"><a href="http://3.bp.blogspot.com/-xgucIosXSK8/Vur_nBEO17I/AAAAAAAAAco/E6CM2RtCGaU3L8DhABUtFAuLYVh7Zt0zQ/s1600/OBP%2BStabilization.jpeg" imageanchor="1"><img border="0" height="397" src="https://3.bp.blogspot.com/-xgucIosXSK8/Vur_nBEO17I/AAAAAAAAAco/E6CM2RtCGaU3L8DhABUtFAuLYVh7Zt0zQ/s400/OBP%2BStabilization.jpeg" width="400" /></a></div><br />The confidence bounds are created from the estimates $\hat{M}$ and $SE(\hat{M})$ above and the formula<br /><br /><div style="text-align: center;">$\left(\dfrac{n}{n+\hat{M}}\right) \pm 1.96 \left[\dfrac{n}{(n+\hat{M})^2}\right] SE(\hat{M})$</div><br />which is obtained from the applying <a href="http://probabilaball.blogspot.com/2015/06/the-delta-method-for-confidence.html" target="_blank">the delta method</a> to the function $p(\hat{M}) = n/(n + \hat{M})$. Note that the mean and prediction intervals I gave do <i>not</i> take $SE(\hat{M})$ into account (ignoring the uncertainty surrounding the correct shrinkage amount, which is indicated by the confidence bounds above), but this is not a huge problem - if you don't believe me, plug slightly different values of $M$ into the formulas yourself and see that the resulting intervals do not change much.<br /><br />Maybe somebody else out there might find this useful. As always, feel free to post any comments or suggestions!<br /><br />rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com2tag:blogger.com,1999:blog-4128498738742055603.post-9008664689408443472015-12-04T12:09:00.000-06:002016-02-16T15:11:38.662-06:00Correcting Parametric Empirical Bayesian Intervals using a Bootstrap<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> In a previous post I discussed empirical Bayes for the beta-binomial model. Empirical Bayesian estimates are just expected values of a posterior distribution - suppose instead that you want interval estimates. The empirical Bayesian method can be used, but the intervals potentially have to be adjusted. In this post I want to show how to construct empirical Bayesian intervals for the beta-binomial model, correct them using a parametric bootstrap, and give a baseball example.<br /><br />Annotated code for the procedure I will describe in this post <a href="https://github.com/Probabilaball/Blog-Code/tree/master/Correcting-Parametric-Empirical-Bayesian-Intervals-using-a-Bootstrap">can be found on my github</a>. <br /><br />As a side note, I just want to say how much I love this procedure. It uses parametric bootstrapping to correct Bayesian intervals to achieve frequentist coverage. God bless America. <br /><br /><br /><h2>Empirical Bayesian Intervals</h2><br /><br /><a href="http://www.probabilaball.com/2015/05/beta-binomial-empirical-bayes.html">In my previous post on beta-binomial empirical Bayes</a> analysis, I used the model<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$y_i \sim Bin(n_i, \theta_i)$</div><div style="text-align: center;">$\theta_i \sim Beta(\alpha, \beta)$</div><br />The empirical Bayes method says to get estimates $\hat{\alpha}$ and $\hat{\beta}$ of the prior parameters - I give a method of moments estimator in the post, but marginal maximum likelihood may also be used - and then calculate posterior distributions for each of the $\theta_i$ using $Beta(\hat{\alpha}, \hat{\beta})$ as the prior, essentially using the data itself to estimate the prior.<br /><br />The empirical Bayesian estimate is the mean of this posterior distribution. If an interval estimate is desired, a credible interval can be calculated by taking quantiles directly from this posterior - <a href="http://www.probabilaball.com/2015/07/bayesian-credible-intervals-for-batting.html">see my post on credible intervals</a>. This is what I will call a "naive" empirical Bayesian interval.<br /><br /><br /><h2>The Empirical Bayes Problem</h2><br /><br />The fundamental problem with naive empirical Bayesian intervals is that they often end up too short, inappropriately centered, or both. This is because the uncertainty of the prior parameters themselves has not been accounted for. From the law of total variance, the posterior variance is given by<br /><div style="text-align: center;"><br />$Var(\theta_i | y_i) = E_{\alpha, \beta|y_i}[Var(\theta_i|y_i, \alpha, \beta)] + Var_{\alpha, \beta|y_i}[E(\theta_i|y_i, \alpha, \beta)]$</div><br />Taking quantiles from the empirical Bayesian posterior estimates the first term, but not the second. For small samples this second term can be significant, and empirical Bayesian generally won't achieve nominal coverage (for more information, see Carlin and Louis's book <i>Bayesian Methods for Data Analysis</i>)<br /><br />One way to correct for the uncertainty is to perform a hierarchical Bayesian analysis, but it's not clear what the "correct" hyperpriors should be - and just using noninformative priors doesn't guarantee that you'll get nominal frequentist coverage.<br /><br />An alternative is to use the bootstrap. Since I'm working in the parametric empirical Bayes case, a parametric bootstrap will be used, though this doesn't necessarily have to be the case. For more information on the technique I will use (and a discussion on how it applies to the normal-normal case), see Laird, N., and Louis, T. (1987), "<i>Empirical Bayes Con fidence Intervals Based on Bootstrap Samples</i>," Journal of the American Statistical Association, 82(399), 739-750.<br /><br />I want to emphasize that this is a technique for small samples. For even moderate samples, the uncertainty in the parameters will be small enough that naive empirical Bayesian intervals will have good frequentist properties, and this can be checked by simulation if you desire.<br /><br /><br /><h2>Parametric Bootstrap</h2><br /><br />The idea of the parametric bootstrap is that we can account for $Var_{\alpha, \beta|y_i}[E(\theta_i|y_i, \alpha, \beta)]$ by resampling. In a traditional bootstrap, data is sampled with replacement from the original data set. Since this is a parametric bootstrap instead, we will resample by generating observations from the model assuming that our estimates $\hat{\alpha}$ and $\hat{\beta}$ are correct.<br /><br /><ol><li>Generate $\theta^*_1, \theta^*_2, ..., \theta^*_k$ from $Beta(\hat{\alpha}, \hat{\beta})$</li><li>Generate $y^*_1, y^*_2,..., y^*_k$ from $Bin(n_i, \theta^*_i)$</li><li>Estimate $\alpha$ and $\beta$ from the bootstrapped $(y^*_i, n_i)$ observations using the same method as you initially used. Call these estimates $\alpha^*_j, \beta^*_j$ </li></ol><br /><br />In that way, we get a set of $N$ bootstrap estimates $\alpha^*, \beta^*$ of the parameters of the underlying beta distribution. The posterior density that accounts for uncertainty of $\hat{\alpha}$ and $\hat{\beta}$ can then be estimated as<br /><div style="text-align: center;"><div style="text-align: left;"><br /></div><div style="text-align: center;">$p^*(\theta_i | y_i, \hat{\alpha}, \hat{\beta}) \approx \dfrac{ \sum_{j = 1}^N p(\theta_i | y_i, \alpha^*_j, \beta^*_j)}{N}$</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Essentially, just the raw average density at each point, averaging over all the bootstrapped parameters values. The corrected 95% empirical Bayesian interval is given by solving</div><div style="text-align: center;"><br /></div><div style="text-align: center;">$\displaystyle \int_{-\infty}^{l} p^*(\theta_i | y_i, \hat{\alpha}, \hat{\beta}) = 0.025$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">$\displaystyle \int_{u}^{\infty} p^*(\theta_i | y_i, \hat{\alpha}, \hat{\beta}) = 0.975$</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br />For lower and upper bounds $l$ and $u$ using numerical techniques.</div></div><br /><br /><h2>Baseball Example</h2><br /><br /><a href="http://www.probabilaball.com/2015/05/beta-binomial-empirical-bayes.html">In my previous post</a>, I analyzed the famous Morris baseball data set that with respect to loss functions to show why empirical Bayes works. Analyzing it with respect to interval estimation also provides an interesting example.<br /><br />Using a beta-binomial model with the method of moments estimator I described in the previous post, this data set has parameters $\hat{\alpha} = 97.676$ and $\hat{\beta} = 270.312$. Each player had $n_i = 45$ at-bats, so the posterior distribution for the batting average $\theta_i$ of player $i$ is<br /><br /><div style="text-align: center;">$\theta_i | y_i, \hat{\alpha}, \hat{\beta} \sim Beta(y_i + 97.676, 45 - y_i + 270.312)$</div><br />Naive intervals can be taken directly as central 95% quantiles from the posterior distribution - again, <a href="http://www.probabilaball.com/2015/07/bayesian-credible-intervals-for-batting.html">see my article on Bayesian credible intervals for more explanation on this</a>. <br /><br />\begin{array}{l c c c c c c c c} \hline<br />\textrm{Player} & y_i & y_i/n_i & \textrm{EB Estimate} & \textrm{Naive Lower} & \theta_i & \textrm{Naive Upper}\\ \hline<br />Clemente & 18 & .400 & .280 & 0.238 & .346 & 0.324 \\<br />F. Robinson & 17 & .378 & .278 & 0.236 & .298 & 0.322 \\<br />F. Howard & 16 & .356 & .275 & 0.233 & .276 & 0.319 \\<br />Johnstone & 15 & .333 & .273 & 0.231 & .222 & 0.317 \\<br />Barry & 14 & .311 & .270 & 0.229 & .273 & 0.314 \\<br />Spencer & 14 & .311 & .270 & 0.229 & .270 & 0.314 \\<br />Kessinger & 13 & .289 & .268 & 0.226 & .263 & 0.312 \\<br />L. Alvarado & 12 & .267 & .266 & 0.224 & .210 & 0.309 \\<br />Santo & 11 & .244 & .263 & 0.222 & .269 & 0.307\\<br />Swoboda & 11 & .244 & .263 & 0.222 & .230 & 0.307 \\<br />Unser & 10 &.222 & .261 & 0.220 & .264 & 0.304 \\<br />Williams & 10 & .222 & .261 & 0.220 & .256 & 0.304 \\<br />Scott & 10 & .222 & .261 & 0.220 & .303 & 0.304 \\<br />Petrocelli & 10 & .222 & .261 & 0.220 & .264 & 0.304 \\<br />E. Rodriguez & 10 & .222 & .261 & 0.220 & .226 & 0.304\\<br />Campaneris & 9 & .200 & .258 & 0.217 & .285 & 0.302 \\<br />Munson & 8 & .178 & .256 & 0.215 & .316 & 0.299 \\<br />Alvis & 7 & .156 & .253 & 0.213 & .200 & 0.296 \\ \hline<br />\end{array}<br /><br />Thirteen out of the eighteen intervals captured the hitter's true average for the rest of the year, for an observed coverage of 72.22%. Roberto Clemente and Thurman Munson managed to overperform with respect to their intervals, while Jay Johnstone, Luis Alvarado, and Max Alvis underperformed.<br /><br />The parametric bootstrap procedure fixes this as follows:<br /><br /><ol><li>Simulate a set of 18 new $\theta^*_i$ from a $Beta(97.676, 270.312)$ distribution.</li><li>Simulate a set of 18 new $y^*_i$ from a $Bin(45, \theta^*_i)$ distribution</li><li>Estimate $\alpha^*$ and $\beta^*$ using the same method used on the original data set.</li></ol>I repeated this for 5000 bootstrap samples. The bootstrapped posterior is then<br /><br /><div style="text-align: left;"><br /></div><div style="text-align: center;">$p^*(\theta_i | y_i, 97.676, 270.312) \approx \dfrac{ \sum_{j = 1}^{5000} p(\theta_i | y_i, \alpha^*_j, \beta^*_j)}{5000}$<br /><br /><div style="text-align: left;">The effect of this is to create a posterior distribution that's centered around the same empirical Bayesian estimate $\hat{\theta_i}$, but more spread out. This is shown in the naive (solid line) and bootstrapped (dashed line) posterior distributions for $y_i = 10$.</div><div style="text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-ohl3LCSUAK4/Vlk8l0_KsCI/AAAAAAAAAbQ/zaOfZHnBVF8/s1600/Boostrapped%2BPosterior.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-ohl3LCSUAK4/Vlk8l0_KsCI/AAAAAAAAAbQ/zaOfZHnBVF8/s1600/Boostrapped%2BPosterior.jpeg" /></a></div><div style="text-align: left;"><br /></div><div style="text-align: left;">Quantiles taken from this bootstrapped distribution will give wider intervals than the naive empirical Bayesian intervals (though it is possible to come up with "odd" data sets where the bootstrap interval is shorter). The bootstrap interval is given by solving<br /><br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$\displaystyle \int_{-\infty}^{l} p^*(\theta_i | y_i, 97.676, 270.312) = 0.025$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">$\displaystyle \int_{u}^{\infty} p^*(\theta_i | y_i, 97.676, 270.312) = 0.975$</div><div style="text-align: left;"><br /></div></div><div style="text-align: left;"><br />For lower and upper bounds $l$ and $u$ - and in full disclosure, I didn't actually perform the full numerical integration. Instead, I averaged over the <span style="font-family: "courier new" , "courier" , monospace;">pbeta(x, alpha, beta)</span> function in <span style="font-family: "courier new" , "courier" , monospace;">R</span> and solved for the value where the averaged CDF is equal to 0.025 or 0.975.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Doing this, 95% bootstrapped intervals are given by</div><div style="text-align: left;"><br /></div><div style="text-align: left;">\begin{array}{l c c c c c c c c} \hline<br />\textrm{Player} & y_i & y_i/n_i & \textrm{EB Estimate} & \textrm{Bootstrap Lower} & \theta_i & \textrm{Bootstrap Upper}\\ \hline<br />Clemente & 18 & .400 & .280 & 0.231 & .346 & 0.391 \\<br />F. Robinson & 17 & .378 & .278 & 0.227 & .298 & 0.382 \\<br />F. Howard & 16 & .356 & .275 & 0.222 & .276 & 0.372 \\<br />Johnstone & 15 & .333 & .273 & 0.217 & .222 & 0.363 \\<br />Barry & 14 & .311 & .270 & 0.211 & .273 & 0.355 \\<br />Spencer & 14 & .311 & .270 & 0.211 & .270 & 0.355 \\<br />Kessinger & 13 & .289 & .268 & 0.205 & .263 & 0.346 \\<br />L. Alvarado & 12 & .267 & .266 & 0.199 & .210 & 0.338 \\<br />Santo & 11 & .244 & .263 & 0.192 & .269 & 0.330\\<br />Swoboda & 11 & .244 & .263 & 0.192 & .230 & 0.330 \\<br />Unser & 10 &.222 & .261 & 0.184 & .264 & 0.323 \\<br />Williams & 10 & .222 & .261 & 0.184 & .256 & 0.323 \\<br />Scott & 10 & .222 & .261 & 0.184 & .303 & 0.323 \\<br />Petrocelli & 10 & .222 & .261 & 0.184 & .264 & 0.323 \\<br />E. Rodriguez & 10 & .222 & .261 & 0.184 & .226 & 0.323\\<br />Campaneris & 9 & .200 & .258 & 0.177 & .285 & 0.317 \\<br />Munson & 8 & .178 & .256 & 0.169 & .316 & 0.310 \\<br />Alvis & 7 & .156 & .253 & 0.161 & .200 & 0.305 \\ \hline<br />\end{array}</div><br /><div style="text-align: left;">Only one player out of eighteen is not captured by the interval - Thurman Munson, who managed to hit 0.178 in is his first 45 at-bats and 0.316 the rest of the season - for an observed coverage of 94.44%. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">(As a reminder, annotated code to perform estimation in the beta-binomial model and calculate bootstrapped empirical Bayesian intervals for the beta-binomial model <a href="https://github.com/Probabilaball/Blog-Code/tree/master/Correcting-Parametric-Empirical-Bayesian-Intervals-using-a-Bootstrap" target="_blank">is available on my github</a>)</div></div>rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-54434971969225293662015-10-23T12:17:00.000-05:002017-03-16T23:38:16.488-05:00From Stabilization to Interval Estimation<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <br /><br />In this post, I'm going to show how to use league means and stabilization points to construct mean interval and prediction interval estimates for some basic counting statistics. I'll focus on two specific models: the normal-normal and the beta-binomial model.<br /><br />At a few points during this post I'm going to mention some empirical results. You can find the data I used and the code I ran <a href="https://github.com/Probabilaball/Blog-Code/tree/master/From-Stabilization-to-Interval-Estimation">on my github</a>.<br /><br /><h2>Distributional Assumptions</h2><h2></h2><br />I'm assuming the statistic in question is a binomial outcome - this covers many basic counting statistics (batting average, on-base percentage, batting average on balls in play etc.) but not rate statistics, or more complicated statistics such as wOBA.<br /><br />Assume that in $n_i$ trials, a player accrues $x_i$ events (hits, on-base events, etc.). I'm going to assume that trials are independent and identical with parameter of success $\theta_i$.<br /><br />For the distribution of the $x_i$, I'm going to work out the math for two specific distributions - the normal and the binomial. I'm also going to assume that the distribution of the $\theta_i$ follows the respective conjugate distribution - the normal for the normal model, and the beta for the binomial model. This prior has mean talent level $\mu$ and stabilization point $M$.<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$x_i \sim p(x_i | \theta_i, n_i)$</div><div style="text-align: center;">$\theta_i \sim G(\theta_i | \mu, M)$</div><br />For the stabilization point $M$, I'm going assume this is the number of events at which $r = 0.5$. If you choose the point at which $r = 0.7$, then these formulas won't work<br /><br />For several of the mathematical results here, I'm going to refer back to my article <a href="http://www.probabilaball.com/2015/07/shrinkage-estimators-for-counting.html">shrinkage estimators for counting statistics</a> - particularly, the examples section at the end - without offering proofs or algebraic derivations.<br /><br />The posterior distribution for $\theta_i$ is given by <br /><br /><div style="text-align: center;">$\displaystyle p(\theta_i | x_i, n_i, \mu, M) = \dfrac{p(x_i | \theta_i, n_i)G(\theta_i | \mu, M)}{\int p(x_i | \theta_i, n_i)G(\theta_i | \mu, M) d_{\theta_i}}$ </div><br />Intervals will be constructed by taking quantiles from this distribution. <br /><br />(For rate statistics instead of count statistics, the gamma-Poisson model can be used - though that will take more math to figure out the correct forms of the intervals. I got about three-quarters of the way there in my article <a href="http://www.probabilaball.com/2015/08/whip-stabilization-by-gamma-poisson.html">WHIP stabilization by the gamma-Poisson model</a> if somebody else wants to work through the rest. For more complicated statistics such as wOBA, I'm going to have to work through some hard math.)<br /><br /><h2>Mean Intervals</h2><h2></h2><h3> </h3><h3>Normal-Normal Model</h3><br />For the normal-normal model, suppose that both the counts and the distribution of talent levels follow a normal distribution. Then the observed proportion $x_i / n_i$ also follows a normal distribution. <br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$\dfrac{x_i}{n_i} \sim N\left(\theta_i, \dfrac{\sigma^2}{n_i}\right)$<br /><br /></div><div style="text-align: center;">$\theta_i \sim N(\mu, \tau^2)$</div><br />Furthermore, a normal approximation to the binomial is used to estimate $\sigma^2$ as $\sigma^2 = \mu (1-\mu)$. The usual normal approximation to the binomial takes $\theta_i (1-\theta_i)$ as the variance; however, the normal-normal model assumes that variance $\sigma^2$ is constant around every single $\theta_i$ - so an estimate for that is the average amount of variance over all of them, $\mu(1-\mu)$. <br /><br />As a side note, the relationship between $M$ and $\tau^2$ is given by<br /><br /><div style="text-align: center;">$M = \dfrac{\sigma^2}{\tau^2} \approx \dfrac{\mu(1-\mu)}{\tau^2}$</div><br />As I showed in my article <a href="http://www.probabilaball.com/2015/07/shrinkage-estimators-for-counting.html">shrinkage estimators for counting statistics</a>, for the normal-normal model the shrinkage coefficient is given as<br /><br /><div style="text-align: center;">$B = \dfrac{\sigma^2/\tau^2}{\sigma^2/\tau^2 + n_i} = \dfrac{ M }{ M + n_i }$</div><br />The resulting posterior is then<br /><br /><div style="text-align: center;">$\theta_i | x_i, n_i, \mu, M \sim N\left( (1-B) \left(\dfrac{x_i}{n_i}\right) + B \mu, (1-B) \left(\dfrac{\sigma^2}{n_i}\right)\right)$<br /><br /><div style="text-align: left;">And substituting in the values of $B$, the variance of the posterior is given as</div><div style="text-align: left;"><br /></div><div style="text-align: center;">$\left(1 - \dfrac{M}{M + n_i}\right)\left(\dfrac{\mu(1-\mu)}{n_i}\right) = \left(\dfrac{n_i}{M + n_i}\right)\left(\dfrac{\mu(1-\mu)}{n_i}\right) = \dfrac{\mu(1-\mu)}{M+n_i}$ </div></div><br />And a 95% interval estimate for $\theta_i$ is<br /><br /><div style="text-align: center;">$ \left[ \left(\dfrac{n_i}{n_i + M}\right) \dfrac{x_i}{n_i}+ \left(\dfrac{M}{n_i+M}\right) \mu \right] \pm 1.96 \sqrt{ \dfrac{\mu(1-\mu)}{M+n_i}}$</div><br /><br /><h3>Beta-Binomial Model</h3><h3></h3><br /><br />For the beta-binomial model, suppose that the counts of events $x_i$ in $n_i$ events follows a binomial distribution and the distribution of the $\theta_i$ themselves is beta.<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$x_i \sim Binomial(n_i, \theta_i )$<br /><br /></div><div style="text-align: center;">$\theta_i \sim Beta(\alpha, \beta)$</div><br />For the beta distribution of talent levels, the parameters can be constructed from the league mean and stabilization point as<br /><br /><div style="text-align: center;">$\alpha = \mu M$<br /><br /></div><div style="text-align: center;">$\beta = (1-\mu) M$</div><br />Using the beta as a prior distribution, the posterior for $\theta_i$ is then<br /><br /><div style="text-align: center;">$\theta_i | x_i, n_i, \mu, M \sim Beta(x_i + \mu M, n_i - x_i + (1-\mu) M)$</div><br />A 95% credible interval can then be taken as quantiles from this distribution - I show how to do this in in R in <a href="http://www.probabilaball.com/2015/07/bayesian-credible-intervals-for-batting.html">my article on Bayesian credible intervals</a>. Most statistical software should be able to take quantiles from the beta distribution easily.<br /><br />Alternatively, a normal approximation may be used - the posterior should be approximately normal with mean and variance<br /><br /><div style="text-align: center;">$\theta_i | x_i, n_i, \mu, M \sim N\left( \dfrac{x_i + \mu M}{n_i + M}, \dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu) M)}{(n_i + M)^2 (1 + n_i + M)}\right)$</div><br />So a 95% credible interval based on the normal approximation to the beta posterior is given by<br /><br /><div style="text-align: center;">$\left(\dfrac{x_i + \mu M}{n_i + M}\right) \pm 1.96 \sqrt{\dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu) M)}{(n_i + M)^2 (1 + n_i + M)}}$</div><br />This should be very close to the interval given by taking quantiles directly from the beta distribution.<br /><br /><h3>Practical Application </h3><br />I downloaded first and second half data hitting data from all qualified non-pitchers from 2010 to 2015 from <a href="http://fangraphs.com/">fangraphs.com</a>. I used the above formulas on the first half of on-base percentage data to create intervals, and then calculated the proportion of those intervals that contained the on-base percentage for the second half. For the league mean and stabilization point, I used values of $M$ and $\mu$ (even though I didn't show $\mu$) from my article "<a href="http://www.probabilaball.com/2015/08/more-offensive-stabilization-points.html">More Offensive Stabilization Points</a>."<br /><br /><i>But wait...isn't there uncertainty in those estimates of $M$ and $\mu$? Yes, but it actually doesn't play a huge role unless the uncertainty is large, such as for the BABIP. You can try it out yourself by running the code and changing the values slightly, or just trust me.</i><br /><br />I rounded off to the nearest whole number for $M$ and to three nonzero digits for $\mu$. The intervals compared were the normal-normal as NN, beta-binomial as BB, and the normal approximation to the beta-binomial as BB (N). The resulting coverages were<br /><br />\begin{array}{| l | l | c | c | c | c |} \hline<br />\textrm{Stat}& \mu & M & \textrm{NN Coverage} & \textrm{BB Coverage} & \textrm{BB (N) Coverage} \\ \hline<br />OBP & 0.33 & 296 & 0.66 & 0.659 & 0.659 \\ <br />BA & 0.268 & 466 & 0.604 & 0.601 & 0.603 \\ <br />1B & 0.158 & 222 & 0.685 & 0.675 & 0.679 \\ <br />2B & 0.0475 & 1025 & 0.532 & 0.532 & 0.531 \\ <br />3B & 0.00492 & 373 & 0.762 & 0.436 & 0.76 \\ <br />XBH & 0.0524 & 1006 & 0.545 & 0.542 & 0.551 \\ <br />HR & 0.0274 & 125 & 0.754 & 0.707 & 0.738 \\ <br />BB & 0.085 & 106 & 0.688 & 0.661 & 0.673 \\ <br />SO & 0.181 & 50 & 0.74 & 0.728 & 0.729 \\ <br />HBP & 0.00866 & 297 & 0.725 & 0.591 & 0.721 \\ \hline\end{array}<br /><br />So what happened? Shouldn't the 95% intervals have 95% coverage? Well, they should. The problem is, I used the wrong type of interval - the intervals calculated here are for the <i>mean </i>$\theta_i$. But we don't have $\theta_i$. What we have is the second half on-base percentage, which is $\theta_i$ plus the random noise that naturally surrounds $\theta_i$ in however many additional plate appearances. What's appropriate here is a <i>prediction</i>-type interval that attempts to cover not the mean, but a new observation - this interval will have to account for both the uncertainty of estimation and the natural randomness in a new set of observations. <br /><br /><br /><h2>Prediction Intervals</h2><br />The interval needed is predictive - since the previous intervals were constructed as Bayesian credible intervals, a posterior predictive interval can be used.<br /><br />Suppose that $\tilde{x_i}$ is the new count of events for player $i$ in $\tilde{n_i}$ new trials. I'm going to assume that $\tilde{n_i}$ is known. I'll also assume that $\tilde{x_i}$ is generated from the same process that generated $x_i$.<br /><br /><div style="text-align: center;">$\tilde{x_i} \sim p(\tilde{x_i} | \theta_i, \tilde{n_i})$</div><br />The posterior predictive is then<br /><br /><div style="text-align: center;">$p(\tilde{x_i}| \tilde{n_i}, x_i, n_i, \mu, M) = \displaystyle \int p(\tilde{x_i} | \theta_i, \tilde{n_i})p(\theta_i | x_i, n_i, \mu, M) d\theta_i$ </div><br />For a bit more explanation, check out <a href="http://www.probabilaball.com/2015/09/the-posterior-predictive.html">my article on posterior predictive distributions.</a><br /><br /><h3>Normal-Normal Model</h3><h3></h3><br />As stated above, the posterior distribution for $\theta_i$ is normal<br /><br /><div style="text-align: center;">$\theta_i | x_i, n_i, \mu, M \sim N\left( B \left(\dfrac{x_i}{n_i}\right) + (1-B) \mu, \dfrac{\mu (1-\mu)}{n_i + M}\right)$<br /><br />$B = \dfrac{M}{n_i + M}$ </div><br />Using a normal approximation to the binomial, the distribution of the new on-base percentage in the second half (call this $\tilde{x_i}/\tilde{n_i}$) is also normal<br /><br /><div style="text-align: center;">$\dfrac{\tilde{x_i}}{\tilde{n_i}} | \theta_i, \mu \sim N\left(\theta_i, \dfrac{\mu(1-\mu)}{\tilde{n_i}}\right)$</div><br />The posterior predictive is the marginal distribution, integrating out over $\theta_i$ - it is given as<br /><br /><div style="text-align: center;">$\dfrac{\tilde{x_i}}{\tilde{n_i}} | \tilde{n_i}, x_i, n_i, \mu, M \sim N\left(B \left(\dfrac{x_i}{n_i}\right) + (1-B) \mu, \dfrac{\mu (1-\mu)}{n_i + M} + \dfrac{\mu(1-\mu)}{\tilde{n_i}}\right)$</div><br />And so a 95% posterior predictive interval for the on-base percentage in the second half is given by<br /><br /><br /><div style="text-align: center;">$ \left[ \left(\dfrac{M}{n_i + M}\right) \dfrac{x_i}{n_i}+ \left(\dfrac{n_i}{n_i+M}\right) \mu \right] \pm 1.96 \sqrt{ \dfrac{\mu(1-\mu)}{n_i + M} + \dfrac{\mu(1-\mu)}{\tilde{n_i}}}$<br /><br /><h3 style="text-align: left;"> </h3><h3 style="text-align: left;">Beta-Binomial Model</h3><h3 style="text-align: left;"> </h3><div style="text-align: left;">The posterior distribution for $\theta_i$ is beta<br /><br /><div style="text-align: center;">$\theta_i | x_i, n_i, \mu, M \sim Beta(x_i + \mu M, n_i - x_i + (1-\mu) M)$</div><br /> The distribution for the number of on-base events $\tilde{x_i}$ in $\tilde{n_i}$ follows a binomial distribution<br /><br /><div style="text-align: center;">$\tilde{x_i} \sim Binomial(\theta_i, \tilde{n_i})$</div><br />The posterior predictive for the number of on-base events in the new number of trials is the marginal distribution, which has density<br /><br /><div style="text-align: center;">$p(\tilde{x_i}| x_i, n_i, \mu, M, \tilde{n_i}) = \displaystyle {\tilde{n_i} \choose \tilde{x_i}} \dfrac{\beta(\tilde{x_i} + x_i + \mu M, \tilde{n_i} - \tilde{x_i} + n_i - x_i + (1-\mu) M)}{\beta(x_i + \mu M,n_i - x_i + (1-\mu) M)}$</div><br /><br />This is the beta-binomial distribution. It's is a discrete distribution that gives the probability of the <i>number</i> of on-base events in $\tilde{n_i}$ new PA, not the actual on-base percentage.<br /><br />Since it is discrete, it's easy to solve for quantiles<br /><br /><div style="text-align: center;">$Q(\alpha) = \displaystyle \min_{k} \{ k : F(k) \le \alpha \}$</div>Where<br /><br /><div style="text-align: center;">$F(k) = \displaystyle \sum_{\tilde{x_i} \le k} p(\tilde{x_i} | x_i, n_i, \mu, M, \tilde{n_i}) = \displaystyle \sum_{\tilde{x_i} \le k} \displaystyle {\tilde{n_i} \choose \tilde{x_i}} \dfrac{\beta(\tilde{x_i} + x_i + \mu M, \tilde{n_i} - \tilde{x_i} + n_i - x_i + (1-\mu) M)}{\beta(x_i + \mu M,n_i - x_i + (1-\mu) M)}$</div><br />Since $Q(\alpha)$ is the quantile for the count of events, a 95% interval for the actual on-base proportion is given by<br /><br /><div style="text-align: center;">$\left(\dfrac{Q(.025)}{\tilde{n_i}} ,\dfrac{Q(.975)}{\tilde{n_i}}\right)$. </div><br />Alternatively, since the distribution is likely to be unimodal and bell-shaped, a normal approximation to the 95% posterior predictive interval is given by<br /><br /><div style="text-align: center;">$\left(\dfrac{x_i + \mu M}{n_i + M}\right) \pm 1.96 \sqrt{\dfrac{(x_i + \mu M)(n_i - x_i + (1-\mu)M)(n_i + M + \tilde{n_i})}{\tilde{n_i} (n_i + M)(n_i + M + 1)}}$</div><br />This isn't as good of an approximation as the normal approximation to the beta-binomial interval for the mean, but the difference between intervals is still only around 1% of the length and should work well.<br /><br /><h3>Practical Application</h3><br /><br />I repeated the analysis using the predictive formulas given above, using the first half on-base percentage to try to capture the second half on-base percentage, using the same $\mu$ and $M$ values as before.<br /><br />\begin{array}{| l | l | c | c | c | c |} \hline<br />\textrm{Stat}& \mu & M & \textrm{NN Coverage} & \textrm{BB Coverage} & \textrm{BB (N) Coverage} \\ \hline<br />OBP & 0.33 & 296 & 0.944 & 0.944 & 0.94 \\ <br />BA & 0.268 & 466 & 0.943 & 0.943 & 0.944 \\ <br />1B & 0.158 & 222 & 0.941 & 0.941 & 0.942 \\ <br />2B & 0.0475 & 1025 & 0.956 & 0.956 & 0.955 \\ <br />3B & 0.00492 & 373 & 0.955 & 0.955 & 0.956 \\ <br />XBH & 0.0524 & 1006 & 0.957 & 0.957 & 0.959 \\ <br />HR & 0.0274 & 125 & 0.951 & 0.951 & 0.952 \\ <br />BB & 0.085 & 106 & 0.925 & 0.925 & 0.921 \\ <br />SO & 0.181 & 50 & 0.918 & 0.918 & 0.92 \\ <br />HBP & 0.00866 & 297 & 0.95 & 0.95 & 0.947 \\ \hline\end{array}<br /><h2>Cautions and Conclusion</h2><br /><br />Despite the positive results, I think that 95% actual coverage from these intervals is overoptimistic. For one, I selected a very "nice" group of individuals to test it on - nonpitchers with more than 300 PA. Being in this category implies a high talent level and a lack of anything that could drastically change that talent level over the course of the season, such as injury. I also treated the second half sample size $\tilde{n_i}$ as known - obviously, that must be estimated as well, and should add additional uncertainty.<br /><br />Furthermore, there are <i>clearly</i> other factors at work than just random variation - players can get traded to different environments (a player being traded to or from Coors park, for example), talent levels may very well change over the course of the season, and events are clearly not independent and identical.<br /><br />Applying these formulas to the population of players at large should see the empirical coverage drop - my guess (though I haven't tested it) is that 95% intervals should empirically get around 85%-90% actual coverage. Also keep in mind that $M$ and $\mu$ need to be kept updated - using means and stabilization points from too far in the past will lead to shrinkage towards the wrong point.</div><div style="text-align: left;"><br />You can and should be able to do better than these intervals, in terms of length - these are incredibly simplistic, using only information about the player and information about the population. Adding covariates to the model to account for other sources of variation should allow you to decrease the length without sacrificing accuracy.<br /><br />Alternatively, you could use these formulas with projections, with $\mu$ as the preseason projection and $M$ representing how many events that projection is "worth." This is more in keeping with the traditional Bayesian sense of the interval, and won't guarantee any sort of coverage.<br /><br />However, I still think these intervals are useful in that they represent a sort of baseline - any more advanced model that generates predictive intervals should be able to do better than these.<br /><br /><i>Edit 16 Mar. 2017: I found that the data file I used for this analysis with the split first and second half statistics was not what I thought it was - it repeated the same player multiple times, giving an inaccurate estimate of the confidence level. I have corrected the data file and re-run the analysis and presented the corrected confidence levels.</i> </div></div>rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-75605366599797841782015-10-16T10:36:00.000-05:002015-10-16T10:36:13.024-05:00Stabilization, Regression, Shrinkage, and Bayes<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> This post is somewhat of a brain dump, for all my thoughts on the concept of a "stabilization point," (which is a term I dislike) how it's being used, assumptions that are (usually) knowingly and (sometimes) unknowingly being made in the process, and when and how it can be used correctly.<br /><br /><div style="text-align: left;"><h2>Stabilization In Practice</h2></div><div style="text-align: left;"><br /></div><div style="text-align: left;">The most well-known stabilization point calculation <a href="http://web.archive.org/web/20080112135748/mvn.com/mlb-stats/2008/01/06/on-the-reliability-of-pitching-stats/">was performed and performed by Russell Carleton</a>, who took samples of size $n$ of size statistic from multiple MLB players and declared the stabilization point to be the $n$ such that the correlation coefficient $r = 0.7$ (the logic being that this gives $R^2 \approx 0.5$ - however, I don't like this). This approach is nonparametric in the sense that it's not making any assumptions about the underlying structure of the data (for example, that the events are binomial distributed), only about the structure of the residuals, and these have shown to be fairly reasonable and robust assumptions - in fact, the biggest problems with the original study came down to issues of sampling.<br /><br />The split-and-correlate method is the most common method used to find stabilization points. This method will work, though it's not especially efficient, and should give good results assuming the sampling is being performed well - <a href="http://www.baseballprospectus.com/article.php?articleid=14215">more recent studies</a> randomly split the data into two halves and then correlate. In fact, it will work for essentially <i>any</i> statistic, especially ones that are difficult to fit into a parametric model.</div><br />In his original study (and subsequent studies), Carleton finds the point at which the split-half correlation is equal to $r = 0.7$, since then $R^2 \approx 0.5$. Others have disagreed with this. Commenter Kincaid on Tom Tango's blog <a href="http://tangotiger.com/index.php/site/comments/point-at-which-pitching-metrics-are-half-signal-half-noise#12">writes</a><br /><br /><div style="text-align: left;"><blockquote class="tr_bq"><i>$r=.7$ between two separate observed samples implies that half of the variance in one observed sample is explained by the other observed sample. But the other observed sample is not a pure measure of skill; it also has random variance. So you can’t extrapolate that as half of the variance is explained by skill.</i></blockquote></div><br />I agree with this statement. In traditional regression analysis, the explanatory variable $x$ is conceived as fixed. In this correlation analysis, both the explanatory and response variables are random. Hence, it makes no sense to say that the linear regression with $x$ explains 50% of the variation in $y$ when $x$ is random and, if given the same player, in fact independent of $y$. Other arguments have also been made regarding the units of $r$ and $R^2$.<br /><br />There's a more practical reason, however - <a href="http://www.insidethebook.com/ee/index.php/site/comments/rates_without_sample_size/">a commonly used form of regression towards the mean is given by</a> <br /><br /><div style="text-align: center;">$\dfrac{M}{n + M}$</div><br />where $M$ is the regression amount towards some mean. <a href="http://www.insidethebook.com/ee/index.php/site/comments/rates_without_sample_size/">Tom Tango notes that</a>, if the stabilization point is estimated as the point at which $r = 0.5$, then this value $M$ can then turn around and directly be plugged into the regression equation given above. the As Kincaid has noted, <a href="http://www.3-dbaseball.net/2011/08/regression-to-mean-and-beta.html">this is the form of statistical shrinkage for the binomial distribution with a beta prior</a>. More generally, this is the form of the shrinkage coefficient $B$ that is obtained by modeling the outcome of a random event with a natural exponential family and performing a Bayesian analysis using a conjugate prior<i> </i>(see section five of Carl Morris's paper <a href="http://www.stat.harvard.edu/People/Faculty/Carl_N._Morris/NEF-QVF_1983_2240566.pdf"><i>Natural Exponential Families with Quadratic Variance Functions</i></a>). Fundamentally, this is why the $M/(n + M)$ formula seems to work so well - not because it's beta-binomial Bayes, but because it's also normal-normal Bayes, and gamma-poisson Bayes, and more - any member of the incredibly flexible natural exponential family.<br /><br />So simply taking the correlation doesn't make any assumptions about the parametric structure of the observed data - but taking that stabilization point and turning it into a shrinkage (regression) estimator towards the population mean <i>does</i> assume that observed data come from natural exponential family with the corresponding conjugate prior for the distribution of true talent levels.<br /><br /><br /><h2>Mathematical Considerations</h2><h2> </h2>In practical usage, only certain members of the natural exponential family are considered - the beta-binomial, gamma-Poisson, and the normal-normal models, for example, with the normal-normal largely dominating these choices. These form a specific subset of the natural exponential family - the natural exponential family with quadratic variance functions. The advantage these have over general NEF distributions is that, aside from being the most commonly used distributions, they are closed under convolution - that is, the sum of NEFQFV distributions is also NEFQFV - and this makes them ideal for modeling counting statistics, as the forms of all calculations stay the same as new information arrives, requiring only that new estimates and sample sizes be plugged into formulas.<br /><br /><a href="http://www.probabilaball.com/2015/07/shrinkage-estimators-for-counting.html">In a previous post</a> I used Morris's work to with the natural exponential family with quadratic variance functions to describe a two-stage model with some raw counting statistic $x_i$ as the sum of $n_i$ trials<br /><br /><div style="text-align: center;"> $X_i \sim p(x_i | \theta_i)$</div><div style="text-align: center;"> $\theta_i \sim G(\theta_i | \mu, \eta)$</div><br />where $p(x_i | \theta_i)$ is NEFQVF with mean $theta_i$. If $G(.)$ is treated as a prior distribution for $\theta_i$, then the form of the shrinkage estimator for $\theta_i$ is given by<br /><br /><div style="text-align: center;">$\hat{\theta_i} = \mu + (1 - B)(\bar{x_i} - \mu) = (1-B)\bar{x_i} + B \mu$</div><br /><div style="text-align: left;">where $\bar{x_i} = x_i/n_i$ and $B$, as mentioned before, is the shrinkage coefficient. The shrinkage coefficient controlled by the average amount of variance at the event level and the variance of $G(.)$, weighted by the sample size $n_i$.<br /><br /><div style="text-align: center;">$B = \dfrac{ E[Var(\bar{x_i} | \theta_i)]}{ E[Var(\bar{x_i} | \theta_i)] + n_i Var(\theta_i)}$</div><br />And for NEF models, this simplifies down to <br /><br /><div style="text-align: center;">$B = \dfrac{M}{M + n_i}$ </div></div><div style="text-align: left;"><br />Implying that the form of the stabilization point $M$ is given as<br /><br /><div style="text-align: center;">$M = \dfrac{E[Var(\bar{x_i} | \theta_i)]}{Var(\theta_i)} = \dfrac{E[V(\theta_i)]}{Var(\theta_i)}$</div><br />Where $V(\theta_i)$ is the variance around the mean $\theta_i$ at the most basic level of the event (plate appearance, inning pitched, etc.). So under the NEF family, the stabilization point is the ratio of the average variance around the true talent level (the variance being a function of the true talent level itself) to the variance of the true talent levels themselves.<br /><br /><a href="http://www.probabilaball.com/2015/08/estimating-theoretical-stabilization.html">In another post</a>, I showed briefly that for this model, the split-half correlation is theoretically equal to one minus the shrinkage coefficient $B$.</div><br /><div style="text-align: center;">$\rho = 1 - B = \dfrac{n_i}{M+n_i}$</div><br />Another result that has been commonly used. Therefore, to achieve any desired level of correlation $p$ between split samples, the formula<br /><br /><div style="text-align: center;">$n = \left(\dfrac{p}{1-p}\right) M$ </div><br />can be used to estimate the sample size required. This formula derives not from any sort of correlation prophecy formula, but just from some algebra involving the forms of the shrinkage coefficient and split-half correlation $\rho$.<br /><br />It's for this reason that I dislike the name "stabilization point" - in its natural form it is the number of events required for a split-half correlation of $r = 0.5$ (and correspondingly a shrinkage coefficient of $0.5$), but really, you can estimate the split-half correlation and/or shrinkage amount for any sample size just by plugging in the corresponding values of $M$ and $n$. In general, there's not going to be much difference between samples of size $n$, $n-1$, and $n + 1$ - there's no magical threshold that the sample size can cross that suddenly a statistic becomes perfectly reliable - and in fact that the formula implies a given statistic can <i>never</i> reach 100% stabilization.<br /><br />If I had my choice I'd call it the stabilization parameter, but alas, the name seems to already be set.<br /><br /><h2>Practical Considerations</h2><br />Note that at no point in the previous description was the shrinkage (regression) explicitly required to be towards the league mean talent level. The league mean is a popular choice to shrink towards; however, if the sabermetrician can construct a different prior distribution from other data (for example, the player's own past results) then all of the above formulas and results can be applied using that prior instead.<br /><br />When calculating stabilization points, studies have typically used a large sample of players from the league - theoretically, this implies that the league distribution of talent levels is being used as the prior distribution (the $G(.)$ in the two-stage model above), and the so-called stabilization point that results to be used for any player. In actuality, choices made during the sampling process imply certain priors which consist but only of certain portions of the league distribution of talent levels (mainly, those that have reached a certain threshold for the number of trials). <a href="http://www.probabilaball.com/2015/08/more-offensive-stabilization-points.html">In my article that calculated offensive stabilization points</a>, I specified a hard minimum of 300 PA/AB in order to be included in the population. I estimated the stabilization point directly, but the issue also effects the correlation method used by Carleton, Carty, Pavlidis, etc. - in order to be included in the sample, a player must have had enough events (however they are defined), and this limits the sample to a nonrepresentative subset of the population of all MLB players. The effect of this is that the specific point calculated is really only valid for those individuals that meet those PA/AB requirements - even though those are the players who we know the most about! Furthermore, players that accrue more events do so specifically because they have higher talent levels - the stabilization points calculated for players who we know will receive, for example, at least 300 PA can't then turn around and be applied to players who we know will accrue fewer than 300 PA. This also explains how two individuals both using the same method in the same manner with the same data can arrive at different conclusions depending entirely on how they chose inclusion/exclusion rules for their sample.<br /><br />As a final issue, I used six years worth of data - in doing this, I made the assumption that the basic shape of true talent levels for the subset of the population I chose had changed negligibly or not at all over six years. I didn't simply use all data, however, because I recognize that offensive environments change - the late 90s and early 2000s, for example, are drastically different from than the early 2010s. This brings up another point - stabilization points, as they are defined, are a function of the mean (coming into play in the average variance around a statistic) and, primarily, the variance of population talent levels - however, both of those are changing over time. This means there is not necessarily such a thing as "the" stabilization point, since as the population of talent levels changes over time, so will the mean and variance (I wrote a couple of articles looking at how <a href="http://www.probabilaball.com/2015/08/offensive-stabilization-points-through.html">offensive</a> and <a href="http://www.probabilaball.com/2015/09/pitching-stabilization-points-through.html">pitching</a> stabilization points have changed over time), so stabilization points in articles that were published just a few years ago may or may not be valid any longer.<br /><br /><br /><h2>Conclusion</h2><h2></h2>Even after all this math, I still think the split-and-correlate method should be thought of as the primary method for calculating stabilization points, since it works on almost any kind of statistic, even more advanced ones that don't fit clearly into a NEF or NEFQVF framework. Turning around and using the results of that analysis to perform shrinkage (regression towards the mean), however, <i>does</i> make very specific assumptions about the form of both the observed data and underlying distribution of talent levels. Furthermore, sampling choices made at the beginning can strongly affect the final outcome, and limit the applicability of your analysis to the larger population. And if you remember nothing else from this - there is no such thing as "the" stabilization point, either in terms of when a statistic is reliable (it's always somewhat unreliable, the question is by how much) or one value that applies to all players at all times (since it's a function of the underlying distribution of talent levels, which is always changing).<br /><br />This has largely been just a summary of techniques, studies, and research others have done - I know others have expressed similar opinions as well - but I found the topic interesting and I wanted to explain it in a way that made sense to me. Hopefully I've made a little more clear the connections between statistical theory and things people were doing just because they seemed to work.<br /><br /><h2>Various Links</h2><h2> </h2>These are just some of the various links I read to attempt to understand what people were doing in practice and attempt to connect it to statistical theory: <br /><br />Carl Morris's paper <i>Natural Exponential Families with Quadratic Variance Functions: Statistical Theory</i>: <a href="http://www.stat.harvard.edu/People/Faculty/Carl_N._Morris/NEF-QVF_1983_2240566.pdf">http://www.stat.harvard.edu/People/Faculty/Carl_N._Morris/NEF-QVF_1983_2240566.pdf</a><br /><br />Russell Carleton's original reliability study: <a href="http://web.archive.org/web/20080112135748/mvn.com/mlb-stats/2008/01/06/on-the-reliability-of-pitching-stats/">http://web.archive.org/web/20080112135748/mvn.com/mlb-stats/2008/01/06/on-the-reliability-of-pitching-stats/</a><br /><br />Carleton's updated calculations: <a href="http://www.baseballprospectus.com/article.php?articleid=20516">http://www.baseballprospectus.com/article.php?articleid=20516</a><br /><br />Tom Tango comments on Carleton's article: <br /><a href="http://tangotiger.com/index.php/site/comments/point-at-which-pitching-metrics-are-half-signal-half-noise">http://tangotiger.com/index.php/site/comments/point-at-which-pitching-metrics-are-half-signal-half-noise</a><br /><br /><br />Derek Carty's stabilization point calculations: <a href="http://www.baseballprospectus.com/a/14215#88434">http://www.baseballprospectus.com/a/14215#88434</a><br /><br />Tom Tango discusses Carty's article, the $r = 0.7$ versus $r = 0.5$ threshold, and regression towards the mean: <a href="http://www.insidethebook.com/ee/index.php/site/comments/rates_without_sample_size/">http://www.insidethebook.com/ee/index.php/site/comments/rates_without_sample_size/</a><br /><a href="http://www.insidethebook.com/ee/index.php/site/comments/rates_without_sample_size/"></a><br />Steve Staude discusses $r = 0.5$ versus $r = 0.7$: <a href="http://www.fangraphs.com/blogs/randomness-stabilization-regression/">http://www.fangraphs.com/blogs/randomness-stabilization-regression/</a><br /><br />Tom Tango comments on Steve's work: <a href="http://tangotiger.com/index.php/site/comments/randomness-stabilization-regression">http://tangotiger.com/index.php/site/comments/randomness-stabilization-regression</a> <br /><br />Tom Tango links to Phil Birnbaum's proof of the regression towards the mean formula: <a href="http://tangotiger.com/index.php/site/comments/blase-from-the-past-proof-of-the-regression-toward-the-mean">http://tangotiger.com/index.php/site/comments/blase-from-the-past-proof-of-the-regression-toward-the-mean</a><br /><br />Kincaid shows that the beta-binomial model produces the regression towards the mean formula: <a href="http://www.3-dbaseball.net/2011/08/regression-to-mean-and-beta.html">http://www.3-dbaseball.net/2011/08/regression-to-mean-and-beta.html</a><br /><br />Harry Pavlidis looks at stabilization for some pitching events: <a href="http://www.hardballtimes.com/it-makes-sense-to-me-i-must-regress/">http://www.hardballtimes.com/it-makes-sense-to-me-i-must-regress/</a><br /><br />Tom Tango discusses Harry's article, and gives the connection between regression and stabilization: <a href="http://www.insidethebook.com/ee/index.php/site/article/regression_equations_for_pitcher_events/">http://www.insidethebook.com/ee/index.php/site/article/regression_equations_for_pitcher_events/</a><br /><br />Great summary of various regression and population variance estimation techniques - heavy on the math: <a href="http://www.countthebasket.com/blog/2008/05/19/regression-to-the-mean/">http://www.countthebasket.com/blog/2008/05/19/regression-to-the-mean/</a><br /><br />The original discussion on regression and shrinkage from Tom Tango's archives: <a href="http://www.tangotiger.net/archives/stud0098.shtml">http://www.tangotiger.net/archives/stud0098.shtml</a><br /><br /><br /><br /><br /><br />rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com2tag:blogger.com,1999:blog-4128498738742055603.post-59172954613173990112015-09-21T11:13:00.003-05:002016-02-16T15:14:33.963-06:00The Posterior Predictive<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> Let's say you do a Bayesian analysis and end up with a posterior distribution $p(\theta | y)$. What does that tell you about a new observation $\tilde{y}$ from some data-generating process that involves $\theta$? The answer can be found using the posterior predictive distribution.<br /><br />The code that I used to generate the images in this article can be found <a href="https://github.com/Probabilaball/Blog-Code/tree/master/The-Posterior-Predictive" target="_blank">on my github</a>. <br /><br /><br /><h2>Posterior Predictive</h2><br /> Given a posterior distribution $p(\theta | y)$, the posterior predictive distribution is defined as<br /><br /><div style="text-align: center;">$p(\tilde{y} | y) = \int p(\tilde{y} | \theta) p(\theta | y) d\theta$</div><br />and it represents the distribution of a new observation $\tilde{y}$ given your updated information about a parameter $\theta$ <i>and </i>natural variation around the observation that arises from the data-generating process.<br /><br />Since applied Bayesian techniques have tended towards fully computational MCMC procedures, the posterior predictive is usually obtained through simulation - let's say you have a sample $\theta_j^*$ ($j = 1,2,...,k$) from the posterior distribution of $\theta$ and you want to know about a new observation from some process that uses the parameter $\theta$.<br /><br /><div style="text-align: center;">$\tilde{y} \sim p(\tilde{y} | \theta)$</div><br />To obtain the posterior predictive, you would simulate a set of observation $\tilde{y}^*_j$ from $p(\tilde{y} | \theta_j^*)$ (in other words, simulate a new observation from the data model for each draw from the MCMC for your parameter). The distribution of these $\tilde{y}^*_j$ approximates the posterior predictive distribution for $\tilde{y}$.<br /><br />In general, this process is very specific to the problem at hand. It is possible in a few common scenarios, however, to calculate the posterior predictive distribution analytically. One example that is useful in baseball analysis the beta-binomial model. <br /><br /><br /><h2>Beta-Binomial Example</h2><br />Let's say a batter obtains a 0.300 OBP in 250 PA - that corresponds to 75 on-base events and 175 not on-base events. What can you say about the distribution of on-base events in a new set of 250 PA?<br /><br />Suppose that the distribution of on-base events is given by a binomial distribution with $n = 250$ and chance of getting on base $\theta$, which is the same in both sets of PA.<br /><br /><div style="text-align: center;">$p(y | \theta) \sim Binomial(250, \theta)$</div><br />For the prior distribution, let's suppose that a $Beta(1,1)$ distribution was used - this is a uniform distribution between zero and one - so any possible value for $\theta$ is equally likely. Since the beta and binomial are conjugate distributions, the posterior distribution of $\theta$ (the batter's chance of getting on base) is also a beta distribution:<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$p(\theta| y = 75) = \dfrac{\theta^{75+1-1}(1-\theta)^{175+1-1}}{\beta(75+1,175+1)} \sim Beta(76,176)$</div><br />Now, suppose we are planning to observe another 250 PA for the same batter, and we want to know the distribution of on-base events $\tilde{y}$ in the new 250 PA. This distribution is also binomial<br /><br /><div style="text-align: center;">$p(\tilde{y} | \theta) = \displaystyle {250 \choose \tilde{y}} \theta^{\tilde{y}}(1-\theta)^{250-\tilde{y}}$</div><br /> The posterior predictive distribution for the number of on-base events in another 250 PA is then obtained by multiplying the two densities and integrating out $\theta$.<br /><br /><div style="text-align: center;">$p(\tilde{y} | y = 75) = \displaystyle \int_0^1 {250 \choose \tilde{y}} \theta^{\tilde{y}}(1-\theta)^{250-\tilde{y}} * \dfrac{\theta^{75}(1-\theta)^{175}}{\beta(76,176)} d\theta$ </div><br />The resulting distribution is known as beta-binomial distribution, which has density<br /><br /><div style="text-align: center;">$p(\tilde{y} | y = 75) =\displaystyle {250 \choose \tilde{y}} \dfrac{\beta(76 + \tilde{y}, 426-\tilde{y})}{\beta(76,176)}$</div><br />(The beta-binomial distribution is obtained from the beta-binomial model - it does get a bit confusing, but they are different things - the beta-binomial model can be thought of as a binomial with extra variance) <br /><br />Now I can use the posterior predictive for inference. If, for example, I wanted to know the probability that a player will have a 0.300 OBP in another 250 PA (corresponding, again, to $\tilde{y}$ = 75 on-base events) then I can calculate that as<br /><br /><div style="text-align: center;">$p(\tilde{y} = 75 | y = 75) = \displaystyle {250 \choose 75} \dfrac{\beta(76 + 75, 426-75)}{\beta(76,176)} \approx 0.0389$</div><br />That is, our updated information says there's a 3.89% chance of getting exactly a 0.300 OBP in a new 250 PA by the same player.<br /><br />The actual distribution of OBP in a new 250 PA is given by<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-ux-E85OTcwI/VaS4wQR0-bI/AAAAAAAAAOA/Yr8_Z1hS2Vk/s1600/Posterior%2BPredictive%2BAnalytic.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="636" src="https://1.bp.blogspot.com/-ux-E85OTcwI/VaS4wQR0-bI/AAAAAAAAAOA/Yr8_Z1hS2Vk/s640/Posterior%2BPredictive%2BAnalytic.jpeg" width="640" /></a></div><br /><br />This can also be accomplished by simulation - first, by simulating a large number $k$ of $\theta^*$ values from the posterior<br /><br /><div style="text-align: center;">$\theta^*_j \sim Beta(76,176)$</div><br />And then using those $\theta^*$ values to simulate from the data model $p(\tilde{y} | \theta^*)$<br /><br /><div style="text-align: center;">$y^*_j \sim Binomial(250, \theta^*_j)$</div><br />The estimated probability of a 0.300 OBP (equivalent to 75 on-base events) is then<br /><br /><div style="text-align: center;">$P(0.300 | y = 75) \approx \displaystyle \dfrac{\textrm{# } y^*_j \textrm{ that equal 75}}{k} $</div><br />This is much easier to do in $R$ -with $k = 1000000$, the code to quickly perform this is<br /><br /><span style="font-family: "courier new" , "courier" , monospace;"> > theta <- rbeta(1000000, 76,176) #Simulate from Posterior<br /> > y <- rbinom(1000000, 250, theta) #Simulate from data model<br /> > mean(y == 75) #Estimate P(y = 75)<br /> [1] 0.039077</span><br /><br />Notice that the result is very close to the analytic answer of 3.89%. The simulated posterior predictive distribution for OBP is<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-8ZmN_ge4sZg/VaS4we0YSNI/AAAAAAAAAN8/A9Xg0MBGGbQ/s1600/Posterior%2BPredictive%2BSimulated.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="636" src="https://1.bp.blogspot.com/-8ZmN_ge4sZg/VaS4we0YSNI/AAAAAAAAAN8/A9Xg0MBGGbQ/s640/Posterior%2BPredictive%2BSimulated.jpeg" width="640" /></a></div><br /><br />Which visually looks very similar to the analytic version.<br /><br /><h2>What's the Use?</h2><br />Aside from doing (as the name implies) prediction, the posterior predictive is very useful for model checking - if data simulated from the posterior predictive (new data) is similar to the data you used to fit the model (old data), that is evidence that the model is a good fit. Conversely, if your posterior predictive data looks nothing like your original data set, you may have misspecified your model (and I'm criminally oversimplifying here - I recommend <i>Bayesian Data Analysis </i>by Gelman et al. for all the detail on model-checking using the posterior predictive distribution).rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com1tag:blogger.com,1999:blog-4128498738742055603.post-88297626779983204322015-09-14T11:51:00.001-05:002016-02-16T15:19:18.823-06:00Pitching Stabilization Points through Time<div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><a href="http://www.probabilaball.com/2015/09/more-pitching-stabilization-points.html">In the previous post</a>, I calculated some stabilization points for pitching statistics for the past few years. In this post, I want to look at how some of those stabilization points have changed over time.<br /><br /><a href="http://www.probabilaball.com/2015/08/offensive-stabilization-points-through.html">(I have previously done this for offensive statistics)</a><br /><br />Each stabilization point is a six-year calculation, including the current and five previous years (so for example, 2014 incudes 2009-2014 data, 1965 includes 1959 - 1965 data, etc.). There's not a mathematical or baseball reason for this choice - through trial and error it just seemed to provide enough data for estimation that the overall trend was apparent, with a decent amount of smoothing. Data includes only starting pitchers from each year, and for cutoff values (the minimum number of TBF, BIP, etc. to be included in the dataset) I used the same values as in my previous post. Years were split for the same player. Raw counts are used, not adjusted in any form. Relief pitchers are excluded.<br /><br />All of the code that I used to create these <a href="https://github.com/Probabilaball/Blog-Code/tree/master/Pitching-Stabilization-Points-through-Time" target="_blank">can be found in my github</a>, though I make no claims to efficiency or ease of operation. Because I added this code several months after the article was originally posted, I did not clean and annotate it as I normally would have - I just posted the raw code. The code is a modified form of the code used to calculate offensive stabilization points over time.<br /><br /><a href="http://imgur.com/a/ccVKl">All of the plots shown below, and more, can be found in my imgur account.</a><br /><br /><h2>Historical Plots</h2><br />For some statistics, I will show plots for both the mean of a statistic over time and the stabilization point. The stabilization point driven largely by the underlying population variance of talent levels, which tends to be more difficult to estimate then the mean - hence the reason that, even with six years of moving data, the 'stabilization point' will appear to fluctuate quite a bit. I recommend not reading too much into the fluctuations, but rather looking for more general patterns.<br /><br />Firstly, the ground ball, fly ball, and line drive rates (per BIP) only have recent data available. In that time, neither the fly ball or ground ball stabilization points have changed much<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-51nf5_bRSjc/VfDiYRQPkhI/AAAAAAAAAXo/Dq1rLVqXOfw/s1600/FB%2BM%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://2.bp.blogspot.com/-51nf5_bRSjc/VfDiYRQPkhI/AAAAAAAAAXo/Dq1rLVqXOfw/s640/FB%2BM%2BPlot.jpeg" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-AK_Qc65pE-8/VfDiYkH3P7I/AAAAAAAAAXs/fTxXFtOXIG0/s1600/GB%2BM%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://4.bp.blogspot.com/-AK_Qc65pE-8/VfDiYkH3P7I/AAAAAAAAAXs/fTxXFtOXIG0/s640/GB%2BM%2BPlot.jpeg" width="640" /></a></div>Line drive rate appears to have increased in recent years, however.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-tvRtE3a74YM/VfDirPaDlpI/AAAAAAAAAX4/TPK_0zbzWt4/s1600/LD%2BM%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://3.bp.blogspot.com/-tvRtE3a74YM/VfDirPaDlpI/AAAAAAAAAX4/TPK_0zbzWt4/s640/LD%2BM%2BPlot.jpeg" width="640" /></a></div>Though keep in mind the standard error is approximately 100 balls in play.<br /><br />More interesting is batting average on balls in play, for which we have more data. The standard error for BABIP is approximately 500 balls in play, so it's not wise to trust small fluctuations in this plot as representing real shifts - however, it does appear that there is a positive trend in the stabilization point, indicative of the spread in BABIP values getting smaller. (A plot with 95% error bounds at each point <a href="http://i.imgur.com/q6iQg71.jpg">can be found here</a>, though I don't necessarily care for it)<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-GXkKbfQz2OE/VfNlCiNaB5I/AAAAAAAAAYM/KJ1hZp9I-vQ/s1600/BABIP%2BM%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://2.bp.blogspot.com/-GXkKbfQz2OE/VfNlCiNaB5I/AAAAAAAAAYM/KJ1hZp9I-vQ/s640/BABIP%2BM%2BPlot.jpeg" width="640" /></a></div>The mean is easier to estimate with more accuracy - and it shows that batting average on balls in play is at its highest point in history.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-rwJ-D_V4SBI/VfNlCdGEiOI/AAAAAAAAAYI/dpyOLuvWkY0/s1600/BABIP%2BMu%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://1.bp.blogspot.com/-rwJ-D_V4SBI/VfNlCdGEiOI/AAAAAAAAAYI/dpyOLuvWkY0/s640/BABIP%2BMu%2BPlot.jpeg" width="640" /></a></div> An animated plot shows how the mean and variance of the observed (histogram) and estimated true talent (dashed line) distributions have changed over time.<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-c3arYrU6S-w/VfNlC6B_cZI/AAAAAAAAAYU/kOmGtHXeuh0/s1600/BABIP%2BThrough%2BTime.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://1.bp.blogspot.com/-c3arYrU6S-w/VfNlC6B_cZI/AAAAAAAAAYU/kOmGtHXeuh0/s400/BABIP%2BThrough%2BTime.gif" width="400" /></a></div>As I've previously mentioned, the primary driving force for stabilization points is the underlying population variance. For example, take strikeout rate (per batter faced): since the dead ball era, it has followed a pattern of fairly consistent decrease (with a recent upsurge that still places it within previously observed ranges).<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-hKJoapqOEe0/VfOk-C0-lpI/AAAAAAAAAYo/zr6Pz9pya44/s1600/SO%2BM%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://1.bp.blogspot.com/-hKJoapqOEe0/VfOk-C0-lpI/AAAAAAAAAYo/zr6Pz9pya44/s640/SO%2BM%2BPlot.jpeg" width="640" /></a></div> Over time, however, the mean strikeout rate (per batter faced) has been on the increase.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-ktmGrQHm4Q0/VfOk-h34DHI/AAAAAAAAAYw/fRcNzdKrPwg/s1600/SO%2BMu%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://3.bp.blogspot.com/-ktmGrQHm4Q0/VfOk-h34DHI/AAAAAAAAAYw/fRcNzdKrPwg/s640/SO%2BMu%2BPlot.jpeg" width="640" /> </a></div><div class="separator" style="clear: both; text-align: left;">What <i>does</i> coincide with the increase in stabilization point is the decrease in population variance over time, as seen in this animated plot with the observed strikeout rates (histogram) and estimated true talent distribution (dashed line) - the spread in both is constantly increasing over time.</div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-SyHtTUnhFLQ/VfOlBLS_fwI/AAAAAAAAAY4/hh04CgsXk60/s1600/SO%2BRate%2Bthrough%2BTime.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://4.bp.blogspot.com/-SyHtTUnhFLQ/VfOlBLS_fwI/AAAAAAAAAY4/hh04CgsXk60/s400/SO%2BRate%2Bthrough%2BTime.gif" width="400" /> </a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">Also interesting is the earned run rate (per inning pitched, min 80 IP).</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-16RK4zp-lD4/VfWcj2CsQwI/AAAAAAAAAZI/nUj6SibMk5w/s1600/ER%2BM%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://2.bp.blogspot.com/-16RK4zp-lD4/VfWcj2CsQwI/AAAAAAAAAZI/nUj6SibMk5w/s640/ER%2BM%2BPlot.jpeg" width="640" /></a></div><div class="separator" style="clear: both; text-align: left;">Beginning in the early 2000s, it dropped to a very low point, relative to its history, and has remained there more consistently than in the past. Meanwhile, the stabilization point for walk rate (min 400 BF) has increased in recent years, after reaching a maximum in the 1980s and decreasing.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-vg3n9Z2f6To/VfWd0AWJ2VI/AAAAAAAAAZQ/0yYQc-b0wmc/s1600/BB%2BM%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://4.bp.blogspot.com/-vg3n9Z2f6To/VfWd0AWJ2VI/AAAAAAAAAZQ/0yYQc-b0wmc/s640/BB%2BM%2BPlot.jpeg" width="640" /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">On-base percentage and hit by-pitch rate have all fluctuated within a relatively stable area over time.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-jPlwYBHx7jw/VfXq8KIUEXI/AAAAAAAAAZg/R2NCbmxi31Q/s1600/OBP%2BM%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://3.bp.blogspot.com/-jPlwYBHx7jw/VfXq8KIUEXI/AAAAAAAAAZg/R2NCbmxi31Q/s640/OBP%2BM%2BPlot.jpeg" width="640" /></a></div><a href="http://3.bp.blogspot.com/-Tp_Ga8T-yPI/VfXq8JK99SI/AAAAAAAAAZk/VF6FqOcOPNA/s1600/HBP%2BM%2BPlot.jpeg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="310" src="https://3.bp.blogspot.com/-Tp_Ga8T-yPI/VfXq8JK99SI/AAAAAAAAAZk/VF6FqOcOPNA/s640/HBP%2BM%2BPlot.jpeg" width="640" /></a><br /><div class="separator" style="clear: both; text-align: left;"><br /></div>Though interestingly, both (per batter faced) took a dip in the 1960 that corresponds to an <i>increase</i> in the mean hit-by-pitch rate and a <i>decrease </i>in the mean on-base-percentage.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-YCHJaCWeM1A/VfXrW87KHSI/AAAAAAAAAZ4/FWJ42fYDKjg/s1600/HBP%2BMu%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://2.bp.blogspot.com/-YCHJaCWeM1A/VfXrW87KHSI/AAAAAAAAAZ4/FWJ42fYDKjg/s640/HBP%2BMu%2BPlot.jpeg" width="640" /></a></div><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-bU1aoQZRgFs/VfXsg9j-JuI/AAAAAAAAAaI/aAkIEbv-OnE/s1600/OBP%2BMu%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://2.bp.blogspot.com/-bU1aoQZRgFs/VfXsg9j-JuI/AAAAAAAAAaI/aAkIEbv-OnE/s640/OBP%2BMu%2BPlot.jpeg" width="640" /></a></div><br />For some statistics, such as WHIP and home run rate, and it is difficult to discern a pattern other than fluctuations within a certain range.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-WcfrcRwGr_g/VfXw5h6ujZI/AAAAAAAAAaQ/g7Q3_Jz1ehw/s1600/WHIP%2BM%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://4.bp.blogspot.com/-WcfrcRwGr_g/VfXw5h6ujZI/AAAAAAAAAaQ/g7Q3_Jz1ehw/s640/WHIP%2BM%2BPlot.jpeg" width="640" /></a><a href="http://4.bp.blogspot.com/-6xUnGm4XGRM/VfXw5s3rUDI/AAAAAAAAAaU/kGgynhCpXNA/s1600/HR%2BM%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="308" src="https://4.bp.blogspot.com/-6xUnGm4XGRM/VfXw5s3rUDI/AAAAAAAAAaU/kGgynhCpXNA/s640/HR%2BM%2BPlot.jpeg" width="640" /></a></div><br />An interesting look at how certain things have changed over time - though, as I mentioned before, I would encourage not reading too much into these plots.rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-66201456317295380852015-09-03T09:57:00.001-05:002016-02-16T15:20:58.625-06:00More Pitching Stabilization Points<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <br />Using the <a href="http://probabilaball.blogspot.com/2015/08/estimating-theoretical-stabilization.html">beta-binomial model</a> (notated BB) or <a href="http://probabilaball.blogspot.com/2015/08/whip-stabilization-by-gamma-poisson.html">the gamma-Poisson model</a> (notated GP, and in this post what I call M is what in the previous post I called K - the variance parameter of the population talent distribution), I calculated the stabilization point for some more pitching statistics. I don't think the model(s) fit perfectly to the data, but they provide a good approximation that generally matches up with results I've seen elsewhere on the web.<br /><br />Data was acquired from fangraphs.com. I only considered starting pitchers from 2009 - 2014, splitting the same pitcher between years, and did not adjust the data in any way.<br /><br />All the data and code I used here <a href="https://github.com/Probabilaball/Blog-Code/tree/master/More-Pitching-Stabilization-Points" target="_blank">may be found on my github</a>. I make no claims to efficiency or ease of use. <br /><br />The "cutoff value" is the minimum number of the denominator (IP, TBF, BIP, etc.) in a year in order to be included in the data set. These numbers were chosen somewhat arbitrarily, and for some of my statistics, changing the cutoff value <i>will </i>change the stabilization point. I'm not sure which statistics this will happen to - I know WHIP for sure, and I suspect ER as well, whereas I think BABIP doesn't exhibit this tendency. It's a function of the change (or lack thereof) in population variance of talent levels as the cutoff value increases - if somebody wants to take a look at it, it would be neat.<br /><br />I wanted have a little fun and apply the model to stats where it clearly is silly to do so, such as win rate (I defined as wins per game started) and extra batters faced per inning (the total number of additional batters a pitcher faced beyond what is required by their IP). The model still produces estimates, but of course, but bad data fed into a good model doesn't magically produce good analysis.<br /><br />\begin{array}{| l | l | c | c | c | c | c |} \hline<br />\textrm{Stat}&\textrm{Formula}&\hat{M}&SE(\hat{M})&\textrm{95% CI}&\textrm{Cutoff}&\textrm{Model}\\ \hline<br />\textrm{BABIP}&\textrm{(H-HR)/n*}&2006.71&484.94&(1056.22,2957.20)&300&BB\\<br />\textrm{GB Rate}&\textrm{GB/BIP}&65.52&3.63&(58.39,72.64)&300&BB\\<br />\textrm{FB Rate}&\textrm{FB/BIP}&61.96&3.42&(55.25,68.66)&300&BB\\<br />\textrm{LD Rate}&\textrm{LD/BIP}&768.42&94.10&(583.99,952.86)&300&BB\\<br />\textrm{HR/FB Rate}&\textrm{HR/FB}&505.11&93.95&(320.96,689.26)&100&BB\\<br />\textrm{SO Rate}&\textrm{SO/TBF}&90.94&5.04&(81.06,100.82)&400&BB\\<br />\textrm{HR Rate}&\textrm{HR/TBF}&931.59&107.80&(720.30,1142.88)&400&BB\\<br />\textrm{BB Rate}&\textrm{(BB-IBB)/(TBF-IBB)}&221.25&14.43&(192.97,249.53)&400&BB\\<br />\textrm{HBP Rate}&\textrm{HBP/TBF}&989.30&119.95&(754.21,1224.41)&400&BB\\<br />\textrm{Hit rate}&\textrm{H/TBF}&623.35&57.57&(510.51,736.18)&400&BB\\<br />\textrm{OBP}&\textrm{(H + BB + HBP)/TBF}&524.73&44.96&(436.62,612.84)&400&BB\\<br />\textrm{Win Rate}&\textrm{W/GS}&57.23&8.68&(40.21,74.24)&15&BB\\<br />\textrm{WHIP}&\textrm{(H + BB)/IP**}&77.20&5.46&(66.50,87.90)&80&GP\\<br />\textrm{ER Rate}&\textrm{ER/IP**}&59.55&3.94&(51.82,67.25)&80&GP\\<br />\textrm{Extra BF}&\textrm{(TBF - 3IP**)/IP**}&73.00&5.08&(63.05,82.95)&80&GP\\ \hline<br />\end{array}<br /><br /><i>* I'm not exactly sure what combinations of statistics fangraphs is using for the denominator of their BABIP - it's not BIP = GB + FB + LD. I know the numerator of H - HR is correct, but the denominator was usually smaller , though sometimes larger, than BIP. I solved for what fangraphs was using and used that in my calculations - if somebody wants to let me know exactly what they're using for n, please do.</i><br /><br /><i>** When dividing by IP, I corrected the 0.1 and 0.2 decimal representations to 0.33 and 0.67. </i><br /><br />I've also created histograms of each observed statistic with an overlay of the estimated distribution of true talent levels. They can be found <a href="http://imgur.com/a/kZFoE" target="_blank">in this imgur gallery</a>. Remember that the dashed line represents the distribution of <i>talent levels</i>, not of observed data, so it's not necessarily bad if it is shaped differently than the observed data. <br /><br />$\hat{M}$ is the estimated variance parameter of the underlying talent distribution. Under the model, it is equal to the number of plate appearances at which there is 50% shrinkage.<br /><br />$SE(\hat{M})$ is the standard error of the estimate $\hat{M}$. It is on the same scale as the divisor in the formula.<br /><br />The 95% CI is calculated as<br /><br /><div style="text-align: center;">$\hat{M} \pm 1.96 SE(\hat{M})$</div><br />It represents a 95% confidence interval for the number of plate appearances at which there is 50% shrinkage.<br /><br />For an arbitrary stabilization level $p$, the number of required plate appearances can be estimated as<br /><br /><div style="text-align: center;">$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{M}$</div><br />And a 95% confidence interval for the required number of plate appearances is given as<br /><br /><div style="text-align: center;">$\left(\dfrac{p}{1-p}\right) \hat{M} \pm 1.96 \left(\dfrac{p}{1-p}\right) SE(\hat{M})$</div><br />Since the denominators are so different (as opposed to offensive statistics where PA was used for almost everything except for batting average, and AB are fairly close to PA), I don't feel as comfortable putting everything on the same plot. That being said, the stats that use TBF look like<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-fZXpgXcMASU/VeepewtBXtI/AAAAAAAAAXM/hpzt1xTz0bM/s1600/TBF%2BStabilization%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="638" src="https://1.bp.blogspot.com/-fZXpgXcMASU/VeepewtBXtI/AAAAAAAAAXM/hpzt1xTz0bM/s640/TBF%2BStabilization%2BPlot.jpeg" width="640" /></a></div><br /><br />And the stats that use BIP for their denominator look like<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-1eOZ7Zr-KjI/VeepfyJYglI/AAAAAAAAAXQ/S1Gzp92XCJk/s1600/BIP%2BStabilization%2BPlot.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="638" src="https://4.bp.blogspot.com/-1eOZ7Zr-KjI/VeepfyJYglI/AAAAAAAAAXQ/S1Gzp92XCJk/s640/BIP%2BStabilization%2BPlot.jpeg" width="640" /></a></div><br />As always, comments are appreciated.<br /><br />rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-89264361966912278492015-09-02T13:31:00.001-05:002015-09-02T16:23:07.577-05:002015 Win Prediction Totals (Through August)<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> These predictions are based on my own method (which can be improved). I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), and I think by this point in the season the actual coverage should be close to that. <br /><br />Intervals are inclusive. All win totals assume a 162 game schedule.<br /><br />\begin{array} {c c c c} <br />\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} \\ \hline<br /><br />ATL & 61 & 66.56 & 72 & 65.57 \\ <br />ARI & 73 & 78.98 & 85 & 83.48 \\ <br />BAL & 72 & 77.94 & 84 & 83.3 \\ <br />BOS & 70 & 75.88 & 82 & 77.71 \\ <br />CHC & 85 & 90.58 & 96 & 83.91 \\ <br />CHW & 70 & 76.24 & 82 & 74.78 \\ <br />CIN & 63 & 68.64 & 75 & 74.14 \\ <br />CLE & 74 & 80.20 & 86 & 82.03 \\ <br />COL & 61 & 67.29 & 73 & 70.14 \\ <br />DET & 69 & 74.75 & 81 & 74.66 \\ <br />HOU & 84 & 89.99 & 96 & 91.79 \\ <br />KCR & 92 & 97.84 & 104 & 90.34 \\ <br />LAA & 74 & 79.95 & 86 & 78.16 \\ <br />LAD & 84 & 90.31 & 96 & 87.66 \\ <br />MIA & 61 & 66.79 & 73 & 74.49 \\ <br />MIL & 64 & 69.71 & 76 & 74.48 \\ <br />MIN & 77 & 82.69 & 89 & 79.41 \\ <br />NYM & 84 & 89.67 & 95 & 87.19 \\ <br />NYY & 83 & 89.34 & 95 & 87.76 \\ <br />OAK & 68 & 73.33 & 79 & 82.77 \\ <br />PHI & 58 & 63.87 & 70 & 64.11 \\ <br />PIT & 91 & 97.21 & 103 & 89.43 \\ <br />SDP & 73 & 78.54 & 84 & 75.97 \\ <br />SEA & 69 & 74.32 & 80 & 71.95 \\ <br />SFG & 80 & 85.61 & 91 & 86.79 \\ <br />STL & 98 & 103.63 & 109 & 97.33 \\ <br />TBR & 76 & 81.45 & 87 & 80.79 \\ <br />TEX & 78 & 83.60 & 90 & 79.00 \\ <br />TOR & 87 & 92.88 & 98 & 98.67 \\ <br />WSN & 77 & 82.61 & 89 & 84.04 \\ \hline\end{array}<br /><br />To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (at the end of August).<br /><br />These quantiles are based off of a distribution - <a href="http://imgur.com/a/wywVF" target="_blank">I've uploaded a picture of each team's distribution to imgur</a>. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is lower than the peak, the team is predicted to finish lucky, and if the blue line is higher than the peak, the team is predicted to finish unlucky. rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-74729502971684385762015-08-27T09:33:00.000-05:002016-02-16T15:36:40.062-06:00WHIP Stabilization by the Gamma-Poisson Model<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <a href="http://probabilaball.blogspot.com/2015/08/more-offensive-stabilization-points.html" target="_blank">I've previously covered shrinkage estimation for offensive statistics - or at least, those that can be written as binomial events</a>. In a previous post, I showed that for models that follow the natural exponential family with quadratic variance function, <a href="http://probabilaball.blogspot.com/2015/07/shrinkage-estimators-for-counting.html" target="_blank">the split-half correlation is equal to one minus the shrinkage coefficient $B$</a>.<br /><br />The techniques I used can also be used when the outcome at the most basic level (at-bat, inning pitched, etc.) is <i>not</i> just a binary outcome. In particular, the Poisson distribution also fits within the framework I derived, as it is a member of <a href="http://probabilaball.blogspot.com/2015/07/shrinkage-estimators-for-counting.html" target="_blank">the natural exponential family with quadratic variance function</a>, and so events that can be modeled as Poisson at the base level will follow the same basic principles I used for the binomial outcomes. I chose the statistic WHIP (walks + hits per inning pitched) to illustrate this method, as it is a counting statistic that is a non-binary event (i.e., you can have 0, 1, 2, ... walks + hits in a given inning), so it fits the support of the Poisson.<br /><br /><h2></h2><h2></h2><h2>Model and Estimation</h2><br />I will assume that in each inning, pitcher $i$ gives up a number of walks + hits that follows a Poisson model with mean $\theta_i$, which is unique to each pitcher. The sum number of walks + hits given up in $n_i$ innings is $x_i$, and I have $N$ pitchers total. I considered only starting pitchers from 2009-2014, and split years between the same pitcher. <a href="https://github.com/Probabilaball/Blog-Code/tree/master/WHIP-Stabilization-by-the-Gamma-Poisson-Model" target="_blank">My code and data are on github for anybody who wants to check my calculations.</a><br /><br />Since the sum of Poissons is Poisson, the sum number of walks + hits $x_i$ given up in $n_i$ innings follows a Poisson distribution with mean $n_i \theta_i$ and mass function<br /><br /><div style="text-align: center;">$p(x_i | \theta_i, n_i) = \dfrac{e^{-n_i \theta_i} (n_i \theta_i)^{x_i}}{x_i !}$</div><br />I will also assume that the distribution of means $\theta_i$ follows a gamma distribution with mean $\mu$ and variance parameter $K$ (in this parametrization, $\mu = \alpha/\beta$ and $K = \beta$ as opposed to the traditional $\alpha, \beta$ parametrization). This distribution has density<br /><br /><div style="text-align: center;">$f(\theta_i | \mu, K) = \dfrac{K^{\mu K}}{\Gamma(\mu K)} \theta_i^{\mu K - 1} e^{-K \theta_i}$</div><br /><a href="http://probabilaball.blogspot.com/2015/08/estimating-theoretical-stabilization.html" target="_blank">As shown in a previous post</a>, the split-half correlation is then one minus the shrinkage coefficient $B$, or <br /><br /><div style="text-align: center;">$\rho = 1 - B = \left(\dfrac{n_i}{n_i + K}\right)$</div><br />So once I have an estimate $\hat{K}$ and a desired stabilization level $p$, solving for $n$ gives<br /><br /><div style="text-align: center;">$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{K}$</div><br />Once again, the population variance parameter $K$ is equivalent to the 0.5 stabilization point - the point where the split half correlation should be exactly equal to 0.5, and also the point where the individual pitcher estimates are shrunk 50% of the way towards the mean. <br /><br />For estimation of $mu$ and $K$, I used marginal maximum likelihood -<a href="http://probabilaball.blogspot.com/2015/06/confidence-interval-for-batting-average.html" target="_blank"> a one dimensional introduction to maximum likelihood is given here</a>. The marginal density of $\mu$ and $K$ is <br /><br /><div style="text-align: center;">$p(x_i | n_i, \mu, K) = \displaystyle \int_0^{\infty} \dfrac{K^{\mu K}n_i^{x_i}}{\Gamma(\mu K) x_i !} e^{-\theta_i (n_i + K)} \theta_i^{x_i + \mu K - 1} d\theta_i = \dfrac{K^{\mu K}n_i^{x_i}}{\Gamma(\mu K) x_i !} \dfrac{\Gamma(x_i + \mu K)}{(n_i + K)^{x_i + \mu K}}$</div><div style="text-align: center;"><br /></div><div style="text-align: left;">And the log-likelihood (dropping terms that do not involve either $\mu$ or $K$) is given by</div><div style="text-align: left;"><br /></div><div style="text-align: center;">$\ell(\mu, K) = N \mu K \log(K) - N \log(\Gamma(\mu K)) + \displaystyle \sum_{i = 1}^N \left[\log(\Gamma(x_i + \mu K)) - (x_i + \mu K) \log(n_i + K)\right]$</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Once again, I wrote code to maximize this function in $R$ using a Newton-Raphson algorithm. I converted $K$ to $\phi = 1/(1 + K)$ in the equation above for estimation and then converted it back by $K = (1-\phi)/\phi$ after estimation was complete - the reason being that it makes the estimation procedure much more stable.<br /><br />In performing this estimation, I had to make a choice of the minimum number of innings pitched (IP) in order to be included in the dataset. When performing a similar analysis for on-base percentage, I found that at around 300 PA, the population variance (and hence, the stabilization point) became roughly constant. Unfortunately, this is <i>not</i> true for starting pitchers.<br /><br /><div style="text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-06mSBBY0O0k/Vd4rBqaDBNI/AAAAAAAAAW8/PoJWexpU0hM/s1600/cuttoff%2Bversus%2BK.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://1.bp.blogspot.com/-06mSBBY0O0k/Vd4rBqaDBNI/AAAAAAAAAW8/PoJWexpU0hM/s400/cuttoff%2Bversus%2BK.jpeg" width="400" /></a></div><br /></div><br /></div>The population variance in talent levels decreases consistently as a function of the minimum number of IP that are considered, and so the stabilization point $K$ increases. This means that, unlike OBP, for example, the stabilization point is always determined by what percentage of pitchers you look at (by IP) - if you look at only the top 50%, the stabilization point will be larger than the stabilization point for the top 70%.<br /><br />This is reflected in the plot below - as with OBP and PA, the mean WHIP is associated with the number of IP, but unlike with OBP, the <i>variance</i> around the mean is constantly changing with the mean.<br /><br /><div style="text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-grU-Ut84IDg/Vd4rBBARiLI/AAAAAAAAAWs/FX6j395EVGs/s1600/IP%2Bversus%2BWHIP.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://1.bp.blogspot.com/-grU-Ut84IDg/Vd4rBBARiLI/AAAAAAAAAWs/FX6j395EVGs/s400/IP%2Bversus%2BWHIP.jpeg" width="400" /></a></div><br /></div><br />For my calculation, I chose to use 80 innings pitched as my cutoff point - corresponding to approximately 15 games started and capturing slightly more than 50% of pitchers (by IP). This point was completely arbitrary, though, and other cutoffs will be equally valid depending on the question at hand.<br /><br />Performing the estimation, the estimated league mean WHIP was $\hat{\mu} = 1.304$ with variance parameter $\hat{K} = 77.203$.<br /><br /><div style="text-align: center;"><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-Km6AigRHrV8/Vd4rBMg7Y8I/AAAAAAAAAWg/6SRMedy4Mcg/s1600/WHIP.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://2.bp.blogspot.com/-Km6AigRHrV8/Vd4rBMg7Y8I/AAAAAAAAAWg/6SRMedy4Mcg/s400/WHIP.jpeg" width="400" /></a></div></div><br />Once again, 95% confidence intervals for a specific stabilization level p are given as<br /><br /><div style="text-align: center;">$\left(\dfrac{p}{1-p}\right) \hat{K} \pm 1.96 \left(\dfrac{p}{1-p}\right) \sqrt{Var(\hat{k})}$</div><br />From (<a href="http://probabilaball.blogspot.com/2015/06/the-delta-method-for-confidence.html" target="_blank">delta-method transformed</a>) maximum likelihood output, $Var(\hat{K}) = 29.791$ (for a standard error of $5.459$ IP). The stabilization curve, with confidence bounds, is then<br /><br /><div style="text-align: center;"><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-c_kVSt9V7aI/Vd4rBLmfPRI/AAAAAAAAAW4/J1LGtfwiPl8/s1600/Stabilization%2BLevel%2Bversus%2BIP.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://1.bp.blogspot.com/-c_kVSt9V7aI/Vd4rBLmfPRI/AAAAAAAAAW4/J1LGtfwiPl8/s400/Stabilization%2BLevel%2Bversus%2BIP.jpeg" width="400" /></a></div><br /></div>Aside from the model criticisms I've already mentioned, standard ones apply - innings pitched are not identical and independent (and treating them as so is clearly much worse than treating plate appearances as identical and independent), pitchers are not machines, etc. I don't think the model is <i>great</i>, but it is useful. It gives confidence bounds for the stabilization point something other methods don't do. As always, comments are appreciated.rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-41960293250780367572015-08-19T14:50:00.001-05:002016-02-16T15:41:01.764-06:00Offensive Stabilization Points through Time<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> Using <a href="http://probabilaball.blogspot.com/2015/08/estimating-theoretical-stabilization.html" target="_blank">my maximum likelihood technique for estimating stabilization points</a>, I performed a moving calculation of the stabilization point using data from 1900 - 2014 from fangraphs.com. Each stabilization point is a six-year calculation, including the current and five previous years (so for example, 2014 incudes 2009-2014 data, 1965 includes 1959 - 1965 data, etc.). There's not a mathematical or baseball reason for this choice - through trial and error it just seemed to provide enough data for estimation that the overall trend was apparent, with a decent amount of smoothing. Data includes only batters from each year with at least 300 plate appearances, and splits years for the same player. Raw counts are used, not adjusted in any form. Pitchers are excluded. My data and code <a href="https://github.com/Probabilaball/Blog-Code/tree/master/Offensive-Stabilization-Points-Through-Time" target="_blank">is posted on my github if you would like to run it for yourself</a>.<br /><br />The "stabilization point" I have defined as the point where split-half correlation is equal to 0.5, which is equivalently where the shrinkage amount is 50% . Both of these are equal to a variance parameter $M$ in the beta-binomial model I fit, where the distribution of events given a mean $\theta_i$ is binomial for player $i$ and the underlying distribution of the $\theta_i$ follows a beta distribution with mean $\mu$ and variance parameter $M$.<br /><br /><h2>Historical Plots</h2><br />Trends can be see clearly in plots. For example, here is a plot of the stabilization point for home run rate from 1900 - 2014.<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-DcCVg57MbEU/VdJWemSUoTI/AAAAAAAAATo/7kZlLWsTvb4/s1600/HR.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://3.bp.blogspot.com/-DcCVg57MbEU/VdJWemSUoTI/AAAAAAAAATo/7kZlLWsTvb4/s640/HR.jpeg" width="640" /></a></div><br />The effect of the dead ball era is clearly evident. A large stabilization point indicates a <i>small </i>variance - and during the dead ball era, there was a small variance, because most players weren't hitting home runs! More recently, the stabilization point has risen to the highest level it's been since that era.<br /><br />Note that the stabilization point is should not be confused with the mean. In fact, here's a plot of the estimated league mean home run rate over the same period.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-ysVbvTRujFw/VdJgY_fw9uI/AAAAAAAAAT4/tyrzYLb89TM/s1600/HR%2BMean.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://2.bp.blogspot.com/-ysVbvTRujFw/VdJgY_fw9uI/AAAAAAAAAT4/tyrzYLb89TM/s640/HR%2BMean.jpeg" width="640" /></a></div><br />While going through peaks and valleys, the home run rate has risen fairly continuously over time - and the recent rise in home run stabilization point actually corresponds to a decrease in the mean home run rate (though interestingly, the decrease in league mean home run rate since the end of the steroid era <i>still </i>puts the current mean home run rate above any other preceding era).<br /><br />To give another example, the stabilization point for triple rate is the lowest its been since the dead ball era - even though the league mean triple rate has decreased fairly continuously over time.<br /><br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-islCd8Nb2Kk/VdJg61OuRkI/AAAAAAAAAUE/Kw4HPRCTGcU/s1600/3B.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="308" src="https://1.bp.blogspot.com/-islCd8Nb2Kk/VdJg61OuRkI/AAAAAAAAAUE/Kw4HPRCTGcU/s640/3B.jpeg" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-JNhVcazt6FA/VdJg6kLV_cI/AAAAAAAAAUA/uLFnurnpVl0/s1600/3B%2BMean.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://4.bp.blogspot.com/-JNhVcazt6FA/VdJg6kLV_cI/AAAAAAAAAUA/uLFnurnpVl0/s640/3B%2BMean.jpeg" width="640" /></a></div><br />Interestingly, the stabilization points for walk rate and on-base percentage are the highest they've ever been, with walk rate having a noticeably sharp increase in recent years - one theory is that this is due to a "moneyball" effect of teams focusing much more strongly on walk rate as opposed to other statistics - indeed, the stabilization point for batting average (shown later in the article) has dropped during the same period - perhaps indicative of being more tolerant of variation in batting average but less tolerant in variation of on-base percentage (of course, pitching has grown more dominant since the end of the steroid era, which is likely adding to the effect as well).<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-0Q_nJtImjSg/VdKzWEKxQWI/AAAAAAAAAVA/rJk0aFJQnT0/s1600/BB.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://4.bp.blogspot.com/-0Q_nJtImjSg/VdKzWEKxQWI/AAAAAAAAAVA/rJk0aFJQnT0/s640/BB.jpeg" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-Cs1T7qRezl4/VdKzWP-l2RI/AAAAAAAAAVE/2L6ywASTDRg/s1600/OBP.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="311" src="https://2.bp.blogspot.com/-Cs1T7qRezl4/VdKzWP-l2RI/AAAAAAAAAVE/2L6ywASTDRg/s640/OBP.jpeg" width="640" /></a></div><br />Meanwhile, the stabilization points for double rate and extra base hit (2B + 3B) have increased over time.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-X673u_wVS-E/VdOFS6xLpbI/AAAAAAAAAVY/CtYzUpaax6U/s1600/2B.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="308" src="https://1.bp.blogspot.com/-X673u_wVS-E/VdOFS6xLpbI/AAAAAAAAAVY/CtYzUpaax6U/s640/2B.jpeg" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-0CdVtYKGHZg/VdOFS8IfMLI/AAAAAAAAAVc/63jpWy2knbs/s1600/XBH.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://3.bp.blogspot.com/-0CdVtYKGHZg/VdOFS8IfMLI/AAAAAAAAAVc/63jpWy2knbs/s640/XBH.jpeg" width="640" /></a></div><br />But while the extra base hit rate stabilization point has decreased from the mid-2000s, while the double rate stabilization point has remained roughly the same.<br /><br />The hit-by-pitch rate follows the same pattern as third base percentage - it increased after the dead ball era, peaking in the 1930s and 1950s - but has decreased since then, and despite a small recent increase, is at its lowest stabilization point since that era.<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-cWwzZrzLXvk/VdOcFl07eyI/AAAAAAAAAWI/v6Jf4PQnyR0/s1600/HBP.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="311" src="https://1.bp.blogspot.com/-cWwzZrzLXvk/VdOcFl07eyI/AAAAAAAAAWI/v6Jf4PQnyR0/s640/HBP.jpeg" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Meanwhile, the strikeout rate stabilization point decreased fairly consistently over time, before stabilizing approximately in the 1970s, with peaks in the 1980s and early 2000s.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-TKpzs4x12r8/VdOcFo3PN5I/AAAAAAAAAWU/GMLhgvRO_0k/s1600/SO.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://2.bp.blogspot.com/-TKpzs4x12r8/VdOcFo3PN5I/AAAAAAAAAWU/GMLhgvRO_0k/s640/SO.jpeg" width="640" /></a></div><br /><h2>What Drives the Stabilization Point?</h2><br />As I've shown, the mean of the underlying distribution of talents does not seem to be strongly associated with the stabilization point - the variance of the underlying distribution of talent levels is the primary factor. There is an inverse relationship - a small stabilization point indicates that there is a large variance in talent levels for that particular statistic, and a large stabilization point indicates that there is a small variance in talent levels for that statistic.<br /><br />To get a clearer view of the factors that are affecting the stabilization point, here's a plot of the stabilization point for batting average (using at-bats as the denominator) versus time.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-5b7ZuagMWCs/VdKtvj4Sv4I/AAAAAAAAAUo/u4LCKVMLSy4/s1600/BA.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://4.bp.blogspot.com/-5b7ZuagMWCs/VdKtvj4Sv4I/AAAAAAAAAUo/u4LCKVMLSy4/s640/BA.jpeg" width="640" /> </a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">Below is an animation showing the empirical distribution of batting average with the estimated underlying distribution of talent levels in dashed lines (since I'm estimating the distribution of <i>true </i>batting averages and not the distribution of <i>observed</i> batting averages, it's okay that the dashed line is narrower than the histogram). Notice that as time goes on, the distribution gets narrower (the variance is decreasing) - this is what's driving the increase in stabilization point over time.</div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-Yz0PjEjqhvo/VdKt0qHhrCI/AAAAAAAAAUw/82GzFOJ-gIQ/s1600/BA%2BOver%2BTime.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-Yz0PjEjqhvo/VdKt0qHhrCI/AAAAAAAAAUw/82GzFOJ-gIQ/s1600/BA%2BOver%2BTime.gif" /></a></div><br />The opposite effect can be seen in the single rate stabilization point - it has decreased (with peaks and valleys) over time<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-wrNM84jnmKg/VdOTrhA9Y_I/AAAAAAAAAVw/EAnqgN3kTRg/s1600/1B.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://1.bp.blogspot.com/-wrNM84jnmKg/VdOTrhA9Y_I/AAAAAAAAAVw/EAnqgN3kTRg/s640/1B.jpeg" width="640" /></a></div><br />As the distribution of single rates has become more spread out.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-nYyKlEPJS40/VdOT90fWDxI/AAAAAAAAAV4/iiZioZPKdhs/s1600/1B%2BRate%2BOver%2BTime.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-nYyKlEPJS40/VdOT90fWDxI/AAAAAAAAAV4/iiZioZPKdhs/s1600/1B%2BRate%2BOver%2BTime.gif" /></a></div><br /><br /><br /><a href="http://imgur.com/a/YuWjH" target="_blank">Graphics of all the stabilization points, league mean talent levels, and animated estimated talent distributions can be found here.</a><br /><br /><h2>Individual Years</h2><br />I also selected a few years to compare individually - some for specific reasons, some just as a representative of a certain era.<br /><br /><ul><li>1910, in the middle of the dead ball era, and a year of particularly low offensive output. </li><li>1928, to represent the 1920s and the age of Babe Ruth.</li><li>1937, to represent the 1930s. </li><li>1945, the end of the second world war. </li><li>1959, to represent the 1950s. </li><li>1968, the year of the pitcher.</li><li>1975, six years after they lowered the mound. </li><li>1987, before the steroid era and six years after the 1981 labor stoppage. </li><li>2001, the year Barry Bonds hit 73 home runs, in the middle of the steroid era. </li><li>2014, the modern era.</li></ul><br />\begin{array}{| c | c | c | c | c | c | c | c | c | c | c |}\hline<br />\textrm{Year} & \textrm{1B} & \textrm{2B}& \textrm{3B}& \textrm{XBH}& \textrm{HR} & \textrm{SO} & \textrm{BB} & \textrm{BA} & \textrm{OBP} & \textrm{HBP} \\ \hline<br />1910 & 523.02 & 348.68 & 469.42 & 274.24 & 537.41 & 130.77 & 87.33 & 285.65 & 175.27 & 259.08 \\<br />1928 & 442.01 & 475.05 & 583.77 & 385.43 & 102.77 & 83.26 & 82.10 & 286.25 & 173.49 & 414.86 \\<br />1937 & 436.17 & 597.72 & 709.13 & 463.40 & 90.61 & 73.79 & 76.94 & 344.93 & 174.84 & 723.59\\<br />1945 & 456.71 & 710.05 & 543.09 & 474.47 & 98.81 & 67.99 & 79.83 & 424.54 & 180.14 & 699.53 \\<br />1959 & 351.40 & 1059.30 & 721.82 & 794.95 & 90.18 & 58.28 & 81.20 & 430.60 & 200.31 & 414.12\\<br />1968 & 333.52 & 867.61 & 691.18 & 700.24 & 94.46 & 55.18 & 93.24 & 476.94 & 265.04 & 448.36\\<br />1975 & 246.27 & 970.40 & 646.09 & 773.63 & 85.50 & 53.19 & 73.61 & 407.65 & 204.92 & 410.99 \\<br />1987 & 269.39 & 949.73 & 537.13 & 801.21 & 90.23 & 52.85 & 86.22 & 541.67 & 262.28 & 430.67 \\<br />2001 & 255.57 & 838.16 & 482.32 & 971.61 & 95.11 & 57.37 & 76.16 & 465.84 & 196.51 & 251.47\\ <br />2014 & 222.16 & 1025.31 & 372.50 & 1006.30 & 124.52 & 49.73 & 105.59 & 465.92 & 295.79 & 297.41 \\ \hline<br />\end{array}<br /><br /><br />While generally following the fuller patterns shown in the plots, the effect of major baseball events such as the dead ball era, the second world war, the lowering of the mound, and the steroid era is evident.<br /><br />Remember that a smaller stabilization point indicates a larger variance among talent levels - so looking at 1968 and 1975 to see the effect of lowering the mound, for example, the spread of single, triple, and home run rates increased while the spread of double and extra-base hit rates decreased (the extra base hit rate being largely driven by the double rate). Interestingly, the spread of strikeout rates remained roughly the same, but the spread of walk rates, hit by pitch rates, batting average, and on-base percentage all increased.<br /><br />Overall, a fun way to look at how offensive statistics have changed over time. Let me know what you think in comments.rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-29901233227334296072015-08-13T09:48:00.002-05:002016-02-16T15:42:14.610-06:00More Offensive Stabilization Points<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script><a href="http://probabilaball.blogspot.com/2015/08/estimating-theoretical-stabilization.html" target="_blank">Using the same method from my previous post</a>, I calculated some stabilization points for more offensive statistics.<br /><br />All of the code and data I used <a href="https://github.com/Probabilaball/Blog-Code/tree/master/More-Offensive-Stabilization-Points">may be found on my github</a>.<br /><br />For each, I used a binomial distribution for the result of the at-bat/plate appearance and a beta distribution for the distribution of talent levels. I think a different model might work better for some of these statistics (and I'll have to work through the poisson-gamma anyway when I look at pitching statistics), but this represents a decent approximation.<br /><br />As I stated in a previous post, statistics that can <i>not</i> be constructed as a binomial event (such as wOBA) do not fall under the framework I am using, and so I have not included them in estimation. I could treat them as binomial events, fit a model, and perform estimation procedure, but I would have no idea if the results are correct or not. <br /><br />All estimated stabilization points were calculated using unadjusted data from fangraphs.com for hitters from 2009-2014 with at least 300 PA, excluding pitchers.<br /><br />\begin{array}{| l | l | c | c | c |} \hline<br />\textrm{Statistic} & \textrm{Formula} & \hat{M} & SE(\hat{M}) & \textrm{95% CI} \\ \hline<br />\textrm{OBP} & \textrm{(H+BB+HBP)/PA} & 295.79 & 16.41 & (263.63, 327.95) \\<br />\textrm{BA} & \textrm{H/AB} & 465.92 & 34.23 & (398.83, 533.02) \\ <br />\textrm{SO Rate} & \textrm{SO/PA} & 49.73 & 1.92 & (45.96, 53.50) \\ <br />\textrm{BB Rate} & \textrm{(BB-IBB)/(PA-IBB)} & 110.91 & 4.84 & (101.44, 120.38) \\ <br />\textrm{1B Rate} & \textrm{1B/PA} & 222.16 & 11.32 & (199.98, 244.34) \\ <br />\textrm{2B Rate} & \textrm{2B/PA} & 1025.31 & 108.00 & (813.64, 1236.98) \\ <br />\textrm{3B Rate} & \textrm{3B/PA} & 372.5 & 26.56 & (320.44, 424.56) \\ <br />\textrm{XBH Rate} & \textrm{(2B+3B)/PA} & 1006.30 & 105.23 & (800.04, 1212.57) \\ <br />\textrm{HR Rate} & \textrm{HR/PA} & 124.52 & 5.90 & (112.95, 136.09) \\ <br />\textrm{HBP Rate} & \textrm{HBP/PA} & 297.41 & 18.26 & (261.61, 333.20) \\ \hline<br />\end{array}<br /><br />I've also created histograms of each observed statistic with an overlay of the estimated distribution of true talent levels. They can be found <a href="http://imgur.com/a/1IHYR" target="_blank">in this imgur gallery</a>. Remember that the dashed line represents the distribution of <i>talent levels</i>, not of observed data, so it's not necessarily bad if it is shaped differently than the observed data. <br /><br />$\hat{M}$ is the estimated variance parameter of the underlying talent distribution. Under the model, it is equal to the number of plate appearances at which there is 50% shrinkage.<br /><br />$SE(\hat{M})$ is the standard error of the estimate $\hat{M}$. It is on the same scale as the divisor in the formula - so PA for all except batting average and walk rate.<br /><br />The 95% CI is calculated as<br /><br /><div style="text-align: center;">$\hat{M} \pm 1.96 SE(\hat{M})$</div><br />It represents a 95% confidence interval for the number of plate appearances at which there is 50% shrinkage.<br /><br />For an arbitrary stabilization level $p$, the number of required plate appearances can be estimated as<br /><br /><div style="text-align: center;">$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{M}$</div><br />And a 95% confidence interval for the required number of plate appearances is given as<br /><br /><div style="text-align: center;">$\left(\dfrac{p}{1-p}\right) \hat{M} \pm 1.96 \left(\dfrac{p}{1-p}\right) SE(\hat{M})$</div><br />Without confidence bounds, a plot of the sample size required for various stabilization levels is<br /><div class="separator" style="clear: both; text-align: center;"> <a href="http://1.bp.blogspot.com/-J64S3_nay28/VcD8iNtiQmI/AAAAAAAAAS4/U2RXWYzKvaU/s1600/p%2Bversus%2Bn.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="638" src="https://1.bp.blogspot.com/-J64S3_nay28/VcD8iNtiQmI/AAAAAAAAAS4/U2RXWYzKvaU/s640/p%2Bversus%2Bn.jpeg" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-QieDy4ndjQw/VcBIKzPXdsI/AAAAAAAAASk/h9QOE623JdY/s1600/p%2Bversus%2Bn.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div><br /><br />And a plot of the stabilization level at various sample sizes is given as<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-gi0eccL9h3I/VcD8iCNSClI/AAAAAAAAATE/xw5Vq_LyOf4/s1600/n%2Bversus%2Bp.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="638" src="https://2.bp.blogspot.com/-gi0eccL9h3I/VcD8iCNSClI/AAAAAAAAATE/xw5Vq_LyOf4/s640/n%2Bversus%2Bp.jpeg" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-yYgrSN_zK14/VcBIJxzPwmI/AAAAAAAAASc/r3o6dkHdNbU/s1600/n%2Bversus%2Bp.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div>This looks very similar to <a href="http://www.fangraphs.com/blogs/a-new-way-to-look-at-sample-size/" target="_blank">other plots I have seen</a>.<br /><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Comments are appreciated. Also, I'm currently in the process of learning ggplot, so hopefully my graphics won't be as awful in the near future.</div>rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-53267861871076311402015-08-12T12:41:00.002-05:002015-08-26T20:06:46.887-05:002015 Win Prediction Totals (Through July)<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> These are a bit late - it's August 12, but these intervals only include games through July 31. <br /><br />These predictions are based on my own method (which can be improved). I set the nominal coverage at 95% (meaning the way I calculated it the intervals should get it right 95% of the time), and I think by this point in the season the actual coverage should be close to that. <br /><br />Intervals are inclusive. All win totals assume a 162 game schedule.<br /><br />\begin{array} {c c c c} <br />\textrm{Team} & \textrm{Lower} & \textrm{Mean} & \textrm{Upper} & \textrm{True Win Total} \\ \hline<br /><br />ATL & 64 & 72.27 & 81 & 72.09 \\ <br />ARI & 73 & 81.38 & 90 & 83.36 \\ <br />BAL & 74 & 82.90 & 92 & 86.17 \\ <br />BOS & 64 & 72.12 & 81 & 72.94 \\ <br />CHC & 76 & 85.07 & 94 & 81.23 \\ <br />CHW & 68 & 76.53 & 85 & 73.10 \\ <br />CIN & 66 & 74.63 & 84 & 76.06 \\ <br />CLE & 68 & 76.90 & 86 & 78.07 \\ <br />COL & 62 & 70.37 & 79 & 72.69 \\ <br />DET & 70 & 78.54 & 87 & 78.37 \\ <br />HOU & 82 & 90.46 & 99 & 90.71 \\ <br />KCR & 85 & 94.02 & 103 & 89.14 \\ <br />LAA & 78 & 87.28 & 96 & 87.11 \\ <br />LAD & 82 & 90.37 & 99 & 88.88 \\ <br />MIA & 61 & 70.09 & 79 & 77.17 \\ <br />MIL & 63 & 71.01 & 80 & 75.43 \\ <br />MIN & 74 & 82.43 & 91 & 79.51 \\ <br />NYM & 74 & 82.33 & 91 & 80.50 \\ <br />NYY & 82 & 90.45 & 99 & 87.62 \\ <br />OAK & 67 & 75.23 & 84 & 84.46 \\ <br />PHI & 54 & 62.61 & 71 & 63.10 \\ <br />PIT & 83 & 92.27 & 101 & 87.16 \\ <br />SDP & 69 & 77.14 & 86 & 74.48 \\ <br />SEA & 65 & 73.45 & 82 & 73.91 \\ <br />SFG & 79 & 88.31 & 97 & 87.21 \\ <br />STL & 92 & 101.20 & 110 & 96.66 \\ <br />TBR & 72 & 80.88 & 90 & 80.62 \\ <br />TEX & 70 & 78.37 & 87 & 76.62 \\ <br />TOR & 77 & 86.00 & 94 & 92.18 \\ <br />WSN & 77 & 85.99 & 95 & 84.89 \\ \hline\end{array}<br />To explain the difference between "Mean" and "True Win Total" - imagine flipping a fair coin 10 times. The number of heads you expect is 5 - this is what I have called "True Win Total," representing my best guess at the true ability of the team over 162 games. However, if you pause halfway through and note that in the first 5 flips there were 4 heads, the predicted total number of heads becomes $4 + 0.5(5) = 6.5$ - this is what I have called "Mean", representing the expected number of wins based on true ability over the remaining schedule added to the current number of wins (at the end of July). <br /><br />As a bonus, these quantiles are based off of a distribution - <a href="http://imgur.com/a/23MKt" target="_blank">I've uploaded a picture of each team's distribution to imgur</a>. The bars in red are the win total values covered by the 95% interval. The blue line represents my estimate of the team's "True Win Total" based on its performance - so if the blue line is lower than the peak, the team is predicted to finish lucky, and if the blue line is higher than the peak, the team is predicted to finish unlucky.rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-90255425216518042922015-08-05T09:35:00.002-05:002020-03-19T10:01:15.907-05:00Estimating Theoretical Stabilization Points <i>Edit 19 March 2020: This post has been adapted into a paper <a href="https://doi.org/10.1016/j.jmp.2020.102330">in the Journal of Mathematical Psychology.</a> In the process of writing the paper, a number of </i> <i>mistakes, omissions, or misstatements were found in this post. It is being left up as it was originally written, just in case anybody is interested. For a more correct version, please refer to the journal article. </i><br /><br />The technique commonly used to assess stabilization points of statistics is called split-half correlation. This post will show that within a fairly general modeling framework, the split-half correlation is a function of two things: the sample size and a variance parameter of the distribution of talent levels. It's therefore possible to skip the correlation step entirely and use statistical techniques to estimate the variance parameter of the talent distribution directly, and then use that to estimate the sample size required (with confidence bounds) for a specific stabilization level. <br /><br /><h2>Theoretical Split-Half Correlation</h2><br />(Note: This first part is very theoretical - it's the part that shows the statistical link between shrinkage and split-half correlation for a certain family of distributions. If you just want to trust me or your own experience that it exists, you can skip this and go straight to the "Estimation" section without missing too much.)<br /><br />I'm going to work within <a href="http://probabilaball.blogspot.com/2015/07/shrinkage-estimators-for-counting.html" target="_blank">my theoretical framework where the data follows a natural exponential family with a quadratic variance function</a> (NEFQVF) - so this will work for normal-normal, beta-binomial, Poisson-gamma, and a few other models.<br /><br /><div style="text-align: center;">$X_i \sim p(x_i | \theta_i)$</div><div style="text-align: center;">$\theta_i \sim G(\theta_i| \mu, \eta)$</div> <br />Split-half reliability takes two samples that are presumed to be measuring the same thing (I'll call these samples $X_i$ and $Y_i$) and calculates the correlation between them -if they actually <i>are</i> measuring the same thing, then the correlation should be high.<br /><br />In baseball, it's commonly used as a function of the sample size $n$ to assess when a stat "stabilizes" - that is, if I take two samples of size $n$ from the same model (be it player at-bats, batters faced, etc.) and calculate the correlation of a statistic between the samples, then once the correlation exceeds a certain value, the statistic is considered to have "stabilized."<br /><br />Let's say that $\bar{X_i}$ is normalized statistic of $n$ observations (on-base percentage, for example) from the first "half" of the data (though it does <i><u>not</u></i> need to be chronological) and $\bar{Y_i}$ is the normalized statistic of $n$ observations from the second half of the data. In baseball terms, $\bar{X_i}$ might be something like the OBP from the first sample and $\bar{Y_i}$ the OBP from the second sample.<br /><br />I want to find the correlation coefficient $\rho$, which is defined as<br /><br /><div style="text-align: center;">$\rho = \dfrac{Cov(\bar{X_i}, \bar{Y_i})}{\sqrt{Var(\bar{X_i})Var(\bar{Y_i})}}$ </div><br />First, the numerator. The law of total covariance states that<br /><br /><div style="text-align: center;">$Cov(\bar{X_i}, \bar{Y_i}) = E[Cov(\bar{X_i}, \bar{Y_i} | \theta_i)] + Cov(E[\bar{X_i} | \theta_i], E[\bar{Y_i} | \theta_i])$</div><br />Given the same mean $\theta_i$, $\bar{X_i}$ and $\bar{Y_i}$ are assumed to be independent - hence,<br /><br /><div style="text-align: center;"> $E[Cov(\bar{X_i}, \bar{Y_i} | \theta_i)] = E[0] = 0$.</div><br />Functionally, this is saying that for a given player, the first set of performance data is independent of the second half of performance data.<br /><br />Since $E[\bar{X_i} | \theta_i] = E[\bar{Y_i} | \theta_i] = \theta_i$ for the framework I'm working in, the second part becomes<br /><div style="text-align: center;"><br /></div><div style="text-align: center;"> $Cov(E[\bar{X_i} | \theta_i], E[\bar{Y_i} | \theta_i]) = Cov(\theta_i, \theta_i) = Var(\theta_i)$</div><br />Thus, $Cov(\bar{X_i}, \bar{Y_i})$ is equal to the "between player" variance - that is, the variance among league talent levels.<br /><br />Now, the denominator. As I've used before, the law of total variance states that<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$Var(\bar{X_i}) = E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i}| \theta_i])$ </div><br />Since $\bar{X_i}$ and $\bar{Y_i}$ have the same distributional form, they will have the same variance.<br /><br /><div style="text-align: center;">$Var(\bar{X_i}) = Var(\bar{Y_i}) = E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i} | \theta_i]) = \dfrac{1}{n} E[V(\theta_i)] + Var(\theta_i)$ </div><br />Where $E[V(\theta_i)]$ is the average variance of performance at the level of plate appearance, inning pitched, etc. Hence, the split correlation between them will be<br /><br /><div style="text-align: center;">$\rho = \dfrac{Var(E[\bar{X_i }| \theta_i])}{E[Var(\bar{X_i} | \theta_i)] + Var(E[\bar{X_i} | \theta_i])} = \dfrac{Var(\theta_i)}{\dfrac{1}{n} E[V(\theta_i)] + Var(\theta_i)} = 1 - B $</div><div style="text-align: center;"><br /></div><div style="text-align: left;">Where $B$ is the shrinkage coefficient. The important theoretical result, then, is that for NEFQVF distributions, the split-half correlation is equal to one minus the shrinkage coefficient. Since we know form of the shrinkage coefficient, we can estimate what the split-half correlation will be for different values of $n$.<br /><br />I'm not breaking any new ground here in terms of what people are doing in practice, but there does exist a theoretical justification linking split-half correlation and this particular formula method for shrinkage estimation.</div><div style="text-align: left;"><br /></div><h2 style="text-align: left;">Estimation</h2><div style="text-align: left;"><br />Using <a href="http://fangraphs.com/">fangraphs.com</a>, I collected data from all MLB batters (excluding pitchers) who had at least 300 PA (which is a somewhat arbitrary choice on my part - further discussion on this choice in the model criticisms section) from 2009 to 2014. I'm considering players to be different across years - so for example, 2009-2014 Miguel Cabrera is six different players. I'll define $x_i$ as the number of on-base events for player $i$ in $n_i$ plate appearances. I have $N$ of these players.<br /><br />All my data and code <a href="https://github.com/Probabilaball/Blog-Code/tree/master/Estimating-Theoretical-Stabilization-Points" target="_blank">is posted on my github</a> if you would like to independently verify my calculations (and I'm learning to use github while I do this, so apologies if the formatting is completely wrong). <br /><br />The number of on-base events $x_i$ follows a beta-binomial model - this fits into the NEFQVF family.<br /><br /><div style="text-align: center;">$x_i \sim Binomial(\theta_i, n)$</div><div style="text-align: center;">$\theta_i \sim Beta(\mu, M)$</div><br />Here I am using $\mu = \alpha/(\alpha + \beta)$ and $M = \alpha + \beta$ as opposed to the traditional $\alpha, \beta$ notation for a beta distribution. The true league mean OBP is $\mu$ and $M$ controls the variance of true OBP values among players.<br /><br />Let's say I want to know at what sample size $n$ the split-half correlation will be at a certain value p. For a beta-binomial mode, the split-half correlation is<br /><br /><div style="text-align: center;">$\rho = 1 - B = 1 - \dfrac{M}{M + n} = \dfrac{n}{M + n}$</div><br />Where $B$ is the shrinkage coefficient. So if we desire a stabilization level p, it is given by solving<br /><br /><div style="text-align: center;">$p = \dfrac{n}{M + n}$</div><br />For $n$. The solution is<br /><br /><div style="text-align: center;">$n= \left(\dfrac{p}{1-p}\right) M$</div><br />Given an estimate of the talent variance parameter $\hat{M}$, the estimated $n$ is <br /><br /><div style="text-align: center;">$\hat{n} = \left(\dfrac{p}{1-p}\right)\hat{M}$</div><br /><div style="text-align: left;">As a side note, at $n = M$ the split-half correlation and shrinkage amount are both 0.5. <br /><br />For estimation of $M$, I'm going to use marginal maximum likelihood. For a one-dimensional introduction to maximum likelihood, <a href="http://probabilaball.blogspot.com/2015/06/confidence-interval-for-batting-average.html" target="_blank">see my post on maximum likelihood estimation for batting averages</a>.<br /><br />The marginal distribution of on-base events $x_i$ given $\mu$ and $M$ is a beta-binomial distribution, with mass function given by<br /><br /><div style="text-align: center;">$ p(x_i | \mu, M) = \displaystyle \int_0^1 {n_i \choose x_i} \dfrac{\theta_i^{x_i + \mu M -1}(1-\theta_i)^{n_i - x_i + (1-\mu)M-1}}{\beta(\mu M, (1-\mu)M)} d\theta_i = {n_i \choose x_i} \dfrac{\beta(x_i + \mu M, n_i - x_i + (1-\mu) M)}{\beta(\mu M, (1-\mu)M)}$</div></div><br />This distribution represents the probability of $x_i$ on-base events in $n_i$ plate appearances given league mean OBP $\mu$ and variance parameter $M$, bypassing the choice of player altogether. The maximum likelihood estimate says to choose the values of $\mu$ and $M$ that maximize the joint probability the observed OBP values. <br /><br />For a sample of size $N$ players, each with $x_i$ on-base events in $n_i$ plate appearances, the log-likelihood is given as<br /><br /><div style="text-align: center;">$\ell(\mu, M) = \displaystyle \left[ \sum_{i =1}^N \log(\beta(x_i + \mu M, n_i - x_i + (1-\mu) M))\right] - N \log(\beta(\mu M, (1-\mu)M))$</div><br />This must be maximized numerically using computer software - I wrote a program using the Newton-Raphson algorithm to do this in $R$, which is posted on my github. For estimation, I actually converted $M$ to $\phi = 1/(1+M)$ in the above equation, performed the maximization, and then converted my estimate back to the original scale with $\hat{M} = (1-\hat{\phi})/\hat{\phi}$. The technical details of why I did this are a bit too much here, but it makes the estimation procedure much more stable - I'll be happy to discuss this with anybody who wants to know.<br /><br />The maximum likelihood estimates are given by $\mu = 0.332$ and $M = 295.7912$.<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div style="text-align: center;"><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-UgMUpYnI-xY/Vb15ReI5YxI/AAAAAAAAAQs/lxf3qIN3PfE/s1600/OBP.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://2.bp.blogspot.com/-UgMUpYnI-xY/Vb15ReI5YxI/AAAAAAAAAQs/lxf3qIN3PfE/s400/OBP.jpeg" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-OwNF6uqyfbE/Va8sPlyq9aI/AAAAAAAAAQM/8MbApi8R1MU/s1600/OBP.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div><br /></div><br />Above is the distribution of observed OBP values with the estimated distribution of true OBP levels overlaid as dashed lines.<br /><br />This means that for a split-half correlation of $p = 0.7$, the estimate of sample size required is<br /><br /><div style="text-align: center;">$\hat{n} = \left(\dfrac{0.7}{1-0.7}\right)295.7912 = 690.1794$</div><br />Which we could round to $\hat{n} = 690$ plate appearances. Since $\hat{M}$ is the maximum likelihood estimator, $\hat{n}$ is the maximum likelihood estimate of the point at which split-half correlation is 0.7 by invariance of the maximum likelihood estimator.<br /><br />Furthermore, if we have a variance $Var(\hat{M})$, the variance of the estimated sample size for p-stabilization is given as<br /><br /><div style="text-align: center;">$Var(\hat{n_i}) = Var\left( \left(\dfrac{p}{1-p}\right)\hat{M}\right) = \left(\dfrac{p}{1-p}\right)^2 Var(\hat{M})$</div><br />So a $(1-\alpha)\times 100\%$ confidence interval for $n$ is given as<br /><br /><div style="text-align: center;">$\left(\dfrac{p}{1-p}\right)\hat{M} \pm z^* \left(\dfrac{p}{1-p}\right) \sqrt{Var(\hat{M})}$</div></div><div style="text-align: left;"><br /><br />The output from the maximum likelihood estimation can be used to estimate $Var(\hat{M})$. Since I estimated $\hat{\phi}$, I had to get $Var(\hat{\phi})$ from output of the computer program I used and then use <a href="http://probabilaball.blogspot.com/2015/06/the-delta-method-for-confidence.html" target="_blank">the delta method</a> to convert it back to the scale of $M$. Doing that, I got $Var(\hat{M}) = 269.1678$. This gives a 95% confidence interval for the 0.7-stabilization point as<br /><br /><div style="text-align: center;">$690.1794 \pm 1.96 \left(\dfrac{0.7}{1-0.7}\right) \sqrt{269.1678} = (615.1478, 765.211)$</div></div><br />Or between approximately 615 and 765 plate appearances. A 95% confidence interval for the 0.5-stabilization point (which is just $\hat{M}$) is between approximately 264 and 328 plate appearances.<br /><br />For an arbitrary p-stabilization point sample size, the confidence interval formula is<br /><br /><div style="text-align: center;">$\left(\dfrac{p}{1-p}\right)295.7912 \pm z^* \left(\dfrac{p}{1-p}\right) \sqrt{269.1678}$</div><br />Below is a graph of the required sample size for for a stabilization level of p between 0.5 and 0.8 - the dashed lines are 95% confidence bounds.<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-ohx65n-atYo/Vb15Rciy2GI/AAAAAAAAAQ0/jiAFEQrFpKc/s1600/p-stabilization.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://3.bp.blogspot.com/-ohx65n-atYo/Vb15Rciy2GI/AAAAAAAAAQ0/jiAFEQrFpKc/s400/p-stabilization.jpeg" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-P6YPpiSgHxg/Va8sPmQw_fI/AAAAAAAAAQQ/f2GbaOvXNaA/s1600/p-stabilization.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-CTwz8M1zE0s/VatPMhr1G8I/AAAAAAAAAOk/Vv85iafDm7g/s1600/p-stabilization.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div><br />As you can see, there are diminishing returns - to stabilize more, you need an increasingly larger sample size.<br /><br /><h2>Model Criticisms</h2><br />Basic criticisms about modeling baseball players apply: players are not machines, plate appearances are not independent and identical, etc. These criticisms will apply to just about model of baseball data, including split-half correlations.<br /><br />I have not adjusted the data in any way - I simply took the raw number of on-base events and plate appearances. Estimation could likely be improved by adjusting the data for various effects before running it through the model.<br /><br />One thing should be obvious: this is a parametric estimator, dependent on the model I chose. If the model I chose does not fit well, then the estimate will be bad. I stuck to OBP for this discussion because it seems to fit the beta-binomial model well. No model is correct, of course, but I do believe the beta-binomial model is close enough to be useful. I simulated data from a beta-binomial model using my estimated parameters and the fixed number of plate appearances, and both visually and with some basic summary statistics the simulated data looked close to the actual data. Not identical - and the real data appears to skew slightly right in comparison to the simulated data, and being identical isn't a realistic goal anyway - but close. Other statistics could require a the use of a different distribution. <br /><br />As I mentioned, the cutoff of 300 PA or greater was somewhat arbitrary - I will fully admit that it's because I don't have a clearly defined population of players in mind. I know pitchers shouldn't be included, and I know that someone who got 10 PA and then was sent down shouldn't be included in the model, but I'm not sure what the correct cutoff for PA should be to get at this vague idea of "MLB-level" hitters I have. That's a problem with this analysis, but one that is easy to correct with the right information.<br /><br />There's a bias/variance trade-off at play here - if I set the cutoff too low then I'm going to get too many players included that aren't from the population I want included in the sample, but the more players I feed into the model the smaller my variance of estimation is. Below is a plot of $\hat{M}$ with 95% confidence bounds for cutoff points from 50 PA to 600 PA.<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-dSfpuZigUas/Vb15Rc6xvII/AAAAAAAAAQw/fuZkRd21YKo/s1600/Cutoff%2Bversus%2BM.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://3.bp.blogspot.com/-dSfpuZigUas/Vb15Rc6xvII/AAAAAAAAAQw/fuZkRd21YKo/s400/Cutoff%2Bversus%2BM.jpeg" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-JDo6I3Fqgzw/Va8sPsE5iCI/AAAAAAAAAQU/s_hO5Lzc67g/s1600/Cutoff%2Bversus%2BM.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div><div style="text-align: center;"><br /></div><br />Around 300 PA seems to be the cutoff value that leads to a roughly stable $M$ estimate that doesn't veer off into being erratic from the lack of information, and seems to approximately conform with what I know about how many plate appearances semi-regular MLB player should get.<br /><br />Lower cutoff points tend to lead to lower stabilization points, as it will include hitters with smaller true OBPs, decreasing both the league average OBP (and also the average amount of variance around OBP values) and variance among true OBP values - the effect of which is to estimate $M$ smaller.<br /><br />The bigger problem I have is that one of the assumptions is the number of plate appearances is independent of the observed on-base percentage - that if one player gets 500 PA and another player gets 700 PA, it tells us nothing about either of the players' true OBP values - they just happened to get 500 and 700 PA, respectively. Of course, we know this isn't true - hitters get more plate appearances specifically <i>because</i> they get on base more.<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-zvRWxz7VLb0/VawBaGvLwwI/AAAAAAAAAPE/GLTbABENTb8/s1600/PA%2Bversus%2BOBP.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="398" src="https://2.bp.blogspot.com/-zvRWxz7VLb0/VawBaGvLwwI/AAAAAAAAAPE/GLTbABENTb8/s400/PA%2Bversus%2BOBP.jpeg" width="400" /></a></div><br />Since the model works by assuming that players with more plate appearances have less variation around their true OBP values, it will make the estimate of the league mean OBP higher - which will affect $M$.<br /><br />Even with all of its problems, I think this estimation method is useful. I don't think I've done anything that other people aren't already doing, but I just wanted to work through it in a way that makes sense to me. Comments are appreciated.<br /><br />. <br /><br /><br /><br /><br /><br />rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-7900828783063462102015-07-30T08:59:00.000-05:002020-03-19T10:01:13.641-05:00Shrinkage Estimators for Counting Statistics <i>Edit 19 March 2020: This post has been adapted into a paper <a href="https://doi.org/10.1016/j.jmp.2020.102330">in the Journal of Mathematical Psychology.</a> In the process of writing the paper, a number of </i> <i>mistakes, omissions, or misstatements were found in this post. It is being left up as it was originally written, just in case anybody is interested. For a more correct version, please refer to the journal article. </i><br /><br />Warning: this post is going to be incredibly technical, even by the standards of this blog. If what I normally post is gory math, this is the running of the bulls. I'm making it so I can refer back to it when I need to.<br /><br />The goal is to set up the theoretical framework for shrinkage estimation of normalized counting statistics to some common mean. I will fully admit this is a very, very limited framework, but some of the most basic baseball statistics fit into it. In the future I hope I can possibly expand this to include more advanced statistics.<br /><br />I will give (not show) a few purely theoretical results - for proofs, see <i>Natural Exponential Families with Quadratic Variance Functions </i>by Carl Morris in The Annals of Statistics, Vol. 11, No. 2 (1983), 515-529, or the more updated version of that paper. <br /><br /><h2>Theoretical Framework</h2><br />Let's say I have some metric $X_i$ for player, team, or object $i$. In this framework, $X_i$ represents a count or a sum of some kind - the raw number of hits, or the raw number of earned runs, etc. I know that $X_i$ is the result of a random process that is controlled by a probability distribution with parameter $\theta_i$, which is unique to each player, team, or object - in baseball, for example, $\theta_i$ represents the player's true "talent" level with respect to metric $X_i$. <br /><br /><div style="text-align: center;">$X_i \sim p(x_i | \theta_i)$ </div><br />I have to assume that the talent levels $\theta_i$ are exchangeable, though the definition is a bit too much to go into here.<br /><br />I'm going to assume that $p(x_i | \theta_i)$ is a member of the natural exponential family with a quadratic variance function (NEFQVF) - this includes very common distributions such as the normal, binomial, Poisson, gamma, and negative binomial. <br /><br />Each of these can be written as the convolution (sum) of $n_i$ other independent, identical distributions, each of which is also NEFQVF with mean $\theta_i$ - the normal is the sum of normals, the binomial is the sum of Bernoullis, the Poisson is the sum of Poissons, the negative binomial is the sum of geometrics, etc.. I will assume that is the case here - that<br /><br /><div style="text-align: center;">$X_i = \displaystyle \sum_{j = 1}^{n_i} Y_{ij}$</div><br />Translating this to baseball terms, this means that $Y_{ij}$ is the outcome of inning, plate appearance, etc., $j$ for player $i$ ($j$ ranges from 1 to $n_i$). The metric $X_i$ is then sum of $n_i$ of these outcomes. Each outcome is assumed independent and identical. Once again, $X_i$ is <i>not</i> normalized by dividing by $n_i$.<br /><br />Conditional on having mean $\theta_i$, the expectations of the $Y_{ij}$ are<br /><br /><div style="text-align: center;">$E[Y_{ij} | \theta_i] = \theta_i$ </div> <br />And so conditional on having mean $\theta_i$, the expected value of the $X_i$ are<br /><br /><div style="text-align: center;">$E[X_i | \theta_i] = E\left[\displaystyle \sum_{j = 1}^{n_i} Y_{ij} \biggr | \theta_i \right] = \displaystyle \sum_{j = 1}^{n_i} E\left[ Y_{ij} \biggr | \theta_i \right] = n_i E[Y_{ij}| \theta_i] = n_i \theta_i$ </div><br />Baseball terms: if a player has, for example, on-base percentage $\theta_i$, then the number of on-base events I expect in $n_i$ plate appearances is $n_i \theta_i$. This does not have to be a whole number.<br /><br />Similarly, and again conditional on mean $\theta_i$, the independence assumption allows us to write the variance of the $X_i$ as<br /><br /><div style="text-align: center;">$Var(X_i | \theta_i) = Var\left(\displaystyle \sum_{j = 1}^{n_i} Y_{ij} \biggr | \theta_i \right) = \displaystyle \sum_{j = 1}^{n_i} Var\left( Y_{ij} \biggr | \theta_i \right) = n_i Var(Y_{ij}| \theta_i) = n_i V(\theta_i)$ </div><br />I'm going to repeat that last bit of notation again, because it's important:<br /><br /><div style="text-align: center;"> $Var(Y_{ij}| \theta_i) =V(\theta_i)$</div><br />$V(\theta_i)$ is the variance of the outcome at the most basic level - plate appearance, inning, batter faced, etc. - conditional on having mean $\theta_i$. For NEFQVF distributions, this has a very particular form - the variance can be written as a polynomial function of the mean $\theta_i$ up to degree 2 (this is the "Quadratic Variance Function" part of NEFQVF):<br /><br /><div style="text-align: center;">$Var(Y_{ij} | \theta_i) = V(\theta_i) = c_0 + c_1 \theta_i + c_2 \theta_i^2$</div><br />For example, the normal distribution has $V(\theta_i) = \sigma^2$, so it fits the QVF model with $c_0 = \sigma^2$ and $c_1 = c_2 = 0$. For the Binomial distribution, $V(\theta_i) = \theta_i (1-\theta_i) = \theta_i - \theta_i^2$, so it fits the QVF model with $c_0 = 0, c_1 = 1$, and $c_2 = -1$. The Poisson distribution has $V(\theta_i) = \theta_i$, so it fits the QVF model with $c_0 = c_2 = 0$ and $c_1 = 1$.<br /><br />I'm now going to assume that the talent levels $\theta_i$ themselves follow some distribution $G(\theta_i | \mu, \eta)$. The parameter $\mu$ is the expected value of the $\theta_i$ ($E[\theta_i] = \mu$), and it represents the league average talent level. The parameter $\eta$ controls, but is not necessarily equal to, the variance of $\theta_i$ (how spread out the talent levels are). Both are assumed to be known. The two-stage model is then<br /><br /><div style="text-align: center;"> $X_i \sim p(x_i | \theta_i)$</div><div style="text-align: center;"> $\theta_i \sim G(\theta_i | \mu, \eta)$</div><br />The unconditional expectation of the $X_i$ is <br /><br /><div style="text-align: center;">$E[X_i] = E[E[X_i | \theta_i]] = E[n_i \theta_i] = n_i \mu$</div><br />And the unconditional variance of $X_i$ is<br /><br /><div style="text-align: center;">$Var(X_i) = E[Var(X_i | \theta_i)] + Var(E[X_i | \theta_i]) = n_i E[ V(\theta_i)] + n_i^2 Var(\theta_i) $</div><br />In the above formula, the quantity $E[V(\theta_i)]$ is the average variance of the outcome at the most basic level (plate appearance, inning, etc.), averaging over all possible talent levels $\theta_i$. The quantity $Var(\theta_i)$ is the variance of the talent levels themselves - how spread out talent is in the league. <br /><br />To this point I haven't normalized the $X_i$ by dividing by each by $n_i$ - let's do that. If I define $\bar{X_i} = X_i/n_i,$ then based on the formulas above<br /><br /><div style="text-align: center;">$E[\bar{X_i}] = E\left[\dfrac{X_i}{n_i}\right] = \dfrac{1}{n_i} E[X_i] = \dfrac{n_i \theta_i}{n_i} = \theta_i$</div><br />And variance<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$Var(\bar{X_i}) = Var\left(\dfrac{X_i}{n_i}\right) = \dfrac{1}{n_i^2} Var(X_i) = \dfrac{n_i E[ V(\theta_i)] + n_i^2 Var(\theta_i)}{n_i^2} = \dfrac{1}{n_i}E[ V(\theta_i)] + Var(\theta_i)$ </div><br />As members of the exponential family, members of the NEFQVF family are guaranteed to have a conjugate prior distribution, so I'll assume that $G(\theta_i | \mu, \eta)$ is conjugate to $p(x_i | \theta_i)$. For example, if $X_i$ follows a normal distribution, $G(\theta_i | \mu, \eta)$ is a normal as well. If $X_i$ follows a Binomial distribution, then $G(\theta_i | \mu, \eta)$ is a beta distribution. If $X_i$ follows a Poisson distribution, then $G(\theta_i | \mu, \eta)$ is a gamma distribution. The priors themselves do not have to be NEFQVF. <br /><br />Since $\eta$ and $\mu$ are assumed known, we can use the Bayes' rule with conjugate prior $G(\theta_i | \mu, \eta)$ to calculate the posterior distribution for $\theta_i$<br /><br /><div style="text-align: center;">$\theta_i | x_i, \mu, \eta \sim \dfrac{p(x_i | \theta_i)G(\theta_i | \mu, \eta)}{\int p(x_i | \theta_i)G(\theta_i | \mu, \eta) d\theta_i}$</div><br />NEFQVF families have closed-form posterior densities.<br /><br />I'm then going to take my as my estimator the expected value of the posterior, $\hat{\theta_i} = E[\theta_i | x_i]$. Specifically for NEFQVF distributions with conjugate priors, the estimator is then given by<br /><br /><div style="text-align: center;">$\hat{\theta_i} = \mu + (1 - B)(\bar{x_i} - \mu) = (1-B) \bar{x_i} + B \mu$</div><br />Where $B$ is known as the shrinkage coefficient. For NEFQVF distributions, the form of $B$ is<br /><br /><div style="text-align: center;">$B = \dfrac{E[\bar{X_i} | \theta_i]}{Var(X_i)} = \dfrac{\dfrac{1}{n_i}E[ V(\theta_i)]}{\dfrac{1}{n_i}E[ V(\theta_i)] + Var(\theta_i)} = \dfrac{E[V(\theta_i)]}{E[V(\theta_i)] + n_i Var(\theta_i)}$ </div><br /><i>Note: The above two formulas, and several of the rules I used to derive them, are guaranteed for NEF distributions and not just NEFQVF distributions; however, the conjugate prior for a NEF may not have a normalizing constant that exists in closed form, and in practical application the distributions that are actually used tend to be NEFQFV. For NEFQFV distributions, a few more algebraic results can be shown about the exact form of the shrinkage estimator by writing the conjugate prior in the general form for exponential densities - for more information, see section 5 of Morris (1983), mentioned in the introduction.</i><br /><br />The shrinkage estimator $B$ for NEFQVF distributions is the ratio of the within-metric variance to the total variance - which is a function of how noisy the data are compared and how spread out the talent levels are. If at a certain $n_i$ the normalized metric tends to be very noisy around its mean but the means tend to be clustered together, shrinkage will be large. If the normalized metric tends to stay close to its mean value but the means tend to be very spread out, shrinkage will be small. And as the number of observations $n_i$ grows bigger, the effect of the noise gets smaller, decreasing the shrinkage amount.<br /> <br />$B$ itself can be thought of as a shrinkage proportion - if $B = 0$ then there is no shrinkage, and the estimator is just the raw observation. This would occur if the average variance around the mean is zero - if there's no noise. If $B = 1$ then complete shrinkage takes place and the estimate of the player's true talent level is just the league average talent level. This occurs if the variance in league talent levels is equal to zero - every player has the exact same talent level.<br /><br />Note that $B$ has no units, since both the top and bottom are variances, so rescaling the data will not change the shrinkage proportion. <br /><br />I'm going to show a few examples, working through gory mathematical details.<br /><br /><span style="color: red;">WARNING: the above results are guaranteed only for NEFQVF distributions - the normal, binomial, negative binomial, Poisson, and gamma, NEF-GHS. Some results also apply to NEF distributions - see Morris (1983) for details. If the data model is not one of those distributions, I can't say whether or not the formulas I've given above will be correct.</span><br /><br /><br /><h2>Normal-Normal Example</h2><br />Let's start with one familiar form - the normal model. This model says that $X_i$, the metric for player $i$, is normally distributed, and is constructed as a sum of $Y_{ij}$ random variables, which are also normally distributed with mean $\theta_i$ and known variance $\sigma^2$. The distribution of talent levels also follows a normal distribution with league mean $\mu$ and variance $\tau^2$.<br /><br />This can be written as<br /><br /><div style="text-align: center;">$Y_{ij} \sim N(\theta_i, \sigma^2)$</div><div style="text-align: center;">$X_i \sim N(n_i \theta_i, n_i \sigma^2)$</div><div style="text-align: center;">$\theta_i \sim N(\mu, \tau^2)$ </div><br />The average variance is simple. As stated before, $V(\theta_i) = \sigma^2$ is constant for the normal distribution, no matter what the actual $\theta_i$ is. Hence,<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$E[V(\theta_i)] = E[\sigma^2] = \sigma^2$</div><br />The variance of the averages is simple, too - the model assumes it's constant as well.<br /><br /><div style="text-align: center;">$Var(\theta_i) = \tau^2$</div><br />This gives a shrinkage coefficient of<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$B = \dfrac{\sigma^2}{\sigma^2 + n_i \tau^2}$</div><br />Which, if I divide both the top and bottom by $n_i$, might look more familiar as<br /><br /><div style="text-align: center;">$B = \dfrac{\sigma^2/n_i}{\sigma^2/n_i + \tau^2}$</div><br /> The shrinkage estimator is then<br /><br /><div style="text-align: center;">$\hat{\theta_i} = \mu + \left(1 - \dfrac{\sigma^2/n_i}{\sigma^2/n_i + \tau^2}\right)(\bar{x_i} - \mu)$ </div><br />Alternatively, I can write $B$ as<br /><br /><div style="text-align: center;">$B = \dfrac{\sigma^2/\tau^2}{\sigma^2/\tau^2 + n_i}$</div><br />And then it follows the familiar pattern from other estimators of $B = m/(m + n)$ for some parameter $m$.<br /><br />It may seem like the normal-normal is not of use - how many counting statistics are there that are normally distributed at the level of inning, plate appearance, or batter faced? The very idea that they are <i>counting</i> statistics says that that's impossible.<br /><br />However, the central limit theorem guarantees that sums of independent, identical random variables converge to a normal - hence the distribution of $X_i$ should be unimodal and bell-shaped for large enough $n_i$ (and I'll intentionally leave the discussion of what constitutes "large enough" aside). Thus, as long as the distribution of the $\theta_i$ (the distribution of talent levels) is bell-shaped and symmetric, using a normal-normal with the normal as an approximation at the $X_i$ level should work.<br /><br /><h2>Beta-Binomial Example</h2><br />Suppose we're measuring the sum of binary events of some kind - a hit, an on-base event, a strikeout, etc. - in $n_i$ observations - plate appearances, innings pitched, batters faced, etc. Each event can be thought of as a sample from a Bernoulli distribution (these are the $Y_{ij}$) with variance function $V(\theta_i) = \theta_i(1-\theta_i)$. The observed metric $X_i$ binomial, and it is constructed as the sum of these Bernoulli random variables<br /><br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$Y_{ij} \sim Bernoulli(\theta_i)$</div><div style="text-align: center;">$X_i \sim Binomial (n_i, \theta_i)$<br /><br /><div style="text-align: left;">The prior distribution for the binomial distribution is the beta.</div></div><div style="text-align: center;">$\theta_i \sim Beta(\mu, M)$</div><br />Fitting with the framework given above, I'm using $\mu = \alpha/(\alpha+\beta)$ and $M = \alpha + \beta$ instead of the traditional $\alpha, \beta$ parametrization, so that $\mu$ represents the league mean and $M$ controls the variation. <br /><br />The average variance is fairly complicated here. We need to find<br /><br /><div style="text-align: center;">$E[V(\theta_i)] = E[\theta_i(1-\theta_i)] = \displaystyle \int_0^1 \dfrac{\theta_i(1-\theta_i) * \theta_i^{\mu M-1}(1-\theta_i)^{(1-\mu) M-1}}{\beta(\mu M, (1-\mu) M)} d\theta_i = \dfrac{\displaystyle \int_0^1 \theta_i^{\mu M}(1-\theta_i)^{(1-\mu) M} d\theta_i}{\beta(\mu M, (1-\mu) M)}$</div><br /> The top part is a $\beta(\mu M + 1, (1-\mu)M + 1)$ function. Utilizing the properties of the beta function, we have<br /><br /><div style="text-align: center;">$E[\theta_i(1-\theta_i)] = \dfrac{\beta(\mu M+1, (1-\mu) M + 1)}{\beta(\mu M, (1-\mu) M)} = \dfrac{\beta(\mu M, (1-\mu) M + 1)}{\beta(\mu M, (1-\mu) M)}\left(\dfrac{\mu M}{\mu M + (1-\mu) M + 1}\right) = $</div><div style="text-align: center;"><br /></div><div style="text-align: center;">$\dfrac{\beta(\mu M, (1-\mu) M )}{\beta(\mu M, (1-\mu) M)}\left(\dfrac{\mu M}{\mu M + (1-\mu) M + 1}\right) \left(\dfrac{(1-\mu) M}{\mu M + (1-\mu) M}\right) = \dfrac{\mu(1-\mu)M^2}{(M+1)M} = \dfrac{\mu(1-\mu) M}{M+1}$ </div><br /> The variance of the $\theta_i$ doesn't require nearly as much calculus, since it can be taken directly as the variance of a beta distribution<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">$Var(\theta_i) = \dfrac{\mu(1-\mu)}{M+1}$</div><br />The shrinkage estimator $B$ is then<br /><br /><div style="text-align: center;">$B = \dfrac{\dfrac{\mu(1-\mu)M}{(M+1)}}{\dfrac{\mu(1-\mu)M}{(M+1)} +\dfrac{n_i \mu(1-\mu)}{(M+1)}} = \dfrac{M}{M + n_i}$</div><br />Since $\mu(1-\mu)/(M+1)$ is in every term on the top and bottom, so it will cancel out. Using this model, then the shrinkage estimator is given by<br /><br /><div style="text-align: center;">$\hat{\theta_i} = \mu + \left(1 - \dfrac{M}{M + n_i}\right)\left(\bar{x_i} - \mu\right)$ </div><br /><br /><div style="text-align: center;"><h2 style="text-align: left;">Poisson-Gamma Example</h2><div style="text-align: left;"><br /></div><div style="text-align: left;">Now suppose that instead of a binary event, the outcome can be a count - zero, one, two, three, etc. Each count can be thought of as a sample from a Poisson distribution with parameter $\theta_i$ (these are the $Y_{ij}$, with $V(\theta_i) = \theta_i$) with $X_i$ as the sum total of counts, which also has a Poisson distribution with parameter $n_i \theta_i$.<br /><br /><div style="text-align: center;">$Y_{ij} \sim Poisson(\theta_i)$</div><div style="text-align: center;">$X_i \sim Poisson(n_i \theta_i)$<br /><br /><div style="text-align: left;">The prior distribution of $\theta_i$ for a Poisson is a gamma.</div></div><div style="text-align: center;">$\theta_i \sim Gamma(\mu, K)$</div><br />In this parametrization, I'm using $\mu = \alpha/\beta$ and $K = \beta$ as compared to the traditional $\alpha, \beta$ parametrization.<br /><br />The average variance is<br /><br /><div style="text-align: center;">$E[V(\theta_i)] = E[\theta_i] = \mu$</div><br />And the variance of the averages is<br /><br /><div style="text-align: center;">$Var(\theta_i) = \dfrac{\mu}{K}$</div><br />So the shrinkage coefficient $B$ is<br /><br /><div style="text-align: center;">$B = \dfrac{\mu}{\mu + \dfrac{n_i \mu}{K}} = \dfrac{1}{1 + \dfrac{n_i}{K}} = \dfrac{K}{K + n_i}$</div><br />Which gives a shrinkage estimator of<br /><br /><div style="text-align: center;">$\hat{\theta_i} = \mu + \left(1 - \dfrac{K}{K + n_i}\right)(\bar{x_i} - \mu)$ </div></div><div style="text-align: left;"><br /></div><h2 style="text-align: left;">What Statistics Fit Into this Framework?</h2><div style="text-align: left;"><br />Any counting statistic that is constructed as a sum of the same basic events falls under framework. It's possible to combine multiple basic events into one "super" event, as long as they are considered to be equal. Examples of this include batting average, on-base percentage, earned run average, batting average on balls in play, fielding percentage, stolen base percentage, team win percentage, etc. It's possible to weight the sum, as long as you're just adding the same type of event to itself over and over.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Any statistic that is a sum, weighted or unweighted, of different events does <i>not</i> fall into this framework - examples include weighted on-base average, slugging percentage, on-base plus slugging percentage, fielding independent pitching, isolated power, etc. Also, any statistics that are ratios of counts -strikeout to walk ratio, for example - do not fall under this framework. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">Statistics like wins above replacement are right out.<br /><br />I want to make clear that this is simply a discussion of what statistics fit nominally into a very specific theoretical framework. A statistic falling under the framework does not imply that a statistic is good, nor does not falling under it imply that a statistic is bad. Furthermore, even if a statistic does not fall under this framework, shrinkage estimation using these formulas may still work as a very good approximation - the best statistics in sabermetrics today are often weighted sums of counting events, and people have been using these shrinkage estimators on them successfully for years, so clearly they must be doing <i>something</i> right.. This is simply what I can justify using statistical theory.</div><div style="text-align: left;"><br /></div><h2 style="text-align: left;">Performing the Analysis</h2><div style="text-align: left;"><br />The values of $\eta$ and $\mu$ must be chosen or estimated. If prior data exists - like, for example, historical baseball data - values can be chosen based upon a careful analysis of that information. If no prior data exists, one option is to estimate the parameters through either moment-based or marginal likelihood-based estimation, and then plug in those values - this method is known as parametric empirical Bayes. Another option is to place a hyperprior or hyperpriors on $\eta$ and $\mu$ and perform a full hierarchical Bayesian analysis, which will almost certainly involve MCMC. Depending on the form of your prior, your shrunk results will likely be similar to, but not equal to, the shrinkage estimators given here.<br /><br />What if none of the NEFQVF models appear to fit your data? You have a few options, such as nonparametric or hierarchical Bayesian modeling, but any method is to get more difficult and more computational. <br /><br /></div></div>rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-63771614861492650522015-07-24T10:26:00.003-05:002016-02-16T15:48:03.181-06:00Normal-Normal Shrinkage Estimation by Empirical Bayes<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> Shrinkage estimation is a very common technique in baseball statistics. So is the normal model. It turns out that one of the one way to shrink is to assume that both the data you see and the distribution of means of the data you see are normal - and then estimate the distribution of means prior by use of the data itself.<br /><br /> The (non-annotated) code I used to generate these results <a href="https://github.com/Probabilaball/Blog-Code/tree/master/Normal-Normal-Empirical-Bayes" target="_blank">may be found on my github</a>.<br /><br /><h2><b>The Normal-Normal Model</b></h2><br />The basic normal-normal model assumes two "stages" of data - the first is the observed data, which we will call $Y_i$ - this is assumed to be normally distributed with mean $\theta_i$ and variance $\sigma^2$.<br /><br /><div style="text-align: center;">$Y_i \sim N(\theta_i, \sigma^2)$</div><div style="text-align: center;">$\theta_i \sim N(\mu, \tau^2)$</div><br />Bayes' rule says that if you assume the "second" level - the distribution of the $\theta_i$ - is also normal, and treat it as a prior, the posterior distribution of $\theta_i$ (that is, the distribution in belief of values of $\theta_i$ after taking into account the data $y_i$ - <a href="http://probabilaball.blogspot.com/2015/05/bayesian-inference.html" target="_blank">see my previous post on Bayesian inference</a>) is also normal with distribution<br /><br /><div style="text-align: center;">$\theta_i | y_i, \mu, \sigma^2 \sim N(B\mu + (1-B)y_i, (1-B)\sigma^2)$</div><br />where<br /><div style="text-align: center;">$B = \left(\dfrac{\sigma^2}{\sigma^2 + \tau^2}\right)$</div><br />The quantity $B$ gives the amount that the mean of the data shrinks towards the mean of the prior. If $\sigma^2$ is large compared to $\tau^2$ (and so the data has a large amount of variance relative to the prior) then the mean of the data gets shrunk towards the prior mean by quite a bit. If $\sigma^2$ is small compared to $\tau^2$ (and so the data has a small amount of variance relative to the prior) then the mean of the data doesn't get shrunk much at all - it tends to stay near the observed $y_i$.<br /><br />No matter what the prior is, this Bayesian estimator can be thought of as a shrinkage estimator. It's just a question of what you're shrinking to, and by how much. A Bayesian, if he or she wishes to be "noninformative" (and I'm going to completely ignore all the controversy of naming and choosing priors) might pick something like $\theta_i \sim N(0,1000)$ as a prior, so $B$ is very close to zero and the shrinkage is very small.<br /><br /><h2><b>Empirical Bayes</b></h2><br />What we're going to do, however, is focus on using the data to choose the prior distribution, by assuming the normal-normal model as described above and estimating the parameters of the prior <i>from the data itself </i>- this is known as empirical Bayes. The effect of using the data to choose the prior is that the data is shrunk towards the mean of the data, in an amount determined by the variance(s) of the data.<br /><br />How do we estimate $\mu$ and $\sigma^2$? One nice property of the normal-normal model is that the marginal distribution of $y_i$ is also normal:<br /><br /><div style="text-align: center;">$y_i | \sigma^2, \tau^2, \mu \sim N( \mu, \sigma^2 + \tau^2 )$</div><br />There are three quantities that need to be estimated to perform this - $\mu$, $\sigma^2$, and $\tau^2$. The formula above gives us two - the first, the population mean, can be estimated by the sample mean $\hat{\mu} = \bar{y}$. The variance is a bit trickier. If I just take the standard variance estimator of the $y_i$<br /><br /><div style="text-align: center;">$Var(Y) = \dfrac{\sum (y_i - \bar{y})^2}{N-1}$</div><br />then that gives an estimate $\hat{\sigma^2 + \tau^2}$ (the hat is over the entire thing - it's not estimating two individual variances and summing them, it's estimating the sum of two individual variances), assuming $\sigma^2$ is the same for every observation. So we're going to need to get some information from <i>somewhere </i>about what $\sigma^2$ is. If the $y_i$ are sums or averages of observations, we can use that. If not, we have to get creative.<br /><br /><br /><h2><b>Baseball Example</b></h2><br />Let's say $Y_i$ is the distribution of a player's batting average in $n_i$ at-bats (the $Y_i$ are already divided by $n_i$, so they are averaged), and the player has true batting average $\theta_i$. We know that a true 0.300 hitter isn't going to always hit 0.300 - he will hit sometimes above, sometimes below. The model says that the player's observed batting average follows a normal distribution with true batting average $\theta_i$ and variance $\sigma^2/n_i$ (since it is an <i>average</i>) - this is the "first" normal distribution as described above.<br /><br />The second stage is the underlying distribution of the $\theta_i$ - that is, the distribution of <i>all </i>players' true batting averages. It is normal with mean $\mu$ (the true mean league average) and variance $\tau^2$. So using this two-stage model is equivalent to saying that if I selected a random, unknown batter from the pool of all major league baseball players and observed his batting average $y_i$, it will follow a normal distribution with mean $\mu$ (the true league mean batting average) and variance $\sigma^2/n_i + \tau^2$ (the sum of the variation due naturally to a player's luck and the variation in batting averages between all players). I understand that this is a bit weird to think about, but imagine trying to describe the distribution of a player's observed batting average when we don't even know his name - you have to figure his true average is somewhere between 0.200 and 0.330, with an mean of around 0.265, and then on top of that add in the average amount of natural variation around his true average (sometimes above, sometimes below) at $n_i$ at-bats. That's what's going on here.<br /><br />In baseball terms, $\sigma^2/n_i$ can be thought of as the "within-player" variance or "luck" and $\tau^2$ can be thought of as the "between-player" variance or "talent." If a batter is very consistent in hitting near his true ability and the distribution of all batting averages is very spread out, then not much shrinkage will occur. Conversely, if a batter has a lot of variation in his observed batting average but there's not much apparent variation in the distribution of all batting averages, then the player will be shrunk heavily towards the league mean.<br /><br />We need three estimates in order to do the procedure - an estimate of each of $\mu$, $\sigma^2/n_i$, and $\tau^2$. The first, $\mu$, is the mean batting average of the league - and since $y_i$ is the batting average for player i, the estimate of the league mean batting average is the obvious one - $\hat{\mu} = \bar{y}$.<br /><br />In the case that $n_i$ is the same for all of your observations, taking $Var(y_i)$ gives us $\hat{\sigma^2/n_i + \tau^2}$. If there are differing $n_i$, the estimation method gets more complex - it's worth its own post at some point to work through it. I'm going to assume that all the players have the same number of at-bats to keep things simple for this example, though it admittedly does make it feel rather artificial.<br /><br />Now we need an estimate of $\sigma^2/n_i$ - the "within-player" variance. But the normal distribution doesn't tell us what $\sigma^2/n_i$ should be. It's pretty typical to model a player's hits in $n_i$ at-bats as following a binomial distribution with true batting average $\theta_i$. Then the sample batting average (hits over at-bats) has variance<br /><br /><div style="text-align: center;">$Var(\textrm{Sample Batting Average}) = \dfrac{\theta_i(1-\theta_i)}{n_i}$</div><br />The binomial distribution is just a sum of independent, identical Bernoulli trials (at-bats, in this case) each with probability of getting a hit $\theta$. So the central limit theorem says that for large $n_i$, we can approximate the distribution of the sample batting average with a normal!<br /><br /><br /><div style="text-align: center;">$\textrm{Sample Batting Average} \sim N\left(\theta_i, \dfrac{\theta_i(1-\theta_i)}{n_i}\right)$ </div><br /><br />A value for $\theta_i$ is needed. It feels natural to use $y_i$ - the player's batting average - in the estimation. This is wrong, however - remember, we don't know how the player's true talent level! We need to shrink by the average variance amount. The average variance amount is estimated by the variance at the league batting average - which is a quantity we also have an estimate for. The estimate of the within-player variance is then <br /><br /><div style="text-align: center;">$\dfrac{\hat{\sigma^2}}{n_i} \approx \dfrac{\bar{y}(1-\bar{y})}{n_i}$</div><br /> Then the empirical Bayes estimator is given by<br /><br /><div style="text-align: center;">$\hat{\theta_i} = \hat{B} \bar{y} + (1-\hat{B})y_i$</div><br />where<br /><br /><div style="text-align: center;"> $\hat{B} = \dfrac{\bar{y}(1-\bar{y})/n_i}{\sum (y_i - \bar{y})^2/(N-1)}$ </div><br /><h2><b>Comparison</b></h2><br />Using the famous Morris data set to compare these estimators (I'll call them $\hat{\theta}^{NN}$ for normal-normal) to the shrinkage estimators from the Beta-Binomial (<a href="http://probabilaball.blogspot.com/2015/05/beta-binomial-empirical-bayes.html" target="_blank">see post here</a> - I'll call these $\hat{\theta}^{BB}$) and James-Stein (<a href="http://probabilaball.blogspot.com/2015/05/the-james-stein-estimator.html" target="_blank">see post here</a> - and note that the James-Stein estimator is just a specific version of the normal-normal estimator - I'll call them $\hat{\theta}^{JS}$), we see that it performs well.<br /><br />\begin{array}{l c c c c c} \hline<br />\textrm{Player} & y_i & \hat{\theta}^{NN} & \hat{\theta}^{BB} & \hat{\theta}^{JS} & \theta \\ \hline<br />Clemente &0.400 &0.280 &0.280 &0.290 &0.346 \\<br />F. Robinson &0.378 &0.277 &0.278 &0.286 &0.298 \\<br />F. Howard &0.356 &0.275 &0.275 &0.282 &0.276 \\ <br />Johnstone &0.333 &0.273 &0.273 &0.277 &0.222\\<br />Barry &0.311 &0.270 &0.270 &0.273 &0.273\\<br />Spencer &0.311 &0.270 &0.270 &0.273 &0.270\\<br />Kessinger &0.289 &0.268 &0.268 &0.268 &0.263\\<br />L.Alvarado &0.267 &0.266 &0.266 &0.264 &0.210\\<br />Santo &0.244 &0.263 &0.263 &0.259 &0.269\\<br />Swoboda &0.244 &0.263 &0.263 &0.259 &0.230\\<br />Unser &0.222 &0.261 &0.261 &0.254 &0.264\\<br />Williams &0.222 &0.261 &0.261 &0.254 &0.256\\<br />Scott &0.222 &0.261 &0.261 &0.254 &0.303\\<br />Petrocelli &0.222 &0.261 &0.261 &0.254 &0.264\\<br />E. Rodriguez &0.222 &0.261 &0.261 &0.254 &0.226\\<br />Campaneris &0.200 &0.258 &0.258 &0.249 &0.285\\<br />Munson &0.178 &0.256 &0.256 &0.244 &0.316\\<br />Alvis &0.156 &0.254 &0.253 &0.239 &0.200\\ \hline<br />\end{array}<br /><br />For this data set, the normal-normal and beta-binomial estimates are almost identical. This shouldn't be a surprise - both the distribution of batting average talent and variation around batting averages is roughly bell-shaped and symmetric, so the normal-normal model and the beta-binomial models are both flexible enough to take that shape.<br /><br />The normal-normal and beta-binomial estimators shrinks the most while the James-Stein shrinks a moderate amount. For this specific data set, the James-Stein estimator seems to hold a slight-advantage - not by much, though. <br /><br /><ul><li>$\sum (\hat{\theta}^{NN}_i - \theta_i)^2 = 0.0218$</li><li>$\sum (\hat{\theta}^{BB}_i - \theta_i)^2 = 0.0218$ </li><li>$\sum (\hat{\theta}^{JS}_i - \theta_i)^2 = 0.0215$</li><li>$\sum (y_i - \theta_i)^2 = 0.0753$. </li></ul><br />Whatever the method you use to shrink, estimates are produced that, when judged using the squared error loss function, are far superior to using the raw batting averages.<br /><br /><b>Advantages/Disadvantages</b><br /><br />This estimator relies on the assumption of normality for both the data and underlying distribution of means - this means it will work well for batting statistics (which tend to be constructed as sums, weighted or otherwise, of assumed independent, identical events) but not as well for other statistics which don't naturally look "normal." Furthermore, if I have to estimate $\sigma^2$ with a binomial variance -why don't I just use a beta-binomial model? That doesn't depend on a large number of at-bats for normality of the distribution of the sample batting average. Overall, I think it will give nice results when used appropriately, but in many situations a different model will fit more naturally to the data.rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0tag:blogger.com,1999:blog-4128498738742055603.post-52867858018825006522015-07-16T11:37:00.000-05:002015-07-21T21:32:10.681-05:00Bayes' Rule<script type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>(This post acts as sort of a prequel to <a href="http://probabilaball.blogspot.com/2015/05/bayesian-inference.html" target="_blank">my post on Bayesian inference</a>) <br /><br />I've talked enough about it, so I figured I would make an actual post on a probability rule that I've been using quite a bit - Bayes' rule.<br /><br /><h2><b>Purpose</b></h2><br />I want to jump straight into a probability example first. Much like the infield fly rule or the offside rule in soccer, I think Bayes' rule makes much more sense when you understand what it's trying to do before learning the technical details.<br /><br /><b> </b><br />Suppose a manager has exactly two pinch hitters available to him - let's call them Adam and José. Adam has a $0.350$ OBP and José has a $0.300$ OBP. The manager calls Adam 70% of the time and calls José 30% of the time.<br /><br />So without knowing who the manager will call, how do we calculate what the probability is that the pinch hitter will get on-base? Like so:<br /><br /><div style="text-align: center;">$P(\textrm{ On-Base }) = P(\textrm{ On-Base } | \textrm{ Adam })P(\textrm{ Adam })+P(\textrm{ On-Base } | \textrm{ José })P(\textrm{ José })$ </div><br />The notation $P(\textrm{ On-Base }|\textrm{ Adam })$ means the probability of getting on-base "given" that Adam was chosen to pinch hit - that is, if we knew that the manager selected Adam, there would be a $0.350$ probability of getting on-base. As stated above, $P(\textrm{ Adam })$ - the probability the manager selects player Adam - is $0.7$. Similarly, $P(\textrm{ On-Base }|\textrm{ José }) = 0.300$ and $P(\textrm{ José }) =0.3$. Plugging numbers into the formula above,<br /><br /><div style="text-align: center;">$P(\textrm{ On-Base }) = (0.350)(0.7) + (0.300)(0.3) = 0.335$ </div><div style="text-align: center;"><br /></div><div style="text-align: left;">The OBP is $0.350$ with probability $0.7$, and $0.300$ with probability $0.3$, so overall there is a $0.335$ probability that the pinch hitter will get on-base.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Now, let's flip what you know around - suppose you know that the pinch hitter got on base, but <i>not</i> which pinch hitter it was. Which player do you think was picked, Adam or José? Logically it's more likely to be Adam - but how much more likely? Can you give probabilities?</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><h2><b>Bayes' Rule</b></h2></div><div style="text-align: left;"><br /></div><div style="text-align: left;">This is the basic idea of Bayes' rule - it flips conditional probabilities around. Instead of $P(\textrm{ On-Base } | \textrm{ Adam })$, it allows you to find find $P(\textrm{ Adam } | \textrm{ On-Base })$. For two events $A$ and $B$, the basic formulation is<br /><br /><div style="text-align: center;">$P(B | A) = \dfrac{P(A|B)P(B)}{P(A)}$</div><br />$P(A)$ on the bottom can be calculated as<br /><br /><div style="text-align: center;">$P(A) = P(A | B)P(B) + P(A | \textrm{ Not B }) P( \textrm{ Not B })$ </div><br />Note that this is why above I specified that there were exactly two pinch hitters, so saying that "Not Adam" is the same thing as saying "José". If there are more than two pinch hitters available, the formula above can be expanded.<br /><br />Applying this to the batting averages, we have<br /><br /><div style="text-align: center;">$P(\textrm{ Adam }| \textrm{ On-Base }) = \dfrac{P(\textrm{ On-Base }|\textrm{ Adam })P(\textrm{ Adam })}{P(\textrm{ On-Base })}$</div><div style="text-align: center;"><br /></div><div style="text-align: center;">$ =\dfrac{P(\textrm{ On-Base }|\textrm{ Adam })P(\textrm{ Adam })}{ P(\textrm{ On-Base } | \textrm{ Adam })P(\textrm{ Adam })+P(\textrm{ On-Base } | \textrm{ José })P(\textrm{ José })} $</div><br />Plugging in numbers, we get<br /><br /><div style="text-align: center;">$P(\textrm{ Adam }| \textrm{ On-Base }) = \dfrac{(0.350)(0.7)}{0.335} \approx 0.731$</div><br />And similarly,<br /><br /><div style="text-align: center;">$P(\textrm{ José }| \textrm{ On-Base }) = \dfrac{(0.300)(0.3)}{0.335} \approx 0.269$</div><br />So, given that the pinch hitter got on base, there was approximately a 73.1% chance it was Adam and approximately a 26.9% chance it was Jose.<br /><br /><h3></h3><h2><b>One More Example</b></h2><br />Adam tests positive for PED use. Let's suppose 10% of all MLB players are using PEDs. The particular test Adam took has 95% specificity and sensitivity - that is, if the player is using PEDs it will correctly identify so 95% of the time, and if the player is <i>not</i> using PEDs it will correctly identify so 95% of the time. Given a positive test, what is the probability that Adam is actually using PEDs? It's not 95%! We have to use Bayes' rule to figure it out.<br /><br />I'm going to use "+" to indicate a positive test (indicating the test says that the player is using drugs) and a "-" to indicate a negative test.<br /><br /><div style="text-align: center;">$P(\textrm{ PEDs } | +) = \dfrac{P( + | \textrm{ PEDs })P(\textrm{ PEDs })}{P( + | \textrm{ PEDs })P(\textrm{ PEDs })+P( + | \textrm{ Not PEDs })P(\textrm{ Not PEDs }) }$</div><br />Let's figure these out one by one. As stated in the problem description, if a person is using PEDs, the test will identify so 95% of the time. Hence, $P( + | \textrm{ PEDs }) = 0.95$. Furthermore, 10% of all players are using PEDs, so $P(\textrm{ PEDs }) = 0.1$ and $P(\textrm{ Not PEDs }) = 0.9$. Lastly, since $P( - | \textrm{ Not PEDs }) = 0.95$ as stated in the problem description, it must be that $P( + | \textrm{ Not PEDs }) = 0.05$.<br /><br />Plugging all these numbers in, we get<br /><br /><div style="text-align: center;"> $P(\textrm{ PEDs } | +) = \dfrac{(0.95)(0.1)}{(0.95)(0.1) + (0.05)(0.9)} \approx 0.678$</div><br />So given that Adam tests positive for PEDs, there's actually only about a 2/3 chance that he's using. It seems counter-intuitive given that the tests were pretty good - 95% sensitivity and specificity - but since most players aren't using (90%), there's bound to be a <i>lot</i> of false positives, making it so that Adam has a very, very good argument if Adam gets suspended over this particular test (vindicated).<br /><br />Put another way - suppose you have 200 MLB players. 180 (90% of the total) are clean, and 20 (10% of the total) are using PEDs. Of the 180 that are clean, 171 (95% of the clean) test negative and 9 (5% of the clean) test positive. Of those that are using, 19 (95% of the PED users) test positive and 1 (5% of the PED users) tests negative.<br /><br />This gives 19 PED users testing positive and 9 clean players testing positive, so the probability of being a PED user given testing positive is 19/(9+19) = 0.678.<br /><br />(Note that I made all these numbers up. I'm sure that the tests MLB actually uses have higher specificity and sensitivity than 95%, and I have no idea what proportion of all MLB players are using PEDs)<br /><br /><h2><b>From Bayes' Rule to Inference</b></h2><br />So how do we go from the rule to inference? Given some sort of model with parameter $\theta$, we can calculate $p(x | \theta)$ - the probability of seeing the data that you saw given a particular value of the parameter. You may recognize this as the likelihood from earlier posts. Bayesians use Bayes' rule to flip around what's inside the probability statement and calculate $p(\theta | x)$ - the probability of a particular value of the parameter given the data that you saw - by<br /><br /><div style="text-align: center;">$p(\theta | x) = \dfrac{p(x | \theta)p(\theta)}{\int p(x | \theta)p(\theta) d\theta}$</div><br />where $p(\theta)$ is the <i>prior</i> distribution chosen by the Bayesian and $p(\theta | x)$ is the <i>posterior</i> distribution that is calculated. Inference about $\theta$ is then performed using the posterior.<br /><br />That's Bayesian inference in a nutshell - start with a model, calculate the probability of seeing the data $x$ given a parameter $\theta$, and then use Bayes' rule to flip that around to the probability of the parameter $\theta$ given that you saw the data $x$.<br /><br />(Oh, and then do checking to make sure your model fits - but that's another post) </div>rcfosterhttp://www.blogger.com/profile/09317049446493200529noreply@blogger.com0