All of the code and data I used may be found on my github.

For each, I used a binomial distribution for the result of the at-bat/plate appearance and a beta distribution for the distribution of talent levels. I think a different model might work better for some of these statistics (and I'll have to work through the poisson-gamma anyway when I look at pitching statistics), but this represents a decent approximation.

As I stated in a previous post, statistics that can

*not*be constructed as a binomial event (such as wOBA) do not fall under the framework I am using, and so I have not included them in estimation. I could treat them as binomial events, fit a model, and perform estimation procedure, but I would have no idea if the results are correct or not.

All estimated stabilization points were calculated using unadjusted data from fangraphs.com for hitters from 2009-2014 with at least 300 PA, excluding pitchers.

\begin{array}{| l | l | c | c | c |} \hline

\textrm{Statistic} & \textrm{Formula} & \hat{M} & SE(\hat{M}) & \textrm{95% CI} \\ \hline

\textrm{OBP} & \textrm{(H+BB+HBP)/PA} & 295.79 & 16.41 & (263.63, 327.95) \\

\textrm{BA} & \textrm{H/AB} & 465.92 & 34.23 & (398.83, 533.02) \\

\textrm{SO Rate} & \textrm{SO/PA} & 49.73 & 1.92 & (45.96, 53.50) \\

\textrm{BB Rate} & \textrm{(BB-IBB)/(PA-IBB)} & 110.91 & 4.84 & (101.44, 120.38) \\

\textrm{1B Rate} & \textrm{1B/PA} & 222.16 & 11.32 & (199.98, 244.34) \\

\textrm{2B Rate} & \textrm{2B/PA} & 1025.31 & 108.00 & (813.64, 1236.98) \\

\textrm{3B Rate} & \textrm{3B/PA} & 372.5 & 26.56 & (320.44, 424.56) \\

\textrm{XBH Rate} & \textrm{(2B+3B)/PA} & 1006.30 & 105.23 & (800.04, 1212.57) \\

\textrm{HR Rate} & \textrm{HR/PA} & 124.52 & 5.90 & (112.95, 136.09) \\

\textrm{HBP Rate} & \textrm{HBP/PA} & 297.41 & 18.26 & (261.61, 333.20) \\ \hline

\end{array}

I've also created histograms of each observed statistic with an overlay of the estimated distribution of true talent levels. They can be found in this imgur gallery. Remember that the dashed line represents the distribution of

*talent levels*, not of observed data, so it's not necessarily bad if it is shaped differently than the observed data.

$\hat{M}$ is the estimated variance parameter of the underlying talent distribution. Under the model, it is equal to the number of plate appearances at which there is 50% shrinkage.

$SE(\hat{M})$ is the standard error of the estimate $\hat{M}$. It is on the same scale as the divisor in the formula - so PA for all except batting average and walk rate.

The 95% CI is calculated as

$\hat{M} \pm 1.96 SE(\hat{M})$

It represents a 95% confidence interval for the number of plate appearances at which there is 50% shrinkage.

For an arbitrary stabilization level $p$, the number of required plate appearances can be estimated as

$\hat{n} = \left(\dfrac{p}{1-p}\right) \hat{M}$

And a 95% confidence interval for the required number of plate appearances is given as

$\left(\dfrac{p}{1-p}\right) \hat{M} \pm 1.96 \left(\dfrac{p}{1-p}\right) SE(\hat{M})$

Without confidence bounds, a plot of the sample size required for various stabilization levels is

And a plot of the stabilization level at various sample sizes is given as

Comments are appreciated. Also, I'm currently in the process of learning ggplot, so hopefully my graphics won't be as awful in the near future.

## No comments:

## Post a Comment