29 May, 2015

Bayesian Inference

A new batter is called up to the big leagues. Let's define $\theta$ as his "true" batting average (which we are assuming exists and is constant). In 50 at-bats, the batter gets a hit 15 times. What is our estimate of $\theta$? The answer, surprisingly, depends on your definition of probability.

Frequentist


The methodology taught today in most statistics classes is called frequentist statistics. To a frequentist, the probability of an event $P(A)$ is what you would get if you were to observe a random experiment (for example, an at-bat) an infinite number of times and calculate the proportion of times the event (for example, getting a hit) happened. Let's define $\hat{\theta}$ as our estimate of $\theta$ (in statistics, this is called "theta hat"). Without going into the exact method of estimation, this definition of probability leads to an estimate of  the batter's true ability as

$\hat{\theta} = \dfrac{\textrm{# of Hits}}{\textrm{# of At-bats}} = \dfrac{15}{50} = 0.300$

This is intuitive - I think that most people, completely absent of any sort of statistical theory, would come up with this estimate on their own. The most common frequentist estimators (method of moments and maximum likelihood) agree. So what's the problem?

Think about something like "The probability that Mike Trout will get injured this season." How does this fit into frequentist inference? There's no way to go back and recreate the entire season. You would have to imagine either a way to rewind time, or lean on some sort of multiple-worlds theory, in order to justify frequentist probability interpretation.

Bayesian


A Bayesian (named after the work of the good reverend Thomas Bayes) defines probability as a
numerical representation of a personal belief in the chance of an event happening. To a Bayesian, data does not define the probability, it updates the probability. There exists some sort of prior knowledge that they then use the data to update to a posterior (that is, after the data is observed) knowledge.

What the prior exactly is and how to represent the prior knowledge is a subject of endless debate, which I will not go into here - and most Bayesians go to great effort to properly determine the prior knowledge before performing inference. Suffice to say, all approaches fundamentally agree that inference should be based on posterior knowledge, not on the long-term results of any process, real or theoretical.

Let's go back to the baseball player - an observer that is completely uninformed may well observe the 15 hits in 50 at-bats and conclude that 0.300 is a perfectly good value for $\hat{\theta}$, the estimate of the player's "true" batting average. A scout sitting elsewhere in the stadium may recognize that the player has excellent mechanics and calculate a value of above 0.300 for $\hat{\theta}$. A different scout may know that the player did not perform this well in the minor leagues and end up with a value lower than 0.300 for $\hat{\theta}$. To a Bayesian, these are all mathematically valid estimates that reflect updated views of the prior information that each observer had.

(I'd like to note here that I'm being somewhat "subjectivist" in this example to emphasize how different priors can lead to different posteriors - but there are those who would argue that none of the of above observers are correct, and that there exists a correct prior that can and should be constructed based off of the large amount of historical knowledge about baseball)

Let's use the batting average example in a more formal mathematical setting.

Beta-Binomial Example


The beta-binomial model assumes that the number of hits $k$ in $n$ at-bats with true success probability $\theta$ has a probability mass function given by

$p(k | \theta, n) = {n \choose k} \theta^k (1-\theta)^{n-k}$

For the example batter, $k = 15$ and $n = 50$.

For the distribution of prior belief, a beta distribution is used. This is a continuous distribution on the range $[0,1]$ defined by two parameters $\alpha$ and $\beta$

$p(\theta | \alpha, \beta) = \dfrac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha, \beta) }$

The values chosen for $\alpha$ and $\beta$ represent the prior knowledge ($B(\alpha, \beta)$ is known as the beta function - its result is a normalizing constant that makes the entire distribution integrate to 1). The expected value of the beta distribution is given by $E[\theta] = \alpha/(\alpha + \beta)$.

Let's say that two observers (call them A and B) observe the baseball player get the 15 hits in 50 at bats. The first observer knows nothing about baseball or the player, so this observer's distribution of prior belief could be represented by a distribution which says anything between 0 and 1 is equally likely - this is given by a beta distribution with $\alpha = 1$ and $\beta = 1$ (this is actually a more informative prior distribution than you might think, but that's another discussion). Observer B is a baseball fan but doesn't know much about the player himself, so observer B picks $\alpha = 53$ and $\beta = 147$ - this puts most of the probability between 0.200 and 0.300, with an average of 0.265.



(As a side note, if you want to think of the information in these priors in in baseball terms - in the mind of observer A, before even playing a single game, the player starts off 1 for 2. In the  mind of observer B, before even playing a single game, the player starts off 53 for 200)

Without going into the mathematical details of Bayes' theorem, the posterior distribution that represents belief in values of $\theta$ is given by

$p(\theta | n, k, \alpha, \beta) =  \dfrac{\theta^{\alpha + k-1}(1-\theta)^{\beta + n - k-1}}{B(\alpha + k, \beta + n - k)}$

That is to say, it's a beta distribution with new parameters $\alpha' = \alpha + k$ and $\beta' = \beta + n - k$.

So for example, $\alpha'_A = 1 + 15 = 16$ and $\beta'_A = 1 + 50 - 15 = 36$, while $\alpha'_B = 53 + 15 = 68$ and $\beta'_B = 147 + 50 - 15 = 182$. The posterior distributions for observers A and B are

The two estimators, which I will choose as $\alpha'/(\alpha' + \beta')$ - the expected values of the posterior distributions - are then given by $\hat{\theta}_A \approx 0.308$ and $\hat{\theta}_B \approx 0.272$. These shouldn't be surprising - observer A didn't have much information to work with, so their estimate is close to 0.300 (but higher! - remember, the average of their prior belief was 0.5). Observer B used much more information in their estimate, so their posterior estimate is much closer to their prior estimate. Realistically, 50 at-bats isn't much to work with in determining the "true" batting average.

(Again as a side note, in baseball terms, the player is now 16 for 52 in the mind of observer A and 68 for 250 in the mind of observer B)

Choice


Note the posterior parameters $\alpha' = \alpha + k$ and $\beta' = \beta + n - k$ - no matter what you choose for $\alpha$ and $\beta$, their effect goes away as the number of at-bats get large (and so $n$ and $k$ get large). No matter what their prior belief is, given enough data the Bayesian approach specifically chosen here should eventually end up with the same estimate.

I want to end by saying that there's somewhat of a false belief that Bayesian and Frequentist ideologies are somehow "opposing" each other. Though this may have been true 50 years ago, it is no longer the case - most statisticians are trained with and are comfortable in using either. In fact, there are ways to have Bayesian methods approximate frequentist methods, and have frequentist methods approximate Bayesian methods! The most important thing, is to make sure that whatever approach you choose is appropriate for the problem you are trying to solve.

The code for all the graphs and calculations in this article may be found in my github.

25 May, 2015

A Quick Check for Independence between Runs Scored and Runs Allowed


Is there any relationship between the number of runs scored in an inning and the number of runs allowed in an inning? Intuitively, it seems like there shouldn't be - but perhaps choosing to increase the rate of runs scored per inning means sacrificing defense, and therefore increasing the rate of runs allowed per inning.

Below is a table of the total runs scored and allowed in innings for all major league baseball teams in 2014.


        RA
RS       0     1     2     3     4     5     6     7     8     9
   0 23628  4446  2042   935   408   143    66    32    12     5
   1  4446   862   421   188    69    39    11     3     2     2
   2  2042   421   178    94    25    17     7     2     0     0
   3   935   188    94    30    11     5     2     2     0     0
   4   408    69    25    11     0     1     2     0     1     0
   5   143    39    17     5     1     2     0     0     0     0
   6    66    11     7     2     2     0     0     0     0     0
   7    32     3     2     2     0     0     0     0     0     0
   8    12     2     0     0     1     0     0     0     0     0
   9     5     2     0     0     0     0     0     0     0     0



This table does not include all innings, as teams that did not finish the ninth inning will not have a corresponding runs scored/allowed. Also note that this table does double count innings, as one team's RS is another team's RA, and vice versa - I will look at specific teams in a moment to address this.

So for example, in 2014, there were 23,628 innings in which a team both gave up 0 runs and scored 0 runs. How can we use this to test for a relationship between runs scored and runs allowed? A chi-squared test could work, but many of the cells have 0s in them - not ideal. Fisher's exact test requires that row and column cell totals be fixed - clearly not the case here - and Fisher exact test is more commonly used with smaller sample sizes and contingency table sizes (though it is still valid for larger sample sizes).

The solution I chose to use is to combine 5, 6, 7, 8, or 9 runs scored into a single category that I called ``5+''. The contingency table is then

                            RA
                RS      0     1     2     3     4    5+
                  0 23628  4446  2042   935   408   258
                  1  4446   862   421   188    69    57
                  2  2042   421   178    94    25    26
                  3   935   188    94    30    11     9
                  4   408    69    25    11     0     4
                 5+   258    57    26     9     4     2 


This is still a bit awkward - a few cells have counts less than 10* - but overall much, much better than before. Performing a chi-square test on this contingency table (I chose to simulate the p-value in $R$) yields a p-value of approximately 0.178 - not significant at a reasonable $\alpha$.

Individual Teams


As mentioned before, the table used was cheating - I combined all data for all major league teams, and so every result was double counted in some way. One way to address this is to look at each team individually, which has the added bonus of addressing the potential issue that dependence was washed out by the pooling. Below is a table of runs scored and runs allowed by the Atlanta Braves in the 2014 season.

                              RA
                RS    0   1   2   3   4   5   6
                  0 809 193  61  28   6   6   2
                  1 125  33  13   3   1   0   0
                  2  62  15   1   1   0   0   0
                  3  26   4   4   0   1   1   0
                  4  14   1   3   0   0   0   0
                  5   2   3   0   0   0   0   0
                  6   2   1   0   0   0   0   0
                  7   0   0   0   1   0   0   0


Again, not all innings were accounted for, since games that ended with only half of the ninth inning played would not have a corresponding RS or RA. To reduce the number of cells with zeroes, innings with 4 or more runs were combined into a single category.\\


                       RA
                    RS    0   1   2   3  4+
                      0 809 193  61  28  14
                      1 125  33  13   3   1
                      2  62  15   1   1   0
                      3  26   4   4   0   2
                     4+  18   5   3   1   0



The simulated p-value for the chi-square test with 100,000 simulations is 0.308 - still not significant. In fact, computing the p-value for each team's contingency table with 100,000 simulations yields



\begin{array}{c | c}\hline
Team & p-value \\ \hline
         ATL&       0.309\\
         ARI&       0.559\\
         BAL&       0.627\\
         BOS&       0.151\\
         CHC&       0.774\\
         CHW&       0.735\\
         CIN&       0.298\\
         CLE&       0.166\\
         COL&       0.986\\
        DET&       0.977\\
        HOU&       0.665\\
        KCR&       0.913\\
        LAA&       0.512\\
        LAD&       0.789\\
        MIA&       0.497\\
        MIL&       0.905\\
        MIN&       0.429\\
        NYM&       0.873\\
        NYY&       0.337\\
        OAK&       0.813\\
        PHI&       0.760\\
        PIT&       0.986\\
        SDP&       0.849\\
        SEA&       0.115\\
        SFG&       0.206\\
        STL&       0.902\\
        TBR&       0.629\\
        TEX&       0.126\\
        TOR&       0.597\\ \hline
\end{array}


There are no p-values below 0.1, so it looks reasonable to assume that runs scored and runs allowed per inning are independent.

This is definitely not the only - or even the most appropriate - way to answer the question of independence between runs scored and runs allowed, but represents an ad-hoc test of the sort that can be useful as a sanity check before proceeding onto other analyses.

*The Chi-square test for independence actually assumes that expected cell counts are more than a certain number - but using observed cell counts is useful as an approximation.

21 May, 2015

Beta-Binomial Empirical Bayes

When you have a set of random variables - say, batting averages of a group of players - building a model that allows for "borrowing" of information from can greatly improve estimation. In the case of batting averages, say a player has a batting average of 0.400 over 45 at-bats. Any baseball fan knows that's unsustainably high - but the raw prediction doesn't "know." All the raw prediction knows is that a player went to bat 45 times and got 18 hits - so the best prediction of that player's future is 0.400. But if it also can somehow know that 0.400 is the highest batting average in a group of players, then that information can be used - possibly, to predict that some amount of regression towards the mean is due.

A basic statistical model that involves a group of similar probability distributions can often be improved by adding something that allows the model from one player to access the information from another player. One technique common in statistical modeling is to add a distribution "underneath" (or on top of, depending on how you visualize it) the individual probability distributions for each player. This is known as a "mixing" distribution, and it represents the probability distribution of the "true" parameters for each player.



In the batting averages example, it represents the distribution of batting averages. For example, you could use the mixing distribution to answer to questions such as "If I randomly select a batter, what is the probability that he has a true batting average greater than 0.250?" or "If I randomly select a batter, what is the probability that I see 5 hits in 30 at bats?" - essentially, sampling from the pool of all batters rather than looking at an individual one.

The Beta-Binomial Model


A very common statistical model of this form is called the beta-binomial model (I'm going to assume you're familiar with both the beta and binomial distributions individually). The basic formulation is

$y_i \sim Bin(\theta_i, n_i)$
$\theta_i \sim Beta(\alpha, \beta)$

This model assumes player $i$ has a "true" batting average represented by constant proportion $\theta_i$, where $0 \le \theta_i \le 1$. Of course, we don't actually know that each $\theta_i$ exists, any more than we know that it follows a binomial distribution or know that there's a beta distribution underneath it. But what this model does allow is the sharing of information - in this case, the beta distribution is the mixing distribution - which is very, very useful to us.

If we knew the values of the parameters of the beta distribution $\alpha$ and $\beta$, we could use Bayes' rule to get an estimate of the $\theta_i$

$\hat{\theta_i} =\dfrac{y_i + \alpha}{n_i + \alpha + \beta}$

If we don't know and don't want to be full Bayesians, however, another option is to use empirical Bayes - that is, to replace $\alpha$ and $\beta$ with estimates $\hat{\alpha}$ and $\hat{\beta}$ (these can be moment estimates, maximum likelihood estimates - any kind of estimates). Then our estimates of the $\theta_i$ become

$\hat{\theta_i} = \dfrac{y_i + \hat{\alpha}}{n_i + \hat{\alpha } + \hat{\beta}}$

The effect of this is that the raw estimates $y_i/n_i$ are pulled towards the group mean in an amount that is determined by two things - the variance between the raw batting averages (if the raw batting averages are very close together, that will increase the shrinkage) and the number of at-bats. Raw predictions that are further away from the mean are pulled harder - a player batting 0.400 or 0.100 is going to have his predicted batting average reduced or improved drastically, but a player batting 0.250 is not going to see much of a change. Furthermore, no matter what $\hat{\alpha}$ and $\hat{\beta}$ are, their effect will go to zero (and hence, shrinkage will lessen) as the number of at-bats $n_i$ gets larger.


Estimating the Parameters



The two most common procedures used for estimation of the parameters in the beta-binomial model are the method of moments and the method of maximum likelihood. I'm going to briefly describe an iterative method of moments estimators, which is useful if all you need are a point estimate and don't necessarily care about the uncertainty of estimation for the parameters. The disadvantage is that moment estimators tend to be less efficient than likelihood estimators, and for small sample sizes can produce negative estimates of $\alpha$ and $\beta$.

Code in R for this procedure can be found on my github.

As previously stated, the two-stage model for beta-binomial empirical Bayes is given by

$y_i \sim Binomial(n_i, \theta_i)$
$\theta_i \sim Beta(\alpha, \beta)$


For the purpose of estimation, I'm going to rewrite the parameters of the beta distribution as

$\mu = \dfrac{\alpha}{\alpha + \beta}$
$\phi = \dfrac{1}{\alpha + \beta + 1}$

This makes $\mu$ equal to the expected value of the beta distribution with $\phi$ as the dispersion parameter.

Begin by defining


$\tilde{\theta_i} = \dfrac{y_i}{n_i}$


These are the raw, un-shrunk estimates of the $\theta_i$. Estimate the mean as


$\hat{\mu} = \dfrac{\displaystyle \sum w_i \tilde{\theta_i}}{w}$

where $w_i$ are weights and $w = \sum w_i$. Initial values of weights must be chosen - using $w_i = n_i$ is a good choice.

Then define


$S = \displaystyle \sum w_i (\hat{\theta_i} - \hat{\mu})^2$


The dispersion parameter $\phi$ of the beta distribution is estimated as


$\hat{\phi} = \dfrac{\hat{\mu}(1-\hat{\mu})\left[\displaystyle \sum \dfrac{w_i}{n_i} \left(1 - \dfrac{w_i}{w}\right)\right]}{\hat{\mu}(1-\hat{\mu})\left[\displaystyle \sum w_i \left(1 - \dfrac{w_i}{w}\right) -  \displaystyle \sum \dfrac{w_i}{n_i} \left(1 - \dfrac{w_i}{w}\right)\right]}$

The weights $w_i$ are then re-estimated as

$w_i = \dfrac{n_i}{1 + \hat{\phi}(n_i - 1)}$

And the process is repeated (starting at calculating $\hat{\mu}$) until the sum of squared differences between the old and new weights is less than a certain tolerance. The iterated moments estimator produces estimates tend to be closer in value to the estimates produced by maximum likelihood than the naive estimator found on Wikipedia, but both moments estimators will give the same estimates when all sample sizes are equal.

To convert these estimates back to the traditional $\alpha, \beta$ format, apply the fomulas:

$\hat{\alpha} = \hat{\mu} \left(\dfrac{1-\hat{\phi}}{\hat{\phi}}\right)$

$\hat{\beta} = (1- \hat{\mu}) \left(\dfrac{1-\hat{\phi}}{\hat{\phi}}\right)$

 And empirical Bayesian estimates of the $\hat{\theta_i}$ are given by

$\hat{\theta_i} =\dfrac{y_i + \hat{\alpha}}{n_i + \hat{\alpha} + \hat{\beta}}$

Caution: It's important here not to try to use the raw $y_i / n_i$ as observations from the beta distribution in order to estimate $\alpha$ and $\beta$ directly - they're estimates of the $\theta_i$, and you'd be ignoring the uncertainty in those estimates, which could lead to poor estimation of the parameters of the underlying beta distribution when the $n_i$ are small or the noise around the $\theta_i$ is large, such as for batting average on balls in play.


Batting Averages


Recall the batting-averages data used in the James-Stein estimator post. That data required a messy transformation to make it fit the equal variance, normality assumption . The beta-binomial method does not require either. If we say that $y_i$ is the number of hits player $i$ gets in $n_i$ at-bats, then moment estimates of the beta parameters are given by $\hat{\mu} = 0.265$ and $\hat{M} = 367.988$. Also define $\theta_i$ as the batting average over the rest of the season

\begin{array}{l c c c c c c c} \hline
\textrm{Player} & y_i & y_i/n_i & \textrm{JS Estimate} & \textrm{BB Estimate} & \theta_i\\ \hline
Clemente & 18 & .400 & .290  & .280 & .346 \\
F. Robinson & 17 & .378  & .286  & .278 & .298 \\
F. Howard & 16 & .356 &  .282 &  .275 & .276 \\
Johnstone & 15 & .333  &  .277 &  .273 & .222\\
Barry & 14 & .311 &  .273 & .270 & .273\\
Spencer & 14 & .311 &  .273 & .270 & .270\\
Kessinger & 13 & .289 & .268  & .268 & .263\\
L. Alvarado & 12 & .267 &  .264 & .266 & .210\\
Santo & 11 & .244 &  .259 &  .263 & .269\\
Swoboda & 11 & .244 & .259  &  .263 & .230\\
Unser & 10 &.222  &  .254 & .261 & .264\\
Williams & 10 & .222 & .254  & .261 & .256 \\
Scott & 10 & .222 &  .254 & .261 &  .303 \\
Petrocelli & 10 & .222 & .254  & .261 & .264\\
E. Rodriguez & 10 & .222 &  .254 & .261 & .226\\
Campaneris & 9 & .200 &  .249 & .258 & .285\\
Munson & 8 & .178 &  .244 & .256 & .316\\
Alvis & 7 & .156 &  .239  & .253 & .200\\ \hline
\end{array}


The beta-binomial empirical Bayesian estimates tend to shrink a bit harder towards the mean - sometimes for better, sometimes for worse. Both the James-Stein and empirical Bayesian estimates improve dramatically on the naive estimates for prediction - compare

  •  $\sum (\hat{\theta}^{JS}_i - \theta_i)^2 = 0.0215$
  •  $\sum (\hat{\theta}^{BB}_i - \theta_i)^2 = 0.0218$ 
  •  $\sum (y_i/n_i - \theta_i)^2 = 0.0753$. 

This type of estimation that pulls observations towards a group mean is known as shrinkage estimation. Whether by means of the James-Stein estimator or a beta-binomial empirical Bayesian method, it offers a clear improvement on prediction, when the measurement criterion is squared error.

16 May, 2015

The James-Stein Estimator


If you want to make a prediction, an important question is "How do I judge accuracy?" There can be different ways.

For example, suppose you want to take the batting averages of a group of players in April and use them to predict the final batting average at the end of the regular season. How do you want to judge how close you got? You could take the average of the differences between your predictions and the actual batting averages. You could take the maximum difference between your prediction and the actual batting average. You could even focus on one side and acknowledge that hey, you don't care if a player over performs based on the prediction - you just want to make sure to minimize how much a player can under perform. All of these ways of judging your predictions could potentially lead to different prediction methods.

The first method - summing or average some function of the distances between the predicted and actual values - is both common, and has an estimator that goes along with it that isn't obvious.

The James-Stein Esimator


Let's say you have a set of observations $x_i$ that represent occurrences of some phenomenon, each of which has common parameter $\theta_i$ that controls some underlying aspect of that phenomenon - a baseball example would be to let $\theta_i$ be the "true" batting average of player $i$, if we can model each player as having a true, constant batting average representing his ability.

A common problem is then, how do we estimate $\theta_i$ with some estimator $\hat{\theta}_i$? The answer to that, as discussed previously, depends on how you are going to measure the accuracy of your estimate. A common approach is to use a squared loss function

$L_i(\theta_i, \hat{\theta}_i) = (\hat{\theta}_i - \theta_i)^2$

This presents a pretty good balance between bias and variance, and makes a lot of intuitive sense - squaring the distance makes everything positive so when you sum them up you can figure out a minimum, and it makes being very wrong is a lot worse than being a little wrong.

How then, do you measure the accuracy of the set of estimators? You could just sum the losses.

$L({\theta}, \hat{\theta}) = \displaystyle \sum_i (\hat{\theta}_i - \theta_i)^2$

In 1962, statistician Charles Stein showed a surprising result: first, let $x_i$ follow a normal distribution with known variance and mean $\theta_i$

$x_i \sim N(\theta_i, 1)$

where $i = 1, 2, ..., k$, for $k \ge 4$ (the number of $\theta$s has to be four or bigger - this is a hard rule that can't be broken)*. In the absence of any other assumptions about the form of the data, the simplest estimator $\hat{\theta}_i$ is the $\hat{\theta}_i = x_i$ (in the case of multiple observations per group, since the sample mean will also be normal, so we can without loss of generality assume one observation). This estimator tends to work well, an is, again, pretty intuitive - the baseball example of this would be to just use April's batting average to predict the final batting average.

 However, Stein showed that a better estimator is given by

$\tilde{\theta}_i = \overline{x} + \left(1 - \dfrac{k-3}{\sum (x_i - \overline{x})^2}\right)(x_i - \overline{x})$

The effect of this estimator is to shrink the estimate $\hat{\theta}_i$ towards the overall mean of the data $\overline{x}$, in an amount determined by both the variance of the $\hat{\theta}_i$ and the distance of $\hat{\theta}_i$ from the overall mean. On average, using $\tilde{\theta}_i$ actually works better in terms of the average squared loss defined above  than using the naive estimator $\hat{\theta}_i$. In some cases, it works much better.

Baseball


How do we apply this to strictly to baseball? Professors Bradley Efron and Carl Morris gave a classic example. The batting averages of 18 players through their first 45 at-bats of the 1970 baseball season is given below. The problem is to use the first 45 at-bats to predict the batting average for the remainder of the season.

\begin{array}{l c c} \hline
\textrm{Player} & x_i & \theta_i \\ \hline
Clemente & .400 & .346 \\
F. Robinson & .378 & .298\\
F. Howard & .356 & .276\\
Johnstone & .333 & .222\\
Barry & .311 & .273\\
Spencer & .311 & .270\\
Kessinger & .289 & .263\\
L. Alvarado & .267 & .210\\
Santo & .244 & .269\\
Swoboda & .244 & .230\\
Unser & .222 & .264\\
Williams & .222 & .256\\
Scott & .222 & .303\\
Petrocelli & .222 & .264\\
E. Rodriguez & .222 & .226\\
Campaneris & .200 & .285\\
Munson & .178 & .316\\
Alvis & .156 & .200\\ \hline
\end{array}

Where $x_i$ is the batting avg. for first 45 at-bats and $\theta_i$ is the batting average for the remainder of the season.

The first 45 at-bats were taken because the James-Stein estimator requires equal variance, and the variance is going to be dependent on the number of at-bats you look at. Here, the distribution of this data is clearly not normal - it is binomial. Hence, the transformation $f(x_i) = \sqrt{n}\arcsin(2x_i - 1)$ (in this case, n = 45) is used. Then the transformed data does fit the James-Stein model - it will follow a $N(\theta_i, 1)$ distribution.

The maximum likelihood estimate in each scenario of the remainder batting average would be $\hat{\theta}_i = x_i$. Applying the transformation to both $x_i$ and $\theta_i$, the squared loss using the naive estimates is then $17.56$. However, if the James-Stein estimator is applied to the transformed $x_i$, the squared loss is $5.01$ - an efficiency of $3.5 = \dfrac{17.56}{5.01}$.


\begin{array}{c c c c c} \hline
x_i & \textrm{Transformed }x_i & \textrm{JS transformed }x_i & \textrm{Back-transformed JS Estimates} & \theta_i\\ \hline
        0.400     & -1.351   &    -2.906   &         0.290 & 0.346\\
        0.378     & -1.653   &    -2.969   &         0.286 & 0.298\\
        0.356     & -1.960   &    -3.033   &         0.282 & 0.276\\
        0.333     & -2.284   &    -3.101   &         0.277 & 0.222\\
        0.311     & -2.600   &    -3.167   &         0.273 & 0.273\\
        0.311     & -2.600   &    -3.167   &         0.273 & 0.270\\
        0.289     & -2.922   &    -3.235   &         0.268 & 0.263\\
        0.267     & -3.252   &    -3.304   &         0.264 & 0.210\\
        0.244     & -3.606   &    -3.378   &         0.259 & 0.269\\
        0.244     & -3.606   &    -3.378   &         0.259 & 0.230\\
        0.222     & -3.955   &    -3.451   &         0.254 & 0.264\\
       0.222      &-3.955    &   -3.451    &        0.254 & 0.256\\
       0.222      &-3.955    &   -3.451    &        0.254 & 0.303\\
       0.222      &-3.955    &   -3.451    &        0.254 & 0.264\\
       0.222      &-3.955    &   -3.451    &        0.254 & 0.226\\
      0.200      &-4.317     &  -3.526     &       0.249 & 0.285\\
       0.178     & -4.694    &   -3.605    &        0.244 & 0.316\\
       0.156     & -5.090   &    -3.688    &        0.239 & 0.200\\ \hline
\end{array}

(R code for these calculations may be found in my github)

In fact, the James-Stein estimator dominates the least-squares estimator when the way to measure accuracy is the squared error function - outside of statistics speak, that means that no matter what your actual $\theta$s happen to be, on average, the James-Stein estimator will work better.

So what, then, is a situation where you would not want to use the James-Stein estimator? The answer is if you're concerned about the worst-case scenario for individual players. The James-Stein estimator - and shrinkage estimation in general - sacrifice accuracy on the individual level in order to increase accuracy on the group level. Players like Roberto Clemente, who we know are very good, are going to get shrunk the hardest. If the only player you care about is Robert Clemente, then using the raw April batting results is going to have the best average performance - even though on the group level, the James-Stein estimator will still "win" most of the time.


Shrinkage Estimation


In general, techniques that "shrink" estimates towards the mean will work better for group prediction than looking at each prediction individually. The James-Stein estimator, however, requires very specific assumptions - normality and equality of variances - that may be hard to meet. In future posts, I will discuss some ways to perform shrinkage estimation in other scenarios.

*Stein's estimator originally simply shrunk towards a common value and used $k-2$ in the estimator - however, in situations where the data is shrunk towards the estimated mean, $k-3$ must be used as a correction.