## 09 June, 2015

### Central Limit Theorem Estimation of a Batting Average

A new batter gets called up to the major leagues. You're interested in the batting average of this new batter - let's call his true batting average $\theta$ (this what the batting average would converge to if the batter were to take millions and millions of at-bats - so we're using a frequentist understanding of probability). You observe that in 50 at-bats, the batter gets 15 hits. Your estimate of the true batting average is
$\hat{\theta} = \dfrac{15}{50} = 0.300$

This is what's called a "point estimate," since it just gives a single point on the real number line. Is that estimate correct? Almost certainly not - there's going to be variability in the data, especially with only $n = 50$ at-bats. What statisticians like to do is take into account the variability of the point estimate and give what's called an interval estimate - that is, a range of values where we think the true batting average $\theta$ actually lies. It's the statistical equivalent of casting a net rather than just a single line. The advantage of doing this is we can control how often the net catches captures the true $\theta$ by changing how big the net is.

Do I think the player's true $\theta$ is actually 0.300? No, almost certainly not. How close am I? With a sample size of 50, there's still a good amount of variation - 0.250 is possible, since it's close to my point estimate, but 0.050 is probably right out, since it's so far away. And therein lies the core technique - a point estimate, plus or minus some margin of error. By giving an interval estimate of 0.300 plus or minus some margin of error, I can fine-tune the margin of error so that I can I can be as confident in my interval as I want.

## Proportion Confidence Interval

If you've ever taken a Statistics 101 course, you've probably encountered the formula for a confidence interval for a proportion:

$\hat{\theta} \pm z^* \sqrt{\dfrac{\hat{\theta}(1-\hat{\theta})}{n}}$

Where $\hat{\theta}$ is the sample proportion. You might have even done a baseball example - take the baseball player who gets $15$ hits in $n = 50$ at-bats. For that player, $\hat{\theta} = 15/50 = 0.3$. A 95% confidence interval is

$0.3 \pm 1.96 \sqrt{\dfrac{0.3(1-0.3)}{50}} = (0.173, 0.472)$

That is to say, with 95% confidence, the true batting average $\theta$ of the player is between 0.173 and 0.472 (we'll leave off the discussion of exactly what "95% confidence" means for now - other than to say that it is not a 95% chance that $\theta$ is in the interval).

Where does this formula come from? One way to get it is from the central limit theorem.

## Central Limit Theorem

The central limit theorem (CLT) is, I think, probably the most important theorem in the realm of applied statistics. What it says is:

If $x_1, x_2,...,x_n$ is an independent, identical sequence of random variables (with finite variance), then the average $\overline{x}$ of the random variables converges to a normal distribution as the sample size goes to infinity.

In the real world you're not going to encounter a random variable with non-finite variance, so it's safe to ignore that. What does independent mean? That if I know the outcome of one trial, it doesn't affect the outcomes of the other trials. Identical means that they all follow the same distribution. Think about flipping a fair coin 10 times in a row - the flips don't affect each other (independent) and all have the same distribution of 50% chance heads, 50% chance tails (identical).

So assuming that the sample size $n$ is large enough , we know that averages of identical, independent random variables approximately follows a normal distribution. What are the parameters (meaning, the mean and variance) of this distribution? The central limit theorem states that the mean of the normal distribution is same mean as each of individual $x_i$ and the variance of the normal distribution is the variance of each of the $x_i$, divided by $n$. In short,

$\overline{x} \sim N( E[x_i], Var(x_i)/n)$

The advantage of this is that using by flipping around probability calculations involving the normal distribution, you can turn this into an interval estimate for $E[x_i]$ as

$\overline{x} \pm z^* \sqrt{Var(x_i)/n}$

Where $z^*$ is an appropriate quantile from the normal distribution. For a 95% confidence interval, $z^* = 1.96$.

## Batting Averages

Does this apply to batting averages? Yes! As the name tells you - a batting average is an average. An average of what? That takes a bit of creativity. I'm going to define a specific type of random variable. Let's say that $x_i$ represents the outcome of a single at-bat. We're going to say $x_i = 1$ if the batter gets a hit and $x_i = 0$ if the player does not get a hit. Also say that the probability of getting a hit is $\theta$, the same for every single at-bat. The expected value of each of the $x_i$ is then $\theta$ with variance $\theta(1-\theta)$. The average of the $x_i$ is now the estimate of the sample proportion!

$\hat{\theta} = \bar{x} = \dfrac{\displaystyle \sum_{i=1}^{n} x_i}{n} = \dfrac{\textrm{# of 1s}}{\textrm{# of trials}} = \dfrac{\textrm{# of hits}}{\textrm{# of at-bats}}$

Hence, a batting average can be seen as the average of the $x_i$ when they are specified in the way I described. What is $E[x_i]$? It's equal to $\theta$. What is $Var(x_i)?$ It's equal to $\theta(1-\theta)$ (let's ignore how I got these for this post). So what is the distribution of $\hat{\theta} = \overline{x}$?

$\hat{\theta} \sim N\left(\theta, \dfrac{\theta(1-\theta)}{n}\right)$

This is assuming the sample size is large enough - and I'll say here that $n = 50$ should be large enough. The 95% confidence interval then becomes

$\hat{\theta} \pm 1.96 \sqrt{\dfrac{\hat{\theta}(1-\hat{\theta})}{n}}$

95\% of all intervals constructed in this manner should contain the true $E[x_i] = \theta$.

There are a few disadvantages, however. If you want an interval for a specific parameter $\theta$, the expected value of your observations $E[x_i]$ has to exactly equal $\theta$, or at least some simple function of $\theta$. The sample size has to be large enough. It's entirely possible that the formula gives a lower bound that's less than $0$ or bigger than $1$ - values that a batting average can not take. And we're making some assumptions about at-bats here - namely, that every at-bat has the same probability $\theta$ of getting a hit (not true) and that at-bats are completely independent (not true). However, these assumptions should be close enough to true that the confidence interval is a reasonable procedure - a 95% confidence interval may not have exactly 95% coverage, but it should be close.
There are alternative methods that can be used here to find a confidence interval for $\theta$, which I will discuss in future posts.