Probabilaball: June 2015

29 June, 2015

Likelihood Ratio Intervals for a Batting Average

I've previously discussed the central limit theorem and Wald theory as methods for giving intervals for parameters - in the simple case of identical, independent at-bats, with probability $\theta$ of getting a hit, both give the same result, which is

$\hat{\theta} \pm z^* \sqrt{\dfrac{\hat{\theta}(1-\hat{\theta})}{n}}$

though for more complicated problems, the two methods may not necessarily give the same result.

Those aren't the only methods for deriving confidence intervals for a batting average. In this post I'm going to derive another type of interval based, again, on the likelihood function, but using it in a different fashion than before.

All the code used to generate the images in this post may be found on my github.

Likelihood Ratio

Statistical theory says that, in the case of a simple one-parameter model, a function of the ratios of the likelihoods (specifically, the maximum likelihood estimator $\hat{\theta}$ and another value $\theta_0$) follows a certain distribution:

$\Delta(\theta_0) = -2 \log\left(\dfrac{L(\theta_0)}{L(\hat{\theta})}\right) = -2[\ell(\theta_0) - \ell(\hat{\theta})] \sim \chi^2_1$

where $L(\theta)$ and $\ell(\theta)$ are the likelihood and log-likelihood functions as defined in this post. This statement can be inverted to get an interval for $\theta$. For a $\chi^2_1$ distribution, the $0.95$ quantile (that is, the value $k$ so that $P(\chi^2_1 \le k) = 0.95$ is at $k = 3.84$. Hence, by taking the set of $\theta_0$ so that

$ -2[\ell(\theta_0) - \ell(\hat{\theta})] \le 3.84$

you get a a 95% confidence interval for $\theta$.

Batting Averages

Recall that for the batting average model with $P(x_i = 1) = \theta$ representing a hit, $P(x_i = 0) = 1-\theta$ representing a non-hit, and independent and identical at-bats, the likelihood function was

$\ell(\theta) = \sum x_i \log(\theta) + (n - \sum x_i) \log(1-\theta)$

Which was maximized at the maximum likelihood estimator

$\hat{\theta} = \dfrac{\sum x_i}{n}$

Let's go back to having a player who gets $15$ hits in $n = 50$ at-bats. The maximum likelihood estimator for the batting average is then $\hat{\theta} = 15/50 = 0.300$.

Plugging in $\sum x_i = 15$, $n = 50$, and $\theta = 0.3$ into the log-likelihood equation gives $\ell(\hat{\theta}) = -30.54$. The function $\Delta$ then has the formula

$\Delta(\theta_0) = -2[\ell(\theta_0) - (-30.54)] = -2\ell(\theta_0) + 61.086$

And so a 95% confidence interval is given by the set of $\theta_0$ such that $\Delta(\theta_0) \le 3.84$. This is not easy to solve with calculus, so the easiest option is to use a computer to solve - graphing the function $\Delta(\theta_0)$ and placing a line at $3.84$ gives

You need to figure out where the curve crosses the line. There are a few ways to do this, but the easy way (which I did) is just calculate $\Delta(\theta_0)$ for a range (for a proportion, there is a finite range of values for $\theta$ - so this is easy) and figure out which values are closest to $3.84$. Doing so gave me a 95% confidence interval as $(0.185, 0.435)$.

Advantages/Disadvantages

Unlike the Wald theory and CLT-based interval, this interval is not dependent on any sort of asymptotic normality - and so can be used for small sample sizes. In fact, if you use it for $1$ hit in $n = 3$ at-bats, you get a graph of the delta function that looks like

and gives a 95% confidence interval of $(0.023,0 .839)$ - not a very useful interval, true, but 3 at-bats is almost no information!

(And note here that if you attempted to do a Wald/CLT interval, you would get $(-0.200,0.867)$ - saying there's a chance the batter's true average is negative)

What the interval does depend on, however, is curvature of the likelihood function at the maximum likelihood estimator $\hat{\theta}$. This is easy to check in one-dimensional cases, but not as easy when the dimensionality of the problem grows.

Another advantage is that the likelihood-based interval is invariant to transformation - that is to say, if you used this method to obtain an interval for a batter's odds of getting a hit, and transformed it back onto the average scale, you would get the same interval as if you had originally done it on the average scale. This is not true for the Wald/CLT intervals.

22 June, 2015

Confidence Interval for wOBA Based on the Multinomial Model

In previous posts I've used models for single events - hits in at bats in particular. As baseball fans, we know that a hit is not the only thing that can occur in an at-bat - there are singles, doubles, triples, home runs, walks (both intentional and unintentional), sacrifice plays, outs - multiple outcomes. We need a model that can account for every single possible outcome.

All the code used to generate the results in this post may be found on my github.

The Multinomial Model

If you observe $n$ independent, identical trials of an event (say, a plate appearance) with $k$ possible outcomes (single, double, triple, home run, walk, sacrifice, etc. - all the way up to an out), then the distribution of counts of each of the possible outcomes follows what's known as a multinomial model. The probability mass function is defined as

$p(x_1, x_2,...,x_k) = \dfrac{n!}{x_1!x_2!...x_k!}\theta_1^{x_1}\theta_2^{x_2}...\theta_{k-1}^{x_{k-1}}(1-\theta_1-\theta_2-...-\theta_{k-1})^{x_k}$

If this looks to you like a binomial, you have a good eye - a binomial is just a multinomial with $k = 2$. As a note, you may if you want define $(1-\theta_1-\theta_2-...-\theta_{k-1}) = \theta_k$ for notation, but it must be defined as shown above for estimation. The log-likelihood function is

$\ell(\theta_1, \theta_2,...,\theta_{k-1}) =\log(n!/(x_1!x_2!...x_k!)) + x_1\log(\theta_1) + ... + x_{k-1}\log(\theta_{k-1}) + x_k \log(1 - \theta_1 - ... - \theta_{k-1})$

Maximizing this function gives maximum likelihood estimates $\hat{\theta_1} = x_1/n$, $\hat{\theta_2} = x_2/n$ - the obvious estimates.

Furthermore, according to statistical theory, the information matrix is the expectation of the negative matrix of second derivatives of the log-likelihood function.

$I(\hat{\theta_1}, \hat{\theta_2},...\hat{\theta_{k-1}}) = -E\left[ \left|\dfrac{\partial^2 \ell}{\partial \theta_i \partial \theta_j} \right| \right]$

And by Wald theory, the inverse of the information matrix gives the variance of the maximum likelihood estimates, which also follows a normal distribution (see my previous post here for a one-dimensional introduction to maximum likelihood and Wald theory). The math gets pretty tedious pretty quick, so I'll just give you the solution - if, for example, $k = 4$ and there are three parameters, then

$\left[ \begin{array}{c} \hat{\theta_1} \\ \hat{\theta_2} \\ \hat{\theta_3} \end{array}\right] \sim N\left( \left[ \begin{array}{c} \theta_1 \\ \theta_2 \\ \theta_3 \end{array}\right] , V(\theta_1, \theta_2, \theta_3)\right)$

$V(\theta_1, \theta_2, \theta_3) = I(\theta_1, \theta_2, \theta_3)^{-1} = \left[ \begin{array}{ccc}
\dfrac{\theta_1(1-\theta_1)}{n} & -\dfrac{\theta_1 \theta_2}{n} & -\dfrac{\theta_1 \theta_3}{n} \\
-\dfrac{\theta_2 \theta_1}{n} & \dfrac{\theta_2(1-\theta_2)}{n} & -\dfrac{\theta_2 \theta_3}{n} \\
-\dfrac{\theta_3 \theta_1}{n} & -\dfrac{\theta_1 \theta_2}{n} & \dfrac{\theta_3(1-\theta_3)}{n}\\
\end{array} \right]$

Notice that for any individual $\theta_i$, the variance is $\dfrac{\theta_i (1-\theta_i)}{n}$ - the same variance as the binomial distribution. This same pattern of binomial variance on the diagonals and $-\dfrac{\theta_i \theta_j}{n}$ continues on the off-diagonals. When doing inference, we will, as usual, replace the $\theta_i$ with $\hat{\theta_i}$, the estimated probabilities.

Weighted Linear Functions of Probabilities

For simplicity, I'm going to assume that $k = 3$ here, so there are only two parameters $\theta_1$ and $\theta_2$ - but the math will work for any $k \ge 2$. Let's say you want to take a weighted function of the parameters.

$g(\theta_1, \theta_2) = w_1 \theta_1 + w_2 \theta_2$

Then the estimated value of $g$ is

$g(\hat{\theta_1}, \hat{\theta_2}) = w_1 \hat{\theta_1} + w_2 {\hat{\theta_2}}$

Where $\hat{\theta_i}$ is given as above. We know from the binomial distribution that $Var(\hat{\theta_1}) = \dfrac{\theta_i (1-\theta_i)}{n}$ - but to get a confidence interval, we need to find

$Var(g(\hat{\theta_1},\hat{\theta_1})) = Var(w_1 \hat{\theta_1} + w_2 {\hat{\theta_2}})$

You may be tempted to take $Var(w_1 \hat{\theta_1} + w_2 \hat{\theta_2}) = w_1^2Var(\hat{\theta_1}) + w_2^2Var(\hat{\theta_2})$, but be careful! The preceding formula only applies when $\hat{\theta}_1$ and $\hat{\theta}_2$ are independent. In situations where they are not, the correct formula is

$Var(w_1 \hat{\theta_1} + w_2 \hat{\theta_2}) = w_1^2 Var(\hat{\theta_1}) + w_2^2 Var(\hat{\theta_2}) + 2 w_1 w_2 Cov(\hat{\theta_1}, \hat{\theta_2})$

you have to take covariance into account. The multinomial model assumes that $n$ is fixed, and so the $\hat{\theta_i}$ are not independent - if you know, for example, that $n = 100$ is fixed and there are $25$ hits, that gives you information about what values the remaining $\theta_i$ can take.

This general formulation - taking variance and covariance into account - is much easier to write out in matrix form. If I let $w = [w_1 \hspace{0.1in} w_2]$, then I can write $Var(g(\hat{\theta_1}, \hat{\theta_2})) = wVw^T$, or

$\hspace{1in}[w_1 \hspace{0.1in} w_2]$ $\left[ \begin{array}{cc}
\dfrac{\hat{\theta_1}(1-\hat{\theta_1})}{n} & -\dfrac{\hat{\theta_1}\hat{\theta_2}}{n} \\
-\dfrac{\hat{\theta_2} \hat{\theta_1}}{n} & \dfrac{\hat{\theta_2}(1-\hat{\theta_2})}{n} \\
\end{array} \right]$$\left[ \begin{array}{c} w_1 \\ w_2\end{array} \right]$

Weighted On Base Average

(For this example, I am stealing data and context from the fangraphs definition on wOBA. Go check it out if you want to learn more about where the weights come from. As defined in The Book, wOBA also includes events of reaching a base on error - I chose to go with the fangraphs example because my notation was already getting out of control, but the math here will easily expand to include that factor as well.)

If we define $n$ as the number of plate appearances minus the number of intentional walks ($n = PA - IBB$), then we can define a multinomial, counting outcomes of interest - singles, doubles, triples, home runs, unintentional walks, hit by pitches, and "other" - as a multinomial distribution. This means we have 6 parameters - $\theta_{1B}, \theta_{2B}, \theta_{3B}, \theta_{HR}, \theta_{uBB},$ and $\theta_{HBP}$ - representing the probability of each event occurring.

In 2013, Mike Trout had, in $n = 706$ plate appearances (minus IBB), 100 unintentional walks, 9 hit by pitches, 115 singles, 39 doubles, 9 triples, and 27 home runs.

Estimators for each of those components of the model is then $\hat{\theta_{uBB}} = 100/706 = 0.142$, $\hat{\theta_{HBP}} = 9/706 = 0.013$, $\hat{\theta_{1B}} = 0.163$, $\hat{\theta_{2B}} = 0.055$, $\hat{\theta_{3B}} = 0.013$, and $\hat{\theta_{HR}} = 0.038$.

This matrix of linear weights is given by

$w = \left[\begin{array}{cccccc}0.69 & 0.722 & 0.888 & 1.271 & 1.616 & 2.101
\end{array}\right]$

The full variance matrix $V$ is given by (and I apologize if the formatting comes out weird - there was no way to show this without massive notation, either with greek letters or scientific notation, since the individual elements are so small)

$\left[ \begin{array}{cc}
\dfrac{\hat{\theta_{uBB}}(1-\hat{\theta_{uBB}})}{n} & -\dfrac{\hat{\theta_{uBB}}\hat{\theta_{HBP}}}{n} & -\dfrac{\hat{\theta_{uBB}}\hat{\theta_{1B}}}{n} & -\dfrac{\hat{\theta_{uBB}}\hat{\theta_{2B}}}{n} & -\dfrac{\hat{\theta_{uBB}}\hat{\theta_{3B}}}{n} & -\dfrac{\hat{\theta_{uBB}}\hat{\theta_{HR}}}{n}\\

-\dfrac{\hat{\theta_{HBP}} \hat{\theta_{uBB}}}{n} & \dfrac{\hat{\theta_{HBP}}(1-\hat{\theta_{HBP}})}{n} & -\dfrac{\hat{\theta_{HBP}}\hat{\theta_{1B}}}{n} & -\dfrac{\hat{\theta_{HBP}}\hat{\theta_{2B}}}{n} & -\dfrac{\hat{\theta_{HBP}}\hat{\theta_{3B}}}{n} & -\dfrac{\hat{\theta_{HBP}}\hat{\theta_{HR}}}{n}\\

-\dfrac{\hat{\theta_{1B}} \hat{\theta_{uBB}}}{n} & -\dfrac{\hat{\theta_{1B}} \hat{\theta_{HBP}}}{n} & \dfrac{\hat{\theta_{1B}}(1-\hat{\theta_{1B}})}{n} & -\dfrac{\hat{\theta_{1B}}\hat{\theta_{2B}}}{n} & -\dfrac{\hat{\theta_{1B}}\hat{\theta_{3B}}}{n} & -\dfrac{\hat{\theta_{1B}}\hat{\theta_{HR}}}{n}\\

-\dfrac{\hat{\theta_{2B}}\hat{\theta_{uBB}}}{n} & -\dfrac{\hat{\theta_{2B}}\hat{\theta_{HBP}}}{n} & -\dfrac{\hat{\theta_{2B}}\hat{\theta_{1B}}}{n} &\dfrac{\hat{\theta_{2B}}(1-\hat{\theta_{2B}})}{n} & -\dfrac{\hat{\theta_{2B}}\hat{\theta_{3B}}}{n} & -\dfrac{\hat{\theta_{2B}}\hat{\theta_{HR}}}{n}\\

-\dfrac{\hat{\theta_{3B}} \hat{\theta_{uBB}}}{n} & -\dfrac{\hat{\theta_{3B}}\hat{\theta_{HBP}}}{n} & -\dfrac{\hat{\theta_{3B}}\hat{\theta_{1B}}}{n} & -\dfrac{\hat{\theta_{3B}}\hat{\theta_{2B}}}{n} & \dfrac{\hat{\theta_{3B}}(1-\hat{\theta_{3B}})}{n} & -\dfrac{\hat{\theta_{3B}}\hat{\theta_{HR}}}{n}\\

-\dfrac{\hat{\theta_{HR}} \hat{\theta_{uBB}}}{n} & -\dfrac{\hat{\theta_{HR}} \hat{\theta_{HBP}}}{n} & -\dfrac{\hat{\theta_{HR}}\hat{\theta_{1B}}}{n} & -\dfrac{\hat{\theta_{HR}}\hat{\theta_{2B}}}{n} & -\dfrac{\hat{\theta_{HR}}\hat{\theta_{3B}}}{n} &\dfrac{\hat{\theta_{HR}}(1-\hat{\theta_{HR}})}{n} \end{array}\right]$

And the standard deviation of wOBA for Mike Trout that year was given by

$\sqrt{wVw^T} = 0.021$

A 95% confidence interval for the wOBA of Mike Trout in 2013 is given by

$0.423 \pm 1.96(0.021) = (0.381,0.461)$

Last Word

These intervals, complicated as they are, are actually too narrow. The formula treats the weights $w_i$ as constants, but really, they are estimates too - $w_i = \hat{w_i}$ - and the formulation does not account for $Var(\hat{w_i})$, the uncertainty in the estimation of the $w_i$. With as much data as has gone into the calculation of the weights, $Var(\hat{w_i})$ is negligible, and can be safely ignored..

20 June, 2015

The Delta Method for a Confidence Interval for Odds

In my previous post, I discussed using Wald theory and maximum likelihood to get a confidence interval for a batting average, $\theta$. What if I want a function of that parameter instead?

Odds

Let's say that instead of a batting average $\theta$, I want the odds of getting a hit. To get the odds of a hit, apply the function

$g(\theta) = \dfrac{\theta}{1-\theta}$

So for example, a batter with a true $\theta = 0.250$ will have odds $g(0.250) = 0.250/0.750 = 1/3$, or one to three odds of getting a hit.

Delta Method

Suppose we have some estimator $\hat{\theta}$ that converges to a normal distribution with variance $\sigma^2$ - that is,

$\hat{\theta} \rightarrow N(\theta, \sigma^2)$

For example, assuming independent and identical at-bats, the sample batting average converges to a normal distribution

$\hat{\theta} \rightarrow N\left(\theta, \dfrac{\theta(1-\theta)}{n}\right)$

then statistical theory says that any function $g(\hat{\theta})$, assuming the first derivative exists and is nonzero, has distribution

$g(\hat{\theta}) \rightarrow N(g(\theta), [g'(\theta)]^2 \sigma^2)$

This gives us a handy way to calculate confidence intervals for functions of parameters, if we can calculate a confidence interval for the parameter itself.

Back to Odds of Getting a Hit

If we define the odds function as above, then the first derivative is given by

$g'(\theta) = \dfrac{1}{(1-\theta)^2}$

and so the distribution of the sample batting odds $g(\hat{\theta})$ converges to a normal distribution with mean $g(\theta)$ and variance

$[g'(\theta)]^2 \sigma^2 = \left[\dfrac{1}{(1-\theta)^2}\right]^2\left[\dfrac{\theta(1-\theta)}{n}\right] = \dfrac{\theta}{n(1-\theta)^3}$

And so a confidence interval for the odds of a hit, given the sample batting average $\hat{\theta}$, is given by

$\left(\dfrac{\hat{\theta}}{1-\hat{\theta}}\right) \pm z^* \sqrt{\dfrac{\hat{\theta}}{n(1-\hat{\theta})^3}}$

where $z^*$ is an appropriate quantile from the normal distribution.

Let's take our batter above, and suppose a $\hat{\theta} = 0.250$ batting average in $n = 40$ at-bats. Then a 95% confidence interval for the odds of getting a hit is given by

$\left(\dfrac{0.250}{1-0.250}\right) \pm1.96 \sqrt{\dfrac{0.250}{40(1 - 0.250)^3}} = (0.095, 0.572)$

A fairly wide interval - but then again, $n = 40$ at-bats isn't much information to work on. If it were instead $n = 400$ at-bats, the interval would be

$\left(\dfrac{0.250}{1-0.250}\right) \pm1.96 \sqrt{\dfrac{0.250}{400(1 - 0.250)^3}} = (0.258, 0.409)$

Which is much smaller, and much more useable.

The code I used to generate these results may be found on my github.

16 June, 2015

Maximum Likelihood Estimation of a Batting Average

In a previous post, I discussed how to generate a confidence interval for a batting average $\theta$ based on the central limit theorem. However, there are other ways to generate confidence intervals - one of them is by the use of maximum likelihood and information theory, which I will nominally call Wald theory.

Maximum Likelihood

Let's say a batter has some true batting average $\theta$ (once again, representing the limiting batting average over millions of at bats - so taking a frequentist view of probability). Assume each of this batter's at-bats are independent, and that he gets a hit with probability $\theta$ and does not get a hit with probability $1-\theta$.

In three at-bats, the batter gets a hit once and does not get a hit twice (I'll call these H and NH for short). How do I use this information to give an estimate $\hat{\theta}$ of the true batting average? Think about the joint probability of seeing one hit and two non-hits. If the true batting average were actually $\theta = 0.100$, it would probably be more likely to see three non-hits. Conversely, if by some miracle the true batting average were actually $\theta = 0.700$, we would probably expect to see more than one hit. Since $\theta$ is a fixed quantity, we can't talk about $P(\theta)$ - but what we can talk about is the joint probability of the data given a specific value of $\theta$, and let that guide our inference.

We assumed that at-bats were independent. This is important, because the assumption of independence allows us to exactly calculate the joint probability of one hit and two non-hits as

$P(H, NH, NH) = P(H)P(NH)P(NH) = (\theta)(1-\theta)(1-\theta) = \theta^1(1-\theta)^2$

and the principle of maximum likelihood says that I should use this formula pick my estimate $\hat{\theta}$. How? By choosing $\hat{\theta}$ as the value of $\theta$ that maximizes the joint probability. Essentially, it's the answer to the question of what "true" batting average would make it the most likely to get one hit and two non-hits. Whatever the answer is, that's our estimate of the true batting average.

What happens if I plug $\theta = 0.5$ into the equation above? Then I get a joint probability of $0.125$. What happens if I plug $\theta = 0.1$ in? I get a joint probability of $0.081$. What happens if I plug $\theta = 0.333$ in? I get a joint probability of approx $0.148$. As it turns out, this is the maximum - so we would use $\hat{\theta} = 0.333$, because it makes it most likely to see the data that we saw

(The code for this graph may be found on my github)

The obvious answer isn't always the correct one, but it is here - and it should serve as a sanity check here that the method of maximum likelihood isn't suggesting that a player with one hit and two non-hits is actually a $0.900$ batter.

Deriving a General Maximum Likelihood Estimator

That was one example. However, you'd probably want to derive a more general method - see if you can figure out what the estimator is going to be before you actually see the data.

Once again, define a model as $x_i = 0$ if the player does not get a hit, and $x_i = 1$ if the player does get a hit. Then $P(x_i = 1) = \theta$ and $P(x_i = 0) = 1 - \theta$. The full mass function is then $P(x_i = k) = \theta^{x_i}(1-\theta)^{1-x_i}$, limiting the possible responses only to $x_i = 1$ or $x_i = 0$.

Say that $x_1, x_2,..., x_n$ represents the outcome of $n$ at-bats. Let's again assume that each at-bat is independent and identical. That allows us to write the joint probability

$P(x_1, x_2, ..., x_n) = P(x_1)P(x_2)...P(x_n) = \displaystyle \prod_{i=1}^n \theta^{x_i} (1-\theta)^{1-x_i} = \theta^{\sum x_i} (1-\theta)^{n - \sum x_i}$

The formula looks complicated, but it's essentially

$P(x_1, x_2,...,x_n) = \theta^{\textrm{# Hits}}(1-\theta)^{\textrm{# Non-Hits}}$

I'm going to write this joint probability as a function of the parameter $\theta$ - call this the likelihood of $\theta$

$L(\theta| x_1, x_2,..., x_n) = \theta^{\sum x_i} (1-\theta)^{n - \sum x_i}$

The goal is to maximize this function with respect to $\theta$ (i.e., find the value of $\theta$ that makes this the biggest). First, though, I'm going to take the natural log of both sides. There are a few reasons for doing this - it gives the same result but makes the math easier, it makes numerical computations more stable if/when you choose to do them, and there's statistical theory that we'll get to in a bit that suggests it's the correct thing to do. This gives a new function of

$\ell(\theta | x_1, x_2, ..., x_n) = \log(L(\theta | x_1, x_2,...,x_n)$

From now on, I'm going to drop the $x_1, x_2,...,x_n$ to save space - the likelihood function still depends on the data, I'm just going to use $\ell(\theta)$ in place of $\ell(\theta| x_1, x_2, ..., x_n)$ now. The log-likelihood is then

$\ell(\theta) = \log(\theta^{\sum x_i} (1-\theta)^{n - \sum x_i}) = \sum x_i \log(\theta) + (n - \sum x_i) \log(1-\theta)$

What's the best way to find a maximum of a function? Take the derivative, set it equal to zero, and solve.

$\ell'(\theta) = \dfrac{\sum x_i}{\theta} - \dfrac{n - \sum x_i}{1-\theta} = 0$

Solving this for $\theta$ yields $\theta = \sum x_i / n$, so the maximum likelihood estimator is $\hat{\theta} = \sum x_i/n$ - that is, the number of hits divided by the number of at-bats - the sample batting average.

From the Estimator to the Interval

In statistics, "information" is a real thing - and a measurable quantity. In general, the negative second derivative of the log likelihood function

$J(\theta | x_1, x_2, ..., x_n) = -\dfrac{\partial^2}{\partial \theta^2} \ell(\theta)$

is known as the observed information. The observed information is dependent on the data, so taking the expectation gives the expected (or Fisher) information.

$I(\theta) = E\left[-\dfrac{\partial^2}{\partial \theta^2} \ell(\theta)\right]$

Notice that $I(\theta)$ does not depend on the data, since it is acquired by taking the expectation over the data. Statistical theory says that under a few pretty reasonable conditions,

$\hat{\theta} \sim N\left(\theta, \dfrac{1}{I(\theta)}\right)$

That is, the maximum likelihood estimator $\hat{\theta}$ follows a normal distribution with mean $\theta$ (the true parameter) and variance given by one over the expected information.

This suggest a confidence interval of the form

$\hat{\theta} \pm z^* \sqrt{\dfrac{1}{I(\theta)}}$

Where $z^*$ is an appropriate quantile from a normal distribution. Of course, we don't actually know what $\theta$ is, so we have to estimate $I(\theta)$ - common approaches are to use $I(\hat{\theta})$ or $J(\hat{\theta} | x_1, x_2, ..., x_n)$ - that is, the expected or observed information with our estimator $\hat{\theta}$ in place of $\theta$.

Applied to Batting Averages

Recall that for the at-bat model with $P(x_i = 1) = \theta$ and $P(x_i = 0) = 1 - \theta$, the first derivative of the likelihood is

$\ell'(\theta) = \dfrac{\sum x_i}{\theta} - \dfrac{n - \sum x_i}{1-\theta}$

The second derivative is

$\ell''(\theta) = -\dfrac{\sum x_i}{\theta^2} - \dfrac{n - \sum x_i}{(1-\theta)^2}$

And making it negative gives the observed information as

$J(\theta | x_1, x_2, ..., x_n) = -\ell''(\theta) = \dfrac{\sum x_i}{\theta^2} + \dfrac{n - \sum x_i}{(1-\theta)^2}$

Lastly, we must take the derivative with respect to $x_i$. Remember that $E[x_i] = \theta$ by the probability model, and it's not hard to show from the definition of expected value that $E[aX + b] = aE[X] + b$, where $X$ is a random variable and $a$ and $b$ are constants. The fisher information is then

$I(\theta) = E[-\ell''(\theta) ]= E\left[\dfrac{\sum x_i}{\theta^2} + \dfrac{n - \sum x_i}{(1-\theta)^2}\right]$

$ = \dfrac{E[\sum x_i]}{\theta^2} + \dfrac{n - E[\sum x_i]}{(1-\theta)^2} = \dfrac{n\theta}{\theta^2} + \dfrac{n - n\theta}{(1-\theta)^2} = \dfrac{n}{\theta} + \dfrac{n}{1-\theta}$

And doing a little bit of algebra gives us

$I(\theta) = \dfrac{n}{\theta(1-\theta)}$

Since we don't know $\theta$, replace it with $\hat{\theta}$. Then by Wald theory, a 95% confidence interval for $\theta$ is given by

$\hat{\theta} \pm z^* \sqrt{\dfrac{\hat{\theta}(1-\hat{\theta})}{n}}$

where $\hat{\theta} = \sum x_i/n$ is just the sample batting average. As it turns out, this is the same interval that you get from central limit theorem (moment-based) inference! This doesn't always have to happen - turns out the sample batting average is just a really great estimator for the true batting average, assuming independent and identical at-bats.

Advantages and Disadvantages

Unlike the central limit theorem interval, Wald theory intervals do not require that $E[x_i] = \theta$ - the maximum likelihood/inverse information approach will work to give intervals for a specific parameter even if the expected value is a function of one or more variables. Furthermore, if you are coding a maximum likelihood estimator by hand (using, say, the Newton-Rhapson algorithm) then you will have to calculate the first and second derivatives anyway - so they tie in very well with numerical methods. In fact, often times people only go so far as to calculate the log-likelihood - all other information needed to create a Wald interval can be found in the output of the numerical maximization program.

A disadvantages is that Wald theory intervals rely on the asymptotic normality of the maximum likelihood estimator, and so sample size considerations again come into play - with the graph above for one hit and two-non hits, you can clearly see that the distribution is non-normal, and so for small sample sizes nominally 95% Wald intervals will fail to reach that coverage. And you have the same problem as with the central limit theorem intervals - it can give numbers in an unrealistic range (batting averages less than 0 or more than 1).

09 June, 2015

Central Limit Theorem Estimation of a Batting Average

A new batter gets called up to the major leagues. You're interested in the batting average of this new batter - let's call his true batting average $\theta$ (this what the batting average would converge to if the batter were to take millions and millions of at-bats - so we're using a frequentist understanding of probability). You observe that in 50 at-bats, the batter gets 15 hits. Your estimate of the true batting average is

$\hat{\theta} = \dfrac{15}{50} = 0.300$

This is what's called a "point estimate," since it just gives a single point on the real number line. Is that estimate correct? Almost certainly not - there's going to be variability in the data, especially with only $n = 50$ at-bats. What statisticians like to do is take into account the variability of the point estimate and give what's called an interval estimate - that is, a range of values where we think the true batting average $\theta$ actually lies. It's the statistical equivalent of casting a net rather than just a single line. The advantage of doing this is we can control how often the net catches captures the true $\theta$ by changing how big the net is.

Do I think the player's true $\theta$ is actually 0.300? No, almost certainly not. How close am I? With a sample size of 50, there's still a good amount of variation - 0.250 is possible, since it's close to my point estimate, but 0.050 is probably right out, since it's so far away. And therein lies the core technique - a point estimate, plus or minus some margin of error. By giving an interval estimate of 0.300 plus or minus some margin of error, I can fine-tune the margin of error so that I can I can be as confident in my interval as I want.

Proportion Confidence Interval

If you've ever taken a Statistics 101 course, you've probably encountered the formula for a confidence interval for a proportion:

$\hat{\theta} \pm z^* \sqrt{\dfrac{\hat{\theta}(1-\hat{\theta})}{n}}$

Where $\hat{\theta}$ is the sample proportion. You might have even done a baseball example - take the baseball player who gets $15$ hits in $n = 50$ at-bats. For that player, $\hat{\theta} = 15/50 = 0.3$. A 95% confidence interval is

$0.3 \pm 1.96 \sqrt{\dfrac{0.3(1-0.3)}{50}} = (0.173, 0.472)$

That is to say, with 95% confidence, the true batting average $\theta$ of the player is between 0.173 and 0.472 (we'll leave off the discussion of exactly what "95% confidence" means for now - other than to say that it is not a 95% chance that $\theta$ is in the interval).

Where does this formula come from? One way to get it is from the central limit theorem.

Central Limit Theorem

The central limit theorem (CLT) is, I think, probably the most important theorem in the realm of applied statistics. What it says is:

If $x_1, x_2,...,x_n$ is an independent, identical sequence of random variables (with finite variance), then the average $\overline{x}$ of the random variables converges to a normal distribution as the sample size goes to infinity.

In the real world you're not going to encounter a random variable with non-finite variance, so it's safe to ignore that. What does independent mean? That if I know the outcome of one trial, it doesn't affect the outcomes of the other trials. Identical means that they all follow the same distribution. Think about flipping a fair coin 10 times in a row - the flips don't affect each other (independent) and all have the same distribution of 50% chance heads, 50% chance tails (identical).

So assuming that the sample size $n$ is large enough , we know that averages of identical, independent random variables approximately follows a normal distribution. What are the parameters (meaning, the mean and variance) of this distribution? The central limit theorem states that the mean of the normal distribution is same mean as each of individual $x_i$ and the variance of the normal distribution is the variance of each of the $x_i$, divided by $n$. In short,

$\overline{x} \sim N( E[x_i], Var(x_i)/n)$

The advantage of this is that using by flipping around probability calculations involving the normal distribution, you can turn this into an interval estimate for $E[x_i]$ as

$\overline{x} \pm z^* \sqrt{Var(x_i)/n}$

Where $z^*$ is an appropriate quantile from the normal distribution. For a 95% confidence interval, $z^* = 1.96$.

Batting Averages

Does this apply to batting averages? Yes! As the name tells you - a batting average is an average. An average of what? That takes a bit of creativity. I'm going to define a specific type of random variable. Let's say that $x_i$ represents the outcome of a single at-bat. We're going to say $x_i = 1$ if the batter gets a hit and $x_i = 0$ if the player does not get a hit. Also say that the probability of getting a hit is $\theta$, the same for every single at-bat. The expected value of each of the $x_i$ is then $\theta$ with variance $\theta(1-\theta)$. The average of the $x_i$ is now the estimate of the sample proportion!

$\hat{\theta} = \bar{x} = \dfrac{\displaystyle \sum_{i=1}^{n} x_i}{n} = \dfrac{\textrm{# of 1s}}{\textrm{# of trials}} = \dfrac{\textrm{# of hits}}{\textrm{# of at-bats}}$

Hence, a batting average can be seen as the average of the $x_i$ when they are specified in the way I described. What is $E[x_i]$? It's equal to $\theta$. What is $Var(x_i)?$ It's equal to $\theta(1-\theta)$ (let's ignore how I got these for this post). So what is the distribution of $\hat{\theta} = \overline{x}$?

$\hat{\theta} \sim N\left(\theta, \dfrac{\theta(1-\theta)}{n}\right)$

This is assuming the sample size is large enough - and I'll say here that $n = 50$ should be large enough. The 95% confidence interval then becomes

$\hat{\theta} \pm 1.96 \sqrt{\dfrac{\hat{\theta}(1-\hat{\theta})}{n}}$

95\% of all intervals constructed in this manner should contain the true $E[x_i] = \theta$.

Advantages/Disadvantages

The central limit theorem is very handy, and provides good intervals when the conditions (usually reasonable to assume) are met.

There are a few disadvantages, however. If you want an interval for a specific parameter $\theta$, the expected value of your observations $E[x_i]$ has to exactly equal $\theta$, or at least some simple function of $\theta$. The sample size has to be large enough. It's entirely possible that the formula gives a lower bound that's less than $0$ or bigger than $1$ - values that a batting average can not take. And we're making some assumptions about at-bats here - namely, that every at-bat has the same probability $\theta$ of getting a hit (not true) and that at-bats are completely independent (not true). However, these assumptions should be close enough to true that the confidence interval is a reasonable procedure - a 95% confidence interval may not have exactly 95% coverage, but it should be close.

There are alternative methods that can be used here to find a confidence interval for $\theta$, which I will discuss in future posts.

03 June, 2015

Restroom Queues

This is a post about a vaguely baseball-related problem that happened recently - on opening day at Wrigley field the lines to use the restrooms were a complete disaster. Turns out you can use statistical modeling to model queues (fancy word for lines), and with some assumptions, you can actually figure out how long the average wait is and how many people are, on average, in line!

Before I get started, the purpose of this post is just as a demonstration of basic queuing theory, not a full analysis of the situation at Wrigley field. A full analysis would require many, many more assumptions, and would take computer modeling and simulation in order to produce estimates.

A queuing model envisions two "parts" of the queue - a line, and a "service center" that people spend time in at the end of the line, and then depart from. If people are leaving the service center faster than they are arriving to the line, the line will be short. If people are arriving to the line faster than they leave the service center, the line will be long.

Underlying queuing theory is the model of times between arrivals and the times for service. The basic model for these is

$A_i \sim Exp(\lambda)$

$S_i \sim Exp(\mu)$

That is, times between arrivals $A_i$ are exponentially distributed with a constant rate of $\lambda$ arrivals per unit of time and service times $S_i$ are exponentially distributed with a constant rate of $\mu$ services per unit of time. Of course in the real world the arrival and service rates are not constant, but we can make the assumption for the purpose of a quick discussion and it should work out okay.

So what happened at Wrigley field? The answer is that the queuing system became unstable - meaning, essentially, that people were arriving faster to restrooms much faster than they were leaving them. What happens in an unstable system? The line just keeps getting longer and longer - theoretically leading to an infinite line, but in the real world leading to situations like at Wrigley Field's.

What makes a system unstable? Depends on the number of "service stations" available - we'll call this quantity $c$. In a bank with one teller, for example, $c = 1$. If there's only one service station, then the system becomes unstable when $\lambda > \mu$ - the arrival rate is greater than the service rate.

Now let's think about restrooms. Clearly in a restroom there can be multiple types of service stations - toilets, urinals, and sinks - all of which will have different service times, and some people may skip the sink all together (gross). So clearly this is not a "basic" modeling situation, but again - demonstration, not full analysis.

Let's combine one toilet/urinal and one sink into a single service station, and say that there are $c$ of them in a restroom. The system is unstable, then, if $\lambda > c\mu$ - that is, the arrival rate is larger than $c$ times the service rate.

What about all the other quantities involved? There's a handy queuing theory calculator I've found on the internet - we're going to choose "M/M/c." The first and second M stand for "Memoryless" - referencing our choice of the exponential distribution for interarrival and service times - and the $c$ is the number of servers as defined above.

Let's plug in a few numbers into the calculator - for example, say there are $\lambda = 60$ customers arriving per hour, there are $c = 10$ service stations (toilet/urinal and sink combined) that can handle $\mu = 8$ customers per hour each. What are the properties of the queuing system?

There are, on average, about 8.4 customers in the system (using the restroom plus in line). On average there are 0.92 customers in line. The customer will have to wait approximately 0.92 minutes on average in line before using the restroom. All well and good.

What happens as $\lambda$ approaches $c \mu$? Let's say now that $\lambda = 60$ customers per hour and $\mu = 8$ customers per hour, but two toilets break and now there are only $c = 8$ service stations. The properties of the queueing system are

A customer is likely to have to wait 12.1 minutes before using the restroom, and there are 12.1 people in line, on average.

As $\lambda$ and $c\mu$ move closer to each other, what tends to happen is that the average time spent waiting in the queue becomes longer and longer, and the average number of people in the queue becomes longer and longer, until the system reaches instability and the (theoretical) average number of people in the system and the average wait time goes to infinity. In the restroom example, what happens if one more toilet breaks and the system now has $c = 7$ service stations? $60 > 7*8 = 56$, and the system very quickly goes from "stable but with a wait" to - well, what you saw reported on the news at Wrigley field.

So what do I think happened at Wrigley? If we think of each restroom as its own queuing system, the construction meant that there were fewer restrooms for more people, suddenly and drastically increasing $\lambda$ for each system while $\mu$ and $c$ stayed constant - leading to instability of what previously seemed like a very stable system.

02 June, 2015

2015 Number of Win Predictions (through May)

These predictions are based on my own "secret sauce" method, which I definitely think can be improved. I've tested these on previous years and I set the nominal coverage at 95% (meaning it should get it right 95% of the time), but based on tests of previous seasons the actual coverage at this point in the season is about 93%.

\begin{array} {c c c}
Team & Lower & Upper \\ \hline
         ATL & 68   & 92 \\
         ARI    &67   & 93 \\
         BAL    &66   & 91 \\
         BOS   & 62   & 87\\
         CHC   & 71   & 95\\
         CHW   & 64 & 89\\
         CIN    &64   & 89\\
         CLE   & 68   & 92\\
         COL   & 65   & 91\\
        DET    &71    & 96\\
        HOU   & 76   & 101\\
        KCR   & 77   & 102\\
        LAA   & 71    & 95\\
        LAD   & 77   & 102\\
        MIA   & 62    & 87\\
        MIL   & 56    & 81\\
        MIN   & 74   & 100\\
        NYM & 72    & 96\\
        NYY   & 70    & 95\\
        OAK   & 62    & 87\\
        PHI    &56    & 82\\
        PIT   & 72    & 97\\
        SDP   & 67    & 91\\
        SEA   & 67    & 91\\
        SFG   & 74   & 98\\
        STL   & 81   & 106\\
        TBR   & 71   & 97\\
        TEX   & 69   & 93\\
        TOR   & 67    & 91\\
        WSN   & 73 &    97\\ \hline
\end{array}

'It's tough to make predictions, especially about the future.' - Yogi Berra

(Probably not something he actually said, but a great quote nonetheless)