##
**Maximum Likelihood**

Let's say a batter has some true batting average $\theta$ (once again, representing the limiting batting average over millions of at bats - so taking a frequentist view of probability). Assume each of this batter's at-bats are independent, and that he gets a hit with probability $\theta$ and does

*not*get a hit with probability $1-\theta$.

In three at-bats, the batter gets a hit once and does not get a hit twice (I'll call these H and NH for short). How do I use this information to give an estimate $\hat{\theta}$ of the true batting average? Think about the joint probability of seeing one hit and two non-hits. If the true batting average were actually $\theta = 0.100$, it would probably be more likely to see three non-hits. Conversely, if by some miracle the true batting average were actually $\theta = 0.700$, we would probably expect to see more than one hit. Since $\theta$ is a fixed quantity, we can't talk about $P(\theta)$ - but what we can talk about is the joint probability of the data

*given*a specific value of $\theta$, and let that guide our inference.

We assumed that at-bats were independent. This is important, because the assumption of independence allows us to exactly calculate the joint probability of one hit and two non-hits as

$P(H, NH, NH) = P(H)P(NH)P(NH) = (\theta)(1-\theta)(1-\theta) = \theta^1(1-\theta)^2$

and the principle of maximum likelihood says that I should use this formula pick my estimate $\hat{\theta}$. How? By choosing $\hat{\theta}$ as the value of $\theta$ that maximizes the joint probability. Essentially, it's the answer to the question of what "true" batting average would make it the most likely to get one hit and two non-hits. Whatever the answer is, that's our estimate of the true batting average.

What
happens if I plug $\theta = 0.5$ into the equation above? Then I get a joint probability of $0.125$. What happens if I plug $\theta = 0.1$ in? I get a joint
probability of $0.081$. What happens if I plug $\theta = 0.333$ in? I get a
joint probability of approx $0.148$. As it turns out, this is the maximum
- so we would use $\hat{\theta} = 0.333$, because it makes it most
likely to see the data that we saw

(The code for this graph may be found on my github)

The obvious answer isn't always the correct one, but it is here - and it should serve as a sanity check here that the method of maximum likelihood isn't suggesting that a player with one hit and two non-hits is actually a $0.900$ batter.

##
**Deriving a General Maximum Likelihood Estimator**

That was one example. However, you'd probably want to derive a more general method - see if you can figure out what the estimator is going to be before you actually see the data.

Once again, define a model as $x_i = 0$ if the player does not get a hit, and $x_i = 1$ if the player does get a hit. Then $P(x_i = 1) = \theta$ and $P(x_i = 0) = 1 - \theta$. The full mass function is then $P(x_i = k) = \theta^{x_i}(1-\theta)^{1-x_i}$, limiting the possible responses only to $x_i = 1$ or $x_i = 0$.

Say that $x_1, x_2,..., x_n$ represents the outcome of $n$ at-bats. Let's again assume that each at-bat is independent and identical. That allows us to write the joint probability

$P(x_1, x_2, ..., x_n) = P(x_1)P(x_2)...P(x_n) = \displaystyle \prod_{i=1}^n \theta^{x_i} (1-\theta)^{1-x_i} = \theta^{\sum x_i} (1-\theta)^{n - \sum x_i}$

The formula looks complicated, but it's essentially

$P(x_1, x_2,...,x_n) = \theta^{\textrm{# Hits}}(1-\theta)^{\textrm{# Non-Hits}}$

I'm going to write this joint probability as a function of the parameter $\theta$ - call this the likelihood of $\theta$

$L(\theta| x_1, x_2,..., x_n) = \theta^{\sum x_i} (1-\theta)^{n - \sum x_i}$

The goal is to maximize this function with respect to $\theta$ (i.e., find the value of $\theta$ that makes this the biggest). First, though, I'm going to take the natural log of both sides. There are a few reasons for doing this - it gives the same result but makes the math easier, it makes numerical computations more stable if/when you choose to do them, and there's statistical theory that we'll get to in a bit that suggests it's the correct thing to do. This gives a new function of

$\ell(\theta | x_1, x_2, ..., x_n) = \log(L(\theta | x_1, x_2,...,x_n)$

From now on, I'm going to drop the $x_1, x_2,...,x_n$ to save space - the likelihood function still depends on the data, I'm just going to use $\ell(\theta)$ in place of $\ell(\theta| x_1, x_2, ..., x_n)$ now. The log-likelihood is then

$\ell(\theta) = \log(\theta^{\sum x_i} (1-\theta)^{n - \sum x_i}) = \sum x_i \log(\theta) + (n - \sum x_i) \log(1-\theta)$

What's the best way to find a maximum of a function? Take the derivative, set it equal to zero, and solve.

$\ell'(\theta) = \dfrac{\sum x_i}{\theta} - \dfrac{n - \sum x_i}{1-\theta} = 0$

Solving this for $\theta$ yields $\theta = \sum x_i / n$, so the maximum likelihood estimator is $\hat{\theta} = \sum x_i/n$ - that is, the number of hits divided by the number of at-bats - the sample batting average.

##
**From the Estimator to the Interval**

In statistics, "information" is a real thing - and a measurable quantity. In general, the negative second derivative of the log likelihood function

$J(\theta | x_1, x_2, ..., x_n) = -\dfrac{\partial^2}{\partial \theta^2} \ell(\theta)$

is known as the observed information. The observed information is dependent on the data, so taking the expectation gives the expected (or Fisher) information.

$I(\theta) = E\left[-\dfrac{\partial^2}{\partial \theta^2} \ell(\theta)\right]$

Notice that $I(\theta)$ does

*not*depend on the data, since it is acquired by taking the expectation over the data. Statistical theory says that under a few pretty reasonable conditions,

$\hat{\theta} \sim N\left(\theta, \dfrac{1}{I(\theta)}\right)$

That is, the maximum likelihood estimator $\hat{\theta}$ follows a normal distribution with mean $\theta$ (the true parameter) and variance given by one over the expected information.

This suggest a confidence interval of the form

$\hat{\theta} \pm z^* \sqrt{\dfrac{1}{I(\theta)}}$

Where $z^*$ is an appropriate quantile from a normal distribution. Of course, we don't actually know what $\theta$ is, so we have to estimate $I(\theta)$ - common approaches are to use $I(\hat{\theta})$ or $J(\hat{\theta} | x_1, x_2, ..., x_n)$ - that is, the expected or observed information with our estimator $\hat{\theta}$ in place of $\theta$.

##
**Applied to Batting Averages**

**Recall that for the at-bat model with $P(x_i = 1) = \theta$ and $P(x_i = 0) = 1 - \theta$, the first derivative of the likelihood is**

$\ell'(\theta) = \dfrac{\sum x_i}{\theta} - \dfrac{n - \sum x_i}{1-\theta}$

The second derivative is

$\ell''(\theta) = -\dfrac{\sum x_i}{\theta^2} - \dfrac{n - \sum x_i}{(1-\theta)^2}$

$J(\theta | x_1, x_2, ..., x_n) = -\ell''(\theta) = \dfrac{\sum x_i}{\theta^2} + \dfrac{n - \sum x_i}{(1-\theta)^2}$

And making it negative gives the observed information as

Lastly, we must take the derivative with respect to $x_i$. Remember that $E[x_i] = \theta$ by the probability model, and it's not hard to show from the definition of expected value that $E[aX + b] = aE[X] + b$, where $X$ is a random variable and $a$ and $b$ are constants. The fisher information is then

$I(\theta) = E[-\ell''(\theta) ]= E\left[\dfrac{\sum x_i}{\theta^2} + \dfrac{n - \sum x_i}{(1-\theta)^2}\right]$

$ = \dfrac{E[\sum x_i]}{\theta^2} + \dfrac{n - E[\sum x_i]}{(1-\theta)^2} = \dfrac{n\theta}{\theta^2} + \dfrac{n - n\theta}{(1-\theta)^2} = \dfrac{n}{\theta} + \dfrac{n}{1-\theta}$

And doing a little bit of algebra gives us

$I(\theta) = \dfrac{n}{\theta(1-\theta)}$

Since we don't know $\theta$, replace it with $\hat{\theta}$. Then by Wald theory, a 95% confidence interval for $\theta$ is given by

$\hat{\theta} \pm z^* \sqrt{\dfrac{\hat{\theta}(1-\hat{\theta})}{n}}$

where $\hat{\theta} = \sum x_i/n$ is just the sample batting average. As it turns out, this is the same interval that you get from central limit theorem (moment-based) inference! This doesn't always have to happen - turns out the sample batting average is just a really great estimator for the true batting average, assuming independent and identical at-bats.

##

Unlike the central limit theorem interval, Wald theory intervals do not require that $E[x_i] = \theta$ - the maximum likelihood/inverse information approach will work to give intervals for a specific parameter even if the expected value is a function of one or more variables. Furthermore, if you are coding a maximum likelihood estimator by hand (using, say, the Newton-Rhapson algorithm) then you will have to calculate the first and second derivatives anyway - so they tie in very well with numerical methods. In fact, often times people only go so far as to calculate the log-likelihood - all other information needed to create a Wald interval can be found in the output of the numerical maximization program.

A disadvantages is that Wald theory intervals rely on the asymptotic normality of the maximum likelihood estimator, and so sample size considerations again come into play - with the graph above for one hit and two-non hits, you can clearly see that the distribution is non-normal, and so for small sample sizes nominally 95% Wald intervals will fail to reach that coverage. And you have the same problem as with the central limit theorem intervals - it can give numbers in an unrealistic range (batting averages less than 0 or more than 1).

##
**Advantages and Disadvantages**

Unlike the central limit theorem interval, Wald theory intervals do not require that $E[x_i] = \theta$ - the maximum likelihood/inverse information approach will work to give intervals for a specific parameter even if the expected value is a function of one or more variables. Furthermore, if you are coding a maximum likelihood estimator by hand (using, say, the Newton-Rhapson algorithm) then you will have to calculate the first and second derivatives anyway - so they tie in very well with numerical methods. In fact, often times people only go so far as to calculate the log-likelihood - all other information needed to create a Wald interval can be found in the output of the numerical maximization program.

A disadvantages is that Wald theory intervals rely on the asymptotic normality of the maximum likelihood estimator, and so sample size considerations again come into play - with the graph above for one hit and two-non hits, you can clearly see that the distribution is non-normal, and so for small sample sizes nominally 95% Wald intervals will fail to reach that coverage. And you have the same problem as with the central limit theorem intervals - it can give numbers in an unrealistic range (batting averages less than 0 or more than 1).

## No comments:

## Post a Comment