29 May, 2015

Bayesian Inference

A new batter is called up to the big leagues. Let's define $\theta$ as his "true" batting average (which we are assuming exists and is constant). In 50 at-bats, the batter gets a hit 15 times. What is our estimate of $\theta$? The answer, surprisingly, depends on your definition of probability.

Frequentist


The methodology taught today in most statistics classes is called frequentist statistics. To a frequentist, the probability of an event $P(A)$ is what you would get if you were to observe a random experiment (for example, an at-bat) an infinite number of times and calculate the proportion of times the event (for example, getting a hit) happened. Let's define $\hat{\theta}$ as our estimate of $\theta$ (in statistics, this is called "theta hat"). Without going into the exact method of estimation, this definition of probability leads to an estimate of  the batter's true ability as

$\hat{\theta} = \dfrac{\textrm{# of Hits}}{\textrm{# of At-bats}} = \dfrac{15}{50} = 0.300$

This is intuitive - I think that most people, completely absent of any sort of statistical theory, would come up with this estimate on their own. The most common frequentist estimators (method of moments and maximum likelihood) agree. So what's the problem?

Think about something like "The probability that Mike Trout will get injured this season." How does this fit into frequentist inference? There's no way to go back and recreate the entire season. You would have to imagine either a way to rewind time, or lean on some sort of multiple-worlds theory, in order to justify frequentist probability interpretation.

Bayesian


A Bayesian (named after the work of the good reverend Thomas Bayes) defines probability as a
numerical representation of a personal belief in the chance of an event happening. To a Bayesian, data does not define the probability, it updates the probability. There exists some sort of prior knowledge that they then use the data to update to a posterior (that is, after the data is observed) knowledge.

What the prior exactly is and how to represent the prior knowledge is a subject of endless debate, which I will not go into here - and most Bayesians go to great effort to properly determine the prior knowledge before performing inference. Suffice to say, all approaches fundamentally agree that inference should be based on posterior knowledge, not on the long-term results of any process, real or theoretical.

Let's go back to the baseball player - an observer that is completely uninformed may well observe the 15 hits in 50 at-bats and conclude that 0.300 is a perfectly good value for $\hat{\theta}$, the estimate of the player's "true" batting average. A scout sitting elsewhere in the stadium may recognize that the player has excellent mechanics and calculate a value of above 0.300 for $\hat{\theta}$. A different scout may know that the player did not perform this well in the minor leagues and end up with a value lower than 0.300 for $\hat{\theta}$. To a Bayesian, these are all mathematically valid estimates that reflect updated views of the prior information that each observer had.

(I'd like to note here that I'm being somewhat "subjectivist" in this example to emphasize how different priors can lead to different posteriors - but there are those who would argue that none of the of above observers are correct, and that there exists a correct prior that can and should be constructed based off of the large amount of historical knowledge about baseball)

Let's use the batting average example in a more formal mathematical setting.

Beta-Binomial Example


The beta-binomial model assumes that the number of hits $k$ in $n$ at-bats with true success probability $\theta$ has a probability mass function given by

$p(k | \theta, n) = {n \choose k} \theta^k (1-\theta)^{n-k}$

For the example batter, $k = 15$ and $n = 50$.

For the distribution of prior belief, a beta distribution is used. This is a continuous distribution on the range $[0,1]$ defined by two parameters $\alpha$ and $\beta$

$p(\theta | \alpha, \beta) = \dfrac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha, \beta) }$

The values chosen for $\alpha$ and $\beta$ represent the prior knowledge ($B(\alpha, \beta)$ is known as the beta function - its result is a normalizing constant that makes the entire distribution integrate to 1). The expected value of the beta distribution is given by $E[\theta] = \alpha/(\alpha + \beta)$.

Let's say that two observers (call them A and B) observe the baseball player get the 15 hits in 50 at bats. The first observer knows nothing about baseball or the player, so this observer's distribution of prior belief could be represented by a distribution which says anything between 0 and 1 is equally likely - this is given by a beta distribution with $\alpha = 1$ and $\beta = 1$ (this is actually a more informative prior distribution than you might think, but that's another discussion). Observer B is a baseball fan but doesn't know much about the player himself, so observer B picks $\alpha = 53$ and $\beta = 147$ - this puts most of the probability between 0.200 and 0.300, with an average of 0.265.



(As a side note, if you want to think of the information in these priors in in baseball terms - in the mind of observer A, before even playing a single game, the player starts off 1 for 2. In the  mind of observer B, before even playing a single game, the player starts off 53 for 200)

Without going into the mathematical details of Bayes' theorem, the posterior distribution that represents belief in values of $\theta$ is given by

$p(\theta | n, k, \alpha, \beta) =  \dfrac{\theta^{\alpha + k-1}(1-\theta)^{\beta + n - k-1}}{B(\alpha + k, \beta + n - k)}$

That is to say, it's a beta distribution with new parameters $\alpha' = \alpha + k$ and $\beta' = \beta + n - k$.

So for example, $\alpha'_A = 1 + 15 = 16$ and $\beta'_A = 1 + 50 - 15 = 36$, while $\alpha'_B = 53 + 15 = 68$ and $\beta'_B = 147 + 50 - 15 = 182$. The posterior distributions for observers A and B are

The two estimators, which I will choose as $\alpha'/(\alpha' + \beta')$ - the expected values of the posterior distributions - are then given by $\hat{\theta}_A \approx 0.308$ and $\hat{\theta}_B \approx 0.272$. These shouldn't be surprising - observer A didn't have much information to work with, so their estimate is close to 0.300 (but higher! - remember, the average of their prior belief was 0.5). Observer B used much more information in their estimate, so their posterior estimate is much closer to their prior estimate. Realistically, 50 at-bats isn't much to work with in determining the "true" batting average.

(Again as a side note, in baseball terms, the player is now 16 for 52 in the mind of observer A and 68 for 250 in the mind of observer B)

Choice


Note the posterior parameters $\alpha' = \alpha + k$ and $\beta' = \beta + n - k$ - no matter what you choose for $\alpha$ and $\beta$, their effect goes away as the number of at-bats get large (and so $n$ and $k$ get large). No matter what their prior belief is, given enough data the Bayesian approach specifically chosen here should eventually end up with the same estimate.

I want to end by saying that there's somewhat of a false belief that Bayesian and Frequentist ideologies are somehow "opposing" each other. Though this may have been true 50 years ago, it is no longer the case - most statisticians are trained with and are comfortable in using either. In fact, there are ways to have Bayesian methods approximate frequentist methods, and have frequentist methods approximate Bayesian methods! The most important thing, is to make sure that whatever approach you choose is appropriate for the problem you are trying to solve.

The code for all the graphs and calculations in this article may be found in my github.

No comments:

Post a Comment