25 October, 2023

For hitters, cold streaks run colder than hot streaks run hot

This blog post is just the product of a thought exercise: how much information do you get from a certain number of plate appearances? Suppose we observe $n = 25$ plate appearances for a batter. If the batter gets on-base $x = 5$ times, is that the same amount of information as if the batter gets on-base $x = 10$ times? 

The answer is no. As it turns out, the batter obtaining fewer hits is more informative for the batter being "bad" than the batter obtaining more hits is for the batter being "good." How is this possible? Consider the very simple case of forming a standard 95% confidence interval for a binomial proportion. From any statistics textbook, this is just

 

$\hat{p} \pm 1.96 \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}$

 

where $\hat{p}$ is the proportion of on-base events for $n$ plate appearances.  Consider the second part, which I will refer to as the "margin of error" and which controls the width of the confidence interval. For $n = 25$ plate appearances, $x = 5$ gives $\hat{p} = 5/25 = 0.2$ and gives


$1.96 \sqrt{\dfrac{0.2(1-0.2)}{25}} = 0.1568$

 

For $n = 10$ on-base events, this gives $\hat{p} = 0.4$ and 


$1.96 \sqrt{\dfrac{0.4(1-0.4)}{25}} = 0.19204$

 

The width of the margin of error of the confidence is nearly 15% higher for $\hat{p} = 0.4$ than for $\hat{p} = 0.2$! There is more uncertainty with a better result.

Going to a Bayesian framework does not fix this issue, with the possible exception of when heavily weighted priors are being used which would be not justifiable in practice. Suppose that the number of on-base events $x$ in $n$ plate appearances once again follows the (overly simplistic) binomial distribution with parameter $p$, and $p$ is assumed to have a $Beta(1,1)$ distribution, which is the simple uniform case.


$x \sim Bin(n, p)$

$p \sim Beta(1,1)$


For the case of $x = 5$ on-base events in $n = 25$ plate appearances, the posterior distribution has form , standard deviation, and 95% central credible interval


$p | x = 5, n = 25 \sim Beta(6, 21)$

$SD(p | x = 5, n = 25) = \sqrt{\dfrac{(6)(21)}{(6 + 21)^2 (6 + 21 + 1)}} = 0.0786$

95% Central CI: $(0.0897,0.3935)$


For the case of $x = 10$ on-base events in $n = 25$ plate appearances, the posterior distribution has form , standard deviation, and 95% central credible interval


$p | x = 10, n = 25 \sim Beta(11, 16)$

$SD(p | x = 10, n = 25) = \sqrt{\dfrac{(11)(16)}{(11 + 16)^2 (11 + 16 + 1)}} = 0.929$

95% Central CI:$(0.2335,0.5942)$ 


Once again, the scenario with worse performance ($x = 5$ on-base events) has a smaller standard deviation, implying there is less posterior uncertainty about the outcome. In addition, the width of the 95% central credible interval is smaller for $x = 5$ ($0.3038$) than for $x = 10$ (0.3608)$.

So how much information is in the $n$ trials? One way to define information, in a probabilistic sense, is with a concept called the observed information. Observed information is a statistic which measures a concept called the Fisher information. Fisher information measures the amount of information that a sample carries about an unknown parameter $\theta$. Unfortunately, calculating this requires knowing the parameter in question, and so it is usually estimated. The log-likelihood of a set of observations $\tilde{x} = \{x_1, x_2, \dots, x_n\}$ is defined as


$\displaystyle \ell (\theta | \tilde{x}) = \sum_{i = 1}^n \log[ f(x _i | \theta)]$

 

And the observed information is defined as the negative second derivative of the log-likelihood, taken with respect to $\theta$.

 

$I(\tilde{x}) = -\dfrac{d}{d \theta^2}  \ell (\theta | \tilde{x})$

 

Note that Bayesians may replace the log-likelihood with the log-posterior distribution of $\theta$.

 In general the observed information must be calculated for every model, but is known for certain models. For the binomial distribution, the observed information is

 

$I(\hat{p}) = \dfrac{n}{\hat{p}(1-\hat{p})}$


where $\hat{p} = x/n$. Hence, for the case of $x = 5$ on-base events in $n = 25$ plate appearances, $I(0.2) = 156.25$. For the case of $x = 10$ on-base events in $n = 25$ plate appearances, $I(0.4) = 104.1667$. There is quite literally more information in the case where the batter performed worse.

For pitchers, the opposite is the case. If we assume that the number of events in a fixed number of trials (such as runs allowed per 9 innings or walks plus hits per inning pitched), this most appropriate simple distribution is the Poisson distribution with parameter $\lambda$. For $n$ trials, the observed information is


$I(\hat{\lambda}) = \dfrac{n}{hat{\lambda}}$


where $\hat{\lambda} = \bar{x}$, the sample mean of observations. 

Imagine two pitchers: one has allowed $x = 30$ walks plus hits in $n = 25$ innings pitched, while the other has allowed $x = 20$ walks plus hits in $n = 25$ innings pitched. For which pitcher do we have more information about their abilities?

For the first pitcher, their sample WHIP is $\hat{\lambda} = 30/25 = 1.2$ and their observed information is $I(1.2) = 25/1.2 = 20.8333$. For the second pitcher, their sample WHIP is $\hat{\lambda} = 20/30 = 0.8$ and their observed information is $I(0.8) = 25/0.8 = 31.25$. Hence, we have more information about the pitcher who has performed better.

So the situation is reversed for batters and hitters. For batters, we tend to have more information when they perform poorly. For pitchers, we tend to have more information when the perform well. This suggests certain managerial strategies in small samples: it is justifiable to pull a poorly performing batter, but also perhaps justifiable to allow a poorly performing pitcher to have more innings. We just have more information about the bad hitter than we do about the bad pitcher, thanks to information theory.