Is there any relationship between the number of runs scored in an inning and the number of runs allowed in an inning? Intuitively, it seems like there shouldn't be - but perhaps choosing to increase the rate of runs scored per inning means sacrificing defense, and therefore increasing the rate of runs allowed per inning.
Below is a table of the total runs scored and allowed in innings for all major league baseball teams in 2014.
RA
RS 0 1 2 3 4 5 6 7 8 9
0 23628 4446 2042 935 408 143 66 32 12 5
1 4446 862 421 188 69 39 11 3 2 2
2 2042 421 178 94 25 17 7 2 0 0
3 935 188 94 30 11 5 2 2 0 0
4 408 69 25 11 0 1 2 0 1 0
5 143 39 17 5 1 2 0 0 0 0
6 66 11 7 2 2 0 0 0 0 0
7 32 3 2 2 0 0 0 0 0 0
8 12 2 0 0 1 0 0 0 0 0
9 5 2 0 0 0 0 0 0 0 0
This table does not include all innings, as teams that did not finish the ninth inning will not have a corresponding runs scored/allowed. Also note that this table does double count innings, as one team's RS is another team's RA, and vice versa - I will look at specific teams in a moment to address this.
So for example, in 2014, there were 23,628 innings in which a team both gave up 0 runs and scored 0 runs. How can we use this to test for a relationship between runs scored and runs allowed? A chi-squared test could work, but many of the cells have 0s in them - not ideal. Fisher's exact test requires that row and column cell totals be fixed - clearly not the case here - and Fisher exact test is more commonly used with smaller sample sizes and contingency table sizes (though it is still valid for larger sample sizes).
The solution I chose to use is to combine 5, 6, 7, 8, or 9 runs scored into a single category that I called ``5+''. The contingency table is then
RA
RS 0 1 2 3 4 5+
0 23628 4446 2042 935 408 258
1 4446 862 421 188 69 57
2 2042 421 178 94 25 26
3 935 188 94 30 11 9
4 408 69 25 11 0 4
5+ 258 57 26 9 4 2
This is still a bit awkward - a few cells have counts less than 10* - but overall much, much better than before. Performing a chi-square test on this contingency table (I chose to simulate the p-value in $R$) yields a p-value of approximately 0.178 - not significant at a reasonable $\alpha$.
Individual Teams
As mentioned before, the table used was cheating - I combined all data for all major league teams, and so every result was double counted in some way. One way to address this is to look at each team individually, which has the added bonus of addressing the potential issue that dependence was washed out by the pooling. Below is a table of runs scored and runs allowed by the Atlanta Braves in the 2014 season.
RA
RS 0 1 2 3 4 5 6
0 809 193 61 28 6 6 2
1 125 33 13 3 1 0 0
2 62 15 1 1 0 0 0
3 26 4 4 0 1 1 0
4 14 1 3 0 0 0 0
5 2 3 0 0 0 0 0
6 2 1 0 0 0 0 0
7 0 0 0 1 0 0 0
Again, not all innings were accounted for, since games that ended with only half of the ninth inning played would not have a corresponding RS or RA. To reduce the number of cells with zeroes, innings with 4 or more runs were combined into a single category.\\
RS 0 1 2 3 4+
0 809 193 61 28 14
1 125 33 13 3 1
2 62 15 1 1 0
3 26 4 4 0 2
4+ 18 5 3 1 0
The simulated p-value for the chi-square test with 100,000 simulations is 0.308 - still not significant. In fact, computing the p-value for each team's contingency table with 100,000 simulations yields
Team & p-value \\ \hline
ATL& 0.309\\
ARI& 0.559\\
BAL& 0.627\\
BOS& 0.151\\
CHC& 0.774\\
CHW& 0.735\\
CIN& 0.298\\
CLE& 0.166\\
COL& 0.986\\
DET& 0.977\\
HOU& 0.665\\
KCR& 0.913\\
LAA& 0.512\\
LAD& 0.789\\
MIA& 0.497\\
MIL& 0.905\\
MIN& 0.429\\
NYM& 0.873\\
NYY& 0.337\\
OAK& 0.813\\
PHI& 0.760\\
PIT& 0.986\\
SDP& 0.849\\
SEA& 0.115\\
SFG& 0.206\\
STL& 0.902\\
TBR& 0.629\\
TEX& 0.126\\
TOR& 0.597\\ \hline
\end{array}
There are no p-values below 0.1, so it looks reasonable to assume that runs scored and runs allowed per inning are independent.
This is definitely not the only - or even the most appropriate - way to answer the question of independence between runs scored and runs allowed, but represents an ad-hoc test of the sort that can be useful as a sanity check before proceeding onto other analyses.
*The Chi-square test for independence actually assumes that expected cell counts are more than a certain number - but using observed cell counts is useful as an approximation.
No comments:
Post a Comment