# Fact check: Do high rated keepers perform worse than low rated keepers?

Our good friend Arlington69 has published yet another statistic. This time, he claims to have evidence that high rated keepers perform worse than low rated keepers. As always, we did the fact checking.

## The Claim

In a recent post, Arlington69 presents the results of an experiment aimed at testing how high rated keepers perform relative to lower rated keepers. The results are for some reason reported separately for various time intervals of the matches that went into the experiment.

Arlington69 has made two samples: The first sample contains matches, where the opponent keeper was rated 92+. The other contained matches where the opponent keeper was rated <92.

He finds that his ability to score goals (goals per Shot on Target) in some time intervals is better when the keeper is rated above 92.

He also did some statistical significance testing, and was able to confirm that the difference in the 76th to 90th minute interval is statistically significant.

The below is a representation of the main part of Arlington69’s statistic.

Interval (minutes) | G / SoT % (92+ Keepers) | G / SoT % (<92 Keepers) | Difference |

0-15 | 34% | 37% | -3% |

16-30 | 38% | 37% | 1% |

31-45 | 40% | 40% | 0% |

46-60 | 35% | 38% | -3% |

61-75 | 42% | 40% | 2% |

76-90 | 47% | 41% | 7% |

Total |
40% |
39% |
1% |

We assume that the results above are meant to support earlier claims made by Arlington69 about handicapping. In an earlier post, Arlington69 writes that he believes that EA has implemented *“handicapping stopping the highest [rated players from] performing to their potential”* and that there is a *“trend for the game to favour the losing team”*, meaning that *“the better team plays worse and the worse team plays better”*.

So, did Arlington69 manage to prove that EA handicaps players with high rated teams?

## Criticism

It is undisputed that the 7 % difference between the <92 and 92+ keepers observed above is statistically significant (95 % confidence level). But statistical significance isn’t evidence of causality, and even less so, handicapping or whatever claim Arlington69 is trying to prove.

And further, it is difficult to see how these results would fit with Arlington69’s earlier claims about handicapping. When you look at the total set of data rather than cherry picking between the time intervals, the difference is small and clearly insignificant.

In all fairness, Arlington69 doesn’t actually claim that these results support his earlier claims about handicapping. Instead, he has moved on to a slightly modified version of his earlier handicapping claim: Now, the game only handicaps the better team in the 76-90th minute but not in the rest of the match. He doesn’t explain why that would make sense.

The absence of a meaningful rationale is a good reason to question whether the alleged correlation between keeper rating and scoring ability perhaps could be a coincidence. And when we look closely at the significance test, that suspicion becomes even more nagging.

Arlington69 made a Z-test, and found the result to be statistically significant with 95 % certainty. This means that there is 5 % probability that the observed correlation is a coincidence. Below, we calculated the Z-value using (a) the actual numbers and (b) for the same number, but having the reduced the number of goals scored against 92+ keepers from 112 to 111. That tiny change is sufficient to change the conclusion from significant to *in*significant.

### Sample sizes

Speaking of significance, the conclusion obviously depends on the sample sizes as demonstrated by the formula above. Arlington69’s significance test is based on the assumption that the sizes of the two samples are given by number of shots on goal, cf. the table below.

Interval (minutes) | n (92+ Keepers) |
n (<92 Keepers) |

0-15 | 194 | 1583 |

16-30 | 169 | 1753 |

31-45 | 246 | 1867 |

46-60 | 164 | 1365 |

61-75 | 174 | 1551 |

76-90 | 236 | 1824 |

Total |
1183 |
9943 |

But there is a problem here: The two samples above are not actually randomized samples, and the observations are not independent. As Arlington69’s samples are based on his own matches, the same matches and hence match-ups occur multiple times. This implies that the actual number of unique match-ups (indepedent observations) in the two samples is much lower than the numbers above indicate, and obviously also that the results aren’t significant.

In addition to that, the fact that the samples are based on Arlington69’s own matches naturally raises a concern in regard to observer bias or perhaps whether certain elements in his own style of playing could contribute to the result.

This is however not out main concern. Far from, actually.

### Sample bias

Above, we discussed the possibility that the results could be a shear coincidence. But there is another option: Namely that there indeed is correlation – caused by a sample bias.

When reading Arlington69’s article, it becomes clear that he made little effort to ensure that other factors, which possibly could impact the results, were kept out of the equation. Hence, we for example don’t know anything about whether all the other players, not least the players shooting, had the same quality level throughout his experiment.

Our hypothesis is that they aren’t, and that this has introduced a decisive bias in his sample.

When we analyzed Arlington69’s similar data from FIFA 18, we – perhaps not surprisingly – saw that both he and his average opponents improved their squads over the duration of the season.

It is possible – and definitely also likely – that the same applies to his FIFA 20 data. Why is this important?

Well, for FIFA 20, no keepers rated 92 or above were available when the game was released in September 19. A couple of 92+ rated keepers were released in December, but most were released in the spring and summer of 2020 during TOTSSF (team of the season so far) and the Summer Heat Objectives.

Therefore, two things must apply here:

- The matches involving 92+ rated keepers were played in the later parts of the season.
- The matches involving <92 rated keepers could have been played at any time during the season, but with a bias toward the earlier parts of the season, as 92+ rated keepers became available later.

Consequently, it is both possible and also likely that Arlington69’s own team on average was better when he played against 92+ keepers than when he played against <92 keepers.

## Conclusion

As it stands, it is both possible and likely that the results presented by Arlington69 are the product of either coincidence or sampling bias. As for sampling bias, we would like the opportunity to test our hypothesis. We therefore reached out to Arlington69 with a request form him to share his data. Unfortunately, he so far has rejected, which mainly is sad because it clearly weakens his case.

Another question that begs for attention here is why Arlington69 decided to split his sample at rating-level 92 rather than for example test for correlation between rating level and shooting accuracy. Equally strange is the fact that he divided his sample into 15 minute intervals rather than, say, 5 minutes. Both decisions appear unnecessary and they inevitably introduce the suspicion that he perhaps is cherry picking by presenting his data in a way, where it appears as if some of the results are significant, even though the overall picture is the opposite.

In addition to the above, we would like to note that there are other experiments, although small in size, which reach the exact opposite conclusion. In this article, redditor Militantxyz tested how his players performed against higher rated keepers. His samples are small and clearly insufficient, but the results directly contradict Arlington69’s claims, even if they are somewhat controversial. Militantxyz concludes that *“[s]hooting at De Gea Draxler was a lot less accurate then when shooting at Adan.”* In other words, higher rated keepers are more effective, but their effectiveness is achieved by making it more difficult for the strikers to hit the target. Whether this conclusion would remain the same in a larger sample is an open question.