# Fact check: Do high rated keepers perform worse than low rated keepers?

Our good friend Arlington69 has published yet another article on his new blog, Futingfeefa.com. This time, he claims to have evidence showing that high rated keepers perform worse than low rated keepers in certain parts of the match. Although we admittedly found the claim more than a tad unlikely, we decided to let it undergo our normal stress test.

It’s time for another FUTfacts fact check.

## The Claim

Let‘s start out with a brief summary of Arlington69’s method, results and conclusion.

### Method

The essence of his study is a comparison of the performance of the opposing keeper in two samples of his own matches. Sample #1 contains all matches, where the opponent’s keeper was rated 92+. Sample #2 contains matches where the opponent keeper was rated <92. He uses his own ability to convert shots into goal (Goals per Shot on Target or G/SoT) as a measure of the opponent keeper’s performance.

The two samples are divided into 15 minute time intervals, which are analyzed separately, meaning that he compares the performance per 15 minute match section. He also drills down to 5 minute intervals.

### Results

Arlington69 finds that the keeper’s performance differs between the two samples *in some time intervals*. In two intervals, the lower rated keeper made 3 % fewer saves than the higher rated keeper, but in one interval, the higher rated keeper made 7 % fewer saves than the lower rated keeper. Overall, the higher rated keeper made 1 % fewer saves than the lower rated keeper.

As for the 1 % overall difference, Arlington69 concludes that it is insignificant, which is interesting because *“it would suggest that their is little benefit from having a highly rated goalkeeper”*.

He then narrows his focus on the 7 %, which is the biggest difference observed in any of the 15 minute intervals. Arlington69 conducts a Z-test and is able to demonstrate that this particular difference is statistically significant. He interprets this observation as evidence that high rated goal keepers have a *“concede late goals”* trait.

The below is a representation of the main part of Arlington69’s comparison.

Interval (minutes) | G / SoT % (92+ Keepers) | G / SoT % (<92 Keepers) | Difference |

0-15 | 34% | 37% | -3% |

16-30 | 38% | 37% | 1% |

31-45 | 40% | 40% | 0% |

46-60 | 35% | 38% | -3% |

61-75 | 42% | 40% | 2% |

76-90 | 47% | 41% | 7% |

Total |
40% |
39% |
1% |

### Conclusion

Arlington69 ends up concluding that *“higher rated goals keepers were significantly less likely to stop a goal being scored”* during the last 5 minutes of the match. He hasn’t shared with us any information showing whether higher rated goal keepers were significantly more likely to stop goals in other time intervals, but this would indeed be an interesting addition.

Although he doesn’t mention handicapping explicitly, it is fair to assume that his experiment is meant to support his earlier claims, i.e. that EA has implemented *“handicapping stopping the highest [rated players from] performing to their potential”* and that *“the better team plays worse and the worse team plays better”*.

## Criticism

As you might have guessed, we aren’t quite impressed with Arlington69’s latest attempt to prove that the only thing stopping him from winning virtually all his matches is big, bad EA. And you might also have guessed some of the reasons. But as always, his article offers some interesting lessons on how *not *to do research in a FIFA setting and in general. And that is the main reason why we bother looking into it.

### A strange choice of method

Before we dig into the data, Arlington69’s choice of method begs for attention. His hypothesis is that higher rated keepers make *fewer *saves per shot than lower rated keepers. In essence, he wants to know how *keeper rating* (variable 1) relates to *keeper performance *(variable 2). Presuming that FIFA works as it should, we would expect these two variables to be directly correlated, meaning that a higher keeper rating produces an increase in keeper performance. Arlington69, on the other hand, has a suspicion that they either are inversely correlated or not correlated. If correct, this would imply that a higher rating either produces a decrease in performance or no change in performance.

Given his belief, or shall we say hypothesis, the natural choice of methodology would be a correlation analysis. A correlation analysis is a systematic analysis which allows you to determine the correlation coefficient (the ‘a’ in y=ax+b) between the two variables spiced up with some significance testing on top.

This is why it is a bit of a mystery that Arlington69 has chosen to compare keepers rated above and below an arbitrary threshold, i.e. 92, and to compare the performance per x minute interval rather than simply look at the general performance.

### Less obvious choices, no answers

Why did he chose that particular method and why did he chose to separate his two samples at 92 rather than any other random value? And why did he chose to compare in 5 / 15 minute intervals? And why doesn’t he present a detailed analysis of all the other 5 or 15 minute intervals instead of narrowing his attention to the last 5 / 15 minutes of the match?

One can of course only speculate, but it definitely raises the suspicion that he primarily made his choices because they gave him the result he wanted. Either that and/or he hasn’t got a clue about what he is doing.

The fact that he may be driven by questionable motives however doesn’t change the fact that his data for the last 5 / 15 minute interval appears to show that higher rated keepers perform worse than lower rated keepers. The question is however whether these results were the product of valid and reliable measurements of the relationship between keeper rating and performance.

### Inflated samples

We normally use significance testing to determine whether our measurements are reliable. Arlington69 claims that his results are statistically significant, and he did conduct a Z-test which apparently suggests that +92 rated keepers perform significantly fewer saves than >92 rated keepers in the last minutes of his matches.

The result of a Z-test depends on the sample sizes, i.e. the number of independent observations gathered through randomized sampling. Arlington69 uses the number of shots on goal as sample sizes, cf. the table below.

Interval (minutes) | n (92+ Keepers) |
n (<92 Keepers) |

0-15 | 194 | 1583 |

16-30 | 169 | 1753 |

31-45 | 246 | 1867 |

46-60 | 164 | 1365 |

61-75 | 174 | 1551 |

76-90 | 236 | 1824 |

Total |
1183 |
9943 |

Multiple shots made against the same opponent in the same match do not constitute multiple independent observations of keeper performance. For example, there will be situations where multiple shots are taken in close succession in a match, meaning that the probability of shot #2 ending up in the net is directly dependent on shot #1. Think of the situation where you shoot and the ball is rebounded, leading to that you shoot again. These two shots are of course not independent observations of scoring probability and definitely not keeper performance.

Arlington69 on average takes 5 shots on goal per match, meaning that the maximum number of independent observations ensuing from his data below is approximately 5 times lower than reported by him. Consequently, the Z-coefficient drops well below the threshold used to determine significance.

In addition to that, the fact that the samples are based on Arlington69’s own matches naturally raises a concern in regard to observer bias or perhaps whether certain elements in his own style of playing could contribute to the result. This is of course only speculation, but the burden of evidence is on him.

So, problem #1: The samples are not made up of independent observations and the result is not statistically significant.

### My performance or yours?

Problem #2: Arlington69 didn’t perform a valid measurement of the relationship between keeper rating and performance.

A central part of the problem is that Arlington69 uses his own conversion rate to express the opposing keeper’s performance. This method comes with an inherent problem: Minimum three factors influence the probability of a shot ending up in the net: The keeper’s performance, the defense’s performance and the opposing team’s performance. And here, we are trying to determine the keeper’s part only.

How do we know, to what extent a conversion rate of 47.5 % was a product of bad keeper performance rather than bad defensive performance or good opponent performance? Surely this can be done. Under normal circumstances, you pick our a sufficiently large sample and make sure that all other possible influencing factors are neutralized.

Arlington69 appears to recognize the issue, and his effort to keep other influencing factors out of the equation are insufficient. He does mention that he used the same formation and tactics during all his 2000 matches. He also notes that he always makes three substitutes around the 60th minute. But what about his own team – did he keep that constant as well? Hardly.

### Upgraded strikers + upgraded keepers = 0 more goals

As his sample consists of more than 2000 of his own matches, we can assume that his matches were played over several months. This is important for two reasons:

First and foremost, most FIFA players upgrade their teams during the season. The fact that Arlington69 doesn’t mention that he kept his team constant in itself suggests that he also upgraded it. When we analyzed his data from FIFA 18, we found that both he and his opponents improved their squads over the duration of the season. This is of course only guessing, but the fact that he hasn’t ruled out this option is more then sufficient to reject his claims.

The second reason why the timing is important has to do with the keepers: No keepers rated +92 were available when FIFA 20 was released in September 2019. The first couple of 92+ rated keepers were released in December, but most +92 rated keepers weren’t released until the spring and summer of 2020 during TOTSSF (team of the season so far) and the Summer Heat Objectives. Together with these +92 keepers, EA released a load of other high-rated players.

This leaves us with two facts and a likely assumption:

- Fact #1: All matches involving 92+ rated keepers were played in the later parts of the season.
- Fact #2: The likelihood that a random match involving a <92 rated keeper was played earlier in the season is bigger than the likelihood that it was played later in the season.
- Assumption: Arlington69 improved his own team gradually during the 2020 season.

Provided that our assumption is correct, Arlington69 was using a better team when playing against +92 rated keepers than when playing against <92 rated keepers. It is therefore not possible to rule out that the reason why the +92 rated keepers in his sample didn’t outperform their cheaper colleagues is that they were facing better opponents than said cheaper colleagues.

## Conclusion

Arlington69’s experiment doesn’t prove that higher rated keepers perform worse than lower rated keepers. We don’t know whether and to what extent higher rated keepers perform better than lower rated keepers when facing the same level of opposition. But we know from other studies of the relationship between player rating and performance that higher rated versions of a striker produce more goals than lower rated versions. So, why would it be different with keepers?

In addition to the above, we would like to note that there are other experiments, although small in size, which reach the exact opposite conclusion of Arlington69. In this article, redditor Militantxyz tested how his players performed against higher rated keepers. His samples are small and clearly insufficient, but the results directly contradict Arlington69’s claims, even if they are somewhat controversial. Militantxyz concludes that *“[s]hooting at De Gea Draxler was a lot less accurate then when shooting at Adan.”* In other words, higher rated keepers are more effective, but their effectiveness is achieved by making it more difficult for the strikers to hit the target.

Whether this conclusion would stand with a larger sample is an open question.