Under scrutiny: Xtoonator’s handicap experiment

Share

Now and then we come across various home brewed experiments aimed at testing whether scripting exists. Although most of them fail to meet basic, scientific requirements, there is something to be learned from them. xtoonator’s experiment below is one such example.

The experiment that we are going to discuss in this article was presented in this report. The report was posted on Reddit during FUT 15 by a user named xtoonator. In his report, xtoonator explains how he played 25 games and noted the usernames and squad ratings of his opponents, results and numerous other pieces of information (the full dataset is available here). Following each match, xtoonator looked his opponents up via FUTSCOPE and noted their FUTSCOPE rating. On the basis of these observations, he ended up concluding that I lose against players that seemed to be worse than me’ and ‘I lose to opponents that have a lower team rating than me.’

In his report, xtoonator writes that:

“So, after this hole thing, I guess I do believe in the handicap. I do believe that there is some kind of scripting that allows “worse” players to beat the better ones / beat the better teams. There is something that makes it easier for them to finish, that takes the other goalkeeper into a full retard, that makes Bale run like a 95th year old with a walking frame and that makes Ronaldo finish like Fernando Torres at Chelsea.”

As we however will explain in the following sections, xtoonator’s data neither supports the conclusion that he actually lost against inferior players, nor that he lost against inferior teams.

Although the extremely small sample size is a huge problem on it’s own, there are other and perhaps also more interesting points of criticism to be made about xtoonator’s experiment.

Did xtoonator lose against inferior players?

During his run of 25 matches, xtoonator lost 8 matches. Two observations lead him to conclude that FIFA allows worse players to beat better ones:

  • He mostly lost against players, who had a lower FUTSCOPE grade than him.
  • He mostly lost against players, who resided in a lower division than him.

There are, however, a number of critical issues  with this line of reasoning.

Skill can’t be measured without uncertainty

In case you aren’t familiar with FUTSCOPE, it was a huge thing during FUT 15. FUTSCOPE, like FUTHEAD Nexus, used player data collected via EA’s game data section (win ratio, goal difference etc.) to calculate a grade (A+ through F). The grade would indicate how good you were. Upon recording his results, xtoonator looked his opponents up via FUTSCOPE and recorded their grade. He, then, observed that in all the 8 matches that he lost, the opponent had a lower grade. This lead him to conclude that there is some sort of bias favoring the inferior player in a match.

The problem with xtoonator’s line of reasoning is that it relies heavily on the FUTSCOPE’s grade system as a method for determining the skill level of one player relative to another. The entire argument rests upon the premise that a player rated A ought to be able to beat a player rated B. But can you make this assumption? FUTSCOPE’s own match prediction algorithm doesn’t use grades this way. As a matter of fact, FUTSCOPE itself often predicts the lower graded player as winner:

futcope1 futcope2

Measuring length with a rubber band

FUTSCOPE’s grades are calculated based on the stats seen in the two examples above, i.e. win / loss ratio, goals per game, best division and so on.

Even though you can come up with a somewhat educated guess about a player’s skill level based on these data, your guess will be blurred by a large amount of uncertainty. You may assess a player as a C, but his actual skill level could be anything between A+ and D, meaning that he potentially could thrash an opponent, that you assessed as a B+.

When you operate with a sample of just 8 matches and use a highly inaccurate prediction method, there is a good chance that you will predict the wrong winner in all 8 cases. This, of course, doesn’t mean that the game is rigged. It just means that your prediction method doesn’t work.

The inaccuracy problem is only accentuated further by the fact that FIFA uses ELO matchmaking, meaning that the game essentially is doing it’s very best to create matches, which are difficult to predict by mainly picking opponents with similar track records.

A larger sample would help

There is no doubt that a larger sample size would have improved the reliability of Xtoonator’s experiment. After all, getting it wrong 8 times out of 8 is likely to happen, whereas 1000 out of 1000 isn’t.

Had xtoonator made a sufficient sample, this is what he would have found:

In our article on ELO matchmaking, we roughly did the same experiment with the same type of data, but using a sample of 2,200 matches. We found that the player, that we assessed to be superior (i.e. the guy with the better FUTSCOPE grade) won considerably more often than he lost. Equally important, we found a clear link between the assessed skill gap and the chance of the superior player winning (the blue graph).

Hence, if we were to apply xtoonator’s own logic to our much larger sample, we would conclude that handicapping doesn’t exist.

Prediction accuracy at various rating gaps
Prediction accuracy at various rating gaps

FIFA’s tricky divisions

xtoonator also noted the best completed division of his opponents. He only lost against players from lower divisions, which he uses to further support the claim that the game somehow helps lesser players. But can you actually make that conclusion based on this observation?

Best completed division is essentially nothing but an even less accurate way of assessing the skill of a player than the methods we just discussed in the previous sections. The most predominant problem is that FIFA’s promotion / relegation mechanisms makes it extremely random what division you end up in. As explained here, players with roughly the same skills can end up multiple divisions apart. Hence, the assumption that a player four divisions below you shouldn’t be able to beat you is plain and simply wrong.

In addition to that, the division-part of xtoonator’s experiment contains a couple of neat design flaws:

First of all, we have an interesting case of experimental bias at our hands here. xtoonator describes himself as a division 1 regular, meaning that he could lose against opponents from his own division or a lower division, but not a higher division as there aren’t any higher divisions. Concluding that you always lose against opponents from lower divisions, when you never play anyone from a higher division is just plain and simply nonsense.

Second, xtoonator for some reason completely leaves out the fact that he won 13 matches, 10 of them against opponents from lower divisions. You simply can’t justify the claim that the game makes you lose against lesser players, when you primarily win against supposedly lesser players.

Did xtoonator lose against players with lower rated squads?

Xtoonator additionally observed that all his 8 losses were against opponents with lower overall ratings. While this could be true, it’s worth noting that all xtoonator’s opponents used a bronze bench according to his own data. Hence, the actual ratings of their lineups are unknown. A player using a bronze bench obviously may get a lower overall rating, but he may still have the better team on the pitch, which I suppose, ought to increase his chances of winning.

Thus, it can’t be determined who actually had the better team on the pitch in any of those 8 matches. That alone makes it impossible to conclude anything with regards to whether there is an advantage to having the lower rated squad.

Why one should approach such experiments with skepticism

One of the purposes of discussing a post like xtoonator’s is to demonstrate that it is relatively easy to create an experiment which at a glance appears to prove something, even though it’s nonsense from one end to another.

Scientists normally work to ensure that their experiments are valid and reliable. Valid means measuring the things you are supposed to measure. A ruler is a valid way to measure distance, but of course not a valid measure of noise. Reliable means that it delivers correct results. Using a rubber band to measure distance doesn’t give you reliable results.

Xtoonator’s study primarily suffers from validity aswell as reliability issues. The methods he uses to measure skill are not sufficiently valid, and his sample size makes his overall conclusions completely unreliable.