Nothing normal about the Normal Distribution

The Elo fallacy

FM handsoffhans

Updated: Jan 8, 2019, 9:07 AM | 5

Anthropocene (also a recent documentary): relating to or denoting the current geological age, viewed as the period during which human activity has been the dominant influence on climate and the environment.

Allow me to coin Datacene: the last 20 years or so where easy and constant access to data, so called cold hard facts, has dominated public discourse and fueled a lot of heated, emotional, irrational personal and societal trends and attitudes. As usual with human affairs, actual knowledge, understanding, sophistication, balance, wisdom is lagging behind by decades or centuries even. I will blame our educational systems, again. It took me until university to find out that a number like “95% probability” does not make sense, it should be more like 95% ±3, a so called standard deviation or confidence interval. Some probability theory has sneaked into our school education, but judging by my high school experience when indeed I had to master quite a few technicalities to qualify for an engineering education, the traditional school and schooling is really not equipped to give students real-life usable skills, and our young men and women leave school financially, legally and statistically illiterate.

Professional chess and developmental chess alike suffer more than ever from the Datacene. A milestone in our virtual insanity was the launch of 2700chess.com . A brilliant website in many senses, it is a constant voice in our head, perhaps Maurice Ashley’s voice, announcing “And Kramnik draws Carlsen and becomes the number 1 in Russia”, five minutes later “Unbelievable, Grischuk won and is now the best in Russia”, another 30 minutes pass and “Sensational, Nepo ragdolls the opposition, chews and spits out super GMs to become the best Russian player ever, after Kasparov, Karpov and Kramnik”. The corresponding live ELO ratings in this hypothetical scenario would be 2802, 2801, 2800. Perhaps a more statistically accurate “there is a 90% chance these three are currently the best Russian players” will never find an audience.

As a coach I often have to work with the chess moms and chess dads of the world, who cannot keep track of their progeny’s chess after the first year or two. From that point onwards, it all becomes something like “Wow, he drew a 2100, good job”, or “what am I paying you for, she lost to a lower rated player twice this week”. At the other end of the spectrum is somebody like Ivanchuk “lost to a 2300, again”. But are there 2300 (FIDE ELO-rated) players, and are there perhaps 2300 moves?

I do not particularly enjoy resorting to definitions, but I am not sure the fundamentals of ratings and rankings have sunk in with the general public, so let’s repeat them here: the ELO rating provides an estimate of what a player might score against another, over an infinite amount of games.The estimate works best if the two players in question have played infinite amount of games against all other players in the system (even the dead ones I’m afraid), and if we are looking to cut down these infinities to manageable numbers then perhaps we could accept 30 games per opponent and a “representative sample” of the opposition, I don’t think anyone knows what a representative sample is but I would go for 10 players from each one of the top 30 chess countries. So I would be quite happy with the estimate after 10*30*30 games, a mere 9000 FIDE rated games. Maybe you are wondering, where did those dead players come from, the answer is they come from the I-don’t-know-what-it’s-called principle, let’s call it the Unchanging World Hypothesis, the idea that “samples come from the same distribution”. It’s the idea that in the 9000 days it took you to play the 9000 games you were not allowed to learn, or forget or get a cold or die. If you did get a cold and affected your play in the first 4500 games, you should get the same cold, no more no less, during the next 4500 games. The idea that if those first 30 games are our first 30 games and you were able to improve a lot quickly then you are not really the player you were at the beginning and we may need to play another 30 games against you, or you may fit differently in the country sample that I suggested, perhaps your characteristics make you identical to another player already in the sample and we have to replace one of the two, or your characteristics create another extreme for the national sample and we need to find another player to take the sample spot you were previously occupying.

For a paragraph that was supposed to be basic, the previous one went over the heads of most readers, I know, I know. The whole thing needs a lot of time to sink in. But let’s fast forward to the kind of player and ELO a coach has to work with. FIDE has made at least one accommodation to the Changing World Hypothesis, ELO calculations for a player’s first steps in rated chess have higher “velocity” factor, allow a player to move up the rankings quickly up to a point, let’s say establish a “master” rating, and from there on to the GM title and beyond your rating progress slows down, let’s say FIDE wants to see “more proof”. There are three “speeds” all in all for FIDE, and despite some/a lot of science that can be applied to the problem, it is anybody’s guess how many speeds there should be and what they should be exactly. I am guessing most people would be OK with “compounding”, ie if you start a tournament with 3 wins this should count against, you should now be expected to continue winning for diminishing results. In the current state of affairs if you have your breakthrough event, win a national championship, gain 100 points and play your next tournament in the same month, rating calculations will still pretend you are the same weak player you were in your previous rated tournament, perhaps from a year ago. This is more or less the damage young players inflict on the pros, they arrive at tournaments as 2100, play as 2500 and enter the next rating list as 2300 while playing the next season as 2600, wreaking more havoc. By the way, their arrival on the scene doubly-inflates the ELO pool, not only there is “more ELO to go around” but it was also generated at 2x or 3x the speed at which the pros lost ELO.

So, your golden boy rated 1600 lost to somebody else’s golden girl rated 1400.

DO NOT PANIC

Your golden girl rated 1700 drew a FIDEtrainer rater 2300

KEEP THE CHAMPAGNE IN THE FRIDGE

Is there anyone rated 2300, really truly madly deeply 2300, published FIDE rating list notwithstanding? Well, kinda. The ascent of young players and descend of the old players suggests there may have been a plateau, or even a succession of plateaus for a player. Grandmasters lose to players rated 500 points lower, or beat players rated 300 points higher almost everyday, but that’s well within statistical theory. Statistical theory is quite agnostic when it comes to whether a player be at 2800 one year and 2400 the next, and then back to 2800 although to my knowledge it has never happened. It feels like somebody like Ivanchuk could do this on purpose, and with a smaller swing by Morozevich we have to ask ourselves if he somehow did it on purpose. His move from top 5 to top 100 in a year or two was quite the negative sensation. Were his memory banks (in his brain) wiped out with a format command? Is he pruning all his 15 move deep calculations to a depth of 10 moves? Did the opposition reverse-engineer his style and come to the board with the right tools to refute Morochess? I am powerless to say.

The recent World Rapid and Blitz had a lot of the “usual suspects” reach the top. And also many unusual suspects. Sure, there is a greater element of chance in faster time controls, a larger variance as the scientific term is, but could the 21 game format bring in more truth to the Elite System, a kind of red pill? If you could throw in 200 players from 2500 to Magnus in a 21 round classical, would Magnus come out unscathed and the 2500s end up in the 2500 range? We have a lot of evidence that no, breaking the glass ceilings of sports and competitive activities reshapes the world from the ground up. Mind you, the world’s elite can hold its own in a couple of games a month, or a year, against the pretenders, but if this opened up to 100 games a year, well, that would blast the whole ELO list wide open. Some of the lower rated players would be able to learn from encounters, would be able to motivate themselves to work harder for the encounters, may also be able to find sponsors for the encounters and take their game where it has never been before. Other “toppies” may be exposed for the drawing machines or one-trick ponies that they are, or may make breakthroughs precisely for the same reasons, drawing 90% of your games would not be enough for anything.

Ideally, a chess coach will be able to form a more nuanced “model” of a trainee, maybe the student plays Sicilian attacks like a 2500, bishop versus knight endgames like a 1500, “chokes” when within half a point of a major accomplishment, or plays best when facing grandmasters, lesser opposition do not inspire concentration and hard work for many hours, when the reward is a win over a nobody in a tournament played nowhere. The numbers 2500 and 1500 are mere placeholders, there are very few hard and fast guidelines a coach can use to quantify the student’s performance, and sometimes maybe that one 1500 endgame by the student will be produce enough analysis and motivation for the next one to be dealt like a 2500 pro. We also see the opposite of course, a single 1500 endgame could trigger a streak of 10 800 ELO endgames, the student having lost faith or motivation or acting up. The point is not to find out if it is the player’s fault or the coaches fault or the parents’ fault, the point is to work as a team to move forward.

Testing methodologies like chess.com’s Puzzlerush do provide some usable data for the coaches, the leaderboards provide quite a neat picture of who is GM, who is an FM and who is a developing player, again with a generous sprinkling of “outliers”, amateurs whose eye-mouse coordination is enough to place them among the grandmasters, the flipside being pros where the computer skills and the motivation to go through an “irrelevant test” seriously dampens their performance (pros train on tactics that may need 30 minutes for 1 position, not 5 minutes for 50 positions). Testing is riddled with the imperfections of rating and more, and you cannot skip the need for an intelligent analysis, response and strategy, you cannot run with tests on autopilot. Having said that, I do feel that chess training and education as a whole can and will become more data-driven, more fine-tuned to what an individual learner needs, wants, can or cannot, and tests will be a part of it, while the world is waiting for more “intelligent” analysis software, that would create more detailed profiles of players in the first instance. Perhaps even the psychological profiles could be derived. I vaguely recall Greek author, psychologist, trainer Kourkounakis devoting a page or two to the analysis of 1.d4 in one of his books, you cannot do such psychological work without some controversy but if it was a feature in Chessbase or Chess Assistant, now that would impress me!

So, if there is, kinda, a 2500 endgame or Kings Indian Attack attack (sic), is there a 2500 move? This question is best answered by switching our attention to pro and semi-pro chess, let’s say 2300 and above. The defining viewing angle for high level chess has been expressed for at least 100 years, perhaps for thousands of years, as follows: it is a game of mistakes, a game of errors. Mind you, we are not talking “you did not play Stockfish’ first move” kind of error, just the good old evaluation-altering or position-character-altering human error. And it does not have to be multiple errors either, we see games decided by a single mistake across the entire spectrum of chess ability, it’s just that our “masters”, starting with the lowly humble FIDE Masters, have show the ability to play long, interesting games without “human error”, basically logical, consistent games without losing control of the position. So, either some kind of logical draw or if it is a win due to winning a pawn or the two bishops at some point, then at no point were there serious drawing chances.

Or to rephrase, a 2300 player is not a player who 70% of the time plays 2300 moves, and a 2800 is a player who 70% of the time plays 2800 moves. A 2300 player is a player who is more likely to play 80% of the time 2500 moves and 19% 2000 moves and 1% 1500 moves. Or as we tend to advise the higher rated player “keep playing, he will blunder at some point”. I can think of many exceptions to these rules, let’s take for example many of MVL’s games. When he enters “complications”, those hand-to-hand full body combat fist fights, or equally when he liquidates to a pawn race ending which could need a 20 move variation, I do have the feeling we are watching 2800 moves. Everything may not be nailed down the way a chess engine would do it, but it is not Puzzlerush either with 1 or 2 tactical points. It is a jungle that lesser players will be bitten by a tarantula, suffocated by a boa or eaten alive by tiger at every single turn. I think you know what I mean. I wish I could say the same thing about Carlsen’s “positional” games, but I don’t think I can. It is very hard to judge those at the absolute top even for his colleagues, let alone a distant onlooker, but it will be very hard to say that those 80 move wins are comprised by 70% 2900 moves and 30% 2700 moves. It is more like the game is mostly a 2750 and he is waiting for the 1800 from the opponent, and the opponent so often delivers! Call it what you want, magic, starstruck, youthful stamina, the Carlsen of recent years is not the 3000 70% of the time we saw in several of his games. It is a much more banal magnet for forced/unforced errors by his opponents. Don’t take my word for it. Ask his peers for at least 2 opportunities Karjakin did not make use of in his WC match, and another 2 (or 1.5) by Caruana in 2018. Granted, Carlsen did a reverse-Carlsen in his first game in WC 2018 too, blundering away an advantage even candidate masters could bring home against Caruana (especially some candidate masters I recall from Kotov’s books :) ).

Well, there you have it, almost a PhD’s worth of my facts, opinions and intuitions about chess rating and chess development. An obsession with rating is similar to the use of lead/amalgam in dentistry: it has its uses but there is no dosage that is so small that it is known to be safe!

The Elo fallacy

Coach Anastasios’ Blog