The hot debate around the K-factor that was going on last week attracted many interesting responses. Well-known rating experts such as Jeff Sonas, Ken Thompson and Hans Arild Runde all contributed with insightful points. The discussion ended with what Chessbase called Dr. John Nunn’s ‘final installment’ but this is unfortunate because the discussion is just getting interesting!
By Daan ZultIn science, of course, there are no ‘final installments’, but more importantly, in this debate no decisive arguments have yet been provided by either side. I was happy to see that this time, Nunn
expressed his ideas more clearly, because his
first contribution was somewhat confusing. He reformulated his inaccurate argument concerning the frequency of rating lists. He also explains more clearly why he has problems with an increase in the K-factor and Sonas’ analysis.A strong argument given by Nunn that concerns
Sonas' analysis, is aimed at the fact that Sonas uses a different model to calculate the expected score. Nunn gives the example that it is a linear function instead of a normal probability distribution. Therefore Sonas’ optimal K-value of 24 belongs to a different statistical model altogether. Even though Nunn is completely correct in this respect, I think it’s harsh to conclude that therefore Sonas’ analysis have no relevance. The reason is a bit technical, so bear with me (this is the hardest part), but here’s why.On a small scale Elo’s formula is almost linear. This is important, because most players who play against each other have a small rating disparity. Therefore, for most games, the expected score based on either Elo’s formula or on a linear function will hardly differ. And so, the fact that the Sonas formula is linear does not explain the fact that he finds a K-factor of 24 to predict results so much better than a K-factor of 10.
White scores better
A second important difference between the Sonas and Elo formulas (not explicitly mentioned by Nunn), is that in Sonas’ model the expected outcome of a game is different for Black and White. This can be understood fairly easily. In the Sonas model, when two players with the same rating compete, the white player has an expected score of 0.541767, instead of the 0.5 in the current model. Sonas' expected score is simply a result from the data of real chess games, where White on average scores 54%. In this case it makes perfect sense to say that Sonas' model provides better predictions, whether the K-factor is 10 or 24, since his model simply uses more information than the current Elo rating model does. Personally, I think the question of whether we want to rate Black and White games differently is more of a political than a statistical choice.
Stats of the MegaBase + TWIC (4,171,030 games)
The two points I addressed above, concerning the linearity and difference in expected score between Black and White, are basically about the expected
outcome of one game between two players with a certain rating difference. Both Elo’s and Sonas' formula provide us with an expected score over one game. However, we should bear in mind that the K-factor is not related to the expected outcome of a game, but to the underlying dynamics in
chess skill. In that respect there isn’t a big difference between Sonas' and Elo’s formula, since they both provide a number for the expected outcome of a game, and do not directly affect the dynamics. In that respect, the difference between Elo’s and Sonas' formula cannot fully explain the fact that under Sonas' formula the (much more dynamic) K=24 predicts results so much better than K=10. I therefore consider it very likely that when we use Elo’s formula and try to find the optimal K-factor, than it quite likely is larger than 10.
GM John Nunn
Cheating argument
Wrapping up all of Nunn’s arguments, he further writes that he considers the ‘cheating argument’ his most important contribution to the discussion about raising the K-factor. Nunn states that cheating becomes easier and more attractive with an increased K-factor. This is true, of course, because with a higher K-factor, you can simply win more points by cheating! At the same time, I can’t help wondering that if this is such a strong argument not to raise the K-factor, then why not decrease the K-factor? Moreover, it does not counter Macieja’s argument that by increasing the number of rating lists it becomes harder to gain rating, thereby already making it less attractive to cheat.Also, a legitimate question is whether we
should use ratings to fight cheaters at all. Originally, cheating was not part of rating considerations, and I personally think this should remain that way. Since no rating/ranking system in the world is able to avoid cheating, it is strange to have it affect the accuracy of the rating model. In my opinion this is the job of respective governing bodies, not the job of the rating system.
Ken Thompson
This brings us to the other reactions on the Chessbase website. Another famous contributor to the debate, computer expert Ken Thompson, also
opposes an increase of the K-factor. He states that the only reason to increase the K-factor is to allow ratings of rising stars to increase faster. This makes sense under the presumption that a chess player has some sort of ‘true chess skill’ that does not differ between one week and the next, whereas it might suddenly change within a short period of time, only to become stable again.However, there are two problems with this argument. First of all, there is no conclusive scientific proof that chess players in fact develop with jumps in skill (although it might well be true, maybe it’s not). In fact, development psychologists are still gathering evidence that can be interpreted in various ways. Secondly, this statement excludes the possibility of ‘temporary shape’, and whether we want ratings to express this temporary shape.Next, Thompson criticizes Sonas' analysis by stating that the successful model predictions for the results of grandmaster Bu Xiangzhi is the result of ‚Äúcherry picking‚Äù, that is to say, it is simply an attempt to find an example for whom the model works and then use it as proof, while it may simply be the result of coincidence. I don’t agree with Thompson here, because it’s not true that Sonas presents the Bu Xiangzhi case as major proof. Bu Xiangzhi is just an illustration of the quality of the model. The real proof Sonas provides, concerns an analysis of the full population of chess players.Thompson also states that an increase in the K-factor introduces the risk of inflation or deflation. Well, it’s obvious that rating systems can be vulnerable to inflation or deflation, but I don’t see why this risk is particularly pregnant for a higher K-factor. To me, it seems that for any value of the K-factor (except zero), there is the risk of inflation/deflation. A higher K-factor only magnifies the inflation/deflation effect when it occurs. According to FIDE handbook (article 12) this is carefully monitored.All in all, Nunn and Thompson give some decent arguments against a rise of the K-factor, but in my opinion, none of them closes the debate. Their arguments show that we should be cautious in interpreting Sonas' analysis, but they do not show that the current K-factor value of 10 is better than any other value. So far, I’d say the only argument that stands is that we have been using this value for a long time and that so far, it has served us well.
Hans Arild Runde
The real problem
In conclusion, I would like to point out the very insightful contribution provided by Hans Arild Runde who runs the
live ratings website. Runde in my opinion managed to
pinpoint the real problem of choosing the right K-factor, which is more of a philosophical approach to the issue. What do we want ratings to be, anyway? Do we want ratings to predict “immediate” or “future” results? Do we want to know who is the best player at this very moment, or do we want to know who will perform best in the coming year?This choice has implications for the K-factor. A high K-factor will produce rating lists that indicate who is in good shape right now, while a low K-factors produces more conservative rating lists, where rankings are likely to hold for longer periods. This is more of a political than an empirical question. Sonas' analysis focuses on predicting immediate results. So it seems that if we want to predict immediate results better, we need to increase the K-factor. However, Sonas' predictions do not consider results that lie further into the future, which might lead to a different value of the K-factor.In the end, it seems that before we can decide (with the use of empirical research) what the optimal K-factor is, we first need to decide what we want ratings to be.
Link
Daan Zult is a PhD student at the University of Amsterdam, currently pursuing research into the thinking of chess players. ChessVibes thanks Kung-Ming Tiong, Assistant Professor of Mathematics at the University of Nottingham, Malaysia Campus for providing insights and feedback.