Details and Comments Regarding the Elo System

JollyPlayer

Updated: Mar 16, 2010, 2:17 AM | 8

The Details of the Elo System
by Jim Fox (JollyPlayer), Ph.D.

Colors: Red: WikiPedia Black: My writing Blue: Comments Green: Research

One thing everyone needs to understand. No system of measuring “performance” is perfect. Some are good, some a great, and some are spectacular. But none are perfect.

Math Professor Elo Arpa (some believe his was born a Arpo Elo) came up with the Elo point system. What makes the Elo system so amazing? It is accurate, and it only takes an understanding of Algebra to compute.

Systems using Finite Markov Chains, and other complex statistical analyses are sometimes considered more accurate - but they have huge drawback. I took the course at the Ph.D. level. Sure, with computers it could be calculated, but what if the program is written with a small error. It would take a Ph.D. (or close to it) statistician to figure it out.

No, Professor Elo came up with what I consider a beautiful system and works quite well, and is not hard for someone with a good high school Algebra education to understand. Simply, it works on the delta (the difference in rankings in the two players) and some basic multiplication and division.

The Elo rating system is a numerical rating system in chess to compare the performance of individual players. It is a common misconception that the letters "ELO" in the Elo-rating system are some sort of abbreviation; the system was named after the Hungarian-American Physics Professor Arpad Elo.

Chess was only one of the many hobbies of Dr. Elo, although he was quite a respected player at the Master Level. He won over forty tournaments, including eight Wisconsin State Championships. But Dr. Elo was also involved with the chess community in other ways; he was the president of the (old) American Chess Federation from 1935 to 1937, and he was a co-founder of the United States Chess Federation (USCF) in 1939.

Before the adoption of the Elo rating system, there were several other rating systems in use, but they were not considered to be very accurate. The USCF was using a rating system developed by Kenneth Harkness. In this system, 1500 points marked an average player; 2000 points a strong club player and 2500 points a grandmaster player. Dr. Elo more or less retained the existing level-range, but he provided a much sounder statistical basis for comparing the individual player scores.

The Elo rating system was adopted by the USCF in 1960, and in 1970 by the World Chess Federation, FIDE. Until 1980, Dr. Elo was in charge of all the calculating all the ratings for FIDE, using nothing more than a Hewlett-Packard calculator.

When a player's actual rated game scores exceed his/her expected scores, the Elo system takes this as evidence that player's rating is too low, and needs to be adjusted upward. Similarly when a player's actual rated game scores fall short of his/her expected scores, that player's rating is adjusted downward. Elo's original suggestion, which is still widely used, was a simple linear adjustment proportional to the amount by which a player overperformed or underperformed his expected score. The maximum possible adjustment per game (usually called the K-value) was set at K = 16 for masters and K = 32 for weaker players.

Note: You may have noticed that early on at Chess.com, your score would go up or down much more than 32 points per game. Why? Well this fits into what statisticians call the “law of a low number of samples”. FIDE and the USCF do NOT publish a rating unless you have played at least 14 rated games. This avoids the law of low sampling.

So why does Chess.com publish them? I do not work for them, so I can only venture a guess. But my guess is that they want newbies to know their rating to keep their interest. But for experienced players, they must remember that a newbie’s score is not very accurate unless they have about 15 rated games to their credit.

It is interesting to note the probability of drawing, as opposed to having a decisive result, is not specified in the Elo system. Instead a draw is considered half a win and half a loss. That is the way it is scored in most, if not all tournaments anyway. So it works quite well in regards to draws.

If Player A has an Elo “strength” or score of E_A and Player B has a “strength” or score E_b. I like to remember it as E for Elo, sub A and player “A” or the higher ranked player and player sub B and the lower ranked player.the exact formula (using the logistic curve) for the expected score of Player A (from WikiPedia) is

Similarly the expected score for Player B is:

Side Note: A common question is why the 400 in both denominators. The rule of 400 points goes, "A difference in rating of more than 400 points shall be counted for rating purposes as though it were a difference of 400 points". If the rating difference between 2 chess players is less than 400 points, the rule of 400 points is not applied”. Now this does not mean that a score can move 400 points -- a common misconception. It means a DIFFERENCE in scores of more than 400 points cannot be put into the formula.

Now with a little Algebra, the formulas above can be changed to their equivalents below:

ΔR1 = K (Expected Probability Player1 - Expected Probability Player2)

and

ΔR2 = K (Expected Probability Player2 - Expected Probability Player1)

Note that EA + EB = 1. Why? Because they are probabilities. In practice, since the “true chess playing strength” of each player is unknown and can change with practice, lack of practice, etc., the expected scores are calculated using the player's current ratings. Probabilities are like percentages, only in decimal form. For example, if you have a 25% chance of winning, it is a probability of .25! Therefore, logically, they must add up to 1 as the probability of the other player is 75% or .75!

This is a very important concept. Otherwise, if a player ranked low beat a GM, their ranking would go up 400 points! Not quite fair for one game. So the K factor keeps a MAXIMUM a score may move for one game - either up or down. SImple in concept, but adds a bit of Algebra. The USCF and FIDE use an average score for a player until they get to 14 games. Otherwise, strong players points would move up too much and weak players, not enough.

Supposing Player A was expected to score EA points but actually scored SA points. The formula for updating his rating is:

ΔR1 = K (Expected Probability Player1 - Expected Probability Player2)

That is a nice equation. Lets use real numbers and maybe it will become clearer. That tends to help me (and my former students when I used to teach college algebra).

ΔR1 is the adjustment for the new Elo score. K, is the K-Factor as mentioned above, 16 or 32 (some federations use different floors [400] and different K-factors [16 & 32] and different averages [1400 for USCF, 1200 for Chess.com], that is why FIDE scores and USCF scores can be different). The probability of winning can be calculated using Calculus for the curve:

But really, who wants to do THAT!

Instead, like in most statistics books, a table is used that is extremely close to what the Calculus would come up with. Statistics books have tables for the Gaussian curve (Normal Curve) and many others so you can take statistics without a year of Calculus first. Here is a sample table that Dr. Elo came up with:

Win expectancies (Exp.) as a function of Elo difference points (Diff.)

between two rated players

---------------------------------------------------------------

Diff. Exp. | Diff. Exp. | Diff. Exp. | Diff. Exp.

---------------------------------------------------------------

0-3 .50 | 92-98 .63 | 198-206 .76 | 345-357 .89

4-10 .51 | 99-106 .64 | 207-215 .77 | 358-374 .90

11-17 .52 | 107-113 .65 | 216-225 .78 | 375-391 .91

18-25 .53 | 114-121 .66 | 226-235 .79 | 392-411 .92

26-32 .54 | 122-129 .67 | 236-245 .80 | 412-432 .93

33-39 .55 | 130-137 .68 | 246-256 .81 | 433-456 .94

40-46 .56 | 138-145 .69 | 257-267 .82 | 457-484 .95

47-53 .57 | 146-153 .70 | 268-278 .83 | 485-517 .96

54-61 .58 | 154-162 .71 | 279-290 .84 | 518-559 .97

62-68 .59 | 163-170 .72 | 291-302 .85 | 560-619 .98

69-76 .60 | 171-179 .73 | 303-315 .86 | 620-735 .99

77-83 .61 | 180-188 .74 | 316-328 .87 | > 735 1.0

84-91 .62 | 189-197 .75 | 329-344 .88 |

--------------------------------------------------------------

An Elo score update can be performed after a rated game or a rated tournament, or after any suitable rating period. An example may help clarify.

Lets say I have a USCF rating of 2150. I play David Pruess (an IM who works for Chess.com) who’s USCF rating is 2457 at the time of the writing of this article. I think most people would easily see that at 307 Elo points less than David, I stand little chance. And the fact of the matter is, I have a very low probability of beating him. But, I do have a chance.

I could have been practicing and play above my ranking. David may have an off day, etc. I have a chance, just not a high one.

So instead of doing Calculus, we look at Dr. Elo’s chart. 307 points gives me a chance of 14% or .14 probability. David has an 84% chance of winning or a .84 probability.

We play a rated game at via the International Correspondence Chess Federation (ICCF - recognized by the USCF) since David lives in the Bay area near San Francisco and I live in Southern Indiana. As a note, all USCF members are automatically members of the ICCF. The ICCF rules are generally 24 hours to make a move (via email or online) and 50 days to finish the game.

I lose as expected. David beats me in 29 moves and I feel fortunate to last THAT long. What is my new USCF Elo score?

Here is the formula again:

ΔR1 = K (Expected Probability Player1 - Expected Probability Player2)

Ra was my rating when we started playing. K is the K factor. Since I am not a Master level, it is 32. David’s probability (S_A) is .84 and my expected probability is .16. So if you remember your Algebra, do what is inside the parenthesis first. In this case (.84 -.16) which gives you .66. Now Algebra rules state do multiplication and division before addition and subtraction from left to right.

Therefore we multiply 32 by .66 and get 21.12. My score goes down 21 points to a new score of 2129. Now I think David has to use a K factor of 16 since he is a master level player. His score goes up 10.61 or rounded to 11. Therefore his new rating is 2468.

Now you can do this for every game, or for a full tournament. Here is an example from WikiPedia for a tournament

Suppose Player A has a rating of 1613, and plays in a five-round tournament. He loses to a player rated 1609, draws with a player rated 1477, defeats a player rated 1388, defeats a player rated 1586, and loses to a player rated 1720. His actual score is (0 + 0.5 + 1 + 1 + 0) = 2.5. His expected score, calculated according to the formula above, was (0.506 + 0.686 + 0.785 + 0.539 + 0.351) = 2.867. Therefore his new rating is (1613 + 32(2.5 − 2.867)) = 1601.

Note that while two wins, two losses, and one draw may seem like a even score, it is worse than expected for Player A because his opponents were lower rated on average. Therefore he is slightly penalized. If he had scored two wins, one loss, and two draws, for a total score of three points, that would have been slightly better than expected, and his new rating would have been (1613 + 32· (3 − 2.867)) = 1617.

Most accurate K-factor?

Elo's original K-factor estimation was made without the benefit of huge databases and statistical evidence. Sonas indicates that a K-factor of 24 (for players rated above 2400) may be more accurate both as a predictive tool of future performance, and also more sensitive to performance.

Certain Internet chess sites seem to avoid a three-level K-factor staggering based on rating range. For example the ICC (Internet Chess Club) seems to adopt a global K=32 except when playing against provisionally rated players.

The USCF (which makes use of a logistic distribution as opposed to a normal distribution) have staggered the K-factor according to three main rating ranges of:

Players below 2100 -> K factor of 32 used
Players between 2100 and 2400 -> K factor of 24 used
Players above 2400 -> K factor of 16 used

FIDE uses the following ranges:

K = 25 for a player new to the rating list until he has completed events with a total of at least 30 games.
K = 15 as long as a player's rating remains under 2400.
K = 10 once a player's published rating has reached 2400, and he has also completed events with a total of at least 30 games. Thereafter it remains permanently at 10.

In over-the-board chess, the staggering of K-factor is important to ensure minimal inflation at the top end of the rating spectrum. This assumption might in theory apply equally to an online chess server, as well as a standard over-the-board chess organization such as FIDE or USCF. In theory, it would make it harder for players to get the much higher ratings, if their K-factor sensitivity was lessened from 32 to 16 for example, when they get over 2400 rating. However, the ICC's help on K-factors indicates that it may simply be the choosing of opponents that enables 2800+ players to further increase their rating quite easily. Picking opponents to up your score or for that matter, lower your score ON PURPOSE is called sandbagging.

This would seem to hold true, for example, if one analyzed the games of a GM on the ICC: one can find a string of games of opponents who are all over 3100. In over-the-board chess, it would only be in very high level all-play-all events that this player would be able to find a steady stream of 2700+ opponents – in at least a category 15+ FIDE event. A category 10 FIDE event would mean players are restricted in rating between 2476 to 2500. However, if the player entered normal Swiss-paired open over-the-board chess tournaments, he would likely meet many opponents less than 2500 FIDE on a regular basis. A single loss or draw against a player rated less than 2500 would knock the GM's FIDE rating down significantly.

Even if the K-factor was 16, and the player defeated a 3100+ player several games in a row, his rating would still rise quite significantly in a short period of time, due to the speed of blitz games, and hence the ability to play many games within a few days. The K-factor would arguably only slow down the increases that the player achieves after each win. The evidence given in the ICC K-factor article relates to the auto-pairing system, where the maximum ratings achieved are seen to be only about 2500.

So it seems that random-pairing as opposed to selective pairing is the key for combatting rating inflation at the top end of the rating spectrum, and possibly only to a much lesser extent, a slightly lower K-factor for a player >2400 rating.

In general the Elo system has increased the competitive climate for chess and inspired players for further study and improvement of their game.However, in some cases ratings can discourage game activity for players who wish to "protect their rating".

Examples:

They may choose their events or opponents more carefully where possible.
If a player is in a Swiss Tournament, and loses a couple of games in a row, they may feel the need to abandon the tournament in order to avoid any further rating "damage".
Junior players, who may have high provisional ratings might play less than they would, because of rating concerns.

In these examples, the rating "agenda" can sometimes conflict with the agenda of promoting chess activity and rated games.

Ratings inflation and deflation

An incrase or decrease in the average rating over all players in the rating system is often referred to as rating inflation or rating deflation respectively. For example, if there is inflation, a modern rating of 2500 is means than a historical rating of 2500, while the reverse is true if there is deflation. Using ratings to compare players between different eras is made more difficult when inflation and deflation is present. (See also Greatest chess player of all time.)

It has been suggested that an overall increase in ratings reflects greater skill. The advent of strong chess computers allows a somewhat objective evaluation of the absolute playing skill of past chess masters, based on their recorded games, but this is also a measure of how computer-like the players' moves are, not merely a measure of how strongly they have played. This, I think is false and blames the computer age.

Instead, I would like to think that the late Bobby Fischer, Boris Spassky, Karpov, Kasparov, Polgar and other Grand Masters have bread a new class of great players. The competition is getting better. Today you can make a living playing chess. A hundred years ago, you could not.

The number of people with ratings over 2700 has increased. Around 1979 there was only one active player (Anatoly Karpov) with a rating this high. This increased to 15 players in 1994, while 33 players have this rating in 2009, which has made this top echelon of chess mastery less exclusive. One possible cause for this inflation was the rating floor, which for a long time was at 2200, and if a player dropped below this they were stricken from the rating list. As a consequence, players at a skill level just below the floor would only be on the rating list if they were overrated, and this would cause them to feed points into the rating pool.

In a pure Elo-system, each game ends in an equal transaction of rating points. If the winner gains N rating points, the loser should drop by N rating points. While this prevents points from entering or leaving the system through when games are played and rated, it typically results in rating deflation.

In 1995, the United Chess Federation experienced that several young scholastic players were improving faster than what the rating system was able to track. As a result, established players with stable ratings started to lose rating points to the young and underrated players. Several of the older established players were frustrated over what they considered an unfair rating decline, and some even quit chess over it.

Combating deflation

Because of the significant difference in timing of when inflation and deflation occur, and in order to combat deflation, most implementations of Elo ratings have a mechanism for injecting points into the system in order to maintain relative ratings over time. FIDE has two inflationary mechanisms. First, performances below a "ratings floor" are not tracked, so a player with true skill below the floor can only be unrated or overrated, never correctly rated. Second, established and higher-rated players have a lower K-factor. New players have a K=25, which drops to K=15 after 30 played games, and to K=10 when the player reaches 2400.

The current system in the United States includes a bonus point scheme which feeds rating points into the system in order to track improving players, and different K-values for different players.

Some methods, used in Norway for example, differentiate between juniors and seniors, and use a larger K factor for the young players, even boosting the rating progress by 100% for when they score well above their predicted performance.

Rating floors in the USA work by guaranteeing that a player will never drop below a certain limit. This also combats deflation, but the chairman of the USCF Ratings Committee has been critical of this method because it does not feed the extra points to the improving players. (A possible motive for these rating floors is to combat sandbagging, i.e. deliberate lowering of ratings to be eligible for lower rating class sections and prizes.)

So as you can see, the Elo system is eloquent and the K-factor very important. The debate rages whether players are getting better, or selecting opponents, or dropping out of tournaments that are Swiss or Round Robin in nature is causing score inflation. Similar debates go on in regards to score deflation.

Chess.com uses the Glicko Adjustment (see Erik's Blog). The floor of the Glicko Adjustment is a K-value of 16.

Chess.com reports your scores NOW, not once a quarter. Some FIDE GM’s have gone up very high, but not stayed there until the next rating period, hence, it was not “official”.

Sandbagging happens even at Chess.com. If you play a lot of players with a 1200 rating (newbies) and beat them, your score will jump up and theirs will greatly fall. This, of course, discourages newbies, artificially inflates your score, and causes problems. Enter a tournament with your newly inflated score and you will most likely be beat like a drum.

Sandbagging works the other way. Lose on purpose to higher rated players so you can enter a tournament in a lower division and possibly win. I think that happens less on Chess.com because there are no entry fees and no cash prizes.

Greeter unrated games are a help to combat the inflation problem. If you are UNSELFISH and wish Chess.com to flourish, you should be willing to play unrated games with with newbies and help them out.

As a note, no Official Federation recognizes ICC scores. Sure, it gives a fine estimate of what an internet player can do in a Blitz game. Many ICC “masters” also play in over-the-board chess. Their ratings there are often significantly lower - and often no where near the master level. My Blitz rating at Chess.com is 400 points higher than my turn based or long game rating. That may be because I have only played about 5 games and won 4 of them - several on time. I no longer play Blitz chess - as I think it is mostly about quick traps and wild moves.

Fisher Random Chess (Chess 960) fits into a similar category. Since it is not consistent, the rating is a fun measure, but means little. In fact, it is less of a predictor than standard Elo ratings.

I hope this article helped. If you are not math oriented, well, formulas can be daunting, I know. But if you remember your high school or college Algebra, well, now you may understand Dr. Elo’s system better and why FIDE and USCF scores differ.

Questions or comments. Leave them below or email me at clergy@chess.com

Jim Fallet, Ph.D. Sept. 2009

Edit 1: Then Glicko System or "Adjustment" to the Elo system. It has been posted elsewhere that Chess.com uses the Glicko system. Why another adjustment? Mark Glickman, Ph.D who is at the University of Boston (see: http://www.freechess.org/Help/HelpFiles/glicko.html) suggested that the Elo system had a small flaw which cause and inaccuracy. This flaw was based on time.

The idea is that if you have played recently in a rated game or tournament, your rating would be more accurate. If you had not played recently, and as time marched on, your rating was not as accurate. Therefore, time would cause your Elo rating to move more sharply -- simply because of the unknown.

The more time, the more it should affect your score. If the two players have had the same time difference, then the Glicko adjustment would have little or no affect. In the other thread pointed to below, a question came up, if my Elo rating is X, what is it with the Glicko adjustment.

Well the answer to that question would depend on the time since your last game and the time since your opponents last game. The formula is a long one, and unlike the Elo system, is not easily explained without a degree in Statistics.

Personally, I find it sad that Chess.com has chosen to use this adjustment. As I stated above, the Elo system is so close, and simple, why mess with it? Well, that is what statisticians at the doctoral level do. I myself am guilty. But in my case, I tried to simplify a process. But with the age of computers, we can become more complex.

A PostScript file containing Mark Glickman's paper discussing this ratings

system may be obtained via ftp. The ftp site is hustat.harvard.edu, the

directory is /pub/glickman, and the file is called "glicko.ps". It is

available at http://hustat.harvard.edu/pub/glickman/glicko.ps.

The paper listed above is no longer available. My guess is Harvard took it down. The reference used here was last updated Feb of 2008.

In the formula, there is at least one constant not defined. I would love to see the original paper, but it is not longer available. Time, in my opinion, is a double edged sword. You could be using your time to get better, sit and watch TV, or simply work to support your family. How big of a factor is time to a chess rating? Dr. Glickman considers it to be a rather large factor. So does the USCF and Chess.com it seems. Nobody suggested FIDE is using the system.

As of the time of the writing of the review cited, it was still being debugged. This bothers me greatly. Dr. Elo, did ALL the calculations with an HP calculator for years. He could do that because of the eloquent simplicity of the system.

The Glicko adjustment gives more "points" for a person who has laid off for awhile but comes back and wins. The idea is that the old rating, when compared to a newer rating, would be more accurate. But if the system was still being debugged in 2008, have errors crept in? Who would know except Dr. Glickman? The paper I am sure is available in a research library. I live 20 miles from a university with such a library. Odds are I would have to order the paper. Such things can cost $10 to $25. If the library has it (paper or micro), well then, you can pay just for the photocopying.

There is no direct way to compare an Elo score to a Glicko adjusted score. You could do it once, but after that you would have to keep both of them separately to see the difference. It would make a very interesting research study. All research studies start with a hypothesis -- normally a null hypothesis. For such a study, the null hypothesis would be something like "There is no difference between Elo Scores and Glicko adjusted scores." My thought is that the null would probably be statistically upheld. But I am guessing since from what I do know, I do not have enough information -- especially about the constant and why the K value must 16 or more.

Dr. Glickman said his scores acted like a probability. Dr. Elo said they were probabilities and could derive a curve. Probabilities are always estimates. I think Dr. Glickman would be safe in saying they are probabilities in the true sense of the word, although some statisticians go crazy at definitions - so I can see why he hedged it a bit.

Your thoughts? Should you get more points if you have not played competitively for a few years and win over someone who has been playing? Do you lose skill with time? Certainly age has a negative effect on cognitive ability. But 30 to 34 is different than 35 to 64!

One thing does bother me besides my hypothesis about time. The K-factor. FIDE will go down to 10. But with the Glicko system, 16 is as low as it will go. Some say the USCF ratings are inflated. No doubt the K-factor and the Glicko system have a lot to do with that.

Edit 2: Dr. Glickman is now at Boston University. His paper was published in The Journal Applied Statistics ( Vol. 48, pp. 377-394). His paper was available at http://math.bu.edu/people/mg/research but is no longer. An explanation (much more in depth than the stub at Wikipedia) is available at http://math.bu.edu/people/mg/glicko/glicko.doc/glicko.html. This document also suffers from a lack of explanation.

Edit 3: I wonder about something. Nothing in the Glicko adjustment or the Elo System suggests this. You play someone rated 1200 who is playing 2 or 3 games. The game goes on for 5 or so moves and THEY time out.

Now my thought it, I had to be vigilant and wait for them. They time out, I get no points and they lose no points. Why? Is this unique to Chess.com? I lost my first USCF game and I lost points on my rating. This seems to be an "addition" to the system by Chess.com (or maybe by Dr. Glickman, but you have to buy his paper to know). I am sure this keeps the newbie interested, but I can think of a lot better ways to keep them interested. No where in life, just because you are new, do you get a break. Usually it is just the opposite.

You must learn that losing costs you. Study. Buy a Diamond Membership. Something. But losing is not free. You do not pass your driving test, they do not give you your license because you are a beginner now do they!

Details and Comments Regarding the Elo System

JollyPlayer's Blog