Elo System with "streaks" added .....

JollyPlayer

Updated: Apr 7, 2010, 2:44 AM | 4

Elo_s System?

By Dr. Jim Fox

I have been a strong opponent of the Glicko system as formulated by Dr. Glickman of Boston University. He has the credentials and is a very brilliant man. But to me, the Glicko system uses time as a negative factor to a chess player's rating. This to me has no solid logic to it. Since I started this paper, Dr. Glickman has made a new website.

Instead of info at Harvard and Boston University and the USCF, he has put it all in one place. The site is titled “Mark Glickman’s World”. Dr. Glickman is brilliant. The difference between myself and Dr. Glickman is a major one. I work in applied statistics. Dr. Glickman is much more theoretical. Therefore, for statistics to be applied, they must be able to be explained to a wide audience. Theoretical statistics look to improve methods and often are accessible by a handful of Ph.Ds who, in some cases, do not want to do the work of even verifying the model (as I found out in a conference).

Well, no formal study has been done on the Glicko system on a large scale to my knowledge. Probably the best study has been done right here on Chess.com. Thousands of players have been rated using The Glicko 2 system. What did we find? Players can climb fast. Way too many players in the 2000 and greater range. Whereas, it seems the low rated player (under 800 would be a good example) have a hard time finding players to play. Lots of players may leave after 1, 2 or 3 games.

I do not have access to the Chess.com database to write a program to calculate the mean and standard deviation of ALL rating scores. It is a huge sample and probably a waste of time and energy anyway. This is a common problem in statistics. So a sample can be taken. If the sample is random AND large enough (rule of thumb is 5% or 30 people) then the sample in considered fairly representative of the whole.

In a large group to make sure all are appropriately represented and to be a truly random sample, each person must have had the same chance when the sample selection started to be selected. That criteria was met.

I built a spreadsheet of the most common letters stratified by letter. I choose a random country from that began with that letter, and took all the players and divided by 50. Then using a random number table, I picked a random number between 1 and 85. Let’s say the number was 5. I took player 5, 90, 175, 260 ... to the end of the list.

Let's tackle the low rated player problem first. Chess.com has a free option. When every player signs up, they are at 1200 (the mean number, or the “average player’s” rating). Actually I think that is a realistic number for an average player IF THEY PLAY. If a player plays 3 games and loses all three, their score can be 700 very easily. It is the player who signs up and NEVER plays or plays very little that is problematic. There are several problems:

They sign up - but never play a single game. This happened about 40% of the time. I was a amazed. Several players had in their trophy case, the “Welcome to Chess.com” and the “1 Year Anniversary” trophies. But yet they never played a single game. Some people picked wonderful pictures for themselves, and a nice style sheet for their home page. Some even made several posts in the forums. Yet, never played any chess.
What do you enter for their score? One would say use 1200 (the average) for them. Others might theorize that they could not figure out the game and never played. Frozen -- scared to death. In that case their score should be zero.
Lastly, they could be be skipped. That is a fine statistical solution, but at some point you will end up skipping four to ten players. Is that acceptable? It is done in statistics quite often.

There were a handful who played 3 to 5 games and had scores well below 1200. They lost early and got discouraged and left is the hypothesis behind this behavior. That is were the Glicko System hurts a bit. Elo gives a provisional rating. The Glicko gives you a rating from the beginning.

But provisional ratings put people off. They have no desire to have no rating until they have 12 or 15 games. I have played over 30 "Welcome to Chess.com Greeter" games. Rarely do the players finish these games. Once they get things figured out they want to play "for real". That means to play for points.

So after randomly picking 50 players, what did I learn? Well, it REALLY depends on what you do with the registered non-playing players.

Here is how is breaks down, What is commonly known as the average (more accurately, the mean) we get 586.6 for the rating and 24.3 as the average number of games. That is using zero as a score for the player who never plays..

Using 1200 for the rating for never-played players the mean = 1347.8 and games does not change and remains at 24.3

Lastly, using only players who have played the numbers skyrocket upward and the curved is drastically negatively skewed. Rating mean = 1602.4 and the number of games is 66.3

Wow -- is the average really 1602? I think that method is skating on thin statistical ice. If they never played when you can play on your PC, Mac, Linux, iPhone, and Cell phone makes Chess.com very accessible - they do not want to play. I think the first set of numbers of a rating of 588.6, numbers of games = 24.4 and a standard deviation of a whopping 770.9 (which seems realistic when the scores go from 0 to 2103) is the most realistic.

OK, using zeros seems unrealistic because anyone who has the IQ to sign up is better than zero. So realistic numbers seem to be

Mean Rating = 1347.8

Mean Games = 24.3

Standard Deviation = 266.4

There would be one more option and instead of using 1200, use the actual mean of the players instead. But that is fraught with circular problems.

So why all the gyrations? Was I bored and needed something to do? Well, I always love a good challenge. But I was thinking about how elegant the Elo system is compare to the Glicko system or the Elo system heavily modified as the USCF uses it (Dr. Glickman sat on that committee for awhile).

Many players think that time between games should be used as a penalty against the player. Of course the idea was the older you get, the less use you have of your mental faculties. Because another factor is used, many feel the Glicko 2 system is more accurate. Chess.com uses seconds since your last game. Are you really worse if it has been a week since you played? That would be 60 minutes x 60 seconds x 24 hours x 7 days. An adjustment to account for 604,000+ seconds is necessary in the Glicko system. Is that really more accurate?

I doubt it. Complexity does not correlate always with accuracy. So what adjustment factor MIGHT affect a player's score? Being a Licensed Mental Health Therapist (I know, four college degrees in several fields - I am too educated for my own good) I started thinking about how win streak gives me confidence and a losing streak seems to last forever. What instead of time, you adjust the probability for streaks? That makes more sense than time for an adjustment to chess scores.

Now, I have a personal problem from this point forward. I have the upmost respect for Dr. Elo. He was not only the “father” of modern chess scores, a brilliant man, and also a great chess player. He was Minnesota State Champ several times.

His system is a piece of art. Improving it, or even suggesting an improvement is like telling Beethoven he made a mistake in the 9^th Symphony! Of course time does march on. John Williams, Richard Rodgers, Andrew Lloyd Webber and others stood on the shoulders of giants like Beethoven.

With my respects to Dr. Glickman, time is an unproven factor -- especially short amounts of time. But streaks have been proven. As you win, you feel better and usually (but not always) feel take confidence forward into winning. Lots of books written on the psychology of chess. None, to my knowledge, written on time between games.

Glicko, is convinced time is a factor. Days, even seconds between games would make you a lessor player. Statisticians who work with humans know that mental capacity does diminish with age. IQ tests, for example, are normed to age groups. But in some cases, people who keep going do not lose mental faculties and if they took their IQ it would go up as they norms are coming down -- even though the player's mental faculties are staying the same.

But even with IQs you are are talking about YEARS - not days or seconds or even months. One of the most fascinating studies done in years has to be the “Nun Study” as reported in the book “Aging With Grace: What the Nun Study Teaches Us About Leading Longer, Healthier, and More Meaningful Lives”. This book shows that quiet living, constant study and several other factors keep nuns older and very sharp in their old age. Many nuns in their 90s or 100s are mentally sharp as can be.

Dr. Mark Glickman came up with the Glicko systems, said it needed further study and it has been left cold. What I am proposing is a very small variation to the great work of Dr. Armand Elo. I do not want a patent like Dr. Glickman wanted, was tentatively given, and was later overturned. It seems it was about then he moved on in his career and left his chess system behind.

But he did sit of the board of the United States Chess Federation (USCF). Of course he was placed on a committee to evaluated and recommend changes to the calculations of the USCF ratings. If you would like to see the USCF formula in all its glory, it can be found in PDF format at at the USCF site.

It takes 11 pages to explain the system. In contrast Dr. Elo did a rating system with a simple formula and a table of scores derived from a normal curve of actual players scores. Could be put on 1 to 2 pages.

I have quite critical of the Glicko system. I personally have never met Dr. Glickman. Probably never will being disabled and I never travel far. I would like it to be known that Dr. Glickman may be a fine person and a great professor. Why did I go to the trouble saying that? Well, I made some critiques of the current USCF system on the USCF bulletin board. I was attacked and asked to prove my case when nobody else was. I got the feeling that my message came out wrong as my accusers told me how gracious Dr. Glickman is. Fine, but when you publish a work, it can be criticized.

I fully expect this work to be criticized - and me along with it.

The Elo “streak” or Elo_s (Elo sub s) as the credit remains, and shall remain with Dr. Elo. S, of course is for streak. The streak is an adjustment made for a streak. Beating someone on a great streak, in this system, is worth slightly more. And winning after losing many games on you own streak is worth slightly more.

Right off the bat I can hear streaks used like this “are prone to sandbagging” and cheating. Yes, they certainly are. But no more than the Elo or Glicko system without this adjustment. You can always lower your rating hoping to gain more points with a victory or two in a lower section of a good tournament. It is risky. Same with sandbagging here on Chess.com. Sandbagging is very risky -- you might get caught or more likely -- just flat beat. A lower rating or a worse streak may give your opponent false confidence. It may backfire.

OK, so how does this work? I will try to explain so you do not need a Ph.D. in Statistics to understand. Dr. Elo worked out a system where you multiply the chances of winning by the K-value. The lower rated the match, the higher the K-value. Master level matches have a lower K-value.

Examples help demonstrate the point. I play another low rated amateur. The K-Value is 25. That is the maximum point “change”. But Dr. Elo knew statistical facts. Once you get 3 standard deviations apart, the percentage rounds to zero. Therefore, if I higher rated player won, they would win no points. If the lower player won, they won all 25 points. This, of course, would eliminate lower rated players from ever getting games with higher rated players.

Dr. Elo “floored” the percentage at 84% or .84. So if you played in the K-value range of 25, the most you could win is .84 x 25 = 21 points and the most you could lose is 4 points.

Now where did the .84 come from? Dr. Elo studied players and recorded them and made a “percentage” chart (speaking statistically, based on standard deviations and a normal curve). If we are both rated, say, 1000, our percentage (chance of winning) would be 50/50 or 50% which is .50 in decimal form. .5 x 25 points is 12.5 points or rounded to 13 points. The winner would get 13 points and the loser would lose 13 points.

The K-Value goes down to 10 at the master level. So if we were both rated at 2600 then .5 x 10 = 5 points. The winner would get 5 points, the loser would lose 5 points. This is why it is hard for master players to quickly move up or down. It also is why getting to be a Grand Master is a great accomplishment. There are many, many, many less Grand Masters than people with earned Ph.D.s!

FIDE uses the system and until 1980, Dr. Elo calculated all the ratings by hand. Towards the end of his career, he used one of the first HP calculators (I worked a good part of a summer to buy one when I was a college kid, about 1980 and it cost me $175 and did a lot less than a $15 calculator does today). The system, as I repeat myself, was, and is, elegant and wonderful.

What Dr. Elo could not foresee was the internet revolution. The Glicko System has one advantage over the Elo system that I can see. It does not have a “provisional” rating. In the 1950s, provisional ratings (your rating did not change for 14 games or so) was not a problem. The published ratings came out once a quarter. But the internet demands a rating now. I have a proposed solution for that too a bit later in this paper.

But first, lets look at how a “streak” might affect the adjustment to the K-Value. Let us use the last 15 games a player has played. If a player has won 14 of 16 in his/her K-value range they are on a streak. Throw out games not in their K-Value range. The same is true if the player has not had 16 games. Use what they have.

8 games would be neutral. No streak, positive or negative. So no adjustment. Easy enough. Now how hard would it be harder to break someone’s streak? How many extra points or percentage should that be worth? There is a whole section of statistics called “objective statistics” or “objective probability”.

For example, the odds of winning the SuperBowl is objective, but a number is put to it. It is done in sports and even in the insurance industry, objective statistics are used. For example, one tornado had hit your area in the last 100 years of recording. What is the percentage of one hitting a new subdivision near your house. It is not 1 in 100. It could be higher (more building going on) or less (that was a freak tornado, they almost never happen). Objective.

I like round numbers, so lets use 10. 10 is the streak number or .10 in decimal form. It is NOT multiplied as that would make the number lower (multiply two numbers less than 1 and they get closer to zero). They must be added. So lets go back to two players rated at 1000 and in the K-value range of 25. The probability would be 50% or .50 in decimal form. The middle would be zero.

If a player had no streak (8 for 16) there would be no adjustment. But a player 16 for 16 would have a 10% or .10 decimal adjustment. 16 for 16 would be 100%.

So I play you. I have a rating of 1000 and you 1300 (Dr. Elo tried to keep 100 as a standard deviation for those used to such ideas). So the K-value normally would be 25, and floored and 84% or .84 as mentioned before. Add .10. Now the percentage is 94% or .94 decimal. .94 x 25 would equal 23.5 or rounded to 24 points. Before it was .84 x 25 which equaled 21 points.

You would gain an extra 3 points for breaking a 16 game streak. The losing player would lose an extra 3 points for having the streak broken. Most of the time it would only make a zero to 1 point difference. But it has one great advantage. It would have the press talking about streaks because they would MEAN something.

Removing Provisional and Adjusting for many Amateurs

Ratings, especially in the Glicko system, move a lot in the first few games. Those of us on Chess.com have experienced this. One loss and HOLY COW, your rating jumps 90 points or 125 points. Lets eliminate that to keep more people interested in chess as a whole. Lets use an Elo provisional K-value of say 30. Floored at .84 as Dr. Elo did and adjusted maximally by .10 for streaks, the highest adjustment would be .94 x 30 which equals 27 points. After 30 games, the K-Value goes to 25.

When a player reaches 2200, the K-value is 15. Once the player reaches 2400 (the IM MASTER range) then the K-value moves to 10. This is why it is so hard to become a Grand Master (over 2500 plus some other criteria). Once the K-Value reaches 15, it stays there (no going back to a higher value) and once it reaches 10, it stays there permanently. This is similar to the current FIDE system.

In Summary ...

No adjustments for time. Small adjustments for streaks. No provisional ratings. There would be less movement early on in an online player’s career and hence I think more players would stay. If Chess.com, and if the USCF would adopt it (unlikely) the ratings would be close to the FIDE rating.

Catching cheaters would be easier and more fair. They must play 20 games of LIVE chess in 20 minute time control against players rated as high or higher. The streak adjustment would be a killer. If they could not beat LIVE players, well, Chess.com would have to do its thing.

To get details on the Elo System, see my blog post.

One last thing must be done. a curve for the streak adjustment is necessary.

0/16 add .10

1/16 add .09

2/16 add .08

3/16 add .07

4/16 add .05

5/16 add .03

6/16 add .02

7/16 add .01

8/16 add .00

Do the reverse for a negative streak (or add to the opponent which I prefer). Simple. Just like Dr. Elo’s system. Simple. Standing on the shoulders of a giant! I do not want credit for modifying Dr. Elo’s system. I just “modernized” it to accommodate the internet world - hopefully in a spirit Dr. Elo would approve.

Dr. Jim Fox

April, 2010

“JollyPlayer” on Chess.com

Elo System with "streaks" added .....

JollyPlayer's Blog