Math People Only!: Changes to how much ratings change... - Chess Forums - Page 6

erik · 2009-10-15T15:48:58-07:00

Ok. There has been talk that ratings move up/down too much. It doesn't bother me, but I know it bugs some people. As some of you know, we use Glicko system (http://en.wikipedia.org/wiki/Glicko_rating_system and http://math.bu.edu/people/mg/research/gdescrip.pdf ). Here is what we use to start:

jay

Oct 18, 2009

0

#101

No idea, never seen their formulas.

meniscus

Oct 18, 2009

0

#102

looks standard. it did have a simple way of explaining RD, which I'll repost for anyone who is still clueless about Glicko

This explaination is based on the fics help page on Glicko Rating.

As you may have noticed, each user has a rating and an RD. RD stands for ratings deviation.

What RD represents

The Ratings Deviation is used to measure how much a users current rating should be trusted. A high RD indicates that the user may not be competing frequently or played many games yet at the current rating level. A low RD indicates that the users rating is fairly well established. This is described in more detail below under RD Interpretation.

How RD Affects Ratings Changes

In general, if your RD is high, then your rating will change a lot each time you play. As it gets smaller, the ratings change per problem will go down. However, the opponents RD will have the opposite effect, to a smaller extent: if his RD is high, then your ratings change will be somewhat smaller than it would be otherwise.

How RD is Updated

In this system[emrald-meniscus], the RD will decrease somewhat each time you solve a problem, because when you solve more problems there is a stronger basis for concluding what your rating should be. However, if you go for a long time without solving any problem, your RD will increase to reflect the increased uncertainty in your rating due to the passage of time. Also, your RD will decrease more if the problem's rating is similar to yours, and decrease less your problem's rating is much different.

Mathematical Interpretation of RD

Direct from Mark Glickman:
Each tactician can be characterized as having a true (but unknown) rating that may be thought of as the tactician's average ability. We never get to know that value, partly because we only observe a finite number of problems, but also because that true rating changes over time as a tactician's ability changes. But we can estimate the unknown rating. Rather than restrict oneself to a single estimate of the true rating, we can describe our estimate as an interval of plausible values. The interval is wider if we are less sure about the tactician's unknown true rating, and the interval is narrower if we are more sure about the unknown rating. The RD quantifies the uncertainty in terms of probability:

The interval formed by current rating +/- RD contains your true rating with probability of about 0.67.
The interval formed by current rating +/- 2 * RD contains your true rating with probability of about 0.95.
The interval formed by current rating +/- 3 * RD contains your true rating with probability of about 0.997.

Credits

The Glicko Ratings System was invented by Mark Glickman, Ph.D. who is currently at Boston University.

meniscus

Oct 18, 2009

0

#103

For those of you who know something about statistics, those last intervals are not confidence intervals, but are called central posterior intervals because the derivation came from a Bayesian analysis of the problem. These numbers are found from the cumulative distribution function of the normal distribution with mean = current rating, and standard deviation = RD. For example, CDF[ N[1600,50], 1550 ] = .159 approximately (that's shorthand Mathematica notation.)

Atos

Oct 18, 2009

0

#104

I am not an expert on rating systems but if the Fide and other chess federations (and chess sites) use the ELO this seems like a good reason that we should be using it as well. Irrespective of the comparative merits of the two systems, I think we would like to be part of the world's chess community and the ratings here to be as much as possible in line with those assigned by major organizations.

On a more personal note, my experience with Glicko is that 'punishes' me by taking away a lot of points when I have not played for a while and I am just a bit rusty. By the time i have played myself back into shape, the RD will have fallen and it will take much longer to make up for the points lost. I think that a high RD should only be applied to the players who are new to the site.

ichabod801

Oct 18, 2009

0

#105

Okay, I got my modified code and Jay's translated code working. I also created 50 imaginary players, and generated 2000 imaginary games between them. I created a random time stamp for each game based on an average between my rate of play and that of He-Who-Names-Himself-Constantly. Now, the imaginary players have "real" ratings that the game results are based on. That way, we can check not only things like how volatile the ratings are, we can test how accurate they are.

I ran a test to compare my code to Jay's. Now, I don't round the ratings, round the RD's, or floor the K's, so I turned off those parts of Jay's code for the comparison. The biggest difference in ratings calculations between the two functions was less than 10^-12, which can be attributed to floating point errors due to the different order of calculation. Likewise, the biggest difference between RD calculations was less than 10^-13. So either we have both implemented glicko correctly, or we're both doing it wrong in the same way with different algorithms. I'm betting on the former.

If you reset Jay's code to round the ratings, round the RD's, and floor the K's, the biggest difference in the ratings calculations is about 7, and the biggest difference in the RD calculations is about 0.1.

Then I took Jay's code with the rounding of rating and rd, but without the flooring of K (since that isn't being done now), and tested it with c squared and c not squared. With c squared the average RD was 75. With c not squared the average RD was 113 (which is more than the maximum RD with c squared!).

Then I took Jay's code squaring c and tried it with the floor for K on and off. There was no real difference between the two runs.

So my conclusion is to fix the c squared error, but see how that works before making any other changes.

jay

Oct 18, 2009

0

#106

ichabod801 wrote:

Okay, I got my modified code and Jay's translated code working. I also created 50 imaginary players, and generated 2000 imaginary games between them. I created a random time stamp for each game based on an average between my rate of play and that of He-Who-Names-Himself-Constantly. Now, the imaginary players have "real" ratings that the game results are based on. That way, we can check not only things like how volatile the ratings are, we can test how accurate they are.

I ran a test to compare my code to Jay's. Now, I don't round the ratings, round the RD's, or floor the K's, so I turned off those parts of Jay's code for the comparison. The biggest difference in ratings calculations between the two functions was less than 10^-12, which can be attributed to floating point errors due to the different order of calculation. Likewise, the biggest difference between RD calculations was less than 10^-13. So either we have both implemented glicko correctly, or we're both doing it wrong in the same way with different algorithms. I'm betting on the former.

If you reset Jay's code to round the ratings, round the RD's, and floor the K's, the biggest difference in the ratings calculations is about 7, and the biggest difference in the RD calculations is about 0.1.

Then I took Jay's code with the rounding of rating and rd, but without the flooring of K (since that isn't being done now), and tested it with c squared and c not squared. With c squared the average RD was 75. With c not squared the average RD was 113 (which is more than the maximum RD with c squared!).

Then I took Jay's code squaring c and tried it with the floor for K on and off. There was no real difference between the two runs.

So my conclusion is to fix the c squared error, but see how that works before making any other changes.

Awesome work!! I'm in the process of doing that now. I'll release the code when I have a chance to do so this coming week.

LATITUDE

Oct 18, 2009

0

#107

jay wrote:

ichabod801 wrote:

Okay, I got my modified code and Jay's translated code working. I also created 50 imaginary players, and generated 2000 imaginary games between them. I created a random time stamp for each game based on an average between my rate of play and that of He-Who-Names-Himself-Constantly. Now, the imaginary players have "real" ratings that the game results are based on. That way, we can check not only things like how volatile the ratings are, we can test how accurate they are.

I ran a test to compare my code to Jay's. Now, I don't round the ratings, round the RD's, or floor the K's, so I turned off those parts of Jay's code for the comparison. The biggest difference in ratings calculations between the two functions was less than 10^-12, which can be attributed to floating point errors due to the different order of calculation. Likewise, the biggest difference between RD calculations was less than 10^-13. So either we have both implemented glicko correctly, or we're both doing it wrong in the same way with different algorithms. I'm betting on the former.

If you reset Jay's code to round the ratings, round the RD's, and floor the K's, the biggest difference in the ratings calculations is about 7, and the biggest difference in the RD calculations is about 0.1.

Then I took Jay's code with the rounding of rating and rd, but without the flooring of K (since that isn't being done now), and tested it with c squared and c not squared. With c squared the average RD was 75. With c not squared the average RD was 113 (which is more than the maximum RD with c squared!).

Then I took Jay's code squaring c and tried it with the floor for K on and off. There was no real difference between the two runs.

So my conclusion is to fix the c squared error, but see how that works before making any other changes.

Awesome work!! I'm in the process of doing that now. I'll release the code when I have a chance to do so this coming week.

OSTIA PEDRIN!!

jay

Oct 18, 2009

0

#108

Alright guys, the new changes are live..let me know how it goes. :)

deepOzzzie

Oct 19, 2009

0

#109

Awesome Thanks

zankfrappa

Oct 19, 2009

0

#110

Hooray for Jay!!!

jay

Oct 27, 2009

0

#111

Alright, I have put the minimum K value back in, even though Mark Glickman had no idea why that was there, although fics does use it, and apparently he helped them implement their formulas. Without the min K value, people with low RDs, their ratings just dont move at all (like the computers in live chess.) I've also created a calculator you can use to test various scenarios. Please math people get in there and run some tests and let me know if the output looks correct. It certainly doesn't feel correct at times.

Do scenarios like a GM rated 2500 beating a E player rated 1200, and you'll see that if both people have normal RDs of like 50, that their ratings don't move all that much, or not as much as you'd expect from such a huge upset.

http://www.chess.com/echess/rating_test.html

The output on the bottom can be interpretted as, the top array (1) is the white player's values, and the bottom array (2) is the black player's values. The table output is simply a cleaner output of these same values. However, it's interesting to see the various other values like E, A, K, etc.

thanks!

eddiewsox

Oct 28, 2009

0

#112

jay wrote:

Alright, I have put the minimum K value back in, even though Mark Glickman had no idea why that was there, although fics does use it, and apparently he helped them implement their formulas. Without the min K value, people with low RDs, their ratings just dont move at all (like the computers in live chess.) I've also created a calculator you can use to test various scenarios. Please math people get in there and run some tests and let me know if the output looks correct. It certainly doesn't feel correct at times.

Do scenarios like a GM rated 2500 beating a E player rated 1200, and you'll see that if both people have normal RDs of like 50, that their ratings don't move all that much, or not as much as you'd expect from such a huge upset.

http://www.chess.com/echess/rating_test.html

The output on the bottom can be interpretted as, the top array (1) is the white player's values, and the bottom array (2) is the black player's values. The table output is simply a cleaner output of these same values. However, it's interesting to see the various other values like E, A, K, etc.

thanks!

Thank you Jay!

IM Kacparov

Oct 28, 2009

0

#113

What's that "minutes" for?

jay

Oct 28, 2009

0

#114

MInutes since last game that person played.

IM Kacparov

Oct 28, 2009

0

#115

I don't think it changes anything, I tried various numbers and it never changed the rating and new RD.

jay

Oct 28, 2009

0

#116

I've just adjusted the formulas a little this morning to try and get rid of some rounding errors and also make sure RD Prime is capped at 350. Minutes will not affect things very much unless you change the value drastically.

HiggsBoson

Oct 28, 2009

0

#117

Is there a set ratings differential at which there is no ratings change when the more highly rated player wins? I know it would vary with the RDs, but it looks like with a low RD the diff is around 500 points.

IM Kacparov

Oct 28, 2009

0

#118

jay wrote:

I've just adjusted the formulas a little this morning to try and get rid of some rounding errors and also make sure RD Prime is capped at 350. Minutes will not affect things very much unless you change the value drastically.

I've tried from 1 to 9999. No change.

jay

Oct 28, 2009

0

#119

Well, what RD values are you using? 350?

IM Kacparov

Oct 28, 2009

0

#120

I tried from 30 to 350.