Improving the 4PC Ratings Algorithm: A plausible crowd-sourced project?

Sort:
kevinkirkpat

In the (pinned) thread on 4PC ratings:

 

VAOhlman wrote:

All in all I think a good discussion is in order while the current system continues. However would it be possible to code in another system as well and compare how well it predicts the games? Not use it for matching, but just for seeing if it is a better predictor. Because, in the long run, that is what we are looking for, no? A system that can fairly accurately predict how player one will do against player two.

 

This got me wondering: could we evaluate different ideas for rating functions more empirically?  I believe so... but doing so would require a lot of elbow grease.  To that end, I wondered if the chess.com team might be willing to tackle this in a crowd-sourced fashion:

  • Chess.com could do nothing more than publish[*] a large sample of historic game results (HUGE sticking point here: this entire idea hinges on the hope that chess.com actually retains records of historic game results),
  • members of the community could use that data to develop/optimize a new 4PC algorithm,
  • and chess.com could make a decision on how (or whether) it made sense to use it.

 

[*] The data dump could be as simple as a downloadable csv (comma-separated-value) file, limited to:

GAME_ID,RED_USERNAME,RED_SCORE,GREEN_USERNAME,GREEN_SCORE,BLUE_USERNAME,BLUE_SCORE,YELLOW_USERNAME,YELLOW_SCORE

(and if there are privacy concerns, perhaps just psuedo-anonymous PLAYER_ID numbers could be substituted for USERNAME)

 

The goal would be to encode different algorithms into functions that:

  • Took 12 input parameters describing each completed game (red's starting rating, a running count of red's rated games, red's final score; same for blue, yellow, and green)
  • Returned 4 output parameters: red's new rating, blue's new rating, yellow's new rating, and green's new rating,

... and assess the various algorithms for accuracy.  Assessment might be along the lines of:

1) Run the function against a training data set of games (i.e. first 90% of games) to compute each player's "training-data" rating.

2) Use the training-data ratings to predict results (player finishing order) of the last 10% of games.

3) Score those predictions (per game) by evaluating each game as six 2-player match-ups:

  • If there's a big gap between the two players' ratings (e.g. over 200 difference), then +2 for a correct prediction (1400 player finishes better than 1100 player) and -2 for an incorrect prediction (1500 player finishes worse than 1200 player).
  • If it's a medium-sized gap (between 50 and 200) then +1 for correct prediction and -1 for failed prediction
  • Ignore match-ups with small rating gaps (less than 50); reward +0 regardless of outcome.

Whichever function yields the highest average accuracy-score-per-game (for all of the 10% testing data) would be considered the "more accurate" function.  

 

Some "sanity" guidelines should also be in play:

1) The logic should be sensible and (relatively) easily explained.  While the function may entail a lot of nuanced adjustments (perhaps accounting for things like relative ratings, score differentials, and positional aspects of the game), players should still have a basic sense of how their ratings should change based on the outcome of each game.  In particular, the player who finishes first should always experience an increase in rating and the 4th place player should always have a drop in rating.

2) Each game should be zero-sum: no net gain (or loss) of rating points among 4 players of a game.

3) The function should be expressible using standard arithmetical functions and simple if-then-else constructs (i.e. no fancy libraries; easily portable to any language) 

4) The function should be deterministic (same inputs => same outputs) and exhibit "smooth" behavior (small changes in inputs should result in minor deviations in output) 

5) The function should be self-correcting.  If a single player has a rating of 1500 after starting at 1200 and playing 50 games; it should be possible to start that same player at 2200 and still wind up with a rating of ~1500 after the same 50 games.

 

Otherwise, there should be no restrictions on the logic and, IMO, there are a lot of ideas worth exploring (and vetting these possibilities is where the crowd-sourcing idea really seems to "fit") 

  • It may be that the rating of the player opposite is vitally important in deciding how strong a player is.  Perhaps a function should give more "street cred" to a player who wins against two adjacent 1400-rated opponents while opposite a 1200 player, than to a player who wins against two adjacent 1400 players while opposite a 1600 player.
  • It may be that "skewed ranking" games are (empirically) very poor predictors of player strength.  For instance, in a 1200 vs 1250 vs 1275 vs 1850 game; perhaps the function is "smart" enough to realize that no matter what happens to the 1850 player, it says very little about that player's skill levels.  
  • It may be wise to use "provisional status" logic (looking at running total of rated games played) to improve accuracy.  For example, a function might decide it's "less bad" to lose to a new 1200-rated player than to an well-established 1200-rated player; or that a new player who soundly defeats three 1600+ players is probably a *lot* better than their initial 1200 rating reflects.  
  • While not as important as final standing, I suspect score-differentials could shed some light on the players' skill levels.  It'd be interesting to see if a function could tease that out without overtly rewarding "running up the scoreboard" play over "quick claim win" play (which says more about players' patience levels than skill levels).  
Skeftomilos

Very interesting idea, but I bet my pet snake that chess.com doesn't retains records of historic game results.

kevinkirkpat

You may be right, but that may be a moot point.  As I circle back to this proposal with fresh eyes, I suspect the bigger hurdle is the request to publish chess.com's proprietary data (however watered-down).  There is a long history of companies releasing this kind of data (e.g. to researchers), only to discover that data-mining techniques could be used to reveal far more than what the publishers intended. 

 

The actual effort of compiling/anonymizing a dataset like this would probably be significantly less than the amount of work needed to obtain approval from the higher-ups to publish it.  

Skeftomilos

True. If I had this database in my hand, the first thing I would do would be to sort all pairs of players by percentage of winning first and second place, to reveal all the cheaters! happy.png

kevinkirkpat

Hah, good point!  More fun: after identifying cheaters, evaluate their rating (using conventional ELO method) to determine what their rating actually ought to be.  "Name and shame", as they say. :-)

Skeftomilos

I just stumbled upon an idea for a simpler rating mechanism for multiplayer games, by a guy named Tom Kerrigan. He describes his idea as follows:

1) At the end of a game, make a list of all the players and sort it by performance.
2) Think of each player as having played two matches: a loss vs. the player right above him on the list, and a win vs. the player right below him.
3) Update each player's rating accordingly using the two-player Elo equations.

What is interesting is that the inventor made the effort to measure the predictive ability of his algorithm, by experimenting with virtual players with a known hypothetical true strength, playing many 10-player games, and found that his algorithm performed better than Microsoft's TrueSkill. In other words he didn't use actual data, he created them himself!

BabYagun

@SkeftomilosTom Kerrigan's algorithm is simple. But consider this case:

 

1st place: 1500

2nd place: 1400

3rd place: 1600

4th place: 2000

 

Currently the 2nd player earns points for 2 wins (he won 1600 and 2000 players) and loses points 1 time (he lost to a 1500 player). In Tom Kerrigan's algo he earns points for 1 win (1600) and loses points 1 time (1500). It is not fair that a 1400 player played better than a 2000 one, but does not get any bonus for that.

 

What is good in Tom Kerrigan's algo: If those 3 lower rated players join to a gang and play against the 2000 player then only one of them gets a profit.

Skeftomilos
BabYagun wrote:

What is good in Tom Kerrigan's algo: If those 3 lower rated players join to a gang and play against the 2000 player then only one of them gets a profit.

That's a clever thought! happy.png
A rating system like this could invite some strange decisions on the board, like competing for the third place instead of the second or even the first! happy.png

kevinkirkpat

 [EDIT: RETRACTION: Per follow-up with @Martin, I believe the logic used in the link below is flawed, and thus my justification to switch to divide-by-6 is unfounded.  Divide-by-3 does, indeed, seem to be the correct approach]

 

(scroll to end for TL/DR)

 

I also came across this post:

http://sradack.blogspot.com/2008/06/elo-rating-system-multiple-players.html

This treats the topic in a more rigorous way; mathematically extending the 2-player ELO logic to the N-player scenario.

 

What I found most interesting is that the final equations wind up yielding the same results as that currently in use, with one important caveat: the "divide-by-3" correction should actually be "divide-by-6".  

 

Upon closer consideration, dividing by 6 makes sense.  The ELO algorithm is meant to not only adjust ratings in the right direction, but also by the "optimal" magnitude.   If the adjustments are too small, it may require players to play dozens, or even hundreds, of games before an accurate measure  of their skill level is established.  But if they are too large, the rating pool becomes chaotic, with estimates jumping haphazardly from one game to the next.  So the math behind 2-player ELO is designed to take a new piece of information (result of a single game) and make the most optimal adjustment to underlying ratings (estimates of players' skill levels), both in direction and in magnitude. 

 

When chess.com decided to use ELO for 4-player tracking, treating the 4-player game as 6 different individual 2-player games, they wind up getting the right "direction" of adjustment for each of the 4 players' ratings.   However, the net adjustment (total "churn" in underlying pool of results) is equivalent to a situation where 6 new games have been played.  Before the "divide-by-3" correction was added, an 1800-rated player having a single bad game against 3 1200 players was being treated as equivalent to an 1800-rated chess player losing 6 consecutive games to a single 1200-rated player (despite having played just one game).  The "divide-by-3" correction definitely helped (now, an 1800 player having a bad game is "only" being treated as two consecutive bad games).  But this really needs to be tweaked further to "divide-by-6".

 

I'm working to build a simulation, similar to that in Skeftomilos's link, that I can use to compare the accuracy of the pure-ELO (with "divide-by-6" correction) methodology to the proposed simplification in that link (i.e. comparing each player to the ratings of the player finishing immediately above and below them).  However, as a rule of thumb: all else being equal, statistical methods that make use of the most available information (here, using the ratings of all 3 competitors vs ratings of just 1 or two competitors) tend to perform better.  Information-reduction techniques should only be considered when the more information-dense methods are untenable (or when the filtered-out information, on average, is definitively independent of the estimated value - almost definitely not the case here).

 

I should note two other things:

1) Based on the opening post, I've verified that chess.com is applying ELO using a "K-factor" of 32.  At least, that is the K-factor used in their example.  It's important to note that ELO is intended to be used with a K-factor that decreases over time... the more games a player has played, the closer their current rating probably is to their "true" skill level; thus, the less impact each new game should have on their overall rating.  Most standard chess organizations start new players with K-factors in the 30-40 range, but reduce that to 10-20 for players who've established a long history of play.  It may or may not be the case that chess.com is already applying this "declining-K-factor" method, but if they are not, I'd strongly recommend doing so.

 

2) Even if ELO (with divide-by-6) winds up being the best predictor in my upcoming simulations, I still think there's probably room for improvement.  Specifically, as many players have noted, the positioning of skilled/unskilled opponents can make a huge difference in any given game.  I'd much rather play opposite an 1800 player, while flanked by two 1500 players, than opposite a 1500 player and flanked by two 1800-rated players.  Furthermore, when one player massively out-ranks their 3 competitors, there is often a "gang-up-on-the-pro" effect; putting the strong player in a signficantly disadvantaged position overall (and treating the game as a normal 1200-1200-1200-1800 matchup decomes highly dubious). If these patterns (and any others that might be pointed out) can be incorporated into my simulator, I may be able to explore various modifications to the ELO algorithm which take them into account.  

 

TL/DR: Chess.com really needs to modify current algorithm to divide final adjustment by 6, not 3.  Other methods may also improve ratings; further research underway.

kevinkirkpat

One final note: Player's current ratings have been reached by problematic implementations.  Assuming the "divide-by-6" fix is incorporated in the near future, I would also recommend a "temporarily-elevated K-factor", perhaps as high as 40 or 50, for the first 5 (or so) games every player plays under the fixed system.  This would help ensure that errors in current ratings are corrected more rapidly at first; and that players who had errantly-established ratings (either much too high or much too low) are more bluntly adjusted to conform to the more accurate system.

Martin0

There will eventually be a switch to glicko instead of elo, like all other ratings on chess.com. That change will improve these k-factor changes to change your rating less when your rating is more accurate and also change less when your opponents rating is less accurate. You can read more here.

 

@kevinkirkpat, there are some errors to your logic with "divide-by-6". It's true that each game is calculated 6 times (red vs blue, red vs yellow, red vs green, blue vs yellow, blue vs green, yellow vs green), but dividing by 3 makes sense since each player is only in the calculation 3 times (once against each opponent). 

 

kevinkirkpat wrote:

Before the "divide-by-3" correction was added, an 1800-rated player having a single bad game against 3 1200 players was being treated as equivalent to an 1800-rated chess player losing 6 consecutive games to a single 1200-rated player (despite having played just one game).  The "divide-by-3" correction definitely helped (now, an 1800 player having a bad game is "only" being treated as two consecutive bad games).

This is incorrect. Before the "divide-by-3" the game would be treated the same as 3 losses for the 1800-rated player against a single 1200-rated player.  The other 3 games in the calculations is just games the 1200 rated players play against each other, which does not affect the 1800-rated player.

kevinkirkpat

[quick edit - fixed formula for 100-game scenario]

 

@Martin0

 

The link I cited performs a (seemingly) mathematically-rigorous, step-by-step approach to extend the 2-player ELO algorithm to the N-player scenario, and determines that for an N-player match-up, the proper divisor is: (N (N-1))/2... that is, 6 for the 4-player game.

 

Intuition is nefariously misleading in the arena of probabilities (google "Monte Hall paradox" for a famous example of this), so while I certainly "got" the intuitive case for "divide-by-3", I casually dismissed it in favor of the "more math-y" methodology used in the link.  I did not consider that the link itself, for all the "pretty" symbolic math, is a blog, not a peer-reviewed publication; and took the results more-or-less at face value.

 

Many times, it's easiest to resolve these questions by looking at extreme cases.  Consider a 100-player chess game (a massive Free-For-All 100-gon version of the current 4PC board).  Say player 1 has rating 1800 and the other 99 players have 1200 rating, and player 1 finishes in last place.  

Each individual 1800-vs-1200 match-up yields an adjustment of about -31 points.  Ultimately, your approach yeilds an adjustment of (-31*100)/99 (which seems reasonable; the game impacts the 1800-rated player roughly as if they'd lost a single game to a single 1200 opponent).  The link's methodology would mean an adjustment of about (-31*100)/4805... the 1800 player would need to finish dead last hundreds of times to get a reasonably-poor rating.

 

 

Thanks for the pushback on this; I happily retract my criticism (and will add an edit accordingly)

kevinkirkpat

PS: how awesome would it be to organize the top-100 chess players of the world to play a single Death Match 100-PC game? :-)

Skeftomilos

@kevinkirkpat I hope that you'll manage to implement the simulation, because I am very interested to see the results!