In the (pinned) thread on 4PC ratings:
VAOhlman wrote:
All in all I think a good discussion is in order while the current system continues. However would it be possible to code in another system as well and compare how well it predicts the games? Not use it for matching, but just for seeing if it is a better predictor. Because, in the long run, that is what we are looking for, no? A system that can fairly accurately predict how player one will do against player two.
This got me wondering: could we evaluate different ideas for rating functions more empirically? I believe so... but doing so would require a lot of elbow grease. To that end, I wondered if the chess.com team might be willing to tackle this in a crowd-sourced fashion:
Chess.com could do nothing more than publish[*] a large sample of historic game results (HUGE sticking point here: this entire idea hinges on the hope that chess.com actually retains records of historic game results),
members of the community could use that data to develop/optimize a new 4PC algorithm,
and chess.com could make a decision on how (or whether) it made sense to use it.
[*] The data dump could be as simple as a downloadable csv (comma-separated-value) file, limited to:
GAME_ID,RED_USERNAME,RED_SCORE,GREEN_USERNAME,GREEN_SCORE,BLUE_USERNAME,BLUE_SCORE,YELLOW_USERNAME,YELLOW_SCORE
(and if there are privacy concerns, perhaps just psuedo-anonymous PLAYER_ID numbers could be substituted for USERNAME)
The goal would be to encode different algorithms into functions that:
Took 12 input parameters describing each completed game (red's starting rating, a running count of red's rated games, red's final score; same for blue, yellow, and green)
Returned 4 output parameters: red's new rating, blue's new rating, yellow's new rating, and green's new rating,
... and assess the various algorithms for accuracy. Assessment might be along the lines of:
1) Run the function against a training data set of games (i.e. first 90% of games) to compute each player's "training-data" rating.
2) Use the training-data ratings to predict results (player finishing order) of the last 10% of games.
3) Score those predictions (per game) by evaluating each game as six 2-player match-ups:
If there's a big gap between the two players' ratings (e.g. over 200 difference), then +2 for a correct prediction (1400 player finishes better than 1100 player) and -2 for an incorrect prediction (1500 player finishes worse than 1200 player).
If it's a medium-sized gap (between 50 and 200) then +1 for correct prediction and -1 for failed prediction
Ignore match-ups with small rating gaps (less than 50); reward +0 regardless of outcome.
Whichever function yields the highest average accuracy-score-per-game (for all of the 10% testing data) would be considered the "more accurate" function.
Some "sanity" guidelines should also be in play:
1) The logic should be sensible and (relatively) easily explained. While the function may entail a lot of nuanced adjustments (perhaps accounting for things like relative ratings, score differentials, and positional aspects of the game), players should still have a basic sense of how their ratings should change based on the outcome of each game. In particular, the player who finishes first should always experience an increase in rating and the 4th place player should always have a drop in rating.
2) Each game should be zero-sum: no net gain (or loss) of rating points among 4 players of a game.
3) The function should be expressible using standard arithmetical functions and simple if-then-else constructs (i.e. no fancy libraries; easily portable to any language)
4) The function should be deterministic (same inputs => same outputs) and exhibit "smooth" behavior (small changes in inputs should result in minor deviations in output)
5) The function should be self-correcting. If a single player has a rating of 1500 after starting at 1200 and playing 50 games; it should be possible to start that same player at 2200 and still wind up with a rating of ~1500 after the same 50 games.
Otherwise, there should be no restrictions on the logic and, IMO, there are a lot of ideas worth exploring (and vetting these possibilities is where the crowd-sourcing idea really seems to "fit")
It may be that the rating of the player opposite is vitally important in deciding how strong a player is. Perhaps a function should give more "street cred" to a player who wins against two adjacent 1400-rated opponents while opposite a 1200 player, than to a player who wins against two adjacent 1400 players while opposite a 1600 player.
It may be that "skewed ranking" games are (empirically) very poor predictors of player strength. For instance, in a 1200 vs 1250 vs 1275 vs 1850 game; perhaps the function is "smart" enough to realize that no matter what happens to the 1850 player, it says very little about that player's skill levels.
It may be wise to use "provisional status" logic (looking at running total of rated games played) to improve accuracy. For example, a function might decide it's "less bad" to lose to a new 1200-rated player than to an well-established 1200-rated player; or that a new player who soundly defeats three 1600+ players is probably a *lot* better than their initial 1200 rating reflects.
While not as important as final standing, I suspect score-differentials could shed some light on the players' skill levels. It'd be interesting to see if a function could tease that out without overtly rewarding "running up the scoreboard" play over "quick claim win" play (which says more about players' patience levels than skill levels).