Very interesting idea, but I bet my pet snake that chess.com doesn't retains records of historic game results.
Improving the 4PC Ratings Algorithm: A plausible crowd-sourced project?

You may be right, but that may be a moot point. As I circle back to this proposal with fresh eyes, I suspect the bigger hurdle is the request to publish chess.com's proprietary data (however watered-down). There is a long history of companies releasing this kind of data (e.g. to researchers), only to discover that data-mining techniques could be used to reveal far more than what the publishers intended.
The actual effort of compiling/anonymizing a dataset like this would probably be significantly less than the amount of work needed to obtain approval from the higher-ups to publish it.

True. If I had this database in my hand, the first thing I would do would be to sort all pairs of players by percentage of winning first and second place, to reveal all the cheaters!

Hah, good point! More fun: after identifying cheaters, evaluate their rating (using conventional ELO method) to determine what their rating actually ought to be. "Name and shame", as they say. :-)

I just stumbled upon an idea for a simpler rating mechanism for multiplayer games, by a guy named Tom Kerrigan. He describes his idea as follows:
1) At the end of a game, make a list of all the players and sort it by performance.
2) Think of each player as having played two matches: a loss vs. the player right above him on the list, and a win vs. the player right below him.
3) Update each player's rating accordingly using the two-player Elo equations.
What is interesting is that the inventor made the effort to measure the predictive ability of his algorithm, by experimenting with virtual players with a known hypothetical true strength, playing many 10-player games, and found that his algorithm performed better than Microsoft's TrueSkill. In other words he didn't use actual data, he created them himself!

@Skeftomilos, Tom Kerrigan's algorithm is simple. But consider this case:
1st place: 1500
2nd place: 1400
3rd place: 1600
4th place: 2000
Currently the 2nd player earns points for 2 wins (he won 1600 and 2000 players) and loses points 1 time (he lost to a 1500 player). In Tom Kerrigan's algo he earns points for 1 win (1600) and loses points 1 time (1500). It is not fair that a 1400 player played better than a 2000 one, but does not get any bonus for that.
What is good in Tom Kerrigan's algo: If those 3 lower rated players join to a gang and play against the 2000 player then only one of them gets a profit.

What is good in Tom Kerrigan's algo: If those 3 lower rated players join to a gang and play against the 2000 player then only one of them gets a profit.
That's a clever thought!
A rating system like this could invite some strange decisions on the board, like competing for the third place instead of the second or even the first!

[EDIT: RETRACTION: Per follow-up with @Martin, I believe the logic used in the link below is flawed, and thus my justification to switch to divide-by-6 is unfounded. Divide-by-3 does, indeed, seem to be the correct approach]
(scroll to end for TL/DR)
I also came across this post:
http://sradack.blogspot.com/2008/06/elo-rating-system-multiple-players.html
This treats the topic in a more rigorous way; mathematically extending the 2-player ELO logic to the N-player scenario.
What I found most interesting is that the final equations wind up yielding the same results as that currently in use, with one important caveat: the "divide-by-3" correction should actually be "divide-by-6".
Upon closer consideration, dividing by 6 makes sense. The ELO algorithm is meant to not only adjust ratings in the right direction, but also by the "optimal" magnitude. If the adjustments are too small, it may require players to play dozens, or even hundreds, of games before an accurate measure of their skill level is established. But if they are too large, the rating pool becomes chaotic, with estimates jumping haphazardly from one game to the next. So the math behind 2-player ELO is designed to take a new piece of information (result of a single game) and make the most optimal adjustment to underlying ratings (estimates of players' skill levels), both in direction and in magnitude.
When chess.com decided to use ELO for 4-player tracking, treating the 4-player game as 6 different individual 2-player games, they wind up getting the right "direction" of adjustment for each of the 4 players' ratings. However, the net adjustment (total "churn" in underlying pool of results) is equivalent to a situation where 6 new games have been played. Before the "divide-by-3" correction was added, an 1800-rated player having a single bad game against 3 1200 players was being treated as equivalent to an 1800-rated chess player losing 6 consecutive games to a single 1200-rated player (despite having played just one game). The "divide-by-3" correction definitely helped (now, an 1800 player having a bad game is "only" being treated as two consecutive bad games). But this really needs to be tweaked further to "divide-by-6".
I'm working to build a simulation, similar to that in Skeftomilos's link, that I can use to compare the accuracy of the pure-ELO (with "divide-by-6" correction) methodology to the proposed simplification in that link (i.e. comparing each player to the ratings of the player finishing immediately above and below them). However, as a rule of thumb: all else being equal, statistical methods that make use of the most available information (here, using the ratings of all 3 competitors vs ratings of just 1 or two competitors) tend to perform better. Information-reduction techniques should only be considered when the more information-dense methods are untenable (or when the filtered-out information, on average, is definitively independent of the estimated value - almost definitely not the case here).
I should note two other things:
1) Based on the opening post, I've verified that chess.com is applying ELO using a "K-factor" of 32. At least, that is the K-factor used in their example. It's important to note that ELO is intended to be used with a K-factor that decreases over time... the more games a player has played, the closer their current rating probably is to their "true" skill level; thus, the less impact each new game should have on their overall rating. Most standard chess organizations start new players with K-factors in the 30-40 range, but reduce that to 10-20 for players who've established a long history of play. It may or may not be the case that chess.com is already applying this "declining-K-factor" method, but if they are not, I'd strongly recommend doing so.
2) Even if ELO (with divide-by-6) winds up being the best predictor in my upcoming simulations, I still think there's probably room for improvement. Specifically, as many players have noted, the positioning of skilled/unskilled opponents can make a huge difference in any given game. I'd much rather play opposite an 1800 player, while flanked by two 1500 players, than opposite a 1500 player and flanked by two 1800-rated players. Furthermore, when one player massively out-ranks their 3 competitors, there is often a "gang-up-on-the-pro" effect; putting the strong player in a signficantly disadvantaged position overall (and treating the game as a normal 1200-1200-1200-1800 matchup decomes highly dubious). If these patterns (and any others that might be pointed out) can be incorporated into my simulator, I may be able to explore various modifications to the ELO algorithm which take them into account.
TL/DR: Chess.com really needs to modify current algorithm to divide final adjustment by 6, not 3. Other methods may also improve ratings; further research underway.

One final note: Player's current ratings have been reached by problematic implementations. Assuming the "divide-by-6" fix is incorporated in the near future, I would also recommend a "temporarily-elevated K-factor", perhaps as high as 40 or 50, for the first 5 (or so) games every player plays under the fixed system. This would help ensure that errors in current ratings are corrected more rapidly at first; and that players who had errantly-established ratings (either much too high or much too low) are more bluntly adjusted to conform to the more accurate system.

There will eventually be a switch to glicko instead of elo, like all other ratings on chess.com. That change will improve these k-factor changes to change your rating less when your rating is more accurate and also change less when your opponents rating is less accurate. You can read more here.
@kevinkirkpat, there are some errors to your logic with "divide-by-6". It's true that each game is calculated 6 times (red vs blue, red vs yellow, red vs green, blue vs yellow, blue vs green, yellow vs green), but dividing by 3 makes sense since each player is only in the calculation 3 times (once against each opponent).
Before the "divide-by-3" correction was added, an 1800-rated player having a single bad game against 3 1200 players was being treated as equivalent to an 1800-rated chess player losing 6 consecutive games to a single 1200-rated player (despite having played just one game). The "divide-by-3" correction definitely helped (now, an 1800 player having a bad game is "only" being treated as two consecutive bad games).
This is incorrect. Before the "divide-by-3" the game would be treated the same as 3 losses for the 1800-rated player against a single 1200-rated player. The other 3 games in the calculations is just games the 1200 rated players play against each other, which does not affect the 1800-rated player.

[quick edit - fixed formula for 100-game scenario]
@Martin0
The link I cited performs a (seemingly) mathematically-rigorous, step-by-step approach to extend the 2-player ELO algorithm to the N-player scenario, and determines that for an N-player match-up, the proper divisor is: (N (N-1))/2... that is, 6 for the 4-player game.
Intuition is nefariously misleading in the arena of probabilities (google "Monte Hall paradox" for a famous example of this), so while I certainly "got" the intuitive case for "divide-by-3", I casually dismissed it in favor of the "more math-y" methodology used in the link. I did not consider that the link itself, for all the "pretty" symbolic math, is a blog, not a peer-reviewed publication; and took the results more-or-less at face value.
Many times, it's easiest to resolve these questions by looking at extreme cases. Consider a 100-player chess game (a massive Free-For-All 100-gon version of the current 4PC board). Say player 1 has rating 1800 and the other 99 players have 1200 rating, and player 1 finishes in last place.
Each individual 1800-vs-1200 match-up yields an adjustment of about -31 points. Ultimately, your approach yeilds an adjustment of (-31*100)/99 (which seems reasonable; the game impacts the 1800-rated player roughly as if they'd lost a single game to a single 1200 opponent). The link's methodology would mean an adjustment of about (-31*100)/4805... the 1800 player would need to finish dead last hundreds of times to get a reasonably-poor rating.
Thanks for the pushback on this; I happily retract my criticism (and will add an edit accordingly)
In the (pinned) thread on 4PC ratings:
All in all I think a good discussion is in order while the current system continues. However would it be possible to code in another system as well and compare how well it predicts the games? Not use it for matching, but just for seeing if it is a better predictor. Because, in the long run, that is what we are looking for, no? A system that can fairly accurately predict how player one will do against player two.
This got me wondering: could we evaluate different ideas for rating functions more empirically? I believe so... but doing so would require a lot of elbow grease. To that end, I wondered if the chess.com team might be willing to tackle this in a crowd-sourced fashion:
[*] The data dump could be as simple as a downloadable csv (comma-separated-value) file, limited to:
GAME_ID,RED_USERNAME,RED_SCORE,GREEN_USERNAME,GREEN_SCORE,BLUE_USERNAME,BLUE_SCORE,YELLOW_USERNAME,YELLOW_SCORE
(and if there are privacy concerns, perhaps just psuedo-anonymous PLAYER_ID numbers could be substituted for USERNAME)
The goal would be to encode different algorithms into functions that:
... and assess the various algorithms for accuracy. Assessment might be along the lines of:
1) Run the function against a training data set of games (i.e. first 90% of games) to compute each player's "training-data" rating.
2) Use the training-data ratings to predict results (player finishing order) of the last 10% of games.
3) Score those predictions (per game) by evaluating each game as six 2-player match-ups:
Whichever function yields the highest average accuracy-score-per-game (for all of the 10% testing data) would be considered the "more accurate" function.
Some "sanity" guidelines should also be in play:
1) The logic should be sensible and (relatively) easily explained. While the function may entail a lot of nuanced adjustments (perhaps accounting for things like relative ratings, score differentials, and positional aspects of the game), players should still have a basic sense of how their ratings should change based on the outcome of each game. In particular, the player who finishes first should always experience an increase in rating and the 4th place player should always have a drop in rating.
2) Each game should be zero-sum: no net gain (or loss) of rating points among 4 players of a game.
3) The function should be expressible using standard arithmetical functions and simple if-then-else constructs (i.e. no fancy libraries; easily portable to any language)
4) The function should be deterministic (same inputs => same outputs) and exhibit "smooth" behavior (small changes in inputs should result in minor deviations in output)
5) The function should be self-correcting. If a single player has a rating of 1500 after starting at 1200 and playing 50 games; it should be possible to start that same player at 2200 and still wind up with a rating of ~1500 after the same 50 games.
Otherwise, there should be no restrictions on the logic and, IMO, there are a lot of ideas worth exploring (and vetting these possibilities is where the crowd-sourcing idea really seems to "fit")