The computational issue is something that prevents this method to be possible. I do have, however, a suggestion in how to approach something similar.
-Start with a singular player's monthly games from the range you're more concerned about.
-Have an ETL that fetchs this player's rivals games.
-Then iterate checking for games of any player you have registered with at least 1 played game you can filter by ELO if you want to stick to a given range.
-Repeat.
In a matter of minutes you will have tens of thousands of games, that can scale up to a couple million games, all with a relatively solid gaussian distribution with a norm based on the ELO of the first player you selected. It's the fastest way of mass fetching games I've found with the current API and is enough to analyze a lot with the proper tools, while not overloading yourself with data from ELO ranges that aren't relevant to your requirements.
Download all games played within a month for "ALL" players
The web spider is definitely a good idea!
Right now I am downloading about 1.000.000 player's one month histories ... and it indeed takes its time.
Cheers Arend
Well, there is ton's of literature on Computer chess, and of course "how to play best", however, the majority of us is <2600 ELO, in fact more in the range 1200-2000, and while of course playing the best move is always right, I was wondering about openings, how easy it is to play them, and most importantly if some openings are better than others given the player strength. As an example Sicilian defense will not work for a 900 Elo player, but I bet there is an Elo range where it is cool to know it, as it beats the repertoire of players in the elo range.
Or to be more concrete, given you Elo, and the strength an preferences of your likely opponents (+/- 100 Elo) is there an opening that is better than others. So right now I am mapping opening move win/draw/loss rates on openings for different ELO ranges, and that already looks super interesting. However, I only have the lichess data.
Also, the ELO distribution between lichess and chess is different, but what is it exactly?
Lastly, and that is definitely more advanced, but what is the difference between players of different ELO? I ran a preliminary test, and 400 ELO more only means 2-3% more chance to play the best computer move for example, but a much larger fraction of not blundering. In other words, if I can nail that down, your focus might be to not blunder instead of playing sophisticatedly ... we kind of know that already, but where is the critical point where that advice stops 800, 1200, 1600?
From a scientific point of view, we have astonishingly little objective analysis and a lot of "books". Not saying they are wrong, just not peer reviewed.
Cheers Arend
that sounds very interesting, do you plan to publish your findings?
I collected about ~12.000.000 games from chess.com and 135.000.000 games from lichess.org (in the time categories that I am interested in) and I am writing this up right now for Journal of Sports Analytics. I'll discuss the results with you all soon once I have proper figures and stats. It is quite interesting so far. Obviously, you can't do a Elo_chess = Elo_lichess-500 ... and I found a set of nice blog posts: https://chessgoals.com/rating-comparison-explained/ (check all of them!) Their method is just one way of doing so, there is another one using "z-statistics" ... pretty much you find your percent rank in one distribution and map that to the other. That is technically the better way of doing that, however, the lichess distributions seem to not be 100% normal but have an extra "bimodal" hump and are biased to one side. In other words, in a perfect world this should have been "easy" but there are a couple of hickups and I need to dig deeper before I can discuss those conclusions.
Cheers Arend
Where did you get them from?
The weird thing is, that they are normal distributed (Gaussian) but have this left tail cutoff. Also, it matters a bit how they defined the group entering this graph ... Anyways, awesome, thank you! I'll try to generate that for my lichess data, the one I showed is the one for games played, and needs to be corrected for that.
Cheers Arend