Download all games played within a month for "ALL" players

Sort:
ahnt

Hi, I am new here, but I didn't find an answer in the previous posts: I need to download "all" games played within a month (Feb. 2023) like the monthly lichess data dumps. They contain all pgns for all games played on their servers in a given month.

I guess, since such data dump doesn't exist, I would need to download the monthly data dumps for each player ... but for that I would need a more or less complete player list.

So, the question is, how to obtain such list, or if there is a better method for obtaining that data. I am aware that I probably won't get all games within a month, but in order to compare properly, I need a similar number than what I got from lichess, which is ~100.000.000 games.

Cheers Arend

Pawnlings

That would be an insane amount of computation. One idea would be to parse the API for players by country

https://api.chess.com/pub/country/{iso}/players

but include ALL countries and then for each player it returns, you could pass that username into the endpoint below

https://api.chess.com/pub/player/{username}/games/{YYYY}/{MM}

This endpoint would pull their games for a specific month. As I mentioned though, this would be an insane amount of information to pull. 

Ximoon

You'll run through limitations though, for instance the first endpoint will only give you a limited number of players per country (10000 if I remember well). There's no guarantee those will be active or representative players.

ahnt

Yep, I agree to the insanity of that. I asked support if they could do data dumps like lichess, ideally monthly, that would simplify things greatly.

I read that the 10000 players per country are updated daily, so ... if I do get the players for say a week, and keep the unique ones, I would have a solid set of players/country. Weirdly, that approach biases (as Ximoon said) the data towards regular players, whereas the lichess data is simply all, and since I need this data for scientific research, that is less than ideal.

Analysing 100.000.000 games on the other hand is surprisingly easy if you have access to high-performance computing - LOL

Tricky_Dicky

This game started on 1st June https://www.chess.com/game/daily/524400003

This game started on 1st July. https://www.chess.com/game/daily/535500003

Difference is 11.1 million. If we assume 10% of games are not started that’s approximately 10 million games. Just daily

I have no idea of the ratio of live to daily but let’s assume 10 to 1 (probably low)

That would be 100 million live games.

So approximately 110 million games (and PGN’s) each month.. A typical PGN file for 1 game is about 15Kb

You can do the maths.

The resource required to download 110m games a month is insane and the storage requirement is not insignificant.

And that doesn't include the research time required to identify the players who have available archives each month.

stephen_33
ahnt wrote:

Hi, I am new here, but I didn't find an answer in the previous posts: I need to download "all" games played within a month (Feb. 2023) like the monthly lichess data dumps. They contain all pgns for all games played on their servers in a given month.

I guess, since such data dump doesn't exist, I would need to download the monthly data dumps for each player ... but for that I would need a more or less complete player list.

So, the question is, how to obtain such list, or if there is a better method for obtaining that data. I am aware that I probably won't get all games within a month, but in order to compare properly, I need a similar number than what I got from lichess, which is ~100.000.000 games.

The simple answer is, it's not possible. There is no way of obtaining a complete record of all games played on this site in a given month because there's no such thing as a set of site members that can be downloaded via the API.

You're probably better off sticking to lichess?

jas0501

A very very rough estimate based on the game count display at the moment of

yields

PLAYING GAMES
172,174 15,226,045
per month 456,781,350

500,00,000 million games.

---------------------

Why would Chess.com, a for profit company, offer their intellectual property free to the public,, i.e. billions of games, while having to pay for the resources required support the server demands. They won't.

stephen_33
jas0501 wrote:

....

Why would Chess.com, a for profit company, offer their intellectual property free to the public,, i.e. billions of games, while having to pay for the resources required support the server demands. They won't.

But the site does and has done since at least 2017 (?) without requiring that users of the API even bother to join the site at all.

I recently suggested to one of the development staff that access to the API be restricted at the very least to site members but was told the resource was always intended to be public and there were no plans to change that.

ahnt

The number is only mind-boggling if you consider getting those games via the API, if it is a simple data-dump it would be easier on everyone's resources (and nerves). However, you also need to keep in mind that chess.com's server resources are sufficient to support those games being played (and potentially analyzed) via a web browser or a mobile app interface ... the API download we talk about is comparably simple. Also, no authentication means no compute overhead, so the logic makes sense to me. Given that I only need a comparable number, I might get away with less data. 
Regarding the intellectual property, that is a somewhat tricky issue. From an ethics point of view, since data is easily annqonymized (on my end) and it is considered "an observation of the world" (I could sit in a park and watch people play) this data is legally (ethically) considered public data. At the same time, since chess.com "owns" the compute resources (for which we paying members paid) one can consider the "providing the data" part as a service, for which one might want to get paid. At the same time are the players the ones contributing the content. I think this is considered a fair deal: nobody gets paid and nobody pays as long as nobody makes a loss.
That also means that I understand the lack of willingness to do any extra service, like a page to download the games, since that would require one of the web-devs to actually invest time (money).
Punchline: not an ideal situation, but also not enough of an incentive to change something.
Cheers Arend

ahnt

Okay, brief update:
when getting all players for all country codes (with 10000 max per country) I get about 100.000 players, and this list updates every day, giving me so far 10.000 new player per day, but that number decreases as it seems, so after a week I might end up with 150.000 names. Downloading all games from one month for all those players, takes about a week with a time delay of ~1 second between each player, and as it looks right now, it will give me about 15.000.000 games.
Not bad, not perfect, but good enough. There is one weird caveat, a game between player A and B will be saved in A's as well as B's monthly pgn ... so I get fewer than 15.000.000 unique games - so be it.

Cheers Arend

AlexeyChess

Finally, there is life in our community)

Arend, could you share your hypothesis or ideas you want to test?

Maybe I can help you or at least save your time without putting unnecessary pressure on chess.com servers
There are a lot of subtitles depending on the task you have, from easy to impossible (using current API)

ahnt

Well, there is ton's of literature on Computer chess, and of course "how to play best", however, the majority of us is <2600 ELO, in fact more in the range 1200-2000, and while of course playing the best move is always right, I was wondering about openings, how easy it is to play them, and most importantly if some openings are better than others given the player strength. As an example Sicilian defense will not work for a 900 Elo player, but I bet there is an Elo range where it is cool to know it, as it beats the repertoire of players in the elo range.

Or to be more concrete, given you Elo, and the strength an preferences of your likely opponents (+/- 100 Elo) is there an opening that is better than others. So right now I am mapping opening move win/draw/loss rates on openings for different ELO ranges, and that already looks super interesting. However, I only have the lichess data.

Also, the ELO distribution between lichess and chess is different, but what is it exactly?

Lastly, and that is definitely more advanced, but what is the difference between players of different ELO? I ran a preliminary test, and 400 ELO more only means 2-3% more chance to play the best computer move for example, but a much larger fraction of not blundering. In other words, if I can nail that down, your focus might be to not blunder instead of playing sophisticatedly ... we kind of know that already, but where is the critical point where that advice stops 800, 1200, 1600?

From a scientific point of view, we have astonishingly little objective analysis and a lot of "books". Not saying they are wrong, just not peer reviewed.

Cheers Arend

jas0501

Determining what's the best opening given one's ELO is a bit more complicated than looking at a ton of games. In total the games themselves obscure the player's temperament. Are the offensive or defensive minded? Tactical or strategic?

In order to draw any best opening conclusions the players temperament needs to be considered.


Player differences based on ELO....
I think an outline of "proper" play and how well different ELO's conform to this outline might provide some structure to this question.
I'm no expert and this is just off the cuff. As to the proper play outline, typical training advice ideas in no particular order:
o Move center pawns first
o Knights before Bishops
o Do not move the queen early in the opening as she can be attached with tempo
o Do not move pieces twice in the opening
o Castle
o Get your rooks in communication
o Rooks belong on open or half-open or soon to be open files
o etc.
o etc.

How soon in the game and how often theses guidelines are violated I expect should correlate with lower ELOs.

ahnt

Thank you all for discussing these ideas, that is really helpful!

The last point you raise is pretty much spot on with what I am looking at. Given the opening "tree" where each branch is walking along a draw, how often does a player of a certain ELO leave this path? Compare London, where (not entirely true) you can do any of the first four moves almost in any order with Sicilian defense, where you have a very "narrow" path. So for each step (move) along the branches, there is a win/draw/loose ratio, and that one changes with Elo, of course. This informs you a) about what is a "safer" opening for your elo, given that others with your elo loose less b) which opening should you look into next, as there might be ones that give you an edge over others with a similar elo. However, that is not as clear cut, as this is a rather messy tree.

What you mentioned are the opening principles, which are very valuable, and once we have this tree, we can also map them onto each of the moves along the tree, how many of those principles are obeyed ... Bong cloud violates a lot for example, Scandinavian a bit, Spanish not at all. However, that is theory, so to speak, how do players with different elo relate to that? Do below 900 elo players forget to castle?

When you listen to Finegold in particular, the magic is to "make the best move" and when you don't know how, learn to do it, the truth hurts. While perfectly true, there is a learning process, and player behavior along the elo "ranks" reveals what people actually do, and might/should help to prepare.

Cheers Arend

AlexeyChess

Arend, thank you for the details. Sorry for the late reply.

About Elo difference between platforms, Elo is designed to show strength inside certain population (not absolute strength). For the same people to have different Elo is fine if they are in different populations. And this difference is not a constant shift, for example, I have about 1500 chess.com and 1800 lichess, it doesn’t mean lichess Elo shifted 300 points. Many titled players have less than 100 points gaps, which makes perfect sense as populations of strong players are small and similar across platforms. Why and how populations are different is a separate question, one guess chess.com audience is shifted to USA time zones.

Another aspect that even changing population on the same platform will lead to elo change without implying strength change, simple adjustment to new population, probably you felt it when Queens Gambit wave reached chess.com (my estimation that about 100 points inflation happened for my elo range)

As a data scientist I like the approach – data tells the story, but to your question “but what is the difference between players of different ELO” probably not enough data for comprehensive study.

Ideally, to get deeper into the thought process, we need same positions played many times by different Elo range players, then we can aggregate and decompose that some range see this, miss that and so on. This is how some chess test works, special multilayer position and check to what layer you will get.

While starting position is always the same and ending positions are quite similar, I believe different elo ranges choose different paths, if we digitalize and put games in multidimension space different elo range will occupy different areas. 1500 have to choose between tea-coffee-milk, “GM” between beer-cider-whiskey, how do we compare them?

Games with high imbalance from that perspective are very instructive because we put stronger player in a usual for weaker opponent position and get information what "GM" do differently. But the share of speedruns and "GM" participation in open tournaments with wide elo range are small. People have a strong tendency to play with about the same elo as they are.

About games, there were 1bln games played according to chess.com officials in Feb 2023, about half of them vs computers (probably you don’t want spend time on them)

My estimation ~550M-600M games on average played monthly vs human in 2023, with a top month March-2023. Some of the games (don’t know the share) are played without registration, usernames like Guest1784343526, probably you also don’t want them, as there are in a grey area regarding elo.

While theoretically speaking it’s not hard to download 500M games, practically speaking it’s a very tedious job because there are many players with very little games played (including 0 games). Hypothetical example to illustrate the problem:

  • Case A, average games per user 500, to download all you need 1M API calls. 
  • Case B, average games per user 1, to download all you need at least 500M API calls. 

For case B there is at least 500 times more overhead on API calls, which will be the main time consumer.

But if you are not after the full coverage, the main part (games of active players) can be easily downloaded having user list.

ahnt

In discord I would now split the thread into "API" and "ELO", and cool that you are into data, in the real world I am: https://scholar.google.com/citations?user=9OItN4cAAAAJ&hl=en&oi=ao

You are totally right, that you can not just say ELO_chess.com=ELO_lichess-500, that doesn't work. However, Elo should (must?) follow a normal distribution giving us a mean and a variance, and once we have both distributions, we can devise a mapping function F(ELO_chess.com)= formula something ELO_lichess. I would then test the accuracy of that function using elos from players in both cohorts.
The following plot shows the elo distribution over all lichess games played in one month:

First of all, it is wrong, as it is for all games played, not players and we see an increased number of games for low ranked players, I need to do the proper plot, but also, we see 500 and 1500 overabundance, as those are the start ELOs at lichess. Obviously, that has an influence, since chess.com starts with other numbers (1200 ...). Anyways, given both player# over ELO (not over total number of games played) allows for the proper comparison (I hope).

Regarding the different paths that low vs. high elo players take, that is super interesting as well, and there is a paper about chess openings doing some sort of "similar position clustering": https://www.nature.com/articles/s41598-023-31658-w one might be able to do that for middle games for low and high elo players - I don't see how. But they got their paper into scientific reports, so totally worth trying. However, The same must (I guess) be true for openings. The advantage is, that there is an opening tree, large though, and not properly well defined ... of course there is opening theory, but citable references are rare.

In the meantime, I got 50GB of games so far ... and I am ~15% through the list of players I got ... I'll stop latest at 100GB, this is nuts, and thus numbers might still not be enough for the opening analysis, but for the ELO comparison they will. It seems, as if the players that get reported per country, are more likely to be active players, with many games, so your case A, is the one we are in.

Cheers Arend

jas0501

FYI from Chess.com's Global Stats:

AlexeyChess

Doesn’t look like normal distribution to me )

Ximoon

Probably because people don't have a unique entry point, they can chose various level to enter the distribution. May stay unrated at 400... Maybe?

jas0501

Chess.com Rapid distribution: