Game archives for AI research - Chess Forums

Outis86

Dec 9, 2020

0

#1

Hello,

I'm wondering if it's possible to obtain a large collection of games for AI research. I'm training Leela Chess Zero (Lc0) networks based on human games to experiment with various learning methods.

For past months I have been data-mining the game archives of Lichess which (at the time of writing) has around 1.6 billion games. As Chess.com has games archived since about 2007 I'm very much interested to know if there are any compressed downloads available for this kind of research.

Lichess has game archives per month which contain (for the recent months in 2020) up to 70+ million games per download. This makes it easy to data-mine larger data-sets.

For me it's key to get a large portion of games in various rating ranges and time controls (preferably no bullet games):

1000-1050
1200-1250
1400-1450
<>
2600+

I have been reading the API documentation lately and apparently for serial access the API should be unrestricted / uncapped but I have seen 429 responses every now and then. Game archives on chess.com are per-player and also on a monthly basis.

To my knowledge there is not really a fine-grained way to obtain players in a certain ELO range and thus this would require some 'probing' to find the right sub-set of players (e.g. for example by iterating over the player listings for each country).

I hope to hear some directions how to achieve above goals.

Tricky_Dicky

Dec 9, 2020

0

#2

There have been some excessive requests recently, from some individuals, which resulted in staff throttling the download rate.

For a large data mining adventure I would suggest that an approach to staff in advance would be best. They may be able give advice on speeds and best times for access.

Nevfy

Dec 9, 2020

0

#3

Imho, with API the only way to perform your task is:

Get list of recently active players from certain country ( https://api.chess.com/pub/country/RU/players );
For each player:
1. Get his rating ( https://api.chess.com/pub/player/outis86/stats );
2. Get archive of his games in the last full month ( https://api.chess.com/pub/player/outis86/games/2020/11/pgn ).

Repeat for each country code. Then you can do all post-analysis (like sorting players by rating ranges) of your PC.

But I'm not sure that Chess.com will allow you to load the server that much. Maybe you should write a message to website's support asking for collaboration.

Outis86

Dec 9, 2020

0

#4

Nevfy wrote:

Imho, with API the only way to perform your task is:

Get list of recently active players from certain country ( https://api.chess.com/pub/country/RU/players );
For each player:
1. Get his rating ( https://api.chess.com/pub/player/outis86/stats );
2. Get archive of his games in the last full month ( https://api.chess.com/pub/player/outis86/games/2020/11/pgn ).

Repeat for each country code. Then you can do all post-analysis (like sorting players by rating ranges) of your PC.

But I'm not sure that Chess.com will allow you to load the server that much. Maybe you should write a message to website's support asking for collaboration.

I think iterating over the player names from the country listing and directly pull the game archives is good enough as the JSON (with PGN data in it) contain the player ELO when the game was played.

Nevfy

Dec 9, 2020

0

#5

@Outis86 You are right. I forgot about it.

Outis86

Dec 10, 2020

0

#6

Who would be the best person to get in touch with for this inquiry?

Tricky_Dicky

Dec 10, 2020

0

#7

@bcurtis (Ben) is the senior developer for the API

bcurtis

Dec 13, 2020

0

#8

Thanks for bringing this up, @Outis86 — AI and related research projects are certainly interesting to us!

Unfortunately right now we are not equipped to provide bulk downloads, and the API methods are not suitable for a variety of reasons. The Published Data API was designed to help the community create tools for Chess.com players, and so as you discovered it is player-centric and obtaining a time-slice of all games that meet certain criteria is just not going to work — the rate-limiting means you would need to spend years obtaining data. And speaking of rate-limiting, that is in place because the type of data we deliver on demand is many thousands of times more of a burden on the servers than something like a bulk download. We discourage scans like you propose, because they are not done for the benefit of the players (for instance, there is no player asking you to process their games in this way), and the server load detracts from the performance for the players on the site. If you are concerned that your scripts may be creating a burden, make sure you include your contact information in the user-agent header of your HTTP request, and if we find that we need to limit or change your access then we know who to email.

These are all temporary conditions. This year, we have discussed many ways and options for opening up data for study and use by people like yourself. For various reasons, we believe that we need to meet a higher legal standard of guarding privacy than other sites (for GDPR, CCPA, etc), and so we cannot simply publish the existing data. Figuring out exactly how we can approach these new and fascinating possibilities is going to take some work, and due to the significant increase in interest in chess this year we are a little short-handed.

But please know that this is a topic that is dear to me, and we will pursue it as we can. Thanks again for bringing this to the forum for discussion.

Mensch-Maschine

Oct 17, 2021

0

#9

Hi @bcurtis.

I have a couple of ideas for data science projects (rather than AI research) that would involve analysing a large dataset of games (ideally a timeslice providing a random sample of ratings, time formats, etc.).

I was wondering if any progress had been made at chess.com in providing such datasets rather than just the player-centric data currently available?

The "other chess site" has published monthly collections of games, but I'd prefer to use chess.com data, especially if I would be discussing results in a blog here (which I'm planning to do).

One option could be to publish a few datasets of various sizes, specifically for the purposes of study/research, separately from the current API system, which, as you point out, is player-centric by design. Is that feasible, or are there still unresolved legal/privacy issues with publishing research data like that?

Thanks.