Game archives for AI research

Sort:
Avatar of Outis86

Hello,

I'm wondering if it's possible to obtain a large collection of games for AI research. I'm training Leela Chess Zero (Lc0) networks based on human games to experiment with various learning methods.

For past months I have been data-mining the game archives of Lichess which (at the time of writing) has around 1.6 billion games. As Chess.com has games archived since about 2007 I'm very much interested to know if there are any compressed downloads available for this kind of research.

Lichess has game archives per month which contain (for the recent months in 2020) up to 70+ million games per download. This makes it easy to data-mine larger data-sets.

For me it's key to get a large portion of games in various rating ranges and time controls (preferably no bullet games):

  • 1000-1050
  • 1200-1250
  • 1400-1450
  • <>
  • 2600+

I have been reading the API documentation lately and apparently for serial access the API should be unrestricted / uncapped but I have seen 429 responses every now and then. Game archives on chess.com are per-player and also on a monthly basis.

To my knowledge there is not really a fine-grained way to obtain players in a certain ELO range and thus this would require some 'probing' to find the right sub-set of players (e.g. for example by iterating over the player listings for each country).

I hope to hear some directions how to achieve above goals.

Avatar of Tricky_Dicky

There have been some excessive requests recently, from some individuals, which resulted in staff throttling the download rate.

For a large data mining adventure I would suggest that an approach to staff in advance would be best. They may be able give advice on speeds and best times for access.

Avatar of Nevfy

Imho, with API the only way to perform your task is:

  1. Get list of recently active players from certain country ( https://api.chess.com/pub/country/RU/players );
  2. For each player:
    1. Get his rating ( https://api.chess.com/pub/player/outis86/stats );
    2. Get archive of his games in the last full month ( https://api.chess.com/pub/player/outis86/games/2020/11/pgn ).

Repeat for each country code. Then you can do all post-analysis (like sorting players by rating ranges) of your PC.

But I'm not sure that Chess.com will allow you to load the server that much. Maybe you should write a message to website's support asking for collaboration.

Avatar of Outis86
Nevfy wrote:

Imho, with API the only way to perform your task is:

  1. Get list of recently active players from certain country ( https://api.chess.com/pub/country/RU/players );
  2. For each player:
    1. Get his rating ( https://api.chess.com/pub/player/outis86/stats );
    2. Get archive of his games in the last full month ( https://api.chess.com/pub/player/outis86/games/2020/11/pgn ).

Repeat for each country code. Then you can do all post-analysis (like sorting players by rating ranges) of your PC.

But I'm not sure that Chess.com will allow you to load the server that much. Maybe you should write a message to website's support asking for collaboration.

I think iterating over the player names from the country listing and directly pull the game archives is good enough as the JSON (with PGN data in it) contain the player ELO when the game was played.

Avatar of Nevfy

@Outis86 You are right. I forgot about it.

Avatar of Outis86

Who would be the best person to get in touch with for this inquiry?

Avatar of Tricky_Dicky

@bcurtis (Ben) is the senior developer for the API

Avatar of bcurtis

Thanks for bringing this up, @Outis86 — AI and related research projects are certainly interesting to us!

Unfortunately right now we are not equipped to provide bulk downloads, and the API methods are not suitable for a variety of reasons. The Published Data API was designed to help the community create tools for Chess.com players, and so as you discovered it is player-centric and obtaining a time-slice of all games that meet certain criteria is just not going to work — the rate-limiting means you would need to spend years obtaining data. And speaking of rate-limiting, that is in place because the type of data we deliver on demand is many thousands of times more of a burden on the servers than something like a bulk download. We discourage scans like you propose, because they are not done for the benefit of the players (for instance, there is no player asking you to process their games in this way), and the server load detracts from the performance for the players on the site. If you are concerned that your scripts may be creating a burden, make sure you include your contact information in the user-agent header of your HTTP request, and if we find that we need to limit or change your access then we know who to email.

These are all temporary conditions. This year, we have discussed many ways and options for opening up data for study and use by people like yourself. For various reasons, we believe that we need to meet a higher legal standard of guarding privacy than other sites (for GDPR, CCPA, etc), and so we cannot simply publish the existing data. Figuring out exactly how we can approach these new and fascinating possibilities is going to take some work, and due to the significant increase in interest in chess this year we are a little short-handed.

But please know that this is a topic that is dear to me, and we will pursue it as we can. Thanks again for bringing this to the forum for discussion.

 

Avatar of Mensch-Maschine

Hi @bcurtis.

I have a couple of ideas for data science projects (rather than AI research) that would involve analysing a large dataset of games (ideally a timeslice providing a random sample of ratings, time formats, etc.).

I was wondering if any progress had been made at chess.com in providing such datasets rather than just the player-centric data currently available?

The "other chess site" has published monthly collections of games, but I'd prefer to use chess.com data, especially if I would be discussing results in a blog here (which I'm planning to do).

One option could be to publish a few datasets of various sizes, specifically for the purposes of study/research, separately from the current API system, which, as you point out, is player-centric by design. Is that feasible, or are there still unresolved legal/privacy issues with publishing research data like that?

Thanks.