Random sampling

Sort:
Mensch-Maschine

Hi devs.

I have a few different ideas for data-science projects based on chess.com data. Ideally what I'd like for most of these projects is a random sample of games or players, etc. However, I'm struggling to understand how to best use the API to get useful information of this kind.

I have published a few blog posts using data from the other website, which publishes monthly archives of all games played on the site. Being a snapshot of all games in a month, this is more or less the kind of random sample I'm after. As I understand it, chess.com does not publish similar archives like this, either directly on the website or through the API. (I have read through some of the reasons why chess.com chooses not to publish archives.)

So, I thought about a few ways to get random samples of data from the API description, such as:

  • Iterating through players from a particular country, and then using the player endpoint. However, as mentioned elsewhere, there is a limit of 10000 players per country via the API. This seems problematic for a few different reasons in taking a random sample from a large enough chess-playing country.
  • Iterating through tournament endpoints (which I discovered are numbered somewhat sequentially), but once again it's not really random sampling as not everyone plays tournaments (and choosing just from tournaments would bias the sample).
  • Scraping (I'd rather not do this for so many different reasons).

So, are there any ideas for either: a) generating random sample of games or players from the chess.com API or b) publishing snapshots (or archives) of chess.com activity perhaps through the API?

Thanks.

SneakyDeeCee

This is an interesting problem. One idea you could attempt is a social networking type of approach. There is something called "six degrees of separation": https://en.wikipedia.org/wiki/Six_degrees_of_separation where basically everyone is supposedly connect via 6 connections. I don't know much about it personally, but it may be worth looking into. What you could do is one of two approaches: 1) start with all the opponents you've faced, randomly select one of them, go to all the opponents they faced, and randomly select, etc. Do this six or so times. 2) just collect usernames of all your opponents and all your opponents opponents, etc for about a depth of 6. Use this collection as your population and do a random sampling.

I understand these aren't perfect and may not truly be random because they all stem from your opponents (or another starting point). However, it may help give an idea of what to do next. Otherwise, I hope there is a better answer out there for ya!

Best of luck!

jas0501

First question is why a random sample is required? What about the project requires randomness?

Mensch-Maschine
SneakyDeeCee wrote:

This is an interesting problem. One idea you could attempt is a social networking type of approach. There is something called "six degrees of separation": https://en.wikipedia.org/wiki/Six_degrees_of_separation where basically everyone is supposedly connect via 6 connections. I don't know much about it personally, but it may be worth looking into. What you could do is one of two approaches: 1) start with all the opponents you've faced, randomly select one of them, go to all the opponents they faced, and randomly select, etc. Do this six or so times. 2) just collect usernames of all your opponents and all your opponents opponents, etc for about a depth of 6. Use this collection as your population and do a random sampling.

I understand these aren't perfect and may not truly be random because they all stem from your opponents (or another starting point). However, it may help give an idea of what to do next. Otherwise, I hope there is a better answer out there for ya!

Best of luck!

Hi SneakyDeeCee.

Thanks. I hadn't thought of this, and it could be very useful given the random nature (withing rating ranges) of the challenge matching process.

Mensch-Maschine
jas0501 wrote:

First question is why a random sample is required? What about the project requires randomness?

Just say I want to draw inferences on the the site-wide rating distribution and how it relates to different characteristics of the players.

Taking the first 10k players alphabetically misses out on player names like GM_xxx and instead gives us 00noob00, etc. Alternatively, taking players from countries with less than 10k players means potentially focusing on countries without a serious chess culture.

Not everyone plays tournaments and I expect very good players don't bother and very bad players get discouraged, also some players are obviously tournament junkies so these people are over-represented.

In either case assuming the sample of players from either of these methods results in a (potentially) biased sample.

SneakyDeeCee
Mensch-Maschine wrote:
SneakyDeeCee wrote:

This is an interesting problem. One idea you could attempt is a social networking type of approach. There is something called "six degrees of separation": https://en.wikipedia.org/wiki/Six_degrees_of_separation where basically everyone is supposedly connect via 6 connections. I don't know much about it personally, but it may be worth looking into. What you could do is one of two approaches: 1) start with all the opponents you've faced, randomly select one of them, go to all the opponents they faced, and randomly select, etc. Do this six or so times. 2) just collect usernames of all your opponents and all your opponents opponents, etc for about a depth of 6. Use this collection as your population and do a random sampling.

I understand these aren't perfect and may not truly be random because they all stem from your opponents (or another starting point). However, it may help give an idea of what to do next. Otherwise, I hope there is a better answer out there for ya!

Best of luck!

Hi SneakyDeeCee.

Thanks. I hadn't thought of this, and it could be very useful given the random nature (withing rating ranges) of the challenge matching process.

You could also check friends too. I just realized that you may be forced to be within a certain rating range based on where you start (1200s would play other 1200s). You'll have to find a way to break from this. Something to consider.

Mensch-Maschine

A couple of numerical experiments suggest that the distribution of a random walk through opponents will tend to underestimate the tails of the rating distribution since chess.com normally matches opponents within +/- 200 rating points. This means that a randomly selected opponent will be more likely to be closer to average than further away, unless there is some kind of weighting magic going on in the chess.com matching algo.

Interesting problem though, with a lot to think through...

Thanks.

stephen_33

Have you considered looking at team matches because matches over a certain size will usually include players at all rating levels, even titled ones. For example:-

https://www.chess.com/club/matches/1334401

https://www.chess.com/club/matches/1334721

Although I'm starting to appreciate your problem because higher calibre players are certainly over represented in matches like those.

I think the idea of using one or more country sets of members is going to yield the closest thing to a representative population. There's no resource on this site available to members like us of sampling members in a truly random way.

* Remember that if downloading sets of members by country, the endpoint for the US is non-functional! That's because it's so large.

Mensch-Maschine

Thanks stephen_33 thumbup

Yes I considered matches and tournaments (both are/would be fairly easy to iterate through with the API).

chess.com sometimes runs 24-hour tournaments, which is another possibility as I imagine 99% of people drop in and play for 1 hour or so, instead of getting their daily fix of chess.com through the random matching algorithm. Being 24 hours it would also presumably give good global coverage.

However I'm not convinced tournaments or matches are a good way to truly sample randomly as they would both tend to attract certain types of players.

Regards.

stephen_33

I can't think of a more representative sample of players than that provided by the country-players endpoint.

jas0501
jas0501 wrote:

First question is why a random sample is required? What about the project requires randomness?

Still wondering why you need a random sample.