Randomly chosen from what set. The set of all accounts contains many that haven't been active for a while. If you want active users you'll have to define 'active'.
Lichess' publicly available data might be more amenable to what you have in mind.
Random sample of US player usernames
As the problem is stated, can't you just take the few thousand you can get, and draw a random sample from them via other means (e.g. a little program, or random.org, or whatever)?
Obviously this wouldn't help in certain situations if there's more to your requirements than explicitly stated; for example:
* The sample *must* be taken from the full set;
* The "few thousand" you get is too small to draw a sufficiently-sized sample from;
* The fact that all the names would start with A (or whatever) is unacceptable.
Another way to approach gathering a data set, might be to harvest a list of American players currently playing in official tournaments and then randomly select some of them for analysis. But you'd have to access the specific tournament https://api.chess.com/pub/tournament/84th-chess-com-daily-tournament-1001-1200 and then check each player's country separately
You can also crawl players (start with a seed player, pull opponents, opponents of opponents, etc.). However, this is time consuming (especially since you'd need to pull profiles one by one to check flag) and there's no guarantee that the final sample will be representative.
This might be some work-around to your problem:
Because how matches are made, this small-world network, what will really help you. By how this networks works, you can gather data from far points quite easily just by going in every direction from one point 6-11 times. You can also prove randomness for science purposes by measure degree of separation in your pool, google for Watts–Strogatz model, I believe math they made will be very helpful.
So... you can try something like this: find some starting set (from the end of global ranking chess.com/leaderboard/live/rapid [it's still 98 percentile], different clubs, tournaments, or some GM speedruns if you want more normalized elo distribution), then look on their unique opponents and friends that are active, and repeat process for ALL active players of them 10+ times. Only after your pool is much bigger than your expectations by few times, filter US players and select your random probe.
I assume elo distribution in US should be at least similar to global elo distribution. So If you put them to a map using as a key e.g. elo/100, you could notice if some range deviate from global data, and crawl more over this range of players, just to improve on how you spread, or to verify data quality/quantity.
There are tools that might help you visualize this kind of graphs, and increase your paper fanciness
, or how much you will enjoy working with them. I recommend https://ogp.me/ framework. Here is practical example made by veritasium channel, If you want to check how much it appeals to you https://www.veritasium.com/simulation1
You can prove your data are not clustered by measuring average distance between probes.
Remember to share later your work here! I would love to read about your work, and what you found out, and really appreciate. This is kind of posts that forum really miss.
I hope this will help you, despite it's not really what you asked for, good luck!
There is another way too, You can write mail to support, describing your purpose and asking for a data you need, there is a chance they might provide it, even tough it's not accessible by now through api
There is another way too, You can write mail to support, describing your purpose and asking for a data you need, there is a chance they might provide it, even tough it's not accessible by now through api
Well now where's the fun in that?
(Said in humor.)
Get chess.com usernames from Google Custom Search results (plain member URLs),
then check the Chess.com API to see if their country is United States.
Requires:
pip install requests
Env vars:
GOOGLE_API_KEY -> Google API key
GOOGLE_CX -> Programmable Search Engine ID (cx)
"""
import os
import time
import csv
import re
import requests
from urllib.parse import urlparse
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
GOOGLE_CX = os.getenv("GOOGLE_CX")
if not GOOGLE_API_KEY or not GOOGLE_CX:
raise SystemExit("Set GOOGLE_API_KEY and GOOGLE_CX environment variables.")
# Query: modify if you'd like to tune the dork.
QUERY = (
'site:www.chess.com/member ("United States" OR "USA" OR "United States of America") '
'-inurl:/games -inurl:/stats -inurl:/tactics -inurl:/puzzles -inurl:/lessons '
'-inurl:/club -inurl:/clubs -inurl:/blog -inurl:/live -inurl:/variants '
'-inurl:/activity -inurl:/archive -inurl:/articles -inurl:/support '
'-inurl:/streams -inurl:/videos -inurl:/forum -inurl:/follow -inurl:/photos -inurl:?'
)
MAX_RESULTS = 10000 # total plain member URLs to collect
PER_PAGE = 10 # Google returns up to 10 items per page
SLEEP_BETWEEN_GOOGLE = 0.5
SLEEP_BETWEEN_API = 0.5 # politeness for Chess.com API
OUT_CSV = "chess_google_usernames_checked.csv"
PLAIN_MEMBER_RE = re.compile(r"^https?://(www\.)?chess\.com/member/([^/?#]+)/?$", re.IGNORECASE)
def google_custom_search(query: str, start: int = 1):
url = "https://www.googleapis.com/customsearch/v1"
params = {
"key": GOOGLE_API_KEY,
"cx": GOOGLE_CX,
"q": query,
"start": start
}
r = requests.get(url, params=params, timeout=20)
r.raise_for_status()
return r.json()
def is_plain_member_url(url: str) -> bool:
return bool(PLAIN_MEMBER_RE.match(url))
def username_from_url(url: str) -> str:
m = PLAIN_MEMBER_RE.match(url)
if m:
return m.group(2)
# last segment fallback
p = urlparse(url).path.rstrip("/").split("/")
return p[-1] if p else ""
def chess_api_player(username: str):
"""
Call Chess.com public API for this username.
Returns dict of the JSON profile if successful, None otherwise.
"""
api_url = f"https://api.chess.com/pub/player/{username}"
try:
r = requests.get(api_url, timeout=10)
if r.status_code == 200:
return r.json()
# 404 -> user doesn't exist or is not public
return {"_http_status": r.status_code}
except Exception as e:
return {"_error": str(e)}
def is_country_us_from_api(profile_json: dict):
"""
Determine if the profile's country matches United States.
Chess.com typically returns a 'country' field with a URL like:
"https://api.chess.com/pub/country/US"
or
"https://api.chess.com/pub/country/united-states"
We'll check for '/US' or 'united-states' in the country URL if present.
"""
if profile_json is None:
return False, "no-profile-json"
if "_http_status" in profile_json:
return False, f"http_status_{profile_json['_http_status']}"
if "_error" in profile_json:
return False, f"error:{profile_json['_error']}"
country = profile_json.get("country") # could be None or a URL
if not country:
return False, "no-country-field"
country_l = country.lower()
if country_l.endswith("/us") or "/country/us" in country_l or "/country/usa" in country_l:
return True, f'country-url="{country}"'
if "united-states" in country_l or "unitedstates" in country_l:
return True, f'country-url="{country}"'
# fallback: sometimes profiles include country name in 'location' field (free text)
location = (profile_json.get("location") or "").lower()
if "united states" in location or "usa" in location:
return True, f'location="{profile_json.get("location")}"'
# not matched
return False, f'country-url="{country}"'
def main():
collected = []
start = 1
while len(collected) < MAX_RESULTS:
try:
resp = google_custom_search(QUERY, start=start)
except Exception as e:
print("Google API error:", e)
break
items = resp.get("items", [])
if not items:
print("No search items returned.")
break
for it in items:
link = it.get("link")
if not link:
continue
# ensure plain member URL format
if not is_plain_member_url(link):
continue
uname = username_from_url(link)
# call Chess.com API for this username
profile = chess_api_player(uname)
# politeness
time.sleep(SLEEP_BETWEEN_API)
is_us, reason = is_country_us_from_api(profile)
collected.append({
"username": uname,
"profile_url": link,
"chess_api_country": profile.get("country") if isinstance(profile, dict) else None,
"is_us": is_us,
"reason": reason
})
print(f"{uname}: is_us={is_us} ({reason})")
if len(collected) >= MAX_RESULTS:
break
# next page
start += PER_PAGE
time.sleep(SLEEP_BETWEEN_GOOGLE)
if start > 90:
# avoid exceeding typical Custom Search paging limits
break
# save CSV
with open(OUT_CSV, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["username", "profile_url", "chess_api_country", "is_us", "reason"])
writer.writeheader()
for r in collected:
writer.writerow(r)
print(f"Done: wrote {len(collected)} rows to {OUT_CSV}")
if __name__ == "__main__":
main()
this is brilliant idea! I would never tough about that, I will not forget that one ![]()
But I think this sample might be "contained", If not all users are indexed, but e.g. only one from social media, do you maybe know something more about it?
There is another way too, You can write mail to support, describing your purpose and asking for a data you need, there is a chance they might provide it, even tough it's not accessible by now through api
Well now where's the fun in that?
(Said in humor.)
It's amazing how many things in my life I learned myself and achieved because of my social phobia, lol
this is brilliant idea! I would never tough about that, I will not forget that one
But I think this sample might be "contained", If not all users are indexed, but e.g. only one from social media, do you maybe know something more about it?
The question is... How "random" is google's sample? There are over 100,000 us chess players. If you hit the leader board and scroll to the end it is 10,000 pages and it ends around 1200 ELO. 1200 is the top 20% so there must be upwards of 500,000 ACTIVE US users. (These numbers are off the top of my head. I am sure someone knows better than me.) That said my code stops at 10,000 that is an under 2% sample.
I started to calculate on my own to verify it from another side, but I found this article, which says nothing about activity, but a lot about quantity of players https://www.chess.com/article/view/chess-countries I checked % of players on few profiles and 10-15% of all players being from US seems to be right to the very top(10% e.g. for Naka).
2/3 are players who registered in the last two years, and there are about 250,000 online all the time, so it's I guess we can assume there are at least 35mln active players, or more like 2 years ago. So my guess is there is at least 4mln US players who played in last 3 months, maybe even more. Which might mean you are correct, depends on what was meant by 'active'
But it's less about how many, but how it's sampled. For instance - if they are crawling and indexing social media, then many older players will not be there, same with non-socializing ones, who... just play chess. Or children/teenagers from some regions who might be unallowed to have social profiles (tbf I know nothing about US culture&law, so help me on that one, is this sound like a real life scenario?)
It doesn't sound like a real problem, isn't it? But style/preferences/activity variate badly between different social groups (like elders, who rarely play bullet, but play classical or correspondence which I didn't even touch for last 10years). so you might get probe over50% and still get false results, just because this probe was slightly biased by who were chosen.
So for someone trying to find the answers, this might be a real pain in the... study. It's still useful to know something, but it's very hard to prove how much you know, without a way to tell what a bias you might have on a data.
But if support will not be helpful, maybe this might be a great starting sample for building a random graph, It's for sure not clustered data, and I feel it's a fair to assume that every active US player have played at least one match against other US player, so this is really sound like a good way to approach the topic.
Thanks everyone, the engagement with this has blown me away to be honest. I'll be in touch with support, and if not, will follow a mix of the strategies offered above (probably I'll use crawling from tournament and country end points and validate my results with lichess data too).
And I will share my results and work in due course.
: )
Hi all,
I'm trying to pull the full game history of a random sample of usernames taken from players with a USA flag in their profile.
Randomness is important to me, as I'll use the data for academic research.
The problem I'm having is it seems the https://api.chess.com/pub/country/US/players only produces a few k players in alphabetical order.
Is there a way to generate random sample of all usernames from the API?
Thanks for any help.
Ben