Most Recent
Forum Legend
Following
New Comments
Locked Topic
Pinned Topic
Hi Chess Friends,
I've released a new version of the Lumbra's Gigabase yesterday. The database containing online games was updated as well, but my focus with the release 2025-07-01 (it contains only the games until 06/30/2025) was lying on the quality of the OTB game database.
Over the last month I wrote a python script which should - at first - just do some deduplication of the database. At the end it was more than that, it has turned out as an advanced script to deduplicate AND improve the quality of the headers of the chess games.
TLDR functionality of the script:The system for deduplicating chess games processes PGN files in several phases to identify duplicates and optimize data quality. First, it reads PGN files, extracts and cleans essential data, calculates hashes, and recognizes metadata. Then, it consolidates player-pair groups using fuzzy name comparisons. This is followed by exact deduplication based on move sequence hashes, where the header of the best game is chosen as the master. Games with subsumed move sequences are also flagged. Another phase uses fuzzy matching for textual similarities of move sequences. Finally, the system exports the unique games and, optionally, the flagged duplicates, optimizing header quality through the integration of FIDE data and a detailed evaluation to ensure the master game contains the best available information.
A more thorough description of how the script works, you can find here on my website.
Final results of the deduplication:
Last cleanup with ScidFinally, a cleanup is carried out with Scid. Scid still finds some duplicates here, which is mainly due to two things:
This cleanup will catch approximatly 1500 to 2000 additional duplicate games.
Have fun with studying chess ;)
Regards,
Michael/Lumbra74
P.S.: I also have a blog post for the database, where I announce regular updates.