Chess database "Lumbra's Gigabase": New, quality-improved release.

Sort:
Lumbra74

Hi Chess Friends,

I've released a new version of the Lumbra's Gigabase yesterday. The database containing online games was updated as well, but my focus with the release 2025-07-01 (it contains only the games until 06/30/2025) was lying on the quality of the OTB game database.

Over the last month I wrote a python script which should - at first - just do some deduplication of the database. At the end it was more than that, it has turned out as an advanced script to deduplicate AND improve the quality of the headers of the chess games.

TLDR functionality of the script:The system for deduplicating chess games processes PGN files in several phases to identify duplicates and optimize data quality. First, it reads PGN files, extracts and cleans essential data, calculates hashes, and recognizes metadata. Then, it consolidates player-pair groups using fuzzy name comparisons. This is followed by exact deduplication based on move sequence hashes, where the header of the best game is chosen as the master. Games with subsumed move sequences are also flagged. Another phase uses fuzzy matching for textual similarities of move sequences. Finally, the system exports the unique games and, optionally, the flagged duplicates, optimizing header quality through the integration of FIDE data and a detailed evaluation to ensure the master game contains the best available information.

A more thorough description of how the script works, you can find here on my website.

Final results of the deduplication:

  • Total games in database: 10.064.281 (I accidently didn't deduplicate in Scid, last release)
  • Number of master games (unique): 9.561.489 –> exported games
  • Number of subsumed duplicates: 4.680
  • Number of exact duplicates: 364.089
  • Number of textual fuzzy duplicates: 134.023
  • Number of games with optimized headers: 631.747
  • Number of master games with at least one player linked to a unique FIDE ID: 8.367.855
  • Number of unique FIDE player IDs in master games: 223.455
  • Number of games with missing result (‘*’ or ‘?’): 0
  • Number of games with unknown or missing player names (White or Black): 573
  • Number of games where the date was cleaned/optimized: 324.491
  • Percentage of deduplicated games: 5.00%
  • Average number of duplicates per master game: 0.05

Last cleanup with ScidFinally, a cleanup is carried out with Scid. Scid still finds some duplicates here, which is mainly due to two things:

  1. Formation of the player pair groupings: If the players’ names are spelled so differently that they aren't included in the grouping, they cannot be recognized as duplicates.
  2. The maximum difference in move sequence length of 30%: If the difference in the number of ply exceeds the value of 30%, the games are also not recognized via deduplication.

This cleanup will catch approximatly 1500 to 2000 additional duplicate games.

Have fun with studying chess ;)

Regards,
Michael/Lumbra74
P.S.: I also have a blog post for the database, where I announce regular updates.