Chess database "Lumbra's Gigabase": New, quality-improved release. - Chess Forums

Jul 5, 2025

0

#1

Hi Chess Friends,

I've released a new version of the Lumbra's Gigabase yesterday. The database containing online games was updated as well, but my focus with the release 2025-07-01 (it contains only the games until 06/30/2025) was lying on the quality of the OTB game database.

Over the last month I wrote a python script which should - at first - just do some deduplication of the database. At the end it was more than that, it has turned out as an advanced script to deduplicate AND improve the quality of the headers of the chess games.

TLDR functionality of the script:The system for deduplicating chess games processes PGN files in several phases to identify duplicates and optimize data quality. First, it reads PGN files, extracts and cleans essential data, calculates hashes, and recognizes metadata. Then, it consolidates player-pair groups using fuzzy name comparisons. This is followed by exact deduplication based on move sequence hashes, where the header of the best game is chosen as the master. Games with subsumed move sequences are also flagged. Another phase uses fuzzy matching for textual similarities of move sequences. Finally, the system exports the unique games and, optionally, the flagged duplicates, optimizing header quality through the integration of FIDE data and a detailed evaluation to ensure the master game contains the best available information.

A more thorough description of how the script works, you can find here on my website.

Final results of the deduplication:

Total games in database: 10.064.281 (I accidently didn't deduplicate in Scid, last release)
Number of master games (unique): 9.561.489 –> exported games
Number of subsumed duplicates: 4.680
Number of exact duplicates: 364.089
Number of textual fuzzy duplicates: 134.023
Number of games with optimized headers: 631.747
Number of master games with at least one player linked to a unique FIDE ID: 8.367.855
Number of unique FIDE player IDs in master games: 223.455
Number of games with missing result (‘*’ or ‘?’): 0
Number of games with unknown or missing player names (White or Black): 573
Number of games where the date was cleaned/optimized: 324.491
Percentage of deduplicated games: 5.00%
Average number of duplicates per master game: 0.05

Last cleanup with ScidFinally, a cleanup is carried out with Scid. Scid still finds some duplicates here, which is mainly due to two things:

Formation of the player pair groupings: If the players’ names are spelled so differently that they aren't included in the grouping, they cannot be recognized as duplicates.
The maximum difference in move sequence length of 30%: If the difference in the number of ply exceeds the value of 30%, the games are also not recognized via deduplication.

This cleanup will catch approximatly 1500 to 2000 additional duplicate games.

Have fun with studying chess ;)

Regards,
Michael/Lumbra74
P.S.: I also have a blog post for the database, where I announce regular updates.

stevenaaus

Jul 25, 2025

0

#2

I've just noticed ScidvsPC's linked-to db (Caissabase) is gone.

I'll change the default db to Lumbras maybe ??? https://lumbrasgigabase.com/en/

At least for now. Cheers

Lumbra74

Jul 26, 2025

0

#3

Fell free to link it
Thanks!

stevenaaus

Oct 16, 2025

0

#4

I just used the DB to find the 1970 candidates matches. Cheers.... They were better than the Millbase tournament games.