I especially like the addition of games from the Lichess Elite database and the chess transfer system. You can play here https://spacemanoyunu.net/ . This will certainly be a valuable addition to your already impressive collection.
Free chess game database with over 11 million games (Scid vs. PC database format)
@Lumbra74: hey, I just noticed that on this page: https://lumbrasgigabase.com/download-the-scid-database/ the download link for the si4 file now only links to the update (5.68 MB 7zip file) rather than the full archive...
A new version of the database (version 2024-02-27) was uploaded yesterday.
The database will be updated weekly, usually Tuesdays, after the release of the most recent TWIC file. The following files will be uploaded:
- Database files (si5, si4 format)
- A differential PGN-file containing the new games since release of the last database.
- A monthly PGN-file, containing the new games
- F.i.: The current database has been released on 02/27/2024. The last database of march will be released on 03/26/2024. These are 4 weeks, so the monthly update file will contain all weekly updates between the two releases.
I'm visiting your site for the first time, and it says the database is being updated. Does that mean I should return in a few days to get the files? Thanks - this all sounds super helpful!
Pretty decent blog. I went to the webpage, but it was "being updated". The ....chessok.com link is dead too.
If you're using less than six sources, instead of having an expensive extra tag, you may want to use the custom flags... Hmm , you probably have more than six sources, but it's a shame not more use is made of the custom flags.... Very quick searches.
Upps, thx for the reminder. I totally forgot to turn off the maintenance mode, while being in a hurry. I'm SORRY!!!
Problem is, if I go to the web address, I can see the content, even if the maintenance mode is on. grmml...
It's now turned off and the site is reachable again.
If you're using less than six sources, instead of having an expensive extra tag, you may want to use the custom flags... Hmm , you probably have more than six sources, but it's a shame not more use is made of the custom flags.... Very quick searches.
I'm using custom flags:
- LEDB - LichessEliteDatabase
- LGB - LumbrasGigaBase (Games out of a few other sources)
- PGNMent - PGN Mentor
- TWIC - The Week in Chess
- Masters - A database called "Games Of GM's"
- LichessB - Games pulled out of the Lichess Broadcast System
Good work. How did you bulk add a "SOURCE ...." tag ? Unless i'm mistaken, I don't think it's possible to do with Scidvspc..
?
I'll write that feature today/sometime... it should come in handy.
Is there a way to add all of this information into an AI? Similar to Luk.ai
I have no idea. Never experimented with an AI...
And please:
I've created a blog post for future information/discussions:
You still have the "Source" tags as well. ??
I stripped them, then compacted the gamefile. Big workout for the gui, laugh, but no problems. Sizes are now:
-rw-r--r-- 1 steve 10285949 Jun 1 13:48 LumbrasGigaBase.sn4
-rw-rw-r-- 1 steve 642779279 Jun 1 13:51 LumbrasGigaBase.si4
-rw-rw-r-- 1 steve 1721331687 Jun 1 13:51 LumbrasGigaBase.sg4
Yeah, the tag still exist. At first it was just a help to differentiate between my sources. But then, I thought it is a nice feature.
I don‘t know the sizes of the database files right now, but the version without these tags should be a significant amount smaller.

Let me know when you get the new chess game database with the top chess engines, this will be the crown for those who would like to get deeper into this game, for example to have an idea what are those to look for, on this link is listed the top 20 chess engine according to their performer
https://tvlavin.blogspot.com/2025/12/los-32-mejores-motores-de-ajedrez-de.html
Hi Lumbra / Michael,
Thank you for sharing this and for the huge amount of work behind Lumbras GigaBase. Making such a clean and large OTB collection freely available is really appreciated.
I’m the developer of Chesspertise (), a low-cost chess database and training app for iPad, Android tablets, macOS, and Windows.
As a small thank-you, I’d be very happy to give you free access to Chesspertise if you’d like to try it.
Thanks again for contributing this resource to the chess community.
Francesco
Hi Francesco,
thanks for the praise. I've also found three new sources CodeKiddy, TheChessDog and BenBase. Sadly, a lot of the databases mentioned in the reddit thread are not available anymore.
A new, further improved database will be released shortly. I've spent a lot of time into developing the Python script, which cleans the database:
The Import Process
PGN Parsing and Chunk Processing
The import begins with reading the PGN files. Since chess databases often span several gigabytes, the files are not loaded completely into memory but split into manageable chunks. Each chunk contains a defined number of games (typically 50,000) and is processed independently.
The parser recognizes:
- Header Tags: All PGN tags such as Event, Site, Date, White, Black, Result, ECO, etc.
- Move Sequences: The complete notation including variations and comments
- Invalid Games: Faulty notation is written to a separate log file
Parallel Processing with Worker Pools
To leverage the full power of modern multi-core processors, multiple worker processes run in parallel. Each worker:
- Reads a chunk
- Parses the PGN notation
- Normalizes moves and calculates hash values
- Writes the results to a staging table
The intensive parsing and normalization involved saves a number of work steps and I/O-intensive database queries later on.
Staging Table and Batch Transfer
The import uses a two-stage architecture for maximum robustness:
- Staging Phase: All games are first written to a temporary UNLOGGED table - fast and without transaction overhead
- Finalize Phase: Data is transferred batch-wise (default: 1 million rows per batch) to the final games table
The batch processing uses a temporary lookup table for player IDs and calculates header scores efficiently in SQL, maximizing transfer speed.
All deduplication phases are fully parallelized. This includes:
- Exact Phase: Parallel processing of duplicate groups
- Join Phase: Parallel fuzzy matching of candidate pairs
- Header Merge: Parallel merging of metadata
- Variant Merge: Parallel integration of differing move sequences
Move Sequence Normalization
A critical step before deduplication: moves from different sources are often notated differently. The script normalizes all moves to a unified format:
- Piece symbols are standardized (K, Q, R, B, N)
- Superfluous characters (!, ?, +, #) are removed for hash comparison
- Move numbering is standardized
From the normalized move sequence, a hash value (xxHash) is calculated - a unique fingerprint of the game for fast comparisons.
Comprehensive Data Validation
All fields are strictly validated during import:
- FIDE IDs: Stored as BIGINT (supports all current and future IDs)
- Year values: Validated in range 0-2100
- ELO ratings: Validated in range 0-4000
- Date fields: Must conform to YYYY-MM-DD format
- Invalid values: Markers like "????" are automatically filtered
Preparation for Deduplication
Player Name Normalization
One of the biggest challenges: the same player appears under different names. "Carlsen, Magnus", "Carlsen,Magnus", "Carlsen, M.", and "Magnus Carlsen" are all the same person. The script performs several normalization steps:
- Whitespace Normalization: Spaces and commas are unified
- Title Extraction: "GM", "IM", "FM" etc. are stored separately
- Unicode Normalization: Special characters and accents are standardized
FIDE Player Lookup
A powerful feature: the script matches player names against the official FIDE database. When a match is found:
- The official FIDE ID is assigned
- The correct spelling of the name is adopted
- Nationality and title are verified
The lookup uses trigram-based similarity search (pg_trgm) to find the correct player even with typos or alternative spellings.
Reference ID Assignment
Similar player names are grouped into reference groups. All variants of a name receive the same reference_id, which enormously speeds up later queries. Instead of searching for "Carlsen, Magnus OR Carlsen,Magnus OR ...", a single ID suffices.
Materialized View for Duplicate Candidates
Before the actual deduplication, a Materialized View is created in PostgreSQL. This pre-computation identifies all game pairs with identical hashes and stores them for fast access. The deduplication can thus work directly on relevant candidates instead of having to search through all millions of games.
The Deduplication Phases
Phase 1: Exact Duplicate Detection
The first step identifies games with exactly identical move sequences. For each game, a unique hash value (fingerprint) is calculated from the normalized move sequence. Games with identical hashes are grouped as potential duplicates.
Phase 2: Subsumption Detection
Not all duplicates have exactly the same move sequence. Often a game was interrupted or the notation ends earlier than in another source. Subsumption detection finds such cases:
- Game A contains moves 1-40
- Game B contains moves 1-35 (identical to A)
- → Game B is a subsumption of Game A
The shorter fragment is marked as a duplicate of the more complete game, while the complete version is preserved as the master.
Phase 3: Join-Lines Detection
Sometimes two games complement each other: one source has the opening in detail, another has the endgame. The Join-Lines phase detects such cases and can intelligently merge the move sequences to reconstruct the most complete version.
Phase 4: Intelligent Header Merging
Different sources provide different quality metadata. One source has the correct FIDE IDs of the players, another has the exact tournament date, yet another has the ELO ratings at the time of the game.
The script uses a score-based merge system:
- Each header value (player name, date, ELO, etc.) receives a quality score
- FIDE-verified data receives higher scores
- For each field, the value with the highest score is adopted
- The result is a "best-of-all" version of the game
Phase 5: Variant Merge
A special feature of the algorithm: when duplicates have differing move sequences (e.g., due to different transcriptions or alternative moves in analysis databases), these are not discarded but embedded as PGN variations in the master game.
Example: If the main source records 10. Nf3, but another source has 10. Ng5, this appears in the result as:
10. Nf3 (10. Ng5) 10... Be7
This way, no information is lost, and players can trace the different sources.
Result
After processing, the database contains:
- Master Games: The highest quality version of each unique game possible
- Duplicate Markers if lines were merged (Added a tag Merged = True)
- Complete Provenance: All sources are traceably documented
Hello Michael,
thank you very much for introducing my database, I really appreciate it! I have just uploaded the latest version with TWIC 1528.
In the near future I will also extract opening books from the database, in a similar way to the PGN files, and make them available on the website. To do this, however, I will have to change the menus again.
Best regards,
Michael/Lumbra74