Free chess game database with over 11 million games (Scid vs. PC database format) - Chess Forums - Page 2

Feb 20, 2024

0

#21

Hello Michael,

thank you very much for introducing my database, I really appreciate it! I have just uploaded the latest version with TWIC 1528.

In the near future I will also extract opening books from the database, in a similar way to the PGN files, and make them available on the website. To do this, however, I will have to change the menus again.

Best regards,
Michael/Lumbra74

charlemen

Feb 20, 2024

0

#22

I especially like the addition of games from the Lichess Elite database and the chess transfer system. You can play here https://spacemanoyunu.net/ . This will certainly be a valuable addition to your already impressive collection.

vitualis

Feb 20, 2024

0

#23

@Lumbra74: hey, I just noticed that on this page: https://lumbrasgigabase.com/download-the-scid-database/ the download link for the si4 file now only links to the update (5.68 MB 7zip file) rather than the full archive...

Lumbra74

Feb 20, 2024

0

#24

Thanks for the hint. It should be fixed now.

Sorry for the inconvenience!

Lumbra74

Feb 28, 2024

0

#25

A new version of the database (version 2024-02-27) was uploaded yesterday.

The database will be updated weekly, usually Tuesdays, after the release of the most recent TWIC file. The following files will be uploaded:

Database files (si5, si4 format)
A differential PGN-file containing the new games since release of the last database.
A monthly PGN-file, containing the new games
- F.i.: The current database has been released on 02/27/2024. The last database of march will be released on 03/26/2024. These are 4 weeks, so the monthly update file will contain all weekly updates between the two releases.

Lumbra74

Feb 29, 2024

0

#26

I've created a blog post for future information:

https://www.chess.com/blog/Lumbra74/lumbras-giga-base-free-massive-game-collection-in-scid-and-pgn-format

mvk20

Apr 25, 2024

0

#27

I'm visiting your site for the first time, and it says the database is being updated. Does that mean I should return in a few days to get the files? Thanks - this all sounds super helpful!

stevenaaus

Apr 26, 2024

0

#28

Pretty decent blog. I went to the webpage, but it was "being updated". The ....chessok.com link is dead too.

If you're using less than six sources, instead of having an expensive extra tag, you may want to use the custom flags... Hmm , you probably have more than six sources, but it's a shame not more use is made of the custom flags.... Very quick searches.

Lumbra74

Apr 26, 2024

0

#29

Upps, thx for the reminder. I totally forgot to turn off the maintenance mode, while being in a hurry. I'm SORRY!!!

Problem is, if I go to the web address, I can see the content, even if the maintenance mode is on. grmml...

It's now turned off and the site is reachable again.

Lumbra74

Apr 26, 2024

0

#30

stevenaaus wrote:

If you're using less than six sources, instead of having an expensive extra tag, you may want to use the custom flags... Hmm , you probably have more than six sources, but it's a shame not more use is made of the custom flags.... Very quick searches.

I'm using custom flags:

LEDB - LichessEliteDatabase
LGB - LumbrasGigaBase (Games out of a few other sources)
PGNMent - PGN Mentor
TWIC - The Week in Chess
Masters - A database called "Games Of GM's"
LichessB - Games pulled out of the Lichess Broadcast System

stevenaaus

Apr 27, 2024

0

#31

Good work. How did you bulk add a "SOURCE ...." tag ? Unless i'm mistaken, I don't think it's possible to do with Scidvspc.. ?

I'll write that feature today/sometime... it should come in handy.

Lumbra74

Apr 29, 2024

0

#32

For the bulk adding I use the tool pgnextract and a bash script.

continuedkrombopulos22

Apr 29, 2024

0

#33

Is there a way to add all of this information into an AI? Similar to Luk.ai

Lumbra74

Apr 30, 2024

0

#34

continuedkrombopulos22 wrote:

Is there a way to add all of this information into an AI? Similar to Luk.ai

I have no idea. Never experimented with an AI...

And please:

I've created a blog post for future information/discussions:

https://www.chess.com/blog/Lumbra74/lumbras-giga-base-free-massive-game-collection-in-scid-and-pgn-format

stevenaaus

May 31, 2024

0

#35

You still have the "Source" tags as well. ??

I stripped them, then compacted the gamefile. Big workout for the gui, laugh, but no problems. Sizes are now:

-rw-r--r-- 1 steve 10285949 Jun 1 13:48 LumbrasGigaBase.sn4
-rw-rw-r-- 1 steve 642779279 Jun 1 13:51 LumbrasGigaBase.si4
-rw-rw-r-- 1 steve 1721331687 Jun 1 13:51 LumbrasGigaBase.sg4

Lumbra74

Jun 1, 2024

0

#36

Yeah, the tag still exist. At first it was just a help to differentiate between my sources. But then, I thought it is a nice feature.

I don‘t know the sizes of the database files right now, but the version without these tags should be a significant amount smaller.

TVLAVIN

Dec 30, 2025

0

#37

Let me know when you get the new chess game database with the top chess engines, this will be the crown for those who would like to get deeper into this game, for example to have an idea what are those to look for, on this link is listed the top 20 chess engine according to their performer

https://tvlavin.blogspot.com/2025/12/los-32-mejores-motores-de-ajedrez-de.html

PhoebeThePug

Jan 3, 2026

0

#38

Hi Lumbra / Michael,

Thank you for sharing this and for the huge amount of work behind Lumbras GigaBase. Making such a clean and large OTB collection freely available is really appreciated.

I’m the developer of Chesspertise (), a low-cost chess database and training app for iPad, Android tablets, macOS, and Windows.

As a small thank-you, I’d be very happy to give you free access to Chesspertise if you’d like to try it.

Thanks again for contributing this resource to the chess community.

Francesco

Lumbra74

Jan 4, 2026

0

#39

Hi Francesco,

thanks for the praise. I've also found three new sources CodeKiddy, TheChessDog and BenBase. Sadly, a lot of the databases mentioned in the reddit thread are not available anymore.

A new, further improved database will be released shortly. I've spent a lot of time into developing the Python script, which cleans the database:

The Import Process

PGN Parsing and Chunk Processing

The import begins with reading the PGN files. Since chess databases often span several gigabytes, the files are not loaded completely into memory but split into manageable chunks. Each chunk contains a defined number of games (typically 50,000) and is processed independently.

The parser recognizes:

Header Tags: All PGN tags such as Event, Site, Date, White, Black, Result, ECO, etc.
Move Sequences: The complete notation including variations and comments
Invalid Games: Faulty notation is written to a separate log file

Parallel Processing with Worker Pools

To leverage the full power of modern multi-core processors, multiple worker processes run in parallel. Each worker:

Reads a chunk
Parses the PGN notation
Normalizes moves and calculates hash values
Writes the results to a staging table

The intensive parsing and normalization involved saves a number of work steps and I/O-intensive database queries later on.

Staging Table and Batch Transfer

The import uses a two-stage architecture for maximum robustness:

Staging Phase: All games are first written to a temporary UNLOGGED table - fast and without transaction overhead
Finalize Phase: Data is transferred batch-wise (default: 1 million rows per batch) to the final games table

The batch processing uses a temporary lookup table for player IDs and calculates header scores efficiently in SQL, maximizing transfer speed.

All deduplication phases are fully parallelized. This includes:

Exact Phase: Parallel processing of duplicate groups
Join Phase: Parallel fuzzy matching of candidate pairs
Header Merge: Parallel merging of metadata
Variant Merge: Parallel integration of differing move sequences

Move Sequence Normalization

A critical step before deduplication: moves from different sources are often notated differently. The script normalizes all moves to a unified format:

Piece symbols are standardized (K, Q, R, B, N)
Superfluous characters (!, ?, +, #) are removed for hash comparison
Move numbering is standardized

From the normalized move sequence, a hash value (xxHash) is calculated - a unique fingerprint of the game for fast comparisons.

Comprehensive Data Validation

All fields are strictly validated during import:

FIDE IDs: Stored as BIGINT (supports all current and future IDs)
Year values: Validated in range 0-2100
ELO ratings: Validated in range 0-4000
Date fields: Must conform to YYYY-MM-DD format
Invalid values: Markers like "????" are automatically filtered

Preparation for Deduplication

Player Name Normalization

One of the biggest challenges: the same player appears under different names. "Carlsen, Magnus", "Carlsen,Magnus", "Carlsen, M.", and "Magnus Carlsen" are all the same person. The script performs several normalization steps:

Whitespace Normalization: Spaces and commas are unified
Title Extraction: "GM", "IM", "FM" etc. are stored separately
Unicode Normalization: Special characters and accents are standardized

FIDE Player Lookup

A powerful feature: the script matches player names against the official FIDE database. When a match is found:

The official FIDE ID is assigned
The correct spelling of the name is adopted
Nationality and title are verified

The lookup uses trigram-based similarity search (pg_trgm) to find the correct player even with typos or alternative spellings.

Reference ID Assignment

Similar player names are grouped into reference groups. All variants of a name receive the same reference_id, which enormously speeds up later queries. Instead of searching for "Carlsen, Magnus OR Carlsen,Magnus OR ...", a single ID suffices.

Materialized View for Duplicate Candidates

Before the actual deduplication, a Materialized View is created in PostgreSQL. This pre-computation identifies all game pairs with identical hashes and stores them for fast access. The deduplication can thus work directly on relevant candidates instead of having to search through all millions of games.

The Deduplication Phases

Phase 1: Exact Duplicate Detection

The first step identifies games with exactly identical move sequences. For each game, a unique hash value (fingerprint) is calculated from the normalized move sequence. Games with identical hashes are grouped as potential duplicates.

Phase 2: Subsumption Detection

Not all duplicates have exactly the same move sequence. Often a game was interrupted or the notation ends earlier than in another source. Subsumption detection finds such cases:

Game A contains moves 1-40
Game B contains moves 1-35 (identical to A)
→ Game B is a subsumption of Game A

The shorter fragment is marked as a duplicate of the more complete game, while the complete version is preserved as the master.

Phase 3: Join-Lines Detection

Sometimes two games complement each other: one source has the opening in detail, another has the endgame. The Join-Lines phase detects such cases and can intelligently merge the move sequences to reconstruct the most complete version.

Phase 4: Intelligent Header Merging

Different sources provide different quality metadata. One source has the correct FIDE IDs of the players, another has the exact tournament date, yet another has the ELO ratings at the time of the game.

The script uses a score-based merge system:

Each header value (player name, date, ELO, etc.) receives a quality score
FIDE-verified data receives higher scores
For each field, the value with the highest score is adopted
The result is a "best-of-all" version of the game

Phase 5: Variant Merge

A special feature of the algorithm: when duplicates have differing move sequences (e.g., due to different transcriptions or alternative moves in analysis databases), these are not discarded but embedded as PGN variations in the master game.

Example: If the main source records 10. Nf3, but another source has 10. Ng5, this appears in the result as:

10. Nf3 (10. Ng5) 10... Be7

This way, no information is lost, and players can trace the different sources.

Result

After processing, the database contains:

Master Games: The highest quality version of each unique game possible
Duplicate Markers if lines were merged (Added a tag Merged = True)
Complete Provenance: All sources are traceably documented