Forums

Could you please give me some info about these databases?

Sort:
Raketonosets

Dear friends,

I have been roaming the Internet in seach of good public domain databases chess games in PGN format and I have found several large one like these:

Chess Analysis Project's openings database (several million games, pgn-files sorted by ECO-code A00-E99. Regularly updated)

ICOFY database (0ver 4'000'000 games in PGN format. Regularly updated)

Million Base 1.74 (just under 2'000'000 games in PGN)

Pittsburg U database of openings and players (I read somewhere that it has many mistakes)

University of Alabama "enormous" database (project of the Department of Computer Science)

Walter Eigenmann's database (computer chess games)

PGN Mentor database (public database of proprietary program)

I am sure I am missing other ones too.

My question is - what is the quality of these databases in terms of reliability?  Then, surely the biggest ones mostly overlap, no?  I mean - if one has 4*10^6 games and the other has 3*10^6 games, surely most of those are the same games, in particular if they are recorded from the same sources and if an effort is being made to collect only high-quality (players over 2000) games?

And one more, probably naive, question: would it not be possible to combine all these databases into one, remove all the games played by players ranked, say, below 2200 and create a "central" high quality only database of chess games in the public domain?

Thanks for any pointers for the newbie which I amEmbarassed!

RN

chessoholicalien

The ICOFY one causes faults in CB10 when I try to remove doubles from a merge of it and Mega2009.

aansel

One question is what is your goal of combining such  files. Many of these games do not have ratings (and those before 1970 almost never do) so that sort criteria would not work.

Also when you say errors--do you mean move transpositions, wrong results, wrong players. Even the best databases have plenty of errors but they try and correct them.

Raketonosets

@chessoholicalien: I run the full  ICOFY with SCID4.0 under Ubuntu GNU/Linux with no problems whatsoever.  I compacted it, filtered it, added ECO codes, and ratings - no prob.  Could it be that the CB10 database format has a hard time importing the memory hungry PGN format of the original base?

@aansel: you raise exactly the kind of issues which I would want to have fixed.  I might be missing a point here, but surely it would be preferable to purge the mega-databases from all the less than useful games (such as those without ECO or  ELO<2200) and come up with a smaller, but high quality reference database, no?

As for what type of errors the U. of Pitt databases have I am not sure - I only read on some chess discussion groups that these databases were full of them.  I cannot say either way.

As a newbie in this entire computer chess and databases business, I am just a little surprized by a) the number of databases out there and b) the fact that a lot of them seem to be aimed and quantity rather than quality.  My "dream database" would have only games with players rated 2200+, each game listed by ECO, and each ECO opening represented by several hundred games.  But maybe I am missing a lot of issues here, I am a noob after all...

rigamagician

Icofy has not been updated since April 2008, and I suspect Chess Analysis Project has not been updated for years.  Some of the games in the UPitt archives have the game results, move orders or player names wrong, but this is a problem with all archives.  Jose and ChessDB also provide large free archives in their respective formats.  Free database archives usually don't have that many ratings.  Chessbase Light Premium has a feature to add in ECO codes.