Let's create a free, open, chess database

Sort:
telefonica

I'm very perplexed by the chess community's insistence on using .pgn files to store game information. These files are not database friendly and cannot be easily searched or analyzed.

I propose that we, as a community of chess players, build a free, open, and online chess database. We can warehouse all of the important historical chess games in our database. We should use a format that's both ubiquitous and easy to use (I suggest SQL).

This database would both serve as an historical account of the chess world and as a tool for research. First, as an historical account of the chess world, this database would catalog information regarding the masters of the game and the games that they played. Second, as a tool for research, the database should be easily "analyzable". A SQL database allows for far more detailed queries than a .pgn database, and as such, allows for much more detailed analyses.

To this end, I've created a SQL database of 1.74 million chess games. The database is in its raw form now, and needs to be organized. I can't take credit for the data within the database, as I started with 'Rebel's' and Ed Schroeder's .pgn database: http://www.top-5000.nl/pgn.htm

Finally, I've created a blog to track the progress and act as a landing page for the community. From this blog, you can download my SQL database. http://chessdata.wordpress.com/

I need your help! Right now, I'm at the very early stages of building this and need brainstorming ideas. Let's get a firm grasp of what we want and how we're going to do it. I've laid out my plan on my blog; feel free to comment. Next, I need help gathering and cleaning data. We have 1.74 million games already, but the data isn't pretty by any means. I also need help designing a database (I need people with database experience and "chess people" who know what kind of information we should be storing and where we can get it). Finally, I'd like to put this somewhere so everyone can access it (anyone know how to build a webpage?).

People of the chess world, UNITE! :)

This is a cross post (by me) on reddit.com/r/chess

http://www.reddit.com/r/chess/comments/1559f8/lets_create_a_free_open_chess_database/

ictavera

It already exists:

http://scid.sourceforge.net/

Scid uses it's own database format that allows for quickly search position, material imbalances, pawn structures, etc.

telefonica

Hi temp, thanks for the input. I only recently became aware of SCID. I haven't fully come to understand the power of SCID but I think a SQL database should still offer a few advantages. For example, a querry in SQL is fully customizable and should allow for very detailed analyses. I'm not sure that I can fully customize my querries in SCID?

ollave

SQL is a horribly inefficient way to store chess games. Sorry. I spent some time researching this recently; there are definitely multiple Ph.D.'s available for people who want to improve the state of the art.

As I'm not looking for a Ph.D. topic, I kinda gave up on my project.

I wish you luck, but I'll be very, very surprised if SQL turns out to be your answer. SQL queries backed by some other (very custom) database engine, possibly. But as I said, Ph.D. material.

All IMHO, naturally.

rooperi
telefonica wrote:

Hi temp, thanks for the input. I only recently became aware of SCID. I haven't fully come to understand the power of SCID but I think a SQL database should still offer a few advantages. For example, a querry in SQL is fully customizable and should allow for very detailed analyses. I'm not sure that I can fully customize my querries in SCID?

You can pretty much ask SCID anything. The quality of the result depends on the quality of the PGN (which should ideally include complete information on the game).

I'm sure other databases also have fully customized queries, once the pgn has been converted to their format.

Give an example of a query you think SCID wont do?

ChessFanNM

I realize this is an old thread, but ...

Give an example of a query you think SCID wont do?
The issue is not only functionality, but accessibility of a database from tools other than SCID. There is a huge number of tools that know how to work with a SQL database and can be used for querying, analyzing and updating a SQL database. SCID has its own scripting language with a much higher barrier to entry for someone who just wants to play with the data (without necessarily even looking into the contents of chess games).

telefonica

When I started this thread, I wasn't aware that SCID existed. Unfortunately, other things have come up and I haven't done much research into SCID's capabilities.

Some examples might be:

  1. Board position queries. e.g. statistical analysis to support d5 vs e5 chain pawn structures.
  2. Determining chess piece relative value at various game stages. Analyzing data from queries that determine which pieces each player has at each of the three game times (beginning, middle and end) and how each game then ended could be used to statistically derive relative piece value.

Just a few ideas off the top of my head that could be interesting to investigate but would require a fairly detailed database / advanced querying engine.