How to store chess openings in a relational database?

Sort:
Sqod
game_designer wrote:

Then jump ahead in the future and you have the database and an interface like on this website using Menu | Learn | Analysis

 

Good stuff, game_designer. I'm busy with another project today (I thought this thread had been forgotten forever) so today I don't have time to look into all the claims made in the different posts. Two quick comments, though: (1) I'd never used the Analysis tool on this site until you mentioned it today. I posted a screen snap of what it gave me for the Hillbilly Attack of the Caro-Kann Defense as an example in case other users have never seen that interface, either:

 

null

 

(2) It looks like your facts table is trying to use all the info from an FEN string, so wouldn't you want the other two values from that string...? (https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation)

()
Halfmove clock: This is the number of halfmoves since the last capture or pawn advance.
This is used to determine if a draw can be claimed under the fifty-move rule.

()
Fullmove number: The number of the full move. It starts at 1,
and is incremented after Black's move.

 

game_designer

Cool man.

 

First thing is now you have the "core engine" for the database.

 

I mean the core logic that drives the database, everything else is just built around this.

 

But now you need data so lets start with FEN strings, as you pointed out, Ply counts, so you ask yourself do I need to store this in the database or can it be calculated easily. In the moves table I have move_number, if it can be calculated from that no need to store it, otherwise just add two columns to the moves table and store that data.

 

Next grab a bunch of PGN files, I would say 8, and get the PGN specification itself.

 

Games need to be a mix, with comments, with no comments, draw, white win, black win, stalemate, checkmate, 3 fold rep, 50 move, insufficient material, blah, blah.

 

You then must manually create the data from the PGN files, look at the data using a text editor.

 

A very laboureless (sic) exercise but essential, you will shape the rest of the database doing this and will learn how the data works, you can't write code before you do it manually first.

 

For example, the games table will be created from the info in the PGN files, player names, tournament, location, date, round, time controls, result, blah, blah.

 

Bit Boards

 

I will use an empty board with a white king on A1 and a black king on H8 as an example.

 

Use A1 as the MSB (Most Significant Bit) and H8 as the LSB (Least Significant Bit)

 

In the facts table you will only ever have 2 rows to represent this position, one for white to move, one for black to move.

 

For the WK (White King) field in the table you will have a 1 bit followed by 63 zero bits, the actual number is not important, you are more interested in the bit flags (1 or 0)

 

For the BK (Black King) field, 63 zero bits followed by a 1 bit, the actual number in this case is 1.

 

Then player to move, zero for white, one for black or the other way around if you want.

 

For all other fields, zeros.

 

Once you have the database shaped, and it has the data you created manually, somebody writes code, you then force feed 100 PGN files into the database, check it out, tweak a few problems.

 

Then do 1,000 next, then 10,000, then ...

 

Then do SAS for all the fancy stuff.

 

Everything works?

 

Hit the button, start loading everything.

 

What will your key be on the games table, to identify every single game?

camter

I am still trying to get my head around the idea of "form".

So, I suppsed "normal" form is the first concept to grasp.

Does nature work this way?

Or, is this Virtual Nature?

Could not do it on a Magnetic Tape based sysrem, I would imagine.

But, it must get rid of a lot of sorting.

Do you have to have chains of links, which tell you where the next links are?

Are there such things as nested links.

But i do know that if my dog is registered, you will be able to find out the criminal record of my Bank Manager's Aunty, and her maiden name. 

From there you will be able tofind what she bought last week at the supermarket, and the sale price of the place she lived at paid by the owner before her, which is very useful to Real Estate salespersons and the Federal Reserve.

And, if you run out of space, you can whack it into the cloud for safekeeping, suitably encoded.

Those hackers must be very, very good to do all the stuff they do.

mgx9600

Hey OP, what is it that you want to do?

I see that you want to find open name(s) based on a board position?  What else?

 

If I understand your requirements correctly, it really isn't too big a program to write.  No need for RDBMS (in fact, RDBMS will likely get in the way because I feel it's going to be not mainly SQL queries but a lot of procedural programming).  If you can sell me on the idea, I can write the program (baesd on what I  think I know about this program, it'll probably take about a day, in which time I can play ~10 chess games, i.e. it'll have to be more useful/fun than 10 chess games).

 

game_designer

You stay in the swamp and shoot crocodiles until you run out of bullets, no escape now...

 

I will find the high ground in the swamp and then blast my way out after planning my escape, only killing the crocodiles that stand in my way.

 

I am outside the swamp now, I am now planning how to drain the swamp, I will kill all the crocodiles in one go without firing a single shot.

 

How many bullets you got left?

 

Good luck.

camter

The crocodiles have no reason to smile now! 

Sqod
Williamfwm wrote:

Things like fianchetto detection need to be hand-coded rules you implement as you need them. You can store this information alongside as you go. This shouldn't be solved at the database level. Trying to make a do-everything Uber Schema to accommodate everything you could ever want is a serious design mistake.

 

I'm not clear on your meaning. We could be talking about pure database queries, or a database queried by a regular computer program, where the program does further processing or pattern matching on the extracted data. So what do you mean by (hard-coded) "rules" and (store this information) "alongside"--are you referring to a program controlling the database?

At any rate, this sounds like the usual time-space tradeoff of computers: deciding the balance of how much should be stored to save time, versus how much should be found by search to save space.

One of the things I do all the time with my text file repertoire is search it for strings that I defined as flags for certain patterns. I've found this to be extremely useful, and this is partly why I can often respond to posts so quickly with a number of specific examples (*if* I have that pattern already described by a text string, with that text string marking most or all of the occurrences of that pattern). At the very least I would expect a good move database to do the same thing. What this means for database implementation is that there would need to be a field for every pattern of interest to me, at least for the most *common* patterns on which I search. You're right in that this could involve a lot of fields, so that might require a judgment call on which are worth including as hardcoded fields.

Some examples of patterns on which I often search are:

o pawn patterns, or piece placement combinations, to *help* in detecting transpositions, for example the 4-knights pattern where both players have both of their knights at KB3

o biff (B-N5 ...P-SR3) continuations: did the pinning bishop retreat, or capture, or back up along the line of the pin? Then, if the bishop retreated along the pin line, was it further chased off with P-SN4? Then, if the bishop retreated via B-SR4, was it further chased off with the opponent's knight (as in the Ruy Lopez)?

o "queen slaps" (QxQ+ ...KxQ), whether actually done or avoided via a different move

o "monarch door sequences" (B-B4 then P-SM3 then OB developed)

o a bishop placed directly in front on one's own central pawns in the opening

This raises some database/program interface questions to which I've never heard an opinion, so if anybody has an opinion or knows of a convention, please let me know:

If such features have been included as fields in a database, when is a good time to populate those fields, and how? For example, in a spreadsheet those fields would be populated automatically and instantly as soon as their definitions were inserted, but if the field requires a lot of calculation, especially using search on all moves in the game, then immediate insertion would not be wise or even possible. Should the program populate those fields periodically? Can or should the database automatically update computed fields if the needed queries are very complicated? Can and should subqueries be used to find such patterns, and is there a way to store common subqueries?

 

camter

Good luck with all that, fellas!

game_designer

Chess engines and bit parallel operations.

 

Modern chess engines use bit boards as the core component to drive the program.

 

Bit operations are very fast and bit boards can be used to perform bit parallel operations.

 

An example, give me all positions in the database where white has a passed pawn, or to be more specific a passed pawn on the A file, only kings on the board and the black king is out side the square of the pawn and can not stop it from queening.

 

With bit boards only a few calculations are required and basically any pattern on the board can be queried.

 

Note that the positions (facts) table in the database is structured using the exact same logic that chess engines use with bit boards.

 

This means that the database in it's core logic is already aligned with the core logic used in chess engines.

 

Note that I now call that fact table positions but it is actually game state because it also includes player to move, permanent castling rights and en passant capture square, this is also what chess engines do, they maintain state in global variables for all the values shown in that table.

 

Adding extra fields in a database for specific patterns or conditions is a serious database design flaw.

 

All the information that you need is already in the positions table, you then use SQL in conjunction with SQL bit functions to calculate whatever.

game_designer

@camter

 

You have contributed nothing constructive to this topic.

 

Please stop spamming on this topic, otherwise I will ask the OP to simply block you.

camter

Noted. Anything I say about it will be construed as spam, I suspect.

We inhabit a strange new world, indeed. 

I will read what is said in this discussion in silence.

And a certain foreboding.

Sqod
game_designer wrote:

Chess engines and bit parallel operations.

...

Note that the positions (facts) table in the database is structured using the exact same logic that chess engines use with bit boards.

...

This means that the database in it's core logic is already aligned with the core logic used in chess engines.

...
All the information that you need is already in the positions table, you then use SQL in conjunction with SQL bit functions to calculate whatever.

 

Aha. I didn't know SQL had bit functions. (I've never needed or used them for the commercial work I've done, but then that wasn't chess-oriented!)

OK, before getting abstract again let's discuss bitboards with some concrete examples applied to an example I gave earlier: biff continuations. This is an actual study I did this earlier this year by hand because I truly didn't know what White should do when biffed on Black's kingside in one opening I was studying. This is an example of a task that would be great to automate on computer. It just involves the computer reporting of statistics based on a given temporal pattern of interest.

(For those unfamiliar with bitboards, here's a good intro: http://webpages.charter.net/tlikens/tech/bitmaps/bit_intro.html)

 

For detection of a *static* position, here's a situation we can detect by bitboards combined in the following ways using all boolean variables, with a few similar sections of code removed...

 

Ruy Lopez, 3...a6

 

r.bqkbnr
.ppp.ppp
p.n.....
.B..p...
....P...
.....N..
PPPP.PPP
RNBQK..R

 

<Location of White's KB>:

........
........
........
.1......
........
........
........
........

 

<Location of Black's a-pawn>:

........
........
1.......
........
........
........
........
........

 

<Location of White's KB> OR <Location of Black's a-pawn>:

........
........
1.......
.1......
........
........
........
........

 

<Template of biff on Black's queenside>:

........
........
1.......
.1......
........
........
........
........

 

Note that in this case:
(<Template of biff on Black's queenside> == (<Location of White's KB> OR <Location of Black's a-pawn>)) = TRUE

 

Therefore by merely doing an equality comparison of the two final bitboards we know that White is being biffed on Black's queenside because the answer we got back was a boolean TRUE. Bit operations are supposed to be extremely fast on computers, so I'm assuming such comparison operators (here, the test of equality "==") are, too.

 

biff_on_blacks_qs = <Template of biff on Black's queenside> == (<Location of White's KB> OR <Location of Black's a-pawn>)
biff_on_blacks_ks = <Template of biff on Black's kingside> == (<Location of White's QB> OR <Location of Black's h-pawn>)
biff_on_whites_qs = <Template of biff on White's queenside> == (<Location of Black's KB> OR <Location of White's a-pawn>)
biff_on_whites_ks = <Template of biff on White's kingside> AND (<Location of Black's QB> OR <Location of White's h-pawn>)

 

So now we just "OR" all cases for which we checked above, and if the result is a boolean TRUE, then a biff exists:

 

biff_exists = biff_on_blacks_qs OR biff_on_blacks_ks OR biff_on_whites_qs OR biff_on_whites_ks

 

Now if we want to look for a *temporal* pattern--one position that appears after another in time--the following is one way to do it, generalized here so that the two positions are not required to appear *immediately* following each other...

 

Ruy Lopez, 3...a6 4. Ba4

 

r.bqkbnr
.ppp.ppp
p.n.....
....p...
B...P...
.....N..
PPPP.PPP
RNBQK..R

 

<Template of backed-off biff on Black's queenside>:

........
........
1.......
........
1.......
........
........
........

 

etc.

 

backed-off_biff_exists = backed-off_biff_on_blacks_qs OR backed-off_biff_on_blacks_ks OR backed-off_biff_on_whites_qs OR backed-off_biff_on_whites_ks

for each ply i of the current game:
{
if biff_exists then biff_ply_number = i
if backed-off_biff_exists then backed-off_biff_ply_number = i
}

biff_continuation_was_maintained_pin = (biff_ply_number < backed-off_biff_ply_number)

 

The logic is slightly faulty, and the bitboard could be rotated to save some code, but that gives an idea of the general structures and checks needed. All that just to detect a biff where the pinning bishop backed off and maintained the pin, in any of the standard positions on the board!

 

That's more detail than I wanted to get into, but I did want to take a look since I'd never seen bitboards used for detecting temporal patterns before. More in a little while...

 

As for the Camter Gallery, I already considered blocking him but at least he's not hostile, and he gives me the illusion that we're doing something really important. happy.png

 

Sqod
mgx9600 wrote:

Hey OP, what is it that you want to do?

I see that you want to find open name(s) based on a board position?  What else?

 

My motivation for wanting to write a small move database this year was mostly to detect transpositions as well as to be able to continue the same type of searches I do in my flat text file (which I find very useful), and to continue my practice of giving names to opening positions whenever I need a convenient label. I've run into several embarrassing situations when I posted lines and claims based on 365chess' database, which to my astonishment doesn't seem to detect transpositions, most notably in the Grand Prix Sicilian. But I have a few other complaints about that database that I'd like to fix by writing my own database, if it's even possible to do what I want in a relational database.

 

Here is my prioritized wish list:

 

(1) Add a few extra capabilities that a flat text file cannot provide, but in addition to the capabilities of a flat text file (namely string search, compete freedom to add comments and opening names), and that 365chess' database does not provide (to my knowledge), especially:

(a) Automatic transposition detection.

(b) Automated inclusion of variances along with averages, which is extremely time-consuming to calculate by hand (I can expand on this, if you want).

(c) Frequency percentage for each move (not just a count or ranking of each move), since I like this measure but it's awfully time-consuming to calculate by hand.

(d) Have the computer automatically insert my desired flagging text string at all positions that apply (e.g., flag every occurrence of B-N5 ...P-SR3 with the text string "$biff"). This is very time-consuming to do, especially for frequent patterns like that.

 

(2) Add some more very useful enhancements to an opening database for which I'm definitely going to be pushing one of these days:

(a) The placement statistics for each piece, at least for the opening (I'll have to think about whether this would be useful for the middlegame), since one of the most common (and fatal) mistakes in an opening is to misplace a piece or pawn (I can expand on this, if you want).

(b) The default move, averaged over all the opponent's responses. This might be done easily by running Stockfish, but can be done less accurately by summing database statistics, but this is extremely time-consuming to do by hand. (I can expand on this, if you want)

(c) The reason for the draw (e.g., agreement, grandmaster draw, repetition, perpetual check, pursuit, stalemate, insufficient mating material). This would be useful for finding specific games.

(d) Same as the previous, but the type of mate, if it has a name (e.g., back rank mate, smothered mate, Opera House Mate).

(e) A list of the transpositions that have been seen to lead to the current position.

(f) A list of the openings to which the current position can transpose, with their frequency. (e.g., Zukertort Opening can transpose to Queen's Gambit Declined or Four Knights Game, Scandinavian Defense can transpose to Advance French, Pirc Defense can transpose to Philidor's Defense)

 

(3) (These are pure luxury for now, but would be very impressive. Some would be very easy to do but happen to be not very important.) Add some deluxe statistics and automatic computations that I have not seen in any database. For each position I would like to see a huge amount of information, enough that the summary page for that position would be like looking at a Table Of Contents for an entire book on that opening/position. In roughly descending order of my interest:

(a) The pros and cons (in text format) for each move. 

(b) Ideas and plans behind each move, in textual format, both plans that are currently being followed, and plans that will be implied by making one of the listed moves. 

(c) Common sequences of moves (e.g., Bg5 ...h6 Bh4 ...g5 Bg3).

(d) Typical killer moves to watch out for (e.g., d5 in the Scandinavian Defense, or Nxb5 in the Najdorf Sicilian).

(e) The amount of space that each side has, also presented as a ratio of White's space to Black's space.

(f) Same as the previous, except a measure of mobility. 

(g) The school of chess that applies to this position (e.g., classical, hypermodern, romantic).

(h) Decisiveness measure.

(i) Average unit denudation rate. (I can expand on this, if you want)

(j) Ability to do very general queries like those I discussed. (Somebody told me there exists a Chess  Query Language, which might do what I want, but I'll have to look that up since I've never heard of it.) It would be interesting to be able to select only a range of ratings of the players of the games, too. (I suspect that the variance decreases as the rating increases.)

(k) Stockfish's recommended move included and flagged among every set of choices.

(l) The move number at which the losing player in the shown game was believed to have gone wrong.

(m) Automatic summary of the *counts* of pros versus cons for each move.

(n) A list or diagram of hanging pieces.

(o) A list or diagram of alignments that could be dangerous or exploited.

(p) Ability to detect "hubs," roughly what chessplayers call "tabiyas": very common and therefore important positions to learn. This would be done by counting the number of transpositions that give rise to a given position.

(q) Ability to pull up similar positions, either automatically or by query. This would be a research project in itself, for which combinations of metrics to use, since there would be many reasonable measures of similarity.

 

(4) Futuristic stuff, unrealistic for the near future.

(a) Computer self-learning of correlations of abstract features (e.g., that the BxKB2+ ...RxKB2 NxKB2 ...KxKB2 exchange tends to end in the KBN vs K endgame).

(b) Computer-generated abstract descriptions of general, temporal patterns the computer discovered for winning certain difficult endgames (e.g., KQ vs KR), in a form that is easily readable and understandable by humans.

(c) A search for discovery of unpublished (or undiscovered) heuristics for openings, based on frequent patterns (e.g., Bc4 is often played in response to ...f6 in any Double King's Pawn Game, "push" variations have poor statistical results, ...Ng4 is often played against Be3).

(d) A search for discovery of recurring but as-yet unnamed types of mates. (I've found and named a few of these, like the "Gai-Wan Mate" and "Corner Shelf Mate.")

(e) Statistics on which moves are fashionable, meaning recently played more than usual, so that trendiness can be assessed as well as historical counts of an opening.

 

As a very general statement, much of what I want from a database is a higher level of abstraction than merely the moves themselves, which are the most specific and least informative explanation about what is going on in the game as a whole. I'm using databases to learn to play better chess, which is difficult without such abstractions. Abstractions can be things like: ideas, plans, general pros and cons, or expectations about where units will move and what is most likely to unfold. With such enhancements we can not only throw away our opening books, but our strategy books as well, and merely study chess move databases!

Sqod

P.S.--

Some very interesting links somebody just sent me:

 

https://timkr.home.xs4all.nl/chess2/cql.htm

https://web.archive.org/web/20140130143815/http://www.rbnn.com/cql/

http://help.chessbase.com/Reader/12/Eng/index.html?000144.htm

 

It sounds like Chess Query Language (CQL) can't search on abstract features, which would be one of my main interests in a query, but it might be a step along the way to what I want. Also, ChessBase 12's "piece probability" function looks interesting, although it's about piece-on-square statistics, not the "next destination square" function that I would find most useful.

 

P.P.S.--Below is an excerpt from that great, unusual book I keep mentioning (on the Caro-Kann Defense) that has such piece placement info in it for each variation, along with pawn formation diagrams, and plans. Outstanding! (at least for my taste) This is exactly the kind of generalized information I would love to see included in a database. This excerpt is only for the Advance Variation, but every variation has the same type of general summary in that book, which makes each position quickly understandable.

 

(p. 4)

Typical pawn structure
Advance Variation

This pawn structure arises after
the moves 1 e4 c6 2 d4 d5 3 e5 and
later ... e7-e6--the Advance
Variation (Part Two)
By playing e4-e5 White fixes the
pawn center: now an exchange of
White's e-pawn for Black's d-pawn
is impossible. Whether this is better
for White or Black is a matter of
style. There are advantages and
disadvantages for both sides. White
obtains a long-term space ad-
vantage since he has a pawn in
Black's half of the board whereas
Black has no immediate prospect of
placing a pawn in White's half. The
pawn advance is rather committal,
however, and leaves White's d-
pawn immobile and thus a potential
target for Black to work against.

Generally, the most appropriate
development for each side in this
variation is as follows.
White King: Usually castles
Kingside only occasionally on the
Queenside.
White Queen: d3 or e2, oc-
casionally g4.
White Rooks: d1 and e1, or d1
and f1.
White King Bishop: d3, oc-
casionally e2.
White Queen Bishop: d2, e3, f4,
or g4. It is usually best to postpone
the development of this piece until
it is reasonably clear which square
is most suitable.
White King Knight: f3 or e2.
White Queen Knight: c3.
White pawns: Except for the d-
and c-pawns, none of the pawns
should be moved in the very early
stages of the game. The f-pawn is
often advanced later.
Black King: Castles Kingside.
Black Queen: b6 or a5, some-
times followed by moving to a6.
Black Rooks: One Rook nor-
mally goes to c8. This is sometimes
followed by ... Rc7 and then
playing the other Rook to c8. This
maneuver is called doubling Rooks.
Black King Bishop: e7.
Black Queen Bishop: f5.
Black King Knight: e7 and then
to f5 or g6. This Knight should be
developed before the King Bishop
to avoid a bottleneck on e7.
(p. 5)
Black Queen Knight: d7 or (after
... c6-c5) c6.
Black pawns: The c-pawn and d-
pawn having already moved, the
only other pawn moves that should
be considered in the early stages is
the e-pawn (to e6!) and the c-pawn
to c5. The moves ... h7-h5 and ...
g7-g6 may be good later to restrain
the advance of White's Kingside
pawns.
White's general strategy in the
Advance Variation is to attack on
the Kingside, which is usually
signalled by f2-f4-f5. Black's
overall strategy is to undermine
White's pawn center by preparing
and playing ... c6-c5, and then to
develop counterplay by making use
of the half-open c-file on which his
Rooks should be standing.

Keene, Raymond, Edmar Mednis, Jack Peters, Julio Kaplan, and Andy Soltis. 1980. Caro-Kann Defense. Great Neck, N.Y.: R.H.M. Press.

mgx9600

Sounds very complicated (ok, it's a very long post and I only skimmed it).  Now I feel bad for triggering such a lengthy post from you because I don't think I'm too interested ATM; just yesterday, I started a new chess project (it's kind of like the DGT e-board but cheaper and better : )

 

One advice I'd offer is that generally if you start doing bit operations as your primary thing, then RDBMS most likely isn't the way to proceed.  Part of the RDBMS power is flexible and able to answer questions based on relationships, and bit operations (and actually any type of parsing into the data types, removes that RDBMS advantage because the RDBMS cannot see the relationship (via SQL).

 

Anyway, I wish you success.

 

Sqod
mgx9600 wrote:

Sounds very complicated (ok, it's a very long post and I only skimmed it).  Now I feel bad for triggering such a lengthy post from you because I don't think I'm too interested ATM;

 

Yes, it's a good thing you didn't jump into such a project. Mostly, for my immediate needs, I want to use a computer to make those few types of calculations about any given position in the database, but each type of calculation I want is quite different from the others. The reason I listed all the more advanced desiderata is to show where I'm headed, since anything I would design would be aiming at being generalizable to the extent that it could be pushed in one of those more advanced directions without having to redesign the whole program. That's why I started by mentioning functions operating on trees: functions on trees is the concept that seems common to most those topics.

For example, to find the most common next square for a piece requires searching *forward* in the tree through many branches, then summarizing what was found only at the current location, whereas computing the frequency is just a matter of summing and dividing multiple times at only the *current* position in the tree. Listing transpositions that led to the current position requires looking *backward* in the tree and summarizing at the current location, which is yet another operation but still one that is based on summarizing statistics based on a certain type of tree search. Finding the most popular line (which I don't need to do since it's already obvious from the way move databases are shown online) is another example of a tree operation: just sort the tree by move popularity and choose the top branch at each step until you reach the end of the database. Partly I was hoping there was some type of software that already does such things--sorting trees, searching and summarizing trees in common ways, but it's sounding like there isn't such a thing.

Sqod
Williamfwm wrote:

But again, since we're talking about a learning tool and not an engine, you can get away with doing slower analysis on the fly, rather than pre-computing redundant data.

 

Yes, this project is definitely intended as a learning tool, which means I wouldn't care if it took hours or even days to run a query or calculation, since in many cases I'm looking for a piece of general knowledge that I might well remember for the rest of my life, like knowing to which square White's QB should be developed in a given opening, or maybe I'm just looking for a piece of insight that I will also always remember, or maybe I'm just looking for a table that could be posted or published that everyone would always find useful. That's why the discussion of bitboards, which are preferred for efficiency purposes, might be getting a little sidetracked.

 

game_designer

I have been thinking about this problem for quite sometime.

 

The main reason is that I have created a new chess game called Final Wars.

 

This new game is similar to Fischer Random but has new concepts called Groups and Reflection.

 

I have been designing this database, just in my head, nothing written down, then I saw your post...

 

I think that most chess databases are very basic designs, so like in the games table you will have a field called say game_data that is a string and basically just contains all the PGN files that have been uploaded into the database.

 

They then create a lot of isolated "analysis" tables that contain information that has been extracted from the PGN data in that field from above.

 

Say a table called "openings" that contains ECO codes with a child table on the many side of the one to many relationship for side lines or transpositions.

 

They basically have to write code to parse all the PGN files to populate all these extra tables when the database is created or updated with a fresh upload of games.

 

Not the best way to design a database, you are storing a lot of PGN string data that is not really structured and a lot of procedural code is required to parse this data.

 

Whenever you want to look at the data in a new way you have to create new tables to store the results and write new code to parse the PGN strings to populate those new tables.

 

With my database design you do not store the PGN data and it only gets parsed once to populate the various tables.

 

Note that the data is very structured, that is the important thing, this allows you to query the data in a standard way, I am referring to your examples from above.

 

It also has other benefits, you do not need a separate opening book or endgame book, it already exists in the database. 

 

Also chess engines could be hooked directly into the database when they are running because the engines and the database are built using the exact same core logic.

Sqod
game_designer wrote:

I personally however would prefer to use a bit board composite primary key to represent each position and I would reduce this to an auto number.

 

(I'm going to be busy again most of the day but I thought I'd add a few more thoughts.)

 

I agree that for a simple, preliminary schema it would be better to forget about hashing and just use some version of the bitboard or maybe FEN for each position, which will be the primary key in the positions table. There would be also be a table for complete games where the primary key doesn't carry any intrinsic interest. I agree that the positions and probably also the stored PGN would best be stored in a numerical format, which would require an outside program that converts textual PGN to a numerical version of the PGN before the database ever sees it.

 

I'm thinking that it is each position record that is going to be most interesting. That's where all those textual descriptions of any pros, cons, plans, measurements, or statuses I mentioned would be stored. I'm also thinking that there should be a field for the preceding ply's position and the following ply's position, so that each position can be followed backward or forward through the tree as needed. A tree-based database or a temporal database (I'd forgotten that those exist, too, which might be helpful in some ways: https://en.wikipedia.org/wiki/Temporal_database) would automatically store move order but likely won't have the built-in functions I'd need, so much of the query work in a relational database would involve traversing linked lists forward and backward.

 

Here are the things I still don't understand, though I still need need to study your proposed facts table more in order to fully understand what you proposed:

Where or how are the exotic functions I proposed going to be implemented? In my experience it's often very hard to get SQL to provide tree-based results. Can customized functions be put into a database system? (I've never heard of that being done.) Or should an outside computer program query the database and operate on the information the program extracts? If that's the solution, then most of the interest in going to be in that program, not in the database, which would then just serve as memory for the external program. Or should those bitboards that indicate piece locations and sought patterns be stored in a separate database table, so that queries would be specified in terms of boolean combinations of the bitboards and the stored game positions? I think that would work, but I've never seen such a thing done.

I thought about the bigger picture last night, and this is what I'm pretty sure is going to happen: (1) I develop a schema that will work, but relational databases won't be able to do the types of queries I want (unless maybe custom functions can be added). (2) I give up on relational databases, and search for another type of software that handles graphs and graph functions in a very versatile manner. (3) I learn that either no such software exists, or it is too expensive for me to afford. (4) I'm back to square one, unless I want to write my own versatile graph analysis program. I'm just being realistic.

I thought a lot about the nature of this problem. Directed graphs are an *extremely* versatile representation method that will allow almost any domain to be represented:

  • atoms (the nodes are atoms, the links are bonds, the data are bond angles & strengths)
  • chemistry (the nodes are molecules, the links are molecular bonds, the data are folding patterns & strengths)
  • communication (the nodes are communicating entities, the links are the communications, the data are times & durations)
  • images (the nodes are regions of the image, the links are the region relationships, the data are the directions & distances)
  • companies (the nodes are employees, the links are working relationships, the data are job titles & responsibilities)
  • transactions (the nodes are the purchasers & products, the links are purchase events, the data are the prices & dates)
  • math (the nodes are the variables, the links are the relationships, and the data are the values & domains)
  • transportation (the nodes are moving entities, the links are the paths, and the data are speeds & vehicle types)
  • physiology (the nodes are the body's organs, the links their interactions, and the data the signals & materials transferred between them)
  • ecosystems (the nodes are the entities in the ecosystem, the links their interactions, and the data the dependencies & transferred resources)
  • expert systems (the nodes are the states, the links are the implications, and the data the strength of the heuristics & the degree of belief)
  • neural networks (the nodes are the neurons, the links are the axons, and the data the activation functions & transferred signal strength)
  • events (the nodes are the events, the links are cause-and-effect, and the data the effect & degree of effect)
  • assembled devices (the nodes are the components, the links are their interactions, and the data their interactions & interaction strengths)
  • electronics (the nodes are the components, the links their connectivity, and the data the volts & amps)

Chess therefore becomes merely another instance of such a graph-based problem:

  • chess (the nodes are the positions, the links are the sequential move orders, the data are the popularities & plans)

 

In other words, at an abstract level almost every problem of interest in every domain can be stored and analyzed using directed graphs (https://en.wikipedia.org/wiki/Directed_graph) with suitable functions that operate on those graphs. A good graph analysis program would already have very versatile built-in functions and function customization ability, which would be ideal for this problem if such software exists or can be written. Later I'll add more of my thoughts and experiences of using such graph analytical software.

 

DiogenesDue

I just skimmed this, but some thoughts:

 

As someone else pointed out, fields for plys/chess moves is not the way to go.  A relational database really isn't the way to go, either.  Why add all the overhead of a MySQL driver and an instance running, etc. just to yank values out for calculations, or worse, use stored procs/triggers to handle fast bitwise operations or math, which they can definitely do, but aren't really designed for?

 

Don't use a FEN/PGN as your primary key, in fact, don't use anything that contains discrete data as your primary key.  Put the FEN in a column and index it if you want to.  Someday you will thank me when you avoid a complete overhaul of your schema wink.png.

 

The key decision really would be just what actually needs to be normalized and what tables you end up with.  What are your base database entities?  I did not really see them mentioned.  Before you worry about fields and datatypes, etc. you should have your entities down.

 

I am guessing that existing opening "explorers" handle transpositions at some level by tying a "final FEN" position to each opening, such that no matter how to arrive at the position, that position becomes authoritative for the opening you tie it to.  But then you will need some mechanism for breaking ties...when a position is one that can transpose into multiple openings, which one is authoritative for displaying?  Like 365chess.com...when you play an English c4 opening that plays d4, often you'll end up in a QG variation.  Somewhere there's probably a weighting value that says that QG is more authoritative than the English, or more likely, that e4/d4 openings are always weighted higher than c4/f4 openings, for example.  

 

I see a few entities here, but I don't see anyone defining them up front, rather more in a "we'll tackle those as we run into them" way:

 

Opening - the base entity, but needs specific definition...is the opening the resulting position at the end, or also all the positions in between (which are shared), or is the opening the exact sequence of moves and open ended?  Etc.

Opening group - you will need to have a hierarchy, which could probably be self-referencing...Ruy Lopez Exchange Variation rolls up into Ruy Lopez rolls up into king's pawn openings, rolls up into kingside openings, etc.

Variations and/or analysis lines - Are these just openings with their own records, in which case they must all be named and categorized, or sequences of moves that are tied to a particular FEN position?

Move order, ideal piece positions, etc. are things that might need their own data structures/handling, but these should all be identified in advance.

If you have hammered out your base schema correctly, any and all "where do we put this?", "how do we reflect that?", "what if?" question should be immediately answerable:  either they fit neatly into the designed schema, or they need to be addressed by an addition to the schema.  If they result in retooling the schema, you didn't design it well enough wink.png...and that's what you want to avoid because otherwise during development all these "what ifs" will result in multiple schema overhauls, which are bad in any case but much worse if you have created tables and populated data already.

 

What is the table structure of the ECO database?  I am guessing you will want to be at least compatible with it if not using the same values:

https://www.365chess.com/eco.php