Chess Analytics: The Long and the Short of It
It has been 6 years since I posted to this blog, and I thought my return should introduce something completely new and fascinating – Chess Analytics. By that I mean the application of modern Data Science analysis to chess game data, and for this first post of what will be many I will briefly discuss the distribution of game lengths among chess players with Elo ratings of at least 2,000, i.e. experts and above.
But first I will give some background. My Chess Analytics posts will be based on a collection of almost 1.7 million games I have downloaded from the KingBase web site. (http://www.kingbase-chess.net) To prepare these games for analysis I preprocessed them with a utility written for the R statistical analysis language by Joshua Kunst. This utility is in file 01_pgn_parser.R and available for code geeks at GitHub. (https://github.com/jbkunst/chess-db) I then read that preprocessed game data into R and, using Kunst’s rchess package, I wrote additional preprocessing code to compute and save several statistics for every game, one of which is the length of each game in half moves, i.e. all of white’s moves plus all of black’s moves.
Here is a histogram that provides a visualization of these game lengths:
Along the x-axis is the length of each game from the shortest with only 6 half moves, to the longest, which was a marathon consisting of 475 half moves!
The y-axis shows us how many games were played at each game length. You can see that there is a lengthy ‘tail’ to the right, indicating that the really long games, say those with more than about 160 half moves, very rarely occurred. Of these extremely long games, there were 28 that were only played once, ranging from one game of 331 half moves to that marathon monster with 475. The most frequent game length was 81, which was played 31,649 times! Notice how far above the madding crowd that highest blue dot is! Don’t you wonder why that game length results in such an enormous number of games?
The green vertical line is the location of the mean, or average, game length of 80.8 half moves, while the red vertical line at 78 is the median, which tells us that half of the nearly 1.7 million games were less than 78 half moves and the other half were longer. Similarly, the green horizontal line at 4,548 indicates the mean value of the blue dots along the y-axis, whereas the red horizontal at 330 is again the median value.
An interesting oddity is that elevated blue dot at game length 120 that was played 12,694 times, whereas its neighboring lengths of 119 and 121 were only played 9,969 and 8,273 times, respectively. Perhaps in a future post I will explore this 120 move over-achiever to see if we can understand why it stands out. There is also a lesser, but still noticeable bump at 160 moves, played 2,116 times vs. lengths 159 and 161 at 1,782 and 1,470 games, respectively.
Of passing interest, though I place no significance upon it, out of the 1,696,607 games there are no games whatsoever with the following lengths: 339, 346, 350, 352, 357, 362, 364, 367, 371, 374, 375, 377—386, 389, 390, 392—397, 399, 401, 403—406, 408—419, 421—425, 427—438, 440—453, 455, 457—474.
It is of interest, however, that there were 98 nano-games that only had 3 white and 3 black moves. Somehow, of those 98 games 15 actually had a winner, 7 by black and 8 by white.
Year |
White |
Black |
Result |
White Elo |
Black Elo |
PGN |
2013 |
Grandelius |
Kurayan |
0-1 |
2576 |
2398 |
1.b4 a5 2.f4 f5 3.e3 a4 |
2013 |
Yilmaz |
Steindorsson |
1-0 |
2531 |
2235 |
1.c4 Nf6 2.Nc3 e6 3.e4 c5 |
2011 |
Pridorozhni |
Malakhov |
0-1 |
2542 |
2714 |
1.f4 Nc6 2.b4 d5 3.a3 Bg4 |
2010 |
Radjabov |
Nakamura |
0-1 |
2744 |
2741 |
1.Nf3 f5 2.c4 Nf6 3.Nc3 d6 |
2012 |
Vernay |
Riff |
1-0 |
2441 |
2494 |
1.d4 Nf6 2.c4 e6 3.Qc2 Bb4+ |
2009 |
Svetlov |
Sanzhaev |
1-0 |
2328 |
2112 |
1.d4 Nf6 2.Nf3 e6 3.c4 b6 |
2000 |
Dinstuhl |
Dautov |
0-1 |
2412 |
2606 |
1.d4 Nf6 2.c4 e6 3.Nf3 b6 |
1999 |
Akhmetov |
Sveshnikov |
0-1 |
2438 |
2541 |
1.Nf3 d5 2.g3 Nf6 3.Bg2 c6 |
2013 |
Baraeva |
Navrotescu |
1-0 |
2211 |
2147 |
1.d4 Nf6 2.c4 g6 3.Nc3 Nd5 |
2010 |
Eljanov |
Caruana |
1-0 |
2742 |
2709 |
1.d4 Nf6 2.c4 c5 3.d5 b5 |
1997 |
Lyrberg |
Akesson |
0-1 |
2430 |
2520 |
1.d4 Nf6 2.Qd3 d5 3.Qxh7 Rxh7 |
2013 |
Babarykin |
Paveliev |
1-0 |
2306 |
2386 |
1.e4 d5 2.exd5 Nf6 3.d4 Nxd5 |
2013 |
Repkova |
Frischmann |
1-0 |
2374 |
2234 |
1.e4 c5 2.Nf3 e6 3.b3 a6 |
2015 |
Yaniuk |
Stupak |
0-1 |
2098 |
2568 |
1.e4 e6 2.d4 d5 3.Nc3 Bb4 |
2013 |
Rakhmanov |
Minero Pineda |
1-0 |
2595 |
2412 |
1.e4 e6 2.d4 d5 3.exd5 exd5 |
And here is the behemoth 475 half-mover between Felber and Lapshun in 1998:
In my next post, (https://www.chess.com/blog/kurtgodden/chess-analytics-introduction-to-material-and-mobility) I will begin the first of several that will discuss this large game dataset with respect to black and white material vs. mobility, and I will show you some extraordinary graphs. In the meantime, please let me know in the comments section if there is some analysis that you would like to see in a future post.