Statistics and Chess Improvement

  • GM Shankland
  • | Feb 21, 2012

Today I would like to share with you an exercise I did at the end of 2011 to try to prepare for the events I would play in the next year. I logged 70 FIDE rated games in 2011. This is a decent but not huge sample size, and I decided to do a thorough statistical analysis of my results to try to find spots where I was performing well and spots where I could be playing better. I’ll show some of the results and notable statistics here:

Results by Rating

My Score vs. Opposition rated under 2400 FIDE: 23 wins, 1 draw, 0 losses: 97.9%, Performance Rating 2960

My Score vs. Opposition rated 2400-2499 FIDE: 3 wins, 8 draws, 0 losses: 72.2%, Performance Rating 2540

My Score vs. Opposition rated 2500-2599 FIDE: 5 wins, 13 draws, 3 losses: 54.5%, Performance Rating 2595

My Score vs. Opposition Rated 2600-2699 FIDE: 1 win, 7 draws, 4 losses: 37.5%, Performance Rating 2546

My Score vs. Opposition Rated 2700+ FIDE: 1 win, 2 draws, 1 loss: 50%, Performance Rating 2728


From this data I could tell that the players I was having the most issue with were the players in the 2400-2700 range, and when I look back at the year, it makes sense. I was not beating 2400-2600 players as much as I should have, and I lost a couple tough games to 2600s while only striking back once. I was holding my own against the really big boys, especially considering one of my draws against 2700+ could be considered a win because I only agreed the draw to win my match in an elimination style format. However, this was only a sample of 4 games. I was quite happy that I was able to manage such an effective score against players lower than 2400.

Another thing I noticed was that while my score against 2400-2600 players left a lot to be desired, I was losing very rarely and the 3 that did occur were against 2590, 2590, and 2592- players who might be in the next category had I played them a month before or later. I decided to move those losses into the 2600-2700 category, and then I found that my score against 2500-2599 was 2656 while against 2600-2700 was only 2512. This ultimately led me to conclude that the players I needed to score better against were the 2400-2500 players and the 2600-2700 players. I decided to fix this by trying to play a bit less theoretically against the lower players to try to get them on their own earlier, and to try to be more aggressive against the stronger players because my solid play got me a bunch of draws but I got knicked for a couple of losses and did not manage to counter it with an equal number of wins. It appears this analysis and my new approach paid off so far: I present you my results against these rating ranges from my first tournament of 2012.

My Score vs. Opposition rated 2400-2499 FIDE: 3 wins, 2 draws, 0 losses: 80%, Performance Rating 2698

My Score vs. Opposition rated 2600-2699 FIDE: 1 win, 1 draw, 0 losses: 75%, Performance Rating 2830

Results by Opening

Another key part of the statistical analysis is to look at openings- this will give you a sense of what you need to study more, both in terms of opening theory and the ensuing middle games. This section was very detailed in my work, but I’ll present a more general version here:

Score with White in d4 d5 systems: 68%, 2557 Performance Rating

Score with White in Nimzo/Quid systems: 70%, 2648 Performance Rating

Score with White in KID/Grunfeld: 67%, 2591 Performance Rating

Score with White in other Systems: 88%, 2755 Performance Rating


Score with Black vs. 1. e4: 58%. 2549 Performance Rating

Score with Black vs. 1. d4: 65%, 2639 Performance Rating

Score with Black vs. Other first moves: 56%, 2542 Performance Rating


With this information I deduced that I mostly need to work on my black repertoire against non-d4 moves and my white repertoire in the Slav and QGD. There is a lot more to the statistical analysis I did, including the use of the serious but unrated games (US Chess League, tiebreak games, rapid games, training games), filtering by how many moves the games lasted (which measures fatigue and level of endgame play), breaking the year in half (In January-May I performed at a 2580 level, while June through December I was over 2630, suggesting I probably improved and the more recent games are more relevant), and much more.

I would suggest to any reader, even those few who are not professional players, that statistical analysis can be an excellent way to examine your own play, and I would suggest breaking the analysis down by rating range and opening. Then, once you have determined a weakness, look at all of your games against this rating range or opening, including the wins, to determine what you might be doing wrong and how to improve it. And always keep in mind- a small sample of games will not have the same accuracy as a large sample. Lastly, I should point out this analysis would have been extremely difficult to do without the help of Chessbase, and I highly recommend that everyone buy this software, for statistics as well as opening preparation, and engine analysis. I know that I was much happier with my training regimen after doing a thorough statistical analysis of my own results, and so far it has paid off in my only 2012 event. I look forward to seeing if it can continue to pay dividends and I hope you find it useful as well.


  • 4 years ago


    I keep it with Garry: Return to your losses over and over again!

  • 5 years ago


    "This leads me to think that we can actually obtain smaller amount of data to obtain trends."

    No, that's not correct.  The number of games you need for convergence depends only on the variance in the results.  The number of steps leading to the result means nothing.  The variance of GM game results was already given and is quite easy to calculate.

    Look at Topalov - Anand WC 2010, two of the best players in the world, playing the same opponent over and over again.  Your intuition might be that the variance would be quite low.  The results were (from Anand's view):

       0 1 .5 1 .5 .5 .5 0 .5 .5 .5 1

    That's 7 draws in 12 games, aha, low variance!  Except no, that's not low variance at all.  The standard dev of these numbers is .33, just a bit higher than the variance in my calculation earlier.


    It's really that simple.  The only question left is whether or not I calculated the numbers correctly, which is why I gave links for every step in the process.


  • 5 years ago


    Also I would remind people that the writer of this article is a GM so his concerns are very refined. This throws an interesting wrench in the typical statistical analysis methodolgy used with data. (it has been 10 years since I had reseach methods so forgive me but this is what I recall) I feel like we can throw out gross and typical errors that cause the vast majority of us to lose games. GMs are in the top 1 percent of players. This leads me to think that we can actually obtain smaller amount of data to obtain trends. In a normal data collection process we expect errors in our data and use methods to filter out the data. We expect a greater variety of error and outlying data that is not applicable in this situation.

    I am curious if there are any statistical methods to deal with the reliability of the data collected being higher than normal. I vaguely remember something but its been a long time.

    I suspect that the application of normal distribution methods might not apply to the data (games) collected in this case because the confidence factor in the actual data is much higher than normal due to the elimination of error by the players. 

    A side note:

    Another idea that i have seen used is to chart  a game using a computer evaluation move by move.  If the games evaluation tends to drop in certain openings then there is probably a strong indicator there is something to work on. 

  • 5 years ago


    Hi Tony,

    There are two paths to improvement at odds here.  On the one hand, you can use your own understanding of the game, look at the moves you play, and make a qualitative analysis of your weaknesses.  In this case you are right, it takes very few games to see clear trends and make an informed decision about what to study.  The problem is, you must trust your own judgment and chess skill, so there is a bit of a chicken-and-egg problem about using your own chess insights to determine how your chess insights are flawed.


    The other approach is purely mathematical, and that is the approach championed in this article.  In this case, you use a statistical analysis that knows nothing about chess to point out areas in which you are under-performing.  The power of this approach is that it removes all your personal bias and flaws.  The downside is that it often takes a great deal of data to be useful.


    When these approaches agree, all is good.  What about when they don't agree?  What if your chess skill says "I am great against the Slav", but the stats tell you otherwise?  Which one should you believe?


    Example: your intuition tells you that a coin is fair.  You base this on your knowledge of how they are made, and on personal experience flipping coins.  Now I present you with a coin, and we flip it some number of times.  It comes up heads every time.  The statistics now disagree with your intuition!  Which one should you believe?  Of course the answer depends on the number of times you flipped the coin.  If you only flipped it twice, you should trust your intuition.  the variance on the expected number of heads is large.  If you flip it one thousand timed though, you should believe the statistics.  The coin is biased with near certainty.


    In this article, Sam chooses to believe the stats over his intuition, but he is wrong to do so.  In the analogy above, he has flipped a coin twice and decided that the coin is biased, a false conclusion based in random variance.  In my post, I give hard numbers for the variance on the stats being used here, so you can see how much data you need before the stats provide information that might be worth valuing over your personal knowledge of chess.


    Finally, you make a good point that chess is a game of many moves.  In fact, each move is itself not one decision, but the sum of many, as you analyze lines, choose candidate moves in different positions, etc.  It might feel like this makes the variance lower, but in fact the variance at the game level is very easy to measure.  You just look at the string of 1s, 1/2s, and 0s and calculate it.  That is the power in the approach... you don't need any information about how the results occurrred to draw conclusions, you only need sufficient data.  You could do a seperate analysis of "quality of moves vs engine recommendation" and turn 10 games into 500 data points.  This might very well yield some interesting results as you suggest.  Once again it wold be easy to measure the variance and create 99% confidence intervals to guide your study.  It's intereting that Sam may very well have enough data to say "I often make key blunders in endgames, I should study them more", but not nearly enough to say "I am weak against 2400s, I will change my opening repertoire" based purely on statistics.  Of course, he is imminently qualified to make either of these statements based on his chess intuition.  :)


    I hope that helps.


  • 5 years ago


    just a thought but one of the problems that occurred to me is that statistics take into account amount of data points and then try to make some judgments based on that. The issue that might come to mind here is that a chess game is not a single data point but in itself is a collection of data points. Players to draw or win a game can not generally make large errors especially at the GM level. A game itself can be a determining factor. Most rating systems want about 10-15 games to detemine an accurate (post provisional) rating. While 70 data points might seem paltry it is actually a large amount of data based if one considers Number of moves etc. Just a thought.... while firm conclusions would be difficult to conconclude on the basis of a few games some trends can be determined. Are the games lost in the opening, middlegame, endgame, type of middlegame (open vs closed ) rook endgames, pawn, bishop etc ...

  • 5 years ago


    We seem to have slightly different ideas of what exactly "analysis" references, but I think we agree once we clean up the semantics.  :)


    Thanks for taking the time to read it all, I'm glad somebody got something out of it.  ;)

  • 5 years ago


    Thanks for elaborating on confidence intervals, elindauer.  Your description is very useful for those who do not know anything about making inferences from the results of statistical analyses.  I was actually pointing out that your statement was not accurate, logically.  The analyses are not flawed, despite the issue of error variance and ultimately confidence intervals.  Rather, what would be a flaw is to make any strong (of perhaps any) inferences from the results of the analyses in this specific case.  In other words, it's not the analyses that are flawed, but rather the conclusion.  This you recognize in noting that the analysis would be useful given a much larger sample.  Your intended point is valid, and your actual calculations illustrate an important point that MUST be considered before trying to identify personal weaknesses.

  • 5 years ago


    @ideological_slave: good question. I see that I didn't give much information about what a "confidence interval" is...


    If you see a stat like "my performance rating against 2600s is 2550", you may be inclined to make a decision with that information, like changing your opening rep, or studying different openings.  But first, you should ask this question "how accurately does this number (2550) represent my true strength?".  Put another way, how often will my true playing strength be significantly different then the result I observed?


    If I tell you that your true strength was 2550 plus or minus 500 rating points, then you would conclude that the number is very inaccurate, so you should not use it to make decisions.  If I say your performance was 2550 plus or minus 2 rating points, then you would say that number is a very accurate representation of your true strength, and could use as a basis for decision making.  The margin of error on a statistic is very important.


    A "95% confidence interval" as constructed in my post simply says that 95% of the time, Sam's true strength against 2600s will lie someplace between 2400 and 2730.  Obviously this is a very wide range, so the number is not very accurate, and should not be used to make important decisions like what openings to play.  As you play more games, the stat becomes more accurate or "converges" to your true skill, and the confidence intervals become smaller.  In this case, perhaps 125 games would be enough to get the number to be accurate to within +/- 50 rating points, and you could begin to think about using it as a basis for decisions.  Even then, 5% of the time your true skill would be signicantly different from the observed result, which is still fairly often.  You could construct a 99% confidence interval to help avoid this, but then of course you would need even MORE data to use this stat alone as justification for you decisions.


    I hope that helps.



    ps. you might think that 95% is too strict to be useful, but consider that in this example, we are given 5 different rating ranges.  If you base your desicions on comparing their results, then all of the rating ranges must be accurate for your results to be useful.  The odds of your true skill being outside the 95% confidence interval for one grouping goes up exponentially.  In this case, you will have one "outlier" result almost 24% of the time.  You would be drawing a lot of mistaken conclusions from the data, making changes that weren't needed, studying the wrong things etc.  When you think about how wide the confidence intervals are for this small an amount of data (ie, the data is a very inaccurate representation of your true skill), and then realize that 24% of the time, doing a comparison like this will fail because the 95% confidence wasn't confident enough, you should realize that this analysis is useless for this amount of data.


    Still, it's a very interesting approach, and could be very useful for players with more data.  People that write chess engine software, or online blitz players both come to mind as users who could benefit from this analysis.

  • 5 years ago


    Some good points being brought up in here.  The posts from zadiagnose and ealindauer especially caught my attention.  @Zadiagnose, obviously, the countless number of variables necessary to predict outcomes makes it impossible to place much faith in the results of these types of analyses.  Basing decisions re: weaknesses and strengths, and subsequent decisions re: what to focus on and not focus on, from these results alone would be foolish.  As receipt1 noted, it is simply "a method to add to your arsenal!" As for elindauer, I'm not seeing how the analyses are flawed; interpreting them with confidence is a problem, but how are the analyses flawed?

  • 5 years ago


    Thank you Sam.  That was a good, helpful article.  The only disappointment was the nitpicky snarky tone of some of the comments that followed it.  Please ignore the jerks, and keep the helpful articles and videos comming.  Thanks.

  • 5 years ago


    There are three kinds of lies: lies, damned lies, and statistics. Benjamin DisraeliBritish politician (1804 - 1881)
  • 5 years ago


    pps.  Gotta love the irony of putting a picture of a normal distribution in your post, and then having your analysis pwned by a calculation of standard deviation...  :D :)


    oh man I'm such a nerd... :) :P

  • 5 years ago


    You know, this is really not that hard to calculate, I'm just going to do it for you.  I hope you will take the time to read this...


    Take, for example, your results against players range 2600 - 2699.  You apparently made chess decisions based on the fact that your performance rating was 2546 that you would not have made had your performance rating been 2595 (your rating vs 2500-2599), or 50 points higher.


    Your observed result against 2600s were: 1 .5 .5 .5 .5 .5 .5 .5 0 0 0 0 for a mean of .375 and a standard deviation of .310. standard deviation {1, .5, .5, .5, .5, .5, .5, .5, 0, 0, 0, 0}


    This gives a 95% confidence interval for your true score against this rating range of plus/minus .18, which suggests that 95% of the time, your true skill in this period was somewhere between a performance rating of 2400 and 2730


    To get a performance rating with a 95% confidence of being within 50 rating points of your observed result, assuming this st dev hold up (and it is probably too low as you have an unusually high number of draws in this sample), you will need to play about 125 games, or ten years worth of results at this pace.  Given the likely rate of fluctuation in your true skill, you are simply not playing chess often enough to ever get your performance rating to converge with this kind of accuracy.


    A statistical analysis like this might be appropriate for an online blitz players who played dozens of games / day.  You could take several weeks worth of data and be willing to make changes that went against what your skill in chess analysis told you.


    IMO, this proves mathematically the analysis done here is flawed, and that live chess players would be much better served to trust their chess skill and simply review games to decide what to study and what openings to play.

       good luck,



    ps. Many a poker player has made this same mistake.  They play 10 or 20 or 30 thousand hands over the course of several months, win a bunch of money, and turn pro, only to discover that they were actually just running good and have no hope of paying the rent with their true win rate.  Human intuition about the variance of observed results is quite poor.

  • 5 years ago


    I love that your are trying to use math to guide your study!  Awesome.  Since you have mentioned poker though, here is an important result.  It is not unusual for a winning player to go 100,000 hands of break even poker, despite their true win rate being over 1 big bet / 100 hands.  100,000!  That's a year of play for most online pros, and a lifetime of live play.  The variance in that game is disturbingly high.


    (edit: removed a bunch of speculation that Shaky's sample size was too small.  Did the math.  See post above)


    tl;dr: stats converge very slowly, be wary, trust your instincts and review games to determine your true weaknesses.

  • 5 years ago


    What i can to say about Sam 2012 ( i just look few:vs Kaidanov,Kamsky,Leko,Erenburg,Hess,Gupta).I see ,Sam is in very good company with Morphy,Capa and Fischer -fathers of american chess , combinations are STRONG side ,style is right ( Morphy ,Capa ),but complicated and aggressive( Fischer),ends and technical positions -good level.What about WEAKNESS-of course TOO OPTIMISTIC (we all are optimists in such young age (21) !,and defence (psychological-from otimistic style).Good decision is deep studying of Akiba,       Botvinnik,Kramnik .I think , in such level PSYCHOLOGICAl factor is very important (stability).I believe Sam can to play at level 2700-28... after serious homework.But...this level is only for BAD BOYS ...only,Good Luck!

  • 5 years ago


    Hi Sam, this was very informative, thanks.

    I have also wondered whether chess strength and chess rating have a linear relationship:

    e.g. is it just as easy for a 2700 player to beat a 2600, than for a 2300 vs 2200, 1800 vs 1700, 1200 vs 1100?

    This statistic of yours is absolutely amazing:

    "My Score vs. Opposition rated under 2400 FIDE: 23 wins, 1 draw, 0 losses: 97.9%, Performance Rating 2960"

    I cant think that I will ever have the same statistic against players rated 170-200 lower than me. And that is probably because my chess is not well rounded off (OK in certain aspects, weak in many other - so there is always a way for weaker player to win me - I am 1600 otb). I sometimes even blew my opning against such guys because I underestimate weaker players.

    So I think there is some randomness but more amongst amateurs and much less so with masters.

    I think this is a very very interesting thing to dig deeper in - chess and stats.

    Regards Johan


    (....although other comments make jokes about here is mine: "stats is like a bikini swimsuit....what it shows is interesting but what it hides is essential"...haha...ok here is another one: "we have identified GM Samy's weak spot, just reach +2900 then you have a good chance to beat him statistically"...haha) 

  • 5 years ago


    This is an excellent article because, while Grandmaster Shankland identified its shortcomings, he gave us a method to approach the question of "How do I improve my chess?".  This sort of tracking and analysis can be improved upon with more data points and a realization that variances are "mutual" in chess, but that doesn't take away from GM Shankland's point that it is a method to add to your arsenal!

  • 5 years ago

    GM Shankland

    Dear All,

    In light of all the comments and differences of opinion this article has produced, I decided to write a supplementary comment. First of all, for those wondering how the numbers were found, I used the chessbase "Statistics" function (the "S" key is a shortcut on chessbase 11). For example, I will filter the database of my own games setting one of the player's ratings to 2400-2499 and the year to 2011, and all of my games against 2400-2499 level players will show up. I then select all and type "S", then into the "Player Name" field I type my name, and it will analyze my results and performance across all those games. You can also filter by ECO code, number of moves, color, etc. I'm not exactly sure how it calculates performance rating; a performance over 2800 against 2400 and below would be impossible by another way of calculation, but I don't believe this is relevant because as long as it is shown relative to my performance against other rating groups, weaknesses can be found. It could also read 1200, 1300, 1400- I would know that the 1200 portion would be my weakness.

    In regards to how this will help your chess, doing statistical analysis will not make you a better player. What it will do is try to isolate certain variables and see where you are underpeforming. One of the most common questions I get is some glorified version of "How do I become a better chess player" and there is no one size fits all answer. I believe for a player to attain a higher level than what he or she is currently at, he or she needs to level out their game so that they are not vulnerable in certain aspects of play- improving weakness. But a lot of the time finding out what a weakness is can be difficult, and different players will have different weaknesses. If you can statistically analyze a large sample of games, you may identify some of those weaknesses. For example, before this process it never occurred to me that I was underperforming with white against the Slav, possibly because as black I scored very well and my wins were more memorable than my losses. Of course there is some degree of randomness to this kind of thing, but I don't think it is nearly as big as a lot of people are suggesting (A 2500 plays like a 2700 one day and a 2300 the next). How often do you see a 2500 beat a 2700? or a 2300 beat a 2500? It seems very rare to me, and the few times they do it's almost always a product of the stronger player playing weaker than normal and the weaker player playing stronger- mutual variance, so to speak. Also, some things might not be as random as expected. I noticed that I performed extremely well in games that started before noon, but I don't think this is random at all- by the extreme standard of professional chess players, my waking up around 10am normally (while not in a tournament) makes me quite the early bird, which would suggest I will have an easier time playing morning games. I am not saying that statistical analysis will absolutely positively identify the exact strengths and weaknesses of your game, or that it is completely foolproof and 100% accurate, but it will definitely give you a pretty good guess, and I've found that the analysis I did was largely correct and beneficial. 


    Sam Shankland

  • 5 years ago


    So the higher rated you are the more likely drw occur, thank you! :)

  • 5 years ago


    @ GM SultanOfKings Yes, i agree with You Sultan (i lost interest too).I think -better is to see 5 best and 5 worst games and after it ,we can see all strong and weak sides of ANY player-this JOB is more interesting... 

Back to Top

Post your reply: