Sample Size and Confidence Levels

Sort:
KevinOSh

I would like to get a better understanding of chess statistics and their meanings.

For example take this position:

Using the chess.com master games database, we want to figure out the best move for black (without resorting to any use of any engine).

In this position, at the master level, white wins 34.2%, 17.7% draw, 48.1% win for white.

There are 80 games in the chess.com master games database: https://www.chess.com/games/search?fen=rnbqkbnr/3p1ppp/p3p3/1p6/4P3/1BN2N2/PP3PPP/R1BQK2R%20b%20KQkq%20-%201%207

By far the most popular move is Bb7. There are 59 games in this position:

https://www.chess.com/games/search?fen=rn1qkbnr/1b1p1ppp/p3p3/1p6/4P3/1BN2N2/PP3PPP/R1BQK2R%20w%20KQkq%20-%202%208

White wins 32.2%, Draw 18.6%, Black wins 49.1%

The second most popular move is Nc6

There are 35 games in this position:

https://www.chess.com/games/search?fen=r1bqkbnr/3p1ppp/p1n1p3/1p6/4P3/1BN2N2/PP3PPP/R1BQK2R%20w%20KQkq%20-%202%208

White wins 28.6%, Draw 14.3%, Black wins 57.1%

There are also some other rare moves, but to avoid this exercise getting even more complicated I'll discount them.

So if you only look at the win rates the move Nc6 looks better, but that might be more to do with luck/chance than the strength of this particular move.

I did not do much statistics at school, but I read that you can find confidence levels based on different sample sizes, for example on this site there is an example for finding 95% confidence level of amount of time watching TV per week https://www.scribbr.com/statistics/confidence-interval/

Well chess is quite different from watching TV, but if we say a win for white is 1 point, a draw is 0.5 and a win for black is 0, then we can get mean values for each move.

So if the mean value is close to 0 that suggests the move is better for black and if it is closer to 1 that suggest the move is better for white.

This is an idea I came up with this morning and thought I should post to get opinions on whether it makes any sense or how it is flawed.

JackSmith_GCC

Maths was never my strong suit but this looks like it should make sense. 
However, I am not entirely sure how useful this method is for the chess side of things. In the words of two wise men: 

"There are three kinds of lies: lies, damned lies, and statistics." - Mark Twain

"On the chessboard lies and hypocrisy do not survive long. The creative combination lays bare the presumption of lies; the merciless fact, culmination in checkmate, contradicts the hypocrites." - Emanuel Lasker


My point being, I don't think statistics are particularly relevant to chess, and only serve as a mildly-useful at best guide for ones search for the best move.

Better is to potz down the lines (with an engine, or with a keen eye) to see the type of different middlegame plans Black will undertake. 

Let's compare the main lines: 


It seems to me there is some potential to Black's ...Nc6/...Na5 forray - the bishop is quite important in many lines - but I would say I trust the ...Bb7/...d6/...Nd7 setup a little bit more on principle. 

Ultimately it is up to you. Both are playable, and best way to find out is to experiment with both. Or just play 3... g6 wink.png 

KevinOSh

I have spent some time looking at opening databases and have been surprised by how many times the most frequently played move has a much worse win rate than a lesser played move.

Perhaps this is due to the surprise value of lesser played moves, or it is just a case of fooled by randomness?

Or perhaps there are many better moves out there waiting to be found and they aren't played enough because other moves are seen as better just because they have been played more?

JackSmith_GCC
KevinOSh wrote:

I have spent some time looking at opening databases and have been surprised by how many times the most frequently played move has a much worse win rate than a lesser played move.

Perhaps this is due to the surprise value of lesser played moves, or it is just a case of fooled by randomness?

Or perhaps there are many better moves out there waiting to be found and they aren't played enough because other moves are seen as better just because they have been played more?

I think there are sometimes lines which are more risky but also have higher reward - this is one reason for such results, as top gms prefer safety. 

Another possibility is that there has recently arisen a new development in an opening, which has had good results and lots of promise, but has not been tested in very many games yet. 

And yes some are anomalous. When you get under 100 total games, it increases the likelihood of a red herring.

 

It's always worth at least investigating the move with the highest win rate, but not worth playing just because it has the highest win rate. 

Deranged

Not to advertise any other sites, but there are some opening databases around that let you filter by rating, and I find these to be far more useful.

You'd rather know what works at the 1500-2000 level than what works at the Master level. Some trap lines can crush 1800s in blitz but don't do so well in OTB tournament games against grandmasters.

But regarding your main question: I'd say you want at least 100 games and at least a 5% difference in results for it to be statistically significant. If there's only like 20 games in question, then you're going to want a higher difference in results for it to be significant, like a 20% better result.

KevinOSh

I agree that there are other opening databases with more useful contextual information.

For most positions, there are much fewer than 100 games. It is quite common to see a position that has only had 2 games played, one of them ending in a win for white and the other ending in a win for black. The opening explorer gives the impression that one move is 100% winning and the other is 100% losing but that is not the case at all.

Clearly a single game is not statistically significant, but often that is all that is available to go on.

This is why I had the idea of confidence levels; to give an indication of the uselessness of certain positions in the database!

Should a single game be completely discarded? Or does it still have some value?

We know it was played by a strong player, there is some rational behind the move, the move ended in defeat, but in most cases that will be due to a mistake later on in the game rather than that particular move.

In the case of a move with 20 games, say it has a 10% better result. It will depend on what you are comparing it to. If the main move has been played 10,000 times, then that's probably best. But if the main move has been played 25 times and the other move 20 times with 10% better results, I would usually be tempted to play the lesser played move.

Here is an example of statistical insignificance in action: the sodium attack 1.Na3 is a well known garbage opening but out of the 15 times it has been played at master level, it has done quite well, winning 7 times, drawing twice, and losing 6 times:

https://www.chess.com/games/search?fen=rnbqkbnr/pppppppp/8/8/8/N7/PPPPPPPP/R1BQKBNR%20b%20KQkq%20-%201%201

The overall win rate for white is quite a lot higher than the win rate for more respectable moves like 1.d4 at 38.9% and 1.e4 at 38.4%

If we give 7 points for the wins, and 2x0.5=1 point for the draws, we have a score of 9 out of 15, and a sample size of 15. What confidence level does this give us? Based on the results we have, what are the chances that 1.Na3 is actually a better opening move than 1.d4? If there were an infinite amount of games played we would have exact win rates for both moves. There have been a lot of 1.d4 games played so we are probably already close to that figure. For 1.Na3 we don't know nearly as much.

 

landloch

Some points to consider in building a statistical model for this (which folks have already touched on):

1) It needs to account for the ratings of both players. An extreme (and unrealistic) example to illustrate the point. Suppose in the database being used for a given line the best move at move 12 was always played by the lower rated player, while the second best move at 12 was always played by the higher rated player. In that case the best move will have worse results because the lower rated players would generally be out-played. You could assume that with a large enough sample size this wouldn't be a problem ... but then again, maybe it would still be a problem after all.

2) What is the time range of the database? Some lines that were once extremely popular have eventually been analyzed as being suboptimal (or maybe even bad). If the database goes far enough back in time, that suboptimal line may have a better winning percent than the new, better line. Indeed, the winning percent for the old line may always remain high, because now that people know to avoid it, it never gets played.

3) What are the time controls of the games? What works well in blitz may be poor in classical.

KevinOSh

All good points.

1. A nice thing is the players ratings give a pretty accurate indication of the strength of the players so by simply summing up the player ratings for all players who are white and comparing them against the sum of ratings for black we can see what sort skew there is.

2. No easy answers. I suspect it is better to have a wide time range on the assumption that having more data is better overall.

3. There are some opening databases that can be filtered on time control, and there tends to be more blitz and bullet games than anything else these days. Even at master level there are a lot of mistakes here. I would think it reasonable either to completely discount all fast chess games or include them but add a weighting in favor of the slow chess games.

JackSmith_GCC

One small problem with the database on the "other" site, is that its selection of master games is limited to primarily games played after 1980, which leaves a big whole where somewhat older opening theory is concerned, which especially impacts more antiquated openings such as the King's Gambit. 

That said, I've not found this issue has stopped me from gaining a good enough understanding of what's going on in a given opening.

landloch

Here’s a simple way to build the statistical model you are looking for.
 
For a given move convert the W/L/D percentages to a score between 0 and 1. For example W = 40% D = 30% L = 30%. The score for the move = W+0.5D or .55.
 
Assume for any given position both players have an equal chance of scoring a point. This is rarely actually the case because of rating differences and because most positions are not equal … but you have to start somewhere. You can probably reduce the rating discrepancy by only selecting games of similarly rated players.
 
Then you do the math to figure out given an expected score of .50, what is the probability of >= the actual score (e.g., .55) happening by chance, given a sample size of n games.
 
This is pretty basics stats with instructions easily found on the internet. Would such a simple approach really be meaningful, given the assorted database issues discussed in this thread? I have no idea.
 
In any event, using databases to find the best move is probably not a good approach. If you want to find the best move read theory and use computers. Databases are more of a rough guide that can steer you to playable positions and give you sense of what those positions entail. And eyeballing W/D/L% and sample size is probably just fine for that.

technical_knockout

dark wood board & neo pieces.  👍

are we metal background/dark mode twins too?

tygxc

Win rate means nothing.
Results correlate with rating difference, not with opening.

When a line is refuted in a single game, then the line is no longer played, but its win rate stands.

LeeEuler

One way to improve the model would be to use a Bayesian approach (too much info to explain it all but not too difficult to learn the basics of, there is plenty of stuff online about why it is useful, what it is used for, how to build models, etc.). Your prior could be based on the Elo difference between the players in the position, (e.g. a 100 pt spread means your prior is that the higher rated player will score ~2/3) as well as the expected scores of white or black at the start of the game. Your training set should be an amalgamation of pre and post computer era games, but honestly there are still some major problems with that method (I don't know much about openings, but I'd guess there are openings that used to score quite well, and now are considered borderline unplayable because of engine refutations, for example). 

Your point about sample size is a good one (though as a side note I will say that I'm somewhat of a heretic in that I think both researchers and lay people tend to overemphasis the power of a study when looking at it's usefulness). Most people's intuition matches reality: if our training set grows larger, the error in our predictions should decrease; we need wider prediction intervals at smaller sample sizes. In the case of a gimmick like 1. Na3, it has been played so infrequently at the master level that any "takeaways" from analysis can probably be discarded.

But to give my two-cents I personally wouldn't guess that this would be a particularly useful exercise, unless you are doing it to just practice and learn different techniques. In other words, it will likely not give you insights that are better than the engine, for example. That being said, it could be interesting to look at very specific positions that the engine marginally favors one side, but for which in practical play, the reverse side scores higher. This might be particularly interesting if you break up the set into rating ranges. As another poster mentioned, how masters score in a position likely has little bearing on how well amateurs score in the position. 

KevinOSh
tygxc wrote:

Win rate means nothing.
Results correlate with rating difference, not with opening.

When a line is refuted in a single game, then the line is no longer played, but its win rate stands.

I can test this theory manually on the 1.Na3 move.

There are actually 23 games showing up on https://www.chess.com/games/search?fen=rnbqkbnr/pppppppp/8/8/8/N7/PPPPPPPP/R1BQKBNR%20b%20KQkq%20-%201%201

In one game a player is either unrated or the rating was not entered so I have removed that one.

White rating: 2489+2604+2630+2606+2451+2834+2385+2333+2138+2425+2486+2062+1802+2513+1944+2446+2400+2396+2395+2395+2375+2205

= 52,314 / 22 = 2378 average rating

Black rating:

2445+2494+2590+1866+2353+2484+2517+2129+2425+2528+2198+2216+2216+2045+2503+2303+2207+2302+2365+2235+2355+2215

= 50,991 / 22 = 2318 average rating

So on average white is 60 points stronger than black when playing 1.Na3, which probably explains the higher win rate.

blueemu
tygxc wrote:

When a line is refuted in a single game, then the line is no longer played, but its win rate stands.

Here's my favorite example of that:

 

 
This used to be an old main line of the Petroff Defense. If you check a database, you'll find that it has been played many times before, with White winning 35.7%, draws accounting for 21.4%, and Black winning 42.9% of the games.
 
Advantage: Black... yes?
 
... or maybe not.
 
 

 

Duckfest

Interesting question. Though I’m undecided on what the exact question is. I’ll just share my ideas on the topic.

These statistics are very dangerous to use without context and should be used in coordination with other information. Similarly to how it can be dangerous to go by engine suggestion without other context. I see many considerations already mentioned by others. 

  • The effect of player ratings, and difference between player ratings skewing results
  • The importance of time formats, winrates for classical or daily games are vastly different from bullet and blitz games
  • Refuted lines not being played

That being said, I also rely heavily on statistics of games played by others. Occasionally, I will look at Grandmaster games, but I’m more interested in what most players do. GMs, just like engines, tend to avoid lines that can be refuted, which leads to very biased representation of the effect of move. That’s why I prefer to use openingtree.com, rather than Explorer, because it tells me what normal players do rather than what GMs do. It gives me a more predictable overview of the moves I need to consider and prepare for.

Winrate for the most popular move

KevinOSh wrote:

I have spent some time looking at opening databases and have been surprised by how many times the most frequently played move has a much worse win rate than a lesser played move.

Perhaps this is due to the surprise value of lesser played moves, or it is just a case of fooled by randomness?

Or perhaps there are many better moves out there waiting to be found and they aren't played enough because other moves are seen as better just because they have been played more?

You mentioned the most popular move having a lower winrate. There is no doubt, a very important factor is that it’s also the line your opponent is most familiar with. When other lines perform better in terms of winrate, it’s not because of the surprise value or because of randomness. That makes it sound like it has an immediate impact. I think it’s more because they don’t know the position as well as the most popular move. I’m going to simplify here. Assuming a most popular move has a 51.5% winrate. A less popular move has a 54% winrate, clearly higher than the #1. That means that should you play this alternative line, and you are as well prepared for the move as the others that have played it, you can expect to achieve two more wins and one more draw over the next 100 games played, compared to the #1 move.

Or the other way around. If I know a position well, the best move played by my opponent is not challenging at all. No matter how good his move is, objectively, if I already know what my response will be, the move will have zero impact. As long as we are both playing book moves neither of us has any advantage. It’s unfamiliar territory, where a player can make mistakes.

At the highest level that’s pretty essentially the core of their game. They all know all the main lines and the entire focus of spectators and analysts is on which player will deviate from the main line and when.

Statistics for winrates
It’s been a few decades since I actually studied statistics, so please allow this very basic line of reasoning. The core idea, when discussing sample sizes and confidence levels, is this: There is an underlying true value that we want to discover. For practical reasons we are not going to find the actual value, so we try to estimate that value by investigating a smaller sample. The confidence level refers to how certain we are about our estimation.

Underlying values
Ideally we would have every chess player play every chess position there is. After that we would have the exact winrate of every position, maybe even subdivided into subcategories controlling for time format, rating level, etc. In that case you could accurately determine the best move according to winrate.
As an example I will provide a position with the following winrates:

Option A. Will see Response AA 60% of the time, with a winrate of 55%. Response AB 30% of the time with a winrate of 45% and 10% chance of AC with a 80% winrate. overall winrate. 54.5%

Option B. Will see Response BA 35% of games with a 50% winrate. BB played 33% of games with 53.5% winrate and BC played 32% of games with a 52.6% winrate. Overall 52% winrate.

Option C. An 80% chance of CA with a 51% winrate. 16% chance of CB with an 48% winrate and only 4% of games will see CC with a 12% winrate. Overall 49% winrate.

Just to clarify: this is an oversimplification and the numbers are made up. Yet, for the sake of arguments, assume these values are reliable and a perfect representation of the outcomes for the entire population.
Looking at these numbers should give you some idea of a sample size. How many games would you need in the database before you discover the actual percentage with reasonable confidence? I haven’t run a simulation on these numbers, but in order to get a good representation of these ‘actual’ numbers, the sample should be in the multiple hundreds at least.

 

Using winrates
There are multiple arguments not to go by winrates, without context. That’s what I did when I just started playing again. I preferred the moves played by the global community over engine analysis. Based on the assumption that they are better players than I am and they probably had done their homework. In many cases they haven’t.
The more I looked into move popularity and move winrates, the more I found out how unreliable they are on their own. Sure, without other information, I have no problem with falling back on the wisdom of the crowds. As I started playing more common moves, I noticed my opponent did the same. Not unsurprising, as they have access to the same information I have.

Winrates don’t mean anything without context. I think the value is in connecting the value to the context. That’s my focus point. When I spot a discrepancy between the expected winrates and the actual winrates, that’s my starting point for further investigation.

 

Risk profiles
That’s why I try to combine winrates and ‘best moves’ with a risk profile. Just like in science, it’s harder to make a general statement than a specific one.
In a recent analysis I noticed a move that had a very decent winrate but the engine analysis was telling me it was not a good move. As it turned out, it had a positive winrate, in some lines over 60%, but against one specific response the winrate would drop significantly.


In this example the question was whether Nbd7 would be a good move . It's not a move that's played often and probably very difficult to evaluate when looking at overall statistics. However, when you zoom in and start subdividing your statements, it becomes much easier.

Consider the following statements

  • Against e4 and Nf3, the move Nbd7 has an average winrate of only 34%.
    • Both moves combined are played 31% of games
  • Against all other moves, the winrate is around 59%.
    • in this sample 116 games equaling roughly 69% of games

All four statements are  can be verified even without a large sample. 

KevinOSh

To clarify, my main motivation behind this is to have a rule of thumb non-cheating methodology for helping me to manually choose better opening moves in correspondence games.

The most accurate mathematical models would also estimate the probability of all of the next moves that your opponent would make and the effect that would have on the eventual result of the game. That's an interesting project for anyone who has a lot of time to invest, but as @duckfest says the real issue with opening databases is they give out data without proper context, and it is easy to get misled by them and pick choose bad moves.

For example, once upon a time a GM decided to play a joke opening move against an NM and won anyway meaning the move has a 100% win rate, which looks a lot better than the 40% win rate of the actual best move.

It is also true that win rates in master games do not look anything like win rates in amateur games. The draw percentages are about 30% whereas in amateur games it something like a 5% draw rate.