Comparison of evaluations: Stockfish vs statistics

Sort:
Avatar of Yigor

Please look here

https://www.chess.com/forum/view/chess-openings/statistical-sharpness-and-evaluation

https://www.chess.com/forum/view/chess-openings/4-statistical-subcategories-of-moves-and-openings

for the definition of statistical evaluation. Stockfish 8 evaluations (0 lines and 1min limit on my comp) are at the left and statistical evaluations are at the right. wink.png I consider openings with 100+ master games in the chesstempo database.

 

  • +0.35 Initial Position +0.39

 

  1. +0.24 1. d4 QP +0.46
  2. +0.21 1. Nf3 Réti +0.47
  3. +0.16 1. c4 English +0.48
  4. +0.06 1. e4 KP +0.30
  5. +0.04 1. e3 van't Kruijs -0.13
  6. -0.03 1. Nc3 van Geet -0.12
  7. -0.07 1. c3 Saragossa -0.05
  8. -0.09 1. f4 Bird -0.25
  9. -0.11 1. g3 Benko +0.42
  10. -0.11 1. a3 Anderssen -0.08
  11. -0.13 1. d3 Mieses -0.09
  12. -0.28 1. b3 NLA +0.10
  13. -0.49 1. b4 Sokolsky -0.03
  14. -0.94 1. g4 Grob -0.21
Avatar of Yigor

It's quite consistent at the top including the suboptimal place of 1. e4. blitz.png Benko's opening 1. g3 is the most notable exception with good statistical and bad engine evaluations.  Grob 1. g4 with its terrible evaluations is probably refutable. tongue.png

Avatar of TatsumakiRonyk

Since Benko's opening has statistically had significantly more success than what the engine dictates it should, it seems likely to me that there is a strong refutation to it that isn't being utilized in the master games you collected the data from.

Yigor, do you have a way of telling what (if any) ECO codes those games transposed into after 1. g3?

Avatar of Yigor
TatsumakiRonyk wrote:

Since Benko's opening has statistically had significantly more success than what the engine dictates it should, it seems likely to me that there is a strong refutation to it that isn't being utilized in the master games you collected the data from.

Yigor, do you have a way of telling what (if any) ECO codes those games transposed into after 1. g3?

 

No, there is no refutation, I guess that it's some engine's glitch. The most of times it transposes into the A07: King's Indian Attack (KIA): 1. g3 d5 2. Nf3 which is a quite strong opening. wink.png

Avatar of TatsumakiRonyk

Thanks for the insightful posts happy.png

Avatar of ThrillerFan

Statistics and Engine Evaluation on move 1 are total hogwash.  It's also laughable that the OP gives names to these when no opening has been established except the really rare first moves.

 

For example, 1.Nf3 is not the Reti.  The Reti specifically entails a d5 response by Black.  1.c4 is not necessarily an English.

 

1.c4 e5, now you are in an English.

1.c4 c5, now you will "almost always" end up in an English (see below).

1.c4 Nf6 2.Nc3 e6 3.e4, now you are in an English.

1.c4 e6 2.Nc3 d5 and now White has nothing better than 3.d4.  This is not an English.  This is a Queen's Gambit Declined

1.c4 e6 2.Nf3 d5 3.b3 (or 3.g3).  This is not an English.  This is a Reti!

 

1.Nf3 c5 2.c4 - This is not a Reti, this is an English

1.Nf3 c5 2.e4 - This is not a Reti, this is a Sicilian

1.c4 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 g6 5.e4 - This is not an English or Reti, this is an Accelerated Dragon!

 

1.Nf3 d5 2.g3 is not yet a King's Indian Attack.  This could just as easily end up a Reti if c4 instead of e4 is played.  After 1.Nf3 d5 2.g3 c5 3.Bg2 Nc6 4.O-O Nf6 (Black has numerous other options - irrelevant which he chooses in this case) 5.d3 e6 6.Nbd2 Be7 7.Re1 O-O 8.e4 (now that e4 has been established instead of c4, you are in a King's Indian Attack)

 

 

Long story short, both engine analysis and statistics don't mean squat at move 1.  Statistics don't account for transpositions, and Engines are long established to be best at forcing tactical calculation, not opening or endgame evaluations.  Computers still today claim K+R+N vs K+R is +3 for White.  It's not!

 

When you look at statistics, so what if it says that White won 23,549 games after 1.c4, drew 17,539, and lost 19,539.  Out of those games with 1.c4, many of them landed in QP openings.  Let's say 2491 of them end up a King's Indian Defense.  What good is it factoring in the 2491 games that started out 1.c4 Nf6 2.Nc3 g6 3.e4 d6 4.d4 Bg7 5.Nf3 O-O without factoring in the other 97 thousand games that went 1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.Nf3 O-O?

 

That's why statistics don't mean squat until you reach a position where the specific opening is established.  The percentage for White (or Black) after 1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.Nf3 O-O 6.Be2 e5 7.O-O Nc6 8.d5 Ne7 9.Ne1 Nd7 10.Be3 f5 11.f3 f4 12.Bf2 g5 will paint you a more accurate picture than the percentage for White (or Black) after 1.d4 because looking at just 1.d4 doesn't account for garbage factored in there or direct transpositions to non-d4 openings, like 1.d4 e6 2.e4, 1.d4 g6 2.e4 w/o c4 played, etc.

Avatar of Yigor

ThrillerFan: For sceptical chess players like U, I created the thread:

https://www.chess.com/forum/view/chess-openings/derived-evaluations-for-better-understanding

Looking not only at first-layer statistical evaluations but also at the sequence of their derivatives, U get better statistical understanding of the position. It takes into account all possible and imaginable transitions and transpositions etc. wink.png

Avatar of TwoMove

Finally somebody, Thrillerfan, says it. It was depressing people was believing this mathematical looking mumble jumble. When eventuallly. some moves into the opening, reach position were decisions are important the number of games available decrease sharply, especially if only consider games of players above a certain grade.

Avatar of Yigor

TwoMove: Everything seems to depress U. tongue.png Enjoy the life, buddy!

Avatar of Yigor

Nothing is perfect. Nevertheless, basically, the statistics give coherent results.

Avatar of SmyslovFan

A friend on facebook is considering writing a book about how to play the most challenging move in a bad position. Not the computer's first choice move, but the move that will give a human player the most difficult time against it. 

Engine evaluations are good for measuring the objective value of a move against perfect play, but are just about useless as a practical tool when two humans go at it. Only a very few players in the world can even comprehend many of the comp's evals. 

One simple example is 0.01 That means the position is just about even, but there's not a forced draw in sight. Most fish see such an evaluation and declare the position drawn. Most tournament players know that there are many equal positions that are not even remotely drawish.

Statistics can help to determine how drawish a line really is, but even there, it's easy to be misled.

For example, The Slav Exchange (1.d4 d5 2.c4 c6 3.cxd5 cxd5) is statistically extremely drawish. But GMs play that variation to win quite a bit. When the game lasts more than 20 moves, the Slav Exchange no longer appears quite so drawish. In other words, the Slav Exchange appears drawish because so many players use the line to draw. The line itself has quite a bit of life in it, for those who are willing to go beyond the stats.

Avatar of Yigor

SmyslovFan: Interesting thoughts. happy.png

Avatar of Yigor

StupidGM: Wow -0.75 vs Nakamura, that's a great result! blitz.pngwink.png

Avatar of MickinMD

Note that Stockfish "1 min" evaluations can be different that longer or shorter (more or less ply) evaluations.  If I'm want a quick analysis on Lucas chess, I set it to Stockfish 8 12-ply. If I want a deeper analysis, I set it to 20-ply and it takes about 70 seconds per half-move on my quad-core, 3 GHz processor computer.  The results can be much different for some moves.

Avatar of Yigor

MickinMD: Yeah, sure. On my comp, 1 min evaluations are approximately equivalent to 20-plies. In addition, in this case of the initial position, they are not really different.