
How to Read Engine Evaluations
I encourage all readers to help me expand this post by contributing more examples!
Note: This post is meant to be scientific. Obviously, it's not a treatise on the subject matter, but I wanted to have a few original statistical insights to help support my main conclusions. Scientific or not, the motto is to have fun, so I hope that by writing this article I'll be able to introduce you, dear reader, to the wonderful world of chess engines!
Note 2: I mostly used Stockfish 5 in the analysis so that evaluation comparisons would be much more accurate (I'll always mention which engine I used and at what depth - the name of the engine used will be included in parantheses next to the given example).
Note 3: This post was originally meant to be comprehensive, but I think that most chess players will be able to confirm my conclusions without me having to support them with over 50 experiments. In time, this post should be updated with an increasing number of examples that will help strengthen my conclusions, but for now, I see it more of an exercise in futility going to such lengths to prove something I already know is true.
Note 4: The article needs to be read in its entirety because some concepts are clarified in multiple sections. I really hope you enjoy the content of this article!
Note 5: Please read the take-home message at the end of the article.
At some point in time, every chess player has stared into the eyes of evil and wondered, "What makes this darkness work?" Some chess players have even been noticeably frightened by this dark beast, trying to avoid it at all costs. In fact, even the strongest human players dare not look this monstrosity in the eyes. Actually, it has no eyes; it is too cold-blooded to even consider sharing anything with us pathetic human beings.
What am I talking about, you might be thinking? Well, I'm talking about chess engines, of course. The silicon beast is very frightening indeed, but if you tame it, it'll reward you with a huge amount of chess knowledge!
Without further ado, I would like to advise you on how to tame this machine once and for all.
1) Understand What They Want
Other than wanting to participate in some sort of Terminator-like apocalypse, chess engines also want to play the best moves. However, despite these machines' brilliance, they are not intelligent (yet?!), so there must be a way for us to know what exactly the machine is thinking.
Enter the concept of evaluations. It is important to note that the ability to make precise measurements is a very essential concept in science - the evaluation function, with its brilliant simplicity, offers precisely such a means of numerically judging a chess position.
Here's how it works: The computer, using its raw processing power, will create countless possible board positions. Once those positions are created, the computer needs to know which one is best; to do that, it assigns a specific number to the board position. This number tells us human players who is better and by how much.
If the engine spits out a +3 evaluation, it thinks that White is ahead by three pawns (the "+" sign is used to denote that White is ahead, while the "-" sign is used to denote that Black is ahead). What does this mean? It means that whatever the position on the board, White is effectively up a minor piece*. Even if Black is actually up a minor piece, when the computer gives an evaluation like "+3," it thinks that White's initiative or long term compensation is so strong that White is effectively up a minor piece.
Now, despite all their merits in helping humans assigne a numerical value to positions, evaluations aren't perfect. I will explain my position with supporting evidence later, but for now, here are some important ideas concerning evaluations functions:
Generally, the evaluations work out as follows:
0 ---- +-0.5: Equal
+-0.5 --- 1.0: White / Black is slightly better
+-1 ---- 2.0: White / Black is much better
+-2 ---- mate in X number: White / Black is winning
1) Evaluations in the realm of + or - 0.3 aren't too helpful. Basically, they're not all that accurate; they are considered as "equal," as in, closer to 0.00 than not. Now, the fact that they are called equal isn't too meaningful as well - actually, a +0.3 position does indeed reveal some things: White is potentially better.
Usually, in a +0.3 position, White has more room for error than Black. However, it is very important to know when the evaluations are taking place: -0.3 positions in the opening basically don't mean anything (more on that later), but in the middlegame their potential to predict which side will eventually be better / which side has more leeway for errors is decently strong (+-0.3 in the endgame is basically completely meaningless. Continue reading to know why this is the case).
2) Unless the positional is sufficiently complex, evaluations in the range of +-1 are quite accurate in the middlegame. In very complex middlegames, a -1 evaluation could be meaningless because there is some sort of forced sequence that leads to an objectively drawn position (that the computer misevaluates as being -1, or it hasn't even reached that position due to the horizon effect). In the opening, +-1 evaluations are reasonably accurate at prediciting which side has a close to decisive advanage. However, in the endgame, computers evaluate many drawn positions as +-1 because they basically only count the material when they see that there is no way to make progress. For example, some completely drawn opposite colored bishop endgames where one side is ahead by a pawn are evaluated as +-1, but it is clear to any human player that the game will most certainly end in a draw. Of course, the "endgame effect" does not solely occur in overly simplified positions. Endgames are also breeding grounds for fortresses, so without human assessment, an engine would consider many endgame positions to be slightly better for one side, but reality would reveal a draw as assured as a draw arising from stalemate.
3) +-2 or +-3 evaluations generally don't suffer from the endgame effect and are basically quite accurate (with what the engine deems as "perfect play," White / Black should be able to win when given those evaluations). In some instances, +-2 or even +-3 evaluations suffer from the endgame effect, but those occurrences are extremely rare. Any higher than +-3 and the win is basically guaranteed. One could even argue for the strong hypothesis that, say, we present a position with a +4 evaluation to two engines and give the normal engine White and the "perfect play engine" (an engine that has solved chess) Black, the "weaker" engine should always be able to win even against perfect play. In the opening and middlegame, +-2 positions are basically always wins, with what the engine deems as perfect play, for the stronger side.
Final note: To better understand computer evaluations, I came up with something I call the "stability factor." The meaning is simple enough: there comes a point where the engine's evaluation simply isn't "improving;" basically, the engine doesn't yet understand the position it is analyzing and needs more time to comprehend it. This problem is easily solved at higher depths; at higher depths, a certain stability, which tells you when you should stop the analysis, is reached. Let's consider an example:
Say you have a certain position you want to analyze and the engine gives an evaluation of +1.2 at depth 25. At depth 27, however, it gives an evaluation of +1.8. At depth 31, it even gives an evaluation of +4.2! At depth 33, though, it gives an evaluation of +4.3. At depth 34, it's back at +4.2. Thus, we can say that at that point in time, the engine's evaluations have stabilized. It is rare for an engine to actually change its evaluations much after it has reached a ~3 ply difference of stabilization (that is, its evaluations have remained almost constant for over three increases in ply count), so most analysis past the point of stabilization is effectively useless.
For the opening analysts out there, that means that having an engine analyze a certain position when its evaluations have already stabilized is pretty much useless - that's as far as the engine will see. The rare cases when the engine does see differences in its evaluation functions past the stabilization point do not warrant the very high amount of time needed to run the engine past its stabilization point.
If you're not satisfied with the engine's evaluations at the stabilization point, you either need to improve the engine's evaluation algorithms so that it can analyze more positions (buy a stronger engine, or contribute to / make one yourself!), or you need drastic improvements in hardware that help you achieve extremely high depths.
* Bishops and knights are traditionally equal to three pawns, but these values are obviously not set in stone; Larry Kaufman, in his brilliant article, The Evaluation of Material Imbalances in Chess, gives the minor pieces a value of 3.25
2) A Draw? - Why Computers are Liars
We've touched on this subject in the previous section. Basically, certain evaluations which give an advantage to either side may actually not be accurate at all.
Example 1 (Stockfish 5)
Let's see an exception to our rule (the "+-3 rule," which states that evaluations above +-3 are basically always winning):
Here, the engine gives a +5 advantage to an endgame that's completely drawn! Now we get to see the importance of human intervention: let's consider that, in a previous position, there were two options. The first leads to this +5 endgame position (which is completely drawn), and the second leads to a +3 endgame position (which is a simple queen vs. rook win); which one would you choose, and which one would the engine choose?
Again, the exceptions to the +-3 rule are quite rare and here it's clear why the computer would misevaluate the position: it just thinks that the queen is much better than the rook, so that's the whole story.
Furthermore, fortress positions are pretty much never given the correct evaluation by the engines - a +5 evaluation could very well mean that White has no way of making any progress. Again, in such endgames, the stability factor would make it clear whether there's a win or not (the evaluation remains at ~+4.94 for over 30 increases in ply count!).
I encourage the readers to present the most difficult-to crack and unbelievable fortresses they know of!
3) Depth Woes - Searching in the Mariana Trench
Arguably, the "depth matters" concept is one of the most important pieces of information you need to know about chess engines. There comes a point where increasing depth leads to diminishing returns, though, but on average PC hardware, it takes lots of time for the evaluation function to stabilize. Therefore, if you're doing some analysis on your computer, more often than not, it's best to let the engine run for quite some time for it to come as close to possible to the "truth" of the position.
Example 1 (Stockfish 5)
Here we see the difference between perfect play, an engine at very high depths (it is rare for the engine to actually play like a tablebase! Remember, endgames are the engine's weakest point precisely because its evaluation functions are completely inaccurate in endgame positions.), and the same engine at lower depths.
Here's the tablebase's opinion of the position (in simplified positions, the tablebase has basically solved chess and is capable of playing perfectly):
With perfect play by both sides, Ra1 is the only drawing move! This position was actually taken from a game Carlsen played against Caruana in the 2013 Tal memorial. Remarkably, Carlsen found Ra1 in the press room (!!!), but later discounted it after discussing it with Caruana. The best way to prove to a chess player that he knows nothing about the royal game is to have him play rook endgames against a tablebase (for more kicks, have him use an engine - the engine shouldn't help him too much in rook endgames, so he'll still be basically helpless against the tablebase). Note: Make sure you're not doing this exercise with Akiba Rubinstein - he might find a mate quicker than the tablebases would!
Now, let's see what an engine at high depths would think of the previous position:
At depth 40 (note that the engine can reach very high search depths in simplified positions), the engine agrees with the tablebase! Do note: generally, this is a rare occurrence, and the engine had to go to very high depths for it to make accurate conclusions about the position and for its evaluation function to be accurate (remember that -0.3 is equal). I am not giving this example to show that the engine is capable of playing like a tablebase, but this example shows the importance of search depth.
So, here's the same position but at depth 32 (remember, when analyzing simplified endgame positions, the engine reaches higher depths more quickly: depth 32 in this position could be compared to much lower depths in the middlegame):
Now, in this example, Stockfish did find the tablebase move, but it was completely misevaluating it! If someone were to take a quick glance at the -2 evaluation, he or she would actually think that Black is winning! In practical terms, it's clear that it's almost impossible for human players to defend this position (notwithstanding that Carlsen himself couldn't do it), but what is practically impossible is surely different from what is objectively impossible.
Example 2 (Stockfish 5)
Here, no matter the depth, the engine still gives this position as better for White. Is it easily drawing in human terms? It depends, but objectively speaking, it's a dead draw (Black's h-pawn will fall so now we'll be able to use 6-man tablebases)!
Here's what the engine thinks (a +1.41 advantage):
And the objective truth (which humans should be able to tell as well!):
4) The PV Breakdown
This is the principal variation, circled in red:
It is the set of moves that the engine thinks is best for both sides. PVs in endgames are helpful, but in middlegames, they tend to "break down."
New TCEC (Thoresen Chess Engines Competition, or the "Unofficial World Chess Computer Championship") viewers always exclaim that the engines' PVs are just crazy, and sure enough, they're right: the end position in the PV "breaks down" because sometimes the engine has to make tough decisions between two moves somewhere in the middle of the PV, and it usually ends up picking the worse of the two (human players will immediately see that the PVs make little to no sense).
Example 1 (Stockfish 5)
Here's a recent game between Fabiano Caruana and Magnus Carlsen. I chose this game because it's probably too complex for even a human + computer pair to completely understand (and this actually speaks volumes about the complexity of chess since compared to incomprehensible games like Ivanchuk - Yusupov, 1991, the Caruana - Carlsen game is a piece of cake), and it highlights some very important ideas that one should be aware of when looking at the PVs that the engine gives.
Let's take a look at the PV that Stockfish gives for Black on his 22nd move (taken at depth 39)
Now, a critique of the individual moves:
Let's start with the ridiculous looking Qd3:
Obviously, the move isn't without its points, but many moves make much more sense (we are speaking from an engine perspective here! The move the computer thinks is best is not immediately obvious to a human).
In fact, here is the position just before Qd3:
Qd3 is the fourth best move!
Stockfish finds the ridiculous looking but actually quite strong (once you understand the point behind the move) Kg1!
Now, this is not an analysis of the Caruana - Carlsen game, but it's interesting to see just how strong of an idea Stockfish came up with:
Now, don't expect the engine to play like a complete patzer in the PV, but there are points where its play can indeed be improved upon.
Finally, let me present you with two important concepts that will allow you to more precisely interpret endgame evaluations: one, you could look at the PV and see what the engine is doing in that PV (for example, in the PV, the engine is making no progress at all, so it's drawn), or, two, you could notice that multiple moves give the exact same evaluation (this is a good indicator that there's a draw since many moves "improve" the position by the same numerical amount; sometimes, of course, only one move will draw, but if multiple moves give the same evaluation, then it's almost certainly a draw.)
Conclusion
Chess is a complex game, but fret not, chess lover! Humanity's unquenchable thirst for knowledge has produced some high powered machines capable of aiding us in understanding our beloved game: it's up to you to make good use of these helpful friends, and I hope that my article has at least helped a bit in introducing you, my dear reader, to such concepts as "The PV Breakdown," "The Engine Endgame Effect," and the "Stability Factor."
Take-Home Message: It is of prime importance that one not view the engine evaluation as a standalone insightful tool to correctly evaluate the position. More than just looking at the engine evaluation, one should note that it is important to see how exactly the engine continues to play the position. If it gives +1 yet goes down a pawn-up rook endgame with extremely high drawing chances, the engine obviously has its evaluation function much more indicative of the material value on the board than it is of the proper end result of the game. Moreover, when going through an engine line to see if its evaluation function makes sense, it is important to see if the line it gives is actually best at every turn - more often than not, ten moves later in the line, the engine will start giving some nonsense moves.
Therefore, to correctly use an engine, one must also use it alongside an opening database, an endgame tablebase, as well as the proper mind to a) let it reach sufficient depths b) play through the lines it gives and c) improve the lines it gives lest it suffers from the "PV" breakdown issue already mentioned in the article.