Unsplash stock image by Kenny Eliason

Science of Chess: Proving yourself wrong

Jul 27, 2025, 1:49 PM | 10

I count myself very fortunate to be a scientist. I love running my lab, and as we approach the beginning of the Fall term (where did summer go?) it's time for me to start designing the experiments we'll be deploying for kids and grown-ups to take part in over the next few months. Without a doubt this is my favorite part of the job. Thinking through how to put an experiment, or more typically a series of experiments, together so that you're confident that you'll learn something new when it's all wrapped up is endlessly challenging and fun. This is when my students and I dream up the image manipulations, the tasks, and the analyses that we hope will offer a new perspective on the human visual system.

A much younger NDPatzer looking at some EEG data from the Balas Lab at North Dakota State University.

Those dreams always feel loaded with promise and potential - this is when we think through how to use new tools for changing the appearance of faces or textures in clever ways, or consider how to get new kinds of data out of behaviors like drawing, coloring, or even just looking at things. Our discussions about what to do next always start with some idea we're excited about - some possible story about how human vision works that we think might be important. If we're lucky enough to be right, maybe this is when we'll find something out that really pushes the field forward.

"Kill your darlings."

That's the fun part.

The difficult part of all of this is that in the lab we can't just let ourselves keep dreaming of the exciting stuff that might happen in our experiments. To be more specific, the trap we have to avoid is getting too carried away by the possibility that we might be right. In the lab, it's critically important that we think through what should happen not just if we're right, but also if we are completely mistaken. Sure, those neat image manipulations and clever tasks often arise from thinking through how people should behave if we're right about the human visual system, but to ensure that our experiments can tell us anything meaningful we have to invest a lot of time poking holes in our own best ideas. What is something that should only happen if we're wrong? Hopefully we can come up with a few different ways we could be mistaken and a few corresponding ways to check. You can't hope to rule out every competing account, but you've got to give it a real try.

In the context of writing, there's a similar principle that is sometimes expressed via the phrase at the beginning of this section: Kill your darlings. For an author, this phrase is intended as advice to be merciless when reviewing your own work. Maybe you loved a particular turn of phrase or were dying to start your story with an introduction that you thought was surprising and creative - that's great and all, but when it's time to edit your work, you can't let your love for the first draft stop you from rejecting those ideas if they lead to a better second draft.

Stock photo.

It's tough. It's REALLY tough sometimes. As a fairly regular blogger, I can't tell you how many times I've had to really force myself to change passages (or reject entire posts!) that I felt invested in but that ultimately weren't very good. Besides the challenges of facing up to sunk-cost reasoning, killing your darlings requires you to make peace with the idea that even if you really liked something it still might not be all that great. Difficult, to be sure, but often necessary to produce work that's worthwhile.

Confirmation Bias and trying to Minimax OTB

As in the lab and on the page, so also on the chessboard.

While there plenty of obvious differences, coming up with a good next move has a lot in common with designing a good experiment and writing good prose. In all three cases, you need to be willing to kill your darlings to really get ahead! What I mean by this is that over the board, it's crucial to avoid playing pure "Hope Chess" and imagining only the Best of All Possible Worlds for yourself. Sure, it's possible that your opponent will take the piece you're intending to sacrifice (brilliantly) and walk right into the 3-move mate you've calculated - but what if they don't? What if you're wrong about the next move they will make? To be a really strong player, you need to consistently think about the move your opponent could make that will cause you the most trouble and plan accordingly.

In the first class I took on Artificial Intelligence as an undergrad, we learned an algorithm for implementing this kind of reasoning called the Minimax Algorithm. Minimax is a decision-making procedure that uses a game tree (a branching structure capturing the full set of choices each player could make on each turn of a game) and an evaluation function (a way to measure the value of a position to a player) to determine which choices lead to the best outcome for one of the players. I'll walk you through a short example from geeksforgeeks.org in a second, but the underlying logic of the algorithm is that while the player wants to maximize the value of the position for themselves, decisions about the best next step need to include a minimizer who will always choose a next step that worsens the player's outcome as much as possible. To see how this works in a simple tree, take a look at the image below:

A diagram of a hypothetical game tree with evals from geeksforgeeks.org

If you're the player in this scenario, your job is to work out if it's better to take the path labeled L or the path labeled R. After you make that choice, your opponent will also get to choose which path to take at the next level of the tree and your "score" will be the number you end up on (the higher the number, the better it is for you). If we assume you're playing a game with perfect information, the idea is that you have access to those values to help you make a choice. Now here's the thing to pay attention to: That 9 at the bottom right looks REALLY tempting. It's by far the best outcome on the tree and you can only get there if you try the path labeled R. Maybe it's a good idea? You want to win, right? Why not aim for the outcome that's the best thing that could happen to you? After all, the opponent is one slip=up away from you winning big.

This, dear reader, is the maximizer talking - if we're only thinking about the best outcome we could reach, this sure seems like the right answer. The obvious (I hope) problem is that the maximizer is essentially very good at hope chess: You could end up at the 9, but your opponent could also choose to go left after you chose to go right and now you'd end up with a score of 2 - the worst outcome on the tree. To avoid getting stuck there, we have to consider the minimizer's relentless drive to punish us as well as the maximizer's sunny optimism. If we choose L, the minimizer will be sure to pick the 3 instead of the 5, so we have to consider that score to be the best we can do on that side of the tree. Likewise, if we choose R, we have to imagine that the minimizer will pick the 2 instead of the 9, making that the best we can hope for. Faced with the choice between the minimizer's 3 and 2, the maximizer may sigh a little, still dreaming of the 9 that might have been, and choose (correctly) to go to the left. In a small game tree like this with a reliable evaluation function, minimax is something that human players more or less implement well: If you've played Tic-Tac-Toe, Nim, or the Dot Game, you've almost certainly used some version of minimax reasoning fairly successfully.

Image credit: The original uploader was Yonidebest at Hebrew Wikipedia. - Transferred from he.wikipedia to Commons., Attribution, https://commons.wikimedia.org/w/index.php?curid=2237158

What about chess, however? How do we do at minimaxing when faced with a much larger tree and an evaluation function that engines have access to, but that humans can generally only guess at? The study I want to tell you about next takes a look at exactly this, with an emphasis on trying to measure how willing players of different strength are to really consider killing their darlings on the chessboard.

Chess calculation as hypothesis testing - Cowely & Byrne (2004)

The authors of this study recruited two groups of 10 participants each: A group of experienced novices with a mean ELO of about 1500 and another group of expert players (mean ELO over 2200). Both participant groups were asked to consider a set of 6 different positions, all of which had equal chances for White and Black and were middlegame positions with 20 or so pieces still in play. These positions thus still provided lots of options for candidate moves, none of which were especially forcing (see below for an example). For each of these positions, participants were asked to not just choose the move they would like to play next, but also to do their best narrate their thought process while making this decision. This is referred to as a Think Aloud protocol, and their goal in using it was to answer the following question: Do experts do a better job than novices at thinking through the worst possible outcomes?

Figure 1 from Cowely & Byrne (2004) depicting an example position from their study.

What I quite like about this approach is that it has the virtue of being quite simple for the participants to understand (they're mostly doing what they would usually do during regular play) but yields very rich data about the decision-making process. The tough part about data like this is working out what to do with it all, however! These days a lot of research groups are exploring how to use tools like LLMs to analyze natural language data like this, but those tools were still 20 years away when this study was conducted. Here, the authors used each bit of transcribed chess monologizing to come up with what they call a problem behavior graph. This is essentially the portion of the game tree that a player came up with in their transcribed speech, including both different move sequences they considered and their sense of whether this was going to lead to something good or something bad. The dashed lines you see here also capture an intriguing feature of this Think Aloud protocol: Players sometimes skip the opponents' moves! Though it doesn't seem like they use this data for much in the conference paper, I thought it was neat that it was part of the transcription.

Figure 2 from Cowley & Byrne (2004). This problem behaviour graph organizes the move sequences a player thought through, ending with the player's determination of whether this was a positive or negative outcome for them.

The authors consider the graphs generated by just 10 players (5 experts and 5 novices), which to be honest is something I don't love about this paper. To be fair, this is challenging work to do (the transcription process is a real time sink - ask me how I know) and recruiting any special population (like the Expert group) is always tough. Still, I do want to advise you to take these results with some serious grains of salt because our statistical power is quite low.

That said, the authors coded the various move sequences in each problem behavior graph according to how the players' evaluation matched up with an engine's evaluation (the venerable Fritz 8 here) of the position. They defined confirmation bias sequences as those where the player evaluated the sequence as positive but the engine ruled it negative. Similarly, they defined falsification sequences as those where both the player and the engine thought the sequence was negative. The question is, how often did players in each group generate each kind of sequence?

Table 1 from Cowley & Byrne (2004). The authors classified players' move sequences according to the relationship between their evaluation of the outcome and the engine's evaluation.

First, it's worthwhile to point out that experts produced more sequences than novices in general. While novices came up with about 6 move sequences that they considered on average, the experts came up with just over 8. In general, experts also tended to agree with the engine's evaluation more often than novices, generating about 6 correctly evaluated sequences on average compared to the novices' average of about 2.5. Experts think through more options and tend to be right about the outcome, too.

The potentially more interesting result, however, has to do with the coding of these sequences according to confirmation bias and falsification. What kinds of sequences are experts considering and how different does it look from the novices? It turns out that according to the authors' coding scheme, expert players engage in more falsification than novices and less confirmation bias. Again, be a little mindful of the stats here, but the figure below shows you the average number of sequences of each type generated by participants in the two groups. Modulo the worries about the small sample size, etc. this looks like evidence that stronger players spend much more time thinking about the minimizer than novice players do.

Cowley & Byrne's (2004) data comparing falsification and confirmation bias behavior in their Expert and Novice groups.

Some wishful thinking about the study and possible next steps

There are a lot of things I quite like about this study - enough of them that I when I was asked by NM Ben Johnson to describe 3 studies I thought chess improvers should know about, this was one of the ones I named! What stands out to me about this one is both the willingness to wade into complicated behavioral data (coding the Think Aloud data, specifically) and what feels like a more direct relationship to real OTB chess. I try to be careful in these posts to not present scientific research about chess as some kind of ticket to improvement, mostly because these studies aren't focused on those outcomes. What I think is neat about this paper, though, is that it does feel to me like it provides some insight into something novices could stand to work on. Kill your darlings. Remember that your opponent isn't going to be kind to you, so don't let yourself off the hook when you calculate. If you see a good move, wait, it may turn out to be a bad one.

The thing is, though, that this is also a study where I think the rivets show rather a lot. I really wish there were more participants. I wish the authors had reported how often players in each group skipped moves in the sequences they generated! I also would kind of like to see (my AI skepticism notwithstanding) what modern tools for summarizing natural language could do to augment this coding scheme. Besides these fairly simple ideas for things to do to expand on this work, I also can't help but think about their definitions of falsification and confirmation bias. In particular, it doesn't feel quite right to me to say that the novices are engaging in more confirmation bias when part of the problem is that they're also just wrong about the eval function more often! It's interesting to see (as the authors report in the text) that novices tend to come up with positive impressions of positions a little more often than experts, but that's a subtler point than the one they're trying to make.

Figure 3 from Cowley & Byrne (2004) - Here, the authors consider how often both groups tended to agree with the engine's evaluation of positions they calculated (Objective outcomes) as opposed to thinking positions were better (positive) or worse (negative) than the engine did. Novices tended to be a little more optimistic overall, but not by much.

I hope the above discussion isn't taking you too far into the weeds, but given the big ideas behind this paper I think it's especially important to remember that constructive skepticism matters in chess and in science. This stands out to me as a compelling attempt to measure something quite complicated: How do different players reason differently during attempts to calculate? There are hints of something interesting here, but also room to do more - after all, what if we're wrong? Compared to some of the research I've discussed here, this one feels especially ready for some replication and some extension. If you've got ideas about this, I'd love to hear them in the comments. Regardless, hope you enjoyed reading about this study and hope to see you next time!

Support Science of Chess posts!

Thanks as always for reading! If you're enjoying these Science of Chess posts and would like to send a small donation my way ($1-$5), you can visit my Ko-fi page here: https://ko-fi.com/bjbalas - Never expected, but always appreciated!

References

[Michelle B. Cowley-Cunningham](https://philpapers.org/s/Michelle B.%20Cowley-Cunningham "View other works by Michelle B. Cowley-Cunningham") - 2004 - In K. D. Forbus, D. Gentner & Regier, Proceedings of the Twenty- Sixth Annual Conference of the Cognitive Science Society. pp. pp. 250- 255..

Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis:verbal reports as data. USA: MIT Press.

Newell, A., & Simon, H. A. (1972). Human problem solving. Englewood Cliffs, NJ: Prentice-Hall.

Wason, P. C. (1960). On the failure to eliminate hypotheses in a conceptual task. Quarterly Journal of Experimental Psychology, 12, 129-140