
Science of Chess: Can you tell an easy puzzle from a tough one?
I love puzzles. My morning routine usually includes the New York Times Connections puzzle, the Strands grid, and the venerable Wordle. From there, it's off to Gisnep for a daily drop-quote afollowed by my new favorite, Squardle. If I need a quick break during the day, I save rounds of anagram solving over at WordWalls to convince myself that I might still have a shot at being Scrabble champion of North Dakota. (Note: I almost certainly do not really have a shot at this, but let a fella dream every once in a while). My wife and I still love to take part in the MIT Mystery Hunt when we're able and we were fortunate to be part of a winning team a few times. This gave us a chance to write puzzles for the next year's solvers, which is very relevant to the study I'm about to describe for you. In general, puzzle-solving appeals to me not just for the challenge of trying to find the right answer(s), but because good puzzles can be surprising, or funny, or revelatory. For the longest time I didn't get what people liked about Nikoli's logic puzzles until one made me laugh out loud. Maybe that sounds weird to you, but if it does I'd say you just haven't done enough Herugolf puzzles yet.

Among all the different kinds of puzzles I do, chess puzzles occupy a unique niche. The mathematician G.H. Hardy referred to them as "the hymn tunes of mathematics," dismissing them as unimportant in the grander scheme of understanding number, shape and space. Though they may be among the least important examples of mathematical reasoning and theorem-proving, they are of obvious importance to people who play the game and want to improve. We don't solve chess puzzles to arrive at some deeper truth about the universe! Instead, most of us solve chess puzzles because they're fun and also (perhaps) in the hope they'll help us get better at playing full games. Puzzle Rush, The Woodpecker Method - all of these various types of puzzle drill are intended to support the acquisition and/or refinement of patterns, or "chunks," of chess knowledge that will help us see brilliant moves over the board. It's this presumed pedagogical value that I think makes chess a bit different than most of my morning puzzle routine and helps motivate the question we're going to consider here: How do tell how hard a puzzle is?
How do you compose a good chess puzzle?
The thing about a good puzzle of any kind is titrating its difficulty. It's very easy to write a puzzle that is so difficult that it is impenetrable, and it is also straightforward to write one that is trivial. The tough thing about puzzle composition of all types is to give the solver something that isn't obvious at first, but will give way after careful and clever enough thought. Writing the kinds of puzzles that make up a good Mystery Hunt , for example, is all about finding this sweet spot, and locating it depends on understanding the problem domain of the puzzle AND having a good sense of what a solver will do when confronted with it.
In cognitive science, the term of art I'd use here is "Theory of Mind," which refers to the ability to estimate what someone else might be thinking or feeling based on their experience of the world. Below, I've included an image of the classic "Sally-Anne" problem used for years as a means of measuring individuals' Theory of Mind capabilities. In this case, answering the question correctly depends on correctly understanding what each person does and doesn't know about the events depicted below. That same kind of understanding is necessary for writing good puzzles - what does my solver know and what will they try to do? What steps might be immediately apparent to them and what ideas will they have a hard time arriving at? It can be tough to take another person's perspective, but composing a puzzle that will be satisfying to solve depends on it.

The way that understanding puzzle difficulty relies on your own understanding of the kind of problem you're presenting AND your understanding of other people's minds makes this a fascinating topic for cognition research. Chess puzzles are also especially convenient for thinking about these issues of puzzle design and difficulty because we have a great deal of data about how hard individual chess puzzles are. Unlike most of the puzzles that people solve recreationally, chess puzzles don't just vary in difficulty in colloquial terms like "easy," "intermediate," and "hard" (or if you're an NYT Crossword buff, Monday vs. Friday). Instead, chess puzzles can have an ELO rating just like people do: Each attempt to solve a puzzle is a "match" between the puzzle and the solver. If the solver works it out correctly, they win and their puzzle-solving ELO increases! If they get it wrong, the puzzle is the victor and it's rating increases. Just like regular tournament play, applying ELO this way should yield ratings that support good predictions about what solvers can succeed against which puzzles. Here's a look at my own ELO puzzle trajectory during my time over at chess.com.
As I've mentioned in other posts, these statistical tools used to characterize chess offer researchers a valuable resource for asking specific questions about the game, and by extension about the cognitive processes that support playing it and thinking about it. In this case, chess puzzles provide a unique opportunity to ask some particularly neat questions about puzzles, domain expertise, and Theory of Mind: How easy is it to determine how difficult a puzzle is? Tand How does puzzle difficulty estimation relate to playing strength? The quantitative data we have about chess puzzles' difficulty makes these probems the perfect sandbox for examining how people assess task difficulty based on their own expertise. Once again, chess turns out to be a very useful proxy for expertise considered more broadly, making it possible to address a specific question about cognition precisely.
Trying to predict puzzle difficulty in the lab
The target article I want to tell you about is really quite straightforward in a number of ways. At its core, there is a simple question that the authors try to answer: Do higher-rated players do a better job predicting how hard puzzles are? How good is anyone at guessing how hard a puzzle will be? The design and analysis of the study is similarly fairly easy to get your head around, which makes it an easy experiment to understand. I thought the results were surprising, however, and maybe hint at some compelling ideas for future work.
Study Design
First, the basics: The authors of this study recruited a group of just 12 chess experts whose FIDE ratings ranged between about 1850 and 2300. These participants were then presented with 12 chess puzzles from Chess Tempo grouped into Easy, Medium, and Hard categories based on their rating. Below, you can see some solving statistics from the manuscript: The higher the puzzle rating, the more time it tends to take people to solve it and the lower the solve rate. Those stats are important to see because they mean there is a clear difference in puzzle difficulty between these three categories that the experimenters constructed for their task.

Participants were asked to solve these puzzles in a randomized order (to avoid sequence effects) and were given three minutes to come up with the best move for each position. The entire testing session took about 30 minutes, and I should note that the researchers were also collecting eye-tracking data throughout the task. To me, the eye movement data didn't add a great deal to the study, so I'm going to focus on the simpler behavioral outcomes instead. If you're interested in some studies that used eyetracking in service of some neat questions about chess and visual perception, however, I'll refer you to some of my other articles on that topic.
Measuring difficulty estimation with puzzle rankings
We're not just interested in how well these players solve puzzles, however, but how able they are to estimate how challenging a puzzle is. This means that the last important detail we need about this study is how they asked participants to rate puzzle difficulty. Their approach is about as simple as you can get: They just asked each participant to rank the puzzles in order from easiest to hardest. Then they used the formula below to measure the agreement between the real difficulty ranking and the ranking provided by each player.
I'm showing you this formula for Kendall's Tau because I want to highlight something neat about their analysis of participants' rankings that makes the results especially interesting to me. In this formula, you're basically counting up pairs of puzzles in the rankings to arrive at the value you need to measure accuracy. In particular, you're counting how many discordant pairs you find in the data and how many concordant pairs you find. A discordant pair refers to a pair of puzzles that's in the wrong order in the participant's rankings: If the puzzle rated 1492 is ranked as more difficult than the one rated 2230, for example, that is one discordant pair. On the other hand, a concordant pair refers to a pair that's in the correct order.
Here's the really important thing: If you take a look at those puzzle stats up above, you'll notice that within the Easy, Medium, and Difficult categories there are puzzles with very similar ratings. If it seems a little much to expect that the participants might be able to tell an 1878.1 puzzle apart from an 1878.6 puzzle, you're not alone in your concern! The researchers thought this was too much to expect as well, so they only counted discordant and concordant pairs from different difficulty categories! That is, the only thing they are really looking at is whether participants said Difficult puzzles were harder than Medium and Easy ones, and whether they thought Easy puzzles were simpler than Medium ones. That's it! A high value of tau means you got that order right more often than not, and a low value means you didn't. So what happened?
It's really tough to tell how hard a puzzle is, actually
I've got a couple neat things to show you, but the bottom line is this: This group of experts wasn't great at ranking the puzzles by difficulty, even though the researchers were only worried about Easy, Medium, and Difficult differences. Let me start by showing you the raw data, which in this case is small-scale enough for you to look at yourself. The table below shows you the rankings for each of the participants in the study. In each column, the right answer should be 1, 2, 3...all the way to 12, but the scrambled orders you see in each case demonstrate how hard it was for players to come up with that order.

I'm sure you can find interesting and surprising errors of your own here, but check out how often some of the most difficult puzzles float to the top while much easier ones end up at the bottom! Clearly there is a lot of misperception going on in terms of how hard these puzzles are at the population level and how hard they appear to be to these players. In the bottom row of the this table, you can also see the value of Kendall's Tau for each player, which is our quantitative tool for talking about how close players got to the right category-based ordering of the puzzles. I think its easier to look at this in the graph below, however, to ask another question: Do stronger players do a better job of estimating puzzle difficulty?

I hope you can see from this graph that the answer to that last question is a big ol' nope. In general these players aren't doing a great job at the task according to Kendall's Tau, and there isn't a significant relationship between these two sets of values to support the claim that you do better at difficulty estimation if you're a higher-rated player. The authors have a few more analyses in the full paper to look at this a few different ways, but the punchline stays the same. Even when you ask how performance with these puzzles relates to difficulty estimation, there just isn't a meaningful relationship there. It's just hard to guess how difficult a chess puzzle is.
So what does this mean (and what comes next?)
Though it might seem disappointing that these players weren't great at this task, I think these results are rather exciting and lead to some cool ideas for related work. To motivate what I'm about to say, I'd like to show you one more figure from the paper - this time a table that shows off how the players did at the 12 Chess Tempo puzzles they were presented with during the study.

What you're seeing here is the percentage of people that got each puzzle correct ("Success" column) and the average time it took them to do so in the last column. The thing I want to call your attention to are some funny discrepancies between these values and the Chess Tempo stats for these puzzles I discussed previously. Puzzles 4 and 5 are supposed to be Medium puzzles, but fewer than half of these expert players got them right! On the other hand, Puzzle 10 is pretty challenging, but this group fared pretty well (and quickly to boot!). I really don't want to delve into the details of these puzzles in an attempt to get very granular about the design and the results, but I think these features of the data hint at an important possibility: Chess Puzzle performance might be very idiosyncratic.
You probably have a sense of what kinds of puzzles you're better at, and if you don't, you can get that kind of information pretty easy with the analytics tools on lichess or chess.com. Here's a look at my Puzzle Dashboard (which is pretty sparse at the moment - gotta do more drill!): I'm apparently atrocious at finding good defensive moves, but not so bad at exploiting pins. Your mileage may vary, however, and that is very much my point.

Puzzle difficulty may not be especially meaningful as a population-level construct. Instead, it may vary so much between players that the more relevant thing might be how difficult a puzzle is for a player like you. You may have taken one of those "Chess Personality" quizzes at some point (I really wanted to be Capablanca, but that is not at all in the cards it seems), but I suspect there is a lot of good work to be done to try and establish exactly what that might mean quantitatively, especially for beginner/intermediate players with variable skills. What kinds of clumps are there in dashboard data like I've displayed above? Is perceived difficulty intimately tied to the kinds of puzzles you struggle with vs. the ones you fly through with ease?
I think these are exciting questions and ones that can probably be answered with the data available through sites like this one. So while the current study doesn't give us an easy answer to the question of how we estimate the difficulty of a chess puzzle, it highlights the need to think about how population-level data and more granular analytics might be necessary to fully understand how different players see different positions.
Support Science of Chess posts!
Thanks as always for reading! If you're enjoying these Science of Chess posts and would like to send a small donation my way ($1-$5), you can visit my Ko-fi page here: https://ko-fi.com/bjbalas - Never expected, but always appreciated!
References
Baron-Cohen, Simon; Leslie, Alan M.; Frith, Uta (October 1985). "Does the autistic child have a "theory of mind"?". Cognition. 21 (1): 37–46.
Chuzo Iwamoto, Masato Haruishi, and Tatsuaki Ibusuki. Herugolf and Makaro are NP-complete. In 9th International Conference on Fun with Algorithms (FUN 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 100, pp. 24:1-24:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.FUN.2018.24
Hardy, G. H. (2012) [1st pub. 1940, with foreword 1967]. A Mathematician's Apology. With a foreword by C. P. Snow. Cambridge: Cambridge University Press. ISBN 9781107295599.
Hristova, D., Guid, M. & Bratko, I. (2014) Assessing the difficulty of chess tactical positions. International Journal on Advances in Intelligent Systems, 7(3), 728-738.