Can you tell a human opponent from a machine?
Image by Pavel Danilyuk

Can you tell a human opponent from a machine?

Avatar of NDpatzer
| 10

When I first started learning how to play chess as a kid, I was also really excited about computers and programming. This meant that it didn't take long for me to set my sights on getting a computer training partner that I could run on my Apple IIc. In 1985, I was saving up for programs like Sargon III or Chessmaster 2000, both of which I played against tons of times to try and train for scholastic tournaments. Whenever I upgraded my computer, I made sure that I had an upgraded chess program too, even though I more or less stopped playing competitive chess in 8th grade (1995 or so, which is important due to what comes next).

Screen shot of Sargon III gameplay on the Apple IIc.

It wasn't long after I quit competitive play that computer chess took a huge leap forward. In 1996, Deep Blue managed to take a game from then World Champion Garry Kasparov, and the next year it would outright beat him under tournament conditions. Though I had mostly left chess behind, the news that a computer could beat the best player in the world was equal parts exciting and disappointing. Part of me couldn't help but want Kasparov to win, but part of me was intrigued and inspired by the idea of building an algorithm that could play chess at the highest levels.

Kasparov facing off against Deep Blue - Chessbase.com

Fast-forward to 2022 and I decided to give chess a try again. I was prepared to have to re-learn a ton of what I had forgotten during years of not playing, but what I wasn't prepared for was just how thoroughly incredibly strong chess engines had become incorporated into the game. The idea that I could analyze my games with the help of an engine more powerful than Deep Blue was startling. The more I started to consume online chess content, the more I was amazed by the way commentary worked now: Streamers and commentators didn't have to speculate about a position the way I remembered, but could now look at an eval bar or an engine line in real time. Gone were the days of wondering if a certain line was best - now the question was whether or not a player would make the move we all knew was strongest with the benefit of the engine. Could a GM see what the machine saw?

What are "computer moves?"

The more chess content I watched, the more I kept hearing a phrase that fascinated me. That phrase was computer move. I'd watch a recap video, for example, and the commentator might point out the best line by saying something like, "But that's such a computer move - there's no way a human is going to play that." I also saw discussions like this in videos exposing chess cheaters: In many of these videos, a player with a rating not far from mine would uncork a move that seemed impossibly strange, but ended up putting enormous pressure on the opponent. A "computer move" strikes again!

Really? h4 is best?

From a cognitive science perspective, I thought this was fascinating. What this idea of "computer moves" suggested to me was that chess engines aren't just better, but that they also play chess differently than people. What could that mean? Why would some computer moves seem "hard to play" or "impossible to see" when they're right there on the board? One possibility is that while people bring a lot of cognitive biases to their decision-making (including confirmation bias, which I've written about elsewhere), computers have no such constraints on their thinking.

A classic example of Confirmation Bias: If I tell you that every even-numbered card is red on the other side, which cards do you turn over to check if that's right? Most people will say "8" and "Red," but it's really "8" and "Blue" that you need to check!

This idea of "computer moves" raises an interesting question, however: Can chess players really tell the difference between a machine opponent and a human opponent? Those examples I gave you above are compelling, but also aren't quite enough to provide a firm answer. To really find out if we can tell a person from a bot, we need to know if chess engines can pass the Turing Test.

What is the Turing Test?

Alan Turing (pictured below), is widely considered one of the founders of computer science and made foundational theoretical contributions to the field  (like his work on computability) and practical applications of computer algorithms (like his war-time work on cryptanalysis). His work was wide-ranging, including topics like pattern formation in biological and chemical systems alongside his more mathematical work. Perhaps his most enduring idea is the concept of the Turing Test, which he suggested as a simple criterion for deciding whether or not a machine could be considered "intelligent."

See page for author, Public domain, via Wikimedia Commons

The key idea behind the Turing Test is that if a person cannot distinguish between a person and a machine in some setting, than we may as well say that the machine is intelligent. There is a LOT of argument about whether that's an acceptable stance to take, but nonetheless the Turing Test has been an important benchmark for AI ever since. The setting that Turing described to introduce the concept of his test was a situation in which a person and a machine would each be asked questions by a person he called the Interrogator. The Interrogator could ask whatever they liked of either the person or the machine while the machine would try to act person-like and the person would try to help the Interrogator identify who was who. At the end of the questioning, the Interrogator would have to try and guess which entity was the real person.

A schematic view of the Turing Test: Can the Interrogator (C) determine which of A or B is the machine? Schoeneh, CC0, via Wikimedia Commons

Since this initial proposal, there have been different attempts to conduct Turing Tests in widely different settings. These include conversation (like the original proposal), visual art (Daniele et al., 2021), music composition (Ariza, 2009) and even driving behavior (Bazilinskyy et al., 2021). But what about chess? This question is the subject of the study I'd like to tell you about, in which researchers examined how well chess players could succeed at an over-the-board version of Turing's "Imitation Game."

A Turing Test for chess (Eisma et al., 2024)

In this study, the researchers set up a straightforward "Imitation Game" scenario for evaluating how detectable machine play would be over-the-board. However, this included a number of important design choices to ensure that their version of a chess Turing Test wouldn't be trivially easy for their participants. For example, we all know that the best engines are simply far stronger than the vast majority of human players. Besides their strength, they also will choose good moves much faster than nearly any human, especially in difficult positions. Noticing that an opponent is playing with super-human accuracy and speed is an easy way of working out that you're playing a machine, so what can we do to make a Turing Test for chess more meaningful?

The strongest engines are just better than humans - how do we make a Turing Test about more than superhuman chess strength?

To look more specifically at the nature of computer play vs. human play, the authors of the current study set up their Turing Test to balance properties of human and computer play carefully. First, all participants were given the same set of starting positions to begin playing from, all of which were selected from games played by GM Drvitman on Lichess. Each of these positions was the result of the first 10 moves of a Rapid game, and were selected so that there were no tactical forcing moves. You can see one of these positions below:

Figure 2 from Eisma et al., (2024) illustrating one of the starting positions used in their Turing Test experiment.

The participants (24 of 'em in all) had an average Lichess Blitz rating of about 1550 and were told that the objective was NOT to win, but to try and work out who they were playing in each of 8 positions: A person or a machine?  In reality, they would face one human player (with a Lichess Blitz rating of 1136) and 3 different engines across these conditions. The engines included two versions of Stockfish (one set to Level 4 and one set to Level 20, the highest you can choose) and Maia, a neat alternative chess engine specifically trained to play like a human (McIlroy-Young et al., 2020). Users can choose different strengths for Maia that are intended to reflect specific human ratings, so the researchers selected a level meant to match an ELO rating of ~1100.

Compared to Stockfish, Maia is specifically trained to mimic human play of varying strength during play using deep learning. (image from Maiachess.com)

With these human and machine opponents lined up for each participant, players were each given 5 minutes on the clock. Participants were informed that their opponent's moves in each position would be made every 10 seconds, minimizing the usefulness of playing speed as a way of guessing who was the  machine. After playing through each of the 8 positions (2 for each opponent in a randomized order), the researchers asked the participants a number of questions about who they thought they were playing. These included asking whether the opponent was a human or machine, their confidence in that guess, and the estimated strength of the opponent. You can see the results of those questions below for the 4 different opponents used here - i order: Human, Maia, Stockfish (weak), and Stockfish (strong).

Figure 4 from Eisma et al., illustrating how players responded to various questions about the 4 opponents they faced.

There is a lot to see in this picture, so let me walk you through the main results. First, in general players thought both versions of Stockfish seemed more machine-like, while Maia was not rated as any less human than a real person. This suggests that even when Stockfish is relatively easy to beat (like the Level 4 opponent) there is still something machine-like about the way it plays that is different than what Maia will do. Second, players were consistently confident about their guesses across all opponents. They were also able to identify the difference in playing strength between Level 4 and Level 20 Stockfish.

One analysis I thought was particularly neat is the data they have from recording what players said to themselves while playing. This allowed the researchers to measure how much players spoke when facing different opponents, but also how often they expressed surprise or confusion about their opponents' moves. Players tended to speak less when facing the stronger version of Stockfish, but express more surprise during play against the weaker version. Machine-like play appears to be more unexpected to humans and to elicit either commentary about the weirdness of a move or perhaps a bit of stunned silence as you realize you're done for.

Photo by Courtney Hill on Unsplash

Conclusions

Overall, the data suggest that there really are such things as "computer moves," at least with regard to how Stockfish plays. The different training used to build Maia leads to more human-like play, with participants generally being tricked into thinking Maia was a human. One interesting bit of data the authors discuss however, is that participants used blunders as evidence both that the human might be an engine ("It made that move because it's programmed to be weak sometimes!") and that Maia might be a human ("It makes mistakes just like me!"). There is possibly something interesting to understand further there regarding when people do and don't expect humans to make different kinds of mistakes during play - a neat question in its own right.

Given that there are "computer moves," the next big question for cognitive science in this domain (at least, I think!) is why even strong players have biases away from these optimal moves that they need to unlearn to get better. Many current GMs (Carlsen and Nakamura included) have talked about studying engines with the goal of understanding why certain puzzling moves do serve a purpose and incorporating those plans into their play. What makes some moves seem more natural than others such that a machine plays like a machine and a human tends not to? For now, Stockfish can clearly beat the living daylights out of us all over the board, but doesn't pass  the Turing Test.

References

Ariza, C. (2009) The Interrogator as Critic: The Turing Test and the Evaluation of Generative Music Systems. Computer Music Journal, 33, 48-70.

Bazilinskyy, P., Sakuma, T. & de Winter (2021) What driving style makes pedestrians think a passing vehicle is driving automatically? Applied Ergonomics, 95, 54968-7.1

Daniele, A., Di Bernardi Luft, C. & Bryan-Kinns, N. (2021) What is Human? A Turing Test for Artistic Creativity. Proceedings of EvoMUSART 2021.

Eisma, Y.B., Koerts, R. & de Winter, J. (2024) Turing tests in chess: An experiment revealing the role of human subjectivity. Computers in Human Behavior Reports, 16, 100496.

McIlroy-Young, R., Sen, S., Kleinberg, J. & Anderson, A. (2020) Aligning superhuman AI with human behavior: Chess as a model system. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1677-1687.

Monthly posts describing research into the cognitive science and neuroscience of chess.