Science of Chess: Winning Streaks, Losing Streaks, and Skill
Stock photo.

Science of Chess: Winning Streaks, Losing Streaks, and Skill

Avatar of NDpatzer
| 9

The other day, everything was going great on the chessboard for yours truly. Maybe I was feeling the benefits of putting in more tactical training than usual, or maybe cutting back on my doom-scrolling was having salutary effects on my attention and focus. I was traveling with my family for a short trip to the East Coast, so maybe it was the change in scenery that had shaken something loose in my mind or brain. Whatever the reason, it felt like I couldn’t lose: I was finding ways to keep pressure on my opponents, the sacrifices I was making turned out to be sound, and even in situations where I was down material I was finding opportunities for counterplay that saved the day. As I watched my rating go up I wondered if maybe I’d turned some corner as a chess player. Maybe something had finally clicked and I’d ascended to a new level of chess mastery.

A chess.com record of consecutive wins.

Then I played some more the next day.

Friends, I had not ascended. Maybe it was the weak hotel coffee or maybe it was the unsettling dreams I’d had the night before while the in-room air conditioning unit rattled and chugged away in the corner. Whatever the reason, nothing worked. My opponents were only playing openings I hated to face and I was getting stuck in all the awkward lines I hadn’t yet taken the time to study. Now I was the one continually under pressure and all the forks, sacrifices, and skewers I’d seen just the day before escaped my notice in game after game. As I watched my rating regress back to where I’d just ascended from, I wondered which day was the fluke: Was I secretly much better at chess than I might have though, or secretly much worse?

A chess.com record of losses

Hot and cold streaks are an inescapable fact of chess, baseball, basketball, and most any other endeavor where we measure performance in successes and failures. A long run of successes can make you feel like you’re “in the zone,” while a long run of failures raises the spectre of being “on tilt.” How much should we read into hot and cold streaks, though? Do they really reflect anything meaningful about the state of an individual player during either a winning streak or a losing streak, or are they sound and fury that signify only the statistics of binary outcomes? I've read a few different posts lately about the relative contribution of luck vs. skill in chess: To what extent is our performance about our cunning plans in this game of perfect information and to what extent are we all taking random walks around the true value of our skill level?

Whether winning or losing streaks are “real” in the sense that they represent meaningful departures from randomness has been examined with a few different datasets, and in this short post I wanted to share a recent analysis along these lines using a large database of online chess community data. The analysis I’ll share with you is just one part of a bigger study with other interesting results I’ll write about later, but I thought this piece of their work was a nice look at chess expertise, losing slumps, and winning streaks via a quantitative analysis that’s fairly easy to describe.

Streaks are (probably) inevitable

When you're in the middle of a streak, good or bad, it can feel like you're being swept up in something that demands an explanation. Surely 10 wins in a row must mean something is afoot! Likewise, losing 12 or 13 in a row must be some kind of hint that there is some reason you're spiraling deeper and deeper into the ELO abyss. The thing is, if we're thinking about binary outcomes like winning or losing a game (we'll forget about draws for now), a series of outcomes is bound to have some runs of either 1's (wins) or 0's (zeros). There's an exercise I give students in one of my computational classes to illustrate this point - the goal is to build intuitions for stochastic systems and demonstrate the value of simulating random processes to learn more about their behavior, but it's also just fun to watch them get surprised.

One group of students gets an actual quarter, and I ask them to flip it 150 times and write down what they get in order. Another group of students does NOT get the quarter, but I ask them to imagine flipping one 150 times and to record the imagined outcome of the flip each time. In both cases, we get a list of Heads and Tails outcomes, but every time I've done this exercise with students there is an immediate difference between the two lists as well. The first one, with the real quarter, usually has a run of either Heads or Tails that's 7 (sometimes more) in a row. The second one, that students just imagined, tends to top out at streaks of about 4 or 5. If you ask the second group why they didn't put in bigger runs of Heads or Tails, they'll usually tell you that it "didn't seem random enough" to have the same thing come up again and again. There's a nice cognitive science point here about the Gambler's Fallacy and a nice point about probability: What appears improbable to the human mind may in fact be quite likely if you know how to calculate!

Image credit: Joshua Hoehne

In this case, you can see why a 7-streak in 150 coin-flips isn't so strange fairly easily with a quick-and-dirty approximation. If we were thinking about getting 7 heads in a row or 7 tails in a row from EXACTLY 7 flips of the coin, that is quite a long shot. How long? There are 2^7 possible series of heads and tails we could end up with, and just 2 of those (7 Heads or 7 Tails) meet our criterion. That means that we've got a 1 in 64 shot of getting our streak. What about our 150 coin flips? This is the quick-and-dirty part, but we can think of this series as a bunch of overlapping 7-flip sequences - 144 of 'em to be exact. If you've got a 1 in 64 shot of something happening, it's not terribly likely if you only get one try. If I give you 144 tries, though, it's much more likely to actually happen.

We've played with these kinds of problems in my classes with other stochastic systems. One of my favorite assignments was asking students to analyze Joe DiMaggio's MLB record-setting hitting streak of 56 games. This one's particularly fun because at least a few years ago, it was the case that you should expect a streak of that length just about once in the history of baseball, so on one hand it maybe wasn't terribly surprising that someone did it. The even better catch, though, is that it turns out to be pretty unlikely that a player with DiMaggio's batting average (lifeitme average of 0.325) would do it - Ty Cobb probably should have been the guy (l0.366). Modeling random systems carefully and playing with the parameters in your model is a great way to understand what is and isn't likely in a concrete and computable way, and helps you get some intuition for thinking about what should and shouldn't surprise you.

Public Domain photo. Ty Cobb may not hold the record for MLB's hitting streak, but if there's a record for sliding "spikes high" into home, I bet he's your guy.

Chess streaks depend on skill - Chowdhary et al. (2023)

Compared to coin flips and baseball, chess may not feel random to you. You don't roll the dice on your turn, after all - you and your player get to see everything on the board and make your best decision. Where's the luck or the randomness in that? Well - hang on. Do you have a headache? Does your opponent? Is your side of the board exposed to a particularly annoying beam of light? What color did you draw, by the way? Is your opponent completely prepped for your favorite pet line?

I imagine you get the point, even if you still want to argue a bit! Yes, chess is indeed not a game of chance, but there are also random factors that affect how we play that vary from game to game. The difference between your rating and your player's rating is essentially a way to arrive at a probabilistic (NOT deterministic!) estimate of who's likely to win: You can go check out some ELO calculators online to see how different rating gaps predict different success rates for the higher-rated player. That makes analyzing time-series data of wins and losses (draws, what draws?) a means of evaluating chess as a stochastic system and comparing real outcomes to different models of how wins and losses might pile up. This is one part of the results Chowdhary et al. (2023) report in their paper, in which they offer a number of different analyses of how elite players differ from beginners in their approach the game and the results they achieve. There are a lot of intriguing results here that I'm planning to talk about in future posts, but for now I want to highlight their investigation of how beginners, intermediate players, advanced players, and experts exhibit "streaky" behavior online.

Given a time series of an individual player's wins and losses during their playing history, the simplest thing we could do is count up streak of wins and losses to see how often we find runs of 5, 6, 7 or more wins or losses in a row. This is where the authors start in this analysis, separating players by skill level according to Glicko-2 quartiles in their dataset of lichess games. Critically, however, they can't stop here! Remember the coin flips and Joe DiMaggio's hitting streak from our previous discussion - streakiness can play out differently just as a function of the bias in the data favoring one outcome over another. To the extent that beginners may not win as often as experts, they may exhibit different streak probabilities just because of their different win-loss proportions. To put it another way, even after we count up streaks of different lengths, do hot and cold streaks just reflect the inevitability of randomly packing that many wins and losses into a sequence? To find out, the authors compared the streak counts they observed for different lengths to what they counted up in shuffled versions of the players' time-series data. That is, what's the ratio of streaks of each length that we see in the real player data to what we see in mixed-up data that only preserves the proportion of wins and losses?

Chowdhary et al. (2023), Figure 1. At the top, an example of one player's timeseries data depicting wins in red and losses in blue. Below left, the ratio of hot and cold streaks of each length in the real data relative to shuffled series. Below right, winning streak ratios of each streak length separated by player ability.

What they found is that in general hot and cold streaks happened more often in the real player data than in the shuffled data, suggesting that such streaks are NOT just a by-product of players' win-loss records. Streakiness can't be explained by modeling players as random win-loss coins. If it were, these ratios should be very close to 1, but instead you can see that longer streaks become increasingly more frequent in the real data for players of all skill levels. Also, note in the lower left of the figure above that slumps occur more often than winning streaks! We shouldn't jump to a firm conclusion based on this, but it may suggest that losing streaks are possibly more clearly due to some deterministic factor than winning streaks.

All of this changes as you get better, however - beginners are MUCH streakier than stronger players, which you can see in the lower right figure. What's this about? The authors are able to rule out a few accounts of this difference: Long temporal gaps between "consecutive" games doesn't explain their data, for example, and the rating difference between opponents during a streak is only weakly correlated with streak length. They were also able to rule out consecutive games with the same opponent, so what's left? Maybe there really is something to being on a bit of a tear, or to being on tilt: Whether that's a motivational effect, a sign of environmental conditions that are affecting your play, or some other aspect of individual behavior or social interaction, hot and cold streaks seem to be more meaningful than random chance.

OK, but...

I feel the need to be a little careful here, however. The "hot hand" has been examined in a bunch of different sports and the results of these analyses have been picked over and argued about for long enough that there is every chance there's more to this story. If you're interested in such statistical arguments, check out the references below for some of my favorite papers about streakiness in different sports. I haven't had the chance to think through the analysis I described above relative to some of the bias issues raised in these papers, but the Chowdhary paper is a great example of open data making it possible for you to play around with these numbers as much as you like, If you find something neat, let me know!

Support Science of Chess posts!

Thanks as always for reading! If you're enjoying these Science of Chess posts and would like to send a small donation my way ($1-$5), you can visit my Ko-fi page here: https://ko-fi.com/bjbalas - Never expected, but always appreciated!

References

Chowdhary, S., Iacopini, I. & Battiston, F. Quantifying human performance in chess. Sci Rep 13, 2113 (2023). https://doi.org/10.1038/s41598-023-27735-9

Gilovich, Thomas; Tversky, A.; Vallone, R. (1985). "The Hot Hand in Basketball: On the Misperception of Random Sequences". Cognitive Psychology. 17 (3): 295–314. doi:10.1016/0010-0285(85)90010-6

Miller, Joshua B.; Sanjurjo, Adam (2016). "Surprised by the Gambler's and Hot Hand Fallacies? A Truth in the Law of Small Numbers". IGIER Working Paper (552). doi:10.2139/ssrn.2627354

Raab, Markus; Gula, B.; Gigerenzer, G. (2011). "The Hot hand Exists in Volleyball and Is Used for Allocation Decisions". Journal of Experimental Psychology: Applied. 18 (1): 81–94. doi:10.1037/a0025951

Monthly posts describing research into the cognitive science and neuroscience of chess.