Beyond the Board: A Non-Parametric Approach to Chess Fraud Detection

Jordi_Agost

Updated: May 25, 2025, 11:20 AM | 7

Welcome, fellow chess enthusiasts and data detectives! Today we’re diving into an unusual corner of the game: when consistency becomes suspicious. We all admire a flawless performance—but what happens when near-perfect play hides a darker secret? In this post, you’ll see how some robust statistical tools can pull back the curtain on a player whose centipawn losses are just a bit too tidy. Along the way, we’ll demystify log transforms, explore non-parametric tests, and even plot a control chart that would make any quality-control engineer proud.

1. What Is CP Loss?

Centipawn loss measures how far each move deviates, on average, from the engine’s top choice.
A score of 0 means a “perfect” move; higher scores indicate bigger mistakes.
We collect one CP_loss value per game by averaging the centipawn losses over all moves.

2. Data

We have two datasets, each covering 109 games for our “suspicious player” and 100 games for a reference group of similar Elo:

Suspicious player: CP_loss_j (centipawn loss mean per game)
Reference Group: CP_loss_g

Both series are strictly ≥ 0 (0 means flawless play).

3. Why a Log Transformation?

Because CP_loss cannot go below zero, its distribution is right-skewed (long tail of big mistakes). Many statistical tests assume more symmetry, so we apply:

Why +1? Ensures zero maps to $\log(1)=0$ .
Effect: Compresses large errors, spreads small ones—reducing skew while preserving the zero floor.

Key Observations:

Shape: Both remain right-skewed, but the suspicious player’s data is tighter around lower values
Spread: The suspicious player’s log-CP_loss occupies a narrower range than the group.
Peak: The suspicious player has a sharper mode near 2.8–3.2; the group’s peak is broader around 3.5–3.9

4. Checking for “Gaussian” Behavior: Shapiro–Wilk Test to the rescue

Before comparing groups, it helps to know whether our transformed data roughly follows the familiar bell-curve (a Normal distribution). Many classical statistics assume “normality,” so we use the Shapiro–Wilk test to check:

What it does: Tests the hypothesis
$H0 : data are Normal vs. H1 : data are not Normal$

Why it matters: The suspicious player’s data—even after log-transform—deviates notably from a bell-curve (long or chunky tails). This rules out simple “t-tests” or standard F-tests for variance, so we need more robust methods.

For thoose how still dont believe me below are two visual proofs that the log-transformed data for our suspicious player is NOT Normally distributed--even if the histogram alone might look bell-shaped at first glance

5. Comparing Spread: Fligner-Killeen Test

With non-normal data, we still want to see if one series is tighter (less variable) than the other. Enter Fligner-Killeen, a test designed to compare variances without assuming Normality:

What it does:

Ranks all data points across both groups.

Checks whether the rank-spread difers between groups

Since p < 0.05, we conclude the suspicious player's spread of errores is significantly smaller than the reference group's
Practical meaning: In repeated independent games, you’d expect natural ups and downs in performance. A consistently low spread suggests something smoothing out those ups and downs—like engine help.

6. Comparing Typical Error: Mann–Whitney U Test

Beyond variance, we also want to know if the typical centipawn loss itself is lower. Because data aren’t Normal, we use Mann–Whitney U, which compares medians:

What it does:

Ranks all CP_loss values combined.

Compares the sum of ranks between groups.

An extremely small p-value means the two distributions are not just different in spread but also in central tendency: the suspicious player’s median error is significantly lower.

In plain English: Not only are their mistakes less variable, but on average they make fewer mistakes than peers—to a degree that almost never happens by chance.

Here’s a violin plot comparing the raw CP loss mean distributions:

Left violin: Suspicious Player
Right violin: Reference Group

Key points illustrated:

The width of each violin at a given CP_loss value shows how many games had that level of error.
The black bar inside each violin marks the median CP_loss.
You can see that the Suspicious Player’s violin is narrower overall (less spread) and shifted lower (lower median), which visually supports the Mann–Whitney U test conclusion: the Suspicious Player consistently makes fewer mistakes and does so more uniformly than peers.

7. Visualizing Consistency: A Robust Control Chart

To see these patterns at a glance, we borrow a classic tool from manufacturing—adjusted for a zero-floor metric:

Median ( $m$ ): the “middle” CP_loss.
Interquartile Range (IQR): the difference between the 75th and 25th percentiles.
Upper Control Limit (UCL):
Lower Control Limit (LCL):

Most games fall between 0 and 42 CP, with only 7 outliers above that band.

Why this matters: In honest play, each game’s mistakes bounce around the median—some high, some low. Seeing nearly all games in a narrow window hints at an external factor smoothing out the natural variability.

8. Binomial Analysis of Win Rate

Objective
Determine what Elo rating the Suspected cheater would need so that winning 45 out of 54 games against 2131-rated opponents has at least a 10% chance.

Input Summary

Opponents’ average Elo: 2131
Games played: 54
Wins achieved: 45
Suspected cheater current Elo: 2174
Expected win probability at Elo 2174 vs. 2131: 0.5604
Observed win rate: 45/54 ≈ 0.8333.

Statistical Model

We model the number of wins $K$ as:
and compute the tail probability:

Graphical Illustration

Orange curve: $P(K\ge45)$ as a function of Elo
Green line: 10% probability threshold
Blue line: Elo ≈ 2322 required for $P\ge0.10$
Y-axis (log scale): highlights how the probability drops steeply below this Elo

Interpretation

At Elo ≈ 2322, winning 45/54 games is just plausible ( $\ge10\%$ ).
At his actual Elo of 2174, the chance of 45 wins is vanishingly small—reinforcing the suspicion that his performance was far stronger than his rating implies.
This discrepancy suggests one of three possibilities: he is significantly underrated, had an extraordinary run, or received external assistance.

Final Conclusion

Our analyses converge on a single message: Suspected Cheater performance is too consistent and too strong for his 2174 rating.

Error metrics: He shows significantly lower variance (Fligner–Killeen) and median centipawn loss (Mann–Whitney U) than peers, with a control chart revealing an unusually tight band of mistakes.
Win rate: Winning 45/54 games against 2131-rated opponents is virtually impossible at Elo 2174; to reach even a 10% chance, he’d need an Elo of ≈2322.
Implication: Such a gap—combined with non-normal error patterns—suggests he is either underrated, on an extraordinary streak, or benefited from external assistance.

Taken together, these findings mark his results as statistically suspicious and deserving of deeper review.