When Style Stops Being Human (Statistically)

Aug 16, 2025, 11:48 AM | 3

Good morning, fellow detectives. Today we’ll explore a different path in fair-play analysis: instead of expensive engine correlations, we’ll study the shape of a player’s style and how it unfolds move after move. The goal isn’t to celebrate isolated brilliance; it’s to spot improbable regularity, vanishing tails, or “too smooth” curves where medians jump without paying the price in variance.

What we measure (and why)

We will only work with PGNs: the position after each move and the “style” it sketches. No engine. We summarize the style’s center, spread, and tails across many moves and games, looking for repeated patterns and human variability.

The four families of style features

We group signals into four families. High one-off values don’t interest us, repeated structure and variability do.

1) King Safety
- Attacker/Defender Score, Pawn Shield, Open Files near the king.
- Humans oscillate between tense and calm phases, coherent attack/defense plans leave a recognizable pulse.

2) Activity & Space
- Weighted Mobility, Control of Center, Developed Pieces, Space Control.
- Sustained activity is good, but truly human activity isn’t perfectly flat.

3) Pawn Structure
- Doubled / Isolated / Passed Pawns, Piece Coordination.
- Structure breathes with practical decisions and trade-offs

4) Immediate Tactical Motifs
- Forks, Pins, Skewers, Threats.
- Captures the “tactical potential” available after each move.

“Same distribution, or not?” Multiple lenses, one verdict

Instead of fixating on one metric or one test, we look for a converging story across the whole curve.

If the center shifts (median higher, or lower, than the reference), order based tests like Mann–Whitney tell you so, and the common-language effect (“Theta”) says how often A tends to beat B.

If the spread changes, Levene centered on the median, calls it out. This is where “the curve tightens” shows up in black and white.

If the shape differs anywhere, KS picks it up; if the difference hides in the tails, Anderson–Darling weights those more heavily.

We add effect sizes (Hedges g; Cliff’s δ) so the story isn’t just “different” but “how different,” and distances like Wasserstein-1 (normalized by the reference IQR) to quantify how much work it would take to morph one curve into the other.

And because we look at several features, we control the false discovery rate with Benjamini–Hochberg and talk in q-values, not just p-values.

Per-game outliers: catching regime changes, not one-off spikes

We judge each game against the player’s other games, feature by feature. For every feature $f$ , we compute a robust z-score against the player’s own baseline:

Then we fold them together:

In parallel, we ask: if you treat this game as one group and the player’s remaining games as the other, do the feature distributions differ? We run the per-feature tests, combine their p-values with Fisher, and control FDR to get a q_comb per game.

A game is flagged when those strands align: q_comb ≤ 0.01, Z $Z_\text{comb} ≥ 6$ comb ≥ 6, and at least two features show
$∣ z ∣ \geq 2.5$ , with a minimum number of moves so we’re not fooled by tiny samples. That’s not hunting for a single flashy move; that’s catching a coherent regime change.

Warming up on a clean player

Before touching any suspicious case, we chart key features for a known, clean player to see how our metrics behave. For the sake of brevity we will not include all:

A tool built for style, not engine correlation

We built a tool that extracts all features above directly from PGNs, aiming to estimate a player’s characteristic style and identify outliers. To keep things interesting, we also studied a confirmed cheater, but chose a non-obvious case (regular win rate, no huge streaks; the kind most would overlook). We analyzed the last 300 games, below are a few flags.

Each line shows [index — date | q | Z | top contributing features].

- 281 — 2025-08-07 | q=0 | Z=22.79 | Defender Score, Mobility, Space Control
- 30 — 2025-06-23 | q=0 | Z=21.80 | Defender Score, Open Files, Space Control
- 64 — 2025-06-23 | q=0 | Z=21.49 | Defender Score, Space Control, Piece Coordination
- 49 — 2025-06-23 | q=0 | Z=21.39 | Defender Score, Space Control, Control of Center
- 96 — 2025-06-23 | q=0 | Z=21.23 | Defender Score, Control of Center, Mobility
- 296 — 2025-08-07 | q=0 | Z=21.21 | Defender Score, Passed Pawns, Space Control
- 183 — 2025-06-21 | q=0 | Z=21.17 | Defender Score, Control of Center, Doubled Pawns
- 105 — 2025-06-21 | q=0 | Z=21.10 | Defender Score, Space Control, Piece Coordination
- 173 — 2025-06-21 | q=0 | Z=21.00 | Defender Score, Mobility, Piece Coordination
- 272 — 2025-08-07 | q=0 | Z=20.98 | Defender Score, Space Control, Doubled Pawns
- 164 — 2025-06-21 | q=0 | Z=20.92 | Defender Score, Mobility, Doubled Pawns
- 118 — 2025-06-21 | q=0 | Z=20.92 | Defender Score, Control of Center, Space Control

Repeated culprits are clear: Defender Score, Space Control, Mobility, Control of Center, Piece Coordination, and pawn-structure signals.

CAMS-II: Scaled Error Value (SEV)

To cross-check our style based flags, we use a complementary, engine-agnostic summary: SEV, which will be introduced on the next update of CAMS . We add an orthogonal view built on centipawn loss per move, without engine lines per se, just the loss distribution.
Per-move penalty (smooth, bounded):

Aggregate error (0–100, lower is better):

$s$ (scale, cp) sets when errors begin to “hurt” .
$p (curvature) controls how sharply penalties grow .$
Large outliers can be capped (e.g $L_{i} \leq 1000$ cp).
This converts into: precision Index = $100 - SEV$ (higher is better).

Where:
SEV: <12 elite GM/IM · 12–25 strong/master · 25–45 club player/medium · >45 beginner. Exact thresholds are not calibrated yet.

What we got in this case

We evaluated 300 games of the suspect (declared Elo 1200):

>78% of the games flagged as suspicious by our model had SEV > 25 (above the "mid-club" range).

Reading, there is a clear correlation between SEV and our model. Our model is able to identify more than three-quarters of the items as suspicious without using any engine.

What we can (and cannot) conclude

What we can say (confidently):

Our style pipeline detects distributional drift: shifts in center (Mann–Whitney + Θ), spread (Levene–median), and shape/tails (KS, Anderson–Darling), quantified with effect sizes (Hedges’ g, Cliff’s δ) and transport distance (Wasserstein-1 / IQR).
At the game level, the combined evidence (Fisher + FDR) and the multifeature z-aggregation (⁠ $Z_{\text{comb}}$ ) isolate coherent regime changes rather than one-off spikes.
In the “confirmed cheater” dataset, top flags repeat the same handful of features (Defender Score, Space Control, Mobility, Center, Coordination, Pawn-structure signals), which is exactly what a non-idiosyncratic (too regular) process would generate.
In the 300 game 1200 Elo case, >78% of the style-flagged games also had SEV > 25 (mid-club or worse). That shows our detector is not merely hunting for “engine-good” play; it’s picking up unnatural regularity irrespective of absolute strength. Style anomalies are about how moves evolve, not how strong they are in centipawns.

What we cannot say (from this alone):

We do not infer intent or tooling. Style drift ≠ cheating. It could be coaching, time-scramble habits, premove patterns, drills, fatigue, opponent pool shifts, or site/time-control artifacts.
We don’t estimate engine match rates or best-line agreement by design. This is a complement, not a substitute, for engine correlation audits.

Final thoughts

This field is still largely underexplored. Despite its limitations, when combined with a well designed ML layer it could have broad applications in cheat detection, sandbagging identification, and abuse mitigation. This experiment, while grounded in solid statistical ideas, is only a very basic prototype meant to showcase a few of the many possible applications.

I find one use case especially relevant: detecting suspicious behavior within winning streaks. Yes, we may observe that a suspect plays strongly, but how far did they drift from their own stylistic baseline? Does that deviation matter? How plausible is it?

Another compelling scenario is account sharing. How feasible would it be to detect two clearly distinct playing patterns on the same account?

In my view, this kind of analysis expands the toolkit we have to combat fraud in chess. There’s plenty to refine: calibration, thresholds, and validation cohorts, but the direction is promising.

Thanks for reading. Until next time, and see you over the board.

Jordi Agost

When Style Stops Being Human (Statistically)

Jordis Blog