What do you mean they're "claims."
It's a fact the version wasn't the latest, it's a fact they used a nonsense time control, and a fact they gave it 64 threads but 1 GB hash. A completely untested and unusual setup.
It was publicity for their project, which was an AI project, not a chess project.
If they were interested in a chess match they could have done it under normal conditions, but this will never happen, because that wasn't the point.
In fact I can only assume they did tests with various versions of SF, at various time controls and hash sizes. Apparently they didn't think the results of A0 would look good enough if they'd done a normal match.
Grischuk is actually a better example. In one Candidate's match-play tournament, he strove for draws as White and often got the draw in under 25 moves. His strategy worked in the shorter matches, but when he faced Gelfand in a longer match he lost.
A 2800 player intent on drawing is very difficult to beat with perfect play. And less than perfect play runs the risk of actually losing.
All of the +2800s have several games where they didn't make any mistakes at all.
Presumably that's "no mistakes at all" according to the engine that got 36% against AlphaZero? I have blitz and maybe even bullet games that meet that threshold (although this relies on ignoring small evaluation differences that are very sensitive to computation time, not to say that larger ones never are).
A 2800 player might well be able to force a draw almost always against a 2800 player, based on all of their choices being based on minimising the opportunity for the opponent to get winning chances. If they can manage it against a 3400 player, they are not a 2800 player, they are a 3400 player. Grischuk is not a 3400 player.
AlphaZero did not test itself against a fully functioning Stockfish. Stockfish was not optimized the way it is for the CCRL ratings. But let's agree that the version used was ~3300 and that AlphaZero scored 64%-36%. That's a 100 point difference. AlphaZero didn't dominate by 1000 points, or anything close to that. The vast majority of the games played in the match were drawn, and AlphaZero only published its wins.
No-one has justified the claims about the version of Stockfish used in the research being significantly handicapped. This would be easy to do if true: just match it up against an optimised version and smash it.
It is true that there should be potential for ALL players to increase their standard of play by allocating more time on moves which demand it. However, this was the same for both players. The same is true of opening books, but these are really a crutch for an engine that is bad at finding good moves in the opening (which is really just the part of the non-endgame which has been seen in previous games). Picking a move for an opening book is a kind of cheat that uses lots of previous computing time by other players to provide assistance.
The 1 Gb hash table is argued to have been a huge handicap. However, the only research I can find on this suggests that increasing hash tables to huge sizes has an inconsistent and small effect on performance. Again this could be easily checked by any person who wants to and has a 32-core machine handy to do the comparison.
It's worth finally mentioning that the processor for the match for Stockfish was an unusually powerful one for computer matches. Alphazero's was more powerful in terms of raw operations, but not at all suitable for running programs like Stockfish - it's designed for general purpose AI.
The AlphaZero test was just a test. It wasn't rated by anyone. It's a very impressive test, and AlphaZero has shown some tremendous improvements over Stockfish. But it has not altered reality. It has not shown that chess is a win, and it has not shown any hint that a computer could reach 4000 elo. If anything, it demonstrated that there's a lot of space between 3300 and 3600. It probably performed at about 3400 strength, but we just don't know.
The space between 3300 and 3600 is defined by Professor Elo.
The graph of improvement of Stockfish posted above does not suggest an imminent ceiling of performance.
There is a problem with the ratings scales for comparison. Strictly speaking for humans, there is a single rating scale for a specific time control. For computers this can be the same, or it can be further restricted to specific hardware. This would make it impossible to compare any computer with special hardware, which is obviously not good enough. Better to have a rating for a specific (but unrestricted) combination of hardware and software at a given time control. This also acknowledges the fact that increasing hardware power makes the strength of computers advance as well as improving software.
The annoying thing is that the rating scale for computers is not firmly related to the one for humans. In principle all it requires is a continuum of competitive play between the two populations, so their ratings stay consistent. Part of the uncertainty is due to people being loose about the definition of computer ratings, forgetting that the time control and the hardware are crucial as well as the software.