Well, it is called regression test whether the newly tuned/adjusted idea/pattern is better than the original one.
It is the same in every chess engine tests.
For example, in SF 9, you have evaluation for Bishop 3.25 and Kt 3.25. And then you make minor evaluation change in new patch where Bishop 3.35 and Kt 3.25. And then there will be regression test between original SF 9 vs newer patch whether your new idea get better outcome or not. If not your new idea will be rejected. If your new idea/patch is passed, it will be accepted for newer development version. There are thousands of similar patches between individual stockfish versions. And each patch have been tested minimum of 10,000 games to prevent statistical flukes.
Yes, after several thousands of games, original neural network (network A) got feedback from outcome of games.This pattern lead to win, loss etc. With those feedback and other weight adjustments, there will be a newer network (network B). Then there will be a match between A vs B, if B win A, B will be the newer master network. But sometimes, B may fail vs A, then the newer network will be rejected.
By this way, only stronger and stronger networks will be carry over for the future training.
So in addition to being a neural net, it sounds like a genetic algorithm -- except there is no "mutation" operation? Or am I misunderstanding the GA concept here?