I agree 100% with your diagnosis..but the conclusionhyatt wrote: The problem is that if you test A vs A', you might get +30 with a search change. Then if you run A and A' against a range of engines, you may only get +5. It is easier, and more accurate, to use the same test setup each time, then there's no accidental changes to skew the results unknowingly.
This is not a "problem", this is a "feature" !!
As a developer I am not interested in ELO accuracy, but in reliably detect good patches from bad ones. And these are two _different_ targets.
Your example is more theorical then practical, in real world, when you are modifing an already mature engine what normally happens is that when you test A vs A', you might get +10 ELO, and if you run A and A' against a range of engines, you may only get +3 ELO.
The difference between the two cases is not in the quantitative ELO result, but the _fundamental_ difference is that the first case is detectable by a test, the second case instead is _not_ detectable because you are well below error margin. So, when you don't have a cluster, A vs A' allows to detect as good a much broader set of patches then in the case you run A and A' against a range of engines.
IOW if you sort the tests results in good / bad / unknown, then if you test only against a range of engines it happens that a lot of good patches end up in the 'unkown' bucket, much more then if you run A vs A'.
If you choose to discard all but reliably good ones, then if you test only against a range of engines be prepared to discard a lot of good stuff.
P.S: Please don't comeback with something along the lines: "Of course, you need 1 milion games and you are sure !"....I think you have understood my point.