I quite agree. I discussed this back with Larry (in PMs at Rybka forum) when you were first tossing this idea around. I had thought I had a few ideas for how to tweak the search, but the robustness in eval stays. Actually, now that I think of, the later IvanHoes have some sort of "randomiser", which merely seems to perturb the eval by some amount (I'd have to check the details). Maybe I can test eval versus perturbed-eval to see how much noise one needs to create to get an effect. I also think taking (at least the open-source) engines and cross-comparing correlations from evaluate() with go movetime 100 is a useful experiment.The tester seems to very clearly identify strong correlations between the playing styles of programs and it does this better than I had hoped.
One thing I like about fixed depth is that there's no dispute about what the "default" level of matching is (at least w/o SMP). I'm not sure this outweighs any negatives. Given that the time alloted appears to be a secondary factor, I would opt for whichever is easier. One issue with using "stop" (which does improve on "go movetime" I agree) is how the OS does time slicing with a "waiting" process (typically I think these are 1/100 of a second in Linux). As noted in the Stockfish discussion, you can still hit a "polling" discretisation behaviour when I/O is only checked every 30K nodes and the search is taking maybe 5 times this amount. If nothing else, as with any experiment, there needs to be some quality control.
One question I have about all of this: can this detect specific overlap in evaluation features, or is it more about evaluation numerology?