Without a doubt there is large amount of non-uniformity within, let's say, the CCRL. Of course, the use of a singleBB+ wrote:I too was wondering about the reference here, as it seemed to me that GB was the TalkChess mod to whom he referred, while I too thought it unlikely he was a Rybka beta tester.I never realized that Graham was a Rybka beta tester. Wow.
It is precisely this contention with which I disagree. Firstly, doing Crafty benchmarks to normalise 40/40 on a given machine is just not too precise. A given engine might be 5-10% slower or faster (relatively) due to setup issues, particularly with memory speed and/or caching. Furthermore, every tester can choose a different book, even on a match/tournament basis. These two aspects are fairly large sources of non-uniformity, even for one tester who has multiple non-identical machines and/or plays different tournaments with different books. More minor aspects of non-uniformity would be choice of TB usage and possibly GUI draw/resign rules (if applicable). In contrast, SSDF uniformised its conditions almost completely. I would like to see some actual evidence (or at least a compelling argument) that lumping together all "blitz" time controls from the alphabet of agencies is any worse than what is already extant from combining quasi-comparable data under conditions deemed equivalent.There is more uniformity of testing conditions intra-league as opposed to inter-league.
benchmark to normalise game time controls is not very accurate, not to mention that the time control computed
from the benchmark must be modified to match the time controls that are available with a given GUI. Yet, the
time controls used by the more prominent rating lists differ a great deal more ( 40/10', 5'3" incremental, game in 10',
all with faster hardware than the CCRL norm ). The different books used within the CCRL could be a large source of non-uniformity, but the ultimate source of those books are all the same, human grandmaster games. I am not sure if that lessens the non-uniformity or not. A marked difference from the CCRL is IPON ( the use of 50 start positions ) and the CEGT ( a considerable number of games in their database start from Nunn and Noomen positions ).
Despite what I wrote above, you very well may be right. The relative rankings of the engines common to the more
prominent rating lists do not appear to differ much. Engine strength appears to be invariant, at least to some degree,
in relation to test conditions. Combining the results from various lists, as Vincent Lejeune did at one point, may in fact
be no less "accurate" then any single list.
The focus is misunderstood. Its greatest attribute is what you like about it; it canvasses ( or at least tries to ) manyI much appreciate CCRL, but feel that its focus has been mis-oriented, or at least mis-understood. For some reason, others have pointed to it as the "gold standard" of ratings lists, but the thing I find useful about it is that it canvasses so many engines.
engines. To accomplish that, resources from several people are pooled together. There lies the greatest weakness
of the CCRL, in terms of being a ratings list. The numerical ratings must be taken with a grain or two of salt.
But, as a list that informs ( or helps verify ) engine authors and the curious the relative strength of an engine, it
does a pretty good job. That is why I joined. If everybody truly understood this, they would see that it does not actually
matter whether or not the CCRL tests certain engines. IPON and SWCR would do a better job of finding the difference
in Elo between Houdini 1.5 and Rybka 4 ( single testers focused on rating the top engines ).