Adam Hair wrote:
I am asking for a coherent argument. It is easy to lob out criticism and insults. That is how most people go about
dismissing others. What would be much more enlightening are well thought out counterpoints to what I am about to
write.
1) Engine books are used in computer chess tournaments, actual competitions between engine authors. Your reference
to FIDE and ECF would apply to this scenario. It is irrelevant to rating lists. The SWCR, IPON, CEGT, and CCRL do not
use engine books because the goal is to find the strength of each engine, not engine+book.
orgfert wrote:
Human rating lists are of the complete chess player, which includes memorized openings and endgames as well as learning.
Without a doubt, you are correct. However, most chess engines are not so complete. A choice has been made to test
what is common among all chess engines.
Adam Hair wrote:2)The "alien" books you refer to are forced on every engine. The purpose is to create a balanced position for the
engines to begin play. Whether or not balanced opening positions are achieved is another question.
orgfert wrote:
Most human rating lists are not composed of such events.
Adam Hair wrote:3) TBs are not removed in general. The question of whether TBs improve Elo has been tested in some cases, but
the CCRL ( and the other rating groups, I think) uses TBs.
orgfert wrote:
Ok.
Adam Hair wrote:4)On ponder strategies, when comparing the various rating lists it does not appear to affect the relative rankings
much whether ponder is on or off. And ponder off allows more games to be played and more engines tested.
orgfert wrote:
Again, for the convenience of the tester. But in principle, this should not be done at all for reasons listed below.
Adam Hair wrote:5) One thing not explicitly named but is a part of the "total product" is learning. To test a engine for the purpose of
determining its Elo rating when learning is on creates problems. If an engine is continually changing, then the
comparision of different engines' results against that engine has little meaning. Bayeselo assumes that each
engine has an unchanging true Elo. If an engine is changing, then its true Elo is changing. Several engines with
learning on would make any rating list constructed more inaccurate than it already is.
orgfert wrote:
I take this means Bayeselo cannot rate humans since they are dynamic, learning entities. If a program is written to be like a human, dynamic, its benefits will be concealed by this flawed testing. My analogy to human rating practices remains apropo. A program is a chess player and its strength is composed of design elements that are then arbitrarily turned off by testers. This is grossly incorrect.
How many programs are written to be dynamic? As far as I have seen, very few have been. Crafty, ProDeo, and (I think)
RomiChess come to mind. I am sure that there are some others. Yet, the vast majority of engines are static entities,
unlike humans. I fail to see how your analogy to humans and human rating lists apply. If engines like the three I named
were in the majority, then I would be in your camp on this issue. But they are not.
Adam Hair wrote:6) Each rating list is an attempt at something approaching a scientific measurement of engine strength. How close
the approach comes is open to opinion.
In each case, there is an attempt to eliminate sources of variation.
Sometimes there are some trade offs ( more testers allow for more games and engines but creates more statistical
noise), but at least there is some idea of each engine's strength ( there are many more that should be been tested).
orgfert wrote:
This approach fundamentally destroys many design elements of a computer chess player's strength. Even if you discover that specific elements tend to make little difference, you are blinding the test to potentially effective strategies when they arrive in newer, more innovative versions.
Therefore, testing should be careful to include all design elements in a system for evaluation, whether they are deemed to differentiate or not. This is a fundamental principle that should never be violated.
[/quote]
This is done quite often in science :
Define what you are trying to measure, try to eliminate sources of variation, then
measure it. The scope of the testing, in this case, is narrowly defined. We are trying to find the relative strength of
each engine. And there are a lot of engines out there, many being updated and new engines arriving each month. The
CCRL has been trying to test as many of them as possible. This goal may be at odds with what you would like to see done.
It has been helpful to others.
Are you also caught up with the notion that the CCRL is some kind of accreditation organization?
If we were, then our tests should include all design elements. Well, we are not and do not pretend to be.