A Talkchess thread: Misinformation being spread

Adam Hair · Post by **Adam Hair** » Wed Jan 19, 2011 5:19 am

orgfert wrote:
You make it sound as if there is not very much left after turning off books, ponder, and learning. I believe many authors would disagree with you.
The real test is if you could do that to a human chess player, would he disagree with you?

Adam Hair wrote:Put at a disadvantage? No. Invalidate their standing? No. I would think some of the extras are there in order to make the chess engine play more interesting, not because it would help it be higher on any rating list.
Ask a human intelligence whether the things you turn off in an artificial intelligence are extras he can do without.

Adam Hair wrote:Do you really think that list makers determine what should be in a chess program?
They determine how to limit and hobble all the AI, which must surely affect differing designs by differing and unknowable quanta. The result is then considered a scientific effort at objective measurement, though how it could be considered so with such arbitrary interference is puzzling.

Adam Hair wrote:You give the whole group too much credit. Some authors undoubtly strive to climb the lists. Others pay more attention to giving their program a full set of features
I think you'll find I give them almost no credit (no offense intended) due to arbitrary tampering with the designs. I think this is done innocently in ignorance.

Adam Hair wrote:Anybody who does not understand that computer chess is artificial intellegence needs to do some reading. Yet,
simply testing for engine strength does not dismiss that connection. How do you think Bob Hyatt tests Crafty?
With books, ponder, and learning on? No. When he competes with Crafty, then yes.

hyatt wrote: This is _really_ mixing apples and oranges. In my testing, I am not trying to find out how much better or worse my program is when compared to another program. I am testing different versions of the _same_ program to see if the changes are good or bad. That is far different than the intent of a chess tournament or a rating list.

Your intent really is not that much different than the intent of a rating list. One purpose of the CCRL lists is to
compare engine versions. When I test a new version of an engine, I set up a gauntlet against the opponents the
older version played against, in order to have some comparison between the two versions. The main difference
in your intent and my intent is that many of your test runs are checking individual changes, whereas I am checking
the end result of all the changes made. Other than that, there does not appear to be much difference in intent. You
check to see if some change in code results in an increase in Elo in comparison to the previous test version relative
to the gauntlet of engines each version played against.

hyatt wrote: I eliminate pondering, because it increases randomness. I eliminate the book because I don't want to have to test every opening and I don't want to deal with the interference learning can cause. I don't use SMP because that is a performance issue that has nothing to do with modifying the program's search or evaluation to improve them. And all I am trying to measure is the change(s) we makes to the evaluation or search. Was the change good or bad.

Actually, rating lists are trying to do the same sort of measurement, just for many more engines.

hyatt wrote: A rating list or a tournament is a different thing. There, the "whole system" is under scrutiny, book, search, eval, speed (smp included) endgame tables, learning, whatever else your engine does to make it play better.

Definitely for a tournament. If I was competing in a tournament, I would want to use anything associated with my
program that could help me win. However, a rating list is not a competitive event.

hyatt wrote: So, as I said, this is apples and oranges. How I test has nothing to do with how one should conduct a tournament, nor a rating list...

I used you as an example because how you test is widely known. How I conduct tests for the CCRL is similar, yet it
is not based on how you do it. It is based on the fact that it is the best way to conduct the testing, given the goals
and constraints I have.

hyatt wrote: I agree with your comments below. We already know that a book can make a huge difference in real games. A good book will both (a) guide the program into openings where it plays well and (b) guide the program away from openings where its eval or search seem ill-suited to handle. This means that one can either try to fix a hole in their evaluation, or they can use their book to avoid that hole. If you graft an odd book onto a program, you deny it this protection that the author depended on, and the results can be artificially worse. Or if your opponent uses a book that is ill-suited to it, your program might look better if it forces the opponent into openings it would normally avoid.

The only question is, which is better? If you only want to compare engines, no books could work, assuming you expect all engines to play all openings equally skillfully. But none of the authors really believe that is possible. Humans don't play that way, we avoid that which we don't understand or are unfamiliar with.

Here you briefly take my point, but then immediately toss it aside with little attempt at explanation.

Adam Hair wrote:But when he wants to find out if some changes in the code makes Crafty stronger, all of that is turned off. The same for other authors.
And the rating lists serve as a check for them.
But in this case, he is tuning search and eval only. To do this, he must isolate it from its dynamic AI functions. Why rating lists would only be interested in a subset of the total AI seems strange. Why is no one interested in the total AI?

Adam Hair wrote:We are not giving any program a UL listing. The fact is this: we are testing the chess engine, not the chess program.
Ok, but this has been clear from the start. What is not so clear is the reason why no one wants to know the relative strength of the AI.

Adam Hair wrote:Start testing all the bells and whistles yourself.
What you are calling bells and whistles are the holy grail of AI. One wonders what we are endeavoring to discover by crippling whatever abilities have been achieved. I don't understand the answers that have be given to this so far.

Adam Hair wrote:You certainly feel strong about this. However, the strength of your convictions does not determine whether you are
right or wrong about an issue.
I'm somewhat at a loss since it seems completely obvious. Testing competing AI's would seem to be a goal with no shortage of champions, yet one finds it a goal of almost no one. And when it is suggested, eyebrows are raised as if the suggestion were utterly ludicrous (complete with laughing emoticons).

BB+ · Post by **BB+** » Wed Jan 19, 2011 9:22 am

This would not seem to follow, since most engines have pondering, yet this fact has not changed the way chess AI are tested with pondering switched off.

This is quite a good point, though I might out that 4 of 6 lists that Wikipedia includes do include pondering -- it is only CCRL/CEGT that do not.

Adam Hair · Post by **Adam Hair** » Wed Jan 19, 2011 1:45 pm

BB+ wrote:
Most chess engines do not have learning functions. If most engines did have learning functions, then the way engines are tested might would be different.
Is it fair to argue the opposite? If more testing coalitions included learning features, would the typical amateur engine be more likely to implement them?

Perhaps so. I am relatively new to this field, so I don't have enough perspective on the influence of rating lists
on chess programmers, and vice versa.

BB+ wrote: A number of commercial engines (HIARCS, Shredder, Naum) do have learning functions, particularly with opening books. Some of them even recommend this -- for instance, the Junior FAQ has (emphasis added):
Q. How can I get Deep Junior to use its own opening book?
A. Deep Junior 12 comes with its own huge chess opening book by GM Alon Greenfeld in Chessbase ctg format (511Mb download) for use in all Chessbase or compatible GUIs. The OwnBook engine parameter enables use of the Deep Junior own engine book which is much smaller than the ctg book. We recommend the use of the ctg book with book learning on for all official testing.
Admittedly, once you've decided, as per CCRL, to use generic books, much of this becomes essentially moot. [Also, I'm not sure if what is meant here is some CTG-based learning rather than something Junior-based -- similarly with Fritz, while Shredder, Naum, and HIARCS all seem to have at least some "learning" which goes beyond just CTG manipulation].

Adam Hair · Post by **Adam Hair** » Wed Jan 19, 2011 1:56 pm

orgfert wrote:
Adam Hair wrote: To leave these functions on would provide extra information about the programs that have learning functions. But it
would give us less information about the engines that do not have these functions.
I don't see how your conclusion follows logically. I'm not even sure that its possible to demonstrate scientifically whether more or less information is being discovered or lost in either category. You are focusing on learning, but it seems even more logical that just in disabling pondering, a critical component of an AI's potential strength is being ignored, as there are at least a few differentiating nuances in the technique, and its a widespread ability in chess AI. It can be extremely important in building momentum in a game that requires momentum to be successful.

It seems as if the intention is to create a rating list of crippled AIs -- scientifically devised to provide as little useful information on the relative strengths of the AIs as can possibly be arranged. I'm sure this was not the intention, but more an accident of not knowing how long it should have been thought about before deciding upon a procedure, and in applying scientific principles that do not pertain to accurate measurement of dynamic systems, such as trying to eliminate variables that are native to intelligence. That is simply defeating the purpose of measuring the intelligence in the first place.

I was merely refering to the loss of statistical information in regards to the engines that are static.

Again, I have to say that one major goal of the CCRL is to test as many engines as possible. To accomplish this,
some concessions are made. Ponder is turned off. Some information for each engine is sacrificed in order to aquire
some information on a larger group of engines. It is as simple as that.

orgfert · Post by **orgfert** » Wed Jan 19, 2011 5:40 pm

Adam Hair wrote:I was merely refering to the loss of statistical information in regards to the engines that are static.

Again, I have to say that one major goal of the CCRL is to test as many engines as possible. To accomplish this,
some concessions are made. Ponder is turned off. Some information for each engine is sacrificed in order to aquire
some information on a larger group of engines. It is as simple as that.

Under the current regimen, information is lost regarding the real differences between static and dynamic AI. And as already mentioned, pondering off doesn't fit the rationale of your methodology.

Your reply to Bob indicated that the rating list was not a competition, which doesn't seem a valid argument in that human rating lists are composed of games entirely from competitions.

Adam Hair · Post by **Adam Hair** » Wed Jan 19, 2011 7:15 pm

BB+ wrote:
This would not seem to follow, since most engines have pondering, yet this fact has not changed the way chess AI are tested with pondering switched off.
This is quite a good point, though I might out that 4 of 6 lists that Wikipedia includes do include pondering -- it is only CCRL/CEGT that do not.

IPON, SWCR, and SSDF are focused only on the top engines. The WBEC list is the byproduct of Leo's tournaments. While it includes
a wide assortment of engines, it is lacking in the number of games played.

As I said before, there is a tradeoff. Ponder off allows more engines to be included with the resources available. Seeing as how
the ponder on results and the ponder off results are not markedly different, I opt for the ability to test more engines.

Adam Hair · Post by **Adam Hair** » Wed Jan 19, 2011 7:55 pm

orgfert wrote:
Adam Hair wrote:I was merely refering to the loss of statistical information in regards to the engines that are static.

Again, I have to say that one major goal of the CCRL is to test as many engines as possible. To accomplish this,
some concessions are made. Ponder is turned off. Some information for each engine is sacrificed in order to aquire
some information on a larger group of engines. It is as simple as that.
Under the current regimen, information is lost regarding the real differences between static and dynamic AI. And as already mentioned, pondering off doesn't fit the rationale of your methodology. Your reply to Bob indicated that the rating list was not a competition, which doesn't seem a valid argument in that human rating lists are composed of games entirely from competitions.

The intent of the creators of the rating list has some role in this, don't you think? The insistence that any chess rating list must have
the same intent as human rating lists is monomaniac. Our intent has been to provide the relative rating for as many engines as
possible with some attempt at statistical significance. Does the attempt fall short? Yes. Are some aspects of some engines ignored?
Yes? Does it invalidate the results from our testing? I argue that it does not, as long as the weaknesses in this approach are kept in
mind.

This is the list that 3 of us tend to concentrate on: http://www.computerchess.org.uk/ccrl/40 ... e_cpu.html

How many of these engines can be found on other lists with this many games played? If it has no significance to you, so be it. It does
have some significance to others. To me, its biggest failing is that it does not include more engines at this point.

hyatt · Post by **hyatt** » Wed Jan 19, 2011 8:23 pm

Adam Hair wrote:
orgfert wrote:
You make it sound as if there is not very much left after turning off books, ponder, and learning. I believe many authors would disagree with you.
The real test is if you could do that to a human chess player, would he disagree with you?

Adam Hair wrote:Put at a disadvantage? No. Invalidate their standing? No. I would think some of the extras are there in order to make the chess engine play more interesting, not because it would help it be higher on any rating list.
Ask a human intelligence whether the things you turn off in an artificial intelligence are extras he can do without.

Adam Hair wrote:Do you really think that list makers determine what should be in a chess program?
They determine how to limit and hobble all the AI, which must surely affect differing designs by differing and unknowable quanta. The result is then considered a scientific effort at objective measurement, though how it could be considered so with such arbitrary interference is puzzling.

Adam Hair wrote:You give the whole group too much credit. Some authors undoubtly strive to climb the lists. Others pay more attention to giving their program a full set of features
I think you'll find I give them almost no credit (no offense intended) due to arbitrary tampering with the designs. I think this is done innocently in ignorance.

Adam Hair wrote:Anybody who does not understand that computer chess is artificial intellegence needs to do some reading. Yet,
simply testing for engine strength does not dismiss that connection. How do you think Bob Hyatt tests Crafty?
With books, ponder, and learning on? No. When he competes with Crafty, then yes.

hyatt wrote: This is _really_ mixing apples and oranges. In my testing, I am not trying to find out how much better or worse my program is when compared to another program. I am testing different versions of the _same_ program to see if the changes are good or bad. That is far different than the intent of a chess tournament or a rating list.
Your intent really is not that much different than the intent of a rating list. One purpose of the CCRL lists is to
compare engine versions. When I test a new version of an engine, I set up a gauntlet against the opponents the
older version played against, in order to have some comparison between the two versions. The main difference
in your intent and my intent is that many of your test runs are checking individual changes, whereas I am checking
the end result of all the changes made. Other than that, there does not appear to be much difference in intent. You
check to see if some change in code results in an increase in Elo in comparison to the previous test version relative
to the gauntlet of engines each version played against.

hyatt wrote: I eliminate pondering, because it increases randomness. I eliminate the book because I don't want to have to test every opening and I don't want to deal with the interference learning can cause. I don't use SMP because that is a performance issue that has nothing to do with modifying the program's search or evaluation to improve them. And all I am trying to measure is the change(s) we makes to the evaluation or search. Was the change good or bad.
Actually, rating lists are trying to do the same sort of measurement, just for many more engines.

hyatt wrote: A rating list or a tournament is a different thing. There, the "whole system" is under scrutiny, book, search, eval, speed (smp included) endgame tables, learning, whatever else your engine does to make it play better.
Definitely for a tournament. If I was competing in a tournament, I would want to use anything associated with my
program that could help me win. However, a rating list is not a competitive event.

Wouldn't you expect a rating list to match a tournament result over a long period of time? Otherwise, what is the purpose of the list if it doesn't _accurately_ show the expected performance of the programs listed? A bookless tournament will have a different result from a booked tournament. Humans don't play "bookless" events. There is much strength to be had by developing a tuned (to a specific program) book, so that the program avoids that which it plays poorly and favors that which it plays well. The programmer might have ignored certain positional issues for the time being, and used a custom book to avoid those openings where that is an issue. A bookless or "common book" event bypasses that protection.

So, what do you want the list to measure? A chess program is a combination of things, including the engine, the book, configuration files, etc... Start stripping parts away and you are not quite testing the "program"...

hyatt wrote: So, as I said, this is apples and oranges. How I test has nothing to do with how one should conduct a tournament, nor a rating list...

I used you as an example because how you test is widely known. How I conduct tests for the CCRL is similar, yet it
is not based on how you do it. It is based on the fact that it is the best way to conduct the testing, given the goals
and constraints I have.
The only problem I see is that your list claims to rate _programs_. It is certainly not doing that. Because parts of the program are disabled. Whether it be pondering, parallel search, opening book, endgame tables, bitbases, etc. A program is the sum of _all_ the parts, not just _some_ of the parts...

I don't believe the casual reader would understand that distinction.

hyatt wrote: I agree with your comments below. We already know that a book can make a huge difference in real games. A good book will both (a) guide the program into openings where it plays well and (b) guide the program away from openings where its eval or search seem ill-suited to handle. This means that one can either try to fix a hole in their evaluation, or they can use their book to avoid that hole. If you graft an odd book onto a program, you deny it this protection that the author depended on, and the results can be artificially worse. Or if your opponent uses a book that is ill-suited to it, your program might look better if it forces the opponent into openings it would normally avoid.

The only question is, which is better? If you only want to compare engines, no books could work, assuming you expect all engines to play all openings equally skillfully. But none of the authors really believe that is possible. Humans don't play that way, we avoid that which we don't understand or are unfamiliar with.

Here you briefly take my point, but then immediately toss it aside with little attempt at explanation.

Adam Hair wrote:But when he wants to find out if some changes in the code makes Crafty stronger, all of that is turned off. The same for other authors.
And the rating lists serve as a check for them.
But in this case, he is tuning search and eval only. To do this, he must isolate it from its dynamic AI functions. Why rating lists would only be interested in a subset of the total AI seems strange. Why is no one interested in the total AI?

Adam Hair wrote:We are not giving any program a UL listing. The fact is this: we are testing the chess engine, not the chess program.
Ok, but this has been clear from the start. What is not so clear is the reason why no one wants to know the relative strength of the AI.

Adam Hair wrote:Start testing all the bells and whistles yourself.
What you are calling bells and whistles are the holy grail of AI. One wonders what we are endeavoring to discover by crippling whatever abilities have been achieved. I don't understand the answers that have be given to this so far.

Adam Hair wrote:You certainly feel strong about this. However, the strength of your convictions does not determine whether you are
right or wrong about an issue.
I'm somewhat at a loss since it seems completely obvious. Testing competing AI's would seem to be a goal with no shortage of champions, yet one finds it a goal of almost no one. And when it is suggested, eyebrows are raised as if the suggestion were utterly ludicrous (complete with laughing emoticons).

orgfert · Post by **orgfert** » Wed Jan 19, 2011 8:45 pm

Adam Hair wrote:
orgfert wrote:
Adam Hair wrote:I was merely refering to the loss of statistical information in regards to the engines that are static.

Again, I have to say that one major goal of the CCRL is to test as many engines as possible. To accomplish this,
some concessions are made. Ponder is turned off. Some information for each engine is sacrificed in order to aquire
some information on a larger group of engines. It is as simple as that.
Under the current regimen, information is lost regarding the real differences between static and dynamic AI. And as already mentioned, pondering off doesn't fit the rationale of your methodology. Your reply to Bob indicated that the rating list was not a competition, which doesn't seem a valid argument in that human rating lists are composed of games entirely from competitions.
The intent of the creators of the rating list has some role in this, don't you think? The insistence that any chess rating list must have
the same intent as human rating lists is monomaniac. Our intent has been to provide the relative rating for as many engines as
possible with some attempt at statistical significance. Does the attempt fall short? Yes. Are some aspects of some engines ignored?
Yes? Does it invalidate the results from our testing? I argue that it does not, as long as the weaknesses in this approach are kept in
mind.

This is the list that 3 of us tend to concentrate on: http://www.computerchess.org.uk/ccrl/40 ... e_cpu.html

How many of these engines can be found on other lists with this many games played? If it has no significance to you, so be it. It does
have some significance to others. To me, its biggest failing is that it does not include more engines at this point.

But according to this method, it seems to me that even a billion "non-competitive" games between all the AI one might find to include could never produce anything more significant than a rating list of arbitrarily crippled AI. I think this could objectively be the lists biggest failing.

Adam Hair · Post by **Adam Hair** » Wed Jan 19, 2011 8:56 pm

There is a current thread at CCC where the discussion turned to AI. Here is one post from that discussion:

http://talkchess.com/forum/viewtopic.ph ... 08&t=37677

Perhaps those who have a historical perspective of chess programming can answer a few questions. Is the general focus of
chess programming nowdays more or less on the race to gain Elo? Are there aspects that are lacking now ( are there areas of
chess programming that should receive more attention from programmers? )? To a layman, AI seems to be focused on trying
to simulate human responses to various stimuli/situations. In regards to chess programming, since the top programs outplay
the top humans, is there any aspect of how humans think ( pertaining to chess ) that should be explored and applied to chess
programming?

I hope that these are semi-intellegent questions. Please feel free to correct any misconceptions and misunderstanding that I
have on this subject.

Adam

OpenChess

OpenChess

A Talkchess thread: Misinformation being spread

Re: A Talkchess thread: Misinformation being spread

Re: A Talkchess thread: Misinformation being spread

Re: A Talkchess thread: Misinformation being spread

Re: A Talkchess thread: Misinformation being spread

Re: A Talkchess thread: Misinformation being spread

Re: A Talkchess thread: Misinformation being spread

Re: A Talkchess thread: Misinformation being spread

Re: A Talkchess thread: Misinformation being spread

Re: A Talkchess thread: Misinformation being spread

Re: A Talkchess thread: Misinformation being spread