To kick off some technical discussions

Sentinel · Post by **Sentinel** » Sat Jun 12, 2010 5:46 pm

thorstenczub wrote:forgive me for beeing an imperfect human

but as long as i see an engine lose because it has no clue that KBB-K is draw when the bishops have the same color, or that you cannot mate with 2 knights, or that wrong colored bishop is draw,
...
IMO a good chess engine should identify those things without tablebases .

and when i cannot see the games, i cannot identify those weaknesses in 2800 or 3000 ELO engines.

it still astonished me to see those things happen in programs that are that strong.

it gives them such a strange mechanical computerished skin... it does not fit in my paradigm of intelligent programs or intelligent methods. its not human. its machine-like.

i am still not used to this.

The reason for this is quite simple. ELO is a valid measure of strength and ppl want results. Engine programmers are quite pragmatic ppl.
Lets take a hypothetical example. To implement all the checkings you listed, it would cost your engine 10% of speed (lost in additional evaluation). 10% of speed is 10 ELO lost. 10 ELO means you will loose 1.5% of games that you would draw otherwise. Do you really think that things you mentioned would make you (statistically) lose 1.5% of total games played that would be draws otherwise?

Chris Whittington · Post by **Chris Whittington** » Sat Jun 12, 2010 5:52 pm

orgfert wrote:
Chris Whittington wrote:I've bolded the sections below where both poster are saying the same thing without actually saying.

The testing methodology can give ELO increases but these increases are NOT mapping to strength or chess skill.

Computer ELO and computer ELO lists are a kind of misleading nonsense.
How do you know?

well, rating lists are based on maths, statistics and some science, there are some assumptions unstated. the maths and the statistics are just processes and are sound but nobody in computer chess ever mentions the assumptions. everything is presented looking all scientific and sound and everybody just accepts as gospel.

I don't mention the possibility of (unprovable) corruption amongst list makers or the (untested) origins of much of their data, so lets assume (despite for example Ed Schroeder's exposure of actual corrupt practices of ELO list makers in the past) that the raw data on win/loss/draw is sound and that programs are not excluded/included/weighted etc and all the other minute detail of getting the method correct are employed.

some assumptions are

computer chess rating list ELO = a measure of chess playing skill, commonly referred to as strength

that there is some kind of linear relationship between 'strength' and ELO list rating

that changes in ELO list rating map to equivalent changes in 'strength' in some sort of linear way

that the computer chess model of improvement technique (hill climbing a sub-optimal hill testing against similar machine opponents, maximising an 'ELO') actually works at all stages of the hill, in particular at the top

The first assumption is incorrect. I think the originator of the ELO system pointed that out. The other assumptions are unproven and the onus is on list makers and publishers and developers using the technique and then using the results to prove them. It would be too convenient for any one, let alone all, to be true and dandy, so I, for one, assume they are unsound.

Science does not work on a wing and a prayer, does it now?

Sentinel · Post by **Sentinel** » Sat Jun 12, 2010 6:04 pm

Chris Whittington wrote: that there is some kind of linear relationship between 'strength' and ELO list rating

that changes in ELO list rating map to equivalent changes in 'strength' in some sort of linear way

The word linear is wrong. Nobody even tried to prove something like this, coz it's simply not correct. However, word monotonic put instead would be correct. Or to be mathematically precise, ELO and engine 'strength' (or skill as you wish to call it) are always positively correlated.

Chris Whittington · Post by **Chris Whittington** » Sat Jun 12, 2010 6:06 pm

Sentinel wrote:
thorstenczub wrote:forgive me for beeing an imperfect human

but as long as i see an engine lose because it has no clue that KBB-K is draw when the bishops have the same color, or that you cannot mate with 2 knights, or that wrong colored bishop is draw,
...
IMO a good chess engine should identify those things without tablebases .

and when i cannot see the games, i cannot identify those weaknesses in 2800 or 3000 ELO engines.

it still astonished me to see those things happen in programs that are that strong.

it gives them such a strange mechanical computerished skin... it does not fit in my paradigm of intelligent programs or intelligent methods. its not human. its machine-like.

i am still not used to this.
The reason for this is quite simple. ELO is a valid measure of strength and ppl want results. Engine programmers are quite pragmatic ppl.
Lets take a hypothetical example. To implement all the checkings you listed, it would cost your engine 10% of speed (lost in additional evaluation). 10% of speed is 10 ELO lost. 10 ELO means you will loose 1.5% of games that you would draw otherwise. Do you really think that things you mentioned would make you (statistically) lose 1.5% of total games played that would be draws otherwise?

well, you prove his point by showing another flaw of the statistical method of 'improvement'

stupid bean-counter program doesn't understand some stupidity or other in its play .....

statistical method of machine development says: "why bother, it's lots of code and time to fix and hardly ever makes any difference and the fix is bound to damage performance elsewhere", so stupidity stays in place

human chess player says, "oh, right KBB-K draws? I'll remember that for the future ...."

and the result is that all computer chess programs are stuffed full of ridiculous idiocies. we can only pray that such developmental laziness and sloppiness doesn't apply in the plutonium processing industry (for example), nor that other complex algorithms in general are not shot through with this approach.

Artifical intelligence?!! Tres drole. Computer chess remains bean counting stupidity, but quickly.

Sentinel · Post by **Sentinel** » Sat Jun 12, 2010 6:10 pm

Chris Whittington wrote:Artifical intelligence?!! Tres drole. Computer chess remains bean counting stupidity, but quickly.

It has always been like that, no matter how sad that sounds. I would say, seeing computer chess as AI is just ignorance, computer chess has always been bean counting. However, very smart bean counting

.

Chris Whittington · Post by **Chris Whittington** » Sat Jun 12, 2010 6:15 pm

Sentinel wrote:
Chris Whittington wrote: that there is some kind of linear relationship between 'strength' and ELO list rating

that changes in ELO list rating map to equivalent changes in 'strength' in some sort of linear way
The word linear is wrong. Nobody even tried to prove something like this, coz it's simply not correct. However, word monotonic put instead would be correct. Or to be mathematically precise, ELO and engine 'strength' (or skill as you wish to call it) are always positively correlated.

Can you prove the positive correlation?

Can you prove positive correlation on a sub-optimal hill using a pool composed of similar machines such that the resulting ELO list maps from the fantasy land of the incestuous sub-optimal hill to the reality of real chess strength / playing skill?

I think not.

A little thought experiment for you .... what might well happen if you introduced, let's say, some turtles from the Galapogos islands into some part of Africa. Can you assure me, absolutely, that these turtles will survive and thrive?

Sentinel · Post by **Sentinel** » Sat Jun 12, 2010 6:19 pm

Chris Whittington wrote:Can you prove the positive correlation?

Can you prove positive correlation on a sub-optimal hill using a pool composed of similar machines such that the resulting ELO list maps from the fantasy land of the incestuous sub-optimal hill to the reality of real chess strength / playing skill?

You don't prove it. Simply you take ELO as the measure of strength by definition, or in other words, ELO is your metric for measuring skill/strength. And it's not that definition is wrong (or at least there is no better one) it's the question if our methods of measuring ELO are correct. There your examples have a point.
However, problem is that we don't have a better way to measure ELO so far.

Chris Whittington · Post by **Chris Whittington** » Sat Jun 12, 2010 6:37 pm

Sentinel wrote:
Chris Whittington wrote:Can you prove the positive correlation?

Can you prove positive correlation on a sub-optimal hill using a pool composed of similar machines such that the resulting ELO list maps from the fantasy land of the incestuous sub-optimal hill to the reality of real chess strength / playing skill?
You don't prove it. Simply you take ELO as the measure of strength by definition, or in other words, ELO is your metric for measuring skill/strength. And it's not that definition is wrong (or at least there is no better one) it's the question if our methods of measuring ELO are correct. There your examples have a point.
However, problem is that we don't have a better way to measure ELO so far.

well, if I have, let's say a dozen objects, 2,3,4....,12,13 cms long, and I wish to rank them in order but my measuring device is hopelessly randomly inaccurate (or,as you put it "we don't have a better way"!) then my ranking is going to be hopelessly wrong.

So, do we know how hopelessly randomly inaccurate our measuring stick on the sub-optimal incestuous hill is? No we don't. Do we know how accurate it is? No we don't.

Yet the computer chess 'community' accepts these 'rating list's' as gospel.

hyatt · Post by **hyatt** » Sat Jun 12, 2010 7:23 pm

thorstenczub wrote:i think this testing method is wrong. by playing 30.000 games between SIMILAR versions of itself,
you will indeed be able to measure small ELO differences between komodo X and komodo X+1
and komodo x+8.

One versions will be indeed "better" if you relate them with each other.

but i still believe that testing with the help of humans watching the games,
and when they found something and fixed it, play also 30.000 games to measure
the difference of the "fixed" version in relation to the version before,
could make more progress.

IMO one should test not against OWN versions, but opponent programs.
because this is the usual thing that people do LATER when the new engine is released,
it plays not against OWN engines similar to itself but against FOREIGN engines.
it has to compete against them in rating lists, in tournaments ...

We have now played way over 100M games in our cluster testing. To date, I have not played one single game of Crafty version N vs Crafty version N+1. I have _never_ trusted that kind of testing since we first studied this issue. Before we started "production testing" we tries some experiments with C vs C and C vs others, and quite often the results were drastically different. In a number of cases, C' would beat C when we added some bit of knowledge, but then C' would do _worse_ against a variety of non-crafty opponents. So we never though about including C vs C' games when we started testing seriously.

so can we trust that the incest tester when making "progress" is right ? and the "progress"
is also there when it comes to play foreign engines ??

It must not be as bad as it seems (incestual testing) as Larry Kaufman has explained many times that this is how Rybka is tested, the two versions play game/1sec games all night long.

i doubt that the 15 ELO you "realized" by testing against yourself will be shown against other engines too.

in the stoneage we had to play test games by hand.
then the autoplayer was invented and we needed a pc for each program.
so 8 or 12 pc's.

then came those wonderful GUIs such as ARENA for windows. suddenly you were able
to test eng-eng matches and engine-tournaments on 1 pc.

and the next industrial method of testing was the autotester with very fast games
and only statistical measurement without even looking into the games.

maybe one should combine the methods.
use autotesting to proof that the new change in the program was succesful by
playing 30.000 fast games against DIFFERENT opponents. not against own "clones".

but i would still prefer watching the games.

the games should be stored in pgn and should be replayed by humans
to find out what is going on.

IMO the reason why some programs make no progress although the programmers are clever
and work for years is this autotesting eng-eng without looking into the game data.

you tune on your own engine, but the progress you get is not real.
its only real if people would also test against clone of your own engine,
and this is something NOBODY beside the programmer is doing.

Replaying the games is an intractable problem. I am currently running a few tests to see if recent changes work, and to also test the cluster which has been having an NFS server problem. In the last 4 days, I have played over one million games (about 30K per hour or so). Who is able to look at even a fraction of those?
\

hyatt · Post by **hyatt** » Sat Jun 12, 2010 7:31 pm

Rebel wrote:Hi Bob, long time no see, nice to meet you here. I hope this forum will be a new fresh start to talk in peace about computer chess in all its aspects.

hyatt wrote: (2) If you look at the crafty source, I have a "phase" variable that tells me what phase of the move selection I am in, from "HASH_MOVE" to "CAPTURE_MOVES" to "KILLER_MOVES" to "REMAINING_MOVES". I do not reduce moves until I get to REMAINING_MOVES (effectively the L (late) in LMR. So for me, there is no extra SEE calls. I have already used SEE to choose which captures are searched in CAPTURE_MOVES, leaving the rest for REMAINING_MOVES. I therefore reduce anything in REMAINING_MOVES (except for moves that give check). So there is really no extra SEE usage at all.
I stopped developing mine some years ago, if memory serves me well my exclusion list is as follows:

1) No LMR in the last 3 plies of the search, this because of the use of futility pruning;

2) Always search at least 3 moves;

3) Hash-move;

4) Captures, guess I have to try your idea skipping bad captures;

5) Queen promotions (no minors);

6) Extended moves;

7) Moves that give check;

8) Moves that escape from check;

9) Static Mate-threads;

10) Killer moves;

Now lend me your cluster as I have understood the key of making progress is to play at least 30.000+ eng-eng games. Perhaps you can elaborate a bit on the latter, I am bit out-of-date the last years but I do find the latest developments still fascinating.

Regards,

Ed

Hi Ed...

my basic approach is to play a varied set of positions, two games per position against a set of opponents that are reliable on our cluster (thing has 128 nodes, each node has 2 cpus, other cluster has 70 nodes, each node has 8 cpus). 30K games gives me an error bar of +/-3 Elo using BayesElo on the complete PGN. Many changes we make are just a few Elo up or down. Some are 10-20 but these are rarer and those could be detected with fewer games. But doing this, Crafty's actual Elo has gone up by almost 300 points in 2 years. Where we were lucky to get 40 in previous years because it is so easy to make a change that sounds good in theory, and which looks good in particular positions, but which hurts overall.

Using the 256 cpu cluster, playing fast games of 10 seconds on clock + 0.1 second increment, we can complete an entire 30,000 game match in about an hour. Which means we have a very accurate answer about whether this was good or bad. And while not in real-time, we can be working on the next change while testing the last change so it is pretty efficient. We occasionally play longer games, and I have done 60+60 once (60 minutes on clock 60 secs increment) which took almost 5 weeks to complete. Fortunately testing has shown that almost all changes can be measured equally well at fast or slow time controls, only a few react differently given more or less time.

My starting set of positions were chosen by using a high-quality PGN game collection, going thru each game one at a time and writing out the FEN when it it white's turn to move, move number 12, one position per game. These were then sorted by popularity to get rid of dups, and then the first 5,000 or so were kept. We are currently using 3,000 positions, where each game alternates colors so two games per position, and we use opponents including Stockfish, fruit, toga, etc...

I just make a change, do a profile-guided compile, and run the test and look at the results in an hour or so.

OpenChess

OpenChess

To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions