To kick off some technical discussions

Code, algorithms, languages, construction...
mcostalba
Posts: 91
Joined: Thu Jun 10, 2010 11:45 pm
Real Name: Marco Costalba

Re: To kick off some technical discussions

Post by mcostalba » Sat Jun 12, 2010 9:28 am

thorstenczub wrote: the question for me was: is this incest testing really allowing to make progress ?
in old days of computerchess we generated games (of course much longer time controls)
and WATCHED the games and were looking for errors.
then programmer tried to fixed it, and the new engine was again on the autoplayers.
Yes, IMHO, it is the only way to reliable make progress, but I am a weak chess player, so my opinion could be biased.

The key point is the position looked by you and by your engine is completly different.

I mean, you look at one position and evaluate weak and strong points, attack possibilities and so on, then your engine makes a move that you judge weak and perhaps the match gives you right because that move leads to a lose.

The point is that the position that you looked at is _not_ the same at which your engine looked. You engine looked (and evaluated) at milions or tens of milions of positions at 20-25 plies deeper in the search, the fanny thing is that none of that milions and milions of positions is _your_ starting positions at which you were looking at. ;-)

So the bottom line is that you have _no_ clue, just looking at one (the beginning) position why your engines played a weak move.

I think the testing methodology of KOMODO is the right one and very similar to our: we play only one game at a time, so to don't introduce noise due to many games on different CPU's at the same time, and with a little bigger TC: 1 minute instead of 30", but we never look at the games and we _only_ trust tests results even if Joona is a very good chess player.

User avatar
thorstenczub
Posts: 593
Joined: Wed Jun 09, 2010 12:51 pm
Real Name: Thorsten Czub
Location: United States of Europe, germany, NRW, Lünen
Contact:

Re: To kick off some technical discussions

Post by thorstenczub » Sat Jun 12, 2010 9:48 am

i think this testing method is wrong. by playing 30.000 games between SIMILAR versions of itself,
you will indeed be able to measure small ELO differences between komodo X and komodo X+1
and komodo x+8.

One versions will be indeed "better" if you relate them with each other.

but i still believe that testing with the help of humans watching the games,
and when they found something and fixed it, play also 30.000 games to measure
the difference of the "fixed" version in relation to the version before,
could make more progress.

IMO one should test not against OWN versions, but opponent programs.
because this is the usual thing that people do LATER when the new engine is released,
it plays not against OWN engines similar to itself but against FOREIGN engines.
it has to compete against them in rating lists, in tournaments ...

so can we trust that the incest tester when making "progress" is right ? and the "progress"
is also there when it comes to play foreign engines ??

i doubt that the 15 ELO you "realized" by testing against yourself will be shown against other engines too.

in the stoneage we had to play test games by hand.
then the autoplayer was invented and we needed a pc for each program.
so 8 or 12 pc's.

then came those wonderful GUIs such as ARENA for windows. suddenly you were able
to test eng-eng matches and engine-tournaments on 1 pc.

and the next industrial method of testing was the autotester with very fast games
and only statistical measurement without even looking into the games.

maybe one should combine the methods.
use autotesting to proof that the new change in the program was succesful by
playing 30.000 fast games against DIFFERENT opponents. not against own "clones".

but i would still prefer watching the games.

the games should be stored in pgn and should be replayed by humans
to find out what is going on.

IMO the reason why some programs make no progress although the programmers are clever
and work for years is this autotesting eng-eng without looking into the game data.

you tune on your own engine, but the progress you get is not real.
its only real if people would also test against clone of your own engine,
and this is something NOBODY beside the programmer is doing.

User avatar
Robert Houdart
Posts: 180
Joined: Thu Jun 10, 2010 4:55 pm
Contact:

Re: To kick off some technical discussions

Post by Robert Houdart » Sat Jun 12, 2010 10:18 am

I also favor the combined approach:
- Watch games to spot things the engine does well and doesn't well, and to come up with ideas for improvement.
- Use engine matches (agains the previous version and against other engines) to validate the implementation of the ideas.

Robert

User avatar
Rebel
Posts: 515
Joined: Wed Jun 09, 2010 7:45 pm
Real Name: Ed Schroder

Re: To kick off some technical discussions

Post by Rebel » Sat Jun 12, 2010 10:38 am

Hi Bob, long time no see, nice to meet you here. I hope this forum will be a new fresh start to talk in peace about computer chess in all its aspects.
hyatt wrote: (2) If you look at the crafty source, I have a "phase" variable that tells me what phase of the move selection I am in, from "HASH_MOVE" to "CAPTURE_MOVES" to "KILLER_MOVES" to "REMAINING_MOVES". I do not reduce moves until I get to REMAINING_MOVES (effectively the L (late) in LMR. So for me, there is no extra SEE calls. I have already used SEE to choose which captures are searched in CAPTURE_MOVES, leaving the rest for REMAINING_MOVES. I therefore reduce anything in REMAINING_MOVES (except for moves that give check). So there is really no extra SEE usage at all.
I stopped developing mine some years ago, if memory serves me well my exclusion list is as follows:

1) No LMR in the last 3 plies of the search, this because of the use of futility pruning;

2) Always search at least 3 moves;

3) Hash-move;

4) Captures, guess I have to try your idea skipping bad captures;

5) Queen promotions (no minors);

6) Extended moves;

7) Moves that give check;

8) Moves that escape from check;

9) Static Mate-threads;

10) Killer moves;

Now lend me your cluster :lol: as I have understood the key of making progress is to play at least 30.000+ eng-eng games. Perhaps you can elaborate a bit on the latter, I am bit out-of-date the last years but I do find the latest developments still fascinating.

Regards,

Ed

User avatar
Rebel
Posts: 515
Joined: Wed Jun 09, 2010 7:45 pm
Real Name: Ed Schroder

Re: To kick off some technical discussions

Post by Rebel » Sat Jun 12, 2010 10:58 am

hyatt wrote: We often find new ideas by looking at individual games, but this is usually in the form of "we are just not evaluating this very well" or "we have no term that attempts to quantify this particular positional concept". But as we fix those things, we don't just use the game where we made a boo-boo, we play 30,000 games to make sure that it helps in more cases than it hurts. Which guarantees upward progress. I had way too many steps backward with crafty prior to cluster-testing. Others seem to be doing the same, although with different approaches. Rybka apparently plays about 40,000 game in 1 second games to tune things. I prefer to occasionally vary the time control to make sure that something that helps at fast games doesn't hurt at slow games.
I am in agreement. I once wrote a small utility that just randomly generates won,draw and lost scores. And in most cases it took thousands of simulated scores before the error-margin reached the wished 50%. Thereafter things still got fluctuated. So I guess that through the years I have thrown away quite a number of good idea's because of insufficient testing. I played 400 games at 40/10 due to hardware limitation. Perhaps I should have followed your idea doing 1 sec per move, that would have given me 6000 games but I doubt if that is still enough.

Question: how long does it take your cluster to finish 30,000 games?

Ed

User avatar
Rebel
Posts: 515
Joined: Wed Jun 09, 2010 7:45 pm
Real Name: Ed Schroder

Re: To kick off some technical discussions

Post by Rebel » Sat Jun 12, 2010 11:05 am

thorstenczub wrote:i think this testing method is wrong. by playing 30.000 games between SIMILAR versions of itself,
you will indeed be able to measure small ELO differences between komodo X and komodo X+1 and komodo x+8.
The methodology is perfect as long as the changes are search related.

Positional changes are another matter, you better use a broad number of different opponents.

Ed

mcostalba
Posts: 91
Joined: Thu Jun 10, 2010 11:45 pm
Real Name: Marco Costalba

Re: To kick off some technical discussions

Post by mcostalba » Sat Jun 12, 2010 11:50 am

Rebel wrote: I stopped developing mine some years ago, if memory serves me well my exclusion list is as follows:

1) No LMR in the last 3 plies of the search, this because of the use of futility pruning;

2) Always search at least 3 moves;

3) Hash-move;

4) Captures, guess I have to try your idea skipping bad captures;

5) Queen promotions (no minors);

6) Extended moves;

7) Moves that give check;

8) Moves that escape from check;

9) Static Mate-threads;

10) Killer moves;

Apart from bad captures that we have still to test (but I guess becuase in SF there is razoring starting form depth 4 plies the benefit of reducing search of bad captures should be mitigated, given that in that case position evaluation is far below beta so razored anyway) your list is the same of SF with the expecption of point (11), we currenlty do not have special code to avoid reducing killer moves (perhaps something else to try ;-) )

User avatar
Chris Whittington
Posts: 437
Joined: Wed Jun 09, 2010 6:25 pm

Re: To kick off some technical discussions

Post by Chris Whittington » Sat Jun 12, 2010 11:59 am

zamar wrote:
Chris Whittington wrote: there's no guarantee, and in fact it is very likely, that the hill being climbed is no way the highest hill around, and when you get to the top, or near the top, there's no way to jump over onto another higher hill to repeat the process, because, as you say "quite often a "fix" that improved the play in a single game we were examining would cause a drastic drop in Elo overall" which has the effect of keeping you on the same hill.

In fact, I think a case can be made that your ELO can continue to rise through the methodology used even when you got to the top of the (non-optimal) hill already and there's nowhere higher to be gone to.
It's easy to agree with you in theory. The practical problem is that there is no point in claiming that "your testing method is sub-optimal" if you aren't able to propose better testing method.
You'ld be right if I was coming from a programmer perspective, but, although a programmer (ex), I've always come at the problem from the human side and in this case I'm interesting in exposing the weaknesses, flaws and false assumptions of the bean-counter paradigm. Statistical testing has several bigs holes in its methodology which are fun (for me) to explore.

mcostalba
Posts: 91
Joined: Thu Jun 10, 2010 11:45 pm
Real Name: Marco Costalba

Re: To kick off some technical discussions

Post by mcostalba » Sat Jun 12, 2010 12:02 pm

thorstenczub wrote:
i doubt that the 15 ELO you "realized" by testing against yourself will be shown against other engines too.
I too, normally is smaller but the important key aspect is that normally it is smaller but with the same sign.

This is fundamental becuase allows to use the self--testing as a kind of leverage effect, a magnifying lens, to see if a patch is good or bad, also if the absolute value of a patch is very small against other engines, but turns out to be measurable against itself.

So the bottom line is that until the sign is the same the "incestuos" effect is a good thing to have IMHO because artifically increases the difference so to move it above noise level.

thorstenczub wrote: in the stoneage we had to play test games by hand.
then the autoplayer was invented and we needed a pc for each program.
so 8 or 12 pc's.

then came those wonderful GUIs such as ARENA for windows. suddenly you were able
to test eng-eng matches and engine-tournaments on 1 pc.

and the next industrial method of testing was the autotester with very fast games
and only statistical measurement without even looking into the games.
Yes then come cutechess-cli (far better and higher quality then crappy Arena) and that road map that you have summerized continues in that direction. So I don't see any reason to turn and look back, but instead go ahead along the lines you have already exposed.
thorstenczub wrote: maybe one should combine the methods.
maybe not ;-)

IMHO this is just an antropocentric view that has no scientific base apart from historically reasons and that just is a waste of resources IMHO.

As soon you realize that the quality metric to apply to engines cannot have human elements then as quicker you progress your engine.

User avatar
Chris Whittington
Posts: 437
Joined: Wed Jun 09, 2010 6:25 pm

Re: To kick off some technical discussions

Post by Chris Whittington » Sat Jun 12, 2010 12:10 pm

hyatt wrote:
Chris Whittington wrote:
The above just confirms the point ....

you are hill climbing, together with a bunch of relatively similar machines, all competing against he same metric (statistical win rate, aka ELO)

there's no guarantee, and in fact it is very likely, that the hill being climbed is no way the highest hill around, and when you get to the top, or near the top, there's no way to jump over onto another higher hill to repeat the process, because, as you say "quite often a "fix" that improved the play in a single game we were examining would cause a drastic drop in Elo overall" which has the effect of keeping you on the same hill.

In fact, I think a case can be made that your ELO can continue to rise through the methodology used even when you got to the top of the (non-optimal) hill already and there's nowhere higher to be gone to.
The alternative is witchcraft/voodoo/magic/etc. I personally believe that with a large set of opening positions, and opponents that are stronger than me, so long as I can close the gap I am getting better overall. This is a far sounder assumption than trying to look at a specific game, isolate a particular move, and adjust either the search or evaluation to choose a better move. Been there. Done that. Got the T-shirt. It is a flawed methodology.

As I mentioned, we tried a few of these early on, just to get a feel for what we _had_ been doing. I would see a game played on ICC where I could analyze and determine some point where a losing move was made. And with (sometimes) some GM help, we'd look at the good move vs the bad move, and try to determine if it was depth or knowledge that caused the error. And for the normal cases, after we came up with a fix that made us switch from the bad move to the good one, cluster testing would often show that the "fix" hurt overall. It is _very_ difficult to envision how a change in the evaluation for this position will effect all the other similar but subtly different positions we have to play through.

We often find new ideas by looking at individual games, but this is usually in the form of "we are just not evaluating this very well" or "we have no term that attempts to quantify this particular positional concept". But as we fix those things, we don't just use the game where we made a boo-boo, we play 30,000 games to make sure that it helps in more cases than it hurts. Which guarantees upward progress. I had way too many steps backward with crafty prior to cluster-testing. Others seem to be doing the same, although with different approaches. Rybka apparently plays about 40,000 game in 1 second games to tune things. I prefer to occasionally vary the time control to make sure that something that helps at fast games doesn't hurt at slow games.

But the point is that this is an objective mechanism, not subjective. Lots of "good ideas" have been tossed out because even though they sounded reasonable, we could not find any implementation that didn't hurt overall results. I like the idea of making a change, then running a quick test and in an hour have a really good idea of whether the idea as implemented worked or not. If not, we try to figure out why as on quite a few occasions, the idea was good, but the implementation had a bug. Humans think too highly of their subjective abilities. I've drifted away from that approach after proving over and over that my subjective opinion was quite a bit away from the real truth.

Is it possible to reach a local maxima? Of course. But we are not doing automated tuning, we are making changes and testing the resulting programs. Which means that as a human, we can recognize a trend that needs attention and do something about it, even if it requires a complete rewrite of something, such as king safety or pawn structure or whatever.
Well, if you say the alternatives are magic and voodoo, you are really saying you don't know any other way forward.

I would like to suggest to you that one of the assumptions you, and others are making is unsound.

You're on an ELO increasing hillclimb with similar machines, very likely on a sub-optimal hill. Your assumption is that ELO equates to strength, a representation of chess skill, and that increasing ELO shows an increase in chess skill of your machine.

I believe it's perfectly possible that this assumption is flawed and that you could be increasing ELO within your pool without increase in "strength". ie it looks like you make progress but in reality you're on the top of the sub optimal hill already and going nowhere. The ELO increases are just a by-product of the testing methodology and not anything 'real'.



2.

Post Reply