To kick off some technical discussions

hyatt · Post by **hyatt** » Sat Jun 12, 2010 7:33 pm

Rebel wrote:
hyatt wrote: We often find new ideas by looking at individual games, but this is usually in the form of "we are just not evaluating this very well" or "we have no term that attempts to quantify this particular positional concept". But as we fix those things, we don't just use the game where we made a boo-boo, we play 30,000 games to make sure that it helps in more cases than it hurts. Which guarantees upward progress. I had way too many steps backward with crafty prior to cluster-testing. Others seem to be doing the same, although with different approaches. Rybka apparently plays about 40,000 game in 1 second games to tune things. I prefer to occasionally vary the time control to make sure that something that helps at fast games doesn't hurt at slow games.
I am in agreement. I once wrote a small utility that just randomly generates won,draw and lost scores. And in most cases it took thousands of simulated scores before the error-margin reached the wished 50%. Thereafter things still got fluctuated. So I guess that through the years I have thrown away quite a number of good idea's because of insufficient testing. I played 400 games at 40/10 due to hardware limitation. Perhaps I should have followed your idea doing 1 sec per move, that would have given me 6000 games but I doubt if that is still enough.

Question: how long does it take your cluster to finish 30,000 games?

Ed

Using 10 seconds on clock + 0.1 seconds increment, about 1 hour for the 256 cpu cluster, or about 30 mins for the bigger cluster that has 540 cpus. Most of my testing is on the 256 cpu box since the other one is more popular and I generally find this one almost unused.

hyatt · Post by **hyatt** » Sat Jun 12, 2010 7:35 pm

mcostalba wrote:
Rebel wrote: I stopped developing mine some years ago, if memory serves me well my exclusion list is as follows:

1) No LMR in the last 3 plies of the search, this because of the use of futility pruning;

2) Always search at least 3 moves;

3) Hash-move;

4) Captures, guess I have to try your idea skipping bad captures;

5) Queen promotions (no minors);

6) Extended moves;

7) Moves that give check;

8) Moves that escape from check;

9) Static Mate-threads;

10) Killer moves;

Apart from bad captures that we have still to test (but I guess becuase in SF there is razoring starting form depth 4 plies the benefit of reducing search of bad captures should be mitigated, given that in that case position evaluation is far below beta so razored anyway) your list is the same of SF with the expecption of point (11), we currenlty do not have special code to avoid reducing killer moves (perhaps something else to try )

I would never reduce killer moves, simply because those are the very moves I expect to fail high, if I am going to get a fail high after the capture search is done. I might try reducing them to see what the effect is, just for fun, since I have not tried this previously.

hyatt · Post by **hyatt** » Sat Jun 12, 2010 7:42 pm

Chris Whittington wrote:
hyatt wrote:
Chris Whittington wrote:
The above just confirms the point ....

you are hill climbing, together with a bunch of relatively similar machines, all competing against he same metric (statistical win rate, aka ELO)

there's no guarantee, and in fact it is very likely, that the hill being climbed is no way the highest hill around, and when you get to the top, or near the top, there's no way to jump over onto another higher hill to repeat the process, because, as you say "quite often a "fix" that improved the play in a single game we were examining would cause a drastic drop in Elo overall" which has the effect of keeping you on the same hill.

In fact, I think a case can be made that your ELO can continue to rise through the methodology used even when you got to the top of the (non-optimal) hill already and there's nowhere higher to be gone to.
The alternative is witchcraft/voodoo/magic/etc. I personally believe that with a large set of opening positions, and opponents that are stronger than me, so long as I can close the gap I am getting better overall. This is a far sounder assumption than trying to look at a specific game, isolate a particular move, and adjust either the search or evaluation to choose a better move. Been there. Done that. Got the T-shirt. It is a flawed methodology.

As I mentioned, we tried a few of these early on, just to get a feel for what we _had_ been doing. I would see a game played on ICC where I could analyze and determine some point where a losing move was made. And with (sometimes) some GM help, we'd look at the good move vs the bad move, and try to determine if it was depth or knowledge that caused the error. And for the normal cases, after we came up with a fix that made us switch from the bad move to the good one, cluster testing would often show that the "fix" hurt overall. It is _very_ difficult to envision how a change in the evaluation for this position will effect all the other similar but subtly different positions we have to play through.

We often find new ideas by looking at individual games, but this is usually in the form of "we are just not evaluating this very well" or "we have no term that attempts to quantify this particular positional concept". But as we fix those things, we don't just use the game where we made a boo-boo, we play 30,000 games to make sure that it helps in more cases than it hurts. Which guarantees upward progress. I had way too many steps backward with crafty prior to cluster-testing. Others seem to be doing the same, although with different approaches. Rybka apparently plays about 40,000 game in 1 second games to tune things. I prefer to occasionally vary the time control to make sure that something that helps at fast games doesn't hurt at slow games.

But the point is that this is an objective mechanism, not subjective. Lots of "good ideas" have been tossed out because even though they sounded reasonable, we could not find any implementation that didn't hurt overall results. I like the idea of making a change, then running a quick test and in an hour have a really good idea of whether the idea as implemented worked or not. If not, we try to figure out why as on quite a few occasions, the idea was good, but the implementation had a bug. Humans think too highly of their subjective abilities. I've drifted away from that approach after proving over and over that my subjective opinion was quite a bit away from the real truth.

Is it possible to reach a local maxima? Of course. But we are not doing automated tuning, we are making changes and testing the resulting programs. Which means that as a human, we can recognize a trend that needs attention and do something about it, even if it requires a complete rewrite of something, such as king safety or pawn structure or whatever.
Well, if you say the alternatives are magic and voodoo, you are really saying you don't know any other way forward.

I would like to suggest to you that one of the assumptions you, and others are making is unsound.

You're on an ELO increasing hillclimb with similar machines, very likely on a sub-optimal hill. Your assumption is that ELO equates to strength, a representation of chess skill, and that increasing ELO shows an increase in chess skill of your machine.

I believe it's perfectly possible that this assumption is flawed and that you could be increasing ELO within your pool without increase in "strength". ie it looks like you make progress but in reality you're on the top of the sub optimal hill already and going nowhere. The ELO increases are just a by-product of the testing methodology and not anything 'real'.

2.

I am not sure you understand what I am doing. This is not a simulated-annealing auto-tuning process. The idea goes like this:

(1) we look at a few longish games, or when we play in tournaments, we make notes as to what we see that we don't like or think we might improve. We then write a piece of code to address one specific issue at a time. We then merge this into the program, compile and test. And then work on tuning that change (and maybe adjusting a few other parameters that might seem related). Once we finish that, we go on to the next idea. This is not a "hill-climbing approach." It is the classic evaluate, change and test approach, but the test part is purely objective and based on a bunch of games using different starting positions and opponents. The "creative" part has never been eliminated.

so don't conclude that we are just doing some sort of optimization problem solution. We do test old ideas to see if old assumptions are still valid or not, but most of what we are doing is adding new code to implement new ideas. But by thorough testing, we bypass the old N steps forward, N+1 steps backward regression issue. Code that hurts never survives. So perhaps we could call this a pseudo-genetic algorithm, where the mutation is controlled by humans writing new code rather than random changes, and the result is "survival of the fittest...."

I am not sure how we could be "not improving" when we are catching up to programs that are ranked above us on every rating list. For example, 4 years ago Fruit was about 200 points above us on every rating list. Now in our testing it is 200 behind us and we are working on slowly catching up to stockfish. If we gain on stockfish, and don't lose any ground to the others we have passed, we _must_ be getting better since stockfish is at or near the top in every rating list around. If I am climbing a hill to reach a local maxima, and that local maxima is still above stockfish, I'll be happy enough until I get there. Then we continue the normal way of developing and continue to repair weaknesses as we find them. You can reach a local maxima if you don't add new code and just tweak values. But that is not what we are doing at all, so I am not sure this even applies.

hyatt · Post by **hyatt** » Sat Jun 12, 2010 7:55 pm

Chris Whittington wrote:
Sentinel wrote:
Chris Whittington wrote: that there is some kind of linear relationship between 'strength' and ELO list rating

that changes in ELO list rating map to equivalent changes in 'strength' in some sort of linear way
The word linear is wrong. Nobody even tried to prove something like this, coz it's simply not correct. However, word monotonic put instead would be correct. Or to be mathematically precise, ELO and engine 'strength' (or skill as you wish to call it) are always positively correlated.
Can you prove the positive correlation?

Can you prove positive correlation on a sub-optimal hill using a pool composed of similar machines such that the resulting ELO list maps from the fantasy land of the incestuous sub-optimal hill to the reality of real chess strength / playing skill?

I think not.

A little thought experiment for you .... what might well happen if you introduced, let's say, some turtles from the Galapogos islands into some part of Africa. Can you assure me, absolutely, that these turtles will survive and thrive?

With proper testing and observation for a few lifecycles, yep.

That's why testing is so much better than guessing.

hyatt · Post by **hyatt** » Sat Jun 12, 2010 7:58 pm

Chris Whittington wrote:
Sentinel wrote:
Chris Whittington wrote:Can you prove the positive correlation?

Can you prove positive correlation on a sub-optimal hill using a pool composed of similar machines such that the resulting ELO list maps from the fantasy land of the incestuous sub-optimal hill to the reality of real chess strength / playing skill?
You don't prove it. Simply you take ELO as the measure of strength by definition, or in other words, ELO is your metric for measuring skill/strength. And it's not that definition is wrong (or at least there is no better one) it's the question if our methods of measuring ELO are correct. There your examples have a point.
However, problem is that we don't have a better way to measure ELO so far.
well, if I have, let's say a dozen objects, 2,3,4....,12,13 cms long, and I wish to rank them in order but my measuring device is hopelessly randomly inaccurate (or,as you put it "we don't have a better way"!) then my ranking is going to be hopelessly wrong.

So, do we know how hopelessly randomly inaccurate our measuring stick on the sub-optimal incestuous hill is? No we don't. Do we know how accurate it is? No we don't.

Yet the computer chess 'community' accepts these 'rating list's' as gospel.

Actually I believe we simply accept them as the best that can be done today. There is a difference in the two statements. Ultimately, chess is about winning, losing and drawing games. If I win more games, I am certainly getting better. And simple statistics proves that. If I win more games after a change, you are going to have a hard time convincing me that my change is bad, particularly if I can play enough games to drive the variance down far enough to eliminate overlap between old and new ratings.

govert · Post by **govert** » Sat Jun 12, 2010 10:03 pm

Chris Whittington wrote:
orgfert wrote:
Chris Whittington wrote:I've bolded the sections below where both poster are saying the same thing without actually saying.

The testing methodology can give ELO increases but these increases are NOT mapping to strength or chess skill.

Computer ELO and computer ELO lists are a kind of misleading nonsense.
How do you know?
well, rating lists are based on maths, statistics and some science, there are some assumptions unstated. the maths and the statistics are just processes and are sound but nobody in computer chess ever mentions the assumptions. everything is presented looking all scientific and sound and everybody just accepts as gospel.

I don't mention the possibility of (unprovable) corruption amongst list makers or the (untested) origins of much of their data, so lets assume (despite for example Ed Schroeder's exposure of actual corrupt practices of ELO list makers in the past) that the raw data on win/loss/draw is sound and that programs are not excluded/included/weighted etc and all the other minute detail of getting the method correct are employed.

some assumptions are

computer chess rating list ELO = a measure of chess playing skill, commonly referred to as strength

that there is some kind of linear relationship between 'strength' and ELO list rating

that changes in ELO list rating map to equivalent changes in 'strength' in some sort of linear way

that the computer chess model of improvement technique (hill climbing a sub-optimal hill testing against similar machine opponents, maximising an 'ELO') actually works at all stages of the hill, in particular at the top

The first assumption is incorrect. I think the originator of the ELO system pointed that out. The other assumptions are unproven and the onus is on list makers and publishers and developers using the technique and then using the results to prove them. It would be too convenient for any one, let alone all, to be true and dandy, so I, for one, assume they are unsound.

Science does not work on a wing and a prayer, does it now?

Do you have any alternative way of measuring strength in mind?

hyatt · Post by **hyatt** » Sat Jun 12, 2010 11:16 pm

That's a key issue. Elo is the best we have so far, and it is founded in statistical methods. Intuition and guesswork are certainly not as effective, based on experience.

Rebel · Post by **Rebel** » Sun Jun 13, 2010 12:33 am

Impressive......... well done..........

Ed

hyatt wrote: Hi Ed...

my basic approach is to play a varied set of positions, two games per position against a set of opponents that are reliable on our cluster (thing has 128 nodes, each node has 2 cpus, other cluster has 70 nodes, each node has 8 cpus). 30K games gives me an error bar of +/-3 Elo using BayesElo on the complete PGN. Many changes we make are just a few Elo up or down. Some are 10-20 but these are rarer and those could be detected with fewer games. But doing this, Crafty's actual Elo has gone up by almost 300 points in 2 years. Where we were lucky to get 40 in previous years because it is so easy to make a change that sounds good in theory, and which looks good in particular positions, but which hurts overall.

Using the 256 cpu cluster, playing fast games of 10 seconds on clock + 0.1 second increment, we can complete an entire 30,000 game match in about an hour. Which means we have a very accurate answer about whether this was good or bad. And while not in real-time, we can be working on the next change while testing the last change so it is pretty efficient. We occasionally play longer games, and I have done 60+60 once (60 minutes on clock 60 secs increment) which took almost 5 weeks to complete. Fortunately testing has shown that almost all changes can be measured equally well at fast or slow time controls, only a few react differently given more or less time.

My starting set of positions were chosen by using a high-quality PGN game collection, going thru each game one at a time and writing out the FEN when it it white's turn to move, move number 12, one position per game. These were then sorted by popularity to get rid of dups, and then the first 5,000 or so were kept. We are currently using 3,000 positions, where each game alternates colors so two games per position, and we use opponents including Stockfish, fruit, toga, etc...

I just make a change, do a profile-guided compile, and run the test and look at the results in an hour or so.

LiquidNitrogen · Post by **LiquidNitrogen** » Sun Jun 13, 2010 5:19 am

hyatt wrote: And quite often a "fix" that improved the play in a single game we were examining would cause a drastic drop in Elo overall.

I can't count the number of times I had a "great idea" to make Gothic Vortex stronger, tested it on maybe a handful of "hard positions" with amazing results, only to have my faithful beta tester tell me 2 days later "Old Vortex beat New Vortex 20-12" or something like that.

One thing we should accept by now: Humans and computers see chess very differently. It's not always obvious why some piece of knowledge which seems "obvious" to us is not worth it's weight in search slowdown.

I agree with the massive statistical game-play approach. Human subjectiveness through "observing key games" can't be allowed to override hardline results.

hyatt · Post by **hyatt** » Sun Jun 13, 2010 5:27 am

LiquidNitrogen wrote:
hyatt wrote: And quite often a "fix" that improved the play in a single game we were examining would cause a drastic drop in Elo overall.
I can't count the number of times I had a "great idea" to make Gothic Vortex stronger, tested it on maybe a handful of "hard positions" with amazing results, only to have my faithful beta tester tell me 2 days later "Old Vortex beat New Vortex 20-12" or something like that.

One thing we should accept by now: Humans and computers see chess very differently. It's not always obvious why some piece of knowledge which seems "obvious" to us is not worth it's weight in search slowdown.

I agree with the massive statistical game-play approach. Human subjectiveness through "observing key games" can't be allowed to override hardline results.

I obviously agree. Human intuition can certainly be used to tell you what to do next in a chess engine development. But for each intuitive feature you add, it needs to be vetted by the harsh, revealing light of thorough objective testing.

OpenChess

OpenChess

To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions

Re: To kick off some technical discussions