Page 1 of 2

More on similarity testing

Posted: Tue Dec 28, 2010 9:09 pm
by BB+
As the topic name has changed, I took the liberty of forking a new thread.

Here are my latest complaints and results:

Rybka 2.3.2a also has "movetime" problems (of course, it is oodles better than Rybka 1.0 Beta
which typically uses 5 or so times the amount of time desired). An example:

Code: Select all

Rybka 2.3.2a
go movetime 1000
[...]
info time 814 nodes 186010 nps 233997
bestmove b1c3 ponder g8f6
I can understand using (a bit) *more* time with the "movetime" token, but using less is even more strange. Maybe the UCI protocol

Code: Select all

       * movetime
                search exactly x mseconds
is interpretive of "exactly" rather loosely. :x

One of the advantages of fixed depth and no SMP is (exact) reproducibility; another is that you can test engines in parallel with no worries. :) However, one problem with "go depth X" is that my (current) positional suite has some with one side up by a lot, and so the +5.07 hash bug/design of some Rybkas makes it painfully slow to reach even depth 10 in many positions. So that's another thing to worry about when constructing a test suite. [Given the amount of problems for testing specifications extant already with a small number of engines, I hesitate to consider everything that could wrong when a more numerous comparison is made].

I might also add that back when these "clone detectors" were first discussed many months ago, I had actually isolated the eval() function in Rybka 3, etc., and Alan Sassler had done some correlation analysis on the numbers generated (I think I had 1 million positions, as it takes so little time, but I forget, and the Rybka forum has it all hidden by now). Again this would be a superior method to determine correlation of evaluation output (whether at the level of "framework" or "numerology" is a different question), though this requires some work to achieve a functional set-up for engines that do not provide source code.

Re: More on similarity testing

Posted: Tue Dec 28, 2010 9:13 pm
by BB+
In any event, I ran Rybka 1.2 (the lazy eval criterion seems correct therein, unlike in Rybka 1.0 Beta -- this would likely affect any purported test of "evaluation" rather notably) versus Rybka 2.3.2a at "go depth 10" for each, even though I think this means depth 12 in Rybka 1.2 and depth 13 in Rybka 2.3.2a. It took the former 7 minutes and the latter 10 minutes, with a correlation of 708. Without further testing I am loath to put much value on this, though I think it is a counterpoint to any claim that the eval in earlier Rybkas changed dramatically before Rybka 3.

To date, I've found the best ways to fool these testers are: erroneously use lazy eval way too much so as to obfuscate your eval function; and, fail to handle movetime properly, either screwing it up by a large factor, or having problems when the given value is too small. :)

Here is the current matrix. I stress that I intend to re-do this with more positions, and eliminate those with "one good move" or an evaluation that is too large.

Code: Select all

                 R12 R23 F21 F10 S19 S15
R12.dp10           0 708 659 561 639 637    7min
R232.dp10        708   0 644 533 627 643   10min
F21.dp10         659 644   0 606 639 652   17min
F10.dp9          561 533 606   0 540 570   13min
SF191.dp14       639 627 639 540   0 744   11min
SF151.dp13       637 643 652 570 744   0   16min
Also, simply because fixed depth is "reproducible" doesn't mean that I think it much of this is "science" as it were.

Re: More on similarity testing

Posted: Tue Dec 28, 2010 9:22 pm
by Sentinel
BB+ wrote:To date, I've found the best ways to fool these testers are: erroneously use lazy eval way too much so as to obfuscate your eval function; and, fail to handle movetime properly, either screwing it up by a large factor, or having problems when the given value is too small. :)
The funny thing with lazy eval is that contrary to Don Dailey's "expert" opinion, you can increase its use for a lot with practically very little to no effect on strength (as long as you don't use it in in PV nodes and not too much in cut/all nodes).
So you can actually trick his "clone detector" so easily almost without impacting the strength.

Thanks again BB for excellent analysis and for, as always, point out exactly the right points.

Re: More on similarity testing

Posted: Tue Dec 28, 2010 9:39 pm
by BB+
you can increase its use for a lot [...] (as long as you don't use it in in PV nodes and not too much in cut/all nodes).
I'm left to wonder exactly what types of nodes are left at this point? :?: :? [Though I agree that one can change lazy eval margins a lot with typically rather little effect on strength. Whether one can be ridiculous about it as with Rybka 1.0 Beta -- well, that needs testing].

To recollect: (https://webspace.utexas.edu/zzw57/rtc/e ... erial.html)
However, there is a very serious bug in Rybka with regards to lazy evaluation. The upper and lower bounds are set to the root score at the end of every iteration that is at least 6 plies. However, Rybka deals with two different scales of evaluation: units of a centipawn and units of 1/32 of a centipawn. In this case, the two values are mixed up: Rybka's search value is in centipawns, but it sets the lazy eval as if this value were in 1/32 centipawn units. Thus, every evaluation (that happens to be less than 32 pawns in either direction, i.e. always) will cause the lazy evaluation bounds to be set based on a score of 0. This means that if the root score (before dividing by 3399) is >0, the bounds are set to -3 and 4, and if the score is <0, the bounds are set to -4 and 3. Every single position with a score outside of these bounds is lazily evaluated, which means that once the score is in this range, Rybka effectively switches to material-only evaluation.
If my understanding is correct (and I also haven't independently verified ZW's analysis), this at the very least means that positions with a large edge to one side or another should not be used in a test suite with Rybka 1.0 Beta.

Re: More on similarity testing

Posted: Tue Dec 28, 2010 9:50 pm
by Sentinel
BB+ wrote:
you can increase its use for a lot [...] (as long as you don't use it in in PV nodes and not too much in cut/all nodes).
I'm left to wonder exactly what types of nodes are left at this point? :?: :?
qsearch? ;)

Re: More on similarity testing

Posted: Wed Dec 29, 2010 9:10 pm
by BB+
I stress that I intend to re-do this with more positions, and eliminate those with "one good move" or an evaluation that is too large.
I did this, and put it in a new thread in the technical subforum.

Here are the bestmove-matching data:

Code: Select all

                 FR10 FR21 IH47 Ryb1 Ry12 R232 Ryb3 Gla2 SF15 SF19  Time
FR10.at.dp9         0 3920 3290 3529 3600 3581 3381 3876 3611 3528  3:36
FR21.at.dp10     3920    0 3927 4551 4478 4436 4064 4330 4248 4127  4:06
IH47c.at.dp15    3290 3927    0 4333 4423 4641 4921 3885 4370 4411  3:09
R1.at.dp10       3529 4551 4333    0 5523 5259 4552 4264 4408 4283  2:45
R12.at.dp11      3600 4478 4423 5523    0 5464 4638 4272 4468 4379  3:18
R232.at.dp11     3581 4436 4641 5259 5464    0 4840 4206 4454 4378  3:21
R3.at.dp10       3381 4064 4921 4552 4638 4840    0 4057 4434 4380  2:51
GL2.at.dp12      3876 4330 3885 4264 4272 4206 4057    0 4735 4365  2:41
SF151.at.dp13    3611 4248 4370 4408 4468 4454 4434 4735    0 5238  3:57
SF191.at.dp14    3528 4127 4411 4283 4379 4378 4380 4365 5238    0  2:35

Re: More on similarity testing

Posted: Thu Dec 30, 2010 2:35 am
by kingliveson
BB+ wrote: Here are the bestmove-matching data:

Code: Select all

                 FR10 FR21 IH47 Ryb1 Ry12 R232 Ryb3 Gla2 SF15 SF19  Time
FR10.at.dp9         0 3920 3290 3529 3600 3581 3381 3876 3611 3528  3:36
FR21.at.dp10     3920    0 3927 4551 4478 4436 4064 4330 4248 4127  4:06
IH47c.at.dp15    3290 3927    0 4333 4423 4641 4921 3885 4370 4411  3:09
R1.at.dp10       3529 4551 4333    0 5523 5259 4552 4264 4408 4283  2:45
R12.at.dp11      3600 4478 4423 5523    0 5464 4638 4272 4468 4379  3:18
R232.at.dp11     3581 4436 4641 5259 5464    0 4840 4206 4454 4378  3:21
R3.at.dp10       3381 4064 4921 4552 4638 4840    0 4057 4434 4380  2:51
GL2.at.dp12      3876 4330 3885 4264 4272 4206 4057    0 4735 4365  2:41
SF151.at.dp13    3611 4248 4370 4408 4468 4454 4434 4735    0 5238  3:57
SF191.at.dp14    3528 4127 4411 4283 4379 4378 4380 4365 5238    0  2:35
Image

Re: More on similarity testing

Posted: Thu Dec 30, 2010 2:38 am
by BB+
I'm beginning to recall why I prefer text-based browsers for some purposes... :D

Re: More on similarity testing

Posted: Thu Dec 30, 2010 3:10 am
by Don
Hi BB,

I have not fully digested all the posts on this subject as I just joined this forum but I am happy that someone else is also looking at this.

My first comment concerns how to set up a program to think for 100 ms or whatever time you wish. The method I use is to set the level to a high ply depth or inifinte, start the search, then wait 100 ms before sending the "stop" command. This is the most reliable way. Do not try to depend on fixed time as that is unreliable as you have discovered.

My second comment is about what this test is supposed to do. When I first announced it on talk-chess, I purposely used a provocative title, just to get peoples attention. I announced it as a "clone tester" and I got more of a reaction that I ever expected. I then asked to have the subject changed to "similarity tester".

The test is not a reliable way to identify a clone and this was obvious to me right from the start. For example it identifies Komodo as being strongly related to Rybka 3, probably because I made heavy use of Larry Kaufman in designing the evaluation function. It shows Naum as being VERY STRONGLY related to Rybka 2 when I have no doubt it is an original work.

The tester seems to very clearly identify strong correlations between the playing styles of programs and it does this better than I had hoped. For example take any program and past versions of that same program and they almost always are the closest matching program. The most related programs to Komodo are all the other Komodo's for example despite evaluation improvements between each version and massive improvements. Naum and Rybka 2, when it's known that Naum was tuned to play just like Rybka 2. The tester is very good at picking up on that. Stockfish 1.9 is like 1.8, 1.7 and 1.6, despite massive ELO improvements.

Other than possible statistical error, the test does not have false positives. It in fact measures what percentage of moves two programs have in common. So if it gives high similarity scores to Firebird and Robbolito, that is not a false positive because the program DO IN FACT play alike over my 10,000 move set. A false positive implies that the test is measuring something other than what it was intended to measure and it's returning the wrong results. For example Naum and Rykba 2 play a huge amount of identical moves. That is not a false positive, it is a fact. If you call this a clone test, then it suddenly becomes a false positive.

When 2 programs DO prove to have an unusual number of moves in common, it suggests, but not proves, a common cause. In Komodo's case the common cause is Larry Kaufman. In the case of Naum and Rybka it is the fact that Naum was purposely tuned to play like Rybka. In the case of Firebird and Robbolito it is probably just a wild coincidence since they are both completely original works.

Don

Re: More on similarity testing

Posted: Thu Dec 30, 2010 3:33 am
by Sentinel
Don wrote:When 2 programs DO prove to have an unusual number of moves in common, it suggests, but not proves, a common cause. In Komodo's case the common cause is Larry Kaufman. In the case of Naum and Rybka it is the fact that Naum was purposely tuned to play like Rybka. In the case of Firebird and Robbolito it is probably just a wild coincidence since they are both completely original works.
If you ignore strength correlation, polling correlation, etc., what is left is eval and in case of Ippo vs. R3 it's not full eval that you see dominantly in correlation at output of your test but lazy eval.
Meaning Ippo and R3 have similar material and PST values.
But that's an old news (you have it in detail in BB's report).

P.S. As you noticed yourself you can tune material and PST values even without disassembling. In that sense Ippo authors are responsible in the same amount as Alex Naumov.