gull chess

ThinkingALot · Post by **ThinkingALot** » Tue Jun 22, 2010 4:50 pm

mcostalba wrote:
ThinkingALot wrote:Simple assembler functions can probably be borrowed from anywhere. I don't think it would be a GPL violation to use these functions and keep the code public domain.
I am sorry but I don't agree here.

Actually I meant the BSF intrinsic from Stockfish. It seems to me that this particular intrinsic can be used in a public domain program. Of course with "Square", "first_1" and "dummy" renamed to smth else.
Do you disagree with this statement? If the answer is "no" the issue is resolved: we were just talking about different things. Otherwise consider a simple line of code from some GPL engine:

Code: Select all

for (i = 0; i < 64; i++) {

This is clearly a "public domain" string, since it's used in public domain programs. So what's the principal difference between this string and the BSF code? I don't see any.

Sentinel · Post by **Sentinel** » Tue Jun 22, 2010 5:09 pm

ThinkingALot wrote:Actually I meant the BSF intrinsic from Stockfish. It seems to me that this particular intrinsic can be used in a public domain program. Of course with "Square", "first_1" and "dummy" renamed to smth else.

Who in today's world would use Stockfish's BSF for god sake???
In the world where even crappy P4 from 10 years ago uses only 2 cycles on bsf/bsr instructions who would use their tables which are like 2-3x slower???

mcostalba · Post by **mcostalba** » Tue Jun 22, 2010 5:23 pm

Sentinel wrote:
ThinkingALot wrote:Actually I meant the BSF intrinsic from Stockfish. It seems to me that this particular intrinsic can be used in a public domain program. Of course with "Square", "first_1" and "dummy" renamed to smth else.
Who in today's world would use Stockfish's BSF for god sake???
In the world where even crappy P4 from 10 years ago uses only 2 cycles on bsf/bsr instructions who would use their tables which are like 2-3x slower???

Are you sure ?

You have asked yourself / everybody two questions, please add a third one: if bsf/bsr is so faster why those idiots in SF still use the software version ?

In case you have some spare time you could take a look to this long thread:

http://talkchess.com/forum/viewtopic.ph ... highlight=

the bottom line is that speed difference, if any, is below 1% in most cases and code is not (easily) portable in asm.

Sentinel · Post by **Sentinel** » Tue Jun 22, 2010 5:34 pm

mcostalba wrote:Are you sure ?

I am. At least for 32bit OSes (for x64 you anyway use the hardware instruction). Simply randomly generate like 10M sparse bitboards and execute BSF on them in a loop. Measure the time. Use your version and asm version of BSF. You can repeat the test for anything starting from P4 up to i7 (Atom excluded, it's too primitive).

You have asked yourself / everybody two questions, please add a third one: if bsf/bsr is so faster why those idiots in SF still use the software version ?

Well, BSF things usually take 2-5% of total execution time, depending on the engine of course. Don't know the percentage on SF, but I would bet on something like 2.5%. Now if your BSF is like 50% slower that's 1.25% lower total speed. Not much of course. I don't say you are idiots ofc, just that you obviously don't care to squeeze every possible permil of speed.

the bottom line is that speed difference, if any, is below 1% in most cases and code is not (easily) portable in asm.

Portability for 5 lines of asm code is not an issue at all, more just an excuse.

mcostalba · Post by **mcostalba** » Tue Jun 22, 2010 7:18 pm

Sentinel wrote: Portability for 5 lines of asm code is not an issue at all, more just an excuse.

An excuse

perhaps it was not clear but if we had just implemented bsf in asm and forget it, we had avoided at least two weeks of testing all the different possibilities (and bsf among then BTW), so I don't understand an excuse for what ? for having done a much more time consuming research activity ?

Your loop test is not very realistic, the point is that you don't only need to get the lowest bit, but also to reset it, and if you test pop_1s_bit() function in a loop you see that speed difference is about 1-2% for just the function, not for the whole engine ! And for some CPU types is even less.

Software bsf at the end is just a multiplication and a table lookup + bit reset, and bit reset part is common to both methods and is a part that weights almost half of the function performance.

Sentinel · Post by **Sentinel** » Wed Jun 23, 2010 3:27 am

mcostalba wrote:Your loop test is not very realistic, the point is that you don't only need to get the lowest bit, but also to reset it, and if you test pop_1s_bit() function in a loop you see that speed difference is about 1-2% for just the function, not for the whole engine ! And for some CPU types is even less.

Ok, if you combine BSF and LSB_clear it changes a bit the story but not too much.
To cut the story short, I have no clue what and how you have measured but you obviously didn't measure it right.
If you take your pop_1st_bit function from SF and just change:

Code: Select all

ret = Square(BitTable[((u.dw.l ^ (u.dw.l - 1)) * 0x783a9b23) >> 26]);

lines into:

Code: Select all

_BitScanForward(&ret, u.dw.l);

(don't forget to add 32 before return ret) you get 5-10% speed up.
If you cleverly rewrite the whole function in asm, you get at least 15-20% speed up. So, I really don't get it, where you've found those 1-2%.

Software bsf at the end is just a multiplication and a table lookup + bit reset, and bit reset part is common to both methods and is a part that weights almost half of the function performance.

Nope, it's not just table lookup + bit reset, there is also bitwise XOR, multiplication and shift. These are all very costly operations.
And if that doesn't convince you, I'll give you a small asm MSVC++ 9.0 optimized sniffet of your original pop_1st_bit and pop_1st_bit when table lookup is replaced with hardware bsf. If you still think afterwards that those two pieces of code take the same time to execute, there is no purpose of continuing conversation.

So your original pop_1st_bit:

Code: Select all

00401118  mov         esi,ecx 
0040111A  test        ecx,ecx 
0040111C  je          main+139h (401139h) 
0040111E  lea         esi,[ecx-1] 
00401121  mov         edx,esi 
00401123  xor         edx,ecx 
00401125  imul        edx,edx,783A9B23h 
0040112B  shr         edx,1Ah 
0040112E  mov         edx,dword ptr ___xi_z+68h (402140h)[edx*4] 
00401135  and         esi,ecx 
00401137  jmp         main+156h (401156h) 
00401139  lea         eax,[edi-1] 
0040113C  mov         ecx,eax 
0040113E  xor         ecx,edi 
00401140  not         ecx  
00401142  imul        ecx,ecx,783A9B23h 
00401148  shr         ecx,1Ah 
0040114B  mov         edx,dword ptr ___xi_z+68h (402140h)[ecx*4] 
00401152  and         eax,edi 
00401154  mov         edi,eax

pop_1st_bit with hardware bsf:

Code: Select all

00401118  mov         ecx,edi 
0040111A  mov         esi,ebx 
0040111C  test        edi,edi 
0040111E  je          main+134h (401134h) 
00401120  bsf         ecx,edi 
00401123  mov         dword ptr [esp+18h],ecx 
00401127  mov         eax,dword ptr [esp+18h] 
0040112B  lea         ecx,[edi-1] 
0040112E  and         ecx,edi 
00401130  mov         edi,ecx 
00401132  jmp         main+147h (401147h) 
00401134  bsf         edx,ebx 
00401137  lea         esi,[ebx-1] 
0040113A  and         esi,ebx 
0040113C  mov         eax,edx 
0040113E  mov         dword ptr [esp+18h],edx 
00401142  mov         ebx,esi 
00401144  add         eax,20h

mcostalba · Post by **mcostalba** » Wed Jun 23, 2010 7:57 am

Sentinel wrote:If you take your pop_1st_bit function from SF and just change:
Code: Select all
ret = Square(BitTable[((u.dw.l ^ (u.dw.l - 1)) * 0x783a9b23) >> 26]);
lines into:
Code: Select all
_BitScanForward(&ret, u.dw.l);
(don't forget to add 32 before return ret) you get 5-10% speed up.

Ok, I will try that....

ThinkingALot · Post by **ThinkingALot** » Wed Jun 23, 2010 12:11 pm

Gull 0.10 is out: https://sourceforge.net/projects/gullch ... p/download. Looks like I've finally managed to fix all the search bugs...

ThinkingALot · Post by **ThinkingALot** » Wed Jun 23, 2010 1:09 pm

Time management bug found. Will fix it soon.
EDIT: fixed.

Hagen · Post by **Hagen** » Wed Jun 23, 2010 1:23 pm

Tested it this morning....Gullchess 010 forfeits on time. This needs a fix. Played 10/6 increments and it still lost.

OpenChess

OpenChess

gull chess

Re: gull chess

Re: gull chess

Re: gull chess

Re: gull chess

Re: gull chess

Re: gull chess

Re: gull chess

Re: gull chess

Re: gull chess

Re: gull chess