mcostalba wrote:Your loop test is not very realistic, the point is that you don't only need to get the lowest bit, but also to reset it, and if you test pop_1s_bit() function in a loop you see that speed difference is about 1-2% for just the function, not for the whole engine ! And for some CPU types is even less.
Ok, if you combine BSF and LSB_clear it changes a bit the story but not too much.
To cut the story short, I have no clue what and how you have measured but you obviously didn't measure it right.
If you take your pop_1st_bit function from SF and just change:
Code: Select all
ret = Square(BitTable[((u.dw.l ^ (u.dw.l - 1)) * 0x783a9b23) >> 26]);
lines into:
(don't forget to add 32 before return ret) you get
5-10% speed up.
If you cleverly rewrite the whole function in asm, you get at least
15-20% speed up. So, I really don't get it, where you've found those 1-2%.
Software bsf at the end is just a multiplication and a table lookup + bit reset, and bit reset part is common to both methods and is a part that weights almost half of the function performance.
Nope, it's not just table lookup + bit reset, there is also bitwise XOR, multiplication and shift. These are all very costly operations.
And if that doesn't convince you, I'll give you a small asm MSVC++ 9.0 optimized sniffet of your original pop_1st_bit and pop_1st_bit when table lookup is replaced with hardware bsf. If you still think afterwards that those two pieces of code take the same time to execute, there is no purpose of continuing conversation.
So your original pop_1st_bit:
Code: Select all
00401118 mov esi,ecx
0040111A test ecx,ecx
0040111C je main+139h (401139h)
0040111E lea esi,[ecx-1]
00401121 mov edx,esi
00401123 xor edx,ecx
00401125 imul edx,edx,783A9B23h
0040112B shr edx,1Ah
0040112E mov edx,dword ptr ___xi_z+68h (402140h)[edx*4]
00401135 and esi,ecx
00401137 jmp main+156h (401156h)
00401139 lea eax,[edi-1]
0040113C mov ecx,eax
0040113E xor ecx,edi
00401140 not ecx
00401142 imul ecx,ecx,783A9B23h
00401148 shr ecx,1Ah
0040114B mov edx,dword ptr ___xi_z+68h (402140h)[ecx*4]
00401152 and eax,edi
00401154 mov edi,eax
pop_1st_bit with hardware bsf:
Code: Select all
00401118 mov ecx,edi
0040111A mov esi,ebx
0040111C test edi,edi
0040111E je main+134h (401134h)
00401120 bsf ecx,edi
00401123 mov dword ptr [esp+18h],ecx
00401127 mov eax,dword ptr [esp+18h]
0040112B lea ecx,[edi-1]
0040112E and ecx,edi
00401130 mov edi,ecx
00401132 jmp main+147h (401147h)
00401134 bsf edx,ebx
00401137 lea esi,[ebx-1]
0040113A and esi,ebx
0040113C mov eax,edx
0040113E mov dword ptr [esp+18h],edx
00401142 mov ebx,esi
00401144 add eax,20h