This is the popcnt routine (for NON-popcnt instruction cpus) from ivanhoe. The actual question is at the bottom, below the code.
Code: Select all
static inline int POPCNT (uint64 w)
{
w = w - ((w >> 1) & 0x5555555555555555ull);
w = (w & 0x3333333333333333ull) + ((w >> 2) & 0x3333333333333333ull);
w = (w + (w >> 4)) & 0x0f0f0f0f0f0f0f0full;
return (w * 0x0101010101010101ull) >> 56;
}
Code: Select all
typedef unsigned long long uint64; //assume this gives 64-bits
const uint64 m1 = 0x5555555555555555ULL; //binary: 0101...
const uint64 m2 = 0x3333333333333333ULL; //binary: 00110011..
const uint64 m4 = 0x0f0f0f0f0f0f0f0fULL; //binary: 4 zeros, 4 ones ...
const uint64 m8 = 0x00ff00ff00ff00ffULL; //binary: 8 zeros, 8 ones ...
const uint64 m16 = 0x0000ffff0000ffffULL; //binary: 16 zeros, 16 ones ...
const uint64 m32 = 0x00000000ffffffffULL; //binary: 32 zeros, 32 ones
const uint64 hff = 0xffffffffffffffffULL; //binary: all ones
const uint64 h01 = 0x0101010101010101ULL; //the sum of 256 to the power of 0,1,2,3...
/* Implement 'popcount_3' from Wikipedia */
/* """This uses fewer arithmetic operations than any other known
implementation on machines with fast multiplication.
It uses 12 arithmetic operations, one of which is a multiply.""" */
static inline int popcount_wikipedia_3(uint64 *buf, int n) {
int cnt=0;
uint64 x;
do {
x = *buf;
x -= (x >> 1) & m1; //put count of each 2 bits into those 2 bits
x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits
x = (x + (x >> 4)) & m4; //put count of each 8 bits into those 8 bits
cnt += (x * h01)>>56; //returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24)+...
buf++;
} while (--n);
return cnt;
}
Code: Select all
int static __inline__ PopCnt(long word)
{
long dummy, dummy2, dummy3;
asm(" xorq %0, %0 " "\n\t"
" testq %1, %1 " "\n\t"
" jz 2f " "\n\t"
"1: leaq -1(%1),%2 " "\n\t"
" incq %0 " "\n\t"
" andq %2, %1 " "\n\t"
" jnz 1b " "\n\t"
"2: " "\n\t"
: "=&r"(dummy), "=&r"(dummy2), "=&r"(dummy3)
: "1"((long) (word))
: "cc");
return (dummy);
}
Code: Select all
template<>
inline int count_1s<CNT64>(Bitboard b) {
b -= ((b>>1) & 0x5555555555555555ULL);
b = ((b>>2) & 0x3333333333333333ULL) + (b & 0x3333333333333333ULL);
b = ((b>>4) + b) & 0x0F0F0F0F0F0F0F0FULL;
b *= 0x0101010101010101ULL;
return int(b >> 56);
}
Code: Select all
unsigned int popcount(unsigned int v)
{
unsigned int retVal;
__asm {
MOV EAX, [v] ;v
MOV EDX, EAX ;v
SHR EAX, 1 ;v >> 1
AND EAX, 055555555h ;(v >> 1) & 0x55555555
SUB EDX, EAX ;w = v - ((v >> 1) & 0x55555555)
MOV EAX, EDX ;w
SHR EDX, 2 ;w >> 2
AND EAX, 033333333h ;w & 0x33333333
AND EDX, 033333333h ;(w >> 2) & 0x33333333
ADD EAX, EDX ;x = (w & 0x33333333) + ((w >> 2) & 0x33333333)
MOV EDX, EAX ;x
SHR EAX, 4 ;x >> 4
ADD EAX, EDX ;x + (x >> 4)
AND EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0x0F0F0F0F)
IMUL EAX, 001010101h ;y * 0x01010101
SHR EAX, 24 ;population count = (y * 0x01010101) >> 24
MOV retVal, EAX ;store result
}
return (retVal);
}
Code: Select all
FreeBSD version 1 : 7508216 us; cnt=32500610
FreeBSD version 2 : 5855897 us; cnt=32500610
16-bit LUT : 3236626 us; cnt=32500610
8-bit LUT : 5387574 us; cnt=32500610
8-bit LUT v2 : 4152103 us; cnt=32500610
Wikipedia #2 : 4149610 us; cnt=32500610
Wikipedia #3 : 2980791 us; cnt=32500610
gcc popcount : 6520142 us; cnt=32500610
gcc popcountll : 3185706 us; cnt=32500610
HAKMEM 169/X11 : 5812659 us; cnt=32500610
Bitslice(7) : 4815994 us; cnt=32500610
Bitslice(24) : 3112013 us; cnt=32500610
naive : 56568623 us; cnt=32500610
Wegner/Kernigan : 33442013 us; cnt=32500610
Anderson : 10516925 us; cnt=32500610
8x shift and add : 49656566 us; cnt=32500610
32x shift and add : 47866979 us; cnt=32500610
And it happens that the version used by ivanhoe is the fastest on my core2.
And now, the Question (or questions): if I copy exactly Crafty's popcount implementation for my engine, 1) is that "impermissibly derived"?; and, 2) should credit be given by referencing Hyatt's Crafty as the origination of the implementation, even if Hyatt borrowed the implementation from some other source?
I'll give the following a name, by calling it the "Hyatt Rule", or "One-Input-One-Output Function Rule". Hyatt: "If a function that is consistent from input to output ... then perhaps copying that is not a deal-breaker on tournament participation." And, the "Hyatt Corollary": "If you choose to borrow something, ... one must at least provide a proper citation or give credit to the originator of the code that was used."
Okay. Let us apply the Hyatt Rule. At the outset, the various popcount implementations appear to me to be obvious examples of a one-input-one-output function. Is that correct? If true, the Hyatt Rule appears to allow rote code copying of these routines. However, there are numerous variations to popcount implementations. Stockfish's is identical to ivanhoe's, though one in c and the other c++. What I find is that the various implementations of this one-input-one-output function differ widely when it comes to speed. As you can see, the choice of a popcount implementation is significant. The "wikipedia #3/ivanhoe/stockfish" routine is the fastest. Whoever wrote these various routines a) knows a lot; b) used their knowledge; and, c) created something. I will certainly improve my engine, if I replace gcc's __builtin_popcount with wikipedia#3/stockfish/ivanhoe's routine. Now, Stockfish's 'bitcount.h' contains the standard GPL boilerplate above its bitcount routine. However, that routine is apparently public domain - it's on wikipedia without citation.
But, all of this is for illustration, so let us assume for the moment that Madame X authored the wikipedia#3/stockfish/ivanhoe version. Clearly, Madame X has created something of significance by inventing, or authoring, a function that nobody else did previously, and although this function takes the same input and produces the exact same output as all the other implementations, Madame X's is more than twice the speed of gcc's builtin. If Madame X uses her implementation in her chess engine, and it is not known or used by any other, am I free to disassemble her fine engine to discover her fast popcount implementation and use it in my engine without credit or citation? Given it's nature, after disassembly, my reconstruction in c would be identical to Madame X's.
How does the Hyatt Rule handle one-input-one-output functions whose various implementations differ widely in speed, speed being the primary objective? In other words, though probably impossible (and irrelevant with new processors), assume I managed to create another implementation that was three times faster than Madame X's? This is very low level stuff. Does the Hyatt Rule exonerate all who would disassemble my closed source engine and use my super fast method in their engines without credit. Again, the nature of this low level code would probably mean a c rewrite of disassemble code would be identical to my version without having ever seen my c code.