Page 1 of 2

To shift or not to shift

Posted: Wed Sep 09, 2015 1:52 am
by thevinenator
Consider the following code...

void BitmapTest()
{
U64 i,j,val;
U64 PowerOf2[64];

for (i = 0; i <= 63; i++)
PowerOf2 = (1ULL << i);

clock_t begin = clock();

for (j = 1; j < 1000000000; j++)
for (i = 0; i < 64; i++)
{
// try using shifts
val = 1ULL << i;

// try array
// val = PowerOf2;

}
printf("%llu\n", val); // needed to keep the optimizer from not doing the loops
}

Compiled on a VS2013 x64 with full optimization for speed.
the CPU is an Intel i5-2520 2.50Ghz

the line: val = 1ULL << i;
takes 31.2 seconds

while the table lookup

val = PowerOf2;
takes 11.9 seconds!

wow

I'm guessing the former is dependent on CPU type but I doubt it will beat the table lookup method.
have others found the table lookup method superior?

i was testing this to see which method would be superior for creating masks to modify a single bit in a variable.

Re: To shift or not to shift

Posted: Wed Sep 09, 2015 2:54 pm
by hyatt
thevinenator wrote:Consider the following code...

void BitmapTest()
{
U64 i,j,val;
U64 PowerOf2[64];

for (i = 0; i <= 63; i++)
PowerOf2 = (1ULL << i);

clock_t begin = clock();

for (j = 1; j < 1000000000; j++)
for (i = 0; i < 64; i++)
{
// try using shifts
val = 1ULL << i;

// try array
// val = PowerOf2;

}
printf("%llu\n", val); // needed to keep the optimizer from not doing the loops
}

Compiled on a VS2013 x64 with full optimization for speed.
the CPU is an Intel i5-2520 2.50Ghz

the line: val = 1ULL << i;
takes 31.2 seconds

while the table lookup

val = PowerOf2;
takes 11.9 seconds!

wow

I'm guessing the former is dependent on CPU type but I doubt it will beat the table lookup method.
have others found the table lookup method superior?

i was testing this to see which method would be superior for creating masks to modify a single bit in a variable.



Have you looked at the asm???

:)

hint: there is not a shift to be found using the code as is (without the array lookup). You are NOT measuring what you think you are... In fact you will discover that inner loop is ALSO missing. :) You REALLY don't think the compiler realizes that shifting something 64 bits ends up with zero? :)

Re: To shift or not to shift

Posted: Sun Sep 13, 2015 8:53 am
by Lasse Hansen
My experience is that a table lookup for SetBit() is always faster than shift, both with my perft and my engine.
I use ClearBit(sqr) = ~SetBit(sqr) together with this, instead of a separate table for clearing a bit.
It is easy to change to using shift later though.
This is on an i7-2600K.

Regards, Lasse

Re: To shift or not to shift

Posted: Sun Sep 13, 2015 3:57 pm
by hyatt
I don't see how it can be faster. A shift is a one cycle instruction. An array lookup requires the address computation and a cache probe. If you have a decent CPU and the value is in L1, they are going to be fairly close. If not in l1, shift easily wins. But I don't see any way a shift can be slower than a memory/cache reference even if it is a L1 hit. Shift = 1 micro-op. Memory access is at least two.

However, you do have to be careful and make sure your measurements are correct. The code posted above does nothing even remotely close to what the C code implies, for example. Drawing any conclusions from running that code will be based on incorrect assumptions and not knowing exactly what the compiler is doing. gcc -O -S is always your friend if you are looking at this kind of stuff.

IE try the -S with the array reference. There are NONE. Your printing "val" is only misleading you. The inner loop is thrown out. Leaving val = PowerOf2. But then i becomes a constant of 63, the last value through the loop (all other val computations are overwritten). So there are zero memory references for that code. You have to be VERY careful when writing code to benchmark something, less the optimizer wreck your measurement as happens here. I'd ALWAYS use the shift for speed, it is always faster than a memory reference, so long as you don't do things that force multiple memory references, say one for the value to shift, and two or three to evaluate the expression for the shift amount. But then again, that's not particularly good programming either and logically you would have to do the same thing for the array lookup.

Re: To shift or not to shift

Posted: Mon Sep 14, 2015 2:28 pm
by Lasse Hansen
hyatt wrote:I don't see how it can be faster. A shift is a one cycle instruction. An array lookup requires the address computation and a cache probe.
Well, its faster for me (tested it once again). It might be that when using variable shift the RC register is clobbered, due to the use of CL, and the RC content has to be restored later, when register pressure is high.
Lasse

Re: To shift or not to shift

Posted: Mon Sep 14, 2015 2:53 pm
by hyatt
That is still only an extra load from memory, same as the table lookup.

Are you sure you are measuring what you think you are? IE your test program was totally wiped out by the optimizer.

Re: To shift or not to shift

Posted: Mon Sep 14, 2015 3:55 pm
by Lasse Hansen
hyatt wrote:That is still only an extra load from memory, same as the table lookup.

Are you sure you are measuring what you think you are? IE your test program was totally wiped out by the optimizer.
Yes, I tried both with my C-engine Sillycon, and with my new C++ perft program.
For Sillycon the speed in NPS difference is around 0.7%, for perft around 2%.

Code: Select all

gcc chess.c -pipe -lm -lpthread -O3 -march=native -mtune=native -finline-limit=220 
//#define SetBit(a)	(SetBitTbl[a])
//#define ClrBit(a)	(~SetBitTbl[a])
#define SetBit(a)	(1ull<<(a))
#define ClrBit(a)	(~(1ull<<(a)))

Code: Select all

g++ -O2 main.cpp
//inline U64 SetBit(int sqr) { return SetBitTbl[sqr]; }
//inline U64 ClrBit(int sqr) { return ~SetBitTbl[sqr]; }
inline U64 SetBit(int sqr) { return 1ull<<sqr; }
inline U64 ClrBit(int sqr) { return ~SetBit(sqr); }
Lasse

Re: To shift or not to shift

Posted: Mon Sep 14, 2015 8:35 pm
by hyatt
Lasse Hansen wrote:
hyatt wrote:That is still only an extra load from memory, same as the table lookup.

Are you sure you are measuring what you think you are? IE your test program was totally wiped out by the optimizer.
Yes, I tried both with my C-engine Sillycon, and with my new C++ perft program.
For Sillycon the speed in NPS difference is around 0.7%, for perft around 2%.

Code: Select all

gcc chess.c -pipe -lm -lpthread -O3 -march=native -mtune=native -finline-limit=220 
//#define SetBit(a)	(SetBitTbl[a])
//#define ClrBit(a)	(~SetBitTbl[a])
#define SetBit(a)	(1ull<<(a))
#define ClrBit(a)	(~(1ull<<(a)))

Code: Select all

g++ -O2 main.cpp
//inline U64 SetBit(int sqr) { return SetBitTbl[sqr]; }
//inline U64 ClrBit(int sqr) { return ~SetBitTbl[sqr]; }
inline U64 SetBit(int sqr) { return 1ull<<sqr; }
inline U64 ClrBit(int sqr) { return ~SetBit(sqr); }
Lasse

BTW if this is a 64 bit compile, you have a ton of registers the compiler can use. cl/cx/ecx/rcx should not be an issue.

Re: To shift or not to shift

Posted: Tue Sep 15, 2015 2:54 pm
by Lasse Hansen
hyatt wrote:BTW if this is a 64 bit compile, you have a ton of registers the compiler can use. cl/cx/ecx/rcx should not be an issue.
I use Ubuntu 14.04 64 bit, gcc and g++ version 4.8.4

BTW, if using 1<<shift and not table, the code to clear a bit may be:

Code: Select all

inline U64 ClrBit(int sqr) { return rol(-2ull,sqr); }
static inline U64 rol(U64 x, U64 r) 
{ 
   asm ("rolq %%cl, %0" : "=r" (x) : "0" (x), "c" (r)); 
   return x; 
}
Using rotate left on the constant -2ull saves a bitwise NOT operation, but also clobbers RCX (AFAIK). I tested and found no improvement in speed in my program, but I seldom use ClrBit() in the code.
Lasse

Re: To shift or not to shift

Posted: Tue Sep 15, 2015 4:45 pm
by hyatt
If you use cl, you certainly clobber rcx. :)

But irregardless, that is a one cycle instruction. Best you can hope for for a table probe is one cycle also, assuming a l1 cache hit and assuming your l1 is 1 cycle for hit penalty. The shift will never vary, while the table probe won't always be a l1 hit. So I still don't see why it would be slower to use the shift. Be interesting to compile the actual code both ways, using -S, to see what, if anything, is going on that looks odd. GCC has had some interesting optimization bugs over the years, so perhaps here...

Here is what I get from my gcc. If I use the 1ULL < n approach:

movl $1, %eax
movl %edi, %ecx
salq %cl, %rax
ret

and if I use the 64 entry array approach:

movslq %edi, %rdi
movq _bits@GOTPCREL(%rip), %rax
movq (%rax,%rdi,8), %rax
ret

that last movq can be 1 or more cycles depending on l1, giving 3 cycles or more. The first approach has no cache dependencies and should always execute in constant time of 3 cycles.