Hi everyone,
I'm also glad to see you around!
@Charles:Intel is very reluctant to publish exact data on their modern superscalar CPUs (the ones that allow "out-of-order" parallel execution of instructions in the U and V execution pipes) so I won't be able to quote their quadcore iN CPUs, but their
Core2 processors are still very slow in muldiv operations' latency which is also heavily bitness-dependent:
0.
DIV :
40 cycles in 32 bits,
116 cycles in 64 bits, non-pairable
1.
MUL :
5 cycles in 32 bits,
8 cycles in 64 bits, non-pairable
2.
IMUL :
3 cycles in 32 bits,
5 cycles in 64 bits, non-pairable
3.
left/right shift :
1 cycle in any bitness, freely pairable (i.e.
0.5 cycles if paired)
This means that the left shift is still 3 times faster than the fastest IMUL (non-paired 32 bits, worst case) and 16 times faster than the slowest MUL (paired 64 bits, best case).
DIV indeed remains the toughest bottleneck, and especially so, under 64 bits.