Quote:
Here's an exhaustively tested routine generalized to handle vector bool char. It performs the described scan on int elements, and also individually on the four chars of each 32bit lane. The AND of those two is then the overall scan of a vector bool char.
Due to some strange bug in AltiVec optimization in gcc, which makes it consume HUGE amounts of RAM (it ran out of memory on my system, with 1.5GB total memory!!!), I had to rewrite your code a bit, apparently gcc 4.0 on Linux doesn't like inlined altivec intrinsics that much.
Code:
// isolates the rightmost 'true' element of a vector bool char
vector bool char RightSingularChar(vector bool char Mask) {
vector uint8_t pack32_8 = { 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,4,8,12 };
vector uint8_t unpack8_32 = { 12,12,12,12, 13,13,13,13, 14,14,14,14, 15,15,15,15 };
vector uint8_t vone8 = vec_splat_u8(1);
vector uint32_t vzero32 = vec_splat_u32(0), vone32 = vec_splat_u32(1);
// generate bool int mask from bool char mask
vector bool int intMask = vec_cmpgt((vector uint32_t)Mask, vzero32);
// compress; only the rightmost 32 bit element is really significant here
intMask = vec_perm(intMask, intMask, pack32_8);
// isolate rightmost bits of char mask and int mask
vector bool int tempMask = vec_sub((vector uint32_t)intMask, vone32);
intMask = vec_xor(intMask, vec_and(intMask, tempMask));
tempMask = vec_sub((vector uint32_t)Mask, vone32);
Mask = vec_xor(Mask, vec_and(Mask, (vector bool char) tempMask));
// splat isolated bits
intMask = (vector bool int)vec_cmpeq((vector uint8_t)intMask, vone8);
Mask = vec_cmpeq((vector uint8_t)Mask, vone8);
// unpack intMask to full size again
intMask = vec_perm(intMask, intMask, unpack8_32);
// combine Mask and intMask
return vec_and((vector bool char)intMask, Mask);
}
The problem is that I can't get it to work like this:
Assuming I pass this mask to this routine:
Code:
vector bool char vmask = { ff,ff,ff,ff, ff,ff,ff,0, ff,ff,ff,ff, ff,ff,ff,ff };
If I call RightSingularChar on vmask, like this:
Code:
vector bool char vres = RightSingularChar(vmask);
I get this result (if I print vres):
Code:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ff
Maybe it's my fault in rewriting it, but i tried to make as few changes as possible.
Also, in rewriting this code for left-to-right, where should I use the permutes? in the initial mask (tried that, didn't work), or inside the algorithm itself?
Konstantinos