All times are UTC-06:00




Post new topic  Reply to topic  [ 6 posts ] 
Author Message
PostPosted: Sun Oct 31, 2004 11:38 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
(reposted here, on Sven's advice)
Hi guys,
I've just completed my first altivec program, a very simple one indeed and I have to say I'm really impressed!
I did an optimized version of a common routine strfill(), quite common (and used in this form in MySQL), which fills a given string with a given char. Here are the benchmarks for both scalar and altivec versions, with different string sizes:
Code:
#size scalar altivec
13 5 2
16 6 1
27 12 3
64 23 2
90 29 3
128 43 3
185 59 5
256 81 6
347 111 8
512 164 11
831 277 18
2048 686 40
3981 1316 86
(Note: I used both 'random' sizes and powers of 2, but this didn't seem to have any impact.)
and here is the code that produced this output:
Code:
#include <altivec.h>
#include <stdio.h>
#include <time.h>

// This one was shamelessy stolen and adapted from Apple's Altivec tutorial
vector unsigned char inline vec_ldsplatchar(unsigned char splatchar) {

vector unsigned char splatmap = vec_lvsl(0, &splatchar);
vector unsigned char result = vec_lde(0, &splatchar);

splatmap = vec_splat(splatmap, 0);

return vec_perm(result, result, splatmap);
}

unsigned char *vec_strfill(unsigned char *s, int len, char p) {
int i;
vector unsigned char sm = vec_ldsplatchar(p);
vector unsigned char *v1 = (vector unsigned char *)s;

vector unsigned char vec_a;
for (i=0; i < len-1; i = i+16) {
vec_a = vec_ld(0, v1);
vec_a = vec_splat(sm, 0);
vec_st(vec_a, i, s);
}

return s;
}

unsigned char *strfill(unsigned char *s, int len,char fill)
{
while (len--) *s++ = fill;
*(s) = '\0';
return(s);
} /* strfill */


int main( void )
{
int i, j, k, max, loops = 100000000;
time_t dt1, dt2, t1, t0;
int sizes[] = {13,16,27,64,90,128,185,256,347,512,831,2048,3981,8192,10000};

printf("#size\tscalar\taltivec\n");
for (k = 0; k < 16; k++) {
max = sizes[k];
unsigned char __attribute__ ((aligned(16))) test[max];
unsigned char __attribute__ ((aligned(16))) splatchar = rand();

t0 = time(NULL);
for (j=0; j < loops; j++) {
test[max] = '\0';
strfill(test, max, splatchar);
}
t1 = time(NULL);
dt1 = t1 - t0;

t0 = time(NULL);
for (j=0; j < loops; j++) {
vec_strfill(test, max, splatchar);
}
t1 = time(NULL);
dt2 = t1 - t0;
printf("%ld\t%ld\t%ld\n", max, dt1, dt2);
}
return 0;
}
I have to say, this code has bugs and is not very optimized, I'm a newbie when it comes to Altivec. But it did convince me that even a newbie like me can write fast Altivec code! I'm posting here for comments and maybe ideas to make it faster, safer etc, so if I've done something exceedingly stupid in this code, please tell me :-)
Also, I used a simple time() function, as I couldn't get pmon to work for me on 2.6.8.
Thanks to Luca for his excellent comments and pointers! Thanks to Genesi for designing such a nice system as Pegasos 2, the more I use it, the more I like it :-)

Regards
Konstantinos


Top
   
PostPosted: Mon Nov 01, 2004 9:45 am 
Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35
Here's a variant that handles alignment. It is untested, but it compiles. Convincing yourself that it does indeed work (or that it is buggy) is left as an enlightening exercise for the reader. :D
Code:
/* All AltiVec loads and stores silently round down their addresses to a
multiple of 16. Handling of arbitrary string buffers must take that into
account.

general case: partial vector at start and end of string

|.....--+-------+-------+-----..|

so we must handle a partial vector first, then a number of whole vectors,
and another partial vector at the end

special case: first and last char of string within same vector

|.----..|

partial vectors are handled by loading the original value, merging in the
modified bytes with vec_sel, and storing back the modified vector
*/
unsigned char *vec_strfill(unsigned char *s, int len, char p) {
unsigned char* ptrlast = s + (len - 1);
unsigned char* ptrfirst = s;

// for inline char -> vector char conversion
union {
vector unsigned char v;
unsigned char c[16];
} envelope;
envelope.c[0] = p;

// this code fails for len <= 0 when first and last positions are in
// different vectors, but that is an undefined case anyway
assert (len > 0);

// these maske are 0xff where we want to modify chars in memory, 0 otherwise
// (vec_lvsr creates a vector value based on the address alignment; these are
// usually used with vec_perm, but I compare them to a threshold value to
// create a mask vector of 0xff/0x00 chars)
vector unsigned char firstmask;
firstmask = (vector unsigned char) vec_cmpgt(vec_lvsr(0, ptrfirst),
vec_splat_u8(15));
// (the type cast works around a bug in Apple GCC)

// vec_splat can only take values from -16 to +15, so I have to build the
// needed value 17 manually (vec_splat_u8(15) is a common subexpression)
vector unsigned char lastmask;
lastmask = (vector unsigned char) vec_cmplt(vec_lvsr(0, ptrlast),
vec_add(vec_splat_u8(15),
vec_splat_u8(2)));
// (the type cast works around a bug in Apple GCC)

// build a vector full of fill chars
vector unsigned char fillchar = vec_splat(envelope.v, 0);

// first and last chars in same vector? (xor used as bit-wise compare)
if ((((long)ptrlast ^ (long)ptrfirst) & ~15L) == 0) {

firstmask = vec_and(firstmask, lastmask); // simply combine masks

// read, modify (based on mask), write
vec_st(vec_sel(vec_ld(0, ptrfirst), fillchar, firstmask), 0, ptrfirst);

return s;
}

// handle first vector: read, modify (based on mask), write
vec_st(vec_sel(vec_ld(0, ptrfirst), fillchar, firstmask), 0, ptrfirst);
// it can happen that firstmask is all 0xff, so it might pay off to test
// whether s is aligned and go directly to the while() loop

// reduce remaining length by the amount of chars covered by first vector
len -= 16 - ((int)ptrfirst & 15);

// now store a number of aligned vectors
while (len >= 16) {
vec_st(fillchar, 0, ptrfirst);
ptrfirst += 16;
len -= 16;
}

// finally there might be a last partial vector
// (if lastmask is all 0xff, then len is already zero here)
if (len > 0) {
// read, modify (based on mask), write
vec_st(vec_sel(vec_ld(0, ptrlast), fillchar, lastmask), 0, ptrlast);
}

return s;
}
This is perhaps not the ultimately fastest way to do things, but it should come quite close and is hopefully readable enough to serve as an illustrative example.


Top
   
PostPosted: Tue Nov 02, 2004 3:23 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Here's a variant that handles alignment. It is untested, but it compiles. Convincing yourself that it does indeed work (or that it is buggy) is left as an enlightening exercise for the reader. :D

(snip)

This is perhaps not the ultimately fastest way to do things, but it should come quite close and is hopefully readable enough to serve as an illustrative example.
Ok, I took the benchmark a little further, added this routine as altivec-ng, and added also memset() from libmoto. I #defined strfill() as follows:
Code:
#define vec_strfill_libmoto(s, len, p) ((unsigned char *)memset(s, p, len))
and I also added these checks:
Code:
if ( strcmp(test1, test2) == 0) {
printf("test1 = test2)\n");
}
if ( strcmp(test1, test3) == 0) {
printf("test1 = test3)\n");
}
if ( strcmp(test1, test4) == 0) {
printf("test1 = test4)\n");
}
if ( strcmp(test2, test3) == 0) {
printf("test2 = test3)\n");
}
if ( strcmp(test2, test4) == 0) {
printf("test2 = test4)\n");
}
if ( strcmp(test3, test4) == 0) {
printf("test3 = test4)\n");
}
at the end of each size loop.

Here are the results:
Code:
#size scalar altivec altivec-ng libmoto
13 6 1 5 5
test1 = test3)
test1 = test4)
test3 = test4)
16 7 1 5 6
test1 = test2)
test1 = test3)
test1 = test4)
test2 = test3)
test2 = test4)
test3 = test4)
27 11 2 4 4
test1 = test3)
test1 = test4)
test3 = test4)
64 27 2 5 4
test1 = test2)
test1 = test4)
test2 = test4)
90 37 3 6 4
test1 = test4)
128 58 4 7 6
test1 = test2)
test1 = test4)
test2 = test4)
185 75 6 7 8
test1 = test4)
256 106 7 9 7
test1 = test2)
test1 = test4)
test2 = test4)
347 141 10 10 10
test1 = test4)
512 208 14 13 11
test1 = test2)
test1 = test4)
test2 = test4)
831 338 21 20 15
test1 = test4)
2048 832 53 43 30
test1 = test2)
test1 = test4)
test2 = test4)
3981 1617 102 79 55
test1 = test4)
8192 3329 208 161 108
test1 = test2)
test1 = test4)
test2 = test4)
I included my version for reference, though I know it's not correct. I intend to work on Holger's version to correct it and also to understand it, but as for production code, I think that libmoto speaks for itself :-)

Konstantinos


Top
   
PostPosted: Tue Nov 02, 2004 4:53 am 
Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35
Quote:
I included my version for reference, though I know it's not correct. I intend to work on Holger's version to correct it and also to understand it, but as for production code, I think that libmoto speaks for itself :-)
It does, as it should!

If you are serious about an "ultimate" routine, you can start by replacing the loop of aligned stores with this:
Code:
// now store a number of aligned vectors
  while (len >= 64) {
    vec_st(fillchar, 0, ptrfirst);
    len -= 64;
    vec_st(fillchar, 16, ptrfirst);
    vec_st(fillchar, 32, ptrfirst);
    vec_st(fillchar, 48, ptrfirst);
    ptrfirst += 64;
  }

  if (len >= 32) {
    vec_st(fillchar, 0, ptrfirst);
    len -= 32;
    vec_st(fillchar, 16, ptrfirst);
    ptrfirst += 32;
  }

  if (len >= 16) {
    vec_st(fillchar, 0, ptrfirst);
    ptrfirst += 16;
    len -= 16;
  }
This will penalize short strings marginally, but help very notably with longer strings. You could also do cache hinting if you do not expect the target strings to reside in cache. This can often gain a factor of two in that case, but it will slow down the cache resident case.

(The ultimate tuning would come from profiling each call of strfill(), and replace the callee site with calls to specialized routines for either short strings, or long cache resident strings, or long uncached strings, respectively. But that would border on ridiculous over-engineering. Can be fun, though. :-))

There are more opportunities to shave off a cycle here and there, but the biggest gains come from vectorizing at all. In any real application, programmer time is usually best spent by doing five solidly tuned routines rather than a single perfectly tuned one.

However, the first lesson of AltiVec tuning was found by yourself: check if tuned library routines exist and use those. Minimal programming effort for a potentially optimal performance gain.


Top
   
PostPosted: Tue Nov 02, 2004 10:44 am 
Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35
In a vain attempt to regain my pride with more untested code :-) I made another variant that is more seriously aimed at ultimate speed. Even if I had commented it thoroughly, it wouldn't be nearly as readable as my first variant.
Code:
unsigned char *vec_strfill(unsigned char *s, int len, char p) {
assert(len > 0); // sanity check

// build a vector full of copies of p
vector unsigned char fillchar = vec_splat(vec_lvsl(0,(unsigned char*)&p), 0);
fillchar = vec_perm(vec_ld(0, (unsigned char*)&p), fillchar, fillchar);

vector unsigned char firstmask, lastmask;

{ // local scope limits lifetime of temporary variables
vector unsigned char ones, zeros; // to build constant values
ones = (vector unsigned char) vec_cmpeq(fillchar, fillchar); // 0xff
zeros = (vector unsigned char) vec_sub(fillchar, fillchar); // 0x00
// the instructions chosen here go to the VSIU and can potentially be
// executed in parallel with a pair going to VPU and LSU

firstmask = vec_perm(zeros, ones, vec_lvsr(0, s));
// lastmask will be zero when the string's end is aligned
lastmask = vec_perm(ones, zeros, vec_lvsr(len, s));
}

// check if firstmask and lastmask have to be combined
if (len <= (((int)s & 15) ^ 15)) {
vec_st(vec_sel(vec_ld(0,s), fillchar, vec_and(firstmask, lastmask)), 0,s);
return s;
}

// check if a last partial vector needs to be handled
if (vec_all_eq(vec_splat_u8(0), lastmask)) {
// no, string ends with aligned vector, so just do first partial vector
vec_st(vec_sel(vec_ld(0, s), fillchar, firstmask), 0, s);
} else {
// do both partial vectors in parallel for better pipeline utilization
vector unsigned char firstdata = vec_ld(0, s);
vector unsigned char lastdata = vec_ld(len, s);
firstdata = vec_sel(firstdata, fillchar, firstmask);
lastdata = vec_sel(lastdata, fillchar, lastmask);
vec_st(firstdata, 0, s);
vec_st(lastdata, len, s);
}

// handle remaining aligned vectors
len -= 1; // number of chars handled in first vector
len -= ((int)s & 15 ^ 15);
len &= ~15; // number of chars handled in last vector

// when there are many short strings, it might pay to do an early exit here
if (len == 0) return s;

unsigned char* cursor = s + 16; // skip first vector (already processed)

// now store a number of aligned vectors
while (len >= 64) {
vec_st(fillchar, 0, cursor);
len -= 64;
vec_st(fillchar, 16, cursor);
vec_st(fillchar, 32, cursor);
vec_st(fillchar, 48, cursor);
cursor += 64;
}

if (len >= 32) {
vec_st(fillchar, 0, cursor);
len -= 32;
vec_st(fillchar, 16, cursor);
cursor += 32;
}

if (len > 0) {
vec_st(fillchar, 0, cursor);
cursor += 16;
len -= 16;
}

assert(len == 0); // sanity check

return s;
}
I am not totally happy with the approach shown in here ... there ought to be a way to do things with fewer conditional branches. I am also leaving a bit of performance on the table when the string buffer starts on an aligned address. Still, initialization overhead should be a few cycles less for every call, and the unrolled loop should help with long strings.


Top
   
PostPosted: Thu Nov 18, 2004 1:52 pm 
Offline

Joined: Thu Nov 18, 2004 11:48 am
Posts: 110
Looks like I'm late and there aren't any space for improvement that I could think about ^^;

Just a notice :

Altivec intrinsics are implemented as quite long macros, if you nest them make sure you don't pass 2 levels of nesting or you could easly end up with different gb of preprocessed source.

gcc _should_ remove temporary variable for you so using them could solve many headache to people with less ram with a little/no overhead.


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 6 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group