All times are UTC - 6 hours




Post new topic Reply to topic  [ 6 posts ] 
Author Message
 Post subject: memchr() vectorized too
PostPosted: Sat Mar 12, 2005 7:21 am 
Offline


Wed Oct 13, 2004 7:26 am

342

Nafplion
I just finished memchr(), the code will be commited shortly to the cvs. The nice thing is that strlen is pretty much the same to memchr() so we have 2 vectorized functions actually :-)

Code:
$ ./altivectorize -v -s -g --tests memchr --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memchr
test: memchr
#size   arrays  glibc                   altivec (Effective bandwidth)
7       599186  0.050 (133.5 MB/s)      0.090 (74.2 MB/s) (0.6x)
13      325000  0.120 (103.3 MB/s)      0.100 (124.0 MB/s) (1.2x)
16      262144  0.080 (190.7 MB/s)      0.100 (152.6 MB/s) (0.8x)
20      209715  0.130 (146.7 MB/s)      0.160 (119.2 MB/s) (0.8x)
27      155344  0.140 (183.9 MB/s)      0.180 (143.1 MB/s) (0.8x)
35      119837  0.160 (208.6 MB/s)      0.180 (185.4 MB/s) (0.9x)
43      97542   0.120 (341.7 MB/s)      0.210 (195.3 MB/s) (0.6x)
54      77672   0.170 (302.9 MB/s)      0.190 (271.0 MB/s) (0.9x)
64      65536   0.200 (305.2 MB/s)      0.210 (290.6 MB/s) (1.0x)
90      46603   0.250 (343.3 MB/s)      0.200 (429.2 MB/s) (1.2x)
128     32768   0.340 (359.0 MB/s)      0.220 (554.9 MB/s) (1.5x)
185     22672   0.470 (375.4 MB/s)      0.280 (630.1 MB/s) (1.7x)
256     16384   0.640 (381.5 MB/s)      0.290 (841.9 MB/s) (2.2x)
347     12087   0.850 (389.3 MB/s)      0.340 (973.3 MB/s) (2.5x)
512     8192    1.170 (417.3 MB/s)      0.350 (1395.1 MB/s) (3.3x)
831     5047    1.840 (430.7 MB/s)      0.500 (1585.0 MB/s) (3.7x)
2048    2048    4.770 (409.5 MB/s)      1.060 (1842.6 MB/s) (4.5x)
3981    1053    8.920 (425.6 MB/s)      1.890 (2008.8 MB/s) (4.7x)
8192    512     18.420 (424.1 MB/s)     3.530 (2213.2 MB/s) (5.2x)
13488   311     30.380 (423.4 MB/s)     5.960 (2158.2 MB/s) (5.1x)
16384   256     38.550 (405.3 MB/s)     7.230 (2161.1 MB/s) (5.3x)
38893   108     92.030 (403.0 MB/s)     17.510 (2118.3 MB/s) (5.3x)
65536   64      153.320 (407.6 MB/s)    33.110 (1887.6 MB/s) (4.6x)
105001  40      252.450 (396.7 MB/s)    48.970 (2044.9 MB/s) (5.2x)
262144  16      738.860 (338.4 MB/s)    205.150 (1218.6 MB/s) (3.6x)
600000  7       3137.570 (182.4 MB/s)   2000.820 (286.0 MB/s) (1.6x)
1134355 4       7318.030 (147.8 MB/s)   6049.960 (178.8 MB/s) (1.2x)
2097152 2       14181.550 (141.0 MB/s)  12319.100 (162.3 MB/s) (1.2x)


I think, that from now on i'll just post the benchmarks, with very few comments :-)


Top
 Profile  
 
PostPosted: Sun Mar 13, 2005 7:32 pm 
Offline


Wed Oct 13, 2004 7:26 am

342

Nafplion
Same results without the --norandom flag (minimize cache hits).

Code:
$ ./altivectorize -v -s -g --tests memchr --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memchr
test: memchr
#size   arrays  glibc                   altivec (Effective bandwidth)
7       599186  0.160 (41.7 MB/s)       0.190 (35.1 MB/s) (0.8x)
13      325000  0.230 (53.9 MB/s)       0.220 (56.4 MB/s) (1.0x)
16      262144  0.610 (25.0 MB/s)       0.560 (27.2 MB/s) (1.1x)
20      209715  0.240 (79.5 MB/s)       0.280 (68.1 MB/s) (0.9x)
27      155344  0.210 (122.6 MB/s)      0.280 (92.0 MB/s) (0.7x)
35      119837  0.270 (123.6 MB/s)      0.260 (128.4 MB/s) (1.0x)
43      97542   0.290 (141.4 MB/s)      0.290 (141.4 MB/s) (1.0x)
54      77672   0.320 (160.9 MB/s)      0.300 (171.7 MB/s) (1.1x)
64      65536   0.970 (62.9 MB/s)       1.080 (56.5 MB/s) (0.9x)
90      46603   0.370 (232.0 MB/s)      0.310 (276.9 MB/s) (1.2x)
128     32768   1.430 (85.4 MB/s)       1.370 (89.1 MB/s) (1.0x)
185     22672   0.600 (294.0 MB/s)      0.360 (490.1 MB/s) (1.7x)
256     16384   2.010 (121.5 MB/s)      1.950 (125.2 MB/s) (1.0x)
347     12087   1.110 (298.1 MB/s)      0.550 (601.7 MB/s) (2.0x)
512     8192    3.950 (123.6 MB/s)      3.100 (157.5 MB/s) (1.3x)
831     5047    2.460 (322.2 MB/s)      0.860 (921.5 MB/s) (2.9x)
2048    2048    13.710 (142.5 MB/s)     11.320 (172.5 MB/s) (1.2x)
3981    1053    9.440 (402.2 MB/s)      1.960 (1937.0 MB/s) (4.8x)
8192    512     52.950 (147.5 MB/s)     43.840 (178.2 MB/s) (1.2x)
13488   311     48.320 (266.2 MB/s)     17.400 (739.3 MB/s) (2.8x)
16384   256     106.520 (146.7 MB/s)    89.950 (173.7 MB/s) (1.2x)
38893   108     218.540 (169.7 MB/s)    159.490 (232.6 MB/s) (1.4x)
65536   64      431.160 (145.0 MB/s)    349.550 (178.8 MB/s) (1.2x)
105001  40      620.610 (161.4 MB/s)    470.980 (212.6 MB/s) (1.3x)
262144  16      1691.410 (147.8 MB/s)   1387.750 (180.1 MB/s) (1.2x)
600000  7       3827.830 (149.5 MB/s)   3122.020 (183.3 MB/s) (1.2x)
1134355 4       7697.620 (140.5 MB/s)   6415.220 (168.6 MB/s) (1.2x)
2097152 2       14585.110 (137.1 MB/s)  11960.200 (167.2 MB/s) (1.2x)



Top
 Profile  
 
PostPosted: Mon Mar 14, 2005 9:37 pm 
Offline


Wed Oct 13, 2004 7:26 am

342

Nafplion
I just finished memrchr() and memmove(). The code is in the CVS.
Actually there's also memcmp(), seems to work, but it's terribly slow, I must work on it a little more to see what I'm doing wrong.

memcmp() does exist on libmotovec and i'm trying to read the code now to see how it's done there, because its performance is what's expected from altivec :-). If someone could look at memcmp() and explain the process there, I'd be more than grateful. I tried to read it but as I think I said too many times, I'm not an assembly guy :-)

Anyway, here are some benchmarks for memrchr() and memmove():

Code:
$ ./altivectorize -v -s -g --tests memrchr --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memrchr
test: memrchr
#size   arrays  glibc                   altivec (Effective bandwidth)
7       599186  0.050 (133.5 MB/s)      0.100 (66.8 MB/s) (0.5x)
13      325000  0.160 (77.5 MB/s)       0.100 (124.0 MB/s) (1.6x)
16      262144  0.090 (169.5 MB/s)      0.140 (109.0 MB/s) (0.6x)
20      209715  0.100 (190.7 MB/s)      0.240 (79.5 MB/s) (0.4x)
27      155344  0.110 (234.1 MB/s)      0.180 (143.1 MB/s) (0.6x)
35      119837  0.160 (208.6 MB/s)      0.190 (175.7 MB/s) (0.8x)
43      97542   0.180 (227.8 MB/s)      0.210 (195.3 MB/s) (0.9x)
54      77672   0.190 (271.0 MB/s)      0.190 (271.0 MB/s) (1.0x)
64      65536   0.200 (305.2 MB/s)      0.250 (244.1 MB/s) (0.8x)
90      46603   0.290 (296.0 MB/s)      0.230 (373.2 MB/s) (1.3x)
128     32768   0.360 (339.1 MB/s)      0.290 (420.9 MB/s) (1.2x)
185     22672   0.500 (352.9 MB/s)      0.270 (653.4 MB/s) (1.9x)
256     16384   0.660 (369.9 MB/s)      0.320 (762.9 MB/s) (2.1x)
347     12087   0.870 (380.4 MB/s)      0.310 (1067.5 MB/s) (2.8x)
512     8192    1.180 (413.8 MB/s)      0.420 (1162.6 MB/s) (2.8x)
831     5047    2.100 (377.4 MB/s)      0.600 (1320.8 MB/s) (3.5x)
2048    2048    5.090 (383.7 MB/s)      1.090 (1791.9 MB/s) (4.7x)
3981    1053    9.850 (385.4 MB/s)      1.960 (1937.0 MB/s) (5.0x)
8192    512     19.960 (391.4 MB/s)     4.330 (1804.3 MB/s) (4.6x)
13488   311     32.860 (391.5 MB/s)     6.670 (1928.5 MB/s) (4.9x)
16384   256     40.020 (390.4 MB/s)     7.880 (1982.9 MB/s) (5.1x)
38893   108     95.040 (390.3 MB/s)     20.350 (1822.7 MB/s) (4.7x)
(more...)

$ ./altivectorize -v -s -g --tests memmove --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memmove
#size   arrays  glibc                   altivec (Effective bandwidth)
7       599186  0.130 (51.4 MB/s)       0.120 (55.6 MB/s) (1.1x)
13      325000  0.150 (82.7 MB/s)       0.150 (82.7 MB/s) (1.0x)
16      262144  0.150 (101.7 MB/s)      0.160 (95.4 MB/s) (0.9x)
20      209715  0.160 (119.2 MB/s)      0.230 (82.9 MB/s) (0.7x)
27      155344  0.160 (160.9 MB/s)      0.170 (151.5 MB/s) (0.9x)
35      119837  0.180 (185.4 MB/s)      0.160 (208.6 MB/s) (1.1x)
43      97542   0.200 (205.0 MB/s)      0.180 (227.8 MB/s) (1.1x)
54      77672   0.210 (245.2 MB/s)      0.150 (343.3 MB/s) (1.4x)
64      65536   0.180 (339.1 MB/s)      0.220 (277.4 MB/s) (0.8x)
90      46603   0.210 (408.7 MB/s)      0.200 (429.2 MB/s) (1.1x)
128     32768   0.240 (508.6 MB/s)      0.230 (530.7 MB/s) (1.0x)
185     22672   0.270 (653.4 MB/s)      0.180 (980.2 MB/s) (1.5x)
256     16384   0.350 (697.5 MB/s)      0.280 (871.9 MB/s) (1.2x)
347     12087   0.420 (787.9 MB/s)      0.220 (1504.2 MB/s) (1.9x)
512     8192    0.560 (871.9 MB/s)      0.350 (1395.1 MB/s) (1.6x)
831     5047    0.930 (852.2 MB/s)      0.400 (1981.3 MB/s) (2.3x)
2048    2048    2.130 (917.0 MB/s)      0.960 (2034.5 MB/s) (2.2x)
3981    1053    3.770 (1007.0 MB/s)     1.050 (3615.8 MB/s) (3.6x)
8192    512     8.960 (871.9 MB/s)      3.340 (2339.1 MB/s) (2.7x)
13488   311     15.190 (846.8 MB/s)     5.410 (2377.7 MB/s) (2.8x)
16384   256     17.670 (884.3 MB/s)     6.760 (2311.4 MB/s) (2.6x)
38893   108     38.800 (956.0 MB/s)     17.900 (2072.1 MB/s) (2.2x)
(more...)


Top
 Profile  
 
PostPosted: Tue Mar 15, 2005 3:02 am 
Offline


Mon Oct 11, 2004 12:49 am

35
I think memcmp() is tricky, because there are three varying conditions even before you compary any two chars to each other: alignments of either source and string length. I see some danger of doing way too many conditional branches.

My first attempt would be to treat both strings as unaligned and handle them the usual way (vec_lvsl(ptr), vec_ld(0, ptr), vec_ld(15, ptr), vec_perm). Compare for equality (vec_all_eq) and loop until string length is reached. Special care has to be taken for the last partial vector; it might be easiest to do that in scalar code (but that would mean no speedup whatsoever for string sizes < 16). To vectorize this, I'd use a boolean mask to AND with, setting all extraneous chars to zero. The mask can probably be created by comparing the result of vec_lvsl or vec_lvsr of the length (cast to void*) to an appropriate threshold.

If at any time the vec_all_eq is false, you need to locate the first offending char position. This is one of the few things that AltiVec has no direct support for. I suspect it might be fastest to do that in scalar code, four bytes at a time with unsigned int compares. All G4 and G5 PPC models have reasonably efficient support for unaligned scalar loads, so performance should be okay. In the worst case, four scalar comparisons have to be done to find the difference in the last quarter of a vector.


Top
 Profile  
 
PostPosted: Tue Mar 15, 2005 12:22 pm 
Offline


Wed Oct 13, 2004 7:26 am

342

Nafplion
This is the memcmp() i wrote:

Code:
int memcmp_vec_aligned(const void *s1, const void *s2, size_t len) {
    unsigned char* s1first = s1;   
    unsigned char* s2first = s2;

    vector unsigned char s1vector, s2vector;

    //printf("s1first = %x\ts2first = %x\tlen = %ld\n", s1first, s2first, len);
    while (len >= 16) {
        s1vector = vec_ld(0, s1first);
        s2vector = vec_ld(0, s2first);
        if (vec_any_ne(s1vector, s2vector)) {
            //printf("Vectors not equal, calling memcmp\n");
            return memcmp(s1first, s2first, 16);
        }
        s1first += 16;
        s2first += 16;
        len -= 16;
        //printf("s1first = %x\ts2first = %x\tlen = %ld\n", s1first, s2first, len);
    }
   
    if (len > 0) {
        //printf("Stuff left, calling memcmp()\n");
        return memcmp(s1first, s2first, len);
    }
    return 0;
}

int memcmp_vec(const void *s1, const void *s2, size_t len) {
       
    if (len <= MEMCMP_THRESHOLD) {
        return memcmp(s1, s2, len);
    }
   
    // for inline  char -> vector char  conversion
    unsigned char* s1first = s1;   
    unsigned char* s2first = s2;
    unsigned char *temp1;
   
    int temp2, result;
    int s1offset = (int)s1first & 15;
    int s2offset = (int)s2first & 15;
   
    vector unsigned char s1vector, s2vector;
    vector unsigned char MSQ, LSQ;
    vector unsigned char mask;
   
#ifdef DEBUG
    printf("srcfirst = %x\tsrclast = %x\n", s1first, s1first+len-1);
    printf("dstfirst = %x\tdstlast = %x\n", s2first, s2first+len-1);
    printf("s1offset = %ld\ts2offset = %ld\n", s1offset, s2offset);
#endif
   
    if (s1offset == 0 && s2offset == 0) {
        // Both buffers are 16-byte aligned, just compare the aligned values
        return memcmp_vec_aligned(s1first, s2first, len);
    } else {   
        // Use standard memcmp to copy the few bytes at the beginning
        result = memcmp(s1first, s2first, s1offset);
        if (result)
            return result;
       
        // Advance both pointers appropriately, now s1first is 16-byte aligned
        s1first += s1offset;
        s2first += s1offset;
        len -= s1offset;
        // recalculate the s2offset
        s2offset = 16 - ((int)s2first & 15);
        if (s2offset == 0) {
            return memcmp_vec_aligned(s1first, s2first, len);
        }

        while (len >= 16) {
            MSQ = vec_ld(0, s2first);            // most significant quadword
            LSQ = vec_ld(15, s2first);           // least significant quadword
            mask = vec_lvsl(0, s2first);         // create the permute mask
            if (vec_any_ne(vec_ld(0, s1first), vec_perm(MSQ, LSQ, mask)))
                return memcmp(s1first, s2first, 16);
            s1first += 16;
            s2first += 16;
            len -= 16;
        }
       
        if (len > 0) {
            return memcmp(s1first, s2first, len);
        }
    }
    return 0;
}


and these are the results i got:

Code:
$ ./altivectorize -v -s -g --tests memcmp --norandom --loops 1000000
Altivec is supported
Verbose mode on
Will do both scalar and vector tests
Will also do glibc tests
loops: 1000000
output file:
will do tests: memcmp
#size   arrays  glibc                   altivec (Effective bandwidth)
7       599186  0.020 (333.8 MB/s)      0.100 (66.8 MB/s) (0.2x)
13      325000  0.030 (413.3 MB/s)      0.130 (95.4 MB/s) (0.2x)
16      262144  0.020 (762.9 MB/s)      0.130 (117.4 MB/s) (0.2x)
20      209715  0.030 (635.8 MB/s)      0.170 (112.2 MB/s) (0.2x)
27      155344  0.040 (643.7 MB/s)      0.190 (135.5 MB/s) (0.2x)
35      119837  0.040 (834.5 MB/s)      0.140 (238.4 MB/s) (0.3x)
43      97542   0.050 (820.2 MB/s)      0.190 (215.8 MB/s) (0.3x)
54      77672   0.040 (1287.5 MB/s)     0.160 (321.9 MB/s) (0.2x)
64      65536   0.040 (1525.9 MB/s)     0.140 (436.0 MB/s) (0.3x)
90      46603   0.060 (1430.5 MB/s)     0.200 (429.2 MB/s) (0.3x)
128     32768   0.070 (1743.9 MB/s)     0.160 (762.9 MB/s) (0.4x)
185     22672   0.090 (1960.3 MB/s)     0.210 (840.1 MB/s) (0.4x)
256     16384   0.120 (2034.5 MB/s)     0.210 (1162.6 MB/s) (0.6x)
347     12087   0.170 (1946.6 MB/s)     0.300 (1103.1 MB/s) (0.6x)
512     8192    0.250 (1953.1 MB/s)     0.320 (1525.9 MB/s) (0.8x)
831     5047    0.390 (2032.1 MB/s)     0.520 (1524.0 MB/s) (0.8x)
2048    2048    0.920 (2123.0 MB/s)     1.150 (1698.4 MB/s) (0.8x)
3981    1053    1.990 (1907.8 MB/s)     2.570 (1477.3 MB/s) (0.8x)
8192    512     3.630 (2152.2 MB/s)     5.060 (1544.0 MB/s) (0.7x)
13488   311     6.250 (2058.1 MB/s)     10.020 (1283.7 MB/s) (0.6x)
16384   256     7.940 (1967.9 MB/s)     10.630 (1469.9 MB/s) (0.7x)


Now, I really can't understand why I get such abysmal performance. :-(


Top
 Profile  
 
PostPosted: Thu Mar 17, 2005 9:06 am 
Offline


Thu Mar 17, 2005 8:53 am

1
I thought about the scalar memcmp a bit, and I figure that it probably unrolls the loop, and could theoretically do 16 bytes every 8 or 9 cycles. With the improvements I suggested on IRC, the unaligned case is going to look something like this in assembly:

lvx s1
lvx s2+16
permute
compare
branch
decrement len (and update CR)
increment index
branch

On a G4, this will take at least 10 cycles per iteration. More, actually, since the G4 will have to take a cycle to redirect flow on the looping branch.

This means that the scalar code will outperform the simple vector loop. You'll have to unroll the loop to beat it. Something like this:

a1 = vec_ld(0,s1)
LSQ = vec_ld(16,s2)
a2 = vec_ld(16,s1)
b2 = vec_ld(32,s2)

if (vec_any_ne(a1, vec_perm(MSQ, LSQ, mask)))
return memcmp(s1, s2, 16)

if (vec_any_ne(a2, vec_perm(LSQ, b2, mask)))
return memcmp(s1+16, s1+16, 16)

s1+=32
s2+=32
len -= 16

(loop)

Of course, this needs to be scheduled properly by the compiler, but we'll just hope it does that automatically.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 6 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group