All times are UTC - 6 hours




Post new topic Reply to topic  [ 12 posts ] 
Author Message
PostPosted: Sat Apr 19, 2008 5:50 am 
Offline


Wed Oct 13, 2004 7:26 am

347
Image

Matrix inversion is one of the most time consuming routines in 3D -and other math-related- applications. I managed to beat the scalar version by ~300% and also beat the highly optimized routine found in cellperformance.com by a small margin!

For detailed math & code analysis and benchmarks, please see

http://www.freevec.org/function/inverse ... rtitioning

I have a couple more matrix functions to optimize and then I will start on integrating these to Mesa/OpenGL.


Top
 Profile  
 
PostPosted: Mon Apr 21, 2008 1:33 am 
Offline


Mon Jan 08, 2007 3:40 am

195

Pinto, Madrid, Spain
markos wrote:
I managed to beat the scalar version by ~300% and also beat the highly optimized routine found in cellperformance.com by a small margin!


Konstantinos, I don't understand a word of this magic you show to us, but it looks impressive, and for sure is a big step forward for Altivec applications. I hope you can find more people that are able to understand your work, so their congratulations are more worth than mine. Translating this mathematical expressions into computer language is indeed work of art.

Congratulations!


Top
 Profile  
 
PostPosted: Mon Apr 21, 2008 8:54 am 
Offline


Wed Oct 13, 2004 7:26 am

347
jcmarcos wrote:
markos wrote:
I managed to beat the scalar version by ~300% and also beat the highly optimized routine found in cellperformance.com by a small margin!


Konstantinos, I don't understand a word of this magic you show to us, but it looks impressive, and for sure is a big step forward for Altivec applications. I hope you can find more people that are able to understand your work, so their congratulations are more worth than mine. Translating this mathematical expressions into computer language is indeed work of art.

Congratulations!


Hi Juan,

thanks for your kind words it really is nice to be appreciated :)
I have to say though, it may be all math and uninteresting to many, but when it will get integrated eg. into Mesa, I think many people will appreciate it even more :D
Stay tuned :)


Top
 Profile  
 
PostPosted: Mon Apr 21, 2008 12:47 pm 
Offline


Tue Jan 31, 2006 1:18 am

49

Bialystok, Poland
http://www.freevec.org/function/inverse ... rtitioning

Interesting article. I have one comment to it - you use a permutation mask on the last stage of calculation. If loading this mask from memory hurts performance, it can be easily generated algorythmically Here are the steps:

1. Generate a word with all zeros (splat operator).
2. Generate a word with all eights (splat too).
3. Make a word with 8 zeros followed by 8 eights (sld operator).
4. Generate a linearly increasing mask [0, 1, 2, 3, 4...]. I do not remember operator name now (have no PDF at hand).
5. Add words from 3. and 4.

It is 5 AltiVec instructions. I expect it to be faster on G4 even if mask is retrieved from L1 cache.


Top
 Profile  
 
PostPosted: Mon Apr 21, 2008 4:20 pm 
Offline
Site Admin


Fri Sep 24, 2004 1:39 am

1589

Alamo Heights, TX
Grzegorz Kraszewski wrote:
1. Generate a word with all zeros (splat operator).
2. Generate a word with all eights (splat too).
3. Make a word with 8 zeros followed by 8 eights (sld operator).
4. Generate a linearly increasing mask [0, 1, 2, 3, 4...]. I do not remember operator name now (have no PDF at hand).
5. Add words from 3. and 4.

It is 5 AltiVec instructions. I expect it to be faster on G4 even if mask is retrieved from L1 cache.


Konstantinos, is this the same thing you were asking on IRC about!?

Amazing stuff anyway. It's kind of staggering what you guys can come up with if you're working on the same code..

I wonder if we could get you guys to cooperate on this more. I think one of the better things about MacOS is the provisions of things like vecLib etc. - FIR/FFT, linear algebra all in a standard system library. If that library was available under Linux (and, why not MorphOS too?) with some Apple-compatibility stubs it could really benefit developers who need instant, speedy ways to optimize their applications.. Grzegorz you have all the experience in FIR :)
Matt Sealey, Genesi USA Inc.
Product Development Analyst


Top
 Profile  
 
 Post subject:
PostPosted: Tue Apr 22, 2008 12:26 am 
Offline
Genesi


Mon Jan 30, 2006 2:28 am

409

Finland
Hi all.

First of all, great work! This is really cool stuff.

Secondly, what IRC channel are you guys in? I want in on the fun as soon as I have some more time (in a couple of weeks). Creating a library like Matt pointed out makes sense doing. I'd like to help. I'd also like to expand on this for cryptographic applications.


Best regards,
Johan
Johan Dams, Genesi USA Inc.
Director, Software Engineering

Yep, I have a blog... PurpleAlienPlanet


Top
 Profile  
 
PostPosted: Tue Apr 22, 2008 2:57 am 
Offline


Wed Oct 13, 2004 7:26 am

347
Grzegorz Kraszewski wrote:
http://www.freevec.org/function/inverse_matrix_4x4_using_partitioning

Interesting article. I have one comment to it - you use a permutation mask on the last stage of calculation. If loading this mask from memory hurts performance, it can be easily generated algorythmically Here are the steps:

1. Generate a word with all zeros (splat operator).
2. Generate a word with all eights (splat too).
3. Make a word with 8 zeros followed by 8 eights (sld operator).
4. Generate a linearly increasing mask [0, 1, 2, 3, 4...]. I do not remember operator name now (have no PDF at hand).
5. Add words from 3. and 4.

It is 5 AltiVec instructions. I expect it to be faster on G4 even if mask is retrieved from L1 cache.


Hi,
I tried sth in the same lines (4 instructions more, I already had v0 splat), as I thought that it would be faster to create the mask instead of loading it. But strangely enough, the code became a bit slower (1.47 sec for 10000000 loops instead of 1.37 sec). I can't really explain why this happened, but I reverted loading the mask instead of creating it.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Apr 22, 2008 3:08 am 
Offline


Wed Oct 13, 2004 7:26 am

347
PurpleAlien wrote:
Hi all.

First of all, great work! This is really cool stuff.

Secondly, what IRC channel are you guys in? I want in on the fun as soon as I have some more time (in a couple of weeks). Creating a library like Matt pointed out makes sense doing. I'd like to help. I'd also like to expand on this for cryptographic applications.


Best regards,
Johan


Hi Johan,

Cryptography is one of the areas that is mostly left untouched wrt to AltiVec optimizations, and this is strange since it was built for such stuff (sure there are a few routines optimized but compared to the multitude of encryption/compression routines out there this is nothing).

But from one point of view this is understandable as it is quite hard for someone to vectorize such an algorithm, and it is not always obvious that it is indeed possible (eg md5sum is not directly possible to vectorize and yet I know that some genious developer in Freescale has actually done that, vectorized the unvectorizable!). Still there are many more algorithms that are easier to tackle (I for one have given zlib 25% speed increase in decompression by tackling only 2 out of 6 time consuming functions).

What I would really really like to see is a reference ppc-oriented distro that really shows the platform's potential. Something that will not have to carry the usual baggage but be slim, optimized and only offer the grounds for derivative platforms.
But it should carry optimized versions of libc, libmcrypt, openssl, mesa3d, etc.

I've done some thinking on this and it's doable and not even that difficult (we don't have to reinvent the wheel, we could just base our work on sth existing), but we should refrain from trying to have everything under the sun like most distros do :)

As for the IRC channels, I'm on Freenode as markos_ and in quite a few channels, so if you're around just /query and we'll continue the discussion in a channel :)


Top
 Profile  
 
PostPosted: Tue Apr 22, 2008 3:27 am 
Offline


Mon Jan 08, 2007 3:40 am

195

Pinto, Madrid, Spain
Grzegorz Kraszewski wrote:
you use a permutation mask on the last stage of calculation. If loading this mask from memory hurts performance, it can be easily generated algorythmically. It is 5 AltiVec instructions.


Genius! More! more!


Top
 Profile  
 
 Post subject:
PostPosted: Tue Apr 22, 2008 3:34 am 
Offline
Genesi


Mon Jan 30, 2006 2:28 am

409

Finland
Hi

Quote:
Cryptography is one of the areas that is mostly left untouched wrt to AltiVec optimizations, and this is strange since it was built for such stuff (sure there are a few routines optimized but compared to the multitude of encryption/compression routines out there this is nothing).


It's just that Cryptography is my main research interest, especially applied to embedded systems. It would be nice to have an embedded chip in the future with Altivec unit to provide fast secure communication in a wireless sensor network for instance.
My main research is in Elliptic Curve cryptography, where, as an example, the Montgomery Modular Multiplication can be vectorized. In the non-ECC field, AES would be a prime candidate to test on the Altivec.


Johan.
Johan Dams, Genesi USA Inc.
Director, Software Engineering

Yep, I have a blog... PurpleAlienPlanet


Top
 Profile  
 
 Post subject:
PostPosted: Tue Apr 22, 2008 3:53 am 
Offline


Wed Oct 13, 2004 7:26 am

347
PurpleAlien wrote:
Hi
It's just that Cryptography is my main research interest, especially applied to embedded systems. It would be nice to have an embedded chip in the future with Altivec unit to provide fast secure communication in a wireless sensor network for instance.
My main research is in Elliptic Curve cryptography, where, as an example, the Montgomery Modular Multiplication can be vectorized. In the non-ECC field, AES would be a prime candidate to test on the Altivec.


Johan.


Wow! That is great! That was my drawback since I haven't studied any cryptography and i had to do the reading from scratch, but if you can help with the actual algorithms I would be glad to work on the altivec implementation! Whenever you feel like join IRC and we could start asap! :D


Top
 Profile  
 
 Post subject:
PostPosted: Tue Apr 22, 2008 7:28 am 
Offline
Site Admin


Fri Sep 24, 2004 1:39 am

1589

Alamo Heights, TX
markos wrote:
Wow! That is great!


In the meantime, why not start at the low end with some transformations that make heavy use of vperm for table lookups or other clever uses other than simply processing masses of data at once?

Algorithms already tuned to AltiVec that you can find on Google are.. the minor stuff like base64 encoding and decoding, and there are fast AES implementations which use *h u g e* tables. You may not be speeding up anything but the table lookup :)

However the full unrolled version is quite a neat thing; every part of the rounds are as follows;

Code:
   s0 = GETU32(pt     ) ^ rk[0];
   s1 = GETU32(pt +  4) ^ rk[1];
   s2 = GETU32(pt +  8) ^ rk[2];
   s3 = GETU32(pt + 12) ^ rk[3];
      t0 = Te0[s0 >> 24] ^ Te1[(s1 >> 16) & 0xff] ^ Te2[(s2 >>  8) & 0xff] ^ Te3[s3 & 0xff] ^ rk[ 4];
      t1 = Te0[s1 >> 24] ^ Te1[(s2 >> 16) & 0xff] ^ Te2[(s3 >>  8) & 0xff] ^ Te3[s0 & 0xff] ^ rk[ 5];
      t2 = Te0[s2 >> 24] ^ Te1[(s3 >> 16) & 0xff] ^ Te2[(s0 >>  8) & 0xff] ^ Te3[s1 & 0xff] ^ rk[ 6];
      t3 = Te0[s3 >> 24] ^ Te1[(s0 >> 16) & 0xff] ^ Te2[(s1 >>  8) & 0xff] ^ Te3[s2 & 0xff] ^ rk[ 7];


You can load in the pt indices to a vector (contiguous) and then the rk elements (which are contiguous anyway, too!) into another, and vec_xor.

This gives you a vector with {s0,s1,s2,s3} which will come in handy. You can create a mask of 24, 16, 8 and a mask of all 0xff with some tricks, easily.

The shift right and can be applied to all of the elements after the permute.

Then you have to do a table lookup and load 4 elements from these locations in memory.. the rk[] elements you can cheat with since they are 16 bytes contiguous in memory :)

Then you have to xor them together which is going to be a major pain because you can't do that in AltiVec, however.. couldn't the data be loaded in twice and then permuted into the 4th element each time (i.e. shift each 32-bit element down and xor and then only care about that element) then store that element?
Matt Sealey, Genesi USA Inc.
Product Development Analyst


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 12 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group