All times are UTC-06:00




Post new topic  Reply to topic  [ 6 posts ] 
Author Message
PostPosted: Sat Oct 15, 2005 7:05 pm 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hi,

I need to explain the virtues of using Altivec on memory operation to some other developers. I think the first step is to explain in a real simple way how a CPU works.

Here is my first approach:
Please be so kind to comment to it!

##
CLAIM: Many memory manipulation routines perform with poor performance.

To be able to write high performing memory accessing routines
its crutial to understand how a CPU accesses memory.

Lets look at a simple example: RTRIM_SPACES.

uchar *str= pointer to string
uint length = length of sting
uchar *end= str+length

WHILE ( end>str && end[-1]!=' ' )
end--;


The routine is memory based.
Everybody knows the memory is slower than a CPU.
So you would assume that it does not matter how the routine accesss the memory
as the memory will always be the bottleneck. This is not true.

Lets look at the performance of the routine on a simplified CPU model

Lets assume our test string is 32 chars long

FACT) A CPU reads cache lines from memory into the CPU caches
Our example CPU loads a 16 bytes cache line in 10 clocks.

To be able to compare the first byte the CPU needs load 1 cache line
=> 10 clocks

Then the CPU will iterate byte wise over the string.

Lets assume the CPU to the byte compare in 1 clock
and needs 1 clocks for the loop.

Time to process the 16 byte cache line will be 32 clock.

For the whole string our routine will need
10 clocks loading first cache line, 16 bytes
+16 clocks processing cache line
+16 clock for the loops
+10 cloack loading next 16 bytes
+16 clocks processing next cache line
+16 clocks for the loops

Total: 84 clocks


WILL LOOP UNROLLING HELP?
Yes, loop unrollong will reduce the penalty of the loop.

Lets say we unrool 4 times we will only need 4 loops for each cache line

For the whole string our routine will need
10 clocks loading first cache line, 16 bytes
+16 clocks processing cache line
+ 4 clock for the 4 loops
+10 cloack loading next 16 bytes
+16 clocks processing next cache line
+ 4 clocks for the 4 loops

Total: 60 clocks


Loopunrolling helps but in real live the time needed to process the data might be higher
so the impact of the unrolling will usually be lower.


TODAY ALL CPUS HAVE MULTIBLE INSTRUCTION UNITS
WHY CAN THE CPU PROCESS MULTIBLE BYTES PER CLOCK?

Yes, all modern CPU have multible instruction units and are often able to process
up to 4 independant integer instructions at the same time.
But many CPU can only perform one memory load per clock.
Because of the multible instruction units we assume that our example CPU only needs one clock to process each byte. We assume that the CPU is able to execute the str<end compare at the same time as the processing of the memory byte.

FACT: while modern CPUs have multible instruction units they often have a limited number of load/store units.
This means our CPU might be able to perform 4 integer operations per clock but only one load.


HOW CAN WE SPEED UP THE ROUTINE?
By operating on longer words. If we can find a way to access the memory in 32bit instead 8bit then we can highly increase the throughput.

Lets say we are able to work on 32 bit words and we loopunroll 4 times.

Example:
For the whole string our routine will need
10 clocks loading first cache line, 16 bytes
+ 4 clocks processing cache line (working on 4 bytes each)
+ 1 clock for the loop
+10 cloack loading next 16 bytes
+ 4 clocks processing next cache line
+ 1 clocks for the loop

Total: 30 clocks


HOW CAN ALTIVEC HELP?
ALTIVEC has powerfull string pattern matching operation and Altivec is able to process 16 bytes per instruction.

So with AlLTIVEC we could in theory get down to
10 clocks loading first cache line, 16 bytes
+ 1 clocks processing cache line (working on 16 bytes)
+ 1 clock for the loop
+10 cloack loading next 16 bytes
+ 1 clocks processing next cache line
+ 1 clocks for the loop

Total: 24 clocks


--
The above example is oversimplified.
If you have any ideas to improve it and how you can better explain this then please comment.

Cheers
Gunnar


Top
   
PostPosted: Sun Oct 16, 2005 11:05 pm 
Offline
Genesi

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1422
Gunnar, can we meet this child of yours?!

She/he must be a child prodigy! :-)

R&B


Top
   
PostPosted: Mon Oct 17, 2005 2:11 am 
Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35
The explanation is nice, as long as you mention that it is a general illustration, not an accurate description of every detail.

There are a few additional goodies in the hardware that are triggered more easily from vector code than from scalar code. Most importantly the "store miss merging" mechanism. The processor attempts to merge two pending stores of adjacent small quantities into one store of a larger quantity. This is recursively done each clock cycle, but the hardware cannot work miracles. It's simplest for the hardware when two adjacent vector stores can be merged, because then the merged transaction contains a whole cache line. In that particular case, the modified cacheline can be written out directly, instead of the more usual "read old contents then merge with new data" procedure. This can reduce bandwidth requirements of a copy operation by a third.

Another potential benefit of AltiVec is that there won't be conditional branches in the loop body; vectorization forces you to find a more streamlined algorithm simply because you cannot express spaghetti code with vectors. Another side effect of the inherent parallelism is that loops will be sort-of unrolled, as each iteration works on more than one element of data. Actual unrolling still helps, of course.

On the downside, you cannot scan along vectors sequentially. The hardware processes each element at the same time, so it just doesn't fit the paradigm. I could imagine one or two extra instructions that would help to vectorize that type of algorithm, but I don't know if those could practically be etched in silicon. Data dependencies across the full vector width are bad for the timing.


Top
   
PostPosted: Mon Oct 17, 2005 7:41 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Thanks Habold for the tips.

I think that many developers have a slight mispercention of how a CPU opperates
and because of this have problems to understand which algorythm will perform and which not.

My post was a first try to draw a simple picture of a CPU to explain a few things.

Any feedback to make this clearer to understand are highly appreciated.


Cheers
Gunnar


Top
   
PostPosted: Mon Oct 17, 2005 6:16 pm 
Offline

Joined: Wed Oct 20, 2004 11:20 pm
Posts: 82
Quote:
She/he must be a child prodigy! :-)
Yeah, but those are always the kids that are lead out of their house in handcuffs for hacking into banks or unleashing "I love you" virii on the world.

_________________
--Aaron
Gentoo 2006.1
Kernel-2.6.17-no4


Top
   
PostPosted: Tue Oct 18, 2005 3:26 am 
Offline

Joined: Mon Oct 11, 2004 12:49 am
Posts: 35
Yes, the explanation is good. Too many people still think that the processor fetches and executes one instruction after another. Their perception of performance is limited to the question "how long does this particular instruction take?".

Anything that educates them about the enormous amount of parallelism, that is really employed in even a comparably simple CPU like G4+, is a good thing. These days, performance means high throughput, not low latency.

Arriving at a result in fewer instructions is one trick, but it is equally important to hand enough independent work to the CPU so that all the parallel functional units (or pipeline stages) can simultaneously help to achieve the overall goal.

A traditional metaphor for a CPU used to be a file clerk working at his desk. But these days it is more appropriate to think of the processor as a big factory hall with lots of workers and machinery. That other image makes it obvious that a lot of work can be done in parallel, and that efficiency means to keep everyone busy with sensible work.


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 6 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group