Quote:
I might be wrong but I feel a good working integer version should be enough for Linux software, shouldn't it?
It should. I figure the best case of the most common denominator should be used - if the e300 improvements also help Cell, then it's a candidate for it.
However optimizing for 128-byte cache lines and 64-bit processors is going to destroy performance on the weaker processors. The balance between how much you prefetch and how much work you do while a prefetch is happening is VERY complicated. But, there are far more e300-class PowerPCs out there than there are POWER5 and Cell... so, please the masses, and not the people who already have spent thousands of dollars on brute force.
Quote:
But I fear that The FPU or Altivec routines are difficult to integrate into Linux.
Well, like I said, you can't use the FPU or AltiVec stuff in the kernel. The context saving required would mean unless you did huge copies, any performance improvement would be swamped.
But, in glibc, this is already done for you (actually the kernel will do it, and the function prologue from the compiler), you don't need to protect any task or do anything weird except copy memory.
I think there has to be a couple of solutions - a userspace to userspace copy in an application could gain some 10% from using FPU or AltiVec registers, you said. I think it's worth that 10% if you have those.
I do think AltiVec is worth it, though; if only if it is using data streams (dstt) rather than standard cache management (dcbt) since you can seperate the prefetching out. A task may use prefetch stream 0 or 1 to prefetch data for an algorithm of it's own (perhaps mpeg decoding or so), while the glibc would be using a higher stream number. This is the recommended way in the AltiVec Programming Environments manual (userspace counts up from 0, kernel/libc counts down from 3) to keep the system software and userspace from stealing each other's streams.
dcb* interact with each other and certain usages can cancel each other out, so you get one chance, and hope the task using glibc memcpy does not do any cache prefetching using dcb*. You also have to hope the kernel doesn't use it.
I think glibc, kernel and then a task using only standard prefetching using dcb* may bring about a net performance loss compared to what is expected, where using data streams will not. The problem is; I have never seen a comprehensive use of data streams benchmarked. Most are very tiny and do not interact with a lot of cache prefetching of either kind.
Quote:
With proper prefetching I get the glib routine on EFIKA 100% faster. The same is true for PEGASOS. I'm not sure that using Altivec will give a big improvement over good integer code.
I think it is down to pipeline bubbles and use of the LSU - you can only do so many things at once. AltiVec helps memcpys because you can use two instructions to do two l/s ops which will fill an entire cache line, in theory, the bus has more opportunity to perform bursts, and if cache aligned, using prefetches and correct interleaving you will get by far the best performance (see libmotovec, the code is disgusting to read, but it leaves absolutely no dead cycles).
Quote:
Its a bit odd that both Linux and GLIBC have to have memcpy functions - it would be better if these for performance so important functions would be only in one place.
Yes you would have thought that it would work better if glibc memcpy() used Linux's memcpy() syscall (this is how anyone would have designed it..) but then, you do not get to use FPU, AltiVec in kernelspace, or perhaps cannot utilize DMA directly from userspace. libc memcpy operates on virtual addresses, kernel memcpy operates on physical, no? You have to trade off some offloads for convenience.