Power Developer - Project

i.MX515 Project

Profiling the i.MX515 GL stack

in category Graphics & 3D
proposed by blu on 30th August 2010 (accepted on 1st September 2010)

[View Full Project]

While on the subject of memcpy..

posted by blu on 23rd September 2010

As we mentioned memcpy in the previous post, perhaps now is the right moment for a diversion from the core subject of the discussion and talk a bit about memcpy. IIRC, the ancient Sumerians had a saying along the lines of 'even if a man lived a good life, one day he'd have to face memcpy'.

The fastest (non-builtin) memcpy I've met yet on the iMX515 is the one from Android's bionic libraries - I guess Google got fed up with the (e)glibc stock version, which is, well, a last resort for moving data around on a Cortex platform. In contrast, Android's version uses NEON loads/stores, empirically-tuned read-prefetch patterns and all that jazz - overall a very reasonable memcpy effort.

Unfortunately, even that well-designed routine is not quite optimal under certain conditions - namely, when copying relatively small amounts of data that fit in the L1 d-cache (32KB on iMX515 - the maximal amount supported by Cortex-A8), particularly when the destination of those data happen to already reside in L2 (combined cache, 256KB on iMX515). Under such conditions the performance you'd normally get from Android's memcpy is as if data were being moved around L2 alone, without seeing much help from L1. But why? Did we not deliberately specify that our data fit in L1? Yes, we did, and yet that does not spare us those L1 write-misses.. Wait, what write misses? The answer is simple (and I was blissfully oblivious to it until last week) - Cortex-A8's L1 d-cache does not operate in a write-allocate fashion. What's the issue with that? Well, it's a tad counter-intuitive, and most CPUs don't do it, so some of us may not have encountered it previously. A8's L1 d-cache works in a write-back, but not write-allocate mode, which means that memory writes are not stalling the CPU pipeline when the memory location has been already cached, but memory writes to an uncached location do not cause a cache line to be synced from memory. Conversely, write-allocate would cause the CPU to keep a cache line with the location of your write-miss in anticipation of more accesses around that location. In our case A8 does no such thing for L1 - it interprets our writes as 'to cache if lucky, but not my problem otherwise'. As a result, for memory locations that were not in L1 beforehand, and which locations are accessed in a write-only streaming manner (which is the case for the destination of memcpy) A8 would serve us with a constant sequence of L1 write-misses. Oh goodies, we just lost our L1 for writing!

Luckily, the solution to that is equally simple - we need to revert to 'manual control' and instruct the CPU to cache those locations we are trying to write to. For the Android's bionic memcpy, that could be achieved through a blunt 'prefetch destination' at the start of the main copy loop:

--- really_tiny_libc/memcpy.S 2010-09-01 18:15:32.000000000 -0500
+++ ../../really_tiny_libc/memcpy.S 2010-09-19 11:31:01.000000000 -0500
@@ -98,6 +98,7 @@
pld [r1, #(PREFETCH_DISTANCE)]

1: /* The main loop copies 64 bytes at a time */
+ pld [r0]
vld1.8 {d0 - d3}, [r1]!
vld1.8 {d4 - d7}, [r1]!
pld [r1, #(PREFETCH_DISTANCE)]

The results speak for themselves (red is the original, blue is the tweaked version):

Android memcpy: measured read+write bandwidth by darkblu, on Flickr

Android memcpy: inferred one-way bandwidth by darkblu, on Flickr

Please note that the second chart is hypothetical, trying to give a naive answer to the question 'Good, but what if the access was one-way, and not read+write?' - to which the chart bluntly doubles the actual measured results ; )

So why not just patch Android's memcpy in the manner described above and enjoy eternal memcpy bliss on the A8? Well, firstly, that patch is detrimental for scenarios when the destination is not already in L2 - then our impromptu prefetch quickly becomes prohibitively expensive (as seen on the charts), as it's not operating from L2 anymore. Second, it's hardly worth the effort, as we don't get that often to copy data to a destination that is already in L2 - perhaps when packing scattered data into a single container, but other than that - not much. And lastly, we really should try to be good citizens and refrain from relying too much on memcpy, for our own sake.