All times are UTC - 6 hours




Post new topic Reply to topic  [ 13 posts ] 
Author Message
PostPosted: Thu Apr 10, 2008 6:05 am 
Offline


Tue Nov 02, 2004 2:11 am

161
Hello,

We did general benchmarking of the V4m 54455 dev board to get an better understanding of the overall performance of the V4m. The results give some information but at the same time they lead to some more questions too. Here are the results:



General System Benchmark result
Code:
-------------------------------------------
Processor & Memory Performance Bench v4.20
-------------------------------------------
Stop all program before the test.
Do not use the computer during the test.
The test will run some minutes, please be patient.
Total memory required = 4.2 MB.
Calibration loops: 8
-------------------------------------------
Comparing different CPU functions:
Results are in million instructions per sec.
Higher value is faster.
CPU-Benchmark      2MB   16KB    4KB    1KB
-------------------------------------------
addi             251.3  251.3  251.3  251.3
shift            293.2  293.2  251.3  251.3
mix              439.8  439.8  439.8  439.8
mul               67.7   67.7   67.7   65.2
bra-un            41.9   42.9   40.9   41.9
bra-pre          117.3  117.3  117.3  109.9
bsr               13.3   13.4   13.3   13.2
nop               45.1   44.0   44.0   44.0
-------------------------------------------
Measuring memory latency:
Result is Million random accesses per sec.
Higher value is faster.
Memory Latency     2MB   16KB    4KB    1KB
-------------------------------------------
random read        1.0
-------------------------------------------
Measuring memory throughput:
Results are in MB/sec. Higher value is faster.
Memory 2 Memory
Alignment 0-0      2MB   16KB    4KB    1KB
-------------------------------------------
glibc memcpy      67.7   67.7   66.4   56.7
read 8            81.8   74.9   74.9   69.0
read 16           95.1   95.1   95.1   88.0
read 32          121.3  121.3  117.3  109.9
read 32x4        121.3  121.3  121.3  106.6
read 32x4B       121.3  121.3  117.3  106.6
write 8           13.1   13.2   13.2   13.0
write 16          26.1   26.1   26.1   25.5
write 32          51.0   51.0   51.0   48.9
write 32x4        51.7   51.0   51.0   48.9
write 32x4B      185.2  185.2  175.9  153.0
copy 8            23.3   23.3   23.3   22.0
copy 32           67.7   67.7   67.7   57.7
copy 32x4         65.2   65.2   65.2   55.8
copy 32x4B       117.3  121.3  117.3   90.2
-------------------------------------------
Cache 2 Cache                             
Alignment 0-0      2MB   16KB    4KB    1KB
-------------------------------------------
glibc memcpy      67.7  106.6  106.6  103.5
read 8            74.9   92.6  140.7  135.3
read 16           92.6  125.7  219.9  219.9
read 32          121.3  175.9  586.4  502.6
read 32x4        121.3  185.2  703.7  703.7
read 32x4B       121.3  185.2  879.6  879.6
write 8           13.2   13.3   13.3   13.3
write 16          26.1   26.7   26.7   26.7
write 32          51.0   53.3   53.3   52.5
write 32x4        51.0   53.3   53.3   53.3
write 32x4B      185.2  207.0  207.0  207.0
copy 8            23.1   23.8   26.7   26.7
copy 32           67.7  100.5  106.6  103.5
copy 32x4         65.2  103.5  106.6  103.5
copy 32x4B       117.3  439.8  390.9  390.9
------------------------------------------- 


Those interested can find the source of the bench here:
bench.c
bench68k_test.s
Linux CF Executable:benchcf

Quick analyzes:

addi and shift show that the V4m can in general issue one integer instruction per clock.

mix shows that under some circumstance the V4 is able to execute 2 integer instructions per clock.

Mul shows us that the V4 needs 4 clocks for a normal integer multiplication.

bra-un and bra-pre show us that corrected predicted branches are quite fast.

Overall the V4m seems to be a nice embedded CPU.
On the first glance, clock by clock the instruction unit of the V4 is more powerful than a 68040.


Latency
Random read = 1.0
260 clocks for one random memory read seems quite slow to me. I wonder where this latency is caused?


Memory throughput:
The read performance looks good to me.

The write performance is very low.
It looks suspicious that the write performance
when working on small cache-able blocks
is the same as the normal memory write performance.
Normally the CPU should get much faster in this test.
I wonder how the cache is setup.

Could it be that the cache is not running in "copy back" but in "write through" mode?
Does someone know in which mode it is?

Could someone explain the reason why to use "write throught" ?

Cheers
Gunnar


Last edited by gunnar on Thu Apr 10, 2008 8:01 am, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Thu Apr 10, 2008 6:46 am 
Offline


Thu Mar 20, 2008 11:26 am

5
Gunnar,

I haven't looked at the code yet, but in the bottom part of your post you are asking about cache setup. What init code did you use? What environment are you running these tests? In linux? or baremetal?

-JWW


Top
 Profile  
 
 Post subject:
PostPosted: Thu Apr 10, 2008 6:50 am 
Offline


Thu Mar 20, 2008 11:26 am

5
Just tried to grab the files...Links do not appear to work. FYI

-JWW


Top
 Profile  
 
 Post subject:
PostPosted: Thu Apr 10, 2008 8:04 am 
Offline


Tue Nov 02, 2004 2:11 am

161
weiljw wrote:
Just tried to grab the files...Links do not appear to work. FYI

-JWW


Sorry, my fault typo in url.
I have corrected the links and I have added a download link to the compiled Linux CF executable.

I've executed the test on the Linux that is installed on the CF dev-boards.

John, many thanks for looking into this!
I'm very curious to understand the setup of the memory and cache here.

Cheers
Gunnar


Top
 Profile  
 
 Post subject:
PostPosted: Thu Apr 10, 2008 8:55 am 
Offline


Tue Nov 02, 2004 2:11 am

161
Understanding the Coldfire Cache

Thinking about the data cache of the CF brings me to another question.
The V4 handbook describes Copyback-Cache behavior as follow:

Quote:

CFV4ebook - 8.7.4.3 Copyback Mode (Data Cache Only)
...
If a byte, word, longword, or line write access misses in the cache, the required cache line is read from memory, thereby
updating the cache.
...


It says that even a line write access which misses in the cache, the required cache line is read from memory. Is this a typing error in the manual, or does it mean that a CF will fetch a line from memory even if its aware that it will overwrite it complete in the next step? Or, am I just misunderstanding this?

Kind regards,

Gunnar


Top
 Profile  
 
 Post subject:
PostPosted: Sat Apr 12, 2008 4:20 am 
Offline
Site Admin


Fri Sep 24, 2004 1:39 am

1589

Alamo Heights, TX
gunnar wrote:
It says that even a line write access which misses in the cache, the required cache line is read from memory. Is this a typing error in the manual, or does it mean that a CF will fetch a line from memory even if its aware that it will overwrite it complete in the next step? Or, am I just misunderstanding this?


As I understand it, a line write access may be made even if only a longword of that cache line changed. In the event that the other 3 longwords in the cache line have changed, the cache subsystem has to merge the two together in order to stay coherent.

It's all down to the way most caches work; they are rarely designed to load or store less than a cache line (this is the whole point of organising it in 16, 32 or 64 byte chunks in the first place). Also a processor usually has no idea what real memory is; everything is interfaced behind at least one level of cache (and turning the caches off simply means every cache access is a miss :) and some kind of platform bus - so for the processor to ever work on memory at all, it needs to make sure the data in the cache matches that in real memory, at any cost, simply for coherency's sake.
Matt Sealey, Genesi USA Inc.
Product Development Analyst


Top
 Profile  
 
 Post subject:
PostPosted: Sat Apr 12, 2008 6:45 am 
Offline


Tue Nov 02, 2004 2:11 am

161
Neko wrote:
As I understand it, a line write access may be made even if only a longword of that cache line changed.


The point is that the Coldfire V4 can distinguish byte/word/longword writes from full line write.
The Coldfire has in optimization for "Full line writes" when using write through cache mode.
A "Full Line write" is a write that will overwrite all 16 bytes of the full cache line anyway. To create a full line write Freescale recommends to use the MOVEM instruction.

As far as I understands the Coldfire cache does operates like this:

Write to memory address in Write through mode:
----------------------------------------------
(In write through the data will be written to the memory directly - Data will NOT be fetched to on chip cache first)

Byte write (direct byte memory write of this data)
Word Write (direct word memory write of this data)
Long word write (direct longword memory write of this data)

line write: (line write are generated by MOVEM instruction)
In the line write the 16byte will be bursted out.

If you run with DDR2 memory than any access that does not burst is rather slow.
Some clarification on how the COLDFIRE DDR2 memory interface works would be appreciated.
Its clear that using line writes does highly improver performance.



Write to memory address in COPY BACK mode:
----------------------------------------------

Copy back does always burst a whole cache line in and out.
So if you alter one byte in memory which is not yet cached then the cache line gets bursted in and altered in on chip cache. The CPU will burst out the content of the altered cache line, if its need the cache line for something else.

COPY BACK is in most case MUCH more efficient than write through. I'm a bit puzzled the Linux operates not in COPY BACK mode. It would be nice to learn if there is reason for this.


The question that I had, is how does the CPU handle it when it recognizes a "full line write" as for example created by a MOVEM instruction.
Reading in the line complete to then complete overwrite it is of course not efficient. I would say this could be reagrded as a bug. The question that we have is, is this a misprint or misunderstanding of the manual or does the V4 has a deficiency here.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Apr 12, 2008 7:19 am 
Offline
Site Admin


Fri Sep 24, 2004 1:39 am

1589

Alamo Heights, TX
gunnar wrote:
(In write through the data will be written to the memory directly - Data will NOT be fetched to on chip cache first)


Right, but the data in the cache is always valid at this point, during a write-through cache operation. The processor writes it to the cache and then the cache subsystem immediately pushes it back out to memory.

Copy-back may wait. In the event that it waits and in the meantime data has been changed, it *must* refresh the cache line with the contents of memory in order to maintain coherency.

The cache will prioritize absolutely the coherency of data in the cache (since all memory access, read or write, has to live in the cache at some time) but it may not be too concerned with the coherency of data in memory.

The relevant snippet (because it's pretty much the same thing) from the PowerPC 32-bit Programming Environments manual;

Quote:
5.2.4.1.1 Pages Designated as Write-Through

When a page is designated as write-through, store operations update the data in the cache and also update the data in main memory. The processor writes to the cache and through to main memory. Load operations use the data in the cache, if it is present. In write-back mode, the processor is required only to update data in the cache. The processor may (but is not required to) update main memory. Load and store operations use the data in the cache, if it is present.

The data in main memory does not necessarily stay consistent with that same location’s data in the cache. Many implementations automatically update main memory in response to a memory access by another device (for example, a snoop hit). In addition, dcbst and dcbf can explicitly force an update of main memory.

The write-through attribute is meaningless for locations designated as caching-inhibited.


The actual operation really is irrelevant, I think the PowerPC explanation is a lot clearer though.

If it's a bug to do this, then this method of cache management has been broken for 18 years in the m68k line. Do you really think this is true? REALLY?
Matt Sealey, Genesi USA Inc.
Product Development Analyst


Top
 Profile  
 
 Post subject:
PostPosted: Sat Apr 12, 2008 7:37 am 
Offline


Tue Nov 02, 2004 2:11 am

161
Neko,

But please do NOT mix PowerPC with CF here.
Coldfire and PowerPC is not the same!
The behavior of "write trough" on the CF is different to your post. Please be so kind and lets refer to CF information in this discussion to prevent causing confusion.

Quote:
If it's a bug to do this, then this method of cache management has been broken for 18 years in the m68k line. Do you really think this is true? REALLY?


Yes.
To avoid his behavior Motorola added the MOVE16 instruction to the 68k instruction set.

The CF does not support the MOVE16 anymore.
But the Coldfire compensates for this by supporting BURST recognition on the MOVEM instruction.
I think that lost MOVE16 is not a loss as the MOVEM is more powerful on the Coldfire now.

My question only is if this condition is handled by the movem of the Coldfire correctly or if this is an oversight in the current V4.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Apr 12, 2008 10:18 am 
Offline
Site Admin


Fri Sep 24, 2004 1:39 am

1589

Alamo Heights, TX
gunnar wrote:
But please do NOT mix PowerPC with CF here. Coldfire and PowerPC is not the same!


A cache is a cache is a cache. The basic operation of a write-through cache and a write-back cache is something they invented decades ago. At it's not changed. Write-back caches have ALWAYS had this caveat of requiring a little more bandwidth - on the basis that this operation is actually rare enough compared to the massive speed gains of not having to write-through, it speeds everything up.

The actual implementation as logic may be slightly different but the fundamental operation - something you might see taught in Computer Science courses - is pretty much identical unless you want to study in-depth the benefits of the many different ways you can implement coherence protocols. At the high level most people work at, it is identical.

Try not to get too close to the metal. It won't help your code.

gunnar wrote:
My question only is if this condition is handled by the movem of the Coldfire correctly or if this is an oversight in the current V4.


Does this really, really impact your plans to emulate the 68000 opcodes not supported on ColdFire?

I think there is far more to do than test memory bandwidth and look at memory bandwidth and niggle over cache handling.

I really do not think that the figures you got from the simple benchmark are that bad. They are certainly far higher than the ones you would get on an original m68k processor. I think this shows you have a lot of room to soak up any overheads.

Wouldn't it be good to start trying to emulate a certain set of opcodes now? Perhaps, and I am not joking, movem.l is a good start, try implementing a handler which reimplements the extra addressing modes (decrements etc.) and see how much performance you can get out of it.
Matt Sealey, Genesi USA Inc.
Product Development Analyst


Last edited by Neko on Sat Apr 12, 2008 12:20 pm, edited 2 times in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sat Apr 12, 2008 11:15 am 
Offline


Sat Nov 03, 2007 10:43 am

38

France
Neko wrote:
gunnar wrote:
My question only is if this condition is handled by the movem of the Coldfire correctly or if this is an oversight in the current V4.


Does this really, really impact your plans to emulate the 68000 opcodes not supported on ColdFire?

I think there is far more to do than test memory bandwidth and look at memory bandwidth and niggle over cache handling.

I really do not think that the figures you got from the simple benchmark are that bad. They are certainly far higher than the ones you would get on an original m68k processor. I think this shows you have a lot of room to soak up any overheads.

Wouldn't it be good to start trying to emulate a certain set of opcodes now? Perhaps, and I am not joking, movem.l is a good start, try implementing a handler which reimplements the extra addressing modes (decrements etc.) and see how much performance you can get out of it.


Hi Gunnar,

Did you had a look at this ?

http://www.microapl.co.uk/Porting/ColdF ... 8KLib.html

Czk.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Apr 14, 2008 3:27 am 
Offline


Mon Jan 08, 2007 3:40 am

195

Pinto, Madrid, Spain
Matt, Gunnar, although you might think that your discussion is getting hard, I like it a lot.
I struggle to understand most things, but this is top-notch technical debate!


Top
 Profile  
 
 Post subject:
PostPosted: Mon Apr 14, 2008 4:53 am 
Offline


Tue Nov 02, 2004 2:11 am

161
jcmarcos wrote:
I struggle to understand most things...


Hi Juan,
please ask if you have any questions.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 13 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group