All times are UTC-06:00




Post new topic  Reply to topic  [ 9 posts ] 
Author Message
PostPosted: Mon Apr 14, 2008 8:42 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Do you think it makes sense to post here some C-code snippes together with the ASM instruction compiled by GCC as examples on how GCC operates?

The examples could be used for brainstorming and to identify patterns in the behavior of GCC.

Cheers
Gunnar


Top
   
PostPosted: Mon Apr 14, 2008 8:55 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Do you think it makes sense to post here some C-code snippes together with the ASM instruction compiled by GCC as examples on how GCC operates?

The examples could be used for brainstorming and to identify patterns in the behavior of GCC.

Cheers
Gunnar
Hi Gunnar,

however important I think the discussion here is, it is much more important to be done in the proper place, ie in the gcc bugtracker and gcc mailing lists. It's much more probable to be fixed there by the right persons, and even the Freescale/CodeSourcery guys are probably following these. Of course if you don't mind, it would be interesting to read these here as well :)

Regards

Konstantinos


Top
   
PostPosted: Mon Apr 14, 2008 7:39 pm 
Offline
Site Admin

Joined: Fri Sep 24, 2004 1:39 am
Posts: 1589
Location: Austin, TX
Quote:
however important I think the discussion here is, it is much more important to be done in the proper place, ie in the gcc bugtracker and gcc mailing lists. It's much more probable to be fixed there by the right persons, and even the Freescale/CodeSourcery guys are probably following these. Of course if you don't mind, it would be interesting to read these here as well :)
I think the discussion is very relevant here :)

However I do think the compiler performance shouldn't be Gunnar's goal here. We're talking about mimicking a 68k processor on ColdFire for a specific application. At this point a 200MB/s bus bandwidth for read is about 3x more than he would expect from a 68060 with EDO SDRAM with a 60ns access time. Certain versions of the GCC compiler generate adequate - if not performance - code (later versions, seemingly, do not). We know CodeWarrior and DIAB and GreenHills do better. In the end, mimicking the m68k does not rely on the compiler but the technique used, of which - as he has very competantly explained in his project and elsewhere - could be one of 3 or 4 or 5 different ways (perhaps a QEMU/UAE style virtual machine, or an instruction trap mechanism as with the 68000 fpsp or 68040/68060.library mechanisms on AmigaOS, or something like ShapeShifter/Sheepshaver MacOS emulation on the Amiga, where 90% of the instructions are run native but important differences are emulated for the purpose of seperation of operating systems).

This is where the important work lies. Redefining the operation of GCC4 is a waste of time. You can code the emulator now, find the best method, and fix GCC later so that it enables compilation and linking of the best method with the least amount of manual hacking. But compiler reworks are the last resort - until, that is, you hit a compiler bug that refuses to generate working code or causes exceptions or doesn't even compile, THEN it is something worth fixing :)

_________________
Matt Sealey


Top
   
PostPosted: Mon Apr 14, 2008 11:52 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
I agree that the discussion is relevant about Gunnar's project and any details around it. My point was about GCC bugs that would get too technical. It's not that *shouldn't* be here, it's that they should be on GCC bugtracker *too*! :)

After all, for any project most bugs are found elsewhere rather than the bugtracker/mailing lists and then filed as bug reports upstream.

In any case, don't mind me, please continue, it was an interesting read anyway :)

Konstantinos


Top
   
PostPosted: Tue Apr 15, 2008 2:22 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hi Matt,
Quote:
At this point a 200MB/s bus bandwidth for read is about 3x more than he would expect from a 68060 with EDO SDRAM with a 60ns access time.
So far I have only measured 120 MB/sec read for the V4m.
You can find a comparison of 680x0 and V4m results here:
http://www.powerdeveloper.org/forums/vi ... 0621#10621

I hope this helps you.


Top
   
 Post subject:
PostPosted: Tue Apr 15, 2008 3:25 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
One GCC example:

C-source
Code:
void * copy_32x4a(void *destparam, const void *srcparam, size_t size)
{
int *dest = destparam;
const int *src = srcparam;
int size32;
size32 = size / 16;
for (; size32; size32--) {
*dest++ = *src++;
*dest++ = *src++;
*dest++ = *src++;
*dest++ = *src++;
}
}
Compile option: m68k-linux-gnu-gcc -mcpu=54455 -msoft-float -o example -Os -fomit-frame-pointer example.c
We use -Os to focus on compact code.

Generated code:
Code:
04: 202f 000c movel %sp@(12),%d0
08: 226f 0004 moveal %sp@(4),%a1
0c: 206f 0008 moveal %sp@(8),%a0
10: e888 lsrl #4,%d0
12: 6022 bras 36
14: 2290 movel %a0@,%a1@
16: 2368 0004 0004 movel %a0@(4),%a1@(4)
1c: 2368 0008 0008 movel %a0@(8),%a1@(8)
22: 2368 000c 000c movel %a0@(12),%a1@(12)
28: d3fc 0000 0010 addal #16,%a1
2e: d1fc 0000 0010 addal #16,%a0
34: 5380 subql #1,%d0
36: 4a80 tstl %d0
38: 66da bnes 14
3a: 4e75 rts
Code length produced by GCC = 56 Byte
Length of workloop = 9 instructions , 38 Byte


Expected code:
Code:
04: 202f 000c movel %sp@(12),%d0
08: 226f 0004 moveal %sp@(4),%a1
0c: 206f 0008 moveal %sp@(8),%a0
10: e888 lsrl #4,%d0
12: 6022 beq 20
14: 20d9 movel %a1@+,%a0@+
16: 20d9 movel %a1@+,%a0@+
18: 20d9 movel %a1@+,%a0@+
1a: 20d9 movel %a1@+,%a0@+
1c: 5380 subql #1,%d0
1e: 66da bnes 14
20: 4e75 rts
Expected code length = 30 Byte
Length of workloop = 6 instructions , 12 Byte



Issue 1:
Why does GCC not use the ConditionCodes already set by the 68k instruction but generates a unneeded test.l?

Issue 2:
Why does GCC not use the much more efficient (an)+ adressing mode but uses instead d(an) mode plus an extra add instrcution ?
The (Ad)+,(Am)+ instruction is 2 Bytes instead of 6 Bytes.
And (Ad)+,(Am)+ does not need the extra two instructions to increment the pointers.

Issue 3:
Assuming that GCC decided to increment a pointer manually.
Why does GCC use addil to increment a pointer?
LEA should be the better choice for this as its 2 bytes shorter than addi.l


Last edited by gunnar on Tue Apr 15, 2008 4:06 am, edited 1 time in total.

Top
   
 Post subject:
PostPosted: Tue Apr 15, 2008 4:04 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Second example:

C-Code:
Code:
void * write_32x4(void *destparam, const void *srcparam, size_t size)
{
int value=1;
int *dst = destparam;
size = size / 16;
for (; size; size--) {
*dst++=value;
*dst++=value;
*dst++=value;
*dst++=value;
}
}

Generated output
Code:
<write_32x4>:
0a: 202f 000c movel %sp@(12),%d0
0e: 206f 0004 moveal %sp@(4),%a0
12: e888 lsrl #4,%d0
14: 601c bras 32
16: 20bc 0000 0001 movel #1,%a0@
1c: 7201 moveq #1,%d1
1e: 2141 0004 movel %d1,%a0@(4)
22: 2141 0008 movel %d1,%a0@(8)
26: 2141 000c movel %d1,%a0@(12)
2a: d1fc 0000 0010 addal #16,%a0
30: 5380 subql #1,%d0
32: 4a80 tstl %d0
34: 66e0 bnes 16
36: 4e75 rts
Generated code length = 46 Byte
Length of Workloop: 9 instructions, 32 byte


The expected result would be
Code:
<write_32x4>:
0a: 202f 000c movel %sp@(12),%d0
0e: 206f 0004 moveal %sp@(4),%a0
12: 7201 moveq #1,%d1
14: e888 lsrl #4,%d0
16: 601c beqs 24
18: 21c0 movel %d1,%a1@+
1a: 21c0 movel %d1,%a1@+
1c: 21c0 movel %d1,%a1@+
1e: 21c0 movel %d1,%a1@+
20: 5380 subql #1,%d0
22: 66e0 bnes 18
24: 4e75 rts
Expected code length = 28 Byte
Length of Workloop: 6 instructions, 12 byte


Issue 4:
We see again the unneeded TST instruction.

Issue 5:
The Compiler again uses a much bigger and slower addressing mode.

Issue 6:
The preload of the work value into register D1 is done inside the work loop. The should be done outside of the main workloop.

Issue 7:
The compiler decides to put the literal work value #1 into the work register D1. But its not always using this work register, one time it uses a literal move.l #1, and thereby unneeded increasing the code by 4 bytes.


Both GCC 4 examples have 9 instruction inside the workloop. Older GCC would solve the same task using a workloop of only 6 instructions.
Generelly the new GCC 4 code is bigger and a lot slower than before.


Top
   
 Post subject:
PostPosted: Tue May 06, 2008 1:25 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
To help improve the Code generated by GCC for Coldfire/68K, I've filed the bugs 36133, 36134, 36135, and 36136 to the GCC-Compiler.


Top
   
 Post subject: gcc 3.4?
PostPosted: Tue May 06, 2008 2:24 am 
Offline

Joined: Tue Nov 02, 2004 6:17 am
Posts: 28
Hi Gunnar,

do you know by chance, if the gcc 3.4 code suffers of the same problems?


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 9 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group