All times are UTC-06:00




Post new topic  Reply to topic  [ 22 posts ] 
Author Message
 Post subject: Eigen port to ARM NEON!
PostPosted: Wed Mar 03, 2010 1:30 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
http://bitbucket.org/eigen/eigen/change ... af7abc0af/

Here are some results from a matrix addition/multiplication benchmark (sizes 512x512) on the Efika MX:

Scalar:
$ ./bench_gemm.gcc4.4.1cs
eigen cpu 3.84s 0.0699051 GFLOPS (19.27s)
eigen real 3.8469s 0.0697796 GFLOPS (19.2648s)


NEON:
$ ./bench_gemm.gcc4.4.1cs.neon
eigen cpu 0.81s 0.331402 GFLOPS (4.07s)
eigen real 0.813919s 0.329806 GFLOPS (4.07218s)

~4.6x faster...


No comments, apart from one: if NEON is that good -and I think it is-, I don't think I'll miss AltiVec and PowerPC.

UPDATE: Results have been fixed, apparently the scalar results were without -mfpu=vfp option -which is needed to actually use the FPU on ARM. ~4.5x faster is more logical, but still very very nice :) Sorry for the misunderstanding


Last edited by markos on Wed Mar 03, 2010 5:49 pm, edited 2 times in total.

Top
   
 Post subject:
PostPosted: Wed Mar 03, 2010 1:52 pm 
Offline
Genesi

Joined: Mon Jan 30, 2006 2:28 am
Posts: 409
Location: Finland
Nice :-)

Great work!


Johan.

_________________
Johan Dams, Genesi USA Inc.
Director, Software Engineering

Yep, I have a blog... PurpleAlienPlanet


Top
   
 Post subject:
PostPosted: Wed Mar 03, 2010 2:00 pm 
Offline

Joined: Mon Mar 10, 2008 11:00 am
Posts: 56
Location: Poland/Chelm
Wow did not expect this result
So NEON is not so bad :)

_________________
Past: Pegasos II G4 & Efika
Now : Mac Mini G4 1.5 Ghz & MorphOS 3.1
BlaBla Team Member -> http://blabla.ppa.pl


Top
   
PostPosted: Wed Mar 03, 2010 3:07 pm 
Offline

Joined: Wed Jul 27, 2005 9:20 am
Posts: 242
It would be really interesting to see a broad spectrum benchmark comparison between a 7447 G4 PPC CPU (as used in Pegasos 2) running @ 800MHz (or results recalculated from 1GHz to 800MHz accordingly) and an i.MX515 CPU.

I have no sense whatsoever about the ARM chip's performance (I have never seen or experienced one in action), but maybe it won't be too bad off? I think it would be interesting for more people than me here on Powerdeveloper.org to see a comparison with the Pegasos 2 G4 hardware, of which most of us has experiences from and can relate to!

Not that raw performance is the key goal of the chip, rather power efficiency, but anyway...


Top
   
PostPosted: Wed Mar 03, 2010 4:21 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
It would be really interesting to see a broad spectrum benchmark comparison between a 7447 G4 PPC CPU (as used in Pegasos 2) running @ 800MHz (or results recalculated from 1GHz to 800MHz accordingly) and an i.MX515 CPU.

I have no sense whatsoever about the ARM chip's performance (I have never seen or experienced one in action), but maybe it won't be too bad off? I think it would be interesting for more people than me here on Powerdeveloper.org to see a comparison with the Pegasos 2 G4 hardware, of which most of us has experiences from and can relate to!

Not that raw performance is the key goal of the chip, rather power efficiency, but anyway...
I will provide tomorrow with Eigen results from G4 also for comparison. One thing is for certain though: NEON has some real good tricks up its sleeve that are not available in either SSE or AltiVec. Even for that it wins both, IMHO.


Top
   
 Post subject:
PostPosted: Wed Mar 03, 2010 8:55 pm 
Offline

Joined: Tue Mar 31, 2009 10:24 pm
Posts: 171
impressive results, Markos. what are you impressions from this simd isa so far?

ps: don't you miss the permute? ; )


Top
   
 Post subject:
PostPosted: Thu Mar 04, 2010 3:44 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
impressive results, Markos. what are you impressions from this simd isa so far?

ps: don't you miss the permute? ; )
The ISA is a very complete and orthogonal SIMD approach. It can do many more things than AltiVec or SSE (I especially like the fact that I can split a 128-bit vector into 2 64-bit vectors, perform an operation and then combine them back into 128-bit. It can load/store 4x128-bit vectors at once also

PS. It has vtbl and vtbx, which perform the same thing, I haven't played around with it yet though :)


Top
   
PostPosted: Thu Mar 04, 2010 2:21 pm 
Offline

Joined: Sat Oct 27, 2007 12:18 pm
Posts: 26
Location: Grenoble, France
Quote:
~4.6x faster...[/b]

No comments, apart from one: if NEON is that good -and I think it is-, I don't think I'll miss AltiVec and PowerPC.
Results are impressive but the comment makes me sad ...

I didn't expect NEON was so good ...


Top
   
PostPosted: Thu Mar 04, 2010 4:31 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Image
Quote:
Quote:
~4.6x faster...[/b]

No comments, apart from one: if NEON is that good -and I think it is-, I don't think I'll miss AltiVec and PowerPC.
Results are impressive but the comment makes me sad ...

I didn't expect NEON was so good ...
It's better: new benchmarks after some finetuning:

$ ./bench_gemm.gcc4.4.1cs.neon
eigen cpu 2.44s 0.880116 GFLOPS (12.29s)
eigen real 2.44403s 0.878666 GFLOPS (12.2967s)

(compiled with gcc 4.4.1 CodeSourcery)

$ ./bench_gemm.gcc4.5.neon
eigen cpu 2.36s 0.909951 GFLOPS (11.85s)
eigen real 2.36316s 0.908733 GFLOPS (11.8516s)

(compiled with gcc 4.5 experimental)

~12.9x times faster. Yes this time it's real. According to the Eigen developers, we have a theoritical limit of 1.6GFLOPS in the EfikaMX, so we have a bit of a work to do yet :)


Top
   
PostPosted: Fri Mar 05, 2010 6:30 am 
Offline

Joined: Mon Jan 08, 2007 3:40 am
Posts: 195
Location: Pinto, Madrid, Spain
Quote:
Quote:
~4.6x faster...[/b]

No comments, apart from one: if NEON is that good -and I think it is-, I don't think I'll miss AltiVec and PowerPC.
I didn't expect NEON was so good ...
Well, that's progress. You can't be the best always. Modern machines have to be better than older ones. By the way, great to see these progresses Konstantinos!


Top
   
PostPosted: Fri Mar 05, 2010 7:53 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Quote:
Quote:
~4.6x faster...[/b]

No comments, apart from one: if NEON is that good -and I think it is-, I don't think I'll miss AltiVec and PowerPC.
I didn't expect NEON was so good ...
Well, that's progress. You can't be the best always. Modern machines have to be better than older ones. By the way, great to see these progresses Konstantinos!
Altivec is still king though, check these results on the G4:

Scalar:
$ ./bench_gemm
eigen cpu 2.65264s 0.809565 GFLOPS (13.283s)
eigen real 2.6532s 0.809394 GFLOPS (13.2863s)

Altivec:
$ ./bench_gemm
eigen cpu 1.17936s 1.82088 GFLOPS (5.90097s)
eigen real 1.17959s 1.82054 GFLOPS (5.90304s)

But have in mind that PowerPC support is much better and more mature than for ARM (esp. wrt NEON) and that PowerPC is slightly faster at 1Ghz. Theoritically the G4 can do 4GFLOPS at fp math and the iMX515 can do 1.6GFLOPS.


Top
   
PostPosted: Fri Mar 05, 2010 10:00 am 
Offline

Joined: Mon Jan 08, 2007 3:40 am
Posts: 195
Location: Pinto, Madrid, Spain
Quote:
PowerPC is slightly faster at 1Ghz
Yes, but ARM is smarter, because it always sucks less electrons. Or am I wrong?

Have you seen that initiative, to build new high power ARM CPUs that are NOT targetted at mobile computers? What will happen when they free ("take off the handcuffs") these processor from the power restrictions they've always had?


Top
   
PostPosted: Fri Mar 05, 2010 10:29 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Yes, but ARM is smarter, because it always sucks less electrons. Or am I wrong?

Have you seen that initiative, to build new high power ARM CPUs that are NOT targetted at mobile computers? What will happen when they free ("take off the handcuffs") these processor from the power restrictions they've always had?
I have remote access to a prototype quad-core ARM Cortex A9 :-P


Top
   
PostPosted: Sun Mar 07, 2010 3:46 pm 
Offline

Joined: Sat Oct 27, 2007 12:18 pm
Posts: 26
Location: Grenoble, France
Quote:
Quote:
PowerPC is slightly faster at 1Ghz
Yes, but ARM is smarter, because it always sucks less electrons. Or am I wrong?

Have you seen that initiative, to build new high power ARM CPUs that are NOT targetted at mobile computers? What will happen when they free ("take off the handcuffs") these processor from the power restrictions they've always had?
Interesting link, thanks. Markos, you are lucky because not so many people can see / use a Cortex-A9 these days even it was announced years ago (at least 2 years).

jcmarcos : I liked very much ARM because it was small, efficient, easy to play with ... But with years, they add many things that were not planned and it is sometimes ugly in my opinion. I am afraid to see it takes the same way x86 did. But some features are great and it works well.

I work on ARM every day and I sometimes play with low level things.


Top
   
 Post subject:
PostPosted: Tue Mar 09, 2010 11:35 am 
Offline

Joined: Tue Mar 09, 2010 10:41 am
Posts: 19
> the iMX515 can do 1.6GFLOPS

The NEON Pipeline has 4 Single Precision FP Multiply Units and 4 Accumulators ... it could handle 4 Floats/Cycle.

So shouldn't this be 3.2 GFLOPS or am I missing something?


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 22 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group