Power Developer :: Eigen port to ARM NEON!

> the iMX515 can do 1.6GFLOPS

The NEON Pipeline has 4 Single Precision FP Multiply Units and 4 Accumulators ... it could handle 4 Floats/Cycle.

So shouldn't this be 3.2 GFLOPS or am I missing something?

The NEON unit can view the same register bank as:
* sixteen 128-bit quadword registers, Q0-Q15
* thirty-two 64-bit doubleword registers, D0-D31.

For example, VADD.I16 q0, q1, q2 indicates an operation on 16-bit integer elements stored in 128-bit Q registers. This means that the operation is on eight 16-bit lanes in parallel.

The benchmark results seem quite logical actually.

The Registers are 128 Bits (Q Registers) wide but may be handled as 64 Bit Registers (D Registers) as well.

see:
http://infocenter.arm.com/help/index.js ... 03s02.html

Quote:

The NEON unit can view the same register bank as:
* sixteen 128-bit quadword registers, Q0-Q15
* thirty-two 64-bit doubleword registers, D0-D31.

The 128 Bit Q Registers can hold 2x 64 Bit, 4x32 Bit, 8x16 Bit or 16x8 Bit elements.

Also see:
http://infocenter.arm.com/help/index.js ... IIFHA.html

Here it performs 8x16 Bit integer add.
I guess this also works with 4x32 Bit float.

Or:
http://infocenter.arm.com/help/index.js ... 03s03.html

Quote:

For example, VADD.I16 q0, q1, q2 indicates an operation on 16-bit integer elements stored in 128-bit Q registers. This means that the operation is on eight 16-bit lanes in parallel.

I did not verify the above with tests so far.
So of course I may misinterpret the docs - if you have different info, please let me know!

Sure - real world performance will not be even close to the theoretical maximum but there may still be some room for improvement

It depends on the instruction:

Quote:

It depends on the instruction:

Ah OK I see - you're right.

Well, at least most ALU instructions seem to be single cycle and 1.6 GFLOPS is also not too bad for a low power CPU :)

The dual-issue on A8 is limited to
- one NEON ALU op
- one NEON load/store/permute instr (e.g. vld/vst/vmov/vext sort of thing)

So its 8 ops/cycle if you count add of 8bit values
For F32 it would be 2 ops/cycle because of single-issue of ALU ops

Quote:

I didn't expect NEON was so good ...

Well, that's progress. You can't be the best always. Modern machines have to be better than older ones. By the way, great to see these progresses Konstantinos!

Altivec is still king though, check these results on the G4:

Scalar:
$ ./bench_gemm
eigen cpu 2.65264s 0.809565 GFLOPS (13.283s)
eigen real 2.6532s 0.809394 GFLOPS (13.2863s)

Altivec:
$ ./bench_gemm
eigen cpu 1.17936s 1.82088 GFLOPS (5.90097s)
eigen real 1.17959s 1.82054 GFLOPS (5.90304s)

But have in mind that PowerPC support is much better and more mature than for ARM (esp. wrt NEON) and that PowerPC is slightly faster at 1Ghz. Theoritically the G4 can do 4GFLOPS at fp math and the iMX515 can do 1.6GFLOPS.

Author:	markos [ Tue Mar 09, 2010 3:31 pm ]
Post subject:
Quote: > the iMX515 can do 1.6GFLOPS The NEON Pipeline has 4 Single Precision FP Multiply Units and 4 Accumulators ... it could handle 4 Floats/Cycle. So shouldn't this be 3.2 GFLOPS or am I missing something? NEON's registers are 64-bit wide, so while I may issue a vaddq_f32 instruction (which performs its operation on 4x32-bit floats) it does the addition 64 bits at a time not 128 bits like true 128-bit SIMD untis -like AltiVec or SSE- do. The benchmark results seem quite logical actually.

Author:	slyd [ Tue Mar 09, 2010 5:33 pm ]
Post subject:
The Registers are 128 Bits (Q Registers) wide but may be handled as 64 Bit Registers (D Registers) as well. see: http://infocenter.arm.com/help/index.js ... 03s02.html Quote: The NEON unit can view the same register bank as: * sixteen 128-bit quadword registers, Q0-Q15 * thirty-two 64-bit doubleword registers, D0-D31. The 128 Bit Q Registers can hold 2x 64 Bit, 4x32 Bit, 8x16 Bit or 16x8 Bit elements. Also see: http://infocenter.arm.com/help/index.js ... IIFHA.html Here it performs 8x16 Bit integer add. I guess this also works with 4x32 Bit float. Or: http://infocenter.arm.com/help/index.js ... 03s03.html Quote: For example, VADD.I16 q0, q1, q2 indicates an operation on 16-bit integer elements stored in 128-bit Q registers. This means that the operation is on eight 16-bit lanes in parallel. I did not verify the above with tests so far. So of course I may misinterpret the docs - if you have different info, please let me know! Quote: The benchmark results seem quite logical actually. Sure - real world performance will not be even close to the theoretical maximum but there may still be some room for improvement

Author:	markos [ Tue Mar 09, 2010 6:36 pm ]
Post subject:
Quote: The Registers are 128 Bits (Q Registers) wide but may be handled as 64 Bit Registers (D Registers) as well. see: http://infocenter.arm.com/help/index.js ... 03s02.html Quote: The NEON unit can view the same register bank as: * sixteen 128-bit quadword registers, Q0-Q15 * thirty-two 64-bit doubleword registers, D0-D31. The 128 Bit Q Registers can hold 2x 64 Bit, 4x32 Bit, 8x16 Bit or 16x8 Bit elements. Also see: http://infocenter.arm.com/help/index.js ... IIFHA.html Here it performs 8x16 Bit integer add. I guess this also works with 4x32 Bit float. Or: http://infocenter.arm.com/help/index.js ... 03s03.html Quote: For example, VADD.I16 q0, q1, q2 indicates an operation on 16-bit integer elements stored in 128-bit Q registers. This means that the operation is on eight 16-bit lanes in parallel. I did not verify the above with tests so far. So of course I may misinterpret the docs - if you have different info, please let me know! It all depends on how one looks at it, I read this differently: http://infocenter.arm.com/help/topic/co ... dgcfe.html rather, it has a register file of 64-bit registers which it can map to 128-bit registers as well. But only as a matter of convenience, it will sometimes take double the cycles to perform an instruction on a q-word. It depends on the instruction: http://infocenter.arm.com/help/index.js ... 06s06.html Eg. for fp32 addition, vadd takes one cycle only for a d-word not a q-word. So a 128-bit vadd will take two cycles. Some others take 1 cycle even for a q-word, eg. integer addition. Quote: Sure - real world performance will not be even close to the theoretical maximum but there may still be some room for improvement There is, but I think the limit is 1.6GFLOPS not 3.2 :)

Author:	slyd [ Wed Mar 10, 2010 4:38 am ]
Post subject:
Quote: It depends on the instruction: Ah OK I see - you're right. Well, at least most ALU instructions seem to be single cycle and 1.6 GFLOPS is also not too bad for a low power CPU :)

Author:	markos [ Wed Mar 10, 2010 4:41 am ]
Post subject:
Quote: Quote: It depends on the instruction: Ah OK I see - you're right. Well, at least most ALU instructions seem to be single cycle and 1.6 GFLOPS is also not too bad for a low power CPU :) Here is a quote from a guy inside ARM -I asked him yesterday as I wanted to be sure myself :) Quote: The dual-issue on A8 is limited to - one NEON ALU op - one NEON load/store/permute instr (e.g. vld/vst/vmov/vext sort of thing) So its 8 ops/cycle if you count add of 8bit values For F32 it would be 2 ops/cycle because of single-issue of ALU ops I hope this clarifies things a bit.

Power Developer https://powerdeveloper.org/forums/

Eigen port to ARM NEON! https://powerdeveloper.org/forums/viewtopic.php?f=60&t=1776	Page 2 of 2

Author:	slyd [ Wed Mar 10, 2010 9:52 am ]
Post subject:
Indeed, it does. Thanks for that info!

Author:	kgardas [ Fri Jun 17, 2011 2:28 am ]
Post subject:	Re: Eigen port to ARM NEON!
Quote: Quote: Quote: I didn't expect NEON was so good ... Well, that's progress. You can't be the best always. Modern machines have to be better than older ones. By the way, great to see these progresses Konstantinos! Altivec is still king though, check these results on the G4: Scalar: $ ./bench_gemm eigen cpu 2.65264s 0.809565 GFLOPS (13.283s) eigen real 2.6532s 0.809394 GFLOPS (13.2863s) Altivec: $ ./bench_gemm eigen cpu 1.17936s 1.82088 GFLOPS (5.90097s) eigen real 1.17959s 1.82054 GFLOPS (5.90304s) But have in mind that PowerPC support is much better and more mature than for ARM (esp. wrt NEON) and that PowerPC is slightly faster at 1Ghz. Theoritically the G4 can do 4GFLOPS at fp math and the iMX515 can do 1.6GFLOPS. So it looks like dual-core A9 will be able to get to G4 performance level in about a fraction of its power consumption. Great!

Page 2 of 2	All times are UTC-06:00
Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/