A few days ago i implemented CAST-256 in C and was unsatisfied by GCC's usage of the PowerPC isa's uncommon instructions.
Type 1: I = ((Kmi + D) <<< Kri)
O = ((S1[Ia] ^ S2[Ib]) - S3[Ic]) + S4[Id]
This excerpt from rfc2612 describes the first of the three F functions used in CAST-256 in C I used the following macro to implement it:
#define F1(D, R, M) \
I = ( (M) + (D) ), \
I = rol( (R), I ), \
( ( ( SBox1[I >> 24] ^ SBox2[(I >> 16) & 0xFF] ) - SBox3[(I >> 8) & 0xFF] ) + SBox4[I & 0xFF] ) \
No matter what i tried gcc always used at least 3 instructions to calculate the index of an sbox if more than a simple shift right or masking is required for the case ( I >> 16 ) & 0xFF gcc used srwi, andi, and slwi. This are three dependend operations where as a single rlwinm $1, $4, 18, 22, 29 would do the job.
Is anyone here interessed in the source then it works?
Current version F1 in Assembler:
Dst0, Dst1, Dst2, Dst3, Src
rlwinm $0, $4, 10, 22, 29
rlwinm $1, $4, 18, 22, 29
rlwinm $2, $4, 26, 22, 29
rlwinm $3, $4, 2, 22, 29
; Dst0 & Idx0, .. , Dst3 & Idx3
lwzx $0, SBox1, $0
lwzx $1, SBox2, $1
lwzx $2, SBox3, $2
lwzx $3, SBox4, $3
; Dst0, Dst1, Dst2, Dst3, Src
Split_Word_Mul_4 $0, $1, $2, $3, $4
Load_SBoxes $0, $1, $2, $3
add Tmp3, Mask, Data
rotlw Tmp3, Tmp3, Rota
SBoxes Tmp0, Tmp1, Tmp2, Tmp3, Tmp3
xor Tmp0, Tmp0, Tmp1
sub Tmp0, Tmp0, Tmp2
add Tmp0, Tmp0, Tmp3
P.S.: I didn't reformat the code sections their are just copied and pasted for Xcode.