GCC makes a mess

Wednesday, 13th May, 2009 - 2:16 am | Compilers, Optimisation, PowerPC

Following up on a report about FFmpeg being slower at MPEG audio decoding than MAD, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast as FFmpeg, suggesting something was going badly wrong in the compilation.

A session with oprofile exposes multiplication as the root of the problem. The MPEG audio decoder in FFmpeg includes many operations of the form a += b * c where b and c are 32 bits in size and a is 64-bit. 64-bit maths on a 32-bit CPU is not handled well by GCC, even when good hardware support is available. A couple of examples compiled with GCC 4.3.3 illustrate this.

Suppose you need the high 32 bits from the 64-bit result of multiplying two 32-bit numbers. This is most easily written in C like this:

int mulh(int a, int b)
{
    return ((int64_t)a * (int64_t)b) >> 32;
}

It doesn’t take much thinking to see that the PowerPC mulhw instruction performs exactly this operation. Indeed, GCC knows of this instruction and uses it. But can we be really sure that those low 32 bits are not needed? GCC seems unconvinced:

mulhw   r9,  r4,  r3
mullw   r10, r4,  r3
srawi   r11, r9,  31
srawi   r12, r9,  0
mr      r3,  r12
blr

The second example is slightly more complicated:

int64_t mac(int64_t a, int b, int c, int d)
{
    a += (int64_t)b * (int64_t)c;
    a += (int64_t)b * (int64_t)d;
    return a;
}

This can, of course, be done with four multiplications and four additions. GCC, however, likes to be thorough, and uses twice the number of both instructions, plus some loads, stores and shifts for completeness:

stwu    r1,  -32(r1)
srawi   r0,  r6,  31
mullw   r0,  r0,  r5
srawi   r8,  r7,  31
stw     r29, 20(r1)
srawi   r29, r5,  31
stw     r27, 12(r1)
stw     r28, 16(r1)
mullw   r11, r29, r6
mulhwu  r9,  r6,  r5
add     r0,  r0,  r11
mullw   r10, r6,  r5
add     r9,  r0,  r9
mullw   r29, r29, r7
addc    r28, r10, r4
adde    r27, r9,  r3
mullw   r8,  r8,  r5
mulhwu  r9,  r7,  r5
add     r8,  r8,  r29
lwz     r29, 20(r1)
mullw   r10, r7,  r5
add     r9,  r8,  r9
addc    r12, r28, r10
adde    r11, r27, r9
lwz     r27, 12(r1)
mr      r4,  r12
lwz     r28, 16(r1)
mr      r3,  r11
addi    r1,  r1,  32
blr

Fortunately, this madness is easily fixed with a little inline assembler, more than doubling the speed of the decoder, thus making FFmpeg significantly faster than MAD also on PowerPC.

Bookmark the permalink.

35 Responses to GCC makes a mess

Adrian Bunk says:

Wednesday, 13th May, 2009 at 4:12 pm

gcc 4.4.0 gives

   0:   7d 24 18 96     mulhw   r9,r4,r3
   4:   7d 23 4b 78     mr      r3,r9
   8:   4e 80 00 20     blr

resp.

   0:   7c ec 3b 78     mr      r12,r7
   4:   7c eb fe 70     srawi   r11,r7,31
   8:   7c ca 33 78     mr      r10,r6
   c:   7c c9 fe 70     srawi   r9,r6,31
  10:   7d 4a 60 14     addc    r10,r10,r12
  14:   7d 29 59 14     adde    r9,r9,r11
  18:   7c ab fe 70     srawi   r11,r5,31
  1c:   7c 09 29 d6     mullw   r0,r9,r5
  20:   7d 6b 51 d6     mullw   r11,r11,r10
  24:   7d 0a 29 d6     mullw   r8,r10,r5
  28:   7c ea 28 16     mulhwu  r7,r10,r5
  2c:   7c 00 5a 14     add     r0,r0,r11
  30:   7d 06 43 78     mr      r6,r8
  34:   7c a0 3a 14     add     r5,r0,r7
  38:   7c c6 20 14     addc    r6,r6,r4
  3c:   7c a5 19 14     adde    r5,r5,r3
  40:   7c c4 33 78     mr      r4,r6
  44:   7c a3 2b 78     mr      r3,r5
  48:   4e 80 00 20     blr

Mans says:

Wednesday, 13th May, 2009 at 4:28 pm

That’s certainly an improvement, going from five times to only twice the optimal code size in the first case and from 3.6x to 2.25x in the second.

Too bad gcc 4.4 is unable to compile a working FFmpeg on PPC, see http://fate.multimedia.cx/
Adrian Bunk says:

Thursday, 14th May, 2009 at 8:16 am

Is anyone reporting all the bugs (both miscompilations and inefficient code) you discover to the gcc Bugzilla?

The “gcc 4.4 is unable to compile a working FFmpeg on PPC” even looks like a regression that is likely to be fixed in later 4.4 releases if reported.
- Mans says:
  
  Thursday, 14th May, 2009 at 8:38 am
  
  We were a bit put off reporting bugs after responses like this:
  
  just because code is syntactically “valid” GNU C doesn’t mean gcc can always compile it
Adrian Bunk says:

Thursday, 14th May, 2009 at 9:10 am

I just read through the whole bug, and in it’s context this statement sounds reasonable (syntactically correct code can cause trouble for gcc’s register allocator).

Wrong code generation is a very different issue than the issue discussed there.
compn says:

Friday, 15th May, 2009 at 5:14 pm

i was going to ask about mad being an int-only decoder but i see ffmpeg mpeg audio decoder has had an int-only version since 0.4.6…

was the original bugreport on ppc ?
- Mans says:
  
  Friday, 15th May, 2009 at 5:54 pm
  
  The original report was on m68k, but I don’t have one of those. I suspect the same gcc inefficiency may be to blame there.
ami_stuff says:

Wednesday, 20th May, 2009 at 1:22 pm

Is there a way to fix it with generic C code without asm inlines (probably no one will add m68k code)? If so, I can test it.
- Mans says:
  
  Wednesday, 20th May, 2009 at 1:50 pm
  
  I’d be happy to add m68k asm, but I have no hardware to test it on, and qemu doesn’t support 68020, which seems to be the minimum gcc/uclibc will work with.
  - Michael Kostylev says:
    
    Tuesday, 1st December, 2009 at 5:22 pm
    
    ColdFire (-mcpu=5475) is somehow supported, r20684 passes 308/310.

ami_stuff says:

Wednesday, 20th May, 2009 at 2:03 pm

int64_t mac(int64_t a, int b, int c, int d)
{
    a += (int64_t)b * (int64_t)c;
    a += (int64_t)b * (int64_t)d;
    return a;
}

GCC 3.4 (-m68060 -fomit-frame-pointer -s -O3):

#NO_APP
	.text
	.even
	.globl	_mac
_mac:
	moveml #0x3c20,sp@-
	movel sp@(24),d2
	movel sp@(28),d3
	movel sp@(32),d5
	smi d4
	extbl d4
	lea ___muldi3,a2
	movel sp@(36),d1
	smi d0
	extbl d0
	movel d1,sp@-
	movel d0,sp@-
	movel d5,sp@-
	movel d4,sp@-
	jbsr a2@
	lea sp@(16),sp
	addl d1,d3
	addxl d0,d2
	movel sp@(40),d1
	smi d0
	extbl d0
	movel d1,sp@-
	movel d0,sp@-
	movel d5,sp@-
	movel d4,sp@-
	jbsr a2@
	lea sp@(16),sp
	addl d3,d1
	addxl d2,d0
	moveml sp@+,#0x43c
	rts

GCC 4.3.2 (-m68060 -fomit-frame-pointer -s -O3):

#NO_APP
	.text
	.even
	.globl	_mac
_mac:
	movem.l #15392,-(sp)
	move.l 32(sp),d3
	smi d2
	extb.l d2
	lea ___muldi3,a2
	move.l d3,-(sp)
	move.l d2,-(sp)
	move.l 44(sp),-(sp)
	smi d0
	extb.l d0
	move.l d0,-(sp)
	jsr (a2)
	lea (16,sp),sp
	move.l 24(sp),d4
	move.l 28(sp),d5
	add.l d1,d5
	addx.l d0,d4
	move.l d3,-(sp)
	move.l d2,-(sp)
	move.l 48(sp),-(sp)
	smi d0
	extb.l d0
	move.l d0,-(sp)
	jsr (a2)
	lea (16,sp),sp
	add.l d1,d5
	addx.l d0,d4
	move.l d4,d0
	move.l d5,d1
	movem.l (sp)+,#1084
	rts

GCC 4.4 alpha 20081212 (-m68060 -fomit-frame-pointer -s -O3):

#NO_APP
	.text
	.even
	.globl	_mac
_mac:
	movem.l #15360,-(sp)
	move.l 36(sp),d3
	smi d2
	extb.l d2
	move.l 32(sp),d1
	smi d0
	extb.l d0
	move.l 28(sp),-(sp)
	smi d4
	extb.l d4
	move.l d4,-(sp)
	move.l d2,d4
	move.l d3,d5
	add.l d1,d5
	addx.l d0,d4
	move.l d5,-(sp)
	move.l d4,-(sp)
	jsr ___muldi3
	lea (16,sp),sp
	move.l d1,d2
	move.l d0,d1
	move.l 20(sp),d5
	add.l 24(sp),d2
	addx.l d5,d1
	move.l d1,d0
	move.l d2,d1
	movem.l (sp)+,#60
	rts

Mans says:

Wednesday, 20th May, 2009 at 3:18 pm

Try this:

int64_t MAC64(int64_t d, int a, int b)
{   
    union { int64_t x; int hl[2]; } x = { d };
    int h, l;
    __asm__ ("muls.l %5, %2, %3  \n\t"
             "add.l  %3, %1      \n\t"
             "addx.l %2, %0      \n\t"
             : "+dm"(x.hl[0]), "+dm"(x.hl[1]),
               "=d"(h), "=&d"(l)
             : "3"(a), "g"(b));
    return x.x;
}

BTW, gcc 4.3 and above for m68k all ICE in various places building FFmpeg.

ami_stuff says:

Wednesday, 20th May, 2009 at 2:07 pm

@Mans:

Try WinUAE – Amiga’s emulator. It supports 68040+FPU:

http://www.winuae.net/

PS. Right now site is down, but the latest executable is here:

http://eab.abime.net/showthread.php?t=40738&page=16
Mans says:

Wednesday, 20th May, 2009 at 2:26 pm

UAE is a full system emulator. It would be much easier to test things with Qemu userspace emulation. Real hardware would be even more fun of course. What hardware are you using?
ami_stuff says:

Wednesday, 20th May, 2009 at 2:34 pm

WinUAE, but most of the time I get right timings – the same results I get from users of a real hardware (which binary file is faster, which is slower), so when something is faster on WinUAE, it is probably faster on a real hardware too.

ami_stuff says:

Wednesday, 20th May, 2009 at 3:57 pm

Hmm, something is wrong:

#define int64_t  long long

int64_t MAC64(int64_t d, int a, int b)
{
    union { int64_t x; int hl[2]; } x = { d };
    int h, l;
    __asm__ ("muls.l %5, %2, %3  \n\t"
             "add.l  %3, %1      \n\t"
             "addx.l %2, %0      \n\t"
             : "+dm"(x.hl[0]), "+dm"(x.hl[1]),
               "=d"(h), "=&d"(l)
             : "3"(a), "g"(b));
    return x.x;
}

GCC 4.4 error:

libavcodec/mpegaudiodec.c:54 error: expected identifier or '(' before 'long'
libavcodec/mpegaudiodec.c:54 error: expected ')' before '+=' token

Line 54 is “int64_t MAC64(int64_t d, int a, int b)”.

Mans says:

Wednesday, 20th May, 2009 at 4:10 pm

You need to arrange for the generic macro in mathops.h to be suppressed. See how it’s done for ppc.

Bernd_afa says:

Wednesday, 20th May, 2009 at 4:18 pm

intresting at least on 68k gcc4.4.0 experimental need only 1 muldi3 and older compilers need 2.

now i compile gcc4.4.0 release and i send it to amistuff

if there is need of a 68k system easy to use and know from amiga in the past, here is a modern preconfig system work on linux and windows.only kickrom images of a amiga are need.

http://amikit.amiga.sk/what-is-it.htm
Bernd_afa says:

Wednesday, 20th May, 2009 at 4:31 pm
```
__asm__ ("muls.l %5, %2, %3  \n\t"
             "add.l  %3, %1      \n\t"
             "addx.l %2, %0      \n\t"
             : "+dm"(x.hl[0]), "+dm"(x.hl[1]),
```
muls does only support 2 parameters on 68k the result is in 2. register
but i cant fix it, i have problems to understand the gcc to register syntax
Mans says:

Wednesday, 20th May, 2009 at 4:45 pm

My code generates correct-looking machine code when compiled with gcc/gas. I just can’t test it.
ami_stuff says:

Wednesday, 20th May, 2009 at 4:53 pm

Mans@

C:\>ffmpeg -i test.mp3
FFmpeg version SVN-r18696, Copyright (c) 2000-2009 Fabrice Bellard, et al.
configuration: –enable-memalign-hack –prefix=/mingw –cross-prefix=i686-ming
w32- –cc=ccache-i686-mingw32-gcc –target-os=mingw32 –arch=i686 –cpu=i686 –e
nable-avisynth –enable-gpl –enable-zlib –enable-bzlib –enable-libgsm –enabl
e-libfaac –enable-libfaad –enable-pthreads –enable-libvorbis –enable-libtheo
ra –enable-libspeex –enable-libmp3lame –enable-libopenjpeg –enable-libxvid –
-enable-libschroedinger –enable-libx264
libavutil 50. 3. 0 / 50. 3. 0
libavcodec 52.27. 0 / 52.27. 0
libavformat 52.32. 0 / 52.32. 0
libavdevice 52. 2. 0 / 52. 2. 0
libswscale 0. 7. 1 / 0. 7. 1
built on Apr 27 2009 04:01:39, gcc: 4.2.4
Input #0, mp3, from ‘test.mp3’:
Duration: 00:04:53.06, start: 0.000000, bitrate: 236 kb/s
Stream #0.0: Audio: mp3, 44100 Hz, stereo, s16, 192 kb/s
At least one output file must be specified

ffmpeg -i test.mp3 test.wav (MPEGAUDIO_HP)

noasm version: 1:24m
asm version: 1:14m

More m68k optimalizations please! :)

Bernd_afa says:

Wednesday, 20th May, 2009 at 6:54 pm

I think i find now out how gcc work with 64 bit.in gcc file gcc4.4.0/gcc/longlong.h must be valid asm code for gcc or libgcc is used.

for 68020 it is here.but 68060 have no 32*32 bit with 64 bit result instruction so 68060 use code in libgcc2.c

the code in longlong.h is this.
it seem gcc support only umul udiv and sdiv on all platforms but no smul.

/* The '020, '030, '040, '060 and CPU32 have 32x32->64 and 64/32->32q-32r.  */
#if (defined (__mc68020__) && !defined (__mc68060__))
#define umul_ppmm(w1, w0, u, v) \
  __asm__ ("mulu%.l %3,%1:%0"						\
	   : "=d" ((USItype) (w0)),					\
	     "=d" ((USItype) (w1))					\
	   : "%0" ((USItype) (u)),					\
	     "dmi" ((USItype) (v)))
#define UMUL_TIME 45
#define udiv_qrnnd(q, r, n1, n0, d) \
  __asm__ ("divu%.l %4,%1:%0"						\
	   : "=d" ((USItype) (q)),					\
	     "=d" ((USItype) (r))					\
	   : "0" ((USItype) (n0)),					\
	     "1" ((USItype) (n1)),					\
	     "dmi" ((USItype) (d)))
#define UDIV_TIME 90
#define sdiv_qrnnd(q, r, n1, n0, d) \
  __asm__ ("divs%.l %4,%1:%0"						\
	   : "=d" ((USItype) (q)),					\
	     "=d" ((USItype) (r))					\
	   : "0" ((USItype) (n0)),					\
	     "1" ((USItype) (n1)),					\
	     "dmi" ((USItype) (d)))

I need closer check if it work as exspectet.

Bernd_afa says:

Wednesday, 20th May, 2009 at 7:01 pm

here is what is possible in longlong.h
/* Define auxiliary asm macros.

1) umul_ppmm(high_prod, low_prod, multiplier, multiplicand) multiplies two
UWtype integers MULTIPLIER and MULTIPLICAND, and generates a two UWtype
word product in HIGH_PROD and LOW_PROD.

2) __umulsidi3(a,b) multiplies two UWtype integers A and B, and returns a
UDWtype product. This is just a variant of umul_ppmm.

3) udiv_qrnnd(quotient, remainder, high_numerator, low_numerator,
denominator) divides a UDWtype, composed by the UWtype integers
HIGH_NUMERATOR and LOW_NUMERATOR, by DENOMINATOR and places the quotient
in QUOTIENT and the remainder in REMAINDER. HIGH_NUMERATOR must be less
than DENOMINATOR for correct operation. If, in addition, the most
significant bit of DENOMINATOR must be 1, then the pre-processor symbol
UDIV_NEEDS_NORMALIZATION is defined to 1.

4) sdiv_qrnnd(quotient, remainder, high_numerator, low_numerator,
denominator). Like udiv_qrnnd but the numbers are signed. The quotient
is rounded towards 0.

5) count_leading_zeros(count, x) counts the number of zero-bits from the
msb to the first nonzero bit in the UWtype X. This is the number of
steps X needs to be shifted left to set the msb. Undefined for X == 0,
unless the symbol COUNT_LEADING_ZEROS_0 is defined to some value.

6) count_trailing_zeros(count, x) like count_leading_zeros, but counts
from the least significant end.

7) add_ssaaaa(high_sum, low_sum, high_addend_1, low_addend_1,
high_addend_2, low_addend_2) adds two UWtype integers, composed by
HIGH_ADDEND_1 and LOW_ADDEND_1, and HIGH_ADDEND_2 and LOW_ADDEND_2
respectively. The result is placed in HIGH_SUM and LOW_SUM. Overflow
(i.e. carry out) is not stored anywhere, and is lost.

8) sub_ddmmss(high_difference, low_difference, high_minuend, low_minuend,
high_subtrahend, low_subtrahend) subtracts two two-word UWtype integers,
composed by HIGH_MINUEND_1 and LOW_MINUEND_1, and HIGH_SUBTRAHEND_2 and
LOW_SUBTRAHEND_2 respectively. The result is placed in HIGH_DIFFERENCE
and LOW_DIFFERENCE. Overflow (i.e. carry out) is not stored anywhere,
and is lost.

If any of these macros are left undefined for a particular CPU,
C macros are used. */
ami_stuff says:

Wednesday, 20th May, 2009 at 10:19 pm

I compiled second example with GCC 4.4 final and asm output is identicial to 4.4 alpha, so no changes for m68k.

I see that “ppc/mathops.h” have optimized versions of MULH, MLS64, MUL16, MAC16 and all of this stuff is used by “libavcodec/mpegaudiodec.c” file.
Amiga compiler misses also llrint() function which I see is used by decoder, so this slow downs decoding process too.
There are also no asm optimized round() and roundf() functions.

If you know Mans how implement these functions as GCC’s m68k asm inlines, it would be really great and it would probably speedup FFmpeg’s decoder a lot.
- Mans says:
  
  Wednesday, 20th May, 2009 at 10:36 pm
  
  Please try out the patch at the top of this git branch: http://git.mansr.com/?p=ffmpeg.mru;a=shortlog;h=refs/heads/m68k
ami_stuff says:

Wednesday, 20th May, 2009 at 10:50 pm

Holly shit! Only 29 sec. now!
ami_stuff says:

Wednesday, 20th May, 2009 at 10:55 pm

It beats libmad by 1 sec. With more optimizations like asm llrint() it will be even faster.

Is there a way to asm optimized code for 68060 build too?
Mans says:

Wednesday, 20th May, 2009 at 11:05 pm

What hardware are you running this on? Can you confirm that the decoded output is correct, please?

68060 doesn’t have the 32×32->64 multiply instruction, but I’m sure it’s not too hard to beat gcc even without it.

The floating point rounding functions are not used in speed-critical places so there is no need to optimise them.
ami_stuff says:

Wednesday, 20th May, 2009 at 11:10 pm

I’m running it on WinUAE emulator. I hear no different, but “find dups” program don’t recognize file decoded with standard FFmpeg and asm-optimized as identicial. I can decode some short file and send it to you to mail if you want, so you can analyze it.
Mans says:

Wednesday, 20th May, 2009 at 11:48 pm

Could you try disabling the asm functions one at a time and see which causes the discrepancy?
ami_stuff says:

Thursday, 21st May, 2009 at 1:27 am

Bad News. This all speedup was because I compared 68060 build of FFmpeg with 68040 build. 68060 build needs to emulate “muls” instruction – here is a speedup. Also, 68040 build generates different wav file compared to 68060 build.

When I compile with your MAC64 & MLS64 functions I get only 1 sec. speedup.

MULH don’t want to compile – statement ‘mul.l (a6),d2:d1’ ignored etc.
ami_stuff says:

Thursday, 21st May, 2009 at 1:30 am

so for 68060 CPU libmad is the best choice
Bernd_afa says:

Thursday, 21st May, 2009 at 10:53 am

I test this
-03 -m68020
-m68040 same result

int64_t MULH(int a, int b)
{
return ((int64_t)(a) * (int64_t)(b))>>32;
}

main(int argc, char *argv[])
{

printf (“%ld\n”,MULH(argc,(long)argv));
}

the code for 68060 is very inefficent, because there is no asm macro in longlong.h
the command MULS.L D4,D2:D6 is also not support on the UAE JIT and is slow execute by interpreter.

i think in gcc longlong.h is miss code for 68060.
I see coldfire code, but this code is too so complex, is a 32 bit *32 bit -> 64 bit result not easier possible ?

gcc 3.4.0
MOVE.L 8(A5),D7
MOVE.L $C(A5),D4
BSR.L ___main ;
MOVE.L D7,D6
MULS.L D4,D2:D6
MOVE.L D2,-(A7)
SMI D0
EXTB.L D0
MOVE.L D0,-(A7)
PEA _MAC64+$2E(PC)
LEA _printf,A3 ;

gcc 4.3.2

MOVE.L $C(A5),D1
MULS.L 8(A5),D0:D1
MOVE.L D0,-(A7)
SMI D2
EXTB.L D2
MOVE.L D2,-(A7)
PEA _time_delay+$F8(PC)
LEA _printf,A3 ;10F94

gcc 4.4.0

MOVE.L 8(A5),D2
MOVE.L $C(A5),D3
JSR ___main ;10FD50
MOVE.L D3,D1
MULS.L D2,D0:D1
MOVE.L D0,-(A7)
SMI D2
EXTB.L D2
MOVE.L D2,-(A7)
PEA _time_delay+$FA(PC)
LEA _printf,A3 ;10FD53

now with -m68060

MOVE.L 8(A5),-(A7)
SMI D0
EXTB.L D0
MOVE.L D0,-(A7)
MOVE.L $C(A5),-(A7)
SMI D2
EXTB.L D2
MOVE.L D2,-(A7)
JSR ___muldi3 ;110504
LEA $10(A7),A7
MOVE.L D0,-(A7)
SMI D2
EXTB.L D2
MOVE.L D2,-(A7)
PEA _time_delay+$FA(PC)
LEA _printf,A3 ;110508
JSR (A3)

……

___muldi3
110504A8: MOVE.L A5,-(A7)
110504AA: MOVEA.L A7,A5
110504AC: MOVEM.L D2-D7/A2,-(A7)
110504B0: MOVE.L $C(A5),D5
110504B4: MOVE.L $14(A5),D6
110504B8: MOVEA.L 8(A5),A2
110504BC: MOVE.L $10(A5),D7
110504C0: MOVE.L D5,D0
110504C2: MOVE.L D6,D1
110504C4: MOVE.L D0,D2
110504C6: SWAP D0
110504C8: MOVE.L D1,D3
110504CA: SWAP D1
110504CC: MOVE D2,D4
110504CE: MULU D3,D4
110504D0: MULU D1,D2
110504D2: MULU D0,D3
110504D4: MULU D0,D1
110504D6: MOVE.L D4,D0
110504D8: EOR D0,D0
110504DA: SWAP D0
110504DC: ADD.L D0,D2
110504DE: ADD.L D3,D2
110504E0: BCC.S ___muldi3+$40 ;110504E8
110504E2: ADDI.L #$10000,D1
110504E8: SWAP D2
110504EA: MOVEQ #0,D0
110504EC: MOVE D2,D0
110504EE: MOVE D4,D2
110504F0: MOVEA.L D2,A1
110504F2: ADD.L D1,D0
110504F4: MOVEA.L D0,A0
110504F6: MOVE.L A1,D1
110504F8: MULS.L D7,D5
110504FC: MOVE.L A2,D2
110504FE: MULS.L D2,D6
11050502: ADD.L D6,D5
11050504: ADD.L A0,D5
11050506: MOVE.L D5,D0
11050508: MOVEM.L (A7)+,D2-D7/A2
1105050C: UNLK A5
1105050E: RTS
ami_stuff says:

Sunday, 7th June, 2009 at 12:42 pm

Mans, could you try to create code for 68060? This way we will know if this slowdown is because of slow GCC asm generated code or maybe it’s normal without hardware 32×32->64 and there is nothing what can be done without re-design of the mpegaudio decoder? Thanks
Pingback: Lenguaje ensamblador – Fuente: Wikipedia. « zarateblog.wordpress.com

GCC makes a mess

35 Responses to GCC makes a mess

Recent Posts

Recent Comments

Categories

Archives

Meta