ARM compiler shoot-out, round 2

In my recent test of ARM compilers, I had to leave out Texas Instrument’s compiler since it failed to build FFmpeg. Since then, the TI compiler team has been busy fixing bugs, and a snapshot I was given to test was able to build enough of a somewhat patched FFmpeg that I can now present round two in this shoot-out.

The contenders this time were the fastest GCC variant from round one, ARM RVCT, and newcomer TI TMS470. With the same rules as last time, the exact versions and optimisation options were like this:

  • CodeSourcery GCC 2009q1 (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize
  • ARM RVCT 4.0 Build 591, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros
  • TI TMS470 4.7.0-a9229, -float_support=vfpv3 -mv=7a8 -O3 -mf=5

To keep things fair, I left the vectoriser off also with the TI compiler. The table below lists the decoding times for the sample files, this time normalised against the participating GCC compiler. Remember, smaller numbers are better.  Also keep in mind that this test was done with a development snapshot of TMS470, not an approved release.

Sample name Codec Code type GCC RVCT TI
cathedral H.264 CABAC integer 1.00 0.95 1.02
NeroAVC H.264 CABAC integer 1.00 0.96 1.05
indiana_jones_4 H.264 CAVLC integer 1.00 0.92 1.02
NeroRecodeSample MPEG-4 ASP integer 1.00 1.01 1.08
Silent_Light MP3 64-bit integer 1.00 0.48 0.72
When_I_Grow_Up FLAC integer 1.00 0.87 0.93
Lumme-Badloop Vorbis float 1.00 0.94 1.05
Canyon AC-3 float 1.00 0.88 1.01
lotr DTS float 1.00 1.00 1.08

Overall, the TI TMS470 compiler comes off slightly worse than GCC. In two cases, however, it was significantly better than GCC, but not as good as RVCT. Incidentally, those were also the ones where RVCT scored the biggest win over GCC.

My conclusions from this test are twofold:

  • ARM’s own compiler is very hard to beat. They do seem to know how their chips work.
  • GCC is incredibly bad at 64-bit arithmetic on 32-bit machines.

The logical next step is to test these compilers with vectorisation enabled. FFmpeg should offer plenty of opportunities for this feature to shine. Unfortunately, that test will have to wait until the RVCT vectoriser is fixed. The current release does not compile FFmpeg with vectorisation enabled.

Bookmark the permalink.

12 Responses to ARM compiler shoot-out, round 2

  1. veryzhang says:

    Does that mean, the gcc standard library for arm is not fully optimized for 64bit intergration arithmetic?

    • Mans says:

      It is not library code that is slow, it is the ARM code generated by gcc from the C code that is bad. One thing gcc often does when doing 64-bit computations on a 32-bit is to set a register to zero (the upper half of a 32-bit number converted to 64-bit), then multiply something by it. It shouldn’t take much to realise that multiplying by zero produces zero, and that part of the calculation can be dropped.

  2. Reimar says:

    s/incredibly bad/useless/
    Really, you just can’t use gcc to do multiplications larger than the native size if speed or code size are in any way relevant.

  3. Anton Korobeynikov says:

    Why are you using softfp ABI for the benchmarks? Hardware FP ABI is much better suited for such of applications and might yield a noticeable speedup.

    PS: Could you please also include LLVM for the tests?

  4. Just as a reminder: Unlike gcc, armcc optimizes for space by default. Please also specify -Otime to make sure it directs its effort to execution time. Please check this out, too. No guarantees, though.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.