In my recent test of ARM compilers, I had to leave out Texas Instrument’s compiler since it failed to build FFmpeg. Since then, the TI compiler team has been busy fixing bugs, and a snapshot I was given to test was able to build enough of a somewhat patched FFmpeg that I can now present round two in this shoot-out.
The contenders this time were the fastest GCC variant from round one, ARM RVCT, and newcomer TI TMS470. With the same rules as last time, the exact versions and optimisation options were like this:
- CodeSourcery GCC 2009q1 (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize
- ARM RVCT 4.0 Build 591, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros
- TI TMS470 4.7.0-a9229, –-float_support=vfpv3 -mv=7a8 -O3 -mf=5
To keep things fair, I left the vectoriser off also with the TI compiler. The table below lists the decoding times for the sample files, this time normalised against the participating GCC compiler. Remember, smaller numbers are better. Also keep in mind that this test was done with a development snapshot of TMS470, not an approved release.
Sample name | Codec | Code type | GCC | RVCT | TI |
---|---|---|---|---|---|
cathedral | H.264 CABAC | integer | 1.00 | 0.95 | 1.02 |
NeroAVC | H.264 CABAC | integer | 1.00 | 0.96 | 1.05 |
indiana_jones_4 | H.264 CAVLC | integer | 1.00 | 0.92 | 1.02 |
NeroRecodeSample | MPEG-4 ASP | integer | 1.00 | 1.01 | 1.08 |
Silent_Light | MP3 | 64-bit integer | 1.00 | 0.48 | 0.72 |
When_I_Grow_Up | FLAC | integer | 1.00 | 0.87 | 0.93 |
Lumme-Badloop | Vorbis | float | 1.00 | 0.94 | 1.05 |
Canyon | AC-3 | float | 1.00 | 0.88 | 1.01 |
lotr | DTS | float | 1.00 | 1.00 | 1.08 |
Overall, the TI TMS470 compiler comes off slightly worse than GCC. In two cases, however, it was significantly better than GCC, but not as good as RVCT. Incidentally, those were also the ones where RVCT scored the biggest win over GCC.
My conclusions from this test are twofold:
- ARM’s own compiler is very hard to beat. They do seem to know how their chips work.
- GCC is incredibly bad at 64-bit arithmetic on 32-bit machines.
The logical next step is to test these compilers with vectorisation enabled. FFmpeg should offer plenty of opportunities for this feature to shine. Unfortunately, that test will have to wait until the RVCT vectoriser is fixed. The current release does not compile FFmpeg with vectorisation enabled.
Does that mean, the gcc standard library for arm is not fully optimized for 64bit intergration arithmetic?
It is not library code that is slow, it is the ARM code generated by gcc from the C code that is bad. One thing gcc often does when doing 64-bit computations on a 32-bit is to set a register to zero (the upper half of a 32-bit number converted to 64-bit), then multiply something by it. It shouldn’t take much to realise that multiplying by zero produces zero, and that part of the calculation can be dropped.
s/incredibly bad/useless/
Really, you just can’t use gcc to do multiplications larger than the native size if speed or code size are in any way relevant.
Why are you using softfp ABI for the benchmarks? Hardware FP ABI is much better suited for such of applications and might yield a noticeable speedup.
PS: Could you please also include LLVM for the tests?
Most of the compilers support only softfp ABI. I compared soft and hard with gcc-csl 2009q1, and there was very little difference. FFmpeg passes floats as arguments or return values in very few places.
I’m planning a new round soon, and I’d be happy to include LLVM, if only I could figure out how to configure it as a cross-compiler.
Build llvm-gcc as usual gcc. There is even script to use codesourcery-provided binutils as a “bootstrap” toolchain.
I see nothing usual about building llvm-gcc. There isn’t even a configure script.
Huh, how so? There is definitely one:
http://llvm.org/viewvc/llvm-project/llvm-gcc-4.2/trunk/configure
Make sure you checked out stuff properly. README.LLVM is also a good thing to read before building.
There is a script for almost automatic build of cross-compilers. See http://llvm.org/viewvc/llvm-project/llvm/trunk/utils/crosstool/ARM/README
That is most definitely not what was in a tarball I downloaded. The only instructions I found involved a complicated procedure combining parts from an llvm base tarball with some llvm-gcc bits, and none of it made much sense, so I gave up.
Sorry, I really have no idea what you’ve downloaded. Release tarballs contains all the code from the SVN repository. For ARM stuff you might really want do checkout code from SVN, since after the 2.6 release bunch of stuff was fixed / improved.
Just as a reminder: Unlike gcc, armcc optimizes for space by default. Please also specify -Otime to make sure it directs its effort to execution time. Please check this out, too. No guarantees, though.
I already said I’m using –translate_gcc which maps -O3 to -O3 -Otime.