ARM inline asm secrets

Although I generally recommend against using GCC inline assembly, preferring instead pure assembly code in separate files, there are occasions where inline is the appropriate solution. Should one, at a time like this, turn to the GCC documentation for guidance, one must be prepared for a degree of disappointment. As it happens, much of the inline asm syntax is left entirely undocumented. This article attempts to fill in some of the blanks for the ARM target.
Continue reading

ARM compiler update

Since my last shootout,  all the tested vendors have updated their compilers. Here is a quick update on each of them.

Both the 4.3 and 4.4 branches of FSF GCC have had bugfix releases, bringing them to 4.3.4 and 4.4.2, respectively. Neither update contains anything particularly noteworthy.

The CodeSourcery 2009q3 release sees an update to a GCC 4.4 base, a significant change from the 4.3 base used in 2009q1. The update is a mixed blessing. In fact, it is mostly a curse and hardly a blessing at all. On the bright side, the floating-point speed regressions in 2009q1 are gone, 2009q3 being a few per cent faster even than 2007q3. Unfortunately, this improvement is completely overshadowed by a major speed regression on integer code, a whopping 24% in one case. This ties in with the slowdown previously observed with FSF GCC 4.4 compared to 4.3.

ARM RVCT 4.0 is now at Build 697. This update fixes some bugs and introduces others. Notably, it no longer builds FFmpeg correctly. The issue has been reported to ARM.

Texas Instruments, finally, have made a formal release, v4.6.1, of their TMS470 compiler incorporating various fixes allowing it to build a moderately patched FFmpeg. The performance remains somewhere between GCC and RVCT on average.

In light of the above, my recommendations remain unchanged:

  • For a free compiler, choose CodeSourcery 2009q1. It beats GCC 4.3.4 by 5-10% in most cases.
  • GNU purists are best served by GCC 4.3.4, which is up to 20% faster than 4.4.2 and rarely slower.
  • When price is not a concern, ARM RCVT is a good option, outperforming GCC by up to a factor 2.
  • In all cases, disable any auto-vectorisation features.

Regardless of which compiler is chosen, I cannot overstress the importance of testing. All compilers are crawling with bugs, and even the most innocent-looking code change can trigger one of them. When using a compiler other than GCC, extra caution is advised considering a lot of code is developed using only GCC and may thus fall prey to bugs unique to said other compiler.

Beware the builtins

GCC includes a large number of builtin functions allegedly providing optimised code for common operations not easily expressed directly in C. Rather than taking such claims at face value (this is GCC after all), I decided to conduct a small investigation to see how well a few of these functions are actually implemented for various targets.

For my test, I selected the following functions:

  • __builtin_bswap32: Byte-swap a 32-bit word.
  • __builtin_bswap64: Byte-swap a 64-bit word.
  • __builtin_clz: Count leading zeros in a word.
  • __builtin_ctz: Count trailing zeros in a word.
  • __builtin_prefetch: Prefetch data into cache.

To test the quality of these builtins, I wrapped each in a normal function, then compiled the code for these targets:

  • ARMv7
  • AVR32
  • MIPS
  • MIPS64
  • PowerPC
  • PowerPC64
  • x86
  • x86_64

In all cases I used compiler flags were -O3 -fomit-frame-pointer plus any flags required to select a modern CPU model.
Continue reading

ARM compiler shoot-out, round 2

In my recent test of ARM compilers, I had to leave out Texas Instrument’s compiler since it failed to build FFmpeg. Since then, the TI compiler team has been busy fixing bugs, and a snapshot I was given to test was able to build enough of a somewhat patched FFmpeg that I can now present round two in this shoot-out.

The contenders this time were the fastest GCC variant from round one, ARM RVCT, and newcomer TI TMS470. With the same rules as last time, the exact versions and optimisation options were like this:

  • CodeSourcery GCC 2009q1 (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize
  • ARM RVCT 4.0 Build 591, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros
  • TI TMS470 4.7.0-a9229, -float_support=vfpv3 -mv=7a8 -O3 -mf=5

Continue reading