Beware the builtins

GCC includes a large number of builtin functions allegedly providing optimised code for common operations not easily expressed directly in C. Rather than taking such claims at face value (this is GCC after all), I decided to conduct a small investigation to see how well a few of these functions are actually implemented for various targets.

For my test, I selected the following functions:

  • __builtin_bswap32: Byte-swap a 32-bit word.
  • __builtin_bswap64: Byte-swap a 64-bit word.
  • __builtin_clz: Count leading zeros in a word.
  • __builtin_ctz: Count trailing zeros in a word.
  • __builtin_prefetch: Prefetch data into cache.

To test the quality of these builtins, I wrapped each in a normal function, then compiled the code for these targets:

  • ARMv7
  • AVR32
  • MIPS
  • MIPS64
  • PowerPC
  • PowerPC64
  • x86
  • x86_64

In all cases I used compiler flags were -O3 -fomit-frame-pointer plus any flags required to select a modern CPU model.

ARM

Both  __builtin_clz and __builtin_prefetch generate the expected CLZ and PLD instructions respectively. The code for __builtin_ctz is reasonable for ARMv6 and earlier:

rsb     r3, r0, #0
and     r0, r3, r0
clz     r0, r0
rsb     r0, r0, #31

For ARMv7 (in fact v6T2), however, using the new bit-reversal instruction would have been better:

rbit    r0, r0
clz     r0, r0

I suspect this is simply a matter of the function not yet having been updated for ARMv7, which is perhaps even excusable given the relatively rare use cases for it.

The byte-reversal functions are where it gets shocking. Rather than use the REV instruction found from ARMv6 on, both of them generate external calls to __bswapsi2 and __bswapdi2 in libgcc, which is plain C code:

SItype
__bswapsi2 (SItype u)
{
  return ((((u) & 0xff000000) >> 24)
          | (((u) & 0x00ff0000) >>  8)
          | (((u) & 0x0000ff00) <<  8)
          | (((u) & 0x000000ff) << 24));
}

DItype
__bswapdi2 (DItype u)
{
   return ((((u) & 0xff00000000000000ull) >> 56)
          | (((u) & 0x00ff000000000000ull) >> 40)
          | (((u) & 0x0000ff0000000000ull) >> 24)
          | (((u) & 0x000000ff00000000ull) >>  8)
          | (((u) & 0x00000000ff000000ull) <<  8)
          | (((u) & 0x0000000000ff0000ull) << 24)
          | (((u) & 0x000000000000ff00ull) << 40)
          | (((u) & 0x00000000000000ffull) << 56));
}

While the 32-bit version compiles to a reasonable-looking shift/mask/or job, the 64-bit one is a real WTF. Brace yourselves:

push    {r4, r5, r6, r7, r8, r9, sl, fp}
mov     r5, #0
mov     r6, #65280      ; 0xff00
sub     sp, sp, #40     ; 0x28
and     r7, r0, r5
and     r8, r1, r6
str     r7, [sp, #8]
str     r8, [sp, #12]
mov     r9, #0
mov     r4, r1
and     r5, r0, r9
mov     sl, #255        ; 0xff
ldr     r9, [sp, #8]
and     r6, r4, sl
mov     ip, #16711680   ; 0xff0000
str     r5, [sp, #16]
str     r6, [sp, #20]
lsl     r2, r0, #24
and     ip, ip, r1
lsr     r7, r4, #24
mov     r1, #0
lsr     r5, r9, #24
mov     sl, #0
mov     r9, #-16777216  ; 0xff000000
and     fp, r0, r9
lsr     r6, ip, #8
orr     r9, r7, r1
and     ip, r4, sl
orr     sl, r1, r2
str     r6, [sp]
str     r9, [sp, #32]
str     sl, [sp, #36]   ; 0x24
add     r8, sp, #32
ldm     r8, {r7, r8}
str     r1, [sp, #4]
ldm     sp, {r9, sl}
orr     r7, r7, r9
orr     r8, r8, sl
str     r7, [sp, #32]
str     r8, [sp, #36]   ; 0x24
mov     r3, r0
mov     r7, #16711680   ; 0xff0000
mov     r8, #0
and     r9, r3, r7
and     sl, r4, r8
ldr     r0, [sp, #16]
str     fp, [sp, #24]
str     ip, [sp, #28]
stm     sp, {r9, sl}
ldr     r7, [sp, #20]
ldr     sl, [sp, #12]
ldr     fp, [sp, #12]
ldr     r8, [sp, #28]
lsr     r0, r0, #8
orr     r7, r0, r7, lsl #24
lsr     r6, sl, #24
orr     r5, r5, fp, lsl #8
lsl     sl, r8, #8
mov     fp, r7
add     r8, sp, #32
ldm     r8, {r7, r8}
orr     r6, r6, r8
ldr     r8, [sp, #20]
ldr     r0, [sp, #24]
orr     r5, r5, r7
lsr     r8, r8, #8
orr     sl, sl, r0, lsr #24
mov     ip, r8
ldr     r0, [sp, #4]
orr     fp, fp, r5
ldr     r5, [sp, #24]
orr     ip, ip, r6
ldr     r6, [sp]
lsl     r9, r5, #8
lsl     r8, r0, #24
orr     fp, fp, r9
lsl     r3, r3, #8
orr     r8, r8, r6, lsr #8
orr     ip, ip, sl
lsl     r7, r6, #24
and     r5, r3, #16711680       ; 0xff0000
orr     r7, r7, fp
orr     r8, r8, ip
orr     r4, r1, r7
orr     r5, r5, r8
mov     r9, r6
mov     r1, r5
mov     r0, r4
add     sp, sp, #40     ; 0x28
pop     {r4, r5, r6, r7, r8, r9, sl, fp}
bx      lr

That’s right, 91 instructions to move 8 bytes around a bit. GCC definitely has a problem with 64-bit numbers. It is perhaps worth noting that the bswap_64 macro in glibc splits the 64-bit value into 32-bit halves which are then reversed independently, thus side-stepping this weakness of gcc.

As a side note, ARM RVCT (armcc) compiles those functions perfectly into one and two REV instructions, respectively.

AVR32

There is not much to report here. The latest gcc version available is 4.2.4, which doesn’t appear to have the bswap functions. The other three are handled nicely, even using a bit-reverse for __builtin_ctz.

MIPS / MIPS64

The situation MIPS is similar to ARM. Both bswap builtins result in external libgcc calls, the rest giving sensible code.

PowerPC

I scarcely believe my eyes, but this one is actually not bad. The PowerPC has no byte-reversal instructions, yet someone seems to have taken the time to teach gcc a good instruction sequence for this operation. The PowerPC does have some powerful rotate-and-mask instructions which come in handy here. First the 32-bit version:

rotlwi  r0,r3,8
rlwimi  r0,r3,24,0,7
rlwimi  r0,r3,24,16,23
mr      r3,r0
blr

The 64-bit byte-reversal simply applies the above code on each half of the value:

rotlwi  r0,r3,8
rlwimi  r0,r3,24,0,7
rlwimi  r0,r3,24,16,23
rotlwi  r3,r4,8
rlwimi  r3,r4,24,0,7
rlwimi  r3,r4,24,16,23
mr      r4,r0
blr

Although I haven’t analysed that code carefully, it looks pretty good.

PowerPC64

Doing 64-bit operations is easier on a 64-bit CPU, right? For you and me perhaps, but not for gcc. Here __builtin_bswap64 gives us the now familiar __bswapdi2 call, and while not as bad as the ARM version, it is not pretty:

rldicr  r0,r3,8,55
rldicr  r10,r3,56,7
rldicr  r0,r0,56,15
rldicl  r11,r3,8,56
rldicr  r9,r3,16,47
or      r11,r10,r11
rldicr  r9,r9,48,23
rldicl  r10,r0,24,40
rldicr  r0,r3,24,39
or      r11,r11,r10
rldicl  r9,r9,40,24
rldicr  r0,r0,40,31
or      r9,r11,r9
rlwinm  r10,r3,0,0,7
rldicl  r0,r0,56,8
or      r0,r9,r0
rldicr  r10,r10,8,55
rlwinm  r11,r3,0,8,15
or      r0,r0,r10
rldicr  r11,r11,24,39
rlwinm  r3,r3,0,16,23
or      r0,r0,r11
rldicr  r3,r3,40,23
or      r3,r0,r3
blr

That is 6 times longer than the (presumably) hand-written 32-bit version.

x86 / x86_64

As one might expect, results on x86 are good. All the tested functions use the available special instructions. One word of caution though: the bit-counting instructions are very slow on some implementations, specifically the Atom, AMD chips, and the notoriously slow Pentium4E.

Conclusion

In conclusion, I would say gcc builtins can be useful to avoid fragile inline assembler. Before using them, however, one should make sure they are not in fact harmful on the required targets. Not even those builtins mapping directly to CPU instructions can be trusted.

Bookmark the permalink.

7 Responses to Beware the builtins

  1. Daniel Jacobowitz says:

    FYI, ARM __builtin_ctz is done (http://gcc.gnu.org/ml/gcc-patches/2009-07/msg00467.html).

    bswap32 was posted at least twice, too (latest http://gcc.gnu.org/ml/gcc-patches/2009-06/msg01100.html). I’ve poked the maintainers about that one.

    As for bswap64, well… that’s just gross.

  2. Nils says:

    I like your two instruction __builtin_ctz version. Haven’t thought about using bit-reverse for this.. Nice!

  3. Reimar says:

    Every time the same mess:
    mov r5, #0
    and r7, r0, r5
    gcc desperately needs to learn not to do multiplies, ands and adds with 0.
    But I guess the truth is nobody cares about the operand size > native size case.

  4. Fredrik Andersson says:

    The powerpc actually do have bytereversal, although it can only operate on memory, lwbrx for example (thus useful when you load data)

    • Mans says:

      Byte-reversing load/store is, in my view, a different class of instruction. I would not say a machine “has byte-reversal” unless it can perform that operation on typical operands, which in the case of PPC means registers.

      I am pleased to see, however, that __builtin_bswap32 applied to a memory location does in fact make use of lwbrx.

  5. stevenb says:

    There was a patch to GCC last week that should make bswap32 should work now (although I have not checked):

    * config/arm/arm.md (bswapsi2): Add support for bswapsi2.
    (arm_rev): New.
    (arm_legacy_rev): Likewise.
    (thumb_legacy_rev): Likewise.

  6. stevenb says:

    The 64 bit variant compiles to fairly good code with “GCC 4.5.0 [trunk revision 156572]“. At -O2 the compiler produces the following for ARM7:

    __bswapdi2:
    	mov	r3, r0
    	eor	r2, r0, r0, ror #16
    	eor	r0, r1, r1, ror #16
    	mov	r2, r2, lsr #8
    	mov	r0, r0, lsr #8
    	bic	r2, r2, #65280
    	bic	r0, r0, #65280
    	eor	r0, r0, r1, ror #8
    	eor	r1, r2, r3, ror #8
    	bx	lr
    

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>