Bit-field badness

Consider the following C code which is based on an real-world situation.

struct bf1_31 {
    unsigned a:1;
    unsigned b:31;
};

void func(struct bf1_31 *p, int n, int a)
{
    int i = 0;
    do {
        if (p[i].a)
            p[i].b += a;
    } while (++i < n);
}

How would we best write this in ARM assembler? This is how I would do it:

func:
        ldr     r3,  [r0], #4
        tst     r3,  #1
        add     r3,  r3,  r2,  lsl #1
        strne   r3,  [r0, #-4]
        subs    r1,  r1,  #1
        bgt     func
        bx      lr

The add instruction is unconditional to avoid a dependency on the comparison. Unrolling the loop would mask the latency of the ldr instruction as well, but that is outside the scope of this experiment.

Now compile this code with gcc -march=armv5te -O3 and watch in horror:

func:
        push    {r4}
        mov     ip, #0
        mov     r4, r2
loop:
        ldrb    r3, [r0]
        add     ip, ip, #1
        tst     r3, #1
        ldrne   r3, [r0]
        andne   r2, r3, #1
        addne   r3, r4, r3, lsr #1
        orrne   r2, r2, r3, lsl #1
        strne   r2, [r0]
        cmp     ip, r1
        add     r0, r0, #4
        blt     loop
        pop     {r4}
        bx      lr

This is nothing short of awful:

  • The same value is loaded from memory twice.
  • A complicated mask/shift/or operation is used where a simple shifted add would suffice.
  • Write-back addressing is not used.
  • The loop control counts up and compares instead of counting down.
  • Useless mov in the prologue; swapping the roles or r2 and r4 would avoid this.
  • Using lr in place of r4 would allow the return to be done with pop {pc}, saving one instruction (ignoring for the moment that no callee-saved registers are needed at all).

Even for this trivial function the gcc-generated code is more than twice the optimal size and slower by approximately the same factor.

The main issue I wanted to illustrate is the poor handling of bit-fields by gcc. When accessing bitfields from memory, gcc issues a separate load for each field even when they are contained in the same aligned memory word. Although each load after the first will most likely hit L1 cache, this is still bad for several reasons:

  • Loads have typically two or three cycles result latency compared to one cycle for data processing instructions. Any bit-field can be extracted from a register with two shifts, and on ARM the second of these can generally be achieved using a shifted second operand to a following instruction. The ARMv6T2 instruction set also adds the SBFX and UBFX instructions for extracting any signed or unsigned bit-field in one cycle.
  • Most CPUs have more data processing units than load/store units. It is thus more likely for an ALU instruction than a load/store to issue without delay on a superscalar processor.
  • Redundant memory accesses can trigger early flushing of store buffers rendering these less efficient.

No gcc bashing is complete without a comparison with another compiler, so without further ado, here is the ARM RVCT output (armcc --cpu 5te -O3):

func:
        mov     r3, #0
        push    {r4, lr}
loop:
        ldr     ip, [r0, r3, lsl #2]
        tst     ip, #1
        addne   ip, ip, r2, lsl #1
        strne   ip, [r0, r3, lsl #2]
        add     r3, r3, #1
        cmp     r3, r1
        blt     loop
        pop     {r4, pc}

This is much better, the core loop using only one instruction more than my version. The loop control is counting up, but at least this register is reused as offset for the memory accesses. More remarkable is the push/pop of two registers that are never used. I had not expected to see this from RVCT.

Even the best compilers are still no match for a human.

Bookmark the permalink.

21 Responses to Bit-field badness

  1. dconrad says:

    llvm with -march=armv6 or above results in nearly the same version as RVCT, only moving the a<<1 outside the loop and not saving unused registers:

    func:
            mov     r2, r2, lsl #1
            mov     r3, #0
    .LBB1_1:
            ldr     r12, [r0, +r3, lsl #2]
            tst     r12, #1
            addne   r12, r12, r2
            strne   r12, [r0, +r3, lsl #2]
            add     r3, r3, #1
            cmp     r3, r1
            bxge    lr
            b       .LBB1_1
    

    but apparently it thinks the bitfield could be unaligned since -march=armv5te results in:

    func:
            stmfd   sp!, {r4, r5, r11, lr}
            mov     r2, r2, lsl #1
            mov     r3, #0
    .LBB1_1:
            mov     r12, r0
            ldrb    lr, [r12, +r3, lsl #2]!
            ldrb    r4, [r12, #+3]
            ldrb    r5, [r12, #+2]
            ldrb    r12, [r12, #+1]
            orr     r4, r5, r4, lsl #8
            orr     r12, lr, r12, lsl #8
            orr     r12, r12, r4, lsl #16
            tst     r12, #1
            beq     .LBB1_3
            add     r12, r12, r2
            mov     lr, r0
            strb    r12, [lr, +r3, lsl #2]!
            mov     r4, r12, lsr #24
            mov     r5, r12, lsr #16
            mov     r12, r12, lsr #8
            strb    r4, [lr, #+3]
            strb    r5, [lr, #+2]
            strb    r12, [lr, #+1]
    .LBB1_3:
            add     r3, r3, #1
            cmp     r3, r1
            ldmfdge sp!, {r4, r5, r11, pc}
            b       .LBB1_1
    
  2. Lee Doolan says:

    For code sections of, say, a few hundred lines (of assembler) or shorter a human can absolutely blow a compiler away.

    But when you start talking about programs that are thousands or hundreds of thousands of lines in length, an optimizing compiler can easily outdo a human.

    • Mans says:

      I beg to differ. Any programme can be broken down into small pieces on each of which, according to yourself, a human will beat the compiler.

      Compilers are useful mainly for two reasons:

      • Compilers are good enough that the effort saved by writing in a high-level language are usually worth a sacrifice in performance.
      • Writing in an HLL allows the code to be compiled for any target machine without modifications.

      The above notwithstanding, until something radical happens to compilers, there will remain areas where the gains from hand-optimisation make the extra effort and loss of portability well worthwhile. A good example is the FFT used in FFmpeg. It is probably the fastest C FFT available, yet the same algorithm hand-written in NEON assembler runs 12 times faster on a Cortex-A8. With the C implementation, the FFT completely dominates the execution time of many popular audio decoders, while the assembler version is barely noticed in profiling.

      As a reference point, FFmpeg contains roughly 7000 lines of ARM assembler (excluding comments and whitespace). In H.264 decoding, FFmpeg on Cortex-A8 spends 30-40% of the total execution time in hand-written assembler, the rest is C code. I am convinced writing the C parts in assembler would double the speed of that code, equivalent to a total speedup of approximately 40%. Much of this code is, of course, very complex, and writing it in assembler would be very difficult. As with all engineering, there is a trade-off between perfection and practicality.

      FFmpeg is probably an extreme example, both in the demands for processing power and in the returns from hand-optimisation. A more mundane application, such as an email client, spends most of the time waiting for the network or user input, and optimising it beyond the basics makes little sense.

      The point I want to get across is that anything can be optimised, the trick lying in identifying the parts which should be. The better the compiler is, the less code will require hand-optimising, and it is precisely small, trivial bits of code like the function above where compilers can realistically be improved. Code like this is also what makes up the bulk of the applications we use.

      • Lee Doolan says:

        I suppose that I expected the reader to do a little thinking on his own. For programs of tens or hundreds of thousands of lines in length, human programmers cannot reasonably devote the resources that it would take to optimize the entire program even if it were modularized into reasonably sized components. To optimize it, that is, so that is better than what a compiler could produce.

  3. Falaina says:

    While your code is probably faster a few comments: Does it really matter (As in are the gains in speed worth the loss in readability)? Additionally, gcc is FAR from one of the “best compilers”; moreso when you’re not generating x86{-64} assembly.

    • Mans says:

      For one piece of code that size it probably doesn’t matter. For a large application built from many such pieces I would argue that it does matter. Consider a web browser on a mobile device. Would you like the browser to use less of precious flash space? Would you like web pages to render quicker? Would you like the battery to last longer? I’d be surprised if you didn’t answer yes to at least one of those questions. A better compiler could give you all three.

      When I said “best compiler”, I was referring to RVCT, which is to my knowledge the best ARM compiler around. If you know of a better one, I’m all ears.

      • Phil says:

        Check out Green Hills Software’s compiler. One of their selling points is that they persistently beat ARM’s own compiler in code size.

        • Mans says:

          My most recent experience with a Green Hills compiler involved a week-long debugging session culminating in the discovery that the compiler would occasionally access memory below the bottom of the stack. An interrupt at an unfortunate time would result in the registers being dumped on the stack, and the wrong value would be loaded. This only happened when compiling without debugging symbols.

  4. Josh says:

    Thanks for showing me this. I’m going to use assembly for everything now.

  5. OK, I’ll bite. RVCT’s loop optimizer doesn’t like structures. Blame it
    (the optimizer) for not recognizing the fact that it is looking at an
    integer operation in this case. Notice that the loop gets transformed
    into a down counting loop.

    void func(struct bf1_31 *p, int n, int a)
    {
        int i;
    
        unsigned int *q = (unsigned int *)p;
    
        #pragma unroll(1) // prevent loop unrolling for comparison
        for (i=0; i<n; i++) {
            if (*q & 1)
                *q++ = *q + 2*a;
        }
    }
    

    This is hardly more portable than assembler (depends on bit-field
    layout) and took about ten times as long to come up with :-) but this
    is the compiler generated assembler code (RVCT, Build 697, -O3 -Otime):

    func PROC
            CMP      r1,#0
            BXLE     lr
    |L1.148|
            LDR      r3,[r0,#0]
            TST      r3,#1
            ADDNE    r3,r3,r2,LSL #1
            STRNE    r3,[r0],#4
            SUBS     r1,r1,#1
            BNE      |L1.148|
            BX       lr
            ENDP
    

    Regards
    Marcus

  6. Even the best compilers are still no match for a human.

    I think a fairer thing to say is “even the best compilers are still no match for the best humans, given infinite time and ingenuity.”

    I frequently am pleasantly surprised at how clever a compiler is being on my behalf. For example, gcc knows how to turn “n / 1000000UL;” into “(n * 1125899907) >> 50″. Maybe if you locked me in a room for an hour and demanded that I optimize that expression I would have come up with that. But the compiler thought of it for me in a fraction of a second.

    After all, everything a compiler does is something a human thought of first, so naturally a compiler isn’t going to outdo the smartest human with an expertise for the architecture. But it’s definitely going to outdo a lot of people. It’s also much less error-prone.

    Also, picking on GCC-on-ARM is like bullying the pipsqueak on the playground. :)

  7. Scott Graves says:

    The reason for the “unnecessary” saving of registers is probably because you are not using -fomit-frame-pointer, which means that all functions must have a stack frame (so they can be debugged and unwound, although DWARF2 debug/unwind information makes that unnecessary)

  8. DavidEarlexarm says:

    The beauty of ARM’s compiler is you can write portably – not relying on bitfield layout, MarcusHarnisch – and still get great code.

    void func(struct bf1_31 *p, int n, int a)
    {
        struct bf1_31 *pend= &p[n];
        do {
            struct bf1_31 *q;
            if ((q=p++)->a)
                q->b += a;
        } while (p < pend);
    }
    

    Using old RVCT3.1 build 569 produces:

    func PROC
            ADD      r12,r0,r1,LSL #2
            PUSH     {lr}
    |L1.8|
            MOV      r3,r0
            LDR      r1,[r0],#4
            TST      r1,#1
            ADDNE    r1,r1,r2,LSL #1
            STRNE    r1,[r3,#0]
            CMP      r0,r12
            BCC      |L1.8|
            POP      {pc}
            ENDP
    

    This same source even generates good –thumb code.

    I’m however mystified why it pushes LR and pops PC in a leaf function. Maybe the branch predictor
    influences cache victim selection to not immediately evict the calling code.

    But I think the compiler will push multiples of two registers for architecture v5te and up to keep the stack 8-byte aligned. For that I'd blame xScale’s LDRD/STRD.

    • mh says:

      Nice! You almost got me there. However, the compiler I am using (4.0, build 697) does not like your code. I mean it does compile it of course, but the instruction sequence is significantly worse than that of the pure integer based loop. And yes, I did fix the loop in my example.

      func_david PROC
              PUSH     {r4,r5}
              ADD      r12,r0,r1,LSL #2
              LDR      r3,[r0,#0]
              ADD      r1,r0,#4
              SUB      r12,r12,r1
              TST      r3,#1
              ADDNE    r3,r3,r2,LSL #1
              STRNE    r3,[r0,#0]
              MOV      r0,#1
              ADD      r4,r0,r12,ASR #2
              CMP      r4,#1
              BLE      |L1.316|
      |L1.280|
              MOV      r3,r1
              ADD      r0,r0,#1
              ADD      r1,r1,#4
              LDR      r12,[r3,#0]
              TST      r12,#1
              ADDNE    r12,r12,r2,LSL #1
              STRNE    r12,[r3,#0]
              CMP      r4,r0
              BGT      |L1.280|
      |L1.316|
              POP      {r4,r5}
              BX       lr
              ENDP
      

      The comment was:

      “hardwarebug_20100130.c”, line 58: #1636-D: Could not optimize: Complicated use of variable (q)

      Stack alignment is only necessary in non-leaf functions as you stated. So every once in a while you will see single registers pushed on the stack. Why the compiler thought this was needed in your example, I don’t know.

      Regards
      Marcus

      • mh says:

        I love replying to myself. It is perhaps interesting to know that RVCT’s loop optimizer seems to interfere with the structure access in this example. Switching back to -O2 -Otime (the loop optimizer is only enabled at -O3, AFAIK) gives me the same efficient code that David posted.

  9. stevenb says:

    I have filed this in GCC bugzilla as bug 42972.

  10. Pingback: Lenguaje ensamblador – Fuente: Wikipedia. « zarateblog.wordpress.com

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>