Consider the following C code which is based on an real-world situation.
struct bf1_31 { unsigned a:1; unsigned b:31; }; void func(struct bf1_31 *p, int n, int a) { int i = 0; do { if (p[i].a) p[i].b += a; } while (++i < n); }
How would we best write this in ARM assembler? This is how I would do it:
func: ldr r3, [r0], #4 tst r3, #1 add r3, r3, r2, lsl #1 strne r3, [r0, #-4] subs r1, r1, #1 bgt func bx lr
The add instruction is unconditional to avoid a dependency on the comparison. Unrolling the loop would mask the latency of the ldr instruction as well, but that is outside the scope of this experiment.
Now compile this code with gcc -march=armv5te -O3 and watch in horror:
func: push {r4} mov ip, #0 mov r4, r2 loop: ldrb r3, [r0] add ip, ip, #1 tst r3, #1 ldrne r3, [r0] andne r2, r3, #1 addne r3, r4, r3, lsr #1 orrne r2, r2, r3, lsl #1 strne r2, [r0] cmp ip, r1 add r0, r0, #4 blt loop pop {r4} bx lr
This is nothing short of awful:
- The same value is loaded from memory twice.
- A complicated mask/shift/or operation is used where a simple shifted add would suffice.
- Write-back addressing is not used.
- The loop control counts up and compares instead of counting down.
- Useless mov in the prologue; swapping the roles or r2 and r4 would avoid this.
- Using lr in place of r4 would allow the return to be done with pop {pc}, saving one instruction (ignoring for the moment that no callee-saved registers are needed at all).
Even for this trivial function the gcc-generated code is more than twice the optimal size and slower by approximately the same factor.
The main issue I wanted to illustrate is the poor handling of bit-fields by gcc. When accessing bitfields from memory, gcc issues a separate load for each field even when they are contained in the same aligned memory word. Although each load after the first will most likely hit L1 cache, this is still bad for several reasons:
- Loads have typically two or three cycles result latency compared to one cycle for data processing instructions. Any bit-field can be extracted from a register with two shifts, and on ARM the second of these can generally be achieved using a shifted second operand to a following instruction. The ARMv6T2 instruction set also adds the SBFX and UBFX instructions for extracting any signed or unsigned bit-field in one cycle.
- Most CPUs have more data processing units than load/store units. It is thus more likely for an ALU instruction than a load/store to issue without delay on a superscalar processor.
- Redundant memory accesses can trigger early flushing of store buffers rendering these less efficient.
No gcc bashing is complete without a comparison with another compiler, so without further ado, here is the ARM RVCT output (armcc --cpu 5te -O3):
func: mov r3, #0 push {r4, lr} loop: ldr ip, [r0, r3, lsl #2] tst ip, #1 addne ip, ip, r2, lsl #1 strne ip, [r0, r3, lsl #2] add r3, r3, #1 cmp r3, r1 blt loop pop {r4, pc}
This is much better, the core loop using only one instruction more than my version. The loop control is counting up, but at least this register is reused as offset for the memory accesses. More remarkable is the push/pop of two registers that are never used. I had not expected to see this from RVCT.
Even the best compilers are still no match for a human.
llvm with -march=armv6 or above results in nearly the same version as RVCT, only moving the a<<1 outside the loop and not saving unused registers:
but apparently it thinks the bitfield could be unaligned since -march=armv5te results in:
The ARM ABI specifies that a bit-field shall be contained within one naturally aligned instance of its declared type. If the bit-field is wider than the declared type special rules apply. In our test case, the alignment is clearly 4 bytes.
I figured. Turns out it’s due to llvm-gcc’s lowering of bitfields into llvm-ir and probably affects all targets that don’t support unaligned loads; clang gets it right.
Filed http://llvm.org/bugs/show_bug.cgi?id=6185 so we’ll see.
For code sections of, say, a few hundred lines (of assembler) or shorter a human can absolutely blow a compiler away.
But when you start talking about programs that are thousands or hundreds of thousands of lines in length, an optimizing compiler can easily outdo a human.
I beg to differ. Any programme can be broken down into small pieces on each of which, according to yourself, a human will beat the compiler.
Compilers are useful mainly for two reasons:
The above notwithstanding, until something radical happens to compilers, there will remain areas where the gains from hand-optimisation make the extra effort and loss of portability well worthwhile. A good example is the FFT used in FFmpeg. It is probably the fastest C FFT available, yet the same algorithm hand-written in NEON assembler runs 12 times faster on a Cortex-A8. With the C implementation, the FFT completely dominates the execution time of many popular audio decoders, while the assembler version is barely noticed in profiling.
As a reference point, FFmpeg contains roughly 7000 lines of ARM assembler (excluding comments and whitespace). In H.264 decoding, FFmpeg on Cortex-A8 spends 30-40% of the total execution time in hand-written assembler, the rest is C code. I am convinced writing the C parts in assembler would double the speed of that code, equivalent to a total speedup of approximately 40%. Much of this code is, of course, very complex, and writing it in assembler would be very difficult. As with all engineering, there is a trade-off between perfection and practicality.
FFmpeg is probably an extreme example, both in the demands for processing power and in the returns from hand-optimisation. A more mundane application, such as an email client, spends most of the time waiting for the network or user input, and optimising it beyond the basics makes little sense.
The point I want to get across is that anything can be optimised, the trick lying in identifying the parts which should be. The better the compiler is, the less code will require hand-optimising, and it is precisely small, trivial bits of code like the function above where compilers can realistically be improved. Code like this is also what makes up the bulk of the applications we use.
I suppose that I expected the reader to do a little thinking on his own. For programs of tens or hundreds of thousands of lines in length, human programmers cannot reasonably devote the resources that it would take to optimize the entire program even if it were modularized into reasonably sized components. To optimize it, that is, so that is better than what a compiler could produce.
While your code is probably faster a few comments: Does it really matter (As in are the gains in speed worth the loss in readability)? Additionally, gcc is FAR from one of the “best compilers”; moreso when you’re not generating x86{-64} assembly.
For one piece of code that size it probably doesn’t matter. For a large application built from many such pieces I would argue that it does matter. Consider a web browser on a mobile device. Would you like the browser to use less of precious flash space? Would you like web pages to render quicker? Would you like the battery to last longer? I’d be surprised if you didn’t answer yes to at least one of those questions. A better compiler could give you all three.
When I said “best compiler”, I was referring to RVCT, which is to my knowledge the best ARM compiler around. If you know of a better one, I’m all ears.
Check out Green Hills Software’s compiler. One of their selling points is that they persistently beat ARM’s own compiler in code size.
My most recent experience with a Green Hills compiler involved a week-long debugging session culminating in the discovery that the compiler would occasionally access memory below the bottom of the stack. An interrupt at an unfortunate time would result in the registers being dumped on the stack, and the wrong value would be loaded. This only happened when compiling without debugging symbols.
Thanks for showing me this. I’m going to use assembly for everything now.
I hope you’re joking.
OK, I’ll bite. RVCT’s loop optimizer doesn’t like structures. Blame it
(the optimizer) for not recognizing the fact that it is looking at an
integer operation in this case. Notice that the loop gets transformed
into a down counting loop.
This is hardly more portable than assembler (depends on bit-field
layout) and took about ten times as long to come up with :-) but this
is the compiler generated assembler code (RVCT, Build 697, -O3 -Otime):
Regards
Marcus
Even the best compilers are still no match for a human.
I think a fairer thing to say is “even the best compilers are still no match for the best humans, given infinite time and ingenuity.”
I frequently am pleasantly surprised at how clever a compiler is being on my behalf. For example, gcc knows how to turn “n / 1000000UL;” into “(n * 1125899907) >> 50”. Maybe if you locked me in a room for an hour and demanded that I optimize that expression I would have come up with that. But the compiler thought of it for me in a fraction of a second.
After all, everything a compiler does is something a human thought of first, so naturally a compiler isn’t going to outdo the smartest human with an expertise for the architecture. But it’s definitely going to outdo a lot of people. It’s also much less error-prone.
Also, picking on GCC-on-ARM is like bullying the pipsqueak on the playground. :)
The reason for the “unnecessary” saving of registers is probably because you are not using -fomit-frame-pointer, which means that all functions must have a stack frame (so they can be debugged and unwound, although DWARF2 debug/unwind information makes that unnecessary)
No, that is not the case. That flag has no effect on ARM.
The beauty of ARM’s compiler is you can write portably – not relying on bitfield layout, MarcusHarnisch – and still get great code.
Using old RVCT3.1 build 569 produces:
This same source even generates good –thumb code.
I’m however mystified why it pushes LR and pops PC in a leaf function. Maybe the branch predictor
influences cache victim selection to not immediately evict the calling code.
But I think the compiler will push multiples of two registers for architecture v5te and up to keep the stack 8-byte aligned. For that I'd blame xScale’s LDRD/STRD.
Nice! You almost got me there. However, the compiler I am using (4.0, build 697) does not like your code. I mean it does compile it of course, but the instruction sequence is significantly worse than that of the pure integer based loop. And yes, I did fix the loop in my example.
The comment was:
“hardwarebug_20100130.c”, line 58: #1636-D: Could not optimize: Complicated use of variable (q)
Stack alignment is only necessary in non-leaf functions as you stated. So every once in a while you will see single registers pushed on the stack. Why the compiler thought this was needed in your example, I don’t know.
Regards
Marcus
I love replying to myself. It is perhaps interesting to know that RVCT’s loop optimizer seems to interfere with the structure access in this example. Switching back to -O2 -Otime (the loop optimizer is only enabled at -O3, AFAIK) gives me the same efficient code that David posted.
I have filed this in GCC bugzilla as bug 42972.
Pingback: Lenguaje ensamblador – Fuente: Wikipedia. « zarateblog.wordpress.com