GCC inline asm annoyance

Doing some PowerPC work recently, I wanted to use the lwbrx instruction, which loads a little endian word from memory. A simple asm statement wrapped in an inline function seemed like the simplest way to do this.

The lwbrx instruction comes with a minor limitation. It is only available in X-form, that is, the effective address is formed by adding the values of two register operands. Normal load instructions also have a D-form, which computes the effective address by adding an immediate offset to a register operand.

This means that my asm statement cannot use a normal “m” constraint for the memory operand, as this would allow GCC to use D-form addressing, which this instruction does not allow. I thus go in search of a special constraint to request X-form. GCC inline assembler supports a number of machine-specific constraints to cover situations like this one. To my dismay, the manual makes no mention of a suitable contraint to use.

Not giving up hope, I head for Google. Google always has answers. Almost always. None of the queries I can think of return a useful result. My quest finally comes to an end with the GCC machine description for PowerPC. This cryptic file suggests an (undocumented) “Z” constraint might work.

My first attempt at using the newly discovered “Z” constraint fails. The compiler still generates D-form address operands. Another examination of the machine description provides the answer. When referring to the operand, I must use %y0 in place of the usual %0. Needless to say, documentation explaining this syntax is nowhere to be found.

After spending the better part of an hour on a task I expected to take no more than five minutes, I finally arrive at a working solution:

static inline uint32_t load_le32(const uint32_t *p)
{
    uint32_t v;
    asm ("lwbrx %0, %y1" : "=r"(v) : "Z"(*p));
    return v;
}

ARM wish-list

Some time ago, I was asked for a multimedia hacker’s wish-list for a future ARM processor, in particular regarding the NEON vector and floating-point coprocessor. This is my list.

  1. Saturating unsigned+signed add/subtract.
    With the current instruction set, this operation requires six instructions (2x VMOVL, 2x VADDW, 2x VQMOVUN) and two extra registers (one if optimal scheduling is not required) for 128-bit vectors. Furthermore, this is a frequently occuring operation, for instance in the H.264 loop filter.
  2. More registers.
    Having another, say, 8 vector registers would be very handy. Encoding this in the existing instructions would of course be tricky, if at all possible. A special VMOV and/or VSWP instruction to access the high registers would be an acceptable compromise, and would certainly be better than using scratch memory. An alternative option could be to make the high half of the existing register file banked. This could perhaps even be done in some clever way allowing the OS to skip save/restore of these registers for processes that never use them.
  3. 256-bit operations.
    8-element vectors are frequently used in video processing. One example is the ubiquitous 8×8 IDCT. In some instances, 32 bits per element are required in intermediate values to maintain adequate precision. The 8×8 IDCT is once again an example. In these cases, support for 8×32-bit vectors would clearly be an advantage.
  4. Vector sum.
    The sum of all elements in a vector is computed as a part of many algorithms, for instance anything involving a dot product and motion estimation in video encoding. Presently, the only option is to use a sequence of 3 or 4 VPADD instructions.
  5. Transposed load/store.
    When performing the same operation on each of a set of rows, one must load values row-wise into registers, and then transpose the registers before using the vector arithmetic instructions. When done computing, the values are again transposed before being stored row-wise. A set of load/store instructions transferring data between rows in memory and “columns” in the register file would save the cost of the transposing operations.
  6. Improved NEON to ARM transfer.
    On Cortex-A8, transferring a 32-bit value from NEON to an ARM register takes a minimum of 20 clock cycles, during which time any normal access to the ARM register file will stall. This delay makes some potential use cases for NEON practically worthless. I am told this has been addressed in the almost-ready Cortex-A9.

On malice and stupidity

In my previous post, I attributed a quotation to one Robert J. Hanlon. This quotation, known as Hanlon’s Razor, deserves a little more attention.

Firstly, I altered the phrase slightly compared its most common form, “Never attribute to malice that which can be adequately explained by stupidity,” substituting incompetence as the final word. I did this simply because I found this form more suitable in the context.

Secondly, the origin of this adage is disputable. A selection of alternatives follows.

  • In his 1980 book Murphy’s Law Book Two: More Reasons why Things Go Wrong!, Arthur Bloch credits Robert J. Hanlon as the creator, citing the above version.
  • Bill Clarke claims to have coined the phrase in 1974, in the story Axioms of a Mad Poet he published that year.
  • In the short story Logic of Empire (1941) by Robert A. Heinlein a similar phrase appears: You have attributed conditions to villainy that simply result from stupidity.
  • Napoleon Bonaparte allegedly uttered the words “Never ascribe to malice that which is adequately explained by incompetence,” although accurate references do not appear to exist.

Perhaps there is some truth to the saying that great minds think alike.

CodeSourcery’s defence

Having covered the spectacular failure of CodeSourcery’s latest ARM compiler a few days ago, I was engaged in a curious debate on IRC with one of their employees. Fiercely denying the problem at first, he eventually offered an explanation: they do not test the compiler output on real hardware; they use QEMU.

QEMU is a CPU emulator supporting a variety of targets. While great for casual development, and for running foreign applications, it is certainly no substitute for real hardware when testing a compiler. Like any piece of software, an emulator is bound to have a few errors, and as it happens, QEMU has known bugs in its handling of the NEON instruction set. Our friend at CodeSourcery should be well aware of these, also being a QEMU developer.

The use of emulators was explained as a necessity due to real hardware not being available. To be fair, CodeSourcery does develop against new hardware before it exists, so some reliance on emulators is unavoidable. This is, however, not the case this time. The Beagleboard was made available to selected developers quite some time ago (I have had one since May, others still longer), and is now being sold by the thousands. CodeSourcery developers, so I am told, were also given an offer of a free board, an offer they chose to refuse.

What does all this mean? Did Murphy decide to inflict maximum bad luck on the hard-working developers, or is there perhaps a larger conspiracy at work? I shall not attempt to speculate in this matter. I will merely repeat this excellent piece of advice given by Robert J. Hanlon: Never attribute to malice that which can be adequately explained by incompetence.

CodeSourcery GCC 2008q3: FAIL

A few days ago, CodeSourcery released their latest version of GCC for ARM, dubbed 2008q3. An announcement email boasts “Improved support for NEON and, in particular, auto-vectorization using NEON.” It is time to put that claim to the test.

FFmpeg has a history of triggering compiler bugs, making it a good test case. Some extra speed would do it good as well.

The new compiler builds FFmpeg without complaint, so everything is looking good so far. To check for any speedup from the improved compiler, I use an Indiana Jones trailer encoded with H.264. Disappointingly, I am unable to get any speed figures. The decoding stops after 160 frames, the immediate cause being an unaligned NEON load in simple loop for copying a few bytes.

Is FFmpeg broken? The same code built with an older compiler release works perfectly, and the parameters passed to the failing function are similar-looking. The answer must lie in the copy loop itself. To verify this hypothesis, I set out to reproduce the error with a minimal test case.

The failure proves remarkably simple to trigger. The test case I arrive at consists of two C source files. The first file is our copy loop:

void copy(char *dst, char *src, int len)
{
    int i;
    for (i = 0; i < len; i++)
        dst[i] = src[i];
}

The second file is our main() function, invoking the copy with suitably unaligned arguments:

extern void copy(char *dst, char *src, int len);
char src[20], dst[16];

int main(void)
{
    char *p = src + !((unsigned)src & 1);
    copy(dst, p, 16);
    return 0;
}

Compiling this with -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -O3 flags results in a broken executable. Adding -fno-tree-vectorize makes the error go away.

So much for the improved auto-vectorisation.

Not testing every compiler on FFmpeg is understandable. Not testing even the most trivial of constructs is unforgivable.