Pointer peril

Use of pointers in the C programming language is subject to a number of constraints, violation of which results in the dreaded undefined behaviour. If a situation with undefined behaviour occurs, anything is permitted to happen. The program may produce unexpected results, crash, or demons may fly out of the user’s nose.

Some of these rules concern pointer arithmetic, addition and subtraction in which one or both operands are pointers. The C99 specification spells it out in section 6.5.6:

When an expression that has integer type is added to or subtracted from a pointer, the result has the type of the pointer operand. […] If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. […]

When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object; the result is the difference of the subscripts of the two array elements.

In simpler, if less accurate, terms, operands and results of pointer arithmetic must be within the same array object. If not, anything can happen.
(more…)

Shared library woes and the price of PIC

It recently came to my attention that the GNU linker on ARM lacks support for several relocation types in shared libraries. Specifically, code using MOVW/MOVT instruction pairs to load the address of data symbols will not work in a shared library. The linker silently drops the necessary relocations, resulting in a runtime crash.

When I pointed out this shortcoming to Paul Brook of CodeSourcery, his response was that such relocations in shared libraries are not supported by the GNU tools, will never be, and that shared libraries should be built with position-independent code (PIC). This is an unfortunate attitude, and doubly so considering that the latest CodeSourcery GCC version will generate these instructions with default settings. In other words, the 2008q3 release of CodeSourcery GCC will, with default flags, build crashing shared libraries without so much as a warning.

The refusal to support non-PIC shared libraries is unfortunate also from a performance point of view. Position independent code is inherently slower than normal code.

In order to find out just how much slower PIC is on ARM, I made two builds of FFmpeg, one normal and one with PIC. The PIC build is about 1.7% slower in several tests, among them H.264 video decoding.

On typically resource-constrained ARM systems it would be nice to have the option of space-saving shared libraries without paying the PIC penalty in performance. Until now this option has been a reality. With CodeSourcery lazily refusing to support the relocations required by the latest version of their own compiler, this option may soon be a thing of the past, at least if the bugs that have haunted recent compiler releases are fixed in upcoming versions.

CodeSourcery fails again

The bug I discovered in CodeSourcery’s 2008q3 release of their GCC version was apparently deemed serious enough for the company to publish an updated release, tagged 2008q3-72, earlier this week. I took it for a test drive.

Since last time, I have updated the FFmpeg regression test scripts, enabling a cross-build to be easily tested on the target device. For the compiler test this means that much more code will be checked for correct operation compared to the rather limited tests I performed on previous versions. Having verified all tests passing when built with the 2007q3 release, I proceeded with the new 2008q3-72 compiler.

All but one of the FFmpeg regression tests passed. Converting a colour image to 1-bit monochrome format failed. A few minutes of detective work revealed the erroneous code, and a simple test case was easily extracted.

The test case looks strikingly familiar:

extern unsigned char dst[512] __attribute__((aligned(8)));
extern unsigned char src[512] __attribute__((aligned(8)));

void array_shift(void)
{
    int i;
    for (i = 0; i < 512; i++)
        dst[i] = src[i] >> 7;
}

(more…)

CodeSourcery’s defence

Having covered the spectacular failure of CodeSourcery’s latest ARM compiler a few days ago, I was engaged in a curious debate on IRC with one of their employees. Fiercely denying the problem at first, he eventually offered an explanation: they do not test the compiler output on real hardware; they use QEMU.

QEMU is a CPU emulator supporting a variety of targets. While great for casual development, and for running foreign applications, it is certainly no substitute for real hardware when testing a compiler. Like any piece of software, an emulator is bound to have a few errors, and as it happens, QEMU has known bugs in its handling of the NEON instruction set. Our friend at CodeSourcery should be well aware of these, also being a QEMU developer.

The use of emulators was explained as a necessity due to real hardware not being available. To be fair, CodeSourcery does develop against new hardware before it exists, so some reliance on emulators is unavoidable. This is, however, not the case this time. The Beagleboard was made available to selected developers quite some time ago (I have had one since May, others still longer), and is now being sold by the thousands. CodeSourcery developers, so I am told, were also given an offer of a free board, an offer they chose to refuse.

What does all this mean? Did Murphy decide to inflict maximum bad luck on the hard-working developers, or is there perhaps a larger conspiracy at work? I shall not attempt to speculate in this matter. I will merely repeat this excellent piece of advice given by Robert J. Hanlon: Never attribute to malice that which can be adequately explained by incompetence.

CodeSourcery GCC 2008q3: FAIL

A few days ago, CodeSourcery released their latest version of GCC for ARM, dubbed 2008q3. An announcement email boasts “Improved support for NEON and, in particular, auto-vectorization using NEON.” It is time to put that claim to the test.

FFmpeg has a history of triggering compiler bugs, making it a good test case. Some extra speed would do it good as well.

The new compiler builds FFmpeg without complaint, so everything is looking good so far. To check for any speedup from the improved compiler, I use an Indiana Jones trailer encoded with H.264. Disappointingly, I am unable to get any speed figures. The decoding stops after 160 frames, the immediate cause being an unaligned NEON load in simple loop for copying a few bytes.

Is FFmpeg broken? The same code built with an older compiler release works perfectly, and the parameters passed to the failing function are similar-looking. The answer must lie in the copy loop itself. To verify this hypothesis, I set out to reproduce the error with a minimal test case.

The failure proves remarkably simple to trigger. The test case I arrive at consists of two C source files. The first file is our copy loop:

void copy(char *dst, char *src, int len)
{
    int i;
    for (i = 0; i < len; i++)
        dst[i] = src[i];
}

The second file is our main() function, invoking the copy with suitably unaligned arguments:

extern void copy(char *dst, char *src, int len);
char src[20], dst[16];

int main(void)
{
    char *p = src + !((unsigned)src & 1);
    copy(dst, p, 16);
    return 0;
}

Compiling this with -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -O3 flags results in a broken executable. Adding -fno-tree-vectorize makes the error go away.

So much for the improved auto-vectorisation.

Not testing every compiler on FFmpeg is understandable. Not testing even the most trivial of constructs is unforgivable.