Pointer peril

Use of pointers in the C programming language is subject to a number of constraints, violation of which results in the dreaded undefined behaviour. If a situation with undefined behaviour occurs, anything is permitted to happen. The program may produce unexpected results, crash, or demons may fly out of the user’s nose.

Some of these rules concern pointer arithmetic, addition and subtraction in which one or both operands are pointers. The C99 specification spells it out in section 6.5.6:

When an expression that has integer type is added to or subtracted from a pointer, the result has the type of the pointer operand. […] If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. […]

When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object; the result is the difference of the subscripts of the two array elements.

In simpler, if less accurate, terms, operands and results of pointer arithmetic must be within the same array object. If not, anything can happen.
Continue reading

GCC makes a mess

Following up on a report about FFmpeg being slower at MPEG audio decoding than MAD, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast as FFmpeg, suggesting something was going badly wrong in the compilation.

A session with oprofile exposes multiplication as the root of the problem. The MPEG audio decoder in FFmpeg includes many operations of the form a += b * c where b and c are 32 bits in size and a is 64-bit. 64-bit maths on a 32-bit CPU is not handled well by GCC, even when good hardware support is available. A couple of examples compiled with GCC 4.3.3 illustrate this.
Continue reading

Thumbs up

ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.

Thumb-2 performance is claimed to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with FFmpeg as the target and compiled the same source revision in ARM and Thumb-2 mode using the RVCT 4.0 compiler. For this test I disabled all hand-written assembler optimisations.

The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a Beagle board. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.

In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.

Shared library woes and the price of PIC

It recently came to my attention that the GNU linker on ARM lacks support for several relocation types in shared libraries. Specifically, code using MOVW/MOVT instruction pairs to load the address of data symbols will not work in a shared library. The linker silently drops the necessary relocations, resulting in a runtime crash.

When I pointed out this shortcoming to Paul Brook of CodeSourcery, his response was that such relocations in shared libraries are not supported by the GNU tools, will never be, and that shared libraries should be built with position-independent code (PIC). This is an unfortunate attitude, and doubly so considering that the latest CodeSourcery GCC version will generate these instructions with default settings. In other words, the 2008q3 release of CodeSourcery GCC will, with default flags, build crashing shared libraries without so much as a warning.

The refusal to support non-PIC shared libraries is unfortunate also from a performance point of view. Position independent code is inherently slower than normal code.

In order to find out just how much slower PIC is on ARM, I made two builds of FFmpeg, one normal and one with PIC. The PIC build is about 1.7% slower in several tests, among them H.264 video decoding.

On typically resource-constrained ARM systems it would be nice to have the option of space-saving shared libraries without paying the PIC penalty in performance. Until now this option has been a reality. With CodeSourcery lazily refusing to support the relocations required by the latest version of their own compiler, this option may soon be a thing of the past, at least if the bugs that have haunted recent compiler releases are fixed in upcoming versions.