ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.
Thumb-2 performance is claimed to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with FFmpeg as the target and compiled the same source revision in ARM and Thumb-2 mode using the RVCT 4.0 compiler. For this test I disabled all hand-written assembler optimisations.
The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a Beagle board. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.
In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.
It recently came to my attention that the GNU linker on ARM lacks support for several relocation types in shared libraries. Specifically, code using MOVW/MOVT instruction pairs to load the address of data symbols will not work in a shared library. The linker silently drops the necessary relocations, resulting in a runtime crash.
When I pointed out this shortcoming to Paul Brook of CodeSourcery, his response was that such relocations in shared libraries are not supported by the GNU tools, will never be, and that shared libraries should be built with position-independent code (PIC). This is an unfortunate attitude, and doubly so considering that the latest CodeSourcery GCC version will generate these instructions with default settings. In other words, the 2008q3 release of CodeSourcery GCC will, with default flags, build crashing shared libraries without so much as a warning.
The refusal to support non-PIC shared libraries is unfortunate also from a performance point of view. Position independent code is inherently slower than normal code.
In order to find out just how much slower PIC is on ARM, I made two builds of FFmpeg, one normal and one with PIC. The PIC build is about 1.7% slower in several tests, among them H.264 video decoding.
On typically resource-constrained ARM systems it would be nice to have the option of space-saving shared libraries without paying the PIC penalty in performance. Until now this option has been a reality. With CodeSourcery lazily refusing to support the relocations required by the latest version of their own compiler, this option may soon be a thing of the past, at least if the bugs that have haunted recent compiler releases are fixed in upcoming versions.
The NEON coprocessor found in the Cortex-A8 operates asynchronously from the ARM pipeline, receiving its instructions from the ARM execution unit through a 16-entry FIFO. Furthermore, the NEON unit has its own load/store unit. This suggests that some mechanism exists to resolve data hazards between the ARM and NEON units such that memory operations appear as if the instructions were executed entirely in order.
Although clearly important with a view to code optimisation, the Cortex-A8 Technical Reference Manual unfortunately does not mention any details about these hazards. In fact, it does not mention them at all.
To sched some light on the situation, I ran a simple benchmark to determine two important parameters of ARM-NEON memory hazard resolution: granularity and latency.
The bug I discovered in CodeSourcery’s 2008q3 release of their GCC version was apparently deemed serious enough for the company to publish an updated release, tagged 2008q3-72, earlier this week. I took it for a test drive.
Since last time, I have updated the FFmpeg regression test scripts, enabling a cross-build to be easily tested on the target device. For the compiler test this means that much more code will be checked for correct operation compared to the rather limited tests I performed on previous versions. Having verified all tests passing when built with the 2007q3 release, I proceeded with the new 2008q3-72 compiler.
All but one of the FFmpeg regression tests passed. Converting a colour image to 1-bit monochrome format failed. A few minutes of detective work revealed the erroneous code, and a simple test case was easily extracted.
The test case looks strikingly familiar:
extern unsigned char dst __attribute__((aligned(8)));
extern unsigned char src __attribute__((aligned(8)));
for (i = 0; i < 512; i++)
dst[i] = src[i] >> 7;
Some time ago, I was asked for a multimedia hacker’s wish-list for a future ARM processor, in particular regarding the NEON vector and floating-point coprocessor. This is my list.
- Saturating unsigned+signed add/subtract.
With the current instruction set, this operation requires six instructions (2x VMOVL, 2x VADDW, 2x VQMOVUN) and two extra registers (one if optimal scheduling is not required) for 128-bit vectors. Furthermore, this is a frequently occuring operation, for instance in the H.264 loop filter.
- More registers.
Having another, say, 8 vector registers would be very handy. Encoding this in the existing instructions would of course be tricky, if at all possible. A special VMOV and/or VSWP instruction to access the high registers would be an acceptable compromise, and would certainly be better than using scratch memory. An alternative option could be to make the high half of the existing register file banked. This could perhaps even be done in some clever way allowing the OS to skip save/restore of these registers for processes that never use them.
- 256-bit operations.
8-element vectors are frequently used in video processing. One example is the ubiquitous 8×8 IDCT. In some instances, 32 bits per element are required in intermediate values to maintain adequate precision. The 8×8 IDCT is once again an example. In these cases, support for 8×32-bit vectors would clearly be an advantage.
- Vector sum.
The sum of all elements in a vector is computed as a part of many algorithms, for instance anything involving a dot product and motion estimation in video encoding. Presently, the only option is to use a sequence of 3 or 4 VPADD instructions.
- Transposed load/store.
When performing the same operation on each of a set of rows, one must load values row-wise into registers, and then transpose the registers before using the vector arithmetic instructions. When done computing, the values are again transposed before being stored row-wise. A set of load/store instructions transferring data between rows in memory and “columns” in the register file would save the cost of the transposing operations.
- Improved NEON to ARM transfer.
On Cortex-A8, transferring a 32-bit value from NEON to an ARM register takes a minimum of 20 clock cycles, during which time any normal access to the ARM register file will stall. This delay makes some potential use cases for NEON practically worthless. I am told this has been addressed in the almost-ready Cortex-A9.