Jan
2
2009
It recently came to my attention that the GNU linker on ARM lacks support for several relocation types in shared libraries. Specifically, code using MOVW/MOVT instruction pairs to load the address of data symbols will not work in a shared library. The linker silently drops the necessary relocations, resulting in a runtime crash.
When I pointed out this shortcoming to Paul Brook of CodeSourcery, his response was that such relocations in shared libraries are not supported by the GNU tools, will never be, and that shared libraries should be built with position-independent code (PIC). This is an unfortunate attitude, and doubly so considering that the latest CodeSourcery GCC version will generate these instructions with default settings. In other words, the 2008q3 release of CodeSourcery GCC will, with default flags, build crashing shared libraries without so much as a warning.
The refusal to support non-PIC shared libraries is unfortunate also from a performance point of view. Position independent code is inherently slower than normal code.
In order to find out just how much slower PIC is on ARM, I made two builds of FFmpeg, one normal and one with PIC. The PIC build is about 1.7% slower in several tests, among them H.264 video decoding.
On typically resource-constrained ARM systems it would be nice to have the option of space-saving shared libraries without paying the PIC penalty in performance. Until now this option has been a reality. With CodeSourcery lazily refusing to support the relocations required by the latest version of their own compiler, this option may soon be a thing of the past, at least if the bugs that have haunted recent compiler releases are fixed in upcoming versions.
no comments | posted in ARM, Bugs, Compilers, Optimisation
Dec
31
2008
The NEON coprocessor found in the Cortex-A8 operates asynchronously from the ARM pipeline, receiving its instructions from the ARM execution unit through a 16-entry FIFO. Furthermore, the NEON unit has its own load/store unit. This suggests that some mechanism exists to resolve data hazards between the ARM and NEON units such that memory operations appear as if the instructions were executed entirely in order.
Although clearly important with a view to code optimisation, the Cortex-A8 Technical Reference Manual unfortunately does not mention any details about these hazards. In fact, it does not mention them at all.
To sched some light on the situation, I ran a simple benchmark to determine two important parameters of ARM-NEON memory hazard resolution: granularity and latency.
Continue reading
4 comments | posted in ARM, Optimisation
Nov
28
2008
The bug I discovered in CodeSourcery’s 2008q3 release of their GCC version was apparently deemed serious enough for the company to publish an updated release, tagged 2008q3-72, earlier this week. I took it for a test drive.
Since last time, I have updated the FFmpeg regression test scripts, enabling a cross-build to be easily tested on the target device. For the compiler test this means that much more code will be checked for correct operation compared to the rather limited tests I performed on previous versions. Having verified all tests passing when built with the 2007q3 release, I proceeded with the new 2008q3-72 compiler.
All but one of the FFmpeg regression tests passed. Converting a colour image to 1-bit monochrome format failed. A few minutes of detective work revealed the erroneous code, and a simple test case was easily extracted.
The test case looks strikingly familiar:
extern unsigned char dst[512] __attribute__((aligned(8)));
extern unsigned char src[512] __attribute__((aligned(8)));
void array_shift(void)
{
int i;
for (i = 0; i < 512; i++)
dst[i] = src[i] >> 7;
}
Continue reading
4 comments | posted in ARM, Bugs, Compilers
Oct
19
2008
Some time ago, I was asked for a multimedia hacker’s wish-list for a future ARM processor, in particular regarding the NEON vector and floating-point coprocessor. This is my list.
- Saturating unsigned+signed add/subtract.
With the current instruction set, this operation requires six instructions (2x VMOVL, 2x VADDW, 2x VQMOVUN) and two extra registers (one if optimal scheduling is not required) for 128-bit vectors. Furthermore, this is a frequently occuring operation, for instance in the H.264 loop filter.
- More registers.
Having another, say, 8 vector registers would be very handy. Encoding this in the existing instructions would of course be tricky, if at all possible. A special VMOV and/or VSWP instruction to access the high registers would be an acceptable compromise, and would certainly be better than using scratch memory. An alternative option could be to make the high half of the existing register file banked. This could perhaps even be done in some clever way allowing the OS to skip save/restore of these registers for processes that never use them.
- 256-bit operations.
8-element vectors are frequently used in video processing. One example is the ubiquitous 8×8 IDCT. In some instances, 32 bits per element are required in intermediate values to maintain adequate precision. The 8×8 IDCT is once again an example. In these cases, support for 8×32-bit vectors would clearly be an advantage.
- Vector sum.
The sum of all elements in a vector is computed as a part of many algorithms, for instance anything involving a dot product and motion estimation in video encoding. Presently, the only option is to use a sequence of 3 or 4 VPADD instructions.
- Transposed load/store.
When performing the same operation on each of a set of rows, one must load values row-wise into registers, and then transpose the registers before using the vector arithmetic instructions. When done computing, the values are again transposed before being stored row-wise. A set of load/store instructions transferring data between rows in memory and “columns” in the register file would save the cost of the transposing operations.
- Improved NEON to ARM transfer.
On Cortex-A8, transferring a 32-bit value from NEON to an ARM register takes a minimum of 20 clock cycles, during which time any normal access to the ARM register file will stall. This delay makes some potential use cases for NEON practically worthless. I am told this has been addressed in the almost-ready Cortex-A9.
1 comment | posted in ARM
Oct
14
2008
Having covered the spectacular failure of CodeSourcery’s latest ARM compiler a few days ago, I was engaged in a curious debate on IRC with one of their employees. Fiercely denying the problem at first, he eventually offered an explanation: they do not test the compiler output on real hardware; they use QEMU.
QEMU is a CPU emulator supporting a variety of targets. While great for casual development, and for running foreign applications, it is certainly no substitute for real hardware when testing a compiler. Like any piece of software, an emulator is bound to have a few errors, and as it happens, QEMU has known bugs in its handling of the NEON instruction set. Our friend at CodeSourcery should be well aware of these, also being a QEMU developer.
The use of emulators was explained as a necessity due to real hardware not being available. To be fair, CodeSourcery does develop against new hardware before it exists, so some reliance on emulators is unavoidable. This is, however, not the case this time. The Beagleboard was made available to selected developers quite some time ago (I have had one since May, others still longer), and is now being sold by the thousands. CodeSourcery developers, so I am told, were also given an offer of a free board, an offer they chose to refuse.
What does all this mean? Did Murphy decide to inflict maximum bad luck on the hard-working developers, or is there perhaps a larger conspiracy at work? I shall not attempt to speculate in this matter. I will merely repeat this excellent piece of advice given by Robert J. Hanlon: Never attribute to malice that which can be adequately explained by incompetence.
no comments | posted in ARM, Bugs, Compilers