A few months ago, I downloaded an evaluation copy of IBM’s XLC compiler to try it out on FFmpeg. The trial licence has now expired, so what better way to spend a few minutes than by cracking it?
The installation script, as expected, copied a number of files into a directory under /opt. More unusually, it also created a small shared library, libxlc101e.so.1, and placed it in /usr/lib. No other files from the installation package were modified, so this must be where the licence is hiding. Without further ado, we proceed to take it apart.
A proper comparison of different compilers targeting ARM is long overdue, so I decided to do my part. I compiled FFmpeg using a selection of compilers, and measured the speed of the result when decoding various media samples. Since we are testing compilers, I disabled all hand-written assembler. The tests were run on a Beagle board clocked at 600 MHz.
These are the compilers I deemed worthy to participate in the test and the optimisation flags I used with each:
- GCC 4.3.3, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize
- GCC 4.4.1, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize
- CodeSourcery GCC 2007q3 (based on 4.2.1), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-tree-vectorize
- CodeSourcery GCC 2009q1 (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize
- ARM RVCT 4.0 Build 591, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros
I would have also included the ARM compiler from Texas Instruments, had it been able to compile FFmpeg.
Following up on a report about FFmpeg being slower at MPEG audio decoding than MAD, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast as FFmpeg, suggesting something was going badly wrong in the compilation.
A session with oprofile exposes multiplication as the root of the problem. The MPEG audio decoder in FFmpeg includes many operations of the form a += b * c where b and c are 32 bits in size and a is 64-bit. 64-bit maths on a 32-bit CPU is not handled well by GCC, even when good hardware support is available. A couple of examples compiled with GCC 4.3.3 illustrate this.
ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.
Thumb-2 performance is claimed to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with FFmpeg as the target and compiled the same source revision in ARM and Thumb-2 mode using the RVCT 4.0 compiler. For this test I disabled all hand-written assembler optimisations.
The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a Beagle board. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.
In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.
Ever since Apple released their iPhone SDK, the FFmpeg mailing lists have seen a steady stream of error reports from users attempting to build FFmpeg for the iPhone, and eventually they got my attention.
The iPhone is built around an ARM1176 CPU, so the SDK includes an ARM cross-compiler and assembler. Most of the reported errors originate from the Apple assembler which appears to have trouble processing the assembler source files from FFmpeg.
The source files use the GNU assembler syntax, and the Apple assembler is based on an old GNU version, so one might reasonably expect it to work. What I had not realised was just how old a version Apple based their assembler on. The version they chose was 1.38.1, released in January 1991, 18 years ago. Features which have since been added to the GNU assembler, and there are many, have not been merged by Apple. As a result, many special directives and macro features used in FFmpeg are not recognised by the Apple assembler, and modifying the code to work with this assembler would render it unusable with modern GNU versions.
Why not replace the assembler in the SDK with a GNU version, one might ask. The answer is that this is not possible. The Apple system uses an object file format, Mach-O, not supported by the GNU tools. The chances of Apple updating their assembler to support the newer syntax appear slim, so our best hope is for the GNU binutils package to gain support for the Mach-O format. This will need a lot of work, and a working version cannot be expected for yet some time.
While this incompatibility persists, those wishing to run an optimised FFmpeg build on their iPhone will have to rely on patches to make it palatable to the Apple assembler. Supporting the Apple syntax directly in FFmpeg is unfortunately not feasible.