Thumbs up

ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.

Thumb-2 performance is claimed to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with FFmpeg as the target and compiled the same source revision in ARM and Thumb-2 mode using the RVCT 4.0 compiler. For this test I disabled all hand-written assembler optimisations.

The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a Beagle board. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.

In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.

New toy: Gdium netbook

A new toy arrived to my house today in the shape of a Gdium Liberty 1000 netbook. Based on a Loongson 2F CPU clocked at 900 MHz, the unit sports 512 MB of RAM, a 1024×600 LCD, and the usual array of external ports. Curiously absent is any form of internal mass-storage device. Operating system, applications, and data are stored on a 16GB USB-attached flash device with a dedicated port.

The operating system is a customised version of Mandriva Linux. Its GNOME GUI somewhat overpowers the small machine, rendering the user experience less than stellar. A less bloated user interface would likely have allowed for smoother, albeit less visually rich, operation.

The selection of applications directly accessible through the main menu system is more or less what is expected for this class of machine: a graphical file manager, web browser, email client, word processor, and some simple utilities and games.

The less visible applications present a more interesting collection. Certain packages appear to have been installed with little consideration for utility. For instance, including GDB but not GCC strikes me as odd, as does the presence of Hylafax on a machine with no modem.

On the multimedia side the Gdium certainly earns points for trying. Both VLC and Totem are installed, as are a number of xine plugins; the main xine application is however missing. Despite all the players available, video playback is performance is disappointing. Even a modest standard-definition MPEG2 video is enough to bring the player to its knees.

FFmpeg is there too, of course. The version found here reports itself as SVN-r11599 though it is undoubtedly patched to some degree, as is customary with distribution builds. Whatever may have been patched, I am pleased to see that nothing appears to have been disabled. A cursory review of the format list shows all the major formats are there, both encoders and decoders.

For a quick speed test, I ran a simple benchmark of FFmpeg on a selection of formats, and compared the results to the Beagle board at 600 MHz. In most tests the Gdium performance is within 10% of the Beagle board, faster for H.264 video and slower for MPEG2. This is unsurprising since FFmpeg has extensive SIMD optimisations for the Cortex-A8 ARM processor on the Beagle board. With floating-point-intensive audio codecs, the Gdium is 2-3 times faster than the Beagle, consistent with the limited floating-point unit of the Cortex-A8.

The Loongson CPU has SIMD capabilities, so compiler/assembler permitting, it should be possible to boost the multimedia performance considerably.

Rotten Apple

Ever since Apple released their iPhone SDK, the FFmpeg mailing lists have seen a steady stream of error reports from users attempting to build FFmpeg for the iPhone, and eventually they got my attention.

The iPhone is built around an ARM1176 CPU, so the SDK includes an ARM cross-compiler and assembler. Most of the reported errors originate from the Apple assembler which appears to have trouble processing the assembler source files from FFmpeg.

The source files use the GNU assembler syntax, and the Apple assembler is based on an old GNU version, so one might reasonably expect it to work. What I had not realised was just how old a version Apple based their assembler on. The version they chose was 1.38.1, released in January 1991, 18 years ago. Features which have since been added to the GNU assembler, and there are many, have not been merged by Apple. As a result, many special directives and macro features used in FFmpeg are not recognised by the Apple assembler, and modifying the code to work with this assembler would render it unusable with modern GNU versions.

Why not replace the assembler in the SDK with a GNU version, one might ask. The answer is that this is not possible. The Apple system uses an object file format, Mach-O, not supported by the GNU tools. The chances of Apple updating their assembler to support the newer syntax appear slim, so our best hope is for the GNU binutils package to gain support for the Mach-O format. This will need a lot of work, and a working version cannot be expected for yet some time.

While this incompatibility persists, those wishing to run an optimised FFmpeg build on their iPhone will have to rely on patches to make it palatable to the Apple assembler. Supporting the Apple syntax directly in FFmpeg is unfortunately not feasible.

Links

Analytics-enabled video lifestyle management

Press releases are always rich riddled infested with current buzz-words, but this one is better than many.

The analytics-enabled video lifestyle management of the title is, apparently, some kind of video surveillance system targeted at home users. According to the press release, it uses mobile video intelligence (MVI), which has got to be a good thing, even having been given an acronym. With all this power, it delivers proactive, video-based information, and does so in a manner that fits today’s connected, mobile lifestyle.

This must be a truly amazing device. It provides users with better lifestyle management, and to top it off, the surveillance footage it supplies is allegedly so great that it also changes how consumers view video – from a passive, entertainment form to a source of rich, real-time information. Not a bad feat for a video of your back door, I must admit.

Shared library woes and the price of PIC

It recently came to my attention that the GNU linker on ARM lacks support for several relocation types in shared libraries. Specifically, code using MOVW/MOVT instruction pairs to load the address of data symbols will not work in a shared library. The linker silently drops the necessary relocations, resulting in a runtime crash.

When I pointed out this shortcoming to Paul Brook of CodeSourcery, his response was that such relocations in shared libraries are not supported by the GNU tools, will never be, and that shared libraries should be built with position-independent code (PIC). This is an unfortunate attitude, and doubly so considering that the latest CodeSourcery GCC version will generate these instructions with default settings. In other words, the 2008q3 release of CodeSourcery GCC will, with default flags, build crashing shared libraries without so much as a warning.

The refusal to support non-PIC shared libraries is unfortunate also from a performance point of view. Position independent code is inherently slower than normal code.

In order to find out just how much slower PIC is on ARM, I made two builds of FFmpeg, one normal and one with PIC. The PIC build is about 1.7% slower in several tests, among them H.264 video decoding.

On typically resource-constrained ARM systems it would be nice to have the option of space-saving shared libraries without paying the PIC penalty in performance. Until now this option has been a reality. With CodeSourcery lazily refusing to support the relocations required by the latest version of their own compiler, this option may soon be a thing of the past, at least if the bugs that have haunted recent compiler releases are fixed in upcoming versions.