May 13 2009

GCC makes a mess

Following up on a report about FFmpeg being slower at MPEG audio decoding than MAD, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast as FFmpeg, suggesting something was going badly wrong in the compilation.

A session with oprofile exposes multiplication as the root of the problem. The MPEG audio decoder in FFmpeg includes many operations of the form a += b * c where b and c are 32 bits in size and a is 64-bit. 64-bit maths on a 32-bit CPU is not handled well by GCC, even when good hardware support is available. A couple of examples compiled with GCC 4.3.3 illustrate this.

Suppose you need the high 32 bits from the 64-bit result of multiplying two 32-bit numbers. This is most easily written in C like this:

int mulh(int a, int b)
{
    return ((int64_t)a * (int64_t)b) >> 32;
}

It doesn’t take much thinking to see that the PowerPC mulhw instruction performs exactly this operation. Indeed, GCC knows of this instruction and uses it. But can we be really sure that those low 32 bits are not needed? GCC seems unconvinced:

mulhw   r9,  r4,  r3
mullw   r10, r4,  r3
srawi   r11, r9,  31
srawi   r12, r9,  0
mr      r3,  r12
blr

The second example is slightly more complicated:

int64_t mac(int64_t a, int b, int c, int d)
{
    a += (int64_t)b * (int64_t)c;
    a += (int64_t)b * (int64_t)d;
    return a;
}

This can, of course, be done with four multiplications and four additions. GCC, however, likes to be thorough, and uses twice the number of both instructions, plus some loads, stores and shifts for completeness:

stwu    r1,  -32(r1)
srawi   r0,  r6,  31
mullw   r0,  r0,  r5
srawi   r8,  r7,  31
stw     r29, 20(r1)
srawi   r29, r5,  31
stw     r27, 12(r1)
stw     r28, 16(r1)
mullw   r11, r29, r6
mulhwu  r9,  r6,  r5
add     r0,  r0,  r11
mullw   r10, r6,  r5
add     r9,  r0,  r9
mullw   r29, r29, r7
addc    r28, r10, r4
adde    r27, r9,  r3
mullw   r8,  r8,  r5
mulhwu  r9,  r7,  r5
add     r8,  r8,  r29
lwz     r29, 20(r1)
mullw   r10, r7,  r5
add     r9,  r8,  r9
addc    r12, r28, r10
adde    r11, r27, r9
lwz     r27, 12(r1)
mr      r4,  r12
lwz     r28, 16(r1)
mr      r3,  r11
addi    r1,  r1,  32
blr

Fortunately, this madness is easily fixed with a little inline assembler, more than doubling the speed of the decoder, thus making FFmpeg significantly faster than MAD also on PowerPC.


Mar 25 2009

Thumbs up

ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.

Thumb-2 performance is claimed to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with FFmpeg as the target and compiled the same source revision in ARM and Thumb-2 mode using the RVCT 4.0 compiler. For this test I disabled all hand-written assembler optimisations.

The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a Beagle board. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.

In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.


Feb 6 2009

New toy: Gdium netbook

A new toy arrived to my house today in the shape of a Gdium Liberty 1000 netbook. Based on a Loongson 2F CPU clocked at 900 MHz, the unit sports 512 MB of RAM, a 1024×600 LCD, and the usual array of external ports. Curiously absent is any form of internal mass-storage device. Operating system, applications, and data are stored on a 16GB USB-attached flash device with a dedicated port.

The operating system is a customised version of Mandriva Linux. Its GNOME GUI somewhat overpowers the small machine, rendering the user experience less than stellar. A less bloated user interface would likely have allowed for smoother, albeit less visually rich, operation.

The selection of applications directly accessible through the main menu system is more or less what is expected for this class of machine: a graphical file manager, web browser, email client, word processor, and some simple utilities and games.

The less visible applications present a more interesting collection. Certain packages appear to have been installed with little consideration for utility. For instance, including GDB but not GCC strikes me as odd, as does the presence of Hylafax on a machine with no modem.

On the multimedia side the Gdium certainly earns points for trying. Both VLC and Totem are installed, as are a number of xine plugins; the main xine application is however missing. Despite all the players available, video playback is performance is disappointing. Even a modest standard-definition MPEG2 video is enough to bring the player to its knees.

FFmpeg is there too, of course. The version found here reports itself as SVN-r11599 though it is undoubtedly patched to some degree, as is customary with distribution builds. Whatever may have been patched, I am pleased to see that nothing appears to have been disabled. A cursory review of the format list shows all the major formats are there, both encoders and decoders.

For a quick speed test, I ran a simple benchmark of FFmpeg on a selection of formats, and compared the results to the Beagle board at 600 MHz. In most tests the Gdium performance is within 10% of the Beagle board, faster for H.264 video and slower for MPEG2. This is unsurprising since FFmpeg has extensive SIMD optimisations for the Cortex-A8 ARM processor on the Beagle board. With floating-point-intensive audio codecs, the Gdium is 2-3 times faster than the Beagle, consistent with the limited floating-point unit of the Cortex-A8.

The Loongson CPU has SIMD capabilities, so compiler/assembler permitting, it should be possible to boost the multimedia performance considerably.


Jan 28 2009

Rotten Apple

Ever since Apple released their iPhone SDK, the FFmpeg mailing lists have seen a steady stream of error reports from users attempting to build FFmpeg for the iPhone, and eventually they got my attention.

The iPhone is built around an ARM1176 CPU, so the SDK includes an ARM cross-compiler and assembler. Most of the reported errors originate from the Apple assembler which appears to have trouble processing the assembler source files from FFmpeg.

The source files use the GNU assembler syntax, and the Apple assembler is based on an old GNU version, so one might reasonably expect it to work. What I had not realised was just how old a version Apple based their assembler on. The version they chose was 1.38.1, released in January 1991, 18 years ago. Features which have since been added to the GNU assembler, and there are many, have not been merged by Apple. As a result, many special directives and macro features used in FFmpeg are not recognised by the Apple assembler, and modifying the code to work with this assembler would render it unusable with modern GNU versions.

Why not replace the assembler in the SDK with a GNU version, one might ask. The answer is that this is not possible. The Apple system uses an object file format, Mach-O, not supported by the GNU tools. The chances of Apple updating their assembler to support the newer syntax appear slim, so our best hope is for the GNU binutils package to gain support for the Mach-O format. This will need a lot of work, and a working version cannot be expected for yet some time.

While this incompatibility persists, those wishing to run an optimised FFmpeg build on their iPhone will have to rely on patches to make it palatable to the Apple assembler. Supporting the Apple syntax directly in FFmpeg is unfortunately not feasible.

Links


Jan 11 2009

Analytics-enabled video lifestyle management

Press releases are always rich riddled infested with current buzz-words, but this one is better than many.

The analytics-enabled video lifestyle management of the title is, apparently, some kind of video surveillance system targeted at home users. According to the press release, it uses mobile video intelligence (MVI), which has got to be a good thing, even having been given an acronym. With all this power, it delivers proactive, video-based information, and does so in a manner that fits today’s connected, mobile lifestyle.

This must be a truly amazing device. It provides users with better lifestyle management, and to top it off, the surveillance footage it supplies is allegedly so great that it also changes how consumers view video – from a passive, entertainment form to a source of rich, real-time information. Not a bad feat for a video of your back door, I must admit.