ARM compiler shoot-out

A proper comparison of different compilers targeting ARM is long overdue, so I decided to do my part. I compiled FFmpeg using a selection of compilers, and measured the speed of the result when decoding various media samples. Since we are testing compilers, I disabled all hand-written assembler. The tests were run on a Beagle board clocked at 600 MHz.

These are the compilers I deemed worthy to participate in the test and the optimisation flags I used with each:

  • GCC 4.3.3, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize
  • GCC 4.4.1, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize
  • CodeSourcery GCC 2007q3 (based on 4.2.1), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-tree-vectorize
  • CodeSourcery GCC 2009q1 (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize
  • ARM RVCT 4.0 Build 591, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros

I would have also included the ARM compiler from Texas Instruments, had it been able to compile FFmpeg.
Continue reading

IJG is back

When FFmpeg released version 0.5 earlier this year, nearly five years had passed since the previous release, during which time the project had attracted frequent criticism for the lack of regular releases. There exists, however, a project whose release interval dwarves that of FFmpeg. I speak of the Independent JPEG Group’s libjpeg, version 7 of which was recently released after 11 years of silence.

So what have they been doing during the last 11 years? Not a lot, it seems. The only change log entry I find noteworthy is the addition of arithmetic entropy coding, previously omitted due to patent concerns. Contrast this with the TO DO note from the previous release:

The major thrust for v7 will probably be improvement of visual quality. The current method for scaling the quantization tables is known not to be very good at low Q values.  We also intend to investigate block boundary smoothing, “poor man’s variable quantization”, and other means of improving quality-vs-file-size performance without sacrificing compatibility.

In future versions, we are considering supporting some of the upcoming JPEG Part 3 extensions — principally, variable quantization and the SPIFF file format.

As always, speeding things up is of great interest.

Eleven years is of course plenty of time for the developers to change their minds, or perhaps even lose them. The TO DO note in version 7 reads thus:
Continue reading

GCC makes a mess

Following up on a report about FFmpeg being slower at MPEG audio decoding than MAD, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast as FFmpeg, suggesting something was going badly wrong in the compilation.

A session with oprofile exposes multiplication as the root of the problem. The MPEG audio decoder in FFmpeg includes many operations of the form a += b * c where b and c are 32 bits in size and a is 64-bit. 64-bit maths on a 32-bit CPU is not handled well by GCC, even when good hardware support is available. A couple of examples compiled with GCC 4.3.3 illustrate this.
Continue reading

Thumbs up

ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.

Thumb-2 performance is claimed to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with FFmpeg as the target and compiled the same source revision in ARM and Thumb-2 mode using the RVCT 4.0 compiler. For this test I disabled all hand-written assembler optimisations.

The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a Beagle board. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.

In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.

New toy: Gdium netbook

A new toy arrived to my house today in the shape of a Gdium Liberty 1000 netbook. Based on a Loongson 2F CPU clocked at 900 MHz, the unit sports 512 MB of RAM, a 1024×600 LCD, and the usual array of external ports. Curiously absent is any form of internal mass-storage device. Operating system, applications, and data are stored on a 16GB USB-attached flash device with a dedicated port.

The operating system is a customised version of Mandriva Linux. Its GNOME GUI somewhat overpowers the small machine, rendering the user experience less than stellar. A less bloated user interface would likely have allowed for smoother, albeit less visually rich, operation.

The selection of applications directly accessible through the main menu system is more or less what is expected for this class of machine: a graphical file manager, web browser, email client, word processor, and some simple utilities and games.

The less visible applications present a more interesting collection. Certain packages appear to have been installed with little consideration for utility. For instance, including GDB but not GCC strikes me as odd, as does the presence of Hylafax on a machine with no modem.

On the multimedia side the Gdium certainly earns points for trying. Both VLC and Totem are installed, as are a number of xine plugins; the main xine application is however missing. Despite all the players available, video playback is performance is disappointing. Even a modest standard-definition MPEG2 video is enough to bring the player to its knees.

FFmpeg is there too, of course. The version found here reports itself as SVN-r11599 though it is undoubtedly patched to some degree, as is customary with distribution builds. Whatever may have been patched, I am pleased to see that nothing appears to have been disabled. A cursory review of the format list shows all the major formats are there, both encoders and decoders.

For a quick speed test, I ran a simple benchmark of FFmpeg on a selection of formats, and compared the results to the Beagle board at 600 MHz. In most tests the Gdium performance is within 10% of the Beagle board, faster for H.264 video and slower for MPEG2. This is unsurprising since FFmpeg has extensive SIMD optimisations for the Cortex-A8 ARM processor on the Beagle board. With floating-point-intensive audio codecs, the Gdium is 2-3 times faster than the Beagle, consistent with the limited floating-point unit of the Cortex-A8.

The Loongson CPU has SIMD capabilities, so compiler/assembler permitting, it should be possible to boost the multimedia performance considerably.