ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.
Thumb-2 performance is claimed to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with FFmpeg as the target and compiled the same source revision in ARM and Thumb-2 mode using the RVCT 4.0 compiler. For this test I disabled all hand-written assembler optimisations.
The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a Beagle board. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.
In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.
Regarding omitting hand-written assembler: It would be interesting to see the effects of recompiling it for a Thumb-2 target. I suspect the modifications necessary are minor (if any). Thanks to UAL which most ARM tool chains support, the assembler source code differences between ARM and Thumb-2 are marginal in most cases.
Regarding MP3 performance: Were you able to find out why MP3 is that much slower, unlike any of the other algorithms? Do you have access to RealView Profiler?
The hand-written assembler would be almost entirely 32-bit instructions, so I doubt there would be any gains there. Also, the current GNU assembler doesn’t fully support UAL. Including the assembler code as ARM in an otherwise Thumb-2 build works fine.
I haven’t investigated the MP3 performance yet. I have the RealView software but no JTAG hardware. I will try oprofile first.
Even if there will be mostly no size reduction by rewriting assembly routines to Thumb-2, there might be a gain from not requiring ISA switching (though I don’t know how A8 behaves on that).
I don’t expect the assembler code to be significantly smaller than before when compiled for Thumb-2. The benefit of simply recompiling it would be that you could actually take advantage of the speed optimizations. And Thumb-2 wouldn’t be a special case any longer.
Switching to ARM state when entering the assembler functions works well. I did a quick benchmark comparing the overhead of calling a function with mode switching and without, and there wasn’t much difference.
By any chance, have you some thumb number ?
Now because of eabi force thumb interwork, I believe a clever complier could do interesting things :
– use 32 bits for hot path code and 16 bits thumb for cold path code (error case, …)