Thumbs up
ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.
Thumb-2 performance is claimed to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with FFmpeg as the target and compiled the same source revision in ARM and Thumb-2 mode using the RVCT 4.0 compiler. For this test I disabled all hand-written assembler optimisations.
The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a Beagle board. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.
In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.
March 25th, 2009 at 12:17 pm
Regarding omitting hand-written assembler: It would be interesting to see the effects of recompiling it for a Thumb-2 target. I suspect the modifications necessary are minor (if any). Thanks to UAL which most ARM tool chains support, the assembler source code differences between ARM and Thumb-2 are marginal in most cases.
Regarding MP3 performance: Were you able to find out why MP3 is that much slower, unlike any of the other algorithms? Do you have access to RealView Profiler?
Regards
Marcus
March 25th, 2009 at 1:12 pm
The hand-written assembler would be almost entirely 32-bit instructions, so I doubt there would be any gains there. Also, the current GNU assembler doesn’t fully support UAL. Including the assembler code as ARM in an otherwise Thumb-2 build works fine.
I haven’t investigated the MP3 performance yet. I have the RealView software but no JTAG hardware. I will try oprofile first.
March 26th, 2009 at 6:50 pm
Even if there will be mostly no size reduction by rewriting assembly routines to Thumb-2, there might be a gain from not requiring ISA switching (though I don’t know how A8 behaves on that).
March 27th, 2009 at 8:19 am
@Mans
I don’t expect the assembler code to be significantly smaller than before when compiled for Thumb-2. The benefit of simply recompiling it would be that you could actually take advantage of the speed optimizations. And Thumb-2 wouldn’t be a special case any longer.
–
Marcus
March 27th, 2009 at 9:22 am
Switching to ARM state when entering the assembler functions works well. I did a quick benchmark comparing the overhead of calling a function with mode switching and without, and there wasn’t much difference.
April 19th, 2009 at 8:40 am
By any chance, have you some thumb number ?
Now because of eabi force thumb interwork, I believe a clever complier could do interesting things :
- use 32 bits for hot path code and 16 bits thumb for cold path code (error case, …)