Some time ago, I was asked for a multimedia hacker’s wish-list for a future ARM processor, in particular regarding the NEON vector and floating-point coprocessor. This is my list.
- Saturating unsigned+signed add/subtract.
With the current instruction set, this operation requires six instructions (2x VMOVL, 2x VADDW, 2x VQMOVUN) and two extra registers (one if optimal scheduling is not required) for 128-bit vectors. Furthermore, this is a frequently occuring operation, for instance in the H.264 loop filter.
- More registers.
Having another, say, 8 vector registers would be very handy. Encoding this in the existing instructions would of course be tricky, if at all possible. A special VMOV and/or VSWP instruction to access the high registers would be an acceptable compromise, and would certainly be better than using scratch memory. An alternative option could be to make the high half of the existing register file banked. This could perhaps even be done in some clever way allowing the OS to skip save/restore of these registers for processes that never use them.
- 256-bit operations.
8-element vectors are frequently used in video processing. One example is the ubiquitous 8×8 IDCT. In some instances, 32 bits per element are required in intermediate values to maintain adequate precision. The 8×8 IDCT is once again an example. In these cases, support for 8×32-bit vectors would clearly be an advantage.
- Vector sum.
The sum of all elements in a vector is computed as a part of many algorithms, for instance anything involving a dot product and motion estimation in video encoding. Presently, the only option is to use a sequence of 3 or 4 VPADD instructions.
- Transposed load/store.
When performing the same operation on each of a set of rows, one must load values row-wise into registers, and then transpose the registers before using the vector arithmetic instructions. When done computing, the values are again transposed before being stored row-wise. A set of load/store instructions transferring data between rows in memory and “columns” in the register file would save the cost of the transposing operations.
- Improved NEON to ARM transfer.
On Cortex-A8, transferring a 32-bit value from NEON to an ARM register takes a minimum of 20 clock cycles, during which time any normal access to the ARM register file will stall. This delay makes some potential use cases for NEON practically worthless. I am told this has been addressed in the almost-ready Cortex-A9.
Just wonder the effciency of NEON-to-ARM transfer. if it really takes 20 clks more, that would be a shame and useless.
Would you provide some related links about this issue?
Thanks a lot.
Improved NEON to ARM transfer.
On Cortex-A8, transferring a 32-bit value from NEON to an ARM register takes a minimum of 20 clock cycles