The NEON coprocessor found in the Cortex-A8 operates asynchronously from the ARM pipeline, receiving its instructions from the ARM execution unit through a 16-entry FIFO. Furthermore, the NEON unit has its own load/store unit. This suggests that some mechanism exists to resolve data hazards between the ARM and NEON units such that memory operations appear as if the instructions were executed entirely in order.
Although clearly important with a view to code optimisation, the Cortex-A8 Technical Reference Manual unfortunately does not mention any details about these hazards. In fact, it does not mention them at all.
To sched some light on the situation, I ran a simple benchmark to determine two important parameters of ARM-NEON memory hazard resolution: granularity and latency.
Since NEON execution lags behind the ARM pipeline, three types of hazard can occur:
- ARM load after NEON store
- ARM store after NEON load
- ARM store after NEON store
The characteristics of each is tested using a loop interleaving 64-bit NEON VLD1/VST1 and ARM LDR/STR instructions using addresses at various intervals. The hardware used for the test is a Beagle Board clocked at 500 MHz and with the L1NEON configuration bit set.
It quickly becomes evident that the basic granularity for the hazard detection is 16 bytes. In addition, some tests show secondary effects within a 64-byte block (cache line). NEON stores crossing a 16-byte boundary apparently incur an extra penalty.
The following table lists the approximate number of cycles required for each pair of instructions when no access spans a 16-byte boundary.
16-byte | 64-byte | other | |
---|---|---|---|
ARM load after NEON store | 22 | 5 | 3 |
ARM store after NEON load | 13 | 3 | 3 |
ARM store after NEON store | 22 | 4 | 4 |
The delay of roughly 20 cycles after a NEON store corresponds nicely with the figure of 20 cycles the TRM quotes for an MRC transfer from NEON to ARM.
The next table lists the same timings when the NEON access spans a 16-byte boundary.
16-byte | 64-byte | other | |
---|---|---|---|
ARM load after NEON store | 22 | 7 | 5 |
ARM store after NEON load | 13 | 3 | 3 |
ARM store after NEON store | 22 | 52 | 48 |
I was somewhat baffled by the last line. Clearly such NEON stores are something to be avoided. Splitting the NEON store into two 32-bit stores has a dramatic effect:
16-byte | 64-byte | other | |
---|---|---|---|
ARM store after NEON store | 22 | 32 | 29 |
Although clearly an improvement, it is still bad enough that mixing such accesses could easily impact performance seriously. It should also be noted that in all other cases, the 64-bit store is faster.
You’re forgetting 1 very important hazard – the SIMD ones that cause NEON pipeline stalls. If you try to access the result of one NEON operation before the operation is complete, then the NEON pipeline will stall until the result is ready.
The purpose of this investigation was to determine the undocumented memory hazards. Instruction latencies are well-documented in the TRM.
Have you seen any issues with vld and registers q8-15?
For some reason, doing a load to registers below q8 work fine, but above q8 the registers don’t get set.
For example, the first won’t work but the second will.
any ideas?
You, or your OS, are obviously doing something wrong. That code should work.
You say at the end “It should also be noted that in all other cases, the 64-bit store is faster.”
Did you mean 64-bit or 64-byte? Also, could you please explain a little bit more about what your 16-byte, 64-byte and Other columns represent? Does it mean that ARM loads should be spaced atleast 64-bytes after NEON stores?