ARM-NEON memory hazards

The NEON coprocessor found in the Cortex-A8 operates asynchronously from the ARM pipeline, receiving its instructions from the ARM execution unit through a 16-entry FIFO. Furthermore, the NEON unit has its own load/store unit. This suggests that some mechanism exists to resolve data hazards between the ARM and NEON units such that memory operations appear as if the instructions were executed entirely in order.

Although clearly important with a view to code optimisation, the Cortex-A8 Technical Reference Manual unfortunately does not mention any details about these hazards. In fact, it does not mention them at all.

To sched some light on the situation, I ran a simple benchmark to determine two important parameters of ARM-NEON memory hazard resolution: granularity and latency.

Since NEON execution lags behind the ARM pipeline, three types of hazard can occur:

  • ARM load after NEON store
  • ARM store after NEON load
  • ARM store after NEON store

The characteristics of each is tested using a loop interleaving 64-bit NEON VLD1/VST1 and ARM LDR/STR instructions using addresses at various intervals. The hardware used for the test is a Beagle Board clocked at 500 MHz and with the L1NEON configuration bit set.

It quickly becomes evident that the basic granularity for the hazard detection is 16 bytes. In addition, some tests show secondary effects within a 64-byte block (cache line). NEON stores crossing a 16-byte boundary apparently incur an extra penalty.

The following table lists the approximate number of cycles required for each pair of instructions when no access spans a 16-byte boundary.

16-byte 64-byte other
ARM load after NEON store 22 5 3
ARM store after NEON load 13 3 3
ARM store after NEON store 22 4 4

The delay of roughly 20 cycles after a NEON store corresponds nicely with the figure of 20 cycles the TRM quotes for an MRC transfer from NEON to ARM.

The next table lists the same timings when the NEON access spans a 16-byte boundary.

16-byte 64-byte other
ARM load after NEON store 22 7 5
ARM store after NEON load 13 3 3
ARM store after NEON store 22 52 48

I was somewhat baffled by the last line. Clearly such NEON stores are something to be avoided. Splitting the NEON store into two 32-bit stores has a dramatic effect:

16-byte 64-byte other
ARM store after NEON store 22 32 29

Although clearly an improvement, it is still bad enough that mixing such accesses could easily impact performance seriously. It should also be noted that in all other cases, the 64-bit store is faster.

Bookmark the permalink.

5 Responses to ARM-NEON memory hazards

  1. cf says:

    You’re forgetting 1 very important hazard – the SIMD ones that cause NEON pipeline stalls. If you try to access the result of one NEON operation before the operation is complete, then the NEON pipeline will stall until the result is ready.

  2. Brett says:

    Have you seen any issues with vld and registers q8-15?

    For some reason, doing a load to registers below q8 work fine, but above q8 the registers don’t get set.

    For example, the first won’t work but the second will.

    vld1.32	{d16-d17}, [r3]
    vld1.32	{d0-d1}, [r3]
    

    any ideas?

  3. You say at the end “It should also be noted that in all other cases, the 64-bit store is faster.”

    Did you mean 64-bit or 64-byte? Also, could you please explain a little bit more about what your 16-byte, 64-byte and Other columns represent? Does it mean that ARM loads should be spaced atleast 64-bytes after NEON stores?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.