ARM-NEON memory hazards

The NEON coprocessor found in the Cortex-A8 operates asynchronously from the ARM pipeline, receiving its instructions from the ARM execution unit through a 16-entry FIFO. Furthermore, the NEON unit has its own load/store unit. This suggests that some mechanism exists to resolve data hazards between the ARM and NEON units such that memory operations appear as if the instructions were executed entirely in order.

Although clearly important with a view to code optimisation, the Cortex-A8 Technical Reference Manual unfortunately does not mention any details about these hazards. In fact, it does not mention them at all.

To sched some light on the situation, I ran a simple benchmark to determine two important parameters of ARM-NEON memory hazard resolution: granularity and latency.

Since NEON execution lags behind the ARM pipeline, three types of hazard can occur:

ARM load after NEON store
ARM store after NEON load
ARM store after NEON store

The characteristics of each is tested using a loop interleaving 64-bit NEON VLD1/VST1 and ARM LDR/STR instructions using addresses at various intervals. The hardware used for the test is a Beagle Board clocked at 500 MHz and with the L1NEON configuration bit set.

It quickly becomes evident that the basic granularity for the hazard detection is 16 bytes. In addition, some tests show secondary effects within a 64-byte block (cache line). NEON stores crossing a 16-byte boundary apparently incur an extra penalty.

The following table lists the approximate number of cycles required for each pair of instructions when no access spans a 16-byte boundary.

	16-byte	64-byte	other
ARM load after NEON store	22	5	3
ARM store after NEON load	13	3	3
ARM store after NEON store	22	4	4

The delay of roughly 20 cycles after a NEON store corresponds nicely with the figure of 20 cycles the TRM quotes for an MRC transfer from NEON to ARM.

The next table lists the same timings when the NEON access spans a 16-byte boundary.

	16-byte	64-byte	other
ARM load after NEON store	22	7	5
ARM store after NEON load	13	3	3
ARM store after NEON store	22	52	48

I was somewhat baffled by the last line. Clearly such NEON stores are something to be avoided. Splitting the NEON store into two 32-bit stores has a dramatic effect:

	16-byte	64-byte	other
ARM store after NEON store	22	32	29

Although clearly an improvement, it is still bad enough that mixing such accesses could easily impact performance seriously. It should also be noted that in all other cases, the 64-bit store is faster.

Bookmark the permalink.

5 Responses to ARM-NEON memory hazards

cf says:

Thursday, 24th June, 2010 at 9:20 pm

You’re forgetting 1 very important hazard – the SIMD ones that cause NEON pipeline stalls. If you try to access the result of one NEON operation before the operation is complete, then the NEON pipeline will stall until the result is ready.
- Mans says:
  
  Thursday, 24th June, 2010 at 9:52 pm
  
  The purpose of this investigation was to determine the undocumented memory hazards. Instruction latencies are well-documented in the TRM.
Brett says:

Tuesday, 29th June, 2010 at 9:58 pm
Have you seen any issues with vld and registers q8-15?

For some reason, doing a load to registers below q8 work fine, but above q8 the registers don’t get set.

For example, the first won’t work but the second will.
```
vld1.32	{d16-d17}, [r3]
vld1.32	{d0-d1}, [r3]
```
any ideas?
- Mans says:
  
  Wednesday, 30th June, 2010 at 10:54 pm
  
  You, or your OS, are obviously doing something wrong. That code should work.
Shervin Emami says:

Friday, 24th September, 2010 at 6:31 am

You say at the end “It should also be noted that in all other cases, the 64-bit store is faster.”

Did you mean 64-bit or 64-byte? Also, could you please explain a little bit more about what your 16-byte, 64-byte and Other columns represent? Does it mean that ARM loads should be spaced atleast 64-bytes after NEON stores?

ARM-NEON memory hazards

5 Responses to ARM-NEON memory hazards

Recent Posts

Recent Comments

Categories

Archives

Meta