<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Hardwarebug &#187; Optimisation</title>
	<atom:link href="http://hardwarebug.org/category/optimisation/feed/" rel="self" type="application/rss+xml" />
	<link>http://hardwarebug.org</link>
	<description>Everything is broken</description>
	<lastBuildDate>Tue, 18 Oct 2011 00:41:58 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Pointer peril</title>
		<link>http://hardwarebug.org/2011/10/18/pointer-peril/</link>
		<comments>http://hardwarebug.org/2011/10/18/pointer-peril/#comments</comments>
		<pubDate>Tue, 18 Oct 2011 00:26:00 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Bugs]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=587</guid>
		<description><![CDATA[Use of pointers in the C programming language is subject to a number of constraints, violation of which results in the dreaded undefined behaviour. If a situation with undefined behaviour occurs, anything is permitted to happen. The program may produce unexpected results, crash, or demons may fly out of the user&#8217;s nose. Some of these [...]]]></description>
			<content:encoded><![CDATA[<p>Use of pointers in the C programming language is subject to a number of constraints, violation of which results in the dreaded <em>undefined behaviour</em>. If a situation with undefined behaviour occurs, anything is permitted to happen. The program may produce unexpected results, crash, or demons may fly out of the user&#8217;s nose.</p>
<p>Some of these rules concern pointer arithmetic, addition and subtraction in which one or both operands are pointers. The C99 specification spells it out in section 6.5.6:</p>
<blockquote><div class="frame-outer small">
<div style="text-align: left;">
When an expression that has integer type is added to or subtracted from a pointer, the result has the type of the pointer operand. [&hellip;] If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. [&hellip;]<br />
<br/>When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object; the result is the difference of the subscripts of the two array elements.
</div>
</div>
</blockquote>
<p>In simpler, if less accurate, terms, operands and results of pointer arithmetic must be within the same array object. If not, anything can happen.<br />
<span id="more-587"></span><br />
To see some of this undefined behaviour in action, consider the following example.</p>
<blockquote><div class="frame-outer small">
<pre style="text-align: left; margin: 0;">
#include &lt;stdio.h&gt;

int foo(void)
{
    int a, b;
    int d = &amp;b - &amp;a; /* undefined */
    int *p = &amp;a;
    b = 0;
    p[d] = 1;        /* undefined */
    return b;
}

int main(void)
{
    printf("%d\n", foo());
    return 0;
}
</pre>
</div>
</blockquote>
<p>This program breaks the above rules twice. Firstly, the <code>&amp;a - &amp;b</code> calculation is undefined because the pointers being subtracted do not point to elements of the same array.  Most compilers will nonetheless evaluate this to the distance between the two variables on the stack.  Secondly, accessing <code>p[d]</code> is undefined because <code>p</code> and <code>p + d</code> do not point to elements of the same array (unless the result of the first undefined expression happened to be zero).</p>
<p>It might be tempting to assume that on a modern system with a single, flat address space, these operations would result in the intuitively obvious outcomes, ultimately setting <code>b</code> to the value 1 and returning this same value.  However, undefined is undefined, and the compiler is free to do whatever it wants:</p>
<blockquote><div class="frame-outer small">
<pre style="text-align: left; margin: 0;">
$ gcc -O undef.c
$ ./a.out
0
</pre>
</div>
</blockquote>
<p>Even on a perfectly normal system, compiled with optimisation enabled the program behaves as though the write to <code>p[d]</code> were ignored.  In fact, this is exactly what happened, as this test shows:</p>
<blockquote><div class="frame-outer small">
<pre style="text-align: left; margin: 0;">
$ gcc -O -fno-tree-pta undef.c
$ ./a.out
1
</pre>
</div>
</blockquote>
<p>Disabling the <a href="http://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/Optimize-Options.html#index-ftree_002dpta-802">tree-pta optimisation</a> in gcc gives us back the intuitive behaviour.  PTA stands for points-to analysis, which means the compiler analyses which objects any pointers can validly access.  In the example, the pointer <code>p</code>, having been set to <code>&amp;a</code> cannot be used in a valid access to the variable <code>b</code>, <code>a</code> and <code>b</code> not being part of the same array.  Between the assignment <code>b = 0</code> and the return statement, no valid access to <code>b</code> takes place, whence the return value is derived to be zero.  The entire function is, in fact, reduced to the assembly equivalent of a simple <code>return 0</code> statement, all because we decided to violate a couple of language rules.</p>
<p>While this example is obviously contrived for clarity, bugs rooted in these rules occur in real programs from time to time.  My most recent encounter with one was in <a href="http://pari.math.u-bordeaux.fr/cgi-bin/bugreport.cgi?bug=1237">PARI/GP</a>, where a somewhat more complicated <a href="http://pari.math.u-bordeaux.fr/cgi-bin/gitweb.cgi?p=pari.git;a=blob;f=src/headers/pariinl.h;h=4b0680a27b7615df56f84b54b16a15986db9b82e;hb=HEAD#l590">incarnation</a> of the example above can be found.  Unfortunately, the maintainers of this program are not responsive to reports of such bad practices in their code:</p>
<blockquote><div class="frame-outer small">
<div style="text-align: left;">
Undefined according to what rule? The code is only requiring the adress space to be flat which is true on all supported platforms.
</div>
</div>
</blockquote>
<p>The rule in question is, of course, the one quoted above.  Since the standard makes no exception for flat address spaces, no such exception exists.  Although the behaviour could be logically defined in this case, it is not, and all programs must still follow the rules.  Filing <a href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49140">bug reports</a> against the compiler will not make them go away.  As of this writing, the issue remains unresolved.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2011/10/18/pointer-peril/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Bit-field badness</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/</link>
		<comments>http://hardwarebug.org/2010/01/30/bit-field-badness/#comments</comments>
		<pubDate>Sat, 30 Jan 2010 16:15:05 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=230</guid>
		<description><![CDATA[Consider the following C code which is based on an real-world situation. struct bf1_31 { unsigned a:1; unsigned b:31; }; void func(struct bf1_31 *p, int n, int a) { int i = 0; do { if (p[i].a) p[i].b += a; } while (++i &#60; n); } How would we best write this in ARM assembler? [...]]]></description>
			<content:encoded><![CDATA[<p>Consider the following C code which is based on an real-world situation.</p>
<blockquote>
<pre>struct bf1_31 {
    unsigned a:1;
    unsigned b:31;
};

void func(struct bf1_31 *p, int n, int a)
{
    int i = 0;
    do {
        if (p[i].a)
            p[i].b += a;
    } while (++i &lt; n);
}
</pre>
</blockquote>
<p>How would we best write this in ARM assembler? This is how I would do it:<br />
<span id="more-230"></span></p>
<blockquote>
<pre>func:
        ldr     r3,  [r0], #4
        tst     r3,  #1
        add     r3,  r3,  r2,  lsl #1
        strne   r3,  [r0, #-4]
        subs    r1,  r1,  #1
        bgt     func
        bx      lr
</pre>
</blockquote>
<p>The <code>add</code> instruction is unconditional to avoid a dependency on the comparison. Unrolling the loop would mask the latency of the <code>ldr</code> instruction as well, but that is outside the scope of this experiment.</p>
<p>Now compile this code with <code>gcc -march=armv5te -O3</code> and watch in horror:</p>
<blockquote>
<pre>func:
        push    {r4}
        mov     ip, #0
        mov     r4, r2
loop:
        ldrb    r3, [r0]
        add     ip, ip, #1
        tst     r3, #1
        ldrne   r3, [r0]
        andne   r2, r3, #1
        addne   r3, r4, r3, lsr #1
        orrne   r2, r2, r3, lsl #1
        strne   r2, [r0]
        cmp     ip, r1
        add     r0, r0, #4
        blt     loop
        pop     {r4}
        bx      lr
</pre>
</blockquote>
<p>This is nothing short of awful:</p>
<ul>
<li>The same value is loaded from memory twice.</li>
<li>A complicated mask/shift/or operation is used where a simple shifted add would suffice.</li>
<li>Write-back addressing is not used.</li>
<li>The loop control counts up and compares instead of counting down.</li>
<li>Useless <code>mov</code> in the prologue; swapping the roles or <code>r2</code> and <code>r4</code> would avoid this.</li>
<li>Using <code>lr</code> in place of <code>r4</code> would allow the return to be done with <code>pop {pc}</code>, saving one instruction (ignoring for the moment that no callee-saved registers are needed at all).</li>
</ul>
<p>Even for this trivial function the gcc-generated code is more than twice the optimal size and slower by approximately the same factor.</p>
<p>The main issue I wanted to illustrate is the poor handling of bit-fields by gcc. When accessing bitfields from memory, gcc issues a separate load for each field even when they are contained in the same aligned memory word. Although each load after the first will most likely hit L1 cache, this is still bad for several reasons:</p>
<ul>
<li>Loads have typically two or three cycles result latency compared to one cycle for data processing instructions. Any bit-field can be extracted from a register with two shifts, and on ARM the second of these can generally be achieved using a shifted second operand to a following instruction. The ARMv6T2 instruction set also adds the <code>SBFX</code> and <code>UBFX</code> instructions for extracting any signed or unsigned bit-field in one cycle.</li>
<li>Most CPUs have more data processing units than load/store units. It is thus more likely for an ALU instruction than a load/store to issue without delay on a superscalar processor.</li>
<li>Redundant memory accesses can trigger early flushing of store buffers rendering these less efficient.</li>
</ul>
<p>No gcc bashing is complete without a comparison with another compiler, so without further ado, here is the ARM RVCT output (<code>armcc --cpu 5te -O3</code>):</p>
<blockquote>
<pre>func:
        mov     r3, #0
        push    {r4, lr}
loop:
        ldr     ip, [r0, r3, lsl #2]
        tst     ip, #1
        addne   ip, ip, r2, lsl #1
        strne   ip, [r0, r3, lsl #2]
        add     r3, r3, #1
        cmp     r3, r1
        blt     loop
        pop     {r4, pc}
</pre>
</blockquote>
<p>This is much better, the core loop using only one instruction more than my version. The loop control is counting up, but at least this register is reused as offset for the memory accesses. More remarkable is the push/pop of two registers that are never used. I had not expected to see this from RVCT.</p>
<p>Even the best compilers are still no match for a human.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/01/30/bit-field-badness/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>GCC makes a mess</title>
		<link>http://hardwarebug.org/2009/05/13/gcc-makes-a-mess/</link>
		<comments>http://hardwarebug.org/2009/05/13/gcc-makes-a-mess/#comments</comments>
		<pubDate>Wed, 13 May 2009 02:16:38 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Optimisation]]></category>
		<category><![CDATA[PowerPC]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=131</guid>
		<description><![CDATA[Following up on a report about FFmpeg being slower at MPEG audio decoding than MAD, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast [...]]]></description>
			<content:encoded><![CDATA[<p>Following up on a report about <a href="http://ffmpeg.org/">FFmpeg</a> being slower at MPEG audio decoding than <a href="http://www.underbit.com/products/mad/">MAD</a>, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast as FFmpeg, suggesting something was going badly wrong in the compilation.</p>
<p>A session with oprofile exposes multiplication as the root of the problem. The MPEG audio decoder in FFmpeg includes many operations of the form <code>a += b * c</code> where <code>b</code> and <code>c</code> are 32 bits in size and <code>a</code> is 64-bit. 64-bit maths on a 32-bit CPU is not handled well by GCC, even when good hardware support is available. A couple of examples compiled with GCC 4.3.3 illustrate this.<br />
<span id="more-131"></span><br />
Suppose you need the high 32 bits from the 64-bit result of multiplying two 32-bit numbers. This is most easily written in C like this:</p>
<blockquote>
<pre>int mulh(int a, int b)
{
    return ((int64_t)a * (int64_t)b) &gt;&gt; 32;
}</pre>
</blockquote>
<p>It doesn&#8217;t take much thinking to see that the PowerPC <code>mulhw</code> instruction performs exactly this operation. Indeed, GCC knows of this instruction and uses it. But can we be <em>really</em> sure that those low 32 bits are not needed? GCC seems unconvinced:</p>
<blockquote>
<pre>mulhw   r9,  r4,  r3
mullw   r10, r4,  r3
srawi   r11, r9,  31
srawi   r12, r9,  0
mr      r3,  r12
blr</pre>
</blockquote>
<p>The second example is slightly more complicated:</p>
<blockquote>
<pre>int64_t mac(int64_t a, int b, int c, int d)
{
    a += (int64_t)b * (int64_t)c;
    a += (int64_t)b * (int64_t)d;
    return a;
}</pre>
</blockquote>
<p>This can, of course, be done with four multiplications and four additions. GCC, however, likes to be thorough, and uses twice the number of both instructions, plus some loads, stores and shifts for completeness:</p>
<blockquote>
<pre>stwu    r1,  -32(r1)
srawi   r0,  r6,  31
mullw   r0,  r0,  r5
srawi   r8,  r7,  31
stw     r29, 20(r1)
srawi   r29, r5,  31
stw     r27, 12(r1)
stw     r28, 16(r1)
mullw   r11, r29, r6
mulhwu  r9,  r6,  r5
add     r0,  r0,  r11
mullw   r10, r6,  r5
add     r9,  r0,  r9
mullw   r29, r29, r7
addc    r28, r10, r4
adde    r27, r9,  r3
mullw   r8,  r8,  r5
mulhwu  r9,  r7,  r5
add     r8,  r8,  r29
lwz     r29, 20(r1)
mullw   r10, r7,  r5
add     r9,  r8,  r9
addc    r12, r28, r10
adde    r11, r27, r9
lwz     r27, 12(r1)
mr      r4,  r12
lwz     r28, 16(r1)
mr      r3,  r11
addi    r1,  r1,  32
blr</pre>
</blockquote>
<p>Fortunately, this madness is easily fixed with a little inline assembler, more than doubling the speed of the decoder, thus making FFmpeg significantly faster than MAD also on PowerPC.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/05/13/gcc-makes-a-mess/feed/</wfw:commentRss>
		<slash:comments>34</slash:comments>
		</item>
		<item>
		<title>Thumbs up</title>
		<link>http://hardwarebug.org/2009/03/25/thumbs-up/</link>
		<comments>http://hardwarebug.org/2009/03/25/thumbs-up/#comments</comments>
		<pubDate>Wed, 25 Mar 2009 03:27:04 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=125</guid>
		<description><![CDATA[ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions. Thumb-2 performance is claimed [...]]]></description>
			<content:encoded><![CDATA[<p>ARM processors have long supported the 16-bit <a href="http://arm.com/products/CPUs/archi-thumb.html">Thumb</a> instruction set, achieving smaller code size at the price of reduced performance. The <a href="http://arm.com/products/CPUs/archi-thumb2.html">Thumb-2</a> extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.</p>
<p>Thumb-2 performance is <a href="http://arm.com/pdfs/Thumb-2CoreTechnologyWhitepaper-Final4.pdf">claimed</a> to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with <a href="http://ffmpeg.org/">FFmpeg</a> as the target and compiled the same source revision in ARM and Thumb-2 mode using the <a href="http://arm.com/products/DevTools/RVCT.html">RVCT 4.0</a> compiler. For this test I disabled all hand-written assembler optimisations.</p>
<p>The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a <a href="http://beagleboard.org/">Beagle board</a>. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.</p>
<p>In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/03/25/thumbs-up/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Shared library woes and the price of PIC</title>
		<link>http://hardwarebug.org/2009/01/02/shared-library-woes-and-the-price-of-pic/</link>
		<comments>http://hardwarebug.org/2009/01/02/shared-library-woes-and-the-price-of-pic/#comments</comments>
		<pubDate>Fri, 02 Jan 2009 18:28:53 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Bugs]]></category>
		<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=100</guid>
		<description><![CDATA[It recently came to my attention that the GNU linker on ARM lacks support for several relocation types in shared libraries. Specifically, code using MOVW/MOVT instruction pairs to load the address of data symbols will not work in a shared library. The linker silently drops the necessary relocations, resulting in a runtime crash. When I [...]]]></description>
			<content:encoded><![CDATA[<p>It recently came to my attention that the GNU linker on ARM lacks support for several relocation types in shared libraries. Specifically, code using <code>MOVW/MOVT</code> instruction pairs to load the address of data symbols will not work in a shared library. The linker silently drops the necessary relocations, resulting in a runtime crash.</p>
<p>When I pointed out this shortcoming to Paul Brook of CodeSourcery, his response was that such relocations in shared libraries are not supported by the GNU tools, will never be, and that shared libraries should be built with position-independent code (PIC). This is an unfortunate attitude, and doubly so considering that the latest CodeSourcery GCC version will generate these instructions with default settings. In other words, the 2008q3 release of CodeSourcery GCC will, with default flags, build crashing shared libraries without so much as a warning.</p>
<p>The refusal to support non-PIC shared libraries is unfortunate also from a performance point of view. Position independent code is inherently slower than normal code.</p>
<p>In order to find out just how much slower PIC is on ARM, I made two builds of FFmpeg, one normal and one with PIC. The PIC build is about 1.7% slower in several tests, among them H.264 video decoding.</p>
<p>On typically resource-constrained ARM systems it would be nice to have the option of space-saving shared libraries without paying the PIC penalty in performance. Until now this option has been a reality. With CodeSourcery lazily refusing to support the relocations required by the latest version of their own compiler, this option may soon be a thing of the past, at least if the bugs that have haunted recent compiler releases are fixed in upcoming versions.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/01/02/shared-library-woes-and-the-price-of-pic/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ARM-NEON memory hazards</title>
		<link>http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/</link>
		<comments>http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/#comments</comments>
		<pubDate>Wed, 31 Dec 2008 02:19:13 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=89</guid>
		<description><![CDATA[The NEON coprocessor found in the Cortex-A8 operates asynchronously from the ARM pipeline, receiving its instructions from the ARM execution unit through a 16-entry FIFO. Furthermore, the NEON unit has its own load/store unit. This suggests that some mechanism exists to resolve data hazards between the ARM and NEON units such that memory operations appear [...]]]></description>
			<content:encoded><![CDATA[<p>The NEON coprocessor found in the Cortex-A8 operates asynchronously from the ARM pipeline, receiving its instructions from the ARM execution unit through a 16-entry FIFO. Furthermore, the NEON unit has its own load/store unit. This suggests that some mechanism exists to resolve data hazards between the ARM and NEON units such that memory operations appear as if the instructions were executed entirely in order.</p>
<p>Although clearly important with a view to code optimisation, the <a href="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344b/index.html">Cortex-A8 Technical Reference Manual</a> unfortunately does not mention any details about these hazards. In fact, it does not mention them at all.</p>
<p>To sched some light on the situation, I ran a simple benchmark to determine two important parameters of ARM-NEON memory hazard resolution: granularity and latency.</p>
<p><span id="more-89"></span>Since NEON execution lags behind the ARM pipeline, three types of hazard can occur:</p>
<ul>
<li>ARM load after NEON store</li>
<li>ARM store after NEON load</li>
<li>ARM store after NEON store</li>
</ul>
<p>The characteristics of each is tested using a loop interleaving 64-bit NEON <code>VLD1/VST1</code> and ARM <code>LDR/STR</code> instructions using addresses at various intervals. The hardware used for the test is a <a href="http://beagleboard.org/">Beagle Board</a> clocked at 500 MHz and with the <a href="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344b/Bgbffjhh.html">L1NEON</a> configuration bit set.</p>
<p>It quickly becomes evident that the basic granularity for the hazard detection is 16 bytes. In addition, some tests show secondary effects within a 64-byte block (cache line). NEON stores crossing a 16-byte boundary apparently incur an extra penalty.</p>
<p>The following table lists the approximate number of cycles required for each pair of instructions when no access spans a 16-byte boundary.</p>
<table border="0">
<tbody>
<tr>
<th></th>
<th>16-byte</th>
<th>64-byte</th>
<th>other</th>
</tr>
<tr>
<th>ARM load after NEON store</th>
<td>22</td>
<td>5</td>
<td>3</td>
</tr>
<tr>
<th>ARM store after NEON load</th>
<td>13</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<th>ARM store after NEON store</th>
<td>22</td>
<td>4</td>
<td>4</td>
</tr>
</tbody>
</table>
<p>The delay of roughly 20 cycles after a NEON store corresponds nicely with the figure of 20 cycles the <a href="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344b/ch16s05s02.html">TRM</a> quotes for an MRC transfer from NEON to ARM.</p>
<p>The next table lists the same timings when the NEON access spans a 16-byte boundary.</p>
<table border="0">
<tbody>
<tr>
<th></th>
<th>16-byte</th>
<th>64-byte</th>
<th>other</th>
</tr>
<tr>
<th>ARM load after NEON store</th>
<td>22</td>
<td>7</td>
<td>5</td>
</tr>
<tr>
<th>ARM store after NEON load</th>
<td>13</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<th>ARM store after NEON store</th>
<td>22</td>
<td>52</td>
<td>48</td>
</tr>
</tbody>
</table>
<p>I was somewhat baffled by the last line. Clearly such NEON stores are something to be avoided. Splitting the NEON store into two 32-bit stores has a dramatic effect:</p>
<table border="0">
<tbody>
<tr>
<th></th>
<th>16-byte</th>
<th>64-byte</th>
<th>other</th>
</tr>
<tr>
<th>ARM store after NEON store</th>
<td>22</td>
<td>32</td>
<td>29</td>
</tr>
</tbody>
</table>
<p>Although clearly an improvement, it is still bad enough that mixing such accesses could easily impact performance seriously. It should also be noted that in all other cases, the 64-bit store is faster.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

