<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Hardwarebug &#187; Compilers</title>
	<atom:link href="http://hardwarebug.org/category/compilers/feed/" rel="self" type="application/rss+xml" />
	<link>http://hardwarebug.org</link>
	<description>Everything is broken</description>
	<lastBuildDate>Tue, 18 Oct 2011 00:41:58 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>ARM inline asm secrets</title>
		<link>http://hardwarebug.org/2010/07/06/arm-inline-asm-secrets/</link>
		<comments>http://hardwarebug.org/2010/07/06/arm-inline-asm-secrets/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 20:52:43 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=493</guid>
		<description><![CDATA[Although I generally recommend against using GCC inline assembler, preferring instead pure assembler code in separate files, there are occasions where inline is the appropriate solution. Should one, at a time like this, turn to the GCC documentation for guidance, one must be prepared for a degree of disappointment. As it happens, much of the [...]]]></description>
			<content:encoded><![CDATA[<p>Although I generally recommend against using GCC inline assembler, preferring instead pure assembler code in separate files, there are occasions where inline is the appropriate solution. Should one, at a time like this, turn to the GCC documentation for guidance, one must be prepared for a degree of disappointment. As it happens, much of the inline asm syntax is left entirely undocumented. This article attempts to fill in some of the blanks for the ARM target.<br />
<span id="more-493"></span></p>
<style>
.asm { border-collapse: collapse; }
.asm td { padding: 0.5em; }
.asm td:first-child { font-family: monospace; font-weight: bold; vertical-align: top }
</style>
<h3>Constraints</h3>
<p>Each operand of an inline asm block is described by a constraint string encoding the valid representations of the operand in the generated assembler. For example the &#8220;r&#8221; code denotes a general-purpose register. In addition to the standard constraints, ARM allows a number of special codes, only some of which are documented. The full list, including a brief description, is available in the <code>constraints.md</code> file in the GCC source tree.  The following table is an extract from this file consisting of the codes which are meaningful in an inline asm block (a few are only useful in the machine description itself).</p>
<table class="asm">
<tr>
<td>f</td>
<td>Legacy FPA registers <code>f0-f7</code>.</td>
</tr>
<tr>
<td>t</td>
<td>The VFP registers <code>s0-s31</code>.</td>
</tr>
<tr>
<td>v</td>
<td>The Cirrus Maverick co-processor registers.</td>
</tr>
<tr>
<td>w</td>
<td>The VFP registers <code>d0-d15</code>, or <code>d0-d31</code> for VFPv3.</td>
</tr>
<tr>
<td>x</td>
<td>The VFP registers <code>d0-d7</code>.</td>
</tr>
<tr>
<td>y</td>
<td>The Intel iWMMX co-processor registers.</td>
</tr>
<tr>
<td>z</td>
<td>The Intel iWMMX GR registers.</td>
</tr>
<tr>
<td>l</td>
<td>In Thumb state the core registers <code>r0-r7</code>.</td>
</tr>
<tr>
<td>h</td>
<td>In Thumb state the core registers <code>r8-r15</code>.</td>
</tr>
<tr>
<td>j</td>
<td>A constant suitable for a MOVW instruction. (ARM/Thumb-2)</td>
</tr>
<tr>
<td>b</td>
<td>Thumb only.  The union of the low registers and the stack register.</td>
</tr>
<tr>
<td>I</td>
<td>In ARM/Thumb-2 state a constant that can be used as an immediate value in a Data Processing instruction.  In Thumb-1 state a constant in the range 0 to 255.</td>
</tr>
<tr>
<td>J</td>
<td>In ARM/Thumb-2 state a constant in the range -4095 to 4095.  In Thumb-1 state a constant in the range -255 to -1.</td>
</tr>
<tr>
<td>K</td>
<td>In ARM/Thumb-2 state a constant that satisfies the <code>I</code> constraint if inverted.  In Thumb-1 state a constant that satisfies the <code>I</code> constraint multiplied by any power of 2.</td>
</tr>
<tr>
<td>L</td>
<td>In ARM/Thumb-2 state a constant that satisfies the <code>I</code> constraint if negated.  In Thumb-1 state a constant in the range -7 to 7.</td>
</tr>
<tr>
<td>M</td>
<td>In Thumb-1 state a constant that is a multiple of 4 in the range 0 to 1020.</td>
</tr>
<tr>
<td>N</td>
<td>Thumb-1 state a constant in the range 0 to 31.</td>
</tr>
<tr>
<td>O</td>
<td>In Thumb-1 state a constant that is a multiple of 4 in the range -508 to 508.</td>
</tr>
<tr>
<td>Pa</td>
<td>In Thumb-1 state a constant in the range -510 to +510</td>
</tr>
<tr>
<td>Pb</td>
<td>In Thumb-1 state a constant in the range -262 to +262</td>
</tr>
<tr>
<td>Ps</td>
<td>In Thumb-2 state a constant in the range -255 to +255</td>
</tr>
<tr>
<td>Pt</td>
<td>In Thumb-2 state a constant in the range -7 to +7</td>
</tr>
<tr>
<td>G</td>
<td>In ARM/Thumb-2 state a valid FPA immediate constant.</td>
</tr>
<tr>
<td>H</td>
<td>In ARM/Thumb-2 state a valid FPA immediate constant when negated.</td>
</tr>
<tr>
<td>Da</td>
<td>In ARM/Thumb-2 state a const_int, const_double or const_vector that can be generated with two Data Processing insns.</td>
</tr>
<tr>
<td>Db</td>
<td>In ARM/Thumb-2 state a const_int, const_double or const_vector that can be generated with three Data Processing insns.</td>
</tr>
<tr>
<td>Dc</td>
<td>In ARM/Thumb-2 state a const_int, const_double or const_vector that can be generated with four Data Processing insns.  This pattern is disabled if optimizing for space or when we have load-delay slots to fill.</td>
</tr>
<tr>
<td>Dn</td>
<td>In ARM/Thumb-2 state a const_vector which can be loaded with a Neon vmov immediate instruction.</td>
</tr>
<tr>
<td>Dl</td>
<td>In ARM/Thumb-2 state a const_vector which can be used with a Neon vorr or vbic instruction.</td>
</tr>
<tr>
<td>DL</td>
<td>In ARM/Thumb-2 state a const_vector which can be used with a Neon vorn or vand instruction.</td>
</tr>
<tr>
<td>Dv</td>
<td>In ARM/Thumb-2 state a const_double which can be used with a VFP fconsts instruction.</td>
</tr>
<tr>
<td>Dy</td>
<td>In ARM/Thumb-2 state a const_double which can be used with a VFP fconstd instruction.</td>
</tr>
<tr>
<td>Ut</td>
<td>In ARM/Thumb-2 state an address valid for loading/storing opaque structure types wider than TImode.</td>
</tr>
<tr>
<td>Uv</td>
<td>In ARM/Thumb-2 state a valid VFP load/store address.</td>
</tr>
<tr>
<td>Uy</td>
<td>In ARM/Thumb-2 state a valid iWMMX load/store address.</td>
</tr>
<tr>
<td>Un</td>
<td>In ARM/Thumb-2 state a valid address for Neon doubleword vector load/store instructions.</td>
</tr>
<tr>
<td>Um</td>
<td>In ARM/Thumb-2 state a valid address for Neon element and structure load/store instructions.</td>
</tr>
<tr>
<td>Us</td>
<td>In ARM/Thumb-2 state a valid address for non-offset loads/stores of quad-word values in four ARM registers.</td>
</tr>
<tr>
<td>Uq</td>
<td>In ARM state an address valid in ldrsb instructions.</td>
</tr>
<tr>
<td>Q</td>
<td>In ARM/Thumb-2 state an address that is a single base register.</td>
</tr>
</table>
<h3>Operand codes</h3>
<p>Within the text of an inline asm block, operands are referenced as <code>%0</code>, <code>%1</code> etc. Register operands are printed as <code>rN</code>, memory operands as <code>[rN, #offset]</code>, and so forth.  In some situations, for example with operands occupying multiple registers, more detailed control of the output may be required, and once again, an undocumented feature comes to our rescue.</p>
<p>Special code letters inserted between the <code>%</code> and the operand number alter the output from the default for each type of operand.  The table below lists the more useful ones.</p>
<table class="asm">
<tr>
<td>c</td>
<td>An integer or symbol address without a preceding # sign</td>
</tr>
<tr>
<td>B</td>
<td>Bitwise inverse of integer or symbol without a preceding #</td>
</tr>
<tr>
<td>L</td>
<td>The low 16 bits of an immediate constant</td>
</tr>
<tr>
<td>m</td>
<td>The base register of a memory operand</td>
</tr>
<tr>
<td>M</td>
<td>A register range suitable for LDM/STM</td>
</tr>
<tr>
<td>H</td>
<td>The highest-numbered register of a pair</td>
</tr>
<tr>
<td>Q</td>
<td>The least significant register of a pair</td>
</tr>
<tr>
<td>R</td>
<td>The most significant register of a pair</td>
</tr>
<tr>
<td>P</td>
<td>A double-precision VFP register</td>
</tr>
<tr>
<td>p</td>
<td>The high single-precision register of a VFP double-precision register</td>
</tr>
<tr>
<td>q</td>
<td>A NEON quad register</td>
</tr>
<tr>
<td>e</td>
<td>The low doubleword register of a NEON quad register</td>
</tr>
<tr>
<td>f</td>
<td>The high doubleword register of a NEON quad register</td>
</tr>
<tr>
<td>h</td>
<td>A range of VFP/NEON registers suitable for VLD1/VST1</td>
</tr>
<tr>
<td>A</td>
<td>A memory operand for a VLD1/VST1 instruction</td>
</tr>
<tr>
<td>y</td>
<td>S register as indexed D register, e.g. <code>s5</code> becomes <code>d2[1]</code></td>
</tr>
</table>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/07/06/arm-inline-asm-secrets/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Bit-field badness</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/</link>
		<comments>http://hardwarebug.org/2010/01/30/bit-field-badness/#comments</comments>
		<pubDate>Sat, 30 Jan 2010 16:15:05 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=230</guid>
		<description><![CDATA[Consider the following C code which is based on an real-world situation. struct bf1_31 { unsigned a:1; unsigned b:31; }; void func(struct bf1_31 *p, int n, int a) { int i = 0; do { if (p[i].a) p[i].b += a; } while (++i &#60; n); } How would we best write this in ARM assembler? [...]]]></description>
			<content:encoded><![CDATA[<p>Consider the following C code which is based on an real-world situation.</p>
<blockquote>
<pre>struct bf1_31 {
    unsigned a:1;
    unsigned b:31;
};

void func(struct bf1_31 *p, int n, int a)
{
    int i = 0;
    do {
        if (p[i].a)
            p[i].b += a;
    } while (++i &lt; n);
}
</pre>
</blockquote>
<p>How would we best write this in ARM assembler? This is how I would do it:<br />
<span id="more-230"></span></p>
<blockquote>
<pre>func:
        ldr     r3,  [r0], #4
        tst     r3,  #1
        add     r3,  r3,  r2,  lsl #1
        strne   r3,  [r0, #-4]
        subs    r1,  r1,  #1
        bgt     func
        bx      lr
</pre>
</blockquote>
<p>The <code>add</code> instruction is unconditional to avoid a dependency on the comparison. Unrolling the loop would mask the latency of the <code>ldr</code> instruction as well, but that is outside the scope of this experiment.</p>
<p>Now compile this code with <code>gcc -march=armv5te -O3</code> and watch in horror:</p>
<blockquote>
<pre>func:
        push    {r4}
        mov     ip, #0
        mov     r4, r2
loop:
        ldrb    r3, [r0]
        add     ip, ip, #1
        tst     r3, #1
        ldrne   r3, [r0]
        andne   r2, r3, #1
        addne   r3, r4, r3, lsr #1
        orrne   r2, r2, r3, lsl #1
        strne   r2, [r0]
        cmp     ip, r1
        add     r0, r0, #4
        blt     loop
        pop     {r4}
        bx      lr
</pre>
</blockquote>
<p>This is nothing short of awful:</p>
<ul>
<li>The same value is loaded from memory twice.</li>
<li>A complicated mask/shift/or operation is used where a simple shifted add would suffice.</li>
<li>Write-back addressing is not used.</li>
<li>The loop control counts up and compares instead of counting down.</li>
<li>Useless <code>mov</code> in the prologue; swapping the roles or <code>r2</code> and <code>r4</code> would avoid this.</li>
<li>Using <code>lr</code> in place of <code>r4</code> would allow the return to be done with <code>pop {pc}</code>, saving one instruction (ignoring for the moment that no callee-saved registers are needed at all).</li>
</ul>
<p>Even for this trivial function the gcc-generated code is more than twice the optimal size and slower by approximately the same factor.</p>
<p>The main issue I wanted to illustrate is the poor handling of bit-fields by gcc. When accessing bitfields from memory, gcc issues a separate load for each field even when they are contained in the same aligned memory word. Although each load after the first will most likely hit L1 cache, this is still bad for several reasons:</p>
<ul>
<li>Loads have typically two or three cycles result latency compared to one cycle for data processing instructions. Any bit-field can be extracted from a register with two shifts, and on ARM the second of these can generally be achieved using a shifted second operand to a following instruction. The ARMv6T2 instruction set also adds the <code>SBFX</code> and <code>UBFX</code> instructions for extracting any signed or unsigned bit-field in one cycle.</li>
<li>Most CPUs have more data processing units than load/store units. It is thus more likely for an ALU instruction than a load/store to issue without delay on a superscalar processor.</li>
<li>Redundant memory accesses can trigger early flushing of store buffers rendering these less efficient.</li>
</ul>
<p>No gcc bashing is complete without a comparison with another compiler, so without further ado, here is the ARM RVCT output (<code>armcc --cpu 5te -O3</code>):</p>
<blockquote>
<pre>func:
        mov     r3, #0
        push    {r4, lr}
loop:
        ldr     ip, [r0, r3, lsl #2]
        tst     ip, #1
        addne   ip, ip, r2, lsl #1
        strne   ip, [r0, r3, lsl #2]
        add     r3, r3, #1
        cmp     r3, r1
        blt     loop
        pop     {r4, pc}
</pre>
</blockquote>
<p>This is much better, the core loop using only one instruction more than my version. The loop control is counting up, but at least this register is reused as offset for the memory accesses. More remarkable is the push/pop of two registers that are never used. I had not expected to see this from RVCT.</p>
<p>Even the best compilers are still no match for a human.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/01/30/bit-field-badness/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>ARM compiler update</title>
		<link>http://hardwarebug.org/2010/01/15/arm-compiler-update/</link>
		<comments>http://hardwarebug.org/2010/01/15/arm-compiler-update/#comments</comments>
		<pubDate>Fri, 15 Jan 2010 18:48:38 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=228</guid>
		<description><![CDATA[Since my last shootout,  all the tested vendors have updated their compilers. Here is a quick update on each of them. Both the 4.3 and 4.4 branches of FSF GCC have had bugfix releases, bringing them to 4.3.4 and 4.4.2, respectively. Neither update contains anything particularly noteworthy. The CodeSourcery 2009q3 release sees an update to [...]]]></description>
			<content:encoded><![CDATA[<p>Since my <a href="http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/">last shootout</a>,  all the tested vendors have updated their compilers. Here is a quick update on each of them.</p>
<p>Both the 4.3 and 4.4 branches of FSF GCC have had bugfix releases, bringing them to <a href="http://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&amp;resolution=FIXED&amp;target_milestone=4.3.4">4.3.4</a> and <a href="http://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&amp;resolution=FIXED&amp;target_milestone=4.4.2">4.4.2</a>, respectively. Neither update contains anything particularly noteworthy.</p>
<p>The CodeSourcery 2009q3 release sees an update to a GCC 4.4 base, a significant change from the 4.3 base used in 2009q1. The update is a mixed blessing. In fact, it is mostly a curse and hardly a blessing at all. On the bright side, the floating-point speed regressions in 2009q1 are gone, 2009q3 being a few per cent faster even than 2007q3. Unfortunately, this improvement is completely overshadowed by a major speed regression on integer code, a whopping 24% in one case. This ties in with the slowdown previously observed with FSF GCC 4.4 compared to 4.3.</p>
<p>ARM RVCT 4.0 is now at Build 697. This update fixes some bugs and introduces others. Notably, it no longer builds FFmpeg correctly. The issue has been reported to ARM.</p>
<p>Texas Instruments, finally, have made a formal release, v4.6.1, of their TMS470 compiler incorporating various fixes allowing it to build a moderately patched FFmpeg. The performance remains somewhere between GCC and RVCT on average.</p>
<p>In light of the above, my recommendations remain unchanged:</p>
<ul>
<li>For a free compiler, choose CodeSourcery 2009q1. It beats GCC 4.3.4 by 5-10% in most cases.</li>
<li>GNU purists are best served by GCC 4.3.4, which is up to 20% faster than 4.4.2 and rarely slower.</li>
<li>When price is not a concern, ARM RCVT is a good option, outperforming GCC by up to a factor 2.</li>
<li>In all cases, disable any auto-vectorisation features.</li>
</ul>
<p>Regardless of which compiler is chosen, I cannot overstress the importance of testing. All compilers are crawling with bugs, and even the most innocent-looking code change can trigger one of them. When using a compiler other than GCC, extra caution is advised considering a lot of code is developed using only GCC and may thus fall prey to bugs unique to said other compiler.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/01/15/arm-compiler-update/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Beware the builtins</title>
		<link>http://hardwarebug.org/2010/01/14/beware-the-builtins/</link>
		<comments>http://hardwarebug.org/2010/01/14/beware-the-builtins/#comments</comments>
		<pubDate>Thu, 14 Jan 2010 01:02:27 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=215</guid>
		<description><![CDATA[GCC includes a large number of builtin functions allegedly providing optimised code for common operations not easily expressed directly in C. Rather than taking such claims at face value (this is GCC after all), I decided to conduct a small investigation to see how well a few of these functions are actually implemented for various [...]]]></description>
			<content:encoded><![CDATA[<p>GCC includes a large number of builtin functions allegedly providing optimised code for common operations not easily expressed directly in C. Rather than taking such claims at face value (this is GCC after all), I decided to conduct a small investigation to see how well a few of these functions are actually implemented for various targets.</p>
<p>For my test, I selected the following functions:</p>
<ul>
<li><code>__builtin_bswap32</code>: Byte-swap a 32-bit word.</li>
<li><code>__builtin_bswap64</code>: Byte-swap a 64-bit word.</li>
<li><code>__builtin_clz</code>: Count leading zeros in a word.</li>
<li><code>__builtin_ctz</code>: Count trailing zeros in a word.</li>
<li><code>__builtin_prefetch</code>: Prefetch data into cache.</li>
</ul>
<p>To test the quality of these builtins, I wrapped each in a normal function, then compiled the code for these targets:</p>
<ul>
<li>ARMv7</li>
<li>AVR32</li>
<li>MIPS</li>
<li>MIPS64</li>
<li>PowerPC</li>
<li>PowerPC64</li>
<li>x86</li>
<li>x86_64</li>
</ul>
<p>In all cases I used compiler flags were <code>-O3 -fomit-frame-pointer</code> plus any flags required to select a modern CPU model.<br />
<span id="more-215"></span></p>
<h3>ARM</h3>
<p>Both  <code>__builtin_clz</code> and <code>__builtin_prefetch</code> generate the expected <code>CLZ</code> and <code>PLD</code> instructions respectively. The code for <code>__builtin_ctz</code> is reasonable for ARMv6 and earlier:</p>
<blockquote>
<pre>rsb     r3, r0, #0
and     r0, r3, r0
clz     r0, r0
rsb     r0, r0, #31
</pre>
</blockquote>
<p>For ARMv7 (in fact v6T2), however, using the new bit-reversal instruction would have been better:</p>
<blockquote>
<pre>rbit    r0, r0
clz     r0, r0
</pre>
</blockquote>
<p>I suspect this is simply a matter of the function not yet having been updated for ARMv7, which is perhaps even excusable given the relatively rare use cases for it.</p>
<p>The byte-reversal functions are where it gets shocking. Rather than use the <code>REV</code> instruction found from ARMv6 on, both of them generate external calls to <code>__bswapsi2</code> and <code>__bswapdi2</code> in libgcc, which is plain C code:</p>
<blockquote>
<pre>SItype
__bswapsi2 (SItype u)
{
  return ((((u) &amp; 0xff000000) &gt;&gt; 24)
          | (((u) &amp; 0x00ff0000) &gt;&gt;  8)
          | (((u) &amp; 0x0000ff00) &lt;&lt;  8)
          | (((u) &amp; 0x000000ff) &lt;&lt; 24));
}

DItype
__bswapdi2 (DItype u)
{
   return ((((u) &amp; 0xff00000000000000ull) &gt;&gt; 56)
          | (((u) &amp; 0x00ff000000000000ull) &gt;&gt; 40)
          | (((u) &amp; 0x0000ff0000000000ull) &gt;&gt; 24)
          | (((u) &amp; 0x000000ff00000000ull) &gt;&gt;  8)
          | (((u) &amp; 0x00000000ff000000ull) &lt;&lt;  8)
          | (((u) &amp; 0x0000000000ff0000ull) &lt;&lt; 24)
          | (((u) &amp; 0x000000000000ff00ull) &lt;&lt; 40)
          | (((u) &amp; 0x00000000000000ffull) &lt;&lt; 56));
}
</pre>
</blockquote>
<p>While the 32-bit version compiles to a reasonable-looking shift/mask/or job, the 64-bit one is a real WTF. Brace yourselves:</p>
<blockquote>
<pre>push    {r4, r5, r6, r7, r8, r9, sl, fp}
mov     r5, #0
mov     r6, #65280      ; 0xff00
sub     sp, sp, #40     ; 0x28
and     r7, r0, r5
and     r8, r1, r6
str     r7, [sp, #8]
str     r8, [sp, #12]
mov     r9, #0
mov     r4, r1
and     r5, r0, r9
mov     sl, #255        ; 0xff
ldr     r9, [sp, #8]
and     r6, r4, sl
mov     ip, #16711680   ; 0xff0000
str     r5, [sp, #16]
str     r6, [sp, #20]
lsl     r2, r0, #24
and     ip, ip, r1
lsr     r7, r4, #24
mov     r1, #0
lsr     r5, r9, #24
mov     sl, #0
mov     r9, #-16777216  ; 0xff000000
and     fp, r0, r9
lsr     r6, ip, #8
orr     r9, r7, r1
and     ip, r4, sl
orr     sl, r1, r2
str     r6, [sp]
str     r9, [sp, #32]
str     sl, [sp, #36]   ; 0x24
add     r8, sp, #32
ldm     r8, {r7, r8}
str     r1, [sp, #4]
ldm     sp, {r9, sl}
orr     r7, r7, r9
orr     r8, r8, sl
str     r7, [sp, #32]
str     r8, [sp, #36]   ; 0x24
mov     r3, r0
mov     r7, #16711680   ; 0xff0000
mov     r8, #0
and     r9, r3, r7
and     sl, r4, r8
ldr     r0, [sp, #16]
str     fp, [sp, #24]
str     ip, [sp, #28]
stm     sp, {r9, sl}
ldr     r7, [sp, #20]
ldr     sl, [sp, #12]
ldr     fp, [sp, #12]
ldr     r8, [sp, #28]
lsr     r0, r0, #8
orr     r7, r0, r7, lsl #24
lsr     r6, sl, #24
orr     r5, r5, fp, lsl #8
lsl     sl, r8, #8
mov     fp, r7
add     r8, sp, #32
ldm     r8, {r7, r8}
orr     r6, r6, r8
ldr     r8, [sp, #20]
ldr     r0, [sp, #24]
orr     r5, r5, r7
lsr     r8, r8, #8
orr     sl, sl, r0, lsr #24
mov     ip, r8
ldr     r0, [sp, #4]
orr     fp, fp, r5
ldr     r5, [sp, #24]
orr     ip, ip, r6
ldr     r6, [sp]
lsl     r9, r5, #8
lsl     r8, r0, #24
orr     fp, fp, r9
lsl     r3, r3, #8
orr     r8, r8, r6, lsr #8
orr     ip, ip, sl
lsl     r7, r6, #24
and     r5, r3, #16711680       ; 0xff0000
orr     r7, r7, fp
orr     r8, r8, ip
orr     r4, r1, r7
orr     r5, r5, r8
mov     r9, r6
mov     r1, r5
mov     r0, r4
add     sp, sp, #40     ; 0x28
pop     {r4, r5, r6, r7, r8, r9, sl, fp}
bx      lr
</pre>
</blockquote>
<p>That&#8217;s right, 91 instructions to move 8 bytes around a bit. GCC definitely has a problem with 64-bit numbers. It is perhaps worth noting that the <code>bswap_64</code> macro in glibc splits the 64-bit value into 32-bit halves which are then reversed independently, thus side-stepping this weakness of gcc.</p>
<p>As a side note, ARM RVCT (armcc) compiles those functions perfectly into one and two <code>REV</code> instructions, respectively.</p>
<h3>AVR32</h3>
<p>There is not much to report here. The latest gcc version available is 4.2.4, which doesn&#8217;t appear to have the bswap functions. The other three are handled nicely, even using a bit-reverse for <code>__builtin_ctz</code>.</p>
<h3>MIPS / MIPS64</h3>
<p>The situation MIPS is similar to ARM. Both bswap builtins result in external libgcc calls, the rest giving sensible code.</p>
<h3>PowerPC</h3>
<p>I scarcely believe my eyes, but this one is actually not bad. The PowerPC has no byte-reversal instructions, yet someone seems to have taken the time to teach gcc a good instruction sequence for this operation. The PowerPC does have some powerful rotate-and-mask instructions which come in handy here. First the 32-bit version:</p>
<blockquote>
<pre>rotlwi  r0,r3,8
rlwimi  r0,r3,24,0,7
rlwimi  r0,r3,24,16,23
mr      r3,r0
blr
</pre>
</blockquote>
<p>The 64-bit byte-reversal simply applies the above code on each half of the value:</p>
<blockquote>
<pre>rotlwi  r0,r3,8
rlwimi  r0,r3,24,0,7
rlwimi  r0,r3,24,16,23
rotlwi  r3,r4,8
rlwimi  r3,r4,24,0,7
rlwimi  r3,r4,24,16,23
mr      r4,r0
blr
</pre>
</blockquote>
<p>Although I haven&#8217;t analysed that code carefully, it looks pretty good.</p>
<h3>PowerPC64</h3>
<p>Doing 64-bit operations is easier on a 64-bit CPU, right? For you and me perhaps, but not for gcc. Here <code>__builtin_bswap64</code> gives us the now familiar <code>__bswapdi2</code> call, and while not as bad as the ARM version, it is not pretty:</p>
<blockquote>
<pre>rldicr  r0,r3,8,55
rldicr  r10,r3,56,7
rldicr  r0,r0,56,15
rldicl  r11,r3,8,56
rldicr  r9,r3,16,47
or      r11,r10,r11
rldicr  r9,r9,48,23
rldicl  r10,r0,24,40
rldicr  r0,r3,24,39
or      r11,r11,r10
rldicl  r9,r9,40,24
rldicr  r0,r0,40,31
or      r9,r11,r9
rlwinm  r10,r3,0,0,7
rldicl  r0,r0,56,8
or      r0,r9,r0
rldicr  r10,r10,8,55
rlwinm  r11,r3,0,8,15
or      r0,r0,r10
rldicr  r11,r11,24,39
rlwinm  r3,r3,0,16,23
or      r0,r0,r11
rldicr  r3,r3,40,23
or      r3,r0,r3
blr
</pre>
</blockquote>
<p>That is 6 times longer than the (presumably) hand-written 32-bit version.</p>
<h3>x86 / x86_64</h3>
<p>As one might expect, results on x86 are good. All the tested functions use the available special instructions. One word of caution though: the bit-counting instructions are very slow on some implementations, specifically the Atom, AMD chips, and the notoriously slow Pentium4E.</p>
<h3>Conclusion</h3>
<p>In conclusion, I would say gcc builtins can be useful to avoid fragile inline assembler. Before using them, however, one should make sure they are not in fact harmful on the required targets. Not even those builtins mapping directly to CPU instructions can be trusted.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/01/14/beware-the-builtins/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>ARM compiler shoot-out, round 2</title>
		<link>http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/</link>
		<comments>http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/#comments</comments>
		<pubDate>Thu, 20 Aug 2009 20:20:35 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=204</guid>
		<description><![CDATA[In my recent test of ARM compilers, I had to leave out Texas Instrument&#8217;s compiler since it failed to build FFmpeg. Since then, the TI compiler team has been busy fixing bugs, and a snapshot I was given to test was able to build enough of a somewhat patched FFmpeg that I can now present [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/">recent test</a> of ARM compilers, I had to leave out Texas Instrument&#8217;s compiler since it failed to build FFmpeg. Since then, the TI compiler team has been busy fixing bugs, and a snapshot I was given to test was able to build enough of a somewhat patched FFmpeg that I can now present round two in this shoot-out.</p>
<p>The contenders this time were the fastest GCC variant from round one, ARM RVCT, and newcomer TI TMS470. With the same rules as last time, the exact versions and optimisation options were like this:</p>
<ul>
<li><strong>CodeSourcery GCC 2009q1</strong> (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>ARM RVCT 4.0 Build 591</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros</li>
<li><strong>TI TMS470 4.7.0-a9229</strong>, <span>-</span>-float_support=vfpv3 -mv=7a8 -O3 -mf=5</li>
</ul>
<p><span id="more-204"></span><br />
To keep things fair, I left the vectoriser off also with the TI compiler. The table below lists the decoding times for the sample files, this time normalised against the participating GCC compiler. Remember, smaller numbers are better.  Also keep in mind that this test was done with a development snapshot of TMS470, not an approved release.</p>
<table border="0" width="100%">
<col></col>
<col></col>
<col></col>
<col width="10%"></col>
<col width="10%"></col>
<col width="10%"></col>
<thead>
<tr style="text-align: left;">
<th>Sample name</th>
<th>Codec</th>
<th>Code type</th>
<th>GCC</th>
<th>RVCT</th>
<th>TI</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4">cathedral</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>1.00</td>
<td>0.95</td>
<td>1.02</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4">NeroAVC</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>1.00</td>
<td>0.96</td>
<td>1.05</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/indiana_jones_4-tlr3_h640w.mov">indiana_jones_4</a></td>
<td>H.264 CAVLC</td>
<td>integer</td>
<td>1.00</td>
<td>0.92</td>
<td>1.02</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4">NeroRecodeSample</a></td>
<td>MPEG-4 ASP</td>
<td>integer</td>
<td>1.00</td>
<td>1.01</td>
<td>1.08</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3">Silent_Light</a></td>
<td>MP3</td>
<td>64-bit integer</td>
<td>1.00</td>
<td>0.48</td>
<td>0.72</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/flac/When I Grow Up.flac">When_I_Grow_Up</a></td>
<td>FLAC</td>
<td>integer</td>
<td>1.00</td>
<td>0.87</td>
<td>0.93</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg">Lumme-Badloop</a></td>
<td>Vorbis</td>
<td>float</td>
<td>1.00</td>
<td>0.94</td>
<td>1.05</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/AC3/Canyon-5.1-48khz-448kbit.ac3">Canyon</a></td>
<td>AC-3</td>
<td>float</td>
<td>1.00</td>
<td>0.88</td>
<td>1.01</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/DTS/lotr_5.1_768.dts">lotr</a></td>
<td>DTS</td>
<td>float</td>
<td>1.00</td>
<td>1.00</td>
<td>1.08</td>
</tr>
</tbody>
</table>
<p>Overall, the TI TMS470 compiler comes off slightly worse than GCC. In two cases, however, it was significantly better than GCC, but not as good as RVCT. Incidentally, those were also the ones where RVCT scored the biggest win over GCC.</p>
<p>My conclusions from this test are twofold:</p>
<ul>
<li>ARM&#8217;s own compiler is very hard to beat. They do seem to know how their chips work.</li>
<li>GCC is incredibly bad at 64-bit arithmetic on 32-bit machines.</li>
</ul>
<p>The logical next step is to test these compilers with vectorisation enabled. FFmpeg should offer plenty of opportunities for this feature to shine. Unfortunately, that test will have to wait until the RVCT vectoriser is fixed. The current release does not compile FFmpeg with vectorisation enabled.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4" length="24154488" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4" length="6766583" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/indiana_jones_4-tlr3_h640w.mov" length="16215526" type="video/quicktime" />
<enclosure url="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4" length="31027653" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3" length="4206720" type="audio/mpeg" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg" length="5856908" type="audio/ogg" />
		</item>
		<item>
		<title>DRM the Big Blue way</title>
		<link>http://hardwarebug.org/2009/08/10/drm-the-big-blue-way/</link>
		<comments>http://hardwarebug.org/2009/08/10/drm-the-big-blue-way/#comments</comments>
		<pubDate>Mon, 10 Aug 2009 20:35:53 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Compilers]]></category>
		<category><![CDATA[PowerPC]]></category>
		<category><![CDATA[Reverse engineering]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=179</guid>
		<description><![CDATA[A few months ago, I downloaded an evaluation copy of IBM&#8217;s XLC compiler to try it out on FFmpeg. The trial licence has now expired, so what better way to spend a few minutes than by cracking it? The installation script, as expected, copied a number of files into a directory under /opt. More unusually, [...]]]></description>
			<content:encoded><![CDATA[<p>A few months ago, I downloaded an evaluation copy of IBM&#8217;s <a href="http://www-01.ibm.com/software/awdtools/xlcpp/linux/">XLC</a> compiler to try it out on FFmpeg. The trial licence has now expired, so what better way to spend a few minutes than by cracking it?</p>
<p>The installation script, as expected, copied a number of files into a directory under <code>/opt</code>. More unusually, it also created a small shared library, <code>libxlc101e.so.1</code>, and placed it in <code>/usr/lib</code>. No other files from the installation package were modified, so this must be where the licence is hiding. Without further ado, we proceed to take it apart.<br />
<span id="more-179"></span><br />
We begin by looking at the symbol table using <code>readelf -s</code>:</p>
<pre>Symbol table '.symtab' contains 44 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 000000b4     0 SECTION LOCAL  DEFAULT    1
     2: 0000017c     0 SECTION LOCAL  DEFAULT    2
     3: 0000036c     0 SECTION LOCAL  DEFAULT    3
     4: 0000055c     0 SECTION LOCAL  DEFAULT    4
     5: 00000580     0 SECTION LOCAL  DEFAULT    5
     6: 000005c4     0 SECTION LOCAL  DEFAULT    6
     7: 00010a3c     0 SECTION LOCAL  DEFAULT    7
     8: 00010a50     0 SECTION LOCAL  DEFAULT    8
     9: 00010ac8     0 SECTION LOCAL  DEFAULT    9
    10: 00010ad8     0 SECTION LOCAL  DEFAULT   10
    11: 00000000     0 SECTION LOCAL  DEFAULT   11
    12: 00000000     0 SECTION LOCAL  DEFAULT   12
    13: 00000000     0 SECTION LOCAL  DEFAULT   13
    14: 00000000     0 SECTION LOCAL  DEFAULT   14
    15: 00000000     0 FILE    LOCAL  DEFAULT  ABS xleval.c
    16: 00010a50     0 OBJECT  LOCAL  HIDDEN   ABS _DYNAMIC
    17: 00010acc     0 OBJECT  LOCAL  HIDDEN   ABS _GLOBAL_OFFSET_TABLE_
    18: 000005f0    24 OBJECT  GLOBAL DEFAULT    6 xlc_extended_eval_lic_dir
    19: 00000660    22 OBJECT  GLOBAL DEFAULT    6 libxlfextendeval_name
    20: 00000738    42 OBJECT  GLOBAL DEFAULT    6 stm_compiler_name
    21: 00000640    16 OBJECT  GLOBAL DEFAULT    6 libupclicense_name
    22: 00000764    32 OBJECT  GLOBAL DEFAULT    6 xlf_compiler_name
    23: 00000580    68 FUNC    GLOBAL DEFAULT    5 _xlgetevalbeta
    24: 00000678    22 OBJECT  GLOBAL DEFAULT    6 libxlcextendeval_name
    25: 00000608    24 OBJECT  GLOBAL DEFAULT    6 xlf_extended_eval_lic_dir
    26: 000006d8    17 OBJECT  GLOBAL DEFAULT    6 xlcmp_name
    27: 000006c0    12 OBJECT  GLOBAL DEFAULT    6 xlc_package_name
    28: 00000650    16 OBJECT  GLOBAL DEFAULT    6 libstmlicense_name
    29: 000005e0    16 OBJECT  GLOBAL DEFAULT    6 xlf_extend_eval_env_var
    30: 000005d0    16 OBJECT  GLOBAL DEFAULT    6 xlc_extend_eval_env_var
    31: 00000620    16 OBJECT  GLOBAL DEFAULT    6 libxlflicense_name
    32: 00010cd0     0 NOTYPE  GLOBAL DEFAULT  ABS __bss_start
    33: 00010a3c    20 OBJECT  GLOBAL DEFAULT    7 _xlevalbeta
    34: 000005c4    10 OBJECT  GLOBAL DEFAULT    6 liblicense_dir
    35: 00000690    22 OBJECT  GLOBAL DEFAULT    6 libupcextendeval_name
    36: 000006ec    30 OBJECT  GLOBAL DEFAULT    6 xlc_compiler_name
    37: 00000630    16 OBJECT  GLOBAL DEFAULT    6 libxlclicense_name
    38: 00010cd0     0 NOTYPE  GLOBAL DEFAULT  ABS _edata
    39: 00010cd0     0 NOTYPE  GLOBAL DEFAULT  ABS _end
    40: 000006cc    12 OBJECT  GLOBAL DEFAULT    6 xlf_package_name
    41: 000006a8    22 OBJECT  GLOBAL DEFAULT    6 libstmextendeval_name
    42: 00010ad8   504 OBJECT  GLOBAL DEFAULT   10 versionString
    43: 0000070c    42 OBJECT  GLOBAL DEFAULT    6 upc_compiler_name</pre>
<p>Notice the lone function at position 23, <code>_xlgetevalbeta</code>, which we proceed to disassemble:</p>
<pre>00000580 &lt;_xlgetevalbeta&gt;:
 580:   94 21 ff f0     stwu    r1,-16(r1)
 584:   93 c1 00 08     stw     r30,8(r1)
 588:   93 e1 00 0c     stw     r31,12(r1)
 58c:   7c 3f 0b 78     mr      r31,r1
 590:   7d 88 02 a6     mflr    r12
 594:   42 9f 00 05     bcl-    20,4*cr7+so,598 &lt;_xlgetevalbeta+0x18&gt;
 598:   7f c8 02 a6     mflr    r30
 59c:   3f de 00 01     addis   r30,r30,1
 5a0:   3b de 05 34     addi    r30,r30,1332
 5a4:   7d 88 03 a6     mtlr    r12
 5a8:   80 1e ff fc     lwz     r0,-4(r30)
 5ac:   7c 03 03 78     mr      r3,r0
 5b0:   81 61 00 00     lwz     r11,0(r1)
 5b4:   83 cb ff f8     lwz     r30,-8(r11)
 5b8:   83 eb ff fc     lwz     r31,-4(r11)
 5bc:   7d 61 5b 78     mr      r1,r11
 5c0:   4e 80 00 20     blr</pre>
<p>This is fairly standard, unoptimised code. After saving a few registers on the stack, it computes the address of the global offset table: <code>0x598 + 0x10000 + 1332 = 0x10acc</code>, matching <code>_GLOBAL_OFFSET_TABLE_</code> from the symbol table. Next, a value is loaded from the GOT, forming the return value of the function after the stack has been restored.</p>
<p>To find out what this return value really is, we look at the relocation table (by means of <code>readelf -r</code>):</p>
<pre>Relocation section '.rela.dyn' at offset 0x55c contains 3 entries:
 Offset     Info    Type            Sym.Value  Sym. Name + Addend
00010a40  00000016 R_PPC_RELATIVE                               00000784
00010a44  00000016 R_PPC_RELATIVE                               0000097c
00010ac8  00001414 R_PPC_GLOB_DAT    00010a3c   _xlevalbeta + 0</pre>
<p>The third entry is the one we are looking for: its offset matches the location read by the code at 0x5a8. This means the <code>_xlgetevalbeta</code> function is returning a pointer to <code>_xlevalbeta</code>, which makes some kind of sense.</p>
<p>Another quick look at the symbol table tells us <code>_xlevalbeta</code> lives at address 0x10a3c and is 20 bytes in size. The section header (provided by <code>readelf -S</code>) helps us find the corresponding location in the file:</p>
<pre>Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg
  [ 0]                   NULL            00000000 000000 000000 00
  [ 1] .hash             HASH            000000b4 0000b4 0000c8 04   A
  [ 2] .dynsym           DYNSYM          0000017c 00017c 0001f0 10   A
  [ 3] .dynstr           STRTAB          0000036c 00036c 0001ee 00   A
  [ 4] .rela.dyn         RELA            0000055c 00055c 000024 0c   A
  [ 5] .text             PROGBITS        00000580 000580 000044 00  AX
  [ 6] .rodata           PROGBITS        000005c4 0005c4 000475 00   A
  [ 7] .data.rel.ro      PROGBITS        00010a3c 000a3c 000014 00  WA
  [ 8] .dynamic          DYNAMIC         00010a50 000a50 000078 08  WA
  [ 9] .got              PROGBITS        00010ac8 000ac8 000010 04  WA
  [10] .data             PROGBITS        00010ad8 000ad8 0001f8 00  WA
  [11] .comment          PROGBITS        00000000 000cd0 000028 00
  [12] .shstrtab         STRTAB          00000000 000cf8 000073 00
  [13] .symtab           SYMTAB          00000000 000fc4 0002c0 10
  [14] .strtab           STRTAB          00000000 001284 000206 00
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings)
  I (info), L (link order), G (group), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)</pre>
<p>The address we are looking for is at the start of the <code>.data.rel.ro</code> section, which can be found at offset 0xa3c in the file. It is time for the <code>hexdump</code> tool:</p>
<pre>00000a30  69 62 69 74 65 64 2e 00  00 00 00 00 00 00 00 01
00000a40  00 00 00 00 00 00 00 00  00 00 24 05 4a 0c 65 c8</pre>
<p>The last four bytes here, <code>4a 0c 65 c8</code>, are interesting. Taken as a 32-bit big endian value, they are exactly equal to the modification time of the file, or in other words, the time the compiler was installed. This cannot be a coincidence, so using a hex editor, we replace this with the current time, <code>4a 80 74 31</code>.</p>
<p>Lo and behold, the compiler is working again.</p>
<p>One hopes the engineers at IBM developing the compiler are not the same ones thinking this copy protection method was a good idea. Then again, perhaps they are; it failed miserably at compiling FFmpeg.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/08/10/drm-the-big-blue-way/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ARM compiler shoot-out</title>
		<link>http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/</link>
		<comments>http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/#comments</comments>
		<pubDate>Wed, 05 Aug 2009 00:06:06 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=150</guid>
		<description><![CDATA[A proper comparison of different compilers targeting ARM is long overdue, so I decided to do my part. I compiled FFmpeg using a selection of compilers, and measured the speed of the result when decoding various media samples. Since we are testing compilers, I disabled all hand-written assembler. The tests were run on a Beagle [...]]]></description>
			<content:encoded><![CDATA[<p>A proper comparison of different compilers targeting ARM is long overdue, so I decided to do my part. I compiled <a href="http://ffmpeg.org/">FFmpeg</a> using a selection of compilers, and measured the speed of the result when decoding various media samples. Since we are testing compilers, I disabled all hand-written assembler. The tests were run on a <a href="http://beagleboard.org/">Beagle board</a> clocked at 600 MHz.</p>
<p>These are the compilers I deemed worthy to participate in the test and the optimisation flags I used with each:</p>
<ul>
<li><strong>GCC 4.3.3</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>GCC 4.4.1</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>CodeSourcery GCC 2007q3</strong> (based on 4.2.1), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-tree-vectorize</li>
<li><strong>CodeSourcery GCC 2009q1</strong> (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>ARM RVCT 4.0 Build 591</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros</li>
</ul>
<p>I would have also included the ARM compiler from Texas Instruments, had it been able to compile FFmpeg.<br />
<span id="more-150"></span><br />
With sample files chosen to exercise various types of code, the result of the test is, sadly, no surprise. The following table lists the runtimes of the different builds relative to the CodeSourcery 2007q3 build. Lower numbers are better.</p>
<table border="0" width="100%">
<col></col>
<col></col>
<col></col>
<col width="10%"></col>
<col width="10%"></col>
<col width="10%"></col>
<col width="10%"></col>
<thead>
<tr style="text-align: left;">
<th>Sample name</th>
<th>Codec</th>
<th>Code type</th>
<th>2009q1</th>
<th>4.3.3</th>
<th>4.4.1</th>
<th>RVCT</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4">cathedral</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>0.97</td>
<td>1.02</td>
<td>1.09</td>
<td>0.93</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4">NeroAVC</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>0.98</td>
<td>1.02</td>
<td>1.12</td>
<td>0.95</td>
</tr>
<tr>
<td><a href="http://movies.apple.com/movies/paramount/indiana_jones_4/indiana_jones_4-tlr3_h640w.mov">indiana_jones_4</a></td>
<td>H.264 CAVLC</td>
<td>integer</td>
<td>0.97</td>
<td>1.02</td>
<td>1.09</td>
<td>0.89</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4">NeroRecodeSample</a></td>
<td>MPEG-4 ASP</td>
<td>integer</td>
<td>0.96</td>
<td>1.03</td>
<td>1.27</td>
<td>0.96</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3">Silent_Light</a></td>
<td>MP3</td>
<td>64-bit integer</td>
<td>0.89</td>
<td>0.88</td>
<td>0.97</td>
<td>0.44</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/flac/When I Grow Up.flac">When_I_Grow_Up</a></td>
<td>FLAC</td>
<td>integer</td>
<td>0.98</td>
<td>0.98</td>
<td>0.93</td>
<td>0.86</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg">Lumme-Badloop</a></td>
<td>Vorbis</td>
<td>float</td>
<td>1.03</td>
<td>1.03</td>
<td>1.02</td>
<td>0.97</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/AC3/Canyon-5.1-48khz-448kbit.ac3">Canyon</a></td>
<td>AC-3</td>
<td>float</td>
<td>1.02</td>
<td>1.02</td>
<td>0.99</td>
<td>0.90</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/DTS/lotr_5.1_768.dts">lotr</a></td>
<td>DTS</td>
<td>float</td>
<td>1.02</td>
<td>1.02</td>
<td>1.00</td>
<td>1.03</td>
</tr>
</tbody>
</table>
<p>Looking at the table, I make these observations:</p>
<ul>
<li>CodeSourcery 2009q1 produces faster integer code, but slower floating-point code, than 2007q3.</li>
<li>GCC 4.4.1 produces much slower code than 4.3.3 in several cases, and is never significantly better.</li>
<li>CodeSourcery GCC generally beats FSF GCC.</li>
<li>ARM RVCT readily beats every GCC version. The MP3 figure is not a typo.</li>
</ul>
<p>My recommendation for a free compiler is CodeSourcery 2009q1 unless your code makes heavy use of floating-point, in which case 2007q3 may give better results. If you prefer, for whatever reason, official GNU releases, 4.3.3 should be the version of choice. Avoid GCC 4.4.1; it is far too unpredictable.</p>
<h4>Bootnotes</h4>
<ul>
<li>See also Mike&#8217;s <a title="Intel Beats Up GCC" href="http://multimedia.cx/eggs/intel-beats-up-gcc/">test of x86 compilers</a>.</li>
<li>Thanks to ARM for providing the RVCT compiler.</li>
<li>Thanks to TI for providing the Beagle board.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4" length="24154488" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4" length="6766583" type="video/mp4" />
<enclosure url="http://movies.apple.com/movies/paramount/indiana_jones_4/indiana_jones_4-tlr3_h640w.mov" length="16215526" type="video/quicktime" />
<enclosure url="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4" length="31027653" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3" length="4206720" type="audio/mpeg" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg" length="5856908" type="audio/ogg" />
		</item>
		<item>
		<title>GCC makes a mess</title>
		<link>http://hardwarebug.org/2009/05/13/gcc-makes-a-mess/</link>
		<comments>http://hardwarebug.org/2009/05/13/gcc-makes-a-mess/#comments</comments>
		<pubDate>Wed, 13 May 2009 02:16:38 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Optimisation]]></category>
		<category><![CDATA[PowerPC]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=131</guid>
		<description><![CDATA[Following up on a report about FFmpeg being slower at MPEG audio decoding than MAD, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast [...]]]></description>
			<content:encoded><![CDATA[<p>Following up on a report about <a href="http://ffmpeg.org/">FFmpeg</a> being slower at MPEG audio decoding than <a href="http://www.underbit.com/products/mad/">MAD</a>, I compared the speed of the two decoders on a few machines. FFmpeg came out somewhat ahead of MAD on most of my test systems with the exception of 32-bit PowerPC. On the PPC MAD was nearly twice as fast as FFmpeg, suggesting something was going badly wrong in the compilation.</p>
<p>A session with oprofile exposes multiplication as the root of the problem. The MPEG audio decoder in FFmpeg includes many operations of the form <code>a += b * c</code> where <code>b</code> and <code>c</code> are 32 bits in size and <code>a</code> is 64-bit. 64-bit maths on a 32-bit CPU is not handled well by GCC, even when good hardware support is available. A couple of examples compiled with GCC 4.3.3 illustrate this.<br />
<span id="more-131"></span><br />
Suppose you need the high 32 bits from the 64-bit result of multiplying two 32-bit numbers. This is most easily written in C like this:</p>
<blockquote>
<pre>int mulh(int a, int b)
{
    return ((int64_t)a * (int64_t)b) &gt;&gt; 32;
}</pre>
</blockquote>
<p>It doesn&#8217;t take much thinking to see that the PowerPC <code>mulhw</code> instruction performs exactly this operation. Indeed, GCC knows of this instruction and uses it. But can we be <em>really</em> sure that those low 32 bits are not needed? GCC seems unconvinced:</p>
<blockquote>
<pre>mulhw   r9,  r4,  r3
mullw   r10, r4,  r3
srawi   r11, r9,  31
srawi   r12, r9,  0
mr      r3,  r12
blr</pre>
</blockquote>
<p>The second example is slightly more complicated:</p>
<blockquote>
<pre>int64_t mac(int64_t a, int b, int c, int d)
{
    a += (int64_t)b * (int64_t)c;
    a += (int64_t)b * (int64_t)d;
    return a;
}</pre>
</blockquote>
<p>This can, of course, be done with four multiplications and four additions. GCC, however, likes to be thorough, and uses twice the number of both instructions, plus some loads, stores and shifts for completeness:</p>
<blockquote>
<pre>stwu    r1,  -32(r1)
srawi   r0,  r6,  31
mullw   r0,  r0,  r5
srawi   r8,  r7,  31
stw     r29, 20(r1)
srawi   r29, r5,  31
stw     r27, 12(r1)
stw     r28, 16(r1)
mullw   r11, r29, r6
mulhwu  r9,  r6,  r5
add     r0,  r0,  r11
mullw   r10, r6,  r5
add     r9,  r0,  r9
mullw   r29, r29, r7
addc    r28, r10, r4
adde    r27, r9,  r3
mullw   r8,  r8,  r5
mulhwu  r9,  r7,  r5
add     r8,  r8,  r29
lwz     r29, 20(r1)
mullw   r10, r7,  r5
add     r9,  r8,  r9
addc    r12, r28, r10
adde    r11, r27, r9
lwz     r27, 12(r1)
mr      r4,  r12
lwz     r28, 16(r1)
mr      r3,  r11
addi    r1,  r1,  32
blr</pre>
</blockquote>
<p>Fortunately, this madness is easily fixed with a little inline assembler, more than doubling the speed of the decoder, thus making FFmpeg significantly faster than MAD also on PowerPC.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/05/13/gcc-makes-a-mess/feed/</wfw:commentRss>
		<slash:comments>34</slash:comments>
		</item>
		<item>
		<title>Thumbs up</title>
		<link>http://hardwarebug.org/2009/03/25/thumbs-up/</link>
		<comments>http://hardwarebug.org/2009/03/25/thumbs-up/#comments</comments>
		<pubDate>Wed, 25 Mar 2009 03:27:04 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=125</guid>
		<description><![CDATA[ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions. Thumb-2 performance is claimed [...]]]></description>
			<content:encoded><![CDATA[<p>ARM processors have long supported the 16-bit <a href="http://arm.com/products/CPUs/archi-thumb.html">Thumb</a> instruction set, achieving smaller code size at the price of reduced performance. The <a href="http://arm.com/products/CPUs/archi-thumb2.html">Thumb-2</a> extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.</p>
<p>Thumb-2 performance is <a href="http://arm.com/pdfs/Thumb-2CoreTechnologyWhitepaper-Final4.pdf">claimed</a> to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with <a href="http://ffmpeg.org/">FFmpeg</a> as the target and compiled the same source revision in ARM and Thumb-2 mode using the <a href="http://arm.com/products/DevTools/RVCT.html">RVCT 4.0</a> compiler. For this test I disabled all hand-written assembler optimisations.</p>
<p>The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a <a href="http://beagleboard.org/">Beagle board</a>. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.</p>
<p>In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/03/25/thumbs-up/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Rotten Apple</title>
		<link>http://hardwarebug.org/2009/01/28/rotten-apple/</link>
		<comments>http://hardwarebug.org/2009/01/28/rotten-apple/#comments</comments>
		<pubDate>Wed, 28 Jan 2009 03:27:48 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=112</guid>
		<description><![CDATA[Ever since Apple released their iPhone SDK, the FFmpeg mailing lists have seen a steady stream of error reports from users attempting to build FFmpeg for the iPhone, and eventually they got my attention. The iPhone is built around an ARM1176 CPU, so the SDK includes an ARM cross-compiler and assembler. Most of the reported [...]]]></description>
			<content:encoded><![CDATA[<p>Ever since Apple released their iPhone SDK, the <a href="http://ffmpeg.org/">FFmpeg</a> mailing lists have seen a steady stream of error reports from users attempting to build FFmpeg for the iPhone, and eventually they got my attention.</p>
<p>The iPhone is built around an ARM1176 CPU, so the SDK includes an ARM cross-compiler and assembler. Most of the reported errors originate from the Apple assembler which appears to have trouble processing the assembler source files from FFmpeg.</p>
<p>The source files use the GNU assembler syntax, and the Apple assembler is based on an old GNU version, so one might reasonably expect it to work. What I had not realised was just how old a version Apple based their assembler on. The version they chose was 1.38.1, released in January 1991, 18 years ago. Features which have since been added to the GNU assembler, and there are many, have not been merged by Apple. As a result, many special directives and macro features used in FFmpeg are not recognised by the Apple assembler, and modifying the code to work with this assembler would render it unusable with modern GNU versions.</p>
<p>Why not replace the assembler in the SDK with a GNU version, one might ask. The answer is that this is not possible. The Apple system uses an object file format, Mach-O, not supported by the GNU tools. The chances of Apple updating their assembler to support the newer syntax appear slim, so our best hope is for the <a href="http://sourceware.org/binutils/">GNU binutils</a> package to gain support for the Mach-O format. This will need a lot of work, and a working version cannot be expected for yet some time.</p>
<p>While this incompatibility persists, those wishing to run an optimised FFmpeg build on their iPhone will have to rely on patches to make it palatable to the Apple assembler. Supporting the Apple syntax directly in FFmpeg is unfortunately not feasible.</p>
<h3>Links</h3>
<ul>
<li><a href="http://code.google.com/p/ffmpeg4iphone/">http://code.google.com/p/ffmpeg4iphone/</a></li>
<li><a href="https://lists.ubuntu.com/archives/ubuntu-ca/2008-February/004207.html">https://lists.ubuntu.com/archives/ubuntu-ca/2008-February/004207.html</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/01/28/rotten-apple/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

