<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Hardwarebug &#187; ARM</title>
	<atom:link href="http://hardwarebug.org/category/arm/feed/" rel="self" type="application/rss+xml" />
	<link>http://hardwarebug.org</link>
	<description>Everything is broken</description>
	<lastBuildDate>Tue, 17 Aug 2010 14:47:17 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>ARM inline asm secrets</title>
		<link>http://hardwarebug.org/2010/07/06/arm-inline-asm-secrets/</link>
		<comments>http://hardwarebug.org/2010/07/06/arm-inline-asm-secrets/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 20:52:43 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=493</guid>
		<description><![CDATA[Although I generally recommend against using GCC inline assembler, preferring instead pure assembler code in separate files, there are occasions where inline is the appropriate solution. Should one, at a time like this, turn to the GCC documentation for guidance, one must be prepared for a degree of disappointment. As it happens, much of the [...]]]></description>
			<content:encoded><![CDATA[<p>Although I generally recommend against using GCC inline assembler, preferring instead pure assembler code in separate files, there are occasions where inline is the appropriate solution. Should one, at a time like this, turn to the GCC documentation for guidance, one must be prepared for a degree of disappointment. As it happens, much of the inline asm syntax is left entirely undocumented. This article attempts to fill in some of the blanks for the ARM target.<br />
<span id="more-493"></span></p>
<style>
.asm { border-collapse: collapse; }
.asm td { padding: 0.5em; }
.asm td:first-child { font-family: monospace; font-weight: bold; vertical-align: top }
</style>
<h3>Constraints</h3>
<p>Each operand of an inline asm block is described by a constraint string encoding the valid representations of the operand in the generated assembler. For example the &#8220;r&#8221; code denotes a general-purpose register. In addition to the standard constraints, ARM allows a number of special codes, only some of which are documented. The full list, including a brief description, is available in the <code>constraints.md</code> file in the GCC source tree.  The following table is an extract from this file consisting of the codes which are meaningful in an inline asm block (a few are only useful in the machine description itself).</p>
<table class="asm">
<tr>
<td>f</td>
<td>Legacy FPA registers <code>f0-f7</code>.</td>
</tr>
<tr>
<td>t</td>
<td>The VFP registers <code>s0-s31</code>.</td>
</tr>
<tr>
<td>v</td>
<td>The Cirrus Maverick co-processor registers.</td>
</tr>
<tr>
<td>w</td>
<td>The VFP registers <code>d0-d15</code>, or <code>d0-d31</code> for VFPv3.</td>
</tr>
<tr>
<td>x</td>
<td>The VFP registers <code>d0-d7</code>.</td>
</tr>
<tr>
<td>y</td>
<td>The Intel iWMMX co-processor registers.</td>
</tr>
<tr>
<td>z</td>
<td>The Intel iWMMX GR registers.</td>
</tr>
<tr>
<td>l</td>
<td>In Thumb state the core registers <code>r0-r7</code>.</td>
</tr>
<tr>
<td>h</td>
<td>In Thumb state the core registers <code>r8-r15</code>.</td>
</tr>
<tr>
<td>j</td>
<td>A constant suitable for a MOVW instruction. (ARM/Thumb-2)</td>
</tr>
<tr>
<td>b</td>
<td>Thumb only.  The union of the low registers and the stack register.</td>
</tr>
<tr>
<td>I</td>
<td>In ARM/Thumb-2 state a constant that can be used as an immediate value in a Data Processing instruction.  In Thumb-1 state a constant in the range 0 to 255.</td>
</tr>
<tr>
<td>J</td>
<td>In ARM/Thumb-2 state a constant in the range -4095 to 4095.  In Thumb-1 state a constant in the range -255 to -1.</td>
</tr>
<tr>
<td>K</td>
<td>In ARM/Thumb-2 state a constant that satisfies the <code>I</code> constraint if inverted.  In Thumb-1 state a constant that satisfies the <code>I</code> constraint multiplied by any power of 2.</td>
</tr>
<tr>
<td>L</td>
<td>In ARM/Thumb-2 state a constant that satisfies the <code>I</code> constraint if negated.  In Thumb-1 state a constant in the range -7 to 7.</td>
</tr>
<tr>
<td>M</td>
<td>In Thumb-1 state a constant that is a multiple of 4 in the range 0 to 1020.</td>
</tr>
<tr>
<td>N</td>
<td>Thumb-1 state a constant in the range 0 to 31.</td>
</tr>
<tr>
<td>O</td>
<td>In Thumb-1 state a constant that is a multiple of 4 in the range -508 to 508.</td>
</tr>
<tr>
<td>Pa</td>
<td>In Thumb-1 state a constant in the range -510 to +510</td>
</tr>
<tr>
<td>Pb</td>
<td>In Thumb-1 state a constant in the range -262 to +262</td>
</tr>
<tr>
<td>Ps</td>
<td>In Thumb-2 state a constant in the range -255 to +255</td>
</tr>
<tr>
<td>Pt</td>
<td>In Thumb-2 state a constant in the range -7 to +7</td>
</tr>
<tr>
<td>G</td>
<td>In ARM/Thumb-2 state a valid FPA immediate constant.</td>
</tr>
<tr>
<td>H</td>
<td>In ARM/Thumb-2 state a valid FPA immediate constant when negated.</td>
</tr>
<tr>
<td>Da</td>
<td>In ARM/Thumb-2 state a const_int, const_double or const_vector that can be generated with two Data Processing insns.</td>
</tr>
<tr>
<td>Db</td>
<td>In ARM/Thumb-2 state a const_int, const_double or const_vector that can be generated with three Data Processing insns.</td>
</tr>
<tr>
<td>Dc</td>
<td>In ARM/Thumb-2 state a const_int, const_double or const_vector that can be generated with four Data Processing insns.  This pattern is disabled if optimizing for space or when we have load-delay slots to fill.</td>
</tr>
<tr>
<td>Dn</td>
<td>In ARM/Thumb-2 state a const_vector which can be loaded with a Neon vmov immediate instruction.</td>
</tr>
<tr>
<td>Dl</td>
<td>In ARM/Thumb-2 state a const_vector which can be used with a Neon vorr or vbic instruction.</td>
</tr>
<tr>
<td>DL</td>
<td>In ARM/Thumb-2 state a const_vector which can be used with a Neon vorn or vand instruction.</td>
</tr>
<tr>
<td>Dv</td>
<td>In ARM/Thumb-2 state a const_double which can be used with a VFP fconsts instruction.</td>
</tr>
<tr>
<td>Dy</td>
<td>In ARM/Thumb-2 state a const_double which can be used with a VFP fconstd instruction.</td>
</tr>
<tr>
<td>Ut</td>
<td>In ARM/Thumb-2 state an address valid for loading/storing opaque structure types wider than TImode.</td>
</tr>
<tr>
<td>Uv</td>
<td>In ARM/Thumb-2 state a valid VFP load/store address.</td>
</tr>
<tr>
<td>Uy</td>
<td>In ARM/Thumb-2 state a valid iWMMX load/store address.</td>
</tr>
<tr>
<td>Un</td>
<td>In ARM/Thumb-2 state a valid address for Neon doubleword vector load/store instructions.</td>
</tr>
<tr>
<td>Um</td>
<td>In ARM/Thumb-2 state a valid address for Neon element and structure load/store instructions.</td>
</tr>
<tr>
<td>Us</td>
<td>In ARM/Thumb-2 state a valid address for non-offset loads/stores of quad-word values in four ARM registers.</td>
</tr>
<tr>
<td>Uq</td>
<td>In ARM state an address valid in ldrsb instructions.</td>
</tr>
<tr>
<td>Q</td>
<td>In ARM/Thumb-2 state an address that is a single base register.</td>
</tr>
</table>
<h3>Operand codes</h3>
<p>Within the text of an inline asm block, operands are referenced as <code>%0</code>, <code>%1</code> etc. Register operands are printed as <code>rN</code>, memory operands as <code>[rN, #offset]</code>, and so forth.  In some situations, for example with operands occupying multiple registers, more detailed control of the output may be required, and once again, an undocumented feature comes to our rescue.</p>
<p>Special code letters inserted between the <code>%</code> and the operand number alter the output from the default for each type of operand.  The table below lists the more useful ones.</p>
<table class="asm">
<tr>
<td>c</td>
<td>An integer or symbol address without a preceding # sign</td>
</tr>
<tr>
<td>B</td>
<td>Bitwise inverse of integer or symbol without a preceding #</td>
</tr>
<tr>
<td>L</td>
<td>The low 16 bits of an immediate constant</td>
</tr>
<tr>
<td>m</td>
<td>The base register of a memory operand</td>
</tr>
<tr>
<td>M</td>
<td>A register range suitable for LDM/STM</td>
</tr>
<tr>
<td>H</td>
<td>The highest-numbered register of a pair</td>
</tr>
<tr>
<td>Q</td>
<td>The least significant register of a pair</td>
</tr>
<tr>
<td>R</td>
<td>The most significant register of a pair</td>
</tr>
<tr>
<td>P</td>
<td>A double-precision VFP register</td>
</tr>
<tr>
<td>p</td>
<td>The high single-precision register of a VFP double-precision register</td>
</tr>
<tr>
<td>q</td>
<td>A NEON quad register</td>
</tr>
<tr>
<td>e</td>
<td>The low doubleword register of a NEON quad register</td>
</tr>
<tr>
<td>f</td>
<td>The high doubleword register of a NEON quad register</td>
</tr>
<tr>
<td>h</td>
<td>A range of VFP/NEON registers suitable for VLD1/VST1</td>
</tr>
<tr>
<td>A</td>
<td>A memory operand for a VLD1/VST1 instruction</td>
</tr>
<tr>
<td>y</td>
<td>S register as indexed D register, e.g. <code>s5</code> becomes <code>d2[1]</code></td>
</tr>
</table>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/07/06/arm-inline-asm-secrets/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>ARM compiler update</title>
		<link>http://hardwarebug.org/2010/01/15/arm-compiler-update/</link>
		<comments>http://hardwarebug.org/2010/01/15/arm-compiler-update/#comments</comments>
		<pubDate>Fri, 15 Jan 2010 18:48:38 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=228</guid>
		<description><![CDATA[Since my last shootout,  all the tested vendors have updated their compilers. Here is a quick update on each of them. Both the 4.3 and 4.4 branches of FSF GCC have had bugfix releases, bringing them to 4.3.4 and 4.4.2, respectively. Neither update contains anything particularly noteworthy. The CodeSourcery 2009q3 release sees an update to [...]]]></description>
			<content:encoded><![CDATA[<p>Since my <a href="http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/">last shootout</a>,  all the tested vendors have updated their compilers. Here is a quick update on each of them.</p>
<p>Both the 4.3 and 4.4 branches of FSF GCC have had bugfix releases, bringing them to <a href="http://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&amp;resolution=FIXED&amp;target_milestone=4.3.4">4.3.4</a> and <a href="http://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&amp;resolution=FIXED&amp;target_milestone=4.4.2">4.4.2</a>, respectively. Neither update contains anything particularly noteworthy.</p>
<p>The CodeSourcery 2009q3 release sees an update to a GCC 4.4 base, a significant change from the 4.3 base used in 2009q1. The update is a mixed blessing. In fact, it is mostly a curse and hardly a blessing at all. On the bright side, the floating-point speed regressions in 2009q1 are gone, 2009q3 being a few per cent faster even than 2007q3. Unfortunately, this improvement is completely overshadowed by a major speed regression on integer code, a whopping 24% in one case. This ties in with the slowdown previously observed with FSF GCC 4.4 compared to 4.3.</p>
<p>ARM RVCT 4.0 is now at Build 697. This update fixes some bugs and introduces others. Notably, it no longer builds FFmpeg correctly. The issue has been reported to ARM.</p>
<p>Texas Instruments, finally, have made a formal release, v4.6.1, of their TMS470 compiler incorporating various fixes allowing it to build a moderately patched FFmpeg. The performance remains somewhere between GCC and RVCT on average.</p>
<p>In light of the above, my recommendations remain unchanged:</p>
<ul>
<li>For a free compiler, choose CodeSourcery 2009q1. It beats GCC 4.3.4 by 5-10% in most cases.</li>
<li>GNU purists are best served by GCC 4.3.4, which is up to 20% faster than 4.4.2 and rarely slower.</li>
<li>When price is not a concern, ARM RCVT is a good option, outperforming GCC by up to a factor 2.</li>
<li>In all cases, disable any auto-vectorisation features.</li>
</ul>
<p>Regardless of which compiler is chosen, I cannot overstress the importance of testing. All compilers are crawling with bugs, and even the most innocent-looking code change can trigger one of them. When using a compiler other than GCC, extra caution is advised considering a lot of code is developed using only GCC and may thus fall prey to bugs unique to said other compiler.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/01/15/arm-compiler-update/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ARM compiler shoot-out, round 2</title>
		<link>http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/</link>
		<comments>http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/#comments</comments>
		<pubDate>Thu, 20 Aug 2009 20:20:35 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=204</guid>
		<description><![CDATA[In my recent test of ARM compilers, I had to leave out Texas Instrument&#8217;s compiler since it failed to build FFmpeg. Since then, the TI compiler team has been busy fixing bugs, and a snapshot I was given to test was able to build enough of a somewhat patched FFmpeg that I can now present [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/">recent test</a> of ARM compilers, I had to leave out Texas Instrument&#8217;s compiler since it failed to build FFmpeg. Since then, the TI compiler team has been busy fixing bugs, and a snapshot I was given to test was able to build enough of a somewhat patched FFmpeg that I can now present round two in this shoot-out.</p>
<p>The contenders this time were the fastest GCC variant from round one, ARM RVCT, and newcomer TI TMS470. With the same rules as last time, the exact versions and optimisation options were like this:</p>
<ul>
<li><strong>CodeSourcery GCC 2009q1</strong> (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>ARM RVCT 4.0 Build 591</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros</li>
<li><strong>TI TMS470 4.7.0-a9229</strong>, <span>-</span>-float_support=vfpv3 -mv=7a8 -O3 -mf=5</li>
</ul>
<p><span id="more-204"></span><br />
To keep things fair, I left the vectoriser off also with the TI compiler. The table below lists the decoding times for the sample files, this time normalised against the participating GCC compiler. Remember, smaller numbers are better.  Also keep in mind that this test was done with a development snapshot of TMS470, not an approved release.</p>
<table border="0" width="100%">
<col></col>
<col></col>
<col></col>
<col width="10%"></col>
<col width="10%"></col>
<col width="10%"></col>
<thead>
<tr style="text-align: left;">
<th>Sample name</th>
<th>Codec</th>
<th>Code type</th>
<th>GCC</th>
<th>RVCT</th>
<th>TI</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4">cathedral</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>1.00</td>
<td>0.95</td>
<td>1.02</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4">NeroAVC</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>1.00</td>
<td>0.96</td>
<td>1.05</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/indiana_jones_4-tlr3_h640w.mov">indiana_jones_4</a></td>
<td>H.264 CAVLC</td>
<td>integer</td>
<td>1.00</td>
<td>0.92</td>
<td>1.02</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4">NeroRecodeSample</a></td>
<td>MPEG-4 ASP</td>
<td>integer</td>
<td>1.00</td>
<td>1.01</td>
<td>1.08</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3">Silent_Light</a></td>
<td>MP3</td>
<td>64-bit integer</td>
<td>1.00</td>
<td>0.48</td>
<td>0.72</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/flac/When I Grow Up.flac">When_I_Grow_Up</a></td>
<td>FLAC</td>
<td>integer</td>
<td>1.00</td>
<td>0.87</td>
<td>0.93</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg">Lumme-Badloop</a></td>
<td>Vorbis</td>
<td>float</td>
<td>1.00</td>
<td>0.94</td>
<td>1.05</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/AC3/Canyon-5.1-48khz-448kbit.ac3">Canyon</a></td>
<td>AC-3</td>
<td>float</td>
<td>1.00</td>
<td>0.88</td>
<td>1.01</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/DTS/lotr_5.1_768.dts">lotr</a></td>
<td>DTS</td>
<td>float</td>
<td>1.00</td>
<td>1.00</td>
<td>1.08</td>
</tr>
</tbody>
</table>
<p>Overall, the TI TMS470 compiler comes off slightly worse than GCC. In two cases, however, it was significantly better than GCC, but not as good as RVCT. Incidentally, those were also the ones where RVCT scored the biggest win over GCC.</p>
<p>My conclusions from this test are twofold:</p>
<ul>
<li>ARM&#8217;s own compiler is very hard to beat. They do seem to know how their chips work.</li>
<li>GCC is incredibly bad at 64-bit arithmetic on 32-bit machines.</li>
</ul>
<p>The logical next step is to test these compilers with vectorisation enabled. FFmpeg should offer plenty of opportunities for this feature to shine. Unfortunately, that test will have to wait until the RVCT vectoriser is fixed. The current release does not compile FFmpeg with vectorisation enabled.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4" length="24154488" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4" length="6766583" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/indiana_jones_4-tlr3_h640w.mov" length="16215526" type="video/quicktime" />
<enclosure url="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4" length="31027653" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3" length="4206720" type="audio/mpeg" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg" length="5856908" type="audio/ogg" />
		</item>
		<item>
		<title>ARM compiler shoot-out</title>
		<link>http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/</link>
		<comments>http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/#comments</comments>
		<pubDate>Wed, 05 Aug 2009 00:06:06 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=150</guid>
		<description><![CDATA[A proper comparison of different compilers targeting ARM is long overdue, so I decided to do my part. I compiled FFmpeg using a selection of compilers, and measured the speed of the result when decoding various media samples. Since we are testing compilers, I disabled all hand-written assembler. The tests were run on a Beagle [...]]]></description>
			<content:encoded><![CDATA[<p>A proper comparison of different compilers targeting ARM is long overdue, so I decided to do my part. I compiled <a href="http://ffmpeg.org/">FFmpeg</a> using a selection of compilers, and measured the speed of the result when decoding various media samples. Since we are testing compilers, I disabled all hand-written assembler. The tests were run on a <a href="http://beagleboard.org/">Beagle board</a> clocked at 600 MHz.</p>
<p>These are the compilers I deemed worthy to participate in the test and the optimisation flags I used with each:</p>
<ul>
<li><strong>GCC 4.3.3</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>GCC 4.4.1</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>CodeSourcery GCC 2007q3</strong> (based on 4.2.1), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-tree-vectorize</li>
<li><strong>CodeSourcery GCC 2009q1</strong> (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>ARM RVCT 4.0 Build 591</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros</li>
</ul>
<p>I would have also included the ARM compiler from Texas Instruments, had it been able to compile FFmpeg.<br />
<span id="more-150"></span><br />
With sample files chosen to exercise various types of code, the result of the test is, sadly, no surprise. The following table lists the runtimes of the different builds relative to the CodeSourcery 2007q3 build. Lower numbers are better.</p>
<table border="0" width="100%">
<col></col>
<col></col>
<col></col>
<col width="10%"></col>
<col width="10%"></col>
<col width="10%"></col>
<col width="10%"></col>
<thead>
<tr style="text-align: left;">
<th>Sample name</th>
<th>Codec</th>
<th>Code type</th>
<th>2009q1</th>
<th>4.3.3</th>
<th>4.4.1</th>
<th>RVCT</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4">cathedral</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>0.97</td>
<td>1.02</td>
<td>1.09</td>
<td>0.93</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4">NeroAVC</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>0.98</td>
<td>1.02</td>
<td>1.12</td>
<td>0.95</td>
</tr>
<tr>
<td><a href="http://movies.apple.com/movies/paramount/indiana_jones_4/indiana_jones_4-tlr3_h640w.mov">indiana_jones_4</a></td>
<td>H.264 CAVLC</td>
<td>integer</td>
<td>0.97</td>
<td>1.02</td>
<td>1.09</td>
<td>0.89</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4">NeroRecodeSample</a></td>
<td>MPEG-4 ASP</td>
<td>integer</td>
<td>0.96</td>
<td>1.03</td>
<td>1.27</td>
<td>0.96</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3">Silent_Light</a></td>
<td>MP3</td>
<td>64-bit integer</td>
<td>0.89</td>
<td>0.88</td>
<td>0.97</td>
<td>0.44</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/flac/When I Grow Up.flac">When_I_Grow_Up</a></td>
<td>FLAC</td>
<td>integer</td>
<td>0.98</td>
<td>0.98</td>
<td>0.93</td>
<td>0.86</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg">Lumme-Badloop</a></td>
<td>Vorbis</td>
<td>float</td>
<td>1.03</td>
<td>1.03</td>
<td>1.02</td>
<td>0.97</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/AC3/Canyon-5.1-48khz-448kbit.ac3">Canyon</a></td>
<td>AC-3</td>
<td>float</td>
<td>1.02</td>
<td>1.02</td>
<td>0.99</td>
<td>0.90</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/DTS/lotr_5.1_768.dts">lotr</a></td>
<td>DTS</td>
<td>float</td>
<td>1.02</td>
<td>1.02</td>
<td>1.00</td>
<td>1.03</td>
</tr>
</tbody>
</table>
<p>Looking at the table, I make these observations:</p>
<ul>
<li>CodeSourcery 2009q1 produces faster integer code, but slower floating-point code, than 2007q3.</li>
<li>GCC 4.4.1 produces much slower code than 4.3.3 in several cases, and is never significantly better.</li>
<li>CodeSourcery GCC generally beats FSF GCC.</li>
<li>ARM RVCT readily beats every GCC version. The MP3 figure is not a typo.</li>
</ul>
<p>My recommendation for a free compiler is CodeSourcery 2009q1 unless your code makes heavy use of floating-point, in which case 2007q3 may give better results. If you prefer, for whatever reason, official GNU releases, 4.3.3 should be the version of choice. Avoid GCC 4.4.1; it is far too unpredictable.</p>
<h4>Bootnotes</h4>
<ul>
<li>See also Mike&#8217;s <a title="Intel Beats Up GCC" href="http://multimedia.cx/eggs/intel-beats-up-gcc/">test of x86 compilers</a>.</li>
<li>Thanks to ARM for providing the RVCT compiler.</li>
<li>Thanks to TI for providing the Beagle board.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4" length="24154488" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4" length="6766583" type="video/mp4" />
<enclosure url="http://movies.apple.com/movies/paramount/indiana_jones_4/indiana_jones_4-tlr3_h640w.mov" length="16215526" type="video/quicktime" />
<enclosure url="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4" length="31027653" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3" length="4206720" type="audio/mpeg" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg" length="5856908" type="audio/ogg" />
		</item>
		<item>
		<title>Thumbs up</title>
		<link>http://hardwarebug.org/2009/03/25/thumbs-up/</link>
		<comments>http://hardwarebug.org/2009/03/25/thumbs-up/#comments</comments>
		<pubDate>Wed, 25 Mar 2009 03:27:04 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=125</guid>
		<description><![CDATA[ARM processors have long supported the 16-bit Thumb instruction set, achieving smaller code size at the price of reduced performance. The Thumb-2 extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions. Thumb-2 performance is claimed [...]]]></description>
			<content:encoded><![CDATA[<p>ARM processors have long supported the 16-bit <a href="http://arm.com/products/CPUs/archi-thumb.html">Thumb</a> instruction set, achieving smaller code size at the price of reduced performance. The <a href="http://arm.com/products/CPUs/archi-thumb2.html">Thumb-2</a> extension, introduced with the ARM1156T2-S processor, promises to regain most of this performance loss while retaining the small code size. This is accomplished by mixing 16-bit and 32-bit instructions.</p>
<p>Thumb-2 performance is <a href="http://arm.com/pdfs/Thumb-2CoreTechnologyWhitepaper-Final4.pdf">claimed</a> to reach 98% of the equivalent ARM code while being only 74% of the size. I decided to put this claim to the test with <a href="http://ffmpeg.org/">FFmpeg</a> as the target and compiled the same source revision in ARM and Thumb-2 mode using the <a href="http://arm.com/products/DevTools/RVCT.html">RVCT 4.0</a> compiler. For this test I disabled all hand-written assembler optimisations.</p>
<p>The Thumb-2 executable is 85% of the ARM one in size, which although being a substantial reduction falls somewhat short of the promised 74%. I tested the performance by measuring the time to decode a few sample media files on a <a href="http://beagleboard.org/">Beagle board</a>. Several of the samples actually decoded faster with the Thumb-2 build, with one H.264 video clip decoding 4% faster. Only one test, MP3 audio decoding, was significantly slower (15%) compared to ARM code. The speedup is likely due to reduced I-cache pressure. Thumb-2 and ARM instructions are executed identically after the initial decode stage, so no improvement can result from the change of instruction set alone.</p>
<p>In conclusion, the Thumb-2 performance is better than I had expected. Nevertheless, a 15% slowdown in even one case is reason enough to carefully benchmark the effects before deciding on a switch.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/03/25/thumbs-up/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Shared library woes and the price of PIC</title>
		<link>http://hardwarebug.org/2009/01/02/shared-library-woes-and-the-price-of-pic/</link>
		<comments>http://hardwarebug.org/2009/01/02/shared-library-woes-and-the-price-of-pic/#comments</comments>
		<pubDate>Fri, 02 Jan 2009 18:28:53 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Bugs]]></category>
		<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=100</guid>
		<description><![CDATA[It recently came to my attention that the GNU linker on ARM lacks support for several relocation types in shared libraries. Specifically, code using MOVW/MOVT instruction pairs to load the address of data symbols will not work in a shared library. The linker silently drops the necessary relocations, resulting in a runtime crash. When I [...]]]></description>
			<content:encoded><![CDATA[<p>It recently came to my attention that the GNU linker on ARM lacks support for several relocation types in shared libraries. Specifically, code using <code>MOVW/MOVT</code> instruction pairs to load the address of data symbols will not work in a shared library. The linker silently drops the necessary relocations, resulting in a runtime crash.</p>
<p>When I pointed out this shortcoming to Paul Brook of CodeSourcery, his response was that such relocations in shared libraries are not supported by the GNU tools, will never be, and that shared libraries should be built with position-independent code (PIC). This is an unfortunate attitude, and doubly so considering that the latest CodeSourcery GCC version will generate these instructions with default settings. In other words, the 2008q3 release of CodeSourcery GCC will, with default flags, build crashing shared libraries without so much as a warning.</p>
<p>The refusal to support non-PIC shared libraries is unfortunate also from a performance point of view. Position independent code is inherently slower than normal code.</p>
<p>In order to find out just how much slower PIC is on ARM, I made two builds of FFmpeg, one normal and one with PIC. The PIC build is about 1.7% slower in several tests, among them H.264 video decoding.</p>
<p>On typically resource-constrained ARM systems it would be nice to have the option of space-saving shared libraries without paying the PIC penalty in performance. Until now this option has been a reality. With CodeSourcery lazily refusing to support the relocations required by the latest version of their own compiler, this option may soon be a thing of the past, at least if the bugs that have haunted recent compiler releases are fixed in upcoming versions.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/01/02/shared-library-woes-and-the-price-of-pic/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ARM-NEON memory hazards</title>
		<link>http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/</link>
		<comments>http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/#comments</comments>
		<pubDate>Wed, 31 Dec 2008 02:19:13 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=89</guid>
		<description><![CDATA[The NEON coprocessor found in the Cortex-A8 operates asynchronously from the ARM pipeline, receiving its instructions from the ARM execution unit through a 16-entry FIFO. Furthermore, the NEON unit has its own load/store unit. This suggests that some mechanism exists to resolve data hazards between the ARM and NEON units such that memory operations appear [...]]]></description>
			<content:encoded><![CDATA[<p>The NEON coprocessor found in the Cortex-A8 operates asynchronously from the ARM pipeline, receiving its instructions from the ARM execution unit through a 16-entry FIFO. Furthermore, the NEON unit has its own load/store unit. This suggests that some mechanism exists to resolve data hazards between the ARM and NEON units such that memory operations appear as if the instructions were executed entirely in order.</p>
<p>Although clearly important with a view to code optimisation, the <a href="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344b/index.html">Cortex-A8 Technical Reference Manual</a> unfortunately does not mention any details about these hazards. In fact, it does not mention them at all.</p>
<p>To sched some light on the situation, I ran a simple benchmark to determine two important parameters of ARM-NEON memory hazard resolution: granularity and latency.</p>
<p><span id="more-89"></span>Since NEON execution lags behind the ARM pipeline, three types of hazard can occur:</p>
<ul>
<li>ARM load after NEON store</li>
<li>ARM store after NEON load</li>
<li>ARM store after NEON store</li>
</ul>
<p>The characteristics of each is tested using a loop interleaving 64-bit NEON <code>VLD1/VST1</code> and ARM <code>LDR/STR</code> instructions using addresses at various intervals. The hardware used for the test is a <a href="http://beagleboard.org/">Beagle Board</a> clocked at 500 MHz and with the <a href="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344b/Bgbffjhh.html">L1NEON</a> configuration bit set.</p>
<p>It quickly becomes evident that the basic granularity for the hazard detection is 16 bytes. In addition, some tests show secondary effects within a 64-byte block (cache line). NEON stores crossing a 16-byte boundary apparently incur an extra penalty.</p>
<p>The following table lists the approximate number of cycles required for each pair of instructions when no access spans a 16-byte boundary.</p>
<table border="0">
<tbody>
<tr>
<th></th>
<th>16-byte</th>
<th>64-byte</th>
<th>other</th>
</tr>
<tr>
<th>ARM load after NEON store</th>
<td>22</td>
<td>5</td>
<td>3</td>
</tr>
<tr>
<th>ARM store after NEON load</th>
<td>13</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<th>ARM store after NEON store</th>
<td>22</td>
<td>4</td>
<td>4</td>
</tr>
</tbody>
</table>
<p>The delay of roughly 20 cycles after a NEON store corresponds nicely with the figure of 20 cycles the <a href="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344b/ch16s05s02.html">TRM</a> quotes for an MRC transfer from NEON to ARM.</p>
<p>The next table lists the same timings when the NEON access spans a 16-byte boundary.</p>
<table border="0">
<tbody>
<tr>
<th></th>
<th>16-byte</th>
<th>64-byte</th>
<th>other</th>
</tr>
<tr>
<th>ARM load after NEON store</th>
<td>22</td>
<td>7</td>
<td>5</td>
</tr>
<tr>
<th>ARM store after NEON load</th>
<td>13</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<th>ARM store after NEON store</th>
<td>22</td>
<td>52</td>
<td>48</td>
</tr>
</tbody>
</table>
<p>I was somewhat baffled by the last line. Clearly such NEON stores are something to be avoided. Splitting the NEON store into two 32-bit stores has a dramatic effect:</p>
<table border="0">
<tbody>
<tr>
<th></th>
<th>16-byte</th>
<th>64-byte</th>
<th>other</th>
</tr>
<tr>
<th>ARM store after NEON store</th>
<td>22</td>
<td>32</td>
<td>29</td>
</tr>
</tbody>
</table>
<p>Although clearly an improvement, it is still bad enough that mixing such accesses could easily impact performance seriously. It should also be noted that in all other cases, the 64-bit store is faster.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>CodeSourcery fails again</title>
		<link>http://hardwarebug.org/2008/11/28/codesourcery-fails-again/</link>
		<comments>http://hardwarebug.org/2008/11/28/codesourcery-fails-again/#comments</comments>
		<pubDate>Fri, 28 Nov 2008 00:19:49 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Bugs]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=83</guid>
		<description><![CDATA[The bug I discovered in CodeSourcery&#8217;s 2008q3 release of their GCC version was apparently deemed serious enough for the company to publish an updated release, tagged 2008q3-72, earlier this week. I took it for a test drive. Since last time, I have updated the FFmpeg regression test scripts, enabling a cross-build to be easily tested [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://hardwarebug.org/2008/10/11/codesourcery-gcc-2008q3-fail/">bug</a> I discovered in CodeSourcery&#8217;s 2008q3 release of their GCC version was apparently deemed serious enough for the company to publish an updated release, tagged 2008q3-72, earlier this week. I took it for a test drive.</p>
<p>Since last time, I have updated the <a href="http://ffmpeg.org/">FFmpeg</a> regression test scripts, enabling a cross-build to be easily tested on the target device. For the compiler test this means that much more code will be checked for correct operation compared to the rather limited tests I performed on previous versions. Having verified all tests passing when built with the 2007q3 release, I proceeded with the new 2008q3-72 compiler.</p>
<p>All but one of the FFmpeg regression tests passed. Converting a colour image to 1-bit monochrome format failed. A few minutes of detective work revealed the erroneous code, and a simple test case was easily extracted.</p>
<p>The test case looks strikingly familiar:</p>
<blockquote>
<pre>extern unsigned char dst[512] __attribute__((aligned(8)));
extern unsigned char src[512] __attribute__((aligned(8)));

void array_shift(void)
{
    int i;
    for (i = 0; i &lt; 512; i++)
        dst[i] = src[i] &gt;&gt; 7;
}</pre>
</blockquote>
<p><span id="more-83"></span>The <code>aligned(8)</code> attribute is not required to trigger the bug; it merely removes some clutter from the generated assembler. Slightly edited for readability, the assembler output from the compiler looks like this:</p>
<blockquote>
<pre>array_shift:
        movw        ip, #:lower16:dst
        movw        r0, #:lower16:src
        movt        ip, #:upper16:dst
        movt        r0, #:upper16:src
        vmov.i32    d17, #249  @ v8qi
        mov         r1, #0
.L2:
        add         r2, ip, r1
        add         r3, r0, r1
        add         r1, r1, #8
        vldr        d16, [r3]
        cmp         r1, #512
        vshl.u8     d16, d16, d17
        vstr        d16, [r2]
        bne         .L2
        bx          lr</pre>
</blockquote>
<p>The vectoriser has done its job and decided to use NEON vector operations to process 8 elements in parallel. The mysterious-looking constant 249 is simply the 8-bit representation of -7. The error is in using the <code>vmov.i32</code> instruction, which writes an immediate value into all <strong>32-bit</strong> elements of the destination register. Using the resulting vector as the shift amount with the <code>vshl.u8</code>, which operates on vectors of <strong>8-bit</strong> data, clearly will not work as intended. Only one in four elements of the array will be shifted, the rest being copied unchanged. The <code>v8qi</code> annotation next to the incorrect instruction is of particular interest. It indicates that the compiler in fact intended to create an 8-element vector of 8-bit values. The translation of this operation into an assembler instruction seems to have gone horribly wrong. A vmov.i8 instruction would have been correct.</p>
<p>As an experiment, I changed arrays to <code>unsigned short</code>, i.e. 16-bit, elements. This is what the compiler produced:</p>
<blockquote>
<pre>array_shift:
        movw        ip, #:lower16:dst
        movw        r0, #:lower16:src
        movt        ip, #:upper16:dst
        movt        r0, #:upper16:src
        mov         r1, #0
        vldr        d17, .L6
.L2:
        add         r2, ip, r1
        add         r3, r0, r1
        add         r1, r1, #8
        vldr        d16, [r3, #0]
        cmp         r1, #1024
        vshl.u16    d16, d16, d17
        vstr        d16, [r2, #0]
        bne         .L2
        bx          lr
.L7:
        .align      3
.L6:
        .short      -7
        .short      0
        .short      -7
        .short      0</pre>
</blockquote>
<p>The immediate operand of the <code>vmov</code> instruction is limited to 8 bits, so the compiler has decided to load the constant vector from a literal pool following the function. The constant it has placed there is perfectly analogous to the flawed value from the first test: the 16-bit representation of -7 zero-extended into a vector of 32-bit elements.</p>
<p>Finally, I replaced the right shift with a left shift. To my astonishment, the compiler generated the correct <code>vmov.i8</code> instruction (with a constant of +7). It even repeated this feat with 16-bit arrays.</p>
<p>CodeSourcery insist they subject every compiler release to an extensive test suite. Evidently it does not extend to cover the right shift operator.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2008/11/28/codesourcery-fails-again/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>ARM wish-list</title>
		<link>http://hardwarebug.org/2008/10/19/arm-wish-list/</link>
		<comments>http://hardwarebug.org/2008/10/19/arm-wish-list/#comments</comments>
		<pubDate>Sun, 19 Oct 2008 00:44:10 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=36</guid>
		<description><![CDATA[Some time ago, I was asked for a multimedia hacker&#8217;s wish-list for a future ARM processor, in particular regarding the NEON vector and floating-point coprocessor. This is my list. Saturating unsigned+signed add/subtract. With the current instruction set, this operation requires six instructions (2x VMOVL, 2x VADDW, 2x VQMOVUN) and two extra registers (one if optimal [...]]]></description>
			<content:encoded><![CDATA[<p>Some time ago, I was asked for a multimedia hacker&#8217;s wish-list for a future ARM processor, in particular regarding the NEON vector and floating-point coprocessor. This is my list.</p>
<ol>
<li>Saturating unsigned+signed add/subtract.<br />
With the current instruction set, this operation requires six instructions (2x <code>VMOVL</code>, 2x <code>VADDW</code>, 2x <code>VQMOVUN</code>) and two extra registers (one if optimal scheduling is not required) for 128-bit vectors. Furthermore, this is a frequently occuring operation, for instance in the H.264 loop filter.</li>
<li>More registers.<br />
Having another, say, 8 vector registers would be very handy.  Encoding this in the existing instructions would of course be tricky, if at all possible.  A special <code>VMOV</code> and/or <code>VSWP</code> instruction to access the high registers would be an acceptable compromise, and would certainly be better than using scratch memory.  An alternative option could be to make the high half of the existing register file banked.  This could perhaps even be done in some clever way allowing the OS to skip save/restore of these registers for processes that never use them.</li>
<li>256-bit operations.<br />
8-element vectors are frequently used in video processing. One example is the ubiquitous 8&#215;8 IDCT. In some instances, 32 bits per element are required in intermediate values to maintain adequate precision. The 8&#215;8 IDCT is once again an example. In these cases, support for 8&#215;32-bit vectors would clearly be an advantage.</li>
<li>Vector sum.<br />
The sum of all elements in a vector is computed as a part of many algorithms, for instance anything involving a dot product and motion estimation in video encoding. Presently, the only option is to use a sequence of 3 or 4 <code>VPADD</code> instructions.</li>
<li>Transposed load/store.<br />
When performing the same operation on each of a set of rows, one must load values row-wise into registers, and then transpose the registers before using the vector arithmetic instructions. When done computing, the values are again transposed before being stored row-wise. A set of load/store instructions transferring data between rows in memory and &#8220;columns&#8221; in the register file would save the cost of the transposing operations.</li>
<li>Improved NEON to ARM transfer.<br />
On Cortex-A8, transferring a 32-bit value from NEON to an ARM register takes a minimum of 20 clock cycles, during which time any normal access to the ARM register file will stall. This delay makes some potential use cases for NEON practically worthless. I am told this has been addressed in the almost-ready Cortex-A9.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2008/10/19/arm-wish-list/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>CodeSourcery&#8217;s defence</title>
		<link>http://hardwarebug.org/2008/10/14/codesourcerys-defence/</link>
		<comments>http://hardwarebug.org/2008/10/14/codesourcerys-defence/#comments</comments>
		<pubDate>Tue, 14 Oct 2008 03:22:50 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Bugs]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=21</guid>
		<description><![CDATA[Having covered the spectacular failure of CodeSourcery&#8217;s latest ARM compiler a few days ago, I was engaged in a curious debate on IRC with one of their employees. Fiercely denying the problem at first, he eventually offered an explanation: they do not test the compiler output on real hardware; they use QEMU. QEMU is a [...]]]></description>
			<content:encoded><![CDATA[<p>Having covered the <a href="http://hardwarebug.org/2008/10/11/codesourcery-gcc-2008q3-fail/">spectacular failure</a> of CodeSourcery&#8217;s latest ARM compiler a few days ago, I was engaged in a curious <a href="http://www.beagleboard.org/irclogs/index.php?date=2008-10-11#T11:31:19">debate on IRC</a> with one of their employees. Fiercely denying the problem at first, he eventually offered an explanation: they do not test the compiler output on real hardware; they use <a href="http://bellard.org/qemu/">QEMU</a>.</p>
<p>QEMU is a CPU emulator supporting a variety of targets. While great for casual development, and for running foreign applications, it is certainly no substitute for real hardware when testing a compiler. Like any piece of software, an emulator is bound to have a few errors, and as it happens, QEMU has known bugs in its handling of the NEON instruction set. Our friend at CodeSourcery should be well aware of these, also being a QEMU developer.</p>
<p>The use of emulators was explained as a necessity due to real hardware not being available. To be fair, CodeSourcery does develop against new hardware before it exists, so some reliance on emulators is unavoidable. This is, however, not the case this time. The <a href="http://elinux.org/BeagleBoard">Beagleboard</a> was made available to selected developers quite some time ago (I have had one since May, others still longer), and is now being sold by the thousands. CodeSourcery developers, so I am told, were also given an offer of a free board, an offer they chose to refuse.</p>
<p>What does all this mean? Did Murphy decide to inflict maximum bad luck on the hard-working developers, or is there perhaps a larger conspiracy at work? I shall not attempt to speculate in this matter. I will merely repeat this excellent piece of advice given by Robert J. Hanlon: <em>Never attribute to malice that which can be adequately explained by incompetence.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2008/10/14/codesourcerys-defence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
