<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Hardwarebug</title>
	<atom:link href="http://hardwarebug.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://hardwarebug.org</link>
	<description>Everything is broken</description>
	<lastBuildDate>Thu, 04 Mar 2010 00:26:17 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Ogg objections</title>
		<link>http://hardwarebug.org/2010/03/03/ogg-objections/</link>
		<comments>http://hardwarebug.org/2010/03/03/ogg-objections/#comments</comments>
		<pubDate>Wed, 03 Mar 2010 14:17:07 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Multimedia]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=374</guid>
		<description><![CDATA[The Ogg container format is being promoted by the Xiph Foundation for use with its Vorbis and Theora codecs. Unfortunately, a number of technical shortcomings in the format render it ill-suited to most, if not all, use cases.  This article examines the most severe of these flaws.

Overview of Ogg
The basic unit in an Ogg [...]]]></description>
			<content:encoded><![CDATA[<p>The Ogg container format is being promoted by the Xiph Foundation for use with its Vorbis and Theora codecs. Unfortunately, a number of technical shortcomings in the format render it ill-suited to most, if not all, use cases.  This article examines the most severe of these flaws.<br />
<span id="more-374"></span></p>
<h1>Overview of Ogg</h1>
<p>The basic unit in an Ogg stream is the <em>page</em> consisting of a header followed by one or more packets from a single elementary stream. A page can contain up to 255 packets, and a packet can span any number of pages. The following table describes the page header.</p>
<div class="frame-outer small">
<div style="text-align: left;">
<table>
<tr>
<th>Field</th>
<th>Size (bits)</th>
<th>Description</th>
</tr>
<tr>
<td>capture_pattern</td>
<td>32</td>
<td>magic number &#8220;OggS&#8221;</td>
</tr>
<tr>
<td>version</td>
<td>8</td>
<td>always zero</td>
</tr>
<tr>
<td>flags</td>
<td>8</td>
</tr>
<tr>
<td>granule_position</td>
<td>64</td>
<td>abstract timestamp</td>
</tr>
<tr>
<td>bitstream_serial_number</td>
<td>32</td>
<td>elementary stream number</td>
</tr>
<tr>
<td>page_sequence_number</td>
<td>32</td>
<td>incremented by 1 each page</td>
</tr>
<tr>
<td>checksum</td>
<td>32</td>
<td>CRC of entire page</td>
</tr>
<tr>
<td>page_segments</td>
<td>8</td>
<td>length of segment_table</td>
</tr>
<tr>
<td>segment_table</td>
<td>variable</td>
<td>list of packet sizes</td>
</tr>
</table>
</div>
</div>
<p>Elementary stream types are identified by looking at the payload of the first few pages, which contain any setup data required by the decoders. For full details, see the official <a href="http://xiph.org/ogg/">format specification</a>.</p>
<h1>Generality</h1>
<p>Ogg, legend tells, was designed to be a general-purpose container format. To most multimedia developers, a general-purpose format is one in which encoded data of any type can be encapsulated with a minimum of effort.</p>
<p>The Ogg format defined by the specification does not fit this description. For every format one wishes to use with Ogg, a complex <em>mapping</em> must first be defined. This mapping defines how to identify a codec, how to extract setup data, and even how timestamps are to be interpreted. All this is done differently for every codec. To correctly parse an Ogg stream, every such mapping ever defined must be known.</p>
<p>Under this premise, a centralised repository of codec mappings would seem like a sensible idea, but alas, no such thing exists. It is simply impossible to obtain a exhaustive list of defined mappings, which makes the task of creating a complete implementation somewhat daunting.</p>
<p>One brave soul, Tobias Waldvogel, created a mapping, OGM, capable of storing any Microsoft AVI compatible codec data in Ogg files. This format saw some use in the wild, but was <a href="http://www.xiph.org/container/ogm.html">frowned upon</a> by Xiph, and it was eventually displaced by other formats.</p>
<p>True generality is evidently not to be found with the Ogg format.</p>
<p>A good example of a general-purpose format is <a href="http://matroska.org/">Matroska</a>. This container can trivially accommodate any codec, all it requires is a unique string to identify the codec. For codecs requiring setup data, a standard location for this is provided in the container. Furthermore, an official list of codec identifiers is maintained, meaning all information required to fully support Matroska files is available from one place.</p>
<p>Matroska also has probably the greatest advantage of all: it is in active, wide-spread use. Historically, standards derived from existing practice have proven more successful than those created by a design committee.</p>
<h1>Overhead</h1>
<p>When designing a container format, one important consideration is that of overhead, i.e. the extra space required in addition to the elementary stream data being combined. For any given container, the overhead can be divided into a fixed part, independent of the total file size, and a variable part growing with increasing file size.  The fixed overhead is not of much concern, its relative contribution being negligible for typical file sizes.</p>
<p>The variable overhead in the Ogg format comes from the page headers, mostly from the <code>segment_table</code> field.  This field uses a most peculiar encoding, somewhat reminiscent of Roman numerals. In Roman times, numbers were written as a sequence of symbols, each representing a value, the combined value being the sum of the constituent values.</p>
<p>The <code>segment_table</code> field lists the sizes of all packets in the page. Each value in the list is coded as a number of bytes equal to 255 followed by a final byte with a smaller value. The packet size is simply the sum of all these bytes. Any strictly additive encoding, such as this, has the distinct drawback of coded length being linearly proportional to the encoded value.  A value of 5000, a reasonable packet size for video of moderate bitrate, requires no less than 20 bytes to encode.</p>
<p>On top of this we have the 27-byte page header which, although paling in comparison to the packet size encoding, is still much larger than necessary. Starting at the top of the list:</p>
<ul>
<li>The <code>version</code> field could be disposed of, a single-bit marker being adequate to separate this first version from hypothetical future versions. One of the unused positions in the <code>flags</code> field could be used for this purpose</li>
<li>A 64-bit <code>granule_position</code> is completely overkill. 32 bits would be more than enough for the vast majority of use cases. In extreme cases, a one-bit flag could be used to signal an extended timestamp field.</li>
<li>32-bit elementary stream number? Are they anticipating files with four billion elementary streams? An eight-bit field, if not smaller, would seem more appropriate here.</li>
<li>The 32-bit <code>page_sequence_number</code> is inexplicable. The intent is to allow detection of page loss due to transmission errors. ISO MPEG-TS uses a 4-bit counter per 188-byte packet for this purpose, and that format is used where packet loss actually happens, unlike any use of Ogg to date.</li>
<li>A mandatory 32-bit checksum is nothing but a waste of space when using a reliable storage/transmission medium. Again, a flag could be used to signal the presence of an optional checksum field.</li>
</ul>
<p>With the changes suggested above, the page header would shrink from 27 bytes to 12 bytes in size.</p>
<p>We thus see that in an Ogg file, the packet size fields alone contribute an overhead of 1/255 or approximately 0.4%. This is a hard lower bound on the overhead, not attainable even in theory. In reality the overhead tends to be closer to 1%.</p>
<p>Contrast this with the ISO MP4 file format, which can easily achieve an overhead of less than 0.05% with a 1 Mbps elementary stream.</p>
<h1>Latency</h1>
<p>In many applications end-to-end latency is an important factor. Examples include video conferencing, telephony, live sports events, interactive gaming, etc. With the codec layer contributing as little as <a href="http://x264dev.multimedia.cx/?p=249">10 milliseconds</a> of latency, the amount imposed by the container becomes an important factor.</p>
<p>Latency in an Ogg-based system is introduced at both the sender and the receiver. Since the page header depends on the entire contents of the page (packet sizes and checksum), a full page of packets must be buffered by the sender before a single bit can be transmitted. This sets a lower bound for the sending latency at the duration of a page.</p>
<p>On the receiving side, playback cannot commence until packets from all elementary streams are available. Hence, with two streams (audio and video) interleaved at the page level, playback is delayed by at least one page duration (two if checksums are verified).</p>
<p>Taking both send and receive latencies into account, the minimum end-to-end latency for Ogg is thus twice the duration of a page, triple if strict checksum verification is required. If page durations are variable, the maximum value must be used in order to avoid buffer underflows.</p>
<p>Minimum latency is clearly achieved by minimising the page duration, which in turn implies sending only one packet per page. This is where the size of the page header becomes important. The header for a single-packet page is <code>27 + packet_size/255</code> bytes in size. For a 1 Mbps video stream at 25 fps this gives an overhead of approximately 1%.  With a typical audio packet size of 400 bytes, the overhead becomes a staggering 7%. The average overhead for a multiplex of these two streams is 1.4%.</p>
<p>As it stands, the Ogg format is clearly not a good choice for a low-latency application. The key to low latency is small packets and fine-grained interleaving of streams, and although Ogg can provide both of these, by sending a single packet per page, the price in overhead is simply too high.</p>
<p>ISO MPEG-PS has an overhead of 9 bytes on most packets (a 5-byte timestamp is added a few times per second), and Microsoft&#8217;s ASF has a 12-byte packet header. My suggestions for compacting the Ogg page header would bring it in line with these formats.</p>
<h1>Random access</h1>
<p>Any general-purpose container format needs to allow random access for direct seeking to any given position in the file. Despite this goal being explicitly mentioned in the Ogg specification, the format only allows the most crude of random access methods.</p>
<p>While many container formats include an index allowing a time to be directly translated into an offset into the file, Ogg has nothing of this kind, the stated rationale for the omission being that this would require a two-pass multiplexing, the second pass creating the index. This is obviously not true; the index could simply be written at the end of the file. Those objecting that this index would be unavailable in a streaming scenario are forgetting that seeking is impossible there regardless.</p>
<p>The method for seeking suggested by the Ogg documentation is to perform a binary search on the file, after each file-level seek operation scanning for a page header, extracting the timestamp, and comparing it to the desired position. When the elementary stream encoding allows only certain packets as random access points (video key frames), a second search will have to be performed to locate the entry point closest to the desired time. In a large file (sizes upwards of 10 GB are common), 50 seeks might be required to find the correct position.</p>
<p>A typical hard drive has an average seek time of roughly 10 ms, giving a total time for the seek operation of around 500 ms, an annoyingly long time. On a slow medium, such as an optical disc or files served over a network, the times are orders of magnitude longer.</p>
<p>A factor further complicating the seeking process is the possibility of header emulation within the elementary stream data. To safeguard against this, one has to read the entire page and verify the checksum. If the storage medium cannot provide data much faster than during normal playback, this provides yet another substantial delay towards finishing the seeking operation. This too applies to both network delivery and optical discs.</p>
<p>Although optical disc usage is perhaps in decline today, one should bear in mind that the Ogg format was designed at a time when CDs and DVDs were rapidly gaining ground, and network-based storage is most certainly on the rise.</p>
<p>The final nail in the coffin of seeking is the codec-dependent timestamp format. At each step in the seeking process, the timestamp parsing specified by the codec mapping corresponding the current page must be invoked. If the mapping is not known, the best one can do is skip pages until one with a known mapping is found. This delays the seeking and complicates the implementation, both bad things.</p>
<h1>Timestamps</h1>
<p>A problem old as multimedia itself is that of synchronising multiple elementary streams (e.g. audio and video) during playback; badly synchronised A/V is highly unpleasant to view. By the time Ogg was invented, solutions to this problem were long since explored and well-understood. The key to proper synchronisation lies in tagging elementary stream packets with timestamps, packets carrying the same timestamp intended for simultaneous presentation. The concept is as simple as it seems, so it is astonishing to see the amount of complexity with which the Ogg designers managed to imbue it. So bizarre is it, that I have devoted an <a href="http://hardwarebug.org/2008/11/17/ogg-timestamps-explored/">entire article</a> to the topic, and will not cover it further here.</p>
<h1>Complexity</h1>
<p>Video and audio decoding are time-consuming tasks, so containers should be designed to minimise extra processing required. With the data volumes involved, even an act as simple as copying a packet of compressed data can have a significant impact. Once again, however, Ogg lets us down. Despite the brevity of the specification, the format is remarkably complicated to parse properly.</p>
<p>The unusual and inefficient encoding of the packet sizes limits the page size to somewhat less than 64 kB. To still allow individual packets larger than this limit, it was decided to allow packets spanning multiple pages, a decision with unfortunate implications. A page-spanning packet as it arrives in the Ogg stream will be discontiguous in memory, a situation most decoders are unable to handle, and reassembly, i.e. copying, is required.</p>
<p>The knowledgeable reader may at this point remark that the MPEG-TS format also splits packets into pieces requiring reassembly before decoding. There is, however, a significant difference there. MPEG-TS was designed for hardware demultiplexing feeding directly into hardware decoders. In such an implementation the fragmentation is not a problem. Rather, the fine-grained interleaving is a feature allowing smaller on-chip buffers.</p>
<p>Buffering is also an area in which Ogg suffers. To keep the overhead down, pages must be made as large as practically possible, and page size translates directly into demultiplexer buffer size. Playback of a file with two elementary streams thus requires 128 kB of buffer space. On a modern PC this is perhaps nothing to be concerned about, but in a small embedded system, e.g. a portable media player, it can be relevant.</p>
<p>In addition to the above, a number of other issues, some of them minor, others more severe, make Ogg processing a painful experience. A selection follows:</p>
<ul>
<li>32-bit random elementary stream identifiers mean a simple table-lookup cannot be used. Instead the list of streams must be searched for a match. While trivial to do in software, it is still annoying, and a hardware demultiplexer would be significantly more complicated than with a smaller identifier.</li>
<li>Semantically ambiguous streams are possible. For example, the continuation flag (bit 1) may conflict with continuation (or lack thereof) implied by the segment table on the preceding page. Such invalid files have been spotted in the wild.</li>
<li>Concatenating independent Ogg streams forms a valid stream. While finding a use case for this strange feature is difficult, an implementation must of course be prepared to encounter such streams. Detecting and dealing with these adds pointless complexity.</li>
<li>Unusual terminology: inventing new terms for well-known concepts is confusing for the developer trying to understand the format in relation to others. A few examples:<br />
<table style="text-align: left; width: 100%;">
<tr>
<th>Ogg name</th>
<th>Usual name</th>
</tr>
<tr>
<td>logical bitstream</td>
<td>elementary stream</td>
</tr>
<tr>
<td>grouping</td>
<td>multiplexing</td>
</tr>
<tr>
<td>lacing value</td>
<td>packet size (approximately)</td>
</tr>
<tr>
<td>segment</td>
<td>imaginary element serving no real purpose</td>
</tr>
<tr>
<td>granule position</td>
<td>timestamp</td>
</tr>
</table>
</li>
</ul>
<h1>Final words</h1>
<p>We have found the Ogg format to be a dubious choice in just about every situation. Why then do certain organisations and individuals persist in promoting it with such ferocity?</p>
<p>When challenged, three types of reaction are characteristic of the Ogg campaigners.</p>
<p>On occasion, these people will assume an apologetic tone, explaining how Ogg was only ever designed for simple audio-only streams (ignoring it is as bad for these as for anything), and this is no doubt true. Why then, I ask again, do they continue to tout Ogg as the one-size-fits-all solution they already admitted it is not?</p>
<p>More commonly, the Ogg proponents will respond with hand-waving arguments best summarised as <em id="notbad" onmouseover="document.getElementById('notbad').textContent='Ogg isn\'t dead, it\'s just resting'" onmouseout="document.getElementById('notbad').textContent='Ogg isn\'t bad, it\'s just different'">Ogg isn&#8217;t bad, it&#8217;s just different</em>. My reply to this assertion is twofold:</p>
<ul>
<li>Being too different <em>is</em> bad. We live in a world where multimedia files come in many varieties, and a decent media player will need to handle the majority of them. Fortunately, most multimedia file formats share some basic traits, and they can easily be processed in the same general framework, the specifics being taken care of at the input stage. A format deviating too far from the standard model becomes problematic.</li>
<li>Ogg <em>is</em> bad. When every angle of examination reveals serious flaws, bad is the only fitting description.</li>
</ul>
<p>The third reaction bypasses all technical analysis: <em>Ogg is patent-free</em>, a claim I am not qualified to directly discuss. Assuming it is true, it still does not alter the fact that <u>Ogg is a bad format</u>. Being free from patents does not magically make Ogg a good choice as file format. If all the standard formats are indeed covered by patents, the only proper solution is to design a new, good format which is not, this time hopefully avoiding the old mistakes.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/03/03/ogg-objections/feed/</wfw:commentRss>
		<slash:comments>58</slash:comments>
		</item>
		<item>
		<title>Cat pictures</title>
		<link>http://hardwarebug.org/2010/02/21/cat-pictures/</link>
		<comments>http://hardwarebug.org/2010/02/21/cat-pictures/#comments</comments>
		<pubDate>Sun, 21 Feb 2010 20:19:07 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=362</guid>
		<description><![CDATA[It has come to my attention that this blog suffers a complete lack of the single most important thing on the Internet: cat pictures.  Here is a feeble attempt to remedy this most shocking of shortfalls.




]]></description>
			<content:encoded><![CDATA[<p>It has come to my attention that this blog suffers a complete lack of the single most important thing on the Internet: <b>cat pictures</b>.  Here is a feeble attempt to remedy this most shocking of shortfalls.</p>
<p><a href="http://hardwarebug.org/wp-content/uploads/2010/02/dsc00508.jpg"><img src="http://hardwarebug.org/wp-content/uploads/2010/02/dsc00508-300x184.jpg" alt="" title="Kitten" width="300" height="184" class="aligncenter size-medium wp-image-363" /></a><br />
<span id="more-362"></span><br />
<a href="http://hardwarebug.org/wp-content/uploads/2010/02/img_1195.jpg"><img src="http://hardwarebug.org/wp-content/uploads/2010/02/img_1195-300x200.jpg" alt="" title="Cat" width="300" height="200" class="aligncenter size-medium wp-image-367" /></a></p>
<p><a href="http://hardwarebug.org/wp-content/uploads/2010/02/img_1186.jpg"><img src="http://hardwarebug.org/wp-content/uploads/2010/02/img_1186-300x204.jpg" alt="" title="Another cat" width="300" height="204" class="aligncenter size-medium wp-image-368" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/02/21/cat-pictures/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>1080p video on Beagle</title>
		<link>http://hardwarebug.org/2010/02/10/1080p-video-on-beagle/</link>
		<comments>http://hardwarebug.org/2010/02/10/1080p-video-on-beagle/#comments</comments>
		<pubDate>Wed, 10 Feb 2010 01:22:35 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Multimedia]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=299</guid>
		<description><![CDATA[FOSDEM 2010 is over, and I&#8217;d like to call it a success for FFmpeg.  The 11-man strong delegation showed the stunned audience a smashing demo featuring a Beagle-powered video wall. It looked like this:

The people who pulled this off look like this:

]]></description>
			<content:encoded><![CDATA[<p>FOSDEM 2010 is over, and I&#8217;d like to call it a success for <a href="http://ffmpeg.org/">FFmpeg</a>.  The 11-man strong delegation showed the stunned audience a smashing demo featuring a <a href="http://beagleboard.org/">Beagle</a>-powered video wall. It looked like this:</p>
<p><object class="frame-outer  size-full aligncenter" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="384" height="313" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="src" value="http://www.youtube.com/v/9pwUdRKllo0&amp;hl=en_US&amp;fs=1" /><param name="allowfullscreen" value="true" /><embed type="application/x-shockwave-flash" width="384" height="313" src="http://www.youtube.com/v/9pwUdRKllo0&amp;hl=en_US&amp;fs=1" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<p>The people who pulled this off look like this:</p>
<p style="text-align: center;"><a href="http://hardwarebug.org/wp-content/uploads/2010/02/img_3133.jpg"><img class="size-full wp-image-321 aligncenter" title="The FFmpeg team at FOSDEM 2010" src="http://hardwarebug.org/wp-content/uploads/2010/02/img_3133_small.jpg" alt="The FFmpeg team" width="384" height="256" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/02/10/1080p-video-on-beagle/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>IJG swings again, and misses</title>
		<link>http://hardwarebug.org/2010/02/01/ijg-swings-again-and-misses/</link>
		<comments>http://hardwarebug.org/2010/02/01/ijg-swings-again-and-misses/#comments</comments>
		<pubDate>Mon, 01 Feb 2010 16:18:24 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Multimedia]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=234</guid>
		<description><![CDATA[Earlier this month the IJG unleashed version 8 of its ubiquitous libjpeg library on the world. Eager to try out the &#8220;major breakthrough in image coding technology&#8221; promised in the README file accompanying v7, I downloaded the release. A glance at the README file suggests something major indeed is afoot:
Version 8.0 is the first release [...]]]></description>
			<content:encoded><![CDATA[<p>Earlier this month the <a href="http://ijg.org/">IJG</a> unleashed version 8 of its ubiquitous libjpeg library on the world. Eager to try out the &#8220;major breakthrough in image coding technology&#8221; <a href="http://hardwarebug.org/2009/08/04/ijg-is-back/">promised</a> in the README file accompanying v7, I downloaded the release. A glance at the README file suggests something major indeed is afoot:</p>
<blockquote><p>Version 8.0 is the first release of a new generation JPEG standard to overcome the limitations of the original JPEG specification.</p></blockquote>
<p>The text also hints at the existence of a <a href="http://jpegclub.org/temp/ITU-T-JPEG-Plus-Proposal_R3.doc">document</a> detailing these marvellous new features, and a Google search later a copy has found its way onto my monitor. As I read, however, my state of mind shifts from an initial excited curiosity, through bewilderment and disbelief, finally arriving at pure merriment.<br />
<span id="more-234"></span><br />
Already on the first page it becomes clear no new JPEG standard in fact exists. All we have is an unsolicited proposal sent to the ITU-T by members of the IJG. Realising that even the most brilliant of inventions must start off as mere proposals, I carry on reading. The summary informs me that I am about to witness the introduction of three extensions to the T.81 JPEG format:</p>
<ol>
<li>An alternative coefficient scan sequence for DCT coefficient serialization</li>
<li>A SmartScale extension in the Start-Of-Scan (SOS) marker segment</li>
<li>A Frame Offset definition in or in addition to the Start-Of-Frame (SOF) marker segment</li>
</ol>
<p>Together these three extensions will, it is promised, &#8220;bring DCT based JPEG back to the forefront of state-of-the-art image coding technologies.&#8221;</p>
<h3>Alternative scan</h3>
<p>The first of the proposed extensions introduces an alternative DCT coefficient scan sequence to be used in place of the zigzag scan employed in most block transform based codecs.</p>
<div id="attachment_239" class="wp-caption aligncenter" style="width: 335px"><img class="size-full wp-image-239" title="jpeg8-alt-scan" src="http://hardwarebug.org/wp-content/uploads/2010/02/jpeg8-alt-scan.png" alt="Alternative scan sequence" width="325" height="256" /><p class="wp-caption-text">Alternative scan sequence</p></div>
<p>The advantage of this scan would be that combined with the existing progressive mode, it simplifies decoding of an initial low-resolution image which is enhanced through subsequent passes. The author of the document calls this scheme  &#8220;image-pyramid/hierarchical multi-resolution coding.&#8221; It is not immediately obvious to me how this constitutes even a small advance in image coding technology.</p>
<p>At this point I am beginning to suspect that our friend from the IJG has been trapped in a half-world between interlaced GIF images transmitted down noisy phone lines and today&#8217;s inferno of SVC, MVC, and other buzzwords.</p>
<h3>(Not so) SmartScale</h3>
<p>Disguised behind this camel-cased moniker we encounter a method which, we are told, will provide better image quality at high compression ratios. The author has combined two well-known (to us) properties in a (to him) clever way.</p>
<p>The first property concerns the perceived impact of different types of distortion in an image. When encoding with JPEG, as the quantiser is increased, the decoded image becomes ever more blocky. At a certain point, a better subjective visual quality can be achieved by down-sampling the image before encoding it, thus allowing a lower quantiser to be used. If the decoded image is scaled back up to the original size, the unpleasant, blocky appearance is replaced with a smooth blur.</p>
<p>The second property belongs to the DCT where, as we all know, the top-left (DC) coefficient is the average of the entire block, its neighbours represent the lowest frequency components etc. A top-left-aligned subset of the coefficient block thus represents a low-resolution version of the full block in the spatial domain.</p>
<p>In his flash of genius, our hero came up with the idea of using the DCT for down-scaling the image. Unfortunately, he appears to possess precious little knowledge of sampling theory and human visual perception. Any block-based resampling will inevitably produce sharp artefacts along the block edges. The human visual system is particularly sensitive to sharp edges, so this is one of the most unwanted types of distortion in an encoded image.</p>
<p>Despite the obvious flaws in this approach, I decided to give it a try. After all, the software is already written, allowing downscaling by factors of 8/8..16.</p>
<p>Using a 1280&#215;720 <a href="http://hardwarebug.org/wp-content/uploads/2010/02/parkrun.png">test image</a>, I encoded it with each of the nine scaling options, from unity to half size, each time adjusting the quality parameter for a final encoded file size of no more than 200000 bytes. The following table presents the encoded file size, the libjpeg quality parameter used, and the <a href="http://en.wikipedia.org/wiki/Structural_similarity">SSIM</a> metric for each of the images.</p>
<table width="50%" align="center">
<tbody>
<tr style="text-align: left">
<th>Scale</th>
<th>Size</th>
<th>Quality</th>
<th>SSIM</th>
</tr>
<tr>
<td>8/8</td>
<td>198462</td>
<td>59</td>
<td>0.940</td>
</tr>
<tr>
<td>8/9</td>
<td>196337</td>
<td>70</td>
<td>0.936</td>
</tr>
<tr>
<td>8/10</td>
<td>196133</td>
<td>79</td>
<td>0.934</td>
</tr>
<tr>
<td>8/11</td>
<td>197179</td>
<td>84</td>
<td>0.927</td>
</tr>
<tr>
<td>8/12</td>
<td>193872</td>
<td>89</td>
<td>0.915</td>
</tr>
<tr>
<td>8/13</td>
<td>197153</td>
<td>92</td>
<td>0.914</td>
</tr>
<tr>
<td>8/14</td>
<td>188334</td>
<td>94</td>
<td>0.899</td>
</tr>
<tr>
<td>8/15</td>
<td>198911</td>
<td>96</td>
<td>0.886</td>
</tr>
<tr>
<td>8/16</td>
<td>197190</td>
<td>97</td>
<td>0.869</td>
</tr>
</tbody>
</table>
<p>Although the smaller images allowed a higher quality setting to be used, the SSIM value drops significantly. Numbers may of course be misleading, but the images below speak for themselves. These are cut-outs from the full image, the original on the left, unscaled JPEG-compressed in the middle, and JPEG with 8/16 scaling to the right.</p>
<div style="text-align: center;"><a href="http://hardwarebug.org/wp-content/uploads/2010/02/parkrun-cut1.png"><img class="alignnone size-full wp-image-246" title="Original" src="http://hardwarebug.org/wp-content/uploads/2010/02/parkrun-cut1.png" alt="" width="128" height="140" /></a><a href="http://hardwarebug.org/wp-content/uploads/2010/02/jpeg8-88-cut1.png"><img class="alignnone size-full wp-image-247" title="No scaling" src="http://hardwarebug.org/wp-content/uploads/2010/02/jpeg8-88-cut1.png" alt="" width="128" height="140" /></a><a href="http://hardwarebug.org/wp-content/uploads/2010/02/jpeg8-816-cut1.png"><img class="alignnone size-full wp-image-244" title="Scale 8/16" src="http://hardwarebug.org/wp-content/uploads/2010/02/jpeg8-816-cut1.png" alt="" width="128" height="140" /></a></div>
<div style="text-align: center;"><a href="http://hardwarebug.org/wp-content/uploads/2010/02/parkrun-cut2.png"><img class="alignnone size-full wp-image-255" title="Original" src="http://hardwarebug.org/wp-content/uploads/2010/02/parkrun-cut2.png" alt="" width="128" height="140" /></a><a href="http://hardwarebug.org/wp-content/uploads/2010/02/jpeg8-88-cut2.png"><img class="alignnone size-full wp-image-256" title="No scaling" src="http://hardwarebug.org/wp-content/uploads/2010/02/jpeg8-88-cut2.png" alt="" width="128" height="140" /></a><a href="http://hardwarebug.org/wp-content/uploads/2010/02/jpeg8-816-cut2.png"><img class="alignnone size-full wp-image-257" title="Scale 8/16" src="http://hardwarebug.org/wp-content/uploads/2010/02/jpeg8-816-cut2.png" alt="" width="128" height="140" /></a></div>
<p>Looking at these images, I do not need to hesitate before picking the JPEG variant I prefer.</p>
<h3>Frame offset</h3>
<p>The third and final extension proposed is quite simple and also quite pointless: a top-left cropping to be applied to the decoded image. The alleged utility of this feature would be to enable lossless cropping of a JPEG image. In a typical image workflow, however, JPEG is only used for the final published version, so the need for this feature appears quite far-fetched.</p>
<h3>The grand finale</h3>
<p>Throughout the text, the author makes references to &#8220;the fundamental DCT property for image representation.&#8221; In his own words:</p>
<blockquote><p>This property was found by the author during implementation of the new DCT scaling features and is after his belief one of the most important discoveries in digital image coding after releasing the JPEG standard in 1992.</p></blockquote>
<p>The secret is to be revealed in an annex to the main text. This annex quotes in full a post by the author to the comp.dsp Usenet group in a thread with the subject <a href="http://groups.google.com/group/comp.dsp/browse_frm/thread/4eb02c1d668daa7d">why DCT</a>. Reading the entire thread proves quite amusing. A few excerpts follow.</p>
<blockquote>
<div class="frame-outer small">
<div style="text-align: left;">The actual reason is much simpler, and therefore apparently very difficult to recognize by complicated-thinking people.<br/><br />
Here is the explanation:<br/><br />
What are people doing when they have a bunch of images and want a quick preview?  They use thumbnails!  What are thumbnails? Thumbnails are small downscaled versions of the original image! If you want more details of the image, you can zoom in stepwise by enlarging (upscaling) the image.</div>
</div>
</blockquote>
<blockquote>
<div class="frame-outer small">
<div style="text-align: left;">So with proper understanding of the fundamental DCT property, the MPEG folks could make their videos more scalable, but, as in the case of JPEG, they are unable to recognize this simple but basic property, unfortunately, and pursue rather inferior approaches in actual developments.</div>
</div>
</blockquote>
<blockquote>
<div class="frame-outer small">
<div style="text-align: left;">These are just phrases, and they don&#8217;t explain anything. But this is typical for the current state in this field: The relevant people ignore and deny the true reasons, and thus they turn in a circle and no progress is being made.</div>
</div>
</blockquote>
<blockquote>
<div class="frame-outer small">
<div style="text-align: left;">However, there are dark forces in action today which ignore and deny any fruitful advances in this field. That is the reason that we didn&#8217;t see any progress in JPEG for more than a decade, and as long as those forces dominate, we will see more confusion and less enlightenment. The truth is always simple, and the DCT *is* simple, but this fact is suppressed by established people who don&#8217;t want to lose their dubious position.</div>
</div>
</blockquote>
<p>I believe a trip to the <a href="http://en.wikipedia.org/wiki/Technology_in_The_Hitchhiker%27s_Guide_to_the_Galaxy#Total_Perspective_Vortex">Total Perspective Vortex</a> may be in order. Perhaps his tin-foil hat will save him.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/02/01/ijg-swings-again-and-misses/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bit-field badness</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/</link>
		<comments>http://hardwarebug.org/2010/01/30/bit-field-badness/#comments</comments>
		<pubDate>Sat, 30 Jan 2010 16:15:05 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Compilers]]></category>
		<category><![CDATA[Optimisation]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=230</guid>
		<description><![CDATA[Consider the following C code which is based on an real-world situation.

struct bf1_31 {
    unsigned a:1;
    unsigned b:31;
};

void func(struct bf1_31 *p, int n, int a)
{
    int i = 0;
    do {
        if (p[i].a)
    [...]]]></description>
			<content:encoded><![CDATA[<p>Consider the following C code which is based on an real-world situation.</p>
<blockquote>
<pre>struct bf1_31 {
    unsigned a:1;
    unsigned b:31;
};

void func(struct bf1_31 *p, int n, int a)
{
    int i = 0;
    do {
        if (p[i].a)
            p[i].b += a;
    } while (++i &lt; n);
}
</pre>
</blockquote>
<p>How would we best write this in ARM assembler? This is how I would do it:<br />
<span id="more-230"></span></p>
<blockquote>
<pre>func:
        ldr     r3,  [r0], #4
        tst     r3,  #1
        add     r3,  r3,  r2,  lsl #1
        strne   r3,  [r0, #-4]
        subs    r1,  r1,  #1
        bgt     func
        bx      lr
</pre>
</blockquote>
<p>The <code>add</code> instruction is unconditional to avoid a dependency on the comparison. Unrolling the loop would mask the latency of the <code>ldr</code> instruction as well, but that is outside the scope of this experiment.</p>
<p>Now compile this code with <code>gcc -march=armv5te -O3</code> and watch in horror:</p>
<blockquote>
<pre>func:
        push    {r4}
        mov     ip, #0
        mov     r4, r2
loop:
        ldrb    r3, [r0]
        add     ip, ip, #1
        tst     r3, #1
        ldrne   r3, [r0]
        andne   r2, r3, #1
        addne   r3, r4, r3, lsr #1
        orrne   r2, r2, r3, lsl #1
        strne   r2, [r0]
        cmp     ip, r1
        add     r0, r0, #4
        blt     loop
        pop     {r4}
        bx      lr
</pre>
</blockquote>
<p>This is nothing short of awful:</p>
<ul>
<li>The same value is loaded from memory twice.</li>
<li>A complicated mask/shift/or operation is used where a simple shifted add would suffice.</li>
<li>Write-back addressing is not used.</li>
<li>The loop control counts up and compares instead of counting down.</li>
<li>Useless <code>mov</code> in the prologue; swapping the roles or <code>r2</code> and <code>r4</code> would avoid this.</li>
<li>Using <code>lr</code> in place of <code>r4</code> would allow the return to be done with <code>pop {pc}</code>, saving one instruction (ignoring for the moment that no callee-saved registers are needed at all).</li>
</ul>
<p>Even for this trivial function the gcc-generated code is more than twice the optimal size and slower by approximately the same factor.</p>
<p>The main issue I wanted to illustrate is the poor handling of bit-fields by gcc. When accessing bitfields from memory, gcc issues a separate load for each field even when they are contained in the same aligned memory word. Although each load after the first will most likely hit L1 cache, this is still bad for several reasons:</p>
<ul>
<li>Loads have typically two or three cycles result latency compared to one cycle for data processing instructions. Any bit-field can be extracted from a register with two shifts, and on ARM the second of these can generally be achieved using a shifted second operand to a following instruction. The ARMv6T2 instruction set also adds the <code>SBFX</code> and <code>UBFX</code> instructions for extracting any signed or unsigned bit-field in one cycle.</li>
<li>Most CPUs have more data processing units than load/store units. It is thus more likely for an ALU instruction than a load/store to issue without delay on a superscalar processor.</li>
<li>Redundant memory accesses can trigger early flushing of store buffers rendering these less efficient.</li>
</ul>
<p>No gcc bashing is complete without a comparison with another compiler, so without further ado, here is the ARM RVCT output (<code>armcc --cpu 5te -O3</code>):</p>
<blockquote>
<pre>func:
        mov     r3, #0
        push    {r4, lr}
loop:
        ldr     ip, [r0, r3, lsl #2]
        tst     ip, #1
        addne   ip, ip, r2, lsl #1
        strne   ip, [r0, r3, lsl #2]
        add     r3, r3, #1
        cmp     r3, r1
        blt     loop
        pop     {r4, pc}
</pre>
</blockquote>
<p>This is much better, the core loop using only one instruction more than my version. The loop control is counting up, but at least this register is reused as offset for the memory accesses. More remarkable is the push/pop of two registers that are never used. I had not expected to see this from RVCT.</p>
<p>Even the best compilers are still no match for a human.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/01/30/bit-field-badness/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>ARM compiler update</title>
		<link>http://hardwarebug.org/2010/01/15/arm-compiler-update/</link>
		<comments>http://hardwarebug.org/2010/01/15/arm-compiler-update/#comments</comments>
		<pubDate>Fri, 15 Jan 2010 18:48:38 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=228</guid>
		<description><![CDATA[Since my last shootout,  all the tested vendors have updated their compilers. Here is a quick update on each of them.
Both the 4.3 and 4.4 branches of FSF GCC have had bugfix releases, bringing them to 4.3.4 and 4.4.2, respectively. Neither update contains anything particularly noteworthy.
The CodeSourcery 2009q3 release sees an update to a GCC [...]]]></description>
			<content:encoded><![CDATA[<p>Since my <a href="http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/">last shootout</a>,  all the tested vendors have updated their compilers. Here is a quick update on each of them.</p>
<p>Both the 4.3 and 4.4 branches of FSF GCC have had bugfix releases, bringing them to <a href="http://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&amp;resolution=FIXED&amp;target_milestone=4.3.4">4.3.4</a> and <a href="http://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&amp;resolution=FIXED&amp;target_milestone=4.4.2">4.4.2</a>, respectively. Neither update contains anything particularly noteworthy.</p>
<p>The CodeSourcery 2009q3 release sees an update to a GCC 4.4 base, a significant change from the 4.3 base used in 2009q1. The update is a mixed blessing. In fact, it is mostly a curse and hardly a blessing at all. On the bright side, the floating-point speed regressions in 2009q1 are gone, 2009q3 being a few per cent faster even than 2007q3. Unfortunately, this improvement is completely overshadowed by a major speed regression on integer code, a whopping 24% in one case. This ties in with the slowdown previously observed with FSF GCC 4.4 compared to 4.3.</p>
<p>ARM RVCT 4.0 is now at Build 697. This update fixes some bugs and introduces others. Notably, it no longer builds FFmpeg correctly. The issue has been reported to ARM.</p>
<p>Texas Instruments, finally, have made a formal release, v4.6.1, of their TMS470 compiler incorporating various fixes allowing it to build a moderately patched FFmpeg. The performance remains somewhere between GCC and RVCT on average.</p>
<p>In light of the above, my recommendations remain unchanged:</p>
<ul>
<li>For a free compiler, choose CodeSourcery 2009q1. It beats GCC 4.3.4 by 5-10% in most cases.</li>
<li>GNU purists are best served by GCC 4.3.4, which is up to 20% faster than 4.4.2 and rarely slower.</li>
<li>When price is not a concern, ARM RCVT is a good option, outperforming GCC by up to a factor 2.</li>
<li>In all cases, disable any auto-vectorisation features.</li>
</ul>
<p>Regardless of which compiler is chosen, I cannot overstress the importance of testing. All compilers are crawling with bugs, and even the most innocent-looking code change can trigger one of them. When using a compiler other than GCC, extra caution is advised considering a lot of code is developed using only GCC and may thus fall prey to bugs unique to said other compiler.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/01/15/arm-compiler-update/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Beware the builtins</title>
		<link>http://hardwarebug.org/2010/01/14/beware-the-builtins/</link>
		<comments>http://hardwarebug.org/2010/01/14/beware-the-builtins/#comments</comments>
		<pubDate>Thu, 14 Jan 2010 01:02:27 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=215</guid>
		<description><![CDATA[GCC includes a large number of builtin functions allegedly providing optimised code for common operations not easily expressed directly in C. Rather than taking such claims at face value (this is GCC after all), I decided to conduct a small investigation to see how well a few of these functions are actually implemented for various [...]]]></description>
			<content:encoded><![CDATA[<p>GCC includes a large number of builtin functions allegedly providing optimised code for common operations not easily expressed directly in C. Rather than taking such claims at face value (this is GCC after all), I decided to conduct a small investigation to see how well a few of these functions are actually implemented for various targets.</p>
<p>For my test, I selected the following functions:</p>
<ul>
<li><code>__builtin_bswap32</code>: Byte-swap a 32-bit word.</li>
<li><code>__builtin_bswap64</code>: Byte-swap a 64-bit word.</li>
<li><code>__builtin_clz</code>: Count leading zeros in a word.</li>
<li><code>__builtin_ctz</code>: Count trailing zeros in a word.</li>
<li><code>__builtin_prefetch</code>: Prefetch data into cache.</li>
</ul>
<p>To test the quality of these builtins, I wrapped each in a normal function, then compiled the code for these targets:</p>
<ul>
<li>ARMv7</li>
<li>AVR32</li>
<li>MIPS</li>
<li>MIPS64</li>
<li>PowerPC</li>
<li>PowerPC64</li>
<li>x86</li>
<li>x86_64</li>
</ul>
<p>In all cases I used compiler flags were <code>-O3 -fomit-frame-pointer</code> plus any flags required to select a modern CPU model.<br />
<span id="more-215"></span></p>
<h3>ARM</h3>
<p>Both  <code>__builtin_clz</code> and <code>__builtin_prefetch</code> generate the expected <code>CLZ</code> and <code>PLD</code> instructions respectively. The code for <code>__builtin_ctz</code> is reasonable for ARMv6 and earlier:</p>
<blockquote>
<pre>rsb     r3, r0, #0
and     r0, r3, r0
clz     r0, r0
rsb     r0, r0, #31
</pre>
</blockquote>
<p>For ARMv7 (in fact v6T2), however, using the new bit-reversal instruction would have been better:</p>
<blockquote>
<pre>rbit    r0, r0
clz     r0, r0
</pre>
</blockquote>
<p>I suspect this is simply a matter of the function not yet having been updated for ARMv7, which is perhaps even excusable given the relatively rare use cases for it.</p>
<p>The byte-reversal functions are where it gets shocking. Rather than use the <code>REV</code> instruction found from ARMv6 on, both of them generate external calls to <code>__bswapsi2</code> and <code>__bswapdi2</code> in libgcc, which is plain C code:</p>
<blockquote>
<pre>SItype
__bswapsi2 (SItype u)
{
  return ((((u) &amp; 0xff000000) &gt;&gt; 24)
          | (((u) &amp; 0x00ff0000) &gt;&gt;  8)
          | (((u) &amp; 0x0000ff00) &lt;&lt;  8)
          | (((u) &amp; 0x000000ff) &lt;&lt; 24));
}

DItype
__bswapdi2 (DItype u)
{
   return ((((u) &amp; 0xff00000000000000ull) &gt;&gt; 56)
          | (((u) &amp; 0x00ff000000000000ull) &gt;&gt; 40)
          | (((u) &amp; 0x0000ff0000000000ull) &gt;&gt; 24)
          | (((u) &amp; 0x000000ff00000000ull) &gt;&gt;  8)
          | (((u) &amp; 0x00000000ff000000ull) &lt;&lt;  8)
          | (((u) &amp; 0x0000000000ff0000ull) &lt;&lt; 24)
          | (((u) &amp; 0x000000000000ff00ull) &lt;&lt; 40)
          | (((u) &amp; 0x00000000000000ffull) &lt;&lt; 56));
}
</pre>
</blockquote>
<p>While the 32-bit version compiles to a reasonable-looking shift/mask/or job, the 64-bit one is a real WTF. Brace yourselves:</p>
<blockquote>
<pre>push    {r4, r5, r6, r7, r8, r9, sl, fp}
mov     r5, #0
mov     r6, #65280      ; 0xff00
sub     sp, sp, #40     ; 0x28
and     r7, r0, r5
and     r8, r1, r6
str     r7, [sp, #8]
str     r8, [sp, #12]
mov     r9, #0
mov     r4, r1
and     r5, r0, r9
mov     sl, #255        ; 0xff
ldr     r9, [sp, #8]
and     r6, r4, sl
mov     ip, #16711680   ; 0xff0000
str     r5, [sp, #16]
str     r6, [sp, #20]
lsl     r2, r0, #24
and     ip, ip, r1
lsr     r7, r4, #24
mov     r1, #0
lsr     r5, r9, #24
mov     sl, #0
mov     r9, #-16777216  ; 0xff000000
and     fp, r0, r9
lsr     r6, ip, #8
orr     r9, r7, r1
and     ip, r4, sl
orr     sl, r1, r2
str     r6, [sp]
str     r9, [sp, #32]
str     sl, [sp, #36]   ; 0x24
add     r8, sp, #32
ldm     r8, {r7, r8}
str     r1, [sp, #4]
ldm     sp, {r9, sl}
orr     r7, r7, r9
orr     r8, r8, sl
str     r7, [sp, #32]
str     r8, [sp, #36]   ; 0x24
mov     r3, r0
mov     r7, #16711680   ; 0xff0000
mov     r8, #0
and     r9, r3, r7
and     sl, r4, r8
ldr     r0, [sp, #16]
str     fp, [sp, #24]
str     ip, [sp, #28]
stm     sp, {r9, sl}
ldr     r7, [sp, #20]
ldr     sl, [sp, #12]
ldr     fp, [sp, #12]
ldr     r8, [sp, #28]
lsr     r0, r0, #8
orr     r7, r0, r7, lsl #24
lsr     r6, sl, #24
orr     r5, r5, fp, lsl #8
lsl     sl, r8, #8
mov     fp, r7
add     r8, sp, #32
ldm     r8, {r7, r8}
orr     r6, r6, r8
ldr     r8, [sp, #20]
ldr     r0, [sp, #24]
orr     r5, r5, r7
lsr     r8, r8, #8
orr     sl, sl, r0, lsr #24
mov     ip, r8
ldr     r0, [sp, #4]
orr     fp, fp, r5
ldr     r5, [sp, #24]
orr     ip, ip, r6
ldr     r6, [sp]
lsl     r9, r5, #8
lsl     r8, r0, #24
orr     fp, fp, r9
lsl     r3, r3, #8
orr     r8, r8, r6, lsr #8
orr     ip, ip, sl
lsl     r7, r6, #24
and     r5, r3, #16711680       ; 0xff0000
orr     r7, r7, fp
orr     r8, r8, ip
orr     r4, r1, r7
orr     r5, r5, r8
mov     r9, r6
mov     r1, r5
mov     r0, r4
add     sp, sp, #40     ; 0x28
pop     {r4, r5, r6, r7, r8, r9, sl, fp}
bx      lr
</pre>
</blockquote>
<p>That&#8217;s right, 91 instructions to move 8 bytes around a bit. GCC definitely has a problem with 64-bit numbers. It is perhaps worth noting that the <code>bswap_64</code> macro in glibc splits the 64-bit value into 32-bit halves which are then reversed independently, thus side-stepping this weakness of gcc.</p>
<p>As a side note, ARM RVCT (armcc) compiles those functions perfectly into one and two <code>REV</code> instructions, respectively.</p>
<h3>AVR32</h3>
<p>There is not much to report here. The latest gcc version available is 4.2.4, which doesn&#8217;t appear to have the bswap functions. The other three are handled nicely, even using a bit-reverse for <code>__builtin_ctz</code>.</p>
<h3>MIPS / MIPS64</h3>
<p>The situation MIPS is similar to ARM. Both bswap builtins result in external libgcc calls, the rest giving sensible code.</p>
<h3>PowerPC</h3>
<p>I scarcely believe my eyes, but this one is actually not bad. The PowerPC has no byte-reversal instructions, yet someone seems to have taken the time to teach gcc a good instruction sequence for this operation. The PowerPC does have some powerful rotate-and-mask instructions which come in handy here. First the 32-bit version:</p>
<blockquote>
<pre>rotlwi  r0,r3,8
rlwimi  r0,r3,24,0,7
rlwimi  r0,r3,24,16,23
mr      r3,r0
blr
</pre>
</blockquote>
<p>The 64-bit byte-reversal simply applies the above code on each half of the value:</p>
<blockquote>
<pre>rotlwi  r0,r3,8
rlwimi  r0,r3,24,0,7
rlwimi  r0,r3,24,16,23
rotlwi  r3,r4,8
rlwimi  r3,r4,24,0,7
rlwimi  r3,r4,24,16,23
mr      r4,r0
blr
</pre>
</blockquote>
<p>Although I haven&#8217;t analysed that code carefully, it looks pretty good.</p>
<h3>PowerPC64</h3>
<p>Doing 64-bit operations is easier on a 64-bit CPU, right? For you and me perhaps, but not for gcc. Here <code>__builtin_bswap64</code> gives us the now familiar <code>__bswapdi2</code> call, and while not as bad as the ARM version, it is not pretty:</p>
<blockquote>
<pre>rldicr  r0,r3,8,55
rldicr  r10,r3,56,7
rldicr  r0,r0,56,15
rldicl  r11,r3,8,56
rldicr  r9,r3,16,47
or      r11,r10,r11
rldicr  r9,r9,48,23
rldicl  r10,r0,24,40
rldicr  r0,r3,24,39
or      r11,r11,r10
rldicl  r9,r9,40,24
rldicr  r0,r0,40,31
or      r9,r11,r9
rlwinm  r10,r3,0,0,7
rldicl  r0,r0,56,8
or      r0,r9,r0
rldicr  r10,r10,8,55
rlwinm  r11,r3,0,8,15
or      r0,r0,r10
rldicr  r11,r11,24,39
rlwinm  r3,r3,0,16,23
or      r0,r0,r11
rldicr  r3,r3,40,23
or      r3,r0,r3
blr
</pre>
</blockquote>
<p>That is 6 times longer than the (presumably) hand-written 32-bit version.</p>
<h3>x86 / x86_64</h3>
<p>As one might expect, results on x86 are good. All the tested functions use the available special instructions. One word of caution though: the bit-counting instructions are very slow on some implementations, specifically the Atom, AMD chips, and the notoriously slow Pentium4E.</p>
<h3>Conclusion</h3>
<p>In conclusion, I would say gcc builtins can be useful to avoid fragile inline assembler. Before using them, however, one should make sure they are not in fact harmful on the required targets. Not even those builtins mapping directly to CPU instructions can be trusted.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2010/01/14/beware-the-builtins/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>ARM compiler shoot-out, round 2</title>
		<link>http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/</link>
		<comments>http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/#comments</comments>
		<pubDate>Thu, 20 Aug 2009 20:20:35 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=204</guid>
		<description><![CDATA[In my recent test of ARM compilers, I had to leave out Texas Instrument&#8217;s compiler since it failed to build FFmpeg. Since then, the TI compiler team has been busy fixing bugs, and a snapshot I was given to test was able to build enough of a somewhat patched FFmpeg that I can now present [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/">recent test</a> of ARM compilers, I had to leave out Texas Instrument&#8217;s compiler since it failed to build FFmpeg. Since then, the TI compiler team has been busy fixing bugs, and a snapshot I was given to test was able to build enough of a somewhat patched FFmpeg that I can now present round two in this shoot-out.</p>
<p>The contenders this time were the fastest GCC variant from round one, ARM RVCT, and newcomer TI TMS470. With the same rules as last time, the exact versions and optimisation options were like this:</p>
<ul>
<li><strong>CodeSourcery GCC 2009q1</strong> (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>ARM RVCT 4.0 Build 591</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros</li>
<li><strong>TI TMS470 4.7.0-a9229</strong>, <span>-</span>-float_support=vfpv3 -mv=7a8 -O3 -mf=5</li>
</ul>
<p><span id="more-204"></span><br />
To keep things fair, I left the vectoriser off also with the TI compiler. The table below lists the decoding times for the sample files, this time normalised against the participating GCC compiler. Remember, smaller numbers are better.  Also keep in mind that this test was done with a development snapshot of TMS470, not an approved release.</p>
<table border="0" width="100%">
<col></col>
<col></col>
<col></col>
<col width="10%"></col>
<col width="10%"></col>
<col width="10%"></col>
<thead>
<tr style="text-align: left;">
<th>Sample name</th>
<th>Codec</th>
<th>Code type</th>
<th>GCC</th>
<th>RVCT</th>
<th>TI</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4">cathedral</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>1.00</td>
<td>0.95</td>
<td>1.02</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4">NeroAVC</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>1.00</td>
<td>0.96</td>
<td>1.05</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/indiana_jones_4-tlr3_h640w.mov">indiana_jones_4</a></td>
<td>H.264 CAVLC</td>
<td>integer</td>
<td>1.00</td>
<td>0.92</td>
<td>1.02</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4">NeroRecodeSample</a></td>
<td>MPEG-4 ASP</td>
<td>integer</td>
<td>1.00</td>
<td>1.01</td>
<td>1.08</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3">Silent_Light</a></td>
<td>MP3</td>
<td>64-bit integer</td>
<td>1.00</td>
<td>0.48</td>
<td>0.72</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/flac/When I Grow Up.flac">When_I_Grow_Up</a></td>
<td>FLAC</td>
<td>integer</td>
<td>1.00</td>
<td>0.87</td>
<td>0.93</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg">Lumme-Badloop</a></td>
<td>Vorbis</td>
<td>float</td>
<td>1.00</td>
<td>0.94</td>
<td>1.05</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/AC3/Canyon-5.1-48khz-448kbit.ac3">Canyon</a></td>
<td>AC-3</td>
<td>float</td>
<td>1.00</td>
<td>0.88</td>
<td>1.01</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/DTS/lotr_5.1_768.dts">lotr</a></td>
<td>DTS</td>
<td>float</td>
<td>1.00</td>
<td>1.00</td>
<td>1.08</td>
</tr>
</tbody>
</table>
<p>Overall, the TI TMS470 compiler comes off slightly worse than GCC. In two cases, however, it was significantly better than GCC, but not as good as RVCT. Incidentally, those were also the ones where RVCT scored the biggest win over GCC.</p>
<p>My conclusions from this test are twofold:</p>
<ul>
<li>ARM&#8217;s own compiler is very hard to beat. They do seem to know how their chips work.</li>
<li>GCC is incredibly bad at 64-bit arithmetic on 32-bit machines.</li>
</ul>
<p>The logical next step is to test these compilers with vectorisation enabled. FFmpeg should offer plenty of opportunities for this feature to shine. Unfortunately, that test will have to wait until the RVCT vectoriser is fixed. The current release does not compile FFmpeg with vectorisation enabled.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/08/20/arm-compiler-shoot-out-round-2/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4" length="24154488" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4" length="6766583" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/indiana_jones_4-tlr3_h640w.mov" length="16215526" type="video/quicktime" />
<enclosure url="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4" length="31027653" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3" length="4206720" type="audio/mpeg" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg" length="5856908" type="audio/ogg" />
		</item>
		<item>
		<title>DRM the Big Blue way</title>
		<link>http://hardwarebug.org/2009/08/10/drm-the-big-blue-way/</link>
		<comments>http://hardwarebug.org/2009/08/10/drm-the-big-blue-way/#comments</comments>
		<pubDate>Mon, 10 Aug 2009 20:35:53 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[Compilers]]></category>
		<category><![CDATA[PowerPC]]></category>
		<category><![CDATA[Reverse engineering]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=179</guid>
		<description><![CDATA[A few months ago, I downloaded an evaluation copy of IBM&#8217;s XLC compiler to try it out on FFmpeg. The trial licence has now expired, so what better way to spend a few minutes than by cracking it?
The installation script, as expected, copied a number of files into a directory under /opt. More unusually, it [...]]]></description>
			<content:encoded><![CDATA[<p>A few months ago, I downloaded an evaluation copy of IBM&#8217;s <a href="http://www-01.ibm.com/software/awdtools/xlcpp/linux/">XLC</a> compiler to try it out on FFmpeg. The trial licence has now expired, so what better way to spend a few minutes than by cracking it?</p>
<p>The installation script, as expected, copied a number of files into a directory under <code>/opt</code>. More unusually, it also created a small shared library, <code>libxlc101e.so.1</code>, and placed it in <code>/usr/lib</code>. No other files from the installation package were modified, so this must be where the licence is hiding. Without further ado, we proceed to take it apart.<br />
<span id="more-179"></span><br />
We begin by looking at the symbol table using <code>readelf -s</code>:</p>
<pre>Symbol table '.symtab' contains 44 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 000000b4     0 SECTION LOCAL  DEFAULT    1
     2: 0000017c     0 SECTION LOCAL  DEFAULT    2
     3: 0000036c     0 SECTION LOCAL  DEFAULT    3
     4: 0000055c     0 SECTION LOCAL  DEFAULT    4
     5: 00000580     0 SECTION LOCAL  DEFAULT    5
     6: 000005c4     0 SECTION LOCAL  DEFAULT    6
     7: 00010a3c     0 SECTION LOCAL  DEFAULT    7
     8: 00010a50     0 SECTION LOCAL  DEFAULT    8
     9: 00010ac8     0 SECTION LOCAL  DEFAULT    9
    10: 00010ad8     0 SECTION LOCAL  DEFAULT   10
    11: 00000000     0 SECTION LOCAL  DEFAULT   11
    12: 00000000     0 SECTION LOCAL  DEFAULT   12
    13: 00000000     0 SECTION LOCAL  DEFAULT   13
    14: 00000000     0 SECTION LOCAL  DEFAULT   14
    15: 00000000     0 FILE    LOCAL  DEFAULT  ABS xleval.c
    16: 00010a50     0 OBJECT  LOCAL  HIDDEN   ABS _DYNAMIC
    17: 00010acc     0 OBJECT  LOCAL  HIDDEN   ABS _GLOBAL_OFFSET_TABLE_
    18: 000005f0    24 OBJECT  GLOBAL DEFAULT    6 xlc_extended_eval_lic_dir
    19: 00000660    22 OBJECT  GLOBAL DEFAULT    6 libxlfextendeval_name
    20: 00000738    42 OBJECT  GLOBAL DEFAULT    6 stm_compiler_name
    21: 00000640    16 OBJECT  GLOBAL DEFAULT    6 libupclicense_name
    22: 00000764    32 OBJECT  GLOBAL DEFAULT    6 xlf_compiler_name
    23: 00000580    68 FUNC    GLOBAL DEFAULT    5 _xlgetevalbeta
    24: 00000678    22 OBJECT  GLOBAL DEFAULT    6 libxlcextendeval_name
    25: 00000608    24 OBJECT  GLOBAL DEFAULT    6 xlf_extended_eval_lic_dir
    26: 000006d8    17 OBJECT  GLOBAL DEFAULT    6 xlcmp_name
    27: 000006c0    12 OBJECT  GLOBAL DEFAULT    6 xlc_package_name
    28: 00000650    16 OBJECT  GLOBAL DEFAULT    6 libstmlicense_name
    29: 000005e0    16 OBJECT  GLOBAL DEFAULT    6 xlf_extend_eval_env_var
    30: 000005d0    16 OBJECT  GLOBAL DEFAULT    6 xlc_extend_eval_env_var
    31: 00000620    16 OBJECT  GLOBAL DEFAULT    6 libxlflicense_name
    32: 00010cd0     0 NOTYPE  GLOBAL DEFAULT  ABS __bss_start
    33: 00010a3c    20 OBJECT  GLOBAL DEFAULT    7 _xlevalbeta
    34: 000005c4    10 OBJECT  GLOBAL DEFAULT    6 liblicense_dir
    35: 00000690    22 OBJECT  GLOBAL DEFAULT    6 libupcextendeval_name
    36: 000006ec    30 OBJECT  GLOBAL DEFAULT    6 xlc_compiler_name
    37: 00000630    16 OBJECT  GLOBAL DEFAULT    6 libxlclicense_name
    38: 00010cd0     0 NOTYPE  GLOBAL DEFAULT  ABS _edata
    39: 00010cd0     0 NOTYPE  GLOBAL DEFAULT  ABS _end
    40: 000006cc    12 OBJECT  GLOBAL DEFAULT    6 xlf_package_name
    41: 000006a8    22 OBJECT  GLOBAL DEFAULT    6 libstmextendeval_name
    42: 00010ad8   504 OBJECT  GLOBAL DEFAULT   10 versionString
    43: 0000070c    42 OBJECT  GLOBAL DEFAULT    6 upc_compiler_name</pre>
<p>Notice the lone function at position 23, <code>_xlgetevalbeta</code>, which we proceed to disassemble:</p>
<pre>00000580 &lt;_xlgetevalbeta&gt;:
 580:   94 21 ff f0     stwu    r1,-16(r1)
 584:   93 c1 00 08     stw     r30,8(r1)
 588:   93 e1 00 0c     stw     r31,12(r1)
 58c:   7c 3f 0b 78     mr      r31,r1
 590:   7d 88 02 a6     mflr    r12
 594:   42 9f 00 05     bcl-    20,4*cr7+so,598 &lt;_xlgetevalbeta+0x18&gt;
 598:   7f c8 02 a6     mflr    r30
 59c:   3f de 00 01     addis   r30,r30,1
 5a0:   3b de 05 34     addi    r30,r30,1332
 5a4:   7d 88 03 a6     mtlr    r12
 5a8:   80 1e ff fc     lwz     r0,-4(r30)
 5ac:   7c 03 03 78     mr      r3,r0
 5b0:   81 61 00 00     lwz     r11,0(r1)
 5b4:   83 cb ff f8     lwz     r30,-8(r11)
 5b8:   83 eb ff fc     lwz     r31,-4(r11)
 5bc:   7d 61 5b 78     mr      r1,r11
 5c0:   4e 80 00 20     blr</pre>
<p>This is fairly standard, unoptimised code. After saving a few registers on the stack, it computes the address of the global offset table: <code>0x598 + 0x10000 + 1332 = 0x10acc</code>, matching <code>_GLOBAL_OFFSET_TABLE_</code> from the symbol table. Next, a value is loaded from the GOT, forming the return value of the function after the stack has been restored.</p>
<p>To find out what this return value really is, we look at the relocation table (by means of <code>readelf -r</code>):</p>
<pre>Relocation section '.rela.dyn' at offset 0x55c contains 3 entries:
 Offset     Info    Type            Sym.Value  Sym. Name + Addend
00010a40  00000016 R_PPC_RELATIVE                               00000784
00010a44  00000016 R_PPC_RELATIVE                               0000097c
00010ac8  00001414 R_PPC_GLOB_DAT    00010a3c   _xlevalbeta + 0</pre>
<p>The third entry is the one we are looking for: its offset matches the location read by the code at 0&#215;5a8. This means the <code>_xlgetevalbeta</code> function is returning a pointer to <code>_xlevalbeta</code>, which makes some kind of sense.</p>
<p>Another quick look at the symbol table tells us <code>_xlevalbeta</code> lives at address 0&#215;10a3c and is 20 bytes in size. The section header (provided by <code>readelf -S</code>) helps us find the corresponding location in the file:</p>
<pre>Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg
  [ 0]                   NULL            00000000 000000 000000 00
  [ 1] .hash             HASH            000000b4 0000b4 0000c8 04   A
  [ 2] .dynsym           DYNSYM          0000017c 00017c 0001f0 10   A
  [ 3] .dynstr           STRTAB          0000036c 00036c 0001ee 00   A
  [ 4] .rela.dyn         RELA            0000055c 00055c 000024 0c   A
  [ 5] .text             PROGBITS        00000580 000580 000044 00  AX
  [ 6] .rodata           PROGBITS        000005c4 0005c4 000475 00   A
  [ 7] .data.rel.ro      PROGBITS        00010a3c 000a3c 000014 00  WA
  [ 8] .dynamic          DYNAMIC         00010a50 000a50 000078 08  WA
  [ 9] .got              PROGBITS        00010ac8 000ac8 000010 04  WA
  [10] .data             PROGBITS        00010ad8 000ad8 0001f8 00  WA
  [11] .comment          PROGBITS        00000000 000cd0 000028 00
  [12] .shstrtab         STRTAB          00000000 000cf8 000073 00
  [13] .symtab           SYMTAB          00000000 000fc4 0002c0 10
  [14] .strtab           STRTAB          00000000 001284 000206 00
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings)
  I (info), L (link order), G (group), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)</pre>
<p>The address we are looking for is at the start of the <code>.data.rel.ro</code> section, which can be found at offset 0xa3c in the file. It is time for the <code>hexdump</code> tool:</p>
<pre>00000a30  69 62 69 74 65 64 2e 00  00 00 00 00 00 00 00 01
00000a40  00 00 00 00 00 00 00 00  00 00 24 05 4a 0c 65 c8</pre>
<p>The last four bytes here, <code>4a 0c 65 c8</code>, are interesting. Taken as a 32-bit big endian value, they are exactly equal to the modification time of the file, or in other words, the time the compiler was installed. This cannot be a coincidence, so using a hex editor, we replace this with the current time, <code>4a 80 74 31</code>.</p>
<p>Lo and behold, the compiler is working again.</p>
<p>One hopes the engineers at IBM developing the compiler are not the same ones thinking this copy protection method was a good idea. Then again, perhaps they are; it failed miserably at compiling FFmpeg.</p>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/08/10/drm-the-big-blue-way/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ARM compiler shoot-out</title>
		<link>http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/</link>
		<comments>http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/#comments</comments>
		<pubDate>Wed, 05 Aug 2009 00:06:06 +0000</pubDate>
		<dc:creator>Mans</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Compilers]]></category>

		<guid isPermaLink="false">http://hardwarebug.org/?p=150</guid>
		<description><![CDATA[A proper comparison of different compilers targeting ARM is long overdue, so I decided to do my part. I compiled FFmpeg using a selection of compilers, and measured the speed of the result when decoding various media samples. Since we are testing compilers, I disabled all hand-written assembler. The tests were run on a Beagle [...]]]></description>
			<content:encoded><![CDATA[<p>A proper comparison of different compilers targeting ARM is long overdue, so I decided to do my part. I compiled <a href="http://ffmpeg.org/">FFmpeg</a> using a selection of compilers, and measured the speed of the result when decoding various media samples. Since we are testing compilers, I disabled all hand-written assembler. The tests were run on a <a href="http://beagleboard.org/">Beagle board</a> clocked at 600 MHz.</p>
<p>These are the compilers I deemed worthy to participate in the test and the optimisation flags I used with each:</p>
<ul>
<li><strong>GCC 4.3.3</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>GCC 4.4.1</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>CodeSourcery GCC 2007q3</strong> (based on 4.2.1), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-tree-vectorize</li>
<li><strong>CodeSourcery GCC 2009q1</strong> (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize</li>
<li><strong>ARM RVCT 4.0 Build 591</strong>, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros</li>
</ul>
<p>I would have also included the ARM compiler from Texas Instruments, had it been able to compile FFmpeg.<br />
<span id="more-150"></span><br />
With sample files chosen to exercise various types of code, the result of the test is, sadly, no surprise. The following table lists the runtimes of the different builds relative to the CodeSourcery 2007q3 build. Lower numbers are better.</p>
<table border="0" width="100%">
<col></col>
<col></col>
<col></col>
<col width="10%"></col>
<col width="10%"></col>
<col width="10%"></col>
<col width="10%"></col>
<thead>
<tr style="text-align: left;">
<th>Sample name</th>
<th>Codec</th>
<th>Code type</th>
<th>2009q1</th>
<th>4.3.3</th>
<th>4.4.1</th>
<th>RVCT</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4">cathedral</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>0.97</td>
<td>1.02</td>
<td>1.09</td>
<td>0.93</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4">NeroAVC</a></td>
<td>H.264 CABAC</td>
<td>integer</td>
<td>0.98</td>
<td>1.02</td>
<td>1.12</td>
<td>0.95</td>
</tr>
<tr>
<td><a href="http://movies.apple.com/movies/paramount/indiana_jones_4/indiana_jones_4-tlr3_h640w.mov">indiana_jones_4</a></td>
<td>H.264 CAVLC</td>
<td>integer</td>
<td>0.97</td>
<td>1.02</td>
<td>1.09</td>
<td>0.89</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4">NeroRecodeSample</a></td>
<td>MPEG-4 ASP</td>
<td>integer</td>
<td>0.96</td>
<td>1.03</td>
<td>1.27</td>
<td>0.96</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3">Silent_Light</a></td>
<td>MP3</td>
<td>64-bit integer</td>
<td>0.89</td>
<td>0.88</td>
<td>0.97</td>
<td>0.44</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/flac/When I Grow Up.flac">When_I_Grow_Up</a></td>
<td>FLAC</td>
<td>integer</td>
<td>0.98</td>
<td>0.98</td>
<td>0.93</td>
<td>0.86</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg">Lumme-Badloop</a></td>
<td>Vorbis</td>
<td>float</td>
<td>1.03</td>
<td>1.03</td>
<td>1.02</td>
<td>0.97</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/AC3/Canyon-5.1-48khz-448kbit.ac3">Canyon</a></td>
<td>AC-3</td>
<td>float</td>
<td>1.02</td>
<td>1.02</td>
<td>0.99</td>
<td>0.90</td>
</tr>
<tr>
<td><a href="http://samples.ffmpeg.org/A-codecs/DTS/lotr_5.1_768.dts">lotr</a></td>
<td>DTS</td>
<td>float</td>
<td>1.02</td>
<td>1.02</td>
<td>1.00</td>
<td>1.03</td>
</tr>
</tbody>
</table>
<p>Looking at the table, I make these observations:</p>
<ul>
<li>CodeSourcery 2009q1 produces faster integer code, but slower floating-point code, than 2007q3.</li>
<li>GCC 4.4.1 produces much slower code than 4.3.3 in several cases, and is never significantly better.</li>
<li>CodeSourcery GCC generally beats FSF GCC.</li>
<li>ARM RVCT readily beats every GCC version. The MP3 figure is not a typo.</li>
</ul>
<p>My recommendation for a free compiler is CodeSourcery 2009q1 unless your code makes heavy use of floating-point, in which case 2007q3 may give better results. If you prefer, for whatever reason, official GNU releases, 4.3.3 should be the version of choice. Avoid GCC 4.4.1; it is far too unpredictable.</p>
<h4>Bootnotes</h4>
<ul>
<li>See also Mike&#8217;s <a title="Intel Beats Up GCC" href="http://multimedia.cx/eggs/intel-beats-up-gcc/">test of x86 compilers</a>.</li>
<li>Thanks to ARM for providing the RVCT compiler.</li>
<li>Thanks to TI for providing the Beagle board.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://hardwarebug.org/2009/08/05/arm-compiler-shoot-out/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/cathedral-beta2-400extra-crop-avc.mp4" length="24154488" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/V-codecs/h264/NeroAVC.mp4" length="6766583" type="video/mp4" />
<enclosure url="http://movies.apple.com/movies/paramount/indiana_jones_4/indiana_jones_4-tlr3_h640w.mov" length="16215526" type="video/quicktime" />
<enclosure url="http://samples.ffmpeg.org/MPEG-4/NeroRecodeSample-MP4/NeroRecodeSample.mp4" length="31027653" type="video/mp4" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/MP3/Silent_Light.mp3" length="4206720" type="audio/mpeg" />
<enclosure url="http://samples.ffmpeg.org/A-codecs/vorbis/Lumme-Badloop.ogg" length="5856908" type="audio/ogg" />
		</item>
	</channel>
</rss>
