<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Bit-field badness</title>
	<atom:link href="http://hardwarebug.org/2010/01/30/bit-field-badness/feed/" rel="self" type="application/rss+xml" />
	<link>http://hardwarebug.org/2010/01/30/bit-field-badness/</link>
	<description>Everything is broken</description>
	<lastBuildDate>Mon, 30 Aug 2010 09:33:38 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
	<item>
		<title>By: mh</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/comment-page-1/#comment-562</link>
		<dc:creator>mh</dc:creator>
		<pubDate>Wed, 10 Feb 2010 20:40:37 +0000</pubDate>
		<guid isPermaLink="false">http://hardwarebug.org/?p=230#comment-562</guid>
		<description>I love replying to myself. It is perhaps interesting to know that RVCT&#039;s loop optimizer seems to interfere with the structure access in this example. Switching back to -O2 -Otime (the loop optimizer is only enabled at -O3, AFAIK) gives me the same efficient code that David posted.</description>
		<content:encoded><![CDATA[<p>I love replying to myself. It is perhaps interesting to know that RVCT&#8217;s loop optimizer seems to interfere with the structure access in this example. Switching back to -O2 -Otime (the loop optimizer is only enabled at -O3, AFAIK) gives me the same efficient code that David posted.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mh</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/comment-page-1/#comment-543</link>
		<dc:creator>mh</dc:creator>
		<pubDate>Tue, 09 Feb 2010 20:40:22 +0000</pubDate>
		<guid isPermaLink="false">http://hardwarebug.org/?p=230#comment-543</guid>
		<description>Nice! You almost got me there. However, the compiler I am using (4.0, build 697) does not like your code. I mean it does compile it of course, but the instruction sequence is significantly worse than that of the pure integer based loop. And yes, I did fix the loop in my example.

&lt;pre&gt;
func_david PROC
        PUSH     {r4,r5}
        ADD      r12,r0,r1,LSL #2
        LDR      r3,[r0,#0]
        ADD      r1,r0,#4
        SUB      r12,r12,r1
        TST      r3,#1
        ADDNE    r3,r3,r2,LSL #1
        STRNE    r3,[r0,#0]
        MOV      r0,#1
        ADD      r4,r0,r12,ASR #2
        CMP      r4,#1
        BLE      &#124;L1.316&#124;
&#124;L1.280&#124;
        MOV      r3,r1
        ADD      r0,r0,#1
        ADD      r1,r1,#4
        LDR      r12,[r3,#0]
        TST      r12,#1
        ADDNE    r12,r12,r2,LSL #1
        STRNE    r12,[r3,#0]
        CMP      r4,r0
        BGT      &#124;L1.280&#124;
&#124;L1.316&#124;
        POP      {r4,r5}
        BX       lr
        ENDP
&lt;/pre&gt;

The comment was:

&quot;hardwarebug_20100130.c&quot;, line 58: #1636-D: Could not optimize: Complicated use of variable (q)

Stack alignment is only necessary in non-leaf functions as you stated. So every once in a while you will see single registers pushed on the stack. Why the compiler thought this was needed in your example, I don&#039;t know.

Regards
Marcus</description>
		<content:encoded><![CDATA[<p>Nice! You almost got me there. However, the compiler I am using (4.0, build 697) does not like your code. I mean it does compile it of course, but the instruction sequence is significantly worse than that of the pure integer based loop. And yes, I did fix the loop in my example.</p>
<pre>
func_david PROC
        PUSH     {r4,r5}
        ADD      r12,r0,r1,LSL #2
        LDR      r3,[r0,#0]
        ADD      r1,r0,#4
        SUB      r12,r12,r1
        TST      r3,#1
        ADDNE    r3,r3,r2,LSL #1
        STRNE    r3,[r0,#0]
        MOV      r0,#1
        ADD      r4,r0,r12,ASR #2
        CMP      r4,#1
        BLE      |L1.316|
|L1.280|
        MOV      r3,r1
        ADD      r0,r0,#1
        ADD      r1,r1,#4
        LDR      r12,[r3,#0]
        TST      r12,#1
        ADDNE    r12,r12,r2,LSL #1
        STRNE    r12,[r3,#0]
        CMP      r4,r0
        BGT      |L1.280|
|L1.316|
        POP      {r4,r5}
        BX       lr
        ENDP
</pre>
<p>The comment was:</p>
<p>&#8220;hardwarebug_20100130.c&#8221;, line 58: #1636-D: Could not optimize: Complicated use of variable (q)</p>
<p>Stack alignment is only necessary in non-leaf functions as you stated. So every once in a while you will see single registers pushed on the stack. Why the compiler thought this was needed in your example, I don&#8217;t know.</p>
<p>Regards<br />
Marcus</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: stevenb</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/comment-page-1/#comment-535</link>
		<dc:creator>stevenb</dc:creator>
		<pubDate>Fri, 05 Feb 2010 11:14:50 +0000</pubDate>
		<guid isPermaLink="false">http://hardwarebug.org/?p=230#comment-535</guid>
		<description>I have filed this in GCC bugzilla as bug 42972.</description>
		<content:encoded><![CDATA[<p>I have filed this in GCC bugzilla as bug 42972.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: DavidEarlexarm</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/comment-page-1/#comment-534</link>
		<dc:creator>DavidEarlexarm</dc:creator>
		<pubDate>Thu, 04 Feb 2010 18:19:37 +0000</pubDate>
		<guid isPermaLink="false">http://hardwarebug.org/?p=230#comment-534</guid>
		<description>The beauty of ARM&#039;s compiler is you can write  portably - not relying on bitfield layout, MarcusHarnisch - and still get great code.
&lt;pre&gt;
void func(struct bf1_31 *p, int n, int a)
{
    struct bf1_31 *pend= &amp;p[n];
    do {
        struct bf1_31 *q;
        if ((q=p++)-&gt;a)
            q-&gt;b += a;
    } while (p &lt; pend);
}
&lt;/pre&gt;
Using old RVCT3.1 build 569 produces:
&lt;pre&gt;
func PROC
        ADD      r12,r0,r1,LSL #2
        PUSH     {lr}
&#124;L1.8&#124;
        MOV      r3,r0
        LDR      r1,[r0],#4
        TST      r1,#1
        ADDNE    r1,r1,r2,LSL #1
        STRNE    r1,[r3,#0]
        CMP      r0,r12
        BCC      &#124;L1.8&#124;
        POP      {pc}
        ENDP
&lt;/pre&gt;
This same source even generates good --thumb code.
 
I&#039;m however mystified why it pushes LR and pops PC in a leaf function. Maybe the branch predictor
influences cache victim selection to not immediately evict the calling code.

But I think the compiler will push multiples of two registers for architecture v5te and up to keep the stack 8-byte aligned. For that I&#039;d blame xScale&#039;s LDRD/STRD.</description>
		<content:encoded><![CDATA[<p>The beauty of ARM&#8217;s compiler is you can write  portably &#8211; not relying on bitfield layout, MarcusHarnisch &#8211; and still get great code.</p>
<pre>
void func(struct bf1_31 *p, int n, int a)
{
    struct bf1_31 *pend= &amp;p[n];
    do {
        struct bf1_31 *q;
        if ((q=p++)-&gt;a)
            q-&gt;b += a;
    } while (p &lt; pend);
}
</pre>
<p>Using old RVCT3.1 build 569 produces:</p>
<pre>
func PROC
        ADD      r12,r0,r1,LSL #2
        PUSH     {lr}
|L1.8|
        MOV      r3,r0
        LDR      r1,[r0],#4
        TST      r1,#1
        ADDNE    r1,r1,r2,LSL #1
        STRNE    r1,[r3,#0]
        CMP      r0,r12
        BCC      |L1.8|
        POP      {pc}
        ENDP
</pre>
<p>This same source even generates good &#8211;thumb code.</p>
<p>I&#8217;m however mystified why it pushes LR and pops PC in a leaf function. Maybe the branch predictor<br />
influences cache victim selection to not immediately evict the calling code.</p>
<p>But I think the compiler will push multiples of two registers for architecture v5te and up to keep the stack 8-byte aligned. For that I&#039;d blame xScale&#8217;s LDRD/STRD.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mans</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/comment-page-1/#comment-530</link>
		<dc:creator>Mans</dc:creator>
		<pubDate>Tue, 02 Feb 2010 23:39:04 +0000</pubDate>
		<guid isPermaLink="false">http://hardwarebug.org/?p=230#comment-530</guid>
		<description>No, that is not the case.  That flag has no effect on ARM.</description>
		<content:encoded><![CDATA[<p>No, that is not the case.  That flag has no effect on ARM.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Scott Graves</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/comment-page-1/#comment-529</link>
		<dc:creator>Scott Graves</dc:creator>
		<pubDate>Tue, 02 Feb 2010 23:24:53 +0000</pubDate>
		<guid isPermaLink="false">http://hardwarebug.org/?p=230#comment-529</guid>
		<description>The reason for the &quot;unnecessary&quot; saving of registers is probably because you are not using -fomit-frame-pointer, which means that all functions must have a stack frame (so they can be debugged and unwound, although DWARF2 debug/unwind information makes that unnecessary)</description>
		<content:encoded><![CDATA[<p>The reason for the &#8220;unnecessary&#8221; saving of registers is probably because you are not using -fomit-frame-pointer, which means that all functions must have a stack frame (so they can be debugged and unwound, although DWARF2 debug/unwind information makes that unnecessary)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joshua Haberman</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/comment-page-1/#comment-528</link>
		<dc:creator>Joshua Haberman</dc:creator>
		<pubDate>Tue, 02 Feb 2010 01:21:53 +0000</pubDate>
		<guid isPermaLink="false">http://hardwarebug.org/?p=230#comment-528</guid>
		<description>&lt;i&gt;Even the best compilers are still no match for a human.&lt;/i&gt;

I think a fairer thing to say is &quot;even the best compilers are still no match for the best humans, given infinite time and ingenuity.&quot;

I frequently am pleasantly surprised at how clever a compiler is being on my behalf.  For example, gcc knows how to turn &quot;n / 1000000UL;&quot; into &quot;(n * 1125899907) &gt;&gt; 50&quot;.  Maybe if you locked me in a room for an hour and demanded that I optimize that expression I would have come up with that.  But the compiler thought of it for me in a fraction of a second.

After all, everything a compiler does is something a human thought of first, so naturally a compiler isn&#039;t going to outdo the smartest human with an expertise for the architecture.  But it&#039;s definitely going to outdo a lot of people.  It&#039;s also much less error-prone.

Also, picking on GCC-on-ARM is like bullying the pipsqueak on the playground.  :)</description>
		<content:encoded><![CDATA[<p><i>Even the best compilers are still no match for a human.</i></p>
<p>I think a fairer thing to say is &#8220;even the best compilers are still no match for the best humans, given infinite time and ingenuity.&#8221;</p>
<p>I frequently am pleasantly surprised at how clever a compiler is being on my behalf.  For example, gcc knows how to turn &#8220;n / 1000000UL;&#8221; into &#8220;(n * 1125899907) &gt;&gt; 50&#8243;.  Maybe if you locked me in a room for an hour and demanded that I optimize that expression I would have come up with that.  But the compiler thought of it for me in a fraction of a second.</p>
<p>After all, everything a compiler does is something a human thought of first, so naturally a compiler isn&#8217;t going to outdo the smartest human with an expertise for the architecture.  But it&#8217;s definitely going to outdo a lot of people.  It&#8217;s also much less error-prone.</p>
<p>Also, picking on GCC-on-ARM is like bullying the pipsqueak on the playground.  :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Marcus Harnisch</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/comment-page-1/#comment-526</link>
		<dc:creator>Marcus Harnisch</dc:creator>
		<pubDate>Mon, 01 Feb 2010 15:39:45 +0000</pubDate>
		<guid isPermaLink="false">http://hardwarebug.org/?p=230#comment-526</guid>
		<description>OK, I&#039;ll bite. RVCT&#039;s loop optimizer doesn&#039;t like structures. Blame it
(the optimizer) for not recognizing the fact that it is looking at an
integer operation in this case. Notice that the loop gets transformed
into a down counting loop.
&lt;pre&gt;
void func(struct bf1_31 *p, int n, int a)
{
    int i;

    unsigned int *q = (unsigned int *)p;

    #pragma unroll(1) // prevent loop unrolling for comparison
    for (i=0; i&lt;n; i++) {
        if (*q &amp; 1)
            *q++ = *q + 2*a;
    }
}
&lt;/pre&gt;
This is hardly more portable than assembler (depends on bit-field
layout) and took about ten times as long to come up with :-) but this
is the compiler generated assembler code (RVCT, Build 697, -O3 -Otime):
&lt;pre&gt;
func PROC
        CMP      r1,#0
        BXLE     lr
&#124;L1.148&#124;
        LDR      r3,[r0,#0]
        TST      r3,#1
        ADDNE    r3,r3,r2,LSL #1
        STRNE    r3,[r0],#4
        SUBS     r1,r1,#1
        BNE      &#124;L1.148&#124;
        BX       lr
        ENDP
&lt;/pre&gt;

Regards
Marcus</description>
		<content:encoded><![CDATA[<p>OK, I&#8217;ll bite. RVCT&#8217;s loop optimizer doesn&#8217;t like structures. Blame it<br />
(the optimizer) for not recognizing the fact that it is looking at an<br />
integer operation in this case. Notice that the loop gets transformed<br />
into a down counting loop.</p>
<pre>
void func(struct bf1_31 *p, int n, int a)
{
    int i;

    unsigned int *q = (unsigned int *)p;

    #pragma unroll(1) // prevent loop unrolling for comparison
    for (i=0; i&lt;n; i++) {
        if (*q &amp; 1)
            *q++ = *q + 2*a;
    }
}
</pre>
<p>This is hardly more portable than assembler (depends on bit-field<br />
layout) and took about ten times as long to come up with :-) but this<br />
is the compiler generated assembler code (RVCT, Build 697, -O3 -Otime):</p>
<pre>
func PROC
        CMP      r1,#0
        BXLE     lr
|L1.148|
        LDR      r3,[r0,#0]
        TST      r3,#1
        ADDNE    r3,r3,r2,LSL #1
        STRNE    r3,[r0],#4
        SUBS     r1,r1,#1
        BNE      |L1.148|
        BX       lr
        ENDP
</pre>
<p>Regards<br />
Marcus</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mans</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/comment-page-1/#comment-525</link>
		<dc:creator>Mans</dc:creator>
		<pubDate>Mon, 01 Feb 2010 14:30:24 +0000</pubDate>
		<guid isPermaLink="false">http://hardwarebug.org/?p=230#comment-525</guid>
		<description>My most recent experience with a Green Hills compiler involved a week-long debugging session culminating in the discovery that the compiler would occasionally access memory below the bottom of the stack. An interrupt at an unfortunate time would result in the registers being dumped on the stack, and the wrong value would be loaded. This only happened when compiling without debugging symbols.</description>
		<content:encoded><![CDATA[<p>My most recent experience with a Green Hills compiler involved a week-long debugging session culminating in the discovery that the compiler would occasionally access memory below the bottom of the stack. An interrupt at an unfortunate time would result in the registers being dumped on the stack, and the wrong value would be loaded. This only happened when compiling without debugging symbols.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Phil</title>
		<link>http://hardwarebug.org/2010/01/30/bit-field-badness/comment-page-1/#comment-524</link>
		<dc:creator>Phil</dc:creator>
		<pubDate>Mon, 01 Feb 2010 06:46:06 +0000</pubDate>
		<guid isPermaLink="false">http://hardwarebug.org/?p=230#comment-524</guid>
		<description>Check out Green Hills Software&#039;s compiler. One of their selling points is that they persistently beat ARM&#039;s own compiler in code size.</description>
		<content:encoded><![CDATA[<p>Check out Green Hills Software&#8217;s compiler. One of their selling points is that they persistently beat ARM&#8217;s own compiler in code size.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
