ARM inline asm secrets

Although I generally recommend against using GCC inline assembly, preferring instead pure assembly code in separate files, there are occasions where inline is the appropriate solution. Should one, at a time like this, turn to the GCC documentation for guidance, one must be prepared for a degree of disappointment. As it happens, much of the inline asm syntax is left entirely undocumented. This article attempts to fill in some of the blanks for the ARM target.

Constraints

Each operand of an inline asm block is described by a constraint string encoding the valid representations of the operand in the generated assembly. For example the “r” code denotes a general-purpose register. In addition to the standard constraints, ARM allows a number of special codes, only some of which are documented. The full list, including a brief description, is available in the constraints.md file in the GCC source tree. The following table is an extract from this file consisting of the codes which are meaningful in an inline asm block (a few are only useful in the machine description itself).

f Legacy FPA registers f0-f7.
t The VFP registers s0-s31.
v The Cirrus Maverick co-processor registers.
w The VFP registers d0-d15, or d0-d31 for VFPv3.
x The VFP registers d0-d7.
y The Intel iWMMX co-processor registers.
z The Intel iWMMX GR registers.
l In Thumb state the core registers r0-r7.
h In Thumb state the core registers r8-r15.
j A constant suitable for a MOVW instruction. (ARM/Thumb-2)
b Thumb only. The union of the low registers and the stack register.
I In ARM/Thumb-2 state a constant that can be used as an immediate value in a Data Processing instruction. In Thumb-1 state a constant in the range 0 to 255.
J In ARM/Thumb-2 state a constant in the range -4095 to 4095. In Thumb-1 state a constant in the range -255 to -1.
K In ARM/Thumb-2 state a constant that satisfies the I constraint if inverted. In Thumb-1 state a constant that satisfies the I constraint multiplied by any power of 2.
L In ARM/Thumb-2 state a constant that satisfies the I constraint if negated. In Thumb-1 state a constant in the range -7 to 7.
M In Thumb-1 state a constant that is a multiple of 4 in the range 0 to 1020.
N Thumb-1 state a constant in the range 0 to 31.
O In Thumb-1 state a constant that is a multiple of 4 in the range -508 to 508.
Pa In Thumb-1 state a constant in the range -510 to +510
Pb In Thumb-1 state a constant in the range -262 to +262
Ps In Thumb-2 state a constant in the range -255 to +255
Pt In Thumb-2 state a constant in the range -7 to +7
G In ARM/Thumb-2 state a valid FPA immediate constant.
H In ARM/Thumb-2 state a valid FPA immediate constant when negated.
Da In ARM/Thumb-2 state a const_int, const_double or const_vector that can be generated with two Data Processing insns.
Db In ARM/Thumb-2 state a const_int, const_double or const_vector that can be generated with three Data Processing insns.
Dc In ARM/Thumb-2 state a const_int, const_double or const_vector that can be generated with four Data Processing insns. This pattern is disabled if optimizing for space or when we have load-delay slots to fill.
Dn In ARM/Thumb-2 state a const_vector which can be loaded with a Neon vmov immediate instruction.
Dl In ARM/Thumb-2 state a const_vector which can be used with a Neon vorr or vbic instruction.
DL In ARM/Thumb-2 state a const_vector which can be used with a Neon vorn or vand instruction.
Dv In ARM/Thumb-2 state a const_double which can be used with a VFP fconsts instruction.
Dy In ARM/Thumb-2 state a const_double which can be used with a VFP fconstd instruction.
Ut In ARM/Thumb-2 state an address valid for loading/storing opaque structure types wider than TImode.
Uv In ARM/Thumb-2 state a valid VFP load/store address.
Uy In ARM/Thumb-2 state a valid iWMMX load/store address.
Un In ARM/Thumb-2 state a valid address for Neon doubleword vector load/store instructions.
Um In ARM/Thumb-2 state a valid address for Neon element and structure load/store instructions.
Us In ARM/Thumb-2 state a valid address for non-offset loads/stores of quad-word values in four ARM registers.
Uq In ARM state an address valid in ldrsb instructions.
Q In ARM/Thumb-2 state an address that is a single base register.

Operand codes

Within the text of an inline asm block, operands are referenced as %0, %1 etc. Register operands are printed as rN, memory operands as [rN, #offset], and so forth. In some situations, for example with operands occupying multiple registers, more detailed control of the output may be required, and once again, an undocumented feature comes to our rescue.

Special code letters inserted between the % and the operand number alter the output from the default for each type of operand. The table below lists the more useful ones.

c An integer or symbol address without a preceding # sign
B Bitwise inverse of integer or symbol without a preceding #
L The low 16 bits of an immediate constant
m The base register of a memory operand
M A register range suitable for LDM/STM
H The highest-numbered register of a pair
Q The least significant register of a pair
R The most significant register of a pair
P A double-precision VFP register
p The high single-precision register of a VFP double-precision register
q A NEON quad register
e The low doubleword register of a NEON quad register
f The high doubleword register of a NEON quad register
h A range of VFP/NEON registers suitable for VLD1/VST1
A A memory operand for a VLD1/VST1 instruction
y S register as indexed D register, e.g. s5 becomes d2[1]
Bookmark the permalink.

19 Responses to ARM inline asm secrets

  1. stevenb says:

    You shouldn’t have to look in constraints.md, all constraints should be documented, see http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/Constraints.html#Constraints

    If something is missing, that’d be a bug.

    • Mans says:

      Are you trolling? That list is so far from complete it’s not even funny. The documentation does not mention the existence of modifier codes used with the % references at all, and less than half of the constraints are covered.

  2. ssvb says:

    As I understand it, and based on exchanging some e-mails with gcc maintainers long ago (unless I got it wrong), most of this stuff is undocumented on purpose. So that nobody will have any right to complain once/if it gets changed in the future versions of gcc without notice. The use of fancy undocumented constraints is strongly discouraged. It is for most parts internal gcc stuff which is not supposed to be exposed.

    This is just like relying on the use of some internal undocumented API of some library. Sure, everyone can read the sources and figure out how it works. But you are the only one at fault if anything goes wrong, blaming the developers is pointless.

    Nevertheless, extending gcc documentation to add information about at least a few more useful constraints makes sense. If you want to get your opinion taken into account, it is better to use gcc bugzilla:
    http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37188

    • Mans says:

      The currently documented constraints are mostly useless for ARM. Anything beyond a simple register operand requires something secret to work reliably.

      That bugzilla entry has been open for almost two years with no action taken. Typical GCC behaviour…

      • stevenb says:

        And typical user behavior to complain that volunteers do not work on something they apparently do not care so much about as you do. What stops you from submitting a patch if you can write a good article about this?

        • Mans says:

          Please do not give me the volunteer bullshit. CodeSourcery has an entire team of paid people working on gcc, as do other companies. The bug report is already there, yet nobody bothers to so much as confirm it, let alone do anything about it. One wonders what these people spend their days doing.

          • stevenb says:

            *shrug*
            Especially in the case of ARM and CodeSourcery, one would expect that your complaint would be addressed if ARM thinks it is important.

            Paid developers who get paid to care about something else than what you care about. I don’t see the difference. In both cases, you are pissing on people because they do not do what you want them to do, because they think they have better things to do. What gives you the right to complain about that?

          • Mans says:

            I do not need permission from anybody to write about an undocumented feature in a piece of free software.

  3. Stian Skjelstad says:

    Using undocumented features that might change in future gcc versions isn’t pure evil as some people claim.

    If you need some fancy features in an project you are working on, you probably lock down which versions of libraries and tools-chains that you work on. Jumping for gcc3.x to gcc4.x for instance is normally not recommended in a live project, especially in the embedded world

  4. Julien says:

    Hi Mans!

    Thanks for that list. I’ve struggled a long time before I finally found your site.

    I use the %q, %e, %f constraints so that gcc will auto allocate registers (I don’t like hardcoding registers in my inline assembly). But GCC seems pretty dumb when it comes to loading/saving values. Consider the following:

    void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b) {
        asm volatile (
        // result = first column of B x first row of A
        "vmul.f32 %q0, %q4, %e8[0]\n\t" // %e
        "vmul.f32 %q1, %q4, %e9[0]\n\t"
        "vmul.f32 %q2, %q4, %e10[0]\n\t"
        "vmul.f32 %q3, %q4, %e11[0]\n\t"
        // result += second column of B x second row of A
        "vmla.f32 %q0, %q5, %e8[1]\n\t"
        "vmla.f32 %q1, %q5, %e9[1]\n\t"
        "vmla.f32 %q2, %q5, %e10[1]\n\t"
        "vmla.f32 %q3, %q5, %e11[1]\n\t"
        // result += third column of B x third row of A
        "vmla.f32 %q0, %q6, %f8[0]\n\t"
        "vmla.f32 %q1, %q6, %f9[0]\n\t"
        "vmla.f32 %q2, %q6, %f10[0]\n\t"
        "vmla.f32 %q3, %q6, %f11[0]\n\t"
        // result += last column of B x last row of A
        "vmla.f32 %q0, %q7, %f8[1]\n\t"
        "vmla.f32 %q1, %q7, %f9[1]\n\t"
        "vmla.f32 %q2, %q7, %f10[1]\n\t"
        "vmla.f32 %q3, %q7, %f11[1]\n\t"
        : "=w" (result.val[0]), "=w" (result.val[1]), "=w" (result.val[2]), "=w" (result.val[3])
        :  "w" (b.val[0]), "w" (b.val[1]), "w" (b.val[2]), "w" (b.val[3]),
           "w" (a.val[0]), "w" (a.val[1]), "w" (a.val[2]), "w" (a.val[3])
        :   
        );  
    }
    

    GCC 4.5 will produce the following code:

        vldmia  r2, {d22-d23}
        vldr    d20, [r2, #16] 
        vldr    d21, [r2, #24] 
        vldr    d18, [r2, #32] 
        vldr    d19, [r2, #40] 
        vldr    d16, [r2, #48] 
        vldr    d17, [r2, #56] 
        vldmia  r1, {d0-d1}
        vldr    d2, [r1, #16] 
        vldr    d3, [r1, #24] 
        vldr    d4, [r1, #32] 
        vldr    d5, [r1, #40] 
        vldr    d6, [r1, #48] 
        vldr    d7, [r1, #56] 
        vmul.f32 q3, q11, d0[0]
        vmul.f32 q2, q11, d2[0]
        vmul.f32 q1, q11, d4[0]
        vmul.f32 q0, q11, d6[0]
        vmla.f32 q3, q10, d0[1]
        vmla.f32 q2, q10, d2[1]
        vmla.f32 q1, q10, d4[1]
        vmla.f32 q0, q10, d6[1]
        vmla.f32 q3, q9, d1[0]
        vmla.f32 q2, q9, d3[0]
        vmla.f32 q1, q9, d5[0]
        vmla.f32 q0, q9, d7[0]
        vmla.f32 q3, q8, d1[1]
        vmla.f32 q2, q8, d3[1]
        vmla.f32 q1, q8, d5[1]
        vmla.f32 q0, q8, d7[1]
        
        vstmia  r0, {d6-d7}
        vstr    d4, [r0, #16] 
        vstr    d5, [r0, #24]
        vstr    d2, [r0, #32]
        vstr    d3, [r0, #40]
        vstr    d0, [r0, #48]
        vstr    d1, [r0, #56]
        bx  lr
    

    Why all the vldr/vstr when it could do everything with just 2 vldmia and 1 vstmia? Do youknow any way to tell him to use result.val as a range of 4 values, like if I were doing “r” (result.val), yet still use auto-allocated registers?

    Thanks!

    • Mans says:

      The GCC register allocator isn’t particularly clever (in fairness, it is a rather hard problem). My guess is that it allocates the registers requested by the constraints of the asm block first, then generates the necessary code to load the values.

      Is this function intended to be inlined? If not, I recommend writing the entire function in pure assembler or, if you must use inline asm, manually do the loads and stores to hardcoded registers. There might also be some way to coerce better behaviour from GCC using range constraints, but those are a bit tricky. If the function is being inlined, you must examine the generated code where it is used as this can differ dramatically from what you get for the function on its own.

  5. Julien says:

    Yes, it is meant to get inlined. When I do chain-multiplication of matrices, GCC is smart enough to not store/load the intermediate results, but still use lots of vldr in front of the chain and lots of vstr to store the end result > <;

    I can't find anything about range constraints in the GCC undocumentation, unfortunately. I think I'm going to dig in codesourcery forums and the gcc dev mailing list for more info.

    Happy Christmas to you btw!

  6. Julien says:

    FWIW, the following inline assembly:

    void method(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b) {
        asm (
        "vldmia   %m[a], { q4-q7 }\n\t"
        "vldmia   %m[b], { q8-q11 } \n\t"
        .................
        "vstmia   %m[result], { q0-q3 }"
        :   
        : [result] "Us" (result), [a] "Us" (a), [b] "Us" (b) 
        : "memory", "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q11"
        );  
    }
    

    …seems to produce better code than using plain old

    void method(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b) {
        asm (
        "vldmia   %[a], { q4-q7 }\n\t"
        "vldmia   %[b], { q8-q11 } \n\t"
        .........
        "vstmia   %[result], { q0-q3 }"
        :   
        : [result] "r" (result.val), [a] "r" (a.val), [b] "r" (b.val) 
        : "memory", "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q11" //clobber
        );  
    }
    

    (note the use of “m” and “Us” versus “r”)

  7. Julien says:
    float32x4_t v;
    asm volatile("# range: %h[v]" : : [v] "w" (v) : );
    

    …produces the following comment in the assembly:

    # range: {d20-d21}
    

    GCC can’t seem to be able to produce a range big enough to hold a quadword, yet alone a range of multiple quadword registers, so it’s kind of useless to use this IMO.

  8. Gregory Eckersley says:

    Thanks for your web page. The GCC constraints are unfathomably complex. I wanted 2 lines of assembly code as follows ” svc sym1″, “.word symbol2″. No constraint would allow symbol2 defined either as a symbolic address or 32 bit constant. Some crashed gcc. e.g:

    #define svcgen(sym1,symbol2) __asm__("SVC %0\n\t .word %c1\n\t" : : "i" (sym1) , "i" (symbol2)  : )
    

    will accept 32 bit constants, not symbols when invoked.

    "svcgen(0x40,32_bit_constant)" works
    "svcgen(0x40,symbol)" fails on constraint
    

    My solution was:

    #define svcgen(sym1,symbol2) __asm__("SVC %0\n\t .word " symbol2 "\n\t" : : "i" (sym1)   : )
    

    Invoked by, e.g. :

    svcgen(0x40,"c_variable_or_constant")
    

    The main thing to realise is that gcc inline assembler
    constraints,however silly, cannot be circumvented. Quotes at at least allowed the constraints to be dodged!

    Your web page helped me realise this earlier – you are
    definitely not trolling – thanks again. GPE

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>