The Cortex-A7 ARM core is a popular choice in low-power and low-cost designs. Unfortunately, the public TRM does not include instruction timing information. It does reveal that execution is in-order which makes measuring the throughput and latency for individual instructions relatively straight-forward.
The table below lists the measured issue cycles (inverse throughput) and result latency of some commonly used instructions.
It should be noted that in some cases, the perceived latency depends on the instruction consuming the result. Most of the values were measured with the result used as input to the same instruction. For instructions with multiple outputs, the latencies of the result registers may also differ.
Finally, although instruction issue is in-order, completion is out of order, allowing independent instructions to issue and complete unimpeded while a multi-cycle instruction is executing in another unit. For example, a 3-cycle MUL instruction does not block ADD instructions following it in program order.
ALU instructions | Issue cycles | Result latency |
---|---|---|
MOV Rd, Rm | 1/2 | 1 |
ADD Rd, Rn, #imm | 1/2 | 1 |
ADD Rd, Rn, Rm | 1 | 1 |
ADD Rd, Rn, Rm, LSL #imm | 1 | 1 |
ADD Rd, Rn, Rm, LSL Rs | 1 | 1 |
LSL Rd, Rn, #imm | 1 | 2 |
LSL Rd, Rn, Rs | 1 | 2 |
QADD Rd, Rn, Rm | 1 | 2 |
QADD8 Rd, Rn, Rm | 1 | 2 |
QADD16 Rd, Rn, Rm | 1 | 2 |
CLZ Rd, Rm | 1 | 1 |
RBIT Rd, Rm | 1 | 2 |
REV Rd, Rm | 1 | 2 |
SBFX Rd, Rn | 1 | 2 |
BFC Rd, #lsb, #width | 1 | 2 |
BFI Rd, Rn, #lsb, #width | 1 | 2 |
NOTE: Shifted operands and shift amounts needed one cycle early. | ||
Multiply instructions | Issue cycles | Result latency |
MUL Rd, Rn, Rm | 1 | 3 |
MLA Rd, Rn, Rm, Ra | 1 | 31 |
SMULL Rd, RdHi, Rn, Rm | 1 | 3 |
SMLAL Rd, RdHi, Rn, Rm | 1 | 31 |
SMMUL Rd, Rn, Rm | 1 | 3 |
SMMLA Rd, Rn, Rm, Ra | 1 | 31 |
SMULBB Rd, Rn, Rm | 1 | 3 |
SMLABB Rd, Rn, Rm, Ra | 1 | 31 |
SMULWB Rd, Rn, Rm | 1 | 3 |
SMLAWB Rd, Rn, Rm, Ra | 1 | 31 |
SMUAD Rd, Rn, Rm | 1 | 3 |
1 Accumulator forwarding allows back to back MLA instructions without delay. | ||
Divide instructions | Issue cycles | Result latency |
SDIV Rd, Rn, Rm | 4-20 | 6-22 |
UDIV Rd, Rn, Rm | 3-19 | 5-21 |
Load/store instructions | Issue cycles | Result latency |
LDR Rt, [Rn] | 1 | 3 |
LDR Rt, [Rn, #imm] | 1 | 3 |
LDR Rt, [Rn, Rm] | 1 | 3 |
LDR Rt, [Rn, Rm, lsl #imm] | 1 | 3 |
LDRD Rt, Rt2, [Rn] | 1 | 3-4 |
LDM Rn, {regs} | 1-8 | 3-10 |
STR Rt, [Rn] | 1 | 2 |
STRD Rt, Rt2, [Rn] | 1 | 2 |
STM Rn, {regs} | 1-10 | 2-12 |
NOTE: Load results are forwarded to dependent stores without delay. | ||
VFP instructions | Issue cycles | Result latency |
VMOV.F32 Sd, Sm | 1 | 4 |
VMOV.F64 Dd, Dm | 1 | 4 |
VNEG.F32 Sd, Sm | 1 | 4 |
VNEG.F64 Dd, Dm | 1 | 4 |
VABS.F32 Sd, Sm | 1 | 4 |
VABS.F64 Dd, Dm | 1 | 4 |
VADD.F32 Sd, Sn, Sm | 1 | 4 |
VADD.F64 Dd, Dn, Dm | 1 | 4 |
VMUL.F32 Sd, Sn, Sm | 1 | 4 |
VMUL.F64 Dd, Dn, Dm | 4 | 7 |
VMLA.F32 Sd, Sn, Sm | 1 | 81 |
VMLA.F64 Dd, Dn, Dm | 4 | 112 |
VFMA.F32 Sd, Sn, Sm | 1 | 81 |
VFMA.F64 Dd, Dn, Dm | 5 | 82 |
VDIV.F32 Sd, Sn, Sm | 15 | 18 |
VDIV.F64 Dd, Dn, Dm | 29 | 32 |
VSQRT.F32 Sd, Sm | 14 | 17 |
VSQRT.F64 Dd, Dm | 28 | 31 |
VCVT.F32.F64 Sd, Dm | 1 | 4 |
VCVT.F64.F32 Dd, Sm | 1 | 4 |
VCVT.F32.S32 Sd, Sm | 1 | 4 |
VCVT.F64.S32 Dd, Sm | 1 | 4 |
VCVT.S32.F32 Sd, Sm | 1 | 4 |
VCVT.S32.F64 Sd, Dm | 1 | 4 |
VCVT.F32.S32 Sd, Sd, #fbits | 1 | 4 |
VCVT.F64.S32 Dd, Dd, #fbits | 1 | 4 |
VCVT.S32.F32 Sd, Sd, #fbits | 1 | 4 |
VCVT.S32.F64 Dd, Dd, #fbits | 1 | 4 |
1 5 cycles with dependency only on accumulator. 2 8 cycles with dependency only on accumulator. |
||
NEON integer instructions | Issue cycles | Result latency |
VADD.I8 Dd, Dn, Dm | 1 | 4 |
VADDL.S8 Qd, Dn, Dm | 2 | 4 |
VADD.I8 Qd, Qn, Qm | 2 | 4 |
VMUL.I8 Dd, Dn, Dm | 2 | 4 |
VMULL.S8 Qd, Dn, Dm | 2 | 4 |
VMUL.I8 Qd, Qn, Qm | 4 | 4 |
VMLA.I8 Dd, Dn, Dm | 2 | 4 |
VMLAL.S8 Qd, Dn, Dm | 2 | 4 |
VMLA.I8 Qd, Qn, Qm | 4 | 4 |
VADD.I16 Dd, Dn, Dm | 1 | 4 |
VADDL.S16 Qd, Dn, Dm | 2 | 4 |
VADD.I16 Qd, Qn, Qm | 2 | 4 |
VMUL.I16 Dd, Dn, Dm | 1 | 4 |
VMULL.S16 Qd, Dn, Dm | 2 | 4 |
VMUL.I16 Qd, Qn, Qm | 2 | 4 |
VMLA.I16 Dd, Dn, Dm | 1 | 4 |
VMLAL.S16 Qd, Dn, Dm | 2 | 4 |
VMLA.I16 Qd, Qn, Qm | 2 | 4 |
VADD.I32 Dd, Dn, Dm | 1 | 4 |
VADDL.S32 Qd, Dn, Dm | 2 | 4 |
VADD.I32 Qd, Qn, Qm | 2 | 4 |
VMUL.I32 Dd, Dn, Dm | 2 | 4 |
VMULL.S32 Qd, Dn, Dm | 2 | 4 |
VMUL.I32 Qd, Qn, Qm | 4 | 4 |
VMLA.I32 Dd, Dn, Dm | 2 | 4 |
VMLAL.S32 Qd, Dn, Dm | 2 | 4 |
VMLA.I32 Qd, Qn, Qm | 4 | 4 |
NEON floating-point instructions | Issue cycles | Result latency |
VADD.F32 Dd, Dn, Dm | 2 | 4 |
VADD.F32 Qd, Qn, Qm | 4 | 4 |
VMUL.F32 Dd, Dn, Dm | 2 | 4 |
VMUL.F32 Qd, Qn, Qm | 4 | 4 |
VMLA.F32 Dd, Dn, Dm | 2 | 81 |
VMLA.F32 Qd, Qn, Qm | 4 | 81 |
1 5 cycles with dependency only on accumulator. | ||
NEON permute instructions | Issue cycles | Result latency |
VEXT.n Dd, Dn, Dm, #imm | 1 | 4 |
VEXT.n Qd, Qn, Qm, #imm | 2 | 5 |
VTRN.n Dd, Dn, Dm | 2 | 5 |
VTRN.n Qd, Qn, Qm | 4 | 5 |
VUZP.n Dd, Dn, Dm | 2 | 5 |
VUZP.n Qd, Qn, Qm | 4 | 6 |
VZIP.n Dd, Dn, Dm | 2 | 5 |
VZIP.n Qd, Qn, Qm | 4 | 6 |
VTBL.8 Dd, {Dn}, Dm | 1 | 4 |
VTBL.8 Dd, {Dn-Dn+1}, Dm | 1 | 4 |
VTBL.8 Dd, {Dn-Dn+2}, Dm | 2 | 5 |
VTBL.8 Dd, {Dn-Dn+3}, Dm | 2 | 5 |
Have you seen any good documentation about what instructions can dual issue? I’ve heard that NEON/VFP instructions are single issue, but not much else.
I have not found any instruction pairs which can truly dual-issue. Some combinations can execute partially in parallel, but only one instruction can begin executing each cycle. If anything, I’d expect (predicted) branch instructions (these are often handled separately from the main execution pipelines) to possibly dual-issue, but this is tricky to measure.
Thats interesting. The ARM whitepapers (pre-release at least) mention it has some ability to dual issue:
http://renesasmobile.com/share/news/2012/ARM-big.LITTLE-whitepaper.pdf
I assumed at least ALU and load/stores could issue on the same cycle if they were independent.
My measurements suggest that this is not the case in reality. A sequence of independent LDR/ADD pairs needs exactly one cycle per instruction.
Thanks for checking!
ARM Cortex-A7 can dual-issue instructions with immediate operands. Or more likely the ones, which need only a single read from the register file. For example “add r0, r1, r2” is bad, but “add r0, r1, #1” is good for it. The “mov” instruction is also good for dual-issue.
Personally, I suspect that they did it primarily to improve handling of the poorly generated code from bad compilers. Now redundant register moves are not that expensive anymore! And if this hypothesis is plausible, maybe they did not bother supporting ALU/LSU dual-issue also because modern compilers are too dumb to make any good use of it.
You’re right, MOV and ADD immediate can dual-issue. Table updated to reflect this.
Are the vmul/vmla.i16 results correct? Would be a little strange if i16 had twice the throughput of i8 and i32.
It seems strange, but I find nothing wrong in my code. A plausible explanation is that there are four 16-bit multipliers forcing VMUL.I8 to execute as two uops while VMUL.I16 can issue in a single cycle.
Hello~Could you please tell us which tools (ex: Benchmarks, code,…….etc) you used to test for these instructions latency ? Because my research needs to test for some instructions’ latency, too.
Thanks a lot !!
To measure the throughput of the ADD instruction, I executed ADD r0, r1, r2 a million times in a loop unrolled enough that the branches become insignificant and measured the time using the CPU cycle counter. This loop takes a million cycles to run, so the throughput is one instruction per cycle. To get the latency, instead repeat ADD r0, r0, r1 so each depends on the previous. This also gives one cycle per instruction, which means the result is ready to be used by an instruction issued one cycle later. Repeat for other instructions. I used a lot of macros.
Thank you very much !!
This really helps me a lot !
Do you do this in a bare-metal environment or linux-based environment? I try to measure instruction cycles by cycle counter in a linux-based system, but everytime I get different values. For example, I measure 100 mov instructions in a for (100 times) loop, and it takes about 5000~7000 cycles to be done. The unstable statics make me feel sick, how to make my measurement more correct? Thank you.
I run under Linux, measuring the cycles needed for a million instructions using a rather unrolled loop five times and take the minimum.
Thank you, genius. Because of your sharing , now I can be confident to write the scripts to measure the tons of the instructions. Thanks again.
Thanks, your list helped me.
According to http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489e/CJAJIIGG.html there are much more instructions for NEON, like VRECPE and VRECPS for reciprocal. Would you also include those and some others to your list, or aren’t they supported on the A7?
Thanks in advance :)
All instructions are supported by the A7. It’s just a bit tedious to measure them, so I started with the most commonly used ones.
Thanks a lot for your dertailed analysis, it clarifies that the A7 is a weaker design than what the marketing material tries to make you believe.
Still, I find it nonexcusable this kind of information is not included in the official microarchitecture documentation – is an assembly programmer really expected to measure stuff like this him/herself in order to write optimal scheduled code?
Is the source/assembly code to generate this information available somewhere? I would like to gather this information for some other ARM cores — A8, A9, A15, etc. Please send me a message. I would greatly appreciate your help.
The timings for A8 and A9 are published on the ARM website. A15 is much trickier to measure since it has out of order multiple issue pipelines.
Beware of the A8 timings published in the TRM:
http://www.avison.me.uk/ben/programming/cortex-a8.html
I did a batch of NEON timings on the A8 a while back and noticed the devil can be in the details sometimes. For example some paper mentioned “extensive support of key forwarding paths” which immediately made me wonder what exactly the unsupported non-key forwarding paths were then…
Well, between NEON integer and floating-point pipelines for example. This means if you VMOV a source or destination operand of a floating-point operation then you’ll need to keep a minimum issue distance between those two instructions (iirc 7 cycles) or the cpu will do it for you.
(To clarify: the timings I linked to aren’t mine, although they did motivate me to perform some Neon timings myself, which he didn’t.)
I need result latency of some other neon floating point instruction like,vceq.f32 q1,q2,#0
vbsl.f32 q1,q4,q3
Before I found this very helpful web page, I measured a few instructions (only add, Sub, mul, div, sqrt) on a Cortex M7 (Atmel SAME70).
The results are the same for the latency (i had only one calc instruction per Loop).
So it seems that the VFP part of the M7 and the A7 is the same.
Perhap this is documented somewhere but I don’t know …
Thank you for the timing.
Where could I find the similar timing for Cortex A17?
Sorry, I don’t have that information.
Hello,
Is there any documentation about cortex a5 which calculates the latencies like this?
I could not find anything.
I don’t know of any such documentation either.
Really useful. Thank you for taking the time to post this.
can you please let me know the cycle details for VLD1.F32 instruction?
Hello,
Thanks. This list is great but I’m confused!
Some researchers in a scientific paper (“Micro-architectural simulation of embedded core heterogeneity with gem5 and McPAT” ) have measured A7 instruction timings, but their results are different from yours. could you explain that to me please?