Cortex-A7 instruction cycle timings

Thursday, 15th May, 2014 - 3:15 pm | ARM

The Cortex-A7 ARM core is a popular choice in low-power and low-cost designs. Unfortunately, the public TRM does not include instruction timing information. It does reveal that execution is in-order which makes measuring the throughput and latency for individual instructions relatively straight-forward.

The table below lists the measured issue cycles (inverse throughput) and result latency of some commonly used instructions.

It should be noted that in some cases, the perceived latency depends on the instruction consuming the result. Most of the values were measured with the result used as input to the same instruction. For instructions with multiple outputs, the latencies of the result registers may also differ.

Finally, although instruction issue is in-order, completion is out of order, allowing independent instructions to issue and complete unimpeded while a multi-cycle instruction is executing in another unit. For example, a 3-cycle MUL instruction does not block ADD instructions following it in program order.

ALU instructions	Issue cycles	Result latency
`MOV Rd, Rm`	1/2	1
`ADD Rd, Rn, #imm`	1/2	1
`ADD Rd, Rn, Rm`	1	1
`ADD Rd, Rn, Rm, LSL #imm`	1	1
`ADD Rd, Rn, Rm, LSL Rs`	1	1
`LSL Rd, Rn, #imm`	1	2
`LSL Rd, Rn, Rs`	1	2
`QADD Rd, Rn, Rm`	1	2
`QADD8 Rd, Rn, Rm`	1	2
`QADD16 Rd, Rn, Rm`	1	2
`CLZ Rd, Rm`	1	1
`RBIT Rd, Rm`	1	2
`REV Rd, Rm`	1	2
`SBFX Rd, Rn`	1	2
`BFC Rd, #lsb, #width`	1	2
`BFI Rd, Rn, #lsb, #width`	1	2
NOTE: Shifted operands and shift amounts needed one cycle early.
Multiply instructions	Issue cycles	Result latency
`MUL Rd, Rn, Rm`	1	3
`MLA Rd, Rn, Rm, Ra`	1	3¹
`SMULL Rd, RdHi, Rn, Rm`	1	3
`SMLAL Rd, RdHi, Rn, Rm`	1	3¹
`SMMUL Rd, Rn, Rm`	1	3
`SMMLA Rd, Rn, Rm, Ra`	1	3¹
`SMULBB Rd, Rn, Rm`	1	3
`SMLABB Rd, Rn, Rm, Ra`	1	3¹
`SMULWB Rd, Rn, Rm`	1	3
`SMLAWB Rd, Rn, Rm, Ra`	1	3¹
`SMUAD Rd, Rn, Rm`	1	3
¹ Accumulator forwarding allows back to back `MLA` instructions without delay.
Divide instructions	Issue cycles	Result latency
`SDIV Rd, Rn, Rm`	4-20	6-22
`UDIV Rd, Rn, Rm`	3-19	5-21
Load/store instructions	Issue cycles	Result latency
`LDR Rt, [Rn]`	1	3
`LDR Rt, [Rn, #imm]`	1	3
`LDR Rt, [Rn, Rm]`	1	3
`LDR Rt, [Rn, Rm, lsl #imm]`	1	3
`LDRD Rt, Rt2, [Rn]`	1	3-4
`LDM Rn, {regs}`	1-8	3-10
`STR Rt, [Rn]`	1	2
`STRD Rt, Rt2, [Rn]`	1	2
`STM Rn, {regs}`	1-10	2-12
NOTE: Load results are forwarded to dependent stores without delay.
VFP instructions	Issue cycles	Result latency
`VMOV.F32 Sd, Sm`	1	4
`VMOV.F64 Dd, Dm`	1	4
`VNEG.F32 Sd, Sm`	1	4
`VNEG.F64 Dd, Dm`	1	4
`VABS.F32 Sd, Sm`	1	4
`VABS.F64 Dd, Dm`	1	4
`VADD.F32 Sd, Sn, Sm`	1	4
`VADD.F64 Dd, Dn, Dm`	1	4
`VMUL.F32 Sd, Sn, Sm`	1	4
`VMUL.F64 Dd, Dn, Dm`	4	7
`VMLA.F32 Sd, Sn, Sm`	1	8¹
`VMLA.F64 Dd, Dn, Dm`	4	11²
`VFMA.F32 Sd, Sn, Sm`	1	8¹
`VFMA.F64 Dd, Dn, Dm`	5	8²
`VDIV.F32 Sd, Sn, Sm`	15	18
`VDIV.F64 Dd, Dn, Dm`	29	32
`VSQRT.F32 Sd, Sm`	14	17
`VSQRT.F64 Dd, Dm`	28	31
`VCVT.F32.F64 Sd, Dm`	1	4
`VCVT.F64.F32 Dd, Sm`	1	4
`VCVT.F32.S32 Sd, Sm`	1	4
`VCVT.F64.S32 Dd, Sm`	1	4
`VCVT.S32.F32 Sd, Sm`	1	4
`VCVT.S32.F64 Sd, Dm`	1	4
`VCVT.F32.S32 Sd, Sd, #fbits`	1	4
`VCVT.F64.S32 Dd, Dd, #fbits`	1	4
`VCVT.S32.F32 Sd, Sd, #fbits`	1	4
`VCVT.S32.F64 Dd, Dd, #fbits`	1	4
¹ 5 cycles with dependency only on accumulator. ² 8 cycles with dependency only on accumulator.
NEON integer instructions	Issue cycles	Result latency
`VADD.I8 Dd, Dn, Dm`	1	4
`VADDL.S8 Qd, Dn, Dm`	2	4
`VADD.I8 Qd, Qn, Qm`	2	4
`VMUL.I8 Dd, Dn, Dm`	2	4
`VMULL.S8 Qd, Dn, Dm`	2	4
`VMUL.I8 Qd, Qn, Qm`	4	4
`VMLA.I8 Dd, Dn, Dm`	2	4
`VMLAL.S8 Qd, Dn, Dm`	2	4
`VMLA.I8 Qd, Qn, Qm`	4	4
`VADD.I16 Dd, Dn, Dm`	1	4
`VADDL.S16 Qd, Dn, Dm`	2	4
`VADD.I16 Qd, Qn, Qm`	2	4
`VMUL.I16 Dd, Dn, Dm`	1	4
`VMULL.S16 Qd, Dn, Dm`	2	4
`VMUL.I16 Qd, Qn, Qm`	2	4
`VMLA.I16 Dd, Dn, Dm`	1	4
`VMLAL.S16 Qd, Dn, Dm`	2	4
`VMLA.I16 Qd, Qn, Qm`	2	4
`VADD.I32 Dd, Dn, Dm`	1	4
`VADDL.S32 Qd, Dn, Dm`	2	4
`VADD.I32 Qd, Qn, Qm`	2	4
`VMUL.I32 Dd, Dn, Dm`	2	4
`VMULL.S32 Qd, Dn, Dm`	2	4
`VMUL.I32 Qd, Qn, Qm`	4	4
`VMLA.I32 Dd, Dn, Dm`	2	4
`VMLAL.S32 Qd, Dn, Dm`	2	4
`VMLA.I32 Qd, Qn, Qm`	4	4
NEON floating-point instructions	Issue cycles	Result latency
`VADD.F32 Dd, Dn, Dm`	2	4
`VADD.F32 Qd, Qn, Qm`	4	4
`VMUL.F32 Dd, Dn, Dm`	2	4
`VMUL.F32 Qd, Qn, Qm`	4	4
`VMLA.F32 Dd, Dn, Dm`	2	8¹
`VMLA.F32 Qd, Qn, Qm`	4	8¹
¹ 5 cycles with dependency only on accumulator.
NEON permute instructions	Issue cycles	Result latency
`VEXT.n Dd, Dn, Dm, #imm`	1	4
`VEXT.n Qd, Qn, Qm, #imm`	2	5
`VTRN.n Dd, Dn, Dm`	2	5
`VTRN.n Qd, Qn, Qm`	4	5
`VUZP.n Dd, Dn, Dm`	2	5
`VUZP.n Qd, Qn, Qm`	4	6
`VZIP.n Dd, Dn, Dm`	2	5
`VZIP.n Qd, Qn, Qm`	4	6
`VTBL.8 Dd, {Dn}, Dm`	1	4
`VTBL.8 Dd, {Dn-Dn+1}, Dm`	1	4
`VTBL.8 Dd, {Dn-Dn+2}, Dm`	2	5
`VTBL.8 Dd, {Dn-Dn+3}, Dm`	2	5

Bookmark the permalink.

31 Responses to Cortex-A7 instruction cycle timings

mike says:

Thursday, 15th May, 2014 at 9:29 pm

Have you seen any good documentation about what instructions can dual issue? I’ve heard that NEON/VFP instructions are single issue, but not much else.
- Mans says:
  
  Thursday, 15th May, 2014 at 10:36 pm
  
  I have not found any instruction pairs which can truly dual-issue. Some combinations can execute partially in parallel, but only one instruction can begin executing each cycle. If anything, I’d expect (predicted) branch instructions (these are often handled separately from the main execution pipelines) to possibly dual-issue, but this is tricky to measure.
  - Mike says:
    
    Thursday, 15th May, 2014 at 10:48 pm
    
    Thats interesting. The ARM whitepapers (pre-release at least) mention it has some ability to dual issue:
    
    http://renesasmobile.com/share/news/2012/ARM-big.LITTLE-whitepaper.pdf
    
    I assumed at least ALU and load/stores could issue on the same cycle if they were independent.
    - Mans says:
      
      Thursday, 15th May, 2014 at 11:23 pm
      
      My measurements suggest that this is not the case in reality. A sequence of independent LDR/ADD pairs needs exactly one cycle per instruction.
      - mike says:
        
        Friday, 16th May, 2014 at 1:14 am
        
        Thanks for checking!
  - Siarhei Siamashka says:
    
    Saturday, 17th May, 2014 at 1:14 pm
    
    ARM Cortex-A7 can dual-issue instructions with immediate operands. Or more likely the ones, which need only a single read from the register file. For example “add r0, r1, r2” is bad, but “add r0, r1, #1” is good for it. The “mov” instruction is also good for dual-issue.
    
    Personally, I suspect that they did it primarily to improve handling of the poorly generated code from bad compilers. Now redundant register moves are not that expensive anymore! And if this hypothesis is plausible, maybe they did not bother supporting ALU/LSU dual-issue also because modern compilers are too dumb to make any good use of it.
    - Mans says:
      
      Saturday, 17th May, 2014 at 3:12 pm
      
      You’re right, MOV and ADD immediate can dual-issue. Table updated to reflect this.
Janne says:

Thursday, 15th May, 2014 at 11:36 pm

Are the vmul/vmla.i16 results correct? Would be a little strange if i16 had twice the throughput of i8 and i32.
- Mans says:
  
  Friday, 16th May, 2014 at 12:31 am
  
  It seems strange, but I find nothing wrong in my code. A plausible explanation is that there are four 16-bit multipliers forcing VMUL.I8 to execute as two uops while VMUL.I16 can issue in a single cycle.
Christine says:

Wednesday, 11th June, 2014 at 8:12 am

Hello~Could you please tell us which tools (ex: Benchmarks, code,…….etc) you used to test for these instructions latency ? Because my research needs to test for some instructions’ latency, too.
Thanks a lot !!
- Mans says:
  
  Thursday, 12th June, 2014 at 12:06 pm
  
  To measure the throughput of the ADD instruction, I executed ADD r0, r1, r2 a million times in a loop unrolled enough that the branches become insignificant and measured the time using the CPU cycle counter. This loop takes a million cycles to run, so the throughput is one instruction per cycle. To get the latency, instead repeat ADD r0, r0, r1 so each depends on the previous. This also gives one cycle per instruction, which means the result is ready to be used by an instruction issued one cycle later. Repeat for other instructions. I used a lot of macros.
  - Christine says:
    
    Friday, 13th June, 2014 at 6:16 pm
    
    Thank you very much !!
    This really helps me a lot !
  - Senchih says:
    
    Wednesday, 29th July, 2015 at 3:39 am
    
    Do you do this in a bare-metal environment or linux-based environment? I try to measure instruction cycles by cycle counter in a linux-based system, but everytime I get different values. For example, I measure 100 mov instructions in a for (100 times) loop, and it takes about 5000~7000 cycles to be done. The unstable statics make me feel sick, how to make my measurement more correct? Thank you.
    - Mans says:
      
      Wednesday, 29th July, 2015 at 12:04 pm
      
      I run under Linux, measuring the cycles needed for a million instructions using a rather unrolled loop five times and take the minimum.
      - Senchih says:
        
        Friday, 31st July, 2015 at 8:24 pm
        
        Thank you, genius. Because of your sharing , now I can be confident to write the scripts to measure the tons of the instructions. Thanks again.
Fabjan says:

Saturday, 4th October, 2014 at 2:20 pm

Thanks, your list helped me.
According to http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489e/CJAJIIGG.html there are much more instructions for NEON, like VRECPE and VRECPS for reciprocal. Would you also include those and some others to your list, or aren’t they supported on the A7?
Thanks in advance :)
- Mans says:
  
  Sunday, 5th October, 2014 at 12:03 pm
  
  All instructions are supported by the A7. It’s just a bit tedious to measure them, so I started with the most commonly used ones.
Linuxhippy says:

Sunday, 15th February, 2015 at 6:24 pm

Thanks a lot for your dertailed analysis, it clarifies that the A7 is a weaker design than what the marketing material tries to make you believe.

Still, I find it nonexcusable this kind of information is not included in the official microarchitecture documentation – is an assembly programmer really expected to measure stuff like this him/herself in order to write optimal scheduled code?
Stroller says:

Friday, 26th June, 2015 at 6:09 pm

Is the source/assembly code to generate this information available somewhere? I would like to gather this information for some other ARM cores — A8, A9, A15, etc. Please send me a message. I would greatly appreciate your help.
- Mans says:
  
  Saturday, 27th June, 2015 at 11:35 am
  
  The timings for A8 and A9 are published on the ARM website. A15 is much trickier to measure since it has out of order multiple issue pipelines.
  - Matthijs van Duin says:
    
    Thursday, 18th August, 2016 at 12:35 pm
    
    Beware of the A8 timings published in the TRM:
    http://www.avison.me.uk/ben/programming/cortex-a8.html
    
    I did a batch of NEON timings on the A8 a while back and noticed the devil can be in the details sometimes. For example some paper mentioned “extensive support of key forwarding paths” which immediately made me wonder what exactly the unsupported non-key forwarding paths were then…
    
    Well, between NEON integer and floating-point pipelines for example. This means if you VMOV a source or destination operand of a floating-point operation then you’ll need to keep a minimum issue distance between those two instructions (iirc 7 cycles) or the cpu will do it for you.
    - Matthijs van Duin says:
      
      Thursday, 18th August, 2016 at 12:38 pm
      
      (To clarify: the timings I linked to aren’t mine, although they did motivate me to perform some Neon timings myself, which he didn’t.)
ravi says:

Thursday, 31st December, 2015 at 4:14 am

I need result latency of some other neon floating point instruction like,vceq.f32 q1,q2,#0
vbsl.f32 q1,q4,q3
Martin C says:

Thursday, 5th January, 2017 at 8:59 pm

Before I found this very helpful web page, I measured a few instructions (only add, Sub, mul, div, sqrt) on a Cortex M7 (Atmel SAME70).
The results are the same for the latency (i had only one calc instruction per Loop).
So it seems that the VFP part of the M7 and the A7 is the same.
Perhap this is documented somewhere but I don’t know …
xianyi says:

Sunday, 8th January, 2017 at 1:22 pm

Thank you for the timing.

Where could I find the similar timing for Cortex A17?
- Mans says:
  
  Monday, 23rd January, 2017 at 9:50 am
  
  Sorry, I don’t have that information.
ARGHAVAN MOHAMMADHASSANI says:

Saturday, 16th December, 2017 at 10:18 am

Hello,
Is there any documentation about cortex a5 which calculates the latencies like this?
I could not find anything.
- Mans says:
  
  Saturday, 16th December, 2017 at 11:39 am
  
  I don’t know of any such documentation either.
Simon Marsden says:

Tuesday, 13th February, 2018 at 10:33 am

Really useful. Thank you for taking the time to post this.
IK says:

Friday, 28th December, 2018 at 12:15 pm

can you please let me know the cycle details for VLD1.F32 instruction?
ARM_Researcher says:

Friday, 11th January, 2019 at 9:59 pm

Hello,
Thanks. This list is great but I’m confused!
Some researchers in a scientific paper (“Micro-architectural simulation of embedded core heterogeneity with gem5 and McPAT” ) have measured A7 instruction timings, but their results are different from yours. could you explain that to me please?

Cortex-A7 instruction cycle timings

31 Responses to Cortex-A7 instruction cycle timings

Recent Posts

Recent Comments

Categories

Archives

Meta