Cortex-A7 instruction cycle timings

The Cortex-A7 ARM core is a popular choice in low-power and low-cost designs. Unfortunately, the public TRM does not include instruction timing information. It does reveal that execution is in-order which makes measuring the throughput and latency for individual instructions relatively straight-forward.

The table below lists the measured issue cycles (inverse throughput) and result latency of some commonly used instructions.

It should be noted that in some cases, the perceived latency depends on the instruction consuming the result. Most of the values were measured with the result used as input to the same instruction. For instructions with multiple outputs, the latencies of the result registers may also differ.

Finally, although instruction issue is in-order, completion is out of order, allowing independent instructions to issue and complete unimpeded while a multi-cycle instruction is executing in another unit. For example, a 3-cycle MUL instruction does not block ADD instructions following it in program order.

ALU instructions Issue cycles Result latency
MOV Rd, Rm 1/2 1
ADD Rd, Rn, #imm 1/2 1
ADD Rd, Rn, Rm 1 1
ADD Rd, Rn, Rm, LSL #imm 1 1
ADD Rd, Rn, Rm, LSL Rs 1 1
LSL Rd, Rn, #imm 1 2
LSL Rd, Rn, Rs 1 2
QADD Rd, Rn, Rm 1 2
QADD8 Rd, Rn, Rm 1 2
QADD16 Rd, Rn, Rm 1 2
CLZ Rd, Rm 1 1
RBIT Rd, Rm 1 2
REV Rd, Rm 1 2
SBFX Rd, Rn 1 2
BFC Rd, #lsb, #width 1 2
BFI Rd, Rn, #lsb, #width 1 2
NOTE: Shifted operands and shift amounts needed one cycle early.
Multiply instructions Issue cycles Result latency
MUL Rd, Rn, Rm 1 3
MLA Rd, Rn, Rm, Ra 1 31
SMULL Rd, RdHi, Rn, Rm 1 3
SMLAL Rd, RdHi, Rn, Rm 1 31
SMMUL Rd, Rn, Rm 1 3
SMMLA Rd, Rn, Rm, Ra 1 31
SMULBB Rd, Rn, Rm 1 3
SMLABB Rd, Rn, Rm, Ra 1 31
SMULWB Rd, Rn, Rm 1 3
SMLAWB Rd, Rn, Rm, Ra 1 31
SMUAD Rd, Rn, Rm 1 3
1 Accumulator forwarding allows back to back MLA instructions without delay.
Divide instructions Issue cycles Result latency
SDIV Rd, Rn, Rm 4-20 6-22
UDIV Rd, Rn, Rm 3-19 5-21
Load/store instructions Issue cycles Result latency
LDR Rt, [Rn] 1 3
LDR Rt, [Rn, #imm] 1 3
LDR Rt, [Rn, Rm] 1 3
LDR Rt, [Rn, Rm, lsl #imm] 1 3
LDRD Rt, Rt2, [Rn] 1 3-4
LDM Rn, {regs} 1-8 3-10
STR Rt, [Rn] 1 2
STRD Rt, Rt2, [Rn] 1 2
STM Rn, {regs} 1-10 2-12
NOTE: Load results are forwarded to dependent stores without delay.
VFP instructions Issue cycles Result latency
VMOV.F32 Sd, Sm 1 4
VMOV.F64 Dd, Dm 1 4
VNEG.F32 Sd, Sm 1 4
VNEG.F64 Dd, Dm 1 4
VABS.F32 Sd, Sm 1 4
VABS.F64 Dd, Dm 1 4
VADD.F32 Sd, Sn, Sm 1 4
VADD.F64 Dd, Dn, Dm 1 4
VMUL.F32 Sd, Sn, Sm 1 4
VMUL.F64 Dd, Dn, Dm 4 7
VMLA.F32 Sd, Sn, Sm 1 81
VMLA.F64 Dd, Dn, Dm 4 112
VFMA.F32 Sd, Sn, Sm 1 81
VFMA.F64 Dd, Dn, Dm 5 82
VDIV.F32 Sd, Sn, Sm 15 18
VDIV.F64 Dd, Dn, Dm 29 32
VSQRT.F32 Sd, Sm 14 17
VSQRT.F64 Dd, Dm 28 31
VCVT.F32.F64 Sd, Dm 1 4
VCVT.F64.F32 Dd, Sm 1 4
VCVT.F32.S32 Sd, Sm 1 4
VCVT.F64.S32 Dd, Sm 1 4
VCVT.S32.F32 Sd, Sm 1 4
VCVT.S32.F64 Sd, Dm 1 4
VCVT.F32.S32 Sd, Sd, #fbits 1 4
VCVT.F64.S32 Dd, Dd, #fbits 1 4
VCVT.S32.F32 Sd, Sd, #fbits 1 4
VCVT.S32.F64 Dd, Dd, #fbits 1 4
1 5 cycles with dependency only on accumulator.
2 8 cycles with dependency only on accumulator.
NEON integer instructions Issue cycles Result latency
VADD.I8 Dd, Dn, Dm 1 4
VADDL.S8 Qd, Dn, Dm 2 4
VADD.I8 Qd, Qn, Qm 2 4
VMUL.I8 Dd, Dn, Dm 2 4
VMULL.S8 Qd, Dn, Dm 2 4
VMUL.I8 Qd, Qn, Qm 4 4
VMLA.I8 Dd, Dn, Dm 2 4
VMLAL.S8 Qd, Dn, Dm 2 4
VMLA.I8 Qd, Qn, Qm 4 4
VADD.I16 Dd, Dn, Dm 1 4
VADDL.S16 Qd, Dn, Dm 2 4
VADD.I16 Qd, Qn, Qm 2 4
VMUL.I16 Dd, Dn, Dm 1 4
VMULL.S16 Qd, Dn, Dm 2 4
VMUL.I16 Qd, Qn, Qm 2 4
VMLA.I16 Dd, Dn, Dm 1 4
VMLAL.S16 Qd, Dn, Dm 2 4
VMLA.I16 Qd, Qn, Qm 2 4
VADD.I32 Dd, Dn, Dm 1 4
VADDL.S32 Qd, Dn, Dm 2 4
VADD.I32 Qd, Qn, Qm 2 4
VMUL.I32 Dd, Dn, Dm 2 4
VMULL.S32 Qd, Dn, Dm 2 4
VMUL.I32 Qd, Qn, Qm 4 4
VMLA.I32 Dd, Dn, Dm 2 4
VMLAL.S32 Qd, Dn, Dm 2 4
VMLA.I32 Qd, Qn, Qm 4 4
NEON floating-point instructions Issue cycles Result latency
VADD.F32 Dd, Dn, Dm 2 4
VADD.F32 Qd, Qn, Qm 4 4
VMUL.F32 Dd, Dn, Dm 2 4
VMUL.F32 Qd, Qn, Qm 4 4
VMLA.F32 Dd, Dn, Dm 2 81
VMLA.F32 Qd, Qn, Qm 4 81
1 5 cycles with dependency only on accumulator.
NEON permute instructions Issue cycles Result latency
VEXT.n Dd, Dn, Dm, #imm 1 4
VEXT.n Qd, Qn, Qm, #imm 2 5
VTRN.n Dd, Dn, Dm 2 5
VTRN.n Qd, Qn, Qm 4 5
VUZP.n Dd, Dn, Dm 2 5
VUZP.n Qd, Qn, Qm 4 6
VZIP.n Dd, Dn, Dm 2 5
VZIP.n Qd, Qn, Qm 4 6
VTBL.8 Dd, {Dn}, Dm 1 4
VTBL.8 Dd, {Dn-Dn+1}, Dm 1 4
VTBL.8 Dd, {Dn-Dn+2}, Dm 2 5
VTBL.8 Dd, {Dn-Dn+3}, Dm 2 5
Bookmark the permalink.

31 Responses to Cortex-A7 instruction cycle timings

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.