Intel Core, Nehalem, and Sandybridge / IvyBridge: a maximum of 5 IPC, including 1 macro-fused cmp+branch to get 5 instructions into 4 fused-domain uops, and the rest being single-uop instruction. (up to 2 of these can be micro-fused store or load+ALU.)
Haswell up to 9th Gens: a maximum of 6 instructions per cycle can be achieved using two pairs of macro-fusable ALU+branch instructions and two instructions that are decoded into two potentially micro-fused uops. The max unfused-domain uop throughput is 7 uops per clock, according to my testing on Skylake..
Early P6-family: Pentium Pro/PII/PIII, and Pentium M. Also Pentium 4: a maximum of 3 instructions per cycle can be achieved using 3 instructions that are decoded into 3 uops. (No macro-fusion, and 3-wide decode and issue).
The max IPC on Sunny Cove may be 7, thanks to increased front-end bandwidth of 5 uops per clock.
The out-of-order pipeline in Intel Core2 and later can issue/rename 4 fused-domain uops per clock. This is the bottleneck. Macro-fusion will combine a
cmp / jcc into a single uop, but this can only happen once per decode block. (Until Haswell).
Also decode is another important bottleneck before the uop-cache in SnB-family. (Up to 4 instructions into up-to-7 uops with a 4-1-1-1 pattern in Core2 and Nehalem; SnB-family is up-to 4 total, or up to 5 in Skylake, e.g. a 2-1-1-1 pattern from still only 4 decoders, not 5 as some sources incorrectly report). Multi-uop instructions have to decode in the first “slot”. See Agner Fog’s microarch guide for much more about the potential bottlenecks in Nehalem.
Nehalem InstLatx64 shows that
nop surprisingly only has 0.33c throughput, not 0.25, but it turns out according to https://www.uops.info/table.html that’s because
nop needs an ALU execution unit in CPUs before Sandybridge. Agner Fog says he didn’t detect a retirement bottleneck on Nehalem.
Even if you could arrange things so more than one macro-fused pair per 4 uops was in a loop, Nehalem has a throughput of only one fused test-and-branch uop per clock (port 5). So it couldn’t sustain more than one macro-fused compare-and-branch per clock even if some of them are not-taken. (Haswell can run not-taken branches on port 0 or port 6, so 6 IPC throughput can be sustained as long as at least one of the macro-fused branches is not-taken.)
;; Should run at one iteration per clock .l: mov edx, [rsi] ; doesn't need an ALU uop. A store would work here, too, but a NOP need an ALU port on Nehalem. add eax, edx inc rsi cmp rsi, rdi ; macro-fuses jb .l ; with this, into 1 cmp+branch uop
For ease of testing, and remove cache/memory bottlenecks, you could change it to load from the same location every time, instead of using the loop counter in the addressing mode. (As long as you avoid register-read stalls from too many cold registers.)
Note that pre-Haswell uarches only have three ALU ports. But
mov loads or stores take pipeline bandwidth so there’s a benefit to having 4-wide issue/rename. It’s also useful for the front-end to be able to issue faster than the out-of-order core can execute, so there is always a buffer of work to do queued up in the scheduler, so it can find the instruction-level parallelism and get started on future loads early, and stuff like that.
I think other than load/store (including
pop thanks to the stack engine),
fxchg might be the only fused-domain uop that doesn’t need an ALU port in Nehalem. Or maybe it actually does, like
nop. On SnB-family uarches,
xor same,same is handled in the rename/issue stage, and sometimes also reg-reg
movs (IvB and later).
nop is also never executed, unlike on Nehalem, so SnB/IvB have 0.25c throughput for
nop even though they only have 3 ALU ports.
mov reg,reg on Ivy Bridge can also be part of a loop that runs 4 front-end uops per clock with only 3 back-end ALU port.
For maxing out back-end uop throughput, you need micro-fusion to get 2 back-end uops (load + ALU) through the front-end as a single fused-domain uop in decode, issue/rename, and in the ROB. https://www.agner.org/optimize/blog/read.php?i=415#852