Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?

Question

No, there are some instructions that can only decode 1/clock

This effect is Intel-only, not AMD.

Theory: the “steering” logic that sends chunks of machine code to decoders looks for patterns in the opcode byte(s) during pre-decode, and any pattern-match that might be a multi-uop instructions has to get sent to the complex decoder. To save power (and latency?) it accepts some false-positive detections of instructions as being possibly multi-uop.

The steering logic is I think smart enough to look at the addressing mode to distinguish mov dword [rdi], 1 (1 uop micro-fused) from mov dword [rip+rel32], imm32 which can’t micro-fuse even in the decoders (because of RIP-relative and immediate) and thus is 2 uops. (TODO: test this, maybe with something that’s a load + immediate like rorx eax, [rdi], 4, and/or with an actual multi-uop instruction mixed in.)

Every case we’ve seen so far has been an instruction where a very similar instruction is multi-uop, as discussed in comments. Except for prefetch and popcnt; IDK what’s up with that, since popcnt is always single uop on Skylake with any operand-size, register or memory source.

Andreas Abel identified the affected instruction on Haswell (https://justpaste.it/1juoc), and Skylake (https://justpaste.it/85otd). These are the Skylake cases:

bswap r32 (1 uop) vs. bswap r64 (2 uops) differs only in the REX.W prefix, not in the opcode.
bt reg, imm or bt reg,reg is 1 uop, but 2 or 10 uops for bt with a memory destination (crazy CISC semantics with a register index into the bitstring). Same for bts/btr/btc, memory destination form is 3 or 11 uops.
cdq and cqo are 1 uop, but the same opcode with a 66 prefix is cwd, 2 uops on Sandybridge-family.
cbw / cwde / cdqe (opcode 98h) are all 1 uop on Skylake; perhaps they’re getting lumped in with cwd / cdq / cqo (opcode 99h), or this is leftover steering logic from some earlier uarch. I did confirm that it’s truly a decode bottleneck on Skylake by alternating with xor eax,eax to break the dependency.
all cmovcc and setcc: Some forms of cmovcc and setcc are 2 uops, since Broadwell changed to having SPAZO and CF as separate inputs to the instruction instead of needing FLAGS merging. Instead of special-casing seta/cmova and setbe/cmovbe as 2 uop instructions, all setcc and cmov instructions are steered to the complex decoder.
vpmovsx/zx with a YMM destination: vpmovzxbd ymm, xmm is 1 uop, but vpmovzxbd ymm, [rdi] can never micro-fuse so it’s 2 uops in the decoders. The steering logic doesn’t check for the register source version, at least in Skylake. In a SIMD loop, it will be running from the uop cache so this isn’t a problem. vpmovzxbd xmm, xmm isn’t affected, so the steering logic does check the vector width.
adc reg, 0 as 1 uop is a special case of adc reg, imm8 (2 uops) on Haswell and earlier. On Skylake the adc al, 0 special encoding is 2 uops for no reason, even though the 3-byte encoding is 1 uop, so that’s a separate missed-optimization in the CPU design. IIRC, adc reg, 0 can decode on any port on Skylake, since it’s a different opcode than the AL special case.
PREFETCHNTA / PREFETCHT0 /PREFETCHT1 / PREFETCHT2 – unexplained
popcnt r16/32/64, r/m – unexplained, all forms are single-uop.

Not every instruction with multi-uop forms is on the list; the steering logic apparently does more detailed checks to distinguish things like vinsertf128 and vinsertps xmm source (1 uop) from memory source (2 uops). But where there are decode slowdowns, it’s explainable by the pattern-matching for that opcode or group of opcodes not doing that extra checking. Except for popcnt and prefetch; perhaps they’re similar to some other opcode, or that’s a missed optimization in the CPU.

Experimental testing of uop cache (fast) vs. legacy decode (slow)

This proves there’s a real effect, and the bottleneck is in the legacy decoders.

Andreas’s comments indicate that xor eax,eax / setnle al seems to have a decode bottleneck of 1/clock. I found the same thing with cdq: Reads EAX, writes EDX, also demonstrably runs faster from the DSB (uop cache), and doesn’t involve partial-registers or anything at all weird, and doesn’t need a dep-breaking instruction.

Even better, being a single-byte instruction it can defeat the DSB with only a short block of instructions. (Leading to misleading results from testing on some CPUs, e.g. in Agner Fog’s tables and on https://uops.info/, e.g. SKX shown as 1c throughput.) https://www.uops.info/html-tp/SKX/CDQ-Measurements.html vs. https://www.uops.info/html-tp/CFL/CDQ-Measurements.html have inconsistent throughputs because of different testing methods: only the Coffee Lake test ever tested with a small enough unroll count (10) to not bust the DSB, finding a throughput of 0.6. (The actual throughput is 0.5 once you account for loop overhead, fully explained by back-end port pressure same as cqo. IDK why you’d find 0.6 instead of 0.55 with only one extra uop for p6 in the loop.)

(Zen can run this instructions with 0.25c throughput; no weird decode problems and handled by every integer-ALU port.)

times 10 cdq in a dec/jnz loop can run from the uop cache, and runs at 0.5c throughput on Skylake (p06), plus loop overhead which also competes for p6.

times 20 cdq is more than 3 uop cache lines for one 32-byte block of machine code, meaning the loop can only run from legacy decode (with the top of the loop aligned). On Skylake this runs at 1 cycle per cdq. Perf counters confirm MITE delivers 1 uop per cycle, rather than groups of 3 or 4 with idle cycles between.

default rel
%ifdef __YASM_VER__
    CPU Skylake AMD
%else
%use smartalign
alignmode p6, 64
%endif

global _start
_start:
    mov  ebp, 1000000000

align 64
.loop:
    ;times 10 cdq   ; 0.5c throughput
    ;times 20 cdq   ; 1c throughput, 1 MITE uop per cycle front-end

    ; times 10 cqo        ; 0.5c throughput 2-byte insn fits uop cache
    ; times 10 cdqe       ; 1c throughput data dependency
    ;times 10 cld         ; ~4c throughput, 3 uops

    dec ebp
    jnz .loop
.end:

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)

On my Arch Linux desktop, I built this into a static executable to run under perf:

i7-6700k with epp=balance_performance (max “turbo” = 3.9GHz)
microcode revision 0xd6 (so LSD disabled, not that it matters: loops can only run from the LSD loop buffer if all their uops are in the DSB uop cache, IIRC.)

  #   in a bash shell:
t=cdq-latency; nasm -f elf64 "$t".asm && ld -o "$t" "$t.o" && objdump -drwC -Mintel "$t" && 
  taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,frontend_retired.dsb_miss,idq.dsb_uops,idq.mite_uops,idq.mite_cycles,idq_uops_not_delivered.core,idq_uops_not_delivered.cycles_fe_was_ok,idq.all_mite_cycles_4_uops ./"$t"

disassembly

0000000000401000 <_start>:
  401000:       bd 00 ca 9a 3b          mov    ebp,0x3b9aca00
  401005:       0f 1f 84 00 00 00 00 00         nop    DWORD PTR [rax+rax*1+0x0]
...
  40103d:       0f 1f 00                nop    DWORD PTR [rax]

0000000000401040 <_start.loop>:
  401040:       99                      cdq    
  401041:       99                      cdq    
  401042:       99                      cdq    
  401043:       99                      cdq    
...
  401052:       99                      cdq    
  401053:       99                      cdq             # 20 total CDQ
  401054:       ff cd                   dec    ebp
  401056:       75 e8                   jne    401040 <_start.loop>

0000000000401058 <_start.end>:
  401058:       31 ff                   xor    edi,edi
  40105a:       b8 e7 00 00 00          mov    eax,0xe7
  40105f:       0f 05                   syscall

Perf results:

 Performance counter stats for './cdq-latency':

          5,205.44 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 1      page-faults               #    0.000 K/sec                  
    20,124,711,776      cycles                    #    3.866 GHz                      (49.88%)
    22,015,118,295      instructions              #    1.09  insn per cycle           (59.91%)
    21,004,212,389      uops_issued.any           # 4035.049 M/sec                    (59.97%)
     1,005,872,141      frontend_retired.dsb_miss #  193.235 M/sec                    (60.03%)
                 0      idq.dsb_uops              #    0.000 K/sec                    (60.08%)
    20,997,157,414      idq.mite_uops             # 4033.694 M/sec                    (60.12%)
    19,996,447,738      idq.mite_cycles           # 3841.451 M/sec                    (40.03%)
    59,048,559,790      idq_uops_not_delivered.core # 11343.621 M/sec                   (39.97%)
       112,956,733      idq_uops_not_delivered.cycles_fe_was_ok #   21.700 M/sec                    (39.92%)
           209,490      idq.all_mite_cycles_4_uops #    0.040 M/sec                    (39.88%)

       5.206491348 seconds time elapsed

So the loop overhead (dec/jnz) happened basically for free, decoding in the same cycle as the last cdq. Counts are not exact because I used too many events in one run (with HT enabled), so perf did software multiplexing. From another run with fewer counters:

# same source, only these HW counters enabled to avoid multiplexing
          5,161.14 msec task-clock                #    1.000 CPUs utilized          

    20,107,065,550      cycles                    #    3.896 GHz                    
    20,000,134,955      idq.mite_cycles           # 3875.142 M/sec                  
    59,050,860,720      idq_uops_not_delivered.core # 11441.447 M/sec                 
        95,968,317      idq_uops_not_delivered.cycles_fe_was_ok #   18.594 M/sec

So we can see that MITE (legacy decode) was active basically every cycle, and that the front-end was basically never “ok”. (i.e. never stalled on the back-end).

With only 10 CDQ instructions, allowing the DSB to work:

...
0000000000401040 <_start.loop>:
  401040:       99                      cdq    
  401041:       99                      cdq    
...
  401049:       99                      cdq        # 10 total CDQ insns
  40104a:       ff cd                   dec    ebp
  40104c:       75 f2                   jne    401040 <_start.loop>

 Performance counter stats for './cdq-latency' (4 runs):

          1,417.38 msec task-clock                #    1.000 CPUs utilized            ( +-  0.03% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 1      page-faults               #    0.001 K/sec                  
     5,511,283,047      cycles                    #    3.888 GHz                      ( +-  0.03% )  (49.83%)
    11,997,247,694      instructions              #    2.18  insn per cycle           ( +-  0.00% )  (59.99%)
    10,999,182,841      uops_issued.any           # 7760.224 M/sec                    ( +-  0.00% )  (60.17%)
           197,753      frontend_retired.dsb_miss #    0.140 M/sec                    ( +- 13.62% )  (60.21%)
    10,988,958,908      idq.dsb_uops              # 7753.010 M/sec                    ( +-  0.03% )  (60.21%)
        10,234,859      idq.mite_uops             #    7.221 M/sec                    ( +- 27.43% )  (60.21%)
         8,114,909      idq.mite_cycles           #    5.725 M/sec                    ( +- 26.11% )  (39.83%)
        40,588,332      idq_uops_not_delivered.core #   28.636 M/sec                    ( +- 21.83% )  (39.79%)
     5,502,581,002      idq_uops_not_delivered.cycles_fe_was_ok # 3882.221 M/sec                    ( +-  0.01% )  (39.79%)
            56,223      idq.all_mite_cycles_4_uops #    0.040 M/sec                    ( +-  3.32% )  (39.79%)

          1.417599 +- 0.000489 seconds time elapsed  ( +-  0.03% )

As reported by idq_uops_not_delivered.cycles_fe_was_ok, basically all the unused front-end uop slots were the fault of the back-end (port pressure on p0 / p6), not the front-end.

No, there are some instructions that can only decode 1/clock

Experimental testing of uop cache (fast) vs. legacy decode (slow)

More Related Contents:

Leave a Comment Cancel reply