Is processor can do memory and arithmetic operation at the same time?

You’re right, a modern x86 will decode add dword [mem], 1 to 3 uops: a load, an ALU add, and a store. (This is actually a simplification of various things, including Intel’s micro-fusion and how AMD always keeps a load+ALU together in some parts of the pipeline…)

Those 3 dependent operations can’t happen at the same time because the later ones have to wait for the result of the earlier one.

But execution of independent instructions can overlap, and modern CPUs very aggressively look for and exploit “instruction level parallelism” to run your code faster than 1 uop per clock. See this answer for an intro to what a single CPU core can do in parallel, with links to more stuff, like Agner Fog’s x86 microarch guide, and David Kanter’s write-ups of Sandybridge and Bulldozer.

But if you look at Intel’s P6 and Sandybridge microarchitecture families, a store is actually separate store-address and store-data uops. The store-address uop has no dependency on the load or ALU uop, and can write the store address into the store buffer at any time. (Intel’s optimization manual calls it the Memory Order Buffer).

To increase front-end throughput, store-address and store-data uops can decode as a micro-fused pair. For add, so can the load+alu operation, so an Intel CPU can decode add dword [rdi], 1 to 2 fused-domain uops. (The same load+add micro-fusion works for decoding add eax, [rdi] to a single uop, so any of “simple” decoders can decode it, not just the “complex” decoder that can handle multi-uop instructions. This reduces front-end bottlenecks).

This is why add [mem], 1 is more efficient than inc [mem] on Intel CPUs, even though inc reg is just as efficient (but smaller) than add reg,1. (inc can’t micro-fuse its load+inc, which sets flags differently than add). INC instruction vs ADD 1: Does it matter?

But this is just helping the front-end get uops into the scheduler more quickly; the load still has to run separately from the add.

But a micro-fused load doesn’t have to wait for the rest of the whole instruction’s inputs to be ready. Consider an instruction like add [rdi], eax where RDI and EAX are both inputs to the instruction, but EAX isn’t needed until the ALU add uop. The load can execute as soon as the load-address is ready and there’s a free load execution unit (AGU + cache access). See also How are x86 uops scheduled, exactly?.

registers are read in Decode uOp, Store/Load in Memory uOp and we allow ourselves to take the value of a register at the Memory uOp

All current x86 microarchitectures use out-of-order execution with register renaming (Tomasulo’s algorithm). Instructions are renamed and issued into the out-of-order part of the core (ROB and scheduler).

The physical register file isn’t read until an instruction is “dispatched” from the scheduler to an execution unit. (Or for recently-generated inputs, forwarded from other uops.)

Independent instructions can overlap their execution. For example, a Skylake CPU can sustain a throughput of 4 fused-domain / 7 unfused-domain uops per clock, including 2 loads + 1 store, in a carefully crafted loop:

.loop: ; HSW: 1.12c / iter. SKL: 1.0001c
    add edx, [rsp]           ; 1 fused-domain uop:  micro-fused load+add
    mov [rax], edi           : 1 fused-domain uop:  micro-fused store-address+store-data
    blsi ebx, [rdi]          : 1 fused-domain uop:  micro-fused load+bit-manip

    dec ecx
    jnz .loop                ; 1 fused-domain uop: macro-fused dec+branch runs on port 6

Sandybridge-family CPUs have an L1d cache capable of 2 reads + 1 write per clock. (Before Haswell, only 256-bit vectors could work around the AGU throughput limit, though. See How can cache be that fast?.)

Sandybridge-family front-end throughput is 4 fused-domain uops per clock, and they have lots of execution units in the back-end to handle various instruction mixes. (Haswell and later have 4 integer ALUs, 2 load ports, a store-data port, and a dedicated store-AGU for simple store addressing modes. So they can often “catch up” quickly after a cache-miss stalls execution, quickly making room in the out-of-order window to find more work to do.)

Leave a Comment