x86 - w3toppers.com

Half-precision floating-point arithmetic on Intel chips

related: https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture – has some info about BFloat16 in Cooper Lake and Sapphire Rapids, and some non-Intel info. Sapphire Rapids will have both BF16 and FP16, with FP16 using the same IEEE754 binary16 format as F16C conversion instructions, not brain-float. And AVX512-FP16 has support for most math operations, unlike BF16 which just has conversion to/from … Read more

What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

TL:DR: Intel Core, Nehalem, and Sandybridge / IvyBridge: a maximum of 5 IPC, including 1 macro-fused cmp+branch to get 5 instructions into 4 fused-domain uops, and the rest being single-uop instruction. (up to 2 of these can be micro-fused store or load+ALU.) Haswell up to 9th Gens: a maximum of 6 instructions per cycle can … Read more

Are two store buffer entries needed for split line/page stores on recent Intel?

What specifically marks an x86 cache line as dirty – any write, or is an explicit change required?

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores. There has been academic research on this and there is even a patent on “eliminating silent store invalidation propagation in shared memory cache coherency protocols”. (Googling ‘”silent store” cache’ if you are interested in more.) For x86, … Read more

Counting machine instructions using gdb

Try this: set pagination off set $count = 0 while $pc != 0xyourstoppingaddress stepi set $count++ end print $count Then go get a cup of coffee. Or a long lunch.

Why did Intel change the static branch prediction mechanism over these years?

The primary reason why static prediction is not favored in modern designs, to the point of perhaps not even being present, is that static predictions occur too late in the pipeline compared to dynamic predictions. The basic issue is that branch directions and target locations must be known before fetching them, but static predictions can … Read more

Bubble sort in x86 (masm32), the sort I wrote doesn’t work

Figured out what was wrong – the print statements in the middle of the program were hosing my memory. Here is the working sort. Thanks for the help everyone! .data aa DWORD 10 DUP(5, 7, 6, 1, 4, 3, 9, 2, 10, 8) count DWORD -1 ; DB 8-bits, DW 16-bit, DWORD 32, WORD 16 … Read more

Are load ops deallocated from the RS when they dispatch, complete or some other time?

The following experiments suggest that the uops are deallocated at some point before the load completes. While this is not a complete answer to your question, it might provide some interesting insights. On Skylake, there is a 33-entry reservation station for loads (see https://stackoverflow.com/a/58575898/10461973). This should also be the case for the Coffee Lake i7-8700K, … Read more

Find the first instance of a character using simd

You have the right idea with _mm256_cmpeq_epi8 -> _mm256_movemask_epi8. AFAIK, that’s the optimal way to implement this for Intel CPUs at least. PMOVMSKB r32, ymm is the same speed as the XMM 16-byte version, so it would be a huge loss to unpack the two lanes of a 256b vector and movemask them separately and … Read more

How do the store buffer and Line Fill Buffer interact with each other?

Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests? The store buffer is used to track stores, in order, both before they retire and after they retire but before they commit to the L1 cache2. The store buffer conceptually is a totally local thing which … Read more