Half-precision floating-point arithmetic on Intel chips

related: https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture – has some info about BFloat16 in Cooper Lake and Sapphire Rapids, and some non-Intel info. Sapphire Rapids will have both BF16 and FP16, with FP16 using the same IEEE754 binary16 format as F16C conversion instructions, not brain-float. And AVX512-FP16 has support for most math operations, unlike BF16 which just has conversion to/from … Read more

What specifically marks an x86 cache line as dirty – any write, or is an explicit change required?

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores. There has been academic research on this and there is even a patent on “eliminating silent store invalidation propagation in shared memory cache coherency protocols”. (Googling ‘”silent store” cache’ if you are interested in more.) For x86, … Read more

Why did Intel change the static branch prediction mechanism over these years?

The primary reason why static prediction is not favored in modern designs, to the point of perhaps not even being present, is that static predictions occur too late in the pipeline compared to dynamic predictions. The basic issue is that branch directions and target locations must be known before fetching them, but static predictions can … Read more

Are load ops deallocated from the RS when they dispatch, complete or some other time?

The following experiments suggest that the uops are deallocated at some point before the load completes. While this is not a complete answer to your question, it might provide some interesting insights. On Skylake, there is a 33-entry reservation station for loads (see https://stackoverflow.com/a/58575898/10461973). This should also be the case for the Coffee Lake i7-8700K, … Read more