Return address prediction stack buffer vs stack-stored return address?

Predictors are normally part of the fetch stage, in order to determine which instructions to fetch next. This takes place before the processor has decoded the instructions, and therefore doesn’t even know with certainty that a branch instruction exists. Like all predictors, the intent of the return address predictor is to get the direction / … Read more

How does Linux perf calculate the cache-references and cache-misses events

The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor: 523,288,816 cache-references (architectural event: LLC Reference) 205,331,370 cache-misses (architectural event: LLC Misses) 237,794,728 L1-dcache-load-misses L1D.REPLACEMENT 3,495,080,007 L1-dcache-loads MEM_INST_RETIRED.ALL_LOADS 2,039,344,725 L1-dcache-stores MEM_INST_RETIRED.ALL_STORES 531,452,853 L1-icache-load-misses ICACHE_64B.IFTAG_MISS 77,062,627 LLC-loads OFFCORE_RESPONSE (MSR bits 0, 16, 30-37) 27,462,249 LLC-load-misses … Read more

CPUID implementations in C++

Accessing raw CPUID information is actually very easy, here is a C++ class for that which works in Windows, Linux and OSX: #ifndef CPUID_H #define CPUID_H #ifdef _WIN32 #include <limits.h> #include <intrin.h> typedef unsigned __int32 uint32_t; #else #include <stdint.h> #endif class CPUID { uint32_t regs[4]; public: explicit CPUID(unsigned i) { #ifdef _WIN32 __cpuid((int *)regs, (int)i); … Read more

Which is faster: x

Potentially depends on the CPU. However, all modern CPUs (x86, ARM) use a “barrel shifter” — a hardware module specifically designed to perform arbitrary shifts in constant time. So the bottom line is… no. No difference.

Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?

No, there are some instructions that can only decode 1/clock This effect is Intel-only, not AMD. Theory: the “steering” logic that sends chunks of machine code to decoders looks for patterns in the opcode byte(s) during pre-decode, and any pattern-match that might be a multi-uop instructions has to get sent to the complex decoder. To … Read more

What does “rep; nop;” mean in x86 assembly? Is it the same as the “pause” instruction?

rep; nop is indeed the same as the pause instruction (opcode F390). It might be used for assemblers which don’t support the pause instruction yet. On previous processors, this simply did nothing, just like nop but in two bytes. On new processors which support hyperthreading, it is used as a hint to the processor that … Read more