X86 prefetching optimizations: "computed goto" threaded code

Jumping directly from block to block is often a win for branch prediction, vs. returning up to one parent indirect-branch, especially on CPUs older than Intel Haswell.

With jumps from the tail of each block, each branch has a different branch-predictor history. It’s probably common for a given block to usually jump to the same next block, or a have a simple pattern of a couple target addresses. This can often be predicted well because each branch individually has a simpler pattern, and the branch history is distributed over multiple branches.

If all dispatching happens from a single indirect branch, there’s may only be one BTB (branch target buffer) entry for it, and the pattern will be too complicated for it to predict well.

Modern TAGE branch predictors in Intel Haswell and later index the BTB using recent branch history, including indirect-branch destination, which does actually work around this problem. See comments on Indexed branch overhead on X86 64 bit mode, and search for Haswell in https://danluu.com/branch-prediction/ . Complicated patterns for one branch can scatter predictions for it across many of the BTB entries.

Specifically, Branch Prediction and the Performance of Interpreters –
Don’t Trust Folklore (2015) by Rohou, Swamy, and Seznec compares Nehalem, SandyBridge, and Haswell on interpreter benchmarks and measures actual mispredict rate for dispatch loops with a single switch statement. They find that Haswell does much better, likely using an ITTAGE predictor.

They don’t test AMD CPUs. AMD has published some info on their CPUs since Piledriver using Perceptron neural networks for branch prediction. I don’t know how well they handle dispatch loops with a single indirect branch. (AMD since Zen 2 uses IT-TAGE as a second level branch predictor, in addition to the hashed perceptron they kept from Zen 1.)

Darek Mihocka discusses this pattern in the context of a interpreting CPU emulator, which jumps from block to block of handlers for different instructions (or simplified uops). He goes into a lot of detail about the performance of various strategies on Core2, Pentium4, and AMD Phenom. (It was written in 2008). Modern branch predictors on current CPUs are most like the Core2.

He eventually presents what he calls the Nostradamus Distributor pattern for checking for early-out (functions return a function-pointer, or a “fire escape” sentinel), in a branch-prediction-friendly way. If you don’t need that, just see the early part of the article where he talks about direct chaining of jumps between blocks vs. a central distributor.

He even bemoans the lack of a code prefetch instruction in x86. That was was probably a bigger deal with Pentium 4, where initial decode to populate the trace cache was very slow compared to running from the trace cache. Sandybridge-family has a decoded-uop cache, but it isn’t a trace cache, and the decoders are still strong enough to not suck when the uop cache misses. Ryzen is similar.

Is there a difference between accessing data relative to the stack pointer or any other pointer?

No. You could even set rsp after jumping so each block can have its own stack. If you have any signal handlers installed, rsp needs to point to valid memory. Also, if you want to be able to call any normal library functions, you need rsp to work as a stack pointer, because they will want to ret.

Is there a prefetching for an indirect jump (jump to value stored in register?).

Prefetch into L2 could be useful if you know the branch target address long before you’re ready to execute an indirect jump. All current x86 CPUs use split L1I / L1D caches, so prefetcht0 would pollute L1D for no gain, but prefetcht1 might be useful (fetch into L2 and L3). Or it might not be useful at all, if the code is already hot in L2.

Also useful: calculate the jump target address as early as possible, so out-of-order execution can resolve the branch while lots of work is queued up in the out-of-order core. This minimizes the potential bubble in the pipeline. Keep the calculation independent of other stuff if possible.

Best case is address in a register many instructions before the jmp, so as soon as the jmp gets a cycle on an execution port, it can provide the correct destination to the front-end (and re-steer if branch prediction got it wrong). Worst case is when the branch target is the result of a long dependency chain of instructions right before the branch. A couple independent instructions, and/or a memory-indirect jump, is fine; out-of-order execution should find cycles to run those instructions once they’re in the OOO scheduler.

There are also split L1iTLB and L1dTLBs, but the L2TLB is usually unified on most microarchitectures. But IIRC, the L2TLB works as a victim cache for the L1 TLBs. A prefetch might trigger a page walk to populate an entry in the L1 data TLB, but on some microarchitectures that wouldn’t help avoid an iTLB miss. (At least it would get the page table data itself into L1D or maybe internal page-directory caches in the page-walk hardware, so another page walk for the same entry would be fast. But since CPUs other than Intel Skylake (and later) only have 1 hardware page-walk unit, if the iTLB miss happens while the first page walk is still happening, it might not be able to start right away, so could actually hurt if your code is so scattered that you’re getting iTLB misses.)

Use 2MB hugepages for the chunk of memory you will JIT into to reduce TLB misses. Probably best to lay out the code in a fairly tight region, with the data separate. DRAM locality effects are a real thing. (A DRAM page is usually bigger than 4kiB, I think, but it’s a hardware thing and you can’t choose. It’s lower latency to access within an already-open page.)

See Agner Fog’s microarch pdf, and also Intel’s optimization manual.. (And AMD’s manual too, if you are worried about AMD CPUs). See more links in the x86 tag wiki.

Is this idea even viable?

Yes, probably.

If possible, when one block always jumps to another block, elide the jump by making the blocks contiguous.

Relative addressing for data is easy: x86-64 has RIP-relative addressing.

You can lea rdi, [rel some_label] and then index from there, or just use RIP-relative addressing directly for some of your static data.

You’re going to be JITting your code or something, so just calculate signed offsets from the end of the current instruction to the data to be accessed, and that’s your RIP-relative offset. Position-independent code + static data is easy in x86-64.

In Granite Rapids and later, PREFETCHIT0 [rip+rel32] prefetches code into “all levels” of cache, or prefetchit1 to prefetch into all levels except L1i.

These instructions are a NOP with an addressing-mode other than RIP-relative, or on CPUs that don’t support them. (Perhaps they also prime iTLB or even uop cache, or at least could on paper.) The docs in Intel’s “future extensions” manual as of 2022 Dec recommends that the target address be the start of some instruction.

It’s only useful to prefetch if you do it early enough, and it doesn’t solve the mispredict problem. It might or might not be a win for an interpreter to prefetch the code for the bytecode instruction after the current one. prefetchit0 can’t do that, it only works with RIP-relative addressing, no indirection. Perhaps because the code-fetch parts of the CPU like L1i and iTLB don’t have AGUs for arbitrary addresses, if it works by feeding the address to those? So it’s no help in prefetching a runtime-variable code location.

X86 prefetching optimizations: “computed goto” threaded code

Leave a Comment Cancel reply

More Related Contents:

Leave a Comment Cancel reply