Branch target prediction in conjunction with branch prediction?

Do read along with the Intel optimization manual, current download location is here. When stale (they move stuff around all the time) then search the Intel site for “Architectures optimization manual”. Keep in mind the info there is fairly generic, they disclose only as much as needed to allow writing efficient code. Branch prediction implementation details are considered a trade secret and do change between architectures. Search the manual for “branch prediction” to find references, it is fairly spread among the chapters.

I’ll give a summary of what’s found in the manual, adding details where appropriate:

Branch prediction is the job of the BPU unit in the core (Branch Prediction Unit). Roughly correlates to “BP” in your question. It contains several sub-units:

  • The branch history table. This table keeps track of previously taken conditional branches and is consulted by the predictor to decide if a branch is likely to be taken. Is is fed with entries by the instruction retirement unit, the one that knows whether the branch was actually taken. This is the sub-unit that has changed the most as the architectures improved, getting deeper and smarter as more real estate became available.

  • The BTB, Branch Target Buffer. This buffer stores the target address of a previously taken indirect jump or call. This correlates to “BTP” in your question. The manual does not state whether the buffer can store multiple targets per address, indexed by the history table, I consider it likely for later architectures.

  • The Return Stack Buffer. This buffer acts a “shadow” stack, storing the return address for CALL instructions, making the target of a RET instruction available with a high degree of confidence without the processor having to rely on the BTB, unlikely to be as effective for calls. It is documented to be 16 levels deep.

Bullet 2) in a bit difficult to answer accurately, the manual only talks about the “Front End” and does not break down the details of the pipeline. Appropriate enough, it is heavily architecture dependent. The diagram in section 2.2.5 is possibly illustrative. The execution trace cache plays a role, it stores previously decoded instructions so is the primary source of BPU consultations. Otherwise right after the instruction translator (aka decoder).

Leave a Comment