What is the effect of second argument in _builtin_prefetch()?

As Margaret points out, one of the args is rw.

Baseline x86-64 (SSE2) does not include write-prefetch instructions, but they exist as ISA extensions. As usual, compilers won’t use them unless you tell them you’re compiling for a target that supports it. (But they will safely run as a NOP on any non-ancient CPU.)

The two instructions are: PREFETCHW (into L1d cache like PREFETCHT0) and PREFETCHWT1 (into L2 cache like PREFETCHT1). They prefetch a line into Exclusive MESI state by sending out an RFO (Read-For-Ownership). This invalidates every other copy of the line in every other core. From that state, the store buffer can commit data to a line (and flip it to Modified) without any further off-core traffic. Or if not modified before eviction, can simply be dropped.

The PREFETCHW instruction is merely a hint and does not affect program behavior. If executed, this instruction moves data closer to the processor and invalidates other cached copies in anticipation of the line being written to in the future.

They have nearly the same machine encoding, same OF 0D opcode, differing only in /1 or /2 in the ModRM /r field. Just like how read-prefetch PREFETCHT0/T1/T2/NTA share an opcode and are differentiated only by /0 (NTA), /1 (T0), etc. in the ModRM /r field. Using /r bits as extra opcode bits is not unique; other one-operand and immediate instructions also do that.

related: Difference between prefetch for read or write

PREFETCHW originally appeared in AMD’s 3DNow!, but has its own feature bit so that CPUs can indicate support for it but not other 3DNow! (packed-float in MMX regs) instructions.

PREFETCHWT1 also has its own CPUID feature bit, but might be associated with AVX512PF. It appears to only be available in Xeon Phi (Knight’s Landing / Knight’s Mill), not mainstream Skylake-AVX512, same as AVX512PF (https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512). (Evidence: According to Intel’s Future Extensions manual, CPUID with EAX=7/ECX=0 gives a feature bitmap in ECX including Bit 00: PREFETCHWT1 (Intel® Xeon Phi™ only.) Also mailing list.

__builtin_prefetch(p,1,2); compiles as follows with GCC:

  • PREFETCHT1 with no -m options, or -march=haswell or older Intel.
  • PREFETCHW with an AMD target, like -march=k8 or -march=bdver2 (Piledriver).
  • PREFETCHW with -march=broadwell or newer Intel SnB-family, and/or -mprfchw for any arch.
  • PREFETCHWT1 with -mprefetchwt1. (If PREFETCHW is also available, gcc uses it for locality=3, but PREFETCHWT1 for locality<=2.) GCC for some reason doesn’t enable this as part of -march=knl or -march=knm, but clang does. I think this is an oversight in GCC.

  • -mprefetchwt1 implies -mprfchw. See also the x86 options section in the GCC manual for more about -march=native vs. -march=whatever to enable a set of ISA extensions and set -mtune=whatever appropriately.

Check it out on the Godbolt compiler explorer, for -march=haswell vs. -march=broadwell -mprefetchwt1. Or modify the compiler args yourself.

clang -O3 -march=knl, and gcc -O3 -march=broadwell -mprefetchwt1 make the same asm:

        prefetchwt1     [rdi]    #   __builtin_prefetch(p,1,2);  // KNL only, otherwise we get prefetchw
        prefetchw       [rdi]    #   __builtin_prefetch(p,1,3);

        prefetcht0      [rdi]    #   __builtin_prefetch(p,0,3);
        prefetcht1      [rdi]    #   __builtin_prefetch(p,0,2);
        prefetcht2      [rdi]    #   __builtin_prefetch(p,0,1);
        prefetchnta     [rdi]    #   __builtin_prefetch(p,0,0);

Also note that their 0F 0D r/m8 machine code decodes as a multi-byte NOP on non-ancient CPUs that don’t have the PREFETCHW or 3DNow! feature-bit. On early 64-bit Intel CPUs, it’s an illegal instruction. (Newer versions of Windows require that PREFETCHW executes without faulting, and in that context people talk about a CPU “supporting PREFETCHW” even if it runs as a NOP).

It’s possible that CPUs which support PREFETCHW but not PREFETCHWT1 will actually run PREFETCHWT1 as if it were PREFETCHW, but I haven’t tested. (It should be testable by running threads on different cores, one doing repeated stores to a location and the other doing PREFETCHWT1 vs. PREFETCHW vs. read prefetch vs. NOP, and see how the writing thread’s throughput is affected.)

It might be preferable to use a read-intent prefetch instead of a NOP, though (like GCC does). But you probably don’t want to do a PREFETCHW and a PREFETCHT0, because too many prefetch instructions aren’t a good thing. (especially for Intel IvyBridge, which has some kind of performance bug for prefetch-instruction throughput. But IvB would run PREFETCHW as a NOP, so you’re only getting one prefetch on that uarch.)

Tuning software-prefetch is hard: too much prefetching means fewer execution resources spent doing real work, if HW prefetch does its job successfully. See Cost of a sub-optimal cacheline prefetch and What Every Programmer Should Know About Memory?

Leave a Comment