how do numactl & perf change memory placement policy of child processes?

TL;DR: The default policy used by numactl can cause performances issues as well as the OpenMP thread binding. numactl constraints are applied to all (forked) children process.

Indeed, numactl use a predefined policy by default. This policy is can be --interleaved, --preferred, --membind, --localalloc. This policy change the behavior of the operating system page allocation when a first touch on the page is done. Here are the meaning of policies:

--interleaved: memory pages are allocated across nodes specified by a nodeset, but are allocated in a round-robin fashion;
--preferred: memory is allocated from a single preferred memory node. If sufficient memory is not available, memory can be allocated from other nodes.;
--membind: only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes;
--localalloc: always allocate on the current node (the one performing the first touch of the memory page).

In your case, specifying an --interleaved or a --localalloc policy should give better performance. I guess that the --localalloc policy should be the best choice if threads are bound to cores (see bellow).

Moreover, the STREAM_ARRAY_SIZE macro is set to a too small value (10 Mo) by default to actually measure the performance of the RAM. Indeed, the AMD EPYC 7742 processor have a 256Mio L3 cache which is big enough to fit all data of the benchmark in it. It is probably much better to compare the results with a value bigger than the L3 cache (eg. 1024 Mio).

Finally, OpenMP threads may migrate from one numa node to another. This can drastically decrease the performance of the benchmark since when a thread move to another node, the accessed memory pages are located on a remote node and the target NUMA node can be saturated. You need to bind OpenMP threads so they cannot move using for example the following environment variables for this specific processor: OMP_NUM_THREADS=64 OMP_PROC_BIND=TRUE OMP_PLACES="{0}:64:1" assuming SMT is disabled and the core IDs are in the range 0-63. If the SMT is enabled, you should tune the OMP_PLACES variable using the command: OMP_PLACES={$(hwloc-calc --sep "},{" --intersect PU core:all.pu:0)} (which require the package hwloc to be installed on the machine). You can check the thread binding with the command hwloc-ps.

Update:

numactl impacts the NUMA attributes of all the children processes (since children process are created using fork on Linux which should copy the NUMA attributes). You can check that with the following (bash) command:

numactl --show
numactl --physcpubind=2 bash -c "(sleep 1; numactl --show; sleep 2;) & (sleep 2; numactl --show; sleep 1;); wait"

This command first show the NUMA attributes. Set it for a child bash process which launch 2 process in parallel. Each sub-child process show NUMA attributes. The result is the following on my 1-node machine:

policy: default                    # Initial NUMA attributes
preferred node: current
physcpubind: 0 1 2 3 4 5 
cpubind: 0 
nodebind: 0 
membind: 0 
policy: default                    # Sub-child 1
preferred node: current
physcpubind: 2 
cpubind: 0 
nodebind: 0 
membind: 0 
policy: default                    # Sub-child 2
preferred node: current
physcpubind: 2 
cpubind: 0 
nodebind: 0 
membind: 0

Here, we can see that the --physcpubind=2 constraint is applied to the two sub-child processes. It should be the same with --membind or the NUMA policy on your multi-node machine. Thus, note that calling perf before the sub-child processes should not have any impact on the NUMA attributes.

perf should not impact directly the allocations. However, the default policy of the OS could balance the amount of allocated RAM on each NUMA node. If this is the case, the page allocations can be unbalanced, decreasing the bandwidth due to the saturation of the NUMA nodes. The OS memory page balancing between NUMA nodes is very sensitive: writing a huge file between two benchmarks can impact the second run for example (and thus impact the performance of the benchmark). This is why NUMA attributes should always be set manually for HPC benchmarks (as well as process binding).

PS: I assumed the stream benchmark has been compiled with the OpenMP support and that optimizations has been enabled too (ie. -fopenmp -O2 -march=native on GCC and Clang).

Update:

More Related Contents:

Leave a Comment Cancel reply