Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

Question

Tl;DR: For these three cases, a penalty of a few cycles is incurred when performing a load and store at the same time. The load latency is on the critical path in all of the three cases, but the penalty is different in different cases. Case 3 is about a cycle higher than case 1 due to the additional load.

Analysis Method 1: Using stall performance events

I was able to reproduce your results for the all of the three cases on IvB and SnB. The numbers I got are within 2% of your numbers. The number of cycles it takes to execute a single iteration of case 1, 2, and 4 is 5.4, 8.9, and 6.6, respectively.

Let’s start with the frontend. The LSD.CYCLES_4_UOPS and LSD.CYCLES_3_UOPS performance events show that basically all the uops are issued from the LSD. In addition, these events together with LSD.CYCLES_ACTIVE show that in every cycle in which the LSD is not stalled, 3 uops are issued in cases 1 and 2 and 4 uops are issued in case 3. In other words, as expected, the uops of every iteration are issued together in the same group in a single cycle.

In all of the following relations, the “=~” sign means that the difference is within 2%. I’ll start with the following empirical observation:

UOPS_ISSUED.STALL_CYCLES + LSD.CYCLES_ACTIVE =~ cycles

Note that the LSD event counts on SnB need to be adjusted as discussed in here.

We also have the following relations:

case 1: UOPS_ISSUED.STALL_CYCLES =~ RESOURCE_STALLS.ANY =~ 4.4c/iter
case 2: UOPS_ISSUED.STALL_CYCLES =~ RESOURCE_STALLS.ANY =~ 7.9c/iter
case 3: UOPS_ISSUED.STALL_CYCLES =~ RESOURCE_STALLS.ANY =~ 5.6c/iter

This means that the reason for the issue stalls is because one or more required resources in the backend are not available. Therefore, we can confidently eliminate the whole frontend from consideration. In cases 1 and 2, that resource is the RS. In case 3, stalls due to the RS constitute about 20% of all resource stalls¹.

Let’s focus now on case 1. There are a total of 4 unfused domain uops: 1 load uop, 1 STA, 1 STD, and 1 dec/jne. The load and STA uops depend on the previous load uop. Whenever the LSD issues a group of uops, the STD and jump uops can be dispatched in the next cycle, so the next cycle will not cause an execution stall event. However, the earliest point where the load and STA uops can be dispatched is in the same cycle in the which the load result is written back. The correlation between CYCLES_NO_EXECUTE and STALLS_LDM_PENDING indicates that the reason why there would be no uops ready for execution is because all of the uops that are in the RS are waiting for the L1 to service pending load requests. Specifically, half of the uops in the RS are load uops and the other half are STAs and they are all waiting for the load of the respective previous iteration to complete. LSD.CYCLES_3_UOPS shows that the LSD waits until there are at least 4 free entries in the RS, only then it issues a group of uops that constitute a full iteration. In the next cycle, two of these uops will be dispatched, thereby freeing 2 RS entries². The other will have to wait for the load they depend on to complete. Most probably the loads complete in program order. Therefore, the LSD waits until the STA and load uops of the oldest iteration that is yet to be executed leave the RS. Thus, UOPS_ISSUED.STALL_CYCLES + 1 =~ the average load latency³. We can conclude that the average load latency in case 1 is 5.4c. Most of this applies to case 2, except for one difference, as I’ll explain shortly.

Since the uops in each iteration form a dependency chain, we also have:

cycles =~ the average load latency.

Hence:

cycles =~ UOPS_ISSUED.STALL_CYCLES + 1 =~ the average load latency.

In case 1, the average load latency is 5.4c. We know that the best-case latency of the L1 cache is 4c, so there is a load latency penalty of 1.4c. But why is the effective load latency not 4c?

The scheduler will predict that the load on which the uops depend will complete within some constant latency and so it will schedule them to be dispatched accordingly. If the load takes more time than that for any reason (such as an L1 miss), the uops will be dispatched but the load result has not arrived yet. In this case, the uops will be replayed and the number of dispatched uops will be larger than the total number of issued uops.

The load and STA uops can only be dispatched to port 2 or 3. The events UOPS_EXECUTED_PORT.PORT_2 and UOPS_EXECUTED_PORT.PORT_3 can be used to count the number of uops dispatched to port 2 and 3, respectively.

case 1: UOPS_EXECUTED_PORT.PORT_2 + UOPS_EXECUTED_PORT.PORT_3 =~ 2uops/iter
case 2: UOPS_EXECUTED_PORT.PORT_2 + UOPS_EXECUTED_PORT.PORT_3 =~ 6uops/iter
case 3: UOPS_EXECUTED_PORT.PORT_2 + UOPS_EXECUTED_PORT.PORT_3 =~ 4.2uops/iter

In case 1, the total number of AGU uops dispatched is exactly equal to the number of AGU uops retired; there are no replays. So the scheduler never mispredicts. In case 2, there is on average 2 replays per AGU uop, which means that the scheduler mispredicts twice on average per AGU uop. Why are there mispredictions in case 2 but not in case 1?

The scheduler will replay uops dependent on a load for any of the following reasons:

L1 cache miss.
Memory disambiguation misprediction.
Memory consistency violation.
L1 cache hit, but there is L1-L2 traffic.
Virtual page number misprediction.
Some other (undocumented) reasons.

The first 5 reasons can be definitively ruled out using the corresponding performance events. Patrick Fay (Intel) says the following:

Lastly yes, there are ‘a few’ idle cycles when switching between a
load and a store. I’m told not to be more specific than ‘a few’.
…
SNB can read and write different banks at the same cycle.

I find these statements, perhaps intentionally, a little ambiguous. The first statement suggests that a load and store to the L1 can never fully overlap. The second one suggests that a load and store can be performed in the same cycle only if there are to different banks. Although being to different banks may neither be a necessary nor sufficient condition. But one thing is for sure, if there are concurrent load and store requests, the load (and the store) can be delayed for one or more cycles. This explains the average 1.4c penalty on the load latency in case 1.

There is a difference between case 1 and case 2. In case 1, the STA and load uops that depend on the same load uop are issued together in the same cycle. On the other hand, in case 2, the STA and load uops that depend on the same load uop belong to two different issue groups. The issue stall time per iteration would be essentially equal to the time it takes to sequentially execute one load and retire one store. The contribution of each operation can be estimated using CYCLE_ACTIVITY.STALLS_LDM_PENDING. It takes one cycle to execute the STA uop so the store can retire in the cycle that immediately follows the one in which the STA is dispatched.

The average load latency is CYCLE_ACTIVITY.STALLS_LDM_PENDING + 1 cycle (the cycle in which the load is dispatched) + 1 cycle (the cycle in the which the jump uop is dispatched). We need to add 2 cycles to CYCLE_ACTIVITY.STALLS_LDM_PENDING because there are no execution stalls in these cycles yet they constitute a fraction of the total load latency. This is equal to 6.8 + 2 = 8.8 cycles =~ cycles.

During the execution of the first dozen (or so) iterations, a jump and STD uops will be allocated in the RS every cycle. These will be always be dispatched for execution in the cycle that follows the issue cycle. At some point, the RS will get full and all of the entries that have not been dispatched yet will be STA and load uops that are waiting for the load uops of the respective previous iterations to complete (writeback their results). So the allocator will stall until there are enough free RS entries to issue a whole iteration. Let’s assume that the oldest load uop has written back its result at cycle T + 0. I’ll refer to the iteration to which that load uop belongs as the current iteration. The following sequence of events will occur:

At cycle T + 0: Dispatch the STA uop of the current iteration and the load uop of the next iteration. There is no allocation in this cycle because there aren’t enough RS entries. This cycle gets counted as an allocation stall cycle but not as an execution stall cycle.

At cycle T + 1: The STA uop completes execution and the store retires. The uops of the next iteration to be allocated are allocated. This cycle gets counted as an execution stall cycle but not as an allocation stall cycle.

At cycle T + 2: The jump and STD uops that were just allocated get dispatched. This cycle gets counted as an allocation stall cycle but not as an execution stall cycle.

At cycles T + 3 to T + 3 + CYCLE_ACTIVITY.STALLS_LDM_PENDING – 2: All of these cycles are counted as both execution and allocation stall cycles. Note that there are CYCLE_ACTIVITY.STALLS_LDM_PENDING – 1 cycles here.

Therefore, UOPS_ISSUED.STALL_CYCLES should be equal to 1 + 0 + 1 + CYCLE_ACTIVITY.STALLS_LDM_PENDING – 1. Let’s check: 7.9 = 1+0+1+6.8-1.

Following the reasoning on case 1, cycles should be equal to UOPS_ISSUED.STALL_CYCLES + 1 = 7.9 + 1 =~ the actual measured cycles. The penalty incurred when performing a load and store in at the same time is 3.6c higher than in case 1. It is as if the load is waiting for a store get committed. I think this also explains why there are replays in the case 2 but not in case 1.

In case 3, there are 1 STD, 1 STA, 2 loads, and 1 jump. The uops of a single iteration can all be allocated in one cycle because the IDQ-RS bandwidth is 4 fused uops per cycle. The uops get unfused on entrance to the RS. The 1 STD require 1 cycle to be dispatched. The jump also takes 1 cycle. There are three AGU uops but only 2 AGU ports. So it takes 2 cycles (compared to 1 in case 1 and 2) to dispatch the AGU uops. The group of AGU uops dispatched will be one of the following:

The second load uop and the STA uop of the same iteration. These are dependent on the first load uop of the same iteration. Both AGU ports are used.
The first load uop of the next iteration can be dispatched in the next cycle. This depends on the load of the previous iteration. Only one of the two AGU ports is used.

Since it takes one more cycle to free enough RS entries to accommodate an entire issue group, UOPS_ISSUED.STALL_CYCLES + 1 – 1 = UOPS_ISSUED.STALL_CYCLES =~ the average load latency =~ 5.6c, which is very close to that of case 1. The penalty is about 1.6c. This explains why, in case 3 compared to case 1 and 2, each AGU uop is dispatched 1.4 times on average.

Again, since it takes on more cycle to free enough RS entries to accommodate an entire issue group:

cycles =~ the average load latency + 1 = 6.6c/iter, which actually exactly matches cycles as measured on my system.

A detailed analysis similar to that one on case 2 can be done on case 3 as well. In case 3, the execution of the STA is overlapped with latency of the second load. The latencies of both loads are also mostly overlapped.

I don’t know why the penalties are different in the different cases. We would need to know how the L1D cache is exactly designed. Anyway, I feel confident enough that there is a penalty of “a few idle cycles” on the load latency (and the store latency) to post this answer.

Footnotes

(1) The other 80% of time is spent stalling on the load matrix. This structure is barely mentioned in the manual. It is used to specify dependencies between uops and load uops. It is estimated to have 32 entries on SnB and IvB. There is no documented performance event that can exclusively count stalls on the LM. All of the documented resource stall events are zero. In case 3, there are 3 out 5 uops per iteration that depend on the previous load, so most probably the LM will be filled before any of the other structures. The “effective” number of RS entries is estimated to be around 51 and 48 on IvB and SnB, respectively.

(2) I might have made a harmless simplification here. See Is it possible for the RESOURCE_STALLS.RS event to occur even when the RS is not completely full?.

(3) It may be helpful to create a visualization of the uop flow through the pipeline to see how this all fits together. You can use a simple load chain as a reference. This is easy for case 1, but difficult for case 2 due to replay.

Analysis Method 2: Using the load latency performance monitoring facility

I came up with another method to analyze the code. This method is much easier but less accurate. However, it does essentially lead us to the same conclusion.

The alternative method is based on the MEM_TRANS_RETIRED.LOAD_LATENCY_* performance events. These events are special in the sense that they can only be counted at the precise level (See: PERF STAT does not count memory-loads but counts memory-stores).

For example, MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 counts the number of loads whose latency is larger than 4 core cycles of a “randomly” selected sample of all executed loads. The latency is measured as follows. The cycle in which the load is dispatched for the first time is the first cycle that is considered as part of the latency of the load. The cycle in which the load result is written back is the last cycle that is considered as part of the latency. Hence, replays are accounted for. Also, starting with SnB (at least), all loads have latencies larger than 4 cycles according to this definition. The minimum latency threshold that is currently supported is 3 cycles.

Case 1
Lat Threshold  | Sample Count
 3             | 1426934
 4             | 1505684
 5             | 1439650
 6             | 1032657      << Drop 1
 7             |   47543      << Drop 2
 8             |   57681
 9             |   60803
10             |   76655
11             |     <10      << Drop 3

Case 2
Lat Threshold  | Sample Count
 3             | 1532028
 4             | 1536547
 5             | 1550828
 6             | 1541661
 7             | 1536371
 8             | 1537337
 9             | 1538440
10             | 1531577
11             |     <10      << Drop

Case 3
Lat Threshold  | Sample Count
 3             | 2936547
 4             | 2890162
 5             | 2921158
 6             | 2468704      << Drop 1
 7             | 1242425      << Drop 2
 8             | 1238254
 9             | 1249995
10             | 1240548
11             |     <10      << Drop 3

It’s critical to understand that these numbers represents the number of loads of the randomly selected sample of all loads. For example, of the total size of the sample of all loads is 10 million and only 1 million of these has a latency larger than the specified threshold, then the measured value is 1 million. However, the total number of executed loads could be 1 billion. Therefore, the absolute values are not very meaningful themselves. What really matters is the pattern across different thresholds.

In case 1, there are three significant drops in the number of loads whose latency is larger than a specific threshold. We can deduce that loads whose latency is equal to or smaller than 6 cycles are the most common, loads whose latency is equal to or smaller than 7 cycles but larger than 6 cycles are the second most common, and most other loads have a latency between 8-11 cycles.

we already know that the minimum latency is 4 cycles. Given these numbers, it’s reasonable to estimate the average load latency to be somewhere between 4 and 6 cycles, but closer to 6 than 4. We know from Method 1 that the average load latency is actually 5.4c. So we can make a fairly good estimation using these numbers.

In case 2, we can deduce that most loads have a latency that is smaller than or equal to 11 cycles. The average load latency is probably also much larger than 4, given the consistency in the measured number of loads across a wide range of latency thresholds. So it’s between 4 and 11, but closer to 11 than 4. We know from Method 1 that the average load latency is actually 8.8c, which is close to any reasonable estimation based on these numbers.

Case 3 is similar to case 1 and in fact they actual average load latency determined using Method 1 is almost the same for these two cases.

Making measurements using MEM_TRANS_RETIRED.LOAD_LATENCY_* is easy and such analysis can be done by someone with little knowledge about the microarchitecture.

More Related Contents:

Leave a Comment Cancel reply