Analyzing of x86 output generated by JIT in the context of volatile

A couple of things, first will be flushed to memory – that’s pretty erroneous. It’s almost never a flush to main memory – it usually drains the StoreBuffer to L1 and it’s up to the cache coherency protocol to sync the data between all caches, but if it’s easier for you to understand this concept in these terms, it’s fine – just know that is slightly different and faster.

It’s a good question of why the [StoreLoad] is there indeed, maybe this will clear up things a bit. volatile is indeed all about fences and here is an example of what barriers would be inserted in case of some volatile operations. For example we have a volatile load:

  // i is some shared volatile field
  int tmp = i; // volatile load of "i"
  // [LoadLoad|LoadStore]

Notice the two barriers here LoadStore and LoadLoad; in plain english it means that any Load and Store that come after a volatile load/read can not “move up” the barrier, they can not be re-ordered “above” that volatile load.

And here is the example for volatile store.

 // "i" is a shared volatile variable
 // [StoreStore|LoadStore]
 i = tmp; // volatile store

It means that any Load and Store can not go “below” the load store itself.

This basically builds the happens-before relationship, volatile load being the acquiring load and volatile store being the releasing store (this also has to do with how Store and Load cpu buffers are implemented, but it’s pretty much out of the scope of the question).

If you think about it, it makes perfect sense about things that we know about volatile in general; it says that once a volatile store has been observed by a volatile load, everything prior to a volatile store will be observed also and this is on-par with memory barriers. It makes sense now that when a volatile store takes place, everything above it can not go beyond it, and once a volatile load happens, everything below it can not go above it, otherwise this happens-before would be broken.

But that’s not it, there’s more. There needs to be sequential consistency, that is why any sane implementation will guarantee that volatiles themselves are not re-ordered, thus two more fences are inserted:

 // any store of some other volatile
 // can not be reordered with this volatile load
 // [StoreLoad] -- this one
 int tmp = i; // volatile load of a shared variable "i"
 // [LoadStore|LoadLoad]

And one more here:

// [StoreStore|LoadStore]
i = tmp; // volatile store
// [StoreLoad] -- and this one

Now, it turns out that on x86 3 out of 4 memory barriers are free – since it is a strong memory model. The only one that needs to be implemented is StoreLoad. On other CPU’s, like ARM for example, lwsycn is one instruction used – but I don’t know much about them.

Usually an mfence is a good option for StoreLoad on x86, but the same thing is guaranteed via lock add (AFAIK in a cheaper way), that is why you see it there. Basically that is the StoreLoad barrier. And yes – you are right in your last sentence, for a weaker memory model – the StoreStore barrier would be required. On a side-note that is what is used when you safely publish a reference via final fields inside a constructor. Upon exiting the constructor there are two fences inserted: LoadStore and StoreStore.

Take all this with a grain of salt – a JVM is free to ignore these as long as it does not break any rules: Aleksey Shipilev has a great talk about this.


EDIT

Suppose you have this case :

[StoreStore|LoadStore]
int x = 4; // volatile store of a shared "x" variable

int y = 3; // non-volatile store of shared variable "y"

int z = x; // volatile load
[LoadLoad|LoadStore]

Basically there is no barrier that would prevent the volatile store to be re-ordered with the volatile load (i.e.: the volatile load would be performed first) and that would cause problems obviously; sequential consistency thus being violated.

You are sort of missing the point here btw (if I am not mistaken) via Every action after volatile load won't be reordered before volatile load is visible. Re-ordering is not possible with the volatile itself – other operations are free to be re-ordered. Let me give you an example:

 int tmp = i; // volatile load of a shared variable "i"
 // [LoadStore|LoadLoad]

 int x = 3; // plain store
 int y = 4; // plain store

The last two operations x = 3 and y = 4 are absolutely free to be re-ordered, they can’t float above the volatile, but they can be re-ordered via themselves. The above example would be perfectly legal:

 int tmp = i; // volatile load
 // [LoadStore|LoadLoad]

 // see how they have been inverted here...
 int y = 4; // plain store
 int x = 3; // plain store

Leave a Comment