Java for-loop optimization

It’s so easy to get fooled by hand-made microbenchmarks – you never know what they actually measure. That’s why there are special tools like JMH. But let’s analyze what happens to the primitive hand-made benchmark:

static class HDouble {
    double value;
}

public static void main(String[] args) {
    primitive();
    wrapper();
}

public static void primitive() {
    long start = System.nanoTime();
    for (double d = 0; d < 1000000000; d++) {
    }
    long end = System.nanoTime();
    System.out.printf("Primitive: %.3f s\n", (end - start) / 1e9);
}

public static void wrapper() {
    HDouble d = new HDouble();
    long start = System.nanoTime();
    for (d.value = 0; d.value < 1000000000; d.value++) {
    }
    long end = System.nanoTime();
    System.out.printf("Wrapper:   %.3f s\n", (end - start) / 1e9);
}

The results are somewhat similar to yours:

Primitive: 3.618 s
Wrapper:   1.380 s

Now repeat the test several times:

public static void main(String[] args) {
    for (int i = 0; i < 5; i++) {
        primitive();
        wrapper();
    }
}

It gets more interesting:

Primitive: 3.661 s
Wrapper:   1.382 s
Primitive: 3.461 s
Wrapper:   1.380 s
Primitive: 1.376 s <-- starting from 3rd iteration
Wrapper:   1.381 s <-- the timings become equal
Primitive: 1.371 s
Wrapper:   1.372 s
Primitive: 1.379 s
Wrapper:   1.378 s

Looks like both methods got finally optimized. Run it once again, now with logging JIT compiler activity:
-XX:-TieredCompilation -XX:CompileOnly=Test -XX:+PrintCompilation

    136    1 %           Test::primitive @ 6 (53 bytes)
   3725    1 %           Test::primitive @ -2 (53 bytes)   made not entrant
Primitive: 3.589 s
   3748    2 %           Test::wrapper @ 17 (73 bytes)
   5122    2 %           Test::wrapper @ -2 (73 bytes)   made not entrant
Wrapper:   1.374 s
   5122    3             Test::primitive (53 bytes)
   5124    4 %           Test::primitive @ 6 (53 bytes)
Primitive: 3.421 s
   8544    5             Test::wrapper (73 bytes)
   8547    6 %           Test::wrapper @ 17 (73 bytes)
Wrapper:   1.378 s
Primitive: 1.372 s
Wrapper:   1.375 s
Primitive: 1.378 s
Wrapper:   1.373 s
Primitive: 1.375 s
Wrapper:   1.378 s

Note % sign in the compilation log on the first iteration. It means that the methods were compiled in OSR (on-stack replacement) mode. During the second iteration the methods were recompiled in normal mode. Since then, starting from the third iteration, there was no difference between primitive and wrapper in execution speed.

What you’ve actually measured is the performance of OSR stub. It is usually not related to the real performance of an application and you shouldn’t care much about it.

But the question still remains, why OSR stub for a wrapper is compiled better than for a primitive variable? To find this out we need to get down to generated assembly code:
-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

I’ll omit all unrelevant code leaving only the compiled loop.

Primitive:

0x00000000023e90d0: vmovsd 0x28(%rsp),%xmm1      <-- load double from the stack
0x00000000023e90d6: vaddsd -0x7e(%rip),%xmm1,%xmm1
0x00000000023e90de: test   %eax,-0x21f90e4(%rip)
0x00000000023e90e4: vmovsd %xmm1,0x28(%rsp)      <-- store to the stack
0x00000000023e90ea: vucomisd 0x28(%rsp),%xmm0    <-- compare with the stack value
0x00000000023e90f0: ja     0x00000000023e90d0

Wrapper:

0x00000000023ebe90: vaddsd -0x78(%rip),%xmm0,%xmm0
0x00000000023ebe98: vmovsd %xmm0,0x10(%rbx)      <-- store to the object field
0x00000000023ebe9d: test   %eax,-0x21fbea3(%rip)
0x00000000023ebea3: vucomisd %xmm0,%xmm1         <-- compare registers
0x00000000023ebea7: ja     0x00000000023ebe90

As you can see, the ‘primitive’ case makes a number of loads and stores to a stack location while ‘wrapper’ does mostly in-register operations. It is quite understandable why OSR stub refers to stack: in the interpreted mode local variables are stored on the stack, and OSR stub is made compatible with this interpreted frame. In a ‘wrapper’ case the value is stored on the heap, and the reference to the object is already cached in a register.

Leave a Comment