[FIXED] Does Java's JIT compiler "expand" trivial loops?

Issue

I am writing a performance-sensitive portion of my application and I am curious about how the JIT compiler (if at all) would optimize the following methods:

private static int alphaBlend(int foreground, int background) {
    int alpha = (foreground >> 24) & 0xFF;
    int subAlpha = 0xFF - alpha;
    return ((((((foreground >> 16) & 0xFF) * alpha) + (((background >> 16) & 0xFF)) * subAlpha) >> 8) << 16)
            | ((((((foreground >> 8) & 0xFF) * alpha) + (((background >> 8) & 0xFF)) * subAlpha) >> 8) << 8)
            | ((((foreground & 0xFF) * alpha) + ((background & 0xFF)) * subAlpha) >> 8);
}

private static int alphaBlendLoop(int foreground, int background) {
    int alpha = (foreground >> 24) & 0xFF;
    int subAlpha = 0xFF - alpha;
    int blended = 0;
    for (int shift = 16; shift >= 0; shift -= 8) {
        blended |= (((((foreground >> shift) & 0xFF) * alpha) + (((background >> shift) & 0xFF)) * subAlpha) >> 8) << shift;
    }
    return blended;
}

These methods perform alpha blending. Basically, they combine a foreground RGBA pixel with a background RGB pixel whose RGB component values have been pre-multiplied with alpha values.

Both of these methods return the same values for the same inputs, but their implementations are different. Personally, I find the latter implementation easier to read, but I am concerned that it may be less performant. The bytecode for both implementations is included below for those interested (it was generated using IntelliJ’s “Show Bytecode” view):

  private static alphaBlend(II)I
   L0
    LINENUMBER 95 L0
    ILOAD 0
    BIPUSH 24
    ISHR
    SIPUSH 255
    IAND
    ISTORE 2
   L1
    LINENUMBER 96 L1
    SIPUSH 255
    ILOAD 2
    ISUB
    ISTORE 3
   L2
    LINENUMBER 97 L2
    ILOAD 0
    BIPUSH 16
    ISHR
    SIPUSH 255
    IAND
    ILOAD 2
    IMUL
    ILOAD 1
    BIPUSH 16
    ISHR
    SIPUSH 255
    IAND
    ILOAD 3
    IMUL
    IADD
    BIPUSH 8
    ISHR
    BIPUSH 16
    ISHL
    ILOAD 0
    BIPUSH 8
    ISHR
    SIPUSH 255
    IAND
    ILOAD 2
    IMUL
    ILOAD 1
    BIPUSH 8
    ISHR
    SIPUSH 255
    IAND
    ILOAD 3
    IMUL
    IADD
    BIPUSH 8
    ISHR
    BIPUSH 8
    ISHL
    IOR
    ILOAD 0
    SIPUSH 255
    IAND
    ILOAD 2
    IMUL
    ILOAD 1
    SIPUSH 255
    IAND
    ILOAD 3
    IMUL
    IADD
    BIPUSH 8
    ISHR
    IOR
    IRETURN
   L3
    LOCALVARIABLE foreground I L0 L3 0
    LOCALVARIABLE background I L0 L3 1
    LOCALVARIABLE alpha I L1 L3 2
    LOCALVARIABLE subAlpha I L2 L3 3
    MAXSTACK = 4
    MAXLOCALS = 4

  private static alphaBlendLoop(II)I
   L0
    LINENUMBER 103 L0
    ILOAD 0
    BIPUSH 24
    ISHR
    SIPUSH 255
    IAND
    ISTORE 2
   L1
    LINENUMBER 104 L1
    SIPUSH 255
    ILOAD 2
    ISUB
    ISTORE 3
   L2
    LINENUMBER 105 L2
    ICONST_0
    ISTORE 4
   L3
    LINENUMBER 106 L3
    BIPUSH 16
    ISTORE 5
   L4
   FRAME FULL [I I I I I I] []
    ILOAD 5
    IFLT L5
   L6
    LINENUMBER 107 L6
    ILOAD 4
    ILOAD 0
    ILOAD 5
    ISHR
    SIPUSH 255
    IAND
    ILOAD 2
    IMUL
    ILOAD 1
    ILOAD 5
    ISHR
    SIPUSH 255
    IAND
    ILOAD 3
    IMUL
    IADD
    BIPUSH 8
    ISHR
    ILOAD 5
    ISHL
    IOR
    ISTORE 4
   L7
    LINENUMBER 106 L7
    IINC 5 -8
    GOTO L4
   L5
    LINENUMBER 109 L5
   FRAME CHOP 1
    ILOAD 4
    IRETURN
   L8
    LOCALVARIABLE shift I L4 L5 5
    LOCALVARIABLE foreground I L0 L8 0
    LOCALVARIABLE background I L0 L8 1
    LOCALVARIABLE alpha I L1 L8 2
    LOCALVARIABLE subAlpha I L2 L8 3
    LOCALVARIABLE blended I L3 L8 4
    MAXSTACK = 4
    MAXLOCALS = 6

Intuitively, the loop requires more “work” and operations (jumping, evaluating a condition, decrementing, etc.). However, the loop is also very predictable; it will always execute exactly three times and the variable defined within its scope will always have the same three values.

In such a scenario, would a JIT compiler (or a smarter static compiler?) be able to optimize a trivial loop like this by expanding it into perhaps a long one-liner like seen in the alphaBlend implementation? Or are loops typically something that cannot be optimized in such a way?

Solution

Yes, HotSpot JIT supports loop unrolling and constant propagation optimizations that make possible to convert alphaBlendLoop into something similar to manually unrolled alphaBlend.

I would personally prefer the third option: a small helper function that makes code even more readable:

private static int blend(int foreground, int background, int alpha, int shift) {
    int fg = (foreground >>> shift) & 0xff;
    int bg = (background >>> shift) & 0xff;
    return (fg * alpha + bg * (0xff - alpha)) >>> 8 << shift;
}

public static int alphaBlend(int foreground, int background) {
    int alpha = foreground >>> 24;
    int R = blend(foreground, background, alpha, 0);
    int G = blend(foreground, background, alpha, 8);
    int B = blend(foreground, background, alpha, 16);
    return R | G | B;
}

I’ve made a JMH benchmark to verify that all 3 options are similar in performance.
Tested on Java 8u77 x64.

Benchmark               Mode  Cnt  Score   Error  Units
Blend.alphaBlendInline  avgt   10  7,831 ± 0,045  ns/op
Blend.alphaBlendLoop    avgt   10  7,860 ± 0,025  ns/op
Blend.alphaBlendMethod  avgt   10  7,769 ± 0,056  ns/op

Answered By – apangin

Answer Checked By – Gilberto Lyons (Easybugfix Admin)

Leave a Reply

(*) Required, Your email will not be published