[FIXED] Android: why is native code so much faster than Java code

Issue

In the following SO question: https://stackoverflow.com/questions/2067955/fast-bitmap-blur-for-android-sdk @zeh claims a port of a java blur algorithm to C runs 40 times faster.

Given that the bulk of the code includes only calculations, and all allocations are only done “one time” before the actual algorithm number crunching – can anyone explain why this code runs 40 times faster? Shouldn’t the Dalvik JIT translate the bytecode and dramatically reduce the gap to native compiled code speed?

Note: I have not confirmed the x40 performance gain myself for this algorithm, but all serious image manipulation algorithm I encounter for Android, are using the NDK – so this supports the notion that NDK code will run much faster.

Solution

For algorithms that operate over arrays of data, there are two things that significantly change performance between a language like Java, and C:

  • Array bound checking: Java will check every access, bmap[i], and confirm i is within the array bounds. If the code tries to access out of bounds, you will get a useful exception. C & C++ do not check anything and just trust your code. The best case response to an out of bounds access is a page fault. A more likely result is "unexpected behavior".

  • Pointers: You can significantly reduce the operations by using pointers.

Take this innocent example of a common filter (similar to blur, but 1D):

for(int i = 0; i < ndata - ncoef; ++i) {  
    z[i] = 0;  
    for(int k = 0; k < ncoef; ++k) {  
        z[i] += c[k] * d[i + k];  
    }  
}  

When you access an array element, coef[k] is:

  • Load address of array coef into register;
  • Load value k into a register;
  • Sum them;
  • Go get memory at that address.

Every one of those array accesses can be improved because you know that the indexes are sequential. Neither the compiler, nor the JIT can know that the indexes are sequential so they cannot optimize fully (although they keep trying).

In C++, you would write code more like this:

int d[10000];  
int z[10000];  
int coef[10];  
int* zptr;  
int* dptr;  
int* cptr;  
dptr = &(d[0]); // Just being overly explicit here, more likely you would dptr = d;  
zptr = &(z[0]); // or zptr = z;  
for(int i = 0; i < (ndata - ncoef); ++i) {  
    *zptr = 0; 
    *cptr = coef;  
    *dptr = d + i;  
    for(int k = 0; k < ncoef; ++k) {  
        *zptr += *cptr * *dptr;  
        cptr++;  
        dptr++;  
    }  
    zptr++;  
}  
       

When you first do something like this (and succeed in getting it correct) you will be surprised how much faster it can be. All the array address calculations of fetching the index and summing the index and base address are replaced with an increment instruction.

For 2D array operations such as blur on an image, an innocent code data[r,c] involves two value fetches, a multiply and a sum. So with 2D arrays the benefits of pointers allows you to remove multiply operations.

So the language allows real reduction in the operations the CPU must perform. The cost is that the C++ code is horrendous to read and debug. Errors in pointers and buffer overflows are food for hackers. But when it comes to raw number grinding algorithms, the speed improvement is too tempting to ignore.

Answered By – jdr5ca

Answer Checked By – Robin (Easybugfix Admin)

Leave a Reply

(*) Required, Your email will not be published