Analyzing Code for Efficiency?

INSERTED: Let’s look at “statistical significance”.

Assume there’s a function call instruction somewhere. You can’t necessarily see it – a class, or a macro, or the compiler, may have slipped it in.
There are other calls to the same function nearby, but this call is in a loop, or its arguments are such as to make this call take a long time. So much time in fact, that if this call could be made to take zero time, then total execution time would be reduced by some amount, say 90%. (Impossible? Not at all.) Would timing pinpoint it? No. Would a call graph pinpoint it? No. Would call counts pinpoint it? No. Because the problem is not at the function level, but at the call instruction level.

Somehow the program is randomly stopped at a point in time, and its state is examined. Will it have stopped in that 90% of the time that would be saved if the instruction could be “zeroed”? Of course – with 90% probability, and the instruction will be pinpointed on the stack waiting for its “work” to be completed.

In fact, if you stop it randomly 20 times, that instruction will be on the stack an average of 18 times, with a standard deviation of +/- 1.3 times.

Is that statistically significant? You bet it is.
Did you need a large number of samples? You bet you didn’t.

Suppose the percentage is small, like 10% or 5%. The same principle applies.

In fact, no matter how few samples are taken, any instruction that is on >1 sample is statistically significant, and is a “hot spot”, “bottleneck”, or whatever you want to call it. If you could remove it, call it less, or somehow reduce it, it will save significant time. (Some you can’t, like “call _main”, but others you can. You just need to find them.)

Of course my code would never be so stupid, would it? Well then, prove it.

OK, now back to the story . . .

ORIGINAL ANSWER: I had heard of profilers, way back when, and I thought they must be pretty neat, but I didn’t have access to them/it. I was working on an embedded processor (an 8086 intel chip) that seemed to be taking an awful long time painting some floating-point numbers on a display screen. The hardware guys suggested providing from their plenty – adding timer chips so I could see how long things were taking. However, one weekend I fired it up with an Intel “Blue Box” in-circuit emulator and got it running. While it was being slow, I wondered “What the heck is it doing?”. So I just halted it to find out. The PC was in the floating-point library (there was no FP chip). That was no surprise, since it was painting floating-point numbers, but I wanted to know more. So I (laboriously) read the hex memory so as to follow the call stack. Guess what? It was in the process of taking the number to be painted, dividing it by 10, converting to integer, converting back to float, subtracting, and so on, just to get the next digit to paint. Needless to say, there were better ways to do that, resulting in a speedup of around 10 times.
That was found with one (1) sample!

On another occasion, on a 68K chip, there was some slowness. Again, a profiler was not available, but debugger “adb” was, so while it was being slow, I halted it a few times. The PC was in the math library, actually in the 32-bit integer multiply routine. Looking up the stack, I found this code:

struct {...} a[...];

int i;
for (i = 0; i < ...; ++i){ ... a[i] ... }

There’s no call to multiply in there – what’s going on? Turns out, for a[i], the compiler has to multiply i by the array element size. Since i is 32 bits (in that compiler) it generates a call to the 32-bit multiply routine, and the stack pinpoints that call instruction. By declaring i as short, the loop tripled in speed!

What’s the point? If you take samples at random times while the program’s being slow, the PC will tell you what it’s doing, but the call stack will tell you why, and lead you straight to the problem. Now, unless the problem is really bad, you may need to take multiple samples. Any statement that shows up on >1 sample is one you can be suspicious of. Notice, it pinpoints statements, instructions even, not big chunks of code like functions. This technique may be “quick and dirty”, but it is very effective.

ADDED: If you do this repeatedly, you can solve problem after problem in the same software. For example, if you get a speedup of 3 times, small performance problems that may have been insignificant before now consume 3 times more of the remaining time. That makes them much easier to hit with samples. You may have to add a temporary outer loop to make it run long enough to sample. In this way, I’ve seen compounded speedup factors of more than 40 times.

Leave a Comment