compiler-optimization
gcc optimization flag -O3 makes code slower than -O2
gcc -O3 uses a cmov for the conditional, so it lengthens the loop-carried dependency chain to include a cmov (which is 2 uops and 2 cycles of latency on your Intel Sandybridge CPU, according to Agner Fog’s instruction tables. See also the x86 tag wiki). This is one of the cases where cmov sucks. If … Read more
Why are elementwise additions much faster in separate loops than in a combined loop?
Answer recommended by Intel
C loop optimization help for final assignment (with compiler optimization disabled)
Re-posting a modified version of my answer from optimized sum of an array of doubles in C, since that question got voted down to -5. The OP of the other question phrased it more as “what else is possible”, so I took him at his word and info-dumped about vectorizing and tuning for current CPU … Read more
How to remove “noise” from GCC/clang assembly output?
Stripping out the .cfi directives, unused labels, and comment lines is a solved problem: the scripts behind Matt Godbolt’s compiler explorer are open source on its github project. It can even do colour highlighting to match source lines to asm lines (using the debug info). You can set it up locally so you can feed … Read more
Why doesn’t GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?
Because Floating Point Math is not Associative. The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer. As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you … Read more