Some [More] Notes on the Abstraction Penalty for IA86 C++ Compilers
Posted on February 25, 2008
by Tommy McGuire
I just happened to run across Dan Piponi's Some Notes on the Abstraction Penalty for IA86 C++ Compilers, which looks at the code generated for the function:
int test1(int z) {
Matrix<static<int,3,3> > m;
Vector<static<int,3> > x;
m[0][0] = 1;
m[0][1] = 0;
m[0][2] = 0;
m[1][0] = 0;
m[1][1] = z;
m[1][2] = 0;
m[2][0] = 0;
m[2][1] = z;
m[2][2] = z;
x[0] = 1;
x[1] = 2;
x[2] = 3;
return (m*x)[2];
}
This function (with the appropriate template definitions) multiplies its argument by five.
The Microsoft compiler (version 13.10.3052) generates three instructions: a mov, a lea, and a ret; the lea instruction multiplies the argument by five. GCC (version 3.3.1) produces about 150 instructions, one of which is a call to a matrix-times-vector function. The famed Intel compiler (version 8.1) produced something like 50 instructions.
Now, on the one hand, the Microsoft and Intel compilers have a significant optimization advantage (which is irrelevant in this case) in only compiling for one architecture. GCC's optimization capabilities are excellent in the dancing elephant sense: given everything GCC does, it optimizes amazingly well. On the other hand, this example involves template expansion and inlining, neither of which really involves the architecture.
However, GCC has significantly improved its C++ support in recent versions, so I wanted to find out what a more recent version (4.1.1) would produce using -O3. Here it is, in all its glory:
_Z5test1i:
.LFB1623:
leal (%rdi,%rdi,4), %eax
ret
.LFE1623:
...
nm reports that the function is 4 bytes, versus 853 unoptimized or 135 with -O2.
$ nm -t dec --demangle -S a.out | grep test1
0000000004197344 0000000000000012 t global constructors keyed to _Z5test1i
0000000004197376 0000000000000004 T test1(int)
Some progress has been made.