I'm dealing with some code that's getting 70-80% slower when compiled as PIC (position independent code), and looking for ways to alleviate the problem. A big part of the problem is that gcc insists on inserting the following in every single function:
call __i686.get_pc_thunk.bx addl $_GLOBAL_OFFSET_TABLE_,%ebx
even if that ends up being 20% of the content of the function. Now,
ebx is a call-preserved register, and <strong>every</strong> function in the relevant translation unit (source file) is loading it with the address of the GOT, and it's easily detectable that the
static functions cannot be called from outside the translation unit (their addresses are never taken). So why can't gcc just load
ebx once at the beginning of the big external-linkage functions, and generate the static-linkage functions so that they assume
ebx has already been loaded with the address of the GOT? Is there any optimization flag I can use to force gcc to make this obvious and massive optimization, short of turning the inline limits up sky-high so everything gets inlined into the external functions?
There is probably no generic cure for this, but you could try to play around with inlining options. I'd guess that
static functions in a compilation unit don't have too many callers, so the overhead in code replication wouldn't be too bad.
The easiest way to force such things with gcc would be to set an
attribute((always_inline)). You could play around with a gcc dependent macro to ensure portability.
If you don't want to modify the code (but
static inline would be good anyhow) you could use the
-finline-limit option to fine tune that.
Not really a solution, but: if the functions in question do not reference file-scope variables, you could put them all together in a single translation unit and compile it without -fPIC flag. Then you link them together with other files in the final SO as usual.