Post by Tony HoughtonPost by NixBecause prelinking has to compute addresses for shared libraries that
require no relocation for *any binary on the system* (if possible).
This means it has to determine sizes and load addresses and things for
every accessible binary and every shared library that one of those
binaries uses before it knows where to relocate an unprelinked shared
library to.
How does dynamic linking avoid the need to know about every binary a
library might be linked with? Does it reserve two different address
ranges for libraries and binaries?
Well, normally when ld.so relocates a binary it only needs to know about
symbols in libraries that *this binary* (transitively) needs: it ensures
that all symbols in all those libraries are at addresses distinct from
all other such symbols.
As such, you could easily have this situation:
bin1:
libreadline.so.4 => /lib/libreadline.so.4 (0x70048000)
libhistory.so.4 => /lib/libhistory.so.4 (0x70090000)
libncurses.so.5 => /lib/libncurses.so.5 (0x700a8000)
libdl.so.2 => /lib/tls/libdl.so.2 (0x70104000)
libc.so.6 => /lib/tls/libc.so.6 (0x70118000)
/lib/ld-linux.so.2 (0x70000000)
bin2:
libc.so.6 => /lib/tls/libc.so.6 (0x70048000)
/lib/ld-linux.so.2 (0x70000000)
Note that libc.so.6 in bin2 has an address which *overlaps* the address
used for libreadline.so.4 in bin1. In non-prelinked, non-execstacked,
non-PaX Linux systems, this is how it always works: libraries are
relocated starting from the lowest available address, so they'll overlap
all over the place, and essentially every library needs relocation at
runtime to make them not overlap when starting any binary that uses that
library. This relocation is normally lazy --- i.e., it happens to each
function when that function is called --- but it still costs both at
runtime and at startup time. Plus, because relocation involves writing
to text pages (the GOT and PLT), it dirties those pages, leading to
greater memory consumption (those pages must now be swapped out, not
paged out, because they're no the same as they were in the binary).
You can see that this relocation processing is happening:
***@amaterasu 37 /tmp% LD_DEBUG=statistics ./bin2
22949:
22949: runtime linker statistics:
22949: total startup time in dynamic loader: 667605 clock cycles
22949: time needed for relocation: 339611 clock cycles (50.8%)
22949: number of relocations: 107
22949: number of relocations from cache: 5
22949: number of relative relocations: 3889
22949: time needed to load objects: 176055 clock cycles (26.3%)
22949:
22949: runtime linker statistics:
22949: final number of relocations: 126
22949: final number of relocations from cache: 5
Here's the same pair of binaries on a prelinked system (well, actually
this is a different architecture, but that's academic):
bin1:
libreadline.so.4 => /lib/libreadline.so.4 (0x47d17000)
libhistory.so.4 => /lib/libhistory.so.4 (0x47cc4000)
libdl.so.2 => /lib/tls/libdl.so.2 (0x42019000)
libc.so.6 => /lib/tls/libc.so.6 (0x41ed1000)
libncurses.so.5 => /lib/libncurses.so.5 (0x433c6000)
/lib/ld-linux.so.2 (0x41eb8000)
bin2:
libc.so.6 => /lib/tls/libc.so.6 (0x41ed1000)
/lib/ld-linux.so.2 (0x41eb8000)
Note that the addresses of libc.so.6 on these two independent binaries
are *identical*, and they're all nonoverlapping for both these
binaries. The same is true of the largest possible set of shared
libraries on the system, given address space constraints. Therefore,
no relocation is necessary for either of those binaries to start.
But note that allocating addresses such that every shared library has
a nonoverlapping address in every binary on the system (or as close to
that state as possible) requires a global pass over *every* binary
and *all* those binaries' dependent shared libraries: furthermore,
expanding one shared library by even a few Kb might cause it to
collide with some other library's
Here's proof of lack-of-relocation-processing:
***@hades 1 /home/nix% LD_DEBUG=statistics ./bin2
19661:
19661: runtime linker statistics:
19661: total startup time in dynamic loader: 1250146 clock cycles
19661: time needed for relocation: 14762 clock cycles (1.1%)
19661: number of relocations: 0
19661: number of relocations from cache: 19
19661: number of relative relocations: 0
19661: time needed to load objects: 217524 clock cycles (17.3%)
19661:
19661: runtime linker statistics:
19661: final number of relocations: 0
19661: final number of relocations from cache: 19
Note the difference in `time needed for relocation': 350,000 versus
15,000 ticks. (The increased `time needed to load objects' is because
this machine is faster than the nonprelinked one, so the disk hits take
more CPU time.)
If you turn on LD_BIND_NOW, the number of relocations without prelink
goes up to the thousands.
This was a tiny binary: for larger ones, like OpenOffice, the time
difference can be titanic. I mean, here's a nonprelinked copy of koffice
starting up and then getting ctrl-C'ed because I don't have an X
server on this console so I can't quit it normally ;)
22996:
22996: runtime linker statistics:
22996: total startup time in dynamic loader: 1687696747 clock cycles
22996: time needed for relocation: 1259756503 clock cycles (74.6%)
22996: number of relocations: 27617
22996: number of relocations from cache: 60427
22996: number of relative relocations: 64362
22996: time needed to load objects: 414493752 clock cycles (24.5%)
[...]
23001:
23001: runtime linker statistics:
23001: final number of relocations: 27092
23001: final number of relocations from cache: 58000
That's not a small cost: this was on a 500MHz UltraSPARC, so that ate at
*least* 2.5 seconds, and that was only one of about a dozen processes
that got started when kwrite started up. With prelink, all of that cost
vanishes.
Post by Tony HoughtonHas anyone thought of making a compromise, using simple arrays for
relocation tables instead of having to look up names at load time?
Windows does this (using integer `ordinals' as well as symbol names). It
speeds up some things (symbol comparison) and makes maintaining ABI
compatibility an utter nightmare. It's one to avoid. In any case, ELF
requires the symbol tables. :)
There are certain fairly simple tricks involving merging of relocation
sections that can reduce the number of symbol name comparisons required
by a large factor (usually about 30--80%): this is implemented via the
`-z combreloc' linker flag, which is on by default. Almost certainly
everything on your system is linked with this option already.
(Note that combreloc merely speeds up part of relocation processing:
prelink *eliminates* it.)
A good read on prelink operation is the prelink paper prelink.pdf,
probably packaged with your copy of prelink. Section 2 discusses
relocation section merging and `-z combreloc'. (Parts of this are hard
to comprehend without a good understanding of ELF, but much can be
grasped anyway.)
--
This is like system("/usr/funky/bin/perl -e 'exec sleep 1'");
--- Peter da Silva