Up until now, malloc scanned the bits of the chunk bitmap from
position zero, skipping a random number of free slots and then
picking the next free one. This slowed things down, especially if
the number of full slots increases.
This changes the scannning to start at a random position in the
bitmap and then taking the first available free slot, wrapping if
the end of the bitmap is reached. Of course we'll still scan more
if the bitmap becomes more full, but the extra iterations skipping
free slots and then some full slots are avoided.
The random number is derived from a global, which is incremented
by a few random bits every time a chunk is needed (with a small optimization
if only one free slot is left).
Thanks to the testers!