Get TCB address using the RDHWR instruction instead of __get_tcb().
This gives fast access to the address on systems that implement
the UserLocal register. TCB caching is still used when running
in the single-threaded mode in order not to penalize old systems.
The kernel counterpart of this change must be in place before
using this diff!
With guenther@