The parallel version of STBY did not take host endianness into
account, and also computed the incorrect address for STBY_E.
Bswap twice to handle the merge and store. Compute mask inside
the function rather than as a parameter. Force align the address,
rather than subtracting one.
Generalize the function to system mode by using probe_access().