md/raid456: downlevel multicore operations to raid_run_ops
The percpu conversion allowed a straightforward handoff of stripe
processing to the async subsytem that initially showed some modest gains
(+4%). However, this model is too simplistic and leads to stripes
bouncing between raid5d and the async thread pool for every invocation
of handle_stripe(). As reported by Holger this can fall into a
pathological situation severely impacting throughput (6x performance
loss).
By downleveling the parallelism to raid_run_ops the pathological
stripe_head bouncing is eliminated. This version still exhibits an
average 11% throughput loss for: