v2 (chk): handle all sizes, simplify the patch quite a bit
v3 (chk): adjust dw estimation as well
v4 (chk): use single loop, make end mask 64bit
Signed-off-by: Roger He <Hongbo.He@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Tested-by: Roger He <Hongbo.He@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>