aarch64:
before this patchset: 3.2 GiB/s
bind node before this patchset: 6.9 Gib/s
after this patchset: 7.9 Gib/s
bind node after this patchset: 8.0 Gib/s
x86:(bind node is not tested yet)
before this patchset: 7.0 GiB/s
after this patchset : 9.3 GiB/s
Please noted that in the test machine, memory access latency is very bad
across nodes compare to local node in aarch64, which is why bandwidth
while bind node is much better.