Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-4.1-20190529' into staging

[mirror_qemu.git] / docs / rdma.txt
diff --git a/docs/rdma.txt b/docs/rdma.txt

index 45a4b1d50dc60bf330ba8cb5f4ea632a463cbf53..a86e992c84538609876baf26291fb5c3edbd4c9d 100644 (file)
--- a/docs/rdma.txt
+++ b/docs/rdma.txt
@@ -1,7 +1,7 @@
  (RDMA: Remote Direct Memory Access)
  RDMA Live Migration Specification, Version # 1
  ==============================================
-Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
+Wiki: https://wiki.qemu.org/Features/RDMALiveMigration
  Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
  
  Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
@@ -18,7 +18,7 @@ Contents:
  * RDMA Migration Protocol Description
  * Versioning and Capabilities
  * QEMUFileRDMA Interface
-* Migration of pc.ram
+* Migration of VM's ram
  * Error handling
  * TODO
  
@@ -30,12 +30,12 @@ of the significantly lower latency and higher throughput over TCP/IP. This is
  because the RDMA I/O architecture reduces the number of interrupts and
  data copies by bypassing the host networking stack. In particular, a TCP-based
  migration, under certain types of memory-bound workloads, may take a more
-unpredicatable amount of time to complete the migration if the amount of
+unpredictable amount of time to complete the migration if the amount of
  memory tracked during each live migration iteration round cannot keep pace
  with the rate of dirty memory produced by the workload.
  
  RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
-over Convered Ethernet) as well as Infiniband-based. This implementation of
+over Converged Ethernet) as well as Infiniband-based. This implementation of
  migration using RDMA is capable of using both technologies because of
  the use of the OpenFabrics OFED software stack that abstracts out the
  programming model irrespective of the underlying hardware.
@@ -66,7 +66,7 @@ bulk-phase round of the migration and can be enabled for extremely
  high-performance RDMA hardware using the following command:
  
  QEMU Monitor Command:
-$ migrate_set_capability x-rdma-pin-all on # disabled by default
+$ migrate_set_capability rdma-pin-all on # disabled by default
  
  Performing this action will cause all 8GB to be pinned, so if that's
  not what you want, then please ignore this step altogether.
@@ -93,12 +93,12 @@ $ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
  
  Next, on the destination machine, add the following to the QEMU command line:
  
-qemu ..... -incoming x-rdma:host:port
+qemu ..... -incoming rdma:host:port
  
  Finally, perform the actual migration on the source machine:
  
  QEMU Monitor Command:
-$ migrate -d x-rdma:host:port
+$ migrate -d rdma:host:port
  
  PERFORMANCE
  ===========
@@ -120,8 +120,8 @@ For example, in the same 8GB RAM example with all 8GB of memory in
  active use and the VM itself is completely idle using the same 40 gbps
  infiniband link:
  
-1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
-2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
+1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
+2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
  
  These numbers would of course scale up to whatever size virtual machine
  you have to migrate using RDMA.
@@ -149,7 +149,7 @@ The only difference between a SEND message and an RDMA
  message is that SEND messages cause notifications
  to be posted to the completion queue (CQ) on the
  infiniband receiver side, whereas RDMA messages (used
-for pc.ram) do not (to behave like an actual DMA).
+for VM's ram) do not (to behave like an actual DMA).
  
  Messages in infiniband require two things:
  
@@ -188,9 +188,9 @@ header portion and a data portion (but together are transmitted
  as a single SEND message).
  
  Header:
-    * Length  (of the data portion, uint32, network byte order)
-    * Type    (what command to perform, uint32, network byte order)
-    * Repeat  (Number of commands in data portion, same type only)
+    * Length               (of the data portion, uint32, network byte order)
+    * Type                 (what command to perform, uint32, network byte order)
+    * Repeat               (Number of commands in data portion, same type only)
  
  The 'Repeat' field is here to support future multiple page registrations
  in a single message without any need to change the protocol itself
@@ -199,20 +199,22 @@ Version #1 requires that all server implementations of the protocol must
  check this field and register all requests found in the array of commands located
  in the data portion and return an equal number of results in the response.
  The maximum number of repeats is hard-coded to 4096. This is a conservative
-limit based on the maximum size of a SEND message along with emperical
+limit based on the maximum size of a SEND message along with empirical
  observations on the maximum future benefit of simultaneous page registrations.
  
-The 'type' field has 10 different command values:
-    1. Unused
-    2. Error              (sent to the source during bad things)
-    3. Ready              (control-channel is available)
-    4. QEMU File          (for sending non-live device state)
-    5. RAM Blocks request (used right after connection setup)
-    6. RAM Blocks result  (used right after connection setup)
-    7. Compress page      (zap zero page and skip registration)
-    8. Register request   (dynamic chunk registration)
-    9. Register result    ('rkey' to be used by sender)
-    10. Register finished  (registration for current iteration finished)
+The 'type' field has 12 different command values:
+     1. Unused
+     2. Error                      (sent to the source during bad things)
+     3. Ready                      (control-channel is available)
+     4. QEMU File                  (for sending non-live device state)
+     5. RAM Blocks request         (used right after connection setup)
+     6. RAM Blocks result          (used right after connection setup)
+     7. Compress page              (zap zero page and skip registration)
+     8. Register request           (dynamic chunk registration)
+     9. Register result            ('rkey' to be used by sender)
+    10. Register finished          (registration for current iteration finished)
+    11. Unregister request         (unpin previously registered memory)
+    12. Unregister finished        (confirmation that unpin completed)
  
  A single control message, as hinted above, can contain within the data
  portion an array of many commands of the same type. If there is more than
@@ -243,7 +245,7 @@ qemu_rdma_exchange_send(header, data, optional response header & data):
     from the receiver to tell us that the receiver
     is *ready* for us to transmit some new bytes.
  2. Optionally: if we are expecting a response from the command
-   (that we have no yet transmitted), let's post an RQ
+   (that we have not yet transmitted), let's post an RQ
     work request to receive that data a few moments later.
  3. When the READY arrives, librdmacm will
     unblock us and we immediately post a RQ work request
@@ -293,8 +295,10 @@ librdmacm provides the user with a 'private data' area to be exchanged
  at connection-setup time before any infiniband traffic is generated.
  
  Header:
-    * Version (protocol version validated before send/recv occurs), uint32, network byte order
-    * Flags   (bitwise OR of each capability), uint32, network byte order
+    * Version (protocol version validated before send/recv occurs),
+                                               uint32, network byte order
+    * Flags   (bitwise OR of each capability),
+                                               uint32, network byte order
  
  There is no data portion of this header right now, so there is
  no length field. The maximum size of the 'private data' section
@@ -313,7 +317,7 @@ If the version is invalid, we throw an error.
  If the version is new, we only negotiate the capabilities that the
  requested version is able to perform and ignore the rest.
  
-Currently there is only *one* capability in Version #1: dynamic page registration
+Currently there is only one capability in Version #1: dynamic page registration
  
  Finally: Negotiation happens with the Flags field: If the primary-VM
  sets a flag, but the destination does not support this capability, it
@@ -326,8 +330,8 @@ QEMUFileRDMA Interface:
  
  QEMUFileRDMA introduces a couple of new functions:
  
-1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
-2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
+1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
+2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)
  
  These two functions are very short and simply use the protocol
  describe above to deliver bytes without changing the upper-level
@@ -351,7 +355,7 @@ If the buffer is empty, then we follow the same steps
  listed above and issue another "QEMU File" protocol command,
  asking for a new SEND message to re-fill the buffer.
  
-Migration of pc.ram:
+Migration of VM's ram:
  ====================
  
  At the beginning of the migration, (migration-rdma.c),
@@ -403,13 +407,14 @@ socket is broken during a non-RDMA based migration.
  
  TODO:
  =====
-1. 'migrate x-rdma:host:port' and '-incoming x-rdma' options will be
-   renamed to 'rdma' after the experimental phase of this work has
-   completed upstream.
-2. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
-   are not compatible with infinband memory pinning and will result in
+1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
+   are not compatible with infiniband memory pinning and will result in
     an aborted migration (but with the source VM left unaffected).
-3. Use of the recent /proc/<pid>/pagemap would likely speed up
+2. Use of the recent /proc/<pid>/pagemap would likely speed up
     the use of KSM and ballooning while using RDMA.
-4. Also, some form of balloon-device usage tracking would also
+3. Also, some form of balloon-device usage tracking would also
     help alleviate some issues.
+4. Use LRU to provide more fine-grained direction of UNREGISTER
+   requests for unpinning memory in an overcommitted environment.
+5. Expose UNREGISTER support to the user by way of workload-specific
+   hints about application behavior.