[mirror_qemu.git] / docs / block-replication.txt

Block replication
----------------------------------------
Copyright Fujitsu, Corp. 2016
Copyright (c) 2016 Intel Corporation
Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.

This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.

Block replication is used for continuous checkpoints. It is designed
for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
where the Secondary VM is not running.

This document gives an overview of block replication's design.

== Background ==
High availability solutions such as micro checkpoint and COLO will do
consecutive checkpoints. The VM state of the Primary and Secondary VM is
identical right after a VM checkpoint, but becomes different as the VM
executes till the next checkpoint. To support disk contents checkpoint,
the modified disk contents in the Secondary VM must be buffered, and are
only dropped at next checkpoint time. To reduce the network transportation
effort during a vmstate checkpoint, the disk modification operations of
the Primary disk are asynchronously forwarded to the Secondary node.

== Workflow ==
The following is the image of block replication workflow:

        +----------------------+            +------------------------+
        |Primary Write Requests|            |Secondary Write Requests|
        +----------------------+            +------------------------+
                  |                                       |
                  |                                      (4)
                  |                                       V
                  |                              /-------------\
                  |      Copy and Forward        |             |
                  |---------(1)----------+       | Disk Buffer |
                  |                      |       |             |
                  |                     (3)      \-------------/
                  |                 speculative      ^
                  |                write through    (2)
                  |                      |           |
                  V                      V           |
           +--------------+           +----------------+
           | Primary Disk |           | Secondary Disk |
           +--------------+           +----------------+

    1) Primary write requests will be copied and forwarded to Secondary
       QEMU.
    2) Before Primary write requests are written to Secondary disk, the
       original sector content will be read from Secondary disk and
       buffered in the Disk buffer, but it will not overwrite the existing
       sector content (it could be from either "Secondary Write Requests" or
       previous COW of "Primary Write Requests") in the Disk buffer.
    3) Primary write requests will be written to Secondary disk.
    4) Secondary write requests will be buffered in the Disk buffer and it
       will overwrite the existing sector content in the buffer.

== Architecture ==
We are going to implement block replication from many basic
blocks that are already in QEMU.

         virtio-blk       ||
             ^            ||                            .----------
             |            ||                            | Secondary
        1 Quorum          ||                            '----------
         /      \         ||                                                           virtio-blk
        /        \        ||                                                               ^
   Primary    2 filter                                                                     |
     disk         ^                                                                   7 Quorum
                  |                                                                    /
                3 NBD  ------->  3 NBD                                                /
                client    ||     server                                          2 filter
                          ||        ^                                                ^
--------.                 ||        |                                                |
Primary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
--------'                 ||        |          backing        ^       backing
                          ||        |                         |
                          ||        |                         |
                          ||        '-------------------------'
                          ||         blockdev-backup sync=none 6

1) The disk on the primary is represented by a block device with two
children, providing replication between a primary disk and the host that
runs the secondary VM. The read pattern (fifo) for quorum can be extended
to make the primary always read from the local disk instead of going through
NBD.

2) The new block filter (the name is replication) will control the block
replication.

3) The secondary disk receives writes from the primary VM through QEMU's
embedded NBD server (speculative write-through).

4) The disk on the secondary is represented by a custom block device
(called active-disk). It should start as an empty disk, and the format
should support bdrv_make_empty() and backing file.

5) The hidden-disk is created automatically. It buffers the original content
that is modified by the primary VM. It should also start as an empty disk,
and the driver supports bdrv_make_empty() and backing file.

6) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer
any state that would otherwise be lost by the speculative write-through
of the NBD server into the secondary disk. So before block replication,
the primary disk and secondary disk should contain the same data.

7) The secondary also has a quorum node, so after secondary failover it
can become the new primary and continue replication.


== Failure Handling ==
There are 7 internal errors when block replication is running:
1. I/O error on primary disk
2. Forwarding primary write requests failed
3. Backup failed
4. I/O error on secondary disk
5. I/O error on active disk
6. Making active disk or hidden disk empty failed
7. Doing failover failed
In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
4 and 6, we just report block replication's error to FT/HA manager (which
decides when to do a new checkpoint, when to do failover).
In case 7, if active commit failed, we use replication failover failed state
in Secondary's write operation (what decides which target to write).

== New block driver interface ==
We add four block driver interfaces to control block replication:
a. replication_start_all()
   Start block replication, called in migration/checkpoint thread.
   We must call block_replication_start_all() in secondary QEMU before
   calling block_replication_start_all() in primary QEMU. The caller
   must hold the I/O mutex lock if it is in migration/checkpoint
   thread.
b. replication_do_checkpoint_all()
   This interface is called after all VM state is transferred to
   Secondary QEMU. The Disk buffer will be dropped in this interface.
   The caller must hold the I/O mutex lock if it is in migration/checkpoint
   thread.
c. replication_get_error_all()
   This interface is called to check if error happened in replication.
   The caller must hold the I/O mutex lock if it is in migration/checkpoint
   thread.
d. replication_stop_all()
   It is called on failover. We will flush the Disk buffer into
   Secondary Disk and stop block replication. The vm should be stopped
   before calling it if you use this API to shutdown the guest, or other
   things except failover. The caller must hold the I/O mutex lock if it is
   in migration/checkpoint thread.

== Usage ==
Primary:
  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
         children.0.file.filename=1.raw,\
         children.0.driver=raw

  Run qmp command in primary qemu:
    { "execute": "human-monitor-command",
      "arguments": {
          "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1"
      }
    }
    { "execute": "x-blockdev-change",
      "arguments": {
          "parent": "colo1",
          "node": "nbd_client1"
      }
    }
  Note:
  1. There should be only one NBD Client for each primary disk.
  2. host is the secondary physical machine's hostname or IP
  3. Each disk must have its own export name.
  4. It is all a single argument to -drive and you should ignore the
     leading whitespace.
  5. The qmp command line must be run after running qmp command line in
     secondary qemu.
  6. After primary failover we need remove children.1 (replication driver).

Secondary:
  -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
  -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=top-disk1
         file.file.filename=active_disk.qcow2,\
         file.driver=qcow2,\
         file.backing.file.filename=hidden_disk.qcow2,\
         file.backing.driver=qcow2,\
         file.backing.backing=colo1
  -drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\
         vote-threshold=1,children.0=childs1

  Then run qmp command in secondary qemu:
    { "execute": "nbd-server-start",
      "arguments": {
          "addr": {
              "type": "inet",
              "data": {
                  "host": "xxx",
                  "port": "xxx"
              }
          }
      }
    }
    { "execute": "nbd-server-add",
      "arguments": {
          "device": "colo1",
          "writable": true
      }
    }

  Note:
  1. The export name in secondary QEMU command line is the secondary
     disk's id.
  2. The export name for the same disk must be the same
  3. The qmp command nbd-server-start and nbd-server-add must be run
     before running the qmp command migrate on primary QEMU
  4. Active disk, hidden disk and nbd target's length should be the
     same.
  5. It is better to put active disk and hidden disk in ramdisk.
  6. It is all a single argument to -drive, and you should ignore
     the leading whitespace.

After Failover:
Primary:
  The secondary host is down, so we should run the following qmp command
  to remove the nbd child from the quorum:
  { "execute": "x-blockdev-change",
    "arguments": {
        "parent": "colo1",
        "child": "children.1"
    }
  }
  { "execute": "human-monitor-command",
    "arguments": {
        "command-line": "drive_del xxxx"
    }
  }
  Note: there is no qmp command to remove the blockdev now

Secondary:
  The primary host is down, so we should do the following thing:
  { "execute": "nbd-server-stop" }

Promote Secondary to Primary:
  see COLO-FT.txt

TODO:
1. Shared disk
Commit	Line	Data
68365a38 WC	1	Block replication
	2	----------------------------------------
	3	Copyright Fujitsu, Corp. 2016
	4	Copyright (c) 2016 Intel Corporation
	5	Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
	6
	7	This work is licensed under the terms of the GNU GPL, version 2 or later.
	8	See the COPYING file in the top-level directory.
	9
	10	Block replication is used for continuous checkpoints. It is designed
	11	for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
	12	It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
	13	where the Secondary VM is not running.
	14
	15	This document gives an overview of block replication's design.
	16
	17	== Background ==
	18	High availability solutions such as micro checkpoint and COLO will do
	19	consecutive checkpoints. The VM state of the Primary and Secondary VM is
	20	identical right after a VM checkpoint, but becomes different as the VM
	21	executes till the next checkpoint. To support disk contents checkpoint,
	22	the modified disk contents in the Secondary VM must be buffered, and are
	23	only dropped at next checkpoint time. To reduce the network transportation
	24	effort during a vmstate checkpoint, the disk modification operations of
	25	the Primary disk are asynchronously forwarded to the Secondary node.
	26
	27	== Workflow ==
	28	The following is the image of block replication workflow:
	29
	30	+----------------------+ +------------------------+
	31	\|Primary Write Requests\| \|Secondary Write Requests\|
	32	+----------------------+ +------------------------+
	33	\| \|
	34	\| (4)
	35	\| V
	36	\| /-------------\
	37	\| Copy and Forward \| \|
	38	\|---------(1)----------+ \| Disk Buffer \|
	39	\| \| \| \|
	40	\| (3) \-------------/
	41	\| speculative ^
	42	\| write through (2)
	43	\| \| \|
	44	V V \|
	45	+--------------+ +----------------+
	46	\| Primary Disk \| \| Secondary Disk \|
	47	+--------------+ +----------------+
	48
	49	1) Primary write requests will be copied and forwarded to Secondary
	50	QEMU.
	51	2) Before Primary write requests are written to Secondary disk, the
	52	original sector content will be read from Secondary disk and
	53	buffered in the Disk buffer, but it will not overwrite the existing
	54	sector content (it could be from either "Secondary Write Requests" or
	55	previous COW of "Primary Write Requests") in the Disk buffer.
	56	3) Primary write requests will be written to Secondary disk.
	57	4) Secondary write requests will be buffered in the Disk buffer and it
	58	will overwrite the existing sector content in the buffer.
	59
	60	== Architecture ==
	61	We are going to implement block replication from many basic
	62	blocks that are already in QEMU.
	63
	64	virtio-blk \|\|
65	^ \|\| .----------
66	\| \|\| \| Secondary
67	1 Quorum \|\| '----------
90dfe59b LS	68	/ \ \|\| virtio-blk
	69	/ \ \|\| ^
	70	Primary 2 filter \|
	71	disk ^ 7 Quorum
	72	\| /
	73	3 NBD -------> 3 NBD /
68365a38 WC	74	client \|\| server 2 filter
	75	\|\| ^ ^
	76	--------. \|\| \| \|
	77	Primary \| \|\| Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
	78	--------' \|\| \| backing ^ backing
	79	\|\| \| \|
	80	\|\| \| \|
	81	\|\| '-------------------------'
9a599217	82	\|\| blockdev-backup sync=none 6
68365a38 WC	83
	84	1) The disk on the primary is represented by a block device with two
	85	children, providing replication between a primary disk and the host that
	86	runs the secondary VM. The read pattern (fifo) for quorum can be extended
	87	to make the primary always read from the local disk instead of going through
	88	NBD.
	89
	90	2) The new block filter (the name is replication) will control the block
	91	replication.
	92
	93	3) The secondary disk receives writes from the primary VM through QEMU's
	94	embedded NBD server (speculative write-through).
	95
	96	4) The disk on the secondary is represented by a custom block device
	97	(called active-disk). It should start as an empty disk, and the format
	98	should support bdrv_make_empty() and backing file.
	99
	100	5) The hidden-disk is created automatically. It buffers the original content
	101	that is modified by the primary VM. It should also start as an empty disk,
	102	and the driver supports bdrv_make_empty() and backing file.
	103
9a599217	104	6) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer
68365a38 WC	105	any state that would otherwise be lost by the speculative write-through
	106	of the NBD server into the secondary disk. So before block replication,
	107	the primary disk and secondary disk should contain the same data.
	108
90dfe59b LS	109	7) The secondary also has a quorum node, so after secondary failover it
	110	can become the new primary and continue replication.
	111
	112
68365a38 WC	113	== Failure Handling ==
	114	There are 7 internal errors when block replication is running:
	115	1. I/O error on primary disk
	116	2. Forwarding primary write requests failed
	117	3. Backup failed
	118	4. I/O error on secondary disk
	119	5. I/O error on active disk
	120	6. Making active disk or hidden disk empty failed
	121	7. Doing failover failed
	122	In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
	123	4 and 6, we just report block replication's error to FT/HA manager (which
	124	decides when to do a new checkpoint, when to do failover).
	125	In case 7, if active commit failed, we use replication failover failed state
	126	in Secondary's write operation (what decides which target to write).
	127
	128	== New block driver interface ==
	129	We add four block driver interfaces to control block replication:
	130	a. replication_start_all()
	131	Start block replication, called in migration/checkpoint thread.
	132	We must call block_replication_start_all() in secondary QEMU before
	133	calling block_replication_start_all() in primary QEMU. The caller
	134	must hold the I/O mutex lock if it is in migration/checkpoint
	135	thread.
	136	b. replication_do_checkpoint_all()
	137	This interface is called after all VM state is transferred to
	138	Secondary QEMU. The Disk buffer will be dropped in this interface.
	139	The caller must hold the I/O mutex lock if it is in migration/checkpoint
	140	thread.
	141	c. replication_get_error_all()
	142	This interface is called to check if error happened in replication.
	143	The caller must hold the I/O mutex lock if it is in migration/checkpoint
	144	thread.
	145	d. replication_stop_all()
	146	It is called on failover. We will flush the Disk buffer into
	147	Secondary Disk and stop block replication. The vm should be stopped
	148	before calling it if you use this API to shutdown the guest, or other
	149	things except failover. The caller must hold the I/O mutex lock if it is
	150	in migration/checkpoint thread.
	151
	152	== Usage ==
	153	Primary:
	154	-drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
	155	children.0.file.filename=1.raw,\
	156	children.0.driver=raw
	157
	158	Run qmp command in primary qemu:
eff708a8 RL	159	{ "execute": "human-monitor-command",
	160	"arguments": {
	161	"command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1"
68365a38 WC	162	}
68365a38 WC	163	}
eff708a8 RL	164	{ "execute": "x-blockdev-change",
	165	"arguments": {
	166	"parent": "colo1",
	167	"node": "nbd_client1"
68365a38 WC	168	}
	169	}
	170	Note:
	171	1. There should be only one NBD Client for each primary disk.
	172	2. host is the secondary physical machine's hostname or IP
	173	3. Each disk must have its own export name.
	174	4. It is all a single argument to -drive and you should ignore the
	175	leading whitespace.
	176	5. The qmp command line must be run after running qmp command line in
	177	secondary qemu.
90dfe59b	178	6. After primary failover we need remove children.1 (replication driver).
68365a38 WC	179
	180	Secondary:
	181	-drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
036ef344	182	-drive if=none,id=childs1,driver=replication,mode=secondary,top-id=top-disk1
68365a38 WC	183	file.file.filename=active_disk.qcow2,\
	184	file.driver=qcow2,\
	185	file.backing.file.filename=hidden_disk.qcow2,\
	186	file.backing.driver=qcow2,\
	187	file.backing.backing=colo1
90dfe59b LS	188	-drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\
90dfe59b LS	189	vote-threshold=1,children.0=childs1
68365a38 WC	190
68365a38 WC	191	Then run qmp command in secondary qemu:
eff708a8 RL	192	{ "execute": "nbd-server-start",
	193	"arguments": {
	194	"addr": {
	195	"type": "inet",
	196	"data": {
	197	"host": "xxx",
	198	"port": "xxx"
68365a38 WC	199	}
	200	}
	201	}
	202	}
eff708a8 RL	203	{ "execute": "nbd-server-add",
	204	"arguments": {
	205	"device": "colo1",
	206	"writable": true
68365a38 WC	207	}
	208	}
	209
	210	Note:
	211	1. The export name in secondary QEMU command line is the secondary
	212	disk's id.
	213	2. The export name for the same disk must be the same
	214	3. The qmp command nbd-server-start and nbd-server-add must be run
	215	before running the qmp command migrate on primary QEMU
	216	4. Active disk, hidden disk and nbd target's length should be the
	217	same.
	218	5. It is better to put active disk and hidden disk in ramdisk.
	219	6. It is all a single argument to -drive, and you should ignore
	220	the leading whitespace.
	221
	222	After Failover:
	223	Primary:
	224	The secondary host is down, so we should run the following qmp command
	225	to remove the nbd child from the quorum:
eff708a8 RL	226	{ "execute": "x-blockdev-change",
	227	"arguments": {
	228	"parent": "colo1",
	229	"child": "children.1"
68365a38 WC	230	}
68365a38 WC	231	}
eff708a8 RL	232	{ "execute": "human-monitor-command",
	233	"arguments": {
	234	"command-line": "drive_del xxxx"
68365a38 WC	235	}
	236	}
	237	Note: there is no qmp command to remove the blockdev now
	238
	239	Secondary:
	240	The primary host is down, so we should do the following thing:
eff708a8	241	{ "execute": "nbd-server-stop" }
68365a38	242
90dfe59b LS	243	Promote Secondary to Primary:
	244	see COLO-FT.txt
	245
68365a38	246	TODO:
90dfe59b	247	1. Shared disk