]> git.proxmox.com Git - mirror_qemu.git/blame - docs/devel/migration.rst
migration: calculate vCPU blocktime on dst side
[mirror_qemu.git] / docs / devel / migration.rst
CommitLineData
2e3c8f8d
DDAG
1=========
2Migration
3=========
f58ae59c
JQ
4
5QEMU has code to load/save the state of the guest that it is running.
dda5336e 6These are two complementary operations. Saving the state just does
f58ae59c
JQ
7that, saves the state for each device that the guest is running.
8Restoring a guest is just the opposite operation: we need to load the
9state of each device.
10
dda5336e 11For this to work, QEMU has to be launched with the same arguments the
f58ae59c
JQ
12two times. I.e. it can only restore the state in one guest that has
13the same devices that the one it was saved (this last requirement can
dda5336e 14be relaxed a bit, but for now we can consider that configuration has
f58ae59c
JQ
15to be exactly the same).
16
17Once that we are able to save/restore a guest, a new functionality is
18requested: migration. This means that QEMU is able to start in one
dda5336e
SW
19machine and being "migrated" to another machine. I.e. being moved to
20another machine.
f58ae59c
JQ
21
22Next was the "live migration" functionality. This is important
23because some guests run with a lot of state (specially RAM), and it
24can take a while to move all state from one machine to another. Live
25migration allows the guest to continue running while the state is
26transferred. Only while the last part of the state is transferred has
27the guest to be stopped. Typically the time that the guest is
28unresponsive during live migration is the low hundred of milliseconds
dda5336e 29(notice that this depends on a lot of things).
f58ae59c 30
2e3c8f8d
DDAG
31Types of migration
32==================
f58ae59c
JQ
33
34Now that we have talked about live migration, there are several ways
35to do migration:
36
37- tcp migration: do the migration using tcp sockets
38- unix migration: do the migration using unix sockets
39- exec migration: do the migration using the stdin/stdout through a process.
40- fd migration: do the migration using an file descriptor that is
dda5336e 41 passed to QEMU. QEMU doesn't care how this file descriptor is opened.
f58ae59c 42
dda5336e 43All these four migration protocols use the same infrastructure to
f58ae59c
JQ
44save/restore state devices. This infrastructure is shared with the
45savevm/loadvm functionality.
46
2e3c8f8d
DDAG
47State Live Migration
48====================
f58ae59c
JQ
49
50This is used for RAM and block devices. It is not yet ported to vmstate.
51<Fill more information here>
52
2e3c8f8d
DDAG
53Common infrastructure
54=====================
f58ae59c 55
2e3c8f8d
DDAG
56The files, sockets or fd's that carry the migration stream are abstracted by
57the ``QEMUFile`` type (see `migration/qemu-file.h`). In most cases this
58is connected to a subtype of ``QIOChannel`` (see `io/`).
f58ae59c 59
2e3c8f8d
DDAG
60Saving the state of one device
61==============================
f58ae59c
JQ
62
63The state of a device is saved using intermediate buffers. There are
64some helper functions to assist this saving.
65
66There is a new concept that we have to explain here: device state
67version. When we migrate a device, we save/load the state as a series
68of fields. Some times, due to bugs or new functionality, we need to
69change the state to store more/different information. We use the
70version to identify each time that we do a change. Each version is
2e3c8f8d
DDAG
71associated with a series of fields saved. The `save_state` always saves
72the state as the newer version. But `load_state` sometimes is able to
f58ae59c
JQ
73load state from an older version.
74
2e3c8f8d
DDAG
75Legacy way
76----------
f58ae59c
JQ
77
78This way is going to disappear as soon as all current users are ported to VMSTATE.
79
80Each device has to register two functions, one to save the state and
81another to load the state back.
82
2e3c8f8d
DDAG
83.. code:: c
84
85 int register_savevm(DeviceState *dev,
86 const char *idstr,
87 int instance_id,
88 int version_id,
89 SaveStateHandler *save_state,
90 LoadStateHandler *load_state,
91 void *opaque);
f58ae59c 92
2e3c8f8d
DDAG
93 typedef void SaveStateHandler(QEMUFile *f, void *opaque);
94 typedef int LoadStateHandler(QEMUFile *f, void *opaque, int version_id);
f58ae59c 95
2e3c8f8d
DDAG
96The important functions for the device state format are the `save_state`
97and `load_state`. Notice that `load_state` receives a version_id
98parameter to know what state format is receiving. `save_state` doesn't
dda5336e 99have a version_id parameter because it always uses the latest version.
f58ae59c 100
2e3c8f8d
DDAG
101VMState
102-------
f58ae59c
JQ
103
104The legacy way of saving/loading state of the device had the problem
dda5336e
SW
105that we have to maintain two functions in sync. If we did one change
106in one of them and not in the other, we would get a failed migration.
f58ae59c
JQ
107
108VMState changed the way that state is saved/loaded. Instead of using
109a function to save the state and another to load it, it was changed to
110a declarative way of what the state consisted of. Now VMState is able
111to interpret that definition to be able to load/save the state. As
112the state is declared only once, it can't go out of sync in the
113save/load functions.
114
7465dfec 115An example (from hw/input/pckbd.c)
f58ae59c 116
2e3c8f8d
DDAG
117.. code:: c
118
119 static const VMStateDescription vmstate_kbd = {
120 .name = "pckbd",
121 .version_id = 3,
122 .minimum_version_id = 3,
123 .fields = (VMStateField[]) {
124 VMSTATE_UINT8(write_cmd, KBDState),
125 VMSTATE_UINT8(status, KBDState),
126 VMSTATE_UINT8(mode, KBDState),
127 VMSTATE_UINT8(pending, KBDState),
128 VMSTATE_END_OF_LIST()
129 }
130 };
f58ae59c
JQ
131
132We are declaring the state with name "pckbd".
2e3c8f8d 133The `version_id` is 3, and the fields are 4 uint8_t in a KBDState structure.
f58ae59c
JQ
134We registered this with:
135
2e3c8f8d
DDAG
136.. code:: c
137
f58ae59c
JQ
138 vmstate_register(NULL, 0, &vmstate_kbd, s);
139
dda5336e 140Note: talk about how vmstate <-> qdev interact, and what the instance ids mean.
f58ae59c 141
2e3c8f8d 142You can search for ``VMSTATE_*`` macros for lots of types used in QEMU in
7465dfec 143include/hw/hw.h.
f58ae59c 144
2e3c8f8d
DDAG
145More about versions
146-------------------
f58ae59c 147
5f9412bb
DDAG
148Version numbers are intended for major incompatible changes to the
149migration of a device, and using them breaks backwards-migration
150compatibility; in general most changes can be made by adding Subsections
151(see below) or _TEST macros (see below) which won't break compatibility.
152
f58ae59c
JQ
153You can see that there are several version fields:
154
2e3c8f8d
DDAG
155- `version_id`: the maximum version_id supported by VMState for that device.
156- `minimum_version_id`: the minimum version_id that VMState is able to understand
f58ae59c 157 for that device.
2e3c8f8d 158- `minimum_version_id_old`: For devices that were not able to port to vmstate, we can
767adce2 159 assign a function that knows how to read this old state. This field is
2e3c8f8d 160 ignored if there is no `load_state_old` handler.
f58ae59c
JQ
161
162So, VMState is able to read versions from minimum_version_id to
2e3c8f8d 163version_id. And the function ``load_state_old()`` (if present) is able to
767adce2
PM
164load state from minimum_version_id_old to minimum_version_id. This
165function is deprecated and will be removed when no more users are left.
f58ae59c 166
5f9412bb
DDAG
167Saving state will always create a section with the 'version_id' value
168and thus can't be loaded by any older QEMU.
169
2e3c8f8d
DDAG
170Massaging functions
171-------------------
f58ae59c 172
dda5336e 173Sometimes, it is not enough to be able to save the state directly
f58ae59c
JQ
174from one structure, we need to fill the correct values there. One
175example is when we are using kvm. Before saving the cpu state, we
176need to ask kvm to copy to QEMU the state that it is using. And the
177opposite when we are loading the state, we need a way to tell kvm to
178load the state for the cpu that we have just loaded from the QEMUFile.
179
180The functions to do that are inside a vmstate definition, and are called:
181
2e3c8f8d 182- ``int (*pre_load)(void *opaque);``
f58ae59c
JQ
183
184 This function is called before we load the state of one device.
185
2e3c8f8d 186- ``int (*post_load)(void *opaque, int version_id);``
f58ae59c
JQ
187
188 This function is called after we load the state of one device.
189
2e3c8f8d 190- ``int (*pre_save)(void *opaque);``
f58ae59c
JQ
191
192 This function is called before we save the state of one device.
193
194Example: You can look at hpet.c, that uses the three function to
2e3c8f8d 195massage the state that is transferred.
f58ae59c 196
a6c5c079
AK
197If you use memory API functions that update memory layout outside
198initialization (i.e., in response to a guest action), this is a strong
2e3c8f8d 199indication that you need to call these functions in a `post_load` callback.
a6c5c079
AK
200Examples of such memory API functions are:
201
202 - memory_region_add_subregion()
203 - memory_region_del_subregion()
204 - memory_region_set_readonly()
205 - memory_region_set_enabled()
206 - memory_region_set_address()
207 - memory_region_set_alias_offset()
208
2e3c8f8d
DDAG
209Subsections
210-----------
f58ae59c
JQ
211
212The use of version_id allows to be able to migrate from older versions
213to newer versions of a device. But not the other way around. This
214makes very complicated to fix bugs in stable branches. If we need to
215add anything to the state to fix a bug, we have to disable migration
216to older versions that don't have that bug-fix (i.e. a new field).
217
dda5336e 218But sometimes, that bug-fix is only needed sometimes, not always. For
f58ae59c
JQ
219instance, if the device is in the middle of a DMA operation, it is
220using a specific functionality, ....
221
222It is impossible to create a way to make migration from any version to
dda5336e 223any other version to work. But we can do better than only allowing
7465dfec 224migration from older versions to newer ones. For that fields that are
dda5336e 225only needed sometimes, we add the idea of subsections. A subsection
f58ae59c
JQ
226is "like" a device vmstate, but with a particularity, it has a Boolean
227function that tells if that values are needed to be sent or not. If
228this functions returns false, the subsection is not sent.
229
230On the receiving side, if we found a subsection for a device that we
231don't understand, we just fail the migration. If we understand all
232the subsections, then we load the state with success.
233
234One important note is that the post_load() function is called "after"
235loading all subsections, because a newer subsection could change same
236value that it uses.
237
238Example:
239
2e3c8f8d
DDAG
240.. code:: c
241
242 static bool ide_drive_pio_state_needed(void *opaque)
243 {
244 IDEState *s = opaque;
245
246 return ((s->status & DRQ_STAT) != 0)
247 || (s->bus->error_status & BM_STATUS_PIO_RETRY);
248 }
249
250 const VMStateDescription vmstate_ide_drive_pio_state = {
251 .name = "ide_drive/pio_state",
252 .version_id = 1,
253 .minimum_version_id = 1,
254 .pre_save = ide_drive_pio_pre_save,
255 .post_load = ide_drive_pio_post_load,
256 .needed = ide_drive_pio_state_needed,
257 .fields = (VMStateField[]) {
258 VMSTATE_INT32(req_nb_sectors, IDEState),
259 VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
260 vmstate_info_uint8, uint8_t),
261 VMSTATE_INT32(cur_io_buffer_offset, IDEState),
262 VMSTATE_INT32(cur_io_buffer_len, IDEState),
263 VMSTATE_UINT8(end_transfer_fn_idx, IDEState),
264 VMSTATE_INT32(elementary_transfer_size, IDEState),
265 VMSTATE_INT32(packet_transfer_size, IDEState),
266 VMSTATE_END_OF_LIST()
267 }
268 };
269
270 const VMStateDescription vmstate_ide_drive = {
271 .name = "ide_drive",
272 .version_id = 3,
273 .minimum_version_id = 0,
274 .post_load = ide_drive_post_load,
275 .fields = (VMStateField[]) {
276 .... several fields ....
277 VMSTATE_END_OF_LIST()
278 },
279 .subsections = (const VMStateDescription*[]) {
280 &vmstate_ide_drive_pio_state,
281 NULL
282 }
283 };
f58ae59c
JQ
284
285Here we have a subsection for the pio state. We only need to
286save/send this state when we are in the middle of a pio operation
2e3c8f8d 287(that is what ``ide_drive_pio_state_needed()`` checks). If DRQ_STAT is
f58ae59c
JQ
288not enabled, the values on that fields are garbage and don't need to
289be sent.
2bfdd1c8 290
5f9412bb
DDAG
291Using a condition function that checks a 'property' to determine whether
292to send a subsection allows backwards migration compatibility when
293new subsections are added.
294
2e3c8f8d
DDAG
295For example:
296
297 a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and
5f9412bb 298 default it to true.
2e3c8f8d
DDAG
299 b) Add an entry to the ``HW_COMPAT_`` for the previous version that sets
300 the property to false.
5f9412bb
DDAG
301 c) Add a static bool support_foo function that tests the property.
302 d) Add a subsection with a .needed set to the support_foo function
303 e) (potentially) Add a pre_load that sets up a default value for 'foo'
304 to be used if the subsection isn't loaded.
305
306Now that subsection will not be generated when using an older
307machine type and the migration stream will be accepted by older
308QEMU versions. pre-load functions can be used to initialise state
309on the newer version so that they default to suitable values
310when loading streams created by older QEMU versions that do not
311generate the subsection.
312
313In some cases subsections are added for data that had been accidentally
314omitted by earlier versions; if the missing data causes the migration
315process to succeed but the guest to behave badly then it may be better
316to send the subsection and cause the migration to explicitly fail
317with the unknown subsection error. If the bad behaviour only happens
318with certain data values, making the subsection conditional on
319the data value (rather than the machine type) allows migrations to succeed
320in most cases. In general the preference is to tie the subsection to
321the machine type, and allow reliable migrations, unless the behaviour
322from omission of the subsection is really bad.
323
2e3c8f8d
DDAG
324Not sending existing elements
325-----------------------------
326
327Sometimes members of the VMState are no longer needed:
5f9412bb 328
2e3c8f8d
DDAG
329 - removing them will break migration compatibility
330
331 - making them version dependent and bumping the version will break backwards migration compatibility.
5f9412bb
DDAG
332
333The best way is to:
2e3c8f8d
DDAG
334
335 a) Add a new property/compatibility/function in the same way for subsections above.
5f9412bb 336 b) replace the VMSTATE macro with the _TEST version of the macro, e.g.:
2e3c8f8d
DDAG
337
338 ``VMSTATE_UINT32(foo, barstruct)``
339
5f9412bb 340 becomes
5f9412bb 341
2e3c8f8d
DDAG
342 ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)``
343
344 Sometime in the future when we no longer care about the ancient versions these can be killed off.
5f9412bb 345
2e3c8f8d
DDAG
346Return path
347-----------
2bfdd1c8
DDAG
348
349In most migration scenarios there is only a single data path that runs
350from the source VM to the destination, typically along a single fd (although
351possibly with another fd or similar for some fast way of throwing pages across).
352
353However, some uses need two way communication; in particular the Postcopy
354destination needs to be able to request pages on demand from the source.
355
356For these scenarios there is a 'return path' from the destination to the source;
2e3c8f8d 357``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return
2bfdd1c8
DDAG
358path.
359
360 Source side
2e3c8f8d 361
2bfdd1c8
DDAG
362 Forward path - written by migration thread
363 Return path - opened by main thread, read by return-path thread
364
365 Destination side
2e3c8f8d 366
2bfdd1c8
DDAG
367 Forward path - read by main thread
368 Return path - opened by main thread, written by main thread AND postcopy
2e3c8f8d
DDAG
369 thread (protected by rp_mutex)
370
371Postcopy
372========
2bfdd1c8 373
2bfdd1c8
DDAG
374'Postcopy' migration is a way to deal with migrations that refuse to converge
375(or take too long to converge) its plus side is that there is an upper bound on
376the amount of migration traffic and time it takes, the down side is that during
377the postcopy phase, a failure of *either* side or the network connection causes
378the guest to be lost.
379
380In postcopy the destination CPUs are started before all the memory has been
381transferred, and accesses to pages that are yet to be transferred cause
382a fault that's translated by QEMU into a request to the source QEMU.
383
384Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
385doesn't finish in a given time the switch is made to postcopy.
386
2e3c8f8d
DDAG
387Enabling postcopy
388-----------------
2bfdd1c8 389
c2eb7f21
GK
390To enable postcopy, issue this command on the monitor (both source and
391destination) prior to the start of migration:
2bfdd1c8 392
2e3c8f8d 393``migrate_set_capability postcopy-ram on``
2bfdd1c8
DDAG
394
395The normal commands are then used to start a migration, which is still
396started in precopy mode. Issuing:
397
2e3c8f8d 398``migrate_start_postcopy``
2bfdd1c8
DDAG
399
400will now cause the transition from precopy to postcopy.
401It can be issued immediately after migration is started or any
402time later on. Issuing it after the end of a migration is harmless.
403
2e3c8f8d
DDAG
404.. note::
405 During the postcopy phase, the bandwidth limits set using
406 ``migrate_set_speed`` is ignored (to avoid delaying requested pages that
407 the destination is waiting for).
2bfdd1c8 408
2e3c8f8d
DDAG
409Postcopy device transfer
410------------------------
2bfdd1c8
DDAG
411
412Loading of device data may cause the device emulation to access guest RAM
413that may trigger faults that have to be resolved by the source, as such
414the migration stream has to be able to respond with page data *during* the
415device load, and hence the device data has to be read from the stream completely
416before the device load begins to free the stream up. This is achieved by
417'packaging' the device data into a blob that's read in one go.
418
419Source behaviour
2e3c8f8d 420----------------
2bfdd1c8
DDAG
421
422Until postcopy is entered the migration stream is identical to normal
423precopy, except for the addition of a 'postcopy advise' command at
424the beginning, to tell the destination that postcopy might happen.
425When postcopy starts the source sends the page discard data and then
426forms the 'package' containing:
427
2e3c8f8d
DDAG
428 - Command: 'postcopy listen'
429 - The device state
2bfdd1c8 430
2e3c8f8d
DDAG
431 A series of sections, identical to the precopy streams device state stream
432 containing everything except postcopiable devices (i.e. RAM)
433 - Command: 'postcopy run'
434
435The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
2bfdd1c8
DDAG
436contents are formatted in the same way as the main migration stream.
437
438During postcopy the source scans the list of dirty pages and sends them
439to the destination without being requested (in much the same way as precopy),
440however when a page request is received from the destination, the dirty page
441scanning restarts from the requested location. This causes requested pages
442to be sent quickly, and also causes pages directly after the requested page
443to be sent quickly in the hope that those pages are likely to be used
444by the destination soon.
445
446Destination behaviour
2e3c8f8d 447---------------------
2bfdd1c8
DDAG
448
449Initially the destination looks the same as precopy, with a single thread
450reading the migration stream; the 'postcopy advise' and 'discard' commands
451are processed to change the way RAM is managed, but don't affect the stream
452processing.
453
2e3c8f8d
DDAG
454::
455
456 ------------------------------------------------------------------------------
457 1 2 3 4 5 6 7
458 main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
459 thread | |
460 | (page request)
461 | \___
462 v \
463 listen thread: --- page -- page -- page -- page -- page --
464
465 a b c
466 ------------------------------------------------------------------------------
467
468- On receipt of ``CMD_PACKAGED`` (1)
469
470 All the data associated with the package - the ( ... ) section in the diagram -
471 is read into memory, and the main thread recurses into qemu_loadvm_state_main
472 to process the contents of the package (2) which contains commands (3,6) and
473 devices (4...)
474
475- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
476
477 a new thread (a) is started that takes over servicing the migration stream,
478 while the main thread carries on loading the package. It loads normal
479 background page data (b) but if during a device load a fault happens (5)
480 the returned page (c) is loaded by the listen thread allowing the main
481 threads device load to carry on.
482
483- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
484
485 letting the destination CPUs start running. At the end of the
486 ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
487 is no longer used by migration, while the listen thread carries on servicing
488 page data until the end of migration.
489
490Postcopy states
491---------------
2bfdd1c8
DDAG
492
493Postcopy moves through a series of states (see postcopy_state) from
494ADVISE->DISCARD->LISTEN->RUNNING->END
495
2e3c8f8d
DDAG
496 - Advise
497
498 Set at the start of migration if postcopy is enabled, even
499 if it hasn't had the start command; here the destination
500 checks that its OS has the support needed for postcopy, and performs
501 setup to ensure the RAM mappings are suitable for later postcopy.
502 The destination will fail early in migration at this point if the
503 required OS support is not present.
504 (Triggered by reception of POSTCOPY_ADVISE command)
505
506 - Discard
507
508 Entered on receipt of the first 'discard' command; prior to
509 the first Discard being performed, hugepages are switched off
510 (using madvise) to ensure that no new huge pages are created
511 during the postcopy phase, and to cause any huge pages that
512 have discards on them to be broken.
513
514 - Listen
515
516 The first command in the package, POSTCOPY_LISTEN, switches
517 the destination state to Listen, and starts a new thread
518 (the 'listen thread') which takes over the job of receiving
519 pages off the migration stream, while the main thread carries
520 on processing the blob. With this thread able to process page
521 reception, the destination now 'sensitises' the RAM to detect
522 any access to missing pages (on Linux using the 'userfault'
523 system).
524
525 - Running
526
527 POSTCOPY_RUN causes the destination to synchronise all
528 state and start the CPUs and IO devices running. The main
529 thread now finishes processing the migration package and
530 now carries on as it would for normal precopy migration
531 (although it can't do the cleanup it would do as it
532 finishes a normal migration).
533
534 - End
535
536 The listen thread can now quit, and perform the cleanup of migration
537 state, the migration is now complete.
538
539Source side page maps
540---------------------
2bfdd1c8
DDAG
541
542The source side keeps two bitmaps during postcopy; 'the migration bitmap'
543and 'unsent map'. The 'migration bitmap' is basically the same as in
544the precopy case, and holds a bit to indicate that page is 'dirty' -
545i.e. needs sending. During the precopy phase this is updated as the CPU
546dirties pages, however during postcopy the CPUs are stopped and nothing
547should dirty anything any more.
548
549The 'unsent map' is used for the transition to postcopy. It is a bitmap that
550has a bit cleared whenever a page is sent to the destination, however during
551the transition to postcopy mode it is combined with the migration bitmap
552to form a set of pages that:
2e3c8f8d 553
2bfdd1c8
DDAG
554 a) Have been sent but then redirtied (which must be discarded)
555 b) Have not yet been sent - which also must be discarded to cause any
556 transparent huge pages built during precopy to be broken.
557
558Note that the contents of the unsentmap are sacrificed during the calculation
559of the discard set and thus aren't valid once in postcopy. The dirtymap
560is still valid and is used to ensure that no page is sent more than once. Any
561request for a page that has already been sent is ignored. Duplicate requests
562such as this can happen as a page is sent at about the same time the
563destination accesses it.
564
2e3c8f8d
DDAG
565Postcopy with hugepages
566-----------------------
0c1f4036
DDAG
567
568Postcopy now works with hugetlbfs backed memory:
2e3c8f8d 569
0c1f4036
DDAG
570 a) The linux kernel on the destination must support userfault on hugepages.
571 b) The huge-page configuration on the source and destination VMs must be
572 identical; i.e. RAMBlocks on both sides must use the same page size.
2e3c8f8d 573 c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
0c1f4036 574 RAM if it doesn't have enough hugepages, triggering (b) to fail.
2e3c8f8d 575 Using ``-mem-prealloc`` enforces the allocation using hugepages.
0c1f4036
DDAG
576 d) Care should be taken with the size of hugepage used; postcopy with 2MB
577 hugepages works well, however 1GB hugepages are likely to be problematic
578 since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
579 and until the full page is transferred the destination thread is blocked.
1dc61e7b
DDAG
580
581Postcopy with shared memory
582---------------------------
583
584Postcopy migration with shared memory needs explicit support from the other
585processes that share memory and from QEMU. There are restrictions on the type of
586memory that userfault can support shared.
587
588The Linux kernel userfault support works on `/dev/shm` memory and on `hugetlbfs`
589(although the kernel doesn't provide an equivalent to `madvise(MADV_DONTNEED)`
590for hugetlbfs which may be a problem in some configurations).
591
592The vhost-user code in QEMU supports clients that have Postcopy support,
593and the `vhost-user-bridge` (in `tests/`) and the DPDK package have changes
594to support postcopy.
595
596The client needs to open a userfaultfd and register the areas
597of memory that it maps with userfault. The client must then pass the
598userfaultfd back to QEMU together with a mapping table that allows
599fault addresses in the clients address space to be converted back to
600RAMBlock/offsets. The client's userfaultfd is added to the postcopy
601fault-thread and page requests are made on behalf of the client by QEMU.
602QEMU performs 'wake' operations on the client's userfaultfd to allow it
603to continue after a page has arrived.
604
605.. note::
606 There are two future improvements that would be nice:
607 a) Some way to make QEMU ignorant of the addresses in the clients
608 address space
609 b) Avoiding the need for QEMU to perform ufd-wake calls after the
610 pages have arrived
611
612Retro-fitting postcopy to existing clients is possible:
613 a) A mechanism is needed for the registration with userfault as above,
614 and the registration needs to be coordinated with the phases of
615 postcopy. In vhost-user extra messages are added to the existing
616 control channel.
617 b) Any thread that can block due to guest memory accesses must be
618 identified and the implication understood; for example if the
619 guest memory access is made while holding a lock then all other
620 threads waiting for that lock will also be blocked.