]> git.proxmox.com Git - mirror_qemu.git/blob - docs/devel/migration/vfio.rst
Merge tag 'for-upstream' of https://gitlab.com/bonzini/qemu into staging
[mirror_qemu.git] / docs / devel / migration / vfio.rst
1 =====================
2 VFIO device migration
3 =====================
4
5 Migration of virtual machine involves saving the state for each device that
6 the guest is running on source host and restoring this saved state on the
7 destination host. This document details how saving and restoring of VFIO
8 devices is done in QEMU.
9
10 Migration of VFIO devices consists of two phases: the optional pre-copy phase,
11 and the stop-and-copy phase. The pre-copy phase is iterative and allows to
12 accommodate VFIO devices that have a large amount of data that needs to be
13 transferred. The iterative pre-copy phase of migration allows for the guest to
14 continue whilst the VFIO device state is transferred to the destination, this
15 helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
16 support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
17 VFIO_DEVICE_FEATURE_MIGRATION ioctl.
18
19 When pre-copy is supported, it's possible to further reduce downtime by
20 enabling "switchover-ack" migration capability.
21 VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
22 and recommends that the initial bytes are sent and loaded in the destination
23 before stopping the source VM. Enabling this migration capability will
24 guarantee that and thus, can potentially reduce downtime even further.
25
26 To support migration of multiple devices that might do P2P transactions between
27 themselves, VFIO migration uAPI defines an intermediate P2P quiescent state.
28 While in the P2P quiescent state, P2P DMA transactions cannot be initiated by
29 the device, but the device can respond to incoming ones. Additionally, all
30 outstanding P2P transactions are guaranteed to have been completed by the time
31 the device enters this state.
32
33 All the devices that support P2P migration are first transitioned to the P2P
34 quiescent state and only then are they stopped or started. This makes migration
35 safe P2P-wise, since starting and stopping the devices is not done atomically
36 for all the devices together.
37
38 Thus, multiple VFIO devices migration is allowed only if all the devices
39 support P2P migration. Single VFIO device migration is allowed regardless of
40 P2P migration support.
41
42 A detailed description of the UAPI for VFIO device migration can be found in
43 the comment for the ``vfio_device_mig_state`` structure in the header file
44 linux-headers/linux/vfio.h.
45
46 VFIO implements the device hooks for the iterative approach as follows:
47
48 * A ``save_setup`` function that sets up migration on the source.
49
50 * A ``load_setup`` function that sets the VFIO device on the destination in
51 _RESUMING state.
52
53 * A ``state_pending_estimate`` function that reports an estimate of the
54 remaining pre-copy data that the vendor driver has yet to save for the VFIO
55 device.
56
57 * A ``state_pending_exact`` function that reads pending_bytes from the vendor
58 driver, which indicates the amount of data that the vendor driver has yet to
59 save for the VFIO device.
60
61 * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
62 active only when the VFIO device is in pre-copy states.
63
64 * A ``save_live_iterate`` function that reads the VFIO device's data from the
65 vendor driver during iterative pre-copy phase.
66
67 * A ``switchover_ack_needed`` function that checks if the VFIO device uses
68 "switchover-ack" migration capability when this capability is enabled.
69
70 * A ``save_state`` function to save the device config space if it is present.
71
72 * A ``save_live_complete_precopy`` function that sets the VFIO device in
73 _STOP_COPY state and iteratively copies the data for the VFIO device until
74 the vendor driver indicates that no data remains.
75
76 * A ``load_state`` function that loads the config section and the data
77 sections that are generated by the save functions above.
78
79 * ``cleanup`` functions for both save and load that perform any migration
80 related cleanup.
81
82
83 The VFIO migration code uses a VM state change handler to change the VFIO
84 device state when the VM state changes from running to not-running, and
85 vice versa.
86
87 Similarly, a migration state change handler is used to trigger a transition of
88 the VFIO device state when certain changes of the migration state occur. For
89 example, the VFIO device state is transitioned back to _RUNNING in case a
90 migration failed or was canceled.
91
92 System memory dirty pages tracking
93 ----------------------------------
94
95 A ``log_global_start`` and ``log_global_stop`` memory listener callback informs
96 the VFIO dirty tracking module to start and stop dirty page tracking. A
97 ``log_sync`` memory listener callback queries the dirty page bitmap from the
98 dirty tracking module and marks system memory pages which were DMA-ed by the
99 VFIO device as dirty. The dirty page bitmap is queried per container.
100
101 Currently there are two ways dirty page tracking can be done:
102 (1) Device dirty tracking:
103 In this method the device is responsible to log and report its DMAs. This
104 method can be used only if the device is capable of tracking its DMAs.
105 Discovering device capability, starting and stopping dirty tracking, and
106 syncing the dirty bitmaps from the device are done using the DMA logging uAPI.
107 More info about the uAPI can be found in the comments of the
108 ``vfio_device_feature_dma_logging_control`` and
109 ``vfio_device_feature_dma_logging_report`` structures in the header file
110 linux-headers/linux/vfio.h.
111
112 (2) VFIO IOMMU module:
113 In this method dirty tracking is done by IOMMU. However, there is currently no
114 IOMMU support for dirty page tracking. For this reason, all pages are
115 perpetually marked dirty, unless the device driver pins pages through external
116 APIs in which case only those pinned pages are perpetually marked dirty.
117
118 If the above two methods are not supported, all pages are perpetually marked
119 dirty by QEMU.
120
121 By default, dirty pages are tracked during pre-copy as well as stop-and-copy
122 phase. So, a page marked as dirty will be copied to the destination in both
123 phases. Copying dirty pages in pre-copy phase helps QEMU to predict if it can
124 achieve its downtime tolerances. If QEMU during pre-copy phase keeps finding
125 dirty pages continuously, then it understands that even in stop-and-copy phase,
126 it is likely to find dirty pages and can predict the downtime accordingly.
127
128 QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
129 which disables querying the dirty bitmap during pre-copy phase. If it is set to
130 off, all dirty pages will be copied to the destination in stop-and-copy phase
131 only.
132
133 System memory dirty pages tracking when vIOMMU is enabled
134 ---------------------------------------------------------
135
136 With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
137 phase of migration. In that case, the unmap ioctl returns any dirty pages in
138 that range and QEMU reports corresponding guest physical pages dirty. During
139 stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
140 pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
141 mapped ranges. If device dirty tracking is enabled with vIOMMU, live migration
142 will be blocked.
143
144 Flow of state changes during Live migration
145 ===========================================
146
147 Below is the state change flow during live migration for a VFIO device that
148 supports both precopy and P2P migration. The flow for devices that don't
149 support it is similar, except that the relevant states for precopy and P2P are
150 skipped.
151 The values in the parentheses represent the VM state, the migration state, and
152 the VFIO device state, respectively.
153
154 Live migration save path
155 ------------------------
156
157 ::
158
159 QEMU normal running state
160 (RUNNING, _NONE, _RUNNING)
161 |
162 migrate_init spawns migration_thread
163 Migration thread then calls each device's .save_setup()
164 (RUNNING, _SETUP, _PRE_COPY)
165 |
166 (RUNNING, _ACTIVE, _PRE_COPY)
167 If device is active, get pending_bytes by .state_pending_{estimate,exact}()
168 If total pending_bytes >= threshold_size, call .save_live_iterate()
169 Data of VFIO device for pre-copy phase is copied
170 Iterate till total pending bytes converge and are less than threshold
171 |
172 On migration completion, the vCPUs and the VFIO device are stopped
173 The VFIO device is first put in P2P quiescent state
174 (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
175 |
176 Then the VFIO device is put in _STOP_COPY state
177 (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
178 .save_live_complete_precopy() is called for each active device
179 For the VFIO device, iterate in .save_live_complete_precopy() until
180 pending data is 0
181 |
182 (POSTMIGRATE, _COMPLETED, _STOP_COPY)
183 Migraton thread schedules cleanup bottom half and exits
184 |
185 .save_cleanup() is called
186 (POSTMIGRATE, _COMPLETED, _STOP)
187
188 Live migration resume path
189 --------------------------
190
191 ::
192
193 Incoming migration calls .load_setup() for each device
194 (RESTORE_VM, _ACTIVE, _STOP)
195 |
196 For each device, .load_state() is called for that device section data
197 (RESTORE_VM, _ACTIVE, _RESUMING)
198 |
199 At the end, .load_cleanup() is called for each device and vCPUs are started
200 The VFIO device is first put in P2P quiescent state
201 (RUNNING, _ACTIVE, _RUNNING_P2P)
202 |
203 (RUNNING, _NONE, _RUNNING)
204
205 Postcopy
206 ========
207
208 Postcopy migration is currently not supported for VFIO devices.