]>
Commit | Line | Data |
---|---|---|
04d0583a PD |
1 | .. |
2 | Copyright (c) 2022, ISP RAS | |
43185f7b | 3 | Written by Pavel Dovgalyuk and Alex Bennée |
04d0583a PD |
4 | |
5 | ======================= | |
6 | Execution Record/Replay | |
7 | ======================= | |
8 | ||
43185f7b PD |
9 | Core concepts |
10 | ============= | |
11 | ||
12 | Record/replay functions are used for the deterministic replay of qemu | |
13 | execution. Execution recording writes a non-deterministic events log, which | |
14 | can be later used for replaying the execution anywhere and for unlimited | |
15 | number of times. Execution replaying reads the log and replays all | |
16 | non-deterministic events including external input, hardware clocks, | |
17 | and interrupts. | |
18 | ||
19 | Several parts of QEMU include function calls to make event log recording | |
20 | and replaying. | |
21 | Devices' models that have non-deterministic input from external devices were | |
22 | changed to write every external event into the execution log immediately. | |
23 | E.g. network packets are written into the log when they arrive into the virtual | |
24 | network adapter. | |
25 | ||
26 | All non-deterministic events are coming from these devices. But to | |
27 | replay them we need to know at which moments they occur. We specify | |
28 | these moments by counting the number of instructions executed between | |
29 | every pair of consecutive events. | |
30 | ||
31 | Academic papers with description of deterministic replay implementation: | |
32 | ||
33 | * `Deterministic Replay of System's Execution with Multi-target QEMU Simulator for Dynamic Analysis and Reverse Debugging <https://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html>`_ | |
34 | * `Don't panic: reverse debugging of kernel drivers <https://dl.acm.org/citation.cfm?id=2786805.2803179>`_ | |
35 | ||
36 | Modifications of qemu include: | |
37 | ||
38 | * wrappers for clock and time functions to save their return values in the log | |
39 | * saving different asynchronous events (e.g. system shutdown) into the log | |
40 | * synchronization of the bottom halves execution | |
41 | * synchronization of the threads from thread pool | |
42 | * recording/replaying user input (mouse, keyboard, and microphone) | |
43 | * adding internal checkpoints for cpu and io synchronization | |
44 | * network filter for recording and replaying the packets | |
45 | * block driver for making block layer deterministic | |
46 | * serial port input record and replay | |
47 | * recording of random numbers obtained from the external sources | |
48 | ||
49 | Instruction counting | |
50 | -------------------- | |
51 | ||
52 | QEMU should work in icount mode to use record/replay feature. icount was | |
53 | designed to allow deterministic execution in absence of external inputs | |
54 | of the virtual machine. We also use icount to control the occurrence of the | |
55 | non-deterministic events. The number of instructions elapsed from the last event | |
56 | is written to the log while recording the execution. In replay mode we | |
57 | can predict when to inject that event using the instruction counter. | |
58 | ||
59 | Locking and thread synchronisation | |
60 | ---------------------------------- | |
61 | ||
62 | Previously the synchronisation of the main thread and the vCPU thread | |
63 | was ensured by the holding of the BQL. However the trend has been to | |
64 | reduce the time the BQL was held across the system including under TCG | |
65 | system emulation. As it is important that batches of events are kept | |
66 | in sequence (e.g. expiring timers and checkpoints in the main thread | |
67 | while instruction checkpoints are written by the vCPU thread) we need | |
68 | another lock to keep things in lock-step. This role is now handled by | |
69 | the replay_mutex_lock. It used to be held only for each event being | |
70 | written but now it is held for a whole execution period. This results | |
71 | in a deterministic ping-pong between the two main threads. | |
72 | ||
73 | As the BQL is now a finer grained lock than the replay_lock it is almost | |
74 | certainly a bug, and a source of deadlocks, to take the | |
75 | replay_mutex_lock while the BQL is held. This is enforced by an assert. | |
76 | While the unlocks are usually in the reverse order, this is not | |
77 | necessary; you can drop the replay_lock while holding the BQL, without | |
78 | doing a more complicated unlock_iothread/replay_unlock/lock_iothread | |
79 | sequence. | |
80 | ||
81 | Checkpoints | |
82 | ----------- | |
83 | ||
84 | Replaying the execution of virtual machine is bound by sources of | |
85 | non-determinism. These are inputs from clock and peripheral devices, | |
86 | and QEMU thread scheduling. Thread scheduling affect on processing events | |
87 | from timers, asynchronous input-output, and bottom halves. | |
978ae0e9 | 88 | |
43185f7b PD |
89 | Invocations of timers are coupled with clock reads and changing the state |
90 | of the virtual machine. Reads produce non-deterministic data taken from | |
91 | host clock. And VM state changes should preserve their order. Their relative | |
92 | order in replay mode must replicate the order of callbacks in record mode. | |
93 | To preserve this order we use checkpoints. When a specific clock is processed | |
94 | in record mode we save to the log special "checkpoint" event. | |
95 | Checkpoints here do not refer to virtual machine snapshots. They are just | |
96 | record/replay events used for synchronization. | |
97 | ||
98 | QEMU in replay mode will try to invoke timers processing in random moment | |
99 | of time. That's why we do not process a group of timers until the checkpoint | |
100 | event will be read from the log. Such an event allows synchronizing CPU | |
101 | execution and timer events. | |
102 | ||
103 | Two other checkpoints govern the "warping" of the virtual clock. | |
104 | While the virtual machine is idle, the virtual clock increments at | |
105 | 1 ns per *real time* nanosecond. This is done by setting up a timer | |
106 | (called the warp timer) on the virtual real time clock, so that the | |
107 | timer fires at the next deadline of the virtual clock; the virtual clock | |
108 | is then incremented (which is called "warping" the virtual clock) as | |
109 | soon as the timer fires or the CPUs need to go out of the idle state. | |
110 | Two functions are used for this purpose; because these actions change | |
111 | virtual machine state and must be deterministic, each of them creates a | |
112 | checkpoint. ``icount_start_warp_timer`` checks if the CPUs are idle and if so | |
113 | starts accounting real time to virtual clock. ``icount_account_warp_timer`` | |
114 | is called when the CPUs get an interrupt or when the warp timer fires, | |
115 | and it warps the virtual clock by the amount of real time that has passed | |
116 | since ``icount_start_warp_timer``. | |
117 | ||
118 | Virtual devices | |
119 | =============== | |
120 | ||
121 | Record/replay mechanism, that could be enabled through icount mode, expects | |
122 | the virtual devices to satisfy the following requirement: | |
123 | everything that affects | |
978ae0e9 PD |
124 | the guest state during execution in icount mode should be deterministic. |
125 | ||
126 | Timers | |
04d0583a | 127 | ------ |
978ae0e9 | 128 | |
43185f7b PD |
129 | Timers are used to execute callbacks from different subsystems of QEMU |
130 | at the specified moments of time. There are several kinds of timers: | |
131 | ||
132 | * Real time clock. Based on host time and used only for callbacks that | |
133 | do not change the virtual machine state. For this reason real time | |
134 | clock and timers does not affect deterministic replay at all. | |
135 | * Virtual clock. These timers run only during the emulation. In icount | |
136 | mode virtual clock value is calculated using executed instructions counter. | |
137 | That is why it is completely deterministic and does not have to be recorded. | |
138 | * Host clock. This clock is used by device models that simulate real time | |
139 | sources (e.g. real time clock chip). Host clock is the one of the sources | |
140 | of non-determinism. Host clock read operations should be logged to | |
141 | make the execution deterministic. | |
142 | * Virtual real time clock. This clock is similar to real time clock but | |
143 | it is used only for increasing virtual clock while virtual machine is | |
144 | sleeping. Due to its nature it is also non-deterministic as the host clock | |
145 | and has to be logged too. | |
146 | ||
978ae0e9 PD |
147 | All virtual devices should use virtual clock for timers that change the guest |
148 | state. Virtual clock is deterministic, therefore such timers are deterministic | |
149 | too. | |
150 | ||
151 | Virtual devices can also use realtime clock for the events that do not change | |
152 | the guest state directly. When the clock ticking should depend on VM execution | |
153 | speed, use virtual clock with EXTERNAL attribute. It is not deterministic, | |
154 | but its speed depends on the guest execution. This clock is used by | |
155 | the virtual devices (e.g., slirp routing device) that lie outside the | |
156 | replayed guest. | |
157 | ||
43185f7b PD |
158 | Block devices |
159 | ------------- | |
160 | ||
161 | Block devices record/replay module (``blkreplay``) intercepts calls of | |
162 | bdrv coroutine functions at the top of block drivers stack. | |
163 | ||
164 | All block completion operations are added to the queue in the coroutines. | |
165 | When the queue is flushed the information about processed requests | |
166 | is recorded to the log. In replay phase the queue is matched with | |
167 | events read from the log. Therefore block devices requests are processed | |
168 | deterministically. | |
169 | ||
978ae0e9 | 170 | Bottom halves |
04d0583a | 171 | ------------- |
978ae0e9 PD |
172 | |
173 | Bottom half callbacks, that affect the guest state, should be invoked through | |
43185f7b | 174 | ``replay_bh_schedule_event`` or ``replay_bh_schedule_oneshot_event`` functions. |
978ae0e9 PD |
175 | Their invocations are saved in record mode and synchronized with the existing |
176 | log in replay mode. | |
177 | ||
43185f7b PD |
178 | Disk I/O events are completely deterministic in our model, because |
179 | in both record and replay modes we start virtual machine from the same | |
180 | disk state. But callbacks that virtual disk controller uses for reading and | |
181 | writing the disk may occur at different moments of time in record and replay | |
182 | modes. | |
183 | ||
184 | Reading and writing requests are created by CPU thread of QEMU. Later these | |
185 | requests proceed to block layer which creates "bottom halves". Bottom | |
186 | halves consist of callback and its parameters. They are processed when | |
187 | main loop locks the global mutex. These locks are not synchronized with | |
188 | replaying process because main loop also processes the events that do not | |
189 | affect the virtual machine state (like user interaction with monitor). | |
190 | ||
191 | That is why we had to implement saving and replaying bottom halves callbacks | |
192 | synchronously to the CPU execution. When the callback is about to execute | |
193 | it is added to the queue in the replay module. This queue is written to the | |
194 | log when its callbacks are executed. In replay mode callbacks are not processed | |
195 | until the corresponding event is read from the events log file. | |
196 | ||
197 | Sometimes the block layer uses asynchronous callbacks for its internal purposes | |
198 | (like reading or writing VM snapshots or disk image cluster tables). In this | |
199 | case bottom halves are not marked as "replayable" and do not saved | |
200 | into the log. | |
201 | ||
978ae0e9 | 202 | Saving/restoring the VM state |
04d0583a | 203 | ----------------------------- |
978ae0e9 PD |
204 | |
205 | All fields in the device state structure (including virtual timers) | |
206 | should be restored by loadvm to the same values they had before savevm. | |
207 | ||
208 | Avoid accessing other devices' state, because the order of saving/restoring | |
209 | is not defined. It means that you should not call functions like | |
43185f7b | 210 | ``update_irq`` in ``post_load`` callback. Save everything explicitly to avoid |
978ae0e9 PD |
211 | the dependencies that may make restoring the VM state non-deterministic. |
212 | ||
213 | Stopping the VM | |
04d0583a | 214 | --------------- |
978ae0e9 PD |
215 | |
216 | Stopping the guest should not interfere with its state (with the exception | |
217 | of the network connections, that could be broken by the remote timeouts). | |
218 | VM can be stopped at any moment of replay by the user. Restarting the VM | |
219 | after that stop should not break the replay by the unneeded guest state change. | |
43185f7b PD |
220 | |
221 | Replay log format | |
222 | ================= | |
223 | ||
224 | Record/replay log consists of the header and the sequence of execution | |
225 | events. The header includes 4-byte replay version id and 8-byte reserved | |
226 | field. Version is updated every time replay log format changes to prevent | |
227 | using replay log created by another build of qemu. | |
228 | ||
229 | The sequence of the events describes virtual machine state changes. | |
230 | It includes all non-deterministic inputs of VM, synchronization marks and | |
231 | instruction counts used to correctly inject inputs at replay. | |
232 | ||
233 | Synchronization marks (checkpoints) are used for synchronizing qemu threads | |
234 | that perform operations with virtual hardware. These operations may change | |
235 | system's state (e.g., change some register or generate interrupt) and | |
236 | therefore should execute synchronously with CPU thread. | |
237 | ||
238 | Every event in the log includes 1-byte event id and optional arguments. | |
239 | When argument is an array, it is stored as 4-byte array length | |
240 | and corresponding number of bytes with data. | |
241 | Here is the list of events that are written into the log: | |
242 | ||
243 | - EVENT_INSTRUCTION. Instructions executed since last event. Followed by: | |
244 | ||
245 | - 4-byte number of executed instructions. | |
246 | ||
247 | - EVENT_INTERRUPT. Used to synchronize interrupt processing. | |
248 | - EVENT_EXCEPTION. Used to synchronize exception handling. | |
249 | - EVENT_ASYNC. This is a group of events. When such an event is generated, | |
250 | it is stored in the queue and processed in icount_account_warp_timer(). | |
251 | Every such event has it's own id from the following list: | |
252 | ||
253 | - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes | |
254 | callbacks that affect virtual machine state, but normally called | |
255 | asynchronously. Followed by: | |
256 | ||
257 | - 8-byte operation id. | |
258 | ||
259 | - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains | |
260 | parameters of keyboard and mouse input operations | |
261 | (key press/release, mouse pointer movement). Followed by: | |
262 | ||
263 | - 9-16 bytes depending of input event. | |
264 | ||
265 | - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event. | |
266 | - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input | |
267 | initiated by the sender. Followed by: | |
268 | ||
269 | - 1-byte character device id. | |
270 | - Array with bytes were read. | |
271 | ||
272 | - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize | |
273 | operations with disk and flash drives with CPU. Followed by: | |
274 | ||
275 | - 8-byte operation id. | |
276 | ||
277 | - REPLAY_ASYNC_EVENT_NET. Incoming network packet. Followed by: | |
278 | ||
279 | - 1-byte network adapter id. | |
280 | - 4-byte packet flags. | |
281 | - Array with packet bytes. | |
282 | ||
283 | - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu, | |
284 | e.g., by closing the window. | |
285 | - EVENT_CHAR_WRITE. Used to synchronize character output operations. Followed by: | |
286 | ||
287 | - 4-byte output function return value. | |
288 | - 4-byte offset in the output array. | |
289 | ||
290 | - EVENT_CHAR_READ_ALL. Used to synchronize character input operations, | |
291 | initiated by qemu. Followed by: | |
292 | ||
293 | - Array with bytes that were read. | |
294 | ||
295 | - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation, | |
296 | initiated by qemu. Followed by: | |
297 | ||
298 | - 4-byte error code. | |
299 | ||
300 | - EVENT_CLOCK + clock_id. Group of events for host clock read operations. Followed by: | |
301 | ||
302 | - 8-byte clock value. | |
303 | ||
304 | - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of | |
305 | CPU, internal threads, and asynchronous input events. | |
306 | - EVENT_END. Last event in the log. |