]>
Commit | Line | Data |
---|---|---|
d73abd6d PD |
1 | Copyright (c) 2010-2015 Institute for System Programming |
2 | of the Russian Academy of Sciences. | |
3 | ||
4 | This work is licensed under the terms of the GNU GPL, version 2 or later. | |
5 | See the COPYING file in the top-level directory. | |
6 | ||
7 | Record/replay | |
8 | ------------- | |
9 | ||
7273db9d | 10 | Record/replay functions are used for the deterministic replay of qemu execution. |
d73abd6d PD |
11 | Execution recording writes a non-deterministic events log, which can be later |
12 | used for replaying the execution anywhere and for unlimited number of times. | |
7273db9d | 13 | It also supports checkpointing for faster rewind to the specific replay moment. |
d73abd6d PD |
14 | Execution replaying reads the log and replays all non-deterministic events |
15 | including external input, hardware clocks, and interrupts. | |
16 | ||
17 | Deterministic replay has the following features: | |
18 | * Deterministically replays whole system execution and all contents of | |
19 | the memory, state of the hardware devices, clocks, and screen of the VM. | |
20 | * Writes execution log into the file for later replaying for multiple times | |
21 | on different machines. | |
6fe6d6c9 | 22 | * Supports i386, x86_64, and Arm hardware platforms. |
d73abd6d PD |
23 | * Performs deterministic replay of all operations with keyboard and mouse |
24 | input devices. | |
25 | ||
26 | Usage of the record/replay: | |
7273db9d PD |
27 | * First, record the execution with the following command line: |
28 | qemu-system-i386 \ | |
29 | -icount shift=7,rr=record,rrfile=replay.bin \ | |
de499eb6 | 30 | -drive file=disk.qcow2,if=none,snapshot,id=img-direct \ |
7273db9d PD |
31 | -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \ |
32 | -device ide-hd,drive=img-blkreplay \ | |
33 | -netdev user,id=net1 -device rtl8139,netdev=net1 \ | |
34 | -object filter-replay,id=replay,netdev=net1 | |
35 | * After recording, you can replay it by using another command line: | |
36 | qemu-system-i386 \ | |
37 | -icount shift=7,rr=replay,rrfile=replay.bin \ | |
de499eb6 | 38 | -drive file=disk.qcow2,if=none,snapshot,id=img-direct \ |
7273db9d PD |
39 | -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \ |
40 | -device ide-hd,drive=img-blkreplay \ | |
41 | -netdev user,id=net1 -device rtl8139,netdev=net1 \ | |
42 | -object filter-replay,id=replay,netdev=net1 | |
43 | The only difference with recording is changing the rr option | |
44 | from record to replay. | |
45 | * Block device images are not actually changed in the recording mode, | |
d73abd6d | 46 | because all of the changes are written to the temporary overlay file. |
7273db9d PD |
47 | This behavior is enabled by using blkreplay driver. It should be used |
48 | for every enabled block device, as described in 'Block devices' section. | |
49 | * '-net none' option should be specified when network is not used, | |
50 | because QEMU adds network card by default. When network is needed, | |
51 | it should be configured explicitly with replay filter, as described | |
52 | in 'Network devices' section. | |
53 | * Interaction with audio devices and serial ports are recorded and replayed | |
54 | automatically when such devices are enabled. | |
55 | ||
56 | Academic papers with description of deterministic replay implementation: | |
d73abd6d PD |
57 | http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html |
58 | http://dl.acm.org/citation.cfm?id=2786805.2803179 | |
59 | ||
60 | Modifications of qemu include: | |
61 | * wrappers for clock and time functions to save their return values in the log | |
62 | * saving different asynchronous events (e.g. system shutdown) into the log | |
63 | * synchronization of the bottom halves execution | |
64 | * synchronization of the threads from thread pool | |
7273db9d | 65 | * recording/replaying user input (mouse, keyboard, and microphone) |
d73abd6d | 66 | * adding internal checkpoints for cpu and io synchronization |
7273db9d PD |
67 | * network filter for recording and replaying the packets |
68 | * block driver for making block layer deterministic | |
69 | * serial port input record and replay | |
878ec29b | 70 | * recording of random numbers obtained from the external sources |
d73abd6d | 71 | |
d759c951 AB |
72 | Locking and thread synchronisation |
73 | ---------------------------------- | |
74 | ||
75 | Previously the synchronisation of the main thread and the vCPU thread | |
76 | was ensured by the holding of the BQL. However the trend has been to | |
77 | reduce the time the BQL was held across the system including under TCG | |
78 | system emulation. As it is important that batches of events are kept | |
79 | in sequence (e.g. expiring timers and checkpoints in the main thread | |
80 | while instruction checkpoints are written by the vCPU thread) we need | |
81 | another lock to keep things in lock-step. This role is now handled by | |
82 | the replay_mutex_lock. It used to be held only for each event being | |
83 | written but now it is held for a whole execution period. This results | |
84 | in a deterministic ping-pong between the two main threads. | |
85 | ||
86 | As the BQL is now a finer grained lock than the replay_lock it is almost | |
87 | certainly a bug, and a source of deadlocks, to take the | |
88 | replay_mutex_lock while the BQL is held. This is enforced by an assert. | |
89 | While the unlocks are usually in the reverse order, this is not | |
90 | necessary; you can drop the replay_lock while holding the BQL, without | |
91 | doing a more complicated unlock_iothread/replay_unlock/lock_iothread | |
92 | sequence. | |
93 | ||
d73abd6d PD |
94 | Non-deterministic events |
95 | ------------------------ | |
96 | ||
97 | Our record/replay system is based on saving and replaying non-deterministic | |
98 | events (e.g. keyboard input) and simulating deterministic ones (e.g. reading | |
99 | from HDD or memory of the VM). Saving only non-deterministic events makes | |
7273db9d | 100 | log file smaller and simulation faster. |
d73abd6d PD |
101 | |
102 | The following non-deterministic data from peripheral devices is saved into | |
103 | the log: mouse and keyboard input, network packets, audio controller input, | |
7273db9d | 104 | serial port input, and hardware clocks (they are non-deterministic |
d73abd6d PD |
105 | too, because their values are taken from the host machine). Inputs from |
106 | simulated hardware, memory of VM, software interrupts, and execution of | |
107 | instructions are not saved into the log, because they are deterministic and | |
108 | can be replayed by simulating the behavior of virtual machine starting from | |
109 | initial state. | |
110 | ||
111 | We had to solve three tasks to implement deterministic replay: recording | |
112 | non-deterministic events, replaying non-deterministic events, and checking | |
113 | that there is no divergence between record and replay modes. | |
114 | ||
115 | We changed several parts of QEMU to make event log recording and replaying. | |
116 | Devices' models that have non-deterministic input from external devices were | |
117 | changed to write every external event into the execution log immediately. | |
118 | E.g. network packets are written into the log when they arrive into the virtual | |
119 | network adapter. | |
120 | ||
121 | All non-deterministic events are coming from these devices. But to | |
122 | replay them we need to know at which moments they occur. We specify | |
123 | these moments by counting the number of instructions executed between | |
124 | every pair of consecutive events. | |
125 | ||
126 | Instruction counting | |
127 | -------------------- | |
128 | ||
129 | QEMU should work in icount mode to use record/replay feature. icount was | |
130 | designed to allow deterministic execution in absence of external inputs | |
131 | of the virtual machine. We also use icount to control the occurrence of the | |
132 | non-deterministic events. The number of instructions elapsed from the last event | |
133 | is written to the log while recording the execution. In replay mode we | |
134 | can predict when to inject that event using the instruction counter. | |
135 | ||
136 | Timers | |
137 | ------ | |
138 | ||
139 | Timers are used to execute callbacks from different subsystems of QEMU | |
140 | at the specified moments of time. There are several kinds of timers: | |
141 | * Real time clock. Based on host time and used only for callbacks that | |
142 | do not change the virtual machine state. For this reason real time | |
143 | clock and timers does not affect deterministic replay at all. | |
144 | * Virtual clock. These timers run only during the emulation. In icount | |
145 | mode virtual clock value is calculated using executed instructions counter. | |
146 | That is why it is completely deterministic and does not have to be recorded. | |
147 | * Host clock. This clock is used by device models that simulate real time | |
148 | sources (e.g. real time clock chip). Host clock is the one of the sources | |
149 | of non-determinism. Host clock read operations should be logged to | |
150 | make the execution deterministic. | |
e76d1798 | 151 | * Virtual real time clock. This clock is similar to real time clock but |
d73abd6d PD |
152 | it is used only for increasing virtual clock while virtual machine is |
153 | sleeping. Due to its nature it is also non-deterministic as the host clock | |
154 | and has to be logged too. | |
155 | ||
156 | Checkpoints | |
157 | ----------- | |
158 | ||
159 | Replaying of the execution of virtual machine is bound by sources of | |
160 | non-determinism. These are inputs from clock and peripheral devices, | |
161 | and QEMU thread scheduling. Thread scheduling affect on processing events | |
162 | from timers, asynchronous input-output, and bottom halves. | |
163 | ||
164 | Invocations of timers are coupled with clock reads and changing the state | |
165 | of the virtual machine. Reads produce non-deterministic data taken from | |
166 | host clock. And VM state changes should preserve their order. Their relative | |
167 | order in replay mode must replicate the order of callbacks in record mode. | |
168 | To preserve this order we use checkpoints. When a specific clock is processed | |
169 | in record mode we save to the log special "checkpoint" event. | |
170 | Checkpoints here do not refer to virtual machine snapshots. They are just | |
171 | record/replay events used for synchronization. | |
172 | ||
173 | QEMU in replay mode will try to invoke timers processing in random moment | |
174 | of time. That's why we do not process a group of timers until the checkpoint | |
175 | event will be read from the log. Such an event allows synchronizing CPU | |
176 | execution and timer events. | |
177 | ||
e76d1798 PD |
178 | Two other checkpoints govern the "warping" of the virtual clock. |
179 | While the virtual machine is idle, the virtual clock increments at | |
180 | 1 ns per *real time* nanosecond. This is done by setting up a timer | |
181 | (called the warp timer) on the virtual real time clock, so that the | |
182 | timer fires at the next deadline of the virtual clock; the virtual clock | |
183 | is then incremented (which is called "warping" the virtual clock) as | |
184 | soon as the timer fires or the CPUs need to go out of the idle state. | |
185 | Two functions are used for this purpose; because these actions change | |
186 | virtual machine state and must be deterministic, each of them creates a | |
8191d368 CF |
187 | checkpoint. icount_start_warp_timer checks if the CPUs are idle and if so |
188 | starts accounting real time to virtual clock. icount_account_warp_timer | |
e76d1798 PD |
189 | is called when the CPUs get an interrupt or when the warp timer fires, |
190 | and it warps the virtual clock by the amount of real time that has passed | |
8191d368 | 191 | since icount_start_warp_timer. |
d73abd6d PD |
192 | |
193 | Bottom halves | |
194 | ------------- | |
195 | ||
196 | Disk I/O events are completely deterministic in our model, because | |
197 | in both record and replay modes we start virtual machine from the same | |
198 | disk state. But callbacks that virtual disk controller uses for reading and | |
199 | writing the disk may occur at different moments of time in record and replay | |
200 | modes. | |
201 | ||
202 | Reading and writing requests are created by CPU thread of QEMU. Later these | |
203 | requests proceed to block layer which creates "bottom halves". Bottom | |
204 | halves consist of callback and its parameters. They are processed when | |
205 | main loop locks the global mutex. These locks are not synchronized with | |
206 | replaying process because main loop also processes the events that do not | |
207 | affect the virtual machine state (like user interaction with monitor). | |
208 | ||
209 | That is why we had to implement saving and replaying bottom halves callbacks | |
210 | synchronously to the CPU execution. When the callback is about to execute | |
211 | it is added to the queue in the replay module. This queue is written to the | |
212 | log when its callbacks are executed. In replay mode callbacks are not processed | |
213 | until the corresponding event is read from the events log file. | |
214 | ||
215 | Sometimes the block layer uses asynchronous callbacks for its internal purposes | |
216 | (like reading or writing VM snapshots or disk image cluster tables). In this | |
217 | case bottom halves are not marked as "replayable" and do not saved | |
218 | into the log. | |
63785678 PD |
219 | |
220 | Block devices | |
221 | ------------- | |
222 | ||
223 | Block devices record/replay module intercepts calls of | |
224 | bdrv coroutine functions at the top of block drivers stack. | |
225 | To record and replay block operations the drive must be configured | |
226 | as following: | |
de499eb6 | 227 | -drive file=disk.qcow2,if=none,snapshot,id=img-direct |
63785678 PD |
228 | -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay |
229 | -device ide-hd,drive=img-blkreplay | |
230 | ||
231 | blkreplay driver should be inserted between disk image and virtual driver | |
232 | controller. Therefore all disk requests may be recorded and replayed. | |
233 | ||
234 | All block completion operations are added to the queue in the coroutines. | |
235 | Queue is flushed at checkpoints and information about processed requests | |
236 | is recorded to the log. In replay phase the queue is matched with | |
237 | events read from the log. Therefore block devices requests are processed | |
238 | deterministically. | |
646c5478 | 239 | |
9c2037d0 PD |
240 | Snapshotting |
241 | ------------ | |
242 | ||
243 | New VM snapshots may be created in replay mode. They can be used later | |
244 | to recover the desired VM state. All VM states created in replay mode | |
245 | are associated with the moment of time in the replay scenario. | |
246 | After recovering the VM state replay will start from that position. | |
247 | ||
248 | Default starting snapshot name may be specified with icount field | |
249 | rrsnapshot as follows: | |
250 | -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name | |
251 | ||
252 | This snapshot is created at start of recording and restored at start | |
253 | of replaying. It also can be loaded while replaying to roll back | |
254 | the execution. | |
255 | ||
de499eb6 PD |
256 | 'snapshot' flag of the disk image must be removed to save the snapshots |
257 | in the overlay (or original image) instead of using the temporary overlay. | |
258 | -drive file=disk.ovl,if=none,id=img-direct | |
259 | -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay | |
260 | -device ide-hd,drive=img-blkreplay | |
261 | ||
7273db9d PD |
262 | Use QEMU monitor to create additional snapshots. 'savevm <name>' command |
263 | created the snapshot and 'loadvm <name>' restores it. To prevent corruption | |
264 | of the original disk image, use overlay files linked to the original images. | |
265 | Therefore all new snapshots (including the starting one) will be saved in | |
266 | overlays and the original image remains unchanged. | |
267 | ||
9a608af3 PD |
268 | When you need to use snapshots with diskless virtual machine, |
269 | it must be started with 'orphan' qcow2 image. This image will be used | |
270 | for storing VM snapshots. Here is the example of the command line for this: | |
271 | ||
272 | qemu-system-i386 -icount shift=3,rr=replay,rrfile=record.bin,rrsnapshot=init \ | |
273 | -net none -drive file=empty.qcow2,if=none,id=rr | |
274 | ||
275 | empty.qcow2 drive does not connected to any virtual block device and used | |
276 | for VM snapshots only. | |
277 | ||
646c5478 PD |
278 | Network devices |
279 | --------------- | |
280 | ||
281 | Record and replay for network interactions is performed with the network filter. | |
282 | Each backend must have its own instance of the replay filter as follows: | |
283 | -netdev user,id=net1 -device rtl8139,netdev=net1 | |
284 | -object filter-replay,id=replay,netdev=net1 | |
285 | ||
286 | Replay network filter is used to record and replay network packets. While | |
287 | recording the virtual machine this filter puts all packets coming from | |
288 | the outer world into the log. In replay mode packets from the log are | |
289 | injected into the network device. All interactions with network backend | |
290 | in replay mode are disabled. | |
3d4d16f4 PD |
291 | |
292 | Audio devices | |
293 | ------------- | |
294 | ||
295 | Audio data is recorded and replay automatically. The command line for recording | |
296 | and replaying must contain identical specifications of audio hardware, e.g.: | |
297 | -soundhw ac97 | |
bb040e00 | 298 | |
7273db9d PD |
299 | Serial ports |
300 | ------------ | |
301 | ||
302 | Serial ports input is recorded and replay automatically. The command lines | |
303 | for recording and replaying must contain identical number of ports in record | |
304 | and replay modes, but their backends may differ. | |
305 | E.g., '-serial stdio' in record mode, and '-serial null' in replay mode. | |
306 | ||
9a608af3 PD |
307 | Reverse debugging |
308 | ----------------- | |
309 | ||
310 | Reverse debugging allows "executing" the program in reverse direction. | |
311 | GDB remote protocol supports "reverse step" and "reverse continue" | |
312 | commands. The first one steps single instruction backwards in time, | |
313 | and the second one finds the last breakpoint in the past. | |
314 | ||
315 | Recorded executions may be used to enable reverse debugging. QEMU can't | |
316 | execute the code in backwards direction, but can load a snapshot and | |
317 | replay forward to find the desired position or breakpoint. | |
318 | ||
319 | The following GDB commands are supported: | |
320 | - reverse-stepi (or rsi) - step one instruction backwards | |
321 | - reverse-continue (or rc) - find last breakpoint in the past | |
322 | ||
323 | Reverse step loads the nearest snapshot and replays the execution until | |
324 | the required instruction is met. | |
325 | ||
326 | Reverse continue may include several passes of examining the execution | |
327 | between the snapshots. Each of the passes include the following steps: | |
328 | 1. loading the snapshot | |
329 | 2. replaying to examine the breakpoints | |
330 | 3. if breakpoint or watchpoint was met | |
ac9574bc | 331 | - loading the snapshot again |
9a608af3 PD |
332 | - replaying to the required breakpoint |
333 | 4. else | |
334 | - proceeding to the p.1 with the earlier snapshot | |
335 | ||
336 | Therefore usage of the reverse debugging requires at least one snapshot | |
337 | created in advance. This can be done by omitting 'snapshot' option | |
338 | for the block drives and adding 'rrsnapshot' for both record and replay | |
339 | command lines. | |
340 | See the "Snapshotting" section to learn more about running record/replay | |
341 | and creating the snapshot in these modes. | |
342 | ||
bb040e00 PD |
343 | Replay log format |
344 | ----------------- | |
345 | ||
806be373 | 346 | Record/replay log consists of the header and the sequence of execution |
bb040e00 PD |
347 | events. The header includes 4-byte replay version id and 8-byte reserved |
348 | field. Version is updated every time replay log format changes to prevent | |
349 | using replay log created by another build of qemu. | |
350 | ||
351 | The sequence of the events describes virtual machine state changes. | |
352 | It includes all non-deterministic inputs of VM, synchronization marks and | |
353 | instruction counts used to correctly inject inputs at replay. | |
354 | ||
355 | Synchronization marks (checkpoints) are used for synchronizing qemu threads | |
356 | that perform operations with virtual hardware. These operations may change | |
357 | system's state (e.g., change some register or generate interrupt) and | |
358 | therefore should execute synchronously with CPU thread. | |
359 | ||
360 | Every event in the log includes 1-byte event id and optional arguments. | |
361 | When argument is an array, it is stored as 4-byte array length | |
362 | and corresponding number of bytes with data. | |
363 | Here is the list of events that are written into the log: | |
364 | ||
365 | - EVENT_INSTRUCTION. Instructions executed since last event. | |
366 | Argument: 4-byte number of executed instructions. | |
367 | - EVENT_INTERRUPT. Used to synchronize interrupt processing. | |
368 | - EVENT_EXCEPTION. Used to synchronize exception handling. | |
369 | - EVENT_ASYNC. This is a group of events. They are always processed | |
370 | together with checkpoints. When such an event is generated, it is | |
371 | stored in the queue and processed only when checkpoint occurs. | |
372 | Every such event is followed by 1-byte checkpoint id and 1-byte | |
373 | async event id from the following list: | |
374 | - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes | |
375 | callbacks that affect virtual machine state, but normally called | |
963e64a4 | 376 | asynchronously. |
bb040e00 PD |
377 | Argument: 8-byte operation id. |
378 | - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains | |
379 | parameters of keyboard and mouse input operations | |
380 | (key press/release, mouse pointer movement). | |
381 | Arguments: 9-16 bytes depending of input event. | |
382 | - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event. | |
383 | - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input | |
384 | initiated by the sender. | |
385 | Arguments: 1-byte character device id. | |
386 | Array with bytes were read. | |
387 | - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize | |
388 | operations with disk and flash drives with CPU. | |
389 | Argument: 8-byte operation id. | |
390 | - REPLAY_ASYNC_EVENT_NET. Incoming network packet. | |
391 | Arguments: 1-byte network adapter id. | |
392 | 4-byte packet flags. | |
393 | Array with packet bytes. | |
394 | - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu, | |
395 | e.g., by closing the window. | |
396 | - EVENT_CHAR_WRITE. Used to synchronize character output operations. | |
397 | Arguments: 4-byte output function return value. | |
398 | 4-byte offset in the output array. | |
399 | - EVENT_CHAR_READ_ALL. Used to synchronize character input operations, | |
400 | initiated by qemu. | |
401 | Argument: Array with bytes that were read. | |
402 | - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation, | |
403 | initiated by qemu. | |
404 | Argument: 4-byte error code. | |
405 | - EVENT_CLOCK + clock_id. Group of events for host clock read operations. | |
406 | Argument: 8-byte clock value. | |
407 | - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of | |
408 | CPU, internal threads, and asynchronous input events. May be followed | |
409 | by one or more EVENT_ASYNC events. | |
410 | - EVENT_END. Last event in the log. |