]>
Commit | Line | Data |
---|---|---|
6012d9a9 | 1 | .. SPDX-License-Identifier: GPL-2.0 |
f392eb25 | 2 | |
6012d9a9 MCC |
3 | ====================================================== |
4 | Timekeeping Virtualization for X86-Based Architectures | |
5 | ====================================================== | |
f392eb25 | 6 | |
6012d9a9 MCC |
7 | :Author: Zachary Amsden <zamsden@redhat.com> |
8 | :Copyright: (c) 2010, Red Hat. All rights reserved. | |
f392eb25 | 9 | |
6012d9a9 | 10 | .. Contents |
f392eb25 | 11 | |
6012d9a9 MCC |
12 | 1) Overview |
13 | 2) Timing Devices | |
14 | 3) TSC Hardware | |
15 | 4) Virtualization Problems | |
f392eb25 | 16 | |
6012d9a9 MCC |
17 | 1. Overview |
18 | =========== | |
f392eb25 ZA |
19 | |
20 | One of the most complicated parts of the X86 platform, and specifically, | |
21 | the virtualization of this platform is the plethora of timing devices available | |
22 | and the complexity of emulating those devices. In addition, virtualization of | |
23 | time introduces a new set of challenges because it introduces a multiplexed | |
24 | division of time beyond the control of the guest CPU. | |
25 | ||
26 | First, we will describe the various timekeeping hardware available, then | |
27 | present some of the problems which arise and solutions available, giving | |
28 | specific recommendations for certain classes of KVM guests. | |
29 | ||
30 | The purpose of this document is to collect data and information relevant to | |
31 | timekeeping which may be difficult to find elsewhere, specifically, | |
32 | information relevant to KVM and hardware-based virtualization. | |
33 | ||
6012d9a9 MCC |
34 | 2. Timing Devices |
35 | ================= | |
f392eb25 ZA |
36 | |
37 | First we discuss the basic hardware devices available. TSC and the related | |
38 | KVM clock are special enough to warrant a full exposition and are described in | |
39 | the following section. | |
40 | ||
6012d9a9 MCC |
41 | 2.1. i8254 - PIT |
42 | ---------------- | |
f392eb25 ZA |
43 | |
44 | One of the first timer devices available is the programmable interrupt timer, | |
45 | or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three | |
46 | channels which can be programmed to deliver periodic or one-shot interrupts. | |
47 | These three channels can be configured in different modes and have individual | |
48 | counters. Channel 1 and 2 were not available for general use in the original | |
49 | IBM PC, and historically were connected to control RAM refresh and the PC | |
50 | speaker. Now the PIT is typically integrated as part of an emulated chipset | |
51 | and a separate physical PIT is not used. | |
52 | ||
53 | The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done | |
54 | using single or multiple byte access to the I/O ports. There are 6 modes | |
55 | available, but not all modes are available to all timers, as only timer 2 | |
56 | has a connected gate input, required for modes 1 and 5. The gate line is | |
6012d9a9 | 57 | controlled by port 61h, bit 0, as illustrated in the following diagram:: |
f392eb25 | 58 | |
6012d9a9 MCC |
59 | -------------- ---------------- |
60 | | | | | | |
61 | | 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0 | |
62 | | Clock | | | | | |
63 | -------------- | +->| GATE TIMER 0 | | |
f392eb25 ZA |
64 | | ---------------- |
65 | | | |
66 | | ---------------- | |
67 | | | | | |
68 | |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM | |
69 | | | | (aka /dev/null) | |
70 | | +->| GATE TIMER 1 | | |
71 | | ---------------- | |
72 | | | |
73 | | ---------------- | |
74 | | | | | |
75 | |------>| CLOCK OUT | ---------> Port 61h, bit 5 | |
76 | | | | | |
6012d9a9 | 77 | Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____ |
f392eb25 ZA |
78 | ---------------- _| )--|LPF|---Speaker |
79 | / *---- \___/ | |
6012d9a9 | 80 | Port 61h, bit 1 ---------------------------------/ |
f392eb25 ZA |
81 | |
82 | The timer modes are now described. | |
83 | ||
6012d9a9 MCC |
84 | Mode 0: Single Timeout. |
85 | This is a one-shot software timeout that counts down | |
f392eb25 ZA |
86 | when the gate is high (always true for timers 0 and 1). When the count |
87 | reaches zero, the output goes high. | |
88 | ||
6012d9a9 MCC |
89 | Mode 1: Triggered One-shot. |
90 | The output is initially set high. When the gate | |
f392eb25 ZA |
91 | line is set high, a countdown is initiated (which does not stop if the gate is |
92 | lowered), during which the output is set low. When the count reaches zero, | |
93 | the output goes high. | |
94 | ||
6012d9a9 MCC |
95 | Mode 2: Rate Generator. |
96 | The output is initially set high. When the countdown | |
f392eb25 ZA |
97 | reaches 1, the output goes low for one count and then returns high. The value |
98 | is reloaded and the countdown automatically resumes. If the gate line goes | |
99 | low, the count is halted. If the output is low when the gate is lowered, the | |
100 | output automatically goes high (this only affects timer 2). | |
101 | ||
6012d9a9 MCC |
102 | Mode 3: Square Wave. |
103 | This generates a high / low square wave. The count | |
f392eb25 ZA |
104 | determines the length of the pulse, which alternates between high and low |
105 | when zero is reached. The count only proceeds when gate is high and is | |
106 | automatically reloaded on reaching zero. The count is decremented twice at | |
107 | each clock to generate a full high / low cycle at the full periodic rate. | |
108 | If the count is even, the clock remains high for N/2 counts and low for N/2 | |
109 | counts; if the clock is odd, the clock is high for (N+1)/2 counts and low | |
110 | for (N-1)/2 counts. Only even values are latched by the counter, so odd | |
111 | values are not observed when reading. This is the intended mode for timer 2, | |
112 | which generates sine-like tones by low-pass filtering the square wave output. | |
113 | ||
6012d9a9 MCC |
114 | Mode 4: Software Strobe. |
115 | After programming this mode and loading the counter, | |
f392eb25 ZA |
116 | the output remains high until the counter reaches zero. Then the output |
117 | goes low for 1 clock cycle and returns high. The counter is not reloaded. | |
118 | Counting only occurs when gate is high. | |
119 | ||
6012d9a9 MCC |
120 | Mode 5: Hardware Strobe. |
121 | After programming and loading the counter, the | |
f392eb25 ZA |
122 | output remains high. When the gate is raised, a countdown is initiated |
123 | (which does not stop if the gate is lowered). When the counter reaches zero, | |
124 | the output goes low for 1 clock cycle and then returns high. The counter is | |
125 | not reloaded. | |
126 | ||
127 | In addition to normal binary counting, the PIT supports BCD counting. The | |
128 | command port, 0x43 is used to set the counter and mode for each of the three | |
129 | timers. | |
130 | ||
6012d9a9 | 131 | PIT commands, issued to port 0x43, using the following bit encoding:: |
f392eb25 | 132 | |
6012d9a9 MCC |
133 | Bit 7-4: Command (See table below) |
134 | Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) | |
135 | Bit 0 : Binary (0) / BCD (1) | |
f392eb25 | 136 | |
6012d9a9 | 137 | Command table:: |
f392eb25 | 138 | |
6012d9a9 | 139 | 0000 - Latch Timer 0 count for port 0x40 |
f392eb25 ZA |
140 | sample and hold the count to be read in port 0x40; |
141 | additional commands ignored until counter is read; | |
142 | mode bits ignored. | |
143 | ||
6012d9a9 | 144 | 0001 - Set Timer 0 LSB mode for port 0x40 |
f392eb25 ZA |
145 | set timer to read LSB only and force MSB to zero; |
146 | mode bits set timer mode | |
147 | ||
6012d9a9 | 148 | 0010 - Set Timer 0 MSB mode for port 0x40 |
f392eb25 ZA |
149 | set timer to read MSB only and force LSB to zero; |
150 | mode bits set timer mode | |
151 | ||
6012d9a9 | 152 | 0011 - Set Timer 0 16-bit mode for port 0x40 |
f392eb25 ZA |
153 | set timer to read / write LSB first, then MSB; |
154 | mode bits set timer mode | |
155 | ||
6012d9a9 MCC |
156 | 0100 - Latch Timer 1 count for port 0x41 - as described above |
157 | 0101 - Set Timer 1 LSB mode for port 0x41 - as described above | |
158 | 0110 - Set Timer 1 MSB mode for port 0x41 - as described above | |
159 | 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above | |
f392eb25 | 160 | |
6012d9a9 MCC |
161 | 1000 - Latch Timer 2 count for port 0x42 - as described above |
162 | 1001 - Set Timer 2 LSB mode for port 0x42 - as described above | |
163 | 1010 - Set Timer 2 MSB mode for port 0x42 - as described above | |
164 | 1011 - Set Timer 2 16-bit mode for port 0x42 as described above | |
f392eb25 | 165 | |
6012d9a9 | 166 | 1101 - General counter latch |
f392eb25 ZA |
167 | Latch combination of counters into corresponding ports |
168 | Bit 3 = Counter 2 | |
169 | Bit 2 = Counter 1 | |
170 | Bit 1 = Counter 0 | |
171 | Bit 0 = Unused | |
172 | ||
6012d9a9 | 173 | 1110 - Latch timer status |
f392eb25 ZA |
174 | Latch combination of counter mode into corresponding ports |
175 | Bit 3 = Counter 2 | |
176 | Bit 2 = Counter 1 | |
177 | Bit 1 = Counter 0 | |
178 | ||
179 | The output of ports 0x40-0x42 following this command will be: | |
180 | ||
181 | Bit 7 = Output pin | |
182 | Bit 6 = Count loaded (0 if timer has expired) | |
183 | Bit 5-4 = Read / Write mode | |
184 | 01 = MSB only | |
185 | 10 = LSB only | |
186 | 11 = LSB / MSB (16-bit) | |
187 | Bit 3-1 = Mode | |
188 | Bit 0 = Binary (0) / BCD mode (1) | |
189 | ||
6012d9a9 MCC |
190 | 2.2. RTC |
191 | -------- | |
f392eb25 ZA |
192 | |
193 | The second device which was available in the original PC was the MC146818 real | |
194 | time clock. The original device is now obsolete, and usually emulated by the | |
195 | system chipset, sometimes by an HPET and some frankenstein IRQ routing. | |
196 | ||
197 | The RTC is accessed through CMOS variables, which uses an index register to | |
198 | control which bytes are read. Since there is only one index register, read | |
199 | of the CMOS and read of the RTC require lock protection (in addition, it is | |
200 | dangerous to allow userspace utilities such as hwclock to have direct RTC | |
201 | access, as they could corrupt kernel reads and writes of CMOS memory). | |
202 | ||
203 | The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt | |
204 | can function as a periodic timer, an additional once a day alarm, and can issue | |
205 | interrupts after an update of the CMOS registers by the MC146818 is complete. | |
206 | The type of interrupt is signalled in the RTC status registers. | |
207 | ||
208 | The RTC will update the current time fields by battery power even while the | |
209 | system is off. The current time fields should not be read while an update is | |
210 | in progress, as indicated in the status register. | |
211 | ||
212 | The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be | |
213 | programmed to a 32kHz divider if the RTC is to count seconds. | |
214 | ||
6012d9a9 MCC |
215 | This is the RAM map originally used for the RTC/CMOS:: |
216 | ||
217 | Location Size Description | |
218 | ------------------------------------------ | |
219 | 00h byte Current second (BCD) | |
220 | 01h byte Seconds alarm (BCD) | |
221 | 02h byte Current minute (BCD) | |
222 | 03h byte Minutes alarm (BCD) | |
223 | 04h byte Current hour (BCD) | |
224 | 05h byte Hours alarm (BCD) | |
225 | 06h byte Current day of week (BCD) | |
226 | 07h byte Current day of month (BCD) | |
227 | 08h byte Current month (BCD) | |
228 | 09h byte Current year (BCD) | |
229 | 0Ah byte Register A | |
f392eb25 ZA |
230 | bit 7 = Update in progress |
231 | bit 6-4 = Divider for clock | |
232 | 000 = 4.194 MHz | |
233 | 001 = 1.049 MHz | |
234 | 010 = 32 kHz | |
235 | 10X = test modes | |
236 | 110 = reset / disable | |
237 | 111 = reset / disable | |
238 | bit 3-0 = Rate selection for periodic interrupt | |
239 | 000 = periodic timer disabled | |
240 | 001 = 3.90625 uS | |
241 | 010 = 7.8125 uS | |
242 | 011 = .122070 mS | |
243 | 100 = .244141 mS | |
244 | ... | |
245 | 1101 = 125 mS | |
246 | 1110 = 250 mS | |
247 | 1111 = 500 mS | |
6012d9a9 | 248 | 0Bh byte Register B |
f392eb25 ZA |
249 | bit 7 = Run (0) / Halt (1) |
250 | bit 6 = Periodic interrupt enable | |
251 | bit 5 = Alarm interrupt enable | |
252 | bit 4 = Update-ended interrupt enable | |
253 | bit 3 = Square wave interrupt enable | |
254 | bit 2 = BCD calendar (0) / Binary (1) | |
255 | bit 1 = 12-hour mode (0) / 24-hour mode (1) | |
256 | bit 0 = 0 (DST off) / 1 (DST enabled) | |
6012d9a9 | 257 | OCh byte Register C (read only) |
f392eb25 ZA |
258 | bit 7 = interrupt request flag (IRQF) |
259 | bit 6 = periodic interrupt flag (PF) | |
260 | bit 5 = alarm interrupt flag (AF) | |
261 | bit 4 = update interrupt flag (UF) | |
262 | bit 3-0 = reserved | |
6012d9a9 | 263 | ODh byte Register D (read only) |
f392eb25 ZA |
264 | bit 7 = RTC has power |
265 | bit 6-0 = reserved | |
6012d9a9 | 266 | 32h byte Current century BCD (*) |
f392eb25 ZA |
267 | (*) location vendor specific and now determined from ACPI global tables |
268 | ||
6012d9a9 MCC |
269 | 2.3. APIC |
270 | --------- | |
f392eb25 ZA |
271 | |
272 | On Pentium and later processors, an on-board timer is available to each CPU | |
273 | as part of the Advanced Programmable Interrupt Controller. The APIC is | |
274 | accessed through memory-mapped registers and provides interrupt service to each | |
275 | CPU, used for IPIs and local timer interrupts. | |
276 | ||
277 | Although in theory the APIC is a safe and stable source for local interrupts, | |
278 | in practice, many bugs and glitches have occurred due to the special nature of | |
279 | the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect | |
280 | the use of the APIC and that workarounds may be required. In addition, some of | |
281 | these workarounds pose unique constraints for virtualization - requiring either | |
282 | extra overhead incurred from extra reads of memory-mapped I/O or additional | |
283 | functionality that may be more computationally expensive to implement. | |
284 | ||
285 | Since the APIC is documented quite well in the Intel and AMD manuals, we will | |
286 | avoid repetition of the detail here. It should be pointed out that the APIC | |
287 | timer is programmed through the LVT (local vector timer) register, is capable | |
288 | of one-shot or periodic operation, and is based on the bus clock divided down | |
289 | by the programmable divider register. | |
290 | ||
6012d9a9 MCC |
291 | 2.4. HPET |
292 | --------- | |
f392eb25 ZA |
293 | |
294 | HPET is quite complex, and was originally intended to replace the PIT / RTC | |
295 | support of the X86 PC. It remains to be seen whether that will be the case, as | |
296 | the de facto standard of PC hardware is to emulate these older devices. Some | |
297 | systems designated as legacy free may support only the HPET as a hardware timer | |
298 | device. | |
299 | ||
300 | The HPET spec is rather loose and vague, requiring at least 3 hardware timers, | |
301 | but allowing implementation freedom to support many more. It also imposes no | |
302 | fixed rate on the timer frequency, but does impose some extremal values on | |
303 | frequency, error and slew. | |
304 | ||
305 | In general, the HPET is recommended as a high precision (compared to PIT /RTC) | |
306 | time source which is independent of local variation (as there is only one HPET | |
307 | in any given system). The HPET is also memory-mapped, and its presence is | |
308 | indicated through ACPI tables by the BIOS. | |
309 | ||
310 | Detailed specification of the HPET is beyond the current scope of this | |
311 | document, as it is also very well documented elsewhere. | |
312 | ||
6012d9a9 MCC |
313 | 2.5. Offboard Timers |
314 | -------------------- | |
f392eb25 ZA |
315 | |
316 | Several cards, both proprietary (watchdog boards) and commonplace (e1000) have | |
317 | timing chips built into the cards which may have registers which are accessible | |
318 | to kernel or user drivers. To the author's knowledge, using these to generate | |
319 | a clocksource for a Linux or other kernel has not yet been attempted and is in | |
320 | general frowned upon as not playing by the agreed rules of the game. Such a | |
321 | timer device would require additional support to be virtualized properly and is | |
322 | not considered important at this time as no known operating system does this. | |
323 | ||
6012d9a9 MCC |
324 | 3. TSC Hardware |
325 | =============== | |
f392eb25 ZA |
326 | |
327 | The TSC or time stamp counter is relatively simple in theory; it counts | |
328 | instruction cycles issued by the processor, which can be used as a measure of | |
329 | time. In practice, due to a number of problems, it is the most complicated | |
330 | timekeeping device to use. | |
331 | ||
332 | The TSC is represented internally as a 64-bit MSR which can be read with the | |
333 | RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware | |
334 | limitations made it possible to write the TSC, but generally on old hardware it | |
335 | was only possible to write the low 32-bits of the 64-bit counter, and the upper | |
336 | 32-bits of the counter were cleared. Now, however, on Intel processors family | |
337 | 0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction | |
338 | has been lifted and all 64-bits are writable. On AMD systems, the ability to | |
339 | write the TSC MSR is not an architectural guarantee. | |
340 | ||
341 | The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by | |
342 | means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. | |
343 | ||
344 | Some vendors have implemented an additional instruction, RDTSCP, which returns | |
345 | atomically not just the TSC, but an indicator which corresponds to the | |
346 | processor number. This can be used to index into an array of TSC variables to | |
347 | determine offset information in SMP systems where TSCs are not synchronized. | |
348 | The presence of this instruction must be determined by consulting CPUID feature | |
349 | bits. | |
350 | ||
351 | Both VMX and SVM provide extension fields in the virtualization hardware which | |
352 | allows the guest visible TSC to be offset by a constant. Newer implementations | |
353 | promise to allow the TSC to additionally be scaled, but this hardware is not | |
354 | yet widely available. | |
355 | ||
6012d9a9 MCC |
356 | 3.1. TSC synchronization |
357 | ------------------------ | |
f392eb25 ZA |
358 | |
359 | The TSC is a CPU-local clock in most implementations. This means, on SMP | |
360 | platforms, the TSCs of different CPUs may start at different times depending | |
361 | on when the CPUs are powered on. Generally, CPUs on the same die will share | |
362 | the same clock, however, this is not always the case. | |
363 | ||
364 | The BIOS may attempt to resynchronize the TSCs during the poweron process and | |
365 | the operating system or other system software may attempt to do this as well. | |
366 | Several hardware limitations make the problem worse - if it is not possible to | |
367 | write the full 64-bits of the TSC, it may be impossible to match the TSC in | |
368 | newly arriving CPUs to that of the rest of the system, resulting in | |
369 | unsynchronized TSCs. This may be done by BIOS or system software, but in | |
370 | practice, getting a perfectly synchronized TSC will not be possible unless all | |
371 | values are read from the same clock, which generally only is possible on single | |
372 | socket systems or those with special hardware support. | |
373 | ||
6012d9a9 MCC |
374 | 3.2. TSC and CPU hotplug |
375 | ------------------------ | |
f392eb25 ZA |
376 | |
377 | As touched on already, CPUs which arrive later than the boot time of the system | |
378 | may not have a TSC value that is synchronized with the rest of the system. | |
379 | Either system software, BIOS, or SMM code may actually try to establish the TSC | |
380 | to a value matching the rest of the system, but a perfect match is usually not | |
381 | a guarantee. This can have the effect of bringing a system from a state where | |
382 | TSC is synchronized back to a state where TSC synchronization flaws, however | |
383 | small, may be exposed to the OS and any virtualization environment. | |
384 | ||
6012d9a9 MCC |
385 | 3.3. TSC and multi-socket / NUMA |
386 | -------------------------------- | |
f392eb25 ZA |
387 | |
388 | Multi-socket systems, especially large multi-socket systems are likely to have | |
389 | individual clocksources rather than a single, universally distributed clock. | |
390 | Since these clocks are driven by different crystals, they will not have | |
391 | perfectly matched frequency, and temperature and electrical variations will | |
392 | cause the CPU clocks, and thus the TSCs to drift over time. Depending on the | |
393 | exact clock and bus design, the drift may or may not be fixed in absolute | |
394 | error, and may accumulate over time. | |
395 | ||
396 | In addition, very large systems may deliberately slew the clocks of individual | |
397 | cores. This technique, known as spread-spectrum clocking, reduces EMI at the | |
398 | clock frequency and harmonics of it, which may be required to pass FCC | |
399 | standards for telecommunications and computer equipment. | |
400 | ||
401 | It is recommended not to trust the TSCs to remain synchronized on NUMA or | |
402 | multiple socket systems for these reasons. | |
403 | ||
6012d9a9 MCC |
404 | 3.4. TSC and C-states |
405 | --------------------- | |
f392eb25 ZA |
406 | |
407 | C-states, or idling states of the processor, especially C1E and deeper sleep | |
408 | states may be problematic for TSC as well. The TSC may stop advancing in such | |
409 | a state, resulting in a TSC which is behind that of other CPUs when execution | |
410 | is resumed. Such CPUs must be detected and flagged by the operating system | |
411 | based on CPU and chipset identifications. | |
412 | ||
413 | The TSC in such a case may be corrected by catching it up to a known external | |
414 | clocksource. | |
415 | ||
6012d9a9 MCC |
416 | 3.5. TSC frequency change / P-states |
417 | ------------------------------------ | |
f392eb25 ZA |
418 | |
419 | To make things slightly more interesting, some CPUs may change frequency. They | |
420 | may or may not run the TSC at the same rate, and because the frequency change | |
421 | may be staggered or slewed, at some points in time, the TSC rate may not be | |
422 | known other than falling within a range of values. In this case, the TSC will | |
423 | not be a stable time source, and must be calibrated against a known, stable, | |
424 | external clock to be a usable source of time. | |
425 | ||
426 | Whether the TSC runs at a constant rate or scales with the P-state is model | |
427 | dependent and must be determined by inspecting CPUID, chipset or vendor | |
428 | specific MSR fields. | |
429 | ||
430 | In addition, some vendors have known bugs where the P-state is actually | |
431 | compensated for properly during normal operation, but when the processor is | |
432 | inactive, the P-state may be raised temporarily to service cache misses from | |
433 | other processors. In such cases, the TSC on halted CPUs could advance faster | |
434 | than that of non-halted processors. AMD Turion processors are known to have | |
435 | this problem. | |
436 | ||
6012d9a9 MCC |
437 | 3.6. TSC and STPCLK / T-states |
438 | ------------------------------ | |
f392eb25 ZA |
439 | |
440 | External signals given to the processor may also have the effect of stopping | |
441 | the TSC. This is typically done for thermal emergency power control to prevent | |
442 | an overheating condition, and typically, there is no way to detect that this | |
443 | condition has happened. | |
444 | ||
6012d9a9 MCC |
445 | 3.7. TSC virtualization - VMX |
446 | ----------------------------- | |
f392eb25 ZA |
447 | |
448 | VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP | |
449 | instructions, which is enough for full virtualization of TSC in any manner. In | |
450 | addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET | |
451 | field specified in the VMCS. Special instructions must be used to read and | |
452 | write the VMCS field. | |
453 | ||
6012d9a9 MCC |
454 | 3.8. TSC virtualization - SVM |
455 | ----------------------------- | |
f392eb25 ZA |
456 | |
457 | SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP | |
458 | instructions, which is enough for full virtualization of TSC in any manner. In | |
459 | addition, SVM allows passing through the host TSC plus an additional offset | |
460 | field specified in the SVM control block. | |
461 | ||
6012d9a9 MCC |
462 | 3.9. TSC feature bits in Linux |
463 | ------------------------------ | |
f392eb25 ZA |
464 | |
465 | In summary, there is no way to guarantee the TSC remains in perfect | |
466 | synchronization unless it is explicitly guaranteed by the architecture. Even | |
467 | if so, the TSCs in multi-sockets or NUMA systems may still run independently | |
468 | despite being locally consistent. | |
469 | ||
470 | The following feature bits are used by Linux to signal various TSC attributes, | |
471 | but they can only be taken to be meaningful for UP or single node systems. | |
472 | ||
6012d9a9 MCC |
473 | ========================= ======================================= |
474 | X86_FEATURE_TSC The TSC is available in hardware | |
475 | X86_FEATURE_RDTSCP The RDTSCP instruction is available | |
476 | X86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states | |
477 | X86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states | |
478 | X86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware) | |
479 | ========================= ======================================= | |
f392eb25 | 480 | |
6012d9a9 MCC |
481 | 4. Virtualization Problems |
482 | ========================== | |
f392eb25 ZA |
483 | |
484 | Timekeeping is especially problematic for virtualization because a number of | |
485 | challenges arise. The most obvious problem is that time is now shared between | |
486 | the host and, potentially, a number of virtual machines. Thus the virtual | |
487 | operating system does not run with 100% usage of the CPU, despite the fact that | |
488 | it may very well make that assumption. It may expect it to remain true to very | |
489 | exacting bounds when interrupt sources are disabled, but in reality only its | |
490 | virtual interrupt sources are disabled, and the machine may still be preempted | |
491 | at any time. This causes problems as the passage of real time, the injection | |
492 | of machine interrupts and the associated clock sources are no longer completely | |
493 | synchronized with real time. | |
494 | ||
17180032 | 495 | This same problem can occur on native hardware to a degree, as SMM mode may |
f392eb25 ZA |
496 | steal cycles from the naturally on X86 systems when SMM mode is used by the |
497 | BIOS, but not in such an extreme fashion. However, the fact that SMM mode may | |
498 | cause similar problems to virtualization makes it a good justification for | |
499 | solving many of these problems on bare metal. | |
500 | ||
6012d9a9 MCC |
501 | 4.1. Interrupt clocking |
502 | ----------------------- | |
f392eb25 ZA |
503 | |
504 | One of the most immediate problems that occurs with legacy operating systems | |
505 | is that the system timekeeping routines are often designed to keep track of | |
506 | time by counting periodic interrupts. These interrupts may come from the PIT | |
507 | or the RTC, but the problem is the same: the host virtualization engine may not | |
508 | be able to deliver the proper number of interrupts per second, and so guest | |
509 | time may fall behind. This is especially problematic if a high interrupt rate | |
510 | is selected, such as 1000 HZ, which is unfortunately the default for many Linux | |
511 | guests. | |
512 | ||
513 | There are three approaches to solving this problem; first, it may be possible | |
514 | to simply ignore it. Guests which have a separate time source for tracking | |
515 | 'wall clock' or 'real time' may not need any adjustment of their interrupts to | |
516 | maintain proper time. If this is not sufficient, it may be necessary to inject | |
517 | additional interrupts into the guest in order to increase the effective | |
518 | interrupt rate. This approach leads to complications in extreme conditions, | |
519 | where host load or guest lag is too much to compensate for, and thus another | |
520 | solution to the problem has risen: the guest may need to become aware of lost | |
521 | ticks and compensate for them internally. Although promising in theory, the | |
522 | implementation of this policy in Linux has been extremely error prone, and a | |
523 | number of buggy variants of lost tick compensation are distributed across | |
524 | commonly used Linux systems. | |
525 | ||
526 | Windows uses periodic RTC clocking as a means of keeping time internally, and | |
527 | thus requires interrupt slewing to keep proper time. It does use a low enough | |
528 | rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in | |
529 | practice. | |
530 | ||
6012d9a9 MCC |
531 | 4.2. TSC sampling and serialization |
532 | ----------------------------------- | |
f392eb25 ZA |
533 | |
534 | As the highest precision time source available, the cycle counter of the CPU | |
535 | has aroused much interest from developers. As explained above, this timer has | |
536 | many problems unique to its nature as a local, potentially unstable and | |
537 | potentially unsynchronized source. One issue which is not unique to the TSC, | |
538 | but is highlighted because of its very precise nature is sampling delay. By | |
539 | definition, the counter, once read is already old. However, it is also | |
540 | possible for the counter to be read ahead of the actual use of the result. | |
541 | This is a consequence of the superscalar execution of the instruction stream, | |
542 | which may execute instructions out of order. Such execution is called | |
543 | non-serialized. Forcing serialized execution is necessary for precise | |
544 | measurement with the TSC, and requires a serializing instruction, such as CPUID | |
545 | or an MSR read. | |
546 | ||
547 | Since CPUID may actually be virtualized by a trap and emulate mechanism, this | |
548 | serialization can pose a performance issue for hardware virtualization. An | |
549 | accurate time stamp counter reading may therefore not always be available, and | |
550 | it may be necessary for an implementation to guard against "backwards" reads of | |
551 | the TSC as seen from other CPUs, even in an otherwise perfectly synchronized | |
552 | system. | |
553 | ||
6012d9a9 MCC |
554 | 4.3. Timespec aliasing |
555 | ---------------------- | |
f392eb25 ZA |
556 | |
557 | Additionally, this lack of serialization from the TSC poses another challenge | |
558 | when using results of the TSC when measured against another time source. As | |
559 | the TSC is much higher precision, many possible values of the TSC may be read | |
560 | while another clock is still expressing the same value. | |
561 | ||
562 | That is, you may read (T,T+10) while external clock C maintains the same value. | |
563 | Due to non-serialized reads, you may actually end up with a range which | |
564 | fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but | |
565 | calibrated against an external value may have a range of valid values. | |
566 | Re-calibrating this computation may actually cause time, as computed after the | |
567 | calibration, to go backwards, compared with time computed before the | |
568 | calibration. | |
569 | ||
570 | This problem is particularly pronounced with an internal time source in Linux, | |
571 | the kernel time, which is expressed in the theoretically high resolution | |
572 | timespec - but which advances in much larger granularity intervals, sometimes | |
573 | at the rate of jiffies, and possibly in catchup modes, at a much larger step. | |
574 | ||
575 | This aliasing requires care in the computation and recalibration of kvmclock | |
576 | and any other values derived from TSC computation (such as TSC virtualization | |
577 | itself). | |
578 | ||
6012d9a9 MCC |
579 | 4.4. Migration |
580 | -------------- | |
f392eb25 ZA |
581 | |
582 | Migration of a virtual machine raises problems for timekeeping in two ways. | |
583 | First, the migration itself may take time, during which interrupts cannot be | |
584 | delivered, and after which, the guest time may need to be caught up. NTP may | |
585 | be able to help to some degree here, as the clock correction required is | |
586 | typically small enough to fall in the NTP-correctable window. | |
587 | ||
588 | An additional concern is that timers based off the TSC (or HPET, if the raw bus | |
589 | clock is exposed) may now be running at different rates, requiring compensation | |
590 | in some way in the hypervisor by virtualizing these timers. In addition, | |
591 | migrating to a faster machine may preclude the use of a passthrough TSC, as a | |
592 | faster clock cannot be made visible to a guest without the potential of time | |
593 | advancing faster than usual. A slower clock is less of a problem, as it can | |
594 | always be caught up to the original rate. KVM clock avoids these problems by | |
595 | simply storing multipliers and offsets against the TSC for the guest to convert | |
596 | back into nanosecond resolution values. | |
597 | ||
6012d9a9 MCC |
598 | 4.5. Scheduling |
599 | --------------- | |
f392eb25 ZA |
600 | |
601 | Since scheduling may be based on precise timing and firing of interrupts, the | |
602 | scheduling algorithms of an operating system may be adversely affected by | |
603 | virtualization. In theory, the effect is random and should be universally | |
604 | distributed, but in contrived as well as real scenarios (guest device access, | |
605 | causes of virtualization exits, possible context switch), this may not always | |
606 | be the case. The effect of this has not been well studied. | |
607 | ||
608 | In an attempt to work around this, several implementations have provided a | |
609 | paravirtualized scheduler clock, which reveals the true amount of CPU time for | |
610 | which a virtual machine has been running. | |
611 | ||
6012d9a9 MCC |
612 | 4.6. Watchdogs |
613 | -------------- | |
f392eb25 ZA |
614 | |
615 | Watchdog timers, such as the lock detector in Linux may fire accidentally when | |
616 | running under hardware virtualization due to timer interrupts being delayed or | |
617 | misinterpretation of the passage of real time. Usually, these warnings are | |
618 | spurious and can be ignored, but in some circumstances it may be necessary to | |
619 | disable such detection. | |
620 | ||
6012d9a9 MCC |
621 | 4.7. Delays and precision timing |
622 | -------------------------------- | |
f392eb25 ZA |
623 | |
624 | Precise timing and delays may not be possible in a virtualized system. This | |
625 | can happen if the system is controlling physical hardware, or issues delays to | |
626 | compensate for slower I/O to and from devices. The first issue is not solvable | |
627 | in general for a virtualized system; hardware control software can't be | |
628 | adequately virtualized without a full real-time operating system, which would | |
629 | require an RT aware virtualization platform. | |
630 | ||
631 | The second issue may cause performance problems, but this is unlikely to be a | |
632 | significant issue. In many cases these delays may be eliminated through | |
633 | configuration or paravirtualization. | |
634 | ||
6012d9a9 MCC |
635 | 4.8. Covert channels and leaks |
636 | ------------------------------ | |
f392eb25 ZA |
637 | |
638 | In addition to the above problems, time information will inevitably leak to the | |
639 | guest about the host in anything but a perfect implementation of virtualized | |
640 | time. This may allow the guest to infer the presence of a hypervisor (as in a | |
641 | red-pill type detection), and it may allow information to leak between guests | |
642 | by using CPU utilization itself as a signalling channel. Preventing such | |
643 | problems would require completely isolated virtual time which may not track | |
644 | real time any longer. This may be useful in certain security or QA contexts, | |
645 | but in general isn't recommended for real-world deployment scenarios. |