]> git.proxmox.com Git - mirror_ubuntu-jammy-kernel.git/blame - Documentation/virt/kvm/timekeeping.rst
Merge tag 'riscv-for-linus-5.9-mw0' of git://git.kernel.org/pub/scm/linux/kernel...
[mirror_ubuntu-jammy-kernel.git] / Documentation / virt / kvm / timekeeping.rst
CommitLineData
6012d9a9 1.. SPDX-License-Identifier: GPL-2.0
f392eb25 2
6012d9a9
MCC
3======================================================
4Timekeeping Virtualization for X86-Based Architectures
5======================================================
f392eb25 6
6012d9a9
MCC
7:Author: Zachary Amsden <zamsden@redhat.com>
8:Copyright: (c) 2010, Red Hat. All rights reserved.
f392eb25 9
6012d9a9 10.. Contents
f392eb25 11
6012d9a9
MCC
12 1) Overview
13 2) Timing Devices
14 3) TSC Hardware
15 4) Virtualization Problems
f392eb25 16
6012d9a9
MCC
171. Overview
18===========
f392eb25
ZA
19
20One of the most complicated parts of the X86 platform, and specifically,
21the virtualization of this platform is the plethora of timing devices available
22and the complexity of emulating those devices. In addition, virtualization of
23time introduces a new set of challenges because it introduces a multiplexed
24division of time beyond the control of the guest CPU.
25
26First, we will describe the various timekeeping hardware available, then
27present some of the problems which arise and solutions available, giving
28specific recommendations for certain classes of KVM guests.
29
30The purpose of this document is to collect data and information relevant to
31timekeeping which may be difficult to find elsewhere, specifically,
32information relevant to KVM and hardware-based virtualization.
33
6012d9a9
MCC
342. Timing Devices
35=================
f392eb25
ZA
36
37First we discuss the basic hardware devices available. TSC and the related
38KVM clock are special enough to warrant a full exposition and are described in
39the following section.
40
6012d9a9
MCC
412.1. i8254 - PIT
42----------------
f392eb25
ZA
43
44One of the first timer devices available is the programmable interrupt timer,
45or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
46channels which can be programmed to deliver periodic or one-shot interrupts.
47These three channels can be configured in different modes and have individual
48counters. Channel 1 and 2 were not available for general use in the original
49IBM PC, and historically were connected to control RAM refresh and the PC
50speaker. Now the PIT is typically integrated as part of an emulated chipset
51and a separate physical PIT is not used.
52
53The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done
54using single or multiple byte access to the I/O ports. There are 6 modes
55available, but not all modes are available to all timers, as only timer 2
56has a connected gate input, required for modes 1 and 5. The gate line is
6012d9a9 57controlled by port 61h, bit 0, as illustrated in the following diagram::
f392eb25 58
6012d9a9
MCC
59 -------------- ----------------
60 | | | |
61 | 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0
62 | Clock | | | |
63 -------------- | +->| GATE TIMER 0 |
f392eb25
ZA
64 | ----------------
65 |
66 | ----------------
67 | | |
68 |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM
69 | | | (aka /dev/null)
70 | +->| GATE TIMER 1 |
71 | ----------------
72 |
73 | ----------------
74 | | |
75 |------>| CLOCK OUT | ---------> Port 61h, bit 5
76 | | |
6012d9a9 77 Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____
f392eb25
ZA
78 ---------------- _| )--|LPF|---Speaker
79 / *---- \___/
6012d9a9 80 Port 61h, bit 1 ---------------------------------/
f392eb25
ZA
81
82The timer modes are now described.
83
6012d9a9
MCC
84Mode 0: Single Timeout.
85 This is a one-shot software timeout that counts down
f392eb25
ZA
86 when the gate is high (always true for timers 0 and 1). When the count
87 reaches zero, the output goes high.
88
6012d9a9
MCC
89Mode 1: Triggered One-shot.
90 The output is initially set high. When the gate
f392eb25
ZA
91 line is set high, a countdown is initiated (which does not stop if the gate is
92 lowered), during which the output is set low. When the count reaches zero,
93 the output goes high.
94
6012d9a9
MCC
95Mode 2: Rate Generator.
96 The output is initially set high. When the countdown
f392eb25
ZA
97 reaches 1, the output goes low for one count and then returns high. The value
98 is reloaded and the countdown automatically resumes. If the gate line goes
99 low, the count is halted. If the output is low when the gate is lowered, the
100 output automatically goes high (this only affects timer 2).
101
6012d9a9
MCC
102Mode 3: Square Wave.
103 This generates a high / low square wave. The count
f392eb25
ZA
104 determines the length of the pulse, which alternates between high and low
105 when zero is reached. The count only proceeds when gate is high and is
106 automatically reloaded on reaching zero. The count is decremented twice at
107 each clock to generate a full high / low cycle at the full periodic rate.
108 If the count is even, the clock remains high for N/2 counts and low for N/2
109 counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
110 for (N-1)/2 counts. Only even values are latched by the counter, so odd
111 values are not observed when reading. This is the intended mode for timer 2,
112 which generates sine-like tones by low-pass filtering the square wave output.
113
6012d9a9
MCC
114Mode 4: Software Strobe.
115 After programming this mode and loading the counter,
f392eb25
ZA
116 the output remains high until the counter reaches zero. Then the output
117 goes low for 1 clock cycle and returns high. The counter is not reloaded.
118 Counting only occurs when gate is high.
119
6012d9a9
MCC
120Mode 5: Hardware Strobe.
121 After programming and loading the counter, the
f392eb25
ZA
122 output remains high. When the gate is raised, a countdown is initiated
123 (which does not stop if the gate is lowered). When the counter reaches zero,
124 the output goes low for 1 clock cycle and then returns high. The counter is
125 not reloaded.
126
127In addition to normal binary counting, the PIT supports BCD counting. The
128command port, 0x43 is used to set the counter and mode for each of the three
129timers.
130
6012d9a9 131PIT commands, issued to port 0x43, using the following bit encoding::
f392eb25 132
6012d9a9
MCC
133 Bit 7-4: Command (See table below)
134 Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
135 Bit 0 : Binary (0) / BCD (1)
f392eb25 136
6012d9a9 137Command table::
f392eb25 138
6012d9a9 139 0000 - Latch Timer 0 count for port 0x40
f392eb25
ZA
140 sample and hold the count to be read in port 0x40;
141 additional commands ignored until counter is read;
142 mode bits ignored.
143
6012d9a9 144 0001 - Set Timer 0 LSB mode for port 0x40
f392eb25
ZA
145 set timer to read LSB only and force MSB to zero;
146 mode bits set timer mode
147
6012d9a9 148 0010 - Set Timer 0 MSB mode for port 0x40
f392eb25
ZA
149 set timer to read MSB only and force LSB to zero;
150 mode bits set timer mode
151
6012d9a9 152 0011 - Set Timer 0 16-bit mode for port 0x40
f392eb25
ZA
153 set timer to read / write LSB first, then MSB;
154 mode bits set timer mode
155
6012d9a9
MCC
156 0100 - Latch Timer 1 count for port 0x41 - as described above
157 0101 - Set Timer 1 LSB mode for port 0x41 - as described above
158 0110 - Set Timer 1 MSB mode for port 0x41 - as described above
159 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
f392eb25 160
6012d9a9
MCC
161 1000 - Latch Timer 2 count for port 0x42 - as described above
162 1001 - Set Timer 2 LSB mode for port 0x42 - as described above
163 1010 - Set Timer 2 MSB mode for port 0x42 - as described above
164 1011 - Set Timer 2 16-bit mode for port 0x42 as described above
f392eb25 165
6012d9a9 166 1101 - General counter latch
f392eb25
ZA
167 Latch combination of counters into corresponding ports
168 Bit 3 = Counter 2
169 Bit 2 = Counter 1
170 Bit 1 = Counter 0
171 Bit 0 = Unused
172
6012d9a9 173 1110 - Latch timer status
f392eb25
ZA
174 Latch combination of counter mode into corresponding ports
175 Bit 3 = Counter 2
176 Bit 2 = Counter 1
177 Bit 1 = Counter 0
178
179 The output of ports 0x40-0x42 following this command will be:
180
181 Bit 7 = Output pin
182 Bit 6 = Count loaded (0 if timer has expired)
183 Bit 5-4 = Read / Write mode
184 01 = MSB only
185 10 = LSB only
186 11 = LSB / MSB (16-bit)
187 Bit 3-1 = Mode
188 Bit 0 = Binary (0) / BCD mode (1)
189
6012d9a9
MCC
1902.2. RTC
191--------
f392eb25
ZA
192
193The second device which was available in the original PC was the MC146818 real
194time clock. The original device is now obsolete, and usually emulated by the
195system chipset, sometimes by an HPET and some frankenstein IRQ routing.
196
197The RTC is accessed through CMOS variables, which uses an index register to
198control which bytes are read. Since there is only one index register, read
199of the CMOS and read of the RTC require lock protection (in addition, it is
200dangerous to allow userspace utilities such as hwclock to have direct RTC
201access, as they could corrupt kernel reads and writes of CMOS memory).
202
203The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt
204can function as a periodic timer, an additional once a day alarm, and can issue
205interrupts after an update of the CMOS registers by the MC146818 is complete.
206The type of interrupt is signalled in the RTC status registers.
207
208The RTC will update the current time fields by battery power even while the
209system is off. The current time fields should not be read while an update is
210in progress, as indicated in the status register.
211
212The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
213programmed to a 32kHz divider if the RTC is to count seconds.
214
6012d9a9
MCC
215This is the RAM map originally used for the RTC/CMOS::
216
217 Location Size Description
218 ------------------------------------------
219 00h byte Current second (BCD)
220 01h byte Seconds alarm (BCD)
221 02h byte Current minute (BCD)
222 03h byte Minutes alarm (BCD)
223 04h byte Current hour (BCD)
224 05h byte Hours alarm (BCD)
225 06h byte Current day of week (BCD)
226 07h byte Current day of month (BCD)
227 08h byte Current month (BCD)
228 09h byte Current year (BCD)
229 0Ah byte Register A
f392eb25
ZA
230 bit 7 = Update in progress
231 bit 6-4 = Divider for clock
232 000 = 4.194 MHz
233 001 = 1.049 MHz
234 010 = 32 kHz
235 10X = test modes
236 110 = reset / disable
237 111 = reset / disable
238 bit 3-0 = Rate selection for periodic interrupt
239 000 = periodic timer disabled
240 001 = 3.90625 uS
241 010 = 7.8125 uS
242 011 = .122070 mS
243 100 = .244141 mS
244 ...
245 1101 = 125 mS
246 1110 = 250 mS
247 1111 = 500 mS
6012d9a9 248 0Bh byte Register B
f392eb25
ZA
249 bit 7 = Run (0) / Halt (1)
250 bit 6 = Periodic interrupt enable
251 bit 5 = Alarm interrupt enable
252 bit 4 = Update-ended interrupt enable
253 bit 3 = Square wave interrupt enable
254 bit 2 = BCD calendar (0) / Binary (1)
255 bit 1 = 12-hour mode (0) / 24-hour mode (1)
256 bit 0 = 0 (DST off) / 1 (DST enabled)
6012d9a9 257 OCh byte Register C (read only)
f392eb25
ZA
258 bit 7 = interrupt request flag (IRQF)
259 bit 6 = periodic interrupt flag (PF)
260 bit 5 = alarm interrupt flag (AF)
261 bit 4 = update interrupt flag (UF)
262 bit 3-0 = reserved
6012d9a9 263 ODh byte Register D (read only)
f392eb25
ZA
264 bit 7 = RTC has power
265 bit 6-0 = reserved
6012d9a9 266 32h byte Current century BCD (*)
f392eb25
ZA
267 (*) location vendor specific and now determined from ACPI global tables
268
6012d9a9
MCC
2692.3. APIC
270---------
f392eb25
ZA
271
272On Pentium and later processors, an on-board timer is available to each CPU
273as part of the Advanced Programmable Interrupt Controller. The APIC is
274accessed through memory-mapped registers and provides interrupt service to each
275CPU, used for IPIs and local timer interrupts.
276
277Although in theory the APIC is a safe and stable source for local interrupts,
278in practice, many bugs and glitches have occurred due to the special nature of
279the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect
280the use of the APIC and that workarounds may be required. In addition, some of
281these workarounds pose unique constraints for virtualization - requiring either
282extra overhead incurred from extra reads of memory-mapped I/O or additional
283functionality that may be more computationally expensive to implement.
284
285Since the APIC is documented quite well in the Intel and AMD manuals, we will
286avoid repetition of the detail here. It should be pointed out that the APIC
287timer is programmed through the LVT (local vector timer) register, is capable
288of one-shot or periodic operation, and is based on the bus clock divided down
289by the programmable divider register.
290
6012d9a9
MCC
2912.4. HPET
292---------
f392eb25
ZA
293
294HPET is quite complex, and was originally intended to replace the PIT / RTC
295support of the X86 PC. It remains to be seen whether that will be the case, as
296the de facto standard of PC hardware is to emulate these older devices. Some
297systems designated as legacy free may support only the HPET as a hardware timer
298device.
299
300The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
301but allowing implementation freedom to support many more. It also imposes no
302fixed rate on the timer frequency, but does impose some extremal values on
303frequency, error and slew.
304
305In general, the HPET is recommended as a high precision (compared to PIT /RTC)
306time source which is independent of local variation (as there is only one HPET
307in any given system). The HPET is also memory-mapped, and its presence is
308indicated through ACPI tables by the BIOS.
309
310Detailed specification of the HPET is beyond the current scope of this
311document, as it is also very well documented elsewhere.
312
6012d9a9
MCC
3132.5. Offboard Timers
314--------------------
f392eb25
ZA
315
316Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
317timing chips built into the cards which may have registers which are accessible
318to kernel or user drivers. To the author's knowledge, using these to generate
319a clocksource for a Linux or other kernel has not yet been attempted and is in
320general frowned upon as not playing by the agreed rules of the game. Such a
321timer device would require additional support to be virtualized properly and is
322not considered important at this time as no known operating system does this.
323
6012d9a9
MCC
3243. TSC Hardware
325===============
f392eb25
ZA
326
327The TSC or time stamp counter is relatively simple in theory; it counts
328instruction cycles issued by the processor, which can be used as a measure of
329time. In practice, due to a number of problems, it is the most complicated
330timekeeping device to use.
331
332The TSC is represented internally as a 64-bit MSR which can be read with the
333RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware
334limitations made it possible to write the TSC, but generally on old hardware it
335was only possible to write the low 32-bits of the 64-bit counter, and the upper
33632-bits of the counter were cleared. Now, however, on Intel processors family
3370Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
338has been lifted and all 64-bits are writable. On AMD systems, the ability to
339write the TSC MSR is not an architectural guarantee.
340
341The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
342means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
343
344Some vendors have implemented an additional instruction, RDTSCP, which returns
345atomically not just the TSC, but an indicator which corresponds to the
346processor number. This can be used to index into an array of TSC variables to
347determine offset information in SMP systems where TSCs are not synchronized.
348The presence of this instruction must be determined by consulting CPUID feature
349bits.
350
351Both VMX and SVM provide extension fields in the virtualization hardware which
352allows the guest visible TSC to be offset by a constant. Newer implementations
353promise to allow the TSC to additionally be scaled, but this hardware is not
354yet widely available.
355
6012d9a9
MCC
3563.1. TSC synchronization
357------------------------
f392eb25
ZA
358
359The TSC is a CPU-local clock in most implementations. This means, on SMP
360platforms, the TSCs of different CPUs may start at different times depending
361on when the CPUs are powered on. Generally, CPUs on the same die will share
362the same clock, however, this is not always the case.
363
364The BIOS may attempt to resynchronize the TSCs during the poweron process and
365the operating system or other system software may attempt to do this as well.
366Several hardware limitations make the problem worse - if it is not possible to
367write the full 64-bits of the TSC, it may be impossible to match the TSC in
368newly arriving CPUs to that of the rest of the system, resulting in
369unsynchronized TSCs. This may be done by BIOS or system software, but in
370practice, getting a perfectly synchronized TSC will not be possible unless all
371values are read from the same clock, which generally only is possible on single
372socket systems or those with special hardware support.
373
6012d9a9
MCC
3743.2. TSC and CPU hotplug
375------------------------
f392eb25
ZA
376
377As touched on already, CPUs which arrive later than the boot time of the system
378may not have a TSC value that is synchronized with the rest of the system.
379Either system software, BIOS, or SMM code may actually try to establish the TSC
380to a value matching the rest of the system, but a perfect match is usually not
381a guarantee. This can have the effect of bringing a system from a state where
382TSC is synchronized back to a state where TSC synchronization flaws, however
383small, may be exposed to the OS and any virtualization environment.
384
6012d9a9
MCC
3853.3. TSC and multi-socket / NUMA
386--------------------------------
f392eb25
ZA
387
388Multi-socket systems, especially large multi-socket systems are likely to have
389individual clocksources rather than a single, universally distributed clock.
390Since these clocks are driven by different crystals, they will not have
391perfectly matched frequency, and temperature and electrical variations will
392cause the CPU clocks, and thus the TSCs to drift over time. Depending on the
393exact clock and bus design, the drift may or may not be fixed in absolute
394error, and may accumulate over time.
395
396In addition, very large systems may deliberately slew the clocks of individual
397cores. This technique, known as spread-spectrum clocking, reduces EMI at the
398clock frequency and harmonics of it, which may be required to pass FCC
399standards for telecommunications and computer equipment.
400
401It is recommended not to trust the TSCs to remain synchronized on NUMA or
402multiple socket systems for these reasons.
403
6012d9a9
MCC
4043.4. TSC and C-states
405---------------------
f392eb25
ZA
406
407C-states, or idling states of the processor, especially C1E and deeper sleep
408states may be problematic for TSC as well. The TSC may stop advancing in such
409a state, resulting in a TSC which is behind that of other CPUs when execution
410is resumed. Such CPUs must be detected and flagged by the operating system
411based on CPU and chipset identifications.
412
413The TSC in such a case may be corrected by catching it up to a known external
414clocksource.
415
6012d9a9
MCC
4163.5. TSC frequency change / P-states
417------------------------------------
f392eb25
ZA
418
419To make things slightly more interesting, some CPUs may change frequency. They
420may or may not run the TSC at the same rate, and because the frequency change
421may be staggered or slewed, at some points in time, the TSC rate may not be
422known other than falling within a range of values. In this case, the TSC will
423not be a stable time source, and must be calibrated against a known, stable,
424external clock to be a usable source of time.
425
426Whether the TSC runs at a constant rate or scales with the P-state is model
427dependent and must be determined by inspecting CPUID, chipset or vendor
428specific MSR fields.
429
430In addition, some vendors have known bugs where the P-state is actually
431compensated for properly during normal operation, but when the processor is
432inactive, the P-state may be raised temporarily to service cache misses from
433other processors. In such cases, the TSC on halted CPUs could advance faster
434than that of non-halted processors. AMD Turion processors are known to have
435this problem.
436
6012d9a9
MCC
4373.6. TSC and STPCLK / T-states
438------------------------------
f392eb25
ZA
439
440External signals given to the processor may also have the effect of stopping
441the TSC. This is typically done for thermal emergency power control to prevent
442an overheating condition, and typically, there is no way to detect that this
443condition has happened.
444
6012d9a9
MCC
4453.7. TSC virtualization - VMX
446-----------------------------
f392eb25
ZA
447
448VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
449instructions, which is enough for full virtualization of TSC in any manner. In
450addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
451field specified in the VMCS. Special instructions must be used to read and
452write the VMCS field.
453
6012d9a9
MCC
4543.8. TSC virtualization - SVM
455-----------------------------
f392eb25
ZA
456
457SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
458instructions, which is enough for full virtualization of TSC in any manner. In
459addition, SVM allows passing through the host TSC plus an additional offset
460field specified in the SVM control block.
461
6012d9a9
MCC
4623.9. TSC feature bits in Linux
463------------------------------
f392eb25
ZA
464
465In summary, there is no way to guarantee the TSC remains in perfect
466synchronization unless it is explicitly guaranteed by the architecture. Even
467if so, the TSCs in multi-sockets or NUMA systems may still run independently
468despite being locally consistent.
469
470The following feature bits are used by Linux to signal various TSC attributes,
471but they can only be taken to be meaningful for UP or single node systems.
472
6012d9a9
MCC
473========================= =======================================
474X86_FEATURE_TSC The TSC is available in hardware
475X86_FEATURE_RDTSCP The RDTSCP instruction is available
476X86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states
477X86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states
478X86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware)
479========================= =======================================
f392eb25 480
6012d9a9
MCC
4814. Virtualization Problems
482==========================
f392eb25
ZA
483
484Timekeeping is especially problematic for virtualization because a number of
485challenges arise. The most obvious problem is that time is now shared between
486the host and, potentially, a number of virtual machines. Thus the virtual
487operating system does not run with 100% usage of the CPU, despite the fact that
488it may very well make that assumption. It may expect it to remain true to very
489exacting bounds when interrupt sources are disabled, but in reality only its
490virtual interrupt sources are disabled, and the machine may still be preempted
491at any time. This causes problems as the passage of real time, the injection
492of machine interrupts and the associated clock sources are no longer completely
493synchronized with real time.
494
17180032 495This same problem can occur on native hardware to a degree, as SMM mode may
f392eb25
ZA
496steal cycles from the naturally on X86 systems when SMM mode is used by the
497BIOS, but not in such an extreme fashion. However, the fact that SMM mode may
498cause similar problems to virtualization makes it a good justification for
499solving many of these problems on bare metal.
500
6012d9a9
MCC
5014.1. Interrupt clocking
502-----------------------
f392eb25
ZA
503
504One of the most immediate problems that occurs with legacy operating systems
505is that the system timekeeping routines are often designed to keep track of
506time by counting periodic interrupts. These interrupts may come from the PIT
507or the RTC, but the problem is the same: the host virtualization engine may not
508be able to deliver the proper number of interrupts per second, and so guest
509time may fall behind. This is especially problematic if a high interrupt rate
510is selected, such as 1000 HZ, which is unfortunately the default for many Linux
511guests.
512
513There are three approaches to solving this problem; first, it may be possible
514to simply ignore it. Guests which have a separate time source for tracking
515'wall clock' or 'real time' may not need any adjustment of their interrupts to
516maintain proper time. If this is not sufficient, it may be necessary to inject
517additional interrupts into the guest in order to increase the effective
518interrupt rate. This approach leads to complications in extreme conditions,
519where host load or guest lag is too much to compensate for, and thus another
520solution to the problem has risen: the guest may need to become aware of lost
521ticks and compensate for them internally. Although promising in theory, the
522implementation of this policy in Linux has been extremely error prone, and a
523number of buggy variants of lost tick compensation are distributed across
524commonly used Linux systems.
525
526Windows uses periodic RTC clocking as a means of keeping time internally, and
527thus requires interrupt slewing to keep proper time. It does use a low enough
528rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
529practice.
530
6012d9a9
MCC
5314.2. TSC sampling and serialization
532-----------------------------------
f392eb25
ZA
533
534As the highest precision time source available, the cycle counter of the CPU
535has aroused much interest from developers. As explained above, this timer has
536many problems unique to its nature as a local, potentially unstable and
537potentially unsynchronized source. One issue which is not unique to the TSC,
538but is highlighted because of its very precise nature is sampling delay. By
539definition, the counter, once read is already old. However, it is also
540possible for the counter to be read ahead of the actual use of the result.
541This is a consequence of the superscalar execution of the instruction stream,
542which may execute instructions out of order. Such execution is called
543non-serialized. Forcing serialized execution is necessary for precise
544measurement with the TSC, and requires a serializing instruction, such as CPUID
545or an MSR read.
546
547Since CPUID may actually be virtualized by a trap and emulate mechanism, this
548serialization can pose a performance issue for hardware virtualization. An
549accurate time stamp counter reading may therefore not always be available, and
550it may be necessary for an implementation to guard against "backwards" reads of
551the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
552system.
553
6012d9a9
MCC
5544.3. Timespec aliasing
555----------------------
f392eb25
ZA
556
557Additionally, this lack of serialization from the TSC poses another challenge
558when using results of the TSC when measured against another time source. As
559the TSC is much higher precision, many possible values of the TSC may be read
560while another clock is still expressing the same value.
561
562That is, you may read (T,T+10) while external clock C maintains the same value.
563Due to non-serialized reads, you may actually end up with a range which
564fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but
565calibrated against an external value may have a range of valid values.
566Re-calibrating this computation may actually cause time, as computed after the
567calibration, to go backwards, compared with time computed before the
568calibration.
569
570This problem is particularly pronounced with an internal time source in Linux,
571the kernel time, which is expressed in the theoretically high resolution
572timespec - but which advances in much larger granularity intervals, sometimes
573at the rate of jiffies, and possibly in catchup modes, at a much larger step.
574
575This aliasing requires care in the computation and recalibration of kvmclock
576and any other values derived from TSC computation (such as TSC virtualization
577itself).
578
6012d9a9
MCC
5794.4. Migration
580--------------
f392eb25
ZA
581
582Migration of a virtual machine raises problems for timekeeping in two ways.
583First, the migration itself may take time, during which interrupts cannot be
584delivered, and after which, the guest time may need to be caught up. NTP may
585be able to help to some degree here, as the clock correction required is
586typically small enough to fall in the NTP-correctable window.
587
588An additional concern is that timers based off the TSC (or HPET, if the raw bus
589clock is exposed) may now be running at different rates, requiring compensation
590in some way in the hypervisor by virtualizing these timers. In addition,
591migrating to a faster machine may preclude the use of a passthrough TSC, as a
592faster clock cannot be made visible to a guest without the potential of time
593advancing faster than usual. A slower clock is less of a problem, as it can
594always be caught up to the original rate. KVM clock avoids these problems by
595simply storing multipliers and offsets against the TSC for the guest to convert
596back into nanosecond resolution values.
597
6012d9a9
MCC
5984.5. Scheduling
599---------------
f392eb25
ZA
600
601Since scheduling may be based on precise timing and firing of interrupts, the
602scheduling algorithms of an operating system may be adversely affected by
603virtualization. In theory, the effect is random and should be universally
604distributed, but in contrived as well as real scenarios (guest device access,
605causes of virtualization exits, possible context switch), this may not always
606be the case. The effect of this has not been well studied.
607
608In an attempt to work around this, several implementations have provided a
609paravirtualized scheduler clock, which reveals the true amount of CPU time for
610which a virtual machine has been running.
611
6012d9a9
MCC
6124.6. Watchdogs
613--------------
f392eb25
ZA
614
615Watchdog timers, such as the lock detector in Linux may fire accidentally when
616running under hardware virtualization due to timer interrupts being delayed or
617misinterpretation of the passage of real time. Usually, these warnings are
618spurious and can be ignored, but in some circumstances it may be necessary to
619disable such detection.
620
6012d9a9
MCC
6214.7. Delays and precision timing
622--------------------------------
f392eb25
ZA
623
624Precise timing and delays may not be possible in a virtualized system. This
625can happen if the system is controlling physical hardware, or issues delays to
626compensate for slower I/O to and from devices. The first issue is not solvable
627in general for a virtualized system; hardware control software can't be
628adequately virtualized without a full real-time operating system, which would
629require an RT aware virtualization platform.
630
631The second issue may cause performance problems, but this is unlikely to be a
632significant issue. In many cases these delays may be eliminated through
633configuration or paravirtualization.
634
6012d9a9
MCC
6354.8. Covert channels and leaks
636------------------------------
f392eb25
ZA
637
638In addition to the above problems, time information will inevitably leak to the
639guest about the host in anything but a perfect implementation of virtualized
640time. This may allow the guest to infer the presence of a hypervisor (as in a
641red-pill type detection), and it may allow information to leak between guests
642by using CPU utilization itself as a signalling channel. Preventing such
643problems would require completely isolated virtual time which may not track
644real time any longer. This may be useful in certain security or QA contexts,
645but in general isn't recommended for real-world deployment scenarios.