[mirror_ubuntu-bionic-kernel.git] / Documentation / timers / timekeeping.txt

Clock sources, Clock events, sched_clock() and delay timers
-----------------------------------------------------------

This document tries to briefly explain some basic kernel timekeeping
abstractions. It partly pertains to the drivers usually found in
drivers/clocksource in the kernel tree, but the code may be spread out
across the kernel.

If you grep through the kernel source you will find a number of architecture-
specific implementations of clock sources, clockevents and several likewise
architecture-specific overrides of the sched_clock() function and some
delay timers.

To provide timekeeping for your platform, the clock source provides
the basic timeline, whereas clock events shoot interrupts on certain points
on this timeline, providing facilities such as high-resolution timers.
sched_clock() is used for scheduling and timestamping, and delay timers
provide an accurate delay source using hardware counters.


Clock sources
-------------

The purpose of the clock source is to provide a timeline for the system that
tells you where you are in time. For example issuing the command 'date' on
a Linux system will eventually read the clock source to determine exactly
what time it is.

Typically the clock source is a monotonic, atomic counter which will provide
n bits which count from 0 to 2^(n-1) and then wraps around to 0 and start over.
It will ideally NEVER stop ticking as long as the system is running. It
may stop during system suspend.

The clock source shall have as high resolution as possible, and the frequency
shall be as stable and correct as possible as compared to a real-world wall
clock. It should not move unpredictably back and forth in time or miss a few
cycles here and there.

It must be immune to the kind of effects that occur in hardware where e.g.
the counter register is read in two phases on the bus lowest 16 bits first
and the higher 16 bits in a second bus cycle with the counter bits
potentially being updated in between leading to the risk of very strange
values from the counter.

When the wall-clock accuracy of the clock source isn't satisfactory, there
are various quirks and layers in the timekeeping code for e.g. synchronizing
the user-visible time to RTC clocks in the system or against networked time
servers using NTP, but all they do basically is update an offset against
the clock source, which provides the fundamental timeline for the system.
These measures does not affect the clock source per se, they only adapt the
system to the shortcomings of it.

The clock source struct shall provide means to translate the provided counter
into a nanosecond value as an unsigned long long (unsigned 64 bit) number.
Since this operation may be invoked very often, doing this in a strict
mathematical sense is not desirable: instead the number is taken as close as
possible to a nanosecond value using only the arithmetic operations
multiply and shift, so in clocksource_cyc2ns() you find:

  ns ~= (clocksource * mult) >> shift

You will find a number of helper functions in the clock source code intended
to aid in providing these mult and shift values, such as
clocksource_khz2mult(), clocksource_hz2mult() that help determine the
mult factor from a fixed shift, and clocksource_register_hz() and
clocksource_register_khz() which will help out assigning both shift and mult
factors using the frequency of the clock source as the only input.

For real simple clock sources accessed from a single I/O memory location
there is nowadays even clocksource_mmio_init() which will take a memory
location, bit width, a parameter telling whether the counter in the
register counts up or down, and the timer clock rate, and then conjure all
necessary parameters.

Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43
seconds, the code handling the clock source will have to compensate for this.
That is the reason why the clock source struct also contains a 'mask'
member telling how many bits of the source are valid. This way the timekeeping
code knows when the counter will wrap around and can insert the necessary
compensation code on both sides of the wrap point so that the system timeline
remains monotonic.


Clock events
------------

Clock events are the conceptual reverse of clock sources: they take a
desired time specification value and calculate the values to poke into
hardware timer registers.

Clock events are orthogonal to clock sources. The same hardware
and register range may be used for the clock event, but it is essentially
a different thing. The hardware driving clock events has to be able to
fire interrupts, so as to trigger events on the system timeline. On an SMP
system, it is ideal (and customary) to have one such event driving timer per
CPU core, so that each core can trigger events independently of any other
core.

You will notice that the clock event device code is based on the same basic
idea about translating counters to nanoseconds using mult and shift
arithmetic, and you find the same family of helper functions again for
assigning these values. The clock event driver does not need a 'mask'
attribute however: the system will not try to plan events beyond the time
horizon of the clock event.


sched_clock()
-------------

In addition to the clock sources and clock events there is a special weak
function in the kernel called sched_clock(). This function shall return the
number of nanoseconds since the system was started. An architecture may or
may not provide an implementation of sched_clock() on its own. If a local
implementation is not provided, the system jiffy counter will be used as
sched_clock().

As the name suggests, sched_clock() is used for scheduling the system,
determining the absolute timeslice for a certain process in the CFS scheduler
for example. It is also used for printk timestamps when you have selected to
include time information in printk for things like bootcharts.

Compared to clock sources, sched_clock() has to be very fast: it is called
much more often, especially by the scheduler. If you have to do trade-offs
between accuracy compared to the clock source, you may sacrifice accuracy
for speed in sched_clock(). It however requires some of the same basic
characteristics as the clock source, i.e. it should be monotonic.

The sched_clock() function may wrap only on unsigned long long boundaries,
i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps
after circa 585 years. (For most practical systems this means "never".)

If an architecture does not provide its own implementation of this function,
it will fall back to using jiffies, making its maximum resolution 1/HZ of the
jiffy frequency for the architecture. This will affect scheduling accuracy
and will likely show up in system benchmarks.

The clock driving sched_clock() may stop or reset to zero during system
suspend/sleep. This does not matter to the function it serves of scheduling
events on the system. However it may result in interesting timestamps in
printk().

The sched_clock() function should be callable in any context, IRQ- and
NMI-safe and return a sane value in any context.

Some architectures may have a limited set of time sources and lack a nice
counter to derive a 64-bit nanosecond value, so for example on the ARM
architecture, special helper functions have been created to provide a
sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the
same counter that is also used as clock source is used for this purpose.

On SMP systems, it is crucial for performance that sched_clock() can be called
independently on each CPU without any synchronization performance hits.
Some hardware (such as the x86 TSC) will cause the sched_clock() function to
drift between the CPUs on the system. The kernel can work around this by
enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect
that makes sched_clock() different from the ordinary clock source.


Delay timers (some architectures only)
--------------------------------------

On systems with variable CPU frequency, the various kernel delay() functions
will sometimes behave strangely. Basically these delays usually use a hard
loop to delay a certain number of jiffy fractions using a "lpj" (loops per
jiffy) value, calibrated on boot.

Let's hope that your system is running on maximum frequency when this value
is calibrated: as an effect when the frequency is geared down to half the
full frequency, any delay() will be twice as long. Usually this does not
hurt, as you're commonly requesting that amount of delay *or more*. But
basically the semantics are quite unpredictable on such systems.

Enter timer-based delays. Using these, a timer read may be used instead of
a hard-coded loop for providing the desired delay.

This is done by declaring a struct delay_timer and assigning the appropriate
function pointers and rate settings for this delay timer.

This is available on some architectures like OpenRISC or ARM.
Commit	Line	Data
7806f60e LW	1	Clock sources, Clock events, sched_clock() and delay timers
	2	-----------------------------------------------------------
	3
	4	This document tries to briefly explain some basic kernel timekeeping
	5	abstractions. It partly pertains to the drivers usually found in
	6	drivers/clocksource in the kernel tree, but the code may be spread out
	7	across the kernel.
	8
	9	If you grep through the kernel source you will find a number of architecture-
	10	specific implementations of clock sources, clockevents and several likewise
	11	architecture-specific overrides of the sched_clock() function and some
	12	delay timers.
	13
	14	To provide timekeeping for your platform, the clock source provides
	15	the basic timeline, whereas clock events shoot interrupts on certain points
	16	on this timeline, providing facilities such as high-resolution timers.
	17	sched_clock() is used for scheduling and timestamping, and delay timers
	18	provide an accurate delay source using hardware counters.
	19
	20
	21	Clock sources
	22	-------------
	23
	24	The purpose of the clock source is to provide a timeline for the system that
	25	tells you where you are in time. For example issuing the command 'date' on
	26	a Linux system will eventually read the clock source to determine exactly
	27	what time it is.
	28
	29	Typically the clock source is a monotonic, atomic counter which will provide
	30	n bits which count from 0 to 2^(n-1) and then wraps around to 0 and start over.
	31	It will ideally NEVER stop ticking as long as the system is running. It
	32	may stop during system suspend.
	33
	34	The clock source shall have as high resolution as possible, and the frequency
	35	shall be as stable and correct as possible as compared to a real-world wall
	36	clock. It should not move unpredictably back and forth in time or miss a few
	37	cycles here and there.
	38
	39	It must be immune to the kind of effects that occur in hardware where e.g.
	40	the counter register is read in two phases on the bus lowest 16 bits first
	41	and the higher 16 bits in a second bus cycle with the counter bits
	42	potentially being updated in between leading to the risk of very strange
	43	values from the counter.
	44
	45	When the wall-clock accuracy of the clock source isn't satisfactory, there
	46	are various quirks and layers in the timekeeping code for e.g. synchronizing
	47	the user-visible time to RTC clocks in the system or against networked time
	48	servers using NTP, but all they do basically is update an offset against
	49	the clock source, which provides the fundamental timeline for the system.
	50	These measures does not affect the clock source per se, they only adapt the
	51	system to the shortcomings of it.
	52
	53	The clock source struct shall provide means to translate the provided counter
	54	into a nanosecond value as an unsigned long long (unsigned 64 bit) number.
	55	Since this operation may be invoked very often, doing this in a strict
	56	mathematical sense is not desirable: instead the number is taken as close as
	57	possible to a nanosecond value using only the arithmetic operations
	58	multiply and shift, so in clocksource_cyc2ns() you find:
	59
	60	ns ~= (clocksource * mult) >> shift
	61
	62	You will find a number of helper functions in the clock source code intended
	63	to aid in providing these mult and shift values, such as
	64	clocksource_khz2mult(), clocksource_hz2mult() that help determine the
65	mult factor from a fixed shift, and clocksource_register_hz() and
66	clocksource_register_khz() which will help out assigning both shift and mult
67	factors using the frequency of the clock source as the only input.
68
69	For real simple clock sources accessed from a single I/O memory location
70	there is nowadays even clocksource_mmio_init() which will take a memory
71	location, bit width, a parameter telling whether the counter in the
72	register counts up or down, and the timer clock rate, and then conjure all
73	necessary parameters.
74
75	Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43
76	seconds, the code handling the clock source will have to compensate for this.
77	That is the reason why the clock source struct also contains a 'mask'
78	member telling how many bits of the source are valid. This way the timekeeping
79	code knows when the counter will wrap around and can insert the necessary
80	compensation code on both sides of the wrap point so that the system timeline
81	remains monotonic.
82
83
84	Clock events
85	------------
86
87	Clock events are the conceptual reverse of clock sources: they take a
88	desired time specification value and calculate the values to poke into
89	hardware timer registers.
90
91	Clock events are orthogonal to clock sources. The same hardware
92	and register range may be used for the clock event, but it is essentially
93	a different thing. The hardware driving clock events has to be able to
94	fire interrupts, so as to trigger events on the system timeline. On an SMP
95	system, it is ideal (and customary) to have one such event driving timer per
96	CPU core, so that each core can trigger events independently of any other
97	core.
98
99	You will notice that the clock event device code is based on the same basic
100	idea about translating counters to nanoseconds using mult and shift
101	arithmetic, and you find the same family of helper functions again for
102	assigning these values. The clock event driver does not need a 'mask'
103	attribute however: the system will not try to plan events beyond the time
104	horizon of the clock event.
105
106
107	sched_clock()
108	-------------
109
110	In addition to the clock sources and clock events there is a special weak
111	function in the kernel called sched_clock(). This function shall return the
112	number of nanoseconds since the system was started. An architecture may or
113	may not provide an implementation of sched_clock() on its own. If a local
114	implementation is not provided, the system jiffy counter will be used as
115	sched_clock().
116
117	As the name suggests, sched_clock() is used for scheduling the system,
118	determining the absolute timeslice for a certain process in the CFS scheduler
119	for example. It is also used for printk timestamps when you have selected to
120	include time information in printk for things like bootcharts.
121
122	Compared to clock sources, sched_clock() has to be very fast: it is called
123	much more often, especially by the scheduler. If you have to do trade-offs
124	between accuracy compared to the clock source, you may sacrifice accuracy
125	for speed in sched_clock(). It however requires some of the same basic
126	characteristics as the clock source, i.e. it should be monotonic.
127
128	The sched_clock() function may wrap only on unsigned long long boundaries,
129	i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps
130	after circa 585 years. (For most practical systems this means "never".)
131
132	If an architecture does not provide its own implementation of this function,
133	it will fall back to using jiffies, making its maximum resolution 1/HZ of the
134	jiffy frequency for the architecture. This will affect scheduling accuracy
135	and will likely show up in system benchmarks.
136
137	The clock driving sched_clock() may stop or reset to zero during system
138	suspend/sleep. This does not matter to the function it serves of scheduling
139	events on the system. However it may result in interesting timestamps in
140	printk().
141
142	The sched_clock() function should be callable in any context, IRQ- and
143	NMI-safe and return a sane value in any context.
144
145	Some architectures may have a limited set of time sources and lack a nice
146	counter to derive a 64-bit nanosecond value, so for example on the ARM
147	architecture, special helper functions have been created to provide a
148	sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the
149	same counter that is also used as clock source is used for this purpose.
150
151	On SMP systems, it is crucial for performance that sched_clock() can be called
152	independently on each CPU without any synchronization performance hits.
153	Some hardware (such as the x86 TSC) will cause the sched_clock() function to
154	drift between the CPUs on the system. The kernel can work around this by
155	enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect
156	that makes sched_clock() different from the ordinary clock source.
157
158
159	Delay timers (some architectures only)
160	--------------------------------------
161
162	On systems with variable CPU frequency, the various kernel delay() functions
163	will sometimes behave strangely. Basically these delays usually use a hard
164	loop to delay a certain number of jiffy fractions using a "lpj" (loops per
165	jiffy) value, calibrated on boot.
166
167	Let's hope that your system is running on maximum frequency when this value
168	is calibrated: as an effect when the frequency is geared down to half the
169	full frequency, any delay() will be twice as long. Usually this does not
170	hurt, as you're commonly requesting that amount of delay or more. But
171	basically the semantics are quite unpredictable on such systems.
172
173	Enter timer-based delays. Using these, a timer read may be used instead of
174	a hard-coded loop for providing the desired delay.
175
176	This is done by declaring a struct delay_timer and assigning the appropriate
177	function pointers and rate settings for this delay timer.
178
179	This is available on some architectures like OpenRISC or ARM.