]>
Commit | Line | Data |
---|---|---|
0c87f9b5 PM |
1 | NO_HZ: Reducing Scheduling-Clock Ticks |
2 | ||
3 | ||
4 | This document describes Kconfig options and boot parameters that can | |
5 | reduce the number of scheduling-clock interrupts, thereby improving energy | |
6 | efficiency and reducing OS jitter. Reducing OS jitter is important for | |
7 | some types of computationally intensive high-performance computing (HPC) | |
8 | applications and for real-time applications. | |
9 | ||
10 | There are two main contexts in which the number of scheduling-clock | |
11 | interrupts can be reduced compared to the old-school approach of sending | |
12 | a scheduling-clock interrupt to all CPUs every jiffy whether they need | |
13 | it or not (CONFIG_HZ_PERIODIC=y or CONFIG_NO_HZ=n for older kernels): | |
14 | ||
15 | 1. Idle CPUs (CONFIG_NO_HZ_IDLE=y or CONFIG_NO_HZ=y for older kernels). | |
16 | ||
17 | 2. CPUs having only one runnable task (CONFIG_NO_HZ_FULL=y). | |
18 | ||
19 | These two cases are described in the following two sections, followed | |
20 | by a third section on RCU-specific considerations and a fourth and final | |
21 | section listing known issues. | |
22 | ||
23 | ||
24 | IDLE CPUs | |
25 | ||
26 | If a CPU is idle, there is little point in sending it a scheduling-clock | |
27 | interrupt. After all, the primary purpose of a scheduling-clock interrupt | |
28 | is to force a busy CPU to shift its attention among multiple duties, | |
29 | and an idle CPU has no duties to shift its attention among. | |
30 | ||
31 | The CONFIG_NO_HZ_IDLE=y Kconfig option causes the kernel to avoid sending | |
32 | scheduling-clock interrupts to idle CPUs, which is critically important | |
33 | both to battery-powered devices and to highly virtualized mainframes. | |
34 | A battery-powered device running a CONFIG_HZ_PERIODIC=y kernel would | |
35 | drain its battery very quickly, easily 2-3 times as fast as would the | |
36 | same device running a CONFIG_NO_HZ_IDLE=y kernel. A mainframe running | |
37 | 1,500 OS instances might find that half of its CPU time was consumed by | |
38 | unnecessary scheduling-clock interrupts. In these situations, there | |
39 | is strong motivation to avoid sending scheduling-clock interrupts to | |
40 | idle CPUs. That said, dyntick-idle mode is not free: | |
41 | ||
42 | 1. It increases the number of instructions executed on the path | |
43 | to and from the idle loop. | |
44 | ||
45 | 2. On many architectures, dyntick-idle mode also increases the | |
46 | number of expensive clock-reprogramming operations. | |
47 | ||
48 | Therefore, systems with aggressive real-time response constraints often | |
49 | run CONFIG_HZ_PERIODIC=y kernels (or CONFIG_NO_HZ=n for older kernels) | |
50 | in order to avoid degrading from-idle transition latencies. | |
51 | ||
52 | An idle CPU that is not receiving scheduling-clock interrupts is said to | |
53 | be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running | |
54 | tickless". The remainder of this document will use "dyntick-idle mode". | |
55 | ||
56 | There is also a boot parameter "nohz=" that can be used to disable | |
57 | dyntick-idle mode in CONFIG_NO_HZ_IDLE=y kernels by specifying "nohz=off". | |
58 | By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling | |
59 | dyntick-idle mode. | |
60 | ||
61 | ||
62 | CPUs WITH ONLY ONE RUNNABLE TASK | |
63 | ||
64 | If a CPU has only one runnable task, there is little point in sending it | |
65 | a scheduling-clock interrupt because there is no other task to switch to. | |
66 | ||
67 | The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid | |
68 | sending scheduling-clock interrupts to CPUs with a single runnable task, | |
69 | and such CPUs are said to be "adaptive-ticks CPUs". This is important | |
70 | for applications with aggressive real-time response constraints because | |
71 | it allows them to improve their worst-case response times by the maximum | |
72 | duration of a scheduling-clock interrupt. It is also important for | |
73 | computationally intensive short-iteration workloads: If any CPU is | |
74 | delayed during a given iteration, all the other CPUs will be forced to | |
75 | wait idle while the delayed CPU finishes. Thus, the delay is multiplied | |
76 | by one less than the number of CPUs. In these situations, there is | |
77 | again strong motivation to avoid sending scheduling-clock interrupts. | |
78 | ||
79 | By default, no CPU will be an adaptive-ticks CPU. The "nohz_full=" | |
80 | boot parameter specifies the adaptive-ticks CPUs. For example, | |
81 | "nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks | |
82 | CPUs. Note that you are prohibited from marking all of the CPUs as | |
83 | adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain | |
84 | online to handle timekeeping tasks in order to ensure that system calls | |
85 | like gettimeofday() returns accurate values on adaptive-tick CPUs. | |
86 | (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no | |
87 | running user processes to observe slight drifts in clock rate.) | |
88 | Therefore, the boot CPU is prohibited from entering adaptive-ticks | |
89 | mode. Specifying a "nohz_full=" mask that includes the boot CPU will | |
90 | result in a boot-time error message, and the boot CPU will be removed | |
91 | from the mask. | |
92 | ||
93 | Alternatively, the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter specifies | |
94 | that all CPUs other than the boot CPU are adaptive-ticks CPUs. This | |
95 | Kconfig parameter will be overridden by the "nohz_full=" boot parameter, | |
96 | so that if both the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter and | |
97 | the "nohz_full=1" boot parameter is specified, the boot parameter will | |
98 | prevail so that only CPU 1 will be an adaptive-ticks CPU. | |
99 | ||
100 | Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded. | |
101 | This is covered in the "RCU IMPLICATIONS" section below. | |
102 | ||
103 | Normally, a CPU remains in adaptive-ticks mode as long as possible. | |
104 | In particular, transitioning to kernel mode does not automatically change | |
105 | the mode. Instead, the CPU will exit adaptive-ticks mode only if needed, | |
106 | for example, if that CPU enqueues an RCU callback. | |
107 | ||
108 | Just as with dyntick-idle mode, the benefits of adaptive-tick mode do | |
109 | not come for free: | |
110 | ||
111 | 1. CONFIG_NO_HZ_FULL selects CONFIG_NO_HZ_COMMON, so you cannot run | |
112 | adaptive ticks without also running dyntick idle. This dependency | |
113 | extends down into the implementation, so that all of the costs | |
114 | of CONFIG_NO_HZ_IDLE are also incurred by CONFIG_NO_HZ_FULL. | |
115 | ||
116 | 2. The user/kernel transitions are slightly more expensive due | |
117 | to the need to inform kernel subsystems (such as RCU) about | |
118 | the change in mode. | |
119 | ||
120 | 3. POSIX CPU timers on adaptive-tick CPUs may miss their deadlines | |
121 | (perhaps indefinitely) because they currently rely on | |
122 | scheduling-tick interrupts. This will likely be fixed in | |
123 | one of two ways: (1) Prevent CPUs with POSIX CPU timers from | |
124 | entering adaptive-tick mode, or (2) Use hrtimers or other | |
125 | adaptive-ticks-immune mechanism to cause the POSIX CPU timer to | |
126 | fire properly. | |
127 | ||
128 | 4. If there are more perf events pending than the hardware can | |
129 | accommodate, they are normally round-robined so as to collect | |
130 | all of them over time. Adaptive-tick mode may prevent this | |
131 | round-robining from happening. This will likely be fixed by | |
132 | preventing CPUs with large numbers of perf events pending from | |
133 | entering adaptive-tick mode. | |
134 | ||
135 | 5. Scheduler statistics for adaptive-tick CPUs may be computed | |
136 | slightly differently than those for non-adaptive-tick CPUs. | |
137 | This might in turn perturb load-balancing of real-time tasks. | |
138 | ||
139 | 6. The LB_BIAS scheduler feature is disabled by adaptive ticks. | |
140 | ||
141 | Although improvements are expected over time, adaptive ticks is quite | |
142 | useful for many types of real-time and compute-intensive applications. | |
143 | However, the drawbacks listed above mean that adaptive ticks should not | |
144 | (yet) be enabled by default. | |
145 | ||
146 | ||
147 | RCU IMPLICATIONS | |
148 | ||
149 | There are situations in which idle CPUs cannot be permitted to | |
150 | enter either dyntick-idle mode or adaptive-tick mode, the most | |
151 | common being when that CPU has RCU callbacks pending. | |
152 | ||
153 | The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such CPUs | |
154 | to enter dyntick-idle mode or adaptive-tick mode anyway. In this case, | |
155 | a timer will awaken these CPUs every four jiffies in order to ensure | |
156 | that the RCU callbacks are processed in a timely fashion. | |
157 | ||
158 | Another approach is to offload RCU callback processing to "rcuo" kthreads | |
159 | using the CONFIG_RCU_NOCB_CPU=y Kconfig option. The specific CPUs to | |
160 | offload may be selected via several methods: | |
161 | ||
162 | 1. One of three mutually exclusive Kconfig options specify a | |
163 | build-time default for the CPUs to offload: | |
164 | ||
165 | a. The CONFIG_RCU_NOCB_CPU_NONE=y Kconfig option results in | |
166 | no CPUs being offloaded. | |
167 | ||
168 | b. The CONFIG_RCU_NOCB_CPU_ZERO=y Kconfig option causes | |
169 | CPU 0 to be offloaded. | |
170 | ||
171 | c. The CONFIG_RCU_NOCB_CPU_ALL=y Kconfig option causes all | |
172 | CPUs to be offloaded. Note that the callbacks will be | |
173 | offloaded to "rcuo" kthreads, and that those kthreads | |
174 | will in fact run on some CPU. However, this approach | |
175 | gives fine-grained control on exactly which CPUs the | |
176 | callbacks run on, along with their scheduling priority | |
177 | (including the default of SCHED_OTHER), and it further | |
178 | allows this control to be varied dynamically at runtime. | |
179 | ||
180 | 2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated | |
181 | list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, | |
182 | 3, 4, and 5. The specified CPUs will be offloaded in addition to | |
183 | any CPUs specified as offloaded by CONFIG_RCU_NOCB_CPU_ZERO=y or | |
184 | CONFIG_RCU_NOCB_CPU_ALL=y. This means that the "rcu_nocbs=" boot | |
185 | parameter has no effect for kernels built with RCU_NOCB_CPU_ALL=y. | |
186 | ||
187 | The offloaded CPUs will never queue RCU callbacks, and therefore RCU | |
188 | never prevents offloaded CPUs from entering either dyntick-idle mode | |
189 | or adaptive-tick mode. That said, note that it is up to userspace to | |
190 | pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the | |
191 | scheduler will decide where to run them, which might or might not be | |
192 | where you want them to run. | |
193 | ||
194 | ||
195 | KNOWN ISSUES | |
196 | ||
197 | o Dyntick-idle slows transitions to and from idle slightly. | |
198 | In practice, this has not been a problem except for the most | |
199 | aggressive real-time workloads, which have the option of disabling | |
200 | dyntick-idle mode, an option that most of them take. However, | |
201 | some workloads will no doubt want to use adaptive ticks to | |
202 | eliminate scheduling-clock interrupt latencies. Here are some | |
203 | options for these workloads: | |
204 | ||
205 | a. Use PMQOS from userspace to inform the kernel of your | |
206 | latency requirements (preferred). | |
207 | ||
208 | b. On x86 systems, use the "idle=mwait" boot parameter. | |
209 | ||
210 | c. On x86 systems, use the "intel_idle.max_cstate=" to limit | |
211 | ` the maximum C-state depth. | |
212 | ||
213 | d. On x86 systems, use the "idle=poll" boot parameter. | |
214 | However, please note that use of this parameter can cause | |
215 | your CPU to overheat, which may cause thermal throttling | |
216 | to degrade your latencies -- and that this degradation can | |
217 | be even worse than that of dyntick-idle. Furthermore, | |
218 | this parameter effectively disables Turbo Mode on Intel | |
219 | CPUs, which can significantly reduce maximum performance. | |
220 | ||
221 | o Adaptive-ticks slows user/kernel transitions slightly. | |
222 | This is not expected to be a problem for computationally intensive | |
223 | workloads, which have few such transitions. Careful benchmarking | |
224 | will be required to determine whether or not other workloads | |
225 | are significantly affected by this effect. | |
226 | ||
227 | o Adaptive-ticks does not do anything unless there is only one | |
228 | runnable task for a given CPU, even though there are a number | |
229 | of other situations where the scheduling-clock tick is not | |
230 | needed. To give but one example, consider a CPU that has one | |
231 | runnable high-priority SCHED_FIFO task and an arbitrary number | |
232 | of low-priority SCHED_OTHER tasks. In this case, the CPU is | |
233 | required to run the SCHED_FIFO task until it either blocks or | |
234 | some other higher-priority task awakens on (or is assigned to) | |
235 | this CPU, so there is no point in sending a scheduling-clock | |
236 | interrupt to this CPU. However, the current implementation | |
237 | nevertheless sends scheduling-clock interrupts to CPUs having a | |
238 | single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER | |
239 | tasks, even though these interrupts are unnecessary. | |
240 | ||
241 | Better handling of these sorts of situations is future work. | |
242 | ||
243 | o A reboot is required to reconfigure both adaptive idle and RCU | |
244 | callback offloading. Runtime reconfiguration could be provided | |
245 | if needed, however, due to the complexity of reconfiguring RCU at | |
246 | runtime, there would need to be an earthshakingly good reason. | |
247 | Especially given that you have the straightforward option of | |
248 | simply offloading RCU callbacks from all CPUs and pinning them | |
249 | where you want them whenever you want them pinned. | |
250 | ||
251 | o Additional configuration is required to deal with other sources | |
252 | of OS jitter, including interrupts and system-utility tasks | |
253 | and processes. This configuration normally involves binding | |
254 | interrupts and tasks to particular CPUs. | |
255 | ||
256 | o Some sources of OS jitter can currently be eliminated only by | |
257 | constraining the workload. For example, the only way to eliminate | |
258 | OS jitter due to global TLB shootdowns is to avoid the unmapping | |
259 | operations (such as kernel module unload operations) that | |
260 | result in these shootdowns. For another example, page faults | |
261 | and TLB misses can be reduced (and in some cases eliminated) by | |
262 | using huge pages and by constraining the amount of memory used | |
263 | by the application. Pre-faulting the working set can also be | |
264 | helpful, especially when combined with the mlock() and mlockall() | |
265 | system calls. | |
266 | ||
267 | o Unless all CPUs are idle, at least one CPU must keep the | |
268 | scheduling-clock interrupt going in order to support accurate | |
269 | timekeeping. | |
270 | ||
271 | o If there are adaptive-ticks CPUs, there will be at least one | |
272 | CPU keeping the scheduling-clock interrupt going, even if all | |
273 | CPUs are otherwise idle. |