]>
Commit | Line | Data |
---|---|---|
0c87f9b5 PM |
1 | NO_HZ: Reducing Scheduling-Clock Ticks |
2 | ||
3 | ||
4 | This document describes Kconfig options and boot parameters that can | |
5 | reduce the number of scheduling-clock interrupts, thereby improving energy | |
6 | efficiency and reducing OS jitter. Reducing OS jitter is important for | |
7 | some types of computationally intensive high-performance computing (HPC) | |
8 | applications and for real-time applications. | |
9 | ||
295fde89 PM |
10 | There are three main ways of managing scheduling-clock interrupts |
11 | (also known as "scheduling-clock ticks" or simply "ticks"): | |
0c87f9b5 | 12 | |
295fde89 PM |
13 | 1. Never omit scheduling-clock ticks (CONFIG_HZ_PERIODIC=y or |
14 | CONFIG_NO_HZ=n for older kernels). You normally will -not- | |
15 | want to choose this option. | |
0c87f9b5 | 16 | |
295fde89 PM |
17 | 2. Omit scheduling-clock ticks on idle CPUs (CONFIG_NO_HZ_IDLE=y or |
18 | CONFIG_NO_HZ=y for older kernels). This is the most common | |
19 | approach, and should be the default. | |
0c87f9b5 | 20 | |
295fde89 PM |
21 | 3. Omit scheduling-clock ticks on CPUs that are either idle or that |
22 | have only one runnable task (CONFIG_NO_HZ_FULL=y). Unless you | |
23 | are running realtime applications or certain types of HPC | |
24 | workloads, you will normally -not- want this option. | |
25 | ||
26 | These three cases are described in the following three sections, followed | |
8bdf7a25 PM |
27 | by a third section on RCU-specific considerations, a fourth section |
28 | discussing testing, and a fifth and final section listing known issues. | |
0c87f9b5 PM |
29 | |
30 | ||
295fde89 PM |
31 | NEVER OMIT SCHEDULING-CLOCK TICKS |
32 | ||
33 | Very old versions of Linux from the 1990s and the very early 2000s | |
34 | are incapable of omitting scheduling-clock ticks. It turns out that | |
35 | there are some situations where this old-school approach is still the | |
36 | right approach, for example, in heavy workloads with lots of tasks | |
37 | that use short bursts of CPU, where there are very frequent idle | |
38 | periods, but where these idle periods are also quite short (tens or | |
39 | hundreds of microseconds). For these types of workloads, scheduling | |
40 | clock interrupts will normally be delivered any way because there | |
41 | will frequently be multiple runnable tasks per CPU. In these cases, | |
42 | attempting to turn off the scheduling clock interrupt will have no effect | |
43 | other than increasing the overhead of switching to and from idle and | |
44 | transitioning between user and kernel execution. | |
45 | ||
46 | This mode of operation can be selected using CONFIG_HZ_PERIODIC=y (or | |
47 | CONFIG_NO_HZ=n for older kernels). | |
48 | ||
49 | However, if you are instead running a light workload with long idle | |
50 | periods, failing to omit scheduling-clock interrupts will result in | |
51 | excessive power consumption. This is especially bad on battery-powered | |
52 | devices, where it results in extremely short battery lifetimes. If you | |
53 | are running light workloads, you should therefore read the following | |
54 | section. | |
55 | ||
56 | In addition, if you are running either a real-time workload or an HPC | |
57 | workload with short iterations, the scheduling-clock interrupts can | |
58 | degrade your applications performance. If this describes your workload, | |
59 | you should read the following two sections. | |
60 | ||
61 | ||
62 | OMIT SCHEDULING-CLOCK TICKS FOR IDLE CPUs | |
0c87f9b5 PM |
63 | |
64 | If a CPU is idle, there is little point in sending it a scheduling-clock | |
65 | interrupt. After all, the primary purpose of a scheduling-clock interrupt | |
66 | is to force a busy CPU to shift its attention among multiple duties, | |
67 | and an idle CPU has no duties to shift its attention among. | |
68 | ||
69 | The CONFIG_NO_HZ_IDLE=y Kconfig option causes the kernel to avoid sending | |
70 | scheduling-clock interrupts to idle CPUs, which is critically important | |
71 | both to battery-powered devices and to highly virtualized mainframes. | |
72 | A battery-powered device running a CONFIG_HZ_PERIODIC=y kernel would | |
73 | drain its battery very quickly, easily 2-3 times as fast as would the | |
74 | same device running a CONFIG_NO_HZ_IDLE=y kernel. A mainframe running | |
75 | 1,500 OS instances might find that half of its CPU time was consumed by | |
76 | unnecessary scheduling-clock interrupts. In these situations, there | |
77 | is strong motivation to avoid sending scheduling-clock interrupts to | |
78 | idle CPUs. That said, dyntick-idle mode is not free: | |
79 | ||
80 | 1. It increases the number of instructions executed on the path | |
81 | to and from the idle loop. | |
82 | ||
83 | 2. On many architectures, dyntick-idle mode also increases the | |
84 | number of expensive clock-reprogramming operations. | |
85 | ||
86 | Therefore, systems with aggressive real-time response constraints often | |
87 | run CONFIG_HZ_PERIODIC=y kernels (or CONFIG_NO_HZ=n for older kernels) | |
88 | in order to avoid degrading from-idle transition latencies. | |
89 | ||
90 | An idle CPU that is not receiving scheduling-clock interrupts is said to | |
91 | be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running | |
92 | tickless". The remainder of this document will use "dyntick-idle mode". | |
93 | ||
94 | There is also a boot parameter "nohz=" that can be used to disable | |
95 | dyntick-idle mode in CONFIG_NO_HZ_IDLE=y kernels by specifying "nohz=off". | |
96 | By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling | |
97 | dyntick-idle mode. | |
98 | ||
99 | ||
295fde89 | 100 | OMIT SCHEDULING-CLOCK TICKS FOR CPUs WITH ONLY ONE RUNNABLE TASK |
0c87f9b5 PM |
101 | |
102 | If a CPU has only one runnable task, there is little point in sending it | |
103 | a scheduling-clock interrupt because there is no other task to switch to. | |
295fde89 PM |
104 | Note that omitting scheduling-clock ticks for CPUs with only one runnable |
105 | task implies also omitting them for idle CPUs. | |
0c87f9b5 PM |
106 | |
107 | The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid | |
108 | sending scheduling-clock interrupts to CPUs with a single runnable task, | |
109 | and such CPUs are said to be "adaptive-ticks CPUs". This is important | |
110 | for applications with aggressive real-time response constraints because | |
111 | it allows them to improve their worst-case response times by the maximum | |
112 | duration of a scheduling-clock interrupt. It is also important for | |
113 | computationally intensive short-iteration workloads: If any CPU is | |
114 | delayed during a given iteration, all the other CPUs will be forced to | |
115 | wait idle while the delayed CPU finishes. Thus, the delay is multiplied | |
116 | by one less than the number of CPUs. In these situations, there is | |
117 | again strong motivation to avoid sending scheduling-clock interrupts. | |
118 | ||
119 | By default, no CPU will be an adaptive-ticks CPU. The "nohz_full=" | |
120 | boot parameter specifies the adaptive-ticks CPUs. For example, | |
121 | "nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks | |
122 | CPUs. Note that you are prohibited from marking all of the CPUs as | |
123 | adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain | |
8bdf7a25 PM |
124 | online to handle timekeeping tasks in order to ensure that system |
125 | calls like gettimeofday() returns accurate values on adaptive-tick CPUs. | |
126 | (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running | |
127 | user processes to observe slight drifts in clock rate.) Therefore, the | |
128 | boot CPU is prohibited from entering adaptive-ticks mode. Specifying a | |
129 | "nohz_full=" mask that includes the boot CPU will result in a boot-time | |
130 | error message, and the boot CPU will be removed from the mask. Note that | |
131 | this means that your system must have at least two CPUs in order for | |
132 | CONFIG_NO_HZ_FULL=y to do anything for you. | |
0c87f9b5 PM |
133 | |
134 | Alternatively, the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter specifies | |
135 | that all CPUs other than the boot CPU are adaptive-ticks CPUs. This | |
136 | Kconfig parameter will be overridden by the "nohz_full=" boot parameter, | |
137 | so that if both the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter and | |
138 | the "nohz_full=1" boot parameter is specified, the boot parameter will | |
139 | prevail so that only CPU 1 will be an adaptive-ticks CPU. | |
140 | ||
141 | Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded. | |
142 | This is covered in the "RCU IMPLICATIONS" section below. | |
143 | ||
144 | Normally, a CPU remains in adaptive-ticks mode as long as possible. | |
145 | In particular, transitioning to kernel mode does not automatically change | |
146 | the mode. Instead, the CPU will exit adaptive-ticks mode only if needed, | |
147 | for example, if that CPU enqueues an RCU callback. | |
148 | ||
149 | Just as with dyntick-idle mode, the benefits of adaptive-tick mode do | |
150 | not come for free: | |
151 | ||
152 | 1. CONFIG_NO_HZ_FULL selects CONFIG_NO_HZ_COMMON, so you cannot run | |
153 | adaptive ticks without also running dyntick idle. This dependency | |
154 | extends down into the implementation, so that all of the costs | |
155 | of CONFIG_NO_HZ_IDLE are also incurred by CONFIG_NO_HZ_FULL. | |
156 | ||
157 | 2. The user/kernel transitions are slightly more expensive due | |
158 | to the need to inform kernel subsystems (such as RCU) about | |
159 | the change in mode. | |
160 | ||
c2519784 PM |
161 | 3. POSIX CPU timers prevent CPUs from entering adaptive-tick mode. |
162 | Real-time applications needing to take actions based on CPU time | |
163 | consumption need to use other means of doing so. | |
0c87f9b5 PM |
164 | |
165 | 4. If there are more perf events pending than the hardware can | |
166 | accommodate, they are normally round-robined so as to collect | |
167 | all of them over time. Adaptive-tick mode may prevent this | |
168 | round-robining from happening. This will likely be fixed by | |
169 | preventing CPUs with large numbers of perf events pending from | |
170 | entering adaptive-tick mode. | |
171 | ||
172 | 5. Scheduler statistics for adaptive-tick CPUs may be computed | |
173 | slightly differently than those for non-adaptive-tick CPUs. | |
174 | This might in turn perturb load-balancing of real-time tasks. | |
175 | ||
176 | 6. The LB_BIAS scheduler feature is disabled by adaptive ticks. | |
177 | ||
178 | Although improvements are expected over time, adaptive ticks is quite | |
179 | useful for many types of real-time and compute-intensive applications. | |
180 | However, the drawbacks listed above mean that adaptive ticks should not | |
181 | (yet) be enabled by default. | |
182 | ||
183 | ||
184 | RCU IMPLICATIONS | |
185 | ||
186 | There are situations in which idle CPUs cannot be permitted to | |
187 | enter either dyntick-idle mode or adaptive-tick mode, the most | |
188 | common being when that CPU has RCU callbacks pending. | |
189 | ||
190 | The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such CPUs | |
191 | to enter dyntick-idle mode or adaptive-tick mode anyway. In this case, | |
192 | a timer will awaken these CPUs every four jiffies in order to ensure | |
193 | that the RCU callbacks are processed in a timely fashion. | |
194 | ||
195 | Another approach is to offload RCU callback processing to "rcuo" kthreads | |
196 | using the CONFIG_RCU_NOCB_CPU=y Kconfig option. The specific CPUs to | |
44c65ff2 PM |
197 | offload may be selected using The "rcu_nocbs=" kernel boot parameter, |
198 | which takes a comma-separated list of CPUs and CPU ranges, for example, | |
199 | "1,3-5" selects CPUs 1, 3, 4, and 5. | |
0c87f9b5 PM |
200 | |
201 | The offloaded CPUs will never queue RCU callbacks, and therefore RCU | |
202 | never prevents offloaded CPUs from entering either dyntick-idle mode | |
203 | or adaptive-tick mode. That said, note that it is up to userspace to | |
204 | pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the | |
205 | scheduler will decide where to run them, which might or might not be | |
206 | where you want them to run. | |
207 | ||
208 | ||
8bdf7a25 PM |
209 | TESTING |
210 | ||
211 | So you enable all the OS-jitter features described in this document, | |
212 | but do not see any change in your workload's behavior. Is this because | |
213 | your workload isn't affected that much by OS jitter, or is it because | |
214 | something else is in the way? This section helps answer this question | |
215 | by providing a simple OS-jitter test suite, which is available on branch | |
216 | master of the following git archive: | |
217 | ||
218 | git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git | |
219 | ||
220 | Clone this archive and follow the instructions in the README file. | |
221 | This test procedure will produce a trace that will allow you to evaluate | |
222 | whether or not you have succeeded in removing OS jitter from your system. | |
223 | If this trace shows that you have removed OS jitter as much as is | |
224 | possible, then you can conclude that your workload is not all that | |
225 | sensitive to OS jitter. | |
226 | ||
227 | Note: this test requires that your system have at least two CPUs. | |
228 | We do not currently have a good way to remove OS jitter from single-CPU | |
229 | systems. | |
230 | ||
231 | ||
0c87f9b5 PM |
232 | KNOWN ISSUES |
233 | ||
234 | o Dyntick-idle slows transitions to and from idle slightly. | |
235 | In practice, this has not been a problem except for the most | |
236 | aggressive real-time workloads, which have the option of disabling | |
237 | dyntick-idle mode, an option that most of them take. However, | |
238 | some workloads will no doubt want to use adaptive ticks to | |
239 | eliminate scheduling-clock interrupt latencies. Here are some | |
240 | options for these workloads: | |
241 | ||
242 | a. Use PMQOS from userspace to inform the kernel of your | |
243 | latency requirements (preferred). | |
244 | ||
245 | b. On x86 systems, use the "idle=mwait" boot parameter. | |
246 | ||
247 | c. On x86 systems, use the "intel_idle.max_cstate=" to limit | |
248 | ` the maximum C-state depth. | |
249 | ||
250 | d. On x86 systems, use the "idle=poll" boot parameter. | |
251 | However, please note that use of this parameter can cause | |
252 | your CPU to overheat, which may cause thermal throttling | |
253 | to degrade your latencies -- and that this degradation can | |
254 | be even worse than that of dyntick-idle. Furthermore, | |
255 | this parameter effectively disables Turbo Mode on Intel | |
256 | CPUs, which can significantly reduce maximum performance. | |
257 | ||
258 | o Adaptive-ticks slows user/kernel transitions slightly. | |
259 | This is not expected to be a problem for computationally intensive | |
260 | workloads, which have few such transitions. Careful benchmarking | |
261 | will be required to determine whether or not other workloads | |
262 | are significantly affected by this effect. | |
263 | ||
264 | o Adaptive-ticks does not do anything unless there is only one | |
265 | runnable task for a given CPU, even though there are a number | |
266 | of other situations where the scheduling-clock tick is not | |
267 | needed. To give but one example, consider a CPU that has one | |
268 | runnable high-priority SCHED_FIFO task and an arbitrary number | |
269 | of low-priority SCHED_OTHER tasks. In this case, the CPU is | |
270 | required to run the SCHED_FIFO task until it either blocks or | |
271 | some other higher-priority task awakens on (or is assigned to) | |
272 | this CPU, so there is no point in sending a scheduling-clock | |
273 | interrupt to this CPU. However, the current implementation | |
274 | nevertheless sends scheduling-clock interrupts to CPUs having a | |
275 | single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER | |
276 | tasks, even though these interrupts are unnecessary. | |
277 | ||
ce5f4fc8 PM |
278 | And even when there are multiple runnable tasks on a given CPU, |
279 | there is little point in interrupting that CPU until the current | |
280 | running task's timeslice expires, which is almost always way | |
281 | longer than the time of the next scheduling-clock interrupt. | |
282 | ||
0c87f9b5 PM |
283 | Better handling of these sorts of situations is future work. |
284 | ||
285 | o A reboot is required to reconfigure both adaptive idle and RCU | |
286 | callback offloading. Runtime reconfiguration could be provided | |
287 | if needed, however, due to the complexity of reconfiguring RCU at | |
288 | runtime, there would need to be an earthshakingly good reason. | |
289 | Especially given that you have the straightforward option of | |
290 | simply offloading RCU callbacks from all CPUs and pinning them | |
291 | where you want them whenever you want them pinned. | |
292 | ||
293 | o Additional configuration is required to deal with other sources | |
294 | of OS jitter, including interrupts and system-utility tasks | |
295 | and processes. This configuration normally involves binding | |
296 | interrupts and tasks to particular CPUs. | |
297 | ||
298 | o Some sources of OS jitter can currently be eliminated only by | |
299 | constraining the workload. For example, the only way to eliminate | |
300 | OS jitter due to global TLB shootdowns is to avoid the unmapping | |
301 | operations (such as kernel module unload operations) that | |
302 | result in these shootdowns. For another example, page faults | |
303 | and TLB misses can be reduced (and in some cases eliminated) by | |
304 | using huge pages and by constraining the amount of memory used | |
305 | by the application. Pre-faulting the working set can also be | |
306 | helpful, especially when combined with the mlock() and mlockall() | |
307 | system calls. | |
308 | ||
309 | o Unless all CPUs are idle, at least one CPU must keep the | |
310 | scheduling-clock interrupt going in order to support accurate | |
311 | timekeeping. | |
312 | ||
ce5f4fc8 PM |
313 | o If there might potentially be some adaptive-ticks CPUs, there |
314 | will be at least one CPU keeping the scheduling-clock interrupt | |
315 | going, even if all CPUs are otherwise idle. | |
316 | ||
317 | Better handling of this situation is ongoing work. | |
318 | ||
319 | o Some process-handling operations still require the occasional | |
320 | scheduling-clock tick. These operations include calculating CPU | |
321 | load, maintaining sched average, computing CFS entity vruntime, | |
322 | computing avenrun, and carrying out load balancing. They are | |
323 | currently accommodated by scheduling-clock tick every second | |
324 | or so. On-going work will eliminate the need even for these | |
325 | infrequent scheduling-clock ticks. |