]>
Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | CPUSETS |
2 | ------- | |
3 | ||
4 | Copyright (C) 2004 BULL SA. | |
5 | Written by Simon.Derr@bull.net | |
6 | ||
b4fb3766 | 7 | Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. |
1da177e4 | 8 | Modified by Paul Jackson <pj@sgi.com> |
b4fb3766 | 9 | Modified by Christoph Lameter <clameter@sgi.com> |
8793d854 | 10 | Modified by Paul Menage <menage@google.com> |
4d5f3553 | 11 | Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> |
1da177e4 LT |
12 | |
13 | CONTENTS: | |
14 | ========= | |
15 | ||
16 | 1. Cpusets | |
17 | 1.1 What are cpusets ? | |
18 | 1.2 Why are cpusets needed ? | |
19 | 1.3 How are cpusets implemented ? | |
bd5e09cf | 20 | 1.4 What are exclusive cpusets ? |
8793d854 PM |
21 | 1.5 What is memory_pressure ? |
22 | 1.6 What is memory spread ? | |
029190c5 | 23 | 1.7 What is sched_load_balance ? |
4d5f3553 HS |
24 | 1.8 What is sched_relax_domain_level ? |
25 | 1.9 How do I use cpusets ? | |
1da177e4 LT |
26 | 2. Usage Examples and Syntax |
27 | 2.1 Basic Usage | |
28 | 2.2 Adding/removing cpus | |
29 | 2.3 Setting flags | |
30 | 2.4 Attaching processes | |
31 | 3. Questions | |
32 | 4. Contact | |
33 | ||
34 | 1. Cpusets | |
35 | ========== | |
36 | ||
37 | 1.1 What are cpusets ? | |
38 | ---------------------- | |
39 | ||
40 | Cpusets provide a mechanism for assigning a set of CPUs and Memory | |
0e1e7c7a CL |
41 | Nodes to a set of tasks. In this document "Memory Node" refers to |
42 | an on-line node that contains memory. | |
1da177e4 LT |
43 | |
44 | Cpusets constrain the CPU and Memory placement of tasks to only | |
45 | the resources within a tasks current cpuset. They form a nested | |
46 | hierarchy visible in a virtual file system. These are the essential | |
47 | hooks, beyond what is already present, required to manage dynamic | |
48 | job placement on large systems. | |
49 | ||
8793d854 PM |
50 | Cpusets use the generic cgroup subsystem described in |
51 | Documentation/cgroup.txt. | |
52 | ||
53 | Requests by a task, using the sched_setaffinity(2) system call to | |
54 | include CPUs in its CPU affinity mask, and using the mbind(2) and | |
55 | set_mempolicy(2) system calls to include Memory Nodes in its memory | |
56 | policy, are both filtered through that tasks cpuset, filtering out any | |
57 | CPUs or Memory Nodes not in that cpuset. The scheduler will not | |
58 | schedule a task on a CPU that is not allowed in its cpus_allowed | |
59 | vector, and the kernel page allocator will not allocate a page on a | |
60 | node that is not allowed in the requesting tasks mems_allowed vector. | |
61 | ||
62 | User level code may create and destroy cpusets by name in the cgroup | |
1da177e4 LT |
63 | virtual file system, manage the attributes and permissions of these |
64 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | |
65 | specify and query to which cpuset a task is assigned, and list the | |
66 | task pids assigned to a cpuset. | |
67 | ||
68 | ||
69 | 1.2 Why are cpusets needed ? | |
70 | ---------------------------- | |
71 | ||
72 | The management of large computer systems, with many processors (CPUs), | |
73 | complex memory cache hierarchies and multiple Memory Nodes having | |
74 | non-uniform access times (NUMA) presents additional challenges for | |
75 | the efficient scheduling and memory placement of processes. | |
76 | ||
77 | Frequently more modest sized systems can be operated with adequate | |
78 | efficiency just by letting the operating system automatically share | |
79 | the available CPU and Memory resources amongst the requesting tasks. | |
80 | ||
81 | But larger systems, which benefit more from careful processor and | |
82 | memory placement to reduce memory access times and contention, | |
83 | and which typically represent a larger investment for the customer, | |
33430dc5 | 84 | can benefit from explicitly placing jobs on properly sized subsets of |
1da177e4 LT |
85 | the system. |
86 | ||
87 | This can be especially valuable on: | |
88 | ||
89 | * Web Servers running multiple instances of the same web application, | |
90 | * Servers running different applications (for instance, a web server | |
91 | and a database), or | |
92 | * NUMA systems running large HPC applications with demanding | |
93 | performance characteristics. | |
94 | ||
95 | These subsets, or "soft partitions" must be able to be dynamically | |
96 | adjusted, as the job mix changes, without impacting other concurrently | |
b4fb3766 CL |
97 | executing jobs. The location of the running jobs pages may also be moved |
98 | when the memory locations are changed. | |
1da177e4 LT |
99 | |
100 | The kernel cpuset patch provides the minimum essential kernel | |
101 | mechanisms required to efficiently implement such subsets. It | |
102 | leverages existing CPU and Memory Placement facilities in the Linux | |
103 | kernel to avoid any additional impact on the critical scheduler or | |
104 | memory allocator code. | |
105 | ||
106 | ||
107 | 1.3 How are cpusets implemented ? | |
108 | --------------------------------- | |
109 | ||
b4fb3766 CL |
110 | Cpusets provide a Linux kernel mechanism to constrain which CPUs and |
111 | Memory Nodes are used by a process or set of processes. | |
1da177e4 LT |
112 | |
113 | The Linux kernel already has a pair of mechanisms to specify on which | |
114 | CPUs a task may be scheduled (sched_setaffinity) and on which Memory | |
115 | Nodes it may obtain memory (mbind, set_mempolicy). | |
116 | ||
117 | Cpusets extends these two mechanisms as follows: | |
118 | ||
119 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the | |
120 | kernel. | |
121 | - Each task in the system is attached to a cpuset, via a pointer | |
8793d854 | 122 | in the task structure to a reference counted cgroup structure. |
1da177e4 LT |
123 | - Calls to sched_setaffinity are filtered to just those CPUs |
124 | allowed in that tasks cpuset. | |
125 | - Calls to mbind and set_mempolicy are filtered to just | |
126 | those Memory Nodes allowed in that tasks cpuset. | |
127 | - The root cpuset contains all the systems CPUs and Memory | |
128 | Nodes. | |
129 | - For any cpuset, one can define child cpusets containing a subset | |
130 | of the parents CPU and Memory Node resources. | |
131 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for | |
132 | browsing and manipulation from user space. | |
133 | - A cpuset may be marked exclusive, which ensures that no other | |
134 | cpuset (except direct ancestors and descendents) may contain | |
135 | any overlapping CPUs or Memory Nodes. | |
136 | - You can list all the tasks (by pid) attached to any cpuset. | |
137 | ||
138 | The implementation of cpusets requires a few, simple hooks | |
139 | into the rest of the kernel, none in performance critical paths: | |
140 | ||
864913f3 | 141 | - in init/main.c, to initialize the root cpuset at system boot. |
1da177e4 LT |
142 | - in fork and exit, to attach and detach a task from its cpuset. |
143 | - in sched_setaffinity, to mask the requested CPUs by what's | |
144 | allowed in that tasks cpuset. | |
145 | - in sched.c migrate_all_tasks(), to keep migrating tasks within | |
146 | the CPUs allowed by their cpuset, if possible. | |
147 | - in the mbind and set_mempolicy system calls, to mask the requested | |
148 | Memory Nodes by what's allowed in that tasks cpuset. | |
864913f3 | 149 | - in page_alloc.c, to restrict memory to allowed nodes. |
1da177e4 LT |
150 | - in vmscan.c, to restrict page recovery to the current cpuset. |
151 | ||
8793d854 PM |
152 | You should mount the "cgroup" filesystem type in order to enable |
153 | browsing and modifying the cpusets presently known to the kernel. No | |
154 | new system calls are added for cpusets - all support for querying and | |
155 | modifying cpusets is via this cpuset file system. | |
1da177e4 LT |
156 | |
157 | The /proc/<pid>/status file for each task has two added lines, | |
158 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | |
159 | and mems_allowed (on which Memory Nodes it may obtain memory), | |
160 | in the format seen in the following example: | |
161 | ||
162 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff | |
163 | Mems_allowed: ffffffff,ffffffff | |
164 | ||
8793d854 PM |
165 | Each cpuset is represented by a directory in the cgroup file system |
166 | containing (on top of the standard cgroup files) the following | |
167 | files describing that cpuset: | |
1da177e4 LT |
168 | |
169 | - cpus: list of CPUs in that cpuset | |
170 | - mems: list of Memory Nodes in that cpuset | |
45b07ef3 | 171 | - memory_migrate flag: if set, move pages to cpusets nodes |
1da177e4 LT |
172 | - cpu_exclusive flag: is cpu placement exclusive? |
173 | - mem_exclusive flag: is memory placement exclusive? | |
78608366 | 174 | - mem_hardwall flag: is memory allocation hardwalled |
bd5e09cf PJ |
175 | - memory_pressure: measure of how much paging pressure in cpuset |
176 | ||
177 | In addition, the root cpuset only has the following file: | |
178 | - memory_pressure_enabled flag: compute memory_pressure? | |
1da177e4 LT |
179 | |
180 | New cpusets are created using the mkdir system call or shell | |
181 | command. The properties of a cpuset, such as its flags, allowed | |
182 | CPUs and Memory Nodes, and attached tasks, are modified by writing | |
183 | to the appropriate file in that cpusets directory, as listed above. | |
184 | ||
185 | The named hierarchical structure of nested cpusets allows partitioning | |
186 | a large system into nested, dynamically changeable, "soft-partitions". | |
187 | ||
188 | The attachment of each task, automatically inherited at fork by any | |
189 | children of that task, to a cpuset allows organizing the work load | |
190 | on a system into related sets of tasks such that each set is constrained | |
191 | to using the CPUs and Memory Nodes of a particular cpuset. A task | |
192 | may be re-attached to any other cpuset, if allowed by the permissions | |
193 | on the necessary cpuset file system directories. | |
194 | ||
195 | Such management of a system "in the large" integrates smoothly with | |
196 | the detailed placement done on individual tasks and memory regions | |
197 | using the sched_setaffinity, mbind and set_mempolicy system calls. | |
198 | ||
199 | The following rules apply to each cpuset: | |
200 | ||
201 | - Its CPUs and Memory Nodes must be a subset of its parents. | |
6a7d68e8 | 202 | - It can't be marked exclusive unless its parent is. |
1da177e4 LT |
203 | - If its cpu or memory is exclusive, they may not overlap any sibling. |
204 | ||
205 | These rules, and the natural hierarchy of cpusets, enable efficient | |
206 | enforcement of the exclusive guarantee, without having to scan all | |
207 | cpusets every time any of them change to ensure nothing overlaps a | |
208 | exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | |
209 | to represent the cpuset hierarchy provides for a familiar permission | |
210 | and name space for cpusets, with a minimum of additional kernel code. | |
211 | ||
38837fc7 PJ |
212 | The cpus and mems files in the root (top_cpuset) cpuset are |
213 | read-only. The cpus file automatically tracks the value of | |
214 | cpu_online_map using a CPU hotplug notifier, and the mems file | |
0b720378 | 215 | automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e., |
0e1e7c7a | 216 | nodes with memory--using the cpuset_track_online_nodes() hook. |
4c4d50f7 | 217 | |
bd5e09cf PJ |
218 | |
219 | 1.4 What are exclusive cpusets ? | |
220 | -------------------------------- | |
221 | ||
222 | If a cpuset is cpu or mem exclusive, no other cpuset, other than | |
223 | a direct ancestor or descendent, may share any of the same CPUs or | |
224 | Memory Nodes. | |
225 | ||
78608366 PM |
226 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", |
227 | i.e. it restricts kernel allocations for page, buffer and other data | |
228 | commonly shared by the kernel across multiple users. All cpusets, | |
229 | whether hardwalled or not, restrict allocations of memory for user | |
230 | space. This enables configuring a system so that several independent | |
231 | jobs can share common kernel data, such as file system pages, while | |
232 | isolating each job's user allocation in its own cpuset. To do this, | |
233 | construct a large mem_exclusive cpuset to hold all the jobs, and | |
234 | construct child, non-mem_exclusive cpusets for each individual job. | |
235 | Only a small amount of typical kernel memory, such as requests from | |
236 | interrupt handlers, is allowed to be taken outside even a | |
237 | mem_exclusive cpuset. | |
bd5e09cf PJ |
238 | |
239 | ||
8793d854 | 240 | 1.5 What is memory_pressure ? |
bd5e09cf PJ |
241 | ----------------------------- |
242 | The memory_pressure of a cpuset provides a simple per-cpuset metric | |
243 | of the rate that the tasks in a cpuset are attempting to free up in | |
244 | use memory on the nodes of the cpuset to satisfy additional memory | |
245 | requests. | |
246 | ||
247 | This enables batch managers monitoring jobs running in dedicated | |
248 | cpusets to efficiently detect what level of memory pressure that job | |
249 | is causing. | |
250 | ||
251 | This is useful both on tightly managed systems running a wide mix of | |
252 | submitted jobs, which may choose to terminate or re-prioritize jobs that | |
253 | are trying to use more memory than allowed on the nodes assigned them, | |
254 | and with tightly coupled, long running, massively parallel scientific | |
255 | computing jobs that will dramatically fail to meet required performance | |
256 | goals if they start to use more memory than allowed to them. | |
257 | ||
258 | This mechanism provides a very economical way for the batch manager | |
259 | to monitor a cpuset for signs of memory pressure. It's up to the | |
260 | batch manager or other user code to decide what to do about it and | |
261 | take action. | |
262 | ||
263 | ==> Unless this feature is enabled by writing "1" to the special file | |
264 | /dev/cpuset/memory_pressure_enabled, the hook in the rebalance | |
265 | code of __alloc_pages() for this metric reduces to simply noticing | |
266 | that the cpuset_memory_pressure_enabled flag is zero. So only | |
267 | systems that enable this feature will compute the metric. | |
268 | ||
269 | Why a per-cpuset, running average: | |
270 | ||
271 | Because this meter is per-cpuset, rather than per-task or mm, | |
272 | the system load imposed by a batch scheduler monitoring this | |
273 | metric is sharply reduced on large systems, because a scan of | |
274 | the tasklist can be avoided on each set of queries. | |
275 | ||
276 | Because this meter is a running average, instead of an accumulating | |
277 | counter, a batch scheduler can detect memory pressure with a | |
278 | single read, instead of having to read and accumulate results | |
279 | for a period of time. | |
280 | ||
281 | Because this meter is per-cpuset rather than per-task or mm, | |
282 | the batch scheduler can obtain the key information, memory | |
283 | pressure in a cpuset, with a single read, rather than having to | |
284 | query and accumulate results over all the (dynamically changing) | |
285 | set of tasks in the cpuset. | |
286 | ||
287 | A per-cpuset simple digital filter (requires a spinlock and 3 words | |
288 | of data per-cpuset) is kept, and updated by any task attached to that | |
289 | cpuset, if it enters the synchronous (direct) page reclaim code. | |
290 | ||
291 | A per-cpuset file provides an integer number representing the recent | |
292 | (half-life of 10 seconds) rate of direct page reclaims caused by | |
293 | the tasks in the cpuset, in units of reclaims attempted per second, | |
294 | times 1000. | |
295 | ||
296 | ||
8793d854 | 297 | 1.6 What is memory spread ? |
825a46af PJ |
298 | --------------------------- |
299 | There are two boolean flag files per cpuset that control where the | |
300 | kernel allocates pages for the file system buffers and related in | |
301 | kernel data structures. They are called 'memory_spread_page' and | |
302 | 'memory_spread_slab'. | |
303 | ||
304 | If the per-cpuset boolean flag file 'memory_spread_page' is set, then | |
305 | the kernel will spread the file system buffers (page cache) evenly | |
306 | over all the nodes that the faulting task is allowed to use, instead | |
307 | of preferring to put those pages on the node where the task is running. | |
308 | ||
309 | If the per-cpuset boolean flag file 'memory_spread_slab' is set, | |
310 | then the kernel will spread some file system related slab caches, | |
311 | such as for inodes and dentries evenly over all the nodes that the | |
312 | faulting task is allowed to use, instead of preferring to put those | |
313 | pages on the node where the task is running. | |
314 | ||
315 | The setting of these flags does not affect anonymous data segment or | |
316 | stack segment pages of a task. | |
317 | ||
318 | By default, both kinds of memory spreading are off, and memory | |
319 | pages are allocated on the node local to where the task is running, | |
320 | except perhaps as modified by the tasks NUMA mempolicy or cpuset | |
321 | configuration, so long as sufficient free memory pages are available. | |
322 | ||
323 | When new cpusets are created, they inherit the memory spread settings | |
324 | of their parent. | |
325 | ||
326 | Setting memory spreading causes allocations for the affected page | |
327 | or slab caches to ignore the tasks NUMA mempolicy and be spread | |
328 | instead. Tasks using mbind() or set_mempolicy() calls to set NUMA | |
329 | mempolicies will not notice any change in these calls as a result of | |
330 | their containing tasks memory spread settings. If memory spreading | |
331 | is turned off, then the currently specified NUMA mempolicy once again | |
332 | applies to memory page allocations. | |
333 | ||
334 | Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | |
335 | files. By default they contain "0", meaning that the feature is off | |
336 | for that cpuset. If a "1" is written to that file, then that turns | |
337 | the named feature on. | |
338 | ||
339 | The implementation is simple. | |
340 | ||
341 | Setting the flag 'memory_spread_page' turns on a per-process flag | |
342 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | |
343 | joins that cpuset. The page allocation calls for the page cache | |
344 | is modified to perform an inline check for this PF_SPREAD_PAGE task | |
345 | flag, and if set, a call to a new routine cpuset_mem_spread_node() | |
346 | returns the node to prefer for the allocation. | |
347 | ||
6a7d68e8 | 348 | Similarly, setting 'memory_spread_slab' turns on the flag |
825a46af PJ |
349 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate |
350 | pages from the node returned by cpuset_mem_spread_node(). | |
351 | ||
352 | The cpuset_mem_spread_node() routine is also simple. It uses the | |
353 | value of a per-task rotor cpuset_mem_spread_rotor to select the next | |
354 | node in the current tasks mems_allowed to prefer for the allocation. | |
355 | ||
356 | This memory placement policy is also known (in other contexts) as | |
357 | round-robin or interleave. | |
358 | ||
359 | This policy can provide substantial improvements for jobs that need | |
360 | to place thread local data on the corresponding node, but that need | |
361 | to access large file system data sets that need to be spread across | |
362 | the several nodes in the jobs cpuset in order to fit. Without this | |
363 | policy, especially for jobs that might have one thread reading in the | |
364 | data set, the memory allocation across the nodes in the jobs cpuset | |
365 | can become very uneven. | |
366 | ||
029190c5 PJ |
367 | 1.7 What is sched_load_balance ? |
368 | -------------------------------- | |
825a46af | 369 | |
029190c5 PJ |
370 | The kernel scheduler (kernel/sched.c) automatically load balances |
371 | tasks. If one CPU is underutilized, kernel code running on that | |
372 | CPU will look for tasks on other more overloaded CPUs and move those | |
373 | tasks to itself, within the constraints of such placement mechanisms | |
374 | as cpusets and sched_setaffinity. | |
375 | ||
376 | The algorithmic cost of load balancing and its impact on key shared | |
377 | kernel data structures such as the task list increases more than | |
378 | linearly with the number of CPUs being balanced. So the scheduler | |
379 | has support to partition the systems CPUs into a number of sched | |
380 | domains such that it only load balances within each sched domain. | |
381 | Each sched domain covers some subset of the CPUs in the system; | |
382 | no two sched domains overlap; some CPUs might not be in any sched | |
383 | domain and hence won't be load balanced. | |
384 | ||
385 | Put simply, it costs less to balance between two smaller sched domains | |
386 | than one big one, but doing so means that overloads in one of the | |
387 | two domains won't be load balanced to the other one. | |
388 | ||
389 | By default, there is one sched domain covering all CPUs, except those | |
390 | marked isolated using the kernel boot time "isolcpus=" argument. | |
391 | ||
392 | This default load balancing across all CPUs is not well suited for | |
393 | the following two situations: | |
394 | 1) On large systems, load balancing across many CPUs is expensive. | |
395 | If the system is managed using cpusets to place independent jobs | |
396 | on separate sets of CPUs, full load balancing is unnecessary. | |
397 | 2) Systems supporting realtime on some CPUs need to minimize | |
398 | system overhead on those CPUs, including avoiding task load | |
399 | balancing if that is not needed. | |
400 | ||
401 | When the per-cpuset flag "sched_load_balance" is enabled (the default | |
402 | setting), it requests that all the CPUs in that cpusets allowed 'cpus' | |
403 | be contained in a single sched domain, ensuring that load balancing | |
404 | can move a task (not otherwised pinned, as by sched_setaffinity) | |
405 | from any CPU in that cpuset to any other. | |
406 | ||
407 | When the per-cpuset flag "sched_load_balance" is disabled, then the | |
408 | scheduler will avoid load balancing across the CPUs in that cpuset, | |
409 | --except-- in so far as is necessary because some overlapping cpuset | |
410 | has "sched_load_balance" enabled. | |
411 | ||
412 | So, for example, if the top cpuset has the flag "sched_load_balance" | |
413 | enabled, then the scheduler will have one sched domain covering all | |
414 | CPUs, and the setting of the "sched_load_balance" flag in any other | |
415 | cpusets won't matter, as we're already fully load balancing. | |
416 | ||
417 | Therefore in the above two situations, the top cpuset flag | |
418 | "sched_load_balance" should be disabled, and only some of the smaller, | |
419 | child cpusets have this flag enabled. | |
420 | ||
421 | When doing this, you don't usually want to leave any unpinned tasks in | |
422 | the top cpuset that might use non-trivial amounts of CPU, as such tasks | |
423 | may be artificially constrained to some subset of CPUs, depending on | |
424 | the particulars of this flag setting in descendent cpusets. Even if | |
425 | such a task could use spare CPU cycles in some other CPUs, the kernel | |
426 | scheduler might not consider the possibility of load balancing that | |
427 | task to that underused CPU. | |
428 | ||
429 | Of course, tasks pinned to a particular CPU can be left in a cpuset | |
430 | that disables "sched_load_balance" as those tasks aren't going anywhere | |
431 | else anyway. | |
432 | ||
433 | There is an impedance mismatch here, between cpusets and sched domains. | |
434 | Cpusets are hierarchical and nest. Sched domains are flat; they don't | |
435 | overlap and each CPU is in at most one sched domain. | |
436 | ||
437 | It is necessary for sched domains to be flat because load balancing | |
438 | across partially overlapping sets of CPUs would risk unstable dynamics | |
439 | that would be beyond our understanding. So if each of two partially | |
440 | overlapping cpusets enables the flag 'sched_load_balance', then we | |
441 | form a single sched domain that is a superset of both. We won't move | |
442 | a task to a CPU outside it cpuset, but the scheduler load balancing | |
443 | code might waste some compute cycles considering that possibility. | |
444 | ||
445 | This mismatch is why there is not a simple one-to-one relation | |
446 | between which cpusets have the flag "sched_load_balance" enabled, | |
447 | and the sched domain configuration. If a cpuset enables the flag, it | |
448 | will get balancing across all its CPUs, but if it disables the flag, | |
449 | it will only be assured of no load balancing if no other overlapping | |
450 | cpuset enables the flag. | |
451 | ||
452 | If two cpusets have partially overlapping 'cpus' allowed, and only | |
453 | one of them has this flag enabled, then the other may find its | |
454 | tasks only partially load balanced, just on the overlapping CPUs. | |
455 | This is just the general case of the top_cpuset example given a few | |
456 | paragraphs above. In the general case, as in the top cpuset case, | |
457 | don't leave tasks that might use non-trivial amounts of CPU in | |
458 | such partially load balanced cpusets, as they may be artificially | |
459 | constrained to some subset of the CPUs allowed to them, for lack of | |
460 | load balancing to the other CPUs. | |
461 | ||
462 | 1.7.1 sched_load_balance implementation details. | |
463 | ------------------------------------------------ | |
464 | ||
465 | The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | |
466 | to most cpuset flags.) When enabled for a cpuset, the kernel will | |
467 | ensure that it can load balance across all the CPUs in that cpuset | |
468 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are | |
469 | in the same sched domain.) | |
470 | ||
471 | If two overlapping cpusets both have 'sched_load_balance' enabled, | |
472 | then they will be (must be) both in the same sched domain. | |
473 | ||
474 | If, as is the default, the top cpuset has 'sched_load_balance' enabled, | |
475 | then by the above that means there is a single sched domain covering | |
476 | the whole system, regardless of any other cpuset settings. | |
477 | ||
478 | The kernel commits to user space that it will avoid load balancing | |
479 | where it can. It will pick as fine a granularity partition of sched | |
480 | domains as it can while still providing load balancing for any set | |
481 | of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | |
482 | ||
483 | The internal kernel cpuset to scheduler interface passes from the | |
484 | cpuset code to the scheduler code a partition of the load balanced | |
485 | CPUs in the system. This partition is a set of subsets (represented | |
486 | as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all | |
487 | the CPUs that must be load balanced. | |
488 | ||
489 | Whenever the 'sched_load_balance' flag changes, or CPUs come or go | |
490 | from a cpuset with this flag enabled, or a cpuset with this flag | |
491 | enabled is removed, the cpuset code builds a new such partition and | |
492 | passes it to the scheduler sched domain setup code, to have the sched | |
493 | domains rebuilt as necessary. | |
494 | ||
495 | This partition exactly defines what sched domains the scheduler should | |
496 | setup - one sched domain for each element (cpumask_t) in the partition. | |
497 | ||
498 | The scheduler remembers the currently active sched domain partitions. | |
499 | When the scheduler routine partition_sched_domains() is invoked from | |
500 | the cpuset code to update these sched domains, it compares the new | |
501 | partition requested with the current, and updates its sched domains, | |
502 | removing the old and adding the new, for each change. | |
503 | ||
4d5f3553 HS |
504 | |
505 | 1.8 What is sched_relax_domain_level ? | |
506 | -------------------------------------- | |
507 | ||
508 | In sched domain, the scheduler migrates tasks in 2 ways; periodic load | |
509 | balance on tick, and at time of some schedule events. | |
510 | ||
511 | When a task is woken up, scheduler try to move the task on idle CPU. | |
512 | For example, if a task A running on CPU X activates another task B | |
513 | on the same CPU X, and if CPU Y is X's sibling and performing idle, | |
514 | then scheduler migrate task B to CPU Y so that task B can start on | |
515 | CPU Y without waiting task A on CPU X. | |
516 | ||
517 | And if a CPU run out of tasks in its runqueue, the CPU try to pull | |
518 | extra tasks from other busy CPUs to help them before it is going to | |
519 | be idle. | |
520 | ||
521 | Of course it takes some searching cost to find movable tasks and/or | |
522 | idle CPUs, the scheduler might not search all CPUs in the domain | |
523 | everytime. In fact, in some architectures, the searching ranges on | |
524 | events are limited in the same socket or node where the CPU locates, | |
525 | while the load balance on tick searchs all. | |
526 | ||
527 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z | |
528 | is idle while CPU X and the siblings are busy, scheduler can't migrate | |
529 | woken task B from X to Z since it is out of its searching range. | |
530 | As the result, task B on CPU X need to wait task A or wait load balance | |
531 | on the next tick. For some applications in special situation, waiting | |
532 | 1 tick may be too long. | |
533 | ||
534 | The 'sched_relax_domain_level' file allows you to request changing | |
535 | this searching range as you like. This file takes int value which | |
536 | indicates size of searching range in levels ideally as follows, | |
537 | otherwise initial value -1 that indicates the cpuset has no request. | |
538 | ||
539 | -1 : no request. use system default or follow request of others. | |
540 | 0 : no search. | |
541 | 1 : search siblings (hyperthreads in a core). | |
542 | 2 : search cores in a package. | |
543 | 3 : search cpus in a node [= system wide on non-NUMA system] | |
544 | ( 4 : search nodes in a chunk of node [on NUMA system] ) | |
545 | ( 5~ : search system wide [on NUMA system]) | |
546 | ||
547 | This file is per-cpuset and affect the sched domain where the cpuset | |
548 | belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | |
549 | is disabled, then 'sched_relax_domain_level' have no effect since | |
550 | there is no sched domain belonging the cpuset. | |
551 | ||
552 | If multiple cpusets are overlapping and hence they form a single sched | |
553 | domain, the largest value among those is used. Be careful, if one | |
554 | requests 0 and others are -1 then 0 is used. | |
555 | ||
556 | Note that modifying this file will have both good and bad effects, | |
557 | and whether it is acceptable or not will be depend on your situation. | |
558 | Don't modify this file if you are not sure. | |
559 | ||
560 | If your situation is: | |
561 | - The migration costs between each cpu can be assumed considerably | |
562 | small(for you) due to your special application's behavior or | |
563 | special hardware support for CPU cache etc. | |
564 | - The searching cost doesn't have impact(for you) or you can make | |
565 | the searching cost enough small by managing cpuset to compact etc. | |
566 | - The latency is required even it sacrifices cache hit rate etc. | |
567 | then increasing 'sched_relax_domain_level' would benefit you. | |
568 | ||
569 | ||
570 | 1.9 How do I use cpusets ? | |
1da177e4 LT |
571 | -------------------------- |
572 | ||
573 | In order to minimize the impact of cpusets on critical kernel | |
574 | code, such as the scheduler, and due to the fact that the kernel | |
575 | does not support one task updating the memory placement of another | |
576 | task directly, the impact on a task of changing its cpuset CPU | |
577 | or Memory Node placement, or of changing to which cpuset a task | |
578 | is attached, is subtle. | |
579 | ||
580 | If a cpuset has its Memory Nodes modified, then for each task attached | |
581 | to that cpuset, the next time that the kernel attempts to allocate | |
582 | a page of memory for that task, the kernel will notice the change | |
583 | in the tasks cpuset, and update its per-task memory placement to | |
584 | remain within the new cpusets memory placement. If the task was using | |
585 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | |
586 | its new cpuset, then the task will continue to use whatever subset | |
587 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task | |
588 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | |
589 | in the new cpuset, then the task will be essentially treated as if it | |
590 | was MPOL_BIND bound to the new cpuset (even though its numa placement, | |
591 | as queried by get_mempolicy(), doesn't change). If a task is moved | |
592 | from one cpuset to another, then the kernel will adjust the tasks | |
593 | memory placement, as above, the next time that the kernel attempts | |
594 | to allocate a page of memory for that task. | |
595 | ||
8f5aa26c PJ |
596 | If a cpuset has its 'cpus' modified, then each task in that cpuset |
597 | will have its allowed CPU placement changed immediately. Similarly, | |
598 | if a tasks pid is written to a cpusets 'tasks' file, in either its | |
599 | current cpuset or another cpuset, then its allowed CPU placement is | |
600 | changed immediately. If such a task had been bound to some subset | |
601 | of its cpuset using the sched_setaffinity() call, the task will be | |
602 | allowed to run on any CPU allowed in its new cpuset, negating the | |
603 | affect of the prior sched_setaffinity() call. | |
1da177e4 LT |
604 | |
605 | In summary, the memory placement of a task whose cpuset is changed is | |
606 | updated by the kernel, on the next allocation of a page for that task, | |
607 | but the processor placement is not updated, until that tasks pid is | |
608 | rewritten to the 'tasks' file of its cpuset. This is done to avoid | |
609 | impacting the scheduler code in the kernel with a check for changes | |
610 | in a tasks processor placement. | |
611 | ||
45b07ef3 PJ |
612 | Normally, once a page is allocated (given a physical page |
613 | of main memory) then that page stays on whatever node it | |
614 | was allocated, so long as it remains allocated, even if the | |
615 | cpusets memory placement policy 'mems' subsequently changes. | |
616 | If the cpuset flag file 'memory_migrate' is set true, then when | |
617 | tasks are attached to that cpuset, any pages that task had | |
618 | allocated to it on nodes in its previous cpuset are migrated | |
b4fb3766 CL |
619 | to the tasks new cpuset. The relative placement of the page within |
620 | the cpuset is preserved during these migration operations if possible. | |
621 | For example if the page was on the second valid node of the prior cpuset | |
622 | then the page will be placed on the second valid node of the new cpuset. | |
623 | ||
45b07ef3 PJ |
624 | Also if 'memory_migrate' is set true, then if that cpusets |
625 | 'mems' file is modified, pages allocated to tasks in that | |
626 | cpuset, that were on nodes in the previous setting of 'mems', | |
b4fb3766 CL |
627 | will be moved to nodes in the new setting of 'mems.' |
628 | Pages that were not in the tasks prior cpuset, or in the cpusets | |
629 | prior 'mems' setting, will not be moved. | |
45b07ef3 | 630 | |
d533f671 | 631 | There is an exception to the above. If hotplug functionality is used |
1da177e4 LT |
632 | to remove all the CPUs that are currently assigned to a cpuset, |
633 | then the kernel will automatically update the cpus_allowed of all | |
b39c4fab | 634 | tasks attached to CPUs in that cpuset to allow all CPUs. When memory |
1da177e4 LT |
635 | hotplug functionality for removing Memory Nodes is available, a |
636 | similar exception is expected to apply there as well. In general, | |
637 | the kernel prefers to violate cpuset placement, over starving a task | |
638 | that has had all its allowed CPUs or Memory Nodes taken offline. User | |
639 | code should reconfigure cpusets to only refer to online CPUs and Memory | |
640 | Nodes when using hotplug to add or remove such resources. | |
641 | ||
642 | There is a second exception to the above. GFP_ATOMIC requests are | |
643 | kernel internal allocations that must be satisfied, immediately. | |
644 | The kernel may drop some request, in rare cases even panic, if a | |
645 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within | |
646 | the current tasks cpuset, then we relax the cpuset, and look for | |
647 | memory anywhere we can find it. It's better to violate the cpuset | |
648 | than stress the kernel. | |
649 | ||
650 | To start a new job that is to be contained within a cpuset, the steps are: | |
651 | ||
652 | 1) mkdir /dev/cpuset | |
8793d854 | 653 | 2) mount -t cgroup -ocpuset cpuset /dev/cpuset |
1da177e4 LT |
654 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in |
655 | the /dev/cpuset virtual file system. | |
656 | 4) Start a task that will be the "founding father" of the new job. | |
657 | 5) Attach that task to the new cpuset by writing its pid to the | |
658 | /dev/cpuset tasks file for that cpuset. | |
659 | 6) fork, exec or clone the job tasks from this founding father task. | |
660 | ||
661 | For example, the following sequence of commands will setup a cpuset | |
662 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | |
663 | and then start a subshell 'sh' in that cpuset: | |
664 | ||
8793d854 | 665 | mount -t cgroup -ocpuset cpuset /dev/cpuset |
1da177e4 LT |
666 | cd /dev/cpuset |
667 | mkdir Charlie | |
668 | cd Charlie | |
669 | /bin/echo 2-3 > cpus | |
670 | /bin/echo 1 > mems | |
671 | /bin/echo $$ > tasks | |
672 | sh | |
673 | # The subshell 'sh' is now running in cpuset Charlie | |
674 | # The next line should display '/Charlie' | |
675 | cat /proc/self/cpuset | |
676 | ||
1da177e4 LT |
677 | In the future, a C library interface to cpusets will likely be |
678 | available. For now, the only way to query or modify cpusets is | |
679 | via the cpuset file system, using the various cd, mkdir, echo, cat, | |
680 | rmdir commands from the shell, or their equivalent from C. | |
681 | ||
682 | The sched_setaffinity calls can also be done at the shell prompt using | |
683 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy | |
684 | calls can be done at the shell prompt using the numactl command | |
685 | (part of Andi Kleen's numa package). | |
686 | ||
687 | 2. Usage Examples and Syntax | |
688 | ============================ | |
689 | ||
690 | 2.1 Basic Usage | |
691 | --------------- | |
692 | ||
693 | Creating, modifying, using the cpusets can be done through the cpuset | |
694 | virtual filesystem. | |
695 | ||
696 | To mount it, type: | |
8793d854 | 697 | # mount -t cgroup -o cpuset cpuset /dev/cpuset |
1da177e4 LT |
698 | |
699 | Then under /dev/cpuset you can find a tree that corresponds to the | |
700 | tree of the cpusets in the system. For instance, /dev/cpuset | |
701 | is the cpuset that holds the whole system. | |
702 | ||
703 | If you want to create a new cpuset under /dev/cpuset: | |
704 | # cd /dev/cpuset | |
705 | # mkdir my_cpuset | |
706 | ||
707 | Now you want to do something with this cpuset. | |
708 | # cd my_cpuset | |
709 | ||
710 | In this directory you can find several files: | |
711 | # ls | |
6a7d68e8 MX |
712 | cpu_exclusive memory_migrate mems tasks |
713 | cpus memory_pressure notify_on_release | |
714 | mem_exclusive memory_spread_page sched_load_balance | |
715 | mem_hardwall memory_spread_slab sched_relax_domain_level | |
1da177e4 LT |
716 | |
717 | Reading them will give you information about the state of this cpuset: | |
718 | the CPUs and Memory Nodes it can use, the processes that are using | |
719 | it, its properties. By writing to these files you can manipulate | |
720 | the cpuset. | |
721 | ||
722 | Set some flags: | |
723 | # /bin/echo 1 > cpu_exclusive | |
724 | ||
725 | Add some cpus: | |
726 | # /bin/echo 0-7 > cpus | |
727 | ||
2400ff77 SH |
728 | Add some mems: |
729 | # /bin/echo 0-7 > mems | |
730 | ||
1da177e4 LT |
731 | Now attach your shell to this cpuset: |
732 | # /bin/echo $$ > tasks | |
733 | ||
734 | You can also create cpusets inside your cpuset by using mkdir in this | |
735 | directory. | |
736 | # mkdir my_sub_cs | |
737 | ||
738 | To remove a cpuset, just use rmdir: | |
739 | # rmdir my_sub_cs | |
740 | This will fail if the cpuset is in use (has cpusets inside, or has | |
741 | processes attached). | |
742 | ||
8793d854 PM |
743 | Note that for legacy reasons, the "cpuset" filesystem exists as a |
744 | wrapper around the cgroup filesystem. | |
745 | ||
746 | The command | |
747 | ||
748 | mount -t cpuset X /dev/cpuset | |
749 | ||
750 | is equivalent to | |
751 | ||
752 | mount -t cgroup -ocpuset X /dev/cpuset | |
753 | echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | |
754 | ||
1da177e4 LT |
755 | 2.2 Adding/removing cpus |
756 | ------------------------ | |
757 | ||
758 | This is the syntax to use when writing in the cpus or mems files | |
759 | in cpuset directories: | |
760 | ||
761 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | |
762 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | |
763 | ||
764 | 2.3 Setting flags | |
765 | ----------------- | |
766 | ||
767 | The syntax is very simple: | |
768 | ||
769 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | |
770 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | |
771 | ||
772 | 2.4 Attaching processes | |
773 | ----------------------- | |
774 | ||
775 | # /bin/echo PID > tasks | |
776 | ||
777 | Note that it is PID, not PIDs. You can only attach ONE task at a time. | |
778 | If you have several tasks to attach, you have to do it one after another: | |
779 | ||
780 | # /bin/echo PID1 > tasks | |
781 | # /bin/echo PID2 > tasks | |
782 | ... | |
783 | # /bin/echo PIDn > tasks | |
784 | ||
785 | ||
786 | 3. Questions | |
787 | ============ | |
788 | ||
789 | Q: what's up with this '/bin/echo' ? | |
790 | A: bash's builtin 'echo' command does not check calls to write() against | |
791 | errors. If you use it in the cpuset file system, you won't be | |
792 | able to tell whether a command succeeded or failed. | |
793 | ||
794 | Q: When I attach processes, only the first of the line gets really attached ! | |
795 | A: We can only return one error code per call to write(). So you should also | |
796 | put only ONE pid. | |
797 | ||
798 | 4. Contact | |
799 | ========== | |
800 | ||
801 | Web: http://www.bullopensource.org/cpuset |