]>
Commit | Line | Data |
---|---|---|
42b88e6a LS |
1 | |
2 | What is Linux Memory Policy? | |
3 | ||
4 | In the Linux kernel, "memory policy" determines from which node the kernel will | |
5 | allocate memory in a NUMA system or in an emulated NUMA system. Linux has | |
6 | supported platforms with Non-Uniform Memory Access architectures since 2.4.?. | |
7 | The current memory policy support was added to Linux 2.6 around May 2004. This | |
8 | document attempts to describe the concepts and APIs of the 2.6 memory policy | |
9 | support. | |
10 | ||
11 | Memory policies should not be confused with cpusets (Documentation/cpusets.txt) | |
12 | which is an administrative mechanism for restricting the nodes from which | |
13 | memory may be allocated by a set of processes. Memory policies are a | |
14 | programming interface that a NUMA-aware application can take advantage of. When | |
15 | both cpusets and policies are applied to a task, the restrictions of the cpuset | |
16 | takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. | |
17 | ||
18 | MEMORY POLICY CONCEPTS | |
19 | ||
20 | Scope of Memory Policies | |
21 | ||
22 | The Linux kernel supports _scopes_ of memory policy, described here from | |
23 | most general to most specific: | |
24 | ||
25 | System Default Policy: this policy is "hard coded" into the kernel. It | |
26 | is the policy that governs all page allocations that aren't controlled | |
27 | by one of the more specific policy scopes discussed below. When the | |
28 | system is "up and running", the system default policy will use "local | |
29 | allocation" described below. However, during boot up, the system | |
30 | default policy will be set to interleave allocations across all nodes | |
31 | with "sufficient" memory, so as not to overload the initial boot node | |
32 | with boot-time allocations. | |
33 | ||
34 | Task/Process Policy: this is an optional, per-task policy. When defined | |
35 | for a specific task, this policy controls all page allocations made by or | |
36 | on behalf of the task that aren't controlled by a more specific scope. | |
37 | If a task does not define a task policy, then all page allocations that | |
38 | would have been controlled by the task policy "fall back" to the System | |
39 | Default Policy. | |
40 | ||
41 | The task policy applies to the entire address space of a task. Thus, | |
42 | it is inheritable, and indeed is inherited, across both fork() | |
43 | [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task | |
44 | to establish the task policy for a child task exec()'d from an | |
45 | executable image that has no awareness of memory policy. See the | |
46 | MEMORY POLICY APIS section, below, for an overview of the system call | |
47 | that a task may use to set/change it's task/process policy. | |
48 | ||
49 | In a multi-threaded task, task policies apply only to the thread | |
50 | [Linux kernel task] that installs the policy and any threads | |
51 | subsequently created by that thread. Any sibling threads existing | |
52 | at the time a new task policy is installed retain their current | |
53 | policy. | |
54 | ||
55 | A task policy applies only to pages allocated after the policy is | |
56 | installed. Any pages already faulted in by the task when the task | |
57 | changes its task policy remain where they were allocated based on | |
58 | the policy at the time they were allocated. | |
59 | ||
60 | VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's | |
61 | virtual adddress space. A task may define a specific policy for a range | |
62 | of its virtual address space. See the MEMORY POLICIES APIS section, | |
63 | below, for an overview of the mbind() system call used to set a VMA | |
64 | policy. | |
65 | ||
66 | A VMA policy will govern the allocation of pages that back this region of | |
67 | the address space. Any regions of the task's address space that don't | |
68 | have an explicit VMA policy will fall back to the task policy, which may | |
69 | itself fall back to the System Default Policy. | |
70 | ||
71 | VMA policies have a few complicating details: | |
72 | ||
73 | VMA policy applies ONLY to anonymous pages. These include pages | |
74 | allocated for anonymous segments, such as the task stack and heap, and | |
75 | any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. | |
76 | If a VMA policy is applied to a file mapping, it will be ignored if | |
77 | the mapping used the MAP_SHARED flag. If the file mapping used the | |
78 | MAP_PRIVATE flag, the VMA policy will only be applied when an | |
79 | anonymous page is allocated on an attempt to write to the mapping-- | |
80 | i.e., at Copy-On-Write. | |
81 | ||
82 | VMA policies are shared between all tasks that share a virtual address | |
83 | space--a.k.a. threads--independent of when the policy is installed; and | |
84 | they are inherited across fork(). However, because VMA policies refer | |
85 | to a specific region of a task's address space, and because the address | |
86 | space is discarded and recreated on exec*(), VMA policies are NOT | |
87 | inheritable across exec(). Thus, only NUMA-aware applications may | |
88 | use VMA policies. | |
89 | ||
90 | A task may install a new VMA policy on a sub-range of a previously | |
91 | mmap()ed region. When this happens, Linux splits the existing virtual | |
92 | memory area into 2 or 3 VMAs, each with it's own policy. | |
93 | ||
94 | By default, VMA policy applies only to pages allocated after the policy | |
95 | is installed. Any pages already faulted into the VMA range remain | |
96 | where they were allocated based on the policy at the time they were | |
97 | allocated. However, since 2.6.16, Linux supports page migration via | |
98 | the mbind() system call, so that page contents can be moved to match | |
99 | a newly installed policy. | |
100 | ||
101 | Shared Policy: Conceptually, shared policies apply to "memory objects" | |
102 | mapped shared into one or more tasks' distinct address spaces. An | |
103 | application installs a shared policies the same way as VMA policies--using | |
104 | the mbind() system call specifying a range of virtual addresses that map | |
105 | the shared object. However, unlike VMA policies, which can be considered | |
106 | to be an attribute of a range of a task's address space, shared policies | |
107 | apply directly to the shared object. Thus, all tasks that attach to the | |
108 | object share the policy, and all pages allocated for the shared object, | |
109 | by any task, will obey the shared policy. | |
110 | ||
111 | As of 2.6.22, only shared memory segments, created by shmget() or | |
112 | mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared | |
113 | policy support was added to Linux, the associated data structures were | |
114 | added to hugetlbfs shmem segments. At the time, hugetlbfs did not | |
115 | support allocation at fault time--a.k.a lazy allocation--so hugetlbfs | |
116 | shmem segments were never "hooked up" to the shared policy support. | |
117 | Although hugetlbfs segments now support lazy allocation, their support | |
118 | for shared policy has not been completed. | |
119 | ||
120 | As mentioned above [re: VMA policies], allocations of page cache | |
121 | pages for regular files mmap()ed with MAP_SHARED ignore any VMA | |
122 | policy installed on the virtual address range backed by the shared | |
123 | file mapping. Rather, shared page cache pages, including pages backing | |
124 | private mappings that have not yet been written by the task, follow | |
125 | task policy, if any, else System Default Policy. | |
126 | ||
127 | The shared policy infrastructure supports different policies on subset | |
128 | ranges of the shared object. However, Linux still splits the VMA of | |
129 | the task that installs the policy for each range of distinct policy. | |
130 | Thus, different tasks that attach to a shared memory segment can have | |
131 | different VMA configurations mapping that one shared object. This | |
132 | can be seen by examining the /proc/<pid>/numa_maps of tasks sharing | |
133 | a shared memory region, when one task has installed shared policy on | |
134 | one or more ranges of the region. | |
135 | ||
136 | Components of Memory Policies | |
137 | ||
65d66fc0 DR |
138 | A Linux memory policy consists of a "mode", optional mode flags, and an |
139 | optional set of nodes. The mode determines the behavior of the policy, | |
140 | the optional mode flags determine the behavior of the mode, and the | |
141 | optional set of nodes can be viewed as the arguments to the policy | |
142 | behavior. | |
42b88e6a LS |
143 | |
144 | Internally, memory policies are implemented by a reference counted | |
145 | structure, struct mempolicy. Details of this structure will be discussed | |
146 | in context, below, as required to explain the behavior. | |
147 | ||
42b88e6a LS |
148 | Linux memory policy supports the following 4 behavioral modes: |
149 | ||
150 | Default Mode--MPOL_DEFAULT: The behavior specified by this mode is | |
151 | context or scope dependent. | |
152 | ||
153 | As mentioned in the Policy Scope section above, during normal | |
154 | system operation, the System Default Policy is hard coded to | |
155 | contain the Default mode. | |
156 | ||
157 | In this context, default mode means "local" allocation--that is | |
158 | attempt to allocate the page from the node associated with the cpu | |
159 | where the fault occurs. If the "local" node has no memory, or the | |
160 | node's memory can be exhausted [no free pages available], local | |
161 | allocation will "fallback to"--attempt to allocate pages from-- | |
162 | "nearby" nodes, in order of increasing "distance". | |
163 | ||
164 | Implementation detail -- subject to change: "Fallback" uses | |
165 | a per node list of sibling nodes--called zonelists--built at | |
166 | boot time, or when nodes or memory are added or removed from | |
167 | the system [memory hotplug]. These per node zonelist are | |
168 | constructed with nodes in order of increasing distance based | |
169 | on information provided by the platform firmware. | |
170 | ||
171 | When a task/process policy or a shared policy contains the Default | |
172 | mode, this also means "local allocation", as described above. | |
173 | ||
174 | In the context of a VMA, Default mode means "fall back to task | |
175 | policy"--which may or may not specify Default mode. Thus, Default | |
176 | mode can not be counted on to mean local allocation when used | |
177 | on a non-shared region of the address space. However, see | |
178 | MPOL_PREFERRED below. | |
179 | ||
65d66fc0 DR |
180 | It is an error for the set of nodes specified for this policy to |
181 | be non-empty. | |
42b88e6a LS |
182 | |
183 | MPOL_BIND: This mode specifies that memory must come from the | |
19770b32 MG |
184 | set of nodes specified by the policy. Memory will be allocated from |
185 | the node in the set with sufficient free memory that is closest to | |
186 | the node where the allocation takes place. | |
42b88e6a LS |
187 | |
188 | MPOL_PREFERRED: This mode specifies that the allocation should be | |
189 | attempted from the single node specified in the policy. If that | |
190 | allocation fails, the kernel will search other nodes, exactly as | |
191 | it would for a local allocation that started at the preferred node | |
192 | in increasing distance from the preferred node. "Local" allocation | |
193 | policy can be viewed as a Preferred policy that starts at the node | |
194 | containing the cpu where the allocation takes place. | |
195 | ||
196 | Internally, the Preferred policy uses a single node--the | |
197 | preferred_node member of struct mempolicy. A "distinguished | |
198 | value of this preferred_node, currently '-1', is interpreted | |
199 | as "the node containing the cpu where the allocation takes | |
200 | place"--local allocation. This is the way to specify | |
201 | local allocation for a specific range of addresses--i.e. for | |
202 | VMA policies. | |
203 | ||
3e1f0645 DR |
204 | It is possible for the user to specify that local allocation is |
205 | always preferred by passing an empty nodemask with this mode. | |
206 | If an empty nodemask is passed, the policy cannot use the | |
207 | MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described | |
208 | below. | |
209 | ||
42b88e6a LS |
210 | MPOL_INTERLEAVED: This mode specifies that page allocations be |
211 | interleaved, on a page granularity, across the nodes specified in | |
212 | the policy. This mode also behaves slightly differently, based on | |
213 | the context where it is used: | |
214 | ||
215 | For allocation of anonymous pages and shared memory pages, | |
216 | Interleave mode indexes the set of nodes specified by the policy | |
217 | using the page offset of the faulting address into the segment | |
218 | [VMA] containing the address modulo the number of nodes specified | |
219 | by the policy. It then attempts to allocate a page, starting at | |
220 | the selected node, as if the node had been specified by a Preferred | |
221 | policy or had been selected by a local allocation. That is, | |
222 | allocation will follow the per node zonelist. | |
223 | ||
224 | For allocation of page cache pages, Interleave mode indexes the set | |
225 | of nodes specified by the policy using a node counter maintained | |
226 | per task. This counter wraps around to the lowest specified node | |
227 | after it reaches the highest specified node. This will tend to | |
228 | spread the pages out over the nodes specified by the policy based | |
229 | on the order in which they are allocated, rather than based on any | |
230 | page offset into an address range or file. During system boot up, | |
231 | the temporary interleaved system default policy works in this | |
232 | mode. | |
233 | ||
65d66fc0 DR |
234 | Linux memory policy supports the following optional mode flags: |
235 | ||
236 | MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by | |
237 | the user should not be remapped if the task or VMA's set of allowed | |
238 | nodes changes after the memory policy has been defined. | |
239 | ||
240 | Without this flag, anytime a mempolicy is rebound because of a | |
241 | change in the set of allowed nodes, the node (Preferred) or | |
242 | nodemask (Bind, Interleave) is remapped to the new set of | |
243 | allowed nodes. This may result in nodes being used that were | |
244 | previously undesired. | |
245 | ||
246 | With this flag, if the user-specified nodes overlap with the | |
247 | nodes allowed by the task's cpuset, then the memory policy is | |
248 | applied to their intersection. If the two sets of nodes do not | |
249 | overlap, the Default policy is used. | |
250 | ||
251 | For example, consider a task that is attached to a cpuset with | |
252 | mems 1-3 that sets an Interleave policy over the same set. If | |
253 | the cpuset's mems change to 3-5, the Interleave will now occur | |
254 | over nodes 3, 4, and 5. With this flag, however, since only node | |
255 | 3 is allowed from the user's nodemask, the "interleave" only | |
256 | occurs over that node. If no nodes from the user's nodemask are | |
257 | now allowed, the Default behavior is used. | |
258 | ||
3e1f0645 DR |
259 | MPOL_F_STATIC_NODES cannot be combined with the |
260 | MPOL_F_RELATIVE_NODES flag. It also cannot be used for | |
261 | MPOL_PREFERRED policies that were created with an empty nodemask | |
262 | (local allocation). | |
65d66fc0 DR |
263 | |
264 | MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed | |
265 | by the user will be mapped relative to the set of the task or VMA's | |
266 | set of allowed nodes. The kernel stores the user-passed nodemask, | |
267 | and if the allowed nodes changes, then that original nodemask will | |
268 | be remapped relative to the new set of allowed nodes. | |
269 | ||
270 | Without this flag (and without MPOL_F_STATIC_NODES), anytime a | |
271 | mempolicy is rebound because of a change in the set of allowed | |
272 | nodes, the node (Preferred) or nodemask (Bind, Interleave) is | |
273 | remapped to the new set of allowed nodes. That remap may not | |
274 | preserve the relative nature of the user's passed nodemask to its | |
275 | set of allowed nodes upon successive rebinds: a nodemask of | |
276 | 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of | |
277 | allowed nodes is restored to its original state. | |
278 | ||
279 | With this flag, the remap is done so that the node numbers from | |
280 | the user's passed nodemask are relative to the set of allowed | |
281 | nodes. In other words, if nodes 0, 2, and 4 are set in the user's | |
282 | nodemask, the policy will be effected over the first (and in the | |
283 | Bind or Interleave case, the third and fifth) nodes in the set of | |
284 | allowed nodes. The nodemask passed by the user represents nodes | |
285 | relative to task or VMA's set of allowed nodes. | |
286 | ||
287 | If the user's nodemask includes nodes that are outside the range | |
288 | of the new set of allowed nodes (for example, node 5 is set in | |
289 | the user's nodemask when the set of allowed nodes is only 0-3), | |
290 | then the remap wraps around to the beginning of the nodemask and, | |
291 | if not already set, sets the node in the mempolicy nodemask. | |
292 | ||
293 | For example, consider a task that is attached to a cpuset with | |
294 | mems 2-5 that sets an Interleave policy over the same set with | |
295 | MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the | |
296 | interleave now occurs over nodes 3,5-6. If the cpuset's mems | |
297 | then change to 0,2-3,5, then the interleave occurs over nodes | |
298 | 0,3,5. | |
299 | ||
300 | Thanks to the consistent remapping, applications preparing | |
301 | nodemasks to specify memory policies using this flag should | |
302 | disregard their current, actual cpuset imposed memory placement | |
303 | and prepare the nodemask as if they were always located on | |
304 | memory nodes 0 to N-1, where N is the number of memory nodes the | |
305 | policy is intended to manage. Let the kernel then remap to the | |
306 | set of memory nodes allowed by the task's cpuset, as that may | |
307 | change over time. | |
308 | ||
3e1f0645 DR |
309 | MPOL_F_RELATIVE_NODES cannot be combined with the |
310 | MPOL_F_STATIC_NODES flag. It also cannot be used for | |
311 | MPOL_PREFERRED policies that were created with an empty nodemask | |
312 | (local allocation). | |
65d66fc0 | 313 | |
42b88e6a LS |
314 | MEMORY POLICY APIs |
315 | ||
316 | Linux supports 3 system calls for controlling memory policy. These APIS | |
317 | always affect only the calling task, the calling task's address space, or | |
318 | some shared object mapped into the calling task's address space. | |
319 | ||
320 | Note: the headers that define these APIs and the parameter data types | |
321 | for user space applications reside in a package that is not part of | |
322 | the Linux kernel. The kernel system call interfaces, with the 'sys_' | |
323 | prefix, are defined in <linux/syscalls.h>; the mode and flag | |
324 | definitions are defined in <linux/mempolicy.h>. | |
325 | ||
326 | Set [Task] Memory Policy: | |
327 | ||
328 | long set_mempolicy(int mode, const unsigned long *nmask, | |
329 | unsigned long maxnode); | |
330 | ||
331 | Set's the calling task's "task/process memory policy" to mode | |
332 | specified by the 'mode' argument and the set of nodes defined | |
333 | by 'nmask'. 'nmask' points to a bit mask of node ids containing | |
65d66fc0 DR |
334 | at least 'maxnode' ids. Optional mode flags may be passed by |
335 | combining the 'mode' argument with the flag (for example: | |
336 | MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). | |
42b88e6a LS |
337 | |
338 | See the set_mempolicy(2) man page for more details | |
339 | ||
340 | ||
341 | Get [Task] Memory Policy or Related Information | |
342 | ||
343 | long get_mempolicy(int *mode, | |
344 | const unsigned long *nmask, unsigned long maxnode, | |
345 | void *addr, int flags); | |
346 | ||
347 | Queries the "task/process memory policy" of the calling task, or | |
348 | the policy or location of a specified virtual address, depending | |
349 | on the 'flags' argument. | |
350 | ||
351 | See the get_mempolicy(2) man page for more details | |
352 | ||
353 | ||
354 | Install VMA/Shared Policy for a Range of Task's Address Space | |
355 | ||
356 | long mbind(void *start, unsigned long len, int mode, | |
357 | const unsigned long *nmask, unsigned long maxnode, | |
358 | unsigned flags); | |
359 | ||
360 | mbind() installs the policy specified by (mode, nmask, maxnodes) as | |
361 | a VMA policy for the range of the calling task's address space | |
362 | specified by the 'start' and 'len' arguments. Additional actions | |
363 | may be requested via the 'flags' argument. | |
364 | ||
365 | See the mbind(2) man page for more details. | |
366 | ||
367 | MEMORY POLICY COMMAND LINE INTERFACE | |
368 | ||
369 | Although not strictly part of the Linux implementation of memory policy, | |
370 | a command line tool, numactl(8), exists that allows one to: | |
371 | ||
372 | + set the task policy for a specified program via set_mempolicy(2), fork(2) and | |
373 | exec(2) | |
374 | ||
375 | + set the shared policy for a shared memory segment via mbind(2) | |
376 | ||
377 | The numactl(8) tool is packages with the run-time version of the library | |
378 | containing the memory policy system call wrappers. Some distributions | |
379 | package the headers and compile-time libraries in a separate development | |
380 | package. | |
381 | ||
382 | ||
383 | MEMORY POLICIES AND CPUSETS | |
384 | ||
385 | Memory policies work within cpusets as described above. For memory policies | |
386 | that require a node or set of nodes, the nodes are restricted to the set of | |
754af6f5 | 387 | nodes whose memories are allowed by the cpuset constraints. If the nodemask |
65d66fc0 DR |
388 | specified for the policy contains nodes that are not allowed by the cpuset and |
389 | MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes | |
390 | specified for the policy and the set of nodes with memory is used. If the | |
391 | result is the empty set, the policy is considered invalid and cannot be | |
392 | installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped | |
393 | onto and folded into the task's set of allowed nodes as previously described. | |
394 | ||
395 | The interaction of memory policies and cpusets can be problematic when tasks | |
396 | in two cpusets share access to a memory region, such as shared memory segments | |
397 | created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and | |
398 | any of the tasks install shared policy on the region, only nodes whose | |
399 | memories are allowed in both cpusets may be used in the policies. Obtaining | |
400 | this information requires "stepping outside" the memory policy APIs to use the | |
401 | cpuset information and requires that one know in what cpusets other task might | |
402 | be attaching to the shared region. Furthermore, if the cpusets' allowed | |
403 | memory sets are disjoint, "local" allocation is the only valid policy. |