]>
Commit | Line | Data |
---|---|---|
fa07e787 LS |
1 | |
2 | This document describes the Linux memory management "Unevictable LRU" | |
3 | infrastructure and the use of this infrastructure to manage several types | |
4 | of "unevictable" pages. The document attempts to provide the overall | |
5 | rationale behind this mechanism and the rationale for some of the design | |
6 | decisions that drove the implementation. The latter design rationale is | |
7 | discussed in the context of an implementation description. Admittedly, one | |
8 | can obtain the implementation details--the "what does it do?"--by reading the | |
9 | code. One hopes that the descriptions below add value by provide the answer | |
10 | to "why does it do that?". | |
11 | ||
12 | Unevictable LRU Infrastructure: | |
13 | ||
14 | The Unevictable LRU adds an additional LRU list to track unevictable pages | |
15 | and to hide these pages from vmscan. This mechanism is based on a patch by | |
16 | Larry Woodman of Red Hat to address several scalability problems with page | |
17 | reclaim in Linux. The problems have been observed at customer sites on large | |
18 | memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB | |
19 | of main memory will have over 32 million 4k pages in a single zone. When a | |
20 | large fraction of these pages are not evictable for any reason [see below], | |
21 | vmscan will spend a lot of time scanning the LRU lists looking for the small | |
22 | fraction of pages that are evictable. This can result in a situation where | |
23 | all cpus are spending 100% of their time in vmscan for hours or days on end, | |
24 | with the system completely unresponsive. | |
25 | ||
26 | The Unevictable LRU infrastructure addresses the following classes of | |
27 | unevictable pages: | |
28 | ||
29 | + page owned by ramfs | |
30 | + page mapped into SHM_LOCKed shared memory regions | |
31 | + page mapped into VM_LOCKED [mlock()ed] vmas | |
32 | ||
33 | The infrastructure might be able to handle other conditions that make pages | |
34 | unevictable, either by definition or by circumstance, in the future. | |
35 | ||
36 | ||
37 | The Unevictable LRU List | |
38 | ||
39 | The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list | |
40 | called the "unevictable" list and an associated page flag, PG_unevictable, to | |
41 | indicate that the page is being managed on the unevictable list. The | |
42 | PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active | |
43 | flag in that it indicates on which LRU list a page resides when PG_lru is set. | |
44 | The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU | |
45 | Kconfig option. | |
46 | ||
47 | The Unevictable LRU infrastructure maintains unevictable pages on an additional | |
48 | LRU list for a few reasons: | |
49 | ||
50 | 1) We get to "treat unevictable pages just like we treat other pages in the | |
51 | system, which means we get to use the same code to manipulate them, the | |
52 | same code to isolate them (for migrate, etc.), the same code to keep track | |
53 | of the statistics, etc..." [Rik van Riel] | |
54 | ||
55 | 2) We want to be able to migrate unevictable pages between nodes--for memory | |
56 | defragmentation, workload management and memory hotplug. The linux kernel | |
57 | can only migrate pages that it can successfully isolate from the lru lists. | |
58 | If we were to maintain pages elsewise than on an lru-like list, where they | |
59 | can be found by isolate_lru_page(), we would prevent their migration, unless | |
60 | we reworked migration code to find the unevictable pages. | |
61 | ||
62 | ||
63 | The unevictable LRU list does not differentiate between file backed and swap | |
64 | backed [anon] pages. This differentiation is only important while the pages | |
65 | are, in fact, evictable. | |
66 | ||
67 | The unevictable LRU list benefits from the "arrayification" of the per-zone | |
68 | LRU lists and statistics originally proposed and posted by Christoph Lameter. | |
69 | ||
70 | The unevictable list does not use the lru pagevec mechanism. Rather, | |
71 | unevictable pages are placed directly on the page's zone's unevictable | |
72 | list under the zone lru_lock. The reason for this is to prevent stranding | |
73 | of pages on the unevictable list when one task has the page isolated from the | |
74 | lru and other tasks are changing the "evictability" state of the page. | |
75 | ||
76 | ||
77 | Unevictable LRU and Memory Controller Interaction | |
78 | ||
79 | The memory controller data structure automatically gets a per zone unevictable | |
80 | lru list as a result of the "arrayification" of the per-zone LRU lists. The | |
81 | memory controller tracks the movement of pages to and from the unevictable list. | |
82 | When a memory control group comes under memory pressure, the controller will | |
83 | not attempt to reclaim pages on the unevictable list. This has a couple of | |
84 | effects. Because the pages are "hidden" from reclaim on the unevictable list, | |
85 | the reclaim process can be more efficient, dealing only with pages that have | |
86 | a chance of being reclaimed. On the other hand, if too many of the pages | |
87 | charged to the control group are unevictable, the evictable portion of the | |
88 | working set of the tasks in the control group may not fit into the available | |
89 | memory. This can cause the control group to thrash or to oom-kill tasks. | |
90 | ||
91 | ||
92 | Unevictable LRU: Detecting Unevictable Pages | |
93 | ||
94 | The function page_evictable(page, vma) in vmscan.c determines whether a | |
95 | page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, | |
96 | page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's | |
97 | address space using a wrapper function. Wrapper functions are used to set, | |
98 | clear and test the flag to reduce the requirement for #ifdef's throughout the | |
99 | source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. | |
100 | This flag remains for the life of the inode. | |
101 | ||
102 | For shared memory regions, AS_UNEVICTABLE is set when an application | |
103 | successfully SHM_LOCKs the region and is removed when the region is | |
104 | SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page | |
105 | tables for the region as does, for example, mlock(). So, we make no special | |
106 | effort to push any pages in the SHM_LOCKed region to the unevictable list. | |
107 | Vmscan will do this when/if it encounters the pages during reclaim. On | |
108 | SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the | |
109 | unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed | |
110 | region is destroyed, the pages are also "rescued" from the unevictable list in | |
111 | the process of freeing them. | |
112 | ||
113 | page_evictable() detects mlock()ed pages by testing an additional page flag, | |
114 | PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a | |
115 | non-NULL vma is supplied, page_evictable() will check whether the vma is | |
116 | VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and | |
117 | update the appropriate statistics if the vma is VM_LOCKED. This method allows | |
118 | efficient "culling" of pages in the fault path that are being faulted in to | |
119 | VM_LOCKED vmas. | |
120 | ||
121 | ||
122 | Unevictable Pages and Vmscan [shrink_*_list()] | |
123 | ||
124 | If unevictable pages are culled in the fault path, or moved to the unevictable | |
125 | list at mlock() or mmap() time, vmscan will never encounter the pages until | |
126 | they have become evictable again, for example, via munlock() and have been | |
127 | "rescued" from the unevictable list. However, there may be situations where we | |
128 | decide, for the sake of expediency, to leave a unevictable page on one of the | |
129 | regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for | |
130 | such pages in all of the shrink_{active|inactive|page}_list() functions and | |
131 | will "cull" such pages that it encounters--that is, it diverts those pages to | |
132 | the unevictable list for the zone being scanned. | |
133 | ||
134 | There may be situations where a page is mapped into a VM_LOCKED vma, but the | |
135 | page is not marked as PageMlocked. Such pages will make it all the way to | |
136 | shrink_page_list() where they will be detected when vmscan walks the reverse | |
137 | map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() | |
138 | will cull the page at that point. | |
139 | ||
fa07e787 LS |
140 | To "cull" an unevictable page, vmscan simply puts the page back on the lru |
141 | list using putback_lru_page()--the inverse operation to isolate_lru_page()-- | |
142 | after dropping the page lock. Because the condition which makes the page | |
143 | unevictable may change once the page is unlocked, putback_lru_page() will | |
144 | recheck the unevictable state of a page that it places on the unevictable lru | |
145 | list. If the page has become unevictable, putback_lru_page() removes it from | |
146 | the list and retries, including the page_unevictable() test. Because such a | |
147 | race is a rare event and movement of pages onto the unevictable list should be | |
148 | rare, these extra evictabilty checks should not occur in the majority of calls | |
149 | to putback_lru_page(). | |
150 | ||
151 | ||
152 | Mlocked Page: Prior Work | |
153 | ||
154 | The "Unevictable Mlocked Pages" infrastructure is based on work originally | |
155 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". | |
156 | Nick posted his patch as an alternative to a patch posted by Christoph | |
157 | Lameter to achieve the same objective--hiding mlocked pages from vmscan. | |
158 | In Nick's patch, he used one of the struct page lru list link fields as a count | |
159 | of VM_LOCKED vmas that map the page. This use of the link field for a count | |
160 | prevented the management of the pages on an LRU list. Thus, mlocked pages were | |
161 | not migratable as isolate_lru_page() could not find them and the lru list link | |
162 | field was not available to the migration subsystem. Nick resolved this by | |
163 | putting mlocked pages back on the lru list before attempting to isolate them, | |
164 | thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated | |
165 | with the Unevictable LRU work, the count was replaced by walking the reverse | |
166 | map to determine whether any VM_LOCKED vmas mapped the page. More on this | |
167 | below. | |
168 | ||
169 | ||
170 | Mlocked Pages: Basic Management | |
171 | ||
172 | Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of | |
173 | unevictable pages. When such a page has been "noticed" by the memory | |
174 | management subsystem, the page is marked with the PG_mlocked [PageMlocked()] | |
175 | flag. A PageMlocked() page will be placed on the unevictable LRU list when | |
176 | it is added to the LRU. Pages can be "noticed" by memory management in | |
177 | several places: | |
178 | ||
179 | 1) in the mlock()/mlockall() system call handlers. | |
180 | 2) in the mmap() system call handler when mmap()ing a region with the | |
181 | MAP_LOCKED flag, or mmap()ing a region in a task that has called | |
182 | mlockall() with the MCL_FUTURE flag. Both of these conditions result | |
183 | in the VM_LOCKED flag being set for the vma. | |
184 | 3) in the fault path, if mlocked pages are "culled" in the fault path, | |
185 | and when a VM_LOCKED stack segment is expanded. | |
63d6c5ad HD |
186 | 4) as mentioned above, in vmscan:shrink_page_list() when attempting to |
187 | reclaim a page in a VM_LOCKED vma via try_to_unmap(). | |
fa07e787 LS |
188 | |
189 | Mlocked pages become unlocked and rescued from the unevictable list when: | |
190 | ||
191 | 1) mapped in a range unlocked via the munlock()/munlockall() system calls. | |
192 | 2) munmapped() out of the last VM_LOCKED vma that maps the page, including | |
193 | unmapping at task exit. | |
194 | 3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file. | |
195 | 4) before a page is COWed in a VM_LOCKED vma. | |
196 | ||
197 | ||
198 | Mlocked Pages: mlock()/mlockall() System Call Handling | |
199 | ||
200 | Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() | |
201 | for each vma in the range specified by the call. In the case of mlockall(), | |
202 | this is the entire active address space of the task. Note that mlock_fixup() | |
203 | is used for both mlock()ing and munlock()ing a range of memory. A call to | |
204 | mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED | |
205 | is treated as a no-op--mlock_fixup() simply returns. | |
206 | ||
207 | If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas" | |
208 | below, mlock_fixup() will attempt to merge the vma with its neighbors or split | |
209 | off a subset of the vma if the range does not cover the entire vma. Once the | |
210 | vma has been merged or split or neither, mlock_fixup() will call | |
211 | __mlock_vma_pages_range() to fault in the pages via get_user_pages() and | |
212 | to mark the pages as mlocked via mlock_vma_page(). | |
213 | ||
214 | Note that the vma being mlocked might be mapped with PROT_NONE. In this case, | |
215 | get_user_pages() will be unable to fault in the pages. That's OK. If pages | |
216 | do end up getting faulted into this VM_LOCKED vma, we'll handle them in the | |
217 | fault path or in vmscan. | |
218 | ||
219 | Also note that a page returned by get_user_pages() could be truncated or | |
220 | migrated out from under us, while we're trying to mlock it. To detect | |
221 | this, __mlock_vma_pages_range() tests the page_mapping after acquiring | |
222 | the page lock. If the page is still associated with its mapping, we'll | |
223 | go ahead and call mlock_vma_page(). If the mapping is gone, we just | |
224 | unlock the page and move on. Worse case, this results in page mapped | |
225 | in a VM_LOCKED vma remaining on a normal LRU list without being | |
226 | PageMlocked(). Again, vmscan will detect and cull such pages. | |
227 | ||
228 | mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will | |
229 | TestSetPageMlocked() for each page returned by get_user_pages(). We use | |
230 | TestSetPageMlocked() because the page might already be mlocked by another | |
231 | task/vma and we don't want to do extra work. We especially do not want to | |
232 | count an mlocked page more than once in the statistics. If the page was | |
233 | already mlocked, mlock_vma_page() is done. | |
234 | ||
235 | If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the | |
236 | page from the LRU, as it is likely on the appropriate active or inactive list | |
237 | at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will | |
238 | putback the page--putback_lru_page()--which will notice that the page is now | |
239 | mlocked and divert the page to the zone's unevictable LRU list. If | |
240 | mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle | |
241 | it later if/when it attempts to reclaim the page. | |
242 | ||
243 | ||
244 | Mlocked Pages: Filtering Special Vmas | |
245 | ||
246 | mlock_fixup() filters several classes of "special" vmas: | |
247 | ||
248 | 1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind | |
249 | these mappings are inherently pinned, so we don't need to mark them as | |
250 | mlocked. In any case, most of the pages have no struct page in which to | |
251 | so mark the page. Because of this, get_user_pages() will fail for these | |
252 | vmas, so there is no sense in attempting to visit them. | |
253 | ||
254 | 2) vmas mapping hugetlbfs page are already effectively pinned into memory. | |
255 | We don't need nor want to mlock() these pages. However, to preserve the | |
63d6c5ad HD |
256 | prior behavior of mlock()--before the unevictable/mlock changes-- |
257 | mlock_fixup() will call make_pages_present() in the hugetlbfs vma range | |
258 | to allocate the huge pages and populate the ptes. | |
fa07e787 LS |
259 | |
260 | 3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of | |
261 | kernel pages, such as the vdso page, relay channel pages, etc. These pages | |
262 | are inherently unevictable and are not managed on the LRU lists. | |
263 | mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls | |
264 | make_pages_present() to populate the ptes. | |
265 | ||
266 | Note that for all of these special vmas, mlock_fixup() does not set the | |
267 | VM_LOCKED flag. Therefore, we won't have to deal with them later during | |
268 | munlock() or munmap()--for example, at task exit. Neither does mlock_fixup() | |
269 | account these vmas against the task's "locked_vm". | |
270 | ||
271 | Mlocked Pages: Downgrading the Mmap Semaphore. | |
272 | ||
273 | mlock_fixup() must be called with the mmap semaphore held for write, because | |
274 | it may have to merge or split vmas. However, mlocking a large region of | |
275 | memory can take a long time--especially if vmscan must reclaim pages to | |
276 | satisfy the regions requirements. Faulting in a large region with the mmap | |
277 | semaphore held for write can hold off other faults on the address space, in | |
278 | the case of a multi-threaded task. It can also hold off scans of the task's | |
279 | address space via /proc. While testing under heavy load, it was observed that | |
280 | the ps(1) command could be held off for many minutes while a large segment was | |
281 | mlock()ed down. | |
282 | ||
283 | To address this issue, and to make the system more responsive during mlock()ing | |
284 | of large segments, mlock_fixup() downgrades the mmap semaphore to read mode | |
285 | during the call to __mlock_vma_pages_range(). This works fine. However, the | |
286 | callers of mlock_fixup() expect the semaphore to be returned in write mode. | |
287 | So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not | |
288 | support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore | |
289 | and reacquire it in write mode. In a multi-threaded task, it is possible for | |
290 | the task memory map to change while the semaphore is dropped. Therefore, | |
291 | mlock_fixup() looks up the vma at the range start address after reacquiring | |
292 | the semaphore in write mode and verifies that it still covers the original | |
293 | range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of | |
294 | mlock_fixup() have been changed to deal with this new error condition. | |
295 | ||
296 | Note: when munlocking a region, all of the pages should already be resident-- | |
297 | unless we have racing threads mlocking() and munlocking() regions. So, | |
298 | unlocking should not have to wait for page allocations nor faults of any kind. | |
299 | Therefore mlock_fixup() does not downgrade the semaphore for munlock(). | |
300 | ||
301 | ||
302 | Mlocked Pages: munlock()/munlockall() System Call Handling | |
303 | ||
304 | The munlock() and munlockall() system calls are handled by the same functions-- | |
305 | do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock | |
306 | vs lock operation indicated by an argument. So, these system calls are also | |
307 | handled by mlock_fixup(). Again, if called for an already munlock()ed vma, | |
308 | mlock_fixup() simply returns. Because of the vma filtering discussed above, | |
309 | VM_LOCKED will not be set in any "special" vmas. So, these vmas will be | |
310 | ignored for munlock. | |
311 | ||
312 | If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off | |
313 | the specified range. The range is then munlocked via the function | |
314 | __mlock_vma_pages_range()--the same function used to mlock a vma range-- | |
315 | passing a flag to indicate that munlock() is being performed. | |
316 | ||
317 | Because the vma access protections could have been changed to PROT_NONE after | |
63d6c5ad | 318 | faulting in and mlocking pages, get_user_pages() was unreliable for visiting |
fa07e787 LS |
319 | these pages for munlocking. Because we don't want to leave pages mlocked(), |
320 | get_user_pages() was enhanced to accept a flag to ignore the permissions when | |
321 | fetching the pages--all of which should be resident as a result of previous | |
322 | mlock()ing. | |
323 | ||
324 | For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling | |
325 | munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked | |
326 | flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() | |
327 | use the Test*PageMlocked() function to handle the case where the page might | |
328 | have already been unlocked by another task. If the page was mlocked, | |
329 | munlock_vma_page() updates that zone statistics for the number of mlocked | |
330 | pages. Note, however, that at this point we haven't checked whether the page | |
331 | is mapped by other VM_LOCKED vmas. | |
332 | ||
333 | We can't call try_to_munlock(), the function that walks the reverse map to check | |
334 | for other VM_LOCKED vmas, without first isolating the page from the LRU. | |
335 | try_to_munlock() is a variant of try_to_unmap() and thus requires that the page | |
336 | not be on an lru list. [More on these below.] However, the call to | |
337 | isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). | |
338 | So, we go ahead and clear PG_mlocked up front, as this might be the only chance | |
339 | we have. If we can successfully isolate the page, we go ahead and | |
340 | try_to_munlock(), which will restore the PG_mlocked flag and update the zone | |
341 | page statistics if it finds another vma holding the page mlocked. If we fail | |
342 | to isolate the page, we'll have left a potentially mlocked page on the LRU. | |
343 | This is fine, because we'll catch it later when/if vmscan tries to reclaim the | |
344 | page. This should be relatively rare. | |
345 | ||
346 | Mlocked Pages: Migrating Them... | |
347 | ||
348 | A page that is being migrated has been isolated from the lru lists and is | |
349 | held locked across unmapping of the page, updating the page's mapping | |
350 | [address_space] entry and copying the contents and state, until the | |
351 | page table entry has been replaced with an entry that refers to the new | |
352 | page. Linux supports migration of mlocked pages and other unevictable | |
353 | pages. This involves simply moving the PageMlocked and PageUnevictable states | |
354 | from the old page to the new page. | |
355 | ||
356 | Note that page migration can race with mlocking or munlocking of the same | |
357 | page. This has been discussed from the mlock/munlock perspective in the | |
358 | respective sections above. Both processes [migration, m[un]locking], hold | |
359 | the page locked. This provides the first level of synchronization. Page | |
360 | migration zeros out the page_mapping of the old page before unlocking it, | |
361 | so m[un]lock can skip these pages by testing the page mapping under page | |
362 | lock. | |
363 | ||
364 | When completing page migration, we place the new and old pages back onto the | |
365 | lru after dropping the page lock. The "unneeded" page--old page on success, | |
366 | new page on failure--will be freed when the reference count held by the | |
367 | migration process is released. To ensure that we don't strand pages on the | |
368 | unevictable list because of a race between munlock and migration, page | |
369 | migration uses the putback_lru_page() function to add migrated pages back to | |
370 | the lru. | |
371 | ||
372 | ||
373 | Mlocked Pages: mmap(MAP_LOCKED) System Call Handling | |
374 | ||
375 | In addition the the mlock()/mlockall() system calls, an application can request | |
376 | that a region of memory be mlocked using the MAP_LOCKED flag with the mmap() | |
377 | call. Furthermore, any mmap() call or brk() call that expands the heap by a | |
378 | task that has previously called mlockall() with the MCL_FUTURE flag will result | |
379 | in the newly mapped memory being mlocked. Before the unevictable/mlock changes, | |
380 | the kernel simply called make_pages_present() to allocate pages and populate | |
381 | the page table. | |
382 | ||
383 | To mlock a range of memory under the unevictable/mlock infrastructure, the | |
384 | mmap() handler and task address space expansion functions call | |
385 | mlock_vma_pages_range() specifying the vma and the address range to mlock. | |
386 | mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in | |
387 | "Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will | |
388 | have already been set by the caller, in filtered vmas. Thus these vma's need | |
389 | not be visited for munlock when the region is unmapped. | |
390 | ||
391 | For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to | |
392 | fault/allocate the pages and mlock them. Again, like mlock_fixup(), | |
393 | mlock_vma_pages_range() downgrades the mmap semaphore to read mode before | |
394 | attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore | |
395 | back to write mode before returning. | |
396 | ||
397 | The callers of mlock_vma_pages_range() will have already added the memory | |
398 | range to be mlocked to the task's "locked_vm". To account for filtered vmas, | |
399 | mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the | |
400 | callers then subtract a non-negative return value from the task's locked_vm. | |
401 | A negative return value represent an error--for example, from get_user_pages() | |
402 | attempting to fault in a vma with PROT_NONE access. In this case, we leave | |
403 | the memory range accounted as locked_vm, as the protections could be changed | |
404 | later and pages allocated into that region. | |
405 | ||
406 | ||
407 | Mlocked Pages: munmap()/exit()/exec() System Call Handling | |
408 | ||
409 | When unmapping an mlocked region of memory, whether by an explicit call to | |
410 | munmap() or via an internal unmap from exit() or exec() processing, we must | |
411 | munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. | |
63d6c5ad HD |
412 | Before the unevictable/mlock changes, mlocking did not mark the pages in any |
413 | way, so unmapping them required no processing. | |
fa07e787 LS |
414 | |
415 | To munlock a range of memory under the unevictable/mlock infrastructure, the | |
416 | munmap() hander and task address space tear down function call | |
417 | munlock_vma_pages_all(). The name reflects the observation that one always | |
418 | specifies the entire vma range when munlock()ing during unmap of a region. | |
419 | Because of the vma filtering when mlocking() regions, only "normal" vmas that | |
420 | actually contain mlocked pages will be passed to munlock_vma_pages_all(). | |
421 | ||
422 | munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup() | |
423 | for the munlock case, calls __munlock_vma_pages_range() to walk the page table | |
424 | for the vma's memory range and munlock_vma_page() each resident page mapped by | |
425 | the vma. This effectively munlocks the page, only if this is the last | |
426 | VM_LOCKED vma that maps the page. | |
427 | ||
428 | ||
429 | Mlocked Page: try_to_unmap() | |
430 | ||
431 | [Note: the code changes represented by this section are really quite small | |
432 | compared to the text to describe what happening and why, and to discuss the | |
433 | implications.] | |
434 | ||
435 | Pages can, of course, be mapped into multiple vmas. Some of these vmas may | |
436 | have VM_LOCKED flag set. It is possible for a page mapped into one or more | |
437 | VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one | |
438 | of the active or inactive LRU lists. This could happen if, for example, a | |
439 | task in the process of munlock()ing the page could not isolate the page from | |
440 | the LRU. As a result, vmscan/shrink_page_list() might encounter such a page | |
441 | as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To | |
442 | handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED | |
443 | vmas while it is walking a page's reverse map. | |
444 | ||
445 | try_to_unmap() is always called, by either vmscan for reclaim or for page | |
446 | migration, with the argument page locked and isolated from the LRU. BUG_ON() | |
447 | assertions enforce this requirement. Separate functions handle anonymous and | |
448 | mapped file pages, as these types of pages have different reverse map | |
449 | mechanisms. | |
450 | ||
451 | try_to_unmap_anon() | |
452 | ||
453 | To unmap anonymous pages, each vma in the list anchored in the anon_vma must be | |
454 | visited--at least until a VM_LOCKED vma is encountered. If the page is being | |
455 | unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked | |
456 | pages are migratable. However, for reclaim, if the page is mapped into a | |
457 | VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap | |
458 | semphore of the mm_struct to which the vma belongs in read mode. If this is | |
459 | successful, try_to_unmap() will mlock the page via mlock_vma_page()--we | |
460 | wouldn't have gotten to try_to_unmap() if the page were already mlocked--and | |
461 | will return SWAP_MLOCK, indicating that the page is unevictable. If the | |
462 | mmap semaphore cannot be acquired, we are not sure whether the page is really | |
463 | unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN. | |
464 | ||
465 | try_to_unmap_file() -- linear mappings | |
466 | ||
467 | Unmapping of a mapped file page works the same, except that the scan visits | |
468 | all vmas that maps the page's index/page offset in the page's mapping's | |
469 | reverse map priority search tree. It must also visit each vma in the page's | |
470 | mapping's non-linear list, if the list is non-empty. As for anonymous pages, | |
471 | on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will | |
472 | attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, | |
473 | returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. | |
474 | ||
475 | try_to_unmap_file() -- non-linear mappings | |
476 | ||
477 | If a page's mapping contains a non-empty non-linear mapping vma list, then | |
478 | try_to_un{map|lock}() must also visit each vma in that list to determine | |
479 | whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit | |
480 | all vmas in the non-linear list to ensure that the pages is not/should not be | |
481 | mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate. | |
482 | However, there is no easy way to determine whether the page is actually mapped | |
483 | in a given vma--either for unmapping or testing whether the VM_LOCKED vma | |
484 | actually pins the page. | |
485 | ||
486 | So, try_to_unmap_file() handles non-linear mappings by scanning a certain | |
487 | number of pages--a "cluster"--in each non-linear vma associated with the page's | |
488 | mapping, for each file mapped page that vmscan tries to unmap. If this happens | |
489 | to unmap the page we're trying to unmap, try_to_unmap() will notice this on | |
490 | return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it | |
491 | will return SWAP_AGAIN, causing vmscan to recirculate this page. We take | |
492 | advantage of the cluster scan in try_to_unmap_cluster() as follows: | |
493 | ||
494 | For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap | |
495 | semaphore of the associated mm_struct for read without blocking. If this | |
496 | attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will | |
497 | retain the mmap semaphore for the scan; otherwise it drops it here. Then, | |
498 | for each page in the cluster, if we're holding the mmap semaphore for a locked | |
499 | vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This | |
500 | call is a no-op if the page is already locked, but will mlock any pages in | |
501 | the non-linear mapping that happen to be unlocked. If one of the pages so | |
502 | mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will | |
503 | return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan | |
504 | to cull the page, rather than recirculating it on the inactive list. Again, | |
505 | if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns | |
506 | SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but | |
507 | couldn't be mlocked. | |
508 | ||
509 | ||
510 | Mlocked pages: try_to_munlock() Reverse Map Scan | |
511 | ||
512 | TODO/FIXME: a better name might be page_mlocked()--analogous to the | |
63d6c5ad | 513 | page_referenced() reverse map walker. |
fa07e787 | 514 | |
63d6c5ad HD |
515 | When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() |
516 | System Call Handling" above--tries to munlock a page, it needs to | |
fa07e787 LS |
517 | determine whether or not the page is mapped by any VM_LOCKED vma, without |
518 | actually attempting to unmap all ptes from the page. For this purpose, the | |
519 | unevictable/mlock infrastructure introduced a variant of try_to_unmap() called | |
520 | try_to_munlock(). | |
521 | ||
522 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and | |
523 | mapped file pages with an additional argument specifing unlock versus unmap | |
524 | processing. Again, these functions walk the respective reverse maps looking | |
525 | for VM_LOCKED vmas. When such a vma is found for anonymous pages and file | |
526 | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions | |
527 | attempt to acquire the associated mmap semphore, mlock the page via | |
528 | mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the | |
63d6c5ad | 529 | pre-clearing of the page's PG_mlocked done by munlock_vma_page. |
fa07e787 LS |
530 | |
531 | If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap | |
532 | semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() | |
533 | to recycle the page on the inactive list and hope that it has better luck | |
534 | with the page next time. | |
535 | ||
536 | For file pages mapped into non-linear vmas, the try_to_munlock() logic works | |
537 | slightly differently. On encountering a VM_LOCKED non-linear vma that might | |
538 | map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking | |
539 | the page. munlock_vma_page() will just leave the page unlocked and let | |
540 | vmscan deal with it--the usual fallback position. | |
541 | ||
542 | Note that try_to_munlock()'s reverse map walk must visit every vma in a pages' | |
543 | reverse map to determine that a page is NOT mapped into any VM_LOCKED vma. | |
544 | However, the scan can terminate when it encounters a VM_LOCKED vma and can | |
545 | successfully acquire the vma's mmap semphore for read and mlock the page. | |
546 | Although try_to_munlock() can be called many [very many!] times when | |
547 | munlock()ing a large region or tearing down a large address space that has been | |
63d6c5ad | 548 | mlocked via mlockall(), overall this is a fairly rare event. |
fa07e787 LS |
549 | |
550 | Mlocked Page: Page Reclaim in shrink_*_list() | |
551 | ||
552 | shrink_active_list() culls any obviously unevictable pages--i.e., | |
553 | !page_evictable(page, NULL)--diverting these to the unevictable lru | |
554 | list. However, shrink_active_list() only sees unevictable pages that | |
555 | made it onto the active/inactive lru lists. Note that these pages do not | |
556 | have PageUnevictable set--otherwise, they would be on the unevictable list and | |
557 | shrink_active_list would never see them. | |
558 | ||
559 | Some examples of these unevictable pages on the LRU lists are: | |
560 | ||
561 | 1) ramfs pages that have been placed on the lru lists when first allocated. | |
562 | ||
563 | 2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to | |
564 | allocate or fault in the pages in the shared memory region. This happens | |
565 | when an application accesses the page the first time after SHM_LOCKing | |
566 | the segment. | |
567 | ||
568 | 3) Mlocked pages that could not be isolated from the lru and moved to the | |
569 | unevictable list in mlock_vma_page(). | |
570 | ||
571 | 3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't | |
572 | acquire the vma's mmap semaphore to test the flags and set PageMlocked. | |
573 | munlock_vma_page() was forced to let the page back on to the normal | |
574 | LRU list for vmscan to handle. | |
575 | ||
63d6c5ad HD |
576 | shrink_inactive_list() also culls any unevictable pages that it finds on |
577 | the inactive lists, again diverting them to the appropriate zone's unevictable | |
fa07e787 LS |
578 | lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became |
579 | SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or | |
580 | pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from | |
581 | the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice | |
582 | the latter, but will pass on to shrink_page_list(). | |
583 | ||
584 | shrink_page_list() again culls obviously unevictable pages that it could | |
63d6c5ad | 585 | encounter for similar reason to shrink_inactive_list(). Pages mapped into |
fa07e787 | 586 | VM_LOCKED vmas but without PG_mlocked set will make it all the way to |
63d6c5ad HD |
587 | try_to_unmap(). shrink_page_list() will divert them to the unevictable list |
588 | when try_to_unmap() returns SWAP_MLOCK, as discussed above. |