--- /dev/null
-Transparently allow any component of a program to use any memory region of said
-program with a device without using device specific memory allocator. This is
-becoming a requirement to simplify the use of advance heterogeneous computing
-where GPU, DSP or FPGA are use to perform various computations.
-
-This document is divided as follow, in the first section i expose the problems
-related to the use of a device specific allocator. The second section i expose
-the hardware limitations that are inherent to many platforms. The third section
-gives an overview of HMM designs. The fourth section explains how CPU page-
-table mirroring works and what is HMM purpose in this context. Fifth section
-deals with how device memory is represented inside the kernel. Finaly the last
-section present the new migration helper that allow to leverage the device DMA
-engine.
+ .. hmm:
+
+ =====================================
+ Heterogeneous Memory Management (HMM)
+ =====================================
+
-Problems of using device specific memory allocator
-==================================================
-
-Device with large amount of on board memory (several giga bytes) like GPU have
-historically manage their memory through dedicated driver specific API. This
-creates a disconnect between memory allocated and managed by device driver and
-regular application memory (private anonymous, share memory or regular file
-back memory). From here on i will refer to this aspect as split address space.
-I use share address space to refer to the opposite situation ie one in which
-any memory region can be use by device transparently.
-
-Split address space because device can only access memory allocated through the
-device specific API. This imply that all memory object in a program are not
-equal from device point of view which complicate large program that rely on a
-wide set of libraries.
-
-Concretly this means that code that wants to leverage device like GPU need to
-copy object between genericly allocated memory (malloc, mmap private/share/)
-and memory allocated through the device driver API (this still end up with an
-mmap but of the device file).
-
-For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
-complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
-data-set need to re-map all the pointer relations between each of its elements.
-This is error prone and program gets harder to debug because of the duplicate
-data-set.
-
-Split address space also means that library can not transparently use data they
-are getting from core program or other library and thus each library might have
-to duplicate its input data-set using specific memory allocator. Large project
-suffer from this and waste resources because of the various memory copy.
-
-Duplicating each library API to accept as input or output memory allocted by
++Provide infrastructure and helpers to integrate non-conventional memory (device
++memory like GPU on board memory) into regular kernel path, with the cornerstone
++of this being specialized struct page for such memory (see sections 5 to 7 of
++this document).
++
++HMM also provides optional helpers for SVM (Share Virtual Memory), i.e.,
++allowing a device to transparently access program address coherently with
++the CPU meaning that any valid pointer on the CPU is also a valid pointer
++for the device. This is becoming mandatory to simplify the use of advanced
++heterogeneous computing where GPU, DSP, or FPGA are used to perform various
++computations on behalf of a process.
++
++This document is divided as follows: in the first section I expose the problems
++related to using device specific memory allocators. In the second section, I
++expose the hardware limitations that are inherent to many platforms. The third
++section gives an overview of the HMM design. The fourth section explains how
++CPU page-table mirroring works and the purpose of HMM in this context. The
++fifth section deals with how device memory is represented inside the kernel.
++Finally, the last section presents a new migration helper that allows lever-
++aging the device DMA engine.
+
+ .. contents:: :local:
+
-combinatorial explosions in the library entry points.
++Problems of using a device specific memory allocator
++====================================================
++
++Devices with a large amount of on board memory (several gigabytes) like GPUs
++have historically managed their memory through dedicated driver specific APIs.
++This creates a disconnect between memory allocated and managed by a device
++driver and regular application memory (private anonymous, shared memory, or
++regular file backed memory). From here on I will refer to this aspect as split
++address space. I use shared address space to refer to the opposite situation:
++i.e., one in which any application memory region can be used by a device
++transparently.
++
++Split address space happens because device can only access memory allocated
++through device specific API. This implies that all memory objects in a program
++are not equal from the device point of view which complicates large programs
++that rely on a wide set of libraries.
++
++Concretely this means that code that wants to leverage devices like GPUs needs
++to copy object between generically allocated memory (malloc, mmap private, mmap
++share) and memory allocated through the device driver API (this still ends up
++with an mmap but of the device file).
++
++For flat data sets (array, grid, image, ...) this isn't too hard to achieve but
++complex data sets (list, tree, ...) are hard to get right. Duplicating a
++complex data set needs to re-map all the pointer relations between each of its
++elements. This is error prone and program gets harder to debug because of the
++duplicate data set and addresses.
++
++Split address space also means that libraries cannot transparently use data
++they are getting from the core program or another library and thus each library
++might have to duplicate its input data set using the device specific memory
++allocator. Large projects suffer from this and waste resources because of the
++various memory copies.
++
++Duplicating each library API to accept as input or output memory allocated by
+ each device specific allocator is not a viable option. It would lead to a
-Finaly with the advance of high level language constructs (in C++ but in other
-language too) it is now possible for compiler to leverage GPU or other devices
-without even the programmer knowledge. Some of compiler identified patterns are
-only do-able with a share address. It is as well more reasonable to use a share
-address space for all the other patterns.
++combinatorial explosion in the library entry points.
+
-System bus, device memory characteristics
-=========================================
++Finally, with the advance of high level language constructs (in C++ but in
++other languages too) it is now possible for the compiler to leverage GPUs and
++other devices without programmer knowledge. Some compiler identified patterns
++are only do-able with a shared address space. It is also more reasonable to use
++a shared address space for all other patterns.
+
+
-System bus cripple share address due to few limitations. Most system bus only
-allow basic memory access from device to main memory, even cache coherency is
-often optional. Access to device memory from CPU is even more limited, most
-often than not it is not cache coherent.
++I/O bus, device memory characteristics
++======================================
+
-If we only consider the PCIE bus than device can access main memory (often
-through an IOMMU) and be cache coherent with the CPUs. However it only allows
-a limited set of atomic operation from device on main memory. This is worse
-in the other direction the CPUs can only access a limited range of the device
-memory and can not perform atomic operations on it. Thus device memory can not
-be consider like regular memory from kernel point of view.
++I/O buses cripple shared address spaces due to a few limitations. Most I/O
++buses only allow basic memory access from device to main memory; even cache
++coherency is often optional. Access to device memory from CPU is even more
++limited. More often than not, it is not cache coherent.
+
-and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
-The final limitation is latency, access to main memory from the device has an
-order of magnitude higher latency than when the device access its own memory.
++If we only consider the PCIE bus, then a device can access main memory (often
++through an IOMMU) and be cache coherent with the CPUs. However, it only allows
++a limited set of atomic operations from device on main memory. This is worse
++in the other direction: the CPU can only access a limited range of the device
++memory and cannot perform atomic operations on it. Thus device memory cannot
++be considered the same as regular memory from the kernel point of view.
+
+ Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
-Some platform are developing new system bus or additions/modifications to PCIE
-to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
++and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
++The final limitation is latency. Access to main memory from the device has an
++order of magnitude higher latency than when the device accesses its own memory.
+
-architecture supports. Saddly not all platform are following this trends and
-some major architecture are left without hardware solutions to those problems.
++Some platforms are developing new I/O buses or additions/modifications to PCIE
++to address some of these limitations (OpenCAPI, CCIX). They mainly allow two-
+ way cache coherency between CPU and device and allow all atomic operations the
-So for share address space to make sense not only we must allow device to
-access any memory memory but we must also permit any memory to be migrated to
-device memory while device is using it (blocking CPU access while it happens).
++architecture supports. Sadly, not all platforms are following this trend and
++some major architectures are left without hardware solutions to these problems.
+
-Share address space and migration
-=================================
++So for shared address space to make sense, not only must we allow devices to
++access any memory but we must also permit any memory to be migrated to device
++memory while device is using it (blocking CPU access while it happens).
+
+
-space by duplication the CPU page table into the device page table so same
-address point to same memory and this for any valid main memory address in
++Shared address space and migration
++==================================
+
+ HMM intends to provide two main features. First one is to share the address
-To achieve this, HMM offer a set of helpers to populate the device page table
++space by duplicating the CPU page table in the device page table so the same
++address points to the same physical memory for any valid main memory address in
+ the process address space.
+
-not as easy as CPU page table updates. To update the device page table you must
-allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
-commands in it to perform the update (unmap, cache invalidations and flush,
-...). This can not be done through common code for all device. Hence why HMM
-provides helpers to factor out everything that can be while leaving the gory
-details to the device driver.
-
-The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
-allow to allocate a struct page for each page of the device memory. Those page
-are special because the CPU can not map them. They however allow to migrate
-main memory to device memory using exhisting migration mechanism and everything
-looks like if page was swap out to disk from CPU point of view. Using a struct
-page gives the easiest and cleanest integration with existing mm mechanisms.
-Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
-for the device memory and second to perform migration. Policy decision of what
-and when to migrate things is left to the device driver.
-
-Note that any CPU access to a device page trigger a page fault and a migration
-back to main memory ie when a page backing an given address A is migrated from
-a main memory page to a device page then any CPU access to address A trigger a
-page fault and initiate a migration back to main memory.
-
-
-With this two features, HMM not only allow a device to mirror a process address
-space and keeps both CPU and device page table synchronize, but also allow to
-leverage device memory by migrating part of data-set that is actively use by a
-device.
++To achieve this, HMM offers a set of helpers to populate the device page table
+ while keeping track of CPU page table updates. Device page table updates are
-Address space mirroring main objective is to allow to duplicate range of CPU
-page table into a device page table and HMM helps keeping both synchronize. A
-device driver that want to mirror a process address space must start with the
++not as easy as CPU page table updates. To update the device page table, you must
++allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
++specific commands in it to perform the update (unmap, cache invalidations, and
++flush, ...). This cannot be done through common code for all devices. Hence
++why HMM provides helpers to factor out everything that can be while leaving the
++hardware specific details to the device driver.
++
++The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that
++allows allocating a struct page for each page of the device memory. Those pages
++are special because the CPU cannot map them. However, they allow migrating
++main memory to device memory using existing migration mechanisms and everything
++looks like a page is swapped out to disk from the CPU point of view. Using a
++struct page gives the easiest and cleanest integration with existing mm mech-
++anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
++memory for the device memory and second to perform migration. Policy decisions
++of what and when to migrate things is left to the device driver.
++
++Note that any CPU access to a device page triggers a page fault and a migration
++back to main memory. For example, when a page backing a given CPU address A is
++migrated from a main memory page to a device page, then any CPU access to
++address A triggers a page fault and initiates a migration back to main memory.
++
++With these two features, HMM not only allows a device to mirror process address
++space and keeping both CPU and device page table synchronized, but also lever-
++ages device memory by migrating the part of the data set that is actively being
++used by the device.
+
+
+ Address space mirroring implementation and API
+ ==============================================
+
-The locked variant is to be use when the driver is already holding the mmap_sem
-of the mm in write mode. The mirror struct has a set of callback that are use
-to propagate CPU page table::
++Address space mirroring's main objective is to allow duplication of a range of
++CPU page table into a device page table; HMM helps keep both synchronized. A
++device driver that wants to mirror a process address space must start with the
+ registration of an hmm_mirror struct::
+
+ int hmm_mirror_register(struct hmm_mirror *mirror,
+ struct mm_struct *mm);
+ int hmm_mirror_register_locked(struct hmm_mirror *mirror,
+ struct mm_struct *mm);
+
-Device driver must perform update to the range following action (turn range
-read only, or fully unmap, ...). Once driver callback returns the device must
-be done with the update.
-
++
++The locked variant is to be used when the driver is already holding mmap_sem
++of the mm in write mode. The mirror struct has a set of callbacks that are used
++to propagate CPU page tables::
+
+ struct hmm_mirror_ops {
+ /* sync_cpu_device_pagetables() - synchronize page tables
+ *
+ * @mirror: pointer to struct hmm_mirror
+ * @update_type: type of update that occurred to the CPU page table
+ * @start: virtual start address of the range to update
+ * @end: virtual end address of the range to update
+ *
+ * This callback ultimately originates from mmu_notifiers when the CPU
+ * page table is updated. The device driver must update its page table
+ * in response to this callback. The update argument tells what action
+ * to perform.
+ *
+ * The device driver must not return from this callback until the device
+ * page tables are completely updated (TLBs flushed, etc); this is a
+ * synchronous call.
+ */
+ void (*update)(struct hmm_mirror *mirror,
+ enum hmm_update action,
+ unsigned long start,
+ unsigned long end);
+ };
+
-When device driver wants to populate a range of virtual address it can use
-either::
++The device driver must perform the update action to the range (mark range
++read only, or fully unmap, ...). The device must be done with the update before
++the driver callback returns.
+
- int hmm_vma_get_pfns(struct vm_area_struct *vma,
++When the device driver wants to populate a range of virtual addresses, it can
++use either::
+
-First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
-will not trigger a page fault on missing or non present entry. The second one
-do trigger page fault on missing or read only entry if write parameter is true.
-Page fault use the generic mm page fault code path just like a CPU page fault.
++ int hmm_vma_get_pfns(struct vm_area_struct *vma,
+ struct hmm_range *range,
+ unsigned long start,
+ unsigned long end,
+ hmm_pfn_t *pfns);
+ int hmm_vma_fault(struct vm_area_struct *vma,
+ struct hmm_range *range,
+ unsigned long start,
+ unsigned long end,
+ hmm_pfn_t *pfns,
+ bool write,
+ bool block);
+
-Both function copy CPU page table into their pfns array argument. Each entry in
-that array correspond to an address in the virtual range. HMM provide a set of
-flags to help driver identify special CPU page table entries.
++The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
++entries and will not trigger a page fault on missing or non-present entries.
++The second one does trigger a page fault on missing or read-only entry if the
++write parameter is true. Page faults use the generic mm page fault code path
++just like a CPU page fault.
+
-respect in order to keep things properly synchronize. The usage pattern is::
++Both functions copy CPU page table entries into their pfns array argument. Each
++entry in that array corresponds to an address in the virtual range. HMM
++provides a set of flags to help the driver identify special CPU page table
++entries.
+
+ Locking with the update() callback is the most important aspect the driver must
-The driver->update lock is the same lock that driver takes inside its update()
-callback. That lock must be call before hmm_vma_range_done() to avoid any race
-with a concurrent CPU page table update.
++respect in order to keep things properly synchronized. The usage pattern is::
+
+ int driver_populate_range(...)
+ {
+ struct hmm_range range;
+ ...
+ again:
+ ret = hmm_vma_get_pfns(vma, &range, start, end, pfns);
+ if (ret)
+ return ret;
+ take_lock(driver->update);
+ if (!hmm_vma_range_done(vma, &range)) {
+ release_lock(driver->update);
+ goto again;
+ }
+
+ // Use pfns array content to update device page table
+
+ release_lock(driver->update);
+ return 0;
+ }
+
-HMM implements all this on top of the mmu_notifier API because we wanted to a
-simpler API and also to be able to perform optimization latter own like doing
-concurrent device update in multi-devices scenario.
++The driver->update lock is the same lock that the driver takes inside its
++update() callback. That lock must be held before hmm_vma_range_done() to avoid
++any race with a concurrent CPU page table update.
+
-HMM also serve as an impedence missmatch between how CPU page table update are
-done (by CPU write to the page table and TLB flushes) from how device update
-their own page table. Device update is a multi-step process, first appropriate
-commands are write to a buffer, then this buffer is schedule for execution on
-the device. It is only once the device has executed commands in the buffer that
-the update is done. Creating and scheduling update command buffer can happen
-concurrently for multiple devices. Waiting for each device to report commands
-as executed is serialize (there is no point in doing this concurrently).
++HMM implements all this on top of the mmu_notifier API because we wanted a
++simpler API and also to be able to perform optimizations latter on like doing
++concurrent device updates in multi-devices scenario.
+
-Several differents design were try to support device memory. First one use
-device specific data structure to keep information about migrated memory and
-HMM hooked itself in various place of mm code to handle any access to address
-that were back by device memory. It turns out that this ended up replicating
-most of the fields of struct page and also needed many kernel code path to be
-updated to understand this new kind of memory.
++HMM also serves as an impedance mismatch between how CPU page table updates
++are done (by CPU write to the page table and TLB flushes) and how devices
++update their own page table. Device updates are a multi-step process. First,
++appropriate commands are written to a buffer, then this buffer is scheduled for
++execution on the device. It is only once the device has executed commands in
++the buffer that the update is done. Creating and scheduling the update command
++buffer can happen concurrently for multiple devices. Waiting for each device to
++report commands as executed is serialized (there is no point in doing this
++concurrently).
+
+
+ Represent and manage device memory from core kernel point of view
+ =================================================================
+
-Thing is most kernel code path never try to access the memory behind a page
-but only care about struct page contents. Because of this HMM switchted to
-directly using struct page for device memory which left most kernel code path
-un-aware of the difference. We only need to make sure that no one ever try to
-map those page from the CPU side.
++Several different designs were tried to support device memory. First one used
++a device specific data structure to keep information about migrated memory and
++HMM hooked itself in various places of mm code to handle any access to
++addresses that were backed by device memory. It turns out that this ended up
++replicating most of the fields of struct page and also needed many kernel code
++paths to be updated to understand this new kind of memory.
+
-HMM provide a set of helpers to register and hotplug device memory as a new
-region needing struct page. This is offer through a very simple API::
++Most kernel code paths never try to access the memory behind a page
++but only care about struct page contents. Because of this, HMM switched to
++directly using struct page for device memory which left most kernel code paths
++unaware of the difference. We only need to make sure that no one ever tries to
++map those pages from the CPU side.
+
-drop. This means the device page is now free and no longer use by anyone. The
-second callback happens whenever CPU try to access a device page which it can
-not do. This second callback must trigger a migration back to system memory.
++HMM provides a set of helpers to register and hotplug device memory as a new
++region needing a struct page. This is offered through a very simple API::
+
+ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+ struct device *device,
+ unsigned long size);
+ void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+ The hmm_devmem_ops is where most of the important things are::
+
+ struct hmm_devmem_ops {
+ void (*free)(struct hmm_devmem *devmem, struct page *page);
+ int (*fault)(struct hmm_devmem *devmem,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ struct page *page,
+ unsigned flags,
+ pmd_t *pmdp);
+ };
+
+ The first callback (free()) happens when the last reference on a device page is
-Migrate to and from device memory
-=================================
++dropped. This means the device page is now free and no longer used by anyone.
++The second callback happens whenever the CPU tries to access a device page
++which it cannot do. This second callback must trigger a migration back to
++system memory.
+
+
-Because CPU can not access device memory, migration must use device DMA engine
-to perform copy from and to device memory. For this we need a new migration
-helper::
++Migration to and from device memory
++===================================
+
-Unlike other migration function it works on a range of virtual address, there
-is two reasons for that. First device DMA copy has a high setup overhead cost
++Because the CPU cannot access device memory, migration must use the device DMA
++engine to perform copy from and to device memory. For this we need a new
++migration helper::
+
+ int migrate_vma(const struct migrate_vma_ops *ops,
+ struct vm_area_struct *vma,
+ unsigned long mentries,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *src,
+ unsigned long *dst,
+ void *private);
+
-make the whole excersie pointless. The second reason is because driver trigger
-such migration base on range of address the device is actively accessing.
++Unlike other migration functions it works on a range of virtual address, there
++are two reasons for that. First, device DMA copy has a high setup overhead cost
+ and thus batching multiple pages is needed as otherwise the migration overhead
-The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
-control destination memory allocation and copy operation. Second one is there
-to allow device driver to perform cleanup operation after migration::
++makes the whole exercise pointless. The second reason is because the
++migration might be for a range of addresses the device is actively accessing.
+
-It is important to stress that this migration helpers allow for hole in the
++The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
++controls destination memory allocation and copy operation. Second one is there
++to allow the device driver to perform cleanup operations after migration::
+
+ struct migrate_vma_ops {
+ void (*alloc_and_copy)(struct vm_area_struct *vma,
+ const unsigned long *src,
+ unsigned long *dst,
+ unsigned long start,
+ unsigned long end,
+ void *private);
+ void (*finalize_and_map)(struct vm_area_struct *vma,
+ const unsigned long *src,
+ const unsigned long *dst,
+ unsigned long start,
+ unsigned long end,
+ void *private);
+ };
+
-the usual reasons (page is pin, page is lock, ...). This helper does not fail
-but just skip over those pages.
++It is important to stress that these migration helpers allow for holes in the
+ virtual address range. Some pages in the range might not be migrated for all
-The alloc_and_copy() might as well decide to not migrate all pages in the
-range (for reasons under the callback control). For those the callback just
-have to leave the corresponding dst entry empty.
++the usual reasons (page is pinned, page is locked, ...). This helper does not
++fail but just skips over those pages.
+
-Finaly the migration of the struct page might fails (for file back page) for
++The alloc_and_copy() might decide to not migrate all pages in the
++range (for reasons under the callback control). For those, the callback just
++has to leave the corresponding dst entry empty.
+
-that happens then the finalize_and_map() can catch any pages that was not
-migrated. Note those page were still copied to new page and thus we wasted
++Finally, the migration of the struct page might fail (for file backed page) for
+ various reasons (failure to freeze reference, or update page cache, ...). If
-anonymous if device page is use for anonymous, file if device page is use for
-file back page or shmem if device page is use for share memory). This is a
-deliberate choice to keep existing application that might start using device
-memory without knowing about it to keep runing unimpacted.
-
-Drawbacks is that OOM killer might kill an application using a lot of device
-memory and not a lot of regular system memory and thus not freeing much system
-memory. We want to gather more real world experience on how application and
-system react under memory pressure in the presence of device memory before
++that happens, then the finalize_and_map() can catch any pages that were not
++migrated. Note those pages were still copied to a new page and thus we wasted
+ bandwidth but this is considered as a rare event and a price that we are
+ willing to pay to keep all the code simpler.
+
+
+ Memory cgroup (memcg) and rss accounting
+ ========================================
+
+ For now device memory is accounted as any regular page in rss counters (either
-Same decision was made for memory cgroup. Device memory page are accounted
++anonymous if device page is used for anonymous, file if device page is used for
++file backed page or shmem if device page is used for shared memory). This is a
++deliberate choice to keep existing applications, that might start using device
++memory without knowing about it, running unimpacted.
++
++A drawback is that the OOM killer might kill an application using a lot of
++device memory and not a lot of regular system memory and thus not freeing much
++system memory. We want to gather more real world experience on how applications
++and system react under memory pressure in the presence of device memory before
+ deciding to account device memory differently.
+
+
-back from device memory to regular memory can not fail because it would
++Same decision was made for memory cgroup. Device memory pages are accounted
+ against same memory cgroup a regular page would be accounted to. This does
+ simplify migration to and from device memory. This also means that migration
-get more experience in how device memory is use and its impact on memory
++back from device memory to regular memory cannot fail because it would
+ go above memory cgroup limit. We might revisit this choice latter on once we
-Note that device memory can never be pin nor by device driver nor through GUP
++get more experience in how device memory is used and its impact on memory
+ resource control.
+
+
-is drop in case of share memory or file back memory.
++Note that device memory can never be pinned by device driver nor through GUP
+ and thus such memory is always free upon process exit. Or when last reference
++is dropped in case of shared memory or file backed memory.
--- /dev/null
-2. Insure that writeback is complete.
+ .. _page_migration:
+
+ ==============
+ Page migration
+ ==============
+
+ Page migration allows the moving of the physical location of pages between
+ nodes in a numa system while the process is running. This means that the
+ virtual addresses that the process sees do not change. However, the
+ system rearranges the physical location of those pages.
+
+ The main intend of page migration is to reduce the latency of memory access
+ by moving pages near to the processor where the process accessing that memory
+ is running.
+
+ Page migration allows a process to manually relocate the node on which its
+ pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
+ a new memory policy via mbind(). The pages of process can also be relocated
+ from another process using the sys_migrate_pages() function call. The
+ migrate_pages function call takes two sets of nodes and moves pages of a
+ process that are located on the from nodes to the destination nodes.
+ Page migration functions are provided by the numactl package by Andi Kleen
+ (a version later than 0.9.3 is required. Get it from
+ ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma
+ which provides an interface similar to other numa functionality for page
+ migration. cat ``/proc/<pid>/numa_maps`` allows an easy review of where the
+ pages of a process are located. See also the numa_maps documentation in the
+ proc(5) man page.
+
+ Manual migration is useful if for example the scheduler has relocated
+ a process to a processor on a distant node. A batch scheduler or an
+ administrator may detect the situation and move the pages of the process
+ nearer to the new processor. The kernel itself does only provide
+ manual page migration support. Automatic page migration may be implemented
+ through user space processes that move pages. A special function call
+ "move_pages" allows the moving of individual pages within a process.
+ A NUMA profiler may f.e. obtain a log showing frequent off node
+ accesses and may use the result to move pages to more advantageous
+ locations.
+
+ Larger installations usually partition the system using cpusets into
+ sections of nodes. Paul Jackson has equipped cpusets with the ability to
+ move pages when a task is moved to another cpuset (See
+ Documentation/cgroup-v1/cpusets.txt).
+ Cpusets allows the automation of process locality. If a task is moved to
+ a new cpuset then also all its pages are moved with it so that the
+ performance of the process does not sink dramatically. Also the pages
+ of processes in a cpuset are moved if the allowed memory nodes of a
+ cpuset are changed.
+
+ Page migration allows the preservation of the relative location of pages
+ within a group of nodes for all migration techniques which will preserve a
+ particular memory allocation pattern generated even after migrating a
+ process. This is necessary in order to preserve the memory latencies.
+ Processes will run with similar performance after migration.
+
+ Page migration occurs in several steps. First a high level
+ description for those trying to use migrate_pages() from the kernel
+ (for userspace usage see the Andi Kleen's numactl package mentioned above)
+ and then a low level description of how the low level details work.
+
+ In kernel use of migrate_pages()
+ ================================
+
+ 1. Remove pages from the LRU.
+
+ Lists of pages to be migrated are generated by scanning over
+ pages and moving them into lists. This is done by
+ calling isolate_lru_page().
+ Calling isolate_lru_page increases the references to the page
+ so that it cannot vanish while the page migration occurs.
+ It also prevents the swapper or other scans to encounter
+ the page.
+
+ 2. We need to have a function of type new_page_t that can be
+ passed to migrate_pages(). This function should figure out
+ how to allocate the correct new page given the old page.
+
+ 3. The migrate_pages() function is called which attempts
+ to do the migration. It will call the function to allocate
+ the new page for each page that is considered for
+ moving.
+
+ How migrate_pages() works
+ =========================
+
+ migrate_pages() does several passes over its list of pages. A page is moved
+ if all references to a page are removable at the time. The page has
+ already been removed from the LRU via isolate_lru_page() and the refcount
+ is increased so that the page cannot be freed while page migration occurs.
+
+ Steps:
+
+ 1. Lock the page to be migrated
+
-5. The radix tree lock is taken. This will cause all processes trying
- to access the page via the mapping to block on the radix tree spinlock.
++2. Ensure that writeback is complete.
+
+ 3. Lock the new page that we want to move to. It is locked so that accesses to
+ this (not yet uptodate) page immediately lock while the move is in progress.
+
+ 4. All the page table references to the page are converted to migration
+ entries. This decreases the mapcount of a page. If the resulting
+ mapcount is not zero then we do not migrate the page. All user space
+ processes that attempt to access the page will now wait on the page lock.
+
-10. The reference count of the old page is dropped because the radix tree
++5. The i_pages lock is taken. This will cause all processes trying
++ to access the page via the mapping to block on the spinlock.
+
+ 6. The refcount of the page is examined and we back out if references remain
+ otherwise we know that we are the only one referencing this page.
+
+ 7. The radix tree is checked and if it does not contain the pointer to this
+ page then we back out because someone else modified the radix tree.
+
+ 8. The new page is prepped with some settings from the old page so that
+ accesses to the new page will discover a page with the correct settings.
+
+ 9. The radix tree is changed to point to the new page.
+
- the new page is referenced to by the radix tree.
++10. The reference count of the old page is dropped because the address space
+ reference is gone. A reference to the new page is established because
-11. The radix tree lock is dropped. With that lookups in the mapping
- become possible again. Processes will move from spinning on the tree_lock
++ the new page is referenced by the address space.
+
++11. The i_pages lock is dropped. With that lookups in the mapping
++ become possible again. Processes will move from spinning on the lock
+ to sleeping on the locked new page.
+
+ 12. The page contents are copied to the new page.
+
+ 13. The remaining page flags are copied to the new page.
+
+ 14. The old page flags are cleared to indicate that the page does
+ not provide any information anymore.
+
+ 15. Queued up writeback on the new page is triggered.
+
+ 16. If migration entries were page then replace them with real ptes. Doing
+ so will enable access for user space processes not already waiting for
+ the page lock.
+
+ 19. The page locks are dropped from the old and new page.
+ Processes waiting on the page lock will redo their page faults
+ and will reach the new page.
+
+ 20. The new page is moved to the LRU and can be scanned by the swapper
+ etc again.
+
+ Non-LRU page migration
+ ======================
+
+ Although original migration aimed for reducing the latency of memory access
+ for NUMA, compaction who want to create high-order page is also main customer.
+
+ Current problem of the implementation is that it is designed to migrate only
+ *LRU* pages. However, there are potential non-lru pages which can be migrated
+ in drivers, for example, zsmalloc, virtio-balloon pages.
+
+ For virtio-balloon pages, some parts of migration code path have been hooked
+ up and added virtio-balloon specific functions to intercept migration logics.
+ It's too specific to a driver so other drivers who want to make their pages
+ movable would have to add own specific hooks in migration path.
+
+ To overclome the problem, VM supports non-LRU page migration which provides
+ generic functions for non-LRU movable pages without driver specific hooks
+ migration path.
+
+ If a driver want to make own pages movable, it should define three functions
+ which are function pointers of struct address_space_operations.
+
+ 1. ``bool (*isolate_page) (struct page *page, isolate_mode_t mode);``
+
+ What VM expects on isolate_page function of driver is to return *true*
+ if driver isolates page successfully. On returing true, VM marks the page
+ as PG_isolated so concurrent isolation in several CPUs skip the page
+ for isolation. If a driver cannot isolate the page, it should return *false*.
+
+ Once page is successfully isolated, VM uses page.lru fields so driver
+ shouldn't expect to preserve values in that fields.
+
+ 2. ``int (*migratepage) (struct address_space *mapping,``
+ | ``struct page *newpage, struct page *oldpage, enum migrate_mode);``
+
+ After isolation, VM calls migratepage of driver with isolated page.
+ The function of migratepage is to move content of the old page to new page
+ and set up fields of struct page newpage. Keep in mind that you should
+ indicate to the VM the oldpage is no longer movable via __ClearPageMovable()
+ under page_lock if you migrated the oldpage successfully and returns
+ MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver
+ can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time
+ because VM interprets -EAGAIN as "temporal migration failure". On returning
+ any error except -EAGAIN, VM will give up the page migration without retrying
+ in this time.
+
+ Driver shouldn't touch page.lru field VM using in the functions.
+
+ 3. ``void (*putback_page)(struct page *);``
+
+ If migration fails on isolated page, VM should return the isolated page
+ to the driver so VM calls driver's putback_page with migration failed page.
+ In this function, driver should put the isolated page back to the own data
+ structure.
+
+ 4. non-lru movable page flags
+
+ There are two page flags for supporting non-lru movable page.
+
+ * PG_movable
+
+ Driver should use the below function to make page movable under page_lock::
+
+ void __SetPageMovable(struct page *page, struct address_space *mapping)
+
+ It needs argument of address_space for registering migration
+ family functions which will be called by VM. Exactly speaking,
+ PG_movable is not a real flag of struct page. Rather than, VM
+ reuses page->mapping's lower bits to represent it.
+
+ ::
+ #define PAGE_MAPPING_MOVABLE 0x2
+ page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
+
+ so driver shouldn't access page->mapping directly. Instead, driver should
+ use page_mapping which mask off the low two bits of page->mapping under
+ page lock so it can get right struct address_space.
+
+ For testing of non-lru movable page, VM supports __PageMovable function.
+ However, it doesn't guarantee to identify non-lru movable page because
+ page->mapping field is unified with other variables in struct page.
+ As well, if driver releases the page after isolation by VM, page->mapping
+ doesn't have stable value although it has PAGE_MAPPING_MOVABLE
+ (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
+ page is LRU or non-lru movable once the page has been isolated. Because
+ LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
+ good for just peeking to test non-lru movable pages before more expensive
+ checking with lock_page in pfn scanning to select victim.
+
+ For guaranteeing non-lru movable page, VM provides PageMovable function.
+ Unlike __PageMovable, PageMovable functions validates page->mapping and
+ mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
+ destroying of page->mapping.
+
+ Driver using __SetPageMovable should clear the flag via __ClearMovablePage
+ under page_lock before the releasing the page.
+
+ * PG_isolated
+
+ To prevent concurrent isolation among several CPUs, VM marks isolated page
+ as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
+ movable page, it can skip it. Driver doesn't need to manipulate the flag
+ because VM will set/clear it automatically. Keep in mind that if driver
+ sees PG_isolated page, it means the page have been isolated by VM so it
+ shouldn't touch page.lru field.
+ PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
+ for own purpose.
+
+ Christoph Lameter, May 8, 2006.
+ Minchan Kim, Mar 28, 2016.