]>
Commit | Line | Data |
---|---|---|
f9451df2 | 1 | .. _userfaultfd: |
25edd8bf | 2 | |
f9451df2 MR |
3 | =========== |
4 | Userfaultfd | |
5 | =========== | |
6 | ||
7 | Objective | |
8 | ========= | |
25edd8bf AA |
9 | |
10 | Userfaults allow the implementation of on-demand paging from userland | |
11 | and more generally they allow userland to take control of various | |
12 | memory page faults, something otherwise only the kernel code could do. | |
13 | ||
14 | For example userfaults allows a proper and more optimal implementation | |
15 | of the PROT_NONE+SIGSEGV trick. | |
16 | ||
f9451df2 MR |
17 | Design |
18 | ====== | |
25edd8bf AA |
19 | |
20 | Userfaults are delivered and resolved through the userfaultfd syscall. | |
21 | ||
22 | The userfaultfd (aside from registering and unregistering virtual | |
23 | memory ranges) provides two primary functionalities: | |
24 | ||
25 | 1) read/POLLIN protocol to notify a userland thread of the faults | |
26 | happening | |
27 | ||
28 | 2) various UFFDIO_* ioctls that can manage the virtual memory regions | |
29 | registered in the userfaultfd that allows userland to efficiently | |
30 | resolve the userfaults it receives via 1) or to manage the virtual | |
31 | memory in the background | |
32 | ||
33 | The real advantage of userfaults if compared to regular virtual memory | |
34 | management of mremap/mprotect is that the userfaults in all their | |
35 | operations never involve heavyweight structures like vmas (in fact the | |
36 | userfaultfd runtime load never takes the mmap_sem for writing). | |
37 | ||
38 | Vmas are not suitable for page- (or hugepage) granular fault tracking | |
39 | when dealing with virtual address spaces that could span | |
40 | Terabytes. Too many vmas would be needed for that. | |
41 | ||
42 | The userfaultfd once opened by invoking the syscall, can also be | |
43 | passed using unix domain sockets to a manager process, so the same | |
44 | manager process could handle the userfaults of a multitude of | |
45 | different processes without them being aware about what is going on | |
46 | (well of course unless they later try to use the userfaultfd | |
47 | themselves on the same region the manager is already tracking, which | |
48 | is a corner case that would currently return -EBUSY). | |
49 | ||
f9451df2 MR |
50 | API |
51 | === | |
25edd8bf AA |
52 | |
53 | When first opened the userfaultfd must be enabled invoking the | |
54 | UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or | |
55 | a later API version) which will specify the read/POLLIN protocol | |
a9b85f94 AA |
56 | userland intends to speak on the UFFD and the uffdio_api.features |
57 | userland requires. The UFFDIO_API ioctl if successful (i.e. if the | |
58 | requested uffdio_api.api is spoken also by the running kernel and the | |
59 | requested features are going to be enabled) will return into | |
60 | uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of | |
61 | respectively all the available features of the read(2) protocol and | |
62 | the generic ioctl available. | |
25edd8bf | 63 | |
5a02026d MR |
64 | The uffdio_api.features bitmask returned by the UFFDIO_API ioctl |
65 | defines what memory types are supported by the userfaultfd and what | |
66 | events, except page fault notifications, may be generated. | |
67 | ||
68 | If the kernel supports registering userfaultfd ranges on hugetlbfs | |
69 | virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in | |
70 | uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be | |
71 | set if the kernel supports registering userfaultfd ranges on shared | |
72 | memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero | |
73 | MAP_SHARED, memfd_create, etc). | |
74 | ||
75 | The userland application that wants to use userfaultfd with hugetlbfs | |
76 | or shared memory need to set the corresponding flag in | |
77 | uffdio_api.features to enable those features. | |
78 | ||
79 | If the userland desires to receive notifications for events other than | |
80 | page faults, it has to verify that uffdio_api.features has appropriate | |
81 | UFFD_FEATURE_EVENT_* bits set. These events are described in more | |
82 | detail below in "Non-cooperative userfaultfd" section. | |
83 | ||
25edd8bf AA |
84 | Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should |
85 | be invoked (if present in the returned uffdio_api.ioctls bitmask) to | |
86 | register a memory range in the userfaultfd by setting the | |
87 | uffdio_register structure accordingly. The uffdio_register.mode | |
88 | bitmask will specify to the kernel which kind of faults to track for | |
89 | the range (UFFDIO_REGISTER_MODE_MISSING would track missing | |
90 | pages). The UFFDIO_REGISTER ioctl will return the | |
91 | uffdio_register.ioctls bitmask of ioctls that are suitable to resolve | |
92 | userfaults on the range registered. Not all ioctls will necessarily be | |
93 | supported for all memory types depending on the underlying virtual | |
94 | memory backend (anonymous memory vs tmpfs vs real filebacked | |
95 | mappings). | |
96 | ||
97 | Userland can use the uffdio_register.ioctls to manage the virtual | |
98 | address space in the background (to add or potentially also remove | |
99 | memory from the userfaultfd registered range). This means a userfault | |
100 | could be triggering just before userland maps in the background the | |
101 | user-faulted page. | |
102 | ||
103 | The primary ioctl to resolve userfaults is UFFDIO_COPY. That | |
104 | atomically copies a page into the userfault registered range and wakes | |
105 | up the blocked userfaults (unless uffdio_copy.mode & | |
106 | UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to | |
107 | UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an | |
108 | half copied page since it'll keep userfaulting until the copy has | |
109 | finished. | |
110 | ||
f9451df2 MR |
111 | QEMU/KVM |
112 | ======== | |
25edd8bf AA |
113 | |
114 | QEMU/KVM is using the userfaultfd syscall to implement postcopy live | |
115 | migration. Postcopy live migration is one form of memory | |
116 | externalization consisting of a virtual machine running with part or | |
117 | all of its memory residing on a different node in the cloud. The | |
118 | userfaultfd abstraction is generic enough that not a single line of | |
119 | KVM kernel code had to be modified in order to add postcopy live | |
120 | migration to QEMU. | |
121 | ||
122 | Guest async page faults, FOLL_NOWAIT and all other GUP features work | |
123 | just fine in combination with userfaults. Userfaults trigger async | |
124 | page faults in the guest scheduler so those guest processes that | |
125 | aren't waiting for userfaults (i.e. network bound) can keep running in | |
126 | the guest vcpus. | |
127 | ||
128 | It is generally beneficial to run one pass of precopy live migration | |
129 | just before starting postcopy live migration, in order to avoid | |
130 | generating userfaults for readonly guest regions. | |
131 | ||
132 | The implementation of postcopy live migration currently uses one | |
133 | single bidirectional socket but in the future two different sockets | |
134 | will be used (to reduce the latency of the userfaults to the minimum | |
135 | possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). | |
136 | ||
137 | The QEMU in the source node writes all pages that it knows are missing | |
138 | in the destination node, into the socket, and the migration thread of | |
139 | the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE | |
140 | ioctls on the userfaultfd in order to map the received pages into the | |
141 | guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). | |
142 | ||
143 | A different postcopy thread in the destination node listens with | |
144 | poll() to the userfaultfd in parallel. When a POLLIN event is | |
145 | generated after a userfault triggers, the postcopy thread read() from | |
146 | the userfaultfd and receives the fault address (or -EAGAIN in case the | |
147 | userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run | |
148 | by the parallel QEMU migration thread). | |
149 | ||
150 | After the QEMU postcopy thread (running in the destination node) gets | |
151 | the userfault address it writes the information about the missing page | |
152 | into the socket. The QEMU source node receives the information and | |
153 | roughly "seeks" to that page address and continues sending all | |
154 | remaining missing pages from that new page offset. Soon after that | |
155 | (just the time to flush the tcp_wmem queue through the network) the | |
156 | migration thread in the QEMU running in the destination node will | |
157 | receive the page that triggered the userfault and it'll map it as | |
158 | usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it | |
159 | was spontaneously sent by the source or if it was an urgent page | |
9332ef9d | 160 | requested through a userfault). |
25edd8bf AA |
161 | |
162 | By the time the userfaults start, the QEMU in the destination node | |
163 | doesn't need to keep any per-page state bitmap relative to the live | |
164 | migration around and a single per-page bitmap has to be maintained in | |
165 | the QEMU running in the source node to know which pages are still | |
166 | missing in the destination node. The bitmap in the source node is | |
167 | checked to find which missing pages to send in round robin and we seek | |
168 | over it when receiving incoming userfaults. After sending each page of | |
169 | course the bitmap is updated accordingly. It's also useful to avoid | |
170 | sending the same page twice (in case the userfault is read by the | |
171 | postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration | |
172 | thread). | |
5a02026d | 173 | |
f9451df2 MR |
174 | Non-cooperative userfaultfd |
175 | =========================== | |
5a02026d MR |
176 | |
177 | When the userfaultfd is monitored by an external manager, the manager | |
178 | must be able to track changes in the process virtual memory | |
179 | layout. Userfaultfd can notify the manager about such changes using | |
180 | the same read(2) protocol as for the page fault notifications. The | |
181 | manager has to explicitly enable these events by setting appropriate | |
182 | bits in uffdio_api.features passed to UFFDIO_API ioctl: | |
183 | ||
f9451df2 MR |
184 | UFFD_FEATURE_EVENT_FORK |
185 | enable userfaultfd hooks for fork(). When this feature is | |
186 | enabled, the userfaultfd context of the parent process is | |
187 | duplicated into the newly created process. The manager | |
188 | receives UFFD_EVENT_FORK with file descriptor of the new | |
189 | userfaultfd context in the uffd_msg.fork. | |
190 | ||
191 | UFFD_FEATURE_EVENT_REMAP | |
192 | enable notifications about mremap() calls. When the | |
193 | non-cooperative process moves a virtual memory area to a | |
194 | different location, the manager will receive | |
195 | UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and | |
196 | new addresses of the area and its original length. | |
197 | ||
198 | UFFD_FEATURE_EVENT_REMOVE | |
199 | enable notifications about madvise(MADV_REMOVE) and | |
200 | madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will | |
201 | be generated upon these calls to madvise. The uffd_msg.remove | |
202 | will contain start and end addresses of the removed area. | |
203 | ||
204 | UFFD_FEATURE_EVENT_UNMAP | |
205 | enable notifications about memory unmapping. The manager will | |
206 | get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and | |
207 | end addresses of the unmapped area. | |
5a02026d MR |
208 | |
209 | Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP | |
210 | are pretty similar, they quite differ in the action expected from the | |
211 | userfaultfd manager. In the former case, the virtual memory is | |
212 | removed, but the area is not, the area remains monitored by the | |
213 | userfaultfd, and if a page fault occurs in that area it will be | |
214 | delivered to the manager. The proper resolution for such page fault is | |
215 | to zeromap the faulting address. However, in the latter case, when an | |
216 | area is unmapped, either explicitly (with munmap() system call), or | |
217 | implicitly (e.g. during mremap()), the area is removed and in turn the | |
218 | userfaultfd context for such area disappears too and the manager will | |
219 | not get further userland page faults from the removed area. Still, the | |
220 | notification is required in order to prevent manager from using | |
221 | UFFDIO_COPY on the unmapped area. | |
222 | ||
223 | Unlike userland page faults which have to be synchronous and require | |
224 | explicit or implicit wakeup, all the events are delivered | |
225 | asynchronously and the non-cooperative process resumes execution as | |
226 | soon as manager executes read(). The userfaultfd manager should | |
227 | carefully synchronize calls to UFFDIO_COPY with the events | |
228 | processing. To aid the synchronization, the UFFDIO_COPY ioctl will | |
229 | return -ENOSPC when the monitored process exits at the time of | |
230 | UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed | |
231 | its virtual memory layout simultaneously with outstanding UFFDIO_COPY | |
232 | operation. | |
233 | ||
234 | The current asynchronous model of the event delivery is optimal for | |
235 | single threaded non-cooperative userfaultfd manager implementations. A | |
236 | synchronous event delivery model can be added later as a new | |
237 | userfaultfd feature to facilitate multithreading enhancements of the | |
238 | non cooperative manager, for example to allow UFFDIO_COPY ioctls to | |
239 | run in parallel to the event reception. Single threaded | |
240 | implementations should continue to use the current async event | |
241 | delivery model instead. |