]>
Commit | Line | Data |
---|---|---|
25edd8bf AA |
1 | = Userfaultfd = |
2 | ||
3 | == Objective == | |
4 | ||
5 | Userfaults allow the implementation of on-demand paging from userland | |
6 | and more generally they allow userland to take control of various | |
7 | memory page faults, something otherwise only the kernel code could do. | |
8 | ||
9 | For example userfaults allows a proper and more optimal implementation | |
10 | of the PROT_NONE+SIGSEGV trick. | |
11 | ||
12 | == Design == | |
13 | ||
14 | Userfaults are delivered and resolved through the userfaultfd syscall. | |
15 | ||
16 | The userfaultfd (aside from registering and unregistering virtual | |
17 | memory ranges) provides two primary functionalities: | |
18 | ||
19 | 1) read/POLLIN protocol to notify a userland thread of the faults | |
20 | happening | |
21 | ||
22 | 2) various UFFDIO_* ioctls that can manage the virtual memory regions | |
23 | registered in the userfaultfd that allows userland to efficiently | |
24 | resolve the userfaults it receives via 1) or to manage the virtual | |
25 | memory in the background | |
26 | ||
27 | The real advantage of userfaults if compared to regular virtual memory | |
28 | management of mremap/mprotect is that the userfaults in all their | |
29 | operations never involve heavyweight structures like vmas (in fact the | |
30 | userfaultfd runtime load never takes the mmap_sem for writing). | |
31 | ||
32 | Vmas are not suitable for page- (or hugepage) granular fault tracking | |
33 | when dealing with virtual address spaces that could span | |
34 | Terabytes. Too many vmas would be needed for that. | |
35 | ||
36 | The userfaultfd once opened by invoking the syscall, can also be | |
37 | passed using unix domain sockets to a manager process, so the same | |
38 | manager process could handle the userfaults of a multitude of | |
39 | different processes without them being aware about what is going on | |
40 | (well of course unless they later try to use the userfaultfd | |
41 | themselves on the same region the manager is already tracking, which | |
42 | is a corner case that would currently return -EBUSY). | |
43 | ||
44 | == API == | |
45 | ||
46 | When first opened the userfaultfd must be enabled invoking the | |
47 | UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or | |
48 | a later API version) which will specify the read/POLLIN protocol | |
a9b85f94 AA |
49 | userland intends to speak on the UFFD and the uffdio_api.features |
50 | userland requires. The UFFDIO_API ioctl if successful (i.e. if the | |
51 | requested uffdio_api.api is spoken also by the running kernel and the | |
52 | requested features are going to be enabled) will return into | |
53 | uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of | |
54 | respectively all the available features of the read(2) protocol and | |
55 | the generic ioctl available. | |
25edd8bf | 56 | |
5a02026d MR |
57 | The uffdio_api.features bitmask returned by the UFFDIO_API ioctl |
58 | defines what memory types are supported by the userfaultfd and what | |
59 | events, except page fault notifications, may be generated. | |
60 | ||
61 | If the kernel supports registering userfaultfd ranges on hugetlbfs | |
62 | virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in | |
63 | uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be | |
64 | set if the kernel supports registering userfaultfd ranges on shared | |
65 | memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero | |
66 | MAP_SHARED, memfd_create, etc). | |
67 | ||
68 | The userland application that wants to use userfaultfd with hugetlbfs | |
69 | or shared memory need to set the corresponding flag in | |
70 | uffdio_api.features to enable those features. | |
71 | ||
72 | If the userland desires to receive notifications for events other than | |
73 | page faults, it has to verify that uffdio_api.features has appropriate | |
74 | UFFD_FEATURE_EVENT_* bits set. These events are described in more | |
75 | detail below in "Non-cooperative userfaultfd" section. | |
76 | ||
25edd8bf AA |
77 | Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should |
78 | be invoked (if present in the returned uffdio_api.ioctls bitmask) to | |
79 | register a memory range in the userfaultfd by setting the | |
80 | uffdio_register structure accordingly. The uffdio_register.mode | |
81 | bitmask will specify to the kernel which kind of faults to track for | |
82 | the range (UFFDIO_REGISTER_MODE_MISSING would track missing | |
83 | pages). The UFFDIO_REGISTER ioctl will return the | |
84 | uffdio_register.ioctls bitmask of ioctls that are suitable to resolve | |
85 | userfaults on the range registered. Not all ioctls will necessarily be | |
86 | supported for all memory types depending on the underlying virtual | |
87 | memory backend (anonymous memory vs tmpfs vs real filebacked | |
88 | mappings). | |
89 | ||
90 | Userland can use the uffdio_register.ioctls to manage the virtual | |
91 | address space in the background (to add or potentially also remove | |
92 | memory from the userfaultfd registered range). This means a userfault | |
93 | could be triggering just before userland maps in the background the | |
94 | user-faulted page. | |
95 | ||
96 | The primary ioctl to resolve userfaults is UFFDIO_COPY. That | |
97 | atomically copies a page into the userfault registered range and wakes | |
98 | up the blocked userfaults (unless uffdio_copy.mode & | |
99 | UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to | |
100 | UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an | |
101 | half copied page since it'll keep userfaulting until the copy has | |
102 | finished. | |
103 | ||
104 | == QEMU/KVM == | |
105 | ||
106 | QEMU/KVM is using the userfaultfd syscall to implement postcopy live | |
107 | migration. Postcopy live migration is one form of memory | |
108 | externalization consisting of a virtual machine running with part or | |
109 | all of its memory residing on a different node in the cloud. The | |
110 | userfaultfd abstraction is generic enough that not a single line of | |
111 | KVM kernel code had to be modified in order to add postcopy live | |
112 | migration to QEMU. | |
113 | ||
114 | Guest async page faults, FOLL_NOWAIT and all other GUP features work | |
115 | just fine in combination with userfaults. Userfaults trigger async | |
116 | page faults in the guest scheduler so those guest processes that | |
117 | aren't waiting for userfaults (i.e. network bound) can keep running in | |
118 | the guest vcpus. | |
119 | ||
120 | It is generally beneficial to run one pass of precopy live migration | |
121 | just before starting postcopy live migration, in order to avoid | |
122 | generating userfaults for readonly guest regions. | |
123 | ||
124 | The implementation of postcopy live migration currently uses one | |
125 | single bidirectional socket but in the future two different sockets | |
126 | will be used (to reduce the latency of the userfaults to the minimum | |
127 | possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). | |
128 | ||
129 | The QEMU in the source node writes all pages that it knows are missing | |
130 | in the destination node, into the socket, and the migration thread of | |
131 | the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE | |
132 | ioctls on the userfaultfd in order to map the received pages into the | |
133 | guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). | |
134 | ||
135 | A different postcopy thread in the destination node listens with | |
136 | poll() to the userfaultfd in parallel. When a POLLIN event is | |
137 | generated after a userfault triggers, the postcopy thread read() from | |
138 | the userfaultfd and receives the fault address (or -EAGAIN in case the | |
139 | userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run | |
140 | by the parallel QEMU migration thread). | |
141 | ||
142 | After the QEMU postcopy thread (running in the destination node) gets | |
143 | the userfault address it writes the information about the missing page | |
144 | into the socket. The QEMU source node receives the information and | |
145 | roughly "seeks" to that page address and continues sending all | |
146 | remaining missing pages from that new page offset. Soon after that | |
147 | (just the time to flush the tcp_wmem queue through the network) the | |
148 | migration thread in the QEMU running in the destination node will | |
149 | receive the page that triggered the userfault and it'll map it as | |
150 | usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it | |
151 | was spontaneously sent by the source or if it was an urgent page | |
9332ef9d | 152 | requested through a userfault). |
25edd8bf AA |
153 | |
154 | By the time the userfaults start, the QEMU in the destination node | |
155 | doesn't need to keep any per-page state bitmap relative to the live | |
156 | migration around and a single per-page bitmap has to be maintained in | |
157 | the QEMU running in the source node to know which pages are still | |
158 | missing in the destination node. The bitmap in the source node is | |
159 | checked to find which missing pages to send in round robin and we seek | |
160 | over it when receiving incoming userfaults. After sending each page of | |
161 | course the bitmap is updated accordingly. It's also useful to avoid | |
162 | sending the same page twice (in case the userfault is read by the | |
163 | postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration | |
164 | thread). | |
5a02026d MR |
165 | |
166 | == Non-cooperative userfaultfd == | |
167 | ||
168 | When the userfaultfd is monitored by an external manager, the manager | |
169 | must be able to track changes in the process virtual memory | |
170 | layout. Userfaultfd can notify the manager about such changes using | |
171 | the same read(2) protocol as for the page fault notifications. The | |
172 | manager has to explicitly enable these events by setting appropriate | |
173 | bits in uffdio_api.features passed to UFFDIO_API ioctl: | |
174 | ||
5a02026d MR |
175 | UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When |
176 | this feature is enabled, the userfaultfd context of the parent process | |
177 | is duplicated into the newly created process. The manager receives | |
178 | UFFD_EVENT_FORK with file descriptor of the new userfaultfd context in | |
179 | the uffd_msg.fork. | |
180 | ||
181 | UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap() | |
182 | calls. When the non-cooperative process moves a virtual memory area to | |
183 | a different location, the manager will receive UFFD_EVENT_REMAP. The | |
184 | uffd_msg.remap will contain the old and new addresses of the area and | |
185 | its original length. | |
186 | ||
187 | UFFD_FEATURE_EVENT_REMOVE - enable notifications about | |
188 | madvise(MADV_REMOVE) and madvise(MADV_DONTNEED) calls. The event | |
189 | UFFD_EVENT_REMOVE will be generated upon these calls to madvise. The | |
190 | uffd_msg.remove will contain start and end addresses of the removed | |
191 | area. | |
192 | ||
193 | UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory | |
194 | unmapping. The manager will get UFFD_EVENT_UNMAP with uffd_msg.remove | |
195 | containing start and end addresses of the unmapped area. | |
196 | ||
197 | Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP | |
198 | are pretty similar, they quite differ in the action expected from the | |
199 | userfaultfd manager. In the former case, the virtual memory is | |
200 | removed, but the area is not, the area remains monitored by the | |
201 | userfaultfd, and if a page fault occurs in that area it will be | |
202 | delivered to the manager. The proper resolution for such page fault is | |
203 | to zeromap the faulting address. However, in the latter case, when an | |
204 | area is unmapped, either explicitly (with munmap() system call), or | |
205 | implicitly (e.g. during mremap()), the area is removed and in turn the | |
206 | userfaultfd context for such area disappears too and the manager will | |
207 | not get further userland page faults from the removed area. Still, the | |
208 | notification is required in order to prevent manager from using | |
209 | UFFDIO_COPY on the unmapped area. | |
210 | ||
211 | Unlike userland page faults which have to be synchronous and require | |
212 | explicit or implicit wakeup, all the events are delivered | |
213 | asynchronously and the non-cooperative process resumes execution as | |
214 | soon as manager executes read(). The userfaultfd manager should | |
215 | carefully synchronize calls to UFFDIO_COPY with the events | |
216 | processing. To aid the synchronization, the UFFDIO_COPY ioctl will | |
217 | return -ENOSPC when the monitored process exits at the time of | |
218 | UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed | |
219 | its virtual memory layout simultaneously with outstanding UFFDIO_COPY | |
220 | operation. | |
221 | ||
222 | The current asynchronous model of the event delivery is optimal for | |
223 | single threaded non-cooperative userfaultfd manager implementations. A | |
224 | synchronous event delivery model can be added later as a new | |
225 | userfaultfd feature to facilitate multithreading enhancements of the | |
226 | non cooperative manager, for example to allow UFFDIO_COPY ioctls to | |
227 | run in parallel to the event reception. Single threaded | |
228 | implementations should continue to use the current async event | |
229 | delivery model instead. |