]>
Commit | Line | Data |
---|---|---|
60a427db JM |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
4917f55b JM |
3 | ========================================= |
4 | A vmemmap diet for HugeTLB and Device DAX | |
5 | ========================================= | |
6 | ||
7 | HugeTLB | |
8 | ======= | |
60a427db | 9 | |
dff03381 MS |
10 | This section is to explain how HugeTLB Vmemmap Optimization (HVO) works. |
11 | ||
838691a1 MS |
12 | The ``struct page`` structures are used to describe a physical page frame. By |
13 | default, there is a one-to-one mapping from a page frame to it's corresponding | |
14 | ``struct page``. | |
60a427db JM |
15 | |
16 | HugeTLB pages consist of multiple base page size pages and is supported by many | |
17 | architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more | |
18 | details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are | |
19 | currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page | |
20 | consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages. | |
838691a1 | 21 | For each base page, there is a corresponding ``struct page``. |
60a427db | 22 | |
838691a1 MS |
23 | Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to |
24 | contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides | |
25 | this upper limit. The only 'useful' information in the remaining ``struct page`` | |
60a427db JM |
26 | is the compound_head field, and this field is the same for all tail pages. |
27 | ||
838691a1 | 28 | By removing redundant ``struct page`` for HugeTLB pages, memory can be returned |
60a427db JM |
29 | to the buddy allocator for other uses. |
30 | ||
31 | Different architectures support different HugeTLB pages. For example, the | |
32 | following table is the HugeTLB page size supported by x86 and arm64 | |
33 | architectures. Because arm64 supports 4k, 16k, and 64k base pages and | |
34 | supports contiguous entries, so it supports many kinds of sizes of HugeTLB | |
35 | page. | |
36 | ||
37 | +--------------+-----------+-----------------------------------------------+ | |
38 | | Architecture | Page Size | HugeTLB Page Size | | |
39 | +--------------+-----------+-----------+-----------+-----------+-----------+ | |
40 | | x86-64 | 4KB | 2MB | 1GB | | | | |
41 | +--------------+-----------+-----------+-----------+-----------+-----------+ | |
42 | | | 4KB | 64KB | 2MB | 32MB | 1GB | | |
43 | | +-----------+-----------+-----------+-----------+-----------+ | |
44 | | arm64 | 16KB | 2MB | 32MB | 1GB | | | |
45 | | +-----------+-----------+-----------+-----------+-----------+ | |
46 | | | 64KB | 2MB | 512MB | 16GB | | | |
47 | +--------------+-----------+-----------+-----------+-----------+-----------+ | |
48 | ||
838691a1 | 49 | When the system boot up, every HugeTLB page has more than one ``struct page`` |
60a427db JM |
50 | structs which size is (unit: pages):: |
51 | ||
52 | struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE | |
53 | ||
54 | Where HugeTLB_Size is the size of the HugeTLB page. We know that the size | |
55 | of the HugeTLB page is always n times PAGE_SIZE. So we can get the following | |
56 | relationship:: | |
57 | ||
58 | HugeTLB_Size = n * PAGE_SIZE | |
59 | ||
60 | Then:: | |
61 | ||
62 | struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE | |
63 | = n * sizeof(struct page) / PAGE_SIZE | |
64 | ||
65 | We can use huge mapping at the pud/pmd level for the HugeTLB page. | |
66 | ||
67 | For the HugeTLB page of the pmd level mapping, then:: | |
68 | ||
69 | struct_size = n * sizeof(struct page) / PAGE_SIZE | |
70 | = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE | |
71 | = sizeof(struct page) / sizeof(pte_t) | |
72 | = 64 / 8 | |
73 | = 8 (pages) | |
74 | ||
75 | Where n is how many pte entries which one page can contains. So the value of | |
76 | n is (PAGE_SIZE / sizeof(pte_t)). | |
77 | ||
78 | This optimization only supports 64-bit system, so the value of sizeof(pte_t) | |
838691a1 MS |
79 | is 8. And this optimization also applicable only when the size of ``struct page`` |
80 | is a power of two. In most cases, the size of ``struct page`` is 64 bytes (e.g. | |
60a427db | 81 | x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the |
838691a1 | 82 | size of ``struct page`` structs of it is 8 page frames which size depends on the |
60a427db JM |
83 | size of the base page. |
84 | ||
85 | For the HugeTLB page of the pud level mapping, then:: | |
86 | ||
87 | struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd) | |
88 | = PAGE_SIZE / 8 * 8 (pages) | |
89 | = PAGE_SIZE (pages) | |
90 | ||
838691a1 | 91 | Where the struct_size(pmd) is the size of the ``struct page`` structs of a |
60a427db JM |
92 | HugeTLB page of the pmd level mapping. |
93 | ||
94 | E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB | |
95 | HugeTLB page consists in 4096. | |
96 | ||
97 | Next, we take the pmd level mapping of the HugeTLB page as an example to | |
98 | show the internal implementation of this optimization. There are 8 pages | |
838691a1 | 99 | ``struct page`` structs associated with a HugeTLB page which is pmd mapped. |
60a427db JM |
100 | |
101 | Here is how things look before optimization:: | |
102 | ||
103 | HugeTLB struct pages(8 pages) page frame(8 pages) | |
104 | +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | |
105 | | | | 0 | -------------> | 0 | | |
106 | | | +-----------+ +-----------+ | |
107 | | | | 1 | -------------> | 1 | | |
108 | | | +-----------+ +-----------+ | |
109 | | | | 2 | -------------> | 2 | | |
110 | | | +-----------+ +-----------+ | |
111 | | | | 3 | -------------> | 3 | | |
112 | | | +-----------+ +-----------+ | |
113 | | | | 4 | -------------> | 4 | | |
114 | | PMD | +-----------+ +-----------+ | |
115 | | level | | 5 | -------------> | 5 | | |
116 | | mapping | +-----------+ +-----------+ | |
117 | | | | 6 | -------------> | 6 | | |
118 | | | +-----------+ +-----------+ | |
119 | | | | 7 | -------------> | 7 | | |
120 | | | +-----------+ +-----------+ | |
121 | | | | |
122 | | | | |
123 | | | | |
124 | +-----------+ | |
125 | ||
126 | The value of page->compound_head is the same for all tail pages. The first | |
838691a1 MS |
127 | page of ``struct page`` (page 0) associated with the HugeTLB page contains the 4 |
128 | ``struct page`` necessary to describe the HugeTLB. The only use of the remaining | |
129 | pages of ``struct page`` (page 1 to page 7) is to point to page->compound_head. | |
130 | Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page`` | |
60a427db JM |
131 | will be used for each HugeTLB page. This will allow us to free the remaining |
132 | 7 pages to the buddy allocator. | |
133 | ||
134 | Here is how things look after remapping:: | |
135 | ||
136 | HugeTLB struct pages(8 pages) page frame(8 pages) | |
137 | +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | |
138 | | | | 0 | -------------> | 0 | | |
139 | | | +-----------+ +-----------+ | |
140 | | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ | |
141 | | | +-----------+ | | | | | | | |
142 | | | | 2 | -----------------+ | | | | | | |
143 | | | +-----------+ | | | | | | |
144 | | | | 3 | -------------------+ | | | | | |
145 | | | +-----------+ | | | | | |
146 | | | | 4 | ---------------------+ | | | | |
147 | | PMD | +-----------+ | | | | |
148 | | level | | 5 | -----------------------+ | | | |
149 | | mapping | +-----------+ | | | |
150 | | | | 6 | -------------------------+ | | |
151 | | | +-----------+ | | |
152 | | | | 7 | ---------------------------+ | |
153 | | | +-----------+ | |
154 | | | | |
155 | | | | |
156 | | | | |
157 | +-----------+ | |
158 | ||
159 | When a HugeTLB is freed to the buddy system, we should allocate 7 pages for | |
160 | vmemmap pages and restore the previous mapping relationship. | |
161 | ||
162 | For the HugeTLB page of the pud level mapping. It is similar to the former. | |
163 | We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages. | |
164 | ||
165 | Apart from the HugeTLB page of the pmd/pud level mapping, some architectures | |
166 | (e.g. aarch64) provides a contiguous bit in the translation table entries | |
167 | that hints to the MMU to indicate that it is one of a contiguous set of | |
168 | entries that can be cached in a single TLB entry. | |
169 | ||
170 | The contiguous bit is used to increase the mapping size at the pmd and pte | |
171 | (last) level. So this type of HugeTLB page can be optimized only when its | |
838691a1 | 172 | size of the ``struct page`` structs is greater than **1** page. |
60a427db JM |
173 | |
174 | Notice: The head vmemmap page is not freed to the buddy allocator and all | |
175 | tail vmemmap pages are mapped to the head vmemmap page frame. So we can see | |
838691a1 MS |
176 | more than one ``struct page`` struct with ``PG_head`` (e.g. 8 per 2 MB HugeTLB |
177 | page) associated with each HugeTLB page. The ``compound_head()`` can handle | |
178 | this correctly. There is only **one** head ``struct page``, the tail | |
179 | ``struct page`` with ``PG_head`` are fake head ``struct page``. We need an | |
180 | approach to distinguish between those two different types of ``struct page`` so | |
181 | that ``compound_head()`` can return the real head ``struct page`` when the | |
182 | parameter is the tail ``struct page`` but with ``PG_head``. The following code | |
183 | snippet describes how to distinguish between real and fake head ``struct page``. | |
184 | ||
185 | .. code-block:: c | |
186 | ||
187 | if (test_bit(PG_head, &page->flags)) { | |
188 | unsigned long head = READ_ONCE(page[1].compound_head); | |
189 | ||
190 | if (head & 1) { | |
191 | if (head == (unsigned long)page + 1) | |
192 | /* head struct page */ | |
193 | else | |
194 | /* tail struct page */ | |
195 | } else { | |
196 | /* head struct page */ | |
197 | } | |
198 | } | |
199 | ||
200 | We can safely access the field of the **page[1]** with ``PG_head`` because the | |
201 | page is a compound page composed with at least two contiguous pages. | |
202 | The implementation refers to ``page_fixed_fake_head()``. | |
4917f55b JM |
203 | |
204 | Device DAX | |
205 | ========== | |
206 | ||
207 | The device-dax interface uses the same tail deduplication technique explained | |
208 | in the previous chapter, except when used with the vmemmap in | |
209 | the device (altmap). | |
210 | ||
211 | The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64), | |
212 | PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64). | |
213 | ||
214 | The differences with HugeTLB are relatively minor. | |
215 | ||
838691a1 | 216 | It only use 3 ``struct page`` for storing all information as opposed |
4917f55b JM |
217 | to 4 on HugeTLB pages. |
218 | ||
219 | There's no remapping of vmemmap given that device-dax memory is not part of | |
220 | System RAM ranges initialized at boot. Thus the tail page deduplication | |
221 | happens at a later stage when we populate the sections. HugeTLB reuses the | |
222 | the head vmemmap page representing, whereas device-dax reuses the tail | |
223 | vmemmap page. This results in only half of the savings compared to HugeTLB. | |
224 | ||
225 | Deduplicated tail pages are not mapped read-only. | |
226 | ||
227 | Here's how things look like on device-dax after the sections are populated:: | |
228 | ||
229 | +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | |
230 | | | | 0 | -------------> | 0 | | |
231 | | | +-----------+ +-----------+ | |
232 | | | | 1 | -------------> | 1 | | |
233 | | | +-----------+ +-----------+ | |
234 | | | | 2 | ----------------^ ^ ^ ^ ^ ^ | |
235 | | | +-----------+ | | | | | | |
236 | | | | 3 | ------------------+ | | | | | |
237 | | | +-----------+ | | | | | |
238 | | | | 4 | --------------------+ | | | | |
239 | | PMD | +-----------+ | | | | |
240 | | level | | 5 | ----------------------+ | | | |
241 | | mapping | +-----------+ | | | |
242 | | | | 6 | ------------------------+ | | |
243 | | | +-----------+ | | |
244 | | | | 7 | --------------------------+ | |
245 | | | +-----------+ | |
246 | | | | |
247 | | | | |
248 | | | | |
249 | +-----------+ |