]>
Commit | Line | Data |
---|---|---|
61f5e1a3 DHB |
1 | |
2 | NUMA mechanics for sPAPR (pseries machines) | |
3 | ============================================ | |
4 | ||
5 | NUMA in sPAPR works different than the System Locality Distance | |
6 | Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR | |
7 | 1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This | |
8 | document aims to complement this specification, providing details | |
9 | of the elements that impacts how QEMU views NUMA in pseries. | |
10 | ||
11 | Associativity and ibm,associativity property | |
12 | -------------------------------------------- | |
13 | ||
14 | Associativity is defined as a group of platform resources that has | |
15 | similar mean performance (or in our context here, distance) relative to | |
16 | everyone else outside of the group. | |
17 | ||
18 | The format of the ibm,associativity property varies with the value of | |
19 | bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with | |
20 | bit 0 equal to zero is deprecated. The current format, with the bit 0 | |
21 | with the value of one, makes ibm,associativity property represent the | |
22 | physical hierarchy of the platform, as one or more lists that starts | |
23 | with the highest level grouping up to the smallest. Considering the | |
24 | following topology: | |
25 | ||
26 | :: | |
27 | ||
28 | Mem M1 ---- Proc P1 | | |
29 | ----------------- | Socket S1 ---| | |
30 | chip C1 | | | |
31 | | HW module 1 (MOD1) | |
32 | Mem M2 ---- Proc P2 | | | |
33 | ----------------- | Socket S2 ---| | |
34 | chip C2 | | |
35 | ||
36 | The ibm,associativity property for the processors would be: | |
37 | ||
38 | * P1: {MOD1, S1, C1, P1} | |
39 | * P2: {MOD1, S2, C2, P2} | |
40 | ||
41 | Each allocable resource has an ibm,associativity property. The LOPAPR | |
42 | specification allows multiple lists to be present in this property, | |
43 | considering that the same resource can have multiple connections to the | |
44 | platform. | |
45 | ||
46 | Relative Performance Distance and ibm,associativity-reference-points | |
47 | -------------------------------------------------------------------- | |
48 | ||
49 | The ibm,associativity-reference-points property is an array that is used | |
50 | to define the relevant performance/distance related boundaries, defining | |
51 | the NUMA levels for the platform. | |
52 | ||
53 | The definition of its elements also varies with the value of bit 0 of byte 5 | |
54 | of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero | |
55 | is also deprecated. With the current format, each integer of the | |
56 | ibm,associativity-reference-points represents an 1 based ordinal index (i.e. | |
57 | the first element is 1) of the ibm,associativity array. The first | |
58 | boundary is the most significant to application performance, followed by | |
59 | less significant boundaries. Allocated resources that belongs to the | |
60 | same performance boundaries are expected to have relative NUMA distance | |
61 | that matches the relevancy of the boundary itself. Resources that belongs | |
62 | to the same first boundary will have the shortest distance from each | |
63 | other. Subsequent boundaries represents greater distances and degraded | |
64 | performance. | |
65 | ||
66 | Using the previous example, the following setting reference points defines | |
67 | three NUMA levels: | |
68 | ||
69 | * ibm,associativity-reference-points = {0x3, 0x2, 0x1} | |
70 | ||
71 | The first NUMA level (0x3) is interpreted as the third element of each | |
72 | ibm,associativity array, the second level is the second element and | |
73 | the third level is the first element. Let's also consider that elements | |
74 | belonging to the first NUMA level have distance equal to 10 from each | |
75 | other, and each NUMA level doubles the distance from the previous. This | |
76 | means that the second would be 20 and the third level 40. For the P1 and | |
77 | P2 processors, we would have the following NUMA levels: | |
78 | ||
79 | :: | |
80 | ||
81 | * ibm,associativity-reference-points = {0x3, 0x2, 0x1} | |
82 | ||
83 | * P1: associativity{MOD1, S1, C1, P1} | |
84 | ||
85 | First NUMA level (0x3) => associativity[2] = C1 | |
86 | Second NUMA level (0x2) => associativity[1] = S1 | |
87 | Third NUMA level (0x1) => associativity[0] = MOD1 | |
88 | ||
89 | * P2: associativity{MOD1, S2, C2, P2} | |
90 | ||
91 | First NUMA level (0x3) => associativity[2] = C2 | |
92 | Second NUMA level (0x2) => associativity[1] = S2 | |
93 | Third NUMA level (0x1) => associativity[0] = MOD1 | |
94 | ||
95 | P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40 | |
96 | ||
97 | Changing the ibm,associativity-reference-points array changes the performance | |
98 | distance attributes for the same associativity arrays, as the following | |
99 | example illustrates: | |
100 | ||
101 | :: | |
102 | ||
103 | * ibm,associativity-reference-points = {0x2} | |
104 | ||
105 | * P1: associativity{MOD1, S1, C1, P1} | |
106 | ||
107 | First NUMA level (0x2) => associativity[1] = S1 | |
108 | ||
109 | * P2: associativity{MOD1, S2, C2, P2} | |
110 | ||
111 | First NUMA level (0x2) => associativity[1] = S2 | |
112 | ||
113 | P1 and P2 does not have a common performance boundary. Since this is a one level | |
114 | NUMA configuration, distance between them is one boundary above the first | |
115 | level, 20. | |
116 | ||
117 | ||
118 | In a hypothetical platform where all resources inside the same hardware module | |
119 | is considered to be on the same performance boundary: | |
120 | ||
121 | :: | |
122 | ||
123 | * ibm,associativity-reference-points = {0x1} | |
124 | ||
125 | * P1: associativity{MOD1, S1, C1, P1} | |
126 | ||
127 | First NUMA level (0x1) => associativity[0] = MOD0 | |
128 | ||
129 | * P2: associativity{MOD1, S2, C2, P2} | |
130 | ||
131 | First NUMA level (0x1) => associativity[0] = MOD0 | |
132 | ||
133 | P1 and P2 belongs to the same first order boundary. The distance between then | |
134 | is 10. | |
135 | ||
136 | ||
137 | How the pseries Linux guest calculates NUMA distances | |
138 | ===================================================== | |
139 | ||
140 | Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is | |
141 | how the distances are expressed. The SLIT table provides the NUMA distance | |
142 | value between the relevant resources. LOPAPR does not provide a standard | |
143 | way to calculate it. We have the ibm,associativity for each resource, which | |
144 | provides a common-performance hierarchy, and the ibm,associativity-reference-points | |
145 | array that tells which level of associativity is considered to be relevant | |
146 | or not. | |
147 | ||
148 | The result is that each OS is free to implement and to interpret the distance | |
149 | as it sees fit. For the pseries Linux guest, each level of NUMA duplicates | |
150 | the distance of the previous level, and the maximum amount of levels is | |
151 | limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the | |
152 | kernel tree). This results in the following distances: | |
153 | ||
154 | * both resources in the first NUMA level: 10 | |
155 | * resources one NUMA level apart: 20 | |
156 | * resources two NUMA levels apart: 40 | |
157 | * resources three NUMA levels apart: 80 | |
158 | * resources four NUMA levels apart: 160 | |
159 | ||
160 | ||
307e7a34 DHB |
161 | pseries NUMA mechanics |
162 | ====================== | |
163 | ||
164 | Starting in QEMU 5.2, the pseries machine considers user input when setting NUMA | |
165 | topology of the guest. The overall design is: | |
166 | ||
167 | * ibm,associativity-reference-points is set to {0x4, 0x3, 0x2, 0x1}, allowing | |
168 | for 4 distinct NUMA distance values based on the NUMA levels | |
169 | ||
170 | * ibm,max-associativity-domains supports multiple associativity domains in all | |
171 | NUMA levels, granting user flexibility | |
172 | ||
173 | * ibm,associativity for all resources varies with user input | |
174 | ||
175 | These changes are only effective for pseries-5.2 and newer machines that are | |
176 | created with more than one NUMA node (disconsidering NUMA nodes created by | |
177 | the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been | |
178 | around for such a long time, with users seeing NUMA distances 10 and 40 | |
179 | (and 80 if using NVLink2 GPUs), and there is no need to disrupt the | |
180 | existing experience of those guests. | |
181 | ||
182 | To bring the user experience x86 users have when tuning up NUMA, we had | |
183 | to operate under the current pseries Linux kernel logic described in | |
184 | `How the pseries Linux guest calculates NUMA distances`_. The result | |
185 | is that we needed to translate NUMA distance user input to pseries | |
186 | Linux kernel input. | |
187 | ||
188 | Translating user distance to kernel distance | |
189 | -------------------------------------------- | |
190 | ||
191 | User input for NUMA distance can vary from 10 to 254. We need to translate | |
192 | that to the values that the Linux kernel operates on (10, 20, 40, 80, 160). | |
193 | This is how it is being done: | |
194 | ||
195 | * user distance 11 to 30 will be interpreted as 20 | |
196 | * user distance 31 to 60 will be interpreted as 40 | |
197 | * user distance 61 to 120 will be interpreted as 80 | |
198 | * user distance 121 and beyond will be interpreted as 160 | |
199 | * user distance 10 stays 10 | |
200 | ||
ac9574bc | 201 | The reasoning behind this approximation is to avoid any round up to the local |
307e7a34 DHB |
202 | distance (10), keeping it exclusive to the 4th NUMA level (which is still |
203 | exclusive to the node_id). All other ranges were chosen under the developer | |
204 | discretion of what would be (somewhat) sensible considering the user input. | |
205 | Any other strategy can be used here, but in the end the reality is that we'll | |
206 | have to accept that a large array of values will be translated to the same | |
207 | NUMA topology in the guest, e.g. this user input: | |
208 | ||
209 | :: | |
210 | ||
211 | 0 1 2 | |
212 | 0 10 31 120 | |
213 | 1 31 10 30 | |
214 | 2 120 30 10 | |
215 | ||
216 | And this other user input: | |
217 | ||
218 | :: | |
219 | ||
220 | 0 1 2 | |
221 | 0 10 60 61 | |
222 | 1 60 10 11 | |
223 | 2 61 11 10 | |
224 | ||
225 | Will both be translated to the same values internally: | |
226 | ||
227 | :: | |
228 | ||
229 | 0 1 2 | |
230 | 0 10 40 80 | |
231 | 1 40 10 20 | |
232 | 2 80 20 10 | |
233 | ||
234 | Users are encouraged to use only the kernel values in the NUMA definition to | |
235 | avoid being taken by surprise with that the guest is actually seeing in the | |
236 | topology. There are enough potential surprises that are inherent to the | |
237 | associativity domain assignment process, discussed below. | |
238 | ||
239 | ||
240 | How associativity domains are assigned | |
241 | -------------------------------------- | |
242 | ||
243 | LOPAPR allows more than one associativity array (or 'string') per allocated | |
244 | resource. This would be used to represent that the resource has multiple | |
245 | connections with the board, and then the operational system, when deciding | |
246 | NUMA distancing, should consider the associativity information that provides | |
247 | the shortest distance. | |
248 | ||
249 | The spapr implementation does not support multiple associativity arrays per | |
250 | resource, neither does the pseries Linux kernel. We'll have to represent the | |
251 | NUMA topology using one associativity per resource, which means that choices | |
252 | and compromises are going to be made. | |
253 | ||
254 | Consider the following NUMA topology entered by user input: | |
255 | ||
256 | :: | |
257 | ||
258 | 0 1 2 3 | |
259 | 0 10 40 20 40 | |
260 | 1 40 10 80 40 | |
261 | 2 20 80 10 20 | |
262 | 3 40 40 20 10 | |
263 | ||
264 | All the associativity arrays are initialized with NUMA id in all associativity | |
265 | domains: | |
266 | ||
267 | * node 0: 0 0 0 0 | |
268 | * node 1: 1 1 1 1 | |
269 | * node 2: 2 2 2 2 | |
270 | * node 3: 3 3 3 3 | |
271 | ||
272 | ||
273 | Honoring just the relative distances of node 0 to every other node, we find the | |
274 | NUMA level matches (considering the reference points {0x4, 0x3, 0x2, 0x1}) for | |
275 | each distance: | |
276 | ||
277 | * distance from 0 to 1 is 40 (no match at 0x4 and 0x3, will match | |
278 | at 0x2) | |
279 | * distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3) | |
280 | * distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match | |
281 | at 0x2) | |
282 | ||
283 | We'll copy the associativity domains of node 0 to all other nodes, based on | |
284 | the NUMA level matches. Between 0 and 1, a match in 0x2, we'll also copy | |
285 | the domains 0x2 and 0x1 from 0 to 1 as well. This will give us: | |
286 | ||
287 | * node 0: 0 0 0 0 | |
288 | * node 1: 0 0 1 1 | |
289 | ||
290 | Doing the same to node 2 and node 3, these are the associativity arrays | |
291 | after considering all matches with node 0: | |
292 | ||
293 | * node 0: 0 0 0 0 | |
294 | * node 1: 0 0 1 1 | |
295 | * node 2: 0 0 0 2 | |
296 | * node 3: 0 0 3 3 | |
297 | ||
298 | The distances related to node 0 are accounted for. For node 1, and keeping | |
299 | in mind that we don't need to revisit node 0 again, the distance from | |
300 | node 1 to 2 is 80, matching at 0x1, and distance from 1 to 3 is 40, | |
301 | match in 0x2. Repeating the same logic of copying all domains up to | |
302 | the NUMA level match: | |
303 | ||
304 | * node 0: 0 0 0 0 | |
305 | * node 1: 1 0 1 1 | |
306 | * node 2: 1 0 0 2 | |
307 | * node 3: 1 0 3 3 | |
308 | ||
309 | In the last step we will analyze just nodes 2 and 3. The desired distance | |
310 | between 2 and 3 is 20, i.e. a match in 0x3: | |
311 | ||
312 | * node 0: 0 0 0 0 | |
313 | * node 1: 1 0 1 1 | |
314 | * node 2: 1 0 0 2 | |
315 | * node 3: 1 0 0 3 | |
316 | ||
317 | ||
318 | The kernel will read these arrays and will calculate the following NUMA topology for | |
319 | the guest: | |
320 | ||
321 | :: | |
322 | ||
323 | 0 1 2 3 | |
324 | 0 10 40 20 20 | |
325 | 1 40 10 40 40 | |
326 | 2 20 40 10 20 | |
327 | 3 20 40 20 10 | |
328 | ||
329 | Note that this is not what the user wanted - the desired distance between | |
330 | 0 and 3 is 40, we calculated it as 20. This is what the current logic and | |
331 | implementation constraints of the kernel and QEMU will provide inside the | |
332 | LOPAPR specification. | |
333 | ||
334 | Users are welcome to use this knowledge and experiment with the input to get | |
335 | the NUMA topology they want, or as closer as they want. The important thing | |
336 | is to keep expectations up to par with what we are capable of provide at this | |
337 | moment: an approximation. | |
338 | ||
339 | Limitations of the implementation | |
61f5e1a3 DHB |
340 | --------------------------------- |
341 | ||
307e7a34 DHB |
342 | As mentioned above, the pSeries NUMA distance logic is, in fact, a way to approximate |
343 | user choice. The Linux kernel, and PAPR itself, does not provide QEMU with the ways | |
344 | to fully map user input to actual NUMA distance the guest will use. These limitations | |
345 | creates two notable limitations in our support: | |
346 | ||
347 | * Asymmetrical topologies aren't supported. We only support NUMA topologies where | |
348 | the distance from node A to B is always the same as B to A. We do not support | |
349 | any A-B pair where the distance back and forth is asymmetric. For example, the | |
350 | following topology isn't supported and the pSeries guest will not boot with this | |
351 | user input: | |
352 | ||
353 | :: | |
354 | ||
355 | 0 1 | |
356 | 0 10 40 | |
357 | 1 20 10 | |
358 | ||
359 | ||
360 | * 'non-transitive' topologies will be poorly translated to the guest. This is the | |
361 | kind of topology where the distance from a node A to B is X, B to C is X, but | |
362 | the distance A to C is not X. E.g.: | |
363 | ||
364 | :: | |
365 | ||
366 | 0 1 2 3 | |
367 | 0 10 20 20 40 | |
368 | 1 20 10 80 40 | |
369 | 2 20 80 10 20 | |
370 | 3 40 40 20 10 | |
371 | ||
372 | In the example above, distance 0 to 2 is 20, 2 to 3 is 20, but 0 to 3 is 40. | |
373 | The kernel will always match with the shortest associativity domain possible, | |
374 | and we're attempting to retain the previous established relations between the | |
375 | nodes. This means that a distance equal to 20 between nodes 0 and 2 and the | |
376 | same distance 20 between nodes 2 and 3 will cause the distance between 0 and 3 | |
377 | to also be 20. | |
378 | ||
379 | ||
380 | Legacy (5.1 and older) pseries NUMA mechanics | |
381 | ============================================= | |
382 | ||
383 | In short, we can summarize the NUMA distances seem in pseries Linux guests, using | |
384 | QEMU up to 5.1, as follows: | |
385 | ||
386 | * local distance, i.e. the distance of the resource to its own NUMA node: 10 | |
387 | * if it's a NVLink GPU device, distance: 80 | |
388 | * every other resource, distance: 40 | |
389 | ||
61f5e1a3 DHB |
390 | The way the pseries Linux guest calculates NUMA distances has a direct effect |
391 | on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is | |
392 | the default ibm,associativity-reference-points being used in the pseries | |
393 | machine: | |
394 | ||
395 | ibm,associativity-reference-points = {0x4, 0x4, 0x2} | |
396 | ||
397 | The first and second level are equal, 0x4, and a third one was added in | |
398 | commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that | |
399 | regardless of how the ibm,associativity properties are being created in | |
400 | the device tree, the pseries Linux guest will only recognize three scenarios | |
401 | as far as NUMA distance goes: | |
402 | ||
403 | * if the resources belongs to the same first NUMA level = 10 | |
404 | * second level is skipped since it's equal to the first | |
405 | * all resources that aren't a NVLink GPU, it is guaranteed that they will belong | |
406 | to the same third NUMA level, having distance = 40 | |
407 | * for NVLink GPUs, distance = 80 from everything else | |
408 | ||
61f5e1a3 DHB |
409 | This also means that user input in QEMU command line does not change the |
410 | NUMA distancing inside the guest for the pseries machine. |