]>
Commit | Line | Data |
---|---|---|
f20e5789 FY |
1 | User Interface for Resource Allocation in Intel Resource Director Technology |
2 | ||
3 | Copyright (C) 2016 Intel Corporation | |
4 | ||
5 | Fenghua Yu <fenghua.yu@intel.com> | |
6 | Tony Luck <tony.luck@intel.com> | |
a9cad3d4 | 7 | Vikas Shivappa <vikas.shivappa@intel.com> |
f20e5789 | 8 | |
de918abb VS |
9 | This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the |
10 | X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3". | |
f20e5789 FY |
11 | |
12 | To use the feature mount the file system: | |
13 | ||
14 | # mount -t resctrl resctrl [-o cdp] /sys/fs/resctrl | |
15 | ||
16 | mount options are: | |
17 | ||
18 | "cdp": Enable code/data prioritization in L3 cache allocations. | |
19 | ||
de918abb VS |
20 | RDT features are orthogonal. A particular system may support only |
21 | monitoring, only control, or both monitoring and control. | |
22 | ||
23 | The mount succeeds if either of allocation or monitoring is present, but | |
24 | only those files and directories supported by the system will be created. | |
25 | For more details on the behavior of the interface during monitoring | |
26 | and allocation, see the "Resource alloc and monitor groups" section. | |
f20e5789 | 27 | |
458b0d6e TG |
28 | Info directory |
29 | -------------- | |
30 | ||
31 | The 'info' directory contains information about the enabled | |
32 | resources. Each resource has its own subdirectory. The subdirectory | |
a9cad3d4 | 33 | names reflect the resource names. |
de918abb VS |
34 | |
35 | Each subdirectory contains the following files with respect to | |
36 | allocation: | |
37 | ||
38 | Cache resource(L3/L2) subdirectory contains the following files | |
39 | related to allocation: | |
458b0d6e | 40 | |
a9cad3d4 VS |
41 | "num_closids": The number of CLOSIDs which are valid for this |
42 | resource. The kernel uses the smallest number of | |
43 | CLOSIDs of all enabled resources as limit. | |
458b0d6e | 44 | |
a9cad3d4 VS |
45 | "cbm_mask": The bitmask which is valid for this resource. |
46 | This mask is equivalent to 100%. | |
458b0d6e | 47 | |
a9cad3d4 VS |
48 | "min_cbm_bits": The minimum number of consecutive bits which |
49 | must be set when writing a mask. | |
458b0d6e | 50 | |
9ebf47f1 FY |
51 | "shareable_bits": Bitmask of shareable resource with other executing |
52 | entities (e.g. I/O). User can use this when | |
53 | setting up exclusive cache partitions. Note that | |
54 | some platforms support devices that have their | |
55 | own settings for cache use which can over-ride | |
56 | these bits. | |
57 | ||
de918abb VS |
58 | Memory bandwitdh(MB) subdirectory contains the following files |
59 | with respect to allocation: | |
a9cad3d4 VS |
60 | |
61 | "min_bandwidth": The minimum memory bandwidth percentage which | |
62 | user can request. | |
63 | ||
64 | "bandwidth_gran": The granularity in which the memory bandwidth | |
65 | percentage is allocated. The allocated | |
66 | b/w percentage is rounded off to the next | |
67 | control step available on the hardware. The | |
68 | available bandwidth control steps are: | |
69 | min_bandwidth + N * bandwidth_gran. | |
70 | ||
71 | "delay_linear": Indicates if the delay scale is linear or | |
72 | non-linear. This field is purely informational | |
73 | only. | |
458b0d6e | 74 | |
de918abb VS |
75 | If RDT monitoring is available there will be an "L3_MON" directory |
76 | with the following files: | |
77 | ||
78 | "num_rmids": The number of RMIDs available. This is the | |
79 | upper bound for how many "CTRL_MON" + "MON" | |
80 | groups can be created. | |
81 | ||
82 | "mon_features": Lists the monitoring events if | |
83 | monitoring is enabled for the resource. | |
84 | ||
85 | "max_threshold_occupancy": | |
86 | Read/write file provides the largest value (in | |
87 | bytes) at which a previously used LLC_occupancy | |
88 | counter can be considered for re-use. | |
89 | ||
90 | ||
91 | Resource alloc and monitor groups | |
92 | --------------------------------- | |
93 | ||
f20e5789 | 94 | Resource groups are represented as directories in the resctrl file |
de918abb VS |
95 | system. The default group is the root directory which, immediately |
96 | after mounting, owns all the tasks and cpus in the system and can make | |
97 | full use of all resources. | |
98 | ||
99 | On a system with RDT control features additional directories can be | |
100 | created in the root directory that specify different amounts of each | |
101 | resource (see "schemata" below). The root and these additional top level | |
102 | directories are referred to as "CTRL_MON" groups below. | |
103 | ||
104 | On a system with RDT monitoring the root directory and other top level | |
105 | directories contain a directory named "mon_groups" in which additional | |
106 | directories can be created to monitor subsets of tasks in the CTRL_MON | |
107 | group that is their ancestor. These are called "MON" groups in the rest | |
108 | of this document. | |
109 | ||
110 | Removing a directory will move all tasks and cpus owned by the group it | |
111 | represents to the parent. Removing one of the created CTRL_MON groups | |
112 | will automatically remove all MON groups below it. | |
113 | ||
114 | All groups contain the following files: | |
115 | ||
116 | "tasks": | |
117 | Reading this file shows the list of all tasks that belong to | |
118 | this group. Writing a task id to the file will add a task to the | |
119 | group. If the group is a CTRL_MON group the task is removed from | |
120 | whichever previous CTRL_MON group owned the task and also from | |
121 | any MON group that owned the task. If the group is a MON group, | |
122 | then the task must already belong to the CTRL_MON parent of this | |
123 | group. The task is removed from any previous MON group. | |
124 | ||
125 | ||
126 | "cpus": | |
127 | Reading this file shows a bitmask of the logical CPUs owned by | |
128 | this group. Writing a mask to this file will add and remove | |
129 | CPUs to/from this group. As with the tasks file a hierarchy is | |
130 | maintained where MON groups may only include CPUs owned by the | |
131 | parent CTRL_MON group. | |
132 | ||
133 | ||
134 | "cpus_list": | |
135 | Just like "cpus", only using ranges of CPUs instead of bitmasks. | |
f20e5789 | 136 | |
f20e5789 | 137 | |
de918abb | 138 | When control is enabled all CTRL_MON groups will also contain: |
f20e5789 | 139 | |
de918abb VS |
140 | "schemata": |
141 | A list of all the resources available to this group. | |
142 | Each resource has its own line and format - see below for details. | |
f20e5789 | 143 | |
de918abb | 144 | When monitoring is enabled all MON groups will also contain: |
4ffa3c97 | 145 | |
de918abb VS |
146 | "mon_data": |
147 | This contains a set of files organized by L3 domain and by | |
148 | RDT event. E.g. on a system with two L3 domains there will | |
149 | be subdirectories "mon_L3_00" and "mon_L3_01". Each of these | |
150 | directories have one file per event (e.g. "llc_occupancy", | |
151 | "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these | |
152 | files provide a read out of the current value of the event for | |
153 | all tasks in the group. In CTRL_MON groups these files provide | |
154 | the sum for all tasks in the CTRL_MON group and all tasks in | |
155 | MON groups. Please see example section for more details on usage. | |
f20e5789 | 156 | |
de918abb VS |
157 | Resource allocation rules |
158 | ------------------------- | |
159 | When a task is running the following rules define which resources are | |
160 | available to it: | |
f20e5789 FY |
161 | |
162 | 1) If the task is a member of a non-default group, then the schemata | |
de918abb | 163 | for that group is used. |
f20e5789 FY |
164 | |
165 | 2) Else if the task belongs to the default group, but is running on a | |
de918abb VS |
166 | CPU that is assigned to some specific group, then the schemata for the |
167 | CPU's group is used. | |
f20e5789 FY |
168 | |
169 | 3) Otherwise the schemata for the default group is used. | |
170 | ||
de918abb VS |
171 | Resource monitoring rules |
172 | ------------------------- | |
173 | 1) If a task is a member of a MON group, or non-default CTRL_MON group | |
174 | then RDT events for the task will be reported in that group. | |
175 | ||
176 | 2) If a task is a member of the default CTRL_MON group, but is running | |
177 | on a CPU that is assigned to some specific group, then the RDT events | |
178 | for the task will be reported in that group. | |
179 | ||
180 | 3) Otherwise RDT events for the task will be reported in the root level | |
181 | "mon_data" group. | |
182 | ||
183 | ||
184 | Notes on cache occupancy monitoring and control | |
185 | ----------------------------------------------- | |
186 | When moving a task from one group to another you should remember that | |
187 | this only affects *new* cache allocations by the task. E.g. you may have | |
188 | a task in a monitor group showing 3 MB of cache occupancy. If you move | |
189 | to a new group and immediately check the occupancy of the old and new | |
190 | groups you will likely see that the old group is still showing 3 MB and | |
191 | the new group zero. When the task accesses locations still in cache from | |
192 | before the move, the h/w does not update any counters. On a busy system | |
193 | you will likely see the occupancy in the old group go down as cache lines | |
194 | are evicted and re-used while the occupancy in the new group rises as | |
195 | the task accesses memory and loads into the cache are counted based on | |
196 | membership in the new group. | |
197 | ||
198 | The same applies to cache allocation control. Moving a task to a group | |
199 | with a smaller cache partition will not evict any cache lines. The | |
200 | process may continue to use them from the old partition. | |
201 | ||
202 | Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) | |
203 | to identify a control group and a monitoring group respectively. Each of | |
204 | the resource groups are mapped to these IDs based on the kind of group. The | |
205 | number of CLOSid and RMID are limited by the hardware and hence the creation of | |
206 | a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID | |
207 | and creation of "MON" group may fail if we run out of RMIDs. | |
208 | ||
209 | max_threshold_occupancy - generic concepts | |
210 | ------------------------------------------ | |
211 | ||
212 | Note that an RMID once freed may not be immediately available for use as | |
213 | the RMID is still tagged the cache lines of the previous user of RMID. | |
214 | Hence such RMIDs are placed on limbo list and checked back if the cache | |
215 | occupancy has gone down. If there is a time when system has a lot of | |
216 | limbo RMIDs but which are not ready to be used, user may see an -EBUSY | |
217 | during mkdir. | |
218 | ||
219 | max_threshold_occupancy is a user configurable value to determine the | |
220 | occupancy at which an RMID can be freed. | |
f20e5789 FY |
221 | |
222 | Schemata files - general concepts | |
223 | --------------------------------- | |
224 | Each line in the file describes one resource. The line starts with | |
225 | the name of the resource, followed by specific values to be applied | |
226 | in each of the instances of that resource on the system. | |
227 | ||
228 | Cache IDs | |
229 | --------- | |
230 | On current generation systems there is one L3 cache per socket and L2 | |
231 | caches are generally just shared by the hyperthreads on a core, but this | |
232 | isn't an architectural requirement. We could have multiple separate L3 | |
233 | caches on a socket, multiple cores could share an L2 cache. So instead | |
234 | of using "socket" or "core" to define the set of logical cpus sharing | |
235 | a resource we use a "Cache ID". At a given cache level this will be a | |
236 | unique number across the whole system (but it isn't guaranteed to be a | |
237 | contiguous sequence, there may be gaps). To find the ID for each logical | |
238 | CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id | |
239 | ||
240 | Cache Bit Masks (CBM) | |
241 | --------------------- | |
242 | For cache resources we describe the portion of the cache that is available | |
243 | for allocation using a bitmask. The maximum value of the mask is defined | |
244 | by each cpu model (and may be different for different cache levels). It | |
245 | is found using CPUID, but is also provided in the "info" directory of | |
246 | the resctrl file system in "info/{resource}/cbm_mask". X86 hardware | |
247 | requires that these masks have all the '1' bits in a contiguous block. So | |
248 | 0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 | |
249 | and 0xA are not. On a system with a 20-bit mask each bit represents 5% | |
250 | of the capacity of the cache. You could partition the cache into four | |
251 | equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. | |
252 | ||
a9cad3d4 VS |
253 | Memory bandwidth(b/w) percentage |
254 | -------------------------------- | |
255 | For Memory b/w resource, user controls the resource by indicating the | |
256 | percentage of total memory b/w. | |
257 | ||
258 | The minimum bandwidth percentage value for each cpu model is predefined | |
259 | and can be looked up through "info/MB/min_bandwidth". The bandwidth | |
260 | granularity that is allocated is also dependent on the cpu model and can | |
261 | be looked up at "info/MB/bandwidth_gran". The available bandwidth | |
262 | control steps are: min_bw + N * bw_gran. Intermediate values are rounded | |
263 | to the next control step available on the hardware. | |
264 | ||
265 | The bandwidth throttling is a core specific mechanism on some of Intel | |
266 | SKUs. Using a high bandwidth and a low bandwidth setting on two threads | |
267 | sharing a core will result in both threads being throttled to use the | |
268 | low bandwidth. | |
f20e5789 | 269 | |
de918abb VS |
270 | L3 schemata file details (code and data prioritization disabled) |
271 | ---------------------------------------------------------------- | |
f20e5789 FY |
272 | With CDP disabled the L3 schemata format is: |
273 | ||
274 | L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... | |
275 | ||
de918abb VS |
276 | L3 schemata file details (CDP enabled via mount option to resctrl) |
277 | ------------------------------------------------------------------ | |
f20e5789 FY |
278 | When CDP is enabled L3 control is split into two separate resources |
279 | so you can specify independent masks for code and data like this: | |
280 | ||
281 | L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... | |
282 | L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... | |
283 | ||
de918abb VS |
284 | L2 schemata file details |
285 | ------------------------ | |
f20e5789 FY |
286 | L2 cache does not support code and data prioritization, so the |
287 | schemata format is always: | |
288 | ||
289 | L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... | |
290 | ||
a9cad3d4 VS |
291 | Memory b/w Allocation details |
292 | ----------------------------- | |
293 | ||
294 | Memory b/w domain is L3 cache. | |
295 | ||
296 | MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... | |
297 | ||
c4026b7b TL |
298 | Reading/writing the schemata file |
299 | --------------------------------- | |
300 | Reading the schemata file will show the state of all resources | |
301 | on all domains. When writing you only need to specify those values | |
302 | which you wish to change. E.g. | |
303 | ||
304 | # cat schemata | |
305 | L3DATA:0=fffff;1=fffff;2=fffff;3=fffff | |
306 | L3CODE:0=fffff;1=fffff;2=fffff;3=fffff | |
307 | # echo "L3DATA:2=3c0;" > schemata | |
308 | # cat schemata | |
309 | L3DATA:0=fffff;1=fffff;2=3c0;3=fffff | |
310 | L3CODE:0=fffff;1=fffff;2=fffff;3=fffff | |
311 | ||
de918abb VS |
312 | Examples for RDT allocation usage: |
313 | ||
f20e5789 FY |
314 | Example 1 |
315 | --------- | |
316 | On a two socket machine (one L3 cache per socket) with just four bits | |
a9cad3d4 VS |
317 | for cache bit masks, minimum b/w of 10% with a memory bandwidth |
318 | granularity of 10% | |
f20e5789 FY |
319 | |
320 | # mount -t resctrl resctrl /sys/fs/resctrl | |
321 | # cd /sys/fs/resctrl | |
322 | # mkdir p0 p1 | |
a9cad3d4 VS |
323 | # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata |
324 | # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata | |
f20e5789 FY |
325 | |
326 | The default resource group is unmodified, so we have access to all parts | |
327 | of all caches (its schemata file reads "L3:0=f;1=f"). | |
328 | ||
329 | Tasks that are under the control of group "p0" may only allocate from the | |
330 | "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. | |
331 | Tasks in group "p1" use the "lower" 50% of cache on both sockets. | |
332 | ||
a9cad3d4 VS |
333 | Similarly, tasks that are under the control of group "p0" may use a |
334 | maximum memory b/w of 50% on socket0 and 50% on socket 1. | |
335 | Tasks in group "p1" may also use 50% memory b/w on both sockets. | |
336 | Note that unlike cache masks, memory b/w cannot specify whether these | |
337 | allocations can overlap or not. The allocations specifies the maximum | |
338 | b/w that the group may be able to use and the system admin can configure | |
339 | the b/w accordingly. | |
340 | ||
f20e5789 FY |
341 | Example 2 |
342 | --------- | |
343 | Again two sockets, but this time with a more realistic 20-bit mask. | |
344 | ||
345 | Two real time tasks pid=1234 running on processor 0 and pid=5678 running on | |
346 | processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy | |
347 | neighbors, each of the two real-time tasks exclusively occupies one quarter | |
348 | of L3 cache on socket 0. | |
349 | ||
350 | # mount -t resctrl resctrl /sys/fs/resctrl | |
351 | # cd /sys/fs/resctrl | |
352 | ||
353 | First we reset the schemata for the default group so that the "upper" | |
a9cad3d4 VS |
354 | 50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by |
355 | ordinary tasks: | |
f20e5789 | 356 | |
a9cad3d4 | 357 | # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata |
f20e5789 FY |
358 | |
359 | Next we make a resource group for our first real time task and give | |
360 | it access to the "top" 25% of the cache on socket 0. | |
361 | ||
362 | # mkdir p0 | |
363 | # echo "L3:0=f8000;1=fffff" > p0/schemata | |
364 | ||
365 | Finally we move our first real time task into this resource group. We | |
366 | also use taskset(1) to ensure the task always runs on a dedicated CPU | |
367 | on socket 0. Most uses of resource groups will also constrain which | |
368 | processors tasks run on. | |
369 | ||
370 | # echo 1234 > p0/tasks | |
371 | # taskset -cp 1 1234 | |
372 | ||
373 | Ditto for the second real time task (with the remaining 25% of cache): | |
374 | ||
375 | # mkdir p1 | |
376 | # echo "L3:0=7c00;1=fffff" > p1/schemata | |
377 | # echo 5678 > p1/tasks | |
378 | # taskset -cp 2 5678 | |
379 | ||
a9cad3d4 VS |
380 | For the same 2 socket system with memory b/w resource and CAT L3 the |
381 | schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is | |
382 | 10): | |
383 | ||
384 | For our first real time task this would request 20% memory b/w on socket | |
385 | 0. | |
386 | ||
387 | # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata | |
388 | ||
389 | For our second real time task this would request an other 20% memory b/w | |
390 | on socket 0. | |
391 | ||
392 | # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata | |
393 | ||
f20e5789 FY |
394 | Example 3 |
395 | --------- | |
396 | ||
397 | A single socket system which has real-time tasks running on core 4-7 and | |
398 | non real-time workload assigned to core 0-3. The real-time tasks share text | |
399 | and data, so a per task association is not required and due to interaction | |
400 | with the kernel it's desired that the kernel on these cores shares L3 with | |
401 | the tasks. | |
402 | ||
403 | # mount -t resctrl resctrl /sys/fs/resctrl | |
404 | # cd /sys/fs/resctrl | |
405 | ||
406 | First we reset the schemata for the default group so that the "upper" | |
a9cad3d4 VS |
407 | 50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 |
408 | cannot be used by ordinary tasks: | |
f20e5789 | 409 | |
a9cad3d4 | 410 | # echo "L3:0=3ff\nMB:0=50" > schemata |
f20e5789 | 411 | |
a9cad3d4 VS |
412 | Next we make a resource group for our real time cores and give it access |
413 | to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on | |
414 | socket 0. | |
f20e5789 FY |
415 | |
416 | # mkdir p0 | |
a9cad3d4 | 417 | # echo "L3:0=ffc00\nMB:0=50" > p0/schemata |
f20e5789 FY |
418 | |
419 | Finally we move core 4-7 over to the new group and make sure that the | |
a9cad3d4 VS |
420 | kernel and the tasks running there get 50% of the cache. They should |
421 | also get 50% of memory bandwidth assuming that the cores 4-7 are SMT | |
422 | siblings and only the real time threads are scheduled on the cores 4-7. | |
f20e5789 | 423 | |
fb8fb46c | 424 | # echo F0 > p0/cpus |
3c2a769d MT |
425 | |
426 | 4) Locking between applications | |
427 | ||
428 | Certain operations on the resctrl filesystem, composed of read/writes | |
429 | to/from multiple files, must be atomic. | |
430 | ||
431 | As an example, the allocation of an exclusive reservation of L3 cache | |
432 | involves: | |
433 | ||
434 | 1. Read the cbmmasks from each directory | |
435 | 2. Find a contiguous set of bits in the global CBM bitmask that is clear | |
436 | in any of the directory cbmmasks | |
437 | 3. Create a new directory | |
438 | 4. Set the bits found in step 2 to the new directory "schemata" file | |
439 | ||
440 | If two applications attempt to allocate space concurrently then they can | |
441 | end up allocating the same bits so the reservations are shared instead of | |
442 | exclusive. | |
443 | ||
444 | To coordinate atomic operations on the resctrlfs and to avoid the problem | |
445 | above, the following locking procedure is recommended: | |
446 | ||
447 | Locking is based on flock, which is available in libc and also as a shell | |
448 | script command | |
449 | ||
450 | Write lock: | |
451 | ||
452 | A) Take flock(LOCK_EX) on /sys/fs/resctrl | |
453 | B) Read/write the directory structure. | |
454 | C) funlock | |
455 | ||
456 | Read lock: | |
457 | ||
458 | A) Take flock(LOCK_SH) on /sys/fs/resctrl | |
459 | B) If success read the directory structure. | |
460 | C) funlock | |
461 | ||
462 | Example with bash: | |
463 | ||
464 | # Atomically read directory structure | |
465 | $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl | |
466 | ||
467 | # Read directory contents and create new subdirectory | |
468 | ||
469 | $ cat create-dir.sh | |
470 | find /sys/fs/resctrl/ > output.txt | |
471 | mask = function-of(output.txt) | |
472 | mkdir /sys/fs/resctrl/newres/ | |
473 | echo mask > /sys/fs/resctrl/newres/schemata | |
474 | ||
475 | $ flock /sys/fs/resctrl/ ./create-dir.sh | |
476 | ||
477 | Example with C: | |
478 | ||
479 | /* | |
480 | * Example code do take advisory locks | |
481 | * before accessing resctrl filesystem | |
482 | */ | |
483 | #include <sys/file.h> | |
484 | #include <stdlib.h> | |
485 | ||
486 | void resctrl_take_shared_lock(int fd) | |
487 | { | |
488 | int ret; | |
489 | ||
490 | /* take shared lock on resctrl filesystem */ | |
491 | ret = flock(fd, LOCK_SH); | |
492 | if (ret) { | |
493 | perror("flock"); | |
494 | exit(-1); | |
495 | } | |
496 | } | |
497 | ||
498 | void resctrl_take_exclusive_lock(int fd) | |
499 | { | |
500 | int ret; | |
501 | ||
502 | /* release lock on resctrl filesystem */ | |
503 | ret = flock(fd, LOCK_EX); | |
504 | if (ret) { | |
505 | perror("flock"); | |
506 | exit(-1); | |
507 | } | |
508 | } | |
509 | ||
510 | void resctrl_release_lock(int fd) | |
511 | { | |
512 | int ret; | |
513 | ||
514 | /* take shared lock on resctrl filesystem */ | |
515 | ret = flock(fd, LOCK_UN); | |
516 | if (ret) { | |
517 | perror("flock"); | |
518 | exit(-1); | |
519 | } | |
520 | } | |
521 | ||
522 | void main(void) | |
523 | { | |
524 | int fd, ret; | |
525 | ||
526 | fd = open("/sys/fs/resctrl", O_DIRECTORY); | |
527 | if (fd == -1) { | |
528 | perror("open"); | |
529 | exit(-1); | |
530 | } | |
531 | resctrl_take_shared_lock(fd); | |
532 | /* code to read directory contents */ | |
533 | resctrl_release_lock(fd); | |
534 | ||
535 | resctrl_take_exclusive_lock(fd); | |
536 | /* code to read and write directory contents */ | |
537 | resctrl_release_lock(fd); | |
538 | } | |
de918abb VS |
539 | |
540 | Examples for RDT Monitoring along with allocation usage: | |
541 | ||
542 | Reading monitored data | |
543 | ---------------------- | |
544 | Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would | |
545 | show the current snapshot of LLC occupancy of the corresponding MON | |
546 | group or CTRL_MON group. | |
547 | ||
548 | ||
549 | Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) | |
550 | --------- | |
551 | On a two socket machine (one L3 cache per socket) with just four bits | |
552 | for cache bit masks | |
553 | ||
554 | # mount -t resctrl resctrl /sys/fs/resctrl | |
555 | # cd /sys/fs/resctrl | |
556 | # mkdir p0 p1 | |
557 | # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata | |
558 | # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata | |
559 | # echo 5678 > p1/tasks | |
560 | # echo 5679 > p1/tasks | |
561 | ||
562 | The default resource group is unmodified, so we have access to all parts | |
563 | of all caches (its schemata file reads "L3:0=f;1=f"). | |
564 | ||
565 | Tasks that are under the control of group "p0" may only allocate from the | |
566 | "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. | |
567 | Tasks in group "p1" use the "lower" 50% of cache on both sockets. | |
568 | ||
569 | Create monitor groups and assign a subset of tasks to each monitor group. | |
570 | ||
571 | # cd /sys/fs/resctrl/p1/mon_groups | |
572 | # mkdir m11 m12 | |
573 | # echo 5678 > m11/tasks | |
574 | # echo 5679 > m12/tasks | |
575 | ||
576 | fetch data (data shown in bytes) | |
577 | ||
578 | # cat m11/mon_data/mon_L3_00/llc_occupancy | |
579 | 16234000 | |
580 | # cat m11/mon_data/mon_L3_01/llc_occupancy | |
581 | 14789000 | |
582 | # cat m12/mon_data/mon_L3_00/llc_occupancy | |
583 | 16789000 | |
584 | ||
585 | The parent ctrl_mon group shows the aggregated data. | |
586 | ||
587 | # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy | |
588 | 31234000 | |
589 | ||
590 | Example 2 (Monitor a task from its creation) | |
591 | --------- | |
592 | On a two socket machine (one L3 cache per socket) | |
593 | ||
594 | # mount -t resctrl resctrl /sys/fs/resctrl | |
595 | # cd /sys/fs/resctrl | |
596 | # mkdir p0 p1 | |
597 | ||
598 | An RMID is allocated to the group once its created and hence the <cmd> | |
599 | below is monitored from its creation. | |
600 | ||
601 | # echo $$ > /sys/fs/resctrl/p1/tasks | |
602 | # <cmd> | |
603 | ||
604 | Fetch the data | |
605 | ||
606 | # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy | |
607 | 31789000 | |
608 | ||
609 | Example 3 (Monitor without CAT support or before creating CAT groups) | |
610 | --------- | |
611 | ||
612 | Assume a system like HSW has only CQM and no CAT support. In this case | |
613 | the resctrl will still mount but cannot create CTRL_MON directories. | |
614 | But user can create different MON groups within the root group thereby | |
615 | able to monitor all tasks including kernel threads. | |
616 | ||
617 | This can also be used to profile jobs cache size footprint before being | |
618 | able to allocate them to different allocation groups. | |
619 | ||
620 | # mount -t resctrl resctrl /sys/fs/resctrl | |
621 | # cd /sys/fs/resctrl | |
622 | # mkdir mon_groups/m01 | |
623 | # mkdir mon_groups/m02 | |
624 | ||
625 | # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks | |
626 | # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks | |
627 | ||
628 | Monitor the groups separately and also get per domain data. From the | |
629 | below its apparent that the tasks are mostly doing work on | |
630 | domain(socket) 0. | |
631 | ||
632 | # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy | |
633 | 31234000 | |
634 | # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy | |
635 | 34555 | |
636 | # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy | |
637 | 31234000 | |
638 | # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy | |
639 | 32789 | |
640 | ||
641 | ||
642 | Example 4 (Monitor real time tasks) | |
643 | ----------------------------------- | |
644 | ||
645 | A single socket system which has real time tasks running on cores 4-7 | |
646 | and non real time tasks on other cpus. We want to monitor the cache | |
647 | occupancy of the real time threads on these cores. | |
648 | ||
649 | # mount -t resctrl resctrl /sys/fs/resctrl | |
650 | # cd /sys/fs/resctrl | |
651 | # mkdir p1 | |
652 | ||
653 | Move the cpus 4-7 over to p1 | |
654 | # echo f0 > p0/cpus | |
655 | ||
656 | View the llc occupancy snapshot | |
657 | ||
658 | # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy | |
659 | 11234000 |