]>
Commit | Line | Data |
---|---|---|
387b1468 | 1 | ============================== |
a6537be9 | 2 | RT-mutex implementation design |
387b1468 MCC |
3 | ============================== |
4 | ||
5 | Copyright (c) 2006 Steven Rostedt | |
6 | ||
7 | Licensed under the GNU Free Documentation License, Version 1.2 | |
8 | ||
a6537be9 SR |
9 | |
10 | This document tries to describe the design of the rtmutex.c implementation. | |
11 | It doesn't describe the reasons why rtmutex.c exists. For that please see | |
387b1468 | 12 | Documentation/locking/rt-mutex.rst. Although this document does explain problems |
a6537be9 SR |
13 | that happen without this code, but that is in the concept to understand |
14 | what the code actually is doing. | |
15 | ||
16 | The goal of this document is to help others understand the priority | |
17 | inheritance (PI) algorithm that is used, as well as reasons for the | |
18 | decisions that were made to implement PI in the manner that was done. | |
19 | ||
20 | ||
21 | Unbounded Priority Inversion | |
22 | ---------------------------- | |
23 | ||
24 | Priority inversion is when a lower priority process executes while a higher | |
25 | priority process wants to run. This happens for several reasons, and | |
26 | most of the time it can't be helped. Anytime a high priority process wants | |
27 | to use a resource that a lower priority process has (a mutex for example), | |
28 | the high priority process must wait until the lower priority process is done | |
29 | with the resource. This is a priority inversion. What we want to prevent | |
30 | is something called unbounded priority inversion. That is when the high | |
31 | priority process is prevented from running by a lower priority process for | |
32 | an undetermined amount of time. | |
33 | ||
c79a8d85 | 34 | The classic example of unbounded priority inversion is where you have three |
a6537be9 SR |
35 | processes, let's call them processes A, B, and C, where A is the highest |
36 | priority process, C is the lowest, and B is in between. A tries to grab a lock | |
37 | that C owns and must wait and lets C run to release the lock. But in the | |
38 | meantime, B executes, and since B is of a higher priority than C, it preempts C, | |
39 | but by doing so, it is in fact preempting A which is a higher priority process. | |
40 | Now there's no way of knowing how long A will be sleeping waiting for C | |
41 | to release the lock, because for all we know, B is a CPU hog and will | |
42 | never give C a chance to release the lock. This is called unbounded priority | |
43 | inversion. | |
44 | ||
387b1468 | 45 | Here's a little ASCII art to show the problem:: |
a6537be9 | 46 | |
387b1468 MCC |
47 | grab lock L1 (owned by C) |
48 | | | |
49 | A ---+ | |
50 | C preempted by B | |
51 | | | |
52 | C +----+ | |
a6537be9 | 53 | |
387b1468 MCC |
54 | B +--------> |
55 | B now keeps A from running. | |
a6537be9 SR |
56 | |
57 | ||
58 | Priority Inheritance (PI) | |
59 | ------------------------- | |
60 | ||
61 | There are several ways to solve this issue, but other ways are out of scope | |
62 | for this document. Here we only discuss PI. | |
63 | ||
64 | PI is where a process inherits the priority of another process if the other | |
65 | process blocks on a lock owned by the current process. To make this easier | |
66 | to understand, let's use the previous example, with processes A, B, and C again. | |
67 | ||
68 | This time, when A blocks on the lock owned by C, C would inherit the priority | |
69 | of A. So now if B becomes runnable, it would not preempt C, since C now has | |
70 | the high priority of A. As soon as C releases the lock, it loses its | |
71 | inherited priority, and A then can continue with the resource that C had. | |
72 | ||
73 | Terminology | |
74 | ----------- | |
75 | ||
76 | Here I explain some terminology that is used in this document to help describe | |
77 | the design that is used to implement PI. | |
78 | ||
387b1468 MCC |
79 | PI chain |
80 | - The PI chain is an ordered series of locks and processes that cause | |
a6537be9 SR |
81 | processes to inherit priorities from a previous process that is |
82 | blocked on one of its locks. This is described in more detail | |
83 | later in this document. | |
84 | ||
387b1468 MCC |
85 | mutex |
86 | - In this document, to differentiate from locks that implement | |
a6537be9 SR |
87 | PI and spin locks that are used in the PI code, from now on |
88 | the PI locks will be called a mutex. | |
89 | ||
387b1468 MCC |
90 | lock |
91 | - In this document from now on, I will use the term lock when | |
a6537be9 SR |
92 | referring to spin locks that are used to protect parts of the PI |
93 | algorithm. These locks disable preemption for UP (when | |
94 | CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from | |
95 | entering critical sections simultaneously. | |
96 | ||
387b1468 MCC |
97 | spin lock |
98 | - Same as lock above. | |
a6537be9 | 99 | |
387b1468 MCC |
100 | waiter |
101 | - A waiter is a struct that is stored on the stack of a blocked | |
a6537be9 SR |
102 | process. Since the scope of the waiter is within the code for |
103 | a process being blocked on the mutex, it is fine to allocate | |
104 | the waiter on the process's stack (local variable). This | |
105 | structure holds a pointer to the task, as well as the mutex that | |
f1824df1 AS |
106 | the task is blocked on. It also has rbtree node structures to |
107 | place the task in the waiters rbtree of a mutex as well as the | |
108 | pi_waiters rbtree of a mutex owner task (described below). | |
a6537be9 SR |
109 | |
110 | waiter is sometimes used in reference to the task that is waiting | |
111 | on a mutex. This is the same as waiter->task. | |
112 | ||
387b1468 MCC |
113 | waiters |
114 | - A list of processes that are blocked on a mutex. | |
a6537be9 | 115 | |
387b1468 MCC |
116 | top waiter |
117 | - The highest priority process waiting on a specific mutex. | |
a6537be9 | 118 | |
387b1468 MCC |
119 | top pi waiter |
120 | - The highest priority process waiting on one of the mutexes | |
a6537be9 SR |
121 | that a specific process owns. |
122 | ||
387b1468 MCC |
123 | Note: |
124 | task and process are used interchangeably in this document, mostly to | |
a6537be9 SR |
125 | differentiate between two processes that are being described together. |
126 | ||
127 | ||
128 | PI chain | |
129 | -------- | |
130 | ||
131 | The PI chain is a list of processes and mutexes that may cause priority | |
132 | inheritance to take place. Multiple chains may converge, but a chain | |
133 | would never diverge, since a process can't be blocked on more than one | |
134 | mutex at a time. | |
135 | ||
387b1468 | 136 | Example:: |
a6537be9 SR |
137 | |
138 | Process: A, B, C, D, E | |
139 | Mutexes: L1, L2, L3, L4 | |
140 | ||
141 | A owns: L1 | |
142 | B blocked on L1 | |
143 | B owns L2 | |
144 | C blocked on L2 | |
145 | C owns L3 | |
146 | D blocked on L3 | |
147 | D owns L4 | |
148 | E blocked on L4 | |
149 | ||
387b1468 | 150 | The chain would be:: |
a6537be9 SR |
151 | |
152 | E->L4->D->L3->C->L2->B->L1->A | |
153 | ||
154 | To show where two chains merge, we could add another process F and | |
155 | another mutex L5 where B owns L5 and F is blocked on mutex L5. | |
156 | ||
387b1468 | 157 | The chain for F would be:: |
a6537be9 SR |
158 | |
159 | F->L5->B->L1->A | |
160 | ||
161 | Since a process may own more than one mutex, but never be blocked on more than | |
162 | one, the chains merge. | |
163 | ||
387b1468 | 164 | Here we show both chains:: |
a6537be9 SR |
165 | |
166 | E->L4->D->L3->C->L2-+ | |
167 | | | |
168 | +->B->L1->A | |
169 | | | |
170 | F->L5-+ | |
171 | ||
172 | For PI to work, the processes at the right end of these chains (or we may | |
173 | also call it the Top of the chain) must be equal to or higher in priority | |
174 | than the processes to the left or below in the chain. | |
175 | ||
176 | Also since a mutex may have more than one process blocked on it, we can | |
177 | have multiple chains merge at mutexes. If we add another process G that is | |
387b1468 | 178 | blocked on mutex L2:: |
a6537be9 SR |
179 | |
180 | G->L2->B->L1->A | |
181 | ||
182 | And once again, to show how this can grow I will show the merging chains | |
387b1468 | 183 | again:: |
a6537be9 SR |
184 | |
185 | E->L4->D->L3->C-+ | |
186 | +->L2-+ | |
187 | | | | |
188 | G-+ +->B->L1->A | |
189 | | | |
190 | F->L5-+ | |
191 | ||
f1824df1 AS |
192 | If process G has the highest priority in the chain, then all the tasks up |
193 | the chain (A and B in this example), must have their priorities increased | |
194 | to that of G. | |
a6537be9 | 195 | |
f1824df1 | 196 | Mutex Waiters Tree |
387b1468 | 197 | ------------------ |
a6537be9 | 198 | |
f1824df1 AS |
199 | Every mutex keeps track of all the waiters that are blocked on itself. The |
200 | mutex has a rbtree to store these waiters by priority. This tree is protected | |
201 | by a spin lock that is located in the struct of the mutex. This lock is called | |
202 | wait_lock. | |
a6537be9 SR |
203 | |
204 | ||
f1824df1 | 205 | Task PI Tree |
a6537be9 SR |
206 | ------------ |
207 | ||
f1824df1 AS |
208 | To keep track of the PI chains, each process has its own PI rbtree. This is |
209 | a tree of all top waiters of the mutexes that are owned by the process. | |
210 | Note that this tree only holds the top waiters and not all waiters that are | |
a6537be9 SR |
211 | blocked on mutexes owned by the process. |
212 | ||
f1824df1 | 213 | The top of the task's PI tree is always the highest priority task that |
a6537be9 SR |
214 | is waiting on a mutex that is owned by the task. So if the task has |
215 | inherited a priority, it will always be the priority of the task that is | |
f1824df1 | 216 | at the top of this tree. |
a6537be9 | 217 | |
f1824df1 AS |
218 | This tree is stored in the task structure of a process as a rbtree called |
219 | pi_waiters. It is protected by a spin lock also in the task structure, | |
a6537be9 SR |
220 | called pi_lock. This lock may also be taken in interrupt context, so when |
221 | locking the pi_lock, interrupts must be disabled. | |
222 | ||
223 | ||
224 | Depth of the PI Chain | |
225 | --------------------- | |
226 | ||
227 | The maximum depth of the PI chain is not dynamic, and could actually be | |
228 | defined. But is very complex to figure it out, since it depends on all | |
229 | the nesting of mutexes. Let's look at the example where we have 3 mutexes, | |
230 | L1, L2, and L3, and four separate functions func1, func2, func3 and func4. | |
231 | The following shows a locking order of L1->L2->L3, but may not actually | |
387b1468 | 232 | be directly nested that way:: |
a6537be9 | 233 | |
387b1468 MCC |
234 | void func1(void) |
235 | { | |
a6537be9 SR |
236 | mutex_lock(L1); |
237 | ||
238 | /* do anything */ | |
239 | ||
240 | mutex_unlock(L1); | |
387b1468 | 241 | } |
a6537be9 | 242 | |
387b1468 MCC |
243 | void func2(void) |
244 | { | |
a6537be9 SR |
245 | mutex_lock(L1); |
246 | mutex_lock(L2); | |
247 | ||
248 | /* do something */ | |
249 | ||
250 | mutex_unlock(L2); | |
251 | mutex_unlock(L1); | |
387b1468 | 252 | } |
a6537be9 | 253 | |
387b1468 MCC |
254 | void func3(void) |
255 | { | |
a6537be9 SR |
256 | mutex_lock(L2); |
257 | mutex_lock(L3); | |
258 | ||
259 | /* do something else */ | |
260 | ||
261 | mutex_unlock(L3); | |
262 | mutex_unlock(L2); | |
387b1468 | 263 | } |
a6537be9 | 264 | |
387b1468 MCC |
265 | void func4(void) |
266 | { | |
a6537be9 SR |
267 | mutex_lock(L3); |
268 | ||
269 | /* do something again */ | |
270 | ||
271 | mutex_unlock(L3); | |
387b1468 | 272 | } |
a6537be9 SR |
273 | |
274 | Now we add 4 processes that run each of these functions separately. | |
275 | Processes A, B, C, and D which run functions func1, func2, func3 and func4 | |
276 | respectively, and such that D runs first and A last. With D being preempted | |
387b1468 | 277 | in func4 in the "do something again" area, we have a locking that follows:: |
a6537be9 | 278 | |
387b1468 MCC |
279 | D owns L3 |
280 | C blocked on L3 | |
281 | C owns L2 | |
282 | B blocked on L2 | |
283 | B owns L1 | |
284 | A blocked on L1 | |
a6537be9 | 285 | |
387b1468 | 286 | And thus we have the chain A->L1->B->L2->C->L3->D. |
a6537be9 SR |
287 | |
288 | This gives us a PI depth of 4 (four processes), but looking at any of the | |
289 | functions individually, it seems as though they only have at most a locking | |
290 | depth of two. So, although the locking depth is defined at compile time, | |
291 | it still is very difficult to find the possibilities of that depth. | |
292 | ||
293 | Now since mutexes can be defined by user-land applications, we don't want a DOS | |
294 | type of application that nests large amounts of mutexes to create a large | |
295 | PI chain, and have the code holding spin locks while looking at a large | |
296 | amount of data. So to prevent this, the implementation not only implements | |
297 | a maximum lock depth, but also only holds at most two different locks at a | |
298 | time, as it walks the PI chain. More about this below. | |
299 | ||
300 | ||
301 | Mutex owner and flags | |
302 | --------------------- | |
303 | ||
304 | The mutex structure contains a pointer to the owner of the mutex. If the | |
305 | mutex is not owned, this owner is set to NULL. Since all architectures | |
f1824df1 AS |
306 | have the task structure on at least a two byte alignment (and if this is |
307 | not true, the rtmutex.c code will be broken!), this allows for the least | |
308 | significant bit to be used as a flag. Bit 0 is used as the "Has Waiters" | |
309 | flag. It's set whenever there are waiters on a mutex. | |
a6537be9 | 310 | |
387b1468 | 311 | See Documentation/locking/rt-mutex.rst for further details. |
a6537be9 SR |
312 | |
313 | cmpxchg Tricks | |
314 | -------------- | |
315 | ||
316 | Some architectures implement an atomic cmpxchg (Compare and Exchange). This | |
317 | is used (when applicable) to keep the fast path of grabbing and releasing | |
318 | mutexes short. | |
319 | ||
387b1468 | 320 | cmpxchg is basically the following function performed atomically:: |
a6537be9 | 321 | |
387b1468 MCC |
322 | unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) |
323 | { | |
9ba0bdfd JA |
324 | unsigned long T = *A; |
325 | if (*A == *B) { | |
326 | *A = *C; | |
327 | } | |
328 | return T; | |
387b1468 MCC |
329 | } |
330 | #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) | |
a6537be9 SR |
331 | |
332 | This is really nice to have, since it allows you to only update a variable | |
333 | if the variable is what you expect it to be. You know if it succeeded if | |
334 | the return value (the old value of A) is equal to B. | |
335 | ||
336 | The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If | |
337 | the architecture does not support CMPXCHG, then this macro is simply set | |
338 | to fail every time. But if CMPXCHG is supported, then this will | |
339 | help out extremely to keep the fast path short. | |
340 | ||
341 | The use of rt_mutex_cmpxchg with the flags in the owner field help optimize | |
342 | the system for architectures that support it. This will also be explained | |
343 | later in this document. | |
344 | ||
345 | ||
346 | Priority adjustments | |
347 | -------------------- | |
348 | ||
349 | The implementation of the PI code in rtmutex.c has several places that a | |
f1824df1 | 350 | process must adjust its priority. With the help of the pi_waiters of a |
a6537be9 SR |
351 | process this is rather easy to know what needs to be adjusted. |
352 | ||
f1824df1 AS |
353 | The functions implementing the task adjustments are rt_mutex_adjust_prio |
354 | and rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio. | |
a6537be9 | 355 | |
f1824df1 AS |
356 | rt_mutex_adjust_prio examines the priority of the task, and the highest |
357 | priority process that is waiting any of mutexes owned by the task. Since | |
358 | the pi_waiters of a task holds an order by priority of all the top waiters | |
359 | of all the mutexes that the task owns, we simply need to compare the top | |
360 | pi waiter to its own normal/deadline priority and take the higher one. | |
361 | Then rt_mutex_setprio is called to adjust the priority of the task to the | |
362 | new priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c | |
363 | to implement the actual change in priority. | |
a6537be9 | 364 | |
387b1468 MCC |
365 | Note: |
366 | For the "prio" field in task_struct, the lower the number, the | |
f1824df1 | 367 | higher the priority. A "prio" of 5 is of higher priority than a |
387b1468 | 368 | "prio" of 10. |
a6537be9 | 369 | |
f1824df1 | 370 | It is interesting to note that rt_mutex_adjust_prio can either increase |
a6537be9 | 371 | or decrease the priority of the task. In the case that a higher priority |
f1824df1 | 372 | process has just blocked on a mutex owned by the task, rt_mutex_adjust_prio |
a6537be9 SR |
373 | would increase/boost the task's priority. But if a higher priority task |
374 | were for some reason to leave the mutex (timeout or signal), this same function | |
f1824df1 | 375 | would decrease/unboost the priority of the task. That is because the pi_waiters |
a6537be9 SR |
376 | always contains the highest priority task that is waiting on a mutex owned |
377 | by the task, so we only need to compare the priority of that top pi waiter | |
378 | to the normal priority of the given task. | |
379 | ||
380 | ||
381 | High level overview of the PI chain walk | |
382 | ---------------------------------------- | |
383 | ||
384 | The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. | |
385 | ||
386 | The implementation has gone through several iterations, and has ended up | |
387 | with what we believe is the best. It walks the PI chain by only grabbing | |
388 | at most two locks at a time, and is very efficient. | |
389 | ||
390 | The rt_mutex_adjust_prio_chain can be used either to boost or lower process | |
391 | priorities. | |
392 | ||
393 | rt_mutex_adjust_prio_chain is called with a task to be checked for PI | |
394 | (de)boosting (the owner of a mutex that a process is blocking on), a flag to | |
f1824df1 | 395 | check for deadlocking, the mutex that the task owns, a pointer to a waiter |
a6537be9 | 396 | that is the process's waiter struct that is blocked on the mutex (although this |
f1824df1 AS |
397 | parameter may be NULL for deboosting), a pointer to the mutex on which the task |
398 | is blocked, and a top_task as the top waiter of the mutex. | |
a6537be9 SR |
399 | |
400 | For this explanation, I will not mention deadlock detection. This explanation | |
401 | will try to stay at a high level. | |
402 | ||
403 | When this function is called, there are no locks held. That also means | |
404 | that the state of the owner and lock can change when entered into this function. | |
405 | ||
406 | Before this function is called, the task has already had rt_mutex_adjust_prio | |
407 | performed on it. This means that the task is set to the priority that it | |
f1824df1 AS |
408 | should be at, but the rbtree nodes of the task's waiter have not been updated |
409 | with the new priorities, and this task may not be in the proper locations | |
410 | in the pi_waiters and waiters trees that the task is blocked on. This function | |
a6537be9 SR |
411 | solves all that. |
412 | ||
f1824df1 AS |
413 | The main operation of this function is summarized by Thomas Gleixner in |
414 | rtmutex.c. See the 'Chain walk basics and protection scope' comment for further | |
415 | details. | |
a6537be9 SR |
416 | |
417 | Taking of a mutex (The walk through) | |
418 | ------------------------------------ | |
419 | ||
420 | OK, now let's take a look at the detailed walk through of what happens when | |
421 | taking a mutex. | |
422 | ||
423 | The first thing that is tried is the fast taking of the mutex. This is | |
424 | done when we have CMPXCHG enabled (otherwise the fast taking automatically | |
425 | fails). Only when the owner field of the mutex is NULL can the lock be | |
426 | taken with the CMPXCHG and nothing else needs to be done. | |
427 | ||
f1824df1 AS |
428 | If there is contention on the lock, we go about the slow path |
429 | (rt_mutex_slowlock). | |
a6537be9 SR |
430 | |
431 | The slow path function is where the task's waiter structure is created on | |
432 | the stack. This is because the waiter structure is only needed for the | |
433 | scope of this function. The waiter structure holds the nodes to store | |
f1824df1 AS |
434 | the task on the waiters tree of the mutex, and if need be, the pi_waiters |
435 | tree of the owner. | |
a6537be9 SR |
436 | |
437 | The wait_lock of the mutex is taken since the slow path of unlocking the | |
438 | mutex also takes this lock. | |
439 | ||
440 | We then call try_to_take_rt_mutex. This is where the architecture that | |
441 | does not implement CMPXCHG would always grab the lock (if there's no | |
442 | contention). | |
443 | ||
444 | try_to_take_rt_mutex is used every time the task tries to grab a mutex in the | |
445 | slow path. The first thing that is done here is an atomic setting of | |
f1824df1 AS |
446 | the "Has Waiters" flag of the mutex's owner field. By setting this flag |
447 | now, the current owner of the mutex being contended for can't release the mutex | |
448 | without going into the slow unlock path, and it would then need to grab the | |
449 | wait_lock, which this code currently holds. So setting the "Has Waiters" flag | |
450 | forces the current owner to synchronize with this code. | |
451 | ||
452 | The lock is taken if the following are true: | |
387b1468 | 453 | |
f1824df1 AS |
454 | 1) The lock has no owner |
455 | 2) The current task is the highest priority against all other | |
456 | waiters of the lock | |
457 | ||
458 | If the task succeeds to acquire the lock, then the task is set as the | |
459 | owner of the lock, and if the lock still has waiters, the top_waiter | |
460 | (highest priority task waiting on the lock) is added to this task's | |
461 | pi_waiters tree. | |
462 | ||
463 | If the lock is not taken by try_to_take_rt_mutex(), then the | |
464 | task_blocks_on_rt_mutex() function is called. This will add the task to | |
465 | the lock's waiter tree and propagate the pi chain of the lock as well | |
466 | as the lock's owner's pi_waiters tree. This is described in the next | |
467 | section. | |
a6537be9 SR |
468 | |
469 | Task blocks on mutex | |
470 | -------------------- | |
471 | ||
472 | The accounting of a mutex and process is done with the waiter structure of | |
473 | the process. The "task" field is set to the process, and the "lock" field | |
f1824df1 AS |
474 | to the mutex. The rbtree node of waiter are initialized to the processes |
475 | current priority. | |
a6537be9 SR |
476 | |
477 | Since the wait_lock was taken at the entry of the slow lock, we can safely | |
f1824df1 AS |
478 | add the waiter to the task waiter tree. If the current process is the |
479 | highest priority process currently waiting on this mutex, then we remove the | |
480 | previous top waiter process (if it exists) from the pi_waiters of the owner, | |
481 | and add the current process to that tree. Since the pi_waiter of the owner | |
a6537be9 SR |
482 | has changed, we call rt_mutex_adjust_prio on the owner to see if the owner |
483 | should adjust its priority accordingly. | |
484 | ||
f1824df1 | 485 | If the owner is also blocked on a lock, and had its pi_waiters changed |
a6537be9 SR |
486 | (or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead |
487 | and run rt_mutex_adjust_prio_chain on the owner, as described earlier. | |
488 | ||
489 | Now all locks are released, and if the current process is still blocked on a | |
490 | mutex (waiter "task" field is not NULL), then we go to sleep (call schedule). | |
491 | ||
492 | Waking up in the loop | |
493 | --------------------- | |
494 | ||
f1824df1 AS |
495 | The task can then wake up for a couple of reasons: |
496 | 1) The previous lock owner released the lock, and the task now is top_waiter | |
497 | 2) we received a signal or timeout | |
a6537be9 | 498 | |
f1824df1 AS |
499 | In both cases, the task will try again to acquire the lock. If it |
500 | does, then it will take itself off the waiters tree and set itself back | |
501 | to the TASK_RUNNING state. | |
a6537be9 | 502 | |
f1824df1 AS |
503 | In first case, if the lock was acquired by another task before this task |
504 | could get the lock, then it will go back to sleep and wait to be woken again. | |
a6537be9 | 505 | |
f1824df1 AS |
506 | The second case is only applicable for tasks that are grabbing a mutex |
507 | that can wake up before getting the lock, either due to a signal or | |
508 | a timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to | |
509 | take the lock again, if it succeeds, then the task will return with the | |
510 | lock held, otherwise it will return with -EINTR if the task was woken | |
511 | by a signal, or -ETIMEDOUT if it timed out. | |
a6537be9 SR |
512 | |
513 | ||
514 | Unlocking the Mutex | |
515 | ------------------- | |
516 | ||
517 | The unlocking of a mutex also has a fast path for those architectures with | |
518 | CMPXCHG. Since the taking of a mutex on contention always sets the | |
519 | "Has Waiters" flag of the mutex's owner, we use this to know if we need to | |
520 | take the slow path when unlocking the mutex. If the mutex doesn't have any | |
521 | waiters, the owner field of the mutex would equal the current process and | |
522 | the mutex can be unlocked by just replacing the owner field with NULL. | |
523 | ||
524 | If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), | |
525 | the slow unlock path is taken. | |
526 | ||
527 | The first thing done in the slow unlock path is to take the wait_lock of the | |
528 | mutex. This synchronizes the locking and unlocking of the mutex. | |
529 | ||
530 | A check is made to see if the mutex has waiters or not. On architectures that | |
531 | do not have CMPXCHG, this is the location that the owner of the mutex will | |
532 | determine if a waiter needs to be awoken or not. On architectures that | |
533 | do have CMPXCHG, that check is done in the fast path, but it is still needed | |
534 | in the slow path too. If a waiter of a mutex woke up because of a signal | |
535 | or timeout between the time the owner failed the fast path CMPXCHG check and | |
536 | the grabbing of the wait_lock, the mutex may not have any waiters, thus the | |
9ba0bdfd | 537 | owner still needs to make this check. If there are no waiters then the mutex |
a6537be9 SR |
538 | owner field is set to NULL, the wait_lock is released and nothing more is |
539 | needed. | |
540 | ||
f1824df1 | 541 | If there are waiters, then we need to wake one up. |
a6537be9 SR |
542 | |
543 | On the wake up code, the pi_lock of the current owner is taken. The top | |
f1824df1 AS |
544 | waiter of the lock is found and removed from the waiters tree of the mutex |
545 | as well as the pi_waiters tree of the current owner. The "Has Waiters" bit is | |
546 | marked to prevent lower priority tasks from stealing the lock. | |
a6537be9 SR |
547 | |
548 | Finally we unlock the pi_lock of the pending owner and wake it up. | |
549 | ||
550 | ||
551 | Contact | |
552 | ------- | |
553 | ||
554 | For updates on this document, please email Steven Rostedt <rostedt@goodmis.org> | |
555 | ||
556 | ||
557 | Credits | |
558 | ------- | |
559 | ||
560 | Author: Steven Rostedt <rostedt@goodmis.org> | |
387b1468 | 561 | |
f1824df1 | 562 | Updated: Alex Shi <alex.shi@linaro.org> - 7/6/2017 |
a6537be9 | 563 | |
387b1468 MCC |
564 | Original Reviewers: |
565 | Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and | |
f1824df1 | 566 | Randy Dunlap |
387b1468 | 567 | |
f1824df1 | 568 | Update (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior |
a6537be9 SR |
569 | |
570 | Updates | |
571 | ------- | |
572 | ||
573 | This document was originally written for 2.6.17-rc3-mm1 | |
f1824df1 | 574 | was updated on 4.12 |