]>
Commit | Line | Data |
---|---|---|
dc7a12bd | 1 | ====================================== |
9762f12d DM |
2 | vlocks for Bare-Metal Mutual Exclusion |
3 | ====================================== | |
4 | ||
5 | Voting Locks, or "vlocks" provide a simple low-level mutual exclusion | |
6 | mechanism, with reasonable but minimal requirements on the memory | |
7 | system. | |
8 | ||
9 | These are intended to be used to coordinate critical activity among CPUs | |
10 | which are otherwise non-coherent, in situations where the hardware | |
11 | provides no other mechanism to support this and ordinary spinlocks | |
12 | cannot be used. | |
13 | ||
14 | ||
15 | vlocks make use of the atomicity provided by the memory system for | |
16 | writes to a single memory location. To arbitrate, every CPU "votes for | |
17 | itself", by storing a unique number to a common memory location. The | |
18 | final value seen in that memory location when all the votes have been | |
19 | cast identifies the winner. | |
20 | ||
21 | In order to make sure that the election produces an unambiguous result | |
22 | in finite time, a CPU will only enter the election in the first place if | |
23 | no winner has been chosen and the election does not appear to have | |
24 | started yet. | |
25 | ||
26 | ||
27 | Algorithm | |
28 | --------- | |
29 | ||
dc7a12bd | 30 | The easiest way to explain the vlocks algorithm is with some pseudo-code:: |
9762f12d DM |
31 | |
32 | ||
33 | int currently_voting[NR_CPUS] = { 0, }; | |
34 | int last_vote = -1; /* no votes yet */ | |
35 | ||
36 | bool vlock_trylock(int this_cpu) | |
37 | { | |
38 | /* signal our desire to vote */ | |
39 | currently_voting[this_cpu] = 1; | |
40 | if (last_vote != -1) { | |
41 | /* someone already volunteered himself */ | |
42 | currently_voting[this_cpu] = 0; | |
43 | return false; /* not ourself */ | |
44 | } | |
45 | ||
46 | /* let's suggest ourself */ | |
47 | last_vote = this_cpu; | |
48 | currently_voting[this_cpu] = 0; | |
49 | ||
50 | /* then wait until everyone else is done voting */ | |
51 | for_each_cpu(i) { | |
52 | while (currently_voting[i] != 0) | |
53 | /* wait */; | |
54 | } | |
55 | ||
56 | /* result */ | |
57 | if (last_vote == this_cpu) | |
58 | return true; /* we won */ | |
59 | return false; | |
60 | } | |
61 | ||
62 | bool vlock_unlock(void) | |
63 | { | |
64 | last_vote = -1; | |
65 | } | |
66 | ||
67 | ||
68 | The currently_voting[] array provides a way for the CPUs to determine | |
69 | whether an election is in progress, and plays a role analogous to the | |
70 | "entering" array in Lamport's bakery algorithm [1]. | |
71 | ||
72 | However, once the election has started, the underlying memory system | |
73 | atomicity is used to pick the winner. This avoids the need for a static | |
74 | priority rule to act as a tie-breaker, or any counters which could | |
75 | overflow. | |
76 | ||
77 | As long as the last_vote variable is globally visible to all CPUs, it | |
78 | will contain only one value that won't change once every CPU has cleared | |
79 | its currently_voting flag. | |
80 | ||
81 | ||
82 | Features and limitations | |
83 | ------------------------ | |
84 | ||
85 | * vlocks are not intended to be fair. In the contended case, it is the | |
86 | _last_ CPU which attempts to get the lock which will be most likely | |
87 | to win. | |
88 | ||
89 | vlocks are therefore best suited to situations where it is necessary | |
90 | to pick a unique winner, but it does not matter which CPU actually | |
91 | wins. | |
92 | ||
93 | * Like other similar mechanisms, vlocks will not scale well to a large | |
94 | number of CPUs. | |
95 | ||
96 | vlocks can be cascaded in a voting hierarchy to permit better scaling | |
dc7a12bd | 97 | if necessary, as in the following hypothetical example for 4096 CPUs:: |
9762f12d DM |
98 | |
99 | /* first level: local election */ | |
100 | my_town = towns[(this_cpu >> 4) & 0xf]; | |
101 | I_won = vlock_trylock(my_town, this_cpu & 0xf); | |
102 | if (I_won) { | |
103 | /* we won the town election, let's go for the state */ | |
104 | my_state = states[(this_cpu >> 8) & 0xf]; | |
105 | I_won = vlock_lock(my_state, this_cpu & 0xf)); | |
106 | if (I_won) { | |
107 | /* and so on */ | |
108 | I_won = vlock_lock(the_whole_country, this_cpu & 0xf]; | |
109 | if (I_won) { | |
110 | /* ... */ | |
111 | } | |
112 | vlock_unlock(the_whole_country); | |
113 | } | |
114 | vlock_unlock(my_state); | |
115 | } | |
116 | vlock_unlock(my_town); | |
117 | ||
118 | ||
119 | ARM implementation | |
120 | ------------------ | |
121 | ||
122 | The current ARM implementation [2] contains some optimisations beyond | |
123 | the basic algorithm: | |
124 | ||
125 | * By packing the members of the currently_voting array close together, | |
126 | we can read the whole array in one transaction (providing the number | |
127 | of CPUs potentially contending the lock is small enough). This | |
128 | reduces the number of round-trips required to external memory. | |
129 | ||
130 | In the ARM implementation, this means that we can use a single load | |
dc7a12bd | 131 | and comparison:: |
9762f12d DM |
132 | |
133 | LDR Rt, [Rn] | |
134 | CMP Rt, #0 | |
135 | ||
dc7a12bd | 136 | ...in place of code equivalent to:: |
9762f12d DM |
137 | |
138 | LDRB Rt, [Rn] | |
139 | CMP Rt, #0 | |
140 | LDRBEQ Rt, [Rn, #1] | |
141 | CMPEQ Rt, #0 | |
142 | LDRBEQ Rt, [Rn, #2] | |
143 | CMPEQ Rt, #0 | |
144 | LDRBEQ Rt, [Rn, #3] | |
145 | CMPEQ Rt, #0 | |
146 | ||
147 | This cuts down on the fast-path latency, as well as potentially | |
148 | reducing bus contention in contended cases. | |
149 | ||
150 | The optimisation relies on the fact that the ARM memory system | |
151 | guarantees coherency between overlapping memory accesses of | |
152 | different sizes, similarly to many other architectures. Note that | |
153 | we do not care which element of currently_voting appears in which | |
154 | bits of Rt, so there is no need to worry about endianness in this | |
155 | optimisation. | |
156 | ||
157 | If there are too many CPUs to read the currently_voting array in | |
158 | one transaction then multiple transations are still required. The | |
159 | implementation uses a simple loop of word-sized loads for this | |
160 | case. The number of transactions is still fewer than would be | |
161 | required if bytes were loaded individually. | |
162 | ||
163 | ||
164 | In principle, we could aggregate further by using LDRD or LDM, but | |
165 | to keep the code simple this was not attempted in the initial | |
166 | implementation. | |
167 | ||
168 | ||
169 | * vlocks are currently only used to coordinate between CPUs which are | |
170 | unable to enable their caches yet. This means that the | |
171 | implementation removes many of the barriers which would be required | |
172 | when executing the algorithm in cached memory. | |
173 | ||
174 | packing of the currently_voting array does not work with cached | |
175 | memory unless all CPUs contending the lock are cache-coherent, due | |
176 | to cache writebacks from one CPU clobbering values written by other | |
177 | CPUs. (Though if all the CPUs are cache-coherent, you should be | |
178 | probably be using proper spinlocks instead anyway). | |
179 | ||
180 | ||
181 | * The "no votes yet" value used for the last_vote variable is 0 (not | |
182 | -1 as in the pseudocode). This allows statically-allocated vlocks | |
183 | to be implicitly initialised to an unlocked state simply by putting | |
184 | them in .bss. | |
185 | ||
186 | An offset is added to each CPU's ID for the purpose of setting this | |
187 | variable, so that no CPU uses the value 0 for its ID. | |
188 | ||
189 | ||
190 | Colophon | |
191 | -------- | |
192 | ||
193 | Originally created and documented by Dave Martin for Linaro Limited, for | |
194 | use in ARM-based big.LITTLE platforms, with review and input gratefully | |
195 | received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for | |
196 | grabbing most of this text out of the relevant mail thread and writing | |
197 | up the pseudocode. | |
198 | ||
199 | Copyright (C) 2012-2013 Linaro Limited | |
200 | Distributed under the terms of Version 2 of the GNU General Public | |
201 | License, as defined in linux/COPYING. | |
202 | ||
203 | ||
204 | References | |
205 | ---------- | |
206 | ||
207 | [1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming | |
208 | Problem", Communications of the ACM 17, 8 (August 1974), 453-455. | |
209 | ||
ae13c65b | 210 | https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm |
9762f12d DM |
211 | |
212 | [2] linux/arch/arm/common/vlock.S, www.kernel.org. |