]>
Commit | Line | Data |
---|---|---|
94b46f34 XL |
1 | # Introduction: the sleep module |
2 | ||
3 | The code in this module governs when worker threads should go to | |
4 | sleep. This is a tricky topic -- the work-stealing algorithm relies on | |
5 | having active worker threads running around stealing from one | |
6 | another. But, if there isn't a lot of work, this can be a bit of a | |
7 | drag, because it requires high CPU usage. | |
8 | ||
9 | The code in this module takes a fairly simple approach to the | |
10 | problem. It allows worker threads to fall asleep if they have failed | |
11 | to steal work after various thresholds; however, whenever new work | |
12 | appears, they will wake up briefly and try to steal again. There are some | |
13 | shortcomings in this current approach: | |
14 | ||
15 | - it can (to some extent) scale *down* the amount of threads, but they | |
16 | can never be scaled *up*. The latter might be useful in the case of | |
17 | user tasks that must (perhaps very occasionally and unpredictably) | |
18 | block for long periods of time. | |
19 | - however, the preferred approach to this is for users to adopt futures | |
20 | instead (and indeed this sleeping work is intended to enable future | |
21 | integration). | |
22 | - we have no way to wake up threads in a fine-grained or targeted | |
23 | manner. The current system wakes up *all* sleeping threads whenever | |
24 | *any* of them might be interested in an event. This means that while | |
25 | we can scale CPU usage down, we do is in a fairly "bursty" manner, | |
26 | where everyone comes online, then some of them go back offline. | |
27 | ||
28 | # The interface for workers | |
29 | ||
30 | Workers interact with the sleep module by invoking three methods: | |
31 | ||
32 | - `work_found()`: signals that the worker found some work and is about | |
33 | to execute it. | |
34 | - `no_work_found()`: signals that the worker searched all available sources | |
35 | for work and found none. | |
36 | - It is important for the coherence of the algorithm that if work | |
37 | was available **before the search started**, it would have been | |
38 | found. If work was made available during the search, then it's ok that | |
39 | it might have been overlooked. | |
40 | - `tickle()`: indicates that new work is available (e.g., a job has | |
41 | been pushed to the local deque) or that some other blocking | |
42 | condition has been resolved (e.g., a latch has been set). Wakes up any | |
43 | sleeping workers. | |
44 | ||
45 | When in a loop searching for work, Workers also have to maintain an | |
46 | integer `yields` that they provide to the `sleep` module (which will | |
47 | return a new value for the next time). Thus the basic worker "find | |
48 | work" loop looks like this (this is `wait_until()`, basically): | |
49 | ||
50 | ```rust | |
51 | let mut yields = 0; | |
52 | while /* not done */ { | |
53 | if let Some(job) = search_for_work() { | |
54 | yields = work_found(self.index, yields); | |
55 | } else { | |
56 | yields = no_work_found(self.index, yields); | |
57 | } | |
58 | } | |
59 | ``` | |
60 | ||
61 | # Getting sleepy and falling asleep | |
62 | ||
63 | The basic idea here is that every worker goes through three states: | |
64 | ||
65 | - **Awake:** actively hunting for tasks. | |
66 | - **Sleepy:** still actively hunting for tasks, but we have signaled that | |
67 | we might go to sleep soon if we don't find any. | |
68 | - **Asleep:** actually asleep (blocked on a condition variable). | |
69 | ||
70 | At any given time, only **one** worker can be in the sleepy | |
71 | state. This allows us to coordinate the entire sleep protocol using a | |
72 | single `AtomicUsize` and without the need of epoch counters or other | |
73 | things that might rollover and so forth. | |
74 | ||
75 | Whenever a worker invokes `work_found()`, it transitions back to the | |
76 | **awake** state. In other words, if it was sleepy, it stops being | |
77 | sleepy. (`work_found()` cannot be invoked when the worker is asleep, | |
78 | since then it is not doing anything.) | |
79 | ||
80 | On the other hand, whenever a worker invokes `no_work_found()`, it | |
81 | *may* transition to a more sleepy state. To track this, we use the | |
82 | counter `yields` that is maintained by the worker's steal loop. This | |
83 | counter starts at 0. Whenever work is found, the counter is returned | |
84 | to 0. But each time that **no** work is found, the counter is | |
85 | incremented. Eventually it will reach a threshold | |
86 | `ROUNDS_UNTIL_SLEEPY`. At this point, the worker will try to become | |
87 | the sleepy one. It does this by executing a CAS into the global | |
88 | registry state (details on this below). If that attempt is successful, | |
89 | then the counter is incremented again, so that it is equal to | |
90 | `ROUNDS_UNTIL_SLEEPY + 1`. Otherwise, the counter stays the same (and | |
91 | hence we will keep trying to become sleepy until either work is found | |
92 | or we are successful). | |
93 | ||
94 | Becoming sleepy does not put us to sleep immediately. Instead, we keep | |
95 | iterating and looking for work for some further number of rounds. If | |
96 | during this search we **do** find work, then we will return the | |
97 | counter to 0 and also reset the global state to indicate we are no | |
98 | longer sleepy. | |
99 | ||
100 | But if again no work is found, `yields` will eventually reach the | |
101 | value `ROUNDS_UNTIL_ASLEEP`. At that point, we will try to transition | |
102 | from **sleepy** to **asleep**. This is done by the helper fn | |
103 | `sleep()`, which executes another CAS on the global state that removes | |
104 | our worker as the sleepy worker and instead sets a flag to indicate | |
105 | that there are sleeping workers present (the flag may already have | |
106 | been set, that's ok). Assuming that CAS succeeds, we will block on a | |
107 | condition variable. | |
108 | ||
109 | # Tickling workers | |
110 | ||
111 | Of course, while all the stuff in the previous section is happening, | |
112 | other workers are (hopefully) producing new work. There are three kinds of | |
113 | events that can allow a blocked worker to make progress: | |
114 | ||
115 | 1. A new task is pushed onto a worker's deque. This task could be stolen. | |
116 | 2. A new task is injected into the thread-pool from the outside. This | |
117 | task could be uninjected and executed. | |
118 | 3. A latch is set. One of the sleeping workers might have been waiting for | |
119 | that before it could go on. | |
120 | ||
121 | Whenever one of these things happens, the worker (or thread, more generally) | |
122 | responsible must invoke `tickle()`. Tickle will basically wake up **all** | |
123 | the workers: | |
124 | ||
125 | - If any worker was the sleepy one, then the global state is changed | |
126 | so that there is no sleepy worker. The sleepy one will notice this | |
127 | when it next invokes `no_work_found()` and return to the *awake* state | |
128 | (with a yield counter of zero). | |
129 | - If any workers were actually **asleep**, then we invoke | |
130 | `notify_all()` on the condition variable, which will cause them to | |
131 | awaken and start over from the awake state (with a yield counter of | |
132 | zero). | |
133 | ||
134 | Because `tickle()` is invoked very frequently -- and hopefully most of | |
135 | the time it is not needed, because the workers are already actively | |
136 | stealing -- it is important that it be very cheap. The current design | |
137 | requires, in the case where nobody is even sleepy, just a load and a | |
138 | compare. If there are sleepy workers, a swap is needed. If there | |
139 | workers *asleep*, we must naturally acquire the lock and signal the | |
140 | condition variable. | |
141 | ||
142 | # The global state | |
143 | ||
144 | We manage all of the above state transitions using a small bit of global | |
145 | state (well, global to the registry). This is stored in the `Sleep` struct. | |
146 | The primary thing is a single `AtomicUsize`. The value in this usize packs | |
147 | in two pieces of information: | |
148 | ||
149 | 1. **Are any workers asleep?** This is just one bit (yes or no). | |
150 | 2. **Which worker is the sleepy worker, if any?** This is a worker id. | |
151 | ||
152 | We use bit 0 to indicate whether any workers are asleep. So if `state | |
153 | & 1` is zero, then no workers are sleeping. But if `state & 1` is 1, | |
154 | then some workers are either sleeping or on their way to falling | |
155 | asleep (i.e., they have acquired the lock). | |
156 | ||
157 | The remaining bits are used to store if there is a sleepy worker. We | |
158 | want `0` to indicate that there is no sleepy worker. If there a sleepy | |
159 | worker with index `worker_index`, we would store `(worker_index + 1) | |
160 | << 1` . The `+1` is there because worker indices are 0-based, so this | |
161 | ensures that the value is non-zero, and the shift skips over the | |
162 | sleepy bit. | |
163 | ||
164 | Some examples: | |
165 | ||
166 | - `0`: everyone is awake, nobody is sleepy | |
167 | - `1`: some workers are asleep, no sleepy worker | |
168 | - `2`: no workers are asleep, but worker 0 is sleepy (`(0 + 1) << 1 == 2`). | |
169 | - `3`: some workers are asleep, and worker 0 is sleepy. | |
170 | ||
171 | # Correctness level 1: avoiding deadlocks etc | |
172 | ||
173 | In general, we do not want to miss wakeups. Two bad things could happen: | |
174 | ||
175 | - **Suboptimal performance**: If this is a wakeup about a new job being | |
176 | pushed into a local deque, it won't deadlock, but it will cause | |
177 | things to run slowly. The reason that it won't deadlock is that we | |
178 | know at least one thread is active (the one doing the pushing), and | |
179 | it will (sooner or later) try to pop this item from its own local | |
180 | deque. | |
181 | - **Deadlocks:** If this is a wakeup about an injected job or a latch that got set, however, | |
182 | this can cause deadlocks. In the former case, if a job is injected but no thread ever | |
183 | wakes to process it, the injector will likely block forever. In the latter case, | |
184 | imagine this scenario: | |
185 | - thread A calls join, forking a task T1, then executing task T2 | |
186 | - thread B steals T1, forks a task T3, and executes T4. | |
187 | - thread A completes task T2 and blocks on T1 | |
188 | - thread A steals task T3 from thread B | |
189 | - thread B finishes T4 and goes to sleep, blocking on T3 | |
190 | - thread A completes task T3 and makes a wakeup, but it gets lost | |
191 | At this point, thread B is still asleep and will never signal T2, so thread A will itself | |
192 | go to sleep. Bad. | |
193 | ||
194 | It turns out that guaranteeing we don't miss a wakeup while retaining | |
195 | good performance is fairly tricky. This is because of some details of | |
196 | the C++11 memory model. But let's ignore those for now and generally | |
197 | assume sequential consistency. In that case, our scheme should work | |
198 | perfectly. | |
199 | ||
200 | Even if you assume seqcst, though, ensuring that you don't miss | |
201 | wakeups can be fairly tricky in the absence of a central queue. For | |
202 | example, consider the simplest scheme: imagine we just had a boolean | |
203 | flag indicating whether anyone was asleep. Then you could imagine that | |
204 | when workers find no work, they flip this flag to true. When work is | |
205 | published, if the flag is true, we issue a wakeup. | |
206 | ||
207 | The problem here is that checking for new work is not an atomic | |
208 | action. So it's possible that worker 1 could start looking for work | |
209 | and (say) see that worker 0's queue is empty and then search workers | |
210 | 2..N. While that searching is taking place, worker 0 publishes some | |
211 | new work. At the time when the new work is published, the "anyone | |
212 | sleeping?" flag is still false, so nothing happens. Then worker 1, who | |
213 | failed to find any work, goes to sleep --- completely missing the wakeup! | |
214 | ||
215 | We use the "sleepy worker" idea to sidestep this problem. Under our | |
216 | scheme, instead of going right to sleep at the end, worker 1 would | |
217 | become sleepy. Worker 1 would then do **at least** one additional | |
218 | scan. During this scan, they should find the work published by worker | |
219 | 0, so they will stop being sleepy and go back to work (here of course | |
220 | we are assuming that no one else has stolen the worker 0 work yet; if | |
221 | someone else stole it, worker 1 may still go to sleep, but that's ok, | |
222 | since there is no more work to be had). | |
223 | ||
224 | Now you may be wondering -- how does being sleepy help? What if, | |
225 | instead of publishing its job right before worker 1 became sleepy, | |
226 | worker 0 wait until right before worker 1 was going to go to sleep? In | |
227 | other words, the sequence was like this: | |
228 | ||
229 | - worker 1 gets sleepy | |
230 | - worker 1 starts its scan, scanning worker 0's deque | |
231 | - worker 0 publishes its job, but nobody is sleeping yet, so no wakeups occur | |
232 | - worker 1 finshes its scan, goes to sleep, missing the wakeup | |
233 | ||
234 | The reason that this doesn't occur is because, when worker 0 publishes | |
235 | its job, it will see that there is a sleepy worker. It will clear the | |
236 | global state to 0. Then, when worker 1 its scan, it will notice that | |
237 | it is no longer sleepy, and hence it will not go to sleep. Instead it | |
238 | will awaken and keep searching for work. | |
239 | ||
240 | The sleepy worker phase thus also serves as a cheap way to signal that | |
241 | work is around: instead of doing the whole dance of acquiring a lock | |
242 | and issuing notifications, when we publish work we can just swap a | |
243 | single atomic counter and let the sleepy worker notice that on their | |
244 | own. | |
245 | ||
246 | ## Beyond seq-cst | |
247 | ||
248 | Unfortunately, the C++11 memory model doesn't generally guarantee | |
249 | seq-cst. And, somewhat annoyingly, it's not easy for the sleep module | |
250 | **in isolation** to guarantee the properties the need. The key | |
251 | challenge has to do with the *synchronized-with* relation. Typically, | |
252 | we try to use acquire-release reasoning, and in that case the idea is | |
253 | that **if** a load observes a store, it will also observe those writes | |
254 | that preceded the store. But nothing says that the load **must** | |
255 | observe the store -- at least not right away. | |
256 | ||
257 | The place that this is most relevant is the load in the `tickle()` | |
258 | routine. The routine begins by reading from the global state. If it | |
259 | sees anything other than 0, it then does a swap and -- if necessary -- | |
260 | acquires a lock and does a notify. This load is a seq-cst load (as are | |
261 | the other accesses in tickle). This ensures that it is sensible to | |
262 | talk about a tickle happening *before* a worker gets sleepy and so | |
263 | forth. | |
264 | ||
265 | It turns out that to get things right, if we use the current tickle | |
266 | routine, we have to use seq-cst operations **both in the sleep module | |
267 | and when publishing work**. We'll walk through two scenarios to | |
268 | show what I mean. | |
269 | ||
270 | ### Scenario 1: get-sleepy-then-get-tickled | |
271 | ||
272 | This scenario shows why the operations in sleep must be seq-cst. We | |
273 | want to ensure that once a worker gets sleepy, any other worker that | |
274 | does a tickle will observe that. In other words, we want to ensure | |
275 | that the following scenario **cannot happen**: | |
276 | ||
277 | 1. worker 1 is blocked on latch L | |
278 | 2. worker 1 becomes sleepy | |
279 | - becoming sleepy involves a CAS on the global state to set it to 4 ("worker 1 is sleepy") | |
280 | 3. worker 0 sets latch L | |
281 | 4. worker 0 tickles **but does not see that worker 0 is sleepy** | |
282 | ||
283 | Let's diagram this. The notation `read_xxx(A) = V` means that a read | |
284 | of location `A` was executed with the result `V`. The `xxx` is the | |
285 | ordering and the location `A` is either `L` (latch) or `S` (global | |
286 | state). I will leave the ordering on the latch as `xxx` as it is not | |
287 | relevant here. The numbers correspond to the steps above. | |
288 | ||
289 | ``` | |
290 | worker 0 worker 1 | |
291 | | +- 2: cas_sc(S, 4) | |
292 | s| 3: write_xxx(L) + | |
293 | b| 4: read_sc(S) = ??? <-sc-+ | |
294 | v | |
295 | ``` | |
296 | ||
297 | Clearly, this cannot happen with sc orderings, because read 4 will | |
298 | always return `4` here. However, if we tried to use acquire-release | |
299 | orderings on the global state, then there would be **no guarantee** | |
300 | that the tickle will observe that a sleepy worker occurred. We would | |
301 | be guaranteed only that worker 0 would **eventually** observe that | |
302 | worker 1 had become sleepy (and, at that time, that it would see other | |
303 | writes). But it could take time -- and if we indeed miss that worker 1 | |
304 | is sleepy, it could lead to deadlock or loss of efficiency, as | |
305 | explained earlier. | |
306 | ||
307 | ### Scenario 2: tickle-then-get-sleepy | |
308 | ||
309 | <a name="tickle-then-get-sleepy"></a> | |
310 | ||
311 | This scenario shows why latch operations must *also* be seq-cst (and, | |
312 | more generally, any operations that publish work before a tickle). We | |
313 | wish to ensure that this ordering of events **cannot occur**: | |
314 | ||
315 | 1. worker 1 is blocked on latch L | |
316 | 2. worker 1 reads latch L, sees false, starts searching for work | |
317 | 3. worker 0 sets latch L | |
318 | 4. worker 0 tickles | |
319 | - the tickle reads from the global state, sees 0 | |
320 | 5. worker 1 finishes searching, becomes sleepy | |
321 | - becoming sleepy involves a CAS on the global state to set it to 4 ("worker 1 is sleepy") | |
322 | 6. worker 1 reads latch L **but does not see that worker 0 set it** | |
323 | 7. worker 1 may then proceed to become sleepy | |
324 | ||
325 | In other words, we want to ensure that if worker 0 sets a latch and | |
326 | does a tickle *before worker 1 gets sleepy*, then worker 1 will | |
327 | observe that latch as set when it calls probe. We'll see that, with | |
328 | the current scheme, this implies that the latch memory orderings must | |
329 | be seq-cst as well. | |
330 | ||
331 | Here is the diagram: | |
332 | ||
333 | ``` | |
334 | worker 0 worker 1 | |
335 | | 2: read_xxx(L) = false | |
336 | s| 3: write_xxx(L, true) | |
337 | b| 4: read_sc(S) = 0 -+ | |
338 | | +-sc---> 5: cas_sc(S, 4) | |
339 | v 6: read_xxx(L) = ??? | |
340 | ``` | |
341 | ||
342 | The diagram shows that each thread's actions are related by | |
343 | *sequenced-before* (sb). Moreover the read and write of `S` are | |
344 | related by `sc` (the seq-cst ordering). However, and this is crucial, | |
345 | this **does not** imply that oper 4 *synchronizes-with* oper 5. This | |
346 | is because a read never synchronizes-with a store, only the | |
347 | reverse. Hence, if the latch were using acq-rel orderings, it would be | |
348 | legal for oper 6 to return false. But if the latch were to use an | |
349 | **sc** ordering itself, then we know that oper 6 must return true, | |
350 | since `3 -sc-> 4 -sc-> 5 -sc-> 6`. | |
351 | ||
352 | **Note** that this means that, before we tickle, we must execute some | |
353 | seq-cst stores to publish our work (and during the scan we must load | |
354 | from those same locations) **if we wish to guarantee that the work we | |
355 | published WILL be seen by the other threads** (as opposed to | |
356 | *may*). This is true for setting a latch -- if a latch is set but | |
357 | another thread misses it, then the system could deadlock. However, in | |
358 | the case of pushing new work to a deque, we choose not to use a seqcst | |
359 | ordering. This is for several reasons: | |
360 | ||
361 | - If we miss a wakeup, the consequences are less dire: we simply run | |
362 | less efficiently (after all, the current thread will eventually | |
363 | complete its current task and pop the new task off the deque). | |
364 | - It is inconvenient: The deque code is beyond our control (it lies in another package). However, | |
365 | we could create a dummy `AtomicBool` for each deque and do a seqcst write to it | |
366 | (with whatever value) after we push to the deque, and a seqcst load whenever | |
367 | we steal from the deque. | |
368 | - The cost of using a dummy variable was found to be quite high for some benchmarks: | |
369 | - 8-10% overhead on nbody-parreduce | |
370 | - 15% overhead on increment-all | |
371 | - 40% overhead on join-recursively | |
372 | ||
373 | ### Alternative solutions | |
374 | ||
375 | In both cases above, our problems arose because tickle is merely | |
376 | performing a seq-cst read. If we instead had tickle perform a release | |
377 | *swap*, that would be a write action of the global state. No matter | |
378 | the ordering mode, all writes to the same memory location have a total | |
379 | ordering, and hence we would not have to worry about others storing a | |
380 | value that we fail to read (as in scenario 1). Similarly, as a release | |
381 | write, a swap during tickle would synchronize-with a later cas and so | |
382 | scenario 2 should be averted. So you might wonder why we don't do | |
383 | that. The simple reason was that it didn't perform as well! In my | |
384 | measurements, many benchmarks were unaffected by using a swap, but | |
385 | some of them were hit hard: | |
386 | - 8-10% overhead on nbody-parreduce | |
387 | - 35% overhead on increment-all | |
388 | - 245% overhead on join-recursively | |
389 | ||
390 | # Deadlock detection | |
391 | ||
392 | This module tracks a number of variables in order to detect deadlocks due to user code blocking. | |
393 | These variables are stored in the `SleepData` struct which itself is kept behind a mutex. | |
394 | It contains the following fields: | |
395 | - `worker_count` - The number of threads in the thread pool. | |
396 | - `active_threads` - The number of threads in the thread pool which are running | |
397 | and aren't blocked in user code or sleeping. | |
398 | - `blocked_threads` - The number of threads which are blocked in user code. | |
399 | This doesn't include threads blocked by Rayon. | |
400 | ||
401 | User code can indicate blocking by calling `mark_blocked` before blocking and | |
402 | calling `mark_unblocked` before unblocking a thread. | |
403 | This will adjust `active_threads` and `blocked_threads` accordingly. | |
404 | ||
405 | When we tickle the thread pool in `Sleep::tickle_cold`, we set `active_threads` to | |
406 | `worker_count` - `blocked_threads` since we wake up all Rayon threads, but not thread blocked | |
407 | by user code. | |
408 | ||
409 | A deadlock is detected by checking if `active_threads` is 0 and `blocked_threads` is above 0. | |
410 | If we ignored `blocked_threads` we would have a deadlock | |
411 | immediately when creating the thread pool. | |
412 | We would also deadlock once the thread pool ran out of work. | |
413 | It is not possible for Rayon itself to deadlock. | |
414 | Deadlocks can only be caused by user code blocking, so this condition doesn't miss any deadlocks. | |
415 | ||
416 | We check for the deadlock condition when | |
417 | threads fall asleep in `mark_unblocked` and in `Sleep::sleep`. | |
418 | If there's a deadlock detected we call the user provided deadlock handler while we hold the | |
419 | lock to `SleepData`. This means the deadlock handler cannot call `mark_blocked` and | |
420 | `mark_unblocked`. The user is expected to handle the deadlock in some non-Rayon thread. | |
421 | Once the deadlock handler returns, the thread which called the deadlock handler will go to sleep. |