]> git.proxmox.com Git - rustc.git/blame - src/vendor/rustc-rayon-core/src/sleep/README.md
New upstream version 1.31.0+dfsg1
[rustc.git] / src / vendor / rustc-rayon-core / src / sleep / README.md
CommitLineData
94b46f34
XL
1# Introduction: the sleep module
2
3The code in this module governs when worker threads should go to
4sleep. This is a tricky topic -- the work-stealing algorithm relies on
5having active worker threads running around stealing from one
6another. But, if there isn't a lot of work, this can be a bit of a
7drag, because it requires high CPU usage.
8
9The code in this module takes a fairly simple approach to the
10problem. It allows worker threads to fall asleep if they have failed
11to steal work after various thresholds; however, whenever new work
12appears, they will wake up briefly and try to steal again. There are some
13shortcomings in this current approach:
14
15- it can (to some extent) scale *down* the amount of threads, but they
16 can never be scaled *up*. The latter might be useful in the case of
17 user tasks that must (perhaps very occasionally and unpredictably)
18 block for long periods of time.
19 - however, the preferred approach to this is for users to adopt futures
20 instead (and indeed this sleeping work is intended to enable future
21 integration).
22- we have no way to wake up threads in a fine-grained or targeted
23 manner. The current system wakes up *all* sleeping threads whenever
24 *any* of them might be interested in an event. This means that while
25 we can scale CPU usage down, we do is in a fairly "bursty" manner,
26 where everyone comes online, then some of them go back offline.
27
28# The interface for workers
29
30Workers interact with the sleep module by invoking three methods:
31
32- `work_found()`: signals that the worker found some work and is about
33 to execute it.
34- `no_work_found()`: signals that the worker searched all available sources
35 for work and found none.
36 - It is important for the coherence of the algorithm that if work
37 was available **before the search started**, it would have been
38 found. If work was made available during the search, then it's ok that
39 it might have been overlooked.
40- `tickle()`: indicates that new work is available (e.g., a job has
41 been pushed to the local deque) or that some other blocking
42 condition has been resolved (e.g., a latch has been set). Wakes up any
43 sleeping workers.
44
45When in a loop searching for work, Workers also have to maintain an
46integer `yields` that they provide to the `sleep` module (which will
47return a new value for the next time). Thus the basic worker "find
48work" loop looks like this (this is `wait_until()`, basically):
49
50```rust
51let mut yields = 0;
52while /* not done */ {
53 if let Some(job) = search_for_work() {
54 yields = work_found(self.index, yields);
55 } else {
56 yields = no_work_found(self.index, yields);
57 }
58}
59```
60
61# Getting sleepy and falling asleep
62
63The basic idea here is that every worker goes through three states:
64
65- **Awake:** actively hunting for tasks.
66- **Sleepy:** still actively hunting for tasks, but we have signaled that
67 we might go to sleep soon if we don't find any.
68- **Asleep:** actually asleep (blocked on a condition variable).
69
70At any given time, only **one** worker can be in the sleepy
71state. This allows us to coordinate the entire sleep protocol using a
72single `AtomicUsize` and without the need of epoch counters or other
73things that might rollover and so forth.
74
75Whenever a worker invokes `work_found()`, it transitions back to the
76**awake** state. In other words, if it was sleepy, it stops being
77sleepy. (`work_found()` cannot be invoked when the worker is asleep,
78since then it is not doing anything.)
79
80On the other hand, whenever a worker invokes `no_work_found()`, it
81*may* transition to a more sleepy state. To track this, we use the
82counter `yields` that is maintained by the worker's steal loop. This
83counter starts at 0. Whenever work is found, the counter is returned
84to 0. But each time that **no** work is found, the counter is
85incremented. Eventually it will reach a threshold
86`ROUNDS_UNTIL_SLEEPY`. At this point, the worker will try to become
87the sleepy one. It does this by executing a CAS into the global
88registry state (details on this below). If that attempt is successful,
89then the counter is incremented again, so that it is equal to
90`ROUNDS_UNTIL_SLEEPY + 1`. Otherwise, the counter stays the same (and
91hence we will keep trying to become sleepy until either work is found
92or we are successful).
93
94Becoming sleepy does not put us to sleep immediately. Instead, we keep
95iterating and looking for work for some further number of rounds. If
96during this search we **do** find work, then we will return the
97counter to 0 and also reset the global state to indicate we are no
98longer sleepy.
99
100But if again no work is found, `yields` will eventually reach the
101value `ROUNDS_UNTIL_ASLEEP`. At that point, we will try to transition
102from **sleepy** to **asleep**. This is done by the helper fn
103`sleep()`, which executes another CAS on the global state that removes
104our worker as the sleepy worker and instead sets a flag to indicate
105that there are sleeping workers present (the flag may already have
106been set, that's ok). Assuming that CAS succeeds, we will block on a
107condition variable.
108
109# Tickling workers
110
111Of course, while all the stuff in the previous section is happening,
112other workers are (hopefully) producing new work. There are three kinds of
113events that can allow a blocked worker to make progress:
114
1151. A new task is pushed onto a worker's deque. This task could be stolen.
1162. A new task is injected into the thread-pool from the outside. This
117 task could be uninjected and executed.
1183. A latch is set. One of the sleeping workers might have been waiting for
119 that before it could go on.
120
121Whenever one of these things happens, the worker (or thread, more generally)
122responsible must invoke `tickle()`. Tickle will basically wake up **all**
123the workers:
124
125- If any worker was the sleepy one, then the global state is changed
126 so that there is no sleepy worker. The sleepy one will notice this
127 when it next invokes `no_work_found()` and return to the *awake* state
128 (with a yield counter of zero).
129- If any workers were actually **asleep**, then we invoke
130 `notify_all()` on the condition variable, which will cause them to
131 awaken and start over from the awake state (with a yield counter of
132 zero).
133
134Because `tickle()` is invoked very frequently -- and hopefully most of
135the time it is not needed, because the workers are already actively
136stealing -- it is important that it be very cheap. The current design
137requires, in the case where nobody is even sleepy, just a load and a
138compare. If there are sleepy workers, a swap is needed. If there
139workers *asleep*, we must naturally acquire the lock and signal the
140condition variable.
141
142# The global state
143
144We manage all of the above state transitions using a small bit of global
145state (well, global to the registry). This is stored in the `Sleep` struct.
146The primary thing is a single `AtomicUsize`. The value in this usize packs
147in two pieces of information:
148
1491. **Are any workers asleep?** This is just one bit (yes or no).
1502. **Which worker is the sleepy worker, if any?** This is a worker id.
151
152We use bit 0 to indicate whether any workers are asleep. So if `state
153& 1` is zero, then no workers are sleeping. But if `state & 1` is 1,
154then some workers are either sleeping or on their way to falling
155asleep (i.e., they have acquired the lock).
156
157The remaining bits are used to store if there is a sleepy worker. We
158want `0` to indicate that there is no sleepy worker. If there a sleepy
159worker with index `worker_index`, we would store `(worker_index + 1)
160<< 1` . The `+1` is there because worker indices are 0-based, so this
161ensures that the value is non-zero, and the shift skips over the
162sleepy bit.
163
164Some examples:
165
166- `0`: everyone is awake, nobody is sleepy
167- `1`: some workers are asleep, no sleepy worker
168- `2`: no workers are asleep, but worker 0 is sleepy (`(0 + 1) << 1 == 2`).
169- `3`: some workers are asleep, and worker 0 is sleepy.
170
171# Correctness level 1: avoiding deadlocks etc
172
173In general, we do not want to miss wakeups. Two bad things could happen:
174
175- **Suboptimal performance**: If this is a wakeup about a new job being
176 pushed into a local deque, it won't deadlock, but it will cause
177 things to run slowly. The reason that it won't deadlock is that we
178 know at least one thread is active (the one doing the pushing), and
179 it will (sooner or later) try to pop this item from its own local
180 deque.
181- **Deadlocks:** If this is a wakeup about an injected job or a latch that got set, however,
182 this can cause deadlocks. In the former case, if a job is injected but no thread ever
183 wakes to process it, the injector will likely block forever. In the latter case,
184 imagine this scenario:
185 - thread A calls join, forking a task T1, then executing task T2
186 - thread B steals T1, forks a task T3, and executes T4.
187 - thread A completes task T2 and blocks on T1
188 - thread A steals task T3 from thread B
189 - thread B finishes T4 and goes to sleep, blocking on T3
190 - thread A completes task T3 and makes a wakeup, but it gets lost
191 At this point, thread B is still asleep and will never signal T2, so thread A will itself
192 go to sleep. Bad.
193
194It turns out that guaranteeing we don't miss a wakeup while retaining
195good performance is fairly tricky. This is because of some details of
196the C++11 memory model. But let's ignore those for now and generally
197assume sequential consistency. In that case, our scheme should work
198perfectly.
199
200Even if you assume seqcst, though, ensuring that you don't miss
201wakeups can be fairly tricky in the absence of a central queue. For
202example, consider the simplest scheme: imagine we just had a boolean
203flag indicating whether anyone was asleep. Then you could imagine that
204when workers find no work, they flip this flag to true. When work is
205published, if the flag is true, we issue a wakeup.
206
207The problem here is that checking for new work is not an atomic
208action. So it's possible that worker 1 could start looking for work
209and (say) see that worker 0's queue is empty and then search workers
2102..N. While that searching is taking place, worker 0 publishes some
211new work. At the time when the new work is published, the "anyone
212sleeping?" flag is still false, so nothing happens. Then worker 1, who
213failed to find any work, goes to sleep --- completely missing the wakeup!
214
215We use the "sleepy worker" idea to sidestep this problem. Under our
216scheme, instead of going right to sleep at the end, worker 1 would
217become sleepy. Worker 1 would then do **at least** one additional
218scan. During this scan, they should find the work published by worker
2190, so they will stop being sleepy and go back to work (here of course
220we are assuming that no one else has stolen the worker 0 work yet; if
221someone else stole it, worker 1 may still go to sleep, but that's ok,
222since there is no more work to be had).
223
224Now you may be wondering -- how does being sleepy help? What if,
225instead of publishing its job right before worker 1 became sleepy,
226worker 0 wait until right before worker 1 was going to go to sleep? In
227other words, the sequence was like this:
228
229- worker 1 gets sleepy
230- worker 1 starts its scan, scanning worker 0's deque
231- worker 0 publishes its job, but nobody is sleeping yet, so no wakeups occur
232- worker 1 finshes its scan, goes to sleep, missing the wakeup
233
234The reason that this doesn't occur is because, when worker 0 publishes
235its job, it will see that there is a sleepy worker. It will clear the
236global state to 0. Then, when worker 1 its scan, it will notice that
237it is no longer sleepy, and hence it will not go to sleep. Instead it
238will awaken and keep searching for work.
239
240The sleepy worker phase thus also serves as a cheap way to signal that
241work is around: instead of doing the whole dance of acquiring a lock
242and issuing notifications, when we publish work we can just swap a
243single atomic counter and let the sleepy worker notice that on their
244own.
245
246## Beyond seq-cst
247
248Unfortunately, the C++11 memory model doesn't generally guarantee
249seq-cst. And, somewhat annoyingly, it's not easy for the sleep module
250**in isolation** to guarantee the properties the need. The key
251challenge has to do with the *synchronized-with* relation. Typically,
252we try to use acquire-release reasoning, and in that case the idea is
253that **if** a load observes a store, it will also observe those writes
254that preceded the store. But nothing says that the load **must**
255observe the store -- at least not right away.
256
257The place that this is most relevant is the load in the `tickle()`
258routine. The routine begins by reading from the global state. If it
259sees anything other than 0, it then does a swap and -- if necessary --
260acquires a lock and does a notify. This load is a seq-cst load (as are
261the other accesses in tickle). This ensures that it is sensible to
262talk about a tickle happening *before* a worker gets sleepy and so
263forth.
264
265It turns out that to get things right, if we use the current tickle
266routine, we have to use seq-cst operations **both in the sleep module
267and when publishing work**. We'll walk through two scenarios to
268show what I mean.
269
270### Scenario 1: get-sleepy-then-get-tickled
271
272This scenario shows why the operations in sleep must be seq-cst. We
273want to ensure that once a worker gets sleepy, any other worker that
274does a tickle will observe that. In other words, we want to ensure
275that the following scenario **cannot happen**:
276
2771. worker 1 is blocked on latch L
2782. worker 1 becomes sleepy
279 - becoming sleepy involves a CAS on the global state to set it to 4 ("worker 1 is sleepy")
2803. worker 0 sets latch L
2814. worker 0 tickles **but does not see that worker 0 is sleepy**
282
283Let's diagram this. The notation `read_xxx(A) = V` means that a read
284of location `A` was executed with the result `V`. The `xxx` is the
285ordering and the location `A` is either `L` (latch) or `S` (global
286state). I will leave the ordering on the latch as `xxx` as it is not
287relevant here. The numbers correspond to the steps above.
288
289```
290 worker 0 worker 1
291 | +- 2: cas_sc(S, 4)
292s| 3: write_xxx(L) +
293b| 4: read_sc(S) = ??? <-sc-+
294 v
295```
296
297Clearly, this cannot happen with sc orderings, because read 4 will
298always return `4` here. However, if we tried to use acquire-release
299orderings on the global state, then there would be **no guarantee**
300that the tickle will observe that a sleepy worker occurred. We would
301be guaranteed only that worker 0 would **eventually** observe that
302worker 1 had become sleepy (and, at that time, that it would see other
303writes). But it could take time -- and if we indeed miss that worker 1
304is sleepy, it could lead to deadlock or loss of efficiency, as
305explained earlier.
306
307### Scenario 2: tickle-then-get-sleepy
308
309<a name="tickle-then-get-sleepy"></a>
310
311This scenario shows why latch operations must *also* be seq-cst (and,
312more generally, any operations that publish work before a tickle). We
313wish to ensure that this ordering of events **cannot occur**:
314
3151. worker 1 is blocked on latch L
3162. worker 1 reads latch L, sees false, starts searching for work
3173. worker 0 sets latch L
3184. worker 0 tickles
319 - the tickle reads from the global state, sees 0
3205. worker 1 finishes searching, becomes sleepy
321 - becoming sleepy involves a CAS on the global state to set it to 4 ("worker 1 is sleepy")
3226. worker 1 reads latch L **but does not see that worker 0 set it**
3237. worker 1 may then proceed to become sleepy
324
325In other words, we want to ensure that if worker 0 sets a latch and
326does a tickle *before worker 1 gets sleepy*, then worker 1 will
327observe that latch as set when it calls probe. We'll see that, with
328the current scheme, this implies that the latch memory orderings must
329be seq-cst as well.
330
331Here is the diagram:
332
333```
334 worker 0 worker 1
335 | 2: read_xxx(L) = false
336s| 3: write_xxx(L, true)
337b| 4: read_sc(S) = 0 -+
338 | +-sc---> 5: cas_sc(S, 4)
339 v 6: read_xxx(L) = ???
340```
341
342The diagram shows that each thread's actions are related by
343*sequenced-before* (sb). Moreover the read and write of `S` are
344related by `sc` (the seq-cst ordering). However, and this is crucial,
345this **does not** imply that oper 4 *synchronizes-with* oper 5. This
346is because a read never synchronizes-with a store, only the
347reverse. Hence, if the latch were using acq-rel orderings, it would be
348legal for oper 6 to return false. But if the latch were to use an
349**sc** ordering itself, then we know that oper 6 must return true,
350since `3 -sc-> 4 -sc-> 5 -sc-> 6`.
351
352**Note** that this means that, before we tickle, we must execute some
353seq-cst stores to publish our work (and during the scan we must load
354from those same locations) **if we wish to guarantee that the work we
355published WILL be seen by the other threads** (as opposed to
356*may*). This is true for setting a latch -- if a latch is set but
357another thread misses it, then the system could deadlock. However, in
358the case of pushing new work to a deque, we choose not to use a seqcst
359ordering. This is for several reasons:
360
361- If we miss a wakeup, the consequences are less dire: we simply run
362 less efficiently (after all, the current thread will eventually
363 complete its current task and pop the new task off the deque).
364- It is inconvenient: The deque code is beyond our control (it lies in another package). However,
365 we could create a dummy `AtomicBool` for each deque and do a seqcst write to it
366 (with whatever value) after we push to the deque, and a seqcst load whenever
367 we steal from the deque.
368- The cost of using a dummy variable was found to be quite high for some benchmarks:
369 - 8-10% overhead on nbody-parreduce
370 - 15% overhead on increment-all
371 - 40% overhead on join-recursively
372
373### Alternative solutions
374
375In both cases above, our problems arose because tickle is merely
376performing a seq-cst read. If we instead had tickle perform a release
377*swap*, that would be a write action of the global state. No matter
378the ordering mode, all writes to the same memory location have a total
379ordering, and hence we would not have to worry about others storing a
380value that we fail to read (as in scenario 1). Similarly, as a release
381write, a swap during tickle would synchronize-with a later cas and so
382scenario 2 should be averted. So you might wonder why we don't do
383that. The simple reason was that it didn't perform as well! In my
384measurements, many benchmarks were unaffected by using a swap, but
385some of them were hit hard:
386 - 8-10% overhead on nbody-parreduce
387 - 35% overhead on increment-all
388 - 245% overhead on join-recursively
389
390# Deadlock detection
391
392This module tracks a number of variables in order to detect deadlocks due to user code blocking.
393These variables are stored in the `SleepData` struct which itself is kept behind a mutex.
394It contains the following fields:
395- `worker_count` - The number of threads in the thread pool.
396- `active_threads` - The number of threads in the thread pool which are running
397 and aren't blocked in user code or sleeping.
398- `blocked_threads` - The number of threads which are blocked in user code.
399 This doesn't include threads blocked by Rayon.
400
401User code can indicate blocking by calling `mark_blocked` before blocking and
402calling `mark_unblocked` before unblocking a thread.
403This will adjust `active_threads` and `blocked_threads` accordingly.
404
405When we tickle the thread pool in `Sleep::tickle_cold`, we set `active_threads` to
406`worker_count` - `blocked_threads` since we wake up all Rayon threads, but not thread blocked
407by user code.
408
409A deadlock is detected by checking if `active_threads` is 0 and `blocked_threads` is above 0.
410If we ignored `blocked_threads` we would have a deadlock
411immediately when creating the thread pool.
412We would also deadlock once the thread pool ran out of work.
413It is not possible for Rayon itself to deadlock.
414Deadlocks can only be caused by user code blocking, so this condition doesn't miss any deadlocks.
415
416We check for the deadlock condition when
417threads fall asleep in `mark_unblocked` and in `Sleep::sleep`.
418If there's a deadlock detected we call the user provided deadlock handler while we hold the
419lock to `SleepData`. This means the deadlock handler cannot call `mark_blocked` and
420`mark_unblocked`. The user is expected to handle the deadlock in some non-Rayon thread.
421Once the deadlock handler returns, the thread which called the deadlock handler will go to sleep.