]> git.proxmox.com Git - ceph.git/blame - ceph/src/boost/tools/build/src/engine/boehm_gc/doc/scale.html
bump version to 12.2.2-pve1
[ceph.git] / ceph / src / boost / tools / build / src / engine / boehm_gc / doc / scale.html
CommitLineData
7c673cae
FG
1<HTML>
2<HEAD>
3<TITLE>Garbage collector scalability</TITLE>
4</HEAD>
5<BODY>
6<H1>Garbage collector scalability</h1>
7In its default configuration, the Boehm-Demers-Weiser garbage collector
8is not thread-safe. It can be made thread-safe for a number of environments
9by building the collector with the appropriate
10<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
11flag. This has primarily two effects:
12<OL>
13<LI> It causes the garbage collector to stop all other threads when
14it needs to see a consistent memory state.
15<LI> It causes the collector to acquire a lock around essentially all
16allocation and garbage collection activity.
17</ol>
18Since a single lock is used for all allocation-related activity, only one
19thread can be allocating or collecting at one point. This inherently
20limits performance of multi-threaded applications on multiprocessors.
21<P>
22On most platforms, the allocator/collector lock is implemented as a
23spin lock with exponential back-off. Longer wait times are implemented
24by yielding and/or sleeping. If a collection is in progress, the pure
25spinning stage is skipped. This has the advantage that uncontested and
26thus most uniprocessor lock acquisitions are very cheap. It has the
27disadvantage that the application may sleep for small periods of time
28even when there is work to be done. And threads may be unnecessarily
29woken up for short periods. Nonetheless, this scheme empirically
30outperforms native queue-based mutual exclusion implementations in most
31cases, sometimes drastically so.
32<H2>Options for enhanced scalability</h2>
33Version 6.0 of the collector adds two facilities to enhance collector
34scalability on multiprocessors. As of 6.0alpha1, these are supported
35only under Linux on X86 and IA64 processors, though ports to other
36otherwise supported Pthreads platforms should be straightforward.
37They are intended to be used together.
38<UL>
39<LI>
40Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
41run the mark phase in parallel in multiple threads, and thus on multiple
42processors. The mark phase typically consumes the large majority of the
43collection time. Thus this largely parallelizes the garbage collector
44itself, though not the allocation process. Currently the marking is
45performed by the thread that triggered the collection, together with
46<I>N</i>-1 dedicated
47threads, where <I>N</i> is the number of processors detected by the collector.
48The dedicated threads are created once at initialization time.
49<P>
50A second effect of this flag is to switch to a more concurrent
51implementation of <TT>GC_malloc_many</tt>, so that free lists can be
52built, and memory can be cleared, by more than one thread concurrently.
53<LI>
54Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
55local allocation. It does not, by itself, cause thread local allocation
56to be used. It simply allows the use of the interface in
57<TT>gc_local_alloc.h</tt>.
58<P>
59Memory returned from thread-local allocators is completely interchangeable
60with that returned by the standard allocators. It may be used by other
61threads. The only difference is that, if the thread allocates enough
62memory of a certain kind, it will build a thread-local free list for
63objects of that kind, and allocate from that. This greatly reduces
64locking. The thread-local free lists are refilled using
65<TT>GC_malloc_many</tt>.
66<P>
67An important side effect of this flag is to replace the default
68spin-then-sleep lock to be replace by a spin-then-queue based implementation.
69This <I>reduces performance</i> for the standard allocation functions,
70though it usually improves performance when thread-local allocation is
71used heavily, and thus the number of short-duration lock acquisitions
72is greatly reduced.
73</ul>
74<P>
75The easiest way to switch an application to thread-local allocation is to
76<OL>
77<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
78and then include the <TT>gc.h</tt>
79header in each client source file.
80<LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
81<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
82and/or <TT>GC_GCJ_MALLOC</tt>.
83</ol>
84<H2>The Parallel Marking Algorithm</h2>
85We use an algorithm similar to
86<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
87Endo, Taura, and Yonezawa</a> at the University of Tokyo.
88However, the data structures and implementation are different,
89and represent a smaller change to the original collector source,
90probably at the expense of extreme scalability. Some of
91the refinements they suggest, <I>e.g.</i> splitting large
92objects, were also incorporated into out approach.
93<P>
94The global mark stack is transformed into a global work queue.
95Unlike the usual case, it never shrinks during a mark phase.
96The mark threads remove objects from the queue by copying them to a
97local mark stack and changing the global descriptor to zero, indicating
98that there is no more work to be done for this entry.
99This removal
100is done with no synchronization. Thus it is possible for more than
101one worker to remove the same entry, resulting in some work duplication.
102<P>
103The global work queue grows only if a marker thread decides to
104return some of its local mark stack to the global one. This
105is done if the global queue appears to be running low, or if
106the local stack is in danger of overflowing. It does require
107synchronization, but should be relatively rare.
108<P>
109The sequential marking code is reused to process local mark stacks.
110Hence the amount of additional code required for parallel marking
111is minimal.
112<P>
113It should be possible to use generational collection in the presence of the
114parallel collector, by calling <TT>GC_enable_incremental()</tt>.
115This does not result in fully incremental collection, since parallel mark
116phases cannot currently be interrupted, and doing so may be too
117expensive.
118<P>
119Gcj-style mark descriptors do not currently mix with the combination
120of local allocation and incremental collection. They should work correctly
121with one or the other, but not both.
122<P>
123The number of marker threads is set on startup to the number of
124available processors (or to the value of the <TT>GC_NPROCS</tt>
125environment variable). If only a single processor is detected,
126parallel marking is disabled.
127<P>
128Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
129the collector to immediately yield the processor instead of busy waiting
130first. In the case of a multiprocessor and a client with multiple
131simultaneously runnable threads, this may have disastrous performance
132consequences (e.g. a factor of 10 slowdown).
133<H2>Performance</h2>
134We conducted some simple experiments with a version of
135<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
136run multiple concurrent client threads in the same address space.
137Each client thread does the same work as the original benchmark, but they share
138a heap.
139This benchmark involves very little work outside of memory allocation.
140This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
141under Linux 2.2.12.
142<P>
143Running with a thread-unsafe collector, the benchmark ran in 9
144seconds. With the simple thread-safe collector,
145built with <TT>-DLINUX_THREADS</tt>, the execution time
146increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
147(The times for the <TT>malloc</tt>/i<TT>free</tt> version
148with glibc <TT>malloc</tt>
149are 10.51 (standard library, pthreads not linked),
15020.90 (one thread, pthreads linked),
151and 24.55 seconds respectively. The benchmark favors a
152garbage collector, since most objects are small.)
153<P>
154The following table gives execution times for the collector built
155with parallel marking and thread-local allocation support
156(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested
157the client using either one or two marker threads, and running
158one or two client threads. Note that the client uses thread local
159allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector
160switches to a locking strategy that is better tuned to less frequent
161lock acquisition. The standard allocation primitives thus peform
162slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
163avoided in time-critical code.
164<P>
165(The results using <TT>pthread_mutex_lock</tt>
166directly for allocation locking would have been worse still, at
167least for older versions of linuxthreads.
168With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
169lock with pthread_mutex_try_lock(), busy_waiting between attempts.
170After a fixed number of attempts, we use pthread_mutex_lock().)
171<P>
172These measurements do not use incremental collection, nor was prefetching
173enabled in the marker. We used the C version of the benchmark.
174All measurements are in elapsed seconds on an unloaded machine.
175<P>
176<TABLE BORDER ALIGN="CENTER">
177<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
178<TH>2 marker threads (secs.)</th></tr>
179<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
180<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
181</table>
182<PP>
183The execution time for the single threaded case is slightly worse than with
184simple locking. However, even the single-threaded benchmark runs faster than
185even the thread-unsafe version if a second processor is available.
186The execution time for two clients with thread local allocation time is
187only 1.4 times the sequential execution time for a single thread in a
188thread-unsafe environment, even though it involves twice the client work.
189That represents close to a
190factor of 2 improvement over the 2 client case with the old collector.
191The old collector clearly
192still suffered from some contention overhead, in spite of the fact that the
193locking scheme had been fairly well tuned.
194<P>
195Full linear speedup (i.e. the same execution time for 1 client on one
196processor as 2 clients on 2 processors)
197is probably not achievable on this kind of
198hardware even with such a small number of processors,
199since the memory system is
200a major constraint for the garbage collector,
201the processors usually share a single memory bus, and thus
202the aggregate memory bandwidth does not increase in
203proportion to the number of processors.
204<P>
205These results are likely to be very sensitive to both hardware and OS
206issues. Preliminary experiments with an older Pentium Pro machine running
207an older kernel were far less encouraging.
208
209</body>
210</html>