]>
Commit | Line | Data |
---|---|---|
1 | <HTML> | |
2 | <HEAD> | |
3 | <TITLE>Garbage collector scalability</TITLE> | |
4 | </HEAD> | |
5 | <BODY> | |
6 | <H1>Garbage collector scalability</h1> | |
7 | In its default configuration, the Boehm-Demers-Weiser garbage collector | |
8 | is not thread-safe. It can be made thread-safe for a number of environments | |
9 | by building the collector with the appropriate | |
10 | <TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation | |
11 | flag. This has primarily two effects: | |
12 | <OL> | |
13 | <LI> It causes the garbage collector to stop all other threads when | |
14 | it needs to see a consistent memory state. | |
15 | <LI> It causes the collector to acquire a lock around essentially all | |
16 | allocation and garbage collection activity. | |
17 | </ol> | |
18 | Since a single lock is used for all allocation-related activity, only one | |
19 | thread can be allocating or collecting at one point. This inherently | |
20 | limits performance of multi-threaded applications on multiprocessors. | |
21 | <P> | |
22 | On most platforms, the allocator/collector lock is implemented as a | |
23 | spin lock with exponential back-off. Longer wait times are implemented | |
24 | by yielding and/or sleeping. If a collection is in progress, the pure | |
25 | spinning stage is skipped. This has the advantage that uncontested and | |
26 | thus most uniprocessor lock acquisitions are very cheap. It has the | |
27 | disadvantage that the application may sleep for small periods of time | |
28 | even when there is work to be done. And threads may be unnecessarily | |
29 | woken up for short periods. Nonetheless, this scheme empirically | |
30 | outperforms native queue-based mutual exclusion implementations in most | |
31 | cases, sometimes drastically so. | |
32 | <H2>Options for enhanced scalability</h2> | |
33 | Version 6.0 of the collector adds two facilities to enhance collector | |
34 | scalability on multiprocessors. As of 6.0alpha1, these are supported | |
35 | only under Linux on X86 and IA64 processors, though ports to other | |
36 | otherwise supported Pthreads platforms should be straightforward. | |
37 | They are intended to be used together. | |
38 | <UL> | |
39 | <LI> | |
40 | Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to | |
41 | run the mark phase in parallel in multiple threads, and thus on multiple | |
42 | processors. The mark phase typically consumes the large majority of the | |
43 | collection time. Thus this largely parallelizes the garbage collector | |
44 | itself, though not the allocation process. Currently the marking is | |
45 | performed by the thread that triggered the collection, together with | |
46 | <I>N</i>-1 dedicated | |
47 | threads, where <I>N</i> is the number of processors detected by the collector. | |
48 | The dedicated threads are created once at initialization time. | |
49 | <P> | |
50 | A second effect of this flag is to switch to a more concurrent | |
51 | implementation of <TT>GC_malloc_many</tt>, so that free lists can be | |
52 | built, and memory can be cleared, by more than one thread concurrently. | |
53 | <LI> | |
54 | Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread | |
55 | local allocation. It does not, by itself, cause thread local allocation | |
56 | to be used. It simply allows the use of the interface in | |
57 | <TT>gc_local_alloc.h</tt>. | |
58 | <P> | |
59 | Memory returned from thread-local allocators is completely interchangeable | |
60 | with that returned by the standard allocators. It may be used by other | |
61 | threads. The only difference is that, if the thread allocates enough | |
62 | memory of a certain kind, it will build a thread-local free list for | |
63 | objects of that kind, and allocate from that. This greatly reduces | |
64 | locking. The thread-local free lists are refilled using | |
65 | <TT>GC_malloc_many</tt>. | |
66 | <P> | |
67 | An important side effect of this flag is to replace the default | |
68 | spin-then-sleep lock to be replace by a spin-then-queue based implementation. | |
69 | This <I>reduces performance</i> for the standard allocation functions, | |
70 | though it usually improves performance when thread-local allocation is | |
71 | used heavily, and thus the number of short-duration lock acquisitions | |
72 | is greatly reduced. | |
73 | </ul> | |
74 | <P> | |
75 | The easiest way to switch an application to thread-local allocation is to | |
76 | <OL> | |
77 | <LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>, | |
78 | and then include the <TT>gc.h</tt> | |
79 | header in each client source file. | |
80 | <LI> Invoke <TT>GC_thr_init()</tt> before any allocation. | |
81 | <LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>, | |
82 | and/or <TT>GC_GCJ_MALLOC</tt>. | |
83 | </ol> | |
84 | <H2>The Parallel Marking Algorithm</h2> | |
85 | We use an algorithm similar to | |
86 | <A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by | |
87 | Endo, Taura, and Yonezawa</a> at the University of Tokyo. | |
88 | However, the data structures and implementation are different, | |
89 | and represent a smaller change to the original collector source, | |
90 | probably at the expense of extreme scalability. Some of | |
91 | the refinements they suggest, <I>e.g.</i> splitting large | |
92 | objects, were also incorporated into out approach. | |
93 | <P> | |
94 | The global mark stack is transformed into a global work queue. | |
95 | Unlike the usual case, it never shrinks during a mark phase. | |
96 | The mark threads remove objects from the queue by copying them to a | |
97 | local mark stack and changing the global descriptor to zero, indicating | |
98 | that there is no more work to be done for this entry. | |
99 | This removal | |
100 | is done with no synchronization. Thus it is possible for more than | |
101 | one worker to remove the same entry, resulting in some work duplication. | |
102 | <P> | |
103 | The global work queue grows only if a marker thread decides to | |
104 | return some of its local mark stack to the global one. This | |
105 | is done if the global queue appears to be running low, or if | |
106 | the local stack is in danger of overflowing. It does require | |
107 | synchronization, but should be relatively rare. | |
108 | <P> | |
109 | The sequential marking code is reused to process local mark stacks. | |
110 | Hence the amount of additional code required for parallel marking | |
111 | is minimal. | |
112 | <P> | |
113 | It should be possible to use generational collection in the presence of the | |
114 | parallel collector, by calling <TT>GC_enable_incremental()</tt>. | |
115 | This does not result in fully incremental collection, since parallel mark | |
116 | phases cannot currently be interrupted, and doing so may be too | |
117 | expensive. | |
118 | <P> | |
119 | Gcj-style mark descriptors do not currently mix with the combination | |
120 | of local allocation and incremental collection. They should work correctly | |
121 | with one or the other, but not both. | |
122 | <P> | |
123 | The number of marker threads is set on startup to the number of | |
124 | available processors (or to the value of the <TT>GC_NPROCS</tt> | |
125 | environment variable). If only a single processor is detected, | |
126 | parallel marking is disabled. | |
127 | <P> | |
128 | Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside | |
129 | the collector to immediately yield the processor instead of busy waiting | |
130 | first. In the case of a multiprocessor and a client with multiple | |
131 | simultaneously runnable threads, this may have disastrous performance | |
132 | consequences (e.g. a factor of 10 slowdown). | |
133 | <H2>Performance</h2> | |
134 | We conducted some simple experiments with a version of | |
135 | <A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to | |
136 | run multiple concurrent client threads in the same address space. | |
137 | Each client thread does the same work as the original benchmark, but they share | |
138 | a heap. | |
139 | This benchmark involves very little work outside of memory allocation. | |
140 | This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine | |
141 | under Linux 2.2.12. | |
142 | <P> | |
143 | Running with a thread-unsafe collector, the benchmark ran in 9 | |
144 | seconds. With the simple thread-safe collector, | |
145 | built with <TT>-DLINUX_THREADS</tt>, the execution time | |
146 | increased to 10.3 seconds, or 23.5 elapsed seconds with two clients. | |
147 | (The times for the <TT>malloc</tt>/i<TT>free</tt> version | |
148 | with glibc <TT>malloc</tt> | |
149 | are 10.51 (standard library, pthreads not linked), | |
150 | 20.90 (one thread, pthreads linked), | |
151 | and 24.55 seconds respectively. The benchmark favors a | |
152 | garbage collector, since most objects are small.) | |
153 | <P> | |
154 | The following table gives execution times for the collector built | |
155 | with parallel marking and thread-local allocation support | |
156 | (<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested | |
157 | the client using either one or two marker threads, and running | |
158 | one or two client threads. Note that the client uses thread local | |
159 | allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector | |
160 | switches to a locking strategy that is better tuned to less frequent | |
161 | lock acquisition. The standard allocation primitives thus peform | |
162 | slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be | |
163 | avoided in time-critical code. | |
164 | <P> | |
165 | (The results using <TT>pthread_mutex_lock</tt> | |
166 | directly for allocation locking would have been worse still, at | |
167 | least for older versions of linuxthreads. | |
168 | With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the | |
169 | lock with pthread_mutex_try_lock(), busy_waiting between attempts. | |
170 | After a fixed number of attempts, we use pthread_mutex_lock().) | |
171 | <P> | |
172 | These measurements do not use incremental collection, nor was prefetching | |
173 | enabled in the marker. We used the C version of the benchmark. | |
174 | All measurements are in elapsed seconds on an unloaded machine. | |
175 | <P> | |
176 | <TABLE BORDER ALIGN="CENTER"> | |
177 | <TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th> | |
178 | <TH>2 marker threads (secs.)</th></tr> | |
179 | <TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td> | |
180 | <TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td> | |
181 | </table> | |
182 | <PP> | |
183 | The execution time for the single threaded case is slightly worse than with | |
184 | simple locking. However, even the single-threaded benchmark runs faster than | |
185 | even the thread-unsafe version if a second processor is available. | |
186 | The execution time for two clients with thread local allocation time is | |
187 | only 1.4 times the sequential execution time for a single thread in a | |
188 | thread-unsafe environment, even though it involves twice the client work. | |
189 | That represents close to a | |
190 | factor of 2 improvement over the 2 client case with the old collector. | |
191 | The old collector clearly | |
192 | still suffered from some contention overhead, in spite of the fact that the | |
193 | locking scheme had been fairly well tuned. | |
194 | <P> | |
195 | Full linear speedup (i.e. the same execution time for 1 client on one | |
196 | processor as 2 clients on 2 processors) | |
197 | is probably not achievable on this kind of | |
198 | hardware even with such a small number of processors, | |
199 | since the memory system is | |
200 | a major constraint for the garbage collector, | |
201 | the processors usually share a single memory bus, and thus | |
202 | the aggregate memory bandwidth does not increase in | |
203 | proportion to the number of processors. | |
204 | <P> | |
205 | These results are likely to be very sensitive to both hardware and OS | |
206 | issues. Preliminary experiments with an older Pentium Pro machine running | |
207 | an older kernel were far less encouraging. | |
208 | ||
209 | </body> | |
210 | </html> |