]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | <HTML> |
2 | <HEAD> | |
3 | <TITLE> Conservative GC Algorithmic Overview </TITLE> | |
4 | <AUTHOR> Hans-J. Boehm, HP Labs (Some of this was written at SGI)</author> | |
5 | </HEAD> | |
6 | <BODY> | |
7 | <H1> <I>This is under construction, and may always be.</i> </h1> | |
8 | <H1> Conservative GC Algorithmic Overview </h1> | |
9 | <P> | |
10 | This is a description of the algorithms and data structures used in our | |
11 | conservative garbage collector. I expect the level of detail to increase | |
12 | with time. For a survey of GC algorithms, see for example | |
13 | <A HREF="ftp://ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps"> Paul Wilson's | |
14 | excellent paper</a>. For an overview of the collector interface, | |
15 | see <A HREF="gcinterface.html">here</a>. | |
16 | <P> | |
17 | This description is targeted primarily at someone trying to understand the | |
18 | source code. It specifically refers to variable and function names. | |
19 | It may also be useful for understanding the algorithms at a higher level. | |
20 | <P> | |
21 | The description here assumes that the collector is used in default mode. | |
22 | In particular, we assume that it used as a garbage collector, and not just | |
23 | a leak detector. We initially assume that it is used in stop-the-world, | |
24 | non-incremental mode, though the presence of the incremental collector | |
25 | will be apparent in the design. | |
26 | We assume the default finalization model, but the code affected by that | |
27 | is very localized. | |
28 | <H2> Introduction </h2> | |
29 | The garbage collector uses a modified mark-sweep algorithm. Conceptually | |
30 | it operates roughly in four phases, which are performed occasionally | |
31 | as part of a memory allocation: | |
32 | ||
33 | <OL> | |
34 | ||
35 | <LI> | |
36 | <I>Preparation</i> Each object has an associated mark bit. | |
37 | Clear all mark bits, indicating that all objects | |
38 | are potentially unreachable. | |
39 | ||
40 | <LI> | |
41 | <I>Mark phase</i> Marks all objects that can be reachable via chains of | |
42 | pointers from variables. Often the collector has no real information | |
43 | about the location of pointer variables in the heap, so it | |
44 | views all static data areas, stacks and registers as potentially containing | |
45 | pointers. Any bit patterns that represent addresses inside | |
46 | heap objects managed by the collector are viewed as pointers. | |
47 | Unless the client program has made heap object layout information | |
48 | available to the collector, any heap objects found to be reachable from | |
49 | variables are again scanned similarly. | |
50 | ||
51 | <LI> | |
52 | <I>Sweep phase</i> Scans the heap for inaccessible, and hence unmarked, | |
53 | objects, and returns them to an appropriate free list for reuse. This is | |
54 | not really a separate phase; even in non incremental mode this is operation | |
55 | is usually performed on demand during an allocation that discovers an empty | |
56 | free list. Thus the sweep phase is very unlikely to touch a page that | |
57 | would not have been touched shortly thereafter anyway. | |
58 | ||
59 | <LI> | |
60 | <I>Finalization phase</i> Unreachable objects which had been registered | |
61 | for finalization are enqueued for finalization outside the collector. | |
62 | ||
63 | </ol> | |
64 | ||
65 | <P> | |
66 | The remaining sections describe the memory allocation data structures, | |
67 | and then the last 3 collection phases in more detail. We conclude by | |
68 | outlining some of the additional features implemented in the collector. | |
69 | ||
70 | <H2>Allocation</h2> | |
71 | The collector includes its own memory allocator. The allocator obtains | |
72 | memory from the system in a platform-dependent way. Under UNIX, it | |
73 | uses either <TT>malloc</tt>, <TT>sbrk</tt>, or <TT>mmap</tt>. | |
74 | <P> | |
75 | Most static data used by the allocator, as well as that needed by the | |
76 | rest of the garbage collector is stored inside the | |
77 | <TT>_GC_arrays</tt> structure. | |
78 | This allows the garbage collector to easily ignore the collectors own | |
79 | data structures when it searches for root pointers. Other allocator | |
80 | and collector internal data structures are allocated dynamically | |
81 | with <TT>GC_scratch_alloc</tt>. <TT>GC_scratch_alloc</tt> does not | |
82 | allow for deallocation, and is therefore used only for permanent data | |
83 | structures. | |
84 | <P> | |
85 | The allocator allocates objects of different <I>kinds</i>. | |
86 | Different kinds are handled somewhat differently by certain parts | |
87 | of the garbage collector. Certain kinds are scanned for pointers, | |
88 | others are not. Some may have per-object type descriptors that | |
89 | determine pointer locations. Or a specific kind may correspond | |
90 | to one specific object layout. Two built-in kinds are uncollectable. | |
91 | One (<TT>STUBBORN</tt>) is immutable without special precautions. | |
92 | In spite of that, it is very likely that most C clients of the | |
93 | collector currently | |
94 | use at most two kinds: <TT>NORMAL</tt> and <TT>PTRFREE</tt> objects. | |
95 | The <A HREF="http://gcc.gnu.org/java">gcj</a> runtime also makes | |
96 | heavy use of a kind (allocated with GC_gcj_malloc) that stores | |
97 | type information at a known offset in method tables. | |
98 | <P> | |
99 | The collector uses a two level allocator. A large block is defined to | |
100 | be one larger than half of <TT>HBLKSIZE</tt>, which is a power of 2, | |
101 | typically on the order of the page size. | |
102 | <P> | |
103 | Large block sizes are rounded up to | |
104 | the next multiple of <TT>HBLKSIZE</tt> and then allocated by | |
105 | <TT>GC_allochblk</tt>. Recent versions of the collector | |
106 | use an approximate best fit algorithm by keeping free lists for | |
107 | several large block sizes. | |
108 | The actual | |
109 | implementation of <TT>GC_allochblk</tt> | |
110 | is significantly complicated by black-listing issues | |
111 | (see below). | |
112 | <P> | |
113 | Small blocks are allocated in chunks of size <TT>HBLKSIZE</tt>. | |
114 | Each chunk is | |
115 | dedicated to only one object size and kind. The allocator maintains | |
116 | separate free lists for each size and kind of object. | |
117 | <P> | |
118 | Once a large block is split for use in smaller objects, it can only | |
119 | be used for objects of that size, unless the collector discovers a completely | |
120 | empty chunk. Completely empty chunks are restored to the appropriate | |
121 | large block free list. | |
122 | <P> | |
123 | In order to avoid allocating blocks for too many distinct object sizes, | |
124 | the collector normally does not directly allocate objects of every possible | |
125 | request size. Instead request are rounded up to one of a smaller number | |
126 | of allocated sizes, for which free lists are maintained. The exact | |
127 | allocated sizes are computed on demand, but subject to the constraint | |
128 | that they increase roughly in geometric progression. Thus objects | |
129 | requested early in the execution are likely to be allocated with exactly | |
130 | the requested size, subject to alignment constraints. | |
131 | See <TT>GC_init_size_map</tt> for details. | |
132 | <P> | |
133 | The actual size rounding operation during small object allocation is | |
134 | implemented as a table lookup in <TT>GC_size_map</tt>. | |
135 | <P> | |
136 | Both collector initialization and computation of allocated sizes are | |
137 | handled carefully so that they do not slow down the small object fast | |
138 | allocation path. An attempt to allocate before the collector is initialized, | |
139 | or before the appropriate <TT>GC_size_map</tt> entry is computed, | |
140 | will take the same path as an allocation attempt with an empty free list. | |
141 | This results in a call to the slow path code (<TT>GC_generic_malloc_inner</tt>) | |
142 | which performs the appropriate initialization checks. | |
143 | <P> | |
144 | In non-incremental mode, we make a decision about whether to garbage collect | |
145 | whenever an allocation would otherwise have failed with the current heap size. | |
146 | If the total amount of allocation since the last collection is less than | |
147 | the heap size divided by <TT>GC_free_space_divisor</tt>, we try to | |
148 | expand the heap. Otherwise, we initiate a garbage collection. This ensures | |
149 | that the amount of garbage collection work per allocated byte remains | |
150 | constant. | |
151 | <P> | |
152 | The above is in fact an oversimplification of the real heap expansion | |
153 | and GC triggering heuristic, which adjusts slightly for root size | |
154 | and certain kinds of | |
155 | fragmentation. In particular: | |
156 | <UL> | |
157 | <LI> Programs with a large root set size and | |
158 | little live heap memory will expand the heap to amortize the cost of | |
159 | scanning the roots. | |
160 | <LI> Versions 5.x of the collector actually collect more frequently in | |
161 | nonincremental mode. The large block allocator usually refuses to split | |
162 | large heap blocks once the garbage collection threshold is | |
163 | reached. This often has the effect of collecting well before the | |
164 | heap fills up, thus reducing fragmentation and working set size at the | |
165 | expense of GC time. Versions 6.x choose an intermediate strategy depending | |
166 | on how much large object allocation has taken place in the past. | |
167 | (If the collector is configured to unmap unused pages, versions 6.x | |
168 | use the 5.x strategy.) | |
169 | <LI> In calculating the amount of allocation since the last collection we | |
170 | give partial credit for objects we expect to be explicitly deallocated. | |
171 | Even if all objects are explicitly managed, it is often desirable to collect | |
172 | on rare occasion, since that is our only mechanism for coalescing completely | |
173 | empty chunks. | |
174 | </ul> | |
175 | <P> | |
176 | It has been suggested that this should be adjusted so that we favor | |
177 | expansion if the resulting heap still fits into physical memory. | |
178 | In many cases, that would no doubt help. But it is tricky to do this | |
179 | in a way that remains robust if multiple application are contending | |
180 | for a single pool of physical memory. | |
181 | ||
182 | <H2>Mark phase</h2> | |
183 | ||
184 | At each collection, the collector marks all objects that are | |
185 | possibly reachable from pointer variables. Since it cannot generally | |
186 | tell where pointer variables are located, it scans the following | |
187 | <I>root segments</i> for pointers: | |
188 | <UL> | |
189 | <LI>The registers. Depending on the architecture, this may be done using | |
190 | assembly code, or by calling a <TT>setjmp</tt>-like function which saves | |
191 | register contents on the stack. | |
192 | <LI>The stack(s). In the case of a single-threaded application, | |
193 | on most platforms this | |
194 | is done by scanning the memory between (an approximation of) the current | |
195 | stack pointer and <TT>GC_stackbottom</tt>. (For Itanium, the register stack | |
196 | scanned separately.) The <TT>GC_stackbottom</tt> variable is set in | |
197 | a highly platform-specific way depending on the appropriate configuration | |
198 | information in <TT>gcconfig.h</tt>. Note that the currently active | |
199 | stack needs to be scanned carefully, since callee-save registers of | |
200 | client code may appear inside collector stack frames, which may | |
201 | change during the mark process. This is addressed by scanning | |
202 | some sections of the stack "eagerly", effectively capturing a snapshot | |
203 | at one point in time. | |
204 | <LI>Static data region(s). In the simplest case, this is the region | |
205 | between <TT>DATASTART</tt> and <TT>DATAEND</tt>, as defined in | |
206 | <TT>gcconfig.h</tt>. However, in most cases, this will also involve | |
207 | static data regions associated with dynamic libraries. These are | |
208 | identified by the mostly platform-specific code in <TT>dyn_load.c</tt>. | |
209 | </ul> | |
210 | The marker maintains an explicit stack of memory regions that are known | |
211 | to be accessible, but that have not yet been searched for contained pointers. | |
212 | Each stack entry contains the starting address of the block to be scanned, | |
213 | as well as a descriptor of the block. If no layout information is | |
214 | available for the block, then the descriptor is simply a length. | |
215 | (For other possibilities, see <TT>gc_mark.h</tt>.) | |
216 | <P> | |
217 | At the beginning of the mark phase, all root segments | |
218 | (as described above) are pushed on the | |
219 | stack by <TT>GC_push_roots</tt>. (Registers and eagerly processed | |
220 | stack sections are processed by pushing the referenced objects instead | |
221 | of the stack section itself.) If <TT>ALL_INTERIOR_PTRS</tt> is not | |
222 | defined, then stack roots require special treatment. In this case, the | |
223 | normal marking code ignores interior pointers, but <TT>GC_push_all_stack</tt> | |
224 | explicitly checks for interior pointers and pushes descriptors for target | |
225 | objects. | |
226 | <P> | |
227 | The marker is structured to allow incremental marking. | |
228 | Each call to <TT>GC_mark_some</tt> performs a small amount of | |
229 | work towards marking the heap. | |
230 | It maintains | |
231 | explicit state in the form of <TT>GC_mark_state</tt>, which | |
232 | identifies a particular sub-phase. Some other pieces of state, most | |
233 | notably the mark stack, identify how much work remains to be done | |
234 | in each sub-phase. The normal progression of mark states for | |
235 | a stop-the-world collection is: | |
236 | <OL> | |
237 | <LI> <TT>MS_INVALID</tt> indicating that there may be accessible unmarked | |
238 | objects. In this case <TT>GC_objects_are_marked</tt> will simultaneously | |
239 | be false, so the mark state is advanced to | |
240 | <LI> <TT>MS_PUSH_UNCOLLECTABLE</tt> indicating that it suffices to push | |
241 | uncollectable objects, roots, and then mark everything reachable from them. | |
242 | <TT>Scan_ptr</tt> is advanced through the heap until all uncollectable | |
243 | objects are pushed, and objects reachable from them are marked. | |
244 | At that point, the next call to <TT>GC_mark_some</tt> calls | |
245 | <TT>GC_push_roots</tt> to push the roots. It the advances the | |
246 | mark state to | |
247 | <LI> <TT>MS_ROOTS_PUSHED</tt> asserting that once the mark stack is | |
248 | empty, all reachable objects are marked. Once in this state, we work | |
249 | only on emptying the mark stack. Once this is completed, the state | |
250 | changes to | |
251 | <LI> <TT>MS_NONE</tt> indicating that reachable objects are marked. | |
252 | </ol> | |
253 | ||
254 | The core mark routine <TT>GC_mark_from</tt>, is called | |
255 | repeatedly by several of the sub-phases when the mark stack starts to fill | |
256 | up. It is also called repeatedly in <TT>MS_ROOTS_PUSHED</tt> state | |
257 | to empty the mark stack. | |
258 | The routine is designed to only perform a limited amount of marking at | |
259 | each call, so that it can also be used by the incremental collector. | |
260 | It is fairly carefully tuned, since it usually consumes a large majority | |
261 | of the garbage collection time. | |
262 | <P> | |
263 | The fact that it perform a only a small amount of work per call also | |
264 | allows it to be used as the core routine of the parallel marker. In that | |
265 | case it is normally invoked on thread-private mark stacks instead of the | |
266 | global mark stack. More details can be found in | |
267 | <A HREF="scale.html">scale.html</a> | |
268 | <P> | |
269 | The marker correctly handles mark stack overflows. Whenever the mark stack | |
270 | overflows, the mark state is reset to <TT>MS_INVALID</tt>. | |
271 | Since there are already marked objects in the heap, | |
272 | this eventually forces a complete | |
273 | scan of the heap, searching for pointers, during which any unmarked objects | |
274 | referenced by marked objects are again pushed on the mark stack. This | |
275 | process is repeated until the mark phase completes without a stack overflow. | |
276 | Each time the stack overflows, an attempt is made to grow the mark stack. | |
277 | All pieces of the collector that push regions onto the mark stack have to be | |
278 | careful to ensure forward progress, even in case of repeated mark stack | |
279 | overflows. Every mark attempt results in additional marked objects. | |
280 | <P> | |
281 | Each mark stack entry is processed by examining all candidate pointers | |
282 | in the range described by the entry. If the region has no associated | |
283 | type information, then this typically requires that each 4-byte aligned | |
284 | quantity (8-byte aligned with 64-bit pointers) be considered a candidate | |
285 | pointer. | |
286 | <P> | |
287 | We determine whether a candidate pointer is actually the address of | |
288 | a heap block. This is done in the following steps: | |
289 | <NL> | |
290 | <LI> The candidate pointer is checked against rough heap bounds. | |
291 | These heap bounds are maintained such that all actual heap objects | |
292 | fall between them. In order to facilitate black-listing (see below) | |
293 | we also include address regions that the heap is likely to expand into. | |
294 | Most non-pointers fail this initial test. | |
295 | <LI> The candidate pointer is divided into two pieces; the most significant | |
296 | bits identify a <TT>HBLKSIZE</tt>-sized page in the address space, and | |
297 | the least significant bits specify an offset within that page. | |
298 | (A hardware page may actually consist of multiple such pages. | |
299 | HBLKSIZE is usually the page size divided by a small power of two.) | |
300 | <LI> | |
301 | The page address part of the candidate pointer is looked up in a | |
302 | <A HREF="tree.html">table</a>. | |
303 | Each table entry contains either 0, indicating that the page is not part | |
304 | of the garbage collected heap, a small integer <I>n</i>, indicating | |
305 | that the page is part of large object, starting at least <I>n</i> pages | |
306 | back, or a pointer to a descriptor for the page. In the first case, | |
307 | the candidate pointer i not a true pointer and can be safely ignored. | |
308 | In the last two cases, we can obtain a descriptor for the page containing | |
309 | the beginning of the object. | |
310 | <LI> | |
311 | The starting address of the referenced object is computed. | |
312 | The page descriptor contains the size of the object(s) | |
313 | in that page, the object kind, and the necessary mark bits for those | |
314 | objects. The size information can be used to map the candidate pointer | |
315 | to the object starting address. To accelerate this process, the page header | |
316 | also contains a pointer to a precomputed map of page offsets to displacements | |
317 | from the beginning of an object. The use of this map avoids a | |
318 | potentially slow integer remainder operation in computing the object | |
319 | start address. | |
320 | <LI> | |
321 | The mark bit for the target object is checked and set. If the object | |
322 | was previously unmarked, the object is pushed on the mark stack. | |
323 | The descriptor is read from the page descriptor. (This is computed | |
324 | from information <TT>GC_obj_kinds</tt> when the page is first allocated.) | |
325 | </nl> | |
326 | <P> | |
327 | At the end of the mark phase, mark bits for left-over free lists are cleared, | |
328 | in case a free list was accidentally marked due to a stray pointer. | |
329 | ||
330 | <H2>Sweep phase</h2> | |
331 | ||
332 | At the end of the mark phase, all blocks in the heap are examined. | |
333 | Unmarked large objects are immediately returned to the large object free list. | |
334 | Each small object page is checked to see if all mark bits are clear. | |
335 | If so, the entire page is returned to the large object free list. | |
336 | Small object pages containing some reachable object are queued for later | |
337 | sweeping, unless we determine that the page contains very little free | |
338 | space, in which case it is not examined further. | |
339 | <P> | |
340 | This initial sweep pass touches only block headers, not | |
341 | the blocks themselves. Thus it does not require significant paging, even | |
342 | if large sections of the heap are not in physical memory. | |
343 | <P> | |
344 | Nonempty small object pages are swept when an allocation attempt | |
345 | encounters an empty free list for that object size and kind. | |
346 | Pages for the correct size and kind are repeatedly swept until at | |
347 | least one empty block is found. Sweeping such a page involves | |
348 | scanning the mark bit array in the page header, and building a free | |
349 | list linked through the first words in the objects themselves. | |
350 | This does involve touching the appropriate data page, but in most cases | |
351 | it will be touched only just before it is used for allocation. | |
352 | Hence any paging is essentially unavoidable. | |
353 | <P> | |
354 | Except in the case of pointer-free objects, we maintain the invariant | |
355 | that any object in a small object free list is cleared (except possibly | |
356 | for the link field). Thus it becomes the burden of the small object | |
357 | sweep routine to clear objects. This has the advantage that we can | |
358 | easily recover from accidentally marking a free list, though that could | |
359 | also be handled by other means. The collector currently spends a fair | |
360 | amount of time clearing objects, and this approach should probably be | |
361 | revisited. | |
362 | <P> | |
363 | In most configurations, we use specialized sweep routines to handle common | |
364 | small object sizes. Since we allocate one mark bit per word, it becomes | |
365 | easier to examine the relevant mark bits if the object size divides | |
366 | the word length evenly. We also suitably unroll the inner sweep loop | |
367 | in each case. (It is conceivable that profile-based procedure cloning | |
368 | in the compiler could make this unnecessary and counterproductive. I | |
369 | know of no existing compiler to which this applies.) | |
370 | <P> | |
371 | The sweeping of small object pages could be avoided completely at the expense | |
372 | of examining mark bits directly in the allocator. This would probably | |
373 | be more expensive, since each allocation call would have to reload | |
374 | a large amount of state (e.g. next object address to be swept, position | |
375 | in mark bit table) before it could do its work. The current scheme | |
376 | keeps the allocator simple and allows useful optimizations in the sweeper. | |
377 | ||
378 | <H2>Finalization</h2> | |
379 | Both <TT>GC_register_disappearing_link</tt> and | |
380 | <TT>GC_register_finalizer</tt> add the request to a corresponding hash | |
381 | table. The hash table is allocated out of collected memory, but | |
382 | the reference to the finalizable object is hidden from the collector. | |
383 | Currently finalization requests are processed non-incrementally at the | |
384 | end of a mark cycle. | |
385 | <P> | |
386 | The collector makes an initial pass over the table of finalizable objects, | |
387 | pushing the contents of unmarked objects onto the mark stack. | |
388 | After pushing each object, the marker is invoked to mark all objects | |
389 | reachable from it. The object itself is not explicitly marked. | |
390 | This assures that objects on which a finalizer depends are neither | |
391 | collected nor finalized. | |
392 | <P> | |
393 | If in the process of marking from an object the | |
394 | object itself becomes marked, we have uncovered | |
395 | a cycle involving the object. This usually results in a warning from the | |
396 | collector. Such objects are not finalized, since it may be | |
397 | unsafe to do so. See the more detailed | |
398 | <A HREF="http://www.hpl.hp.com/personal/Hans_Boehm/gc/finalization.html"> discussion of finalization semantics</a>. | |
399 | <P> | |
400 | Any objects remaining unmarked at the end of this process are added to | |
401 | a queue of objects whose finalizers can be run. Depending on collector | |
402 | configuration, finalizers are dequeued and run either implicitly during | |
403 | allocation calls, or explicitly in response to a user request. | |
404 | (Note that the former is unfortunately both the default and not generally safe. | |
405 | If finalizers perform synchronization, it may result in deadlocks. | |
406 | Nontrivial finalizers generally need to perform synchronization, and | |
407 | thus require a different collector configuration.) | |
408 | <P> | |
409 | The collector provides a mechanism for replacing the procedure that is | |
410 | used to mark through objects. This is used both to provide support for | |
411 | Java-style unordered finalization, and to ignore certain kinds of cycles, | |
412 | <I>e.g.</i> those arising from C++ implementations of virtual inheritance. | |
413 | ||
414 | <H2>Generational Collection and Dirty Bits</h2> | |
415 | We basically use the concurrent and generational GC algorithm described in | |
416 | <A HREF="http://www.hpl.hp.com/personal/Hans_Boehm/gc/papers/pldi91.ps.Z">"Mostly Parallel Garbage Collection"</a>, | |
417 | by Boehm, Demers, and Shenker. | |
418 | <P> | |
419 | The most significant modification is that | |
420 | the collector always starts running in the allocating thread. | |
421 | There is no separate garbage collector thread. (If parallel GC is | |
422 | enabled, helper threads may also be woken up.) | |
423 | If an allocation attempt either requests a large object, or encounters | |
424 | an empty small object free list, and notices that there is a collection | |
425 | in progress, it immediately performs a small amount of marking work | |
426 | as described above. | |
427 | <P> | |
428 | This change was made both because we wanted to easily accommodate | |
429 | single-threaded environments, and because a separate GC thread requires | |
430 | very careful control over the scheduler to prevent the mutator from | |
431 | out-running the collector, and hence provoking unneeded heap growth. | |
432 | <P> | |
433 | In incremental mode, the heap is always expanded when we encounter | |
434 | insufficient space for an allocation. Garbage collection is triggered | |
435 | whenever we notice that more than | |
436 | <TT>GC_heap_size</tt>/2 * <TT>GC_free_space_divisor</tt> | |
437 | bytes of allocation have taken place. | |
438 | After <TT>GC_full_freq</tt> minor collections a major collection | |
439 | is started. | |
440 | <P> | |
441 | All collections initially run interrupted until a predetermined | |
442 | amount of time (50 msecs by default) has expired. If this allows | |
443 | the collection to complete entirely, we can avoid correcting | |
444 | for data structure modifications during the collection. If it does | |
445 | not complete, we return control to the mutator, and perform small | |
446 | amounts of additional GC work during those later allocations that | |
447 | cannot be satisfied from small object free lists. When marking completes, | |
448 | the set of modified pages is retrieved, and we mark once again from | |
449 | marked objects on those pages, this time with the mutator stopped. | |
450 | <P> | |
451 | We keep track of modified pages using one of several distinct mechanisms: | |
452 | <OL> | |
453 | <LI> | |
454 | Through explicit mutator cooperation. Currently this requires | |
455 | the use of <TT>GC_malloc_stubborn</tt>, and is rarely used. | |
456 | <LI> | |
457 | (<TT>MPROTECT_VDB</tt>) By write-protecting physical pages and | |
458 | catching write faults. This is | |
459 | implemented for many Unix-like systems and for win32. It is not possible | |
460 | in a few environments. | |
461 | <LI> | |
462 | (<TT>PROC_VDB</tt>) By retrieving dirty bit information from /proc. | |
463 | (Currently only Sun's | |
464 | Solaris supports this. Though this is considerably cleaner, performance | |
465 | may actually be better with mprotect and signals.) | |
466 | <LI> | |
467 | (<TT>PCR_VDB</tt>) By relying on an external dirty bit implementation, in this | |
468 | case the one in Xerox PCR. | |
469 | <LI> | |
470 | (<TT>DEFAULT_VDB</tt>) By treating all pages as dirty. This is the default if | |
471 | none of the other techniques is known to be usable, and | |
472 | <TT>GC_malloc_stubborn</tt> is not used. Practical only for testing, or if | |
473 | the vast majority of objects use <TT>GC_malloc_stubborn</tt>. | |
474 | </ol> | |
475 | ||
476 | <H2>Black-listing</h2> | |
477 | ||
478 | The collector implements <I>black-listing</i> of pages, as described | |
479 | in | |
480 | <A HREF="http://www.acm.org/pubs/citations/proceedings/pldi/155090/p197-boehm/"> | |
481 | Boehm, ``Space Efficient Conservative Collection'', PLDI '93</a>, also available | |
482 | <A HREF="papers/pldi93.ps.Z">here</a>. | |
483 | <P> | |
484 | During the mark phase, the collector tracks ``near misses'', i.e. attempts | |
485 | to follow a ``pointer'' to just outside the garbage-collected heap, or | |
486 | to a currently unallocated page inside the heap. Pages that have been | |
487 | the targets of such near misses are likely to be the targets of | |
488 | misidentified ``pointers'' in the future. To minimize the future | |
489 | damage caused by such misidentifications they will be allocated only to | |
490 | small pointerfree objects. | |
491 | <P> | |
492 | The collector understands two different kinds of black-listing. A | |
493 | page may be black listed for interior pointer references | |
494 | (<TT>GC_add_to_black_list_stack</tt>), if it was the target of a near | |
495 | miss from a location that requires interior pointer recognition, | |
496 | <I>e.g.</i> the stack, or the heap if <TT>GC_all_interior_pointers</tt> | |
497 | is set. In this case, we also avoid allocating large blocks that include | |
498 | this page. | |
499 | <P> | |
500 | If the near miss came from a source that did not require interior | |
501 | pointer recognition, it is black-listed with | |
502 | <TT>GC_add_to_black_list_normal</tt>. | |
503 | A page black-listed in this way may appear inside a large object, | |
504 | so long as it is not the first page of a large object. | |
505 | <P> | |
506 | The <TT>GC_allochblk</tt> routine respects black-listing when assigning | |
507 | a block to a particular object kind and size. It occasionally | |
508 | drops (i.e. allocates and forgets) blocks that are completely black-listed | |
509 | in order to avoid excessively long large block free lists containing | |
510 | only unusable blocks. This would otherwise become an issue | |
511 | if there is low demand for small pointerfree objects. | |
512 | ||
513 | <H2>Thread support</h2> | |
514 | We support several different threading models. Unfortunately Pthreads, | |
515 | the only reasonably well standardized thread model, supports too narrow | |
516 | an interface for conservative garbage collection. There appears to be | |
517 | no completely portable way to allow the collector | |
518 | to coexist with various Pthreads | |
519 | implementations. Hence we currently support only the more | |
520 | common Pthreads implementations. | |
521 | <P> | |
522 | In particular, it is very difficult for the collector to stop all other | |
523 | threads in the system and examine the register contents. This is currently | |
524 | accomplished with very different mechanisms for some Pthreads | |
525 | implementations. The Solaris implementation temporarily disables much | |
526 | of the user-level threads implementation by stopping kernel-level threads | |
527 | ("lwp"s). The Linux/HPUX/OSF1 and Irix implementations sends signals to | |
528 | individual Pthreads and has them wait in the signal handler. | |
529 | <P> | |
530 | The Linux and Irix implementations use | |
531 | only documented Pthreads calls, but rely on extensions to their semantics. | |
532 | The Linux implementation <TT>linux_threads.c</tt> relies on only very | |
533 | mild extensions to the pthreads semantics, and already supports a large number | |
534 | of other Unix-like pthreads implementations. Our goal is to make this the | |
535 | only pthread support in the collector. | |
536 | <P> | |
537 | (The Irix implementation is separate only for historical reasons and should | |
538 | clearly be merged. The current Solaris implementation probably performs | |
539 | better in the uniprocessor case, but does not support thread operations in the | |
540 | collector. Hence it cannot support the parallel marker.) | |
541 | <P> | |
542 | All implementations must | |
543 | intercept thread creation and a few other thread-specific calls to allow | |
544 | enumeration of threads and location of thread stacks. This is current | |
545 | accomplished with <TT># define</tt>'s in <TT>gc.h</tt> | |
546 | (really <TT>gc_pthread_redirects.h</tt>), or optionally | |
547 | by using ld's function call wrapping mechanism under Linux. | |
548 | <P> | |
549 | Recent versions of the collector support several facilites to enhance | |
550 | the processor-scalability and thread performance of the collector. | |
551 | These are discussed in more detail <A HREF="scale.html">here</a>. | |
552 | We briefly outline the data approach to thread-local allocation in the | |
553 | next section. | |
554 | <H2>Thread-local allocation</h2> | |
555 | If thread-local allocation is enabled, the collector keeps separate | |
556 | arrays of free lists for each thread. Thread-local allocation | |
557 | is currently only supported on a few platforms. | |
558 | <P> | |
559 | The free list arrays associated | |
560 | with each thread are only used to satisfy requests for objects that | |
561 | are both very small, and belong to one of a small number of well-known | |
562 | kinds. These currently include "normal" and pointer-free objects. | |
563 | Depending onthe configuration, "gcj" objects may also be included. | |
564 | <P> | |
565 | Thread-local free list entries contain either a pointer to the first | |
566 | element of a free list, or they contain a counter of the number of | |
567 | allocation "granules" allocated so far. Initially they contain the | |
568 | value one, i.e. a small counter value. | |
569 | <P> | |
570 | Thread-local allocation allocates directly through the global | |
571 | allocator, if the object is of a size or kind not covered by the | |
572 | local free lists. | |
573 | <P> | |
574 | If there is an appropriate local free list, the allocator checks whether it | |
575 | contains a sufficiently small counter value. If so, the counter is simply | |
576 | incremented by the counter value, and the global allocator is used. | |
577 | In this way, the initial few allocations of a given size bypass the local | |
578 | allocator. A thread that only allocates a handful of objects of a given | |
579 | size will not build up its own free list for that size. This avoids | |
580 | wasting space for unpopular objects sizes or kinds. | |
581 | <P> | |
582 | Once the counter passes a threshold, <TT>GC_malloc_many</tt> is called | |
583 | to allocate roughly <TT>HBLKSIZE</tt> space and put it on the corresponding | |
584 | local free list. Further allocations of that size and kind then use | |
585 | this free list, and no longer need to acquire the allocation lock. | |
586 | The allocation procedure is otherwise similar to the global free lists. | |
587 | The local free lists are also linked using the first word in the object. | |
588 | In most cases this means they require considerably less time. | |
589 | <P> | |
590 | Local free lists are treated buy most of the rest of the collector | |
591 | as though they were in-use reachable data. This requires some care, | |
592 | since pointer-free objects are not normally traced, and hence a special | |
593 | tracing procedure is required to mark all objects on pointer-free and | |
594 | gcj local free lists. | |
595 | <P> | |
596 | On thread exit, any remaining thread-local free list entries are | |
597 | transferred back to the global free list. | |
598 | <P> | |
599 | Note that if the collector is configured for thread-local allocation, | |
600 | GC versions before 7 do not invoke the thread-local allocator by default. | |
601 | <TT>GC_malloc</tt> only uses thread-local allocation in version 7 and later. | |
602 | In earlier versions, <TT>GC_MALLOC</tt> (all caps) may be directed | |
603 | to use thread-local allocation by defining <TT>GC_REDIRECT_TO_LOCAL</tt> | |
604 | and then include <TT>gc_local_alloc.h</tt>. | |
605 | <P> | |
606 | For some more details see <A HREF="scale.html">here</a>, and the | |
607 | technical report entitled | |
608 | <A HREF="http://www.hpl.hp.com/techreports/2000/HPL-2000-165.html"> | |
609 | ``Fast Multiprocessor Memory Allocation and Garbage Collection'' | |
610 | </a> | |
611 | <P> | |
612 | <HR> | |
613 | <P> | |
614 | Comments are appreciated. Please send mail to | |
615 | <A HREF="mailto:boehm@acm.org"><TT>boehm@acm.org</tt></a> or | |
616 | <A HREF="mailto:Hans.Boehm@hp.com"><TT>Hans.Boehm@hp.com</tt></a> | |
617 | <P> | |
618 | This is a modified copy of a page written while the author was at SGI. | |
619 | The original was <A HREF="http://reality.sgi.com/boehm/gcdescr.html">here</a>. | |
620 | </body> | |
621 | </html> |