]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | <HTML> |
2 | <HEAD> | |
3 | <TITLE>Debugging Garbage Collector Related Problems</title> | |
4 | </head> | |
5 | <BODY> | |
6 | <H1>Debugging Garbage Collector Related Problems</h1> | |
7 | This page contains some hints on | |
8 | debugging issues specific to | |
9 | the Boehm-Demers-Weiser conservative garbage collector. | |
10 | It applies both to debugging issues in client code that manifest themselves | |
11 | as collector misbehavior, and to debugging the collector itself. | |
12 | <P> | |
13 | If you suspect a bug in the collector itself, it is strongly recommended | |
14 | that you try the latest collector release, even if it is labelled as "alpha", | |
15 | before proceeding. | |
16 | <H2>Bus Errors and Segmentation Violations</h2> | |
17 | <P> | |
18 | If the fault occurred in GC_find_limit, or with incremental collection enabled, | |
19 | this is probably normal. The collector installs handlers to take care of | |
20 | these. You will not see these unless you are using a debugger. | |
21 | Your debugger <I>should</i> allow you to continue. | |
22 | It's often preferable to tell the debugger to ignore SIGBUS and SIGSEGV | |
23 | ("<TT>handle SIGSEGV SIGBUS nostop noprint</tt>" in gdb, | |
24 | "<TT>ignore SIGSEGV SIGBUS</tt>" in most versions of dbx) | |
25 | and set a breakpoint in <TT>abort</tt>. | |
26 | The collector will call abort if the signal had another cause, | |
27 | and there was not other handler previously installed. | |
28 | <P> | |
29 | We recommend debugging without incremental collection if possible. | |
30 | (This applies directly to UNIX systems. | |
31 | Debugging with incremental collection under win32 is worse. See README.win32.) | |
32 | <P> | |
33 | If the application generates an unhandled SIGSEGV or equivalent, it may | |
34 | often be easiest to set the environment variable GC_LOOP_ON_ABORT. On many | |
35 | platforms, this will cause the collector to loop in a handler when the | |
36 | SIGSEGV is encountered (or when the collector aborts for some other reason), | |
37 | and a debugger can then be attached to the looping | |
38 | process. This sidesteps common operating system problems related | |
39 | to incomplete core files for multithreaded applications, etc. | |
40 | <H2>Other Signals</h2> | |
41 | On most platforms, the multithreaded version of the collector needs one or | |
42 | two other signals for internal use by the collector in stopping threads. | |
43 | It is normally wise to tell the debugger to ignore these. On Linux, | |
44 | the collector currently uses SIGPWR and SIGXCPU by default. | |
45 | <H2>Warning Messages About Needing to Allocate Blacklisted Blocks</h2> | |
46 | The garbage collector generates warning messages of the form | |
47 | <PRE> | |
48 | Needed to allocate blacklisted block at 0x... | |
49 | </pre> | |
50 | or | |
51 | <PRE> | |
52 | Repeated allocation of very large block ... | |
53 | </pre> | |
54 | when it needs to allocate a block at a location that it knows to be | |
55 | referenced by a false pointer. These false pointers can be either permanent | |
56 | (<I>e.g.</i> a static integer variable that never changes) or temporary. | |
57 | In the latter case, the warning is largely spurious, and the block will | |
58 | eventually be reclaimed normally. | |
59 | In the former case, the program will still run correctly, but the block | |
60 | will never be reclaimed. Unless the block is intended to be | |
61 | permanent, the warning indicates a memory leak. | |
62 | <OL> | |
63 | <LI>Ignore these warnings while you are using GC_DEBUG. Some of the routines | |
64 | mentioned below don't have debugging equivalents. (Alternatively, write | |
65 | the missing routines and send them to me.) | |
66 | <LI>Replace allocator calls that request large blocks with calls to | |
67 | <TT>GC_malloc_ignore_off_page</tt> or | |
68 | <TT>GC_malloc_atomic_ignore_off_page</tt>. You may want to set a | |
69 | breakpoint in <TT>GC_default_warn_proc</tt> to help you identify such calls. | |
70 | Make sure that a pointer to somewhere near the beginning of the resulting block | |
71 | is maintained in a (preferably volatile) variable as long as | |
72 | the block is needed. | |
73 | <LI> | |
74 | If the large blocks are allocated with realloc, we suggest instead allocating | |
75 | them with something like the following. Note that the realloc size increment | |
76 | should be fairly large (e.g. a factor of 3/2) for this to exhibit reasonable | |
77 | performance. But we all know we should do that anyway. | |
78 | <PRE> | |
79 | void * big_realloc(void *p, size_t new_size) | |
80 | { | |
81 | size_t old_size = GC_size(p); | |
82 | void * result; | |
83 | ||
84 | if (new_size <= 10000) return(GC_realloc(p, new_size)); | |
85 | if (new_size <= old_size) return(p); | |
86 | result = GC_malloc_ignore_off_page(new_size); | |
87 | if (result == 0) return(0); | |
88 | memcpy(result,p,old_size); | |
89 | GC_free(p); | |
90 | return(result); | |
91 | } | |
92 | </pre> | |
93 | ||
94 | <LI> In the unlikely case that even relatively small object | |
95 | (<20KB) allocations are triggering these warnings, then your address | |
96 | space contains lots of "bogus pointers", i.e. values that appear to | |
97 | be pointers but aren't. Usually this can be solved by using GC_malloc_atomic | |
98 | or the routines in gc_typed.h to allocate large pointer-free regions of bitmaps, etc. Sometimes the problem can be solved with trivial changes of encoding | |
99 | in certain values. It is possible, to identify the source of the bogus | |
100 | pointers by building the collector with <TT>-DPRINT_BLACK_LIST</tt>, | |
101 | which will cause it to print the "bogus pointers", along with their location. | |
102 | ||
103 | <LI> If you get only a fixed number of these warnings, you are probably only | |
104 | introducing a bounded leak by ignoring them. If the data structures being | |
105 | allocated are intended to be permanent, then it is also safe to ignore them. | |
106 | The warnings can be turned off by calling GC_set_warn_proc with a procedure | |
107 | that ignores these warnings (e.g. by doing absolutely nothing). | |
108 | </ol> | |
109 | ||
110 | <H2>The Collector References a Bad Address in <TT>GC_malloc</tt></h2> | |
111 | ||
112 | This typically happens while the collector is trying to remove an entry from | |
113 | its free list, and the free list pointer is bad because the free list link | |
114 | in the last allocated object was bad. | |
115 | <P> | |
116 | With > 99% probability, you wrote past the end of an allocated object. | |
117 | Try setting <TT>GC_DEBUG</tt> before including <TT>gc.h</tt> and | |
118 | allocating with <TT>GC_MALLOC</tt>. This will try to detect such | |
119 | overwrite errors. | |
120 | ||
121 | <H2>Unexpectedly Large Heap</h2> | |
122 | ||
123 | Unexpected heap growth can be due to one of the following: | |
124 | <OL> | |
125 | <LI> Data structures that are being unintentionally retained. This | |
126 | is commonly caused by data structures that are no longer being used, | |
127 | but were not cleared, or by caches growing without bounds. | |
128 | <LI> Pointer misidentification. The garbage collector is interpreting | |
129 | integers or other data as pointers and retaining the "referenced" | |
130 | objects. A common symptom is that GC_dump() shows much of the heap | |
131 | as black-listed. | |
132 | <LI> Heap fragmentation. This should never result in unbounded growth, | |
133 | but it may account for larger heaps. This is most commonly caused | |
134 | by allocation of large objects. On some platforms it can be reduced | |
135 | by building with -DUSE_MUNMAP, which will cause the collector to unmap | |
136 | memory corresponding to pages that have not been recently used. | |
137 | <LI> Per object overhead. This is usually a relatively minor effect, but | |
138 | it may be worth considering. If the collector recognizes interior | |
139 | pointers, object sizes are increased, so that one-past-the-end pointers | |
140 | are correctly recognized. The collector can be configured not to do this | |
141 | (<TT>-DDONT_ADD_BYTE_AT_END</tt>). | |
142 | <P> | |
143 | The collector rounds up object sizes so the result fits well into the | |
144 | chunk size (<TT>HBLKSIZE</tt>, normally 4K on 32 bit machines, 8K | |
145 | on 64 bit machines) used by the collector. Thus it may be worth avoiding | |
146 | objects of size 2K + 1 (or 2K if a byte is being added at the end.) | |
147 | </ol> | |
148 | The last two cases can often be identified by looking at the output | |
149 | of a call to <TT>GC_dump()</tt>. Among other things, it will print the | |
150 | list of free heap blocks, and a very brief description of all chunks in | |
151 | the heap, the object sizes they correspond to, and how many live objects | |
152 | were found in the chunk at the last collection. | |
153 | <P> | |
154 | Growing data structures can usually be identified by | |
155 | <OL> | |
156 | <LI> Building the collector with <TT>-DKEEP_BACK_PTRS</tt>, | |
157 | <LI> Preferably using debugging allocation (defining <TT>GC_DEBUG</tt> | |
158 | before including <TT>gc.h</tt> and allocating with <TT>GC_MALLOC</tt>), | |
159 | so that objects will be identified by their allocation site, | |
160 | <LI> Running the application long enough so | |
161 | that most of the heap is composed of "leaked" memory, and | |
162 | <LI> Then calling <TT>GC_generate_random_backtrace()</tt> from backptr.h | |
163 | a few times to determine why some randomly sampled objects in the heap are | |
164 | being retained. | |
165 | </ol> | |
166 | <P> | |
167 | The same technique can often be used to identify problems with false | |
168 | pointers, by noting whether the reference chains printed by | |
169 | <TT>GC_generate_random_backtrace()</tt> involve any misidentified pointers. | |
170 | An alternate technique is to build the collector with | |
171 | <TT>-DPRINT_BLACK_LIST</tt> which will cause it to report values that | |
172 | are almost, but not quite, look like heap pointers. It is very likely that | |
173 | actual false pointers will come from similar sources. | |
174 | <P> | |
175 | In the unlikely case that false pointers are an issue, it can usually | |
176 | be resolved using one or more of the following techniques: | |
177 | <OL> | |
178 | <LI> Use <TT>GC_malloc_atomic</tt> for objects containing no pointers. | |
179 | This is especially important for large arrays containing compressed data, | |
180 | pseudo-random numbers, and the like. It is also likely to improve GC | |
181 | performance, perhaps drastically so if the application is paging. | |
182 | <LI> If you allocate large objects containing only | |
183 | one or two pointers at the beginning, either try the typed allocation | |
184 | primitives is <TT>gc_typed.h</tt>, or separate out the pointerfree component. | |
185 | <LI> Consider using <TT>GC_malloc_ignore_off_page()</tt> | |
186 | to allocate large objects. (See <TT>gc.h</tt> and above for details. | |
187 | Large means > 100K in most environments.) | |
188 | <LI> If your heap size is larger than 100MB or so, build the collector with | |
189 | -DLARGE_CONFIG. This allows the collector to keep more precise black-list | |
190 | information. | |
191 | <LI> If you are using heaps close to, or larger than, a gigabyte on a 32-bit | |
192 | machine, you may want to consider moving to a platform with 64-bit pointers. | |
193 | This is very likely to resolve any false pointer issues. | |
194 | </ol> | |
195 | <H2>Prematurely Reclaimed Objects</h2> | |
196 | The usual symptom of this is a segmentation fault, or an obviously overwritten | |
197 | value in a heap object. This should, of course, be impossible. In practice, | |
198 | it may happen for reasons like the following: | |
199 | <OL> | |
200 | <LI> The collector did not intercept the creation of threads correctly in | |
201 | a multithreaded application, <I>e.g.</i> because the client called | |
202 | <TT>pthread_create</tt> without including <TT>gc.h</tt>, which redefines it. | |
203 | <LI> The last pointer to an object in the garbage collected heap was stored | |
204 | somewhere were the collector couldn't see it, <I>e.g.</i> in an | |
205 | object allocated with system <TT>malloc</tt>, in certain types of | |
206 | <TT>mmap</tt>ed files, | |
207 | or in some data structure visible only to the OS. (On some platforms, | |
208 | thread-local storage is one of these.) | |
209 | <LI> The last pointer to an object was somehow disguised, <I>e.g.</i> by | |
210 | XORing it with another pointer. | |
211 | <LI> Incorrect use of <TT>GC_malloc_atomic</tt> or typed allocation. | |
212 | <LI> An incorrect <TT>GC_free</tt> call. | |
213 | <LI> The client program overwrote an internal garbage collector data structure. | |
214 | <LI> A garbage collector bug. | |
215 | <LI> (Empirically less likely than any of the above.) A compiler optimization | |
216 | that disguised the last pointer. | |
217 | </ol> | |
218 | The following relatively simple techniques should be tried first to narrow | |
219 | down the problem: | |
220 | <OL> | |
221 | <LI> If you are using the incremental collector try turning it off for | |
222 | debugging. | |
223 | <LI> If you are using shared libraries, try linking statically. If that works, | |
224 | ensure that DYNAMIC_LOADING is defined on your platform. | |
225 | <LI> Try to reproduce the problem with fully debuggable unoptimized code. | |
226 | This will eliminate the last possibility, as well as making debugging easier. | |
227 | <LI> Try replacing any suspect typed allocation and <TT>GC_malloc_atomic</tt> | |
228 | calls with calls to <TT>GC_malloc</tt>. | |
229 | <LI> Try removing any GC_free calls (<I>e.g.</i> with a suitable | |
230 | <TT>#define</tt>). | |
231 | <LI> Rebuild the collector with <TT>-DGC_ASSERTIONS</tt>. | |
232 | <LI> If the following works on your platform (i.e. if gctest still works | |
233 | if you do this), try building the collector with | |
234 | <TT>-DREDIRECT_MALLOC=GC_malloc_uncollectable</tt>. This will cause | |
235 | the collector to scan memory allocated with malloc. | |
236 | </ol> | |
237 | If all else fails, you will have to attack this with a debugger. | |
238 | Suggested steps: | |
239 | <OL> | |
240 | <LI> Call <TT>GC_dump()</tt> from the debugger around the time of the failure. Verify | |
241 | that the collectors idea of the root set (i.e. static data regions which | |
242 | it should scan for pointers) looks plausible. If not, i.e. if it doesn't | |
243 | include some static variables, report this as | |
244 | a collector bug. Be sure to describe your platform precisely, since this sort | |
245 | of problem is nearly always very platform dependent. | |
246 | <LI> Especially if the failure is not deterministic, try to isolate it to | |
247 | a relatively small test case. | |
248 | <LI> Set a break point in <TT>GC_finish_collection</tt>. This is a good | |
249 | point to examine what has been marked, i.e. found reachable, by the | |
250 | collector. | |
251 | <LI> If the failure is deterministic, run the process | |
252 | up to the last collection before the failure. | |
253 | Note that the variable <TT>GC_gc_no</tt> counts collections and can be used | |
254 | to set a conditional breakpoint in the right one. It is incremented just | |
255 | before the call to GC_finish_collection. | |
256 | If object <TT>p</tt> was prematurely recycled, it may be helpful to | |
257 | look at <TT>*GC_find_header(p)</tt> at the failure point. | |
258 | The <TT>hb_last_reclaimed</tt> field will identify the collection number | |
259 | during which its block was last swept. | |
260 | <LI> Verify that the offending object still has its correct contents at | |
261 | this point. | |
262 | Then call <TT>GC_is_marked(p)</tt> from the debugger to verify that the | |
263 | object has not been marked, and is about to be reclaimed. Note that | |
264 | <TT>GC_is_marked(p)</tt> expects the real address of an object (the | |
265 | address of the debug header if there is one), and thus it may | |
266 | be more appropriate to call <TT>GC_is_marked(GC_base(p))</tt> | |
267 | instead. | |
268 | <LI> Determine a path from a root, i.e. static variable, stack, or | |
269 | register variable, | |
270 | to the reclaimed object. Call <TT>GC_is_marked(q)</tt> for each object | |
271 | <TT>q</tt> along the path, trying to locate the first unmarked object, say | |
272 | <TT>r</tt>. | |
273 | <LI> If <TT>r</tt> is pointed to by a static root, | |
274 | verify that the location | |
275 | pointing to it is part of the root set printed by <TT>GC_dump()</tt>. If it | |
276 | is on the stack in the main (or only) thread, verify that | |
277 | <TT>GC_stackbottom</tt> is set correctly to the base of the stack. If it is | |
278 | in another thread stack, check the collector's thread data structure | |
279 | (<TT>GC_thread[]</tt> on several platforms) to make sure that stack bounds | |
280 | are set correctly. | |
281 | <LI> If <TT>r</tt> is pointed to by heap object <TT>s</tt>, check that the | |
282 | collector's layout description for <TT>s</tt> is such that the pointer field | |
283 | will be scanned. Call <TT>*GC_find_header(s)</tt> to look at the descriptor | |
284 | for the heap chunk. The <TT>hb_descr</tt> field specifies the layout | |
285 | of objects in that chunk. See gc_mark.h for the meaning of the descriptor. | |
286 | (If it's low order 2 bits are zero, then it is just the length of the | |
287 | object prefix to be scanned. This form is always used for objects allocated | |
288 | with <TT>GC_malloc</tt> or <TT>GC_malloc_atomic</tt>.) | |
289 | <LI> If the failure is not deterministic, you may still be able to apply some | |
290 | of the above technique at the point of failure. But remember that objects | |
291 | allocated since the last collection will not have been marked, even if the | |
292 | collector is functioning properly. On some platforms, the collector | |
293 | can be configured to save call chains in objects for debugging. | |
294 | Enabling this feature will also cause it to save the call stack at the | |
295 | point of the last GC in GC_arrays._last_stack. | |
296 | <LI> When looking at GC internal data structures remember that a number | |
297 | of <TT>GC_</tt><I>xxx</i> variables are really macro defined to | |
298 | <TT>GC_arrays._</tt><I>xxx</i>, so that | |
299 | the collector can avoid scanning them. | |
300 | </ol> | |
301 | </body> | |
302 | </html> | |
303 | ||
304 | ||
305 | ||
306 |