compiler/rustc_codegen_llvm/src/debuginfo/doc.md

   1 # Debug Info Module
   2
   3 This module serves the purpose of generating debug symbols. We use LLVM's
   4 [source level debugging](https://llvm.org/docs/SourceLevelDebugging.html)
   5 features for generating the debug information. The general principle is
   6 this:
   7
   8 Given the right metadata in the LLVM IR, the LLVM code generator is able to
   9 create DWARF debug symbols for the given code. The
  10 [metadata](https://llvm.org/docs/LangRef.html#metadata-type) is structured
  11 much like DWARF *debugging information entries* (DIE), representing type
  12 information such as datatype layout, function signatures, block layout,
  13 variable location and scope information, etc. It is the purpose of this
  14 module to generate correct metadata and insert it into the LLVM IR.
  15
  16 As the exact format of metadata trees may change between different LLVM
  17 versions, we now use LLVM
  18 [DIBuilder](https://llvm.org/docs/doxygen/html/classllvm_1_1DIBuilder.html)
  19 to create metadata where possible. This will hopefully ease the adaption of
  20 this module to future LLVM versions.
  21
  22 The public API of the module is a set of functions that will insert the
  23 correct metadata into the LLVM IR when called with the right parameters.
  24 The module is thus driven from an outside client with functions like
  25 `debuginfo::create_local_var_metadata(bx: block, local: &ast::local)`.
  26
  27 Internally the module will try to reuse already created metadata by
  28 utilizing a cache. The way to get a shared metadata node when needed is
  29 thus to just call the corresponding function in this module:
  30
  31     let file_metadata = file_metadata(cx, file);
  32
  33 The function will take care of probing the cache for an existing node for
  34 that exact file path.
  35
  36 All private state used by the module is stored within either the
  37 CrateDebugContext struct (owned by the CodegenCx) or the
  38 FunctionDebugContext (owned by the FunctionCx).
  39
  40 This file consists of three conceptual sections:
  41 1. The public interface of the module
  42 2. Module-internal metadata creation functions
  43 3. Minor utility functions
  44
  45
  46 ## Recursive Types
  47
  48 Some kinds of types, such as structs and enums can be recursive. That means
  49 that the type definition of some type X refers to some other type which in
  50 turn (transitively) refers to X. This introduces cycles into the type
  51 referral graph. A naive algorithm doing an on-demand, depth-first traversal
  52 of this graph when describing types, can get trapped in an endless loop
  53 when it reaches such a cycle.
  54
  55 For example, the following simple type for a singly-linked list...
  56
  57 ```
  58 struct List {
  59     value: i32,
  60     tail: Option<Box<List>>,
  61 }
  62 ```
  63
  64 will generate the following callstack with a naive DFS algorithm:
  65
  66 ```
  67 describe(t = List)
  68   describe(t = i32)
  69   describe(t = Option<Box<List>>)
  70     describe(t = Box<List>)
  71       describe(t = List) // at the beginning again...
  72       ...
  73 ```
  74
  75 To break cycles like these, we use "forward declarations". That is, when
  76 the algorithm encounters a possibly recursive type (any struct or enum), it
  77 immediately creates a type description node and inserts it into the cache
  78 *before* describing the members of the type. This type description is just
  79 a stub (as type members are not described and added to it yet) but it
  80 allows the algorithm to already refer to the type. After the stub is
  81 inserted into the cache, the algorithm continues as before. If it now
  82 encounters a recursive reference, it will hit the cache and does not try to
  83 describe the type anew.
  84
  85 This behavior is encapsulated in the 'RecursiveTypeDescription' enum,
  86 which represents a kind of continuation, storing all state needed to
  87 continue traversal at the type members after the type has been registered
  88 with the cache. (This implementation approach might be a tad over-
  89 engineered and may change in the future)
  90
  91
  92 ## Source Locations and Line Information
  93
  94 In addition to data type descriptions the debugging information must also
  95 allow to map machine code locations back to source code locations in order
  96 to be useful. This functionality is also handled in this module. The
  97 following functions allow to control source mappings:
  98
  99 + `set_source_location()`
 100 + `clear_source_location()`
 101 + `start_emitting_source_locations()`
 102
 103 `set_source_location()` allows to set the current source location. All IR
 104 instructions created after a call to this function will be linked to the
 105 given source location, until another location is specified with
 106 `set_source_location()` or the source location is cleared with
 107 `clear_source_location()`. In the later case, subsequent IR instruction
 108 will not be linked to any source location. As you can see, this is a
 109 stateful API (mimicking the one in LLVM), so be careful with source
 110 locations set by previous calls. It's probably best to not rely on any
 111 specific state being present at a given point in code.
 112
 113 One topic that deserves some extra attention is *function prologues*. At
 114 the beginning of a function's machine code there are typically a few
 115 instructions for loading argument values into allocas and checking if
 116 there's enough stack space for the function to execute. This *prologue* is
 117 not visible in the source code and LLVM puts a special PROLOGUE END marker
 118 into the line table at the first non-prologue instruction of the function.
 119 In order to find out where the prologue ends, LLVM looks for the first
 120 instruction in the function body that is linked to a source location. So,
 121 when generating prologue instructions we have to make sure that we don't
 122 emit source location information until the 'real' function body begins. For
 123 this reason, source location emission is disabled by default for any new
 124 function being codegened and is only activated after a call to the third
 125 function from the list above, `start_emitting_source_locations()`. This
 126 function should be called right before regularly starting to codegen the
 127 top-level block of the given function.
 128
 129 There is one exception to the above rule: `llvm.dbg.declare` instruction
 130 must be linked to the source location of the variable being declared. For
 131 function parameters these `llvm.dbg.declare` instructions typically occur
 132 in the middle of the prologue, however, they are ignored by LLVM's prologue
 133 detection. The `create_argument_metadata()` and related functions take care
 134 of linking the `llvm.dbg.declare` instructions to the correct source
 135 locations even while source location emission is still disabled, so there
 136 is no need to do anything special with source location handling here.
 137
 138 ## Unique Type Identification
 139
 140 In order for link-time optimization to work properly, LLVM needs a unique
 141 type identifier that tells it across compilation units which types are the
 142 same as others. This type identifier is created by
 143 `TypeMap::get_unique_type_id_of_type()` using the following algorithm:
 144
 145 1. Primitive types have their name as ID
 146
 147 2. Structs, enums and traits have a multipart identifier
 148
 149   1. The first part is the SVH (strict version hash) of the crate they
 150      were originally defined in
 151
 152   2. The second part is the ast::NodeId of the definition in their
 153      original crate
 154
 155   3. The final part is a concatenation of the type IDs of their concrete
 156      type arguments if they are generic types.
 157
 158 3. Tuple-, pointer-, and function types are structurally identified, which
 159    means that they are equivalent if their component types are equivalent
 160    (i.e., `(i32, i32)` is the same regardless in which crate it is used).
 161
 162 This algorithm also provides a stable ID for types that are defined in one
 163 crate but instantiated from metadata within another crate. We just have to
 164 take care to always map crate and `NodeId`s back to the original crate
 165 context.
 166
 167 As a side-effect these unique type IDs also help to solve a problem arising
 168 from lifetime parameters. Since lifetime parameters are completely omitted
 169 in debuginfo, more than one `Ty` instance may map to the same debuginfo
 170 type metadata, that is, some struct `Struct<'a>` may have N instantiations
 171 with different concrete substitutions for `'a`, and thus there will be N
 172 `Ty` instances for the type `Struct<'a>` even though it is not generic
 173 otherwise. Unfortunately this means that we cannot use `ty::type_id()` as
 174 cheap identifier for type metadata -- we have done this in the past, but it
 175 led to unnecessary metadata duplication in the best case and LLVM
 176 assertions in the worst. However, the unique type ID as described above
 177 *can* be used as identifier. Since it is comparatively expensive to
 178 construct, though, `ty::type_id()` is still used additionally as an
 179 optimization for cases where the exact same type has been seen before
 180 (which is most of the time).