]>
Commit | Line | Data |
---|---|---|
859cdc01 | 1 | ============== |
9d3a4736 AK |
2 | The memory API |
3 | ============== | |
4 | ||
5 | The memory API models the memory and I/O buses and controllers of a QEMU | |
6 | machine. It attempts to allow modelling of: | |
7 | ||
859cdc01 PM |
8 | - ordinary RAM |
9 | - memory-mapped I/O (MMIO) | |
10 | - memory controllers that can dynamically reroute physical memory regions | |
11 | to different destinations | |
9d3a4736 AK |
12 | |
13 | The memory model provides support for | |
14 | ||
859cdc01 PM |
15 | - tracking RAM changes by the guest |
16 | - setting up coalesced memory for kvm | |
17 | - setting up ioeventfd regions for kvm | |
9d3a4736 | 18 | |
2d40178a PB |
19 | Memory is modelled as an acyclic graph of MemoryRegion objects. Sinks |
20 | (leaves) are RAM and MMIO regions, while other nodes represent | |
21 | buses, memory controllers, and memory regions that have been rerouted. | |
22 | ||
23 | In addition to MemoryRegion objects, the memory API provides AddressSpace | |
24 | objects for every root and possibly for intermediate MemoryRegions too. | |
25 | These represent memory as seen from the CPU or a device's viewpoint. | |
9d3a4736 AK |
26 | |
27 | Types of regions | |
28 | ---------------- | |
29 | ||
5056c0c3 | 30 | There are multiple types of memory regions (all represented by a single C type |
9d3a4736 AK |
31 | MemoryRegion): |
32 | ||
33 | - RAM: a RAM region is simply a range of host memory that can be made available | |
34 | to the guest. | |
5056c0c3 PM |
35 | You typically initialize these with memory_region_init_ram(). Some special |
36 | purposes require the variants memory_region_init_resizeable_ram(), | |
37 | memory_region_init_ram_from_file(), or memory_region_init_ram_ptr(). | |
9d3a4736 AK |
38 | |
39 | - MMIO: a range of guest memory that is implemented by host callbacks; | |
40 | each read or write causes a callback to be called on the host. | |
0c52a80e C |
41 | You initialize these with memory_region_init_io(), passing it a |
42 | MemoryRegionOps structure describing the callbacks. | |
5056c0c3 PM |
43 | |
44 | - ROM: a ROM memory region works like RAM for reads (directly accessing | |
a1777f7f PM |
45 | a region of host memory), and forbids writes. You initialize these with |
46 | memory_region_init_rom(). | |
47 | ||
48 | - ROM device: a ROM device memory region works like RAM for reads | |
49 | (directly accessing a region of host memory), but like MMIO for | |
50 | writes (invoking a callback). You initialize these with | |
51 | memory_region_init_rom_device(). | |
5056c0c3 PM |
52 | |
53 | - IOMMU region: an IOMMU region translates addresses of accesses made to it | |
54 | and forwards them to some other target memory region. As the name suggests, | |
55 | these are only needed for modelling an IOMMU, not for simple devices. | |
56 | You initialize these with memory_region_init_iommu(). | |
9d3a4736 AK |
57 | |
58 | - container: a container simply includes other memory regions, each at | |
59 | a different offset. Containers are useful for grouping several regions | |
60 | into one unit. For example, a PCI BAR may be composed of a RAM region | |
61 | and an MMIO region. | |
62 | ||
63 | A container's subregions are usually non-overlapping. In some cases it is | |
64 | useful to have overlapping regions; for example a memory controller that | |
65 | can overlay a subregion of RAM with MMIO or ROM, or a PCI controller | |
66 | that does not prevent card from claiming overlapping BARs. | |
67 | ||
5056c0c3 PM |
68 | You initialize a pure container with memory_region_init(). |
69 | ||
9d3a4736 AK |
70 | - alias: a subsection of another region. Aliases allow a region to be |
71 | split apart into discontiguous regions. Examples of uses are memory banks | |
72 | used when the guest address space is smaller than the amount of RAM | |
73 | addressed, or a memory controller that splits main memory to expose a "PCI | |
74 | hole". Aliases may point to any type of region, including other aliases, | |
75 | but an alias may not point back to itself, directly or indirectly. | |
5056c0c3 PM |
76 | You initialize these with memory_region_init_alias(). |
77 | ||
78 | - reservation region: a reservation region is primarily for debugging. | |
79 | It claims I/O space that is not supposed to be handled by QEMU itself. | |
80 | The typical use is to track parts of the address space which will be | |
257a7430 PB |
81 | handled by the host kernel when KVM is enabled. You initialize these |
82 | by passing a NULL callback parameter to memory_region_init_io(). | |
9d3a4736 | 83 | |
6f1ce94a PM |
84 | It is valid to add subregions to a region which is not a pure container |
85 | (that is, to an MMIO, RAM or ROM region). This means that the region | |
86 | will act like a container, except that any addresses within the container's | |
87 | region which are not claimed by any subregion are handled by the | |
88 | container itself (ie by its MMIO callbacks or RAM backing). However | |
89 | it is generally possible to achieve the same effect with a pure container | |
90 | one of whose subregions is a low priority "background" region covering | |
91 | the whole address range; this is often clearer and is preferred. | |
92 | Subregions cannot be added to an alias region. | |
9d3a4736 | 93 | |
2286468f PM |
94 | Migration |
95 | --------- | |
96 | ||
97 | Where the memory region is backed by host memory (RAM, ROM and | |
98 | ROM device memory region types), this host memory needs to be | |
99 | copied to the destination on migration. These APIs which allocate | |
100 | the host memory for you will also register the memory so it is | |
101 | migrated: | |
859cdc01 PM |
102 | |
103 | - memory_region_init_ram() | |
104 | - memory_region_init_rom() | |
105 | - memory_region_init_rom_device() | |
2286468f PM |
106 | |
107 | For most devices and boards this is the correct thing. If you | |
108 | have a special case where you need to manage the migration of | |
109 | the backing memory yourself, you can call the functions: | |
859cdc01 PM |
110 | |
111 | - memory_region_init_ram_nomigrate() | |
112 | - memory_region_init_rom_nomigrate() | |
113 | - memory_region_init_rom_device_nomigrate() | |
114 | ||
2286468f PM |
115 | which only initialize the MemoryRegion and leave handling |
116 | migration to the caller. | |
117 | ||
118 | The functions: | |
859cdc01 PM |
119 | |
120 | - memory_region_init_resizeable_ram() | |
121 | - memory_region_init_ram_from_file() | |
122 | - memory_region_init_ram_from_fd() | |
123 | - memory_region_init_ram_ptr() | |
124 | - memory_region_init_ram_device_ptr() | |
125 | ||
2286468f PM |
126 | are for special cases only, and so they do not automatically |
127 | register the backing memory for migration; the caller must | |
128 | manage migration if necessary. | |
129 | ||
9d3a4736 AK |
130 | Region names |
131 | ------------ | |
132 | ||
133 | Regions are assigned names by the constructor. For most regions these are | |
134 | only used for debugging purposes, but RAM regions also use the name to identify | |
135 | live migration sections. This means that RAM region names need to have ABI | |
136 | stability. | |
137 | ||
138 | Region lifecycle | |
139 | ---------------- | |
140 | ||
8b5c2160 PB |
141 | A region is created by one of the memory_region_init*() functions and |
142 | attached to an object, which acts as its owner or parent. QEMU ensures | |
143 | that the owner object remains alive as long as the region is visible to | |
144 | the guest, or as long as the region is in use by a virtual CPU or another | |
145 | device. For example, the owner object will not die between an | |
146 | address_space_map operation and the corresponding address_space_unmap. | |
d8d95814 | 147 | |
8b5c2160 PB |
148 | After creation, a region can be added to an address space or a |
149 | container with memory_region_add_subregion(), and removed using | |
150 | memory_region_del_subregion(). | |
151 | ||
152 | Various region attributes (read-only, dirty logging, coalesced mmio, | |
153 | ioeventfd) can be changed during the region lifecycle. They take effect | |
154 | as soon as the region is made visible. This can be immediately, later, | |
155 | or never. | |
156 | ||
157 | Destruction of a memory region happens automatically when the owner | |
158 | object dies. | |
159 | ||
160 | If however the memory region is part of a dynamically allocated data | |
161 | structure, you should call object_unparent() to destroy the memory region | |
162 | before the data structure is freed. For an example see VFIOMSIXInfo | |
163 | and VFIOQuirk in hw/vfio/pci.c. | |
164 | ||
165 | You must not destroy a memory region as long as it may be in use by a | |
166 | device or CPU. In order to do this, as a general rule do not create or | |
167 | destroy memory regions dynamically during a device's lifetime, and only | |
168 | call object_unparent() in the memory region owner's instance_finalize | |
169 | callback. The dynamically allocated data structure that contains the | |
170 | memory region then should obviously be freed in the instance_finalize | |
171 | callback as well. | |
172 | ||
173 | If you break this rule, the following situation can happen: | |
174 | ||
175 | - the memory region's owner had a reference taken via memory_region_ref | |
176 | (for example by address_space_map) | |
177 | ||
178 | - the region is unparented, and has no owner anymore | |
179 | ||
180 | - when address_space_unmap is called, the reference to the memory region's | |
181 | owner is leaked. | |
182 | ||
183 | ||
184 | There is an exception to the above rule: it is okay to call | |
185 | object_unparent at any time for an alias or a container region. It is | |
186 | therefore also okay to create or destroy alias and container regions | |
187 | dynamically during a device's lifetime. | |
188 | ||
189 | This exceptional usage is valid because aliases and containers only help | |
190 | QEMU building the guest's memory map; they are never accessed directly. | |
191 | memory_region_ref and memory_region_unref are never called on aliases | |
192 | or containers, and the above situation then cannot happen. Exploiting | |
193 | this exception is rarely necessary, and therefore it is discouraged, | |
194 | but nevertheless it is used in a few places. | |
195 | ||
196 | For regions that "have no owner" (NULL is passed at creation time), the | |
197 | machine object is actually used as the owner. Since instance_finalize is | |
198 | never called for the machine object, you must never call object_unparent | |
199 | on regions that have no owner, unless they are aliases or containers. | |
d8d95814 | 200 | |
9d3a4736 AK |
201 | |
202 | Overlapping regions and priority | |
203 | -------------------------------- | |
204 | Usually, regions may not overlap each other; a memory address decodes into | |
205 | exactly one target. In some cases it is useful to allow regions to overlap, | |
206 | and sometimes to control which of an overlapping regions is visible to the | |
207 | guest. This is done with memory_region_add_subregion_overlap(), which | |
208 | allows the region to overlap any other region in the same container, and | |
209 | specifies a priority that allows the core to decide which of two regions at | |
210 | the same address are visible (highest wins). | |
8002ccd6 MA |
211 | Priority values are signed, and the default value is zero. This means that |
212 | you can use memory_region_add_subregion_overlap() both to specify a region | |
213 | that must sit 'above' any others (with a positive priority) and also a | |
214 | background region that sits 'below' others (with a negative priority). | |
9d3a4736 | 215 | |
6f1ce94a PM |
216 | If the higher priority region in an overlap is a container or alias, then |
217 | the lower priority region will appear in any "holes" that the higher priority | |
218 | region has left by not mapping subregions to that area of its address range. | |
219 | (This applies recursively -- if the subregions are themselves containers or | |
220 | aliases that leave holes then the lower priority region will appear in these | |
221 | holes too.) | |
222 | ||
223 | For example, suppose we have a container A of size 0x8000 with two subregions | |
8210f5f6 XZ |
224 | B and C. B is a container mapped at 0x2000, size 0x4000, priority 2; C is |
225 | an MMIO region mapped at 0x0, size 0x6000, priority 1. B currently has two | |
6f1ce94a | 226 | of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at |
859cdc01 | 227 | offset 0x2000. As a diagram:: |
6f1ce94a | 228 | |
b3f3fdeb WJ |
229 | 0 1000 2000 3000 4000 5000 6000 7000 8000 |
230 | |------|------|------|------|------|------|------|------| | |
231 | A: [ ] | |
6f1ce94a PM |
232 | C: [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC] |
233 | B: [ ] | |
234 | D: [DDDDD] | |
235 | E: [EEEEE] | |
236 | ||
859cdc01 PM |
237 | The regions that will be seen within this address range then are:: |
238 | ||
239 | [CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC] | |
6f1ce94a PM |
240 | |
241 | Since B has higher priority than C, its subregions appear in the flat map | |
242 | even where they overlap with C. In ranges where B has not mapped anything | |
243 | C's region appears. | |
244 | ||
245 | If B had provided its own MMIO operations (ie it was not a pure container) | |
246 | then these would be used for any addresses in its range not handled by | |
859cdc01 PM |
247 | D or E, and the result would be:: |
248 | ||
249 | [CCCCCCCCCCCC][DDDDD][BBBBB][EEEEE][BBBBB] | |
6f1ce94a PM |
250 | |
251 | Priority values are local to a container, because the priorities of two | |
252 | regions are only compared when they are both children of the same container. | |
253 | This means that the device in charge of the container (typically modelling | |
254 | a bus or a memory controller) can use them to manage the interaction of | |
255 | its child regions without any side effects on other parts of the system. | |
256 | In the example above, the priorities of D and E are unimportant because | |
257 | they do not overlap each other. It is the relative priority of B and C | |
258 | that causes D and E to appear on top of C: D and E's priorities are never | |
259 | compared against the priority of C. | |
260 | ||
9d3a4736 AK |
261 | Visibility |
262 | ---------- | |
263 | The memory core uses the following rules to select a memory region when the | |
264 | guest accesses an address: | |
265 | ||
266 | - all direct subregions of the root region are matched against the address, in | |
267 | descending priority order | |
859cdc01 | 268 | |
9d3a4736 AK |
269 | - if the address lies outside the region offset/size, the subregion is |
270 | discarded | |
6f1ce94a PM |
271 | - if the subregion is a leaf (RAM or MMIO), the search terminates, returning |
272 | this leaf region | |
9d3a4736 AK |
273 | - if the subregion is a container, the same algorithm is used within the |
274 | subregion (after the address is adjusted by the subregion offset) | |
6f1ce94a | 275 | - if the subregion is an alias, the search is continued at the alias target |
9d3a4736 | 276 | (after the address is adjusted by the subregion offset and alias offset) |
6f1ce94a PM |
277 | - if a recursive search within a container or alias subregion does not |
278 | find a match (because of a "hole" in the container's coverage of its | |
279 | address range), then if this is a container with its own MMIO or RAM | |
280 | backing the search terminates, returning the container itself. Otherwise | |
281 | we continue with the next subregion in priority order | |
859cdc01 | 282 | |
6f1ce94a PM |
283 | - if none of the subregions match the address then the search terminates |
284 | with no match found | |
9d3a4736 AK |
285 | |
286 | Example memory map | |
287 | ------------------ | |
288 | ||
859cdc01 PM |
289 | :: |
290 | ||
291 | system_memory: container@0-2^48-1 | |
292 | | | |
293 | +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff) | |
294 | | | |
295 | +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff) | |
296 | | | |
297 | +---- vga-window: alias@0xa0000-0xbffff ---> #pci (0xa0000-0xbffff) | |
298 | | (prio 1) | |
299 | | | |
300 | +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff) | |
301 | ||
302 | pci (0-2^32-1) | |
303 | | | |
304 | +--- vga-area: container@0xa0000-0xbffff | |
305 | | | | |
306 | | +--- alias@0x00000-0x7fff ---> #vram (0x010000-0x017fff) | |
307 | | | | |
308 | | +--- alias@0x08000-0xffff ---> #vram (0x020000-0x027fff) | |
309 | | | |
310 | +---- vram: ram@0xe1000000-0xe1ffffff | |
311 | | | |
312 | +---- vga-mmio: mmio@0xe2000000-0xe200ffff | |
313 | ||
314 | ram: ram@0x00000000-0xffffffff | |
9d3a4736 | 315 | |
69ddaf66 | 316 | This is a (simplified) PC memory map. The 4GB RAM block is mapped into the |
9d3a4736 AK |
317 | system address space via two aliases: "lomem" is a 1:1 mapping of the first |
318 | 3.5GB; "himem" maps the last 0.5GB at address 4GB. This leaves 0.5GB for the | |
319 | so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with | |
320 | 4GB of memory. | |
321 | ||
322 | The memory controller diverts addresses in the range 640K-768K to the PCI | |
7075ba30 | 323 | address space. This is modelled using the "vga-window" alias, mapped at a |
9d3a4736 AK |
324 | higher priority so it obscures the RAM at the same addresses. The vga window |
325 | can be removed by programming the memory controller; this is modelled by | |
326 | removing the alias and exposing the RAM underneath. | |
327 | ||
328 | The pci address space is not a direct child of the system address space, since | |
329 | we only want parts of it to be visible (we accomplish this using aliases). | |
330 | It has two subregions: vga-area models the legacy vga window and is occupied | |
331 | by two 32K memory banks pointing at two sections of the framebuffer. | |
332 | In addition the vram is mapped as a BAR at address e1000000, and an additional | |
333 | BAR containing MMIO registers is mapped after it. | |
334 | ||
335 | Note that if the guest maps a BAR outside the PCI hole, it would not be | |
336 | visible as the pci-hole alias clips it to a 0.5GB range. | |
337 | ||
9d3a4736 AK |
338 | MMIO Operations |
339 | --------------- | |
340 | ||
687ac05d PM |
341 | MMIO regions are provided with ->read() and ->write() callbacks, |
342 | which are sufficient for most devices. Some devices change behaviour | |
343 | based on the attributes used for the memory transaction, or need | |
344 | to be able to respond that the access should provoke a bus error | |
345 | rather than completing successfully; those devices can use the | |
346 | ->read_with_attrs() and ->write_with_attrs() callbacks instead. | |
347 | ||
348 | In addition various constraints can be supplied to control how these | |
349 | callbacks are called: | |
9d3a4736 | 350 | |
859cdc01 PM |
351 | - .valid.min_access_size, .valid.max_access_size define the access sizes |
352 | (in bytes) which the device accepts; accesses outside this range will | |
353 | have device and bus specific behaviour (ignored, or machine check) | |
354 | - .valid.unaligned specifies that the *device being modelled* supports | |
355 | unaligned accesses; if false, unaligned accesses will invoke the | |
356 | appropriate bus or CPU specific behaviour. | |
357 | - .impl.min_access_size, .impl.max_access_size define the access sizes | |
358 | (in bytes) supported by the *implementation*; other access sizes will be | |
359 | emulated using the ones available. For example a 4-byte write will be | |
360 | emulated using four 1-byte writes, if .impl.max_access_size = 1. | |
361 | - .impl.unaligned specifies that the *implementation* supports unaligned | |
362 | accesses; if false, unaligned accesses will be emulated by two aligned | |
363 | accesses. |