]>
Commit | Line | Data |
---|---|---|
c8c06e52 AB |
1 | .. |
2 | Copyright (c) 2015-2020 Linaro Ltd. | |
c6489dd9 | 3 | |
c8c06e52 AB |
4 | This work is licensed under the terms of the GNU GPL, version 2 or |
5 | later. See the COPYING file in the top-level directory. | |
c6489dd9 | 6 | |
ae63ed16 LP |
7 | ================== |
8 | Multi-threaded TCG | |
9 | ================== | |
c6489dd9 | 10 | |
c8c06e52 AB |
11 | This document outlines the design for multi-threaded TCG (a.k.a MTTCG) |
12 | system-mode emulation. user-mode emulation has always mirrored the | |
13 | thread structure of the translated executable although some of the | |
14 | changes done for MTTCG system emulation have improved the stability of | |
15 | linux-user emulation. | |
c6489dd9 AB |
16 | |
17 | The original system-mode TCG implementation was single threaded and | |
18 | dealt with multiple CPUs with simple round-robin scheduling. This | |
19 | simplified a lot of things but became increasingly limited as systems | |
20 | being emulated gained additional cores and per-core performance gains | |
21 | for host systems started to level off. | |
22 | ||
23 | vCPU Scheduling | |
24 | =============== | |
25 | ||
26 | We introduce a new running mode where each vCPU will run on its own | |
c8c06e52 AB |
27 | user-space thread. This is enabled by default for all FE/BE |
28 | combinations where the host memory model is able to accommodate the | |
29 | guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the | |
30 | guest has had the required work done to support this safely | |
31 | (TARGET_SUPPORTS_MTTCG). | |
32 | ||
33 | System emulation will fall back to the original round robin approach | |
34 | if: | |
35 | ||
36 | * forced by --accel tcg,thread=single | |
37 | * enabling --icount mode | |
38 | * 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST) | |
c6489dd9 AB |
39 | |
40 | In the general case of running translated code there should be no | |
41 | inter-vCPU dependencies and all vCPUs should be able to run at full | |
42 | speed. Synchronisation will only be required while accessing internal | |
43 | shared data structures or when the emulated architecture requires a | |
44 | coherent representation of the emulated machine state. | |
45 | ||
46 | Shared Data Structures | |
47 | ====================== | |
48 | ||
49 | Main Run Loop | |
50 | ------------- | |
51 | ||
52 | Even when there is no code being generated there are a number of | |
53 | structures associated with the hot-path through the main run-loop. | |
54 | These are associated with looking up the next translation block to | |
55 | execute. These include: | |
56 | ||
57 | tb_jmp_cache (per-vCPU, cache of recent jumps) | |
58 | tb_ctx.htable (global hash table, phys address->tb lookup) | |
59 | ||
60 | As TB linking only occurs when blocks are in the same page this code | |
61 | is critical to performance as looking up the next TB to execute is the | |
62 | most common reason to exit the generated code. | |
63 | ||
64 | DESIGN REQUIREMENT: Make access to lookup structures safe with | |
65 | multiple reader/writer threads. Minimise any lock contention to do it. | |
66 | ||
67 | The hot-path avoids using locks where possible. The tb_jmp_cache is | |
68 | updated with atomic accesses to ensure consistent results. The fall | |
69 | back QHT based hash table is also designed for lockless lookups. Locks | |
70 | are only taken when code generation is required or TranslationBlocks | |
71 | have their block-to-block jumps patched. | |
72 | ||
73 | Global TCG State | |
74 | ---------------- | |
75 | ||
c8c06e52 AB |
76 | User-mode emulation |
77 | ~~~~~~~~~~~~~~~~~~~ | |
78 | ||
c6489dd9 AB |
79 | We need to protect the entire code generation cycle including any post |
80 | generation patching of the translated code. This also implies a shared | |
81 | translation buffer which contains code running on all cores. Any | |
82 | execution path that comes to the main run loop will need to hold a | |
83 | mutex for code generation. This also includes times when we need flush | |
84 | code or entries from any shared lookups/caches. Structures held on a | |
85 | per-vCPU basis won't need locking unless other vCPUs will need to | |
86 | modify them. | |
87 | ||
88 | DESIGN REQUIREMENT: Add locking around all code generation and TB | |
89 | patching. | |
90 | ||
91 | (Current solution) | |
92 | ||
0ac20318 EC |
93 | Code generation is serialised with mmap_lock(). |
94 | ||
c8c06e52 AB |
95 | !User-mode emulation |
96 | ~~~~~~~~~~~~~~~~~~~~ | |
97 | ||
0ac20318 | 98 | Each vCPU has its own TCG context and associated TCG region, thereby |
c8c06e52 | 99 | requiring no locking during translation. |
c6489dd9 AB |
100 | |
101 | Translation Blocks | |
102 | ------------------ | |
103 | ||
104 | Currently the whole system shares a single code generation buffer | |
105 | which when full will force a flush of all translations and start from | |
106 | scratch again. Some operations also force a full flush of translations | |
107 | including: | |
108 | ||
109 | - debugging operations (breakpoint insertion/removal) | |
110 | - some CPU helper functions | |
93154e76 | 111 | - linux-user spawning its first thread |
c6489dd9 AB |
112 | |
113 | This is done with the async_safe_run_on_cpu() mechanism to ensure all | |
114 | vCPUs are quiescent when changes are being made to shared global | |
115 | structures. | |
116 | ||
117 | More granular translation invalidation events are typically due | |
118 | to a change of the state of a physical page: | |
119 | ||
120 | - code modification (self modify code, patching code) | |
121 | - page changes (new page mapping in linux-user mode) | |
122 | ||
123 | While setting the invalid flag in a TranslationBlock will stop it | |
124 | being used when looked up in the hot-path there are a number of other | |
125 | book-keeping structures that need to be safely cleared. | |
126 | ||
127 | Any TranslationBlocks which have been patched to jump directly to the | |
128 | now invalid blocks need the jump patches reversing so they will return | |
129 | to the C code. | |
130 | ||
131 | There are a number of look-up caches that need to be properly updated | |
132 | including the: | |
133 | ||
134 | - jump lookup cache | |
135 | - the physical-to-tb lookup hash table | |
136 | - the global page table | |
137 | ||
138 | The global page table (l1_map) which provides a multi-level look-up | |
139 | for PageDesc structures which contain pointers to the start of a | |
140 | linked list of all Translation Blocks in that page (see page_next). | |
141 | ||
142 | Both the jump patching and the page cache involve linked lists that | |
143 | the invalidated TranslationBlock needs to be removed from. | |
144 | ||
145 | DESIGN REQUIREMENT: Safely handle invalidation of TBs | |
146 | - safely patch/revert direct jumps | |
147 | - remove central PageDesc lookup entries | |
148 | - ensure lookup caches/hashes are safely updated | |
149 | ||
150 | (Current solution) | |
151 | ||
152 | The direct jump themselves are updated atomically by the TCG | |
153 | tb_set_jmp_target() code. Modification to the linked lists that allow | |
194125e3 EC |
154 | searching for linked pages are done under the protection of tb->jmp_lock, |
155 | where tb is the destination block of a jump. Each origin block keeps a | |
156 | pointer to its destinations so that the appropriate lock can be acquired before | |
157 | iterating over a jump list. | |
c6489dd9 | 158 | |
78722ed0 EC |
159 | The global page table is a lockless radix tree; cmpxchg is used |
160 | to atomically insert new elements. | |
c6489dd9 AB |
161 | |
162 | The lookup caches are updated atomically and the lookup hash uses QHT | |
163 | which is designed for concurrent safe lookup. | |
164 | ||
95590e24 EC |
165 | Parallel code generation is supported. QHT is used at insertion time |
166 | as the synchronization point across threads, thereby ensuring that we only | |
167 | keep track of a single TranslationBlock for each guest code block. | |
c6489dd9 AB |
168 | |
169 | Memory maps and TLBs | |
170 | -------------------- | |
171 | ||
172 | The memory handling code is fairly critical to the speed of memory | |
173 | access in the emulated system. The SoftMMU code is designed so the | |
174 | hot-path can be handled entirely within translated code. This is | |
175 | handled with a per-vCPU TLB structure which once populated will allow | |
176 | a series of accesses to the page to occur without exiting the | |
177 | translated code. It is possible to set flags in the TLB address which | |
178 | will ensure the slow-path is taken for each access. This can be done | |
179 | to support: | |
180 | ||
181 | - Memory regions (dividing up access to PIO, MMIO and RAM) | |
182 | - Dirty page tracking (for code gen, SMC detection, migration and display) | |
183 | - Virtual TLB (for translating guest address->real address) | |
184 | ||
185 | When the TLB tables are updated by a vCPU thread other than their own | |
186 | we need to ensure it is done in a safe way so no inconsistent state is | |
187 | seen by the vCPU thread. | |
188 | ||
189 | Some operations require updating a number of vCPUs TLBs at the same | |
190 | time in a synchronised manner. | |
191 | ||
192 | DESIGN REQUIREMENTS: | |
193 | ||
194 | - TLB Flush All/Page | |
195 | - can be across-vCPUs | |
196 | - cross vCPU TLB flush may need other vCPU brought to halt | |
197 | - change may need to be visible to the calling vCPU immediately | |
198 | - TLB Flag Update | |
199 | - usually cross-vCPU | |
200 | - want change to be visible as soon as possible | |
201 | - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) | |
202 | - This is a per-vCPU table - by definition can't race | |
203 | - updated by its own thread when the slow-path is forced | |
204 | ||
205 | (Current solution) | |
206 | ||
207 | We have updated cputlb.c to defer operations when a cross-vCPU | |
208 | operation with async_run_on_cpu() which ensures each vCPU sees a | |
209 | coherent state when it next runs its work (in a few instructions | |
210 | time). | |
211 | ||
212 | A new set up operations (tlb_flush_*_all_cpus) take an additional flag | |
213 | which when set will force synchronisation by setting the source vCPUs | |
214 | work as "safe work" and exiting the cpu run loop. This ensure by the | |
215 | time execution restarts all flush operations have completed. | |
216 | ||
217 | TLB flag updates are all done atomically and are also protected by the | |
0ac20318 | 218 | corresponding page lock. |
c6489dd9 AB |
219 | |
220 | (Known limitation) | |
221 | ||
222 | Not really a limitation but the wait mechanism is overly strict for | |
223 | some architectures which only need flushes completed by a barrier | |
224 | instruction. This could be a future optimisation. | |
225 | ||
226 | Emulated hardware state | |
227 | ----------------------- | |
228 | ||
229 | Currently thanks to KVM work any access to IO memory is automatically | |
230 | protected by the global iothread mutex, also known as the BQL (Big | |
231 | Qemu Lock). Any IO region that doesn't use global mutex is expected to | |
232 | do its own locking. | |
233 | ||
234 | However IO memory isn't the only way emulated hardware state can be | |
235 | modified. Some architectures have model specific registers that | |
236 | trigger hardware emulation features. Generally any translation helper | |
237 | that needs to update more than a single vCPUs of state should take the | |
238 | BQL. | |
239 | ||
240 | As the BQL, or global iothread mutex is shared across the system we | |
241 | push the use of the lock as far down into the TCG code as possible to | |
242 | minimise contention. | |
243 | ||
244 | (Current solution) | |
245 | ||
246 | MMIO access automatically serialises hardware emulation by way of the | |
6fe6d6c9 | 247 | BQL. Currently Arm targets serialise all ARM_CP_IO register accesses |
c6489dd9 AB |
248 | and also defer the reset/startup of vCPUs to the vCPU context by way |
249 | of async_run_on_cpu(). | |
250 | ||
251 | Updates to interrupt state are also protected by the BQL as they can | |
252 | often be cross vCPU. | |
253 | ||
254 | Memory Consistency | |
255 | ================== | |
256 | ||
257 | Between emulated guests and host systems there are a range of memory | |
258 | consistency models. Even emulating weakly ordered systems on strongly | |
259 | ordered hosts needs to ensure things like store-after-load re-ordering | |
260 | can be prevented when the guest wants to. | |
261 | ||
262 | Memory Barriers | |
263 | --------------- | |
264 | ||
265 | Barriers (sometimes known as fences) provide a mechanism for software | |
266 | to enforce a particular ordering of memory operations from the point | |
267 | of view of external observers (e.g. another processor core). They can | |
268 | apply to any memory operations as well as just loads or stores. | |
269 | ||
c8c06e52 | 270 | The Linux kernel has an excellent `write-up |
1ec43ca4 | 271 | <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_ |
c8c06e52 AB |
272 | on the various forms of memory barrier and the guarantees they can |
273 | provide. | |
c6489dd9 AB |
274 | |
275 | Barriers are often wrapped around synchronisation primitives to | |
276 | provide explicit memory ordering semantics. However they can be used | |
277 | by themselves to provide safe lockless access by ensuring for example | |
278 | a change to a signal flag will only be visible once the changes to | |
279 | payload are. | |
280 | ||
281 | DESIGN REQUIREMENT: Add a new tcg_memory_barrier op | |
282 | ||
283 | This would enforce a strong load/store ordering so all loads/stores | |
284 | complete at the memory barrier. On single-core non-SMP strongly | |
285 | ordered backends this could become a NOP. | |
286 | ||
287 | Aside from explicit standalone memory barrier instructions there are | |
288 | also implicit memory ordering semantics which comes with each guest | |
289 | memory access instruction. For example all x86 load/stores come with | |
6fe6d6c9 | 290 | fairly strong guarantees of sequential consistency whereas Arm has |
c6489dd9 AB |
291 | special variants of load/store instructions that imply acquire/release |
292 | semantics. | |
293 | ||
294 | In the case of a strongly ordered guest architecture being emulated on | |
295 | a weakly ordered host the scope for a heavy performance impact is | |
296 | quite high. | |
297 | ||
298 | DESIGN REQUIREMENTS: Be efficient with use of memory barriers | |
299 | - host systems with stronger implied guarantees can skip some barriers | |
300 | - merge consecutive barriers to the strongest one | |
301 | ||
302 | (Current solution) | |
303 | ||
304 | The system currently has a tcg_gen_mb() which will add memory barrier | |
305 | operations if code generation is being done in a parallel context. The | |
306 | tcg_optimize() function attempts to merge barriers up to their | |
307 | strongest form before any load/store operations. The solution was | |
308 | originally developed and tested for linux-user based systems. All | |
309 | backends have been converted to emit fences when required. So far the | |
310 | following front-ends have been updated to emit fences when required: | |
311 | ||
312 | - target-i386 | |
313 | - target-arm | |
314 | - target-aarch64 | |
315 | - target-alpha | |
316 | - target-mips | |
317 | ||
318 | Memory Control and Maintenance | |
319 | ------------------------------ | |
320 | ||
321 | This includes a class of instructions for controlling system cache | |
322 | behaviour. While QEMU doesn't model cache behaviour these instructions | |
323 | are often seen when code modification has taken place to ensure the | |
324 | changes take effect. | |
325 | ||
326 | Synchronisation Primitives | |
327 | -------------------------- | |
328 | ||
329 | There are two broad types of synchronisation primitives found in | |
330 | modern ISAs: atomic instructions and exclusive regions. | |
331 | ||
332 | The first type offer a simple atomic instruction which will guarantee | |
333 | some sort of test and conditional store will be truly atomic w.r.t. | |
334 | other cores sharing access to the memory. The classic example is the | |
335 | x86 cmpxchg instruction. | |
336 | ||
337 | The second type offer a pair of load/store instructions which offer a | |
9277d81f | 338 | guarantee that a region of memory has not been touched between the |
6fe6d6c9 | 339 | load and store instructions. An example of this is Arm's ldrex/strex |
c6489dd9 AB |
340 | pair where the strex instruction will return a flag indicating a |
341 | successful store only if no other CPU has accessed the memory region | |
342 | since the ldrex. | |
343 | ||
344 | Traditionally TCG has generated a series of operations that work | |
345 | because they are within the context of a single translation block so | |
346 | will have completed before another CPU is scheduled. However with | |
347 | the ability to have multiple threads running to emulate multiple CPUs | |
348 | we will need to explicitly expose these semantics. | |
349 | ||
350 | DESIGN REQUIREMENTS: | |
351 | - Support classic atomic instructions | |
352 | - Support load/store exclusive (or load link/store conditional) pairs | |
353 | - Generic enough infrastructure to support all guest architectures | |
354 | CURRENT OPEN QUESTIONS: | |
355 | - How problematic is the ABA problem in general? | |
356 | ||
357 | (Current solution) | |
358 | ||
359 | The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which | |
360 | can be used directly or combined to emulate other instructions like | |
6fe6d6c9 | 361 | Arm's ldrex/strex instructions. While they are susceptible to the ABA |
c6489dd9 AB |
362 | problem so far common guests have not implemented patterns where |
363 | this may be a problem - typically presenting a locking ABI which | |
364 | assumes cmpxchg like semantics. | |
365 | ||
366 | The code also includes a fall-back for cases where multi-threaded TCG | |
367 | ops can't work (e.g. guest atomic width > host atomic width). In this | |
368 | case an EXCP_ATOMIC exit occurs and the instruction is emulated with | |
369 | an exclusive lock which ensures all emulation is serialised. | |
370 | ||
371 | While the atomic helpers look good enough for now there may be a need | |
372 | to look at solutions that can more closely model the guest | |
373 | architectures semantics. |