]> git.proxmox.com Git - mirror_qemu.git/blame - docs/multi-thread-tcg.txt
rbd: Fix to cleanly reject -drive without pool or image
[mirror_qemu.git] / docs / multi-thread-tcg.txt
CommitLineData
c6489dd9
AB
1Copyright (c) 2015-2016 Linaro Ltd.
2
3This work is licensed under the terms of the GNU GPL, version 2 or
4later. See the COPYING file in the top-level directory.
5
6Introduction
7============
8
9This document outlines the design for multi-threaded TCG system-mode
10emulation. The current user-mode emulation mirrors the thread
11structure of the translated executable. Some of the work will be
12applicable to both system and linux-user emulation.
13
14The original system-mode TCG implementation was single threaded and
15dealt with multiple CPUs with simple round-robin scheduling. This
16simplified a lot of things but became increasingly limited as systems
17being emulated gained additional cores and per-core performance gains
18for host systems started to level off.
19
20vCPU Scheduling
21===============
22
23We introduce a new running mode where each vCPU will run on its own
24user-space thread. This will be enabled by default for all FE/BE
25combinations that have had the required work done to support this
26safely.
27
28In the general case of running translated code there should be no
29inter-vCPU dependencies and all vCPUs should be able to run at full
30speed. Synchronisation will only be required while accessing internal
31shared data structures or when the emulated architecture requires a
32coherent representation of the emulated machine state.
33
34Shared Data Structures
35======================
36
37Main Run Loop
38-------------
39
40Even when there is no code being generated there are a number of
41structures associated with the hot-path through the main run-loop.
42These are associated with looking up the next translation block to
43execute. These include:
44
45 tb_jmp_cache (per-vCPU, cache of recent jumps)
46 tb_ctx.htable (global hash table, phys address->tb lookup)
47
48As TB linking only occurs when blocks are in the same page this code
49is critical to performance as looking up the next TB to execute is the
50most common reason to exit the generated code.
51
52DESIGN REQUIREMENT: Make access to lookup structures safe with
53multiple reader/writer threads. Minimise any lock contention to do it.
54
55The hot-path avoids using locks where possible. The tb_jmp_cache is
56updated with atomic accesses to ensure consistent results. The fall
57back QHT based hash table is also designed for lockless lookups. Locks
58are only taken when code generation is required or TranslationBlocks
59have their block-to-block jumps patched.
60
61Global TCG State
62----------------
63
64We need to protect the entire code generation cycle including any post
65generation patching of the translated code. This also implies a shared
66translation buffer which contains code running on all cores. Any
67execution path that comes to the main run loop will need to hold a
68mutex for code generation. This also includes times when we need flush
69code or entries from any shared lookups/caches. Structures held on a
70per-vCPU basis won't need locking unless other vCPUs will need to
71modify them.
72
73DESIGN REQUIREMENT: Add locking around all code generation and TB
74patching.
75
76(Current solution)
77
78Mainly as part of the linux-user work all code generation is
79serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the
80place of mmap_lock() in linux-user.
81
82Translation Blocks
83------------------
84
85Currently the whole system shares a single code generation buffer
86which when full will force a flush of all translations and start from
87scratch again. Some operations also force a full flush of translations
88including:
89
90 - debugging operations (breakpoint insertion/removal)
91 - some CPU helper functions
92
93This is done with the async_safe_run_on_cpu() mechanism to ensure all
94vCPUs are quiescent when changes are being made to shared global
95structures.
96
97More granular translation invalidation events are typically due
98to a change of the state of a physical page:
99
100 - code modification (self modify code, patching code)
101 - page changes (new page mapping in linux-user mode)
102
103While setting the invalid flag in a TranslationBlock will stop it
104being used when looked up in the hot-path there are a number of other
105book-keeping structures that need to be safely cleared.
106
107Any TranslationBlocks which have been patched to jump directly to the
108now invalid blocks need the jump patches reversing so they will return
109to the C code.
110
111There are a number of look-up caches that need to be properly updated
112including the:
113
114 - jump lookup cache
115 - the physical-to-tb lookup hash table
116 - the global page table
117
118The global page table (l1_map) which provides a multi-level look-up
119for PageDesc structures which contain pointers to the start of a
120linked list of all Translation Blocks in that page (see page_next).
121
122Both the jump patching and the page cache involve linked lists that
123the invalidated TranslationBlock needs to be removed from.
124
125DESIGN REQUIREMENT: Safely handle invalidation of TBs
126 - safely patch/revert direct jumps
127 - remove central PageDesc lookup entries
128 - ensure lookup caches/hashes are safely updated
129
130(Current solution)
131
132The direct jump themselves are updated atomically by the TCG
133tb_set_jmp_target() code. Modification to the linked lists that allow
134searching for linked pages are done under the protect of the
135tb_lock().
136
137The global page table is protected by the tb_lock() in system-mode and
138mmap_lock() in linux-user mode.
139
140The lookup caches are updated atomically and the lookup hash uses QHT
141which is designed for concurrent safe lookup.
142
143
144Memory maps and TLBs
145--------------------
146
147The memory handling code is fairly critical to the speed of memory
148access in the emulated system. The SoftMMU code is designed so the
149hot-path can be handled entirely within translated code. This is
150handled with a per-vCPU TLB structure which once populated will allow
151a series of accesses to the page to occur without exiting the
152translated code. It is possible to set flags in the TLB address which
153will ensure the slow-path is taken for each access. This can be done
154to support:
155
156 - Memory regions (dividing up access to PIO, MMIO and RAM)
157 - Dirty page tracking (for code gen, SMC detection, migration and display)
158 - Virtual TLB (for translating guest address->real address)
159
160When the TLB tables are updated by a vCPU thread other than their own
161we need to ensure it is done in a safe way so no inconsistent state is
162seen by the vCPU thread.
163
164Some operations require updating a number of vCPUs TLBs at the same
165time in a synchronised manner.
166
167DESIGN REQUIREMENTS:
168
169 - TLB Flush All/Page
170 - can be across-vCPUs
171 - cross vCPU TLB flush may need other vCPU brought to halt
172 - change may need to be visible to the calling vCPU immediately
173 - TLB Flag Update
174 - usually cross-vCPU
175 - want change to be visible as soon as possible
176 - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
177 - This is a per-vCPU table - by definition can't race
178 - updated by its own thread when the slow-path is forced
179
180(Current solution)
181
182We have updated cputlb.c to defer operations when a cross-vCPU
183operation with async_run_on_cpu() which ensures each vCPU sees a
184coherent state when it next runs its work (in a few instructions
185time).
186
187A new set up operations (tlb_flush_*_all_cpus) take an additional flag
188which when set will force synchronisation by setting the source vCPUs
189work as "safe work" and exiting the cpu run loop. This ensure by the
190time execution restarts all flush operations have completed.
191
192TLB flag updates are all done atomically and are also protected by the
193tb_lock() which is used by the functions that update the TLB in bulk.
194
195(Known limitation)
196
197Not really a limitation but the wait mechanism is overly strict for
198some architectures which only need flushes completed by a barrier
199instruction. This could be a future optimisation.
200
201Emulated hardware state
202-----------------------
203
204Currently thanks to KVM work any access to IO memory is automatically
205protected by the global iothread mutex, also known as the BQL (Big
206Qemu Lock). Any IO region that doesn't use global mutex is expected to
207do its own locking.
208
209However IO memory isn't the only way emulated hardware state can be
210modified. Some architectures have model specific registers that
211trigger hardware emulation features. Generally any translation helper
212that needs to update more than a single vCPUs of state should take the
213BQL.
214
215As the BQL, or global iothread mutex is shared across the system we
216push the use of the lock as far down into the TCG code as possible to
217minimise contention.
218
219(Current solution)
220
221MMIO access automatically serialises hardware emulation by way of the
222BQL. Currently ARM targets serialise all ARM_CP_IO register accesses
223and also defer the reset/startup of vCPUs to the vCPU context by way
224of async_run_on_cpu().
225
226Updates to interrupt state are also protected by the BQL as they can
227often be cross vCPU.
228
229Memory Consistency
230==================
231
232Between emulated guests and host systems there are a range of memory
233consistency models. Even emulating weakly ordered systems on strongly
234ordered hosts needs to ensure things like store-after-load re-ordering
235can be prevented when the guest wants to.
236
237Memory Barriers
238---------------
239
240Barriers (sometimes known as fences) provide a mechanism for software
241to enforce a particular ordering of memory operations from the point
242of view of external observers (e.g. another processor core). They can
243apply to any memory operations as well as just loads or stores.
244
245The Linux kernel has an excellent write-up on the various forms of
246memory barrier and the guarantees they can provide [1].
247
248Barriers are often wrapped around synchronisation primitives to
249provide explicit memory ordering semantics. However they can be used
250by themselves to provide safe lockless access by ensuring for example
251a change to a signal flag will only be visible once the changes to
252payload are.
253
254DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
255
256This would enforce a strong load/store ordering so all loads/stores
257complete at the memory barrier. On single-core non-SMP strongly
258ordered backends this could become a NOP.
259
260Aside from explicit standalone memory barrier instructions there are
261also implicit memory ordering semantics which comes with each guest
262memory access instruction. For example all x86 load/stores come with
263fairly strong guarantees of sequential consistency where as ARM has
264special variants of load/store instructions that imply acquire/release
265semantics.
266
267In the case of a strongly ordered guest architecture being emulated on
268a weakly ordered host the scope for a heavy performance impact is
269quite high.
270
271DESIGN REQUIREMENTS: Be efficient with use of memory barriers
272 - host systems with stronger implied guarantees can skip some barriers
273 - merge consecutive barriers to the strongest one
274
275(Current solution)
276
277The system currently has a tcg_gen_mb() which will add memory barrier
278operations if code generation is being done in a parallel context. The
279tcg_optimize() function attempts to merge barriers up to their
280strongest form before any load/store operations. The solution was
281originally developed and tested for linux-user based systems. All
282backends have been converted to emit fences when required. So far the
283following front-ends have been updated to emit fences when required:
284
285 - target-i386
286 - target-arm
287 - target-aarch64
288 - target-alpha
289 - target-mips
290
291Memory Control and Maintenance
292------------------------------
293
294This includes a class of instructions for controlling system cache
295behaviour. While QEMU doesn't model cache behaviour these instructions
296are often seen when code modification has taken place to ensure the
297changes take effect.
298
299Synchronisation Primitives
300--------------------------
301
302There are two broad types of synchronisation primitives found in
303modern ISAs: atomic instructions and exclusive regions.
304
305The first type offer a simple atomic instruction which will guarantee
306some sort of test and conditional store will be truly atomic w.r.t.
307other cores sharing access to the memory. The classic example is the
308x86 cmpxchg instruction.
309
310The second type offer a pair of load/store instructions which offer a
311guarantee that an region of memory has not been touched between the
312load and store instructions. An example of this is ARM's ldrex/strex
313pair where the strex instruction will return a flag indicating a
314successful store only if no other CPU has accessed the memory region
315since the ldrex.
316
317Traditionally TCG has generated a series of operations that work
318because they are within the context of a single translation block so
319will have completed before another CPU is scheduled. However with
320the ability to have multiple threads running to emulate multiple CPUs
321we will need to explicitly expose these semantics.
322
323DESIGN REQUIREMENTS:
324 - Support classic atomic instructions
325 - Support load/store exclusive (or load link/store conditional) pairs
326 - Generic enough infrastructure to support all guest architectures
327CURRENT OPEN QUESTIONS:
328 - How problematic is the ABA problem in general?
329
330(Current solution)
331
332The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
333can be used directly or combined to emulate other instructions like
334ARM's ldrex/strex instructions. While they are susceptible to the ABA
335problem so far common guests have not implemented patterns where
336this may be a problem - typically presenting a locking ABI which
337assumes cmpxchg like semantics.
338
339The code also includes a fall-back for cases where multi-threaded TCG
340ops can't work (e.g. guest atomic width > host atomic width). In this
341case an EXCP_ATOMIC exit occurs and the instruction is emulated with
342an exclusive lock which ensures all emulation is serialised.
343
344While the atomic helpers look good enough for now there may be a need
345to look at solutions that can more closely model the guest
346architectures semantics.
347
348==========
349
350[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt