[mirror_qemu.git] / docs / multi-thread-tcg.txt

Copyright (c) 2015-2016 Linaro Ltd.

This work is licensed under the terms of the GNU GPL, version 2 or
later. See the COPYING file in the top-level directory.

Introduction
============

This document outlines the design for multi-threaded TCG system-mode
emulation. The current user-mode emulation mirrors the thread
structure of the translated executable. Some of the work will be
applicable to both system and linux-user emulation.

The original system-mode TCG implementation was single threaded and
dealt with multiple CPUs with simple round-robin scheduling. This
simplified a lot of things but became increasingly limited as systems
being emulated gained additional cores and per-core performance gains
for host systems started to level off.

vCPU Scheduling
===============

We introduce a new running mode where each vCPU will run on its own
user-space thread. This will be enabled by default for all FE/BE
combinations that have had the required work done to support this
safely.

In the general case of running translated code there should be no
inter-vCPU dependencies and all vCPUs should be able to run at full
speed. Synchronisation will only be required while accessing internal
shared data structures or when the emulated architecture requires a
coherent representation of the emulated machine state.

Shared Data Structures
======================

Main Run Loop
-------------

Even when there is no code being generated there are a number of
structures associated with the hot-path through the main run-loop.
These are associated with looking up the next translation block to
execute. These include:

    tb_jmp_cache (per-vCPU, cache of recent jumps)
    tb_ctx.htable (global hash table, phys address->tb lookup)

As TB linking only occurs when blocks are in the same page this code
is critical to performance as looking up the next TB to execute is the
most common reason to exit the generated code.

DESIGN REQUIREMENT: Make access to lookup structures safe with
multiple reader/writer threads. Minimise any lock contention to do it.

The hot-path avoids using locks where possible. The tb_jmp_cache is
updated with atomic accesses to ensure consistent results. The fall
back QHT based hash table is also designed for lockless lookups. Locks
are only taken when code generation is required or TranslationBlocks
have their block-to-block jumps patched.

Global TCG State
----------------

We need to protect the entire code generation cycle including any post
generation patching of the translated code. This also implies a shared
translation buffer which contains code running on all cores. Any
execution path that comes to the main run loop will need to hold a
mutex for code generation. This also includes times when we need flush
code or entries from any shared lookups/caches. Structures held on a
per-vCPU basis won't need locking unless other vCPUs will need to
modify them.

DESIGN REQUIREMENT: Add locking around all code generation and TB
patching.

(Current solution)

Mainly as part of the linux-user work all code generation is
serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the
place of mmap_lock() in linux-user.

Translation Blocks
------------------

Currently the whole system shares a single code generation buffer
which when full will force a flush of all translations and start from
scratch again. Some operations also force a full flush of translations
including:

  - debugging operations (breakpoint insertion/removal)
  - some CPU helper functions

This is done with the async_safe_run_on_cpu() mechanism to ensure all
vCPUs are quiescent when changes are being made to shared global
structures.

More granular translation invalidation events are typically due
to a change of the state of a physical page:

  - code modification (self modify code, patching code)
  - page changes (new page mapping in linux-user mode)

While setting the invalid flag in a TranslationBlock will stop it
being used when looked up in the hot-path there are a number of other
book-keeping structures that need to be safely cleared.

Any TranslationBlocks which have been patched to jump directly to the
now invalid blocks need the jump patches reversing so they will return
to the C code.

There are a number of look-up caches that need to be properly updated
including the:

  - jump lookup cache
  - the physical-to-tb lookup hash table
  - the global page table

The global page table (l1_map) which provides a multi-level look-up
for PageDesc structures which contain pointers to the start of a
linked list of all Translation Blocks in that page (see page_next).

Both the jump patching and the page cache involve linked lists that
the invalidated TranslationBlock needs to be removed from.

DESIGN REQUIREMENT: Safely handle invalidation of TBs
                      - safely patch/revert direct jumps
                      - remove central PageDesc lookup entries
                      - ensure lookup caches/hashes are safely updated

(Current solution)

The direct jump themselves are updated atomically by the TCG
tb_set_jmp_target() code. Modification to the linked lists that allow
searching for linked pages are done under the protect of the
tb_lock().

The global page table is protected by the tb_lock() in system-mode and
mmap_lock() in linux-user mode.

The lookup caches are updated atomically and the lookup hash uses QHT
which is designed for concurrent safe lookup.


Memory maps and TLBs
--------------------

The memory handling code is fairly critical to the speed of memory
access in the emulated system. The SoftMMU code is designed so the
hot-path can be handled entirely within translated code. This is
handled with a per-vCPU TLB structure which once populated will allow
a series of accesses to the page to occur without exiting the
translated code. It is possible to set flags in the TLB address which
will ensure the slow-path is taken for each access. This can be done
to support:

  - Memory regions (dividing up access to PIO, MMIO and RAM)
  - Dirty page tracking (for code gen, SMC detection, migration and display)
  - Virtual TLB (for translating guest address->real address)

When the TLB tables are updated by a vCPU thread other than their own
we need to ensure it is done in a safe way so no inconsistent state is
seen by the vCPU thread.

Some operations require updating a number of vCPUs TLBs at the same
time in a synchronised manner.

DESIGN REQUIREMENTS:

  - TLB Flush All/Page
    - can be across-vCPUs
    - cross vCPU TLB flush may need other vCPU brought to halt
    - change may need to be visible to the calling vCPU immediately
  - TLB Flag Update
    - usually cross-vCPU
    - want change to be visible as soon as possible
  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
    - This is a per-vCPU table - by definition can't race
    - updated by its own thread when the slow-path is forced

(Current solution)

We have updated cputlb.c to defer operations when a cross-vCPU
operation with async_run_on_cpu() which ensures each vCPU sees a
coherent state when it next runs its work (in a few instructions
time).

A new set up operations (tlb_flush_*_all_cpus) take an additional flag
which when set will force synchronisation by setting the source vCPUs
work as "safe work" and exiting the cpu run loop. This ensure by the
time execution restarts all flush operations have completed.

TLB flag updates are all done atomically and are also protected by the
tb_lock() which is used by the functions that update the TLB in bulk.

(Known limitation)

Not really a limitation but the wait mechanism is overly strict for
some architectures which only need flushes completed by a barrier
instruction. This could be a future optimisation.

Emulated hardware state
-----------------------

Currently thanks to KVM work any access to IO memory is automatically
protected by the global iothread mutex, also known as the BQL (Big
Qemu Lock). Any IO region that doesn't use global mutex is expected to
do its own locking.

However IO memory isn't the only way emulated hardware state can be
modified. Some architectures have model specific registers that
trigger hardware emulation features. Generally any translation helper
that needs to update more than a single vCPUs of state should take the
BQL.

As the BQL, or global iothread mutex is shared across the system we
push the use of the lock as far down into the TCG code as possible to
minimise contention.

(Current solution)

MMIO access automatically serialises hardware emulation by way of the
BQL. Currently ARM targets serialise all ARM_CP_IO register accesses
and also defer the reset/startup of vCPUs to the vCPU context by way
of async_run_on_cpu().

Updates to interrupt state are also protected by the BQL as they can
often be cross vCPU.

Memory Consistency
==================

Between emulated guests and host systems there are a range of memory
consistency models. Even emulating weakly ordered systems on strongly
ordered hosts needs to ensure things like store-after-load re-ordering
can be prevented when the guest wants to.

Memory Barriers
---------------

Barriers (sometimes known as fences) provide a mechanism for software
to enforce a particular ordering of memory operations from the point
of view of external observers (e.g. another processor core). They can
apply to any memory operations as well as just loads or stores.

The Linux kernel has an excellent write-up on the various forms of
memory barrier and the guarantees they can provide [1].

Barriers are often wrapped around synchronisation primitives to
provide explicit memory ordering semantics. However they can be used
by themselves to provide safe lockless access by ensuring for example
a change to a signal flag will only be visible once the changes to
payload are.

DESIGN REQUIREMENT: Add a new tcg_memory_barrier op

This would enforce a strong load/store ordering so all loads/stores
complete at the memory barrier. On single-core non-SMP strongly
ordered backends this could become a NOP.

Aside from explicit standalone memory barrier instructions there are
also implicit memory ordering semantics which comes with each guest
memory access instruction. For example all x86 load/stores come with
fairly strong guarantees of sequential consistency where as ARM has
special variants of load/store instructions that imply acquire/release
semantics.

In the case of a strongly ordered guest architecture being emulated on
a weakly ordered host the scope for a heavy performance impact is
quite high.

DESIGN REQUIREMENTS: Be efficient with use of memory barriers
       - host systems with stronger implied guarantees can skip some barriers
       - merge consecutive barriers to the strongest one

(Current solution)

The system currently has a tcg_gen_mb() which will add memory barrier
operations if code generation is being done in a parallel context. The
tcg_optimize() function attempts to merge barriers up to their
strongest form before any load/store operations. The solution was
originally developed and tested for linux-user based systems. All
backends have been converted to emit fences when required. So far the
following front-ends have been updated to emit fences when required:

    - target-i386
    - target-arm
    - target-aarch64
    - target-alpha
    - target-mips

Memory Control and Maintenance
------------------------------

This includes a class of instructions for controlling system cache
behaviour. While QEMU doesn't model cache behaviour these instructions
are often seen when code modification has taken place to ensure the
changes take effect.

Synchronisation Primitives
--------------------------

There are two broad types of synchronisation primitives found in
modern ISAs: atomic instructions and exclusive regions.

The first type offer a simple atomic instruction which will guarantee
some sort of test and conditional store will be truly atomic w.r.t.
other cores sharing access to the memory. The classic example is the
x86 cmpxchg instruction.

The second type offer a pair of load/store instructions which offer a
guarantee that an region of memory has not been touched between the
load and store instructions. An example of this is ARM's ldrex/strex
pair where the strex instruction will return a flag indicating a
successful store only if no other CPU has accessed the memory region
since the ldrex.

Traditionally TCG has generated a series of operations that work
because they are within the context of a single translation block so
will have completed before another CPU is scheduled. However with
the ability to have multiple threads running to emulate multiple CPUs
we will need to explicitly expose these semantics.

DESIGN REQUIREMENTS:
  - Support classic atomic instructions
  - Support load/store exclusive (or load link/store conditional) pairs
  - Generic enough infrastructure to support all guest architectures
CURRENT OPEN QUESTIONS:
  - How problematic is the ABA problem in general?

(Current solution)

The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
can be used directly or combined to emulate other instructions like
ARM's ldrex/strex instructions. While they are susceptible to the ABA
problem so far common guests have not implemented patterns where
this may be a problem - typically presenting a locking ABI which
assumes cmpxchg like semantics.

The code also includes a fall-back for cases where multi-threaded TCG
ops can't work (e.g. guest atomic width > host atomic width). In this
case an EXCP_ATOMIC exit occurs and the instruction is emulated with
an exclusive lock which ensures all emulation is serialised.

While the atomic helpers look good enough for now there may be a need
to look at solutions that can more closely model the guest
architectures semantics.

==========

[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
Commit	Line	Data
c6489dd9 AB	1	Copyright (c) 2015-2016 Linaro Ltd.
	2
	3	This work is licensed under the terms of the GNU GPL, version 2 or
	4	later. See the COPYING file in the top-level directory.
	5
	6	Introduction
	7	============
	8
	9	This document outlines the design for multi-threaded TCG system-mode
	10	emulation. The current user-mode emulation mirrors the thread
	11	structure of the translated executable. Some of the work will be
	12	applicable to both system and linux-user emulation.
	13
	14	The original system-mode TCG implementation was single threaded and
	15	dealt with multiple CPUs with simple round-robin scheduling. This
	16	simplified a lot of things but became increasingly limited as systems
	17	being emulated gained additional cores and per-core performance gains
	18	for host systems started to level off.
	19
	20	vCPU Scheduling
	21	===============
	22
	23	We introduce a new running mode where each vCPU will run on its own
	24	user-space thread. This will be enabled by default for all FE/BE
	25	combinations that have had the required work done to support this
	26	safely.
	27
	28	In the general case of running translated code there should be no
	29	inter-vCPU dependencies and all vCPUs should be able to run at full
	30	speed. Synchronisation will only be required while accessing internal
	31	shared data structures or when the emulated architecture requires a
	32	coherent representation of the emulated machine state.
	33
	34	Shared Data Structures
	35	======================
	36
	37	Main Run Loop
	38	-------------
	39
	40	Even when there is no code being generated there are a number of
	41	structures associated with the hot-path through the main run-loop.
	42	These are associated with looking up the next translation block to
	43	execute. These include:
	44
	45	tb_jmp_cache (per-vCPU, cache of recent jumps)
	46	tb_ctx.htable (global hash table, phys address->tb lookup)
	47
	48	As TB linking only occurs when blocks are in the same page this code
	49	is critical to performance as looking up the next TB to execute is the
	50	most common reason to exit the generated code.
	51
	52	DESIGN REQUIREMENT: Make access to lookup structures safe with
	53	multiple reader/writer threads. Minimise any lock contention to do it.
	54
	55	The hot-path avoids using locks where possible. The tb_jmp_cache is
	56	updated with atomic accesses to ensure consistent results. The fall
	57	back QHT based hash table is also designed for lockless lookups. Locks
	58	are only taken when code generation is required or TranslationBlocks
	59	have their block-to-block jumps patched.
	60
	61	Global TCG State
	62	----------------
	63
	64	We need to protect the entire code generation cycle including any post
65	generation patching of the translated code. This also implies a shared
66	translation buffer which contains code running on all cores. Any
67	execution path that comes to the main run loop will need to hold a
68	mutex for code generation. This also includes times when we need flush
69	code or entries from any shared lookups/caches. Structures held on a
70	per-vCPU basis won't need locking unless other vCPUs will need to
71	modify them.
72
73	DESIGN REQUIREMENT: Add locking around all code generation and TB
74	patching.
75
76	(Current solution)
77
78	Mainly as part of the linux-user work all code generation is
79	serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the
80	place of mmap_lock() in linux-user.
81
82	Translation Blocks
83	------------------
84
85	Currently the whole system shares a single code generation buffer
86	which when full will force a flush of all translations and start from
87	scratch again. Some operations also force a full flush of translations
88	including:
89
90	- debugging operations (breakpoint insertion/removal)
91	- some CPU helper functions
92
93	This is done with the async_safe_run_on_cpu() mechanism to ensure all
94	vCPUs are quiescent when changes are being made to shared global
95	structures.
96
97	More granular translation invalidation events are typically due
98	to a change of the state of a physical page:
99
100	- code modification (self modify code, patching code)
101	- page changes (new page mapping in linux-user mode)
102
103	While setting the invalid flag in a TranslationBlock will stop it
104	being used when looked up in the hot-path there are a number of other
105	book-keeping structures that need to be safely cleared.
106
107	Any TranslationBlocks which have been patched to jump directly to the
108	now invalid blocks need the jump patches reversing so they will return
109	to the C code.
110
111	There are a number of look-up caches that need to be properly updated
112	including the:
113
114	- jump lookup cache
115	- the physical-to-tb lookup hash table
116	- the global page table
117
118	The global page table (l1_map) which provides a multi-level look-up
119	for PageDesc structures which contain pointers to the start of a
120	linked list of all Translation Blocks in that page (see page_next).
121
122	Both the jump patching and the page cache involve linked lists that
123	the invalidated TranslationBlock needs to be removed from.
124
125	DESIGN REQUIREMENT: Safely handle invalidation of TBs
126	- safely patch/revert direct jumps
127	- remove central PageDesc lookup entries
128	- ensure lookup caches/hashes are safely updated
129
130	(Current solution)
131
132	The direct jump themselves are updated atomically by the TCG
133	tb_set_jmp_target() code. Modification to the linked lists that allow
134	searching for linked pages are done under the protect of the
135	tb_lock().
136
137	The global page table is protected by the tb_lock() in system-mode and
138	mmap_lock() in linux-user mode.
139
140	The lookup caches are updated atomically and the lookup hash uses QHT
141	which is designed for concurrent safe lookup.
142
143
144	Memory maps and TLBs
145	--------------------
146
147	The memory handling code is fairly critical to the speed of memory
148	access in the emulated system. The SoftMMU code is designed so the
149	hot-path can be handled entirely within translated code. This is
150	handled with a per-vCPU TLB structure which once populated will allow
151	a series of accesses to the page to occur without exiting the
152	translated code. It is possible to set flags in the TLB address which
153	will ensure the slow-path is taken for each access. This can be done
154	to support:
155
156	- Memory regions (dividing up access to PIO, MMIO and RAM)
157	- Dirty page tracking (for code gen, SMC detection, migration and display)
158	- Virtual TLB (for translating guest address->real address)
159
160	When the TLB tables are updated by a vCPU thread other than their own
161	we need to ensure it is done in a safe way so no inconsistent state is
162	seen by the vCPU thread.
163
164	Some operations require updating a number of vCPUs TLBs at the same
165	time in a synchronised manner.
166
167	DESIGN REQUIREMENTS:
168
169	- TLB Flush All/Page
170	- can be across-vCPUs
171	- cross vCPU TLB flush may need other vCPU brought to halt
172	- change may need to be visible to the calling vCPU immediately
173	- TLB Flag Update
174	- usually cross-vCPU
175	- want change to be visible as soon as possible
176	- TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
177	- This is a per-vCPU table - by definition can't race
178	- updated by its own thread when the slow-path is forced
179
180	(Current solution)
181
182	We have updated cputlb.c to defer operations when a cross-vCPU
183	operation with async_run_on_cpu() which ensures each vCPU sees a
184	coherent state when it next runs its work (in a few instructions
185	time).
186
187	A new set up operations (tlb_flush_*_all_cpus) take an additional flag
188	which when set will force synchronisation by setting the source vCPUs
189	work as "safe work" and exiting the cpu run loop. This ensure by the
190	time execution restarts all flush operations have completed.
191
192	TLB flag updates are all done atomically and are also protected by the
193	tb_lock() which is used by the functions that update the TLB in bulk.
194
195	(Known limitation)
196
197	Not really a limitation but the wait mechanism is overly strict for
198	some architectures which only need flushes completed by a barrier
199	instruction. This could be a future optimisation.
200
201	Emulated hardware state
202	-----------------------
203
204	Currently thanks to KVM work any access to IO memory is automatically
205	protected by the global iothread mutex, also known as the BQL (Big
206	Qemu Lock). Any IO region that doesn't use global mutex is expected to
207	do its own locking.
208
209	However IO memory isn't the only way emulated hardware state can be
210	modified. Some architectures have model specific registers that
211	trigger hardware emulation features. Generally any translation helper
212	that needs to update more than a single vCPUs of state should take the
213	BQL.
214
215	As the BQL, or global iothread mutex is shared across the system we
216	push the use of the lock as far down into the TCG code as possible to
217	minimise contention.
218
219	(Current solution)
220
221	MMIO access automatically serialises hardware emulation by way of the
222	BQL. Currently ARM targets serialise all ARM_CP_IO register accesses
223	and also defer the reset/startup of vCPUs to the vCPU context by way
224	of async_run_on_cpu().
225
226	Updates to interrupt state are also protected by the BQL as they can
227	often be cross vCPU.
228
229	Memory Consistency
230	==================
231
232	Between emulated guests and host systems there are a range of memory
233	consistency models. Even emulating weakly ordered systems on strongly
234	ordered hosts needs to ensure things like store-after-load re-ordering
235	can be prevented when the guest wants to.
236
237	Memory Barriers
238	---------------
239
240	Barriers (sometimes known as fences) provide a mechanism for software
241	to enforce a particular ordering of memory operations from the point
242	of view of external observers (e.g. another processor core). They can
243	apply to any memory operations as well as just loads or stores.
244
245	The Linux kernel has an excellent write-up on the various forms of
246	memory barrier and the guarantees they can provide [1].
247
248	Barriers are often wrapped around synchronisation primitives to
249	provide explicit memory ordering semantics. However they can be used
250	by themselves to provide safe lockless access by ensuring for example
251	a change to a signal flag will only be visible once the changes to
252	payload are.
253
254	DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
255
256	This would enforce a strong load/store ordering so all loads/stores
257	complete at the memory barrier. On single-core non-SMP strongly
258	ordered backends this could become a NOP.
259
260	Aside from explicit standalone memory barrier instructions there are
261	also implicit memory ordering semantics which comes with each guest
262	memory access instruction. For example all x86 load/stores come with
263	fairly strong guarantees of sequential consistency where as ARM has
264	special variants of load/store instructions that imply acquire/release
265	semantics.
266
267	In the case of a strongly ordered guest architecture being emulated on
268	a weakly ordered host the scope for a heavy performance impact is
269	quite high.
270
271	DESIGN REQUIREMENTS: Be efficient with use of memory barriers
272	- host systems with stronger implied guarantees can skip some barriers
273	- merge consecutive barriers to the strongest one
274
275	(Current solution)
276
277	The system currently has a tcg_gen_mb() which will add memory barrier
278	operations if code generation is being done in a parallel context. The
279	tcg_optimize() function attempts to merge barriers up to their
280	strongest form before any load/store operations. The solution was
281	originally developed and tested for linux-user based systems. All
282	backends have been converted to emit fences when required. So far the
283	following front-ends have been updated to emit fences when required:
284
285	- target-i386
286	- target-arm
287	- target-aarch64
288	- target-alpha
289	- target-mips
290
291	Memory Control and Maintenance
292	------------------------------
293
294	This includes a class of instructions for controlling system cache
295	behaviour. While QEMU doesn't model cache behaviour these instructions
296	are often seen when code modification has taken place to ensure the
297	changes take effect.
298
299	Synchronisation Primitives
300	--------------------------
301
302	There are two broad types of synchronisation primitives found in
303	modern ISAs: atomic instructions and exclusive regions.
304
305	The first type offer a simple atomic instruction which will guarantee
306	some sort of test and conditional store will be truly atomic w.r.t.
307	other cores sharing access to the memory. The classic example is the
308	x86 cmpxchg instruction.
309
310	The second type offer a pair of load/store instructions which offer a
311	guarantee that an region of memory has not been touched between the
312	load and store instructions. An example of this is ARM's ldrex/strex
313	pair where the strex instruction will return a flag indicating a
314	successful store only if no other CPU has accessed the memory region
315	since the ldrex.
316
317	Traditionally TCG has generated a series of operations that work
318	because they are within the context of a single translation block so
319	will have completed before another CPU is scheduled. However with
320	the ability to have multiple threads running to emulate multiple CPUs
321	we will need to explicitly expose these semantics.
322
323	DESIGN REQUIREMENTS:
324	- Support classic atomic instructions
325	- Support load/store exclusive (or load link/store conditional) pairs
326	- Generic enough infrastructure to support all guest architectures
327	CURRENT OPEN QUESTIONS:
328	- How problematic is the ABA problem in general?
329
330	(Current solution)
331
332	The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
333	can be used directly or combined to emulate other instructions like
334	ARM's ldrex/strex instructions. While they are susceptible to the ABA
335	problem so far common guests have not implemented patterns where
336	this may be a problem - typically presenting a locking ABI which
337	assumes cmpxchg like semantics.
338
339	The code also includes a fall-back for cases where multi-threaded TCG
340	ops can't work (e.g. guest atomic width > host atomic width). In this
341	case an EXCP_ATOMIC exit occurs and the instruction is emulated with
342	an exclusive lock which ensures all emulation is serialised.
343
344	While the atomic helpers look good enough for now there may be a need
345	to look at solutions that can more closely model the guest
346	architectures semantics.
347
348	==========
349
350	[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt