[mirror_qemu.git] / docs / devel / multi-thread-tcg.rst

..
  Copyright (c) 2015-2020 Linaro Ltd.

  This work is licensed under the terms of the GNU GPL, version 2 or
  later. See the COPYING file in the top-level directory.

==================
Multi-threaded TCG
==================

This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
system-mode emulation. user-mode emulation has always mirrored the
thread structure of the translated executable although some of the
changes done for MTTCG system emulation have improved the stability of
linux-user emulation.

The original system-mode TCG implementation was single threaded and
dealt with multiple CPUs with simple round-robin scheduling. This
simplified a lot of things but became increasingly limited as systems
being emulated gained additional cores and per-core performance gains
for host systems started to level off.

vCPU Scheduling
===============

We introduce a new running mode where each vCPU will run on its own
user-space thread. This is enabled by default for all FE/BE
combinations where the host memory model is able to accommodate the
guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
guest has had the required work done to support this safely
(TARGET_SUPPORTS_MTTCG).

System emulation will fall back to the original round robin approach
if:

* forced by --accel tcg,thread=single
* enabling --icount mode
* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)

In the general case of running translated code there should be no
inter-vCPU dependencies and all vCPUs should be able to run at full
speed. Synchronisation will only be required while accessing internal
shared data structures or when the emulated architecture requires a
coherent representation of the emulated machine state.

Shared Data Structures
======================

Main Run Loop
-------------

Even when there is no code being generated there are a number of
structures associated with the hot-path through the main run-loop.
These are associated with looking up the next translation block to
execute. These include:

    tb_jmp_cache (per-vCPU, cache of recent jumps)
    tb_ctx.htable (global hash table, phys address->tb lookup)

As TB linking only occurs when blocks are in the same page this code
is critical to performance as looking up the next TB to execute is the
most common reason to exit the generated code.

DESIGN REQUIREMENT: Make access to lookup structures safe with
multiple reader/writer threads. Minimise any lock contention to do it.

The hot-path avoids using locks where possible. The tb_jmp_cache is
updated with atomic accesses to ensure consistent results. The fall
back QHT based hash table is also designed for lockless lookups. Locks
are only taken when code generation is required or TranslationBlocks
have their block-to-block jumps patched.

Global TCG State
----------------

User-mode emulation
~~~~~~~~~~~~~~~~~~~

We need to protect the entire code generation cycle including any post
generation patching of the translated code. This also implies a shared
translation buffer which contains code running on all cores. Any
execution path that comes to the main run loop will need to hold a
mutex for code generation. This also includes times when we need flush
code or entries from any shared lookups/caches. Structures held on a
per-vCPU basis won't need locking unless other vCPUs will need to
modify them.

DESIGN REQUIREMENT: Add locking around all code generation and TB
patching.

(Current solution)

Code generation is serialised with mmap_lock().

!User-mode emulation
~~~~~~~~~~~~~~~~~~~~

Each vCPU has its own TCG context and associated TCG region, thereby
requiring no locking during translation.

Translation Blocks
------------------

Currently the whole system shares a single code generation buffer
which when full will force a flush of all translations and start from
scratch again. Some operations also force a full flush of translations
including:

  - debugging operations (breakpoint insertion/removal)
  - some CPU helper functions
  - linux-user spawning its first thread

This is done with the async_safe_run_on_cpu() mechanism to ensure all
vCPUs are quiescent when changes are being made to shared global
structures.

More granular translation invalidation events are typically due
to a change of the state of a physical page:

  - code modification (self modify code, patching code)
  - page changes (new page mapping in linux-user mode)

While setting the invalid flag in a TranslationBlock will stop it
being used when looked up in the hot-path there are a number of other
book-keeping structures that need to be safely cleared.

Any TranslationBlocks which have been patched to jump directly to the
now invalid blocks need the jump patches reversing so they will return
to the C code.

There are a number of look-up caches that need to be properly updated
including the:

  - jump lookup cache
  - the physical-to-tb lookup hash table
  - the global page table

The global page table (l1_map) which provides a multi-level look-up
for PageDesc structures which contain pointers to the start of a
linked list of all Translation Blocks in that page (see page_next).

Both the jump patching and the page cache involve linked lists that
the invalidated TranslationBlock needs to be removed from.

DESIGN REQUIREMENT: Safely handle invalidation of TBs
                      - safely patch/revert direct jumps
                      - remove central PageDesc lookup entries
                      - ensure lookup caches/hashes are safely updated

(Current solution)

The direct jump themselves are updated atomically by the TCG
tb_set_jmp_target() code. Modification to the linked lists that allow
searching for linked pages are done under the protection of tb->jmp_lock,
where tb is the destination block of a jump. Each origin block keeps a
pointer to its destinations so that the appropriate lock can be acquired before
iterating over a jump list.

The global page table is a lockless radix tree; cmpxchg is used
to atomically insert new elements.

The lookup caches are updated atomically and the lookup hash uses QHT
which is designed for concurrent safe lookup.

Parallel code generation is supported. QHT is used at insertion time
as the synchronization point across threads, thereby ensuring that we only
keep track of a single TranslationBlock for each guest code block.

Memory maps and TLBs
--------------------

The memory handling code is fairly critical to the speed of memory
access in the emulated system. The SoftMMU code is designed so the
hot-path can be handled entirely within translated code. This is
handled with a per-vCPU TLB structure which once populated will allow
a series of accesses to the page to occur without exiting the
translated code. It is possible to set flags in the TLB address which
will ensure the slow-path is taken for each access. This can be done
to support:

  - Memory regions (dividing up access to PIO, MMIO and RAM)
  - Dirty page tracking (for code gen, SMC detection, migration and display)
  - Virtual TLB (for translating guest address->real address)

When the TLB tables are updated by a vCPU thread other than their own
we need to ensure it is done in a safe way so no inconsistent state is
seen by the vCPU thread.

Some operations require updating a number of vCPUs TLBs at the same
time in a synchronised manner.

DESIGN REQUIREMENTS:

  - TLB Flush All/Page
    - can be across-vCPUs
    - cross vCPU TLB flush may need other vCPU brought to halt
    - change may need to be visible to the calling vCPU immediately
  - TLB Flag Update
    - usually cross-vCPU
    - want change to be visible as soon as possible
  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
    - This is a per-vCPU table - by definition can't race
    - updated by its own thread when the slow-path is forced

(Current solution)

We have updated cputlb.c to defer operations when a cross-vCPU
operation with async_run_on_cpu() which ensures each vCPU sees a
coherent state when it next runs its work (in a few instructions
time).

A new set up operations (tlb_flush_*_all_cpus) take an additional flag
which when set will force synchronisation by setting the source vCPUs
work as "safe work" and exiting the cpu run loop. This ensure by the
time execution restarts all flush operations have completed.

TLB flag updates are all done atomically and are also protected by the
corresponding page lock.

(Known limitation)

Not really a limitation but the wait mechanism is overly strict for
some architectures which only need flushes completed by a barrier
instruction. This could be a future optimisation.

Emulated hardware state
-----------------------

Currently thanks to KVM work any access to IO memory is automatically
protected by the global iothread mutex, also known as the BQL (Big
Qemu Lock). Any IO region that doesn't use global mutex is expected to
do its own locking.

However IO memory isn't the only way emulated hardware state can be
modified. Some architectures have model specific registers that
trigger hardware emulation features. Generally any translation helper
that needs to update more than a single vCPUs of state should take the
BQL.

As the BQL, or global iothread mutex is shared across the system we
push the use of the lock as far down into the TCG code as possible to
minimise contention.

(Current solution)

MMIO access automatically serialises hardware emulation by way of the
BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
and also defer the reset/startup of vCPUs to the vCPU context by way
of async_run_on_cpu().

Updates to interrupt state are also protected by the BQL as they can
often be cross vCPU.

Memory Consistency
==================

Between emulated guests and host systems there are a range of memory
consistency models. Even emulating weakly ordered systems on strongly
ordered hosts needs to ensure things like store-after-load re-ordering
can be prevented when the guest wants to.

Memory Barriers
---------------

Barriers (sometimes known as fences) provide a mechanism for software
to enforce a particular ordering of memory operations from the point
of view of external observers (e.g. another processor core). They can
apply to any memory operations as well as just loads or stores.

The Linux kernel has an excellent `write-up
<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
on the various forms of memory barrier and the guarantees they can
provide.

Barriers are often wrapped around synchronisation primitives to
provide explicit memory ordering semantics. However they can be used
by themselves to provide safe lockless access by ensuring for example
a change to a signal flag will only be visible once the changes to
payload are.

DESIGN REQUIREMENT: Add a new tcg_memory_barrier op

This would enforce a strong load/store ordering so all loads/stores
complete at the memory barrier. On single-core non-SMP strongly
ordered backends this could become a NOP.

Aside from explicit standalone memory barrier instructions there are
also implicit memory ordering semantics which comes with each guest
memory access instruction. For example all x86 load/stores come with
fairly strong guarantees of sequential consistency whereas Arm has
special variants of load/store instructions that imply acquire/release
semantics.

In the case of a strongly ordered guest architecture being emulated on
a weakly ordered host the scope for a heavy performance impact is
quite high.

DESIGN REQUIREMENTS: Be efficient with use of memory barriers
       - host systems with stronger implied guarantees can skip some barriers
       - merge consecutive barriers to the strongest one

(Current solution)

The system currently has a tcg_gen_mb() which will add memory barrier
operations if code generation is being done in a parallel context. The
tcg_optimize() function attempts to merge barriers up to their
strongest form before any load/store operations. The solution was
originally developed and tested for linux-user based systems. All
backends have been converted to emit fences when required. So far the
following front-ends have been updated to emit fences when required:

    - target-i386
    - target-arm
    - target-aarch64
    - target-alpha
    - target-mips

Memory Control and Maintenance
------------------------------

This includes a class of instructions for controlling system cache
behaviour. While QEMU doesn't model cache behaviour these instructions
are often seen when code modification has taken place to ensure the
changes take effect.

Synchronisation Primitives
--------------------------

There are two broad types of synchronisation primitives found in
modern ISAs: atomic instructions and exclusive regions.

The first type offer a simple atomic instruction which will guarantee
some sort of test and conditional store will be truly atomic w.r.t.
other cores sharing access to the memory. The classic example is the
x86 cmpxchg instruction.

The second type offer a pair of load/store instructions which offer a
guarantee that a region of memory has not been touched between the
load and store instructions. An example of this is Arm's ldrex/strex
pair where the strex instruction will return a flag indicating a
successful store only if no other CPU has accessed the memory region
since the ldrex.

Traditionally TCG has generated a series of operations that work
because they are within the context of a single translation block so
will have completed before another CPU is scheduled. However with
the ability to have multiple threads running to emulate multiple CPUs
we will need to explicitly expose these semantics.

DESIGN REQUIREMENTS:
  - Support classic atomic instructions
  - Support load/store exclusive (or load link/store conditional) pairs
  - Generic enough infrastructure to support all guest architectures
CURRENT OPEN QUESTIONS:
  - How problematic is the ABA problem in general?

(Current solution)

The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
can be used directly or combined to emulate other instructions like
Arm's ldrex/strex instructions. While they are susceptible to the ABA
problem so far common guests have not implemented patterns where
this may be a problem - typically presenting a locking ABI which
assumes cmpxchg like semantics.

The code also includes a fall-back for cases where multi-threaded TCG
ops can't work (e.g. guest atomic width > host atomic width). In this
case an EXCP_ATOMIC exit occurs and the instruction is emulated with
an exclusive lock which ensures all emulation is serialised.

While the atomic helpers look good enough for now there may be a need
to look at solutions that can more closely model the guest
architectures semantics.
Commit	Line	Data
c8c06e52 AB	1	..
c8c06e52 AB	2	Copyright (c) 2015-2020 Linaro Ltd.
c6489dd9	3
c8c06e52 AB	4	This work is licensed under the terms of the GNU GPL, version 2 or
c8c06e52 AB	5	later. See the COPYING file in the top-level directory.
c6489dd9	6
ae63ed16 LP	7	==================
	8	Multi-threaded TCG
	9	==================
c6489dd9	10
c8c06e52 AB	11	This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
	12	system-mode emulation. user-mode emulation has always mirrored the
	13	thread structure of the translated executable although some of the
	14	changes done for MTTCG system emulation have improved the stability of
	15	linux-user emulation.
c6489dd9 AB	16
	17	The original system-mode TCG implementation was single threaded and
	18	dealt with multiple CPUs with simple round-robin scheduling. This
	19	simplified a lot of things but became increasingly limited as systems
	20	being emulated gained additional cores and per-core performance gains
	21	for host systems started to level off.
	22
	23	vCPU Scheduling
	24	===============
	25
	26	We introduce a new running mode where each vCPU will run on its own
c8c06e52 AB	27	user-space thread. This is enabled by default for all FE/BE
	28	combinations where the host memory model is able to accommodate the
	29	guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
	30	guest has had the required work done to support this safely
	31	(TARGET_SUPPORTS_MTTCG).
	32
	33	System emulation will fall back to the original round robin approach
	34	if:
	35
	36	* forced by --accel tcg,thread=single
	37	* enabling --icount mode
	38	* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
c6489dd9 AB	39
	40	In the general case of running translated code there should be no
	41	inter-vCPU dependencies and all vCPUs should be able to run at full
	42	speed. Synchronisation will only be required while accessing internal
	43	shared data structures or when the emulated architecture requires a
	44	coherent representation of the emulated machine state.
	45
	46	Shared Data Structures
	47	======================
	48
	49	Main Run Loop
	50	-------------
	51
	52	Even when there is no code being generated there are a number of
	53	structures associated with the hot-path through the main run-loop.
	54	These are associated with looking up the next translation block to
	55	execute. These include:
	56
	57	tb_jmp_cache (per-vCPU, cache of recent jumps)
	58	tb_ctx.htable (global hash table, phys address->tb lookup)
	59
	60	As TB linking only occurs when blocks are in the same page this code
	61	is critical to performance as looking up the next TB to execute is the
	62	most common reason to exit the generated code.
	63
	64	DESIGN REQUIREMENT: Make access to lookup structures safe with
	65	multiple reader/writer threads. Minimise any lock contention to do it.
	66
	67	The hot-path avoids using locks where possible. The tb_jmp_cache is
	68	updated with atomic accesses to ensure consistent results. The fall
	69	back QHT based hash table is also designed for lockless lookups. Locks
	70	are only taken when code generation is required or TranslationBlocks
	71	have their block-to-block jumps patched.
	72
	73	Global TCG State
	74	----------------
	75
c8c06e52 AB	76	User-mode emulation
	77	~~~~~~~~~~~~~~~~~~~
	78
c6489dd9 AB	79	We need to protect the entire code generation cycle including any post
	80	generation patching of the translated code. This also implies a shared
	81	translation buffer which contains code running on all cores. Any
	82	execution path that comes to the main run loop will need to hold a
	83	mutex for code generation. This also includes times when we need flush
	84	code or entries from any shared lookups/caches. Structures held on a
	85	per-vCPU basis won't need locking unless other vCPUs will need to
	86	modify them.
	87
	88	DESIGN REQUIREMENT: Add locking around all code generation and TB
	89	patching.
	90
	91	(Current solution)
	92
0ac20318 EC	93	Code generation is serialised with mmap_lock().
0ac20318 EC	94
c8c06e52 AB	95	!User-mode emulation
	96	~~~~~~~~~~~~~~~~~~~~
	97
0ac20318	98	Each vCPU has its own TCG context and associated TCG region, thereby
c8c06e52	99	requiring no locking during translation.
c6489dd9 AB	100
	101	Translation Blocks
	102	------------------
	103
	104	Currently the whole system shares a single code generation buffer
	105	which when full will force a flush of all translations and start from
	106	scratch again. Some operations also force a full flush of translations
	107	including:
	108
	109	- debugging operations (breakpoint insertion/removal)
	110	- some CPU helper functions
93154e76	111	- linux-user spawning its first thread
c6489dd9 AB	112
	113	This is done with the async_safe_run_on_cpu() mechanism to ensure all
	114	vCPUs are quiescent when changes are being made to shared global
	115	structures.
	116
	117	More granular translation invalidation events are typically due
	118	to a change of the state of a physical page:
	119
	120	- code modification (self modify code, patching code)
	121	- page changes (new page mapping in linux-user mode)
	122
	123	While setting the invalid flag in a TranslationBlock will stop it
	124	being used when looked up in the hot-path there are a number of other
	125	book-keeping structures that need to be safely cleared.
	126
	127	Any TranslationBlocks which have been patched to jump directly to the
	128	now invalid blocks need the jump patches reversing so they will return
	129	to the C code.
	130
	131	There are a number of look-up caches that need to be properly updated
	132	including the:
	133
	134	- jump lookup cache
	135	- the physical-to-tb lookup hash table
	136	- the global page table
	137
	138	The global page table (l1_map) which provides a multi-level look-up
	139	for PageDesc structures which contain pointers to the start of a
	140	linked list of all Translation Blocks in that page (see page_next).
	141
	142	Both the jump patching and the page cache involve linked lists that
	143	the invalidated TranslationBlock needs to be removed from.
	144
	145	DESIGN REQUIREMENT: Safely handle invalidation of TBs
	146	- safely patch/revert direct jumps
	147	- remove central PageDesc lookup entries
	148	- ensure lookup caches/hashes are safely updated
	149
	150	(Current solution)
	151
	152	The direct jump themselves are updated atomically by the TCG
	153	tb_set_jmp_target() code. Modification to the linked lists that allow
194125e3 EC	154	searching for linked pages are done under the protection of tb->jmp_lock,
	155	where tb is the destination block of a jump. Each origin block keeps a
	156	pointer to its destinations so that the appropriate lock can be acquired before
	157	iterating over a jump list.
c6489dd9	158
78722ed0 EC	159	The global page table is a lockless radix tree; cmpxchg is used
78722ed0 EC	160	to atomically insert new elements.
c6489dd9 AB	161
	162	The lookup caches are updated atomically and the lookup hash uses QHT
	163	which is designed for concurrent safe lookup.
	164
95590e24 EC	165	Parallel code generation is supported. QHT is used at insertion time
	166	as the synchronization point across threads, thereby ensuring that we only
	167	keep track of a single TranslationBlock for each guest code block.
c6489dd9 AB	168
	169	Memory maps and TLBs
	170	--------------------
	171
	172	The memory handling code is fairly critical to the speed of memory
	173	access in the emulated system. The SoftMMU code is designed so the
	174	hot-path can be handled entirely within translated code. This is
	175	handled with a per-vCPU TLB structure which once populated will allow
	176	a series of accesses to the page to occur without exiting the
	177	translated code. It is possible to set flags in the TLB address which
	178	will ensure the slow-path is taken for each access. This can be done
	179	to support:
	180
	181	- Memory regions (dividing up access to PIO, MMIO and RAM)
	182	- Dirty page tracking (for code gen, SMC detection, migration and display)
	183	- Virtual TLB (for translating guest address->real address)
	184
	185	When the TLB tables are updated by a vCPU thread other than their own
	186	we need to ensure it is done in a safe way so no inconsistent state is
	187	seen by the vCPU thread.
	188
	189	Some operations require updating a number of vCPUs TLBs at the same
	190	time in a synchronised manner.
	191
	192	DESIGN REQUIREMENTS:
	193
	194	- TLB Flush All/Page
	195	- can be across-vCPUs
	196	- cross vCPU TLB flush may need other vCPU brought to halt
	197	- change may need to be visible to the calling vCPU immediately
	198	- TLB Flag Update
	199	- usually cross-vCPU
	200	- want change to be visible as soon as possible
	201	- TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
	202	- This is a per-vCPU table - by definition can't race
	203	- updated by its own thread when the slow-path is forced
	204
	205	(Current solution)
	206
	207	We have updated cputlb.c to defer operations when a cross-vCPU
	208	operation with async_run_on_cpu() which ensures each vCPU sees a
	209	coherent state when it next runs its work (in a few instructions
	210	time).
	211
	212	A new set up operations (tlb_flush_*_all_cpus) take an additional flag
	213	which when set will force synchronisation by setting the source vCPUs
	214	work as "safe work" and exiting the cpu run loop. This ensure by the
	215	time execution restarts all flush operations have completed.
	216
	217	TLB flag updates are all done atomically and are also protected by the
0ac20318	218	corresponding page lock.
c6489dd9 AB	219
	220	(Known limitation)
	221
	222	Not really a limitation but the wait mechanism is overly strict for
	223	some architectures which only need flushes completed by a barrier
	224	instruction. This could be a future optimisation.
	225
	226	Emulated hardware state
	227	-----------------------
	228
	229	Currently thanks to KVM work any access to IO memory is automatically
	230	protected by the global iothread mutex, also known as the BQL (Big
	231	Qemu Lock). Any IO region that doesn't use global mutex is expected to
	232	do its own locking.
	233
	234	However IO memory isn't the only way emulated hardware state can be
	235	modified. Some architectures have model specific registers that
	236	trigger hardware emulation features. Generally any translation helper
	237	that needs to update more than a single vCPUs of state should take the
	238	BQL.
	239
	240	As the BQL, or global iothread mutex is shared across the system we
	241	push the use of the lock as far down into the TCG code as possible to
	242	minimise contention.
	243
	244	(Current solution)
	245
	246	MMIO access automatically serialises hardware emulation by way of the
6fe6d6c9	247	BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
c6489dd9 AB	248	and also defer the reset/startup of vCPUs to the vCPU context by way
	249	of async_run_on_cpu().
	250
	251	Updates to interrupt state are also protected by the BQL as they can
	252	often be cross vCPU.
	253
	254	Memory Consistency
	255	==================
	256
	257	Between emulated guests and host systems there are a range of memory
	258	consistency models. Even emulating weakly ordered systems on strongly
	259	ordered hosts needs to ensure things like store-after-load re-ordering
	260	can be prevented when the guest wants to.
	261
	262	Memory Barriers
	263	---------------
	264
	265	Barriers (sometimes known as fences) provide a mechanism for software
	266	to enforce a particular ordering of memory operations from the point
	267	of view of external observers (e.g. another processor core). They can
	268	apply to any memory operations as well as just loads or stores.
	269
c8c06e52	270	The Linux kernel has an excellent `write-up
1ec43ca4	271	<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
c8c06e52 AB	272	on the various forms of memory barrier and the guarantees they can
c8c06e52 AB	273	provide.
c6489dd9 AB	274
	275	Barriers are often wrapped around synchronisation primitives to
	276	provide explicit memory ordering semantics. However they can be used
	277	by themselves to provide safe lockless access by ensuring for example
	278	a change to a signal flag will only be visible once the changes to
	279	payload are.
	280
	281	DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
	282
	283	This would enforce a strong load/store ordering so all loads/stores
	284	complete at the memory barrier. On single-core non-SMP strongly
	285	ordered backends this could become a NOP.
	286
	287	Aside from explicit standalone memory barrier instructions there are
	288	also implicit memory ordering semantics which comes with each guest
	289	memory access instruction. For example all x86 load/stores come with
6fe6d6c9	290	fairly strong guarantees of sequential consistency whereas Arm has
c6489dd9 AB	291	special variants of load/store instructions that imply acquire/release
	292	semantics.
	293
	294	In the case of a strongly ordered guest architecture being emulated on
	295	a weakly ordered host the scope for a heavy performance impact is
	296	quite high.
	297
	298	DESIGN REQUIREMENTS: Be efficient with use of memory barriers
	299	- host systems with stronger implied guarantees can skip some barriers
	300	- merge consecutive barriers to the strongest one
	301
	302	(Current solution)
	303
	304	The system currently has a tcg_gen_mb() which will add memory barrier
	305	operations if code generation is being done in a parallel context. The
	306	tcg_optimize() function attempts to merge barriers up to their
	307	strongest form before any load/store operations. The solution was
	308	originally developed and tested for linux-user based systems. All
	309	backends have been converted to emit fences when required. So far the
	310	following front-ends have been updated to emit fences when required:
	311
	312	- target-i386
	313	- target-arm
	314	- target-aarch64
	315	- target-alpha
	316	- target-mips
	317
	318	Memory Control and Maintenance
	319	------------------------------
	320
	321	This includes a class of instructions for controlling system cache
	322	behaviour. While QEMU doesn't model cache behaviour these instructions
	323	are often seen when code modification has taken place to ensure the
	324	changes take effect.
	325
	326	Synchronisation Primitives
	327	--------------------------
	328
	329	There are two broad types of synchronisation primitives found in
	330	modern ISAs: atomic instructions and exclusive regions.
	331
	332	The first type offer a simple atomic instruction which will guarantee
	333	some sort of test and conditional store will be truly atomic w.r.t.
	334	other cores sharing access to the memory. The classic example is the
	335	x86 cmpxchg instruction.
	336
	337	The second type offer a pair of load/store instructions which offer a
9277d81f	338	guarantee that a region of memory has not been touched between the
6fe6d6c9	339	load and store instructions. An example of this is Arm's ldrex/strex
c6489dd9 AB	340	pair where the strex instruction will return a flag indicating a
	341	successful store only if no other CPU has accessed the memory region
	342	since the ldrex.
	343
	344	Traditionally TCG has generated a series of operations that work
	345	because they are within the context of a single translation block so
	346	will have completed before another CPU is scheduled. However with
	347	the ability to have multiple threads running to emulate multiple CPUs
	348	we will need to explicitly expose these semantics.
	349
	350	DESIGN REQUIREMENTS:
	351	- Support classic atomic instructions
	352	- Support load/store exclusive (or load link/store conditional) pairs
	353	- Generic enough infrastructure to support all guest architectures
	354	CURRENT OPEN QUESTIONS:
	355	- How problematic is the ABA problem in general?
	356
	357	(Current solution)
	358
	359	The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
	360	can be used directly or combined to emulate other instructions like
6fe6d6c9	361	Arm's ldrex/strex instructions. While they are susceptible to the ABA
c6489dd9 AB	362	problem so far common guests have not implemented patterns where
	363	this may be a problem - typically presenting a locking ABI which
	364	assumes cmpxchg like semantics.
	365
	366	The code also includes a fall-back for cases where multi-threaded TCG
	367	ops can't work (e.g. guest atomic width > host atomic width). In this
	368	case an EXCP_ATOMIC exit occurs and the instruction is emulated with
	369	an exclusive lock which ensures all emulation is serialised.
	370
	371	While the atomic helpers look good enough for now there may be a need
	372	to look at solutions that can more closely model the guest
	373	architectures semantics.