[mirror_ubuntu-hirsute-kernel.git] / Documentation / arm / vlocks.rst

======================================
vlocks for Bare-Metal Mutual Exclusion
======================================

Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
mechanism, with reasonable but minimal requirements on the memory
system.

These are intended to be used to coordinate critical activity among CPUs
which are otherwise non-coherent, in situations where the hardware
provides no other mechanism to support this and ordinary spinlocks
cannot be used.


vlocks make use of the atomicity provided by the memory system for
writes to a single memory location.  To arbitrate, every CPU "votes for
itself", by storing a unique number to a common memory location.  The
final value seen in that memory location when all the votes have been
cast identifies the winner.

In order to make sure that the election produces an unambiguous result
in finite time, a CPU will only enter the election in the first place if
no winner has been chosen and the election does not appear to have
started yet.


Algorithm
---------

The easiest way to explain the vlocks algorithm is with some pseudo-code::


	int currently_voting[NR_CPUS] = { 0, };
	int last_vote = -1; /* no votes yet */

	bool vlock_trylock(int this_cpu)
	{
		/* signal our desire to vote */
		currently_voting[this_cpu] = 1;
		if (last_vote != -1) {
			/* someone already volunteered himself */
			currently_voting[this_cpu] = 0;
			return false; /* not ourself */
		}

		/* let's suggest ourself */
		last_vote = this_cpu;
		currently_voting[this_cpu] = 0;

		/* then wait until everyone else is done voting */
		for_each_cpu(i) {
			while (currently_voting[i] != 0)
				/* wait */;
		}

		/* result */
		if (last_vote == this_cpu)
			return true; /* we won */
		return false;
	}

	bool vlock_unlock(void)
	{
		last_vote = -1;
	}


The currently_voting[] array provides a way for the CPUs to determine
whether an election is in progress, and plays a role analogous to the
"entering" array in Lamport's bakery algorithm [1].

However, once the election has started, the underlying memory system
atomicity is used to pick the winner.  This avoids the need for a static
priority rule to act as a tie-breaker, or any counters which could
overflow.

As long as the last_vote variable is globally visible to all CPUs, it
will contain only one value that won't change once every CPU has cleared
its currently_voting flag.


Features and limitations
------------------------

 * vlocks are not intended to be fair.  In the contended case, it is the
   _last_ CPU which attempts to get the lock which will be most likely
   to win.

   vlocks are therefore best suited to situations where it is necessary
   to pick a unique winner, but it does not matter which CPU actually
   wins.

 * Like other similar mechanisms, vlocks will not scale well to a large
   number of CPUs.

   vlocks can be cascaded in a voting hierarchy to permit better scaling
   if necessary, as in the following hypothetical example for 4096 CPUs::

	/* first level: local election */
	my_town = towns[(this_cpu >> 4) & 0xf];
	I_won = vlock_trylock(my_town, this_cpu & 0xf);
	if (I_won) {
		/* we won the town election, let's go for the state */
		my_state = states[(this_cpu >> 8) & 0xf];
		I_won = vlock_lock(my_state, this_cpu & 0xf));
		if (I_won) {
			/* and so on */
			I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
			if (I_won) {
				/* ... */
			}
			vlock_unlock(the_whole_country);
		}
		vlock_unlock(my_state);
	}
	vlock_unlock(my_town);


ARM implementation
------------------

The current ARM implementation [2] contains some optimisations beyond
the basic algorithm:

 * By packing the members of the currently_voting array close together,
   we can read the whole array in one transaction (providing the number
   of CPUs potentially contending the lock is small enough).  This
   reduces the number of round-trips required to external memory.

   In the ARM implementation, this means that we can use a single load
   and comparison::

	LDR	Rt, [Rn]
	CMP	Rt, #0

   ...in place of code equivalent to::

	LDRB	Rt, [Rn]
	CMP	Rt, #0
	LDRBEQ	Rt, [Rn, #1]
	CMPEQ	Rt, #0
	LDRBEQ	Rt, [Rn, #2]
	CMPEQ	Rt, #0
	LDRBEQ	Rt, [Rn, #3]
	CMPEQ	Rt, #0

   This cuts down on the fast-path latency, as well as potentially
   reducing bus contention in contended cases.

   The optimisation relies on the fact that the ARM memory system
   guarantees coherency between overlapping memory accesses of
   different sizes, similarly to many other architectures.  Note that
   we do not care which element of currently_voting appears in which
   bits of Rt, so there is no need to worry about endianness in this
   optimisation.

   If there are too many CPUs to read the currently_voting array in
   one transaction then multiple transations are still required.  The
   implementation uses a simple loop of word-sized loads for this
   case.  The number of transactions is still fewer than would be
   required if bytes were loaded individually.


   In principle, we could aggregate further by using LDRD or LDM, but
   to keep the code simple this was not attempted in the initial
   implementation.


 * vlocks are currently only used to coordinate between CPUs which are
   unable to enable their caches yet.  This means that the
   implementation removes many of the barriers which would be required
   when executing the algorithm in cached memory.

   packing of the currently_voting array does not work with cached
   memory unless all CPUs contending the lock are cache-coherent, due
   to cache writebacks from one CPU clobbering values written by other
   CPUs.  (Though if all the CPUs are cache-coherent, you should be
   probably be using proper spinlocks instead anyway).


 * The "no votes yet" value used for the last_vote variable is 0 (not
   -1 as in the pseudocode).  This allows statically-allocated vlocks
   to be implicitly initialised to an unlocked state simply by putting
   them in .bss.

   An offset is added to each CPU's ID for the purpose of setting this
   variable, so that no CPU uses the value 0 for its ID.


Colophon
--------

Originally created and documented by Dave Martin for Linaro Limited, for
use in ARM-based big.LITTLE platforms, with review and input gratefully
received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
grabbing most of this text out of the relevant mail thread and writing
up the pseudocode.

Copyright (C) 2012-2013  Linaro Limited
Distributed under the terms of Version 2 of the GNU General Public
License, as defined in linux/COPYING.


References
----------

[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
    Problem", Communications of the ACM 17, 8 (August 1974), 453-455.

    https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm

[2] linux/arch/arm/common/vlock.S, www.kernel.org.
Commit	Line	Data
dc7a12bd	1	======================================
9762f12d DM	2	vlocks for Bare-Metal Mutual Exclusion
	3	======================================
	4
	5	Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
	6	mechanism, with reasonable but minimal requirements on the memory
	7	system.
	8
	9	These are intended to be used to coordinate critical activity among CPUs
	10	which are otherwise non-coherent, in situations where the hardware
	11	provides no other mechanism to support this and ordinary spinlocks
	12	cannot be used.
	13
	14
	15	vlocks make use of the atomicity provided by the memory system for
	16	writes to a single memory location. To arbitrate, every CPU "votes for
	17	itself", by storing a unique number to a common memory location. The
	18	final value seen in that memory location when all the votes have been
	19	cast identifies the winner.
	20
	21	In order to make sure that the election produces an unambiguous result
	22	in finite time, a CPU will only enter the election in the first place if
	23	no winner has been chosen and the election does not appear to have
	24	started yet.
	25
	26
	27	Algorithm
	28	---------
	29
dc7a12bd	30	The easiest way to explain the vlocks algorithm is with some pseudo-code::
9762f12d DM	31
	32
	33	int currently_voting[NR_CPUS] = { 0, };
	34	int last_vote = -1; /* no votes yet */
	35
	36	bool vlock_trylock(int this_cpu)
	37	{
	38	/* signal our desire to vote */
	39	currently_voting[this_cpu] = 1;
	40	if (last_vote != -1) {
	41	/* someone already volunteered himself */
	42	currently_voting[this_cpu] = 0;
	43	return false; /* not ourself */
	44	}
	45
	46	/* let's suggest ourself */
	47	last_vote = this_cpu;
	48	currently_voting[this_cpu] = 0;
	49
	50	/* then wait until everyone else is done voting */
	51	for_each_cpu(i) {
	52	while (currently_voting[i] != 0)
	53	/* wait */;
	54	}
	55
	56	/* result */
	57	if (last_vote == this_cpu)
	58	return true; /* we won */
	59	return false;
	60	}
	61
	62	bool vlock_unlock(void)
	63	{
	64	last_vote = -1;
	65	}
	66
	67
	68	The currently_voting[] array provides a way for the CPUs to determine
	69	whether an election is in progress, and plays a role analogous to the
	70	"entering" array in Lamport's bakery algorithm [1].
	71
	72	However, once the election has started, the underlying memory system
	73	atomicity is used to pick the winner. This avoids the need for a static
	74	priority rule to act as a tie-breaker, or any counters which could
	75	overflow.
	76
	77	As long as the last_vote variable is globally visible to all CPUs, it
	78	will contain only one value that won't change once every CPU has cleared
	79	its currently_voting flag.
	80
	81
	82	Features and limitations
	83	------------------------
	84
	85	* vlocks are not intended to be fair. In the contended case, it is the
	86	_last_ CPU which attempts to get the lock which will be most likely
	87	to win.
	88
	89	vlocks are therefore best suited to situations where it is necessary
	90	to pick a unique winner, but it does not matter which CPU actually
	91	wins.
	92
	93	* Like other similar mechanisms, vlocks will not scale well to a large
	94	number of CPUs.
95
96	vlocks can be cascaded in a voting hierarchy to permit better scaling
dc7a12bd	97	if necessary, as in the following hypothetical example for 4096 CPUs::
9762f12d DM	98
	99	/* first level: local election */
	100	my_town = towns[(this_cpu >> 4) & 0xf];
	101	I_won = vlock_trylock(my_town, this_cpu & 0xf);
	102	if (I_won) {
	103	/* we won the town election, let's go for the state */
	104	my_state = states[(this_cpu >> 8) & 0xf];
	105	I_won = vlock_lock(my_state, this_cpu & 0xf));
	106	if (I_won) {
	107	/* and so on */
	108	I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
	109	if (I_won) {
	110	/* ... */
	111	}
	112	vlock_unlock(the_whole_country);
	113	}
	114	vlock_unlock(my_state);
	115	}
	116	vlock_unlock(my_town);
	117
	118
	119	ARM implementation
	120	------------------
	121
	122	The current ARM implementation [2] contains some optimisations beyond
	123	the basic algorithm:
	124
	125	* By packing the members of the currently_voting array close together,
	126	we can read the whole array in one transaction (providing the number
	127	of CPUs potentially contending the lock is small enough). This
	128	reduces the number of round-trips required to external memory.
	129
	130	In the ARM implementation, this means that we can use a single load
dc7a12bd	131	and comparison::
9762f12d DM	132
	133	LDR Rt, [Rn]
	134	CMP Rt, #0
	135
dc7a12bd	136	...in place of code equivalent to::
9762f12d DM	137
	138	LDRB Rt, [Rn]
	139	CMP Rt, #0
	140	LDRBEQ Rt, [Rn, #1]
	141	CMPEQ Rt, #0
	142	LDRBEQ Rt, [Rn, #2]
	143	CMPEQ Rt, #0
	144	LDRBEQ Rt, [Rn, #3]
	145	CMPEQ Rt, #0
	146
	147	This cuts down on the fast-path latency, as well as potentially
	148	reducing bus contention in contended cases.
	149
	150	The optimisation relies on the fact that the ARM memory system
	151	guarantees coherency between overlapping memory accesses of
	152	different sizes, similarly to many other architectures. Note that
	153	we do not care which element of currently_voting appears in which
	154	bits of Rt, so there is no need to worry about endianness in this
	155	optimisation.
	156
	157	If there are too many CPUs to read the currently_voting array in
	158	one transaction then multiple transations are still required. The
	159	implementation uses a simple loop of word-sized loads for this
	160	case. The number of transactions is still fewer than would be
	161	required if bytes were loaded individually.
	162
	163
	164	In principle, we could aggregate further by using LDRD or LDM, but
	165	to keep the code simple this was not attempted in the initial
	166	implementation.
	167
	168
	169	* vlocks are currently only used to coordinate between CPUs which are
	170	unable to enable their caches yet. This means that the
	171	implementation removes many of the barriers which would be required
	172	when executing the algorithm in cached memory.
	173
	174	packing of the currently_voting array does not work with cached
	175	memory unless all CPUs contending the lock are cache-coherent, due
	176	to cache writebacks from one CPU clobbering values written by other
	177	CPUs. (Though if all the CPUs are cache-coherent, you should be
	178	probably be using proper spinlocks instead anyway).
	179
	180
	181	* The "no votes yet" value used for the last_vote variable is 0 (not
	182	-1 as in the pseudocode). This allows statically-allocated vlocks
	183	to be implicitly initialised to an unlocked state simply by putting
	184	them in .bss.
	185
	186	An offset is added to each CPU's ID for the purpose of setting this
	187	variable, so that no CPU uses the value 0 for its ID.
	188
	189
	190	Colophon
	191	--------
	192
	193	Originally created and documented by Dave Martin for Linaro Limited, for
	194	use in ARM-based big.LITTLE platforms, with review and input gratefully
	195	received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for
	196	grabbing most of this text out of the relevant mail thread and writing
	197	up the pseudocode.
	198
	199	Copyright (C) 2012-2013 Linaro Limited
	200	Distributed under the terms of Version 2 of the GNU General Public
201	License, as defined in linux/COPYING.
202
203
204	References
205	----------
206
207	[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
208	Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
209
ae13c65b	210	https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
9762f12d DM	211
9762f12d DM	212	[2] linux/arch/arm/common/vlock.S, www.kernel.org.