[ceph.git] / ceph / doc / dev / mon-elections.rst

=================
Monitor Elections
=================

The Original Algorithm
======================
Historically, monitor leader elections have been very simple: the lowest-ranked
monitor wins!

This is accomplished using a low-state "Elector" module (though it has now
been split into an Elector that handles message-passing, and an ElectionLogic
that makes the voting choices). It tracks the election epoch and not much
else. Odd epochs are elections; even epochs have a leader and let the monitor
do its ongoing work. When a timeout occurs or the monitor asks for a
new election, we bump the epoch and send out Propose messages to all known
monitors.
In general, if we receive an old message we either drop it or trigger a new
election (if we think the sender is newly-booted and needs to join quorum). If
we receive a message from a newer epoch, we bump up our epoch to match and
either Defer to the Proposer or else bump the epoch again and Propose
ourselves if we expect to win over them. When we receive a Propose within
our current epoch, we either Defer to the sender or ignore them (we ignore them
if they are of a higher rank than us, or higher than the rank we have already
deferred to).
(Note that if we have the highest rank it is possible for us to defer to every
other monitor in sequence within the same election epoch!)

This resolves under normal circumstances because all monitors agree on the
priority voting order, and epochs are only bumped when a monitor isn't
participating or sees a possible conflict with the known proposers.

The Problems
==============
The original algorithm didn't work at all under a variety of netsplit
conditions. This didn't manifest often in practice but has become
important as the community and commercial vendors move Ceph into
spaces requiring the use of "stretch clusters".

The New Algorithms
==================
We still default to the original ("classic") election algorithm, but
support letting users change to new ones via the CLI. These
algorithms are implemented as different functions and switch statements
within the ElectionLogic class.

The first algorithm is very simple: "disallow" lets you add monitors
to a list of disallowed leaders.
The second, "connectivity", incorporates connection score ratings
and elects the monitor with the best score.

Algorithm: disallow
===================
If a monitor is in the disallowed list, it always defers to another
monitor, no matter the rank. Otherwise, it is the same as the classic
algorithm is.
Since changing the disallowed list requires a paxos update, monitors
in an election together should always have the same set. This means
the election order is constant and static across the full monitor set
and elections resolve trivially (assuming a connected network).

This algorithm really just exists as a demo and stepping-stone to
the more advanced connectivity mode, but it may have utility in asymmetric
networks and clusters.

Algorithm: connectivity
=======================
This algorithm takes as input scores for each connection
(both ways, discussed in the next section) and attempts to elect the monitor
with the highest total score. We keep the same basic message-passing flow as the
classic algorithm, in which elections are driven by reacting to Propose messages.
But this has several challenges since unlike ranks, scores are not static (and
might change during an election!). To guarantee an election epoch does not
produce multiple leaders, we must maintain two key invariants:
* Monitors must maintain static scores during an election epoch
* Any deferral must be transitive -- if A defers to B and then to C,
B had better defer to C as well!

We handle these very explicitly: by branching a copy stable_peer_tracker
of our peer_tracker scoring object whenever starting an election (or
bumping the epoch), and by refusing to defer to a monitor if it won't
be deferred to by our current leader choice. (All Propose messages include
a copy of the scores the leader is working from, so peers can evaluate them.)

Of course, those modifications can easily block. To guarantee forward progress,
we make several further adjustments:
* If we want to defer to a new peer, but have already deferred to a peer
whose scores don't allow that, we bump the election epoch and start()
the election over again.
* All election messages include the scores the sender is aware of.

This guarantees we will resolve the election as long as the network is
reasonably stable (even if disconnected): As long as all score "views"
result in the same deferral order, an election will complete normally. And by
broadly sharing scores across the full set of monitors, monitors rapidly
converge on the global newest state.

This algorithm has one further important feature compared to the classic and
disallowed handlers: it can ignore out-of-quorum peers. Normally, whenever
a monitor B receives a Propose from an out-of-quorum peer C, B will itself trigger
a new election to give C an opportunity to join. But because the
highest-scoring monitor A may be netsplit from C, this is not desirable. So in
the connectivity election algorithm, B only "forwards" Propose messages when B's
scores indicate the cluster would choose a leader other than A.

Connection Scoring
==================
We implement scoring within the ConnectionTracker class, which is
driven by the Elector and provided to ElectionLogic as a resource. Elector
is responsible for sending out MMonPing messages, and for reporting the
results in to the ConnectionTracker as calls to report_[live|dead]_connection
with the relevant peer and the time units the call counts for. (These time units
are seconds in the monitor, but the ConnectionTracker is agnostic and our unit
tests count simple time steps.)

We configure a "half life" and each report updates the peer's current status
(alive or dead) and its total score. The new score is current_score * (1 - units_alive / (2 * half_life)) + (units_alive / (2 * half_life)). (For a dead report, we of course
subtract the new delta, rather than adding it).

We can further encode and decode the ConnectionTracker for wire transmission,
and receive_peer_report()s of a full ConnectionTracker (containing all
known scores) or a ConnectionReport (representing a single peer's scores)
to slurp up the scores from peers. These scores are of course all versioned so
we are in no danger of accidentally going backwards in time.
We can query an individual connection score (if the connection is down, it's 0)
or the total score of a specific monitor, which is the connection score from all
other monitors going in to that one.

By default, we consider pings failed after 2 seconds (mon_elector_ping_timeout)
and ping live connections every second (mon_elector_ping_divisor). The halflife
is 12 hours (mon_con_tracker_score_halflife).
Commit	Line	Data
f67539c2 TL	1	=================
	2	Monitor Elections
	3	=================
	4
	5	The Original Algorithm
	6	======================
	7	Historically, monitor leader elections have been very simple: the lowest-ranked
	8	monitor wins!
	9
	10	This is accomplished using a low-state "Elector" module (though it has now
	11	been split into an Elector that handles message-passing, and an ElectionLogic
	12	that makes the voting choices). It tracks the election epoch and not much
	13	else. Odd epochs are elections; even epochs have a leader and let the monitor
	14	do its ongoing work. When a timeout occurs or the monitor asks for a
	15	new election, we bump the epoch and send out Propose messages to all known
	16	monitors.
	17	In general, if we receive an old message we either drop it or trigger a new
	18	election (if we think the sender is newly-booted and needs to join quorum). If
	19	we receive a message from a newer epoch, we bump up our epoch to match and
	20	either Defer to the Proposer or else bump the epoch again and Propose
	21	ourselves if we expect to win over them. When we receive a Propose within
	22	our current epoch, we either Defer to the sender or ignore them (we ignore them
	23	if they are of a higher rank than us, or higher than the rank we have already
	24	deferred to).
	25	(Note that if we have the highest rank it is possible for us to defer to every
	26	other monitor in sequence within the same election epoch!)
	27
	28	This resolves under normal circumstances because all monitors agree on the
	29	priority voting order, and epochs are only bumped when a monitor isn't
	30	participating or sees a possible conflict with the known proposers.
	31
	32	The Problems
	33	==============
	34	The original algorithm didn't work at all under a variety of netsplit
	35	conditions. This didn't manifest often in practice but has become
	36	important as the community and commercial vendors move Ceph into
	37	spaces requiring the use of "stretch clusters".
	38
	39	The New Algorithms
	40	==================
	41	We still default to the original ("classic") election algorithm, but
	42	support letting users change to new ones via the CLI. These
	43	algorithms are implemented as different functions and switch statements
	44	within the ElectionLogic class.
	45
	46	The first algorithm is very simple: "disallow" lets you add monitors
	47	to a list of disallowed leaders.
	48	The second, "connectivity", incorporates connection score ratings
	49	and elects the monitor with the best score.
	50
	51	Algorithm: disallow
	52	===================
	53	If a monitor is in the disallowed list, it always defers to another
	54	monitor, no matter the rank. Otherwise, it is the same as the classic
	55	algorithm is.
	56	Since changing the disallowed list requires a paxos update, monitors
	57	in an election together should always have the same set. This means
	58	the election order is constant and static across the full monitor set
	59	and elections resolve trivially (assuming a connected network).
	60
	61	This algorithm really just exists as a demo and stepping-stone to
	62	the more advanced connectivity mode, but it may have utility in asymmetric
	63	networks and clusters.
	64
65	Algorithm: connectivity
66	=======================
67	This algorithm takes as input scores for each connection
68	(both ways, discussed in the next section) and attempts to elect the monitor
69	with the highest total score. We keep the same basic message-passing flow as the
70	classic algorithm, in which elections are driven by reacting to Propose messages.
71	But this has several challenges since unlike ranks, scores are not static (and
72	might change during an election!). To guarantee an election epoch does not
73	produce multiple leaders, we must maintain two key invariants:
74	* Monitors must maintain static scores during an election epoch
75	* Any deferral must be transitive -- if A defers to B and then to C,
76	B had better defer to C as well!
77
78	We handle these very explicitly: by branching a copy stable_peer_tracker
79	of our peer_tracker scoring object whenever starting an election (or
80	bumping the epoch), and by refusing to defer to a monitor if it won't
81	be deferred to by our current leader choice. (All Propose messages include
82	a copy of the scores the leader is working from, so peers can evaluate them.)
83
84	Of course, those modifications can easily block. To guarantee forward progress,
85	we make several further adjustments:
86	* If we want to defer to a new peer, but have already deferred to a peer
87	whose scores don't allow that, we bump the election epoch and start()
88	the election over again.
89	* All election messages include the scores the sender is aware of.
90
91	This guarantees we will resolve the election as long as the network is
92	reasonably stable (even if disconnected): As long as all score "views"
93	result in the same deferral order, an election will complete normally. And by
94	broadly sharing scores across the full set of monitors, monitors rapidly
95	converge on the global newest state.
96
97	This algorithm has one further important feature compared to the classic and
98	disallowed handlers: it can ignore out-of-quorum peers. Normally, whenever
99	a monitor B receives a Propose from an out-of-quorum peer C, B will itself trigger
100	a new election to give C an opportunity to join. But because the
101	highest-scoring monitor A may be netsplit from C, this is not desirable. So in
102	the connectivity election algorithm, B only "forwards" Propose messages when B's
103	scores indicate the cluster would choose a leader other than A.
104
105	Connection Scoring
106	==================
107	We implement scoring within the ConnectionTracker class, which is
108	driven by the Elector and provided to ElectionLogic as a resource. Elector
109	is responsible for sending out MMonPing messages, and for reporting the
110	results in to the ConnectionTracker as calls to report_[live\|dead]_connection
111	with the relevant peer and the time units the call counts for. (These time units
112	are seconds in the monitor, but the ConnectionTracker is agnostic and our unit
113	tests count simple time steps.)
114
115	We configure a "half life" and each report updates the peer's current status
116	(alive or dead) and its total score. The new score is current_score * (1 - units_alive / (2 * half_life)) + (units_alive / (2 * half_life)). (For a dead report, we of course
117	subtract the new delta, rather than adding it).
118
119	We can further encode and decode the ConnectionTracker for wire transmission,
120	and receive_peer_report()s of a full ConnectionTracker (containing all
121	known scores) or a ConnectionReport (representing a single peer's scores)
122	to slurp up the scores from peers. These scores are of course all versioned so
123	we are in no danger of accidentally going backwards in time.
124	We can query an individual connection score (if the connection is down, it's 0)
125	or the total score of a specific monitor, which is the connection score from all
126	other monitors going in to that one.
127
128	By default, we consider pings failed after 2 seconds (mon_elector_ping_timeout)
129	and ping live connections every second (mon_elector_ping_divisor). The halflife
130	is 12 hours (mon_con_tracker_score_halflife).