]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/mon-elections.rst
import ceph quincy 17.2.6
[ceph.git] / ceph / doc / dev / mon-elections.rst
CommitLineData
f67539c2
TL
1=================
2Monitor Elections
3=================
4
5The Original Algorithm
6======================
7Historically, monitor leader elections have been very simple: the lowest-ranked
8monitor wins!
9
10This is accomplished using a low-state "Elector" module (though it has now
11been split into an Elector that handles message-passing, and an ElectionLogic
12that makes the voting choices). It tracks the election epoch and not much
13else. Odd epochs are elections; even epochs have a leader and let the monitor
14do its ongoing work. When a timeout occurs or the monitor asks for a
15new election, we bump the epoch and send out Propose messages to all known
16monitors.
17In general, if we receive an old message we either drop it or trigger a new
18election (if we think the sender is newly-booted and needs to join quorum). If
19we receive a message from a newer epoch, we bump up our epoch to match and
20either Defer to the Proposer or else bump the epoch again and Propose
21ourselves if we expect to win over them. When we receive a Propose within
22our current epoch, we either Defer to the sender or ignore them (we ignore them
23if they are of a higher rank than us, or higher than the rank we have already
24deferred to).
25(Note that if we have the highest rank it is possible for us to defer to every
26other monitor in sequence within the same election epoch!)
27
28This resolves under normal circumstances because all monitors agree on the
29priority voting order, and epochs are only bumped when a monitor isn't
30participating or sees a possible conflict with the known proposers.
31
32The Problems
33==============
34The original algorithm didn't work at all under a variety of netsplit
35conditions. This didn't manifest often in practice but has become
36important as the community and commercial vendors move Ceph into
37spaces requiring the use of "stretch clusters".
38
39The New Algorithms
40==================
41We still default to the original ("classic") election algorithm, but
42support letting users change to new ones via the CLI. These
43algorithms are implemented as different functions and switch statements
44within the ElectionLogic class.
45
46The first algorithm is very simple: "disallow" lets you add monitors
47to a list of disallowed leaders.
48The second, "connectivity", incorporates connection score ratings
49and elects the monitor with the best score.
50
51Algorithm: disallow
52===================
53If a monitor is in the disallowed list, it always defers to another
54monitor, no matter the rank. Otherwise, it is the same as the classic
55algorithm is.
56Since changing the disallowed list requires a paxos update, monitors
57in an election together should always have the same set. This means
58the election order is constant and static across the full monitor set
59and elections resolve trivially (assuming a connected network).
60
61This algorithm really just exists as a demo and stepping-stone to
62the more advanced connectivity mode, but it may have utility in asymmetric
63networks and clusters.
64
65Algorithm: connectivity
66=======================
67This algorithm takes as input scores for each connection
68(both ways, discussed in the next section) and attempts to elect the monitor
69with the highest total score. We keep the same basic message-passing flow as the
70classic algorithm, in which elections are driven by reacting to Propose messages.
71But this has several challenges since unlike ranks, scores are not static (and
72might change during an election!). To guarantee an election epoch does not
73produce multiple leaders, we must maintain two key invariants:
74* Monitors must maintain static scores during an election epoch
75* Any deferral must be transitive -- if A defers to B and then to C,
76B had better defer to C as well!
77
78We handle these very explicitly: by branching a copy stable_peer_tracker
79of our peer_tracker scoring object whenever starting an election (or
80bumping the epoch), and by refusing to defer to a monitor if it won't
81be deferred to by our current leader choice. (All Propose messages include
82a copy of the scores the leader is working from, so peers can evaluate them.)
83
84Of course, those modifications can easily block. To guarantee forward progress,
85we make several further adjustments:
86* If we want to defer to a new peer, but have already deferred to a peer
87whose scores don't allow that, we bump the election epoch and start()
88the election over again.
89* All election messages include the scores the sender is aware of.
90
91This guarantees we will resolve the election as long as the network is
92reasonably stable (even if disconnected): As long as all score "views"
93result in the same deferral order, an election will complete normally. And by
94broadly sharing scores across the full set of monitors, monitors rapidly
95converge on the global newest state.
96
97This algorithm has one further important feature compared to the classic and
98disallowed handlers: it can ignore out-of-quorum peers. Normally, whenever
99a monitor B receives a Propose from an out-of-quorum peer C, B will itself trigger
100a new election to give C an opportunity to join. But because the
101highest-scoring monitor A may be netsplit from C, this is not desirable. So in
102the connectivity election algorithm, B only "forwards" Propose messages when B's
103scores indicate the cluster would choose a leader other than A.
104
105Connection Scoring
106==================
107We implement scoring within the ConnectionTracker class, which is
108driven by the Elector and provided to ElectionLogic as a resource. Elector
109is responsible for sending out MMonPing messages, and for reporting the
110results in to the ConnectionTracker as calls to report_[live|dead]_connection
111with the relevant peer and the time units the call counts for. (These time units
112are seconds in the monitor, but the ConnectionTracker is agnostic and our unit
113tests count simple time steps.)
114
115We configure a "half life" and each report updates the peer's current status
116(alive or dead) and its total score. The new score is current_score * (1 - units_alive / (2 * half_life)) + (units_alive / (2 * half_life)). (For a dead report, we of course
117subtract the new delta, rather than adding it).
118
119We can further encode and decode the ConnectionTracker for wire transmission,
120and receive_peer_report()s of a full ConnectionTracker (containing all
121known scores) or a ConnectionReport (representing a single peer's scores)
122to slurp up the scores from peers. These scores are of course all versioned so
123we are in no danger of accidentally going backwards in time.
124We can query an individual connection score (if the connection is down, it's 0)
125or the total score of a specific monitor, which is the connection score from all
126other monitors going in to that one.
127
128By default, we consider pings failed after 2 seconds (mon_elector_ping_timeout)
129and ping live connections every second (mon_elector_ping_divisor). The halflife
130is 12 hours (mon_con_tracker_score_halflife).