]>
Commit | Line | Data |
---|---|---|
f67539c2 TL |
1 | ================= |
2 | Monitor Elections | |
3 | ================= | |
4 | ||
5 | The Original Algorithm | |
6 | ====================== | |
7 | Historically, monitor leader elections have been very simple: the lowest-ranked | |
8 | monitor wins! | |
9 | ||
10 | This is accomplished using a low-state "Elector" module (though it has now | |
11 | been split into an Elector that handles message-passing, and an ElectionLogic | |
12 | that makes the voting choices). It tracks the election epoch and not much | |
13 | else. Odd epochs are elections; even epochs have a leader and let the monitor | |
14 | do its ongoing work. When a timeout occurs or the monitor asks for a | |
15 | new election, we bump the epoch and send out Propose messages to all known | |
16 | monitors. | |
17 | In general, if we receive an old message we either drop it or trigger a new | |
18 | election (if we think the sender is newly-booted and needs to join quorum). If | |
19 | we receive a message from a newer epoch, we bump up our epoch to match and | |
20 | either Defer to the Proposer or else bump the epoch again and Propose | |
21 | ourselves if we expect to win over them. When we receive a Propose within | |
22 | our current epoch, we either Defer to the sender or ignore them (we ignore them | |
23 | if they are of a higher rank than us, or higher than the rank we have already | |
24 | deferred to). | |
25 | (Note that if we have the highest rank it is possible for us to defer to every | |
26 | other monitor in sequence within the same election epoch!) | |
27 | ||
28 | This resolves under normal circumstances because all monitors agree on the | |
29 | priority voting order, and epochs are only bumped when a monitor isn't | |
30 | participating or sees a possible conflict with the known proposers. | |
31 | ||
32 | The Problems | |
33 | ============== | |
34 | The original algorithm didn't work at all under a variety of netsplit | |
35 | conditions. This didn't manifest often in practice but has become | |
36 | important as the community and commercial vendors move Ceph into | |
37 | spaces requiring the use of "stretch clusters". | |
38 | ||
39 | The New Algorithms | |
40 | ================== | |
41 | We still default to the original ("classic") election algorithm, but | |
42 | support letting users change to new ones via the CLI. These | |
43 | algorithms are implemented as different functions and switch statements | |
44 | within the ElectionLogic class. | |
45 | ||
46 | The first algorithm is very simple: "disallow" lets you add monitors | |
47 | to a list of disallowed leaders. | |
48 | The second, "connectivity", incorporates connection score ratings | |
49 | and elects the monitor with the best score. | |
50 | ||
51 | Algorithm: disallow | |
52 | =================== | |
53 | If a monitor is in the disallowed list, it always defers to another | |
54 | monitor, no matter the rank. Otherwise, it is the same as the classic | |
55 | algorithm is. | |
56 | Since changing the disallowed list requires a paxos update, monitors | |
57 | in an election together should always have the same set. This means | |
58 | the election order is constant and static across the full monitor set | |
59 | and elections resolve trivially (assuming a connected network). | |
60 | ||
61 | This algorithm really just exists as a demo and stepping-stone to | |
62 | the more advanced connectivity mode, but it may have utility in asymmetric | |
63 | networks and clusters. | |
64 | ||
65 | Algorithm: connectivity | |
66 | ======================= | |
67 | This algorithm takes as input scores for each connection | |
68 | (both ways, discussed in the next section) and attempts to elect the monitor | |
69 | with the highest total score. We keep the same basic message-passing flow as the | |
70 | classic algorithm, in which elections are driven by reacting to Propose messages. | |
71 | But this has several challenges since unlike ranks, scores are not static (and | |
72 | might change during an election!). To guarantee an election epoch does not | |
73 | produce multiple leaders, we must maintain two key invariants: | |
74 | * Monitors must maintain static scores during an election epoch | |
75 | * Any deferral must be transitive -- if A defers to B and then to C, | |
76 | B had better defer to C as well! | |
77 | ||
78 | We handle these very explicitly: by branching a copy stable_peer_tracker | |
79 | of our peer_tracker scoring object whenever starting an election (or | |
80 | bumping the epoch), and by refusing to defer to a monitor if it won't | |
81 | be deferred to by our current leader choice. (All Propose messages include | |
82 | a copy of the scores the leader is working from, so peers can evaluate them.) | |
83 | ||
84 | Of course, those modifications can easily block. To guarantee forward progress, | |
85 | we make several further adjustments: | |
86 | * If we want to defer to a new peer, but have already deferred to a peer | |
87 | whose scores don't allow that, we bump the election epoch and start() | |
88 | the election over again. | |
89 | * All election messages include the scores the sender is aware of. | |
90 | ||
91 | This guarantees we will resolve the election as long as the network is | |
92 | reasonably stable (even if disconnected): As long as all score "views" | |
93 | result in the same deferral order, an election will complete normally. And by | |
94 | broadly sharing scores across the full set of monitors, monitors rapidly | |
95 | converge on the global newest state. | |
96 | ||
97 | This algorithm has one further important feature compared to the classic and | |
98 | disallowed handlers: it can ignore out-of-quorum peers. Normally, whenever | |
99 | a monitor B receives a Propose from an out-of-quorum peer C, B will itself trigger | |
100 | a new election to give C an opportunity to join. But because the | |
101 | highest-scoring monitor A may be netsplit from C, this is not desirable. So in | |
102 | the connectivity election algorithm, B only "forwards" Propose messages when B's | |
103 | scores indicate the cluster would choose a leader other than A. | |
104 | ||
105 | Connection Scoring | |
106 | ================== | |
107 | We implement scoring within the ConnectionTracker class, which is | |
108 | driven by the Elector and provided to ElectionLogic as a resource. Elector | |
109 | is responsible for sending out MMonPing messages, and for reporting the | |
110 | results in to the ConnectionTracker as calls to report_[live|dead]_connection | |
111 | with the relevant peer and the time units the call counts for. (These time units | |
112 | are seconds in the monitor, but the ConnectionTracker is agnostic and our unit | |
113 | tests count simple time steps.) | |
114 | ||
115 | We configure a "half life" and each report updates the peer's current status | |
116 | (alive or dead) and its total score. The new score is current_score * (1 - units_alive / (2 * half_life)) + (units_alive / (2 * half_life)). (For a dead report, we of course | |
117 | subtract the new delta, rather than adding it). | |
118 | ||
119 | We can further encode and decode the ConnectionTracker for wire transmission, | |
120 | and receive_peer_report()s of a full ConnectionTracker (containing all | |
121 | known scores) or a ConnectionReport (representing a single peer's scores) | |
122 | to slurp up the scores from peers. These scores are of course all versioned so | |
123 | we are in no danger of accidentally going backwards in time. | |
124 | We can query an individual connection score (if the connection is down, it's 0) | |
125 | or the total score of a specific monitor, which is the connection score from all | |
126 | other monitors going in to that one. | |
127 | ||
128 | By default, we consider pings failed after 2 seconds (mon_elector_ping_timeout) | |
129 | and ping live connections every second (mon_elector_ping_divisor). The halflife | |
130 | is 12 hours (mon_con_tracker_score_halflife). |