ceph/doc/dev/mon-elections.rst

   1 .. _dev_mon_elections:
   2
   3 =================
   4 Monitor Elections
   5 =================
   6
   7 The Original Algorithm
   8 ======================
   9 Historically, monitor leader elections have been very simple: the lowest-ranked
  10 monitor wins!
  11
  12 This is accomplished using a low-state "Elector" module (though it has now
  13 been split into an Elector that handles message-passing, and an ElectionLogic
  14 that makes the voting choices). It tracks the election epoch and not much
  15 else. Odd epochs are elections; even epochs have a leader and let the monitor
  16 do its ongoing work. When a timeout occurs or the monitor asks for a
  17 new election, we bump the epoch and send out Propose messages to all known
  18 monitors.
  19 In general, if we receive an old message we either drop it or trigger a new
  20 election (if we think the sender is newly-booted and needs to join quorum). If
  21 we receive a message from a newer epoch, we bump up our epoch to match and
  22 either Defer to the Proposer or else bump the epoch again and Propose
  23 ourselves if we expect to win over them. When we receive a Propose within
  24 our current epoch, we either Defer to the sender or ignore them (we ignore them
  25 if they are of a higher rank than us, or higher than the rank we have already
  26 deferred to).
  27 (Note that if we have the highest rank it is possible for us to defer to every
  28 other monitor in sequence within the same election epoch!)
  29
  30 This resolves under normal circumstances because all monitors agree on the
  31 priority voting order, and epochs are only bumped when a monitor isn't
  32 participating or sees a possible conflict with the known proposers.
  33
  34 The Problems
  35 ==============
  36 The original algorithm didn't work at all under a variety of netsplit
  37 conditions. This didn't manifest often in practice but has become
  38 important as the community and commercial vendors move Ceph into
  39 spaces requiring the use of "stretch clusters".
  40
  41 The New Algorithms
  42 ==================
  43 We still default to the original ("classic") election algorithm, but
  44 support letting users change to new ones via the CLI. These
  45 algorithms are implemented as different functions and switch statements
  46 within the ElectionLogic class.
  47
  48 The first algorithm is very simple: "disallow" lets you add monitors
  49 to a list of disallowed leaders.
  50 The second, "connectivity", incorporates connection score ratings
  51 and elects the monitor with the best score.
  52
  53 Algorithm: disallow
  54 ===================
  55 If a monitor is in the disallowed list, it always defers to another
  56 monitor, no matter the rank. Otherwise, it is the same as the classic
  57 algorithm is.
  58 Since changing the disallowed list requires a paxos update, monitors
  59 in an election together should always have the same set. This means
  60 the election order is constant and static across the full monitor set
  61 and elections resolve trivially (assuming a connected network).
  62
  63 This algorithm really just exists as a demo and stepping-stone to
  64 the more advanced connectivity mode, but it may have utility in asymmetric
  65 networks and clusters.
  66
  67 Algorithm: connectivity
  68 =======================
  69 This algorithm takes as input scores for each connection
  70 (both ways, discussed in the next section) and attempts to elect the monitor
  71 with the highest total score. We keep the same basic message-passing flow as the
  72 classic algorithm, in which elections are driven by reacting to Propose messages.
  73 But this has several challenges since unlike ranks, scores are not static (and
  74 might change during an election!). To guarantee an election epoch does not
  75 produce multiple leaders, we must maintain two key invariants:
  76 * Monitors must maintain static scores during an election epoch
  77 * Any deferral must be transitive -- if A defers to B and then to C,
  78 B had better defer to C as well!
  79
  80 We handle these very explicitly: by branching a copy stable_peer_tracker
  81 of our peer_tracker scoring object whenever starting an election (or
  82 bumping the epoch), and by refusing to defer to a monitor if it won't
  83 be deferred to by our current leader choice. (All Propose messages include
  84 a copy of the scores the leader is working from, so peers can evaluate them.)
  85
  86 Of course, those modifications can easily block. To guarantee forward progress,
  87 we make several further adjustments:
  88 * If we want to defer to a new peer, but have already deferred to a peer
  89 whose scores don't allow that, we bump the election epoch and start()
  90 the election over again.
  91 * All election messages include the scores the sender is aware of.
  92
  93 This guarantees we will resolve the election as long as the network is
  94 reasonably stable (even if disconnected): As long as all score "views"
  95 result in the same deferral order, an election will complete normally. And by
  96 broadly sharing scores across the full set of monitors, monitors rapidly
  97 converge on the global newest state.
  98
  99 This algorithm has one further important feature compared to the classic and
 100 disallowed handlers: it can ignore out-of-quorum peers. Normally, whenever
 101 a monitor B receives a Propose from an out-of-quorum peer C, B will itself trigger
 102 a new election to give C an opportunity to join. But because the
 103 highest-scoring monitor A may be netsplit from C, this is not desirable. So in
 104 the connectivity election algorithm, B only "forwards" Propose messages when B's
 105 scores indicate the cluster would choose a leader other than A.
 106
 107 Connection Scoring
 108 ==================
 109 We implement scoring within the ConnectionTracker class, which is
 110 driven by the Elector and provided to ElectionLogic as a resource. Elector
 111 is responsible for sending out MMonPing messages, and for reporting the
 112 results in to the ConnectionTracker as calls to report_[live|dead]_connection
 113 with the relevant peer and the time units the call counts for. (These time units
 114 are seconds in the monitor, but the ConnectionTracker is agnostic and our unit
 115 tests count simple time steps.)
 116
 117 We configure a "half life" and each report updates the peer's current status
 118 (alive or dead) and its total score. The new score is current_score * (1 - units_alive / (2 * half_life)) + (units_alive / (2 * half_life)). (For a dead report, we of course
 119 subtract the new delta, rather than adding it).
 120
 121 We can further encode and decode the ConnectionTracker for wire transmission,
 122 and receive_peer_report()s of a full ConnectionTracker (containing all
 123 known scores) or a ConnectionReport (representing a single peer's scores)
 124 to slurp up the scores from peers. These scores are of course all versioned so
 125 we are in no danger of accidentally going backwards in time.
 126 We can query an individual connection score (if the connection is down, it's 0)
 127 or the total score of a specific monitor, which is the connection score from all
 128 other monitors going in to that one.
 129
 130 By default, we consider pings failed after 2 seconds (mon_elector_ping_timeout)
 131 and ping live connections every second (mon_elector_ping_divisor). The halflife
 132 is 12 hours (mon_con_tracker_score_halflife).