ceph/doc/dev/osd_internals/past_intervals.rst

   1 =============
   2 PastIntervals
   3 =============
   4
   5 Purpose
   6 -------
   7
   8 There are two situations where we need to consider the set of all acting-set
   9 OSDs for a PG back to some epoch ``e``:
  10
  11  * During peering, we need to consider the acting set for every epoch back to
  12    ``last_epoch_started``, the last epoch in which the PG completed peering and
  13    became active.
  14    (see :doc:`/dev/osd_internals/last_epoch_started` for a detailed explanation)
  15  * During recovery, we need to consider the acting set for every epoch back to
  16    ``last_epoch_clean``, the last epoch at which all of the OSDs in the acting
  17    set were fully recovered, and the acting set was full.
  18
  19 For either of these purposes, we could build such a set by iterating backwards
  20 from the current OSDMap to the relevant epoch.  Instead, we maintain a structure
  21 PastIntervals for each PG.
  22
  23 An ``interval`` is a contiguous sequence of OSDMap epochs where the PG mapping
  24 didn't change.  This includes changes to the acting set, the up set, the
  25 primary, and several other parameters fully spelled out in
  26 PastIntervals::check_new_interval.
  27
  28 Maintenance and Trimming
  29 ------------------------
  30
  31 The PastIntervals structure stores a record for each ``interval`` back to
  32 last_epoch_clean.  On each new ``interval`` (See AdvMap reactions,
  33 PeeringState::should_restart_peering, and PeeringState::start_peering_interval)
  34 each OSD with the PG will add the new ``interval`` to its local PastIntervals.
  35 Activation messages to OSDs which do not already have the PG contain the
  36 sender's PastIntervals so that the recipient needn't rebuild it.  (See
  37 PeeringState::activate needs_past_intervals).
  38
  39 PastIntervals are trimmed in two places.  First, when the primary marks the
  40 PG clean, it clears its past_intervals instance
  41 (PeeringState::try_mark_clean()).  The replicas will do the same thing when
  42 they receive the info (See PeeringState::update_history).
  43
  44 The second, more complex, case is in PeeringState::start_peering_interval.  In
  45 the event of a "map gap", we assume that the PG actually has gone clean, but we
  46 haven't received a pg_info_t with the updated ``last_epoch_clean`` value yet.
  47 To explain this behavior, we need to discuss OSDMap trimming.
  48
  49 OSDMap Trimming
  50 ---------------
  51
  52 OSDMaps are created by the Monitor quorum and gossiped out to the OSDs.  The
  53 Monitor cluster also determines when OSDs (and the Monitors) are allowed to
  54 trim old OSDMap epochs.  For the reasons explained above in this document, the
  55 primary constraint is that we must retain all OSDMaps back to some epoch such
  56 that all PGs have been clean at that or a later epoch (min_last_epoch_clean).
  57 (See OSDMonitor::get_trim_to).
  58
  59 The Monitor quorum determines min_last_epoch_clean through MOSDBeacon messages
  60 sent periodically by each OSDs.  Each message contains a set of PGs for which
  61 the OSD is primary at that moment as well as the min_last_epoch_clean across
  62 that set.  The Monitors track these values in OSDMonitor::last_epoch_clean.
  63
  64 There is a subtlety in the min_last_epoch_clean value used by the OSD to
  65 populate the MOSDBeacon.  OSD::collect_pg_stats invokes PG::with_pg_stats to
  66 obtain the lec value, which actually uses
  67 pg_stat_t::get_effective_last_epoch_clean() rather than
  68 info.history.last_epoch_clean.  If the PG is currently clean,
  69 pg_stat_t::get_effective_last_epoch_clean() is the current epoch rather than
  70 last_epoch_clean -- this works because the PG is clean at that epoch and it
  71 allows OSDMaps to be trimmed during periods where OSDMaps are being created
  72 (due to snapshot activity, perhaps), but no PGs are undergoing ``interval``
  73 changes.
  74
  75 Back to PastIntervals
  76 ---------------------
  77
  78 We can now understand our second trimming case above.  If OSDMaps have been
  79 trimmed up to epoch ``e``, we know that the PG must have been clean at some epoch
  80 >= ``e`` (indeed, **all** PGs must have been), so we can drop our PastIntevals.
  81
  82 This dependency also pops up in PeeringState::check_past_interval_bounds().
  83 PeeringState::get_required_past_interval_bounds takes as a parameter
  84 oldest_epoch, which comes from OSDSuperblock::cluster_osdmap_trim_lower_bound.
  85 We use cluster_osdmap_trim_lower_bound rather than a specific osd's oldest_map
  86 because we don't necessarily trim all MOSDMap::cluster_osdmap_trim_lower_bound.
  87 In order to avoid doing too much work at once we limit the amount of osdmaps
  88 trimmed using ``osd_target_transaction_size`` in OSD::trim_maps().
  89 For this reason, a specific OSD's oldest_map can lag behind
  90 OSDSuperblock::cluster_osdmap_trim_lower_bound
  91 for a while.
  92
  93 See https://tracker.ceph.com/issues/49689 for an example.