ceph/doc/dev/osd_internals/stale_read.rst

   1 Preventing Stale Reads
   2 ======================
   3
   4 We write synchronously to all replicas before sending an ACK to the
   5 client, which limits the potential for inconsistency
   6 in the write path.  However, by default we serve reads from just
   7 one replica (the lead/primary OSD for each PG), and the
   8 client will use whatever OSDMap is has to select the OSD from which to read.
   9 In most cases, this is fine: either the client map is correct,
  10 or the OSD that we think is the primary for the object knows that it
  11 is not the primary anymore, and can feed the client an updated map
  12 that indicates a newer primary.
  13
  14 They key is to ensure that this is *always* true.  In particular, we
  15 need to ensure that an OSD that is fenced off from its peers and has
  16 not learned about a map update does not continue to service read
  17 requests from similarly stale clients at any point after which a new
  18 primary may have been allowed to make a write.
  19
  20 We accomplish this via a mechanism that works much like a read lease.
  21 Each pool may have a ``read_lease_interval`` property which defines
  22 how long this is, although by default we simply set it to
  23 ``osd_pool_default_read_lease_ratio`` (default: .8) times the
  24 ``osd_heartbeat_grace``.  (This way the lease will generally have
  25 expired by the time we mark a failed OSD down.)
  26
  27 readable_until
  28 --------------
  29
  30 Primary and replica both track a couple of values:
  31
  32 * *readable_until* is how long we are allowed to service (read)
  33   requests before *our* "lease" expires.
  34 * *readable_until_ub* is an upper bound on *readable_until* for any
  35   OSD in the acting set.
  36
  37 The primary manages these two values by sending *pg_lease_t* messages
  38 to replicas that increase the upper bound.  Once all acting OSDs have
  39 acknowledged they've seen the higher bound, the primary increases its
  40 own *readable_until* and shares that (in a subsequent *pg_lease_t*
  41 message).  The resulting invariant is that any acting OSDs'
  42 *readable_until* is always <= any acting OSDs' *readable_until_ub*.
  43
  44 In order to avoid any problems with clock skew, we use monotonic
  45 clocks (which are only accurate locally and unaffected by time
  46 adjustments) throughout to manage these leases.  Peer OSDs calculate
  47 upper and lower bounds on the deltas between OSD-local clocks,
  48 allowing the primary to share timestamps based on its local clock
  49 while replicas translate that to an appropriate bound in for their own
  50 local clocks.
  51
  52 Prior Intervals
  53 ---------------
  54
  55 Whenever there is an interval change, we need to have an upper bound
  56 on the *readable_until* values for any OSDs in the prior interval.
  57 All OSDs from that interval have this value (*readable_until_ub*), and
  58 share it as part of the pg_history_t during peering.
  59
  60 Because peering may involve OSDs that were not already communicating
  61 before and may not have bounds on their clock deltas, the bound in
  62 *pg_history_t* is shared as a simple duration before the upper bound
  63 expires.  This means that the bound slips forward in time due to the
  64 transit time for the peering message, but that is generally quite
  65 short, and moving the bound later in time is safe since it is an
  66 *upper* bound.
  67
  68 PG "laggy" state
  69 ----------------
  70
  71 While the PG is active, *pg_lease_t* and *pg_lease_ack_t* messages are
  72 regularly exchanged.  However, if a client request comes in and the
  73 lease has expired (*readable_until* has passed), the PG will go into a
  74 *LAGGY* state and request will be blocked.  Once the lease is renewed,
  75 the request(s) will be requeued.
  76
  77 PG "wait" state
  78 ---------------
  79
  80 If peering completes but the prior interval's OSDs may still be
  81 readable, the PG will go into the *WAIT* state until sufficient time
  82 has passed.  Any OSD requests will block during that period.  Recovery
  83 may proceed while in this state, since the logical, user-visible
  84 content of objects does not change.
  85
  86 Dead OSDs
  87 ---------
  88
  89 Generally speaking, we need to wait until prior intervals' OSDs *know*
  90 that they should no longer be readable.  If an OSD is known to have
  91 crashed (e.g., because the process is no longer running, which we may
  92 infer because we get a ECONNREFUSED error), then we can infer that it
  93 is not readable.
  94
  95 Similarly, if an OSD is marked down, gets a map update telling it so,
  96 and then informs the monitor that it knows it was marked down, we can
  97 similarly infer that it is not still serving requests for a prior interval.
  98
  99 When a PG is in the *WAIT* state, it will watch new maps for OSDs'
 100 *dead_epoch* value indicating they are aware of their dead-ness.  If
 101 all down OSDs from prior interval are so aware, we can exit the WAIT
 102 state early.