]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/osd_internals/stale_read.rst
import quincy beta 17.1.0
[ceph.git] / ceph / doc / dev / osd_internals / stale_read.rst
1 Preventing Stale Reads
2 ======================
3
4 We write synchronously to all replicas before sending an ACK to the
5 client, which limits the potential for inconsistency
6 in the write path. However, by default we serve reads from just
7 one replica (the lead/primary OSD for each PG), and the
8 client will use whatever OSDMap is has to select the OSD from which to read.
9 In most cases, this is fine: either the client map is correct,
10 or the OSD that we think is the primary for the object knows that it
11 is not the primary anymore, and can feed the client an updated map
12 that indicates a newer primary.
13
14 They key is to ensure that this is *always* true. In particular, we
15 need to ensure that an OSD that is fenced off from its peers and has
16 not learned about a map update does not continue to service read
17 requests from similarly stale clients at any point after which a new
18 primary may have been allowed to make a write.
19
20 We accomplish this via a mechanism that works much like a read lease.
21 Each pool may have a ``read_lease_interval`` property which defines
22 how long this is, although by default we simply set it to
23 ``osd_pool_default_read_lease_ratio`` (default: .8) times the
24 ``osd_heartbeat_grace``. (This way the lease will generally have
25 expired by the time we mark a failed OSD down.)
26
27 readable_until
28 --------------
29
30 Primary and replica both track a couple of values:
31
32 * *readable_until* is how long we are allowed to service (read)
33 requests before *our* "lease" expires.
34 * *readable_until_ub* is an upper bound on *readable_until* for any
35 OSD in the acting set.
36
37 The primary manages these two values by sending *pg_lease_t* messages
38 to replicas that increase the upper bound. Once all acting OSDs have
39 acknowledged they've seen the higher bound, the primary increases its
40 own *readable_until* and shares that (in a subsequent *pg_lease_t*
41 message). The resulting invariant is that any acting OSDs'
42 *readable_until* is always <= any acting OSDs' *readable_until_ub*.
43
44 In order to avoid any problems with clock skew, we use monotonic
45 clocks (which are only accurate locally and unaffected by time
46 adjustments) throughout to manage these leases. Peer OSDs calculate
47 upper and lower bounds on the deltas between OSD-local clocks,
48 allowing the primary to share timestamps based on its local clock
49 while replicas translate that to an appropriate bound in for their own
50 local clocks.
51
52 Prior Intervals
53 ---------------
54
55 Whenever there is an interval change, we need to have an upper bound
56 on the *readable_until* values for any OSDs in the prior interval.
57 All OSDs from that interval have this value (*readable_until_ub*), and
58 share it as part of the pg_history_t during peering.
59
60 Because peering may involve OSDs that were not already communicating
61 before and may not have bounds on their clock deltas, the bound in
62 *pg_history_t* is shared as a simple duration before the upper bound
63 expires. This means that the bound slips forward in time due to the
64 transit time for the peering message, but that is generally quite
65 short, and moving the bound later in time is safe since it is an
66 *upper* bound.
67
68 PG "laggy" state
69 ----------------
70
71 While the PG is active, *pg_lease_t* and *pg_lease_ack_t* messages are
72 regularly exchanged. However, if a client request comes in and the
73 lease has expired (*readable_until* has passed), the PG will go into a
74 *LAGGY* state and request will be blocked. Once the lease is renewed,
75 the request(s) will be requeued.
76
77 PG "wait" state
78 ---------------
79
80 If peering completes but the prior interval's OSDs may still be
81 readable, the PG will go into the *WAIT* state until sufficient time
82 has passed. Any OSD requests will block during that period. Recovery
83 may proceed while in this state, since the logical, user-visible
84 content of objects does not change.
85
86 Dead OSDs
87 ---------
88
89 Generally speaking, we need to wait until prior intervals' OSDs *know*
90 that they should no longer be readable. If an OSD is known to have
91 crashed (e.g., because the process is no longer running, which we may
92 infer because we get a ECONNREFUSED error), then we can infer that it
93 is not readable.
94
95 Similarly, if an OSD is marked down, gets a map update telling it so,
96 and then informs the monitor that it knows it was marked down, we can
97 similarly infer that it is not still serving requests for a prior interval.
98
99 When a PG is in the *WAIT* state, it will watch new maps for OSDs'
100 *dead_epoch* value indicating they are aware of their dead-ness. If
101 all down OSDs from prior interval are so aware, we can exit the WAIT
102 state early.