]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | .. _log-based-pg: |
2 | ||
7c673cae FG |
3 | ============ |
4 | Log Based PG | |
5 | ============ | |
6 | ||
7 | Background | |
8 | ========== | |
9 | ||
10 | Why PrimaryLogPG? | |
11 | ----------------- | |
12 | ||
13 | Currently, consistency for all ceph pool types is ensured by primary | |
14 | log-based replication. This goes for both erasure-coded and | |
15 | replicated pools. | |
16 | ||
17 | Primary log-based replication | |
18 | ----------------------------- | |
19 | ||
20 | Reads must return data written by any write which completed (where the | |
21 | client could possibly have received a commit message). There are lots | |
22 | of ways to handle this, but ceph's architecture makes it easy for | |
23 | everyone at any map epoch to know who the primary is. Thus, the easy | |
24 | answer is to route all writes for a particular pg through a single | |
25 | ordering primary and then out to the replicas. Though we only | |
26 | actually need to serialize writes on a single object (and even then, | |
27 | the partial ordering only really needs to provide an ordering between | |
28 | writes on overlapping regions), we might as well serialize writes on | |
29 | the whole PG since it lets us represent the current state of the PG | |
30 | using two numbers: the epoch of the map on the primary in which the | |
31 | most recent write started (this is a bit stranger than it might seem | |
11fdf7f2 | 32 | since map distribution itself is asynchronous -- see Peering and the |
7c673cae FG |
33 | concept of interval changes) and an increasing per-pg version number |
34 | -- this is referred to in the code with type eversion_t and stored as | |
35 | pg_info_t::last_update. Furthermore, we maintain a log of "recent" | |
36 | operations extending back at least far enough to include any | |
37 | *unstable* writes (writes which have been started but not committed) | |
38 | and objects which aren't uptodate locally (see recovery and | |
39 | backfill). In practice, the log will extend much further | |
40 | (osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not | |
41 | clean) because it's handy for quickly performing recovery. | |
42 | ||
43 | Using this log, as long as we talk to a non-empty subset of the OSDs | |
44 | which must have accepted any completed writes from the most recent | |
45 | interval in which we accepted writes, we can determine a conservative | |
46 | log which must contain any write which has been reported to a client | |
47 | as committed. There is some freedom here, we can choose any log entry | |
48 | between the oldest head remembered by an element of that set (any | |
49 | newer cannot have completed without that log containing it) and the | |
50 | newest head remembered (clearly, all writes in the log were started, | |
51 | so it's fine for us to remember them) as the new head. This is the | |
52 | main point of divergence between replicated pools and ec pools in | |
53 | PG/PrimaryLogPG: replicated pools try to choose the newest valid | |
54 | option to avoid the client needing to replay those operations and | |
55 | instead recover the other copies. EC pools instead try to choose | |
56 | the *oldest* option available to them. | |
57 | ||
58 | The reason for this gets to the heart of the rest of the differences | |
59 | in implementation: one copy will not generally be enough to | |
60 | reconstruct an ec object. Indeed, there are encodings where some log | |
61 | combinations would leave unrecoverable objects (as with a 4+2 encoding | |
62 | where 3 of the replicas remember a write, but the other 3 do not -- we | |
63 | don't have 3 copies of either version). For this reason, log entries | |
64 | representing *unstable* writes (writes not yet committed to the | |
65 | client) must be rollbackable using only local information on ec pools. | |
66 | Log entries in general may therefore be rollbackable (and in that case, | |
67 | via a delayed application or via a set of instructions for rolling | |
68 | back an inplace update) or not. Replicated pool log entries are | |
69 | never able to be rolled back. | |
70 | ||
71 | For more details, see PGLog.h/cc, osd_types.h:pg_log_t, | |
72 | osd_types.h:pg_log_entry_t, and peering in general. | |
73 | ||
74 | ReplicatedBackend/ECBackend unification strategy | |
75 | ================================================ | |
76 | ||
77 | PGBackend | |
78 | --------- | |
79 | ||
80 | So, the fundamental difference between replication and erasure coding | |
81 | is that replication can do destructive updates while erasure coding | |
82 | cannot. It would be really annoying if we needed to have two entire | |
83 | implementations of PrimaryLogPG, one for each of the two, if there | |
84 | are really only a few fundamental differences: | |
85 | ||
86 | #. How reads work -- async only, requires remote reads for ec | |
87 | #. How writes work -- either restricted to append, or must write aside and do a | |
88 | tpc | |
89 | #. Whether we choose the oldest or newest possible head entry during peering | |
90 | #. A bit of extra information in the log entry to enable rollback | |
91 | ||
92 | and so many similarities | |
93 | ||
94 | #. All of the stats and metadata for objects | |
95 | #. The high level locking rules for mixing client IO with recovery and scrub | |
96 | #. The high level locking rules for mixing reads and writes without exposing | |
97 | uncommitted state (which might be rolled back or forgotten later) | |
98 | #. The process, metadata, and protocol needed to determine the set of osds | |
11fdf7f2 | 99 | which participated in the most recent interval in which we accepted writes |
7c673cae FG |
100 | #. etc. |
101 | ||
102 | Instead, we choose a few abstractions (and a few kludges) to paper over the differences: | |
103 | ||
104 | #. PGBackend | |
105 | #. PGTransaction | |
106 | #. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting | |
107 | #. Various bits of the write pipeline disallow some operations based on pool | |
108 | type -- like omap operations, class operation reads, and writes which are | |
109 | not aligned appends (officially, so far) for ec | |
110 | #. Misc other kludges here and there | |
111 | ||
112 | PGBackend and PGTransaction enable abstraction of differences 1, 2, | |
113 | and the addition of 4 as needed to the log entries. | |
114 | ||
115 | The replicated implementation is in ReplicatedBackend.h/cc and doesn't | |
116 | require much explanation, I think. More detail on the ECBackend can be | |
117 | found in doc/dev/osd_internals/erasure_coding/ecbackend.rst. | |
118 | ||
119 | PGBackend Interface Explanation | |
120 | =============================== | |
121 | ||
122 | Note: this is from a design document from before the original firefly | |
123 | and is probably out of date w.r.t. some of the method names. | |
124 | ||
125 | Readable vs Degraded | |
126 | -------------------- | |
127 | ||
128 | For a replicated pool, an object is readable iff it is present on | |
129 | the primary (at the right version). For an ec pool, we need at least | |
130 | M shards present to do a read, and we need it on the primary. For | |
11fdf7f2 | 131 | this reason, PGBackend needs to include some interfaces for determining |
7c673cae FG |
132 | when recovery is required to serve a read vs a write. This also |
133 | changes the rules for when peering has enough logs to prove that it | |
134 | ||
135 | Core Changes: | |
136 | ||
137 | - | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate | |
138 | | objects to allow the user to make these determinations. | |
139 | ||
140 | Client Reads | |
141 | ------------ | |
142 | ||
143 | Reads with the replicated strategy can always be satisfied | |
144 | synchronously out of the primary OSD. With an erasure coded strategy, | |
145 | the primary will need to request data from some number of replicas in | |
146 | order to satisfy a read. PGBackend will therefore need to provide | |
11fdf7f2 | 147 | separate objects_read_sync and objects_read_async interfaces where |
7c673cae FG |
148 | the former won't be implemented by the ECBackend. |
149 | ||
150 | PGBackend interfaces: | |
151 | ||
152 | - objects_read_sync | |
153 | - objects_read_async | |
154 | ||
155 | Scrub | |
156 | ----- | |
157 | ||
158 | We currently have two scrub modes with different default frequencies: | |
159 | ||
160 | #. [shallow] scrub: compares the set of objects and metadata, but not | |
161 | the contents | |
162 | #. deep scrub: compares the set of objects, metadata, and a crc32 of | |
163 | the object contents (including omap) | |
164 | ||
165 | The primary requests a scrubmap from each replica for a particular | |
166 | range of objects. The replica fills out this scrubmap for the range | |
167 | of objects including, if the scrub is deep, a crc32 of the contents of | |
168 | each object. The primary gathers these scrubmaps from each replica | |
169 | and performs a comparison identifying inconsistent objects. | |
170 | ||
171 | Most of this can work essentially unchanged with erasure coded PG with | |
172 | the caveat that the PGBackend implementation must be in charge of | |
173 | actually doing the scan. | |
174 | ||
175 | ||
176 | PGBackend interfaces: | |
177 | ||
178 | - be_* | |
179 | ||
180 | Recovery | |
181 | -------- | |
182 | ||
183 | The logic for recovering an object depends on the backend. With | |
184 | the current replicated strategy, we first pull the object replica | |
185 | to the primary and then concurrently push it out to the replicas. | |
186 | With the erasure coded strategy, we probably want to read the | |
187 | minimum number of replica chunks required to reconstruct the object | |
188 | and push out the replacement chunks concurrently. | |
189 | ||
190 | Another difference is that objects in erasure coded pg may be | |
191 | unrecoverable without being unfound. The "unfound" concept | |
192 | should probably then be renamed to unrecoverable. Also, the | |
193 | PGBackend implementation will have to be able to direct the search | |
194 | for pg replicas with unrecoverable object chunks and to be able | |
195 | to determine whether a particular object is recoverable. | |
196 | ||
197 | ||
198 | Core changes: | |
199 | ||
200 | - s/unfound/unrecoverable | |
201 | ||
202 | PGBackend interfaces: | |
203 | ||
204 | - `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_ | |
205 | - `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_ | |
206 | - `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_ | |
207 | - `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_ | |
208 | - `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_ |