]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | .. _log-based-pg: |
2 | ||
7c673cae FG |
3 | ============ |
4 | Log Based PG | |
5 | ============ | |
6 | ||
7 | Background | |
8 | ========== | |
9 | ||
10 | Why PrimaryLogPG? | |
11 | ----------------- | |
12 | ||
13 | Currently, consistency for all ceph pool types is ensured by primary | |
f67539c2 | 14 | log-based replication. This goes for both erasure-coded (EC) and |
7c673cae FG |
15 | replicated pools. |
16 | ||
17 | Primary log-based replication | |
18 | ----------------------------- | |
19 | ||
20 | Reads must return data written by any write which completed (where the | |
21 | client could possibly have received a commit message). There are lots | |
f67539c2 | 22 | of ways to handle this, but Ceph's architecture makes it easy for |
7c673cae | 23 | everyone at any map epoch to know who the primary is. Thus, the easy |
f67539c2 | 24 | answer is to route all writes for a particular PG through a single |
7c673cae | 25 | ordering primary and then out to the replicas. Though we only |
f67539c2 | 26 | actually need to serialize writes on a single RADOS object (and even then, |
7c673cae FG |
27 | the partial ordering only really needs to provide an ordering between |
28 | writes on overlapping regions), we might as well serialize writes on | |
29 | the whole PG since it lets us represent the current state of the PG | |
30 | using two numbers: the epoch of the map on the primary in which the | |
31 | most recent write started (this is a bit stranger than it might seem | |
11fdf7f2 | 32 | since map distribution itself is asynchronous -- see Peering and the |
f67539c2 TL |
33 | concept of interval changes) and an increasing per-PG version number |
34 | -- this is referred to in the code with type ``eversion_t`` and stored as | |
35 | ``pg_info_t::last_update``. Furthermore, we maintain a log of "recent" | |
7c673cae FG |
36 | operations extending back at least far enough to include any |
37 | *unstable* writes (writes which have been started but not committed) | |
20effc67 | 38 | and objects which aren't up-to-date locally (see recovery and |
7c673cae | 39 | backfill). In practice, the log will extend much further |
f67539c2 | 40 | (``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not |
7c673cae FG |
41 | clean) because it's handy for quickly performing recovery. |
42 | ||
43 | Using this log, as long as we talk to a non-empty subset of the OSDs | |
44 | which must have accepted any completed writes from the most recent | |
45 | interval in which we accepted writes, we can determine a conservative | |
46 | log which must contain any write which has been reported to a client | |
47 | as committed. There is some freedom here, we can choose any log entry | |
48 | between the oldest head remembered by an element of that set (any | |
49 | newer cannot have completed without that log containing it) and the | |
50 | newest head remembered (clearly, all writes in the log were started, | |
51 | so it's fine for us to remember them) as the new head. This is the | |
f67539c2 TL |
52 | main point of divergence between replicated pools and EC pools in |
53 | ``PG/PrimaryLogPG``: replicated pools try to choose the newest valid | |
7c673cae FG |
54 | option to avoid the client needing to replay those operations and |
55 | instead recover the other copies. EC pools instead try to choose | |
56 | the *oldest* option available to them. | |
57 | ||
58 | The reason for this gets to the heart of the rest of the differences | |
59 | in implementation: one copy will not generally be enough to | |
f67539c2 TL |
60 | reconstruct an EC object. Indeed, there are encodings where some log |
61 | combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding | |
7c673cae FG |
62 | where 3 of the replicas remember a write, but the other 3 do not -- we |
63 | don't have 3 copies of either version). For this reason, log entries | |
64 | representing *unstable* writes (writes not yet committed to the | |
f67539c2 | 65 | client) must be rollbackable using only local information on EC pools. |
7c673cae FG |
66 | Log entries in general may therefore be rollbackable (and in that case, |
67 | via a delayed application or via a set of instructions for rolling | |
68 | back an inplace update) or not. Replicated pool log entries are | |
69 | never able to be rolled back. | |
70 | ||
f67539c2 TL |
71 | For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``, |
72 | ``osd_types.h:pg_log_entry_t``, and peering in general. | |
7c673cae FG |
73 | |
74 | ReplicatedBackend/ECBackend unification strategy | |
75 | ================================================ | |
76 | ||
77 | PGBackend | |
78 | --------- | |
79 | ||
f67539c2 | 80 | The fundamental difference between replication and erasure coding |
7c673cae FG |
81 | is that replication can do destructive updates while erasure coding |
82 | cannot. It would be really annoying if we needed to have two entire | |
f67539c2 | 83 | implementations of ``PrimaryLogPG`` since there |
7c673cae FG |
84 | are really only a few fundamental differences: |
85 | ||
f67539c2 | 86 | #. How reads work -- async only, requires remote reads for EC |
7c673cae FG |
87 | #. How writes work -- either restricted to append, or must write aside and do a |
88 | tpc | |
89 | #. Whether we choose the oldest or newest possible head entry during peering | |
90 | #. A bit of extra information in the log entry to enable rollback | |
91 | ||
92 | and so many similarities | |
93 | ||
94 | #. All of the stats and metadata for objects | |
95 | #. The high level locking rules for mixing client IO with recovery and scrub | |
96 | #. The high level locking rules for mixing reads and writes without exposing | |
97 | uncommitted state (which might be rolled back or forgotten later) | |
98 | #. The process, metadata, and protocol needed to determine the set of osds | |
11fdf7f2 | 99 | which participated in the most recent interval in which we accepted writes |
7c673cae FG |
100 | #. etc. |
101 | ||
102 | Instead, we choose a few abstractions (and a few kludges) to paper over the differences: | |
103 | ||
f67539c2 TL |
104 | #. ``PGBackend`` |
105 | #. ``PGTransaction`` | |
106 | #. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting`` | |
7c673cae FG |
107 | #. Various bits of the write pipeline disallow some operations based on pool |
108 | type -- like omap operations, class operation reads, and writes which are | |
f67539c2 | 109 | not aligned appends (officially, so far) for EC |
7c673cae FG |
110 | #. Misc other kludges here and there |
111 | ||
f67539c2 | 112 | ``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above |
7c673cae FG |
113 | and the addition of 4 as needed to the log entries. |
114 | ||
f67539c2 TL |
115 | The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't |
116 | require much additional explanation. More detail on the ``ECBackend`` can be | |
117 | found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``. | |
7c673cae FG |
118 | |
119 | PGBackend Interface Explanation | |
120 | =============================== | |
121 | ||
f67539c2 | 122 | Note: this is from a design document that predated the Firefly release |
7c673cae FG |
123 | and is probably out of date w.r.t. some of the method names. |
124 | ||
125 | Readable vs Degraded | |
126 | -------------------- | |
127 | ||
f67539c2 TL |
128 | For a replicated pool, an object is readable IFF it is present on |
129 | the primary (at the right version). For an EC pool, we need at least | |
130 | `m` shards present to perform a read, and we need it on the primary. For | |
131 | this reason, ``PGBackend`` needs to include some interfaces for determining | |
7c673cae FG |
132 | when recovery is required to serve a read vs a write. This also |
133 | changes the rules for when peering has enough logs to prove that it | |
134 | ||
135 | Core Changes: | |
136 | ||
f67539c2 | 137 | - | ``PGBackend`` needs to be able to return ``IsPG(Recoverable|Readable)Predicate`` |
7c673cae FG |
138 | | objects to allow the user to make these determinations. |
139 | ||
140 | Client Reads | |
141 | ------------ | |
142 | ||
f67539c2 TL |
143 | Reads from a replicated pool can always be satisfied |
144 | synchronously by the primary OSD. Within an erasure coded pool, | |
7c673cae | 145 | the primary will need to request data from some number of replicas in |
f67539c2 TL |
146 | order to satisfy a read. ``PGBackend`` will therefore need to provide |
147 | separate ``objects_read_sync`` and ``objects_read_async`` interfaces where | |
148 | the former won't be implemented by the ``ECBackend``. | |
7c673cae | 149 | |
f67539c2 | 150 | ``PGBackend`` interfaces: |
7c673cae | 151 | |
f67539c2 TL |
152 | - ``objects_read_sync`` |
153 | - ``objects_read_async`` | |
7c673cae | 154 | |
f67539c2 TL |
155 | Scrubs |
156 | ------ | |
7c673cae FG |
157 | |
158 | We currently have two scrub modes with different default frequencies: | |
159 | ||
160 | #. [shallow] scrub: compares the set of objects and metadata, but not | |
161 | the contents | |
f67539c2 | 162 | #. deep scrub: compares the set of objects, metadata, and a CRC32 of |
7c673cae FG |
163 | the object contents (including omap) |
164 | ||
165 | The primary requests a scrubmap from each replica for a particular | |
166 | range of objects. The replica fills out this scrubmap for the range | |
f67539c2 | 167 | of objects including, if the scrub is deep, a CRC32 of the contents of |
7c673cae FG |
168 | each object. The primary gathers these scrubmaps from each replica |
169 | and performs a comparison identifying inconsistent objects. | |
170 | ||
171 | Most of this can work essentially unchanged with erasure coded PG with | |
f67539c2 | 172 | the caveat that the ``PGBackend`` implementation must be in charge of |
7c673cae FG |
173 | actually doing the scan. |
174 | ||
175 | ||
f67539c2 | 176 | ``PGBackend`` interfaces: |
7c673cae | 177 | |
f67539c2 | 178 | - ``be_*`` |
7c673cae FG |
179 | |
180 | Recovery | |
181 | -------- | |
182 | ||
183 | The logic for recovering an object depends on the backend. With | |
184 | the current replicated strategy, we first pull the object replica | |
185 | to the primary and then concurrently push it out to the replicas. | |
186 | With the erasure coded strategy, we probably want to read the | |
187 | minimum number of replica chunks required to reconstruct the object | |
188 | and push out the replacement chunks concurrently. | |
189 | ||
f67539c2 TL |
190 | Another difference is that objects in erasure coded PG may be |
191 | unrecoverable without being unfound. The ``unfound`` state | |
192 | should probably be renamed to ``unrecoverable``. Also, the | |
193 | ``PGBackend`` implementation will have to be able to direct the search | |
194 | for PG replicas with unrecoverable object chunks and to be able | |
7c673cae FG |
195 | to determine whether a particular object is recoverable. |
196 | ||
197 | ||
198 | Core changes: | |
199 | ||
f67539c2 | 200 | - ``s/unfound/unrecoverable`` |
7c673cae FG |
201 | |
202 | PGBackend interfaces: | |
203 | ||
204 | - `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_ | |
205 | - `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_ | |
206 | - `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_ | |
207 | - `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_ | |
208 | - `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_ |