]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/osd_internals/log_based_pg.rst
update sources to ceph Nautilus 14.2.1
[ceph.git] / ceph / doc / dev / osd_internals / log_based_pg.rst
CommitLineData
11fdf7f2
TL
1.. _log-based-pg:
2
7c673cae
FG
3============
4Log Based PG
5============
6
7Background
8==========
9
10Why PrimaryLogPG?
11-----------------
12
13Currently, consistency for all ceph pool types is ensured by primary
14log-based replication. This goes for both erasure-coded and
15replicated pools.
16
17Primary log-based replication
18-----------------------------
19
20Reads must return data written by any write which completed (where the
21client could possibly have received a commit message). There are lots
22of ways to handle this, but ceph's architecture makes it easy for
23everyone at any map epoch to know who the primary is. Thus, the easy
24answer is to route all writes for a particular pg through a single
25ordering primary and then out to the replicas. Though we only
26actually need to serialize writes on a single object (and even then,
27the partial ordering only really needs to provide an ordering between
28writes on overlapping regions), we might as well serialize writes on
29the whole PG since it lets us represent the current state of the PG
30using two numbers: the epoch of the map on the primary in which the
31most recent write started (this is a bit stranger than it might seem
11fdf7f2 32since map distribution itself is asynchronous -- see Peering and the
7c673cae
FG
33concept of interval changes) and an increasing per-pg version number
34-- this is referred to in the code with type eversion_t and stored as
35pg_info_t::last_update. Furthermore, we maintain a log of "recent"
36operations extending back at least far enough to include any
37*unstable* writes (writes which have been started but not committed)
38and objects which aren't uptodate locally (see recovery and
39backfill). In practice, the log will extend much further
40(osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not
41clean) because it's handy for quickly performing recovery.
42
43Using this log, as long as we talk to a non-empty subset of the OSDs
44which must have accepted any completed writes from the most recent
45interval in which we accepted writes, we can determine a conservative
46log which must contain any write which has been reported to a client
47as committed. There is some freedom here, we can choose any log entry
48between the oldest head remembered by an element of that set (any
49newer cannot have completed without that log containing it) and the
50newest head remembered (clearly, all writes in the log were started,
51so it's fine for us to remember them) as the new head. This is the
52main point of divergence between replicated pools and ec pools in
53PG/PrimaryLogPG: replicated pools try to choose the newest valid
54option to avoid the client needing to replay those operations and
55instead recover the other copies. EC pools instead try to choose
56the *oldest* option available to them.
57
58The reason for this gets to the heart of the rest of the differences
59in implementation: one copy will not generally be enough to
60reconstruct an ec object. Indeed, there are encodings where some log
61combinations would leave unrecoverable objects (as with a 4+2 encoding
62where 3 of the replicas remember a write, but the other 3 do not -- we
63don't have 3 copies of either version). For this reason, log entries
64representing *unstable* writes (writes not yet committed to the
65client) must be rollbackable using only local information on ec pools.
66Log entries in general may therefore be rollbackable (and in that case,
67via a delayed application or via a set of instructions for rolling
68back an inplace update) or not. Replicated pool log entries are
69never able to be rolled back.
70
71For more details, see PGLog.h/cc, osd_types.h:pg_log_t,
72osd_types.h:pg_log_entry_t, and peering in general.
73
74ReplicatedBackend/ECBackend unification strategy
75================================================
76
77PGBackend
78---------
79
80So, the fundamental difference between replication and erasure coding
81is that replication can do destructive updates while erasure coding
82cannot. It would be really annoying if we needed to have two entire
83implementations of PrimaryLogPG, one for each of the two, if there
84are really only a few fundamental differences:
85
86#. How reads work -- async only, requires remote reads for ec
87#. How writes work -- either restricted to append, or must write aside and do a
88 tpc
89#. Whether we choose the oldest or newest possible head entry during peering
90#. A bit of extra information in the log entry to enable rollback
91
92and so many similarities
93
94#. All of the stats and metadata for objects
95#. The high level locking rules for mixing client IO with recovery and scrub
96#. The high level locking rules for mixing reads and writes without exposing
97 uncommitted state (which might be rolled back or forgotten later)
98#. The process, metadata, and protocol needed to determine the set of osds
11fdf7f2 99 which participated in the most recent interval in which we accepted writes
7c673cae
FG
100#. etc.
101
102Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
103
104#. PGBackend
105#. PGTransaction
106#. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting
107#. Various bits of the write pipeline disallow some operations based on pool
108 type -- like omap operations, class operation reads, and writes which are
109 not aligned appends (officially, so far) for ec
110#. Misc other kludges here and there
111
112PGBackend and PGTransaction enable abstraction of differences 1, 2,
113and the addition of 4 as needed to the log entries.
114
115The replicated implementation is in ReplicatedBackend.h/cc and doesn't
116require much explanation, I think. More detail on the ECBackend can be
117found in doc/dev/osd_internals/erasure_coding/ecbackend.rst.
118
119PGBackend Interface Explanation
120===============================
121
122Note: this is from a design document from before the original firefly
123and is probably out of date w.r.t. some of the method names.
124
125Readable vs Degraded
126--------------------
127
128For a replicated pool, an object is readable iff it is present on
129the primary (at the right version). For an ec pool, we need at least
130M shards present to do a read, and we need it on the primary. For
11fdf7f2 131this reason, PGBackend needs to include some interfaces for determining
7c673cae
FG
132when recovery is required to serve a read vs a write. This also
133changes the rules for when peering has enough logs to prove that it
134
135Core Changes:
136
137- | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate
138 | objects to allow the user to make these determinations.
139
140Client Reads
141------------
142
143Reads with the replicated strategy can always be satisfied
144synchronously out of the primary OSD. With an erasure coded strategy,
145the primary will need to request data from some number of replicas in
146order to satisfy a read. PGBackend will therefore need to provide
11fdf7f2 147separate objects_read_sync and objects_read_async interfaces where
7c673cae
FG
148the former won't be implemented by the ECBackend.
149
150PGBackend interfaces:
151
152- objects_read_sync
153- objects_read_async
154
155Scrub
156-----
157
158We currently have two scrub modes with different default frequencies:
159
160#. [shallow] scrub: compares the set of objects and metadata, but not
161 the contents
162#. deep scrub: compares the set of objects, metadata, and a crc32 of
163 the object contents (including omap)
164
165The primary requests a scrubmap from each replica for a particular
166range of objects. The replica fills out this scrubmap for the range
167of objects including, if the scrub is deep, a crc32 of the contents of
168each object. The primary gathers these scrubmaps from each replica
169and performs a comparison identifying inconsistent objects.
170
171Most of this can work essentially unchanged with erasure coded PG with
172the caveat that the PGBackend implementation must be in charge of
173actually doing the scan.
174
175
176PGBackend interfaces:
177
178- be_*
179
180Recovery
181--------
182
183The logic for recovering an object depends on the backend. With
184the current replicated strategy, we first pull the object replica
185to the primary and then concurrently push it out to the replicas.
186With the erasure coded strategy, we probably want to read the
187minimum number of replica chunks required to reconstruct the object
188and push out the replacement chunks concurrently.
189
190Another difference is that objects in erasure coded pg may be
191unrecoverable without being unfound. The "unfound" concept
192should probably then be renamed to unrecoverable. Also, the
193PGBackend implementation will have to be able to direct the search
194for pg replicas with unrecoverable object chunks and to be able
195to determine whether a particular object is recoverable.
196
197
198Core changes:
199
200- s/unfound/unrecoverable
201
202PGBackend interfaces:
203
204- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
205- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
206- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
207- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
208- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_