]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/osd_internals/log_based_pg.rst
import quincy beta 17.1.0
[ceph.git] / ceph / doc / dev / osd_internals / log_based_pg.rst
CommitLineData
11fdf7f2
TL
1.. _log-based-pg:
2
7c673cae
FG
3============
4Log Based PG
5============
6
7Background
8==========
9
10Why PrimaryLogPG?
11-----------------
12
13Currently, consistency for all ceph pool types is ensured by primary
f67539c2 14log-based replication. This goes for both erasure-coded (EC) and
7c673cae
FG
15replicated pools.
16
17Primary log-based replication
18-----------------------------
19
20Reads must return data written by any write which completed (where the
21client could possibly have received a commit message). There are lots
f67539c2 22of ways to handle this, but Ceph's architecture makes it easy for
7c673cae 23everyone at any map epoch to know who the primary is. Thus, the easy
f67539c2 24answer is to route all writes for a particular PG through a single
7c673cae 25ordering primary and then out to the replicas. Though we only
f67539c2 26actually need to serialize writes on a single RADOS object (and even then,
7c673cae
FG
27the partial ordering only really needs to provide an ordering between
28writes on overlapping regions), we might as well serialize writes on
29the whole PG since it lets us represent the current state of the PG
30using two numbers: the epoch of the map on the primary in which the
31most recent write started (this is a bit stranger than it might seem
11fdf7f2 32since map distribution itself is asynchronous -- see Peering and the
f67539c2
TL
33concept of interval changes) and an increasing per-PG version number
34-- this is referred to in the code with type ``eversion_t`` and stored as
35``pg_info_t::last_update``. Furthermore, we maintain a log of "recent"
7c673cae
FG
36operations extending back at least far enough to include any
37*unstable* writes (writes which have been started but not committed)
20effc67 38and objects which aren't up-to-date locally (see recovery and
7c673cae 39backfill). In practice, the log will extend much further
f67539c2 40(``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not
7c673cae
FG
41clean) because it's handy for quickly performing recovery.
42
43Using this log, as long as we talk to a non-empty subset of the OSDs
44which must have accepted any completed writes from the most recent
45interval in which we accepted writes, we can determine a conservative
46log which must contain any write which has been reported to a client
47as committed. There is some freedom here, we can choose any log entry
48between the oldest head remembered by an element of that set (any
49newer cannot have completed without that log containing it) and the
50newest head remembered (clearly, all writes in the log were started,
51so it's fine for us to remember them) as the new head. This is the
f67539c2
TL
52main point of divergence between replicated pools and EC pools in
53``PG/PrimaryLogPG``: replicated pools try to choose the newest valid
7c673cae
FG
54option to avoid the client needing to replay those operations and
55instead recover the other copies. EC pools instead try to choose
56the *oldest* option available to them.
57
58The reason for this gets to the heart of the rest of the differences
59in implementation: one copy will not generally be enough to
f67539c2
TL
60reconstruct an EC object. Indeed, there are encodings where some log
61combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding
7c673cae
FG
62where 3 of the replicas remember a write, but the other 3 do not -- we
63don't have 3 copies of either version). For this reason, log entries
64representing *unstable* writes (writes not yet committed to the
f67539c2 65client) must be rollbackable using only local information on EC pools.
7c673cae
FG
66Log entries in general may therefore be rollbackable (and in that case,
67via a delayed application or via a set of instructions for rolling
68back an inplace update) or not. Replicated pool log entries are
69never able to be rolled back.
70
f67539c2
TL
71For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``,
72``osd_types.h:pg_log_entry_t``, and peering in general.
7c673cae
FG
73
74ReplicatedBackend/ECBackend unification strategy
75================================================
76
77PGBackend
78---------
79
f67539c2 80The fundamental difference between replication and erasure coding
7c673cae
FG
81is that replication can do destructive updates while erasure coding
82cannot. It would be really annoying if we needed to have two entire
f67539c2 83implementations of ``PrimaryLogPG`` since there
7c673cae
FG
84are really only a few fundamental differences:
85
f67539c2 86#. How reads work -- async only, requires remote reads for EC
7c673cae
FG
87#. How writes work -- either restricted to append, or must write aside and do a
88 tpc
89#. Whether we choose the oldest or newest possible head entry during peering
90#. A bit of extra information in the log entry to enable rollback
91
92and so many similarities
93
94#. All of the stats and metadata for objects
95#. The high level locking rules for mixing client IO with recovery and scrub
96#. The high level locking rules for mixing reads and writes without exposing
97 uncommitted state (which might be rolled back or forgotten later)
98#. The process, metadata, and protocol needed to determine the set of osds
11fdf7f2 99 which participated in the most recent interval in which we accepted writes
7c673cae
FG
100#. etc.
101
102Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
103
f67539c2
TL
104#. ``PGBackend``
105#. ``PGTransaction``
106#. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting``
7c673cae
FG
107#. Various bits of the write pipeline disallow some operations based on pool
108 type -- like omap operations, class operation reads, and writes which are
f67539c2 109 not aligned appends (officially, so far) for EC
7c673cae
FG
110#. Misc other kludges here and there
111
f67539c2 112``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above
7c673cae
FG
113and the addition of 4 as needed to the log entries.
114
f67539c2
TL
115The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't
116require much additional explanation. More detail on the ``ECBackend`` can be
117found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``.
7c673cae
FG
118
119PGBackend Interface Explanation
120===============================
121
f67539c2 122Note: this is from a design document that predated the Firefly release
7c673cae
FG
123and is probably out of date w.r.t. some of the method names.
124
125Readable vs Degraded
126--------------------
127
f67539c2
TL
128For a replicated pool, an object is readable IFF it is present on
129the primary (at the right version). For an EC pool, we need at least
130`m` shards present to perform a read, and we need it on the primary. For
131this reason, ``PGBackend`` needs to include some interfaces for determining
7c673cae
FG
132when recovery is required to serve a read vs a write. This also
133changes the rules for when peering has enough logs to prove that it
134
135Core Changes:
136
f67539c2 137- | ``PGBackend`` needs to be able to return ``IsPG(Recoverable|Readable)Predicate``
7c673cae
FG
138 | objects to allow the user to make these determinations.
139
140Client Reads
141------------
142
f67539c2
TL
143Reads from a replicated pool can always be satisfied
144synchronously by the primary OSD. Within an erasure coded pool,
7c673cae 145the primary will need to request data from some number of replicas in
f67539c2
TL
146order to satisfy a read. ``PGBackend`` will therefore need to provide
147separate ``objects_read_sync`` and ``objects_read_async`` interfaces where
148the former won't be implemented by the ``ECBackend``.
7c673cae 149
f67539c2 150``PGBackend`` interfaces:
7c673cae 151
f67539c2
TL
152- ``objects_read_sync``
153- ``objects_read_async``
7c673cae 154
f67539c2
TL
155Scrubs
156------
7c673cae
FG
157
158We currently have two scrub modes with different default frequencies:
159
160#. [shallow] scrub: compares the set of objects and metadata, but not
161 the contents
f67539c2 162#. deep scrub: compares the set of objects, metadata, and a CRC32 of
7c673cae
FG
163 the object contents (including omap)
164
165The primary requests a scrubmap from each replica for a particular
166range of objects. The replica fills out this scrubmap for the range
f67539c2 167of objects including, if the scrub is deep, a CRC32 of the contents of
7c673cae
FG
168each object. The primary gathers these scrubmaps from each replica
169and performs a comparison identifying inconsistent objects.
170
171Most of this can work essentially unchanged with erasure coded PG with
f67539c2 172the caveat that the ``PGBackend`` implementation must be in charge of
7c673cae
FG
173actually doing the scan.
174
175
f67539c2 176``PGBackend`` interfaces:
7c673cae 177
f67539c2 178- ``be_*``
7c673cae
FG
179
180Recovery
181--------
182
183The logic for recovering an object depends on the backend. With
184the current replicated strategy, we first pull the object replica
185to the primary and then concurrently push it out to the replicas.
186With the erasure coded strategy, we probably want to read the
187minimum number of replica chunks required to reconstruct the object
188and push out the replacement chunks concurrently.
189
f67539c2
TL
190Another difference is that objects in erasure coded PG may be
191unrecoverable without being unfound. The ``unfound`` state
192should probably be renamed to ``unrecoverable``. Also, the
193``PGBackend`` implementation will have to be able to direct the search
194for PG replicas with unrecoverable object chunks and to be able
7c673cae
FG
195to determine whether a particular object is recoverable.
196
197
198Core changes:
199
f67539c2 200- ``s/unfound/unrecoverable``
7c673cae
FG
201
202PGBackend interfaces:
203
204- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
205- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
206- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
207- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
208- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_