[ceph.git] / ceph / doc / dev / osd_internals / log_based_pg.rst

.. _log-based-pg:

============
Log Based PG
============

Background
==========

Why PrimaryLogPG?
-----------------

Currently, consistency for all ceph pool types is ensured by primary
log-based replication. This goes for both erasure-coded (EC) and
replicated pools.

Primary log-based replication
-----------------------------

Reads must return data written by any write which completed (where the
client could possibly have received a commit message).  There are lots
of ways to handle this, but Ceph's architecture makes it easy for
everyone at any map epoch to know who the primary is.  Thus, the easy
answer is to route all writes for a particular PG through a single
ordering primary and then out to the replicas.  Though we only
actually need to serialize writes on a single RADOS object (and even then,
the partial ordering only really needs to provide an ordering between
writes on overlapping regions), we might as well serialize writes on
the whole PG since it lets us represent the current state of the PG
using two numbers: the epoch of the map on the primary in which the
most recent write started (this is a bit stranger than it might seem
since map distribution itself is asynchronous -- see Peering and the
concept of interval changes) and an increasing per-PG version number
-- this is referred to in the code with type ``eversion_t`` and stored as
``pg_info_t::last_update``.  Furthermore, we maintain a log of "recent"
operations extending back at least far enough to include any
*unstable* writes (writes which have been started but not committed)
and objects which aren't up-to-date locally (see recovery and
backfill).  In practice, the log will extend much further
(``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not
clean) because it's handy for quickly performing recovery.

Using this log, as long as we talk to a non-empty subset of the OSDs
which must have accepted any completed writes from the most recent
interval in which we accepted writes, we can determine a conservative
log which must contain any write which has been reported to a client
as committed.  There is some freedom here, we can choose any log entry
between the oldest head remembered by an element of that set (any
newer cannot have completed without that log containing it) and the
newest head remembered (clearly, all writes in the log were started,
so it's fine for us to remember them) as the new head.  This is the
main point of divergence between replicated pools and EC pools in
``PG/PrimaryLogPG``: replicated pools try to choose the newest valid
option to avoid the client needing to replay those operations and
instead recover the other copies.  EC pools instead try to choose
the *oldest* option available to them.

The reason for this gets to the heart of the rest of the differences
in implementation: one copy will not generally be enough to
reconstruct an EC object.  Indeed, there are encodings where some log
combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding
where 3 of the replicas remember a write, but the other 3 do not -- we
don't have 3 copies of either version).  For this reason, log entries
representing *unstable* writes (writes not yet committed to the
client) must be rollbackable using only local information on EC pools.
Log entries in general may therefore be rollbackable (and in that case,
via a delayed application or via a set of instructions for rolling
back an inplace update) or not.  Replicated pool log entries are
never able to be rolled back.

For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``,
``osd_types.h:pg_log_entry_t``, and peering in general.

ReplicatedBackend/ECBackend unification strategy
================================================

PGBackend
---------

The fundamental difference between replication and erasure coding
is that replication can do destructive updates while erasure coding
cannot.  It would be really annoying if we needed to have two entire
implementations of ``PrimaryLogPG`` since there
are really only a few fundamental differences:

#. How reads work -- async only, requires remote reads for EC
#. How writes work -- either restricted to append, or must write aside and do a
   tpc
#. Whether we choose the oldest or newest possible head entry during peering
#. A bit of extra information in the log entry to enable rollback

and so many similarities

#. All of the stats and metadata for objects
#. The high level locking rules for mixing client IO with recovery and scrub
#. The high level locking rules for mixing reads and writes without exposing
   uncommitted state (which might be rolled back or forgotten later)
#. The process, metadata, and protocol needed to determine the set of osds
   which participated in the most recent interval in which we accepted writes
#. etc.

Instead, we choose a few abstractions (and a few kludges) to paper over the differences:

#. ``PGBackend``
#. ``PGTransaction``
#. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting``
#. Various bits of the write pipeline disallow some operations based on pool
   type -- like omap operations, class operation reads, and writes which are
   not aligned appends (officially, so far) for EC
#. Misc other kludges here and there

``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above
and the addition of 4 as needed to the log entries.

The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't
require much additional explanation.  More detail on the ``ECBackend`` can be
found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``.

PGBackend Interface Explanation
===============================

Note: this is from a design document that predated the Firefly release
and is probably out of date w.r.t. some of the method names.

Readable vs Degraded
--------------------

For a replicated pool, an object is readable IFF it is present on
the primary (at the right version).  For an EC pool, we need at least
`m` shards present to perform a read, and we need it on the primary.  For
this reason, ``PGBackend`` needs to include some interfaces for determining
when recovery is required to serve a read vs a write.  This also
changes the rules for when peering has enough logs to prove that it

Core Changes:

- | ``PGBackend`` needs to be able to return ``IsPG(Recoverable|Readable)Predicate``
  | objects to allow the user to make these determinations.

Client Reads
------------

Reads from a replicated pool can always be satisfied
synchronously by the primary OSD.  Within an erasure coded pool,
the primary will need to request data from some number of replicas in
order to satisfy a read.  ``PGBackend`` will therefore need to provide
separate ``objects_read_sync`` and ``objects_read_async`` interfaces where
the former won't be implemented by the ``ECBackend``.

``PGBackend`` interfaces:

- ``objects_read_sync``
- ``objects_read_async``

Scrubs
------

We currently have two scrub modes with different default frequencies:

#. [shallow] scrub: compares the set of objects and metadata, but not
   the contents
#. deep scrub: compares the set of objects, metadata, and a CRC32 of
   the object contents (including omap)

The primary requests a scrubmap from each replica for a particular
range of objects.  The replica fills out this scrubmap for the range
of objects including, if the scrub is deep, a CRC32 of the contents of
each object.  The primary gathers these scrubmaps from each replica
and performs a comparison identifying inconsistent objects.

Most of this can work essentially unchanged with erasure coded PG with
the caveat that the ``PGBackend`` implementation must be in charge of
actually doing the scan.


``PGBackend`` interfaces:

- ``be_*``

Recovery
--------

The logic for recovering an object depends on the backend.  With
the current replicated strategy, we first pull the object replica
to the primary and then concurrently push it out to the replicas.
With the erasure coded strategy, we probably want to read the
minimum number of replica chunks required to reconstruct the object
and push out the replacement chunks concurrently.

Another difference is that objects in erasure coded PG may be
unrecoverable without being unfound.  The ``unfound`` state
should probably be renamed to ``unrecoverable``.  Also, the
``PGBackend`` implementation will have to be able to direct the search
for PG replicas with unrecoverable object chunks and to be able
to determine whether a particular object is recoverable.


Core changes:

- ``s/unfound/unrecoverable``

PGBackend interfaces:

- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_
Commit	Line	Data
11fdf7f2 TL	1	.. _log-based-pg:
11fdf7f2 TL	2
7c673cae FG	3	============
	4	Log Based PG
	5	============
	6
	7	Background
	8	==========
	9
	10	Why PrimaryLogPG?
	11	-----------------
	12
	13	Currently, consistency for all ceph pool types is ensured by primary
f67539c2	14	log-based replication. This goes for both erasure-coded (EC) and
7c673cae FG	15	replicated pools.
	16
	17	Primary log-based replication
	18	-----------------------------
	19
	20	Reads must return data written by any write which completed (where the
	21	client could possibly have received a commit message). There are lots
f67539c2	22	of ways to handle this, but Ceph's architecture makes it easy for
7c673cae	23	everyone at any map epoch to know who the primary is. Thus, the easy
f67539c2	24	answer is to route all writes for a particular PG through a single
7c673cae	25	ordering primary and then out to the replicas. Though we only
f67539c2	26	actually need to serialize writes on a single RADOS object (and even then,
7c673cae FG	27	the partial ordering only really needs to provide an ordering between
	28	writes on overlapping regions), we might as well serialize writes on
	29	the whole PG since it lets us represent the current state of the PG
	30	using two numbers: the epoch of the map on the primary in which the
	31	most recent write started (this is a bit stranger than it might seem
11fdf7f2	32	since map distribution itself is asynchronous -- see Peering and the
f67539c2 TL	33	concept of interval changes) and an increasing per-PG version number
	34	-- this is referred to in the code with type ``eversion_t`` and stored as
	35	``pg_info_t::last_update``. Furthermore, we maintain a log of "recent"
7c673cae FG	36	operations extending back at least far enough to include any
7c673cae FG	37	unstable writes (writes which have been started but not committed)
20effc67	38	and objects which aren't up-to-date locally (see recovery and
7c673cae	39	backfill). In practice, the log will extend much further
f67539c2	40	(``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not
7c673cae FG	41	clean) because it's handy for quickly performing recovery.
	42
	43	Using this log, as long as we talk to a non-empty subset of the OSDs
	44	which must have accepted any completed writes from the most recent
	45	interval in which we accepted writes, we can determine a conservative
	46	log which must contain any write which has been reported to a client
	47	as committed. There is some freedom here, we can choose any log entry
	48	between the oldest head remembered by an element of that set (any
	49	newer cannot have completed without that log containing it) and the
	50	newest head remembered (clearly, all writes in the log were started,
	51	so it's fine for us to remember them) as the new head. This is the
f67539c2 TL	52	main point of divergence between replicated pools and EC pools in
f67539c2 TL	53	``PG/PrimaryLogPG``: replicated pools try to choose the newest valid
7c673cae FG	54	option to avoid the client needing to replay those operations and
	55	instead recover the other copies. EC pools instead try to choose
	56	the oldest option available to them.
	57
	58	The reason for this gets to the heart of the rest of the differences
	59	in implementation: one copy will not generally be enough to
f67539c2 TL	60	reconstruct an EC object. Indeed, there are encodings where some log
f67539c2 TL	61	combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding
7c673cae FG	62	where 3 of the replicas remember a write, but the other 3 do not -- we
	63	don't have 3 copies of either version). For this reason, log entries
	64	representing unstable writes (writes not yet committed to the
f67539c2	65	client) must be rollbackable using only local information on EC pools.
7c673cae FG	66	Log entries in general may therefore be rollbackable (and in that case,
	67	via a delayed application or via a set of instructions for rolling
	68	back an inplace update) or not. Replicated pool log entries are
	69	never able to be rolled back.
	70
f67539c2 TL	71	For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``,
f67539c2 TL	72	``osd_types.h:pg_log_entry_t``, and peering in general.
7c673cae FG	73
	74	ReplicatedBackend/ECBackend unification strategy
	75	================================================
	76
	77	PGBackend
	78	---------
	79
f67539c2	80	The fundamental difference between replication and erasure coding
7c673cae FG	81	is that replication can do destructive updates while erasure coding
7c673cae FG	82	cannot. It would be really annoying if we needed to have two entire
f67539c2	83	implementations of ``PrimaryLogPG`` since there
7c673cae FG	84	are really only a few fundamental differences:
7c673cae FG	85
f67539c2	86	#. How reads work -- async only, requires remote reads for EC
7c673cae FG	87	#. How writes work -- either restricted to append, or must write aside and do a
	88	tpc
	89	#. Whether we choose the oldest or newest possible head entry during peering
	90	#. A bit of extra information in the log entry to enable rollback
	91
	92	and so many similarities
	93
	94	#. All of the stats and metadata for objects
	95	#. The high level locking rules for mixing client IO with recovery and scrub
	96	#. The high level locking rules for mixing reads and writes without exposing
	97	uncommitted state (which might be rolled back or forgotten later)
	98	#. The process, metadata, and protocol needed to determine the set of osds
11fdf7f2	99	which participated in the most recent interval in which we accepted writes
7c673cae FG	100	#. etc.
	101
	102	Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
	103
f67539c2 TL	104	#. ``PGBackend``
	105	#. ``PGTransaction``
	106	#. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting``
7c673cae FG	107	#. Various bits of the write pipeline disallow some operations based on pool
7c673cae FG	108	type -- like omap operations, class operation reads, and writes which are
f67539c2	109	not aligned appends (officially, so far) for EC
7c673cae FG	110	#. Misc other kludges here and there
7c673cae FG	111
f67539c2	112	``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above
7c673cae FG	113	and the addition of 4 as needed to the log entries.
7c673cae FG	114
f67539c2 TL	115	The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't
	116	require much additional explanation. More detail on the ``ECBackend`` can be
	117	found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``.
7c673cae FG	118
	119	PGBackend Interface Explanation
	120	===============================
	121
f67539c2	122	Note: this is from a design document that predated the Firefly release
7c673cae FG	123	and is probably out of date w.r.t. some of the method names.
	124
	125	Readable vs Degraded
	126	--------------------
	127
f67539c2 TL	128	For a replicated pool, an object is readable IFF it is present on
	129	the primary (at the right version). For an EC pool, we need at least
	130	`m` shards present to perform a read, and we need it on the primary. For
	131	this reason, ``PGBackend`` needs to include some interfaces for determining
7c673cae FG	132	when recovery is required to serve a read vs a write. This also
	133	changes the rules for when peering has enough logs to prove that it
	134
	135	Core Changes:
	136
f67539c2	137	- \| ``PGBackend`` needs to be able to return ``IsPG(Recoverable\|Readable)Predicate``
7c673cae FG	138	\| objects to allow the user to make these determinations.
	139
	140	Client Reads
	141	------------
	142
f67539c2 TL	143	Reads from a replicated pool can always be satisfied
f67539c2 TL	144	synchronously by the primary OSD. Within an erasure coded pool,
7c673cae	145	the primary will need to request data from some number of replicas in
f67539c2 TL	146	order to satisfy a read. ``PGBackend`` will therefore need to provide
	147	separate ``objects_read_sync`` and ``objects_read_async`` interfaces where
	148	the former won't be implemented by the ``ECBackend``.
7c673cae	149
f67539c2	150	``PGBackend`` interfaces:
7c673cae	151
f67539c2 TL	152	- ``objects_read_sync``
f67539c2 TL	153	- ``objects_read_async``
7c673cae	154
f67539c2 TL	155	Scrubs
f67539c2 TL	156	------
7c673cae FG	157
	158	We currently have two scrub modes with different default frequencies:
	159
	160	#. [shallow] scrub: compares the set of objects and metadata, but not
	161	the contents
f67539c2	162	#. deep scrub: compares the set of objects, metadata, and a CRC32 of
7c673cae FG	163	the object contents (including omap)
	164
	165	The primary requests a scrubmap from each replica for a particular
	166	range of objects. The replica fills out this scrubmap for the range
f67539c2	167	of objects including, if the scrub is deep, a CRC32 of the contents of
7c673cae FG	168	each object. The primary gathers these scrubmaps from each replica
	169	and performs a comparison identifying inconsistent objects.
	170
	171	Most of this can work essentially unchanged with erasure coded PG with
f67539c2	172	the caveat that the ``PGBackend`` implementation must be in charge of
7c673cae FG	173	actually doing the scan.
	174
	175
f67539c2	176	``PGBackend`` interfaces:
7c673cae	177
f67539c2	178	- ``be_*``
7c673cae FG	179
	180	Recovery
	181	--------
	182
	183	The logic for recovering an object depends on the backend. With
	184	the current replicated strategy, we first pull the object replica
	185	to the primary and then concurrently push it out to the replicas.
	186	With the erasure coded strategy, we probably want to read the
	187	minimum number of replica chunks required to reconstruct the object
	188	and push out the replacement chunks concurrently.
	189
f67539c2 TL	190	Another difference is that objects in erasure coded PG may be
	191	unrecoverable without being unfound. The ``unfound`` state
	192	should probably be renamed to ``unrecoverable``. Also, the
	193	``PGBackend`` implementation will have to be able to direct the search
	194	for PG replicas with unrecoverable object chunks and to be able
7c673cae FG	195	to determine whether a particular object is recoverable.
	196
	197
	198	Core changes:
	199
f67539c2	200	- ``s/unfound/unrecoverable``
7c673cae FG	201
	202	PGBackend interfaces:
	203
	204	- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
	205	- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
	206	- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
	207	- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
	208	- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_