[ceph.git] / ceph / doc / dev / rados-client-protocol.rst

RADOS client protocol
=====================

This is very incomplete, but one must start somewhere.

Basics
------

Requests are MOSDOp messages.  Replies are MOSDOpReply messages.

An object request is targeted at an hobject_t, which includes a pool,
hash value, object name, placement key (usually empty), and snapid.

The hash value is a 32-bit hash value, normally generated by hashing
the object name.  The hobject_t can be arbitrarily constructed,
though, with any hash value and name.  Note that in the MOSDOp these
components are spread across several fields and not logically
assembled in an actual hobject_t member (mainly historical reasons).

A request can also target a PG.  In this case, the *ps* value matches
a specific PG, the object name is empty, and (hopefully) the ops in
the request are PG ops.

Either way, the request ultimately targets a PG, either by using the
explicit pgid or by folding the hash value onto the current number of
pgs in the pool.  The client sends the request to the primary for the
associated PG.

Each request is assigned a unique tid.

Resends
-------

If there is a connection drop, the client will resend any outstanding
requests.

Any time there is a PG mapping change such that the primary changes,
the client is responsible for resending the request.  Note that
although there may be an interval change from the OSD's perspective
(triggering PG peering), if the primary doesn't change then the client
need not resend.

There are a few exceptions to this rule:

 * There is a last_force_op_resend field in the pg_pool_t in the
   OSDMap.  If this changes, then the clients are forced to resend any
   outstanding requests. (This happens when tiering is adjusted, for
   example.)
 * Some requests are such that they are resent on *any* PG interval
   change, as defined by pg_interval_t's is_new_interval() (the same
   criteria used by peering in the OSD).
 * If the PAUSE OSDMap flag is set and unset.

Each time a request is sent to the OSD the *attempt* field is incremented. The
first time it is 0, the next 1, etc.

Backoff
-------

Ordinarily the OSD will simply queue any requests it can't immediately
process in memory until such time as it can.  This can become
problematic because the OSD limits the total amount of RAM consumed by
incoming messages: if either of the thresholds for the number of
messages or the number of bytes is reached, new messages will not be
read off the network socket, causing backpressure through the network.

In some cases, though, the OSD knows or expects that a PG or object
will be unavailable for some time and does not want to consume memory
by queuing requests.  In these cases it can send a MOSDBackoff message
to the client.

A backoff request has four properties:

#. the op code (block, unblock, or ack-block)
#. *id*, a unique id assigned within this session
#. hobject_t begin
#. hobject_t end

There are two types of backoff: a *PG* backoff will plug all requests
targeting an entire PG at the client, as described by a range of the
hash/hobject_t space [begin,end), while an *object* backoff will plug
all requests targeting a single object (begin == end).

When the client receives a *block* backoff message, it is now
responsible for *not* sending any requests for hobject_ts described by
the backoff.  The backoff remains in effect until the backoff is
cleared (via an 'unblock' message) or the OSD session is closed.  A
*ack_block* message is sent back to the OSD immediately to acknowledge
receipt of the backoff.

When an unblock is
received, it will reference a specific id that the client previous had
blocked.  However, the range described by the unblock may be smaller
than the original range, as the PG may have split on the OSD.  The unblock
should *only* unblock the range specified in the unblock message.  Any requests
that fall within the unblock request range are reexamined and, if no other
installed backoff applies, resent.

On the OSD, Backoffs are also tracked across ranges of the hash space, and
exist in three states:

#. new
#. acked
#. deleting

A newly installed backoff is set to *new* and a message is sent to the
client.  When the *ack-block* message is received it is changed to the
*acked* state.  The OSD may process other messages from the client that
are covered by the backoff in the *new* state, but once the backoff is
*acked* it should never see a blocked request unless there is a bug.

If the OSD wants to a remove a backoff in the *acked* state it can
simply remove it and notify the client.  If the backoff is in the
*new* state it must move it to the *deleting* state and continue to
use it to discard client requests until the *ack-block* message is
received, at which point it can finally be removed.  This is necessary to
preserve the order of operations processed by the OSD.
Commit	Line	Data
7c673cae FG	1	RADOS client protocol
	2	=====================
	3
	4	This is very incomplete, but one must start somewhere.
	5
	6	Basics
	7	------
	8
	9	Requests are MOSDOp messages. Replies are MOSDOpReply messages.
	10
11fdf7f2	11	An object request is targeted at an hobject_t, which includes a pool,
7c673cae FG	12	hash value, object name, placement key (usually empty), and snapid.
	13
	14	The hash value is a 32-bit hash value, normally generated by hashing
	15	the object name. The hobject_t can be arbitrarily constructed,
	16	though, with any hash value and name. Note that in the MOSDOp these
	17	components are spread across several fields and not logically
	18	assembled in an actual hobject_t member (mainly historical reasons).
	19
	20	A request can also target a PG. In this case, the ps value matches
	21	a specific PG, the object name is empty, and (hopefully) the ops in
	22	the request are PG ops.
	23
	24	Either way, the request ultimately targets a PG, either by using the
	25	explicit pgid or by folding the hash value onto the current number of
	26	pgs in the pool. The client sends the request to the primary for the
11fdf7f2	27	associated PG.
7c673cae FG	28
	29	Each request is assigned a unique tid.
	30
	31	Resends
	32	-------
	33
	34	If there is a connection drop, the client will resend any outstanding
11fdf7f2	35	requests.
7c673cae FG	36
	37	Any time there is a PG mapping change such that the primary changes,
	38	the client is responsible for resending the request. Note that
	39	although there may be an interval change from the OSD's perspective
	40	(triggering PG peering), if the primary doesn't change then the client
	41	need not resend.
	42
	43	There are a few exceptions to this rule:
	44
	45	* There is a last_force_op_resend field in the pg_pool_t in the
	46	OSDMap. If this changes, then the clients are forced to resend any
	47	outstanding requests. (This happens when tiering is adjusted, for
	48	example.)
	49	* Some requests are such that they are resent on any PG interval
	50	change, as defined by pg_interval_t's is_new_interval() (the same
	51	criteria used by peering in the OSD).
	52	* If the PAUSE OSDMap flag is set and unset.
	53
	54	Each time a request is sent to the OSD the attempt field is incremented. The
	55	first time it is 0, the next 1, etc.
	56
	57	Backoff
	58	-------
	59
11fdf7f2	60	Ordinarily the OSD will simply queue any requests it can't immediately
7c673cae FG	61	process in memory until such time as it can. This can become
	62	problematic because the OSD limits the total amount of RAM consumed by
	63	incoming messages: if either of the thresholds for the number of
	64	messages or the number of bytes is reached, new messages will not be
	65	read off the network socket, causing backpressure through the network.
	66
	67	In some cases, though, the OSD knows or expects that a PG or object
	68	will be unavailable for some time and does not want to consume memory
	69	by queuing requests. In these cases it can send a MOSDBackoff message
	70	to the client.
	71
	72	A backoff request has four properties:
	73
	74	#. the op code (block, unblock, or ack-block)
	75	#. id, a unique id assigned within this session
	76	#. hobject_t begin
	77	#. hobject_t end
	78
	79	There are two types of backoff: a PG backoff will plug all requests
11fdf7f2	80	targeting an entire PG at the client, as described by a range of the
7c673cae	81	hash/hobject_t space [begin,end), while an object backoff will plug
11fdf7f2	82	all requests targeting a single object (begin == end).
7c673cae FG	83
	84	When the client receives a block backoff message, it is now
	85	responsible for not sending any requests for hobject_ts described by
	86	the backoff. The backoff remains in effect until the backoff is
	87	cleared (via an 'unblock' message) or the OSD session is closed. A
	88	ack_block message is sent back to the OSD immediately to acknowledge
	89	receipt of the backoff.
	90
	91	When an unblock is
	92	received, it will reference a specific id that the client previous had
	93	blocked. However, the range described by the unblock may be smaller
	94	than the original range, as the PG may have split on the OSD. The unblock
	95	should only unblock the range specified in the unblock message. Any requests
	96	that fall within the unblock request range are reexamined and, if no other
	97	installed backoff applies, resent.
	98
	99	On the OSD, Backoffs are also tracked across ranges of the hash space, and
	100	exist in three states:
	101
	102	#. new
	103	#. acked
	104	#. deleting
	105
	106	A newly installed backoff is set to new and a message is sent to the
	107	client. When the ack-block message is received it is changed to the
	108	acked state. The OSD may process other messages from the client that
	109	are covered by the backoff in the new state, but once the backoff is
	110	acked it should never see a blocked request unless there is a bug.
	111
	112	If the OSD wants to a remove a backoff in the acked state it can
	113	simply remove it and notify the client. If the backoff is in the
	114	new state it must move it to the deleting state and continue to
	115	use it to discard client requests until the ack-block message is
	116	received, at which point it can finally be removed. This is necessary to
	117	preserve the order of operations processed by the OSD.