]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/rados-client-protocol.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / dev / rados-client-protocol.rst
CommitLineData
7c673cae
FG
1RADOS client protocol
2=====================
3
4This is very incomplete, but one must start somewhere.
5
6Basics
7------
8
9Requests are MOSDOp messages. Replies are MOSDOpReply messages.
10
11fdf7f2 11An object request is targeted at an hobject_t, which includes a pool,
7c673cae
FG
12hash value, object name, placement key (usually empty), and snapid.
13
14The hash value is a 32-bit hash value, normally generated by hashing
15the object name. The hobject_t can be arbitrarily constructed,
16though, with any hash value and name. Note that in the MOSDOp these
17components are spread across several fields and not logically
18assembled in an actual hobject_t member (mainly historical reasons).
19
20A request can also target a PG. In this case, the *ps* value matches
21a specific PG, the object name is empty, and (hopefully) the ops in
22the request are PG ops.
23
24Either way, the request ultimately targets a PG, either by using the
25explicit pgid or by folding the hash value onto the current number of
26pgs in the pool. The client sends the request to the primary for the
11fdf7f2 27associated PG.
7c673cae
FG
28
29Each request is assigned a unique tid.
30
31Resends
32-------
33
34If there is a connection drop, the client will resend any outstanding
11fdf7f2 35requests.
7c673cae
FG
36
37Any time there is a PG mapping change such that the primary changes,
38the client is responsible for resending the request. Note that
39although there may be an interval change from the OSD's perspective
40(triggering PG peering), if the primary doesn't change then the client
41need not resend.
42
43There are a few exceptions to this rule:
44
45 * There is a last_force_op_resend field in the pg_pool_t in the
46 OSDMap. If this changes, then the clients are forced to resend any
47 outstanding requests. (This happens when tiering is adjusted, for
48 example.)
49 * Some requests are such that they are resent on *any* PG interval
50 change, as defined by pg_interval_t's is_new_interval() (the same
51 criteria used by peering in the OSD).
52 * If the PAUSE OSDMap flag is set and unset.
53
54Each time a request is sent to the OSD the *attempt* field is incremented. The
55first time it is 0, the next 1, etc.
56
57Backoff
58-------
59
11fdf7f2 60Ordinarily the OSD will simply queue any requests it can't immediately
7c673cae
FG
61process in memory until such time as it can. This can become
62problematic because the OSD limits the total amount of RAM consumed by
63incoming messages: if either of the thresholds for the number of
64messages or the number of bytes is reached, new messages will not be
65read off the network socket, causing backpressure through the network.
66
67In some cases, though, the OSD knows or expects that a PG or object
68will be unavailable for some time and does not want to consume memory
69by queuing requests. In these cases it can send a MOSDBackoff message
70to the client.
71
72A backoff request has four properties:
73
74#. the op code (block, unblock, or ack-block)
75#. *id*, a unique id assigned within this session
76#. hobject_t begin
77#. hobject_t end
78
79There are two types of backoff: a *PG* backoff will plug all requests
11fdf7f2 80targeting an entire PG at the client, as described by a range of the
7c673cae 81hash/hobject_t space [begin,end), while an *object* backoff will plug
11fdf7f2 82all requests targeting a single object (begin == end).
7c673cae
FG
83
84When the client receives a *block* backoff message, it is now
85responsible for *not* sending any requests for hobject_ts described by
86the backoff. The backoff remains in effect until the backoff is
87cleared (via an 'unblock' message) or the OSD session is closed. A
88*ack_block* message is sent back to the OSD immediately to acknowledge
89receipt of the backoff.
90
91When an unblock is
92received, it will reference a specific id that the client previous had
93blocked. However, the range described by the unblock may be smaller
94than the original range, as the PG may have split on the OSD. The unblock
95should *only* unblock the range specified in the unblock message. Any requests
96that fall within the unblock request range are reexamined and, if no other
97installed backoff applies, resent.
98
99On the OSD, Backoffs are also tracked across ranges of the hash space, and
100exist in three states:
101
102#. new
103#. acked
104#. deleting
105
106A newly installed backoff is set to *new* and a message is sent to the
107client. When the *ack-block* message is received it is changed to the
108*acked* state. The OSD may process other messages from the client that
109are covered by the backoff in the *new* state, but once the backoff is
110*acked* it should never see a blocked request unless there is a bug.
111
112If the OSD wants to a remove a backoff in the *acked* state it can
113simply remove it and notify the client. If the backoff is in the
114*new* state it must move it to the *deleting* state and continue to
115use it to discard client requests until the *ack-block* message is
116received, at which point it can finally be removed. This is necessary to
117preserve the order of operations processed by the OSD.