]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | RADOS client protocol |
2 | ===================== | |
3 | ||
4 | This is very incomplete, but one must start somewhere. | |
5 | ||
6 | Basics | |
7 | ------ | |
8 | ||
9 | Requests are MOSDOp messages. Replies are MOSDOpReply messages. | |
10 | ||
11fdf7f2 | 11 | An object request is targeted at an hobject_t, which includes a pool, |
7c673cae FG |
12 | hash value, object name, placement key (usually empty), and snapid. |
13 | ||
14 | The hash value is a 32-bit hash value, normally generated by hashing | |
15 | the object name. The hobject_t can be arbitrarily constructed, | |
16 | though, with any hash value and name. Note that in the MOSDOp these | |
17 | components are spread across several fields and not logically | |
18 | assembled in an actual hobject_t member (mainly historical reasons). | |
19 | ||
20 | A request can also target a PG. In this case, the *ps* value matches | |
21 | a specific PG, the object name is empty, and (hopefully) the ops in | |
22 | the request are PG ops. | |
23 | ||
24 | Either way, the request ultimately targets a PG, either by using the | |
25 | explicit pgid or by folding the hash value onto the current number of | |
26 | pgs in the pool. The client sends the request to the primary for the | |
11fdf7f2 | 27 | associated PG. |
7c673cae FG |
28 | |
29 | Each request is assigned a unique tid. | |
30 | ||
31 | Resends | |
32 | ------- | |
33 | ||
34 | If there is a connection drop, the client will resend any outstanding | |
11fdf7f2 | 35 | requests. |
7c673cae FG |
36 | |
37 | Any time there is a PG mapping change such that the primary changes, | |
38 | the client is responsible for resending the request. Note that | |
39 | although there may be an interval change from the OSD's perspective | |
40 | (triggering PG peering), if the primary doesn't change then the client | |
41 | need not resend. | |
42 | ||
43 | There are a few exceptions to this rule: | |
44 | ||
45 | * There is a last_force_op_resend field in the pg_pool_t in the | |
46 | OSDMap. If this changes, then the clients are forced to resend any | |
47 | outstanding requests. (This happens when tiering is adjusted, for | |
48 | example.) | |
49 | * Some requests are such that they are resent on *any* PG interval | |
50 | change, as defined by pg_interval_t's is_new_interval() (the same | |
51 | criteria used by peering in the OSD). | |
52 | * If the PAUSE OSDMap flag is set and unset. | |
53 | ||
54 | Each time a request is sent to the OSD the *attempt* field is incremented. The | |
55 | first time it is 0, the next 1, etc. | |
56 | ||
57 | Backoff | |
58 | ------- | |
59 | ||
11fdf7f2 | 60 | Ordinarily the OSD will simply queue any requests it can't immediately |
7c673cae FG |
61 | process in memory until such time as it can. This can become |
62 | problematic because the OSD limits the total amount of RAM consumed by | |
63 | incoming messages: if either of the thresholds for the number of | |
64 | messages or the number of bytes is reached, new messages will not be | |
65 | read off the network socket, causing backpressure through the network. | |
66 | ||
67 | In some cases, though, the OSD knows or expects that a PG or object | |
68 | will be unavailable for some time and does not want to consume memory | |
69 | by queuing requests. In these cases it can send a MOSDBackoff message | |
70 | to the client. | |
71 | ||
72 | A backoff request has four properties: | |
73 | ||
74 | #. the op code (block, unblock, or ack-block) | |
75 | #. *id*, a unique id assigned within this session | |
76 | #. hobject_t begin | |
77 | #. hobject_t end | |
78 | ||
79 | There are two types of backoff: a *PG* backoff will plug all requests | |
11fdf7f2 | 80 | targeting an entire PG at the client, as described by a range of the |
7c673cae | 81 | hash/hobject_t space [begin,end), while an *object* backoff will plug |
11fdf7f2 | 82 | all requests targeting a single object (begin == end). |
7c673cae FG |
83 | |
84 | When the client receives a *block* backoff message, it is now | |
85 | responsible for *not* sending any requests for hobject_ts described by | |
86 | the backoff. The backoff remains in effect until the backoff is | |
87 | cleared (via an 'unblock' message) or the OSD session is closed. A | |
88 | *ack_block* message is sent back to the OSD immediately to acknowledge | |
89 | receipt of the backoff. | |
90 | ||
91 | When an unblock is | |
92 | received, it will reference a specific id that the client previous had | |
93 | blocked. However, the range described by the unblock may be smaller | |
94 | than the original range, as the PG may have split on the OSD. The unblock | |
95 | should *only* unblock the range specified in the unblock message. Any requests | |
96 | that fall within the unblock request range are reexamined and, if no other | |
97 | installed backoff applies, resent. | |
98 | ||
99 | On the OSD, Backoffs are also tracked across ranges of the hash space, and | |
100 | exist in three states: | |
101 | ||
102 | #. new | |
103 | #. acked | |
104 | #. deleting | |
105 | ||
106 | A newly installed backoff is set to *new* and a message is sent to the | |
107 | client. When the *ack-block* message is received it is changed to the | |
108 | *acked* state. The OSD may process other messages from the client that | |
109 | are covered by the backoff in the *new* state, but once the backoff is | |
110 | *acked* it should never see a blocked request unless there is a bug. | |
111 | ||
112 | If the OSD wants to a remove a backoff in the *acked* state it can | |
113 | simply remove it and notify the client. If the backoff is in the | |
114 | *new* state it must move it to the *deleting* state and continue to | |
115 | use it to discard client requests until the *ack-block* message is | |
116 | received, at which point it can finally be removed. This is necessary to | |
117 | preserve the order of operations processed by the OSD. |