]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ====================== |
2 | OSD Config Reference | |
3 | ====================== | |
4 | ||
5 | .. index:: OSD; configuration | |
6 | ||
7 | You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD | |
8 | Daemons can use the default values and a very minimal configuration. A minimal | |
9 | Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and | |
10 | uses default values for nearly everything else. | |
11 | ||
12 | Ceph OSD Daemons are numerically identified in incremental fashion, beginning | |
13 | with ``0`` using the following convention. :: | |
14 | ||
15 | osd.0 | |
16 | osd.1 | |
17 | osd.2 | |
18 | ||
19 | In a configuration file, you may specify settings for all Ceph OSD Daemons in | |
20 | the cluster by adding configuration settings to the ``[osd]`` section of your | |
21 | configuration file. To add settings directly to a specific Ceph OSD Daemon | |
22 | (e.g., ``host``), enter it in an OSD-specific section of your configuration | |
23 | file. For example: | |
24 | ||
25 | .. code-block:: ini | |
1adf2230 | 26 | |
7c673cae | 27 | [osd] |
1adf2230 AA |
28 | osd journal size = 5120 |
29 | ||
7c673cae FG |
30 | [osd.0] |
31 | host = osd-host-a | |
1adf2230 | 32 | |
7c673cae FG |
33 | [osd.1] |
34 | host = osd-host-b | |
35 | ||
36 | ||
37 | .. index:: OSD; config settings | |
38 | ||
39 | General Settings | |
40 | ================ | |
41 | ||
42 | The following settings provide an Ceph OSD Daemon's ID, and determine paths to | |
43 | data and journals. Ceph deployment scripts typically generate the UUID | |
1adf2230 AA |
44 | automatically. |
45 | ||
46 | .. warning:: **DO NOT** change the default paths for data or journals, as it | |
47 | makes it more problematic to troubleshoot Ceph later. | |
7c673cae FG |
48 | |
49 | The journal size should be at least twice the product of the expected drive | |
50 | speed multiplied by ``filestore max sync interval``. However, the most common | |
51 | practice is to partition the journal drive (often an SSD), and mount it such | |
52 | that Ceph uses the entire partition for the journal. | |
53 | ||
54 | ||
55 | ``osd uuid`` | |
56 | ||
57 | :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. | |
58 | :Type: UUID | |
59 | :Default: The UUID. | |
1adf2230 | 60 | :Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` |
7c673cae FG |
61 | applies to the entire cluster. |
62 | ||
63 | ||
1adf2230 | 64 | ``osd data`` |
7c673cae | 65 | |
1adf2230 AA |
66 | :Description: The path to the OSDs data. You must create the directory when |
67 | deploying Ceph. You should mount a drive for OSD data at this | |
68 | mount point. We do not recommend changing the default. | |
7c673cae FG |
69 | |
70 | :Type: String | |
71 | :Default: ``/var/lib/ceph/osd/$cluster-$id`` | |
72 | ||
73 | ||
1adf2230 | 74 | ``osd max write size`` |
7c673cae FG |
75 | |
76 | :Description: The maximum size of a write in megabytes. | |
77 | :Type: 32-bit Integer | |
78 | :Default: ``90`` | |
79 | ||
80 | ||
11fdf7f2 TL |
81 | ``osd max object size`` |
82 | ||
83 | :Description: The maximum size of a RADOS object in bytes. | |
84 | :Type: 32-bit Unsigned Integer | |
85 | :Default: 128MB | |
86 | ||
87 | ||
88 | ``osd client message size cap`` | |
7c673cae FG |
89 | |
90 | :Description: The largest client data message allowed in memory. | |
c07f9fc5 | 91 | :Type: 64-bit Unsigned Integer |
1adf2230 | 92 | :Default: 500MB default. ``500*1024L*1024L`` |
7c673cae FG |
93 | |
94 | ||
1adf2230 | 95 | ``osd class dir`` |
7c673cae FG |
96 | |
97 | :Description: The class path for RADOS class plug-ins. | |
98 | :Type: String | |
99 | :Default: ``$libdir/rados-classes`` | |
100 | ||
101 | ||
102 | .. index:: OSD; file system | |
103 | ||
104 | File System Settings | |
105 | ==================== | |
106 | Ceph builds and mounts file systems which are used for Ceph OSDs. | |
107 | ||
1adf2230 | 108 | ``osd mkfs options {fs-type}`` |
7c673cae FG |
109 | |
110 | :Description: Options used when creating a new Ceph OSD of type {fs-type}. | |
111 | ||
112 | :Type: String | |
113 | :Default for xfs: ``-f -i 2048`` | |
114 | :Default for other file systems: {empty string} | |
115 | ||
116 | For example:: | |
117 | ``osd mkfs options xfs = -f -d agcount=24`` | |
118 | ||
1adf2230 | 119 | ``osd mount options {fs-type}`` |
7c673cae FG |
120 | |
121 | :Description: Options used when mounting a Ceph OSD of type {fs-type}. | |
122 | ||
123 | :Type: String | |
124 | :Default for xfs: ``rw,noatime,inode64`` | |
125 | :Default for other file systems: ``rw, noatime`` | |
126 | ||
127 | For example:: | |
128 | ``osd mount options xfs = rw, noatime, inode64, logbufs=8`` | |
129 | ||
130 | ||
131 | .. index:: OSD; journal settings | |
132 | ||
133 | Journal Settings | |
134 | ================ | |
135 | ||
136 | By default, Ceph expects that you will store an Ceph OSD Daemons journal with | |
137 | the following path:: | |
138 | ||
139 | /var/lib/ceph/osd/$cluster-$id/journal | |
140 | ||
1adf2230 AA |
141 | When using a single device type (for example, spinning drives), the journals |
142 | should be *colocated*: the logical volume (or partition) should be in the same | |
143 | device as the ``data`` logical volume. | |
7c673cae | 144 | |
1adf2230 AA |
145 | When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning |
146 | drives) it makes sense to place the journal on the faster device, while | |
147 | ``data`` occupies the slower device fully. | |
7c673cae | 148 | |
1adf2230 AA |
149 | The default ``osd journal size`` value is 5120 (5 gigabytes), but it can be |
150 | larger, in which case it will need to be set in the ``ceph.conf`` file:: | |
7c673cae | 151 | |
7c673cae | 152 | |
1adf2230 | 153 | osd journal size = 10240 |
7c673cae | 154 | |
1adf2230 AA |
155 | |
156 | ``osd journal`` | |
7c673cae FG |
157 | |
158 | :Description: The path to the OSD's journal. This may be a path to a file or a | |
1adf2230 | 159 | block device (such as a partition of an SSD). If it is a file, |
7c673cae FG |
160 | you must create the directory to contain it. We recommend using a |
161 | drive separate from the ``osd data`` drive. | |
162 | ||
163 | :Type: String | |
164 | :Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` | |
165 | ||
166 | ||
1adf2230 | 167 | ``osd journal size`` |
7c673cae | 168 | |
1adf2230 | 169 | :Description: The size of the journal in megabytes. |
7c673cae FG |
170 | |
171 | :Type: 32-bit Integer | |
172 | :Default: ``5120`` | |
7c673cae FG |
173 | |
174 | ||
175 | See `Journal Config Reference`_ for additional details. | |
176 | ||
177 | ||
178 | Monitor OSD Interaction | |
179 | ======================= | |
180 | ||
181 | Ceph OSD Daemons check each other's heartbeats and report to monitors | |
182 | periodically. Ceph can use default values in many cases. However, if your | |
183 | network has latency issues, you may need to adopt longer intervals. See | |
184 | `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. | |
185 | ||
186 | ||
187 | Data Placement | |
188 | ============== | |
189 | ||
190 | See `Pool & PG Config Reference`_ for details. | |
191 | ||
192 | ||
193 | .. index:: OSD; scrubbing | |
194 | ||
195 | Scrubbing | |
196 | ========= | |
197 | ||
198 | In addition to making multiple copies of objects, Ceph insures data integrity by | |
199 | scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the | |
200 | object storage layer. For each placement group, Ceph generates a catalog of all | |
201 | objects and compares each primary object and its replicas to ensure that no | |
202 | objects are missing or mismatched. Light scrubbing (daily) checks the object | |
203 | size and attributes. Deep scrubbing (weekly) reads the data and uses checksums | |
204 | to ensure data integrity. | |
205 | ||
206 | Scrubbing is important for maintaining data integrity, but it can reduce | |
207 | performance. You can adjust the following settings to increase or decrease | |
208 | scrubbing operations. | |
209 | ||
210 | ||
1adf2230 | 211 | ``osd max scrubs`` |
7c673cae | 212 | |
1adf2230 | 213 | :Description: The maximum number of simultaneous scrub operations for |
7c673cae FG |
214 | a Ceph OSD Daemon. |
215 | ||
216 | :Type: 32-bit Int | |
1adf2230 | 217 | :Default: ``1`` |
7c673cae FG |
218 | |
219 | ``osd scrub begin hour`` | |
220 | ||
221 | :Description: The time of day for the lower bound when a scheduled scrub can be | |
222 | performed. | |
223 | :Type: Integer in the range of 0 to 24 | |
224 | :Default: ``0`` | |
225 | ||
226 | ||
227 | ``osd scrub end hour`` | |
228 | ||
229 | :Description: The time of day for the upper bound when a scheduled scrub can be | |
230 | performed. Along with ``osd scrub begin hour``, they define a time | |
231 | window, in which the scrubs can happen. But a scrub will be performed | |
232 | no matter the time window allows or not, as long as the placement | |
233 | group's scrub interval exceeds ``osd scrub max interval``. | |
234 | :Type: Integer in the range of 0 to 24 | |
235 | :Default: ``24`` | |
236 | ||
237 | ||
11fdf7f2 TL |
238 | ``osd scrub begin week day`` |
239 | ||
240 | :Description: This restricts scrubbing to this day of the week or later. | |
241 | 0 or 7 = Sunday, 1 = Monday, etc. | |
242 | :Type: Integer in the range of 0 to 7 | |
243 | :Default: ``0`` | |
244 | ||
245 | ||
246 | ``osd scrub end week day`` | |
247 | ||
248 | :Description: This restricts scrubbing to days of the week earlier than this. | |
249 | 0 or 7 = Sunday, 1 = Monday, etc. | |
250 | :Type: Integer in the range of 0 to 7 | |
251 | :Default: ``7`` | |
252 | ||
253 | ||
7c673cae FG |
254 | ``osd scrub during recovery`` |
255 | ||
256 | :Description: Allow scrub during recovery. Setting this to ``false`` will disable | |
257 | scheduling new scrub (and deep--scrub) while there is active recovery. | |
258 | Already running scrubs will be continued. This might be useful to reduce | |
259 | load on busy clusters. | |
260 | :Type: Boolean | |
261 | :Default: ``true`` | |
262 | ||
263 | ||
1adf2230 | 264 | ``osd scrub thread timeout`` |
7c673cae FG |
265 | |
266 | :Description: The maximum time in seconds before timing out a scrub thread. | |
267 | :Type: 32-bit Integer | |
1adf2230 | 268 | :Default: ``60`` |
7c673cae FG |
269 | |
270 | ||
1adf2230 | 271 | ``osd scrub finalize thread timeout`` |
7c673cae | 272 | |
1adf2230 | 273 | :Description: The maximum time in seconds before timing out a scrub finalize |
7c673cae FG |
274 | thread. |
275 | ||
276 | :Type: 32-bit Integer | |
277 | :Default: ``60*10`` | |
278 | ||
279 | ||
1adf2230 | 280 | ``osd scrub load threshold`` |
7c673cae | 281 | |
11fdf7f2 TL |
282 | :Description: The normalized maximum load. Ceph will not scrub when the system load |
283 | (as defined by ``getloadavg() / number of online cpus``) is higher than this number. | |
7c673cae FG |
284 | Default is ``0.5``. |
285 | ||
286 | :Type: Float | |
1adf2230 | 287 | :Default: ``0.5`` |
7c673cae FG |
288 | |
289 | ||
1adf2230 | 290 | ``osd scrub min interval`` |
7c673cae FG |
291 | |
292 | :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon | |
293 | when the Ceph Storage Cluster load is low. | |
294 | ||
295 | :Type: Float | |
296 | :Default: Once per day. ``60*60*24`` | |
297 | ||
298 | ||
1adf2230 | 299 | ``osd scrub max interval`` |
7c673cae | 300 | |
1adf2230 | 301 | :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon |
7c673cae FG |
302 | irrespective of cluster load. |
303 | ||
304 | :Type: Float | |
305 | :Default: Once per week. ``7*60*60*24`` | |
306 | ||
307 | ||
308 | ``osd scrub chunk min`` | |
309 | ||
310 | :Description: The minimal number of object store chunks to scrub during single operation. | |
311 | Ceph blocks writes to single chunk during scrub. | |
312 | ||
313 | :Type: 32-bit Integer | |
314 | :Default: 5 | |
315 | ||
316 | ||
317 | ``osd scrub chunk max`` | |
318 | ||
319 | :Description: The maximum number of object store chunks to scrub during single operation. | |
320 | ||
321 | :Type: 32-bit Integer | |
322 | :Default: 25 | |
323 | ||
324 | ||
325 | ``osd scrub sleep`` | |
326 | ||
327 | :Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow | |
328 | down whole scrub operation while client operations will be less impacted. | |
329 | ||
330 | :Type: Float | |
331 | :Default: 0 | |
332 | ||
333 | ||
334 | ``osd deep scrub interval`` | |
335 | ||
1adf2230 | 336 | :Description: The interval for "deep" scrubbing (fully reading all data). The |
7c673cae FG |
337 | ``osd scrub load threshold`` does not affect this setting. |
338 | ||
339 | :Type: Float | |
340 | :Default: Once per week. ``60*60*24*7`` | |
341 | ||
342 | ||
343 | ``osd scrub interval randomize ratio`` | |
344 | ||
345 | :Description: Add a random delay to ``osd scrub min interval`` when scheduling | |
346 | the next scrub job for a placement group. The delay is a random | |
347 | value less than ``osd scrub min interval`` \* | |
348 | ``osd scrub interval randomized ratio``. So the default setting | |
349 | practically randomly spreads the scrubs out in the allowed time | |
350 | window of ``[1, 1.5]`` \* ``osd scrub min interval``. | |
351 | :Type: Float | |
352 | :Default: ``0.5`` | |
353 | ||
354 | ``osd deep scrub stride`` | |
355 | ||
356 | :Description: Read size when doing a deep scrub. | |
357 | :Type: 32-bit Integer | |
358 | :Default: 512 KB. ``524288`` | |
359 | ||
360 | ||
11fdf7f2 | 361 | ``osd scrub auto repair`` |
7c673cae | 362 | |
11fdf7f2 TL |
363 | :Description: Setting this to ``true`` will enable automatic pg repair when errors |
364 | are found in scrub or deep-scrub. However, if more than | |
365 | ``osd scrub auto repair num errors`` errors are found a repair is NOT performed. | |
366 | :Type: Boolean | |
367 | :Default: ``false`` | |
7c673cae | 368 | |
7c673cae | 369 | |
11fdf7f2 | 370 | ``osd scrub auto repair num errors`` |
7c673cae | 371 | |
11fdf7f2 TL |
372 | :Description: Auto repair will not occur if more than this many errors are found. |
373 | :Type: 32-bit Integer | |
374 | :Default: ``5`` | |
7c673cae | 375 | |
7c673cae | 376 | |
11fdf7f2 | 377 | .. index:: OSD; operations settings |
7c673cae | 378 | |
11fdf7f2 TL |
379 | Operations |
380 | ========== | |
7c673cae FG |
381 | |
382 | ``osd op queue`` | |
383 | ||
384 | :Description: This sets the type of queue to be used for prioritizing ops | |
385 | in the OSDs. Both queues feature a strict sub-queue which is | |
386 | dequeued before the normal queue. The normal queue is different | |
387 | between implementations. The original PrioritizedQueue (``prio``) uses a | |
388 | token bucket system which when there are sufficient tokens will | |
389 | dequeue high priority queues first. If there are not enough | |
390 | tokens available, queues are dequeued low priority to high priority. | |
c07f9fc5 | 391 | The WeightedPriorityQueue (``wpq``) dequeues all priorities in |
7c673cae FG |
392 | relation to their priorities to prevent starvation of any queue. |
393 | WPQ should help in cases where a few OSDs are more overloaded | |
c07f9fc5 FG |
394 | than others. The new mClock based OpClassQueue |
395 | (``mclock_opclass``) prioritizes operations based on which class | |
396 | they belong to (recovery, scrub, snaptrim, client op, osd subop). | |
397 | And, the mClock based ClientQueue (``mclock_client``) also | |
398 | incorporates the client identifier in order to promote fairness | |
399 | between clients. See `QoS Based on mClock`_. Requires a restart. | |
7c673cae FG |
400 | |
401 | :Type: String | |
c07f9fc5 | 402 | :Valid Choices: prio, wpq, mclock_opclass, mclock_client |
7c673cae FG |
403 | :Default: ``prio`` |
404 | ||
405 | ||
406 | ``osd op queue cut off`` | |
407 | ||
408 | :Description: This selects which priority ops will be sent to the strict | |
409 | queue verses the normal queue. The ``low`` setting sends all | |
410 | replication ops and higher to the strict queue, while the ``high`` | |
411 | option sends only replication acknowledgement ops and higher to | |
412 | the strict queue. Setting this to ``high`` should help when a few | |
413 | OSDs in the cluster are very busy especially when combined with | |
414 | ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy | |
415 | handling replication traffic could starve primary client traffic | |
416 | on these OSDs without these settings. Requires a restart. | |
417 | ||
418 | :Type: String | |
419 | :Valid Choices: low, high | |
420 | :Default: ``low`` | |
421 | ||
422 | ||
423 | ``osd client op priority`` | |
424 | ||
11fdf7f2 | 425 | :Description: The priority set for client operations. |
7c673cae FG |
426 | |
427 | :Type: 32-bit Integer | |
1adf2230 | 428 | :Default: ``63`` |
7c673cae FG |
429 | :Valid Range: 1-63 |
430 | ||
431 | ||
432 | ``osd recovery op priority`` | |
433 | ||
11fdf7f2 | 434 | :Description: The priority set for recovery operations, if not specified by the pool's ``recovery_op_priority``. |
7c673cae FG |
435 | |
436 | :Type: 32-bit Integer | |
1adf2230 | 437 | :Default: ``3`` |
7c673cae FG |
438 | :Valid Range: 1-63 |
439 | ||
440 | ||
441 | ``osd scrub priority`` | |
442 | ||
11fdf7f2 TL |
443 | :Description: The default priority set for a scheduled scrub work queue when the |
444 | pool doesn't specify a value of ``scrub_priority``. This can be | |
445 | boosted to the value of ``osd client op priority`` when scrub is | |
446 | blocking client operations. | |
7c673cae FG |
447 | |
448 | :Type: 32-bit Integer | |
449 | :Default: ``5`` | |
450 | :Valid Range: 1-63 | |
451 | ||
452 | ||
11fdf7f2 TL |
453 | ``osd requested scrub priority`` |
454 | ||
455 | :Description: The priority set for user requested scrub on the work queue. If | |
456 | this value were to be smaller than ``osd client op priority`` it | |
457 | can be boosted to the value of ``osd client op priority`` when | |
458 | scrub is blocking client operations. | |
459 | ||
460 | :Type: 32-bit Integer | |
461 | :Default: ``120`` | |
462 | ||
463 | ||
7c673cae FG |
464 | ``osd snap trim priority`` |
465 | ||
11fdf7f2 | 466 | :Description: The priority set for the snap trim work queue. |
7c673cae FG |
467 | |
468 | :Type: 32-bit Integer | |
469 | :Default: ``5`` | |
470 | :Valid Range: 1-63 | |
471 | ||
494da23a TL |
472 | ``osd snap trim sleep`` |
473 | ||
474 | :Description: Time in seconds to sleep before next snap trim op. | |
475 | Increasing this value will slow down snap trimming. | |
476 | This option overrides backend specific variants. | |
477 | ||
478 | :Type: Float | |
479 | :Default: ``0`` | |
480 | ||
481 | ||
482 | ``osd snap trim sleep hdd`` | |
483 | ||
484 | :Description: Time in seconds to sleep before next snap trim op | |
485 | for HDDs. | |
486 | ||
487 | :Type: Float | |
488 | :Default: ``5`` | |
489 | ||
490 | ||
491 | ``osd snap trim sleep ssd`` | |
492 | ||
493 | :Description: Time in seconds to sleep before next snap trim op | |
494 | for SSDs. | |
495 | ||
496 | :Type: Float | |
497 | :Default: ``0`` | |
498 | ||
499 | ||
500 | ``osd snap trim sleep hybrid`` | |
501 | ||
502 | :Description: Time in seconds to sleep before next snap trim op | |
503 | when osd data is on HDD and osd journal is on SSD. | |
504 | ||
505 | :Type: Float | |
506 | :Default: ``2`` | |
7c673cae | 507 | |
1adf2230 | 508 | ``osd op thread timeout`` |
7c673cae FG |
509 | |
510 | :Description: The Ceph OSD Daemon operation thread timeout in seconds. | |
511 | :Type: 32-bit Integer | |
1adf2230 | 512 | :Default: ``15`` |
7c673cae FG |
513 | |
514 | ||
1adf2230 | 515 | ``osd op complaint time`` |
7c673cae FG |
516 | |
517 | :Description: An operation becomes complaint worthy after the specified number | |
518 | of seconds have elapsed. | |
519 | ||
520 | :Type: Float | |
1adf2230 | 521 | :Default: ``30`` |
7c673cae FG |
522 | |
523 | ||
7c673cae FG |
524 | ``osd op history size`` |
525 | ||
526 | :Description: The maximum number of completed operations to track. | |
527 | :Type: 32-bit Unsigned Integer | |
528 | :Default: ``20`` | |
529 | ||
530 | ||
531 | ``osd op history duration`` | |
532 | ||
533 | :Description: The oldest completed operation to track. | |
534 | :Type: 32-bit Unsigned Integer | |
535 | :Default: ``600`` | |
536 | ||
537 | ||
538 | ``osd op log threshold`` | |
539 | ||
540 | :Description: How many operations logs to display at once. | |
541 | :Type: 32-bit Integer | |
542 | :Default: ``5`` | |
543 | ||
c07f9fc5 FG |
544 | |
545 | QoS Based on mClock | |
546 | ------------------- | |
547 | ||
548 | Ceph's use of mClock is currently in the experimental phase and should | |
549 | be approached with an exploratory mindset. | |
550 | ||
551 | Core Concepts | |
552 | ````````````` | |
553 | ||
554 | The QoS support of Ceph is implemented using a queueing scheduler | |
555 | based on `the dmClock algorithm`_. This algorithm allocates the I/O | |
556 | resources of the Ceph cluster in proportion to weights, and enforces | |
11fdf7f2 | 557 | the constraints of minimum reservation and maximum limitation, so that |
c07f9fc5 FG |
558 | the services can compete for the resources fairly. Currently the |
559 | *mclock_opclass* operation queue divides Ceph services involving I/O | |
560 | resources into following buckets: | |
561 | ||
562 | - client op: the iops issued by client | |
563 | - osd subop: the iops issued by primary OSD | |
564 | - snap trim: the snap trimming related requests | |
565 | - pg recovery: the recovery related requests | |
566 | - pg scrub: the scrub related requests | |
567 | ||
568 | And the resources are partitioned using following three sets of tags. In other | |
569 | words, the share of each type of service is controlled by three tags: | |
570 | ||
571 | #. reservation: the minimum IOPS allocated for the service. | |
572 | #. limitation: the maximum IOPS allocated for the service. | |
573 | #. weight: the proportional share of capacity if extra capacity or system | |
574 | oversubscribed. | |
575 | ||
576 | In Ceph operations are graded with "cost". And the resources allocated | |
577 | for serving various services are consumed by these "costs". So, for | |
578 | example, the more reservation a services has, the more resource it is | |
579 | guaranteed to possess, as long as it requires. Assuming there are 2 | |
580 | services: recovery and client ops: | |
581 | ||
582 | - recovery: (r:1, l:5, w:1) | |
583 | - client ops: (r:2, l:0, w:9) | |
584 | ||
585 | The settings above ensure that the recovery won't get more than 5 | |
586 | requests per second serviced, even if it requires so (see CURRENT | |
587 | IMPLEMENTATION NOTE below), and no other services are competing with | |
588 | it. But if the clients start to issue large amount of I/O requests, | |
589 | neither will they exhaust all the I/O resources. 1 request per second | |
590 | is always allocated for recovery jobs as long as there are any such | |
591 | requests. So the recovery jobs won't be starved even in a cluster with | |
592 | high load. And in the meantime, the client ops can enjoy a larger | |
593 | portion of the I/O resource, because its weight is "9", while its | |
594 | competitor "1". In the case of client ops, it is not clamped by the | |
595 | limit setting, so it can make use of all the resources if there is no | |
596 | recovery ongoing. | |
597 | ||
598 | Along with *mclock_opclass* another mclock operation queue named | |
599 | *mclock_client* is available. It divides operations based on category | |
600 | but also divides them based on the client making the request. This | |
601 | helps not only manage the distribution of resources spent on different | |
602 | classes of operations but also tries to insure fairness among clients. | |
603 | ||
604 | CURRENT IMPLEMENTATION NOTE: the current experimental implementation | |
605 | does not enforce the limit values. As a first approximation we decided | |
606 | not to prevent operations that would otherwise enter the operation | |
607 | sequencer from doing so. | |
608 | ||
609 | Subtleties of mClock | |
610 | ```````````````````` | |
611 | ||
612 | The reservation and limit values have a unit of requests per | |
613 | second. The weight, however, does not technically have a unit and the | |
614 | weights are relative to one another. So if one class of requests has a | |
615 | weight of 1 and another a weight of 9, then the latter class of | |
616 | requests should get 9 executed at a 9 to 1 ratio as the first class. | |
617 | However that will only happen once the reservations are met and those | |
618 | values include the operations executed under the reservation phase. | |
619 | ||
620 | Even though the weights do not have units, one must be careful in | |
621 | choosing their values due how the algorithm assigns weight tags to | |
622 | requests. If the weight is *W*, then for a given class of requests, | |
623 | the next one that comes in will have a weight tag of *1/W* plus the | |
624 | previous weight tag or the current time, whichever is larger. That | |
625 | means if *W* is sufficiently large and therefore *1/W* is sufficiently | |
626 | small, the calculated tag may never be assigned as it will get a value | |
627 | of the current time. The ultimate lesson is that values for weight | |
628 | should not be too large. They should be under the number of requests | |
629 | one expects to ve serviced each second. | |
630 | ||
631 | Caveats | |
632 | ``````` | |
633 | ||
634 | There are some factors that can reduce the impact of the mClock op | |
635 | queues within Ceph. First, requests to an OSD are sharded by their | |
636 | placement group identifier. Each shard has its own mClock queue and | |
637 | these queues neither interact nor share information among them. The | |
638 | number of shards can be controlled with the configuration options | |
639 | ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and | |
640 | ``osd_op_num_shards_ssd``. A lower number of shards will increase the | |
11fdf7f2 | 641 | impact of the mClock queues, but may have other deleterious effects. |
c07f9fc5 FG |
642 | |
643 | Second, requests are transferred from the operation queue to the | |
644 | operation sequencer, in which they go through the phases of | |
645 | execution. The operation queue is where mClock resides and mClock | |
646 | determines the next op to transfer to the operation sequencer. The | |
647 | number of operations allowed in the operation sequencer is a complex | |
648 | issue. In general we want to keep enough operations in the sequencer | |
649 | so it's always getting work done on some operations while it's waiting | |
650 | for disk and network access to complete on other operations. On the | |
651 | other hand, once an operation is transferred to the operation | |
652 | sequencer, mClock no longer has control over it. Therefore to maximize | |
653 | the impact of mClock, we want to keep as few operations in the | |
654 | operation sequencer as possible. So we have an inherent tension. | |
655 | ||
656 | The configuration options that influence the number of operations in | |
657 | the operation sequencer are ``bluestore_throttle_bytes``, | |
658 | ``bluestore_throttle_deferred_bytes``, | |
659 | ``bluestore_throttle_cost_per_io``, | |
660 | ``bluestore_throttle_cost_per_io_hdd``, and | |
661 | ``bluestore_throttle_cost_per_io_ssd``. | |
662 | ||
663 | A third factor that affects the impact of the mClock algorithm is that | |
664 | we're using a distributed system, where requests are made to multiple | |
665 | OSDs and each OSD has (can have) multiple shards. Yet we're currently | |
666 | using the mClock algorithm, which is not distributed (note: dmClock is | |
667 | the distributed version of mClock). | |
668 | ||
669 | Various organizations and individuals are currently experimenting with | |
670 | mClock as it exists in this code base along with their modifications | |
671 | to the code base. We hope you'll share you're experiences with your | |
672 | mClock and dmClock experiments in the ceph-devel mailing list. | |
673 | ||
674 | ||
675 | ``osd push per object cost`` | |
676 | ||
677 | :Description: the overhead for serving a push op | |
678 | ||
679 | :Type: Unsigned Integer | |
680 | :Default: 1000 | |
681 | ||
682 | ``osd recovery max chunk`` | |
683 | ||
684 | :Description: the maximum total size of data chunks a recovery op can carry. | |
685 | ||
686 | :Type: Unsigned Integer | |
687 | :Default: 8 MiB | |
688 | ||
689 | ||
690 | ``osd op queue mclock client op res`` | |
691 | ||
692 | :Description: the reservation of client op. | |
693 | ||
694 | :Type: Float | |
695 | :Default: 1000.0 | |
696 | ||
697 | ||
698 | ``osd op queue mclock client op wgt`` | |
699 | ||
700 | :Description: the weight of client op. | |
701 | ||
702 | :Type: Float | |
703 | :Default: 500.0 | |
704 | ||
705 | ||
706 | ``osd op queue mclock client op lim`` | |
707 | ||
708 | :Description: the limit of client op. | |
709 | ||
710 | :Type: Float | |
711 | :Default: 1000.0 | |
712 | ||
713 | ||
714 | ``osd op queue mclock osd subop res`` | |
715 | ||
716 | :Description: the reservation of osd subop. | |
717 | ||
718 | :Type: Float | |
719 | :Default: 1000.0 | |
720 | ||
721 | ||
722 | ``osd op queue mclock osd subop wgt`` | |
723 | ||
724 | :Description: the weight of osd subop. | |
725 | ||
726 | :Type: Float | |
727 | :Default: 500.0 | |
728 | ||
729 | ||
730 | ``osd op queue mclock osd subop lim`` | |
731 | ||
732 | :Description: the limit of osd subop. | |
733 | ||
734 | :Type: Float | |
735 | :Default: 0.0 | |
736 | ||
737 | ||
738 | ``osd op queue mclock snap res`` | |
739 | ||
740 | :Description: the reservation of snap trimming. | |
741 | ||
742 | :Type: Float | |
743 | :Default: 0.0 | |
744 | ||
745 | ||
746 | ``osd op queue mclock snap wgt`` | |
747 | ||
748 | :Description: the weight of snap trimming. | |
749 | ||
750 | :Type: Float | |
751 | :Default: 1.0 | |
752 | ||
753 | ||
754 | ``osd op queue mclock snap lim`` | |
755 | ||
756 | :Description: the limit of snap trimming. | |
757 | ||
758 | :Type: Float | |
759 | :Default: 0.001 | |
760 | ||
761 | ||
762 | ``osd op queue mclock recov res`` | |
763 | ||
764 | :Description: the reservation of recovery. | |
765 | ||
766 | :Type: Float | |
767 | :Default: 0.0 | |
768 | ||
769 | ||
770 | ``osd op queue mclock recov wgt`` | |
771 | ||
772 | :Description: the weight of recovery. | |
773 | ||
774 | :Type: Float | |
775 | :Default: 1.0 | |
776 | ||
777 | ||
778 | ``osd op queue mclock recov lim`` | |
779 | ||
780 | :Description: the limit of recovery. | |
781 | ||
782 | :Type: Float | |
783 | :Default: 0.001 | |
784 | ||
785 | ||
786 | ``osd op queue mclock scrub res`` | |
787 | ||
788 | :Description: the reservation of scrub jobs. | |
789 | ||
790 | :Type: Float | |
791 | :Default: 0.0 | |
792 | ||
793 | ||
794 | ``osd op queue mclock scrub wgt`` | |
795 | ||
796 | :Description: the weight of scrub jobs. | |
797 | ||
798 | :Type: Float | |
799 | :Default: 1.0 | |
800 | ||
801 | ||
802 | ``osd op queue mclock scrub lim`` | |
803 | ||
804 | :Description: the limit of scrub jobs. | |
805 | ||
806 | :Type: Float | |
807 | :Default: 0.001 | |
808 | ||
809 | .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf | |
810 | ||
811 | ||
7c673cae FG |
812 | .. index:: OSD; backfilling |
813 | ||
814 | Backfilling | |
815 | =========== | |
816 | ||
817 | When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will | |
818 | want to rebalance the cluster by moving placement groups to or from Ceph OSD | |
819 | Daemons to restore the balance. The process of migrating placement groups and | |
820 | the objects they contain can reduce the cluster's operational performance | |
821 | considerably. To maintain operational performance, Ceph performs this migration | |
822 | with 'backfilling', which allows Ceph to set backfill operations to a lower | |
1adf2230 | 823 | priority than requests to read or write data. |
7c673cae FG |
824 | |
825 | ||
826 | ``osd max backfills`` | |
827 | ||
828 | :Description: The maximum number of backfills allowed to or from a single OSD. | |
829 | :Type: 64-bit Unsigned Integer | |
830 | :Default: ``1`` | |
831 | ||
832 | ||
1adf2230 | 833 | ``osd backfill scan min`` |
7c673cae FG |
834 | |
835 | :Description: The minimum number of objects per backfill scan. | |
836 | ||
837 | :Type: 32-bit Integer | |
1adf2230 | 838 | :Default: ``64`` |
7c673cae FG |
839 | |
840 | ||
1adf2230 | 841 | ``osd backfill scan max`` |
7c673cae FG |
842 | |
843 | :Description: The maximum number of objects per backfill scan. | |
844 | ||
845 | :Type: 32-bit Integer | |
1adf2230 | 846 | :Default: ``512`` |
7c673cae FG |
847 | |
848 | ||
849 | ``osd backfill retry interval`` | |
850 | ||
851 | :Description: The number of seconds to wait before retrying backfill requests. | |
852 | :Type: Double | |
853 | :Default: ``10.0`` | |
854 | ||
855 | .. index:: OSD; osdmap | |
856 | ||
857 | OSD Map | |
858 | ======= | |
859 | ||
1adf2230 | 860 | OSD maps reflect the OSD daemons operating in the cluster. Over time, the |
7c673cae FG |
861 | number of map epochs increases. Ceph provides some settings to ensure that |
862 | Ceph performs well as the OSD map grows larger. | |
863 | ||
864 | ||
865 | ``osd map dedup`` | |
866 | ||
1adf2230 | 867 | :Description: Enable removing duplicates in the OSD map. |
7c673cae FG |
868 | :Type: Boolean |
869 | :Default: ``true`` | |
870 | ||
871 | ||
1adf2230 | 872 | ``osd map cache size`` |
7c673cae FG |
873 | |
874 | :Description: The number of OSD maps to keep cached. | |
875 | :Type: 32-bit Integer | |
7c673cae FG |
876 | :Default: ``50`` |
877 | ||
878 | ||
1adf2230 | 879 | ``osd map message max`` |
7c673cae FG |
880 | |
881 | :Description: The maximum map entries allowed per MOSDMap message. | |
882 | :Type: 32-bit Integer | |
a8e16298 | 883 | :Default: ``40`` |
7c673cae FG |
884 | |
885 | ||
886 | ||
887 | .. index:: OSD; recovery | |
888 | ||
889 | Recovery | |
890 | ======== | |
891 | ||
892 | When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD | |
893 | begins peering with other Ceph OSD Daemons before writes can occur. See | |
894 | `Monitoring OSDs and PGs`_ for details. | |
895 | ||
896 | If a Ceph OSD Daemon crashes and comes back online, usually it will be out of | |
897 | sync with other Ceph OSD Daemons containing more recent versions of objects in | |
898 | the placement groups. When this happens, the Ceph OSD Daemon goes into recovery | |
899 | mode and seeks to get the latest copy of the data and bring its map back up to | |
900 | date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects | |
901 | and placement groups may be significantly out of date. Also, if a failure domain | |
902 | went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at | |
903 | the same time. This can make the recovery process time consuming and resource | |
904 | intensive. | |
905 | ||
906 | To maintain operational performance, Ceph performs recovery with limitations on | |
907 | the number recovery requests, threads and object chunk sizes which allows Ceph | |
1adf2230 | 908 | perform well in a degraded state. |
7c673cae FG |
909 | |
910 | ||
1adf2230 | 911 | ``osd recovery delay start`` |
7c673cae | 912 | |
1adf2230 | 913 | :Description: After peering completes, Ceph will delay for the specified number |
7c673cae FG |
914 | of seconds before starting to recover objects. |
915 | ||
916 | :Type: Float | |
1adf2230 | 917 | :Default: ``0`` |
7c673cae FG |
918 | |
919 | ||
1adf2230 | 920 | ``osd recovery max active`` |
7c673cae | 921 | |
1adf2230 AA |
922 | :Description: The number of active recovery requests per OSD at one time. More |
923 | requests will accelerate recovery, but the requests places an | |
7c673cae FG |
924 | increased load on the cluster. |
925 | ||
926 | :Type: 32-bit Integer | |
31f18b77 | 927 | :Default: ``3`` |
7c673cae FG |
928 | |
929 | ||
1adf2230 | 930 | ``osd recovery max chunk`` |
7c673cae | 931 | |
1adf2230 | 932 | :Description: The maximum size of a recovered chunk of data to push. |
c07f9fc5 | 933 | :Type: 64-bit Unsigned Integer |
1adf2230 | 934 | :Default: ``8 << 20`` |
7c673cae FG |
935 | |
936 | ||
31f18b77 FG |
937 | ``osd recovery max single start`` |
938 | ||
939 | :Description: The maximum number of recovery operations per OSD that will be | |
940 | newly started when an OSD is recovering. | |
c07f9fc5 | 941 | :Type: 64-bit Unsigned Integer |
31f18b77 FG |
942 | :Default: ``1`` |
943 | ||
944 | ||
1adf2230 | 945 | ``osd recovery thread timeout`` |
7c673cae FG |
946 | |
947 | :Description: The maximum time in seconds before timing out a recovery thread. | |
948 | :Type: 32-bit Integer | |
949 | :Default: ``30`` | |
950 | ||
951 | ||
952 | ``osd recover clone overlap`` | |
953 | ||
1adf2230 | 954 | :Description: Preserves clone overlap during recovery. Should always be set |
7c673cae FG |
955 | to ``true``. |
956 | ||
957 | :Type: Boolean | |
958 | :Default: ``true`` | |
959 | ||
31f18b77 FG |
960 | |
961 | ``osd recovery sleep`` | |
962 | ||
c07f9fc5 FG |
963 | :Description: Time in seconds to sleep before next recovery or backfill op. |
964 | Increasing this value will slow down recovery operation while | |
965 | client operations will be less impacted. | |
31f18b77 FG |
966 | |
967 | :Type: Float | |
c07f9fc5 FG |
968 | :Default: ``0`` |
969 | ||
970 | ||
971 | ``osd recovery sleep hdd`` | |
972 | ||
973 | :Description: Time in seconds to sleep before next recovery or backfill op | |
974 | for HDDs. | |
975 | ||
976 | :Type: Float | |
977 | :Default: ``0.1`` | |
978 | ||
979 | ||
980 | ``osd recovery sleep ssd`` | |
981 | ||
982 | :Description: Time in seconds to sleep before next recovery or backfill op | |
983 | for SSDs. | |
984 | ||
985 | :Type: Float | |
986 | :Default: ``0`` | |
31f18b77 | 987 | |
d2e6a577 FG |
988 | |
989 | ``osd recovery sleep hybrid`` | |
990 | ||
991 | :Description: Time in seconds to sleep before next recovery or backfill op | |
992 | when osd data is on HDD and osd journal is on SSD. | |
993 | ||
994 | :Type: Float | |
995 | :Default: ``0.025`` | |
996 | ||
11fdf7f2 TL |
997 | |
998 | ``osd recovery priority`` | |
999 | ||
1000 | :Description: The default priority set for recovery work queue. Not | |
1001 | related to a pool's ``recovery_priority``. | |
1002 | ||
1003 | :Type: 32-bit Integer | |
1004 | :Default: ``5`` | |
1005 | ||
1006 | ||
7c673cae FG |
1007 | Tiering |
1008 | ======= | |
1009 | ||
1010 | ``osd agent max ops`` | |
1011 | ||
1012 | :Description: The maximum number of simultaneous flushing ops per tiering agent | |
1013 | in the high speed mode. | |
1014 | :Type: 32-bit Integer | |
1015 | :Default: ``4`` | |
1016 | ||
1017 | ||
1018 | ``osd agent max low ops`` | |
1019 | ||
1020 | :Description: The maximum number of simultaneous flushing ops per tiering agent | |
1021 | in the low speed mode. | |
1022 | :Type: 32-bit Integer | |
1023 | :Default: ``2`` | |
1024 | ||
1025 | See `cache target dirty high ratio`_ for when the tiering agent flushes dirty | |
1026 | objects within the high speed mode. | |
1027 | ||
1028 | Miscellaneous | |
1029 | ============= | |
1030 | ||
1031 | ||
1adf2230 | 1032 | ``osd snap trim thread timeout`` |
7c673cae FG |
1033 | |
1034 | :Description: The maximum time in seconds before timing out a snap trim thread. | |
1035 | :Type: 32-bit Integer | |
1adf2230 | 1036 | :Default: ``60*60*1`` |
7c673cae FG |
1037 | |
1038 | ||
1adf2230 | 1039 | ``osd backlog thread timeout`` |
7c673cae FG |
1040 | |
1041 | :Description: The maximum time in seconds before timing out a backlog thread. | |
1042 | :Type: 32-bit Integer | |
1adf2230 | 1043 | :Default: ``60*60*1`` |
7c673cae FG |
1044 | |
1045 | ||
1adf2230 | 1046 | ``osd default notify timeout`` |
7c673cae FG |
1047 | |
1048 | :Description: The OSD default notification timeout (in seconds). | |
c07f9fc5 | 1049 | :Type: 32-bit Unsigned Integer |
1adf2230 | 1050 | :Default: ``30`` |
7c673cae FG |
1051 | |
1052 | ||
1adf2230 | 1053 | ``osd check for log corruption`` |
7c673cae FG |
1054 | |
1055 | :Description: Check log files for corruption. Can be computationally expensive. | |
1056 | :Type: Boolean | |
1adf2230 | 1057 | :Default: ``false`` |
7c673cae FG |
1058 | |
1059 | ||
1adf2230 | 1060 | ``osd remove thread timeout`` |
7c673cae FG |
1061 | |
1062 | :Description: The maximum time in seconds before timing out a remove OSD thread. | |
1063 | :Type: 32-bit Integer | |
1064 | :Default: ``60*60`` | |
1065 | ||
1066 | ||
1adf2230 | 1067 | ``osd command thread timeout`` |
7c673cae FG |
1068 | |
1069 | :Description: The maximum time in seconds before timing out a command thread. | |
1070 | :Type: 32-bit Integer | |
1adf2230 | 1071 | :Default: ``10*60`` |
7c673cae FG |
1072 | |
1073 | ||
1adf2230 | 1074 | ``osd command max records`` |
7c673cae | 1075 | |
1adf2230 | 1076 | :Description: Limits the number of lost objects to return. |
7c673cae | 1077 | :Type: 32-bit Integer |
1adf2230 | 1078 | :Default: ``256`` |
7c673cae FG |
1079 | |
1080 | ||
7c673cae FG |
1081 | ``osd fast fail on connection refused`` |
1082 | ||
1083 | :Description: If this option is enabled, crashed OSDs are marked down | |
1084 | immediately by connected peers and MONs (assuming that the | |
1085 | crashed OSD host survives). Disable it to restore old | |
1086 | behavior, at the expense of possible long I/O stalls when | |
1087 | OSDs crash in the middle of I/O operations. | |
1088 | :Type: Boolean | |
1089 | :Default: ``true`` | |
1090 | ||
1091 | ||
1092 | ||
1093 | .. _pool: ../../operations/pools | |
1094 | .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction | |
1095 | .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering | |
1096 | .. _Pool & PG Config Reference: ../pool-pg-config-ref | |
1097 | .. _Journal Config Reference: ../journal-ref | |
1098 | .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio |