]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/configuration/osd-config-ref.rst
import 15.2.5
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
1 ======================
2 OSD Config Reference
3 ======================
4
5 .. index:: OSD; configuration
6
7 You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
8 Daemons can use the default values and a very minimal configuration. A minimal
9 Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and
10 uses default values for nearly everything else.
11
12 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
13 with ``0`` using the following convention. ::
14
15 osd.0
16 osd.1
17 osd.2
18
19 In a configuration file, you may specify settings for all Ceph OSD Daemons in
20 the cluster by adding configuration settings to the ``[osd]`` section of your
21 configuration file. To add settings directly to a specific Ceph OSD Daemon
22 (e.g., ``host``), enter it in an OSD-specific section of your configuration
23 file. For example:
24
25 .. code-block:: ini
26
27 [osd]
28 osd journal size = 5120
29
30 [osd.0]
31 host = osd-host-a
32
33 [osd.1]
34 host = osd-host-b
35
36
37 .. index:: OSD; config settings
38
39 General Settings
40 ================
41
42 The following settings provide a Ceph OSD Daemon's ID, and determine paths to
43 data and journals. Ceph deployment scripts typically generate the UUID
44 automatically.
45
46 .. warning:: **DO NOT** change the default paths for data or journals, as it
47 makes it more problematic to troubleshoot Ceph later.
48
49 The journal size should be at least twice the product of the expected drive
50 speed multiplied by ``filestore max sync interval``. However, the most common
51 practice is to partition the journal drive (often an SSD), and mount it such
52 that Ceph uses the entire partition for the journal.
53
54
55 ``osd uuid``
56
57 :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
58 :Type: UUID
59 :Default: The UUID.
60 :Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
61 applies to the entire cluster.
62
63
64 ``osd data``
65
66 :Description: The path to the OSDs data. You must create the directory when
67 deploying Ceph. You should mount a drive for OSD data at this
68 mount point. We do not recommend changing the default.
69
70 :Type: String
71 :Default: ``/var/lib/ceph/osd/$cluster-$id``
72
73
74 ``osd max write size``
75
76 :Description: The maximum size of a write in megabytes.
77 :Type: 32-bit Integer
78 :Default: ``90``
79
80
81 ``osd max object size``
82
83 :Description: The maximum size of a RADOS object in bytes.
84 :Type: 32-bit Unsigned Integer
85 :Default: 128MB
86
87
88 ``osd client message size cap``
89
90 :Description: The largest client data message allowed in memory.
91 :Type: 64-bit Unsigned Integer
92 :Default: 500MB default. ``500*1024L*1024L``
93
94
95 ``osd class dir``
96
97 :Description: The class path for RADOS class plug-ins.
98 :Type: String
99 :Default: ``$libdir/rados-classes``
100
101
102 .. index:: OSD; file system
103
104 File System Settings
105 ====================
106 Ceph builds and mounts file systems which are used for Ceph OSDs.
107
108 ``osd mkfs options {fs-type}``
109
110 :Description: Options used when creating a new Ceph OSD of type {fs-type}.
111
112 :Type: String
113 :Default for xfs: ``-f -i 2048``
114 :Default for other file systems: {empty string}
115
116 For example::
117 ``osd mkfs options xfs = -f -d agcount=24``
118
119 ``osd mount options {fs-type}``
120
121 :Description: Options used when mounting a Ceph OSD of type {fs-type}.
122
123 :Type: String
124 :Default for xfs: ``rw,noatime,inode64``
125 :Default for other file systems: ``rw, noatime``
126
127 For example::
128 ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
129
130
131 .. index:: OSD; journal settings
132
133 Journal Settings
134 ================
135
136 By default, Ceph expects that you will store an Ceph OSD Daemons journal with
137 the following path::
138
139 /var/lib/ceph/osd/$cluster-$id/journal
140
141 When using a single device type (for example, spinning drives), the journals
142 should be *colocated*: the logical volume (or partition) should be in the same
143 device as the ``data`` logical volume.
144
145 When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
146 drives) it makes sense to place the journal on the faster device, while
147 ``data`` occupies the slower device fully.
148
149 The default ``osd journal size`` value is 5120 (5 gigabytes), but it can be
150 larger, in which case it will need to be set in the ``ceph.conf`` file::
151
152
153 osd journal size = 10240
154
155
156 ``osd journal``
157
158 :Description: The path to the OSD's journal. This may be a path to a file or a
159 block device (such as a partition of an SSD). If it is a file,
160 you must create the directory to contain it. We recommend using a
161 drive separate from the ``osd data`` drive.
162
163 :Type: String
164 :Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
165
166
167 ``osd journal size``
168
169 :Description: The size of the journal in megabytes.
170
171 :Type: 32-bit Integer
172 :Default: ``5120``
173
174
175 See `Journal Config Reference`_ for additional details.
176
177
178 Monitor OSD Interaction
179 =======================
180
181 Ceph OSD Daemons check each other's heartbeats and report to monitors
182 periodically. Ceph can use default values in many cases. However, if your
183 network has latency issues, you may need to adopt longer intervals. See
184 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
185
186
187 Data Placement
188 ==============
189
190 See `Pool & PG Config Reference`_ for details.
191
192
193 .. index:: OSD; scrubbing
194
195 Scrubbing
196 =========
197
198 In addition to making multiple copies of objects, Ceph ensures data integrity by
199 scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
200 object storage layer. For each placement group, Ceph generates a catalog of all
201 objects and compares each primary object and its replicas to ensure that no
202 objects are missing or mismatched. Light scrubbing (daily) checks the object
203 size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
204 to ensure data integrity.
205
206 Scrubbing is important for maintaining data integrity, but it can reduce
207 performance. You can adjust the following settings to increase or decrease
208 scrubbing operations.
209
210
211 ``osd max scrubs``
212
213 :Description: The maximum number of simultaneous scrub operations for
214 a Ceph OSD Daemon.
215
216 :Type: 32-bit Int
217 :Default: ``1``
218
219 ``osd scrub begin hour``
220
221 :Description: The time of day for the lower bound when a scheduled scrub can be
222 performed.
223 :Type: Integer in the range of 0 to 24
224 :Default: ``0``
225
226
227 ``osd scrub end hour``
228
229 :Description: The time of day for the upper bound when a scheduled scrub can be
230 performed. Along with ``osd scrub begin hour``, they define a time
231 window, in which the scrubs can happen. But a scrub will be performed
232 no matter the time window allows or not, as long as the placement
233 group's scrub interval exceeds ``osd scrub max interval``.
234 :Type: Integer in the range of 0 to 24
235 :Default: ``24``
236
237
238 ``osd scrub begin week day``
239
240 :Description: This restricts scrubbing to this day of the week or later.
241 0 or 7 = Sunday, 1 = Monday, etc.
242 :Type: Integer in the range of 0 to 7
243 :Default: ``0``
244
245
246 ``osd scrub end week day``
247
248 :Description: This restricts scrubbing to days of the week earlier than this.
249 0 or 7 = Sunday, 1 = Monday, etc.
250 :Type: Integer in the range of 0 to 7
251 :Default: ``7``
252
253
254 ``osd scrub during recovery``
255
256 :Description: Allow scrub during recovery. Setting this to ``false`` will disable
257 scheduling new scrub (and deep--scrub) while there is active recovery.
258 Already running scrubs will be continued. This might be useful to reduce
259 load on busy clusters.
260 :Type: Boolean
261 :Default: ``false``
262
263
264 ``osd scrub thread timeout``
265
266 :Description: The maximum time in seconds before timing out a scrub thread.
267 :Type: 32-bit Integer
268 :Default: ``60``
269
270
271 ``osd scrub finalize thread timeout``
272
273 :Description: The maximum time in seconds before timing out a scrub finalize
274 thread.
275
276 :Type: 32-bit Integer
277 :Default: ``60*10``
278
279
280 ``osd scrub load threshold``
281
282 :Description: The normalized maximum load. Ceph will not scrub when the system load
283 (as defined by ``getloadavg() / number of online cpus``) is higher than this number.
284 Default is ``0.5``.
285
286 :Type: Float
287 :Default: ``0.5``
288
289
290 ``osd scrub min interval``
291
292 :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
293 when the Ceph Storage Cluster load is low.
294
295 :Type: Float
296 :Default: Once per day. ``60*60*24``
297
298
299 ``osd scrub max interval``
300
301 :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
302 irrespective of cluster load.
303
304 :Type: Float
305 :Default: Once per week. ``7*60*60*24``
306
307
308 ``osd scrub chunk min``
309
310 :Description: The minimal number of object store chunks to scrub during single operation.
311 Ceph blocks writes to single chunk during scrub.
312
313 :Type: 32-bit Integer
314 :Default: 5
315
316
317 ``osd scrub chunk max``
318
319 :Description: The maximum number of object store chunks to scrub during single operation.
320
321 :Type: 32-bit Integer
322 :Default: 25
323
324
325 ``osd scrub sleep``
326
327 :Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
328 down whole scrub operation while client operations will be less impacted.
329
330 :Type: Float
331 :Default: 0
332
333
334 ``osd deep scrub interval``
335
336 :Description: The interval for "deep" scrubbing (fully reading all data). The
337 ``osd scrub load threshold`` does not affect this setting.
338
339 :Type: Float
340 :Default: Once per week. ``60*60*24*7``
341
342
343 ``osd scrub interval randomize ratio``
344
345 :Description: Add a random delay to ``osd scrub min interval`` when scheduling
346 the next scrub job for a placement group. The delay is a random
347 value less than ``osd scrub min interval`` \*
348 ``osd scrub interval randomized ratio``. So the default setting
349 practically randomly spreads the scrubs out in the allowed time
350 window of ``[1, 1.5]`` \* ``osd scrub min interval``.
351 :Type: Float
352 :Default: ``0.5``
353
354 ``osd deep scrub stride``
355
356 :Description: Read size when doing a deep scrub.
357 :Type: 32-bit Integer
358 :Default: 512 KB. ``524288``
359
360
361 ``osd scrub auto repair``
362
363 :Description: Setting this to ``true`` will enable automatic pg repair when errors
364 are found in scrub or deep-scrub. However, if more than
365 ``osd scrub auto repair num errors`` errors are found a repair is NOT performed.
366 :Type: Boolean
367 :Default: ``false``
368
369
370 ``osd scrub auto repair num errors``
371
372 :Description: Auto repair will not occur if more than this many errors are found.
373 :Type: 32-bit Integer
374 :Default: ``5``
375
376
377 .. index:: OSD; operations settings
378
379 Operations
380 ==========
381
382 ``osd op queue``
383
384 :Description: This sets the type of queue to be used for prioritizing ops
385 in the OSDs. Both queues feature a strict sub-queue which is
386 dequeued before the normal queue. The normal queue is different
387 between implementations. The original PrioritizedQueue (``prio``) uses a
388 token bucket system which when there are sufficient tokens will
389 dequeue high priority queues first. If there are not enough
390 tokens available, queues are dequeued low priority to high priority.
391 The WeightedPriorityQueue (``wpq``) dequeues all priorities in
392 relation to their priorities to prevent starvation of any queue.
393 WPQ should help in cases where a few OSDs are more overloaded
394 than others. The new mClock based OpClassQueue
395 (``mclock_opclass``) prioritizes operations based on which class
396 they belong to (recovery, scrub, snaptrim, client op, osd subop).
397 And, the mClock based ClientQueue (``mclock_client``) also
398 incorporates the client identifier in order to promote fairness
399 between clients. See `QoS Based on mClock`_. Requires a restart.
400
401 :Type: String
402 :Valid Choices: prio, wpq, mclock_opclass, mclock_client
403 :Default: ``wpq``
404
405
406 ``osd op queue cut off``
407
408 :Description: This selects which priority ops will be sent to the strict
409 queue verses the normal queue. The ``low`` setting sends all
410 replication ops and higher to the strict queue, while the ``high``
411 option sends only replication acknowledgment ops and higher to
412 the strict queue. Setting this to ``high`` should help when a few
413 OSDs in the cluster are very busy especially when combined with
414 ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
415 handling replication traffic could starve primary client traffic
416 on these OSDs without these settings. Requires a restart.
417
418 :Type: String
419 :Valid Choices: low, high
420 :Default: ``high``
421
422
423 ``osd client op priority``
424
425 :Description: The priority set for client operations.
426
427 :Type: 32-bit Integer
428 :Default: ``63``
429 :Valid Range: 1-63
430
431
432 ``osd recovery op priority``
433
434 :Description: The priority set for recovery operations, if not specified by the pool's ``recovery_op_priority``.
435
436 :Type: 32-bit Integer
437 :Default: ``3``
438 :Valid Range: 1-63
439
440
441 ``osd scrub priority``
442
443 :Description: The default priority set for a scheduled scrub work queue when the
444 pool doesn't specify a value of ``scrub_priority``. This can be
445 boosted to the value of ``osd client op priority`` when scrub is
446 blocking client operations.
447
448 :Type: 32-bit Integer
449 :Default: ``5``
450 :Valid Range: 1-63
451
452
453 ``osd requested scrub priority``
454
455 :Description: The priority set for user requested scrub on the work queue. If
456 this value were to be smaller than ``osd client op priority`` it
457 can be boosted to the value of ``osd client op priority`` when
458 scrub is blocking client operations.
459
460 :Type: 32-bit Integer
461 :Default: ``120``
462
463
464 ``osd snap trim priority``
465
466 :Description: The priority set for the snap trim work queue.
467
468 :Type: 32-bit Integer
469 :Default: ``5``
470 :Valid Range: 1-63
471
472 ``osd snap trim sleep``
473
474 :Description: Time in seconds to sleep before next snap trim op.
475 Increasing this value will slow down snap trimming.
476 This option overrides backend specific variants.
477
478 :Type: Float
479 :Default: ``0``
480
481
482 ``osd snap trim sleep hdd``
483
484 :Description: Time in seconds to sleep before next snap trim op
485 for HDDs.
486
487 :Type: Float
488 :Default: ``5``
489
490
491 ``osd snap trim sleep ssd``
492
493 :Description: Time in seconds to sleep before next snap trim op
494 for SSDs.
495
496 :Type: Float
497 :Default: ``0``
498
499
500 ``osd snap trim sleep hybrid``
501
502 :Description: Time in seconds to sleep before next snap trim op
503 when osd data is on HDD and osd journal is on SSD.
504
505 :Type: Float
506 :Default: ``2``
507
508 ``osd op thread timeout``
509
510 :Description: The Ceph OSD Daemon operation thread timeout in seconds.
511 :Type: 32-bit Integer
512 :Default: ``15``
513
514
515 ``osd op complaint time``
516
517 :Description: An operation becomes complaint worthy after the specified number
518 of seconds have elapsed.
519
520 :Type: Float
521 :Default: ``30``
522
523
524 ``osd op history size``
525
526 :Description: The maximum number of completed operations to track.
527 :Type: 32-bit Unsigned Integer
528 :Default: ``20``
529
530
531 ``osd op history duration``
532
533 :Description: The oldest completed operation to track.
534 :Type: 32-bit Unsigned Integer
535 :Default: ``600``
536
537
538 ``osd op log threshold``
539
540 :Description: How many operations logs to display at once.
541 :Type: 32-bit Integer
542 :Default: ``5``
543
544
545 .. _dmclock-qos:
546
547 QoS Based on mClock
548 -------------------
549
550 Ceph's use of mClock is currently in the experimental phase and should
551 be approached with an exploratory mindset.
552
553 Core Concepts
554 `````````````
555
556 The QoS support of Ceph is implemented using a queueing scheduler
557 based on `the dmClock algorithm`_. This algorithm allocates the I/O
558 resources of the Ceph cluster in proportion to weights, and enforces
559 the constraints of minimum reservation and maximum limitation, so that
560 the services can compete for the resources fairly. Currently the
561 *mclock_opclass* operation queue divides Ceph services involving I/O
562 resources into following buckets:
563
564 - client op: the iops issued by client
565 - osd subop: the iops issued by primary OSD
566 - snap trim: the snap trimming related requests
567 - pg recovery: the recovery related requests
568 - pg scrub: the scrub related requests
569
570 And the resources are partitioned using following three sets of tags. In other
571 words, the share of each type of service is controlled by three tags:
572
573 #. reservation: the minimum IOPS allocated for the service.
574 #. limitation: the maximum IOPS allocated for the service.
575 #. weight: the proportional share of capacity if extra capacity or system
576 oversubscribed.
577
578 In Ceph operations are graded with "cost". And the resources allocated
579 for serving various services are consumed by these "costs". So, for
580 example, the more reservation a services has, the more resource it is
581 guaranteed to possess, as long as it requires. Assuming there are 2
582 services: recovery and client ops:
583
584 - recovery: (r:1, l:5, w:1)
585 - client ops: (r:2, l:0, w:9)
586
587 The settings above ensure that the recovery won't get more than 5
588 requests per second serviced, even if it requires so (see CURRENT
589 IMPLEMENTATION NOTE below), and no other services are competing with
590 it. But if the clients start to issue large amount of I/O requests,
591 neither will they exhaust all the I/O resources. 1 request per second
592 is always allocated for recovery jobs as long as there are any such
593 requests. So the recovery jobs won't be starved even in a cluster with
594 high load. And in the meantime, the client ops can enjoy a larger
595 portion of the I/O resource, because its weight is "9", while its
596 competitor "1". In the case of client ops, it is not clamped by the
597 limit setting, so it can make use of all the resources if there is no
598 recovery ongoing.
599
600 Along with *mclock_opclass* another mclock operation queue named
601 *mclock_client* is available. It divides operations based on category
602 but also divides them based on the client making the request. This
603 helps not only manage the distribution of resources spent on different
604 classes of operations but also tries to ensure fairness among clients.
605
606 CURRENT IMPLEMENTATION NOTE: the current experimental implementation
607 does not enforce the limit values. As a first approximation we decided
608 not to prevent operations that would otherwise enter the operation
609 sequencer from doing so.
610
611 Subtleties of mClock
612 ````````````````````
613
614 The reservation and limit values have a unit of requests per
615 second. The weight, however, does not technically have a unit and the
616 weights are relative to one another. So if one class of requests has a
617 weight of 1 and another a weight of 9, then the latter class of
618 requests should get 9 executed at a 9 to 1 ratio as the first class.
619 However that will only happen once the reservations are met and those
620 values include the operations executed under the reservation phase.
621
622 Even though the weights do not have units, one must be careful in
623 choosing their values due how the algorithm assigns weight tags to
624 requests. If the weight is *W*, then for a given class of requests,
625 the next one that comes in will have a weight tag of *1/W* plus the
626 previous weight tag or the current time, whichever is larger. That
627 means if *W* is sufficiently large and therefore *1/W* is sufficiently
628 small, the calculated tag may never be assigned as it will get a value
629 of the current time. The ultimate lesson is that values for weight
630 should not be too large. They should be under the number of requests
631 one expects to ve serviced each second.
632
633 Caveats
634 ```````
635
636 There are some factors that can reduce the impact of the mClock op
637 queues within Ceph. First, requests to an OSD are sharded by their
638 placement group identifier. Each shard has its own mClock queue and
639 these queues neither interact nor share information among them. The
640 number of shards can be controlled with the configuration options
641 ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
642 ``osd_op_num_shards_ssd``. A lower number of shards will increase the
643 impact of the mClock queues, but may have other deleterious effects.
644
645 Second, requests are transferred from the operation queue to the
646 operation sequencer, in which they go through the phases of
647 execution. The operation queue is where mClock resides and mClock
648 determines the next op to transfer to the operation sequencer. The
649 number of operations allowed in the operation sequencer is a complex
650 issue. In general we want to keep enough operations in the sequencer
651 so it's always getting work done on some operations while it's waiting
652 for disk and network access to complete on other operations. On the
653 other hand, once an operation is transferred to the operation
654 sequencer, mClock no longer has control over it. Therefore to maximize
655 the impact of mClock, we want to keep as few operations in the
656 operation sequencer as possible. So we have an inherent tension.
657
658 The configuration options that influence the number of operations in
659 the operation sequencer are ``bluestore_throttle_bytes``,
660 ``bluestore_throttle_deferred_bytes``,
661 ``bluestore_throttle_cost_per_io``,
662 ``bluestore_throttle_cost_per_io_hdd``, and
663 ``bluestore_throttle_cost_per_io_ssd``.
664
665 A third factor that affects the impact of the mClock algorithm is that
666 we're using a distributed system, where requests are made to multiple
667 OSDs and each OSD has (can have) multiple shards. Yet we're currently
668 using the mClock algorithm, which is not distributed (note: dmClock is
669 the distributed version of mClock).
670
671 Various organizations and individuals are currently experimenting with
672 mClock as it exists in this code base along with their modifications
673 to the code base. We hope you'll share you're experiences with your
674 mClock and dmClock experiments in the ceph-devel mailing list.
675
676
677 ``osd push per object cost``
678
679 :Description: the overhead for serving a push op
680
681 :Type: Unsigned Integer
682 :Default: 1000
683
684 ``osd recovery max chunk``
685
686 :Description: the maximum total size of data chunks a recovery op can carry.
687
688 :Type: Unsigned Integer
689 :Default: 8 MiB
690
691
692 ``osd op queue mclock client op res``
693
694 :Description: the reservation of client op.
695
696 :Type: Float
697 :Default: 1000.0
698
699
700 ``osd op queue mclock client op wgt``
701
702 :Description: the weight of client op.
703
704 :Type: Float
705 :Default: 500.0
706
707
708 ``osd op queue mclock client op lim``
709
710 :Description: the limit of client op.
711
712 :Type: Float
713 :Default: 1000.0
714
715
716 ``osd op queue mclock osd subop res``
717
718 :Description: the reservation of osd subop.
719
720 :Type: Float
721 :Default: 1000.0
722
723
724 ``osd op queue mclock osd subop wgt``
725
726 :Description: the weight of osd subop.
727
728 :Type: Float
729 :Default: 500.0
730
731
732 ``osd op queue mclock osd subop lim``
733
734 :Description: the limit of osd subop.
735
736 :Type: Float
737 :Default: 0.0
738
739
740 ``osd op queue mclock snap res``
741
742 :Description: the reservation of snap trimming.
743
744 :Type: Float
745 :Default: 0.0
746
747
748 ``osd op queue mclock snap wgt``
749
750 :Description: the weight of snap trimming.
751
752 :Type: Float
753 :Default: 1.0
754
755
756 ``osd op queue mclock snap lim``
757
758 :Description: the limit of snap trimming.
759
760 :Type: Float
761 :Default: 0.001
762
763
764 ``osd op queue mclock recov res``
765
766 :Description: the reservation of recovery.
767
768 :Type: Float
769 :Default: 0.0
770
771
772 ``osd op queue mclock recov wgt``
773
774 :Description: the weight of recovery.
775
776 :Type: Float
777 :Default: 1.0
778
779
780 ``osd op queue mclock recov lim``
781
782 :Description: the limit of recovery.
783
784 :Type: Float
785 :Default: 0.001
786
787
788 ``osd op queue mclock scrub res``
789
790 :Description: the reservation of scrub jobs.
791
792 :Type: Float
793 :Default: 0.0
794
795
796 ``osd op queue mclock scrub wgt``
797
798 :Description: the weight of scrub jobs.
799
800 :Type: Float
801 :Default: 1.0
802
803
804 ``osd op queue mclock scrub lim``
805
806 :Description: the limit of scrub jobs.
807
808 :Type: Float
809 :Default: 0.001
810
811 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
812
813
814 .. index:: OSD; backfilling
815
816 Backfilling
817 ===========
818
819 When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
820 want to rebalance the cluster by moving placement groups to or from Ceph OSD
821 Daemons to restore the balance. The process of migrating placement groups and
822 the objects they contain can reduce the cluster's operational performance
823 considerably. To maintain operational performance, Ceph performs this migration
824 with 'backfilling', which allows Ceph to set backfill operations to a lower
825 priority than requests to read or write data.
826
827
828 ``osd max backfills``
829
830 :Description: The maximum number of backfills allowed to or from a single OSD.
831 :Type: 64-bit Unsigned Integer
832 :Default: ``1``
833
834
835 ``osd backfill scan min``
836
837 :Description: The minimum number of objects per backfill scan.
838
839 :Type: 32-bit Integer
840 :Default: ``64``
841
842
843 ``osd backfill scan max``
844
845 :Description: The maximum number of objects per backfill scan.
846
847 :Type: 32-bit Integer
848 :Default: ``512``
849
850
851 ``osd backfill retry interval``
852
853 :Description: The number of seconds to wait before retrying backfill requests.
854 :Type: Double
855 :Default: ``10.0``
856
857 .. index:: OSD; osdmap
858
859 OSD Map
860 =======
861
862 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
863 number of map epochs increases. Ceph provides some settings to ensure that
864 Ceph performs well as the OSD map grows larger.
865
866
867 ``osd map dedup``
868
869 :Description: Enable removing duplicates in the OSD map.
870 :Type: Boolean
871 :Default: ``true``
872
873
874 ``osd map cache size``
875
876 :Description: The number of OSD maps to keep cached.
877 :Type: 32-bit Integer
878 :Default: ``50``
879
880
881 ``osd map message max``
882
883 :Description: The maximum map entries allowed per MOSDMap message.
884 :Type: 32-bit Integer
885 :Default: ``40``
886
887
888
889 .. index:: OSD; recovery
890
891 Recovery
892 ========
893
894 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
895 begins peering with other Ceph OSD Daemons before writes can occur. See
896 `Monitoring OSDs and PGs`_ for details.
897
898 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
899 sync with other Ceph OSD Daemons containing more recent versions of objects in
900 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
901 mode and seeks to get the latest copy of the data and bring its map back up to
902 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
903 and placement groups may be significantly out of date. Also, if a failure domain
904 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
905 the same time. This can make the recovery process time consuming and resource
906 intensive.
907
908 To maintain operational performance, Ceph performs recovery with limitations on
909 the number recovery requests, threads and object chunk sizes which allows Ceph
910 perform well in a degraded state.
911
912
913 ``osd recovery delay start``
914
915 :Description: After peering completes, Ceph will delay for the specified number
916 of seconds before starting to recover objects.
917
918 :Type: Float
919 :Default: ``0``
920
921
922 ``osd recovery max active``
923
924 :Description: The number of active recovery requests per OSD at one time. More
925 requests will accelerate recovery, but the requests places an
926 increased load on the cluster.
927
928 This value is only used if it is non-zero. Normally it
929 is ``0``, which means that the ``hdd`` or ``ssd`` values
930 (below) are used, depending on the type of the primary
931 device backing the OSD.
932
933 :Type: 32-bit Integer
934 :Default: ``0``
935
936 ``osd recovery max active hdd``
937
938 :Description: The number of active recovery requests per OSD at one time, if the
939 primary device is rotational.
940
941 :Type: 32-bit Integer
942 :Default: ``3``
943
944 ``osd recovery max active ssd``
945
946 :Description: The number of active recovery requests per OSD at one time, if the
947 primary device is non-rotational (i.e., an SSD).
948
949 :Type: 32-bit Integer
950 :Default: ``10``
951
952
953 ``osd recovery max chunk``
954
955 :Description: The maximum size of a recovered chunk of data to push.
956 :Type: 64-bit Unsigned Integer
957 :Default: ``8 << 20``
958
959
960 ``osd recovery max single start``
961
962 :Description: The maximum number of recovery operations per OSD that will be
963 newly started when an OSD is recovering.
964 :Type: 64-bit Unsigned Integer
965 :Default: ``1``
966
967
968 ``osd recovery thread timeout``
969
970 :Description: The maximum time in seconds before timing out a recovery thread.
971 :Type: 32-bit Integer
972 :Default: ``30``
973
974
975 ``osd recover clone overlap``
976
977 :Description: Preserves clone overlap during recovery. Should always be set
978 to ``true``.
979
980 :Type: Boolean
981 :Default: ``true``
982
983
984 ``osd recovery sleep``
985
986 :Description: Time in seconds to sleep before next recovery or backfill op.
987 Increasing this value will slow down recovery operation while
988 client operations will be less impacted.
989
990 :Type: Float
991 :Default: ``0``
992
993
994 ``osd recovery sleep hdd``
995
996 :Description: Time in seconds to sleep before next recovery or backfill op
997 for HDDs.
998
999 :Type: Float
1000 :Default: ``0.1``
1001
1002
1003 ``osd recovery sleep ssd``
1004
1005 :Description: Time in seconds to sleep before next recovery or backfill op
1006 for SSDs.
1007
1008 :Type: Float
1009 :Default: ``0``
1010
1011
1012 ``osd recovery sleep hybrid``
1013
1014 :Description: Time in seconds to sleep before next recovery or backfill op
1015 when osd data is on HDD and osd journal is on SSD.
1016
1017 :Type: Float
1018 :Default: ``0.025``
1019
1020
1021 ``osd recovery priority``
1022
1023 :Description: The default priority set for recovery work queue. Not
1024 related to a pool's ``recovery_priority``.
1025
1026 :Type: 32-bit Integer
1027 :Default: ``5``
1028
1029
1030 Tiering
1031 =======
1032
1033 ``osd agent max ops``
1034
1035 :Description: The maximum number of simultaneous flushing ops per tiering agent
1036 in the high speed mode.
1037 :Type: 32-bit Integer
1038 :Default: ``4``
1039
1040
1041 ``osd agent max low ops``
1042
1043 :Description: The maximum number of simultaneous flushing ops per tiering agent
1044 in the low speed mode.
1045 :Type: 32-bit Integer
1046 :Default: ``2``
1047
1048 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1049 objects within the high speed mode.
1050
1051 Miscellaneous
1052 =============
1053
1054
1055 ``osd snap trim thread timeout``
1056
1057 :Description: The maximum time in seconds before timing out a snap trim thread.
1058 :Type: 32-bit Integer
1059 :Default: ``60*60*1``
1060
1061
1062 ``osd backlog thread timeout``
1063
1064 :Description: The maximum time in seconds before timing out a backlog thread.
1065 :Type: 32-bit Integer
1066 :Default: ``60*60*1``
1067
1068
1069 ``osd default notify timeout``
1070
1071 :Description: The OSD default notification timeout (in seconds).
1072 :Type: 32-bit Unsigned Integer
1073 :Default: ``30``
1074
1075
1076 ``osd check for log corruption``
1077
1078 :Description: Check log files for corruption. Can be computationally expensive.
1079 :Type: Boolean
1080 :Default: ``false``
1081
1082
1083 ``osd remove thread timeout``
1084
1085 :Description: The maximum time in seconds before timing out a remove OSD thread.
1086 :Type: 32-bit Integer
1087 :Default: ``60*60``
1088
1089
1090 ``osd command thread timeout``
1091
1092 :Description: The maximum time in seconds before timing out a command thread.
1093 :Type: 32-bit Integer
1094 :Default: ``10*60``
1095
1096
1097 ``osd delete sleep``
1098
1099 :Description: Time in seconds to sleep before next removal transaction. This
1100 helps to throttle the pg deletion process.
1101
1102 :Type: Float
1103 :Default: ``0``
1104
1105
1106 ``osd delete sleep hdd``
1107
1108 :Description: Time in seconds to sleep before next removal transaction
1109 for HDDs.
1110
1111 :Type: Float
1112 :Default: ``5``
1113
1114
1115 ``osd delete sleep ssd``
1116
1117 :Description: Time in seconds to sleep before next removal transaction
1118 for SSDs.
1119
1120 :Type: Float
1121 :Default: ``0``
1122
1123
1124 ``osd delete sleep hybrid``
1125
1126 :Description: Time in seconds to sleep before next removal transaction
1127 when osd data is on HDD and osd journal is on SSD.
1128
1129 :Type: Float
1130 :Default: ``2``
1131
1132
1133 ``osd command max records``
1134
1135 :Description: Limits the number of lost objects to return.
1136 :Type: 32-bit Integer
1137 :Default: ``256``
1138
1139
1140 ``osd fast fail on connection refused``
1141
1142 :Description: If this option is enabled, crashed OSDs are marked down
1143 immediately by connected peers and MONs (assuming that the
1144 crashed OSD host survives). Disable it to restore old
1145 behavior, at the expense of possible long I/O stalls when
1146 OSDs crash in the middle of I/O operations.
1147 :Type: Boolean
1148 :Default: ``true``
1149
1150
1151
1152 .. _pool: ../../operations/pools
1153 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1154 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1155 .. _Pool & PG Config Reference: ../pool-pg-config-ref
1156 .. _Journal Config Reference: ../journal-ref
1157 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio