]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/configuration/osd-config-ref.rst
ded20df8bd3ce7599cbd04aeed9920d16881f7e9
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
1 ======================
2 OSD Config Reference
3 ======================
4
5 .. index:: OSD; configuration
6
7 You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
8 releases, the central config store), but Ceph OSD
9 Daemons can use the default values and a very minimal configuration. A minimal
10 Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``, and
11 uses default values for nearly everything else.
12
13 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
14 with ``0`` using the following convention. ::
15
16 osd.0
17 osd.1
18 osd.2
19
20 In a configuration file, you may specify settings for all Ceph OSD Daemons in
21 the cluster by adding configuration settings to the ``[osd]`` section of your
22 configuration file. To add settings directly to a specific Ceph OSD Daemon
23 (e.g., ``host``), enter it in an OSD-specific section of your configuration
24 file. For example:
25
26 .. code-block:: ini
27
28 [osd]
29 osd_journal_size = 5120
30
31 [osd.0]
32 host = osd-host-a
33
34 [osd.1]
35 host = osd-host-b
36
37
38 .. index:: OSD; config settings
39
40 General Settings
41 ================
42
43 The following settings provide a Ceph OSD Daemon's ID, and determine paths to
44 data and journals. Ceph deployment scripts typically generate the UUID
45 automatically.
46
47 .. warning:: **DO NOT** change the default paths for data or journals, as it
48 makes it more problematic to troubleshoot Ceph later.
49
50 When using Filestore, the journal size should be at least twice the product of the expected drive
51 speed multiplied by ``filestore_max_sync_interval``. However, the most common
52 practice is to partition the journal drive (often an SSD), and mount it such
53 that Ceph uses the entire partition for the journal.
54
55
56 ``osd_uuid``
57
58 :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
59 :Type: UUID
60 :Default: The UUID.
61 :Note: The ``osd_uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
62 applies to the entire cluster.
63
64
65 ``osd_data``
66
67 :Description: The path to the OSDs data. You must create the directory when
68 deploying Ceph. You should mount a drive for OSD data at this
69 mount point. We do not recommend changing the default.
70
71 :Type: String
72 :Default: ``/var/lib/ceph/osd/$cluster-$id``
73
74
75 ``osd_max_write_size``
76
77 :Description: The maximum size of a write in megabytes.
78 :Type: 32-bit Integer
79 :Default: ``90``
80
81
82 ``osd_max_object_size``
83
84 :Description: The maximum size of a RADOS object in bytes.
85 :Type: 32-bit Unsigned Integer
86 :Default: 128MB
87
88
89 ``osd_client_message_size_cap``
90
91 :Description: The largest client data message allowed in memory.
92 :Type: 64-bit Unsigned Integer
93 :Default: 500MB default. ``500*1024L*1024L``
94
95
96 ``osd_class_dir``
97
98 :Description: The class path for RADOS class plug-ins.
99 :Type: String
100 :Default: ``$libdir/rados-classes``
101
102
103 .. index:: OSD; file system
104
105 File System Settings
106 ====================
107 Ceph builds and mounts file systems which are used for Ceph OSDs.
108
109 ``osd_mkfs_options {fs-type}``
110
111 :Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
112
113 :Type: String
114 :Default for xfs: ``-f -i 2048``
115 :Default for other file systems: {empty string}
116
117 For example::
118 ``osd_mkfs_options_xfs = -f -d agcount=24``
119
120 ``osd_mount_options {fs-type}``
121
122 :Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
123
124 :Type: String
125 :Default for xfs: ``rw,noatime,inode64``
126 :Default for other file systems: ``rw, noatime``
127
128 For example::
129 ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
130
131
132 .. index:: OSD; journal settings
133
134 Journal Settings
135 ================
136
137 This section applies only to the older Filestore OSD back end. Since Luminous
138 BlueStore has been default and preferred.
139
140 By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
141 the following path, which is usually a symlink to a device or partition::
142
143 /var/lib/ceph/osd/$cluster-$id/journal
144
145 When using a single device type (for example, spinning drives), the journals
146 should be *colocated*: the logical volume (or partition) should be in the same
147 device as the ``data`` logical volume.
148
149 When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
150 drives) it makes sense to place the journal on the faster device, while
151 ``data`` occupies the slower device fully.
152
153 The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
154 larger, in which case it will need to be set in the ``ceph.conf`` file.
155 A value of 10 gigabytes is common in practice::
156
157 osd_journal_size = 10240
158
159
160 ``osd_journal``
161
162 :Description: The path to the OSD's journal. This may be a path to a file or a
163 block device (such as a partition of an SSD). If it is a file,
164 you must create the directory to contain it. We recommend using a
165 separate fast device when the ``osd_data`` drive is an HDD.
166
167 :Type: String
168 :Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
169
170
171 ``osd_journal_size``
172
173 :Description: The size of the journal in megabytes.
174
175 :Type: 32-bit Integer
176 :Default: ``5120``
177
178
179 See `Journal Config Reference`_ for additional details.
180
181
182 Monitor OSD Interaction
183 =======================
184
185 Ceph OSD Daemons check each other's heartbeats and report to monitors
186 periodically. Ceph can use default values in many cases. However, if your
187 network has latency issues, you may need to adopt longer intervals. See
188 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
189
190
191 Data Placement
192 ==============
193
194 See `Pool & PG Config Reference`_ for details.
195
196
197 .. index:: OSD; scrubbing
198
199 Scrubbing
200 =========
201
202 In addition to making multiple copies of objects, Ceph ensures data integrity by
203 scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
204 object storage layer. For each placement group, Ceph generates a catalog of all
205 objects and compares each primary object and its replicas to ensure that no
206 objects are missing or mismatched. Light scrubbing (daily) checks the object
207 size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
208 to ensure data integrity.
209
210 Scrubbing is important for maintaining data integrity, but it can reduce
211 performance. You can adjust the following settings to increase or decrease
212 scrubbing operations.
213
214
215 ``osd_max_scrubs``
216
217 :Description: The maximum number of simultaneous scrub operations for
218 a Ceph OSD Daemon.
219
220 :Type: 32-bit Int
221 :Default: ``1``
222
223 ``osd_scrub_begin_hour``
224
225 :Description: This restricts scrubbing to this hour of the day or later.
226 Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0``
227 to allow scrubbing the entire day. Along with ``osd_scrub_end_hour``, they define a time
228 window, in which the scrubs can happen.
229 But a scrub will be performed
230 no matter whether the time window allows or not, as long as the placement
231 group's scrub interval exceeds ``osd_scrub_max_interval``.
232 :Type: Integer in the range of 0 to 23
233 :Default: ``0``
234
235
236 ``osd_scrub_end_hour``
237
238 :Description: This restricts scrubbing to the hour earlier than this.
239 Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` to allow scrubbing
240 for the entire day. Along with ``osd_scrub_begin_hour``, they define a time
241 window, in which the scrubs can happen. But a scrub will be performed
242 no matter whether the time window allows or not, as long as the placement
243 group's scrub interval exceeds ``osd_scrub_max_interval``.
244 :Type: Integer in the range of 0 to 23
245 :Default: ``0``
246
247
248 ``osd_scrub_begin_week_day``
249
250 :Description: This restricts scrubbing to this day of the week or later.
251 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0``
252 and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week.
253 Along with ``osd_scrub_end_week_day``, they define a time window in which
254 scrubs can happen. But a scrub will be performed
255 no matter whether the time window allows or not, when the PG's
256 scrub interval exceeds ``osd_scrub_max_interval``.
257 :Type: Integer in the range of 0 to 6
258 :Default: ``0``
259
260
261 ``osd_scrub_end_week_day``
262
263 :Description: This restricts scrubbing to days of the week earlier than this.
264 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0``
265 and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week.
266 Along with ``osd_scrub_begin_week_day``, they define a time
267 window, in which the scrubs can happen. But a scrub will be performed
268 no matter whether the time window allows or not, as long as the placement
269 group's scrub interval exceeds ``osd_scrub_max_interval``.
270 :Type: Integer in the range of 0 to 6
271 :Default: ``0``
272
273
274 ``osd scrub during recovery``
275
276 :Description: Allow scrub during recovery. Setting this to ``false`` will disable
277 scheduling new scrub (and deep--scrub) while there is active recovery.
278 Already running scrubs will be continued. This might be useful to reduce
279 load on busy clusters.
280 :Type: Boolean
281 :Default: ``false``
282
283
284 ``osd_scrub_thread_timeout``
285
286 :Description: The maximum time in seconds before timing out a scrub thread.
287 :Type: 32-bit Integer
288 :Default: ``60``
289
290
291 ``osd_scrub_finalize_thread_timeout``
292
293 :Description: The maximum time in seconds before timing out a scrub finalize
294 thread.
295
296 :Type: 32-bit Integer
297 :Default: ``10*60``
298
299
300 ``osd_scrub_load_threshold``
301
302 :Description: The normalized maximum load. Ceph will not scrub when the system load
303 (as defined by ``getloadavg() / number of online CPUs``) is higher than this number.
304 Default is ``0.5``.
305
306 :Type: Float
307 :Default: ``0.5``
308
309
310 ``osd_scrub_min_interval``
311
312 :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
313 when the Ceph Storage Cluster load is low.
314
315 :Type: Float
316 :Default: Once per day. ``24*60*60``
317
318 .. _osd_scrub_max_interval:
319
320 ``osd_scrub_max_interval``
321
322 :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
323 irrespective of cluster load.
324
325 :Type: Float
326 :Default: Once per week. ``7*24*60*60``
327
328
329 ``osd_scrub_chunk_min``
330
331 :Description: The minimal number of object store chunks to scrub during single operation.
332 Ceph blocks writes to single chunk during scrub.
333
334 :Type: 32-bit Integer
335 :Default: 5
336
337
338 ``osd_scrub_chunk_max``
339
340 :Description: The maximum number of object store chunks to scrub during single operation.
341
342 :Type: 32-bit Integer
343 :Default: 25
344
345
346 ``osd_scrub_sleep``
347
348 :Description: Time to sleep before scrubbing the next group of chunks. Increasing this value will slow
349 down the overall rate of scrubbing so that client operations will be less impacted.
350
351 :Type: Float
352 :Default: 0
353
354
355 ``osd_deep_scrub_interval``
356
357 :Description: The interval for "deep" scrubbing (fully reading all data). The
358 ``osd_scrub_load_threshold`` does not affect this setting.
359
360 :Type: Float
361 :Default: Once per week. ``7*24*60*60``
362
363
364 ``osd_scrub_interval_randomize_ratio``
365
366 :Description: Add a random delay to ``osd_scrub_min_interval`` when scheduling
367 the next scrub job for a PG. The delay is a random
368 value less than ``osd_scrub_min_interval`` \*
369 ``osd_scrub_interval_randomized_ratio``. The default setting
370 spreads scrubs throughout the allowed time
371 window of ``[1, 1.5]`` \* ``osd_scrub_min_interval``.
372 :Type: Float
373 :Default: ``0.5``
374
375 ``osd_deep_scrub_stride``
376
377 :Description: Read size when doing a deep scrub.
378 :Type: 32-bit Integer
379 :Default: 512 KB. ``524288``
380
381
382 ``osd_scrub_auto_repair``
383
384 :Description: Setting this to ``true`` will enable automatic PG repair when errors
385 are found by scrubs or deep-scrubs. However, if more than
386 ``osd_scrub_auto_repair_num_errors`` errors are found a repair is NOT performed.
387 :Type: Boolean
388 :Default: ``false``
389
390
391 ``osd_scrub_auto_repair_num_errors``
392
393 :Description: Auto repair will not occur if more than this many errors are found.
394 :Type: 32-bit Integer
395 :Default: ``5``
396
397
398 .. index:: OSD; operations settings
399
400 Operations
401 ==========
402
403 ``osd_op_queue``
404
405 :Description: This sets the type of queue to be used for prioritizing ops
406 within each OSD. Both queues feature a strict sub-queue which is
407 dequeued before the normal queue. The normal queue is different
408 between implementations. The WeightedPriorityQueue (``wpq``)
409 dequeues operations in relation to their priorities to prevent
410 starvation of any queue. WPQ should help in cases where a few OSDs
411 are more overloaded than others. The new mClockQueue
412 (``mclock_scheduler``) prioritizes operations based on which class
413 they belong to (recovery, scrub, snaptrim, client op, osd subop).
414 See `QoS Based on mClock`_. Requires a restart.
415
416 :Type: String
417 :Valid Choices: wpq, mclock_scheduler
418 :Default: ``wpq``
419
420
421 ``osd_op_queue_cut_off``
422
423 :Description: This selects which priority ops will be sent to the strict
424 queue verses the normal queue. The ``low`` setting sends all
425 replication ops and higher to the strict queue, while the ``high``
426 option sends only replication acknowledgment ops and higher to
427 the strict queue. Setting this to ``high`` should help when a few
428 OSDs in the cluster are very busy especially when combined with
429 ``wpq`` in the ``osd_op_queue`` setting. OSDs that are very busy
430 handling replication traffic could starve primary client traffic
431 on these OSDs without these settings. Requires a restart.
432
433 :Type: String
434 :Valid Choices: low, high
435 :Default: ``high``
436
437
438 ``osd_client_op_priority``
439
440 :Description: The priority set for client operations. This value is relative
441 to that of ``osd_recovery_op_priority`` below. The default
442 strongly favors client ops over recovery.
443
444 :Type: 32-bit Integer
445 :Default: ``63``
446 :Valid Range: 1-63
447
448
449 ``osd_recovery_op_priority``
450
451 :Description: The priority of recovery operations vs client operations, if not specified by the
452 pool's ``recovery_op_priority``. The default value prioritizes client
453 ops (see above) over recovery ops. You may adjust the tradeoff of client
454 impact against the time to restore cluster health by lowering this value
455 for increased prioritization of client ops, or by increasing it to favor
456 recovery.
457
458 :Type: 32-bit Integer
459 :Default: ``3``
460 :Valid Range: 1-63
461
462
463 ``osd_scrub_priority``
464
465 :Description: The default work queue priority for scheduled scrubs when the
466 pool doesn't specify a value of ``scrub_priority``. This can be
467 boosted to the value of ``osd_client_op_priority`` when scrubs are
468 blocking client operations.
469
470 :Type: 32-bit Integer
471 :Default: ``5``
472 :Valid Range: 1-63
473
474
475 ``osd_requested_scrub_priority``
476
477 :Description: The priority set for user requested scrub on the work queue. If
478 this value were to be smaller than ``osd_client_op_priority`` it
479 can be boosted to the value of ``osd_client_op_priority`` when
480 scrub is blocking client operations.
481
482 :Type: 32-bit Integer
483 :Default: ``120``
484
485
486 ``osd_snap_trim_priority``
487
488 :Description: The priority set for the snap trim work queue.
489
490 :Type: 32-bit Integer
491 :Default: ``5``
492 :Valid Range: 1-63
493
494 ``osd_snap_trim_sleep``
495
496 :Description: Time in seconds to sleep before next snap trim op.
497 Increasing this value will slow down snap trimming.
498 This option overrides backend specific variants.
499
500 :Type: Float
501 :Default: ``0``
502
503
504 ``osd_snap_trim_sleep_hdd``
505
506 :Description: Time in seconds to sleep before next snap trim op
507 for HDDs.
508
509 :Type: Float
510 :Default: ``5``
511
512
513 ``osd_snap_trim_sleep_ssd``
514
515 :Description: Time in seconds to sleep before next snap trim op
516 for SSD OSDs (including NVMe).
517
518 :Type: Float
519 :Default: ``0``
520
521
522 ``osd_snap_trim_sleep_hybrid``
523
524 :Description: Time in seconds to sleep before next snap trim op
525 when OSD data is on an HDD and the OSD journal or WAL+DB is on an SSD.
526
527 :Type: Float
528 :Default: ``2``
529
530 ``osd_op_thread_timeout``
531
532 :Description: The Ceph OSD Daemon operation thread timeout in seconds.
533 :Type: 32-bit Integer
534 :Default: ``15``
535
536
537 ``osd_op_complaint_time``
538
539 :Description: An operation becomes complaint worthy after the specified number
540 of seconds have elapsed.
541
542 :Type: Float
543 :Default: ``30``
544
545
546 ``osd_op_history_size``
547
548 :Description: The maximum number of completed operations to track.
549 :Type: 32-bit Unsigned Integer
550 :Default: ``20``
551
552
553 ``osd_op_history_duration``
554
555 :Description: The oldest completed operation to track.
556 :Type: 32-bit Unsigned Integer
557 :Default: ``600``
558
559
560 ``osd_op_log_threshold``
561
562 :Description: How many operations logs to display at once.
563 :Type: 32-bit Integer
564 :Default: ``5``
565
566
567 .. _dmclock-qos:
568
569 QoS Based on mClock
570 -------------------
571
572 Ceph's use of mClock is currently experimental and should
573 be approached with an exploratory mindset.
574
575 Core Concepts
576 `````````````
577
578 Ceph's QoS support is implemented using a queueing scheduler
579 based on `the dmClock algorithm`_. This algorithm allocates the I/O
580 resources of the Ceph cluster in proportion to weights, and enforces
581 the constraints of minimum reservation and maximum limitation, so that
582 the services can compete for the resources fairly. Currently the
583 *mclock_scheduler* operation queue divides Ceph services involving I/O
584 resources into following buckets:
585
586 - client op: the iops issued by client
587 - osd subop: the iops issued by primary OSD
588 - snap trim: the snap trimming related requests
589 - pg recovery: the recovery related requests
590 - pg scrub: the scrub related requests
591
592 And the resources are partitioned using following three sets of tags. In other
593 words, the share of each type of service is controlled by three tags:
594
595 #. reservation: the minimum IOPS allocated for the service.
596 #. limitation: the maximum IOPS allocated for the service.
597 #. weight: the proportional share of capacity if extra capacity or system
598 oversubscribed.
599
600 In Ceph operations are graded with "cost". And the resources allocated
601 for serving various services are consumed by these "costs". So, for
602 example, the more reservation a services has, the more resource it is
603 guaranteed to possess, as long as it requires. Assuming there are 2
604 services: recovery and client ops:
605
606 - recovery: (r:1, l:5, w:1)
607 - client ops: (r:2, l:0, w:9)
608
609 The settings above ensure that the recovery won't get more than 5
610 requests per second serviced, even if it requires so (see CURRENT
611 IMPLEMENTATION NOTE below), and no other services are competing with
612 it. But if the clients start to issue large amount of I/O requests,
613 neither will they exhaust all the I/O resources. 1 request per second
614 is always allocated for recovery jobs as long as there are any such
615 requests. So the recovery jobs won't be starved even in a cluster with
616 high load. And in the meantime, the client ops can enjoy a larger
617 portion of the I/O resource, because its weight is "9", while its
618 competitor "1". In the case of client ops, it is not clamped by the
619 limit setting, so it can make use of all the resources if there is no
620 recovery ongoing.
621
622 CURRENT IMPLEMENTATION NOTE: the current experimental implementation
623 does not enforce the limit values. As a first approximation we decided
624 not to prevent operations that would otherwise enter the operation
625 sequencer from doing so.
626
627 Subtleties of mClock
628 ````````````````````
629
630 The reservation and limit values have a unit of requests per
631 second. The weight, however, does not technically have a unit and the
632 weights are relative to one another. So if one class of requests has a
633 weight of 1 and another a weight of 9, then the latter class of
634 requests should get 9 executed at a 9 to 1 ratio as the first class.
635 However that will only happen once the reservations are met and those
636 values include the operations executed under the reservation phase.
637
638 Even though the weights do not have units, one must be careful in
639 choosing their values due how the algorithm assigns weight tags to
640 requests. If the weight is *W*, then for a given class of requests,
641 the next one that comes in will have a weight tag of *1/W* plus the
642 previous weight tag or the current time, whichever is larger. That
643 means if *W* is sufficiently large and therefore *1/W* is sufficiently
644 small, the calculated tag may never be assigned as it will get a value
645 of the current time. The ultimate lesson is that values for weight
646 should not be too large. They should be under the number of requests
647 one expects to ve serviced each second.
648
649 Caveats
650 ```````
651
652 There are some factors that can reduce the impact of the mClock op
653 queues within Ceph. First, requests to an OSD are sharded by their
654 placement group identifier. Each shard has its own mClock queue and
655 these queues neither interact nor share information among them. The
656 number of shards can be controlled with the configuration options
657 ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
658 ``osd_op_num_shards_ssd``. A lower number of shards will increase the
659 impact of the mClock queues, but may have other deleterious effects.
660
661 Second, requests are transferred from the operation queue to the
662 operation sequencer, in which they go through the phases of
663 execution. The operation queue is where mClock resides and mClock
664 determines the next op to transfer to the operation sequencer. The
665 number of operations allowed in the operation sequencer is a complex
666 issue. In general we want to keep enough operations in the sequencer
667 so it's always getting work done on some operations while it's waiting
668 for disk and network access to complete on other operations. On the
669 other hand, once an operation is transferred to the operation
670 sequencer, mClock no longer has control over it. Therefore to maximize
671 the impact of mClock, we want to keep as few operations in the
672 operation sequencer as possible. So we have an inherent tension.
673
674 The configuration options that influence the number of operations in
675 the operation sequencer are ``bluestore_throttle_bytes``,
676 ``bluestore_throttle_deferred_bytes``,
677 ``bluestore_throttle_cost_per_io``,
678 ``bluestore_throttle_cost_per_io_hdd``, and
679 ``bluestore_throttle_cost_per_io_ssd``.
680
681 A third factor that affects the impact of the mClock algorithm is that
682 we're using a distributed system, where requests are made to multiple
683 OSDs and each OSD has (can have) multiple shards. Yet we're currently
684 using the mClock algorithm, which is not distributed (note: dmClock is
685 the distributed version of mClock).
686
687 Various organizations and individuals are currently experimenting with
688 mClock as it exists in this code base along with their modifications
689 to the code base. We hope you'll share you're experiences with your
690 mClock and dmClock experiments on the ``ceph-devel`` mailing list.
691
692
693 ``osd_push_per_object_cost``
694
695 :Description: the overhead for serving a push op
696
697 :Type: Unsigned Integer
698 :Default: 1000
699
700
701 ``osd_recovery_max_chunk``
702
703 :Description: the maximum total size of data chunks a recovery op can carry.
704
705 :Type: Unsigned Integer
706 :Default: 8 MiB
707
708
709 ``osd_mclock_scheduler_client_res``
710
711 :Description: IO proportion reserved for each client (default).
712
713 :Type: Unsigned Integer
714 :Default: 1
715
716
717 ``osd_mclock_scheduler_client_wgt``
718
719 :Description: IO share for each client (default) over reservation.
720
721 :Type: Unsigned Integer
722 :Default: 1
723
724
725 ``osd_mclock_scheduler_client_lim``
726
727 :Description: IO limit for each client (default) over reservation.
728
729 :Type: Unsigned Integer
730 :Default: 999999
731
732
733 ``osd_mclock_scheduler_background_recovery_res``
734
735 :Description: IO proportion reserved for background recovery (default).
736
737 :Type: Unsigned Integer
738 :Default: 1
739
740
741 ``osd_mclock_scheduler_background_recovery_wgt``
742
743 :Description: IO share for each background recovery over reservation.
744
745 :Type: Unsigned Integer
746 :Default: 1
747
748
749 ``osd_mclock_scheduler_background_recovery_lim``
750
751 :Description: IO limit for background recovery over reservation.
752
753 :Type: Unsigned Integer
754 :Default: 999999
755
756
757 ``osd_mclock_scheduler_background_best_effort_res``
758
759 :Description: IO proportion reserved for background best_effort (default).
760
761 :Type: Unsigned Integer
762 :Default: 1
763
764
765 ``osd_mclock_scheduler_background_best_effort_wgt``
766
767 :Description: IO share for each background best_effort over reservation.
768
769 :Type: Unsigned Integer
770 :Default: 1
771
772
773 ``osd_mclock_scheduler_background_best_effort_lim``
774
775 :Description: IO limit for background best_effort over reservation.
776
777 :Type: Unsigned Integer
778 :Default: 999999
779
780 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
781
782
783 .. index:: OSD; backfilling
784
785 Backfilling
786 ===========
787
788 When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
789 rebalance the cluster by moving placement groups to or from Ceph OSDs
790 to restore balanced utilization. The process of migrating placement groups and
791 the objects they contain can reduce the cluster's operational performance
792 considerably. To maintain operational performance, Ceph performs this migration
793 with 'backfilling', which allows Ceph to set backfill operations to a lower
794 priority than requests to read or write data.
795
796
797 ``osd_max_backfills``
798
799 :Description: The maximum number of backfills allowed to or from a single OSD.
800 Note that this is applied separately for read and write operations.
801 :Type: 64-bit Unsigned Integer
802 :Default: ``1``
803
804
805 ``osd_backfill_scan_min``
806
807 :Description: The minimum number of objects per backfill scan.
808
809 :Type: 32-bit Integer
810 :Default: ``64``
811
812
813 ``osd_backfill_scan_max``
814
815 :Description: The maximum number of objects per backfill scan.
816
817 :Type: 32-bit Integer
818 :Default: ``512``
819
820
821 ``osd_backfill_retry_interval``
822
823 :Description: The number of seconds to wait before retrying backfill requests.
824 :Type: Double
825 :Default: ``10.0``
826
827 .. index:: OSD; osdmap
828
829 OSD Map
830 =======
831
832 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
833 number of map epochs increases. Ceph provides some settings to ensure that
834 Ceph performs well as the OSD map grows larger.
835
836
837 ``osd_map_dedup``
838
839 :Description: Enable removing duplicates in the OSD map.
840 :Type: Boolean
841 :Default: ``true``
842
843
844 ``osd_map_cache_size``
845
846 :Description: The number of OSD maps to keep cached.
847 :Type: 32-bit Integer
848 :Default: ``50``
849
850
851 ``osd_map_message_max``
852
853 :Description: The maximum map entries allowed per MOSDMap message.
854 :Type: 32-bit Integer
855 :Default: ``40``
856
857
858
859 .. index:: OSD; recovery
860
861 Recovery
862 ========
863
864 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
865 begins peering with other Ceph OSD Daemons before writes can occur. See
866 `Monitoring OSDs and PGs`_ for details.
867
868 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
869 sync with other Ceph OSD Daemons containing more recent versions of objects in
870 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
871 mode and seeks to get the latest copy of the data and bring its map back up to
872 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
873 and placement groups may be significantly out of date. Also, if a failure domain
874 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
875 the same time. This can make the recovery process time consuming and resource
876 intensive.
877
878 To maintain operational performance, Ceph performs recovery with limitations on
879 the number recovery requests, threads and object chunk sizes which allows Ceph
880 perform well in a degraded state.
881
882
883 ``osd_recovery_delay_start``
884
885 :Description: After peering completes, Ceph will delay for the specified number
886 of seconds before starting to recover RADOS objects.
887
888 :Type: Float
889 :Default: ``0``
890
891
892 ``osd_recovery_max_active``
893
894 :Description: The number of active recovery requests per OSD at one time. More
895 requests will accelerate recovery, but the requests places an
896 increased load on the cluster.
897
898 This value is only used if it is non-zero. Normally it
899 is ``0``, which means that the ``hdd`` or ``ssd`` values
900 (below) are used, depending on the type of the primary
901 device backing the OSD.
902
903 :Type: 32-bit Integer
904 :Default: ``0``
905
906 ``osd_recovery_max_active_hdd``
907
908 :Description: The number of active recovery requests per OSD at one time, if the
909 primary device is rotational.
910
911 :Type: 32-bit Integer
912 :Default: ``3``
913
914 ``osd_recovery_max_active_ssd``
915
916 :Description: The number of active recovery requests per OSD at one time, if the
917 primary device is non-rotational (i.e., an SSD).
918
919 :Type: 32-bit Integer
920 :Default: ``10``
921
922
923 ``osd_recovery_max_chunk``
924
925 :Description: The maximum size of a recovered chunk of data to push.
926 :Type: 64-bit Unsigned Integer
927 :Default: ``8 << 20``
928
929
930 ``osd_recovery_max_single_start``
931
932 :Description: The maximum number of recovery operations per OSD that will be
933 newly started when an OSD is recovering.
934 :Type: 64-bit Unsigned Integer
935 :Default: ``1``
936
937
938 ``osd_recovery_thread_timeout``
939
940 :Description: The maximum time in seconds before timing out a recovery thread.
941 :Type: 32-bit Integer
942 :Default: ``30``
943
944
945 ``osd_recover_clone_overlap``
946
947 :Description: Preserves clone overlap during recovery. Should always be set
948 to ``true``.
949
950 :Type: Boolean
951 :Default: ``true``
952
953
954 ``osd_recovery_sleep``
955
956 :Description: Time in seconds to sleep before the next recovery or backfill op.
957 Increasing this value will slow down recovery operation while
958 client operations will be less impacted.
959
960 :Type: Float
961 :Default: ``0``
962
963
964 ``osd_recovery_sleep_hdd``
965
966 :Description: Time in seconds to sleep before next recovery or backfill op
967 for HDDs.
968
969 :Type: Float
970 :Default: ``0.1``
971
972
973 ``osd_recovery_sleep_ssd``
974
975 :Description: Time in seconds to sleep before the next recovery or backfill op
976 for SSDs.
977
978 :Type: Float
979 :Default: ``0``
980
981
982 ``osd_recovery_sleep_hybrid``
983
984 :Description: Time in seconds to sleep before the next recovery or backfill op
985 when OSD data is on HDD and OSD journal / WAL+DB is on SSD.
986
987 :Type: Float
988 :Default: ``0.025``
989
990
991 ``osd_recovery_priority``
992
993 :Description: The default priority set for recovery work queue. Not
994 related to a pool's ``recovery_priority``.
995
996 :Type: 32-bit Integer
997 :Default: ``5``
998
999
1000 Tiering
1001 =======
1002
1003 ``osd_agent_max_ops``
1004
1005 :Description: The maximum number of simultaneous flushing ops per tiering agent
1006 in the high speed mode.
1007 :Type: 32-bit Integer
1008 :Default: ``4``
1009
1010
1011 ``osd_agent_max_low_ops``
1012
1013 :Description: The maximum number of simultaneous flushing ops per tiering agent
1014 in the low speed mode.
1015 :Type: 32-bit Integer
1016 :Default: ``2``
1017
1018 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1019 objects within the high speed mode.
1020
1021 Miscellaneous
1022 =============
1023
1024
1025 ``osd_snap_trim_thread_timeout``
1026
1027 :Description: The maximum time in seconds before timing out a snap trim thread.
1028 :Type: 32-bit Integer
1029 :Default: ``1*60*60``
1030
1031
1032 ``osd_backlog_thread_timeout``
1033
1034 :Description: The maximum time in seconds before timing out a backlog thread.
1035 :Type: 32-bit Integer
1036 :Default: ``1*60*60``
1037
1038
1039 ``osd_default_notify_timeout``
1040
1041 :Description: The OSD default notification timeout (in seconds).
1042 :Type: 32-bit Unsigned Integer
1043 :Default: ``30``
1044
1045
1046 ``osd_check_for_log_corruption``
1047
1048 :Description: Check log files for corruption. Can be computationally expensive.
1049 :Type: Boolean
1050 :Default: ``false``
1051
1052
1053 ``osd_remove_thread_timeout``
1054
1055 :Description: The maximum time in seconds before timing out a remove OSD thread.
1056 :Type: 32-bit Integer
1057 :Default: ``60*60``
1058
1059
1060 ``osd_command_thread_timeout``
1061
1062 :Description: The maximum time in seconds before timing out a command thread.
1063 :Type: 32-bit Integer
1064 :Default: ``10*60``
1065
1066
1067 ``osd_delete_sleep``
1068
1069 :Description: Time in seconds to sleep before the next removal transaction. This
1070 throttles the PG deletion process.
1071
1072 :Type: Float
1073 :Default: ``0``
1074
1075
1076 ``osd_delete_sleep_hdd``
1077
1078 :Description: Time in seconds to sleep before the next removal transaction
1079 for HDDs.
1080
1081 :Type: Float
1082 :Default: ``5``
1083
1084
1085 ``osd_delete_sleep_ssd``
1086
1087 :Description: Time in seconds to sleep before the next removal transaction
1088 for SSDs.
1089
1090 :Type: Float
1091 :Default: ``0``
1092
1093
1094 ``osd_delete_sleep_hybrid``
1095
1096 :Description: Time in seconds to sleep before the next removal transaction
1097 when OSD data is on HDD and OSD journal or WAL+DB is on SSD.
1098
1099 :Type: Float
1100 :Default: ``1``
1101
1102
1103 ``osd_command_max_records``
1104
1105 :Description: Limits the number of lost objects to return.
1106 :Type: 32-bit Integer
1107 :Default: ``256``
1108
1109
1110 ``osd_fast_fail_on_connection_refused``
1111
1112 :Description: If this option is enabled, crashed OSDs are marked down
1113 immediately by connected peers and MONs (assuming that the
1114 crashed OSD host survives). Disable it to restore old
1115 behavior, at the expense of possible long I/O stalls when
1116 OSDs crash in the middle of I/O operations.
1117 :Type: Boolean
1118 :Default: ``true``
1119
1120
1121
1122 .. _pool: ../../operations/pools
1123 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1124 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1125 .. _Pool & PG Config Reference: ../pool-pg-config-ref
1126 .. _Journal Config Reference: ../journal-ref
1127 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio