5 .. index:: OSD; configuration
7 You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
8 Daemons can use the default values and a very minimal configuration. A minimal
9 Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and
10 uses default values for nearly everything else.
12 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
13 with ``0`` using the following convention. ::
19 In a configuration file, you may specify settings for all Ceph OSD Daemons in
20 the cluster by adding configuration settings to the ``[osd]`` section of your
21 configuration file. To add settings directly to a specific Ceph OSD Daemon
22 (e.g., ``host``), enter it in an OSD-specific section of your configuration
28 osd journal size = 1024
37 .. index:: OSD; config settings
42 The following settings provide an Ceph OSD Daemon's ID, and determine paths to
43 data and journals. Ceph deployment scripts typically generate the UUID
44 automatically. We **DO NOT** recommend changing the default paths for data or
45 journals, as it makes it more problematic to troubleshoot Ceph later.
47 The journal size should be at least twice the product of the expected drive
48 speed multiplied by ``filestore max sync interval``. However, the most common
49 practice is to partition the journal drive (often an SSD), and mount it such
50 that Ceph uses the entire partition for the journal.
55 :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
58 :Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
59 applies to the entire cluster.
64 :Description: The path to the OSDs data. You must create the directory when
65 deploying Ceph. You should mount a drive for OSD data at this
66 mount point. We do not recommend changing the default.
69 :Default: ``/var/lib/ceph/osd/$cluster-$id``
72 ``osd max write size``
74 :Description: The maximum size of a write in megabytes.
79 ``osd client message size cap``
81 :Description: The largest client data message allowed in memory.
82 :Type: 64-bit Integer Unsigned
83 :Default: 500MB default. ``500*1024L*1024L``
88 :Description: The class path for RADOS class plug-ins.
90 :Default: ``$libdir/rados-classes``
93 .. index:: OSD; file system
97 Ceph builds and mounts file systems which are used for Ceph OSDs.
99 ``osd mkfs options {fs-type}``
101 :Description: Options used when creating a new Ceph OSD of type {fs-type}.
104 :Default for xfs: ``-f -i 2048``
105 :Default for other file systems: {empty string}
108 ``osd mkfs options xfs = -f -d agcount=24``
110 ``osd mount options {fs-type}``
112 :Description: Options used when mounting a Ceph OSD of type {fs-type}.
115 :Default for xfs: ``rw,noatime,inode64``
116 :Default for other file systems: ``rw, noatime``
119 ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
122 .. index:: OSD; journal settings
127 By default, Ceph expects that you will store an Ceph OSD Daemons journal with
130 /var/lib/ceph/osd/$cluster-$id/journal
132 Without performance optimization, Ceph stores the journal on the same disk as
133 the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use
134 a separate disk to store journal data (e.g., a solid state drive delivers high
135 performance journaling).
137 Ceph's default ``osd journal size`` is 0, so you will need to set this in your
138 ``ceph.conf`` file. A journal size should find the product of the ``filestore
139 max sync interval`` and the expected throughput, and multiply the product by
142 osd journal size = {2 * (expected throughput * filestore max sync interval)}
144 The expected throughput number should include the expected disk throughput
145 (i.e., sustained data transfer rate), and network throughput. For example,
146 a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()``
147 of the disk and network throughput should provide a reasonable expected
148 throughput. Some users just start off with a 10GB journal size. For
151 osd journal size = 10000
156 :Description: The path to the OSD's journal. This may be a path to a file or a
157 block device (such as a partition of an SSD). If it is a file,
158 you must create the directory to contain it. We recommend using a
159 drive separate from the ``osd data`` drive.
162 :Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
167 :Description: The size of the journal in megabytes. If this is 0, and the
168 journal is a block device, the entire block device is used.
169 Since v0.54, this is ignored if the journal is a block device,
170 and the entire block device is used.
172 :Type: 32-bit Integer
174 :Recommended: Begin with 1GB. Should be at least twice the product of the
175 expected speed multiplied by ``filestore max sync interval``.
178 See `Journal Config Reference`_ for additional details.
181 Monitor OSD Interaction
182 =======================
184 Ceph OSD Daemons check each other's heartbeats and report to monitors
185 periodically. Ceph can use default values in many cases. However, if your
186 network has latency issues, you may need to adopt longer intervals. See
187 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
193 See `Pool & PG Config Reference`_ for details.
196 .. index:: OSD; scrubbing
201 In addition to making multiple copies of objects, Ceph insures data integrity by
202 scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
203 object storage layer. For each placement group, Ceph generates a catalog of all
204 objects and compares each primary object and its replicas to ensure that no
205 objects are missing or mismatched. Light scrubbing (daily) checks the object
206 size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
207 to ensure data integrity.
209 Scrubbing is important for maintaining data integrity, but it can reduce
210 performance. You can adjust the following settings to increase or decrease
211 scrubbing operations.
216 :Description: The maximum number of simultaneous scrub operations for
222 ``osd scrub begin hour``
224 :Description: The time of day for the lower bound when a scheduled scrub can be
226 :Type: Integer in the range of 0 to 24
230 ``osd scrub end hour``
232 :Description: The time of day for the upper bound when a scheduled scrub can be
233 performed. Along with ``osd scrub begin hour``, they define a time
234 window, in which the scrubs can happen. But a scrub will be performed
235 no matter the time window allows or not, as long as the placement
236 group's scrub interval exceeds ``osd scrub max interval``.
237 :Type: Integer in the range of 0 to 24
241 ``osd scrub during recovery``
243 :Description: Allow scrub during recovery. Setting this to ``false`` will disable
244 scheduling new scrub (and deep--scrub) while there is active recovery.
245 Already running scrubs will be continued. This might be useful to reduce
246 load on busy clusters.
251 ``osd scrub thread timeout``
253 :Description: The maximum time in seconds before timing out a scrub thread.
254 :Type: 32-bit Integer
258 ``osd scrub finalize thread timeout``
260 :Description: The maximum time in seconds before timing out a scrub finalize
263 :Type: 32-bit Integer
267 ``osd scrub load threshold``
269 :Description: The maximum load. Ceph will not scrub when the system load
270 (as defined by ``getloadavg()``) is higher than this number.
277 ``osd scrub min interval``
279 :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
280 when the Ceph Storage Cluster load is low.
283 :Default: Once per day. ``60*60*24``
286 ``osd scrub max interval``
288 :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
289 irrespective of cluster load.
292 :Default: Once per week. ``7*60*60*24``
295 ``osd scrub chunk min``
297 :Description: The minimal number of object store chunks to scrub during single operation.
298 Ceph blocks writes to single chunk during scrub.
300 :Type: 32-bit Integer
304 ``osd scrub chunk max``
306 :Description: The maximum number of object store chunks to scrub during single operation.
308 :Type: 32-bit Integer
314 :Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
315 down whole scrub operation while client operations will be less impacted.
321 ``osd deep scrub interval``
323 :Description: The interval for "deep" scrubbing (fully reading all data). The
324 ``osd scrub load threshold`` does not affect this setting.
327 :Default: Once per week. ``60*60*24*7``
330 ``osd scrub interval randomize ratio``
332 :Description: Add a random delay to ``osd scrub min interval`` when scheduling
333 the next scrub job for a placement group. The delay is a random
334 value less than ``osd scrub min interval`` \*
335 ``osd scrub interval randomized ratio``. So the default setting
336 practically randomly spreads the scrubs out in the allowed time
337 window of ``[1, 1.5]`` \* ``osd scrub min interval``.
341 ``osd deep scrub stride``
343 :Description: Read size when doing a deep scrub.
344 :Type: 32-bit Integer
345 :Default: 512 KB. ``524288``
348 .. index:: OSD; operations settings
353 Operations settings allow you to configure the number of threads for servicing
354 requests. If you set ``osd op threads`` to ``0``, it disables multi-threading.
355 By default, Ceph uses two threads with a 30 second timeout and a 30 second
356 complaint time if an operation doesn't complete within those time parameters.
357 You can set operations priority weights between client operations and
358 recovery operations to ensure optimal performance during recovery.
363 :Description: The number of threads to service Ceph OSD Daemon operations.
364 Set to ``0`` to disable it. Increasing the number may increase
365 the request processing rate.
367 :Type: 32-bit Integer
373 :Description: This sets the type of queue to be used for prioritizing ops
374 in the OSDs. Both queues feature a strict sub-queue which is
375 dequeued before the normal queue. The normal queue is different
376 between implementations. The original PrioritizedQueue (``prio``) uses a
377 token bucket system which when there are sufficient tokens will
378 dequeue high priority queues first. If there are not enough
379 tokens available, queues are dequeued low priority to high priority.
380 The new WeightedPriorityQueue (``wpq``) dequeues all priorities in
381 relation to their priorities to prevent starvation of any queue.
382 WPQ should help in cases where a few OSDs are more overloaded
383 than others. Requires a restart.
386 :Valid Choices: prio, wpq
390 ``osd op queue cut off``
392 :Description: This selects which priority ops will be sent to the strict
393 queue verses the normal queue. The ``low`` setting sends all
394 replication ops and higher to the strict queue, while the ``high``
395 option sends only replication acknowledgement ops and higher to
396 the strict queue. Setting this to ``high`` should help when a few
397 OSDs in the cluster are very busy especially when combined with
398 ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
399 handling replication traffic could starve primary client traffic
400 on these OSDs without these settings. Requires a restart.
403 :Valid Choices: low, high
407 ``osd client op priority``
409 :Description: The priority set for client operations. It is relative to
410 ``osd recovery op priority``.
412 :Type: 32-bit Integer
417 ``osd recovery op priority``
419 :Description: The priority set for recovery operations. It is relative to
420 ``osd client op priority``.
422 :Type: 32-bit Integer
427 ``osd scrub priority``
429 :Description: The priority set for scrub operations. It is relative to
430 ``osd client op priority``.
432 :Type: 32-bit Integer
437 ``osd snap trim priority``
439 :Description: The priority set for snap trim operations. It is relative to
440 ``osd client op priority``.
442 :Type: 32-bit Integer
447 ``osd op thread timeout``
449 :Description: The Ceph OSD Daemon operation thread timeout in seconds.
450 :Type: 32-bit Integer
454 ``osd op complaint time``
456 :Description: An operation becomes complaint worthy after the specified number
457 of seconds have elapsed.
465 :Description: The number of disk threads, which are used to perform background
466 disk intensive OSD operations such as scrubbing and snap
469 :Type: 32-bit Integer
472 ``osd disk thread ioprio class``
474 :Description: Warning: it will only be used if both ``osd disk thread
475 ioprio class`` and ``osd disk thread ioprio priority`` are
476 set to a non default value. Sets the ioprio_set(2) I/O
477 scheduling ``class`` for the disk thread. Acceptable
478 values are ``idle``, ``be`` or ``rt``. The ``idle``
479 class means the disk thread will have lower priority
480 than any other thread in the OSD. This is useful to slow
481 down scrubbing on an OSD that is busy handling client
482 operations. ``be`` is the default and is the same
483 priority as all other threads in the OSD. ``rt`` means
484 the disk thread will have precendence over all other
485 threads in the OSD. Note: Only works with the Linux Kernel
486 CFQ scheduler. Since Jewel scrubbing is no longer carried
487 out by the disk iothread, see osd priority options instead.
489 :Default: the empty string
491 ``osd disk thread ioprio priority``
493 :Description: Warning: it will only be used if both ``osd disk thread
494 ioprio class`` and ``osd disk thread ioprio priority`` are
495 set to a non default value. It sets the ioprio_set(2)
496 I/O scheduling ``priority`` of the disk thread ranging
497 from 0 (highest) to 7 (lowest). If all OSDs on a given
498 host were in class ``idle`` and compete for I/O
499 (i.e. due to controller congestion), it can be used to
500 lower the disk thread priority of one OSD to 7 so that
501 another OSD with priority 0 can have priority.
502 Note: Only works with the Linux Kernel CFQ scheduler.
503 :Type: Integer in the range of 0 to 7 or -1 if not to be used.
506 ``osd op history size``
508 :Description: The maximum number of completed operations to track.
509 :Type: 32-bit Unsigned Integer
513 ``osd op history duration``
515 :Description: The oldest completed operation to track.
516 :Type: 32-bit Unsigned Integer
520 ``osd op log threshold``
522 :Description: How many operations logs to display at once.
523 :Type: 32-bit Integer
526 .. index:: OSD; backfilling
531 When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
532 want to rebalance the cluster by moving placement groups to or from Ceph OSD
533 Daemons to restore the balance. The process of migrating placement groups and
534 the objects they contain can reduce the cluster's operational performance
535 considerably. To maintain operational performance, Ceph performs this migration
536 with 'backfilling', which allows Ceph to set backfill operations to a lower
537 priority than requests to read or write data.
540 ``osd max backfills``
542 :Description: The maximum number of backfills allowed to or from a single OSD.
543 :Type: 64-bit Unsigned Integer
547 ``osd backfill scan min``
549 :Description: The minimum number of objects per backfill scan.
551 :Type: 32-bit Integer
555 ``osd backfill scan max``
557 :Description: The maximum number of objects per backfill scan.
559 :Type: 32-bit Integer
563 ``osd backfill retry interval``
565 :Description: The number of seconds to wait before retrying backfill requests.
569 .. index:: OSD; osdmap
574 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
575 number of map epochs increases. Ceph provides some settings to ensure that
576 Ceph performs well as the OSD map grows larger.
581 :Description: Enable removing duplicates in the OSD map.
586 ``osd map cache size``
588 :Description: The number of OSD maps to keep cached.
589 :Type: 32-bit Integer
593 ``osd map cache bl size``
595 :Description: The size of the in-memory OSD map cache in OSD daemons.
596 :Type: 32-bit Integer
600 ``osd map cache bl inc size``
602 :Description: The size of the in-memory OSD map cache incrementals in
605 :Type: 32-bit Integer
609 ``osd map message max``
611 :Description: The maximum map entries allowed per MOSDMap message.
612 :Type: 32-bit Integer
617 .. index:: OSD; recovery
622 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
623 begins peering with other Ceph OSD Daemons before writes can occur. See
624 `Monitoring OSDs and PGs`_ for details.
626 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
627 sync with other Ceph OSD Daemons containing more recent versions of objects in
628 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
629 mode and seeks to get the latest copy of the data and bring its map back up to
630 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
631 and placement groups may be significantly out of date. Also, if a failure domain
632 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
633 the same time. This can make the recovery process time consuming and resource
636 To maintain operational performance, Ceph performs recovery with limitations on
637 the number recovery requests, threads and object chunk sizes which allows Ceph
638 perform well in a degraded state.
641 ``osd recovery delay start``
643 :Description: After peering completes, Ceph will delay for the specified number
644 of seconds before starting to recover objects.
650 ``osd recovery max active``
652 :Description: The number of active recovery requests per OSD at one time. More
653 requests will accelerate recovery, but the requests places an
654 increased load on the cluster.
656 :Type: 32-bit Integer
660 ``osd recovery max chunk``
662 :Description: The maximum size of a recovered chunk of data to push.
663 :Type: 64-bit Integer Unsigned
664 :Default: ``8 << 20``
667 ``osd recovery max single start``
669 :Description: The maximum number of recovery operations per OSD that will be
670 newly started when an OSD is recovering.
671 :Type: 64-bit Integer Unsigned
675 ``osd recovery thread timeout``
677 :Description: The maximum time in seconds before timing out a recovery thread.
678 :Type: 32-bit Integer
682 ``osd recover clone overlap``
684 :Description: Preserves clone overlap during recovery. Should always be set
691 ``osd recovery sleep``
693 :Description: Time to sleep before next recovery. Increasing this value will
694 slow down recovery operation while client operations will be
703 ``osd agent max ops``
705 :Description: The maximum number of simultaneous flushing ops per tiering agent
706 in the high speed mode.
707 :Type: 32-bit Integer
711 ``osd agent max low ops``
713 :Description: The maximum number of simultaneous flushing ops per tiering agent
714 in the low speed mode.
715 :Type: 32-bit Integer
718 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
719 objects within the high speed mode.
725 ``osd snap trim thread timeout``
727 :Description: The maximum time in seconds before timing out a snap trim thread.
728 :Type: 32-bit Integer
729 :Default: ``60*60*1``
732 ``osd backlog thread timeout``
734 :Description: The maximum time in seconds before timing out a backlog thread.
735 :Type: 32-bit Integer
736 :Default: ``60*60*1``
739 ``osd default notify timeout``
741 :Description: The OSD default notification timeout (in seconds).
742 :Type: 32-bit Integer Unsigned
746 ``osd check for log corruption``
748 :Description: Check log files for corruption. Can be computationally expensive.
753 ``osd remove thread timeout``
755 :Description: The maximum time in seconds before timing out a remove OSD thread.
756 :Type: 32-bit Integer
760 ``osd command thread timeout``
762 :Description: The maximum time in seconds before timing out a command thread.
763 :Type: 32-bit Integer
767 ``osd command max records``
769 :Description: Limits the number of lost objects to return.
770 :Type: 32-bit Integer
774 ``osd auto upgrade tmap``
776 :Description: Uses ``tmap`` for ``omap`` on old objects.
781 ``osd tmapput sets users tmap``
783 :Description: Uses ``tmap`` for debugging only.
788 ``osd fast fail on connection refused``
790 :Description: If this option is enabled, crashed OSDs are marked down
791 immediately by connected peers and MONs (assuming that the
792 crashed OSD host survives). Disable it to restore old
793 behavior, at the expense of possible long I/O stalls when
794 OSDs crash in the middle of I/O operations.
800 .. _pool: ../../operations/pools
801 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
802 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
803 .. _Pool & PG Config Reference: ../pool-pg-config-ref
804 .. _Journal Config Reference: ../journal-ref
805 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio