ceph/doc/rados/configuration/osd-config-ref.rst

   1 ======================
   2  OSD Config Reference
   3 ======================
   4
   5 .. index:: OSD; configuration
   6
   7 You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
   8 Daemons can use the default values and a very minimal configuration. A minimal
   9 Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``,  and
  10 uses default values for nearly everything else.
  11
  12 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
  13 with ``0`` using the following convention. ::
  14
  15         osd.0
  16         osd.1
  17         osd.2
  18
  19 In a configuration file, you may specify settings for all Ceph OSD Daemons in
  20 the cluster by adding configuration settings to the ``[osd]`` section of your
  21 configuration file. To add settings directly to a specific Ceph OSD Daemon
  22 (e.g., ``host``), enter  it in an OSD-specific section of your configuration
  23 file. For example:
  24
  25 .. code-block:: ini
  26
  27         [osd]
  28                 osd journal size = 5120
  29
  30         [osd.0]
  31                 host = osd-host-a
  32
  33         [osd.1]
  34                 host = osd-host-b
  35
  36
  37 .. index:: OSD; config settings
  38
  39 General Settings
  40 ================
  41
  42 The following settings provide an Ceph OSD Daemon's ID, and determine paths to
  43 data and journals. Ceph deployment scripts typically generate the UUID
  44 automatically.
  45
  46 .. warning:: **DO NOT** change the default paths for data or journals, as it
  47              makes it more problematic to troubleshoot Ceph later.
  48
  49 The journal size should be at least twice the product of the expected drive
  50 speed multiplied by ``filestore max sync interval``. However, the most common
  51 practice is to partition the journal drive (often an SSD), and mount it such
  52 that Ceph uses the entire partition for the journal.
  53
  54
  55 ``osd uuid``
  56
  57 :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
  58 :Type: UUID
  59 :Default: The UUID.
  60 :Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
  61        applies to the entire cluster.
  62
  63
  64 ``osd data``
  65
  66 :Description: The path to the OSDs data. You must create the directory when
  67               deploying Ceph. You should mount a drive for OSD data at this
  68               mount point. We do not recommend changing the default.
  69
  70 :Type: String
  71 :Default: ``/var/lib/ceph/osd/$cluster-$id``
  72
  73
  74 ``osd max write size``
  75
  76 :Description: The maximum size of a write in megabytes.
  77 :Type: 32-bit Integer
  78 :Default: ``90``
  79
  80
  81 ``osd client message size cap``
  82
  83 :Description: The largest client data message allowed in memory.
  84 :Type: 64-bit Unsigned Integer
  85 :Default: 500MB default. ``500*1024L*1024L``
  86
  87
  88 ``osd class dir``
  89
  90 :Description: The class path for RADOS class plug-ins.
  91 :Type: String
  92 :Default: ``$libdir/rados-classes``
  93
  94
  95 .. index:: OSD; file system
  96
  97 File System Settings
  98 ====================
  99 Ceph builds and mounts file systems which are used for Ceph OSDs.
 100
 101 ``osd mkfs options {fs-type}``
 102
 103 :Description: Options used when creating a new Ceph OSD of type {fs-type}.
 104
 105 :Type: String
 106 :Default for xfs: ``-f -i 2048``
 107 :Default for other file systems: {empty string}
 108
 109 For example::
 110   ``osd mkfs options xfs = -f -d agcount=24``
 111
 112 ``osd mount options {fs-type}``
 113
 114 :Description: Options used when mounting a Ceph OSD of type {fs-type}.
 115
 116 :Type: String
 117 :Default for xfs: ``rw,noatime,inode64``
 118 :Default for other file systems: ``rw, noatime``
 119
 120 For example::
 121   ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
 122
 123
 124 .. index:: OSD; journal settings
 125
 126 Journal Settings
 127 ================
 128
 129 By default, Ceph expects that you will store an Ceph OSD Daemons journal with
 130 the  following path::
 131
 132         /var/lib/ceph/osd/$cluster-$id/journal
 133
 134 When using a single device type (for example, spinning drives), the journals
 135 should be *colocated*: the logical volume (or partition) should be in the same
 136 device as the ``data`` logical volume.
 137
 138 When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
 139 drives) it makes sense to place the journal on the faster device, while
 140 ``data`` occupies the slower device fully.
 141
 142 The default ``osd journal size`` value is 5120 (5 gigabytes), but it can be
 143 larger, in which case it will need to be set in the ``ceph.conf`` file::
 144
 145
 146         osd journal size = 10240
 147
 148
 149 ``osd journal``
 150
 151 :Description: The path to the OSD's journal. This may be a path to a file or a
 152               block device (such as a partition of an SSD). If it is a file,
 153               you must create the directory to contain it. We recommend using a
 154               drive separate from the ``osd data`` drive.
 155
 156 :Type: String
 157 :Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
 158
 159
 160 ``osd journal size``
 161
 162 :Description: The size of the journal in megabytes.
 163
 164 :Type: 32-bit Integer
 165 :Default: ``5120``
 166
 167
 168 See `Journal Config Reference`_ for additional details.
 169
 170
 171 Monitor OSD Interaction
 172 =======================
 173
 174 Ceph OSD Daemons check each other's heartbeats and report to monitors
 175 periodically. Ceph can use default values in many cases. However, if your
 176 network  has latency issues, you may need to adopt longer intervals. See
 177 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
 178
 179
 180 Data Placement
 181 ==============
 182
 183 See `Pool & PG Config Reference`_ for details.
 184
 185
 186 .. index:: OSD; scrubbing
 187
 188 Scrubbing
 189 =========
 190
 191 In addition to making multiple copies of objects, Ceph insures data integrity by
 192 scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
 193 object storage layer. For each placement group, Ceph generates a catalog of all
 194 objects and compares each primary object and its replicas to ensure that no
 195 objects are missing or mismatched. Light scrubbing (daily) checks the object
 196 size and attributes.  Deep scrubbing (weekly) reads the data and uses checksums
 197 to ensure data integrity.
 198
 199 Scrubbing is important for maintaining data integrity, but it can reduce
 200 performance. You can adjust the following settings to increase or decrease
 201 scrubbing operations.
 202
 203
 204 ``osd max scrubs``
 205
 206 :Description: The maximum number of simultaneous scrub operations for
 207               a Ceph OSD Daemon.
 208
 209 :Type: 32-bit Int
 210 :Default: ``1``
 211
 212 ``osd scrub begin hour``
 213
 214 :Description: The time of day for the lower bound when a scheduled scrub can be
 215               performed.
 216 :Type: Integer in the range of 0 to 24
 217 :Default: ``0``
 218
 219
 220 ``osd scrub end hour``
 221
 222 :Description: The time of day for the upper bound when a scheduled scrub can be
 223               performed. Along with ``osd scrub begin hour``, they define a time
 224               window, in which the scrubs can happen. But a scrub will be performed
 225               no matter the time window allows or not, as long as the placement
 226               group's scrub interval exceeds ``osd scrub max interval``.
 227 :Type: Integer in the range of 0 to 24
 228 :Default: ``24``
 229
 230
 231 ``osd scrub during recovery``
 232
 233 :Description: Allow scrub during recovery. Setting this to ``false`` will disable
 234               scheduling new scrub (and deep--scrub) while there is active recovery.
 235               Already running scrubs will be continued. This might be useful to reduce
 236               load on busy clusters.
 237 :Type: Boolean
 238 :Default: ``true``
 239
 240
 241 ``osd scrub thread timeout``
 242
 243 :Description: The maximum time in seconds before timing out a scrub thread.
 244 :Type: 32-bit Integer
 245 :Default: ``60``
 246
 247
 248 ``osd scrub finalize thread timeout``
 249
 250 :Description: The maximum time in seconds before timing out a scrub finalize
 251               thread.
 252
 253 :Type: 32-bit Integer
 254 :Default: ``60*10``
 255
 256
 257 ``osd scrub load threshold``
 258
 259 :Description: The maximum load. Ceph will not scrub when the system load
 260               (as defined by ``getloadavg()``) is higher than this number.
 261               Default is ``0.5``.
 262
 263 :Type: Float
 264 :Default: ``0.5``
 265
 266
 267 ``osd scrub min interval``
 268
 269 :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
 270               when the Ceph Storage Cluster load is low.
 271
 272 :Type: Float
 273 :Default: Once per day. ``60*60*24``
 274
 275
 276 ``osd scrub max interval``
 277
 278 :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
 279               irrespective of cluster load.
 280
 281 :Type: Float
 282 :Default: Once per week. ``7*60*60*24``
 283
 284
 285 ``osd scrub chunk min``
 286
 287 :Description: The minimal number of object store chunks to scrub during single operation.
 288               Ceph blocks writes to single chunk during scrub.
 289
 290 :Type: 32-bit Integer
 291 :Default: 5
 292
 293
 294 ``osd scrub chunk max``
 295
 296 :Description: The maximum number of object store chunks to scrub during single operation.
 297
 298 :Type: 32-bit Integer
 299 :Default: 25
 300
 301
 302 ``osd scrub sleep``
 303
 304 :Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
 305               down whole scrub operation while client operations will be less impacted.
 306
 307 :Type: Float
 308 :Default: 0
 309
 310
 311 ``osd deep scrub interval``
 312
 313 :Description: The interval for "deep" scrubbing (fully reading all data). The
 314               ``osd scrub load threshold`` does not affect this setting.
 315
 316 :Type: Float
 317 :Default: Once per week.  ``60*60*24*7``
 318
 319
 320 ``osd scrub interval randomize ratio``
 321
 322 :Description: Add a random delay to ``osd scrub min interval`` when scheduling
 323               the next scrub job for a placement group. The delay is a random
 324               value less than ``osd scrub min interval`` \*
 325               ``osd scrub interval randomized ratio``. So the default setting
 326               practically randomly spreads the scrubs out in the allowed time
 327               window of ``[1, 1.5]`` \* ``osd scrub min interval``.
 328 :Type: Float
 329 :Default: ``0.5``
 330
 331 ``osd deep scrub stride``
 332
 333 :Description: Read size when doing a deep scrub.
 334 :Type: 32-bit Integer
 335 :Default: 512 KB. ``524288``
 336
 337
 338 .. index:: OSD; operations settings
 339
 340 Operations
 341 ==========
 342
 343 Operations settings allow you to configure the number of threads for servicing
 344 requests. If you set ``osd op threads`` to ``0``, it disables multi-threading.
 345 By default, Ceph  uses two threads with a 30 second timeout and a 30 second
 346 complaint time if an operation doesn't complete within those time parameters.
 347 You can set operations priority weights between client operations and
 348 recovery operations to ensure optimal performance during recovery.
 349
 350
 351 ``osd op threads``
 352
 353 :Description: The number of threads to service Ceph OSD Daemon operations.
 354               Set to ``0`` to disable it. Increasing the number may increase
 355               the request processing rate.
 356
 357 :Type: 32-bit Integer
 358 :Default: ``2``
 359
 360
 361 ``osd op queue``
 362
 363 :Description: This sets the type of queue to be used for prioritizing ops
 364               in the OSDs. Both queues feature a strict sub-queue which is
 365               dequeued before the normal queue. The normal queue is different
 366               between implementations. The original PrioritizedQueue (``prio``) uses a
 367               token bucket system which when there are sufficient tokens will
 368               dequeue high priority queues first. If there are not enough
 369               tokens available, queues are dequeued low priority to high priority.
 370               The WeightedPriorityQueue (``wpq``) dequeues all priorities in
 371               relation to their priorities to prevent starvation of any queue.
 372               WPQ should help in cases where a few OSDs are more overloaded
 373               than others. The new mClock based OpClassQueue
 374               (``mclock_opclass``) prioritizes operations based on which class
 375               they belong to (recovery, scrub, snaptrim, client op, osd subop).
 376               And, the mClock based ClientQueue (``mclock_client``) also
 377               incorporates the client identifier in order to promote fairness
 378               between clients. See `QoS Based on mClock`_. Requires a restart.
 379
 380 :Type: String
 381 :Valid Choices: prio, wpq, mclock_opclass, mclock_client
 382 :Default: ``prio``
 383
 384
 385 ``osd op queue cut off``
 386
 387 :Description: This selects which priority ops will be sent to the strict
 388               queue verses the normal queue. The ``low`` setting sends all
 389               replication ops and higher to the strict queue, while the ``high``
 390               option sends only replication acknowledgement ops and higher to
 391               the strict queue. Setting this to ``high`` should help when a few
 392               OSDs in the cluster are very busy especially when combined with
 393               ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
 394               handling replication traffic could starve primary client traffic
 395               on these OSDs without these settings. Requires a restart.
 396
 397 :Type: String
 398 :Valid Choices: low, high
 399 :Default: ``low``
 400
 401
 402 ``osd client op priority``
 403
 404 :Description: The priority set for client operations. It is relative to
 405               ``osd recovery op priority``.
 406
 407 :Type: 32-bit Integer
 408 :Default: ``63``
 409 :Valid Range: 1-63
 410
 411
 412 ``osd recovery op priority``
 413
 414 :Description: The priority set for recovery operations. It is relative to
 415               ``osd client op priority``.
 416
 417 :Type: 32-bit Integer
 418 :Default: ``3``
 419 :Valid Range: 1-63
 420
 421
 422 ``osd scrub priority``
 423
 424 :Description: The priority set for scrub operations. It is relative to
 425               ``osd client op priority``.
 426
 427 :Type: 32-bit Integer
 428 :Default: ``5``
 429 :Valid Range: 1-63
 430
 431
 432 ``osd snap trim priority``
 433
 434 :Description: The priority set for snap trim operations. It is relative to
 435               ``osd client op priority``.
 436
 437 :Type: 32-bit Integer
 438 :Default: ``5``
 439 :Valid Range: 1-63
 440
 441
 442 ``osd op thread timeout``
 443
 444 :Description: The Ceph OSD Daemon operation thread timeout in seconds.
 445 :Type: 32-bit Integer
 446 :Default: ``15``
 447
 448
 449 ``osd op complaint time``
 450
 451 :Description: An operation becomes complaint worthy after the specified number
 452               of seconds have elapsed.
 453
 454 :Type: Float
 455 :Default: ``30``
 456
 457
 458 ``osd disk threads``
 459
 460 :Description: The number of disk threads, which are used to perform background
 461               disk intensive OSD operations such as scrubbing and snap
 462               trimming.
 463
 464 :Type: 32-bit Integer
 465 :Default: ``1``
 466
 467 ``osd disk thread ioprio class``
 468
 469 :Description: Warning: it will only be used if both ``osd disk thread
 470               ioprio class`` and ``osd disk thread ioprio priority`` are
 471               set to a non default value.  Sets the ioprio_set(2) I/O
 472               scheduling ``class`` for the disk thread. Acceptable
 473               values are ``idle``, ``be`` or ``rt``. The ``idle``
 474               class means the disk thread will have lower priority
 475               than any other thread in the OSD. This is useful to slow
 476               down scrubbing on an OSD that is busy handling client
 477               operations. ``be`` is the default and is the same
 478               priority as all other threads in the OSD. ``rt`` means
 479               the disk thread will have precendence over all other
 480               threads in the OSD. Note: Only works with the Linux Kernel
 481               CFQ scheduler. Since Jewel scrubbing is no longer carried
 482               out by the disk iothread, see osd priority options instead.
 483 :Type: String
 484 :Default: the empty string
 485
 486 ``osd disk thread ioprio priority``
 487
 488 :Description: Warning: it will only be used if both ``osd disk thread
 489               ioprio class`` and ``osd disk thread ioprio priority`` are
 490               set to a non default value. It sets the ioprio_set(2)
 491               I/O scheduling ``priority`` of the disk thread ranging
 492               from 0 (highest) to 7 (lowest). If all OSDs on a given
 493               host were in class ``idle`` and compete for I/O
 494               (i.e. due to controller congestion), it can be used to
 495               lower the disk thread priority of one OSD to 7 so that
 496               another OSD with priority 0 can have priority.
 497               Note: Only works with the Linux Kernel CFQ scheduler.
 498 :Type: Integer in the range of 0 to 7 or -1 if not to be used.
 499 :Default: ``-1``
 500
 501 ``osd op history size``
 502
 503 :Description: The maximum number of completed operations to track.
 504 :Type: 32-bit Unsigned Integer
 505 :Default: ``20``
 506
 507
 508 ``osd op history duration``
 509
 510 :Description: The oldest completed operation to track.
 511 :Type: 32-bit Unsigned Integer
 512 :Default: ``600``
 513
 514
 515 ``osd op log threshold``
 516
 517 :Description: How many operations logs to display at once.
 518 :Type: 32-bit Integer
 519 :Default: ``5``
 520
 521
 522 QoS Based on mClock
 523 -------------------
 524
 525 Ceph's use of mClock is currently in the experimental phase and should
 526 be approached with an exploratory mindset.
 527
 528 Core Concepts
 529 `````````````
 530
 531 The QoS support of Ceph is implemented using a queueing scheduler
 532 based on `the dmClock algorithm`_. This algorithm allocates the I/O
 533 resources of the Ceph cluster in proportion to weights, and enforces
 534 the constraits of minimum reservation and maximum limitation, so that
 535 the services can compete for the resources fairly. Currently the
 536 *mclock_opclass* operation queue divides Ceph services involving I/O
 537 resources into following buckets:
 538
 539 - client op: the iops issued by client
 540 - osd subop: the iops issued by primary OSD
 541 - snap trim: the snap trimming related requests
 542 - pg recovery: the recovery related requests
 543 - pg scrub: the scrub related requests
 544
 545 And the resources are partitioned using following three sets of tags. In other
 546 words, the share of each type of service is controlled by three tags:
 547
 548 #. reservation: the minimum IOPS allocated for the service.
 549 #. limitation: the maximum IOPS allocated for the service.
 550 #. weight: the proportional share of capacity if extra capacity or system
 551    oversubscribed.
 552
 553 In Ceph operations are graded with "cost". And the resources allocated
 554 for serving various services are consumed by these "costs". So, for
 555 example, the more reservation a services has, the more resource it is
 556 guaranteed to possess, as long as it requires. Assuming there are 2
 557 services: recovery and client ops:
 558
 559 - recovery: (r:1, l:5, w:1)
 560 - client ops: (r:2, l:0, w:9)
 561
 562 The settings above ensure that the recovery won't get more than 5
 563 requests per second serviced, even if it requires so (see CURRENT
 564 IMPLEMENTATION NOTE below), and no other services are competing with
 565 it. But if the clients start to issue large amount of I/O requests,
 566 neither will they exhaust all the I/O resources. 1 request per second
 567 is always allocated for recovery jobs as long as there are any such
 568 requests. So the recovery jobs won't be starved even in a cluster with
 569 high load. And in the meantime, the client ops can enjoy a larger
 570 portion of the I/O resource, because its weight is "9", while its
 571 competitor "1". In the case of client ops, it is not clamped by the
 572 limit setting, so it can make use of all the resources if there is no
 573 recovery ongoing.
 574
 575 Along with *mclock_opclass* another mclock operation queue named
 576 *mclock_client* is available. It divides operations based on category
 577 but also divides them based on the client making the request. This
 578 helps not only manage the distribution of resources spent on different
 579 classes of operations but also tries to insure fairness among clients.
 580
 581 CURRENT IMPLEMENTATION NOTE: the current experimental implementation
 582 does not enforce the limit values. As a first approximation we decided
 583 not to prevent operations that would otherwise enter the operation
 584 sequencer from doing so.
 585
 586 Subtleties of mClock
 587 ````````````````````
 588
 589 The reservation and limit values have a unit of requests per
 590 second. The weight, however, does not technically have a unit and the
 591 weights are relative to one another. So if one class of requests has a
 592 weight of 1 and another a weight of 9, then the latter class of
 593 requests should get 9 executed at a 9 to 1 ratio as the first class.
 594 However that will only happen once the reservations are met and those
 595 values include the operations executed under the reservation phase.
 596
 597 Even though the weights do not have units, one must be careful in
 598 choosing their values due how the algorithm assigns weight tags to
 599 requests. If the weight is *W*, then for a given class of requests,
 600 the next one that comes in will have a weight tag of *1/W* plus the
 601 previous weight tag or the current time, whichever is larger. That
 602 means if *W* is sufficiently large and therefore *1/W* is sufficiently
 603 small, the calculated tag may never be assigned as it will get a value
 604 of the current time. The ultimate lesson is that values for weight
 605 should not be too large. They should be under the number of requests
 606 one expects to ve serviced each second.
 607
 608 Caveats
 609 ```````
 610
 611 There are some factors that can reduce the impact of the mClock op
 612 queues within Ceph. First, requests to an OSD are sharded by their
 613 placement group identifier. Each shard has its own mClock queue and
 614 these queues neither interact nor share information among them. The
 615 number of shards can be controlled with the configuration options
 616 ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
 617 ``osd_op_num_shards_ssd``. A lower number of shards will increase the
 618 impact of the mClock queues, but may have other deliterious effects.
 619
 620 Second, requests are transferred from the operation queue to the
 621 operation sequencer, in which they go through the phases of
 622 execution. The operation queue is where mClock resides and mClock
 623 determines the next op to transfer to the operation sequencer. The
 624 number of operations allowed in the operation sequencer is a complex
 625 issue. In general we want to keep enough operations in the sequencer
 626 so it's always getting work done on some operations while it's waiting
 627 for disk and network access to complete on other operations. On the
 628 other hand, once an operation is transferred to the operation
 629 sequencer, mClock no longer has control over it. Therefore to maximize
 630 the impact of mClock, we want to keep as few operations in the
 631 operation sequencer as possible. So we have an inherent tension.
 632
 633 The configuration options that influence the number of operations in
 634 the operation sequencer are ``bluestore_throttle_bytes``,
 635 ``bluestore_throttle_deferred_bytes``,
 636 ``bluestore_throttle_cost_per_io``,
 637 ``bluestore_throttle_cost_per_io_hdd``, and
 638 ``bluestore_throttle_cost_per_io_ssd``.
 639
 640 A third factor that affects the impact of the mClock algorithm is that
 641 we're using a distributed system, where requests are made to multiple
 642 OSDs and each OSD has (can have) multiple shards. Yet we're currently
 643 using the mClock algorithm, which is not distributed (note: dmClock is
 644 the distributed version of mClock).
 645
 646 Various organizations and individuals are currently experimenting with
 647 mClock as it exists in this code base along with their modifications
 648 to the code base. We hope you'll share you're experiences with your
 649 mClock and dmClock experiments in the ceph-devel mailing list.
 650
 651
 652 ``osd push per object cost``
 653
 654 :Description: the overhead for serving a push op
 655
 656 :Type: Unsigned Integer
 657 :Default: 1000
 658
 659 ``osd recovery max chunk``
 660
 661 :Description: the maximum total size of data chunks a recovery op can carry.
 662
 663 :Type: Unsigned Integer
 664 :Default: 8 MiB
 665
 666
 667 ``osd op queue mclock client op res``
 668
 669 :Description: the reservation of client op.
 670
 671 :Type: Float
 672 :Default: 1000.0
 673
 674
 675 ``osd op queue mclock client op wgt``
 676
 677 :Description: the weight of client op.
 678
 679 :Type: Float
 680 :Default: 500.0
 681
 682
 683 ``osd op queue mclock client op lim``
 684
 685 :Description: the limit of client op.
 686
 687 :Type: Float
 688 :Default: 1000.0
 689
 690
 691 ``osd op queue mclock osd subop res``
 692
 693 :Description: the reservation of osd subop.
 694
 695 :Type: Float
 696 :Default: 1000.0
 697
 698
 699 ``osd op queue mclock osd subop wgt``
 700
 701 :Description: the weight of osd subop.
 702
 703 :Type: Float
 704 :Default: 500.0
 705
 706
 707 ``osd op queue mclock osd subop lim``
 708
 709 :Description: the limit of osd subop.
 710
 711 :Type: Float
 712 :Default: 0.0
 713
 714
 715 ``osd op queue mclock snap res``
 716
 717 :Description: the reservation of snap trimming.
 718
 719 :Type: Float
 720 :Default: 0.0
 721
 722
 723 ``osd op queue mclock snap wgt``
 724
 725 :Description: the weight of snap trimming.
 726
 727 :Type: Float
 728 :Default: 1.0
 729
 730
 731 ``osd op queue mclock snap lim``
 732
 733 :Description: the limit of snap trimming.
 734
 735 :Type: Float
 736 :Default: 0.001
 737
 738
 739 ``osd op queue mclock recov res``
 740
 741 :Description: the reservation of recovery.
 742
 743 :Type: Float
 744 :Default: 0.0
 745
 746
 747 ``osd op queue mclock recov wgt``
 748
 749 :Description: the weight of recovery.
 750
 751 :Type: Float
 752 :Default: 1.0
 753
 754
 755 ``osd op queue mclock recov lim``
 756
 757 :Description: the limit of recovery.
 758
 759 :Type: Float
 760 :Default: 0.001
 761
 762
 763 ``osd op queue mclock scrub res``
 764
 765 :Description: the reservation of scrub jobs.
 766
 767 :Type: Float
 768 :Default: 0.0
 769
 770
 771 ``osd op queue mclock scrub wgt``
 772
 773 :Description: the weight of scrub jobs.
 774
 775 :Type: Float
 776 :Default: 1.0
 777
 778
 779 ``osd op queue mclock scrub lim``
 780
 781 :Description: the limit of scrub jobs.
 782
 783 :Type: Float
 784 :Default: 0.001
 785
 786 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
 787
 788
 789 .. index:: OSD; backfilling
 790
 791 Backfilling
 792 ===========
 793
 794 When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
 795 want to rebalance the cluster by moving placement groups to or from Ceph OSD
 796 Daemons to restore the balance. The process of migrating placement groups and
 797 the objects they contain can reduce the cluster's operational performance
 798 considerably. To maintain operational performance, Ceph performs this migration
 799 with 'backfilling', which allows Ceph to set backfill operations to a lower
 800 priority than requests to read or write data.
 801
 802
 803 ``osd max backfills``
 804
 805 :Description: The maximum number of backfills allowed to or from a single OSD.
 806 :Type: 64-bit Unsigned Integer
 807 :Default: ``1``
 808
 809
 810 ``osd backfill scan min``
 811
 812 :Description: The minimum number of objects per backfill scan.
 813
 814 :Type: 32-bit Integer
 815 :Default: ``64``
 816
 817
 818 ``osd backfill scan max``
 819
 820 :Description: The maximum number of objects per backfill scan.
 821
 822 :Type: 32-bit Integer
 823 :Default: ``512``
 824
 825
 826 ``osd backfill retry interval``
 827
 828 :Description: The number of seconds to wait before retrying backfill requests.
 829 :Type: Double
 830 :Default: ``10.0``
 831
 832 .. index:: OSD; osdmap
 833
 834 OSD Map
 835 =======
 836
 837 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
 838 number of map epochs increases. Ceph provides some settings to ensure that
 839 Ceph performs well as the OSD map grows larger.
 840
 841
 842 ``osd map dedup``
 843
 844 :Description: Enable removing duplicates in the OSD map.
 845 :Type: Boolean
 846 :Default: ``true``
 847
 848
 849 ``osd map cache size``
 850
 851 :Description: The number of OSD maps to keep cached.
 852 :Type: 32-bit Integer
 853 :Default: ``500``
 854
 855
 856 ``osd map cache bl size``
 857
 858 :Description: The size of the in-memory OSD map cache in OSD daemons.
 859 :Type: 32-bit Integer
 860 :Default: ``50``
 861
 862
 863 ``osd map cache bl inc size``
 864
 865 :Description: The size of the in-memory OSD map cache incrementals in
 866               OSD daemons.
 867
 868 :Type: 32-bit Integer
 869 :Default: ``100``
 870
 871
 872 ``osd map message max``
 873
 874 :Description: The maximum map entries allowed per MOSDMap message.
 875 :Type: 32-bit Integer
 876 :Default: ``100``
 877
 878
 879
 880 .. index:: OSD; recovery
 881
 882 Recovery
 883 ========
 884
 885 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
 886 begins peering with other Ceph OSD Daemons before writes can occur.  See
 887 `Monitoring OSDs and PGs`_ for details.
 888
 889 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
 890 sync with other Ceph OSD Daemons containing more recent versions of objects in
 891 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
 892 mode and seeks to get the latest copy of the data and bring its map back up to
 893 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
 894 and placement groups may be significantly out of date. Also, if a failure domain
 895 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
 896 the same time. This can make the recovery process time consuming and resource
 897 intensive.
 898
 899 To maintain operational performance, Ceph performs recovery with limitations on
 900 the number recovery requests, threads and object chunk sizes which allows Ceph
 901 perform well in a degraded state.
 902
 903
 904 ``osd recovery delay start``
 905
 906 :Description: After peering completes, Ceph will delay for the specified number
 907               of seconds before starting to recover objects.
 908
 909 :Type: Float
 910 :Default: ``0``
 911
 912
 913 ``osd recovery max active``
 914
 915 :Description: The number of active recovery requests per OSD at one time. More
 916               requests will accelerate recovery, but the requests places an
 917               increased load on the cluster.
 918
 919 :Type: 32-bit Integer
 920 :Default: ``3``
 921
 922
 923 ``osd recovery max chunk``
 924
 925 :Description: The maximum size of a recovered chunk of data to push.
 926 :Type: 64-bit Unsigned Integer
 927 :Default: ``8 << 20``
 928
 929
 930 ``osd recovery max single start``
 931
 932 :Description: The maximum number of recovery operations per OSD that will be
 933               newly started when an OSD is recovering.
 934 :Type: 64-bit Unsigned Integer
 935 :Default: ``1``
 936
 937
 938 ``osd recovery thread timeout``
 939
 940 :Description: The maximum time in seconds before timing out a recovery thread.
 941 :Type: 32-bit Integer
 942 :Default: ``30``
 943
 944
 945 ``osd recover clone overlap``
 946
 947 :Description: Preserves clone overlap during recovery. Should always be set
 948               to ``true``.
 949
 950 :Type: Boolean
 951 :Default: ``true``
 952
 953
 954 ``osd recovery sleep``
 955
 956 :Description: Time in seconds to sleep before next recovery or backfill op.
 957               Increasing this value will slow down recovery operation while
 958               client operations will be less impacted.
 959
 960 :Type: Float
 961 :Default: ``0``
 962
 963
 964 ``osd recovery sleep hdd``
 965
 966 :Description: Time in seconds to sleep before next recovery or backfill op
 967               for HDDs.
 968
 969 :Type: Float
 970 :Default: ``0.1``
 971
 972
 973 ``osd recovery sleep ssd``
 974
 975 :Description: Time in seconds to sleep before next recovery or backfill op
 976               for SSDs.
 977
 978 :Type: Float
 979 :Default: ``0``
 980
 981
 982 ``osd recovery sleep hybrid``
 983
 984 :Description: Time in seconds to sleep before next recovery or backfill op
 985               when osd data is on HDD and osd journal is on SSD.
 986
 987 :Type: Float
 988 :Default: ``0.025``
 989
 990 Tiering
 991 =======
 992
 993 ``osd agent max ops``
 994
 995 :Description: The maximum number of simultaneous flushing ops per tiering agent
 996               in the high speed mode.
 997 :Type: 32-bit Integer
 998 :Default: ``4``
 999
1000
1001 ``osd agent max low ops``
1002
1003 :Description: The maximum number of simultaneous flushing ops per tiering agent
1004               in the low speed mode.
1005 :Type: 32-bit Integer
1006 :Default: ``2``
1007
1008 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1009 objects within the high speed mode.
1010
1011 Miscellaneous
1012 =============
1013
1014
1015 ``osd snap trim thread timeout``
1016
1017 :Description: The maximum time in seconds before timing out a snap trim thread.
1018 :Type: 32-bit Integer
1019 :Default: ``60*60*1``
1020
1021
1022 ``osd backlog thread timeout``
1023
1024 :Description: The maximum time in seconds before timing out a backlog thread.
1025 :Type: 32-bit Integer
1026 :Default: ``60*60*1``
1027
1028
1029 ``osd default notify timeout``
1030
1031 :Description: The OSD default notification timeout (in seconds).
1032 :Type: 32-bit Unsigned Integer
1033 :Default: ``30``
1034
1035
1036 ``osd check for log corruption``
1037
1038 :Description: Check log files for corruption. Can be computationally expensive.
1039 :Type: Boolean
1040 :Default: ``false``
1041
1042
1043 ``osd remove thread timeout``
1044
1045 :Description: The maximum time in seconds before timing out a remove OSD thread.
1046 :Type: 32-bit Integer
1047 :Default: ``60*60``
1048
1049
1050 ``osd command thread timeout``
1051
1052 :Description: The maximum time in seconds before timing out a command thread.
1053 :Type: 32-bit Integer
1054 :Default: ``10*60``
1055
1056
1057 ``osd command max records``
1058
1059 :Description: Limits the number of lost objects to return.
1060 :Type: 32-bit Integer
1061 :Default: ``256``
1062
1063
1064 ``osd auto upgrade tmap``
1065
1066 :Description: Uses ``tmap`` for ``omap`` on old objects.
1067 :Type: Boolean
1068 :Default: ``true``
1069
1070
1071 ``osd tmapput sets users tmap``
1072
1073 :Description: Uses ``tmap`` for debugging only.
1074 :Type: Boolean
1075 :Default: ``false``
1076
1077
1078 ``osd fast fail on connection refused``
1079
1080 :Description: If this option is enabled, crashed OSDs are marked down
1081               immediately by connected peers and MONs (assuming that the
1082               crashed OSD host survives). Disable it to restore old
1083               behavior, at the expense of possible long I/O stalls when
1084               OSDs crash in the middle of I/O operations.
1085 :Type: Boolean
1086 :Default: ``true``
1087
1088
1089
1090 .. _pool: ../../operations/pools
1091 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1092 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1093 .. _Pool & PG Config Reference: ../pool-pg-config-ref
1094 .. _Journal Config Reference: ../journal-ref
1095 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio