ceph/doc/rados/configuration/osd-config-ref.rst

   1 ======================
   2  OSD Config Reference
   3 ======================
   4
   5 .. index:: OSD; configuration
   6
   7 You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
   8 Daemons can use the default values and a very minimal configuration. A minimal
   9 Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``,  and
  10 uses default values for nearly everything else.
  11
  12 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
  13 with ``0`` using the following convention. ::
  14
  15         osd.0
  16         osd.1
  17         osd.2
  18
  19 In a configuration file, you may specify settings for all Ceph OSD Daemons in
  20 the cluster by adding configuration settings to the ``[osd]`` section of your
  21 configuration file. To add settings directly to a specific Ceph OSD Daemon
  22 (e.g., ``host``), enter  it in an OSD-specific section of your configuration
  23 file. For example:
  24
  25 .. code-block:: ini
  26
  27         [osd]
  28                 osd journal size = 5120
  29
  30         [osd.0]
  31                 host = osd-host-a
  32
  33         [osd.1]
  34                 host = osd-host-b
  35
  36
  37 .. index:: OSD; config settings
  38
  39 General Settings
  40 ================
  41
  42 The following settings provide a Ceph OSD Daemon's ID, and determine paths to
  43 data and journals. Ceph deployment scripts typically generate the UUID
  44 automatically.
  45
  46 .. warning:: **DO NOT** change the default paths for data or journals, as it
  47              makes it more problematic to troubleshoot Ceph later.
  48
  49 The journal size should be at least twice the product of the expected drive
  50 speed multiplied by ``filestore max sync interval``. However, the most common
  51 practice is to partition the journal drive (often an SSD), and mount it such
  52 that Ceph uses the entire partition for the journal.
  53
  54
  55 ``osd uuid``
  56
  57 :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
  58 :Type: UUID
  59 :Default: The UUID.
  60 :Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
  61        applies to the entire cluster.
  62
  63
  64 ``osd data``
  65
  66 :Description: The path to the OSDs data. You must create the directory when
  67               deploying Ceph. You should mount a drive for OSD data at this
  68               mount point. We do not recommend changing the default.
  69
  70 :Type: String
  71 :Default: ``/var/lib/ceph/osd/$cluster-$id``
  72
  73
  74 ``osd max write size``
  75
  76 :Description: The maximum size of a write in megabytes.
  77 :Type: 32-bit Integer
  78 :Default: ``90``
  79
  80
  81 ``osd max object size``
  82
  83 :Description: The maximum size of a RADOS object in bytes.
  84 :Type: 32-bit Unsigned Integer
  85 :Default: 128MB
  86
  87
  88 ``osd client message size cap``
  89
  90 :Description: The largest client data message allowed in memory.
  91 :Type: 64-bit Unsigned Integer
  92 :Default: 500MB default. ``500*1024L*1024L``
  93
  94
  95 ``osd class dir``
  96
  97 :Description: The class path for RADOS class plug-ins.
  98 :Type: String
  99 :Default: ``$libdir/rados-classes``
 100
 101
 102 .. index:: OSD; file system
 103
 104 File System Settings
 105 ====================
 106 Ceph builds and mounts file systems which are used for Ceph OSDs.
 107
 108 ``osd mkfs options {fs-type}``
 109
 110 :Description: Options used when creating a new Ceph OSD of type {fs-type}.
 111
 112 :Type: String
 113 :Default for xfs: ``-f -i 2048``
 114 :Default for other file systems: {empty string}
 115
 116 For example::
 117   ``osd mkfs options xfs = -f -d agcount=24``
 118
 119 ``osd mount options {fs-type}``
 120
 121 :Description: Options used when mounting a Ceph OSD of type {fs-type}.
 122
 123 :Type: String
 124 :Default for xfs: ``rw,noatime,inode64``
 125 :Default for other file systems: ``rw, noatime``
 126
 127 For example::
 128   ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
 129
 130
 131 .. index:: OSD; journal settings
 132
 133 Journal Settings
 134 ================
 135
 136 By default, Ceph expects that you will store an Ceph OSD Daemons journal with
 137 the  following path::
 138
 139         /var/lib/ceph/osd/$cluster-$id/journal
 140
 141 When using a single device type (for example, spinning drives), the journals
 142 should be *colocated*: the logical volume (or partition) should be in the same
 143 device as the ``data`` logical volume.
 144
 145 When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
 146 drives) it makes sense to place the journal on the faster device, while
 147 ``data`` occupies the slower device fully.
 148
 149 The default ``osd journal size`` value is 5120 (5 gigabytes), but it can be
 150 larger, in which case it will need to be set in the ``ceph.conf`` file::
 151
 152
 153         osd journal size = 10240
 154
 155
 156 ``osd journal``
 157
 158 :Description: The path to the OSD's journal. This may be a path to a file or a
 159               block device (such as a partition of an SSD). If it is a file,
 160               you must create the directory to contain it. We recommend using a
 161               drive separate from the ``osd data`` drive.
 162
 163 :Type: String
 164 :Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
 165
 166
 167 ``osd journal size``
 168
 169 :Description: The size of the journal in megabytes.
 170
 171 :Type: 32-bit Integer
 172 :Default: ``5120``
 173
 174
 175 See `Journal Config Reference`_ for additional details.
 176
 177
 178 Monitor OSD Interaction
 179 =======================
 180
 181 Ceph OSD Daemons check each other's heartbeats and report to monitors
 182 periodically. Ceph can use default values in many cases. However, if your
 183 network  has latency issues, you may need to adopt longer intervals. See
 184 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
 185
 186
 187 Data Placement
 188 ==============
 189
 190 See `Pool & PG Config Reference`_ for details.
 191
 192
 193 .. index:: OSD; scrubbing
 194
 195 Scrubbing
 196 =========
 197
 198 In addition to making multiple copies of objects, Ceph ensures data integrity by
 199 scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
 200 object storage layer. For each placement group, Ceph generates a catalog of all
 201 objects and compares each primary object and its replicas to ensure that no
 202 objects are missing or mismatched. Light scrubbing (daily) checks the object
 203 size and attributes.  Deep scrubbing (weekly) reads the data and uses checksums
 204 to ensure data integrity.
 205
 206 Scrubbing is important for maintaining data integrity, but it can reduce
 207 performance. You can adjust the following settings to increase or decrease
 208 scrubbing operations.
 209
 210
 211 ``osd max scrubs``
 212
 213 :Description: The maximum number of simultaneous scrub operations for
 214               a Ceph OSD Daemon.
 215
 216 :Type: 32-bit Int
 217 :Default: ``1``
 218
 219 ``osd scrub begin hour``
 220
 221 :Description: The time of day for the lower bound when a scheduled scrub can be
 222               performed.
 223 :Type: Integer in the range of 0 to 24
 224 :Default: ``0``
 225
 226
 227 ``osd scrub end hour``
 228
 229 :Description: The time of day for the upper bound when a scheduled scrub can be
 230               performed. Along with ``osd scrub begin hour``, they define a time
 231               window, in which the scrubs can happen. But a scrub will be performed
 232               no matter the time window allows or not, as long as the placement
 233               group's scrub interval exceeds ``osd scrub max interval``.
 234 :Type: Integer in the range of 0 to 24
 235 :Default: ``24``
 236
 237
 238 ``osd scrub begin week day``
 239
 240 :Description: This restricts scrubbing to this day of the week or later.
 241               0 or 7 = Sunday, 1 = Monday, etc.
 242 :Type: Integer in the range of 0 to 7
 243 :Default: ``0``
 244
 245
 246 ``osd scrub end week day``
 247
 248 :Description: This restricts scrubbing to days of the week earlier than this.
 249               0 or 7 = Sunday, 1 = Monday, etc.
 250 :Type: Integer in the range of 0 to 7
 251 :Default: ``7``
 252
 253
 254 ``osd scrub during recovery``
 255
 256 :Description: Allow scrub during recovery. Setting this to ``false`` will disable
 257               scheduling new scrub (and deep--scrub) while there is active recovery.
 258               Already running scrubs will be continued. This might be useful to reduce
 259               load on busy clusters.
 260 :Type: Boolean
 261 :Default: ``false``
 262
 263
 264 ``osd scrub thread timeout``
 265
 266 :Description: The maximum time in seconds before timing out a scrub thread.
 267 :Type: 32-bit Integer
 268 :Default: ``60``
 269
 270
 271 ``osd scrub finalize thread timeout``
 272
 273 :Description: The maximum time in seconds before timing out a scrub finalize
 274               thread.
 275
 276 :Type: 32-bit Integer
 277 :Default: ``60*10``
 278
 279
 280 ``osd scrub load threshold``
 281
 282 :Description: The normalized maximum load. Ceph will not scrub when the system load
 283               (as defined by ``getloadavg() / number of online cpus``) is higher than this number.
 284               Default is ``0.5``.
 285
 286 :Type: Float
 287 :Default: ``0.5``
 288
 289
 290 ``osd scrub min interval``
 291
 292 :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
 293               when the Ceph Storage Cluster load is low.
 294
 295 :Type: Float
 296 :Default: Once per day. ``60*60*24``
 297
 298
 299 ``osd scrub max interval``
 300
 301 :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
 302               irrespective of cluster load.
 303
 304 :Type: Float
 305 :Default: Once per week. ``7*60*60*24``
 306
 307
 308 ``osd scrub chunk min``
 309
 310 :Description: The minimal number of object store chunks to scrub during single operation.
 311               Ceph blocks writes to single chunk during scrub.
 312
 313 :Type: 32-bit Integer
 314 :Default: 5
 315
 316
 317 ``osd scrub chunk max``
 318
 319 :Description: The maximum number of object store chunks to scrub during single operation.
 320
 321 :Type: 32-bit Integer
 322 :Default: 25
 323
 324
 325 ``osd scrub sleep``
 326
 327 :Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
 328               down whole scrub operation while client operations will be less impacted.
 329
 330 :Type: Float
 331 :Default: 0
 332
 333
 334 ``osd deep scrub interval``
 335
 336 :Description: The interval for "deep" scrubbing (fully reading all data). The
 337               ``osd scrub load threshold`` does not affect this setting.
 338
 339 :Type: Float
 340 :Default: Once per week.  ``60*60*24*7``
 341
 342
 343 ``osd scrub interval randomize ratio``
 344
 345 :Description: Add a random delay to ``osd scrub min interval`` when scheduling
 346               the next scrub job for a placement group. The delay is a random
 347               value less than ``osd scrub min interval`` \*
 348               ``osd scrub interval randomized ratio``. So the default setting
 349               practically randomly spreads the scrubs out in the allowed time
 350               window of ``[1, 1.5]`` \* ``osd scrub min interval``.
 351 :Type: Float
 352 :Default: ``0.5``
 353
 354 ``osd deep scrub stride``
 355
 356 :Description: Read size when doing a deep scrub.
 357 :Type: 32-bit Integer
 358 :Default: 512 KB. ``524288``
 359
 360
 361 ``osd scrub auto repair``
 362
 363 :Description: Setting this to ``true`` will enable automatic pg repair when errors
 364               are found in scrub or deep-scrub.  However, if more than
 365               ``osd scrub auto repair num errors`` errors are found a repair is NOT performed.
 366 :Type: Boolean
 367 :Default: ``false``
 368
 369
 370 ``osd scrub auto repair num errors``
 371
 372 :Description: Auto repair will not occur if more than this many errors are found.
 373 :Type: 32-bit Integer
 374 :Default: ``5``
 375
 376
 377 .. index:: OSD; operations settings
 378
 379 Operations
 380 ==========
 381
 382 ``osd op queue``
 383
 384 :Description: This sets the type of queue to be used for prioritizing ops
 385               in the OSDs. Both queues feature a strict sub-queue which is
 386               dequeued before the normal queue. The normal queue is different
 387               between implementations. The original PrioritizedQueue (``prio``) uses a
 388               token bucket system which when there are sufficient tokens will
 389               dequeue high priority queues first. If there are not enough
 390               tokens available, queues are dequeued low priority to high priority.
 391               The WeightedPriorityQueue (``wpq``) dequeues all priorities in
 392               relation to their priorities to prevent starvation of any queue.
 393               WPQ should help in cases where a few OSDs are more overloaded
 394               than others. The new mClock based OpClassQueue
 395               (``mclock_opclass``) prioritizes operations based on which class
 396               they belong to (recovery, scrub, snaptrim, client op, osd subop).
 397               And, the mClock based ClientQueue (``mclock_client``) also
 398               incorporates the client identifier in order to promote fairness
 399               between clients. See `QoS Based on mClock`_. Requires a restart.
 400
 401 :Type: String
 402 :Valid Choices: prio, wpq, mclock_opclass, mclock_client
 403 :Default: ``wpq``
 404
 405
 406 ``osd op queue cut off``
 407
 408 :Description: This selects which priority ops will be sent to the strict
 409               queue verses the normal queue. The ``low`` setting sends all
 410               replication ops and higher to the strict queue, while the ``high``
 411               option sends only replication acknowledgment ops and higher to
 412               the strict queue. Setting this to ``high`` should help when a few
 413               OSDs in the cluster are very busy especially when combined with
 414               ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
 415               handling replication traffic could starve primary client traffic
 416               on these OSDs without these settings. Requires a restart.
 417
 418 :Type: String
 419 :Valid Choices: low, high
 420 :Default: ``high``
 421
 422
 423 ``osd client op priority``
 424
 425 :Description: The priority set for client operations.
 426
 427 :Type: 32-bit Integer
 428 :Default: ``63``
 429 :Valid Range: 1-63
 430
 431
 432 ``osd recovery op priority``
 433
 434 :Description: The priority set for recovery operations, if not specified by the pool's ``recovery_op_priority``.
 435
 436 :Type: 32-bit Integer
 437 :Default: ``3``
 438 :Valid Range: 1-63
 439
 440
 441 ``osd scrub priority``
 442
 443 :Description: The default priority set for a scheduled scrub work queue when the
 444               pool doesn't specify a value of ``scrub_priority``.  This can be
 445               boosted to the value of ``osd client op priority`` when scrub is
 446               blocking client operations.
 447
 448 :Type: 32-bit Integer
 449 :Default: ``5``
 450 :Valid Range: 1-63
 451
 452
 453 ``osd requested scrub priority``
 454
 455 :Description: The priority set for user requested scrub on the work queue.  If
 456               this value were to be smaller than ``osd client op priority`` it
 457               can be boosted to the value of ``osd client op priority`` when
 458               scrub is blocking client operations.
 459
 460 :Type: 32-bit Integer
 461 :Default: ``120``
 462
 463
 464 ``osd snap trim priority``
 465
 466 :Description: The priority set for the snap trim work queue.
 467
 468 :Type: 32-bit Integer
 469 :Default: ``5``
 470 :Valid Range: 1-63
 471
 472 ``osd snap trim sleep``
 473
 474 :Description: Time in seconds to sleep before next snap trim op.
 475               Increasing this value will slow down snap trimming.
 476               This option overrides backend specific variants.
 477
 478 :Type: Float
 479 :Default: ``0``
 480
 481
 482 ``osd snap trim sleep hdd``
 483
 484 :Description: Time in seconds to sleep before next snap trim op
 485               for HDDs.
 486
 487 :Type: Float
 488 :Default: ``5``
 489
 490
 491 ``osd snap trim sleep ssd``
 492
 493 :Description: Time in seconds to sleep before next snap trim op
 494               for SSDs.
 495
 496 :Type: Float
 497 :Default: ``0``
 498
 499
 500 ``osd snap trim sleep hybrid``
 501
 502 :Description: Time in seconds to sleep before next snap trim op
 503               when osd data is on HDD and osd journal is on SSD.
 504
 505 :Type: Float
 506 :Default: ``2``
 507
 508 ``osd op thread timeout``
 509
 510 :Description: The Ceph OSD Daemon operation thread timeout in seconds.
 511 :Type: 32-bit Integer
 512 :Default: ``15``
 513
 514
 515 ``osd op complaint time``
 516
 517 :Description: An operation becomes complaint worthy after the specified number
 518               of seconds have elapsed.
 519
 520 :Type: Float
 521 :Default: ``30``
 522
 523
 524 ``osd op history size``
 525
 526 :Description: The maximum number of completed operations to track.
 527 :Type: 32-bit Unsigned Integer
 528 :Default: ``20``
 529
 530
 531 ``osd op history duration``
 532
 533 :Description: The oldest completed operation to track.
 534 :Type: 32-bit Unsigned Integer
 535 :Default: ``600``
 536
 537
 538 ``osd op log threshold``
 539
 540 :Description: How many operations logs to display at once.
 541 :Type: 32-bit Integer
 542 :Default: ``5``
 543
 544
 545 .. _dmclock-qos:
 546
 547 QoS Based on mClock
 548 -------------------
 549
 550 Ceph's use of mClock is currently in the experimental phase and should
 551 be approached with an exploratory mindset.
 552
 553 Core Concepts
 554 `````````````
 555
 556 The QoS support of Ceph is implemented using a queueing scheduler
 557 based on `the dmClock algorithm`_. This algorithm allocates the I/O
 558 resources of the Ceph cluster in proportion to weights, and enforces
 559 the constraints of minimum reservation and maximum limitation, so that
 560 the services can compete for the resources fairly. Currently the
 561 *mclock_opclass* operation queue divides Ceph services involving I/O
 562 resources into following buckets:
 563
 564 - client op: the iops issued by client
 565 - osd subop: the iops issued by primary OSD
 566 - snap trim: the snap trimming related requests
 567 - pg recovery: the recovery related requests
 568 - pg scrub: the scrub related requests
 569
 570 And the resources are partitioned using following three sets of tags. In other
 571 words, the share of each type of service is controlled by three tags:
 572
 573 #. reservation: the minimum IOPS allocated for the service.
 574 #. limitation: the maximum IOPS allocated for the service.
 575 #. weight: the proportional share of capacity if extra capacity or system
 576    oversubscribed.
 577
 578 In Ceph operations are graded with "cost". And the resources allocated
 579 for serving various services are consumed by these "costs". So, for
 580 example, the more reservation a services has, the more resource it is
 581 guaranteed to possess, as long as it requires. Assuming there are 2
 582 services: recovery and client ops:
 583
 584 - recovery: (r:1, l:5, w:1)
 585 - client ops: (r:2, l:0, w:9)
 586
 587 The settings above ensure that the recovery won't get more than 5
 588 requests per second serviced, even if it requires so (see CURRENT
 589 IMPLEMENTATION NOTE below), and no other services are competing with
 590 it. But if the clients start to issue large amount of I/O requests,
 591 neither will they exhaust all the I/O resources. 1 request per second
 592 is always allocated for recovery jobs as long as there are any such
 593 requests. So the recovery jobs won't be starved even in a cluster with
 594 high load. And in the meantime, the client ops can enjoy a larger
 595 portion of the I/O resource, because its weight is "9", while its
 596 competitor "1". In the case of client ops, it is not clamped by the
 597 limit setting, so it can make use of all the resources if there is no
 598 recovery ongoing.
 599
 600 Along with *mclock_opclass* another mclock operation queue named
 601 *mclock_client* is available. It divides operations based on category
 602 but also divides them based on the client making the request. This
 603 helps not only manage the distribution of resources spent on different
 604 classes of operations but also tries to ensure fairness among clients.
 605
 606 CURRENT IMPLEMENTATION NOTE: the current experimental implementation
 607 does not enforce the limit values. As a first approximation we decided
 608 not to prevent operations that would otherwise enter the operation
 609 sequencer from doing so.
 610
 611 Subtleties of mClock
 612 ````````````````````
 613
 614 The reservation and limit values have a unit of requests per
 615 second. The weight, however, does not technically have a unit and the
 616 weights are relative to one another. So if one class of requests has a
 617 weight of 1 and another a weight of 9, then the latter class of
 618 requests should get 9 executed at a 9 to 1 ratio as the first class.
 619 However that will only happen once the reservations are met and those
 620 values include the operations executed under the reservation phase.
 621
 622 Even though the weights do not have units, one must be careful in
 623 choosing their values due how the algorithm assigns weight tags to
 624 requests. If the weight is *W*, then for a given class of requests,
 625 the next one that comes in will have a weight tag of *1/W* plus the
 626 previous weight tag or the current time, whichever is larger. That
 627 means if *W* is sufficiently large and therefore *1/W* is sufficiently
 628 small, the calculated tag may never be assigned as it will get a value
 629 of the current time. The ultimate lesson is that values for weight
 630 should not be too large. They should be under the number of requests
 631 one expects to ve serviced each second.
 632
 633 Caveats
 634 ```````
 635
 636 There are some factors that can reduce the impact of the mClock op
 637 queues within Ceph. First, requests to an OSD are sharded by their
 638 placement group identifier. Each shard has its own mClock queue and
 639 these queues neither interact nor share information among them. The
 640 number of shards can be controlled with the configuration options
 641 ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
 642 ``osd_op_num_shards_ssd``. A lower number of shards will increase the
 643 impact of the mClock queues, but may have other deleterious effects.
 644
 645 Second, requests are transferred from the operation queue to the
 646 operation sequencer, in which they go through the phases of
 647 execution. The operation queue is where mClock resides and mClock
 648 determines the next op to transfer to the operation sequencer. The
 649 number of operations allowed in the operation sequencer is a complex
 650 issue. In general we want to keep enough operations in the sequencer
 651 so it's always getting work done on some operations while it's waiting
 652 for disk and network access to complete on other operations. On the
 653 other hand, once an operation is transferred to the operation
 654 sequencer, mClock no longer has control over it. Therefore to maximize
 655 the impact of mClock, we want to keep as few operations in the
 656 operation sequencer as possible. So we have an inherent tension.
 657
 658 The configuration options that influence the number of operations in
 659 the operation sequencer are ``bluestore_throttle_bytes``,
 660 ``bluestore_throttle_deferred_bytes``,
 661 ``bluestore_throttle_cost_per_io``,
 662 ``bluestore_throttle_cost_per_io_hdd``, and
 663 ``bluestore_throttle_cost_per_io_ssd``.
 664
 665 A third factor that affects the impact of the mClock algorithm is that
 666 we're using a distributed system, where requests are made to multiple
 667 OSDs and each OSD has (can have) multiple shards. Yet we're currently
 668 using the mClock algorithm, which is not distributed (note: dmClock is
 669 the distributed version of mClock).
 670
 671 Various organizations and individuals are currently experimenting with
 672 mClock as it exists in this code base along with their modifications
 673 to the code base. We hope you'll share you're experiences with your
 674 mClock and dmClock experiments in the ceph-devel mailing list.
 675
 676
 677 ``osd push per object cost``
 678
 679 :Description: the overhead for serving a push op
 680
 681 :Type: Unsigned Integer
 682 :Default: 1000
 683
 684 ``osd recovery max chunk``
 685
 686 :Description: the maximum total size of data chunks a recovery op can carry.
 687
 688 :Type: Unsigned Integer
 689 :Default: 8 MiB
 690
 691
 692 ``osd op queue mclock client op res``
 693
 694 :Description: the reservation of client op.
 695
 696 :Type: Float
 697 :Default: 1000.0
 698
 699
 700 ``osd op queue mclock client op wgt``
 701
 702 :Description: the weight of client op.
 703
 704 :Type: Float
 705 :Default: 500.0
 706
 707
 708 ``osd op queue mclock client op lim``
 709
 710 :Description: the limit of client op.
 711
 712 :Type: Float
 713 :Default: 1000.0
 714
 715
 716 ``osd op queue mclock osd subop res``
 717
 718 :Description: the reservation of osd subop.
 719
 720 :Type: Float
 721 :Default: 1000.0
 722
 723
 724 ``osd op queue mclock osd subop wgt``
 725
 726 :Description: the weight of osd subop.
 727
 728 :Type: Float
 729 :Default: 500.0
 730
 731
 732 ``osd op queue mclock osd subop lim``
 733
 734 :Description: the limit of osd subop.
 735
 736 :Type: Float
 737 :Default: 0.0
 738
 739
 740 ``osd op queue mclock snap res``
 741
 742 :Description: the reservation of snap trimming.
 743
 744 :Type: Float
 745 :Default: 0.0
 746
 747
 748 ``osd op queue mclock snap wgt``
 749
 750 :Description: the weight of snap trimming.
 751
 752 :Type: Float
 753 :Default: 1.0
 754
 755
 756 ``osd op queue mclock snap lim``
 757
 758 :Description: the limit of snap trimming.
 759
 760 :Type: Float
 761 :Default: 0.001
 762
 763
 764 ``osd op queue mclock recov res``
 765
 766 :Description: the reservation of recovery.
 767
 768 :Type: Float
 769 :Default: 0.0
 770
 771
 772 ``osd op queue mclock recov wgt``
 773
 774 :Description: the weight of recovery.
 775
 776 :Type: Float
 777 :Default: 1.0
 778
 779
 780 ``osd op queue mclock recov lim``
 781
 782 :Description: the limit of recovery.
 783
 784 :Type: Float
 785 :Default: 0.001
 786
 787
 788 ``osd op queue mclock scrub res``
 789
 790 :Description: the reservation of scrub jobs.
 791
 792 :Type: Float
 793 :Default: 0.0
 794
 795
 796 ``osd op queue mclock scrub wgt``
 797
 798 :Description: the weight of scrub jobs.
 799
 800 :Type: Float
 801 :Default: 1.0
 802
 803
 804 ``osd op queue mclock scrub lim``
 805
 806 :Description: the limit of scrub jobs.
 807
 808 :Type: Float
 809 :Default: 0.001
 810
 811 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
 812
 813
 814 .. index:: OSD; backfilling
 815
 816 Backfilling
 817 ===========
 818
 819 When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
 820 want to rebalance the cluster by moving placement groups to or from Ceph OSD
 821 Daemons to restore the balance. The process of migrating placement groups and
 822 the objects they contain can reduce the cluster's operational performance
 823 considerably. To maintain operational performance, Ceph performs this migration
 824 with 'backfilling', which allows Ceph to set backfill operations to a lower
 825 priority than requests to read or write data.
 826
 827
 828 ``osd max backfills``
 829
 830 :Description: The maximum number of backfills allowed to or from a single OSD.
 831 :Type: 64-bit Unsigned Integer
 832 :Default: ``1``
 833
 834
 835 ``osd backfill scan min``
 836
 837 :Description: The minimum number of objects per backfill scan.
 838
 839 :Type: 32-bit Integer
 840 :Default: ``64``
 841
 842
 843 ``osd backfill scan max``
 844
 845 :Description: The maximum number of objects per backfill scan.
 846
 847 :Type: 32-bit Integer
 848 :Default: ``512``
 849
 850
 851 ``osd backfill retry interval``
 852
 853 :Description: The number of seconds to wait before retrying backfill requests.
 854 :Type: Double
 855 :Default: ``10.0``
 856
 857 .. index:: OSD; osdmap
 858
 859 OSD Map
 860 =======
 861
 862 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
 863 number of map epochs increases. Ceph provides some settings to ensure that
 864 Ceph performs well as the OSD map grows larger.
 865
 866
 867 ``osd map dedup``
 868
 869 :Description: Enable removing duplicates in the OSD map.
 870 :Type: Boolean
 871 :Default: ``true``
 872
 873
 874 ``osd map cache size``
 875
 876 :Description: The number of OSD maps to keep cached.
 877 :Type: 32-bit Integer
 878 :Default: ``50``
 879
 880
 881 ``osd map message max``
 882
 883 :Description: The maximum map entries allowed per MOSDMap message.
 884 :Type: 32-bit Integer
 885 :Default: ``40``
 886
 887
 888
 889 .. index:: OSD; recovery
 890
 891 Recovery
 892 ========
 893
 894 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
 895 begins peering with other Ceph OSD Daemons before writes can occur.  See
 896 `Monitoring OSDs and PGs`_ for details.
 897
 898 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
 899 sync with other Ceph OSD Daemons containing more recent versions of objects in
 900 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
 901 mode and seeks to get the latest copy of the data and bring its map back up to
 902 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
 903 and placement groups may be significantly out of date. Also, if a failure domain
 904 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
 905 the same time. This can make the recovery process time consuming and resource
 906 intensive.
 907
 908 To maintain operational performance, Ceph performs recovery with limitations on
 909 the number recovery requests, threads and object chunk sizes which allows Ceph
 910 perform well in a degraded state.
 911
 912
 913 ``osd recovery delay start``
 914
 915 :Description: After peering completes, Ceph will delay for the specified number
 916               of seconds before starting to recover objects.
 917
 918 :Type: Float
 919 :Default: ``0``
 920
 921
 922 ``osd recovery max active``
 923
 924 :Description: The number of active recovery requests per OSD at one time. More
 925               requests will accelerate recovery, but the requests places an
 926               increased load on the cluster.
 927
 928               This value is only used if it is non-zero. Normally it
 929               is ``0``, which means that the ``hdd`` or ``ssd`` values
 930               (below) are used, depending on the type of the primary
 931               device backing the OSD.
 932
 933 :Type: 32-bit Integer
 934 :Default: ``0``
 935
 936 ``osd recovery max active hdd``
 937
 938 :Description: The number of active recovery requests per OSD at one time, if the
 939               primary device is rotational.
 940
 941 :Type: 32-bit Integer
 942 :Default: ``3``
 943
 944 ``osd recovery max active ssd``
 945
 946 :Description: The number of active recovery requests per OSD at one time, if the
 947               primary device is non-rotational (i.e., an SSD).
 948
 949 :Type: 32-bit Integer
 950 :Default: ``10``
 951
 952
 953 ``osd recovery max chunk``
 954
 955 :Description: The maximum size of a recovered chunk of data to push.
 956 :Type: 64-bit Unsigned Integer
 957 :Default: ``8 << 20``
 958
 959
 960 ``osd recovery max single start``
 961
 962 :Description: The maximum number of recovery operations per OSD that will be
 963               newly started when an OSD is recovering.
 964 :Type: 64-bit Unsigned Integer
 965 :Default: ``1``
 966
 967
 968 ``osd recovery thread timeout``
 969
 970 :Description: The maximum time in seconds before timing out a recovery thread.
 971 :Type: 32-bit Integer
 972 :Default: ``30``
 973
 974
 975 ``osd recover clone overlap``
 976
 977 :Description: Preserves clone overlap during recovery. Should always be set
 978               to ``true``.
 979
 980 :Type: Boolean
 981 :Default: ``true``
 982
 983
 984 ``osd recovery sleep``
 985
 986 :Description: Time in seconds to sleep before next recovery or backfill op.
 987               Increasing this value will slow down recovery operation while
 988               client operations will be less impacted.
 989
 990 :Type: Float
 991 :Default: ``0``
 992
 993
 994 ``osd recovery sleep hdd``
 995
 996 :Description: Time in seconds to sleep before next recovery or backfill op
 997               for HDDs.
 998
 999 :Type: Float
1000 :Default: ``0.1``
1001
1002
1003 ``osd recovery sleep ssd``
1004
1005 :Description: Time in seconds to sleep before next recovery or backfill op
1006               for SSDs.
1007
1008 :Type: Float
1009 :Default: ``0``
1010
1011
1012 ``osd recovery sleep hybrid``
1013
1014 :Description: Time in seconds to sleep before next recovery or backfill op
1015               when osd data is on HDD and osd journal is on SSD.
1016
1017 :Type: Float
1018 :Default: ``0.025``
1019
1020
1021 ``osd recovery priority``
1022
1023 :Description: The default priority set for recovery work queue.  Not
1024               related to a pool's ``recovery_priority``.
1025
1026 :Type: 32-bit Integer
1027 :Default: ``5``
1028
1029
1030 Tiering
1031 =======
1032
1033 ``osd agent max ops``
1034
1035 :Description: The maximum number of simultaneous flushing ops per tiering agent
1036               in the high speed mode.
1037 :Type: 32-bit Integer
1038 :Default: ``4``
1039
1040
1041 ``osd agent max low ops``
1042
1043 :Description: The maximum number of simultaneous flushing ops per tiering agent
1044               in the low speed mode.
1045 :Type: 32-bit Integer
1046 :Default: ``2``
1047
1048 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1049 objects within the high speed mode.
1050
1051 Miscellaneous
1052 =============
1053
1054
1055 ``osd snap trim thread timeout``
1056
1057 :Description: The maximum time in seconds before timing out a snap trim thread.
1058 :Type: 32-bit Integer
1059 :Default: ``60*60*1``
1060
1061
1062 ``osd backlog thread timeout``
1063
1064 :Description: The maximum time in seconds before timing out a backlog thread.
1065 :Type: 32-bit Integer
1066 :Default: ``60*60*1``
1067
1068
1069 ``osd default notify timeout``
1070
1071 :Description: The OSD default notification timeout (in seconds).
1072 :Type: 32-bit Unsigned Integer
1073 :Default: ``30``
1074
1075
1076 ``osd check for log corruption``
1077
1078 :Description: Check log files for corruption. Can be computationally expensive.
1079 :Type: Boolean
1080 :Default: ``false``
1081
1082
1083 ``osd remove thread timeout``
1084
1085 :Description: The maximum time in seconds before timing out a remove OSD thread.
1086 :Type: 32-bit Integer
1087 :Default: ``60*60``
1088
1089
1090 ``osd command thread timeout``
1091
1092 :Description: The maximum time in seconds before timing out a command thread.
1093 :Type: 32-bit Integer
1094 :Default: ``10*60``
1095
1096
1097 ``osd delete sleep``
1098
1099 :Description: Time in seconds to sleep before next removal transaction. This
1100               helps to throttle the pg deletion process.
1101
1102 :Type: Float
1103 :Default: ``0``
1104
1105
1106 ``osd delete sleep hdd``
1107
1108 :Description: Time in seconds to sleep before next removal transaction
1109               for HDDs.
1110
1111 :Type: Float
1112 :Default: ``5``
1113
1114
1115 ``osd delete sleep ssd``
1116
1117 :Description: Time in seconds to sleep before next removal transaction
1118               for SSDs.
1119
1120 :Type: Float
1121 :Default: ``0``
1122
1123
1124 ``osd delete sleep hybrid``
1125
1126 :Description: Time in seconds to sleep before next removal transaction
1127               when osd data is on HDD and osd journal is on SSD.
1128
1129 :Type: Float
1130 :Default: ``2``
1131
1132
1133 ``osd command max records``
1134
1135 :Description: Limits the number of lost objects to return.
1136 :Type: 32-bit Integer
1137 :Default: ``256``
1138
1139
1140 ``osd fast fail on connection refused``
1141
1142 :Description: If this option is enabled, crashed OSDs are marked down
1143               immediately by connected peers and MONs (assuming that the
1144               crashed OSD host survives). Disable it to restore old
1145               behavior, at the expense of possible long I/O stalls when
1146               OSDs crash in the middle of I/O operations.
1147 :Type: Boolean
1148 :Default: ``true``
1149
1150
1151
1152 .. _pool: ../../operations/pools
1153 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1154 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1155 .. _Pool & PG Config Reference: ../pool-pg-config-ref
1156 .. _Journal Config Reference: ../journal-ref
1157 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio