ceph/doc/rados/configuration/osd-config-ref.rst

   1 ======================
   2  OSD Config Reference
   3 ======================
   4
   5 .. index:: OSD; configuration
   6
   7 You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
   8 releases, the central config store), but Ceph OSD
   9 Daemons can use the default values and a very minimal configuration. A minimal
  10 Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``,  and
  11 uses default values for nearly everything else.
  12
  13 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
  14 with ``0`` using the following convention. ::
  15
  16         osd.0
  17         osd.1
  18         osd.2
  19
  20 In a configuration file, you may specify settings for all Ceph OSD Daemons in
  21 the cluster by adding configuration settings to the ``[osd]`` section of your
  22 configuration file. To add settings directly to a specific Ceph OSD Daemon
  23 (e.g., ``host``), enter  it in an OSD-specific section of your configuration
  24 file. For example:
  25
  26 .. code-block:: ini
  27
  28         [osd]
  29                 osd_journal_size = 5120
  30
  31         [osd.0]
  32                 host = osd-host-a
  33
  34         [osd.1]
  35                 host = osd-host-b
  36
  37
  38 .. index:: OSD; config settings
  39
  40 General Settings
  41 ================
  42
  43 The following settings provide a Ceph OSD Daemon's ID, and determine paths to
  44 data and journals. Ceph deployment scripts typically generate the UUID
  45 automatically.
  46
  47 .. warning:: **DO NOT** change the default paths for data or journals, as it
  48              makes it more problematic to troubleshoot Ceph later.
  49
  50 When using Filestore, the journal size should be at least twice the product of the expected drive
  51 speed multiplied by ``filestore_max_sync_interval``. However, the most common
  52 practice is to partition the journal drive (often an SSD), and mount it such
  53 that Ceph uses the entire partition for the journal.
  54
  55
  56 ``osd_uuid``
  57
  58 :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
  59 :Type: UUID
  60 :Default: The UUID.
  61 :Note: The ``osd_uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
  62        applies to the entire cluster.
  63
  64
  65 ``osd_data``
  66
  67 :Description: The path to the OSDs data. You must create the directory when
  68               deploying Ceph. You should mount a drive for OSD data at this
  69               mount point. We do not recommend changing the default.
  70
  71 :Type: String
  72 :Default: ``/var/lib/ceph/osd/$cluster-$id``
  73
  74
  75 ``osd_max_write_size``
  76
  77 :Description: The maximum size of a write in megabytes.
  78 :Type: 32-bit Integer
  79 :Default: ``90``
  80
  81
  82 ``osd_max_object_size``
  83
  84 :Description: The maximum size of a RADOS object in bytes.
  85 :Type: 32-bit Unsigned Integer
  86 :Default: 128MB
  87
  88
  89 ``osd_client_message_size_cap``
  90
  91 :Description: The largest client data message allowed in memory.
  92 :Type: 64-bit Unsigned Integer
  93 :Default: 500MB default. ``500*1024L*1024L``
  94
  95
  96 ``osd_class_dir``
  97
  98 :Description: The class path for RADOS class plug-ins.
  99 :Type: String
 100 :Default: ``$libdir/rados-classes``
 101
 102
 103 .. index:: OSD; file system
 104
 105 File System Settings
 106 ====================
 107 Ceph builds and mounts file systems which are used for Ceph OSDs.
 108
 109 ``osd_mkfs_options {fs-type}``
 110
 111 :Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
 112
 113 :Type: String
 114 :Default for xfs: ``-f -i 2048``
 115 :Default for other file systems: {empty string}
 116
 117 For example::
 118   ``osd_mkfs_options_xfs = -f -d agcount=24``
 119
 120 ``osd_mount_options {fs-type}``
 121
 122 :Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
 123
 124 :Type: String
 125 :Default for xfs: ``rw,noatime,inode64``
 126 :Default for other file systems: ``rw, noatime``
 127
 128 For example::
 129   ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
 130
 131
 132 .. index:: OSD; journal settings
 133
 134 Journal Settings
 135 ================
 136
 137 This section applies only to the older Filestore OSD back end.  Since Luminous
 138 BlueStore has been default and preferred.
 139
 140 By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
 141 the following path, which is usually a symlink to a device or partition::
 142
 143         /var/lib/ceph/osd/$cluster-$id/journal
 144
 145 When using a single device type (for example, spinning drives), the journals
 146 should be *colocated*: the logical volume (or partition) should be in the same
 147 device as the ``data`` logical volume.
 148
 149 When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
 150 drives) it makes sense to place the journal on the faster device, while
 151 ``data`` occupies the slower device fully.
 152
 153 The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
 154 larger, in which case it will need to be set in the ``ceph.conf`` file.
 155 A value of 10 gigabytes is common in practice::
 156
 157         osd_journal_size = 10240
 158
 159
 160 ``osd_journal``
 161
 162 :Description: The path to the OSD's journal. This may be a path to a file or a
 163               block device (such as a partition of an SSD). If it is a file,
 164               you must create the directory to contain it. We recommend using a
 165               separate fast device when the ``osd_data`` drive is an HDD.
 166
 167 :Type: String
 168 :Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
 169
 170
 171 ``osd_journal_size``
 172
 173 :Description: The size of the journal in megabytes.
 174
 175 :Type: 32-bit Integer
 176 :Default: ``5120``
 177
 178
 179 See `Journal Config Reference`_ for additional details.
 180
 181
 182 Monitor OSD Interaction
 183 =======================
 184
 185 Ceph OSD Daemons check each other's heartbeats and report to monitors
 186 periodically. Ceph can use default values in many cases. However, if your
 187 network has latency issues, you may need to adopt longer intervals. See
 188 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
 189
 190
 191 Data Placement
 192 ==============
 193
 194 See `Pool & PG Config Reference`_ for details.
 195
 196
 197 .. index:: OSD; scrubbing
 198
 199 Scrubbing
 200 =========
 201
 202 In addition to making multiple copies of objects, Ceph ensures data integrity by
 203 scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
 204 object storage layer. For each placement group, Ceph generates a catalog of all
 205 objects and compares each primary object and its replicas to ensure that no
 206 objects are missing or mismatched. Light scrubbing (daily) checks the object
 207 size and attributes.  Deep scrubbing (weekly) reads the data and uses checksums
 208 to ensure data integrity.
 209
 210 Scrubbing is important for maintaining data integrity, but it can reduce
 211 performance. You can adjust the following settings to increase or decrease
 212 scrubbing operations.
 213
 214
 215 ``osd_max_scrubs``
 216
 217 :Description: The maximum number of simultaneous scrub operations for
 218               a Ceph OSD Daemon.
 219
 220 :Type: 32-bit Int
 221 :Default: ``1``
 222
 223 ``osd_scrub_begin_hour``
 224
 225 :Description: This restricts scrubbing to this hour of the day or later.
 226               Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0``
 227               to allow scrubbing the entire day.  Along with ``osd_scrub_end_hour``, they define a time
 228               window, in which the scrubs can happen.
 229               But a scrub will be performed
 230               no matter whether the time window allows or not, as long as the placement
 231               group's scrub interval exceeds ``osd_scrub_max_interval``.
 232 :Type: Integer in the range of 0 to 23
 233 :Default: ``0``
 234
 235
 236 ``osd_scrub_end_hour``
 237
 238 :Description: This restricts scrubbing to the hour earlier than this.
 239               Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` to allow scrubbing
 240               for the entire day.  Along with ``osd_scrub_begin_hour``, they define a time
 241               window, in which the scrubs can happen. But a scrub will be performed
 242               no matter whether the time window allows or not, as long as the placement
 243               group's scrub interval exceeds ``osd_scrub_max_interval``.
 244 :Type: Integer in the range of 0 to 23
 245 :Default: ``0``
 246
 247
 248 ``osd_scrub_begin_week_day``
 249
 250 :Description: This restricts scrubbing to this day of the week or later.
 251               0  = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0``
 252               and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week.
 253               Along with ``osd_scrub_end_week_day``, they define a time window in which
 254               scrubs can happen. But a scrub will be performed
 255               no matter whether the time window allows or not, when the PG's
 256               scrub interval exceeds ``osd_scrub_max_interval``.
 257 :Type: Integer in the range of 0 to 6
 258 :Default: ``0``
 259
 260
 261 ``osd_scrub_end_week_day``
 262
 263 :Description: This restricts scrubbing to days of the week earlier than this.
 264               0 = Sunday, 1 = Monday, etc.  Use ``osd_scrub_begin_week_day = 0``
 265               and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week.
 266               Along with ``osd_scrub_begin_week_day``, they define a time
 267               window, in which the scrubs can happen. But a scrub will be performed
 268               no matter whether the time window allows or not, as long as the placement
 269               group's scrub interval exceeds ``osd_scrub_max_interval``.
 270 :Type: Integer in the range of 0 to 6
 271 :Default: ``0``
 272
 273
 274 ``osd scrub during recovery``
 275
 276 :Description: Allow scrub during recovery. Setting this to ``false`` will disable
 277               scheduling new scrub (and deep--scrub) while there is active recovery.
 278               Already running scrubs will be continued. This might be useful to reduce
 279               load on busy clusters.
 280 :Type: Boolean
 281 :Default: ``false``
 282
 283
 284 ``osd_scrub_thread_timeout``
 285
 286 :Description: The maximum time in seconds before timing out a scrub thread.
 287 :Type: 32-bit Integer
 288 :Default: ``60``
 289
 290
 291 ``osd_scrub_finalize_thread_timeout``
 292
 293 :Description: The maximum time in seconds before timing out a scrub finalize
 294               thread.
 295
 296 :Type: 32-bit Integer
 297 :Default: ``10*60``
 298
 299
 300 ``osd_scrub_load_threshold``
 301
 302 :Description: The normalized maximum load. Ceph will not scrub when the system load
 303               (as defined by ``getloadavg() / number of online CPUs``) is higher than this number.
 304               Default is ``0.5``.
 305
 306 :Type: Float
 307 :Default: ``0.5``
 308
 309
 310 ``osd_scrub_min_interval``
 311
 312 :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
 313               when the Ceph Storage Cluster load is low.
 314
 315 :Type: Float
 316 :Default: Once per day. ``24*60*60``
 317
 318 .. _osd_scrub_max_interval:
 319
 320 ``osd_scrub_max_interval``
 321
 322 :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
 323               irrespective of cluster load.
 324
 325 :Type: Float
 326 :Default: Once per week. ``7*24*60*60``
 327
 328
 329 ``osd_scrub_chunk_min``
 330
 331 :Description: The minimal number of object store chunks to scrub during single operation.
 332               Ceph blocks writes to single chunk during scrub.
 333
 334 :Type: 32-bit Integer
 335 :Default: 5
 336
 337
 338 ``osd_scrub_chunk_max``
 339
 340 :Description: The maximum number of object store chunks to scrub during single operation.
 341
 342 :Type: 32-bit Integer
 343 :Default: 25
 344
 345
 346 ``osd_scrub_sleep``
 347
 348 :Description: Time to sleep before scrubbing the next group of chunks. Increasing this value will slow
 349               down the overall rate of scrubbing so that client operations will be less impacted.
 350
 351 :Type: Float
 352 :Default: 0
 353
 354
 355 ``osd_deep_scrub_interval``
 356
 357 :Description: The interval for "deep" scrubbing (fully reading all data). The
 358               ``osd_scrub_load_threshold`` does not affect this setting.
 359
 360 :Type: Float
 361 :Default: Once per week.  ``7*24*60*60``
 362
 363
 364 ``osd_scrub_interval_randomize_ratio``
 365
 366 :Description: Add a random delay to ``osd_scrub_min_interval`` when scheduling
 367               the next scrub job for a PG. The delay is a random
 368               value less than ``osd_scrub_min_interval`` \*
 369               ``osd_scrub_interval_randomized_ratio``. The default setting
 370               spreads scrubs throughout the allowed time
 371               window of ``[1, 1.5]`` \* ``osd_scrub_min_interval``.
 372 :Type: Float
 373 :Default: ``0.5``
 374
 375 ``osd_deep_scrub_stride``
 376
 377 :Description: Read size when doing a deep scrub.
 378 :Type: 32-bit Integer
 379 :Default: 512 KB. ``524288``
 380
 381
 382 ``osd_scrub_auto_repair``
 383
 384 :Description: Setting this to ``true`` will enable automatic PG repair when errors
 385               are found by scrubs or deep-scrubs.  However, if more than
 386               ``osd_scrub_auto_repair_num_errors`` errors are found a repair is NOT performed.
 387 :Type: Boolean
 388 :Default: ``false``
 389
 390
 391 ``osd_scrub_auto_repair_num_errors``
 392
 393 :Description: Auto repair will not occur if more than this many errors are found.
 394 :Type: 32-bit Integer
 395 :Default: ``5``
 396
 397
 398 .. index:: OSD; operations settings
 399
 400 Operations
 401 ==========
 402
 403  ``osd_op_queue``
 404
 405 :Description: This sets the type of queue to be used for prioritizing ops
 406               within each OSD. Both queues feature a strict sub-queue which is
 407               dequeued before the normal queue. The normal queue is different
 408               between implementations. The WeightedPriorityQueue (``wpq``)
 409               dequeues operations in relation to their priorities to prevent
 410               starvation of any queue. WPQ should help in cases where a few OSDs
 411               are more overloaded than others. The new mClockQueue
 412               (``mclock_scheduler``) prioritizes operations based on which class
 413               they belong to (recovery, scrub, snaptrim, client op, osd subop).
 414               See `QoS Based on mClock`_. Requires a restart.
 415
 416 :Type: String
 417 :Valid Choices: wpq, mclock_scheduler
 418 :Default: ``wpq``
 419
 420
 421 ``osd_op_queue_cut_off``
 422
 423 :Description: This selects which priority ops will be sent to the strict
 424               queue verses the normal queue. The ``low`` setting sends all
 425               replication ops and higher to the strict queue, while the ``high``
 426               option sends only replication acknowledgment ops and higher to
 427               the strict queue. Setting this to ``high`` should help when a few
 428               OSDs in the cluster are very busy especially when combined with
 429               ``wpq`` in the ``osd_op_queue`` setting. OSDs that are very busy
 430               handling replication traffic could starve primary client traffic
 431               on these OSDs without these settings. Requires a restart.
 432
 433 :Type: String
 434 :Valid Choices: low, high
 435 :Default: ``high``
 436
 437
 438 ``osd_client_op_priority``
 439
 440 :Description: The priority set for client operations.  This value is relative
 441               to that of ``osd_recovery_op_priority`` below.  The default
 442               strongly favors client ops over recovery.
 443
 444 :Type: 32-bit Integer
 445 :Default: ``63``
 446 :Valid Range: 1-63
 447
 448
 449 ``osd_recovery_op_priority``
 450
 451 :Description: The priority of recovery operations vs client operations, if not specified by the
 452               pool's ``recovery_op_priority``.  The default value prioritizes client
 453               ops (see above) over recovery ops.  You may adjust the tradeoff of client
 454               impact against the time to restore cluster health by lowering this value
 455               for increased prioritization of client ops, or by increasing it to favor
 456               recovery.
 457
 458 :Type: 32-bit Integer
 459 :Default: ``3``
 460 :Valid Range: 1-63
 461
 462
 463 ``osd_scrub_priority``
 464
 465 :Description: The default work queue priority for scheduled scrubs when the
 466               pool doesn't specify a value of ``scrub_priority``.  This can be
 467               boosted to the value of ``osd_client_op_priority`` when scrubs are
 468               blocking client operations.
 469
 470 :Type: 32-bit Integer
 471 :Default: ``5``
 472 :Valid Range: 1-63
 473
 474
 475 ``osd_requested_scrub_priority``
 476
 477 :Description: The priority set for user requested scrub on the work queue.  If
 478               this value were to be smaller than ``osd_client_op_priority`` it
 479               can be boosted to the value of ``osd_client_op_priority`` when
 480               scrub is blocking client operations.
 481
 482 :Type: 32-bit Integer
 483 :Default: ``120``
 484
 485
 486 ``osd_snap_trim_priority``
 487
 488 :Description: The priority set for the snap trim work queue.
 489
 490 :Type: 32-bit Integer
 491 :Default: ``5``
 492 :Valid Range: 1-63
 493
 494 ``osd_snap_trim_sleep``
 495
 496 :Description: Time in seconds to sleep before next snap trim op.
 497               Increasing this value will slow down snap trimming.
 498               This option overrides backend specific variants.
 499
 500 :Type: Float
 501 :Default: ``0``
 502
 503
 504 ``osd_snap_trim_sleep_hdd``
 505
 506 :Description: Time in seconds to sleep before next snap trim op
 507               for HDDs.
 508
 509 :Type: Float
 510 :Default: ``5``
 511
 512
 513 ``osd_snap_trim_sleep_ssd``
 514
 515 :Description: Time in seconds to sleep before next snap trim op
 516               for SSD OSDs (including NVMe).
 517
 518 :Type: Float
 519 :Default: ``0``
 520
 521
 522 ``osd_snap_trim_sleep_hybrid``
 523
 524 :Description: Time in seconds to sleep before next snap trim op
 525               when OSD data is on an HDD and the OSD journal or WAL+DB is on an SSD.
 526
 527 :Type: Float
 528 :Default: ``2``
 529
 530 ``osd_op_thread_timeout``
 531
 532 :Description: The Ceph OSD Daemon operation thread timeout in seconds.
 533 :Type: 32-bit Integer
 534 :Default: ``15``
 535
 536
 537 ``osd_op_complaint_time``
 538
 539 :Description: An operation becomes complaint worthy after the specified number
 540               of seconds have elapsed.
 541
 542 :Type: Float
 543 :Default: ``30``
 544
 545
 546 ``osd_op_history_size``
 547
 548 :Description: The maximum number of completed operations to track.
 549 :Type: 32-bit Unsigned Integer
 550 :Default: ``20``
 551
 552
 553 ``osd_op_history_duration``
 554
 555 :Description: The oldest completed operation to track.
 556 :Type: 32-bit Unsigned Integer
 557 :Default: ``600``
 558
 559
 560 ``osd_op_log_threshold``
 561
 562 :Description: How many operations logs to display at once.
 563 :Type: 32-bit Integer
 564 :Default: ``5``
 565
 566
 567 .. _dmclock-qos:
 568
 569 QoS Based on mClock
 570 -------------------
 571
 572 Ceph's use of mClock is currently experimental and should
 573 be approached with an exploratory mindset.
 574
 575 Core Concepts
 576 `````````````
 577
 578 Ceph's QoS support is implemented using a queueing scheduler
 579 based on `the dmClock algorithm`_. This algorithm allocates the I/O
 580 resources of the Ceph cluster in proportion to weights, and enforces
 581 the constraints of minimum reservation and maximum limitation, so that
 582 the services can compete for the resources fairly. Currently the
 583 *mclock_scheduler* operation queue divides Ceph services involving I/O
 584 resources into following buckets:
 585
 586 - client op: the iops issued by client
 587 - osd subop: the iops issued by primary OSD
 588 - snap trim: the snap trimming related requests
 589 - pg recovery: the recovery related requests
 590 - pg scrub: the scrub related requests
 591
 592 And the resources are partitioned using following three sets of tags. In other
 593 words, the share of each type of service is controlled by three tags:
 594
 595 #. reservation: the minimum IOPS allocated for the service.
 596 #. limitation: the maximum IOPS allocated for the service.
 597 #. weight: the proportional share of capacity if extra capacity or system
 598    oversubscribed.
 599
 600 In Ceph operations are graded with "cost". And the resources allocated
 601 for serving various services are consumed by these "costs". So, for
 602 example, the more reservation a services has, the more resource it is
 603 guaranteed to possess, as long as it requires. Assuming there are 2
 604 services: recovery and client ops:
 605
 606 - recovery: (r:1, l:5, w:1)
 607 - client ops: (r:2, l:0, w:9)
 608
 609 The settings above ensure that the recovery won't get more than 5
 610 requests per second serviced, even if it requires so (see CURRENT
 611 IMPLEMENTATION NOTE below), and no other services are competing with
 612 it. But if the clients start to issue large amount of I/O requests,
 613 neither will they exhaust all the I/O resources. 1 request per second
 614 is always allocated for recovery jobs as long as there are any such
 615 requests. So the recovery jobs won't be starved even in a cluster with
 616 high load. And in the meantime, the client ops can enjoy a larger
 617 portion of the I/O resource, because its weight is "9", while its
 618 competitor "1". In the case of client ops, it is not clamped by the
 619 limit setting, so it can make use of all the resources if there is no
 620 recovery ongoing.
 621
 622 CURRENT IMPLEMENTATION NOTE: the current experimental implementation
 623 does not enforce the limit values. As a first approximation we decided
 624 not to prevent operations that would otherwise enter the operation
 625 sequencer from doing so.
 626
 627 Subtleties of mClock
 628 ````````````````````
 629
 630 The reservation and limit values have a unit of requests per
 631 second. The weight, however, does not technically have a unit and the
 632 weights are relative to one another. So if one class of requests has a
 633 weight of 1 and another a weight of 9, then the latter class of
 634 requests should get 9 executed at a 9 to 1 ratio as the first class.
 635 However that will only happen once the reservations are met and those
 636 values include the operations executed under the reservation phase.
 637
 638 Even though the weights do not have units, one must be careful in
 639 choosing their values due how the algorithm assigns weight tags to
 640 requests. If the weight is *W*, then for a given class of requests,
 641 the next one that comes in will have a weight tag of *1/W* plus the
 642 previous weight tag or the current time, whichever is larger. That
 643 means if *W* is sufficiently large and therefore *1/W* is sufficiently
 644 small, the calculated tag may never be assigned as it will get a value
 645 of the current time. The ultimate lesson is that values for weight
 646 should not be too large. They should be under the number of requests
 647 one expects to ve serviced each second.
 648
 649 Caveats
 650 ```````
 651
 652 There are some factors that can reduce the impact of the mClock op
 653 queues within Ceph. First, requests to an OSD are sharded by their
 654 placement group identifier. Each shard has its own mClock queue and
 655 these queues neither interact nor share information among them. The
 656 number of shards can be controlled with the configuration options
 657 ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
 658 ``osd_op_num_shards_ssd``. A lower number of shards will increase the
 659 impact of the mClock queues, but may have other deleterious effects.
 660
 661 Second, requests are transferred from the operation queue to the
 662 operation sequencer, in which they go through the phases of
 663 execution. The operation queue is where mClock resides and mClock
 664 determines the next op to transfer to the operation sequencer. The
 665 number of operations allowed in the operation sequencer is a complex
 666 issue. In general we want to keep enough operations in the sequencer
 667 so it's always getting work done on some operations while it's waiting
 668 for disk and network access to complete on other operations. On the
 669 other hand, once an operation is transferred to the operation
 670 sequencer, mClock no longer has control over it. Therefore to maximize
 671 the impact of mClock, we want to keep as few operations in the
 672 operation sequencer as possible. So we have an inherent tension.
 673
 674 The configuration options that influence the number of operations in
 675 the operation sequencer are ``bluestore_throttle_bytes``,
 676 ``bluestore_throttle_deferred_bytes``,
 677 ``bluestore_throttle_cost_per_io``,
 678 ``bluestore_throttle_cost_per_io_hdd``, and
 679 ``bluestore_throttle_cost_per_io_ssd``.
 680
 681 A third factor that affects the impact of the mClock algorithm is that
 682 we're using a distributed system, where requests are made to multiple
 683 OSDs and each OSD has (can have) multiple shards. Yet we're currently
 684 using the mClock algorithm, which is not distributed (note: dmClock is
 685 the distributed version of mClock).
 686
 687 Various organizations and individuals are currently experimenting with
 688 mClock as it exists in this code base along with their modifications
 689 to the code base. We hope you'll share you're experiences with your
 690 mClock and dmClock experiments on the ``ceph-devel`` mailing list.
 691
 692
 693 ``osd_push_per_object_cost``
 694
 695 :Description: the overhead for serving a push op
 696
 697 :Type: Unsigned Integer
 698 :Default: 1000
 699
 700
 701 ``osd_recovery_max_chunk``
 702
 703 :Description: the maximum total size of data chunks a recovery op can carry.
 704
 705 :Type: Unsigned Integer
 706 :Default: 8 MiB
 707
 708
 709 ``osd_mclock_scheduler_client_res``
 710
 711 :Description: IO proportion reserved for each client (default).
 712
 713 :Type: Unsigned Integer
 714 :Default: 1
 715
 716
 717 ``osd_mclock_scheduler_client_wgt``
 718
 719 :Description: IO share for each client (default) over reservation.
 720
 721 :Type: Unsigned Integer
 722 :Default: 1
 723
 724
 725 ``osd_mclock_scheduler_client_lim``
 726
 727 :Description: IO limit for each client (default) over reservation.
 728
 729 :Type: Unsigned Integer
 730 :Default: 999999
 731
 732
 733 ``osd_mclock_scheduler_background_recovery_res``
 734
 735 :Description: IO proportion reserved for background recovery (default).
 736
 737 :Type: Unsigned Integer
 738 :Default: 1
 739
 740
 741 ``osd_mclock_scheduler_background_recovery_wgt``
 742
 743 :Description: IO share for each background recovery over reservation.
 744
 745 :Type: Unsigned Integer
 746 :Default: 1
 747
 748
 749 ``osd_mclock_scheduler_background_recovery_lim``
 750
 751 :Description: IO limit for background recovery over reservation.
 752
 753 :Type: Unsigned Integer
 754 :Default: 999999
 755
 756
 757 ``osd_mclock_scheduler_background_best_effort_res``
 758
 759 :Description: IO proportion reserved for background best_effort (default).
 760
 761 :Type: Unsigned Integer
 762 :Default: 1
 763
 764
 765 ``osd_mclock_scheduler_background_best_effort_wgt``
 766
 767 :Description: IO share for each background best_effort over reservation.
 768
 769 :Type: Unsigned Integer
 770 :Default: 1
 771
 772
 773 ``osd_mclock_scheduler_background_best_effort_lim``
 774
 775 :Description: IO limit for background best_effort over reservation.
 776
 777 :Type: Unsigned Integer
 778 :Default: 999999
 779
 780 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
 781
 782
 783 .. index:: OSD; backfilling
 784
 785 Backfilling
 786 ===========
 787
 788 When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
 789 rebalance the cluster by moving placement groups to or from Ceph OSDs
 790 to restore balanced utilization. The process of migrating placement groups and
 791 the objects they contain can reduce the cluster's operational performance
 792 considerably. To maintain operational performance, Ceph performs this migration
 793 with 'backfilling', which allows Ceph to set backfill operations to a lower
 794 priority than requests to read or write data.
 795
 796
 797 ``osd_max_backfills``
 798
 799 :Description: The maximum number of backfills allowed to or from a single OSD.
 800               Note that this is applied separately for read and write operations.
 801 :Type: 64-bit Unsigned Integer
 802 :Default: ``1``
 803
 804
 805 ``osd_backfill_scan_min``
 806
 807 :Description: The minimum number of objects per backfill scan.
 808
 809 :Type: 32-bit Integer
 810 :Default: ``64``
 811
 812
 813 ``osd_backfill_scan_max``
 814
 815 :Description: The maximum number of objects per backfill scan.
 816
 817 :Type: 32-bit Integer
 818 :Default: ``512``
 819
 820
 821 ``osd_backfill_retry_interval``
 822
 823 :Description: The number of seconds to wait before retrying backfill requests.
 824 :Type: Double
 825 :Default: ``10.0``
 826
 827 .. index:: OSD; osdmap
 828
 829 OSD Map
 830 =======
 831
 832 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
 833 number of map epochs increases. Ceph provides some settings to ensure that
 834 Ceph performs well as the OSD map grows larger.
 835
 836
 837 ``osd_map_dedup``
 838
 839 :Description: Enable removing duplicates in the OSD map.
 840 :Type: Boolean
 841 :Default: ``true``
 842
 843
 844 ``osd_map_cache_size``
 845
 846 :Description: The number of OSD maps to keep cached.
 847 :Type: 32-bit Integer
 848 :Default: ``50``
 849
 850
 851 ``osd_map_message_max``
 852
 853 :Description: The maximum map entries allowed per MOSDMap message.
 854 :Type: 32-bit Integer
 855 :Default: ``40``
 856
 857
 858
 859 .. index:: OSD; recovery
 860
 861 Recovery
 862 ========
 863
 864 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
 865 begins peering with other Ceph OSD Daemons before writes can occur.  See
 866 `Monitoring OSDs and PGs`_ for details.
 867
 868 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
 869 sync with other Ceph OSD Daemons containing more recent versions of objects in
 870 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
 871 mode and seeks to get the latest copy of the data and bring its map back up to
 872 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
 873 and placement groups may be significantly out of date. Also, if a failure domain
 874 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
 875 the same time. This can make the recovery process time consuming and resource
 876 intensive.
 877
 878 To maintain operational performance, Ceph performs recovery with limitations on
 879 the number recovery requests, threads and object chunk sizes which allows Ceph
 880 perform well in a degraded state.
 881
 882
 883 ``osd_recovery_delay_start``
 884
 885 :Description: After peering completes, Ceph will delay for the specified number
 886               of seconds before starting to recover RADOS objects.
 887
 888 :Type: Float
 889 :Default: ``0``
 890
 891
 892 ``osd_recovery_max_active``
 893
 894 :Description: The number of active recovery requests per OSD at one time. More
 895               requests will accelerate recovery, but the requests places an
 896               increased load on the cluster.
 897
 898               This value is only used if it is non-zero. Normally it
 899               is ``0``, which means that the ``hdd`` or ``ssd`` values
 900               (below) are used, depending on the type of the primary
 901               device backing the OSD.
 902
 903 :Type: 32-bit Integer
 904 :Default: ``0``
 905
 906 ``osd_recovery_max_active_hdd``
 907
 908 :Description: The number of active recovery requests per OSD at one time, if the
 909               primary device is rotational.
 910
 911 :Type: 32-bit Integer
 912 :Default: ``3``
 913
 914 ``osd_recovery_max_active_ssd``
 915
 916 :Description: The number of active recovery requests per OSD at one time, if the
 917               primary device is non-rotational (i.e., an SSD).
 918
 919 :Type: 32-bit Integer
 920 :Default: ``10``
 921
 922
 923 ``osd_recovery_max_chunk``
 924
 925 :Description: The maximum size of a recovered chunk of data to push.
 926 :Type: 64-bit Unsigned Integer
 927 :Default: ``8 << 20``
 928
 929
 930 ``osd_recovery_max_single_start``
 931
 932 :Description: The maximum number of recovery operations per OSD that will be
 933               newly started when an OSD is recovering.
 934 :Type: 64-bit Unsigned Integer
 935 :Default: ``1``
 936
 937
 938 ``osd_recovery_thread_timeout``
 939
 940 :Description: The maximum time in seconds before timing out a recovery thread.
 941 :Type: 32-bit Integer
 942 :Default: ``30``
 943
 944
 945 ``osd_recover_clone_overlap``
 946
 947 :Description: Preserves clone overlap during recovery. Should always be set
 948               to ``true``.
 949
 950 :Type: Boolean
 951 :Default: ``true``
 952
 953
 954 ``osd_recovery_sleep``
 955
 956 :Description: Time in seconds to sleep before the next recovery or backfill op.
 957               Increasing this value will slow down recovery operation while
 958               client operations will be less impacted.
 959
 960 :Type: Float
 961 :Default: ``0``
 962
 963
 964 ``osd_recovery_sleep_hdd``
 965
 966 :Description: Time in seconds to sleep before next recovery or backfill op
 967               for HDDs.
 968
 969 :Type: Float
 970 :Default: ``0.1``
 971
 972
 973 ``osd_recovery_sleep_ssd``
 974
 975 :Description: Time in seconds to sleep before the next recovery or backfill op
 976               for SSDs.
 977
 978 :Type: Float
 979 :Default: ``0``
 980
 981
 982 ``osd_recovery_sleep_hybrid``
 983
 984 :Description: Time in seconds to sleep before the next recovery or backfill op
 985               when OSD data is on HDD and OSD journal / WAL+DB is on SSD.
 986
 987 :Type: Float
 988 :Default: ``0.025``
 989
 990
 991 ``osd_recovery_priority``
 992
 993 :Description: The default priority set for recovery work queue.  Not
 994               related to a pool's ``recovery_priority``.
 995
 996 :Type: 32-bit Integer
 997 :Default: ``5``
 998
 999
1000 Tiering
1001 =======
1002
1003 ``osd_agent_max_ops``
1004
1005 :Description: The maximum number of simultaneous flushing ops per tiering agent
1006               in the high speed mode.
1007 :Type: 32-bit Integer
1008 :Default: ``4``
1009
1010
1011 ``osd_agent_max_low_ops``
1012
1013 :Description: The maximum number of simultaneous flushing ops per tiering agent
1014               in the low speed mode.
1015 :Type: 32-bit Integer
1016 :Default: ``2``
1017
1018 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1019 objects within the high speed mode.
1020
1021 Miscellaneous
1022 =============
1023
1024
1025 ``osd_snap_trim_thread_timeout``
1026
1027 :Description: The maximum time in seconds before timing out a snap trim thread.
1028 :Type: 32-bit Integer
1029 :Default: ``1*60*60``
1030
1031
1032 ``osd_backlog_thread_timeout``
1033
1034 :Description: The maximum time in seconds before timing out a backlog thread.
1035 :Type: 32-bit Integer
1036 :Default: ``1*60*60``
1037
1038
1039 ``osd_default_notify_timeout``
1040
1041 :Description: The OSD default notification timeout (in seconds).
1042 :Type: 32-bit Unsigned Integer
1043 :Default: ``30``
1044
1045
1046 ``osd_check_for_log_corruption``
1047
1048 :Description: Check log files for corruption. Can be computationally expensive.
1049 :Type: Boolean
1050 :Default: ``false``
1051
1052
1053 ``osd_remove_thread_timeout``
1054
1055 :Description: The maximum time in seconds before timing out a remove OSD thread.
1056 :Type: 32-bit Integer
1057 :Default: ``60*60``
1058
1059
1060 ``osd_command_thread_timeout``
1061
1062 :Description: The maximum time in seconds before timing out a command thread.
1063 :Type: 32-bit Integer
1064 :Default: ``10*60``
1065
1066
1067 ``osd_delete_sleep``
1068
1069 :Description: Time in seconds to sleep before the next removal transaction. This
1070               throttles the PG deletion process.
1071
1072 :Type: Float
1073 :Default: ``0``
1074
1075
1076 ``osd_delete_sleep_hdd``
1077
1078 :Description: Time in seconds to sleep before the next removal transaction
1079               for HDDs.
1080
1081 :Type: Float
1082 :Default: ``5``
1083
1084
1085 ``osd_delete_sleep_ssd``
1086
1087 :Description: Time in seconds to sleep before the next removal transaction
1088               for SSDs.
1089
1090 :Type: Float
1091 :Default: ``0``
1092
1093
1094 ``osd_delete_sleep_hybrid``
1095
1096 :Description: Time in seconds to sleep before the next removal transaction
1097               when OSD data is on HDD and OSD journal or WAL+DB is on SSD.
1098
1099 :Type: Float
1100 :Default: ``1``
1101
1102
1103 ``osd_command_max_records``
1104
1105 :Description: Limits the number of lost objects to return.
1106 :Type: 32-bit Integer
1107 :Default: ``256``
1108
1109
1110 ``osd_fast_fail_on_connection_refused``
1111
1112 :Description: If this option is enabled, crashed OSDs are marked down
1113               immediately by connected peers and MONs (assuming that the
1114               crashed OSD host survives). Disable it to restore old
1115               behavior, at the expense of possible long I/O stalls when
1116               OSDs crash in the middle of I/O operations.
1117 :Type: Boolean
1118 :Default: ``true``
1119
1120
1121
1122 .. _pool: ../../operations/pools
1123 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1124 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1125 .. _Pool & PG Config Reference: ../pool-pg-config-ref
1126 .. _Journal Config Reference: ../journal-ref
1127 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio