ceph/doc/rados/configuration/osd-config-ref.rst

   1 ======================
   2  OSD Config Reference
   3 ======================
   4
   5 .. index:: OSD; configuration
   6
   7 You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
   8 releases, the central config store), but Ceph OSD
   9 Daemons can use the default values and a very minimal configuration. A minimal
  10 Ceph OSD Daemon configuration sets ``host`` and
  11 uses default values for nearly everything else.
  12
  13 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
  14 with ``0`` using the following convention. ::
  15
  16         osd.0
  17         osd.1
  18         osd.2
  19
  20 In a configuration file, you may specify settings for all Ceph OSD Daemons in
  21 the cluster by adding configuration settings to the ``[osd]`` section of your
  22 configuration file. To add settings directly to a specific Ceph OSD Daemon
  23 (e.g., ``host``), enter  it in an OSD-specific section of your configuration
  24 file. For example:
  25
  26 .. code-block:: ini
  27
  28         [osd]
  29                 osd_journal_size = 5120
  30
  31         [osd.0]
  32                 host = osd-host-a
  33
  34         [osd.1]
  35                 host = osd-host-b
  36
  37
  38 .. index:: OSD; config settings
  39
  40 General Settings
  41 ================
  42
  43 The following settings provide a Ceph OSD Daemon's ID, and determine paths to
  44 data and journals. Ceph deployment scripts typically generate the UUID
  45 automatically.
  46
  47 .. warning:: **DO NOT** change the default paths for data or journals, as it
  48              makes it more problematic to troubleshoot Ceph later.
  49
  50 When using Filestore, the journal size should be at least twice the product of the expected drive
  51 speed multiplied by ``filestore_max_sync_interval``. However, the most common
  52 practice is to partition the journal drive (often an SSD), and mount it such
  53 that Ceph uses the entire partition for the journal.
  54
  55 .. confval:: osd_uuid
  56 .. confval:: osd_data
  57 .. confval:: osd_max_write_size
  58 .. confval:: osd_max_object_size
  59 .. confval:: osd_client_message_size_cap
  60 .. confval:: osd_class_dir
  61    :default: $libdir/rados-classes
  62
  63 .. index:: OSD; file system
  64
  65 File System Settings
  66 ====================
  67 Ceph builds and mounts file systems which are used for Ceph OSDs.
  68
  69 ``osd_mkfs_options {fs-type}``
  70
  71 :Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
  72
  73 :Type: String
  74 :Default for xfs: ``-f -i 2048``
  75 :Default for other file systems: {empty string}
  76
  77 For example::
  78   ``osd_mkfs_options_xfs = -f -d agcount=24``
  79
  80 ``osd_mount_options {fs-type}``
  81
  82 :Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
  83
  84 :Type: String
  85 :Default for xfs: ``rw,noatime,inode64``
  86 :Default for other file systems: ``rw, noatime``
  87
  88 For example::
  89   ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
  90
  91
  92 .. index:: OSD; journal settings
  93
  94 Journal Settings
  95 ================
  96
  97 This section applies only to the older Filestore OSD back end.  Since Luminous
  98 BlueStore has been default and preferred.
  99
 100 By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
 101 the following path, which is usually a symlink to a device or partition::
 102
 103         /var/lib/ceph/osd/$cluster-$id/journal
 104
 105 When using a single device type (for example, spinning drives), the journals
 106 should be *colocated*: the logical volume (or partition) should be in the same
 107 device as the ``data`` logical volume.
 108
 109 When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
 110 drives) it makes sense to place the journal on the faster device, while
 111 ``data`` occupies the slower device fully.
 112
 113 The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
 114 larger, in which case it will need to be set in the ``ceph.conf`` file.
 115 A value of 10 gigabytes is common in practice::
 116
 117         osd_journal_size = 10240
 118
 119
 120 .. confval:: osd_journal
 121 .. confval:: osd_journal_size
 122
 123 See `Journal Config Reference`_ for additional details.
 124
 125
 126 Monitor OSD Interaction
 127 =======================
 128
 129 Ceph OSD Daemons check each other's heartbeats and report to monitors
 130 periodically. Ceph can use default values in many cases. However, if your
 131 network has latency issues, you may need to adopt longer intervals. See
 132 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
 133
 134
 135 Data Placement
 136 ==============
 137
 138 See `Pool & PG Config Reference`_ for details.
 139
 140
 141 .. index:: OSD; scrubbing
 142
 143 .. _rados_config_scrubbing:
 144
 145 Scrubbing
 146 =========
 147
 148 One way that Ceph ensures data integrity is by "scrubbing" placement groups.
 149 Ceph scrubbing is analogous to ``fsck`` on the object storage layer. Ceph
 150 generates a catalog of all objects in each placement group and compares each
 151 primary object to its replicas, ensuring that no objects are missing or
 152 mismatched. Light scrubbing checks the object size and attributes, and is
 153 usually done daily. Deep scrubbing reads the data and uses checksums to ensure
 154 data integrity, and is usually done weekly. The freqeuncies of both light
 155 scrubbing and deep scrubbing are determined by the cluster's configuration,
 156 which is fully under your control and subject to the settings explained below
 157 in this section.
 158
 159 Although scrubbing is important for maintaining data integrity, it can reduce
 160 the performance of the Ceph cluster. You can adjust the following settings to
 161 increase or decrease the frequency and depth of scrubbing operations.
 162
 163
 164 .. confval:: osd_max_scrubs
 165 .. confval:: osd_scrub_begin_hour
 166 .. confval:: osd_scrub_end_hour
 167 .. confval:: osd_scrub_begin_week_day
 168 .. confval:: osd_scrub_end_week_day
 169 .. confval:: osd_scrub_during_recovery
 170 .. confval:: osd_scrub_load_threshold
 171 .. confval:: osd_scrub_min_interval
 172 .. confval:: osd_scrub_max_interval
 173 .. confval:: osd_scrub_chunk_min
 174 .. confval:: osd_scrub_chunk_max
 175 .. confval:: osd_scrub_sleep
 176 .. confval:: osd_deep_scrub_interval
 177 .. confval:: osd_scrub_interval_randomize_ratio
 178 .. confval:: osd_deep_scrub_stride
 179 .. confval:: osd_scrub_auto_repair
 180 .. confval:: osd_scrub_auto_repair_num_errors
 181
 182 .. index:: OSD; operations settings
 183
 184 Operations
 185 ==========
 186
 187 .. confval:: osd_op_num_shards
 188 .. confval:: osd_op_num_shards_hdd
 189 .. confval:: osd_op_num_shards_ssd
 190 .. confval:: osd_op_queue
 191 .. confval:: osd_op_queue_cut_off
 192 .. confval:: osd_client_op_priority
 193 .. confval:: osd_recovery_op_priority
 194 .. confval:: osd_scrub_priority
 195 .. confval:: osd_requested_scrub_priority
 196 .. confval:: osd_snap_trim_priority
 197 .. confval:: osd_snap_trim_sleep
 198 .. confval:: osd_snap_trim_sleep_hdd
 199 .. confval:: osd_snap_trim_sleep_ssd
 200 .. confval:: osd_snap_trim_sleep_hybrid
 201 .. confval:: osd_op_thread_timeout
 202 .. confval:: osd_op_complaint_time
 203 .. confval:: osd_op_history_size
 204 .. confval:: osd_op_history_duration
 205 .. confval:: osd_op_log_threshold
 206 .. confval:: osd_op_thread_suicide_timeout
 207 .. note:: See https://old.ceph.com/planet/dealing-with-some-osd-timeouts/ for
 208    more on ``osd_op_thread_suicide_timeout``. Be aware that this is a link to a
 209    reworking of a blog post from 2017, and that its conclusion will direct you
 210    back to this page "for more information".
 211
 212 .. _dmclock-qos:
 213
 214 QoS Based on mClock
 215 -------------------
 216
 217 Ceph's use of mClock is now more refined and can be used by following the
 218 steps as described in `mClock Config Reference`_.
 219
 220 Core Concepts
 221 `````````````
 222
 223 Ceph's QoS support is implemented using a queueing scheduler
 224 based on `the dmClock algorithm`_. This algorithm allocates the I/O
 225 resources of the Ceph cluster in proportion to weights, and enforces
 226 the constraints of minimum reservation and maximum limitation, so that
 227 the services can compete for the resources fairly. Currently the
 228 *mclock_scheduler* operation queue divides Ceph services involving I/O
 229 resources into following buckets:
 230
 231 - client op: the iops issued by client
 232 - osd subop: the iops issued by primary OSD
 233 - snap trim: the snap trimming related requests
 234 - pg recovery: the recovery related requests
 235 - pg scrub: the scrub related requests
 236
 237 And the resources are partitioned using following three sets of tags. In other
 238 words, the share of each type of service is controlled by three tags:
 239
 240 #. reservation: the minimum IOPS allocated for the service.
 241 #. limitation: the maximum IOPS allocated for the service.
 242 #. weight: the proportional share of capacity if extra capacity or system
 243    oversubscribed.
 244
 245 In Ceph, operations are graded with "cost". And the resources allocated
 246 for serving various services are consumed by these "costs". So, for
 247 example, the more reservation a services has, the more resource it is
 248 guaranteed to possess, as long as it requires. Assuming there are 2
 249 services: recovery and client ops:
 250
 251 - recovery: (r:1, l:5, w:1)
 252 - client ops: (r:2, l:0, w:9)
 253
 254 The settings above ensure that the recovery won't get more than 5
 255 requests per second serviced, even if it requires so (see CURRENT
 256 IMPLEMENTATION NOTE below), and no other services are competing with
 257 it. But if the clients start to issue large amount of I/O requests,
 258 neither will they exhaust all the I/O resources. 1 request per second
 259 is always allocated for recovery jobs as long as there are any such
 260 requests. So the recovery jobs won't be starved even in a cluster with
 261 high load. And in the meantime, the client ops can enjoy a larger
 262 portion of the I/O resource, because its weight is "9", while its
 263 competitor "1". In the case of client ops, it is not clamped by the
 264 limit setting, so it can make use of all the resources if there is no
 265 recovery ongoing.
 266
 267 CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit
 268 values. Therefore, if a service crosses the enforced limit, the op remains
 269 in the operation queue until the limit is restored.
 270
 271 Subtleties of mClock
 272 ````````````````````
 273
 274 The reservation and limit values have a unit of requests per
 275 second. The weight, however, does not technically have a unit and the
 276 weights are relative to one another. So if one class of requests has a
 277 weight of 1 and another a weight of 9, then the latter class of
 278 requests should get 9 executed at a 9 to 1 ratio as the first class.
 279 However that will only happen once the reservations are met and those
 280 values include the operations executed under the reservation phase.
 281
 282 Even though the weights do not have units, one must be careful in
 283 choosing their values due how the algorithm assigns weight tags to
 284 requests. If the weight is *W*, then for a given class of requests,
 285 the next one that comes in will have a weight tag of *1/W* plus the
 286 previous weight tag or the current time, whichever is larger. That
 287 means if *W* is sufficiently large and therefore *1/W* is sufficiently
 288 small, the calculated tag may never be assigned as it will get a value
 289 of the current time. The ultimate lesson is that values for weight
 290 should not be too large. They should be under the number of requests
 291 one expects to be serviced each second.
 292
 293 Caveats
 294 ```````
 295
 296 There are some factors that can reduce the impact of the mClock op
 297 queues within Ceph. First, requests to an OSD are sharded by their
 298 placement group identifier. Each shard has its own mClock queue and
 299 these queues neither interact nor share information among them. The
 300 number of shards can be controlled with the configuration options
 301 :confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
 302 :confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
 303 impact of the mClock queues, but may have other deleterious effects.
 304
 305 Second, requests are transferred from the operation queue to the
 306 operation sequencer, in which they go through the phases of
 307 execution. The operation queue is where mClock resides and mClock
 308 determines the next op to transfer to the operation sequencer. The
 309 number of operations allowed in the operation sequencer is a complex
 310 issue. In general we want to keep enough operations in the sequencer
 311 so it's always getting work done on some operations while it's waiting
 312 for disk and network access to complete on other operations. On the
 313 other hand, once an operation is transferred to the operation
 314 sequencer, mClock no longer has control over it. Therefore to maximize
 315 the impact of mClock, we want to keep as few operations in the
 316 operation sequencer as possible. So we have an inherent tension.
 317
 318 The configuration options that influence the number of operations in
 319 the operation sequencer are :confval:`bluestore_throttle_bytes`,
 320 :confval:`bluestore_throttle_deferred_bytes`,
 321 :confval:`bluestore_throttle_cost_per_io`,
 322 :confval:`bluestore_throttle_cost_per_io_hdd`, and
 323 :confval:`bluestore_throttle_cost_per_io_ssd`.
 324
 325 A third factor that affects the impact of the mClock algorithm is that
 326 we're using a distributed system, where requests are made to multiple
 327 OSDs and each OSD has (can have) multiple shards. Yet we're currently
 328 using the mClock algorithm, which is not distributed (note: dmClock is
 329 the distributed version of mClock).
 330
 331 Various organizations and individuals are currently experimenting with
 332 mClock as it exists in this code base along with their modifications
 333 to the code base. We hope you'll share you're experiences with your
 334 mClock and dmClock experiments on the ``ceph-devel`` mailing list.
 335
 336 .. confval:: osd_async_recovery_min_cost
 337 .. confval:: osd_push_per_object_cost
 338 .. confval:: osd_mclock_scheduler_client_res
 339 .. confval:: osd_mclock_scheduler_client_wgt
 340 .. confval:: osd_mclock_scheduler_client_lim
 341 .. confval:: osd_mclock_scheduler_background_recovery_res
 342 .. confval:: osd_mclock_scheduler_background_recovery_wgt
 343 .. confval:: osd_mclock_scheduler_background_recovery_lim
 344 .. confval:: osd_mclock_scheduler_background_best_effort_res
 345 .. confval:: osd_mclock_scheduler_background_best_effort_wgt
 346 .. confval:: osd_mclock_scheduler_background_best_effort_lim
 347
 348 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
 349
 350 .. index:: OSD; backfilling
 351
 352 Backfilling
 353 ===========
 354
 355 When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
 356 rebalance the cluster by moving placement groups to or from Ceph OSDs
 357 to restore balanced utilization. The process of migrating placement groups and
 358 the objects they contain can reduce the cluster's operational performance
 359 considerably. To maintain operational performance, Ceph performs this migration
 360 with 'backfilling', which allows Ceph to set backfill operations to a lower
 361 priority than requests to read or write data.
 362
 363
 364 .. confval:: osd_max_backfills
 365 .. confval:: osd_backfill_scan_min
 366 .. confval:: osd_backfill_scan_max
 367 .. confval:: osd_backfill_retry_interval
 368
 369 .. index:: OSD; osdmap
 370
 371 OSD Map
 372 =======
 373
 374 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
 375 number of map epochs increases. Ceph provides some settings to ensure that
 376 Ceph performs well as the OSD map grows larger.
 377
 378 .. confval:: osd_map_dedup
 379 .. confval:: osd_map_cache_size
 380 .. confval:: osd_map_message_max
 381
 382 .. index:: OSD; recovery
 383
 384 Recovery
 385 ========
 386
 387 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
 388 begins peering with other Ceph OSD Daemons before writes can occur.  See
 389 `Monitoring OSDs and PGs`_ for details.
 390
 391 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
 392 sync with other Ceph OSD Daemons containing more recent versions of objects in
 393 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
 394 mode and seeks to get the latest copy of the data and bring its map back up to
 395 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
 396 and placement groups may be significantly out of date. Also, if a failure domain
 397 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
 398 the same time. This can make the recovery process time consuming and resource
 399 intensive.
 400
 401 To maintain operational performance, Ceph performs recovery with limitations on
 402 the number recovery requests, threads and object chunk sizes which allows Ceph
 403 perform well in a degraded state.
 404
 405 .. confval:: osd_recovery_delay_start
 406 .. confval:: osd_recovery_max_active
 407 .. confval:: osd_recovery_max_active_hdd
 408 .. confval:: osd_recovery_max_active_ssd
 409 .. confval:: osd_recovery_max_chunk
 410 .. confval:: osd_recovery_max_single_start
 411 .. confval:: osd_recover_clone_overlap
 412 .. confval:: osd_recovery_sleep
 413 .. confval:: osd_recovery_sleep_hdd
 414 .. confval:: osd_recovery_sleep_ssd
 415 .. confval:: osd_recovery_sleep_hybrid
 416 .. confval:: osd_recovery_priority
 417
 418 Tiering
 419 =======
 420
 421 .. confval:: osd_agent_max_ops
 422 .. confval:: osd_agent_max_low_ops
 423
 424 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
 425 objects within the high speed mode.
 426
 427 Miscellaneous
 428 =============
 429
 430 .. confval:: osd_default_notify_timeout
 431 .. confval:: osd_check_for_log_corruption
 432 .. confval:: osd_delete_sleep
 433 .. confval:: osd_delete_sleep_hdd
 434 .. confval:: osd_delete_sleep_ssd
 435 .. confval:: osd_delete_sleep_hybrid
 436 .. confval:: osd_command_max_records
 437 .. confval:: osd_fast_fail_on_connection_refused
 438
 439 .. _pool: ../../operations/pools
 440 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
 441 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
 442 .. _Pool & PG Config Reference: ../pool-pg-config-ref
 443 .. _Journal Config Reference: ../journal-ref
 444 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
 445 .. _mClock Config Reference: ../mclock-config-ref