ceph/doc/rados/configuration/osd-config-ref.rst

   1 ======================
   2  OSD Config Reference
   3 ======================
   4
   5 .. index:: OSD; configuration
   6
   7 You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
   8 releases, the central config store), but Ceph OSD
   9 Daemons can use the default values and a very minimal configuration. A minimal
  10 Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``,  and
  11 uses default values for nearly everything else.
  12
  13 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
  14 with ``0`` using the following convention. ::
  15
  16         osd.0
  17         osd.1
  18         osd.2
  19
  20 In a configuration file, you may specify settings for all Ceph OSD Daemons in
  21 the cluster by adding configuration settings to the ``[osd]`` section of your
  22 configuration file. To add settings directly to a specific Ceph OSD Daemon
  23 (e.g., ``host``), enter  it in an OSD-specific section of your configuration
  24 file. For example:
  25
  26 .. code-block:: ini
  27
  28         [osd]
  29                 osd_journal_size = 5120
  30
  31         [osd.0]
  32                 host = osd-host-a
  33
  34         [osd.1]
  35                 host = osd-host-b
  36
  37
  38 .. index:: OSD; config settings
  39
  40 General Settings
  41 ================
  42
  43 The following settings provide a Ceph OSD Daemon's ID, and determine paths to
  44 data and journals. Ceph deployment scripts typically generate the UUID
  45 automatically.
  46
  47 .. warning:: **DO NOT** change the default paths for data or journals, as it
  48              makes it more problematic to troubleshoot Ceph later.
  49
  50 When using Filestore, the journal size should be at least twice the product of the expected drive
  51 speed multiplied by ``filestore_max_sync_interval``. However, the most common
  52 practice is to partition the journal drive (often an SSD), and mount it such
  53 that Ceph uses the entire partition for the journal.
  54
  55 .. confval:: osd_uuid
  56 .. confval:: osd_data
  57 .. confval:: osd_max_write_size
  58 .. confval:: osd_max_object_size
  59 .. confval:: osd_client_message_size_cap
  60 .. confval:: osd_class_dir
  61    :default: $libdir/rados-classes
  62
  63 .. index:: OSD; file system
  64
  65 File System Settings
  66 ====================
  67 Ceph builds and mounts file systems which are used for Ceph OSDs.
  68
  69 ``osd_mkfs_options {fs-type}``
  70
  71 :Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
  72
  73 :Type: String
  74 :Default for xfs: ``-f -i 2048``
  75 :Default for other file systems: {empty string}
  76
  77 For example::
  78   ``osd_mkfs_options_xfs = -f -d agcount=24``
  79
  80 ``osd_mount_options {fs-type}``
  81
  82 :Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
  83
  84 :Type: String
  85 :Default for xfs: ``rw,noatime,inode64``
  86 :Default for other file systems: ``rw, noatime``
  87
  88 For example::
  89   ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
  90
  91
  92 .. index:: OSD; journal settings
  93
  94 Journal Settings
  95 ================
  96
  97 This section applies only to the older Filestore OSD back end.  Since Luminous
  98 BlueStore has been default and preferred.
  99
 100 By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
 101 the following path, which is usually a symlink to a device or partition::
 102
 103         /var/lib/ceph/osd/$cluster-$id/journal
 104
 105 When using a single device type (for example, spinning drives), the journals
 106 should be *colocated*: the logical volume (or partition) should be in the same
 107 device as the ``data`` logical volume.
 108
 109 When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
 110 drives) it makes sense to place the journal on the faster device, while
 111 ``data`` occupies the slower device fully.
 112
 113 The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
 114 larger, in which case it will need to be set in the ``ceph.conf`` file.
 115 A value of 10 gigabytes is common in practice::
 116
 117         osd_journal_size = 10240
 118
 119
 120 .. confval:: osd_journal
 121 .. confval:: osd_journal_size
 122
 123 See `Journal Config Reference`_ for additional details.
 124
 125
 126 Monitor OSD Interaction
 127 =======================
 128
 129 Ceph OSD Daemons check each other's heartbeats and report to monitors
 130 periodically. Ceph can use default values in many cases. However, if your
 131 network has latency issues, you may need to adopt longer intervals. See
 132 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
 133
 134
 135 Data Placement
 136 ==============
 137
 138 See `Pool & PG Config Reference`_ for details.
 139
 140
 141 .. index:: OSD; scrubbing
 142
 143 Scrubbing
 144 =========
 145
 146 In addition to making multiple copies of objects, Ceph ensures data integrity by
 147 scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
 148 object storage layer. For each placement group, Ceph generates a catalog of all
 149 objects and compares each primary object and its replicas to ensure that no
 150 objects are missing or mismatched. Light scrubbing (daily) checks the object
 151 size and attributes.  Deep scrubbing (weekly) reads the data and uses checksums
 152 to ensure data integrity.
 153
 154 Scrubbing is important for maintaining data integrity, but it can reduce
 155 performance. You can adjust the following settings to increase or decrease
 156 scrubbing operations.
 157
 158
 159 .. confval:: osd_max_scrubs
 160 .. confval:: osd_scrub_begin_hour
 161 .. confval:: osd_scrub_end_hour
 162 .. confval:: osd_scrub_begin_week_day
 163 .. confval:: osd_scrub_end_week_day
 164 .. confval:: osd_scrub_during_recovery
 165 .. confval:: osd_scrub_load_threshold
 166 .. confval:: osd_scrub_min_interval
 167 .. confval:: osd_scrub_max_interval
 168 .. confval:: osd_scrub_chunk_min
 169 .. confval:: osd_scrub_chunk_max
 170 .. confval:: osd_scrub_sleep
 171 .. confval:: osd_deep_scrub_interval
 172 .. confval:: osd_scrub_interval_randomize_ratio
 173 .. confval:: osd_deep_scrub_stride
 174 .. confval:: osd_scrub_auto_repair
 175 .. confval:: osd_scrub_auto_repair_num_errors
 176
 177 .. index:: OSD; operations settings
 178
 179 Operations
 180 ==========
 181
 182 .. confval:: osd_op_num_shards
 183 .. confval:: osd_op_num_shards_hdd
 184 .. confval:: osd_op_num_shards_ssd
 185 .. confval:: osd_op_queue
 186 .. confval:: osd_op_queue_cut_off
 187 .. confval:: osd_client_op_priority
 188 .. confval:: osd_recovery_op_priority
 189 .. confval:: osd_scrub_priority
 190 .. confval:: osd_requested_scrub_priority
 191 .. confval:: osd_snap_trim_priority
 192 .. confval:: osd_snap_trim_sleep
 193 .. confval:: osd_snap_trim_sleep_hdd
 194 .. confval:: osd_snap_trim_sleep_ssd
 195 .. confval:: osd_snap_trim_sleep_hybrid
 196 .. confval:: osd_op_thread_timeout
 197 .. confval:: osd_op_complaint_time
 198 .. confval:: osd_op_history_size
 199 .. confval:: osd_op_history_duration
 200 .. confval:: osd_op_log_threshold
 201
 202 .. _dmclock-qos:
 203
 204 QoS Based on mClock
 205 -------------------
 206
 207 Ceph's use of mClock is now more refined and can be used by following the
 208 steps as described in `mClock Config Reference`_.
 209
 210 Core Concepts
 211 `````````````
 212
 213 Ceph's QoS support is implemented using a queueing scheduler
 214 based on `the dmClock algorithm`_. This algorithm allocates the I/O
 215 resources of the Ceph cluster in proportion to weights, and enforces
 216 the constraints of minimum reservation and maximum limitation, so that
 217 the services can compete for the resources fairly. Currently the
 218 *mclock_scheduler* operation queue divides Ceph services involving I/O
 219 resources into following buckets:
 220
 221 - client op: the iops issued by client
 222 - osd subop: the iops issued by primary OSD
 223 - snap trim: the snap trimming related requests
 224 - pg recovery: the recovery related requests
 225 - pg scrub: the scrub related requests
 226
 227 And the resources are partitioned using following three sets of tags. In other
 228 words, the share of each type of service is controlled by three tags:
 229
 230 #. reservation: the minimum IOPS allocated for the service.
 231 #. limitation: the maximum IOPS allocated for the service.
 232 #. weight: the proportional share of capacity if extra capacity or system
 233    oversubscribed.
 234
 235 In Ceph, operations are graded with "cost". And the resources allocated
 236 for serving various services are consumed by these "costs". So, for
 237 example, the more reservation a services has, the more resource it is
 238 guaranteed to possess, as long as it requires. Assuming there are 2
 239 services: recovery and client ops:
 240
 241 - recovery: (r:1, l:5, w:1)
 242 - client ops: (r:2, l:0, w:9)
 243
 244 The settings above ensure that the recovery won't get more than 5
 245 requests per second serviced, even if it requires so (see CURRENT
 246 IMPLEMENTATION NOTE below), and no other services are competing with
 247 it. But if the clients start to issue large amount of I/O requests,
 248 neither will they exhaust all the I/O resources. 1 request per second
 249 is always allocated for recovery jobs as long as there are any such
 250 requests. So the recovery jobs won't be starved even in a cluster with
 251 high load. And in the meantime, the client ops can enjoy a larger
 252 portion of the I/O resource, because its weight is "9", while its
 253 competitor "1". In the case of client ops, it is not clamped by the
 254 limit setting, so it can make use of all the resources if there is no
 255 recovery ongoing.
 256
 257 CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit
 258 values. Therefore, if a service crosses the enforced limit, the op remains
 259 in the operation queue until the limit is restored.
 260
 261 Subtleties of mClock
 262 ````````````````````
 263
 264 The reservation and limit values have a unit of requests per
 265 second. The weight, however, does not technically have a unit and the
 266 weights are relative to one another. So if one class of requests has a
 267 weight of 1 and another a weight of 9, then the latter class of
 268 requests should get 9 executed at a 9 to 1 ratio as the first class.
 269 However that will only happen once the reservations are met and those
 270 values include the operations executed under the reservation phase.
 271
 272 Even though the weights do not have units, one must be careful in
 273 choosing their values due how the algorithm assigns weight tags to
 274 requests. If the weight is *W*, then for a given class of requests,
 275 the next one that comes in will have a weight tag of *1/W* plus the
 276 previous weight tag or the current time, whichever is larger. That
 277 means if *W* is sufficiently large and therefore *1/W* is sufficiently
 278 small, the calculated tag may never be assigned as it will get a value
 279 of the current time. The ultimate lesson is that values for weight
 280 should not be too large. They should be under the number of requests
 281 one expects to be serviced each second.
 282
 283 Caveats
 284 ```````
 285
 286 There are some factors that can reduce the impact of the mClock op
 287 queues within Ceph. First, requests to an OSD are sharded by their
 288 placement group identifier. Each shard has its own mClock queue and
 289 these queues neither interact nor share information among them. The
 290 number of shards can be controlled with the configuration options
 291 :confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
 292 :confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
 293 impact of the mClock queues, but may have other deleterious effects.
 294
 295 Second, requests are transferred from the operation queue to the
 296 operation sequencer, in which they go through the phases of
 297 execution. The operation queue is where mClock resides and mClock
 298 determines the next op to transfer to the operation sequencer. The
 299 number of operations allowed in the operation sequencer is a complex
 300 issue. In general we want to keep enough operations in the sequencer
 301 so it's always getting work done on some operations while it's waiting
 302 for disk and network access to complete on other operations. On the
 303 other hand, once an operation is transferred to the operation
 304 sequencer, mClock no longer has control over it. Therefore to maximize
 305 the impact of mClock, we want to keep as few operations in the
 306 operation sequencer as possible. So we have an inherent tension.
 307
 308 The configuration options that influence the number of operations in
 309 the operation sequencer are :confval:`bluestore_throttle_bytes`,
 310 :confval:`bluestore_throttle_deferred_bytes`,
 311 :confval:`bluestore_throttle_cost_per_io`,
 312 :confval:`bluestore_throttle_cost_per_io_hdd`, and
 313 :confval:`bluestore_throttle_cost_per_io_ssd`.
 314
 315 A third factor that affects the impact of the mClock algorithm is that
 316 we're using a distributed system, where requests are made to multiple
 317 OSDs and each OSD has (can have) multiple shards. Yet we're currently
 318 using the mClock algorithm, which is not distributed (note: dmClock is
 319 the distributed version of mClock).
 320
 321 Various organizations and individuals are currently experimenting with
 322 mClock as it exists in this code base along with their modifications
 323 to the code base. We hope you'll share you're experiences with your
 324 mClock and dmClock experiments on the ``ceph-devel`` mailing list.
 325
 326 .. confval:: osd_async_recovery_min_cost
 327 .. confval:: osd_push_per_object_cost
 328 .. confval:: osd_mclock_scheduler_client_res
 329 .. confval:: osd_mclock_scheduler_client_wgt
 330 .. confval:: osd_mclock_scheduler_client_lim
 331 .. confval:: osd_mclock_scheduler_background_recovery_res
 332 .. confval:: osd_mclock_scheduler_background_recovery_wgt
 333 .. confval:: osd_mclock_scheduler_background_recovery_lim
 334 .. confval:: osd_mclock_scheduler_background_best_effort_res
 335 .. confval:: osd_mclock_scheduler_background_best_effort_wgt
 336 .. confval:: osd_mclock_scheduler_background_best_effort_lim
 337
 338 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
 339
 340 .. index:: OSD; backfilling
 341
 342 Backfilling
 343 ===========
 344
 345 When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
 346 rebalance the cluster by moving placement groups to or from Ceph OSDs
 347 to restore balanced utilization. The process of migrating placement groups and
 348 the objects they contain can reduce the cluster's operational performance
 349 considerably. To maintain operational performance, Ceph performs this migration
 350 with 'backfilling', which allows Ceph to set backfill operations to a lower
 351 priority than requests to read or write data.
 352
 353
 354 .. confval:: osd_max_backfills
 355 .. confval:: osd_backfill_scan_min
 356 .. confval:: osd_backfill_scan_max
 357 .. confval:: osd_backfill_retry_interval
 358
 359 .. index:: OSD; osdmap
 360
 361 OSD Map
 362 =======
 363
 364 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
 365 number of map epochs increases. Ceph provides some settings to ensure that
 366 Ceph performs well as the OSD map grows larger.
 367
 368 .. confval:: osd_map_dedup
 369 .. confval:: osd_map_cache_size
 370 .. confval:: osd_map_message_max
 371
 372 .. index:: OSD; recovery
 373
 374 Recovery
 375 ========
 376
 377 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
 378 begins peering with other Ceph OSD Daemons before writes can occur.  See
 379 `Monitoring OSDs and PGs`_ for details.
 380
 381 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
 382 sync with other Ceph OSD Daemons containing more recent versions of objects in
 383 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
 384 mode and seeks to get the latest copy of the data and bring its map back up to
 385 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
 386 and placement groups may be significantly out of date. Also, if a failure domain
 387 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
 388 the same time. This can make the recovery process time consuming and resource
 389 intensive.
 390
 391 To maintain operational performance, Ceph performs recovery with limitations on
 392 the number recovery requests, threads and object chunk sizes which allows Ceph
 393 perform well in a degraded state.
 394
 395 .. confval:: osd_recovery_delay_start
 396 .. confval:: osd_recovery_max_active
 397 .. confval:: osd_recovery_max_active_hdd
 398 .. confval:: osd_recovery_max_active_ssd
 399 .. confval:: osd_recovery_max_chunk
 400 .. confval:: osd_recovery_max_single_start
 401 .. confval:: osd_recover_clone_overlap
 402 .. confval:: osd_recovery_sleep
 403 .. confval:: osd_recovery_sleep_hdd
 404 .. confval:: osd_recovery_sleep_ssd
 405 .. confval:: osd_recovery_sleep_hybrid
 406 .. confval:: osd_recovery_priority
 407
 408 Tiering
 409 =======
 410
 411 .. confval:: osd_agent_max_ops
 412 .. confval:: osd_agent_max_low_ops
 413
 414 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
 415 objects within the high speed mode.
 416
 417 Miscellaneous
 418 =============
 419
 420 .. confval:: osd_default_notify_timeout
 421 .. confval:: osd_check_for_log_corruption
 422 .. confval:: osd_delete_sleep
 423 .. confval:: osd_delete_sleep_hdd
 424 .. confval:: osd_delete_sleep_ssd
 425 .. confval:: osd_delete_sleep_hybrid
 426 .. confval:: osd_command_max_records
 427 .. confval:: osd_fast_fail_on_connection_refused
 428
 429 .. _pool: ../../operations/pools
 430 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
 431 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
 432 .. _Pool & PG Config Reference: ../pool-pg-config-ref
 433 .. _Journal Config Reference: ../journal-ref
 434 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
 435 .. _mClock Config Reference: ../mclock-config-ref