]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/configuration/osd-config-ref.rst
import ceph pacific 16.2.5
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
CommitLineData
7c673cae
FG
1======================
2 OSD Config Reference
3======================
4
5.. index:: OSD; configuration
6
f67539c2
TL
7You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
8releases, the central config store), but Ceph OSD
7c673cae 9Daemons can use the default values and a very minimal configuration. A minimal
f67539c2 10Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``, and
7c673cae
FG
11uses default values for nearly everything else.
12
13Ceph OSD Daemons are numerically identified in incremental fashion, beginning
14with ``0`` using the following convention. ::
15
16 osd.0
17 osd.1
18 osd.2
19
20In a configuration file, you may specify settings for all Ceph OSD Daemons in
21the cluster by adding configuration settings to the ``[osd]`` section of your
22configuration file. To add settings directly to a specific Ceph OSD Daemon
23(e.g., ``host``), enter it in an OSD-specific section of your configuration
24file. For example:
25
26.. code-block:: ini
1adf2230 27
7c673cae 28 [osd]
f67539c2 29 osd_journal_size = 5120
1adf2230 30
7c673cae
FG
31 [osd.0]
32 host = osd-host-a
1adf2230 33
7c673cae
FG
34 [osd.1]
35 host = osd-host-b
36
37
38.. index:: OSD; config settings
39
40General Settings
41================
42
9f95a23c 43The following settings provide a Ceph OSD Daemon's ID, and determine paths to
7c673cae 44data and journals. Ceph deployment scripts typically generate the UUID
1adf2230
AA
45automatically.
46
47.. warning:: **DO NOT** change the default paths for data or journals, as it
48 makes it more problematic to troubleshoot Ceph later.
7c673cae 49
f67539c2
TL
50When using Filestore, the journal size should be at least twice the product of the expected drive
51speed multiplied by ``filestore_max_sync_interval``. However, the most common
7c673cae
FG
52practice is to partition the journal drive (often an SSD), and mount it such
53that Ceph uses the entire partition for the journal.
54
55
f67539c2 56``osd_uuid``
7c673cae
FG
57
58:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
59:Type: UUID
60:Default: The UUID.
f67539c2 61:Note: The ``osd_uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
7c673cae
FG
62 applies to the entire cluster.
63
64
f67539c2 65``osd_data``
7c673cae 66
1adf2230
AA
67:Description: The path to the OSDs data. You must create the directory when
68 deploying Ceph. You should mount a drive for OSD data at this
69 mount point. We do not recommend changing the default.
7c673cae
FG
70
71:Type: String
72:Default: ``/var/lib/ceph/osd/$cluster-$id``
73
74
f67539c2 75``osd_max_write_size``
7c673cae
FG
76
77:Description: The maximum size of a write in megabytes.
78:Type: 32-bit Integer
79:Default: ``90``
80
81
f67539c2 82``osd_max_object_size``
11fdf7f2
TL
83
84:Description: The maximum size of a RADOS object in bytes.
85:Type: 32-bit Unsigned Integer
86:Default: 128MB
87
88
f67539c2 89``osd_client_message_size_cap``
7c673cae
FG
90
91:Description: The largest client data message allowed in memory.
c07f9fc5 92:Type: 64-bit Unsigned Integer
1adf2230 93:Default: 500MB default. ``500*1024L*1024L``
7c673cae
FG
94
95
f67539c2 96``osd_class_dir``
7c673cae
FG
97
98:Description: The class path for RADOS class plug-ins.
99:Type: String
100:Default: ``$libdir/rados-classes``
101
102
103.. index:: OSD; file system
104
105File System Settings
106====================
107Ceph builds and mounts file systems which are used for Ceph OSDs.
108
f67539c2 109``osd_mkfs_options {fs-type}``
7c673cae 110
f67539c2 111:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
7c673cae
FG
112
113:Type: String
114:Default for xfs: ``-f -i 2048``
115:Default for other file systems: {empty string}
116
117For example::
f67539c2 118 ``osd_mkfs_options_xfs = -f -d agcount=24``
7c673cae 119
f67539c2 120``osd_mount_options {fs-type}``
7c673cae 121
f67539c2 122:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
7c673cae
FG
123
124:Type: String
125:Default for xfs: ``rw,noatime,inode64``
126:Default for other file systems: ``rw, noatime``
127
128For example::
f67539c2 129 ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
7c673cae
FG
130
131
132.. index:: OSD; journal settings
133
134Journal Settings
135================
136
f67539c2
TL
137This section applies only to the older Filestore OSD back end. Since Luminous
138BlueStore has been default and preferred.
139
140By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
141the following path, which is usually a symlink to a device or partition::
7c673cae
FG
142
143 /var/lib/ceph/osd/$cluster-$id/journal
144
1adf2230
AA
145When using a single device type (for example, spinning drives), the journals
146should be *colocated*: the logical volume (or partition) should be in the same
147device as the ``data`` logical volume.
7c673cae 148
1adf2230
AA
149When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
150drives) it makes sense to place the journal on the faster device, while
151``data`` occupies the slower device fully.
7c673cae 152
f67539c2
TL
153The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
154larger, in which case it will need to be set in the ``ceph.conf`` file.
155A value of 10 gigabytes is common in practice::
7c673cae 156
f67539c2 157 osd_journal_size = 10240
7c673cae 158
1adf2230 159
f67539c2 160``osd_journal``
7c673cae
FG
161
162:Description: The path to the OSD's journal. This may be a path to a file or a
1adf2230 163 block device (such as a partition of an SSD). If it is a file,
7c673cae 164 you must create the directory to contain it. We recommend using a
f67539c2 165 separate fast device when the ``osd_data`` drive is an HDD.
7c673cae
FG
166
167:Type: String
168:Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
169
170
f67539c2 171``osd_journal_size``
7c673cae 172
1adf2230 173:Description: The size of the journal in megabytes.
7c673cae
FG
174
175:Type: 32-bit Integer
176:Default: ``5120``
7c673cae
FG
177
178
179See `Journal Config Reference`_ for additional details.
180
181
182Monitor OSD Interaction
183=======================
184
185Ceph OSD Daemons check each other's heartbeats and report to monitors
186periodically. Ceph can use default values in many cases. However, if your
f67539c2 187network has latency issues, you may need to adopt longer intervals. See
7c673cae
FG
188`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
189
190
191Data Placement
192==============
193
194See `Pool & PG Config Reference`_ for details.
195
196
197.. index:: OSD; scrubbing
198
199Scrubbing
200=========
201
9f95a23c 202In addition to making multiple copies of objects, Ceph ensures data integrity by
7c673cae
FG
203scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
204object storage layer. For each placement group, Ceph generates a catalog of all
205objects and compares each primary object and its replicas to ensure that no
206objects are missing or mismatched. Light scrubbing (daily) checks the object
207size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
208to ensure data integrity.
209
210Scrubbing is important for maintaining data integrity, but it can reduce
211performance. You can adjust the following settings to increase or decrease
212scrubbing operations.
213
214
f67539c2 215``osd_max_scrubs``
7c673cae 216
1adf2230 217:Description: The maximum number of simultaneous scrub operations for
7c673cae
FG
218 a Ceph OSD Daemon.
219
220:Type: 32-bit Int
1adf2230 221:Default: ``1``
7c673cae 222
f67539c2 223``osd_scrub_begin_hour``
7c673cae 224
f67539c2
TL
225:Description: This restricts scrubbing to this hour of the day or later.
226 Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0``
227 to allow scrubbing the entire day. Along with ``osd_scrub_end_hour``, they define a time
228 window, in which the scrubs can happen.
229 But a scrub will be performed
230 no matter whether the time window allows or not, as long as the placement
231 group's scrub interval exceeds ``osd_scrub_max_interval``.
232:Type: Integer in the range of 0 to 23
7c673cae
FG
233:Default: ``0``
234
235
f67539c2 236``osd_scrub_end_hour``
7c673cae 237
f67539c2
TL
238:Description: This restricts scrubbing to the hour earlier than this.
239 Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` to allow scrubbing
240 for the entire day. Along with ``osd_scrub_begin_hour``, they define a time
7c673cae 241 window, in which the scrubs can happen. But a scrub will be performed
f67539c2
TL
242 no matter whether the time window allows or not, as long as the placement
243 group's scrub interval exceeds ``osd_scrub_max_interval``.
244:Type: Integer in the range of 0 to 23
245:Default: ``0``
7c673cae
FG
246
247
f67539c2 248``osd_scrub_begin_week_day``
11fdf7f2
TL
249
250:Description: This restricts scrubbing to this day of the week or later.
f67539c2
TL
251 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0``
252 and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week.
253 Along with ``osd_scrub_end_week_day``, they define a time window in which
254 scrubs can happen. But a scrub will be performed
255 no matter whether the time window allows or not, when the PG's
256 scrub interval exceeds ``osd_scrub_max_interval``.
257:Type: Integer in the range of 0 to 6
11fdf7f2
TL
258:Default: ``0``
259
260
f67539c2 261``osd_scrub_end_week_day``
11fdf7f2
TL
262
263:Description: This restricts scrubbing to days of the week earlier than this.
f67539c2
TL
264 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0``
265 and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week.
266 Along with ``osd_scrub_begin_week_day``, they define a time
267 window, in which the scrubs can happen. But a scrub will be performed
268 no matter whether the time window allows or not, as long as the placement
269 group's scrub interval exceeds ``osd_scrub_max_interval``.
270:Type: Integer in the range of 0 to 6
271:Default: ``0``
11fdf7f2
TL
272
273
7c673cae
FG
274``osd scrub during recovery``
275
276:Description: Allow scrub during recovery. Setting this to ``false`` will disable
277 scheduling new scrub (and deep--scrub) while there is active recovery.
278 Already running scrubs will be continued. This might be useful to reduce
279 load on busy clusters.
280:Type: Boolean
f6b5b4d7 281:Default: ``false``
7c673cae
FG
282
283
f67539c2 284``osd_scrub_thread_timeout``
7c673cae
FG
285
286:Description: The maximum time in seconds before timing out a scrub thread.
287:Type: 32-bit Integer
1adf2230 288:Default: ``60``
7c673cae
FG
289
290
f67539c2 291``osd_scrub_finalize_thread_timeout``
7c673cae 292
1adf2230 293:Description: The maximum time in seconds before timing out a scrub finalize
7c673cae
FG
294 thread.
295
296:Type: 32-bit Integer
f67539c2 297:Default: ``10*60``
7c673cae
FG
298
299
f67539c2 300``osd_scrub_load_threshold``
7c673cae 301
11fdf7f2 302:Description: The normalized maximum load. Ceph will not scrub when the system load
f67539c2 303 (as defined by ``getloadavg() / number of online CPUs``) is higher than this number.
7c673cae
FG
304 Default is ``0.5``.
305
306:Type: Float
1adf2230 307:Default: ``0.5``
7c673cae
FG
308
309
f67539c2 310``osd_scrub_min_interval``
7c673cae
FG
311
312:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
313 when the Ceph Storage Cluster load is low.
314
315:Type: Float
f67539c2 316:Default: Once per day. ``24*60*60``
7c673cae 317
f67539c2 318.. _osd_scrub_max_interval:
7c673cae 319
f67539c2 320``osd_scrub_max_interval``
7c673cae 321
1adf2230 322:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
7c673cae
FG
323 irrespective of cluster load.
324
325:Type: Float
f67539c2 326:Default: Once per week. ``7*24*60*60``
7c673cae
FG
327
328
f67539c2 329``osd_scrub_chunk_min``
7c673cae
FG
330
331:Description: The minimal number of object store chunks to scrub during single operation.
332 Ceph blocks writes to single chunk during scrub.
333
334:Type: 32-bit Integer
335:Default: 5
336
337
f67539c2 338``osd_scrub_chunk_max``
7c673cae
FG
339
340:Description: The maximum number of object store chunks to scrub during single operation.
341
342:Type: 32-bit Integer
343:Default: 25
344
345
f67539c2 346``osd_scrub_sleep``
7c673cae 347
f67539c2
TL
348:Description: Time to sleep before scrubbing the next group of chunks. Increasing this value will slow
349 down the overall rate of scrubbing so that client operations will be less impacted.
7c673cae
FG
350
351:Type: Float
352:Default: 0
353
354
f67539c2 355``osd_deep_scrub_interval``
7c673cae 356
1adf2230 357:Description: The interval for "deep" scrubbing (fully reading all data). The
f67539c2 358 ``osd_scrub_load_threshold`` does not affect this setting.
7c673cae
FG
359
360:Type: Float
f67539c2 361:Default: Once per week. ``7*24*60*60``
7c673cae
FG
362
363
f67539c2 364``osd_scrub_interval_randomize_ratio``
7c673cae 365
f67539c2
TL
366:Description: Add a random delay to ``osd_scrub_min_interval`` when scheduling
367 the next scrub job for a PG. The delay is a random
368 value less than ``osd_scrub_min_interval`` \*
369 ``osd_scrub_interval_randomized_ratio``. The default setting
370 spreads scrubs throughout the allowed time
371 window of ``[1, 1.5]`` \* ``osd_scrub_min_interval``.
7c673cae
FG
372:Type: Float
373:Default: ``0.5``
374
f67539c2 375``osd_deep_scrub_stride``
7c673cae
FG
376
377:Description: Read size when doing a deep scrub.
378:Type: 32-bit Integer
379:Default: 512 KB. ``524288``
380
381
f67539c2 382``osd_scrub_auto_repair``
7c673cae 383
f67539c2
TL
384:Description: Setting this to ``true`` will enable automatic PG repair when errors
385 are found by scrubs or deep-scrubs. However, if more than
386 ``osd_scrub_auto_repair_num_errors`` errors are found a repair is NOT performed.
11fdf7f2
TL
387:Type: Boolean
388:Default: ``false``
7c673cae 389
7c673cae 390
f67539c2 391``osd_scrub_auto_repair_num_errors``
7c673cae 392
11fdf7f2
TL
393:Description: Auto repair will not occur if more than this many errors are found.
394:Type: 32-bit Integer
395:Default: ``5``
7c673cae 396
7c673cae 397
11fdf7f2 398.. index:: OSD; operations settings
7c673cae 399
11fdf7f2
TL
400Operations
401==========
7c673cae 402
f67539c2 403 ``osd_op_queue``
7c673cae
FG
404
405:Description: This sets the type of queue to be used for prioritizing ops
f67539c2 406 within each OSD. Both queues feature a strict sub-queue which is
7c673cae 407 dequeued before the normal queue. The normal queue is different
f67539c2
TL
408 between implementations. The WeightedPriorityQueue (``wpq``)
409 dequeues operations in relation to their priorities to prevent
410 starvation of any queue. WPQ should help in cases where a few OSDs
411 are more overloaded than others. The new mClockQueue
412 (``mclock_scheduler``) prioritizes operations based on which class
c07f9fc5 413 they belong to (recovery, scrub, snaptrim, client op, osd subop).
f67539c2 414 See `QoS Based on mClock`_. Requires a restart.
7c673cae
FG
415
416:Type: String
f67539c2 417:Valid Choices: wpq, mclock_scheduler
9f95a23c 418:Default: ``wpq``
7c673cae
FG
419
420
f67539c2 421``osd_op_queue_cut_off``
7c673cae
FG
422
423:Description: This selects which priority ops will be sent to the strict
424 queue verses the normal queue. The ``low`` setting sends all
425 replication ops and higher to the strict queue, while the ``high``
9f95a23c 426 option sends only replication acknowledgment ops and higher to
7c673cae
FG
427 the strict queue. Setting this to ``high`` should help when a few
428 OSDs in the cluster are very busy especially when combined with
f67539c2 429 ``wpq`` in the ``osd_op_queue`` setting. OSDs that are very busy
7c673cae
FG
430 handling replication traffic could starve primary client traffic
431 on these OSDs without these settings. Requires a restart.
432
433:Type: String
434:Valid Choices: low, high
9f95a23c 435:Default: ``high``
7c673cae
FG
436
437
f67539c2 438``osd_client_op_priority``
7c673cae 439
f67539c2
TL
440:Description: The priority set for client operations. This value is relative
441 to that of ``osd_recovery_op_priority`` below. The default
442 strongly favors client ops over recovery.
7c673cae
FG
443
444:Type: 32-bit Integer
1adf2230 445:Default: ``63``
7c673cae
FG
446:Valid Range: 1-63
447
448
f67539c2 449``osd_recovery_op_priority``
7c673cae 450
f67539c2
TL
451:Description: The priority of recovery operations vs client operations, if not specified by the
452 pool's ``recovery_op_priority``. The default value prioritizes client
453 ops (see above) over recovery ops. You may adjust the tradeoff of client
454 impact against the time to restore cluster health by lowering this value
455 for increased prioritization of client ops, or by increasing it to favor
456 recovery.
7c673cae
FG
457
458:Type: 32-bit Integer
1adf2230 459:Default: ``3``
7c673cae
FG
460:Valid Range: 1-63
461
462
f67539c2 463``osd_scrub_priority``
7c673cae 464
f67539c2 465:Description: The default work queue priority for scheduled scrubs when the
11fdf7f2 466 pool doesn't specify a value of ``scrub_priority``. This can be
f67539c2 467 boosted to the value of ``osd_client_op_priority`` when scrubs are
11fdf7f2 468 blocking client operations.
7c673cae
FG
469
470:Type: 32-bit Integer
471:Default: ``5``
472:Valid Range: 1-63
473
474
f67539c2 475``osd_requested_scrub_priority``
11fdf7f2
TL
476
477:Description: The priority set for user requested scrub on the work queue. If
f67539c2
TL
478 this value were to be smaller than ``osd_client_op_priority`` it
479 can be boosted to the value of ``osd_client_op_priority`` when
11fdf7f2
TL
480 scrub is blocking client operations.
481
482:Type: 32-bit Integer
483:Default: ``120``
484
485
f67539c2 486``osd_snap_trim_priority``
7c673cae 487
11fdf7f2 488:Description: The priority set for the snap trim work queue.
7c673cae
FG
489
490:Type: 32-bit Integer
491:Default: ``5``
492:Valid Range: 1-63
493
f67539c2 494``osd_snap_trim_sleep``
494da23a
TL
495
496:Description: Time in seconds to sleep before next snap trim op.
497 Increasing this value will slow down snap trimming.
498 This option overrides backend specific variants.
499
500:Type: Float
501:Default: ``0``
502
503
f67539c2 504``osd_snap_trim_sleep_hdd``
494da23a
TL
505
506:Description: Time in seconds to sleep before next snap trim op
507 for HDDs.
508
509:Type: Float
510:Default: ``5``
511
512
f67539c2 513``osd_snap_trim_sleep_ssd``
494da23a
TL
514
515:Description: Time in seconds to sleep before next snap trim op
f67539c2 516 for SSD OSDs (including NVMe).
494da23a
TL
517
518:Type: Float
519:Default: ``0``
520
521
f67539c2 522``osd_snap_trim_sleep_hybrid``
494da23a
TL
523
524:Description: Time in seconds to sleep before next snap trim op
f67539c2 525 when OSD data is on an HDD and the OSD journal or WAL+DB is on an SSD.
494da23a
TL
526
527:Type: Float
528:Default: ``2``
7c673cae 529
f67539c2 530``osd_op_thread_timeout``
7c673cae
FG
531
532:Description: The Ceph OSD Daemon operation thread timeout in seconds.
533:Type: 32-bit Integer
1adf2230 534:Default: ``15``
7c673cae
FG
535
536
f67539c2 537``osd_op_complaint_time``
7c673cae
FG
538
539:Description: An operation becomes complaint worthy after the specified number
540 of seconds have elapsed.
541
542:Type: Float
1adf2230 543:Default: ``30``
7c673cae
FG
544
545
f67539c2 546``osd_op_history_size``
7c673cae
FG
547
548:Description: The maximum number of completed operations to track.
549:Type: 32-bit Unsigned Integer
550:Default: ``20``
551
552
f67539c2 553``osd_op_history_duration``
7c673cae
FG
554
555:Description: The oldest completed operation to track.
556:Type: 32-bit Unsigned Integer
557:Default: ``600``
558
559
f67539c2 560``osd_op_log_threshold``
7c673cae
FG
561
562:Description: How many operations logs to display at once.
563:Type: 32-bit Integer
564:Default: ``5``
565
c07f9fc5 566
9f95a23c
TL
567.. _dmclock-qos:
568
c07f9fc5
FG
569QoS Based on mClock
570-------------------
571
b3b6e05e
TL
572Ceph's use of mClock is now more refined and can be used by following the
573steps as described in `mClock Config Reference`_.
c07f9fc5
FG
574
575Core Concepts
576`````````````
577
f67539c2 578Ceph's QoS support is implemented using a queueing scheduler
c07f9fc5
FG
579based on `the dmClock algorithm`_. This algorithm allocates the I/O
580resources of the Ceph cluster in proportion to weights, and enforces
11fdf7f2 581the constraints of minimum reservation and maximum limitation, so that
c07f9fc5 582the services can compete for the resources fairly. Currently the
f67539c2 583*mclock_scheduler* operation queue divides Ceph services involving I/O
c07f9fc5
FG
584resources into following buckets:
585
586- client op: the iops issued by client
587- osd subop: the iops issued by primary OSD
588- snap trim: the snap trimming related requests
589- pg recovery: the recovery related requests
590- pg scrub: the scrub related requests
591
592And the resources are partitioned using following three sets of tags. In other
593words, the share of each type of service is controlled by three tags:
594
595#. reservation: the minimum IOPS allocated for the service.
596#. limitation: the maximum IOPS allocated for the service.
597#. weight: the proportional share of capacity if extra capacity or system
598 oversubscribed.
599
b3b6e05e 600In Ceph, operations are graded with "cost". And the resources allocated
c07f9fc5
FG
601for serving various services are consumed by these "costs". So, for
602example, the more reservation a services has, the more resource it is
603guaranteed to possess, as long as it requires. Assuming there are 2
604services: recovery and client ops:
605
606- recovery: (r:1, l:5, w:1)
607- client ops: (r:2, l:0, w:9)
608
609The settings above ensure that the recovery won't get more than 5
610requests per second serviced, even if it requires so (see CURRENT
611IMPLEMENTATION NOTE below), and no other services are competing with
612it. But if the clients start to issue large amount of I/O requests,
613neither will they exhaust all the I/O resources. 1 request per second
614is always allocated for recovery jobs as long as there are any such
615requests. So the recovery jobs won't be starved even in a cluster with
616high load. And in the meantime, the client ops can enjoy a larger
617portion of the I/O resource, because its weight is "9", while its
618competitor "1". In the case of client ops, it is not clamped by the
619limit setting, so it can make use of all the resources if there is no
620recovery ongoing.
621
b3b6e05e
TL
622CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit
623values. Therefore, if a service crosses the enforced limit, the op remains
624in the operation queue until the limit is restored.
c07f9fc5
FG
625
626Subtleties of mClock
627````````````````````
628
629The reservation and limit values have a unit of requests per
630second. The weight, however, does not technically have a unit and the
631weights are relative to one another. So if one class of requests has a
632weight of 1 and another a weight of 9, then the latter class of
633requests should get 9 executed at a 9 to 1 ratio as the first class.
634However that will only happen once the reservations are met and those
635values include the operations executed under the reservation phase.
636
637Even though the weights do not have units, one must be careful in
638choosing their values due how the algorithm assigns weight tags to
639requests. If the weight is *W*, then for a given class of requests,
640the next one that comes in will have a weight tag of *1/W* plus the
641previous weight tag or the current time, whichever is larger. That
642means if *W* is sufficiently large and therefore *1/W* is sufficiently
643small, the calculated tag may never be assigned as it will get a value
644of the current time. The ultimate lesson is that values for weight
645should not be too large. They should be under the number of requests
b3b6e05e 646one expects to be serviced each second.
c07f9fc5
FG
647
648Caveats
649```````
650
651There are some factors that can reduce the impact of the mClock op
652queues within Ceph. First, requests to an OSD are sharded by their
653placement group identifier. Each shard has its own mClock queue and
654these queues neither interact nor share information among them. The
655number of shards can be controlled with the configuration options
656``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
657``osd_op_num_shards_ssd``. A lower number of shards will increase the
11fdf7f2 658impact of the mClock queues, but may have other deleterious effects.
c07f9fc5
FG
659
660Second, requests are transferred from the operation queue to the
661operation sequencer, in which they go through the phases of
662execution. The operation queue is where mClock resides and mClock
663determines the next op to transfer to the operation sequencer. The
664number of operations allowed in the operation sequencer is a complex
665issue. In general we want to keep enough operations in the sequencer
666so it's always getting work done on some operations while it's waiting
667for disk and network access to complete on other operations. On the
668other hand, once an operation is transferred to the operation
669sequencer, mClock no longer has control over it. Therefore to maximize
670the impact of mClock, we want to keep as few operations in the
671operation sequencer as possible. So we have an inherent tension.
672
673The configuration options that influence the number of operations in
674the operation sequencer are ``bluestore_throttle_bytes``,
675``bluestore_throttle_deferred_bytes``,
676``bluestore_throttle_cost_per_io``,
677``bluestore_throttle_cost_per_io_hdd``, and
678``bluestore_throttle_cost_per_io_ssd``.
679
680A third factor that affects the impact of the mClock algorithm is that
681we're using a distributed system, where requests are made to multiple
682OSDs and each OSD has (can have) multiple shards. Yet we're currently
683using the mClock algorithm, which is not distributed (note: dmClock is
684the distributed version of mClock).
685
686Various organizations and individuals are currently experimenting with
687mClock as it exists in this code base along with their modifications
688to the code base. We hope you'll share you're experiences with your
f67539c2 689mClock and dmClock experiments on the ``ceph-devel`` mailing list.
c07f9fc5
FG
690
691
f67539c2 692``osd_push_per_object_cost``
c07f9fc5
FG
693
694:Description: the overhead for serving a push op
695
696:Type: Unsigned Integer
697:Default: 1000
698
f67539c2
TL
699
700``osd_recovery_max_chunk``
c07f9fc5
FG
701
702:Description: the maximum total size of data chunks a recovery op can carry.
703
704:Type: Unsigned Integer
705:Default: 8 MiB
706
707
f67539c2 708``osd_mclock_scheduler_client_res``
c07f9fc5 709
f67539c2 710:Description: IO proportion reserved for each client (default).
c07f9fc5 711
f67539c2
TL
712:Type: Unsigned Integer
713:Default: 1
c07f9fc5
FG
714
715
f67539c2 716``osd_mclock_scheduler_client_wgt``
c07f9fc5 717
f67539c2 718:Description: IO share for each client (default) over reservation.
c07f9fc5 719
f67539c2
TL
720:Type: Unsigned Integer
721:Default: 1
c07f9fc5
FG
722
723
f67539c2 724``osd_mclock_scheduler_client_lim``
c07f9fc5 725
f67539c2 726:Description: IO limit for each client (default) over reservation.
c07f9fc5 727
f67539c2
TL
728:Type: Unsigned Integer
729:Default: 999999
c07f9fc5
FG
730
731
f67539c2 732``osd_mclock_scheduler_background_recovery_res``
c07f9fc5 733
f67539c2 734:Description: IO proportion reserved for background recovery (default).
c07f9fc5 735
f67539c2
TL
736:Type: Unsigned Integer
737:Default: 1
c07f9fc5
FG
738
739
f67539c2 740``osd_mclock_scheduler_background_recovery_wgt``
c07f9fc5 741
f67539c2 742:Description: IO share for each background recovery over reservation.
c07f9fc5 743
f67539c2
TL
744:Type: Unsigned Integer
745:Default: 1
c07f9fc5
FG
746
747
f67539c2 748``osd_mclock_scheduler_background_recovery_lim``
c07f9fc5 749
f67539c2 750:Description: IO limit for background recovery over reservation.
c07f9fc5 751
f67539c2
TL
752:Type: Unsigned Integer
753:Default: 999999
c07f9fc5
FG
754
755
f67539c2 756``osd_mclock_scheduler_background_best_effort_res``
c07f9fc5 757
f67539c2 758:Description: IO proportion reserved for background best_effort (default).
c07f9fc5 759
f67539c2
TL
760:Type: Unsigned Integer
761:Default: 1
c07f9fc5
FG
762
763
f67539c2 764``osd_mclock_scheduler_background_best_effort_wgt``
c07f9fc5 765
f67539c2 766:Description: IO share for each background best_effort over reservation.
c07f9fc5 767
f67539c2
TL
768:Type: Unsigned Integer
769:Default: 1
c07f9fc5
FG
770
771
f67539c2 772``osd_mclock_scheduler_background_best_effort_lim``
c07f9fc5 773
f67539c2 774:Description: IO limit for background best_effort over reservation.
c07f9fc5 775
f67539c2
TL
776:Type: Unsigned Integer
777:Default: 999999
c07f9fc5
FG
778
779.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
780
781
7c673cae
FG
782.. index:: OSD; backfilling
783
784Backfilling
785===========
786
f67539c2
TL
787When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
788rebalance the cluster by moving placement groups to or from Ceph OSDs
789to restore balanced utilization. The process of migrating placement groups and
7c673cae
FG
790the objects they contain can reduce the cluster's operational performance
791considerably. To maintain operational performance, Ceph performs this migration
792with 'backfilling', which allows Ceph to set backfill operations to a lower
1adf2230 793priority than requests to read or write data.
7c673cae
FG
794
795
f67539c2 796``osd_max_backfills``
7c673cae
FG
797
798:Description: The maximum number of backfills allowed to or from a single OSD.
f67539c2 799 Note that this is applied separately for read and write operations.
7c673cae
FG
800:Type: 64-bit Unsigned Integer
801:Default: ``1``
802
803
f67539c2 804``osd_backfill_scan_min``
7c673cae
FG
805
806:Description: The minimum number of objects per backfill scan.
807
808:Type: 32-bit Integer
1adf2230 809:Default: ``64``
7c673cae
FG
810
811
f67539c2 812``osd_backfill_scan_max``
7c673cae
FG
813
814:Description: The maximum number of objects per backfill scan.
815
816:Type: 32-bit Integer
1adf2230 817:Default: ``512``
7c673cae
FG
818
819
f67539c2 820``osd_backfill_retry_interval``
7c673cae
FG
821
822:Description: The number of seconds to wait before retrying backfill requests.
823:Type: Double
824:Default: ``10.0``
825
826.. index:: OSD; osdmap
827
828OSD Map
829=======
830
1adf2230 831OSD maps reflect the OSD daemons operating in the cluster. Over time, the
7c673cae
FG
832number of map epochs increases. Ceph provides some settings to ensure that
833Ceph performs well as the OSD map grows larger.
834
835
f67539c2 836``osd_map_dedup``
7c673cae 837
1adf2230 838:Description: Enable removing duplicates in the OSD map.
7c673cae
FG
839:Type: Boolean
840:Default: ``true``
841
842
f67539c2 843``osd_map_cache_size``
7c673cae
FG
844
845:Description: The number of OSD maps to keep cached.
846:Type: 32-bit Integer
7c673cae
FG
847:Default: ``50``
848
849
f67539c2 850``osd_map_message_max``
7c673cae
FG
851
852:Description: The maximum map entries allowed per MOSDMap message.
853:Type: 32-bit Integer
a8e16298 854:Default: ``40``
7c673cae
FG
855
856
857
858.. index:: OSD; recovery
859
860Recovery
861========
862
863When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
864begins peering with other Ceph OSD Daemons before writes can occur. See
865`Monitoring OSDs and PGs`_ for details.
866
867If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
868sync with other Ceph OSD Daemons containing more recent versions of objects in
869the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
870mode and seeks to get the latest copy of the data and bring its map back up to
871date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
872and placement groups may be significantly out of date. Also, if a failure domain
873went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
874the same time. This can make the recovery process time consuming and resource
875intensive.
876
877To maintain operational performance, Ceph performs recovery with limitations on
878the number recovery requests, threads and object chunk sizes which allows Ceph
1adf2230 879perform well in a degraded state.
7c673cae
FG
880
881
f67539c2 882``osd_recovery_delay_start``
7c673cae 883
1adf2230 884:Description: After peering completes, Ceph will delay for the specified number
f67539c2 885 of seconds before starting to recover RADOS objects.
7c673cae
FG
886
887:Type: Float
1adf2230 888:Default: ``0``
7c673cae
FG
889
890
f67539c2 891``osd_recovery_max_active``
7c673cae 892
1adf2230
AA
893:Description: The number of active recovery requests per OSD at one time. More
894 requests will accelerate recovery, but the requests places an
7c673cae
FG
895 increased load on the cluster.
896
9f95a23c
TL
897 This value is only used if it is non-zero. Normally it
898 is ``0``, which means that the ``hdd`` or ``ssd`` values
899 (below) are used, depending on the type of the primary
900 device backing the OSD.
901
902:Type: 32-bit Integer
903:Default: ``0``
904
f67539c2 905``osd_recovery_max_active_hdd``
9f95a23c
TL
906
907:Description: The number of active recovery requests per OSD at one time, if the
908 primary device is rotational.
909
7c673cae 910:Type: 32-bit Integer
31f18b77 911:Default: ``3``
7c673cae 912
f67539c2 913``osd_recovery_max_active_ssd``
9f95a23c
TL
914
915:Description: The number of active recovery requests per OSD at one time, if the
916 primary device is non-rotational (i.e., an SSD).
917
918:Type: 32-bit Integer
919:Default: ``10``
920
7c673cae 921
f67539c2 922``osd_recovery_max_chunk``
7c673cae 923
1adf2230 924:Description: The maximum size of a recovered chunk of data to push.
c07f9fc5 925:Type: 64-bit Unsigned Integer
1adf2230 926:Default: ``8 << 20``
7c673cae
FG
927
928
f67539c2 929``osd_recovery_max_single_start``
31f18b77
FG
930
931:Description: The maximum number of recovery operations per OSD that will be
932 newly started when an OSD is recovering.
c07f9fc5 933:Type: 64-bit Unsigned Integer
31f18b77
FG
934:Default: ``1``
935
936
f67539c2 937``osd_recovery_thread_timeout``
7c673cae
FG
938
939:Description: The maximum time in seconds before timing out a recovery thread.
940:Type: 32-bit Integer
941:Default: ``30``
942
943
f67539c2 944``osd_recover_clone_overlap``
7c673cae 945
1adf2230 946:Description: Preserves clone overlap during recovery. Should always be set
7c673cae
FG
947 to ``true``.
948
949:Type: Boolean
950:Default: ``true``
951
31f18b77 952
f67539c2 953``osd_recovery_sleep``
31f18b77 954
f67539c2 955:Description: Time in seconds to sleep before the next recovery or backfill op.
c07f9fc5
FG
956 Increasing this value will slow down recovery operation while
957 client operations will be less impacted.
31f18b77
FG
958
959:Type: Float
c07f9fc5
FG
960:Default: ``0``
961
962
f67539c2 963``osd_recovery_sleep_hdd``
c07f9fc5
FG
964
965:Description: Time in seconds to sleep before next recovery or backfill op
966 for HDDs.
967
968:Type: Float
969:Default: ``0.1``
970
971
f67539c2 972``osd_recovery_sleep_ssd``
c07f9fc5 973
f67539c2 974:Description: Time in seconds to sleep before the next recovery or backfill op
c07f9fc5
FG
975 for SSDs.
976
977:Type: Float
978:Default: ``0``
31f18b77 979
d2e6a577 980
f67539c2 981``osd_recovery_sleep_hybrid``
d2e6a577 982
f67539c2
TL
983:Description: Time in seconds to sleep before the next recovery or backfill op
984 when OSD data is on HDD and OSD journal / WAL+DB is on SSD.
d2e6a577
FG
985
986:Type: Float
987:Default: ``0.025``
988
11fdf7f2 989
f67539c2 990``osd_recovery_priority``
11fdf7f2
TL
991
992:Description: The default priority set for recovery work queue. Not
993 related to a pool's ``recovery_priority``.
994
995:Type: 32-bit Integer
996:Default: ``5``
997
998
7c673cae
FG
999Tiering
1000=======
1001
f67539c2 1002``osd_agent_max_ops``
7c673cae
FG
1003
1004:Description: The maximum number of simultaneous flushing ops per tiering agent
1005 in the high speed mode.
1006:Type: 32-bit Integer
1007:Default: ``4``
1008
1009
f67539c2 1010``osd_agent_max_low_ops``
7c673cae
FG
1011
1012:Description: The maximum number of simultaneous flushing ops per tiering agent
1013 in the low speed mode.
1014:Type: 32-bit Integer
1015:Default: ``2``
1016
1017See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1018objects within the high speed mode.
1019
1020Miscellaneous
1021=============
1022
1023
f67539c2 1024``osd_snap_trim_thread_timeout``
7c673cae
FG
1025
1026:Description: The maximum time in seconds before timing out a snap trim thread.
1027:Type: 32-bit Integer
f67539c2 1028:Default: ``1*60*60``
7c673cae
FG
1029
1030
f67539c2 1031``osd_backlog_thread_timeout``
7c673cae
FG
1032
1033:Description: The maximum time in seconds before timing out a backlog thread.
1034:Type: 32-bit Integer
f67539c2 1035:Default: ``1*60*60``
7c673cae
FG
1036
1037
f67539c2 1038``osd_default_notify_timeout``
7c673cae
FG
1039
1040:Description: The OSD default notification timeout (in seconds).
c07f9fc5 1041:Type: 32-bit Unsigned Integer
1adf2230 1042:Default: ``30``
7c673cae
FG
1043
1044
f67539c2 1045``osd_check_for_log_corruption``
7c673cae
FG
1046
1047:Description: Check log files for corruption. Can be computationally expensive.
1048:Type: Boolean
1adf2230 1049:Default: ``false``
7c673cae
FG
1050
1051
f67539c2 1052``osd_remove_thread_timeout``
7c673cae
FG
1053
1054:Description: The maximum time in seconds before timing out a remove OSD thread.
1055:Type: 32-bit Integer
1056:Default: ``60*60``
1057
1058
f67539c2 1059``osd_command_thread_timeout``
7c673cae
FG
1060
1061:Description: The maximum time in seconds before timing out a command thread.
1062:Type: 32-bit Integer
1adf2230 1063:Default: ``10*60``
7c673cae
FG
1064
1065
f67539c2 1066``osd_delete_sleep``
9f95a23c 1067
f67539c2
TL
1068:Description: Time in seconds to sleep before the next removal transaction. This
1069 throttles the PG deletion process.
9f95a23c
TL
1070
1071:Type: Float
1072:Default: ``0``
1073
1074
f67539c2 1075``osd_delete_sleep_hdd``
9f95a23c 1076
f67539c2 1077:Description: Time in seconds to sleep before the next removal transaction
9f95a23c
TL
1078 for HDDs.
1079
1080:Type: Float
1081:Default: ``5``
1082
1083
f67539c2 1084``osd_delete_sleep_ssd``
9f95a23c 1085
f67539c2 1086:Description: Time in seconds to sleep before the next removal transaction
9f95a23c
TL
1087 for SSDs.
1088
1089:Type: Float
1090:Default: ``0``
1091
1092
f67539c2 1093``osd_delete_sleep_hybrid``
9f95a23c 1094
f67539c2
TL
1095:Description: Time in seconds to sleep before the next removal transaction
1096 when OSD data is on HDD and OSD journal or WAL+DB is on SSD.
9f95a23c
TL
1097
1098:Type: Float
adb31ebb 1099:Default: ``1``
9f95a23c
TL
1100
1101
f67539c2 1102``osd_command_max_records``
7c673cae 1103
1adf2230 1104:Description: Limits the number of lost objects to return.
7c673cae 1105:Type: 32-bit Integer
1adf2230 1106:Default: ``256``
7c673cae
FG
1107
1108
f67539c2 1109``osd_fast_fail_on_connection_refused``
7c673cae
FG
1110
1111:Description: If this option is enabled, crashed OSDs are marked down
1112 immediately by connected peers and MONs (assuming that the
1113 crashed OSD host survives). Disable it to restore old
1114 behavior, at the expense of possible long I/O stalls when
1115 OSDs crash in the middle of I/O operations.
1116:Type: Boolean
1117:Default: ``true``
1118
1119
1120
1121.. _pool: ../../operations/pools
1122.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1123.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1124.. _Pool & PG Config Reference: ../pool-pg-config-ref
1125.. _Journal Config Reference: ../journal-ref
1126.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
b3b6e05e 1127.. _mClock Config Reference: ../mclock-config-ref