]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/configuration/osd-config-ref.rst
update sources to 12.2.8
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
CommitLineData
7c673cae
FG
1======================
2 OSD Config Reference
3======================
4
5.. index:: OSD; configuration
6
7You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
8Daemons can use the default values and a very minimal configuration. A minimal
9Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and
10uses default values for nearly everything else.
11
12Ceph OSD Daemons are numerically identified in incremental fashion, beginning
13with ``0`` using the following convention. ::
14
15 osd.0
16 osd.1
17 osd.2
18
19In a configuration file, you may specify settings for all Ceph OSD Daemons in
20the cluster by adding configuration settings to the ``[osd]`` section of your
21configuration file. To add settings directly to a specific Ceph OSD Daemon
22(e.g., ``host``), enter it in an OSD-specific section of your configuration
23file. For example:
24
25.. code-block:: ini
1adf2230 26
7c673cae 27 [osd]
1adf2230
AA
28 osd journal size = 5120
29
7c673cae
FG
30 [osd.0]
31 host = osd-host-a
1adf2230 32
7c673cae
FG
33 [osd.1]
34 host = osd-host-b
35
36
37.. index:: OSD; config settings
38
39General Settings
40================
41
42The following settings provide an Ceph OSD Daemon's ID, and determine paths to
43data and journals. Ceph deployment scripts typically generate the UUID
1adf2230
AA
44automatically.
45
46.. warning:: **DO NOT** change the default paths for data or journals, as it
47 makes it more problematic to troubleshoot Ceph later.
7c673cae
FG
48
49The journal size should be at least twice the product of the expected drive
50speed multiplied by ``filestore max sync interval``. However, the most common
51practice is to partition the journal drive (often an SSD), and mount it such
52that Ceph uses the entire partition for the journal.
53
54
55``osd uuid``
56
57:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
58:Type: UUID
59:Default: The UUID.
1adf2230 60:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
7c673cae
FG
61 applies to the entire cluster.
62
63
1adf2230 64``osd data``
7c673cae 65
1adf2230
AA
66:Description: The path to the OSDs data. You must create the directory when
67 deploying Ceph. You should mount a drive for OSD data at this
68 mount point. We do not recommend changing the default.
7c673cae
FG
69
70:Type: String
71:Default: ``/var/lib/ceph/osd/$cluster-$id``
72
73
1adf2230 74``osd max write size``
7c673cae
FG
75
76:Description: The maximum size of a write in megabytes.
77:Type: 32-bit Integer
78:Default: ``90``
79
80
81``osd client message size cap``
82
83:Description: The largest client data message allowed in memory.
c07f9fc5 84:Type: 64-bit Unsigned Integer
1adf2230 85:Default: 500MB default. ``500*1024L*1024L``
7c673cae
FG
86
87
1adf2230 88``osd class dir``
7c673cae
FG
89
90:Description: The class path for RADOS class plug-ins.
91:Type: String
92:Default: ``$libdir/rados-classes``
93
94
95.. index:: OSD; file system
96
97File System Settings
98====================
99Ceph builds and mounts file systems which are used for Ceph OSDs.
100
1adf2230 101``osd mkfs options {fs-type}``
7c673cae
FG
102
103:Description: Options used when creating a new Ceph OSD of type {fs-type}.
104
105:Type: String
106:Default for xfs: ``-f -i 2048``
107:Default for other file systems: {empty string}
108
109For example::
110 ``osd mkfs options xfs = -f -d agcount=24``
111
1adf2230 112``osd mount options {fs-type}``
7c673cae
FG
113
114:Description: Options used when mounting a Ceph OSD of type {fs-type}.
115
116:Type: String
117:Default for xfs: ``rw,noatime,inode64``
118:Default for other file systems: ``rw, noatime``
119
120For example::
121 ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
122
123
124.. index:: OSD; journal settings
125
126Journal Settings
127================
128
129By default, Ceph expects that you will store an Ceph OSD Daemons journal with
130the following path::
131
132 /var/lib/ceph/osd/$cluster-$id/journal
133
1adf2230
AA
134When using a single device type (for example, spinning drives), the journals
135should be *colocated*: the logical volume (or partition) should be in the same
136device as the ``data`` logical volume.
7c673cae 137
1adf2230
AA
138When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
139drives) it makes sense to place the journal on the faster device, while
140``data`` occupies the slower device fully.
7c673cae 141
1adf2230
AA
142The default ``osd journal size`` value is 5120 (5 gigabytes), but it can be
143larger, in which case it will need to be set in the ``ceph.conf`` file::
7c673cae 144
7c673cae 145
1adf2230 146 osd journal size = 10240
7c673cae 147
1adf2230
AA
148
149``osd journal``
7c673cae
FG
150
151:Description: The path to the OSD's journal. This may be a path to a file or a
1adf2230 152 block device (such as a partition of an SSD). If it is a file,
7c673cae
FG
153 you must create the directory to contain it. We recommend using a
154 drive separate from the ``osd data`` drive.
155
156:Type: String
157:Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
158
159
1adf2230 160``osd journal size``
7c673cae 161
1adf2230 162:Description: The size of the journal in megabytes.
7c673cae
FG
163
164:Type: 32-bit Integer
165:Default: ``5120``
7c673cae
FG
166
167
168See `Journal Config Reference`_ for additional details.
169
170
171Monitor OSD Interaction
172=======================
173
174Ceph OSD Daemons check each other's heartbeats and report to monitors
175periodically. Ceph can use default values in many cases. However, if your
176network has latency issues, you may need to adopt longer intervals. See
177`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
178
179
180Data Placement
181==============
182
183See `Pool & PG Config Reference`_ for details.
184
185
186.. index:: OSD; scrubbing
187
188Scrubbing
189=========
190
191In addition to making multiple copies of objects, Ceph insures data integrity by
192scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
193object storage layer. For each placement group, Ceph generates a catalog of all
194objects and compares each primary object and its replicas to ensure that no
195objects are missing or mismatched. Light scrubbing (daily) checks the object
196size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
197to ensure data integrity.
198
199Scrubbing is important for maintaining data integrity, but it can reduce
200performance. You can adjust the following settings to increase or decrease
201scrubbing operations.
202
203
1adf2230 204``osd max scrubs``
7c673cae 205
1adf2230 206:Description: The maximum number of simultaneous scrub operations for
7c673cae
FG
207 a Ceph OSD Daemon.
208
209:Type: 32-bit Int
1adf2230 210:Default: ``1``
7c673cae
FG
211
212``osd scrub begin hour``
213
214:Description: The time of day for the lower bound when a scheduled scrub can be
215 performed.
216:Type: Integer in the range of 0 to 24
217:Default: ``0``
218
219
220``osd scrub end hour``
221
222:Description: The time of day for the upper bound when a scheduled scrub can be
223 performed. Along with ``osd scrub begin hour``, they define a time
224 window, in which the scrubs can happen. But a scrub will be performed
225 no matter the time window allows or not, as long as the placement
226 group's scrub interval exceeds ``osd scrub max interval``.
227:Type: Integer in the range of 0 to 24
228:Default: ``24``
229
230
231``osd scrub during recovery``
232
233:Description: Allow scrub during recovery. Setting this to ``false`` will disable
234 scheduling new scrub (and deep--scrub) while there is active recovery.
235 Already running scrubs will be continued. This might be useful to reduce
236 load on busy clusters.
237:Type: Boolean
238:Default: ``true``
239
240
1adf2230 241``osd scrub thread timeout``
7c673cae
FG
242
243:Description: The maximum time in seconds before timing out a scrub thread.
244:Type: 32-bit Integer
1adf2230 245:Default: ``60``
7c673cae
FG
246
247
1adf2230 248``osd scrub finalize thread timeout``
7c673cae 249
1adf2230 250:Description: The maximum time in seconds before timing out a scrub finalize
7c673cae
FG
251 thread.
252
253:Type: 32-bit Integer
254:Default: ``60*10``
255
256
1adf2230 257``osd scrub load threshold``
7c673cae
FG
258
259:Description: The maximum load. Ceph will not scrub when the system load
260 (as defined by ``getloadavg()``) is higher than this number.
261 Default is ``0.5``.
262
263:Type: Float
1adf2230 264:Default: ``0.5``
7c673cae
FG
265
266
1adf2230 267``osd scrub min interval``
7c673cae
FG
268
269:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
270 when the Ceph Storage Cluster load is low.
271
272:Type: Float
273:Default: Once per day. ``60*60*24``
274
275
1adf2230 276``osd scrub max interval``
7c673cae 277
1adf2230 278:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
7c673cae
FG
279 irrespective of cluster load.
280
281:Type: Float
282:Default: Once per week. ``7*60*60*24``
283
284
285``osd scrub chunk min``
286
287:Description: The minimal number of object store chunks to scrub during single operation.
288 Ceph blocks writes to single chunk during scrub.
289
290:Type: 32-bit Integer
291:Default: 5
292
293
294``osd scrub chunk max``
295
296:Description: The maximum number of object store chunks to scrub during single operation.
297
298:Type: 32-bit Integer
299:Default: 25
300
301
302``osd scrub sleep``
303
304:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
305 down whole scrub operation while client operations will be less impacted.
306
307:Type: Float
308:Default: 0
309
310
311``osd deep scrub interval``
312
1adf2230 313:Description: The interval for "deep" scrubbing (fully reading all data). The
7c673cae
FG
314 ``osd scrub load threshold`` does not affect this setting.
315
316:Type: Float
317:Default: Once per week. ``60*60*24*7``
318
319
320``osd scrub interval randomize ratio``
321
322:Description: Add a random delay to ``osd scrub min interval`` when scheduling
323 the next scrub job for a placement group. The delay is a random
324 value less than ``osd scrub min interval`` \*
325 ``osd scrub interval randomized ratio``. So the default setting
326 practically randomly spreads the scrubs out in the allowed time
327 window of ``[1, 1.5]`` \* ``osd scrub min interval``.
328:Type: Float
329:Default: ``0.5``
330
331``osd deep scrub stride``
332
333:Description: Read size when doing a deep scrub.
334:Type: 32-bit Integer
335:Default: 512 KB. ``524288``
336
337
338.. index:: OSD; operations settings
339
340Operations
341==========
342
343Operations settings allow you to configure the number of threads for servicing
344requests. If you set ``osd op threads`` to ``0``, it disables multi-threading.
345By default, Ceph uses two threads with a 30 second timeout and a 30 second
346complaint time if an operation doesn't complete within those time parameters.
347You can set operations priority weights between client operations and
348recovery operations to ensure optimal performance during recovery.
349
350
351``osd op threads``
352
353:Description: The number of threads to service Ceph OSD Daemon operations.
354 Set to ``0`` to disable it. Increasing the number may increase
355 the request processing rate.
356
357:Type: 32-bit Integer
358:Default: ``2``
359
360
361``osd op queue``
362
363:Description: This sets the type of queue to be used for prioritizing ops
364 in the OSDs. Both queues feature a strict sub-queue which is
365 dequeued before the normal queue. The normal queue is different
366 between implementations. The original PrioritizedQueue (``prio``) uses a
367 token bucket system which when there are sufficient tokens will
368 dequeue high priority queues first. If there are not enough
369 tokens available, queues are dequeued low priority to high priority.
c07f9fc5 370 The WeightedPriorityQueue (``wpq``) dequeues all priorities in
7c673cae
FG
371 relation to their priorities to prevent starvation of any queue.
372 WPQ should help in cases where a few OSDs are more overloaded
c07f9fc5
FG
373 than others. The new mClock based OpClassQueue
374 (``mclock_opclass``) prioritizes operations based on which class
375 they belong to (recovery, scrub, snaptrim, client op, osd subop).
376 And, the mClock based ClientQueue (``mclock_client``) also
377 incorporates the client identifier in order to promote fairness
378 between clients. See `QoS Based on mClock`_. Requires a restart.
7c673cae
FG
379
380:Type: String
c07f9fc5 381:Valid Choices: prio, wpq, mclock_opclass, mclock_client
7c673cae
FG
382:Default: ``prio``
383
384
385``osd op queue cut off``
386
387:Description: This selects which priority ops will be sent to the strict
388 queue verses the normal queue. The ``low`` setting sends all
389 replication ops and higher to the strict queue, while the ``high``
390 option sends only replication acknowledgement ops and higher to
391 the strict queue. Setting this to ``high`` should help when a few
392 OSDs in the cluster are very busy especially when combined with
393 ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
394 handling replication traffic could starve primary client traffic
395 on these OSDs without these settings. Requires a restart.
396
397:Type: String
398:Valid Choices: low, high
399:Default: ``low``
400
401
402``osd client op priority``
403
1adf2230 404:Description: The priority set for client operations. It is relative to
7c673cae
FG
405 ``osd recovery op priority``.
406
407:Type: 32-bit Integer
1adf2230 408:Default: ``63``
7c673cae
FG
409:Valid Range: 1-63
410
411
412``osd recovery op priority``
413
1adf2230 414:Description: The priority set for recovery operations. It is relative to
7c673cae
FG
415 ``osd client op priority``.
416
417:Type: 32-bit Integer
1adf2230 418:Default: ``3``
7c673cae
FG
419:Valid Range: 1-63
420
421
422``osd scrub priority``
423
424:Description: The priority set for scrub operations. It is relative to
425 ``osd client op priority``.
426
427:Type: 32-bit Integer
428:Default: ``5``
429:Valid Range: 1-63
430
431
432``osd snap trim priority``
433
434:Description: The priority set for snap trim operations. It is relative to
435 ``osd client op priority``.
436
437:Type: 32-bit Integer
438:Default: ``5``
439:Valid Range: 1-63
440
441
1adf2230 442``osd op thread timeout``
7c673cae
FG
443
444:Description: The Ceph OSD Daemon operation thread timeout in seconds.
445:Type: 32-bit Integer
1adf2230 446:Default: ``15``
7c673cae
FG
447
448
1adf2230 449``osd op complaint time``
7c673cae
FG
450
451:Description: An operation becomes complaint worthy after the specified number
452 of seconds have elapsed.
453
454:Type: Float
1adf2230 455:Default: ``30``
7c673cae
FG
456
457
1adf2230 458``osd disk threads``
7c673cae 459
1adf2230
AA
460:Description: The number of disk threads, which are used to perform background
461 disk intensive OSD operations such as scrubbing and snap
7c673cae
FG
462 trimming.
463
464:Type: 32-bit Integer
1adf2230 465:Default: ``1``
7c673cae
FG
466
467``osd disk thread ioprio class``
468
469:Description: Warning: it will only be used if both ``osd disk thread
470 ioprio class`` and ``osd disk thread ioprio priority`` are
471 set to a non default value. Sets the ioprio_set(2) I/O
472 scheduling ``class`` for the disk thread. Acceptable
473 values are ``idle``, ``be`` or ``rt``. The ``idle``
474 class means the disk thread will have lower priority
475 than any other thread in the OSD. This is useful to slow
476 down scrubbing on an OSD that is busy handling client
477 operations. ``be`` is the default and is the same
478 priority as all other threads in the OSD. ``rt`` means
479 the disk thread will have precendence over all other
1adf2230 480 threads in the OSD. Note: Only works with the Linux Kernel
7c673cae
FG
481 CFQ scheduler. Since Jewel scrubbing is no longer carried
482 out by the disk iothread, see osd priority options instead.
483:Type: String
484:Default: the empty string
485
486``osd disk thread ioprio priority``
487
488:Description: Warning: it will only be used if both ``osd disk thread
489 ioprio class`` and ``osd disk thread ioprio priority`` are
490 set to a non default value. It sets the ioprio_set(2)
491 I/O scheduling ``priority`` of the disk thread ranging
492 from 0 (highest) to 7 (lowest). If all OSDs on a given
493 host were in class ``idle`` and compete for I/O
494 (i.e. due to controller congestion), it can be used to
495 lower the disk thread priority of one OSD to 7 so that
496 another OSD with priority 0 can have priority.
497 Note: Only works with the Linux Kernel CFQ scheduler.
498:Type: Integer in the range of 0 to 7 or -1 if not to be used.
499:Default: ``-1``
500
501``osd op history size``
502
503:Description: The maximum number of completed operations to track.
504:Type: 32-bit Unsigned Integer
505:Default: ``20``
506
507
508``osd op history duration``
509
510:Description: The oldest completed operation to track.
511:Type: 32-bit Unsigned Integer
512:Default: ``600``
513
514
515``osd op log threshold``
516
517:Description: How many operations logs to display at once.
518:Type: 32-bit Integer
519:Default: ``5``
520
c07f9fc5
FG
521
522QoS Based on mClock
523-------------------
524
525Ceph's use of mClock is currently in the experimental phase and should
526be approached with an exploratory mindset.
527
528Core Concepts
529`````````````
530
531The QoS support of Ceph is implemented using a queueing scheduler
532based on `the dmClock algorithm`_. This algorithm allocates the I/O
533resources of the Ceph cluster in proportion to weights, and enforces
534the constraits of minimum reservation and maximum limitation, so that
535the services can compete for the resources fairly. Currently the
536*mclock_opclass* operation queue divides Ceph services involving I/O
537resources into following buckets:
538
539- client op: the iops issued by client
540- osd subop: the iops issued by primary OSD
541- snap trim: the snap trimming related requests
542- pg recovery: the recovery related requests
543- pg scrub: the scrub related requests
544
545And the resources are partitioned using following three sets of tags. In other
546words, the share of each type of service is controlled by three tags:
547
548#. reservation: the minimum IOPS allocated for the service.
549#. limitation: the maximum IOPS allocated for the service.
550#. weight: the proportional share of capacity if extra capacity or system
551 oversubscribed.
552
553In Ceph operations are graded with "cost". And the resources allocated
554for serving various services are consumed by these "costs". So, for
555example, the more reservation a services has, the more resource it is
556guaranteed to possess, as long as it requires. Assuming there are 2
557services: recovery and client ops:
558
559- recovery: (r:1, l:5, w:1)
560- client ops: (r:2, l:0, w:9)
561
562The settings above ensure that the recovery won't get more than 5
563requests per second serviced, even if it requires so (see CURRENT
564IMPLEMENTATION NOTE below), and no other services are competing with
565it. But if the clients start to issue large amount of I/O requests,
566neither will they exhaust all the I/O resources. 1 request per second
567is always allocated for recovery jobs as long as there are any such
568requests. So the recovery jobs won't be starved even in a cluster with
569high load. And in the meantime, the client ops can enjoy a larger
570portion of the I/O resource, because its weight is "9", while its
571competitor "1". In the case of client ops, it is not clamped by the
572limit setting, so it can make use of all the resources if there is no
573recovery ongoing.
574
575Along with *mclock_opclass* another mclock operation queue named
576*mclock_client* is available. It divides operations based on category
577but also divides them based on the client making the request. This
578helps not only manage the distribution of resources spent on different
579classes of operations but also tries to insure fairness among clients.
580
581CURRENT IMPLEMENTATION NOTE: the current experimental implementation
582does not enforce the limit values. As a first approximation we decided
583not to prevent operations that would otherwise enter the operation
584sequencer from doing so.
585
586Subtleties of mClock
587````````````````````
588
589The reservation and limit values have a unit of requests per
590second. The weight, however, does not technically have a unit and the
591weights are relative to one another. So if one class of requests has a
592weight of 1 and another a weight of 9, then the latter class of
593requests should get 9 executed at a 9 to 1 ratio as the first class.
594However that will only happen once the reservations are met and those
595values include the operations executed under the reservation phase.
596
597Even though the weights do not have units, one must be careful in
598choosing their values due how the algorithm assigns weight tags to
599requests. If the weight is *W*, then for a given class of requests,
600the next one that comes in will have a weight tag of *1/W* plus the
601previous weight tag or the current time, whichever is larger. That
602means if *W* is sufficiently large and therefore *1/W* is sufficiently
603small, the calculated tag may never be assigned as it will get a value
604of the current time. The ultimate lesson is that values for weight
605should not be too large. They should be under the number of requests
606one expects to ve serviced each second.
607
608Caveats
609```````
610
611There are some factors that can reduce the impact of the mClock op
612queues within Ceph. First, requests to an OSD are sharded by their
613placement group identifier. Each shard has its own mClock queue and
614these queues neither interact nor share information among them. The
615number of shards can be controlled with the configuration options
616``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
617``osd_op_num_shards_ssd``. A lower number of shards will increase the
618impact of the mClock queues, but may have other deliterious effects.
619
620Second, requests are transferred from the operation queue to the
621operation sequencer, in which they go through the phases of
622execution. The operation queue is where mClock resides and mClock
623determines the next op to transfer to the operation sequencer. The
624number of operations allowed in the operation sequencer is a complex
625issue. In general we want to keep enough operations in the sequencer
626so it's always getting work done on some operations while it's waiting
627for disk and network access to complete on other operations. On the
628other hand, once an operation is transferred to the operation
629sequencer, mClock no longer has control over it. Therefore to maximize
630the impact of mClock, we want to keep as few operations in the
631operation sequencer as possible. So we have an inherent tension.
632
633The configuration options that influence the number of operations in
634the operation sequencer are ``bluestore_throttle_bytes``,
635``bluestore_throttle_deferred_bytes``,
636``bluestore_throttle_cost_per_io``,
637``bluestore_throttle_cost_per_io_hdd``, and
638``bluestore_throttle_cost_per_io_ssd``.
639
640A third factor that affects the impact of the mClock algorithm is that
641we're using a distributed system, where requests are made to multiple
642OSDs and each OSD has (can have) multiple shards. Yet we're currently
643using the mClock algorithm, which is not distributed (note: dmClock is
644the distributed version of mClock).
645
646Various organizations and individuals are currently experimenting with
647mClock as it exists in this code base along with their modifications
648to the code base. We hope you'll share you're experiences with your
649mClock and dmClock experiments in the ceph-devel mailing list.
650
651
652``osd push per object cost``
653
654:Description: the overhead for serving a push op
655
656:Type: Unsigned Integer
657:Default: 1000
658
659``osd recovery max chunk``
660
661:Description: the maximum total size of data chunks a recovery op can carry.
662
663:Type: Unsigned Integer
664:Default: 8 MiB
665
666
667``osd op queue mclock client op res``
668
669:Description: the reservation of client op.
670
671:Type: Float
672:Default: 1000.0
673
674
675``osd op queue mclock client op wgt``
676
677:Description: the weight of client op.
678
679:Type: Float
680:Default: 500.0
681
682
683``osd op queue mclock client op lim``
684
685:Description: the limit of client op.
686
687:Type: Float
688:Default: 1000.0
689
690
691``osd op queue mclock osd subop res``
692
693:Description: the reservation of osd subop.
694
695:Type: Float
696:Default: 1000.0
697
698
699``osd op queue mclock osd subop wgt``
700
701:Description: the weight of osd subop.
702
703:Type: Float
704:Default: 500.0
705
706
707``osd op queue mclock osd subop lim``
708
709:Description: the limit of osd subop.
710
711:Type: Float
712:Default: 0.0
713
714
715``osd op queue mclock snap res``
716
717:Description: the reservation of snap trimming.
718
719:Type: Float
720:Default: 0.0
721
722
723``osd op queue mclock snap wgt``
724
725:Description: the weight of snap trimming.
726
727:Type: Float
728:Default: 1.0
729
730
731``osd op queue mclock snap lim``
732
733:Description: the limit of snap trimming.
734
735:Type: Float
736:Default: 0.001
737
738
739``osd op queue mclock recov res``
740
741:Description: the reservation of recovery.
742
743:Type: Float
744:Default: 0.0
745
746
747``osd op queue mclock recov wgt``
748
749:Description: the weight of recovery.
750
751:Type: Float
752:Default: 1.0
753
754
755``osd op queue mclock recov lim``
756
757:Description: the limit of recovery.
758
759:Type: Float
760:Default: 0.001
761
762
763``osd op queue mclock scrub res``
764
765:Description: the reservation of scrub jobs.
766
767:Type: Float
768:Default: 0.0
769
770
771``osd op queue mclock scrub wgt``
772
773:Description: the weight of scrub jobs.
774
775:Type: Float
776:Default: 1.0
777
778
779``osd op queue mclock scrub lim``
780
781:Description: the limit of scrub jobs.
782
783:Type: Float
784:Default: 0.001
785
786.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
787
788
7c673cae
FG
789.. index:: OSD; backfilling
790
791Backfilling
792===========
793
794When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
795want to rebalance the cluster by moving placement groups to or from Ceph OSD
796Daemons to restore the balance. The process of migrating placement groups and
797the objects they contain can reduce the cluster's operational performance
798considerably. To maintain operational performance, Ceph performs this migration
799with 'backfilling', which allows Ceph to set backfill operations to a lower
1adf2230 800priority than requests to read or write data.
7c673cae
FG
801
802
803``osd max backfills``
804
805:Description: The maximum number of backfills allowed to or from a single OSD.
806:Type: 64-bit Unsigned Integer
807:Default: ``1``
808
809
1adf2230 810``osd backfill scan min``
7c673cae
FG
811
812:Description: The minimum number of objects per backfill scan.
813
814:Type: 32-bit Integer
1adf2230 815:Default: ``64``
7c673cae
FG
816
817
1adf2230 818``osd backfill scan max``
7c673cae
FG
819
820:Description: The maximum number of objects per backfill scan.
821
822:Type: 32-bit Integer
1adf2230 823:Default: ``512``
7c673cae
FG
824
825
826``osd backfill retry interval``
827
828:Description: The number of seconds to wait before retrying backfill requests.
829:Type: Double
830:Default: ``10.0``
831
832.. index:: OSD; osdmap
833
834OSD Map
835=======
836
1adf2230 837OSD maps reflect the OSD daemons operating in the cluster. Over time, the
7c673cae
FG
838number of map epochs increases. Ceph provides some settings to ensure that
839Ceph performs well as the OSD map grows larger.
840
841
842``osd map dedup``
843
1adf2230 844:Description: Enable removing duplicates in the OSD map.
7c673cae
FG
845:Type: Boolean
846:Default: ``true``
847
848
1adf2230 849``osd map cache size``
7c673cae
FG
850
851:Description: The number of OSD maps to keep cached.
852:Type: 32-bit Integer
853:Default: ``500``
854
855
856``osd map cache bl size``
857
1adf2230 858:Description: The size of the in-memory OSD map cache in OSD daemons.
7c673cae
FG
859:Type: 32-bit Integer
860:Default: ``50``
861
862
863``osd map cache bl inc size``
864
1adf2230 865:Description: The size of the in-memory OSD map cache incrementals in
7c673cae
FG
866 OSD daemons.
867
868:Type: 32-bit Integer
869:Default: ``100``
870
871
1adf2230 872``osd map message max``
7c673cae
FG
873
874:Description: The maximum map entries allowed per MOSDMap message.
875:Type: 32-bit Integer
876:Default: ``100``
877
878
879
880.. index:: OSD; recovery
881
882Recovery
883========
884
885When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
886begins peering with other Ceph OSD Daemons before writes can occur. See
887`Monitoring OSDs and PGs`_ for details.
888
889If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
890sync with other Ceph OSD Daemons containing more recent versions of objects in
891the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
892mode and seeks to get the latest copy of the data and bring its map back up to
893date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
894and placement groups may be significantly out of date. Also, if a failure domain
895went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
896the same time. This can make the recovery process time consuming and resource
897intensive.
898
899To maintain operational performance, Ceph performs recovery with limitations on
900the number recovery requests, threads and object chunk sizes which allows Ceph
1adf2230 901perform well in a degraded state.
7c673cae
FG
902
903
1adf2230 904``osd recovery delay start``
7c673cae 905
1adf2230 906:Description: After peering completes, Ceph will delay for the specified number
7c673cae
FG
907 of seconds before starting to recover objects.
908
909:Type: Float
1adf2230 910:Default: ``0``
7c673cae
FG
911
912
1adf2230 913``osd recovery max active``
7c673cae 914
1adf2230
AA
915:Description: The number of active recovery requests per OSD at one time. More
916 requests will accelerate recovery, but the requests places an
7c673cae
FG
917 increased load on the cluster.
918
919:Type: 32-bit Integer
31f18b77 920:Default: ``3``
7c673cae
FG
921
922
1adf2230 923``osd recovery max chunk``
7c673cae 924
1adf2230 925:Description: The maximum size of a recovered chunk of data to push.
c07f9fc5 926:Type: 64-bit Unsigned Integer
1adf2230 927:Default: ``8 << 20``
7c673cae
FG
928
929
31f18b77
FG
930``osd recovery max single start``
931
932:Description: The maximum number of recovery operations per OSD that will be
933 newly started when an OSD is recovering.
c07f9fc5 934:Type: 64-bit Unsigned Integer
31f18b77
FG
935:Default: ``1``
936
937
1adf2230 938``osd recovery thread timeout``
7c673cae
FG
939
940:Description: The maximum time in seconds before timing out a recovery thread.
941:Type: 32-bit Integer
942:Default: ``30``
943
944
945``osd recover clone overlap``
946
1adf2230 947:Description: Preserves clone overlap during recovery. Should always be set
7c673cae
FG
948 to ``true``.
949
950:Type: Boolean
951:Default: ``true``
952
31f18b77
FG
953
954``osd recovery sleep``
955
c07f9fc5
FG
956:Description: Time in seconds to sleep before next recovery or backfill op.
957 Increasing this value will slow down recovery operation while
958 client operations will be less impacted.
31f18b77
FG
959
960:Type: Float
c07f9fc5
FG
961:Default: ``0``
962
963
964``osd recovery sleep hdd``
965
966:Description: Time in seconds to sleep before next recovery or backfill op
967 for HDDs.
968
969:Type: Float
970:Default: ``0.1``
971
972
973``osd recovery sleep ssd``
974
975:Description: Time in seconds to sleep before next recovery or backfill op
976 for SSDs.
977
978:Type: Float
979:Default: ``0``
31f18b77 980
d2e6a577
FG
981
982``osd recovery sleep hybrid``
983
984:Description: Time in seconds to sleep before next recovery or backfill op
985 when osd data is on HDD and osd journal is on SSD.
986
987:Type: Float
988:Default: ``0.025``
989
7c673cae
FG
990Tiering
991=======
992
993``osd agent max ops``
994
995:Description: The maximum number of simultaneous flushing ops per tiering agent
996 in the high speed mode.
997:Type: 32-bit Integer
998:Default: ``4``
999
1000
1001``osd agent max low ops``
1002
1003:Description: The maximum number of simultaneous flushing ops per tiering agent
1004 in the low speed mode.
1005:Type: 32-bit Integer
1006:Default: ``2``
1007
1008See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1009objects within the high speed mode.
1010
1011Miscellaneous
1012=============
1013
1014
1adf2230 1015``osd snap trim thread timeout``
7c673cae
FG
1016
1017:Description: The maximum time in seconds before timing out a snap trim thread.
1018:Type: 32-bit Integer
1adf2230 1019:Default: ``60*60*1``
7c673cae
FG
1020
1021
1adf2230 1022``osd backlog thread timeout``
7c673cae
FG
1023
1024:Description: The maximum time in seconds before timing out a backlog thread.
1025:Type: 32-bit Integer
1adf2230 1026:Default: ``60*60*1``
7c673cae
FG
1027
1028
1adf2230 1029``osd default notify timeout``
7c673cae
FG
1030
1031:Description: The OSD default notification timeout (in seconds).
c07f9fc5 1032:Type: 32-bit Unsigned Integer
1adf2230 1033:Default: ``30``
7c673cae
FG
1034
1035
1adf2230 1036``osd check for log corruption``
7c673cae
FG
1037
1038:Description: Check log files for corruption. Can be computationally expensive.
1039:Type: Boolean
1adf2230 1040:Default: ``false``
7c673cae
FG
1041
1042
1adf2230 1043``osd remove thread timeout``
7c673cae
FG
1044
1045:Description: The maximum time in seconds before timing out a remove OSD thread.
1046:Type: 32-bit Integer
1047:Default: ``60*60``
1048
1049
1adf2230 1050``osd command thread timeout``
7c673cae
FG
1051
1052:Description: The maximum time in seconds before timing out a command thread.
1053:Type: 32-bit Integer
1adf2230 1054:Default: ``10*60``
7c673cae
FG
1055
1056
1adf2230 1057``osd command max records``
7c673cae 1058
1adf2230 1059:Description: Limits the number of lost objects to return.
7c673cae 1060:Type: 32-bit Integer
1adf2230 1061:Default: ``256``
7c673cae
FG
1062
1063
1adf2230 1064``osd auto upgrade tmap``
7c673cae
FG
1065
1066:Description: Uses ``tmap`` for ``omap`` on old objects.
1067:Type: Boolean
1068:Default: ``true``
7c673cae 1069
1adf2230
AA
1070
1071``osd tmapput sets users tmap``
7c673cae
FG
1072
1073:Description: Uses ``tmap`` for debugging only.
1074:Type: Boolean
1adf2230 1075:Default: ``false``
7c673cae
FG
1076
1077
7c673cae
FG
1078``osd fast fail on connection refused``
1079
1080:Description: If this option is enabled, crashed OSDs are marked down
1081 immediately by connected peers and MONs (assuming that the
1082 crashed OSD host survives). Disable it to restore old
1083 behavior, at the expense of possible long I/O stalls when
1084 OSDs crash in the middle of I/O operations.
1085:Type: Boolean
1086:Default: ``true``
1087
1088
1089
1090.. _pool: ../../operations/pools
1091.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1092.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1093.. _Pool & PG Config Reference: ../pool-pg-config-ref
1094.. _Journal Config Reference: ../journal-ref
1095.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio