]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/configuration/osd-config-ref.rst
import new upstream nautilus stable release 14.2.8
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
CommitLineData
7c673cae
FG
1======================
2 OSD Config Reference
3======================
4
5.. index:: OSD; configuration
6
7You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
8Daemons can use the default values and a very minimal configuration. A minimal
9Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and
10uses default values for nearly everything else.
11
12Ceph OSD Daemons are numerically identified in incremental fashion, beginning
13with ``0`` using the following convention. ::
14
15 osd.0
16 osd.1
17 osd.2
18
19In a configuration file, you may specify settings for all Ceph OSD Daemons in
20the cluster by adding configuration settings to the ``[osd]`` section of your
21configuration file. To add settings directly to a specific Ceph OSD Daemon
22(e.g., ``host``), enter it in an OSD-specific section of your configuration
23file. For example:
24
25.. code-block:: ini
1adf2230 26
7c673cae 27 [osd]
1adf2230
AA
28 osd journal size = 5120
29
7c673cae
FG
30 [osd.0]
31 host = osd-host-a
1adf2230 32
7c673cae
FG
33 [osd.1]
34 host = osd-host-b
35
36
37.. index:: OSD; config settings
38
39General Settings
40================
41
42The following settings provide an Ceph OSD Daemon's ID, and determine paths to
43data and journals. Ceph deployment scripts typically generate the UUID
1adf2230
AA
44automatically.
45
46.. warning:: **DO NOT** change the default paths for data or journals, as it
47 makes it more problematic to troubleshoot Ceph later.
7c673cae
FG
48
49The journal size should be at least twice the product of the expected drive
50speed multiplied by ``filestore max sync interval``. However, the most common
51practice is to partition the journal drive (often an SSD), and mount it such
52that Ceph uses the entire partition for the journal.
53
54
55``osd uuid``
56
57:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
58:Type: UUID
59:Default: The UUID.
1adf2230 60:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
7c673cae
FG
61 applies to the entire cluster.
62
63
1adf2230 64``osd data``
7c673cae 65
1adf2230
AA
66:Description: The path to the OSDs data. You must create the directory when
67 deploying Ceph. You should mount a drive for OSD data at this
68 mount point. We do not recommend changing the default.
7c673cae
FG
69
70:Type: String
71:Default: ``/var/lib/ceph/osd/$cluster-$id``
72
73
1adf2230 74``osd max write size``
7c673cae
FG
75
76:Description: The maximum size of a write in megabytes.
77:Type: 32-bit Integer
78:Default: ``90``
79
80
11fdf7f2
TL
81``osd max object size``
82
83:Description: The maximum size of a RADOS object in bytes.
84:Type: 32-bit Unsigned Integer
85:Default: 128MB
86
87
88``osd client message size cap``
7c673cae
FG
89
90:Description: The largest client data message allowed in memory.
c07f9fc5 91:Type: 64-bit Unsigned Integer
1adf2230 92:Default: 500MB default. ``500*1024L*1024L``
7c673cae
FG
93
94
1adf2230 95``osd class dir``
7c673cae
FG
96
97:Description: The class path for RADOS class plug-ins.
98:Type: String
99:Default: ``$libdir/rados-classes``
100
101
102.. index:: OSD; file system
103
104File System Settings
105====================
106Ceph builds and mounts file systems which are used for Ceph OSDs.
107
1adf2230 108``osd mkfs options {fs-type}``
7c673cae
FG
109
110:Description: Options used when creating a new Ceph OSD of type {fs-type}.
111
112:Type: String
113:Default for xfs: ``-f -i 2048``
114:Default for other file systems: {empty string}
115
116For example::
117 ``osd mkfs options xfs = -f -d agcount=24``
118
1adf2230 119``osd mount options {fs-type}``
7c673cae
FG
120
121:Description: Options used when mounting a Ceph OSD of type {fs-type}.
122
123:Type: String
124:Default for xfs: ``rw,noatime,inode64``
125:Default for other file systems: ``rw, noatime``
126
127For example::
128 ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
129
130
131.. index:: OSD; journal settings
132
133Journal Settings
134================
135
136By default, Ceph expects that you will store an Ceph OSD Daemons journal with
137the following path::
138
139 /var/lib/ceph/osd/$cluster-$id/journal
140
1adf2230
AA
141When using a single device type (for example, spinning drives), the journals
142should be *colocated*: the logical volume (or partition) should be in the same
143device as the ``data`` logical volume.
7c673cae 144
1adf2230
AA
145When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
146drives) it makes sense to place the journal on the faster device, while
147``data`` occupies the slower device fully.
7c673cae 148
1adf2230
AA
149The default ``osd journal size`` value is 5120 (5 gigabytes), but it can be
150larger, in which case it will need to be set in the ``ceph.conf`` file::
7c673cae 151
7c673cae 152
1adf2230 153 osd journal size = 10240
7c673cae 154
1adf2230
AA
155
156``osd journal``
7c673cae
FG
157
158:Description: The path to the OSD's journal. This may be a path to a file or a
1adf2230 159 block device (such as a partition of an SSD). If it is a file,
7c673cae
FG
160 you must create the directory to contain it. We recommend using a
161 drive separate from the ``osd data`` drive.
162
163:Type: String
164:Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
165
166
1adf2230 167``osd journal size``
7c673cae 168
1adf2230 169:Description: The size of the journal in megabytes.
7c673cae
FG
170
171:Type: 32-bit Integer
172:Default: ``5120``
7c673cae
FG
173
174
175See `Journal Config Reference`_ for additional details.
176
177
178Monitor OSD Interaction
179=======================
180
181Ceph OSD Daemons check each other's heartbeats and report to monitors
182periodically. Ceph can use default values in many cases. However, if your
183network has latency issues, you may need to adopt longer intervals. See
184`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
185
186
187Data Placement
188==============
189
190See `Pool & PG Config Reference`_ for details.
191
192
193.. index:: OSD; scrubbing
194
195Scrubbing
196=========
197
198In addition to making multiple copies of objects, Ceph insures data integrity by
199scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
200object storage layer. For each placement group, Ceph generates a catalog of all
201objects and compares each primary object and its replicas to ensure that no
202objects are missing or mismatched. Light scrubbing (daily) checks the object
203size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
204to ensure data integrity.
205
206Scrubbing is important for maintaining data integrity, but it can reduce
207performance. You can adjust the following settings to increase or decrease
208scrubbing operations.
209
210
1adf2230 211``osd max scrubs``
7c673cae 212
1adf2230 213:Description: The maximum number of simultaneous scrub operations for
7c673cae
FG
214 a Ceph OSD Daemon.
215
216:Type: 32-bit Int
1adf2230 217:Default: ``1``
7c673cae
FG
218
219``osd scrub begin hour``
220
221:Description: The time of day for the lower bound when a scheduled scrub can be
222 performed.
223:Type: Integer in the range of 0 to 24
224:Default: ``0``
225
226
227``osd scrub end hour``
228
229:Description: The time of day for the upper bound when a scheduled scrub can be
230 performed. Along with ``osd scrub begin hour``, they define a time
231 window, in which the scrubs can happen. But a scrub will be performed
232 no matter the time window allows or not, as long as the placement
233 group's scrub interval exceeds ``osd scrub max interval``.
234:Type: Integer in the range of 0 to 24
235:Default: ``24``
236
237
11fdf7f2
TL
238``osd scrub begin week day``
239
240:Description: This restricts scrubbing to this day of the week or later.
241 0 or 7 = Sunday, 1 = Monday, etc.
242:Type: Integer in the range of 0 to 7
243:Default: ``0``
244
245
246``osd scrub end week day``
247
248:Description: This restricts scrubbing to days of the week earlier than this.
249 0 or 7 = Sunday, 1 = Monday, etc.
250:Type: Integer in the range of 0 to 7
251:Default: ``7``
252
253
7c673cae
FG
254``osd scrub during recovery``
255
256:Description: Allow scrub during recovery. Setting this to ``false`` will disable
257 scheduling new scrub (and deep--scrub) while there is active recovery.
258 Already running scrubs will be continued. This might be useful to reduce
259 load on busy clusters.
260:Type: Boolean
261:Default: ``true``
262
263
1adf2230 264``osd scrub thread timeout``
7c673cae
FG
265
266:Description: The maximum time in seconds before timing out a scrub thread.
267:Type: 32-bit Integer
1adf2230 268:Default: ``60``
7c673cae
FG
269
270
1adf2230 271``osd scrub finalize thread timeout``
7c673cae 272
1adf2230 273:Description: The maximum time in seconds before timing out a scrub finalize
7c673cae
FG
274 thread.
275
276:Type: 32-bit Integer
277:Default: ``60*10``
278
279
1adf2230 280``osd scrub load threshold``
7c673cae 281
11fdf7f2
TL
282:Description: The normalized maximum load. Ceph will not scrub when the system load
283 (as defined by ``getloadavg() / number of online cpus``) is higher than this number.
7c673cae
FG
284 Default is ``0.5``.
285
286:Type: Float
1adf2230 287:Default: ``0.5``
7c673cae
FG
288
289
1adf2230 290``osd scrub min interval``
7c673cae
FG
291
292:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
293 when the Ceph Storage Cluster load is low.
294
295:Type: Float
296:Default: Once per day. ``60*60*24``
297
298
1adf2230 299``osd scrub max interval``
7c673cae 300
1adf2230 301:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
7c673cae
FG
302 irrespective of cluster load.
303
304:Type: Float
305:Default: Once per week. ``7*60*60*24``
306
307
308``osd scrub chunk min``
309
310:Description: The minimal number of object store chunks to scrub during single operation.
311 Ceph blocks writes to single chunk during scrub.
312
313:Type: 32-bit Integer
314:Default: 5
315
316
317``osd scrub chunk max``
318
319:Description: The maximum number of object store chunks to scrub during single operation.
320
321:Type: 32-bit Integer
322:Default: 25
323
324
325``osd scrub sleep``
326
327:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
328 down whole scrub operation while client operations will be less impacted.
329
330:Type: Float
331:Default: 0
332
333
334``osd deep scrub interval``
335
1adf2230 336:Description: The interval for "deep" scrubbing (fully reading all data). The
7c673cae
FG
337 ``osd scrub load threshold`` does not affect this setting.
338
339:Type: Float
340:Default: Once per week. ``60*60*24*7``
341
342
343``osd scrub interval randomize ratio``
344
345:Description: Add a random delay to ``osd scrub min interval`` when scheduling
346 the next scrub job for a placement group. The delay is a random
347 value less than ``osd scrub min interval`` \*
348 ``osd scrub interval randomized ratio``. So the default setting
349 practically randomly spreads the scrubs out in the allowed time
350 window of ``[1, 1.5]`` \* ``osd scrub min interval``.
351:Type: Float
352:Default: ``0.5``
353
354``osd deep scrub stride``
355
356:Description: Read size when doing a deep scrub.
357:Type: 32-bit Integer
358:Default: 512 KB. ``524288``
359
360
11fdf7f2 361``osd scrub auto repair``
7c673cae 362
11fdf7f2
TL
363:Description: Setting this to ``true`` will enable automatic pg repair when errors
364 are found in scrub or deep-scrub. However, if more than
365 ``osd scrub auto repair num errors`` errors are found a repair is NOT performed.
366:Type: Boolean
367:Default: ``false``
7c673cae 368
7c673cae 369
11fdf7f2 370``osd scrub auto repair num errors``
7c673cae 371
11fdf7f2
TL
372:Description: Auto repair will not occur if more than this many errors are found.
373:Type: 32-bit Integer
374:Default: ``5``
7c673cae 375
7c673cae 376
11fdf7f2 377.. index:: OSD; operations settings
7c673cae 378
11fdf7f2
TL
379Operations
380==========
7c673cae
FG
381
382``osd op queue``
383
384:Description: This sets the type of queue to be used for prioritizing ops
385 in the OSDs. Both queues feature a strict sub-queue which is
386 dequeued before the normal queue. The normal queue is different
387 between implementations. The original PrioritizedQueue (``prio``) uses a
388 token bucket system which when there are sufficient tokens will
389 dequeue high priority queues first. If there are not enough
390 tokens available, queues are dequeued low priority to high priority.
c07f9fc5 391 The WeightedPriorityQueue (``wpq``) dequeues all priorities in
7c673cae
FG
392 relation to their priorities to prevent starvation of any queue.
393 WPQ should help in cases where a few OSDs are more overloaded
c07f9fc5
FG
394 than others. The new mClock based OpClassQueue
395 (``mclock_opclass``) prioritizes operations based on which class
396 they belong to (recovery, scrub, snaptrim, client op, osd subop).
397 And, the mClock based ClientQueue (``mclock_client``) also
398 incorporates the client identifier in order to promote fairness
399 between clients. See `QoS Based on mClock`_. Requires a restart.
7c673cae
FG
400
401:Type: String
c07f9fc5 402:Valid Choices: prio, wpq, mclock_opclass, mclock_client
7c673cae
FG
403:Default: ``prio``
404
405
406``osd op queue cut off``
407
408:Description: This selects which priority ops will be sent to the strict
409 queue verses the normal queue. The ``low`` setting sends all
410 replication ops and higher to the strict queue, while the ``high``
411 option sends only replication acknowledgement ops and higher to
412 the strict queue. Setting this to ``high`` should help when a few
413 OSDs in the cluster are very busy especially when combined with
414 ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
415 handling replication traffic could starve primary client traffic
416 on these OSDs without these settings. Requires a restart.
417
418:Type: String
419:Valid Choices: low, high
420:Default: ``low``
421
422
423``osd client op priority``
424
11fdf7f2 425:Description: The priority set for client operations.
7c673cae
FG
426
427:Type: 32-bit Integer
1adf2230 428:Default: ``63``
7c673cae
FG
429:Valid Range: 1-63
430
431
432``osd recovery op priority``
433
11fdf7f2 434:Description: The priority set for recovery operations, if not specified by the pool's ``recovery_op_priority``.
7c673cae
FG
435
436:Type: 32-bit Integer
1adf2230 437:Default: ``3``
7c673cae
FG
438:Valid Range: 1-63
439
440
441``osd scrub priority``
442
11fdf7f2
TL
443:Description: The default priority set for a scheduled scrub work queue when the
444 pool doesn't specify a value of ``scrub_priority``. This can be
445 boosted to the value of ``osd client op priority`` when scrub is
446 blocking client operations.
7c673cae
FG
447
448:Type: 32-bit Integer
449:Default: ``5``
450:Valid Range: 1-63
451
452
11fdf7f2
TL
453``osd requested scrub priority``
454
455:Description: The priority set for user requested scrub on the work queue. If
456 this value were to be smaller than ``osd client op priority`` it
457 can be boosted to the value of ``osd client op priority`` when
458 scrub is blocking client operations.
459
460:Type: 32-bit Integer
461:Default: ``120``
462
463
7c673cae
FG
464``osd snap trim priority``
465
11fdf7f2 466:Description: The priority set for the snap trim work queue.
7c673cae
FG
467
468:Type: 32-bit Integer
469:Default: ``5``
470:Valid Range: 1-63
471
494da23a
TL
472``osd snap trim sleep``
473
474:Description: Time in seconds to sleep before next snap trim op.
475 Increasing this value will slow down snap trimming.
476 This option overrides backend specific variants.
477
478:Type: Float
479:Default: ``0``
480
481
482``osd snap trim sleep hdd``
483
484:Description: Time in seconds to sleep before next snap trim op
485 for HDDs.
486
487:Type: Float
488:Default: ``5``
489
490
491``osd snap trim sleep ssd``
492
493:Description: Time in seconds to sleep before next snap trim op
494 for SSDs.
495
496:Type: Float
497:Default: ``0``
498
499
500``osd snap trim sleep hybrid``
501
502:Description: Time in seconds to sleep before next snap trim op
503 when osd data is on HDD and osd journal is on SSD.
504
505:Type: Float
506:Default: ``2``
7c673cae 507
1adf2230 508``osd op thread timeout``
7c673cae
FG
509
510:Description: The Ceph OSD Daemon operation thread timeout in seconds.
511:Type: 32-bit Integer
1adf2230 512:Default: ``15``
7c673cae
FG
513
514
1adf2230 515``osd op complaint time``
7c673cae
FG
516
517:Description: An operation becomes complaint worthy after the specified number
518 of seconds have elapsed.
519
520:Type: Float
1adf2230 521:Default: ``30``
7c673cae
FG
522
523
7c673cae
FG
524``osd op history size``
525
526:Description: The maximum number of completed operations to track.
527:Type: 32-bit Unsigned Integer
528:Default: ``20``
529
530
531``osd op history duration``
532
533:Description: The oldest completed operation to track.
534:Type: 32-bit Unsigned Integer
535:Default: ``600``
536
537
538``osd op log threshold``
539
540:Description: How many operations logs to display at once.
541:Type: 32-bit Integer
542:Default: ``5``
543
c07f9fc5
FG
544
545QoS Based on mClock
546-------------------
547
548Ceph's use of mClock is currently in the experimental phase and should
549be approached with an exploratory mindset.
550
551Core Concepts
552`````````````
553
554The QoS support of Ceph is implemented using a queueing scheduler
555based on `the dmClock algorithm`_. This algorithm allocates the I/O
556resources of the Ceph cluster in proportion to weights, and enforces
11fdf7f2 557the constraints of minimum reservation and maximum limitation, so that
c07f9fc5
FG
558the services can compete for the resources fairly. Currently the
559*mclock_opclass* operation queue divides Ceph services involving I/O
560resources into following buckets:
561
562- client op: the iops issued by client
563- osd subop: the iops issued by primary OSD
564- snap trim: the snap trimming related requests
565- pg recovery: the recovery related requests
566- pg scrub: the scrub related requests
567
568And the resources are partitioned using following three sets of tags. In other
569words, the share of each type of service is controlled by three tags:
570
571#. reservation: the minimum IOPS allocated for the service.
572#. limitation: the maximum IOPS allocated for the service.
573#. weight: the proportional share of capacity if extra capacity or system
574 oversubscribed.
575
576In Ceph operations are graded with "cost". And the resources allocated
577for serving various services are consumed by these "costs". So, for
578example, the more reservation a services has, the more resource it is
579guaranteed to possess, as long as it requires. Assuming there are 2
580services: recovery and client ops:
581
582- recovery: (r:1, l:5, w:1)
583- client ops: (r:2, l:0, w:9)
584
585The settings above ensure that the recovery won't get more than 5
586requests per second serviced, even if it requires so (see CURRENT
587IMPLEMENTATION NOTE below), and no other services are competing with
588it. But if the clients start to issue large amount of I/O requests,
589neither will they exhaust all the I/O resources. 1 request per second
590is always allocated for recovery jobs as long as there are any such
591requests. So the recovery jobs won't be starved even in a cluster with
592high load. And in the meantime, the client ops can enjoy a larger
593portion of the I/O resource, because its weight is "9", while its
594competitor "1". In the case of client ops, it is not clamped by the
595limit setting, so it can make use of all the resources if there is no
596recovery ongoing.
597
598Along with *mclock_opclass* another mclock operation queue named
599*mclock_client* is available. It divides operations based on category
600but also divides them based on the client making the request. This
601helps not only manage the distribution of resources spent on different
602classes of operations but also tries to insure fairness among clients.
603
604CURRENT IMPLEMENTATION NOTE: the current experimental implementation
605does not enforce the limit values. As a first approximation we decided
606not to prevent operations that would otherwise enter the operation
607sequencer from doing so.
608
609Subtleties of mClock
610````````````````````
611
612The reservation and limit values have a unit of requests per
613second. The weight, however, does not technically have a unit and the
614weights are relative to one another. So if one class of requests has a
615weight of 1 and another a weight of 9, then the latter class of
616requests should get 9 executed at a 9 to 1 ratio as the first class.
617However that will only happen once the reservations are met and those
618values include the operations executed under the reservation phase.
619
620Even though the weights do not have units, one must be careful in
621choosing their values due how the algorithm assigns weight tags to
622requests. If the weight is *W*, then for a given class of requests,
623the next one that comes in will have a weight tag of *1/W* plus the
624previous weight tag or the current time, whichever is larger. That
625means if *W* is sufficiently large and therefore *1/W* is sufficiently
626small, the calculated tag may never be assigned as it will get a value
627of the current time. The ultimate lesson is that values for weight
628should not be too large. They should be under the number of requests
629one expects to ve serviced each second.
630
631Caveats
632```````
633
634There are some factors that can reduce the impact of the mClock op
635queues within Ceph. First, requests to an OSD are sharded by their
636placement group identifier. Each shard has its own mClock queue and
637these queues neither interact nor share information among them. The
638number of shards can be controlled with the configuration options
639``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
640``osd_op_num_shards_ssd``. A lower number of shards will increase the
11fdf7f2 641impact of the mClock queues, but may have other deleterious effects.
c07f9fc5
FG
642
643Second, requests are transferred from the operation queue to the
644operation sequencer, in which they go through the phases of
645execution. The operation queue is where mClock resides and mClock
646determines the next op to transfer to the operation sequencer. The
647number of operations allowed in the operation sequencer is a complex
648issue. In general we want to keep enough operations in the sequencer
649so it's always getting work done on some operations while it's waiting
650for disk and network access to complete on other operations. On the
651other hand, once an operation is transferred to the operation
652sequencer, mClock no longer has control over it. Therefore to maximize
653the impact of mClock, we want to keep as few operations in the
654operation sequencer as possible. So we have an inherent tension.
655
656The configuration options that influence the number of operations in
657the operation sequencer are ``bluestore_throttle_bytes``,
658``bluestore_throttle_deferred_bytes``,
659``bluestore_throttle_cost_per_io``,
660``bluestore_throttle_cost_per_io_hdd``, and
661``bluestore_throttle_cost_per_io_ssd``.
662
663A third factor that affects the impact of the mClock algorithm is that
664we're using a distributed system, where requests are made to multiple
665OSDs and each OSD has (can have) multiple shards. Yet we're currently
666using the mClock algorithm, which is not distributed (note: dmClock is
667the distributed version of mClock).
668
669Various organizations and individuals are currently experimenting with
670mClock as it exists in this code base along with their modifications
671to the code base. We hope you'll share you're experiences with your
672mClock and dmClock experiments in the ceph-devel mailing list.
673
674
675``osd push per object cost``
676
677:Description: the overhead for serving a push op
678
679:Type: Unsigned Integer
680:Default: 1000
681
682``osd recovery max chunk``
683
684:Description: the maximum total size of data chunks a recovery op can carry.
685
686:Type: Unsigned Integer
687:Default: 8 MiB
688
689
690``osd op queue mclock client op res``
691
692:Description: the reservation of client op.
693
694:Type: Float
695:Default: 1000.0
696
697
698``osd op queue mclock client op wgt``
699
700:Description: the weight of client op.
701
702:Type: Float
703:Default: 500.0
704
705
706``osd op queue mclock client op lim``
707
708:Description: the limit of client op.
709
710:Type: Float
711:Default: 1000.0
712
713
714``osd op queue mclock osd subop res``
715
716:Description: the reservation of osd subop.
717
718:Type: Float
719:Default: 1000.0
720
721
722``osd op queue mclock osd subop wgt``
723
724:Description: the weight of osd subop.
725
726:Type: Float
727:Default: 500.0
728
729
730``osd op queue mclock osd subop lim``
731
732:Description: the limit of osd subop.
733
734:Type: Float
735:Default: 0.0
736
737
738``osd op queue mclock snap res``
739
740:Description: the reservation of snap trimming.
741
742:Type: Float
743:Default: 0.0
744
745
746``osd op queue mclock snap wgt``
747
748:Description: the weight of snap trimming.
749
750:Type: Float
751:Default: 1.0
752
753
754``osd op queue mclock snap lim``
755
756:Description: the limit of snap trimming.
757
758:Type: Float
759:Default: 0.001
760
761
762``osd op queue mclock recov res``
763
764:Description: the reservation of recovery.
765
766:Type: Float
767:Default: 0.0
768
769
770``osd op queue mclock recov wgt``
771
772:Description: the weight of recovery.
773
774:Type: Float
775:Default: 1.0
776
777
778``osd op queue mclock recov lim``
779
780:Description: the limit of recovery.
781
782:Type: Float
783:Default: 0.001
784
785
786``osd op queue mclock scrub res``
787
788:Description: the reservation of scrub jobs.
789
790:Type: Float
791:Default: 0.0
792
793
794``osd op queue mclock scrub wgt``
795
796:Description: the weight of scrub jobs.
797
798:Type: Float
799:Default: 1.0
800
801
802``osd op queue mclock scrub lim``
803
804:Description: the limit of scrub jobs.
805
806:Type: Float
807:Default: 0.001
808
809.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
810
811
7c673cae
FG
812.. index:: OSD; backfilling
813
814Backfilling
815===========
816
817When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
818want to rebalance the cluster by moving placement groups to or from Ceph OSD
819Daemons to restore the balance. The process of migrating placement groups and
820the objects they contain can reduce the cluster's operational performance
821considerably. To maintain operational performance, Ceph performs this migration
822with 'backfilling', which allows Ceph to set backfill operations to a lower
1adf2230 823priority than requests to read or write data.
7c673cae
FG
824
825
826``osd max backfills``
827
828:Description: The maximum number of backfills allowed to or from a single OSD.
829:Type: 64-bit Unsigned Integer
830:Default: ``1``
831
832
1adf2230 833``osd backfill scan min``
7c673cae
FG
834
835:Description: The minimum number of objects per backfill scan.
836
837:Type: 32-bit Integer
1adf2230 838:Default: ``64``
7c673cae
FG
839
840
1adf2230 841``osd backfill scan max``
7c673cae
FG
842
843:Description: The maximum number of objects per backfill scan.
844
845:Type: 32-bit Integer
1adf2230 846:Default: ``512``
7c673cae
FG
847
848
849``osd backfill retry interval``
850
851:Description: The number of seconds to wait before retrying backfill requests.
852:Type: Double
853:Default: ``10.0``
854
855.. index:: OSD; osdmap
856
857OSD Map
858=======
859
1adf2230 860OSD maps reflect the OSD daemons operating in the cluster. Over time, the
7c673cae
FG
861number of map epochs increases. Ceph provides some settings to ensure that
862Ceph performs well as the OSD map grows larger.
863
864
865``osd map dedup``
866
1adf2230 867:Description: Enable removing duplicates in the OSD map.
7c673cae
FG
868:Type: Boolean
869:Default: ``true``
870
871
1adf2230 872``osd map cache size``
7c673cae
FG
873
874:Description: The number of OSD maps to keep cached.
875:Type: 32-bit Integer
7c673cae
FG
876:Default: ``50``
877
878
1adf2230 879``osd map message max``
7c673cae
FG
880
881:Description: The maximum map entries allowed per MOSDMap message.
882:Type: 32-bit Integer
a8e16298 883:Default: ``40``
7c673cae
FG
884
885
886
887.. index:: OSD; recovery
888
889Recovery
890========
891
892When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
893begins peering with other Ceph OSD Daemons before writes can occur. See
894`Monitoring OSDs and PGs`_ for details.
895
896If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
897sync with other Ceph OSD Daemons containing more recent versions of objects in
898the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
899mode and seeks to get the latest copy of the data and bring its map back up to
900date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
901and placement groups may be significantly out of date. Also, if a failure domain
902went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
903the same time. This can make the recovery process time consuming and resource
904intensive.
905
906To maintain operational performance, Ceph performs recovery with limitations on
907the number recovery requests, threads and object chunk sizes which allows Ceph
1adf2230 908perform well in a degraded state.
7c673cae
FG
909
910
1adf2230 911``osd recovery delay start``
7c673cae 912
1adf2230 913:Description: After peering completes, Ceph will delay for the specified number
7c673cae
FG
914 of seconds before starting to recover objects.
915
916:Type: Float
1adf2230 917:Default: ``0``
7c673cae
FG
918
919
1adf2230 920``osd recovery max active``
7c673cae 921
1adf2230
AA
922:Description: The number of active recovery requests per OSD at one time. More
923 requests will accelerate recovery, but the requests places an
7c673cae
FG
924 increased load on the cluster.
925
926:Type: 32-bit Integer
31f18b77 927:Default: ``3``
7c673cae
FG
928
929
1adf2230 930``osd recovery max chunk``
7c673cae 931
1adf2230 932:Description: The maximum size of a recovered chunk of data to push.
c07f9fc5 933:Type: 64-bit Unsigned Integer
1adf2230 934:Default: ``8 << 20``
7c673cae
FG
935
936
31f18b77
FG
937``osd recovery max single start``
938
939:Description: The maximum number of recovery operations per OSD that will be
940 newly started when an OSD is recovering.
c07f9fc5 941:Type: 64-bit Unsigned Integer
31f18b77
FG
942:Default: ``1``
943
944
1adf2230 945``osd recovery thread timeout``
7c673cae
FG
946
947:Description: The maximum time in seconds before timing out a recovery thread.
948:Type: 32-bit Integer
949:Default: ``30``
950
951
952``osd recover clone overlap``
953
1adf2230 954:Description: Preserves clone overlap during recovery. Should always be set
7c673cae
FG
955 to ``true``.
956
957:Type: Boolean
958:Default: ``true``
959
31f18b77
FG
960
961``osd recovery sleep``
962
c07f9fc5
FG
963:Description: Time in seconds to sleep before next recovery or backfill op.
964 Increasing this value will slow down recovery operation while
965 client operations will be less impacted.
31f18b77
FG
966
967:Type: Float
c07f9fc5
FG
968:Default: ``0``
969
970
971``osd recovery sleep hdd``
972
973:Description: Time in seconds to sleep before next recovery or backfill op
974 for HDDs.
975
976:Type: Float
977:Default: ``0.1``
978
979
980``osd recovery sleep ssd``
981
982:Description: Time in seconds to sleep before next recovery or backfill op
983 for SSDs.
984
985:Type: Float
986:Default: ``0``
31f18b77 987
d2e6a577
FG
988
989``osd recovery sleep hybrid``
990
991:Description: Time in seconds to sleep before next recovery or backfill op
992 when osd data is on HDD and osd journal is on SSD.
993
994:Type: Float
995:Default: ``0.025``
996
11fdf7f2
TL
997
998``osd recovery priority``
999
1000:Description: The default priority set for recovery work queue. Not
1001 related to a pool's ``recovery_priority``.
1002
1003:Type: 32-bit Integer
1004:Default: ``5``
1005
1006
7c673cae
FG
1007Tiering
1008=======
1009
1010``osd agent max ops``
1011
1012:Description: The maximum number of simultaneous flushing ops per tiering agent
1013 in the high speed mode.
1014:Type: 32-bit Integer
1015:Default: ``4``
1016
1017
1018``osd agent max low ops``
1019
1020:Description: The maximum number of simultaneous flushing ops per tiering agent
1021 in the low speed mode.
1022:Type: 32-bit Integer
1023:Default: ``2``
1024
1025See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1026objects within the high speed mode.
1027
1028Miscellaneous
1029=============
1030
1031
1adf2230 1032``osd snap trim thread timeout``
7c673cae
FG
1033
1034:Description: The maximum time in seconds before timing out a snap trim thread.
1035:Type: 32-bit Integer
1adf2230 1036:Default: ``60*60*1``
7c673cae
FG
1037
1038
1adf2230 1039``osd backlog thread timeout``
7c673cae
FG
1040
1041:Description: The maximum time in seconds before timing out a backlog thread.
1042:Type: 32-bit Integer
1adf2230 1043:Default: ``60*60*1``
7c673cae
FG
1044
1045
1adf2230 1046``osd default notify timeout``
7c673cae
FG
1047
1048:Description: The OSD default notification timeout (in seconds).
c07f9fc5 1049:Type: 32-bit Unsigned Integer
1adf2230 1050:Default: ``30``
7c673cae
FG
1051
1052
1adf2230 1053``osd check for log corruption``
7c673cae
FG
1054
1055:Description: Check log files for corruption. Can be computationally expensive.
1056:Type: Boolean
1adf2230 1057:Default: ``false``
7c673cae
FG
1058
1059
1adf2230 1060``osd remove thread timeout``
7c673cae
FG
1061
1062:Description: The maximum time in seconds before timing out a remove OSD thread.
1063:Type: 32-bit Integer
1064:Default: ``60*60``
1065
1066
1adf2230 1067``osd command thread timeout``
7c673cae
FG
1068
1069:Description: The maximum time in seconds before timing out a command thread.
1070:Type: 32-bit Integer
1adf2230 1071:Default: ``10*60``
7c673cae
FG
1072
1073
1adf2230 1074``osd command max records``
7c673cae 1075
1adf2230 1076:Description: Limits the number of lost objects to return.
7c673cae 1077:Type: 32-bit Integer
1adf2230 1078:Default: ``256``
7c673cae
FG
1079
1080
7c673cae
FG
1081``osd fast fail on connection refused``
1082
1083:Description: If this option is enabled, crashed OSDs are marked down
1084 immediately by connected peers and MONs (assuming that the
1085 crashed OSD host survives). Disable it to restore old
1086 behavior, at the expense of possible long I/O stalls when
1087 OSDs crash in the middle of I/O operations.
1088:Type: Boolean
1089:Default: ``true``
1090
1091
1092
1093.. _pool: ../../operations/pools
1094.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1095.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1096.. _Pool & PG Config Reference: ../pool-pg-config-ref
1097.. _Journal Config Reference: ../journal-ref
1098.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio