]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/configuration/osd-config-ref.rst
update sources to v12.1.3
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
CommitLineData
7c673cae
FG
1======================
2 OSD Config Reference
3======================
4
5.. index:: OSD; configuration
6
7You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
8Daemons can use the default values and a very minimal configuration. A minimal
9Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and
10uses default values for nearly everything else.
11
12Ceph OSD Daemons are numerically identified in incremental fashion, beginning
13with ``0`` using the following convention. ::
14
15 osd.0
16 osd.1
17 osd.2
18
19In a configuration file, you may specify settings for all Ceph OSD Daemons in
20the cluster by adding configuration settings to the ``[osd]`` section of your
21configuration file. To add settings directly to a specific Ceph OSD Daemon
22(e.g., ``host``), enter it in an OSD-specific section of your configuration
23file. For example:
24
25.. code-block:: ini
26
27 [osd]
28 osd journal size = 1024
29
30 [osd.0]
31 host = osd-host-a
32
33 [osd.1]
34 host = osd-host-b
35
36
37.. index:: OSD; config settings
38
39General Settings
40================
41
42The following settings provide an Ceph OSD Daemon's ID, and determine paths to
43data and journals. Ceph deployment scripts typically generate the UUID
44automatically. We **DO NOT** recommend changing the default paths for data or
45journals, as it makes it more problematic to troubleshoot Ceph later.
46
47The journal size should be at least twice the product of the expected drive
48speed multiplied by ``filestore max sync interval``. However, the most common
49practice is to partition the journal drive (often an SSD), and mount it such
50that Ceph uses the entire partition for the journal.
51
52
53``osd uuid``
54
55:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
56:Type: UUID
57:Default: The UUID.
58:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
59 applies to the entire cluster.
60
61
62``osd data``
63
64:Description: The path to the OSDs data. You must create the directory when
65 deploying Ceph. You should mount a drive for OSD data at this
66 mount point. We do not recommend changing the default.
67
68:Type: String
69:Default: ``/var/lib/ceph/osd/$cluster-$id``
70
71
72``osd max write size``
73
74:Description: The maximum size of a write in megabytes.
75:Type: 32-bit Integer
76:Default: ``90``
77
78
79``osd client message size cap``
80
81:Description: The largest client data message allowed in memory.
c07f9fc5 82:Type: 64-bit Unsigned Integer
7c673cae
FG
83:Default: 500MB default. ``500*1024L*1024L``
84
85
86``osd class dir``
87
88:Description: The class path for RADOS class plug-ins.
89:Type: String
90:Default: ``$libdir/rados-classes``
91
92
93.. index:: OSD; file system
94
95File System Settings
96====================
97Ceph builds and mounts file systems which are used for Ceph OSDs.
98
99``osd mkfs options {fs-type}``
100
101:Description: Options used when creating a new Ceph OSD of type {fs-type}.
102
103:Type: String
104:Default for xfs: ``-f -i 2048``
105:Default for other file systems: {empty string}
106
107For example::
108 ``osd mkfs options xfs = -f -d agcount=24``
109
110``osd mount options {fs-type}``
111
112:Description: Options used when mounting a Ceph OSD of type {fs-type}.
113
114:Type: String
115:Default for xfs: ``rw,noatime,inode64``
116:Default for other file systems: ``rw, noatime``
117
118For example::
119 ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
120
121
122.. index:: OSD; journal settings
123
124Journal Settings
125================
126
127By default, Ceph expects that you will store an Ceph OSD Daemons journal with
128the following path::
129
130 /var/lib/ceph/osd/$cluster-$id/journal
131
132Without performance optimization, Ceph stores the journal on the same disk as
133the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use
134a separate disk to store journal data (e.g., a solid state drive delivers high
135performance journaling).
136
137Ceph's default ``osd journal size`` is 0, so you will need to set this in your
138``ceph.conf`` file. A journal size should find the product of the ``filestore
139max sync interval`` and the expected throughput, and multiply the product by
140two (2)::
141
142 osd journal size = {2 * (expected throughput * filestore max sync interval)}
143
144The expected throughput number should include the expected disk throughput
145(i.e., sustained data transfer rate), and network throughput. For example,
146a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()``
147of the disk and network throughput should provide a reasonable expected
148throughput. Some users just start off with a 10GB journal size. For
149example::
150
151 osd journal size = 10000
152
153
154``osd journal``
155
156:Description: The path to the OSD's journal. This may be a path to a file or a
157 block device (such as a partition of an SSD). If it is a file,
158 you must create the directory to contain it. We recommend using a
159 drive separate from the ``osd data`` drive.
160
161:Type: String
162:Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
163
164
165``osd journal size``
166
167:Description: The size of the journal in megabytes. If this is 0, and the
168 journal is a block device, the entire block device is used.
169 Since v0.54, this is ignored if the journal is a block device,
170 and the entire block device is used.
171
172:Type: 32-bit Integer
173:Default: ``5120``
174:Recommended: Begin with 1GB. Should be at least twice the product of the
175 expected speed multiplied by ``filestore max sync interval``.
176
177
178See `Journal Config Reference`_ for additional details.
179
180
181Monitor OSD Interaction
182=======================
183
184Ceph OSD Daemons check each other's heartbeats and report to monitors
185periodically. Ceph can use default values in many cases. However, if your
186network has latency issues, you may need to adopt longer intervals. See
187`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
188
189
190Data Placement
191==============
192
193See `Pool & PG Config Reference`_ for details.
194
195
196.. index:: OSD; scrubbing
197
198Scrubbing
199=========
200
201In addition to making multiple copies of objects, Ceph insures data integrity by
202scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
203object storage layer. For each placement group, Ceph generates a catalog of all
204objects and compares each primary object and its replicas to ensure that no
205objects are missing or mismatched. Light scrubbing (daily) checks the object
206size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
207to ensure data integrity.
208
209Scrubbing is important for maintaining data integrity, but it can reduce
210performance. You can adjust the following settings to increase or decrease
211scrubbing operations.
212
213
214``osd max scrubs``
215
216:Description: The maximum number of simultaneous scrub operations for
217 a Ceph OSD Daemon.
218
219:Type: 32-bit Int
220:Default: ``1``
221
222``osd scrub begin hour``
223
224:Description: The time of day for the lower bound when a scheduled scrub can be
225 performed.
226:Type: Integer in the range of 0 to 24
227:Default: ``0``
228
229
230``osd scrub end hour``
231
232:Description: The time of day for the upper bound when a scheduled scrub can be
233 performed. Along with ``osd scrub begin hour``, they define a time
234 window, in which the scrubs can happen. But a scrub will be performed
235 no matter the time window allows or not, as long as the placement
236 group's scrub interval exceeds ``osd scrub max interval``.
237:Type: Integer in the range of 0 to 24
238:Default: ``24``
239
240
241``osd scrub during recovery``
242
243:Description: Allow scrub during recovery. Setting this to ``false`` will disable
244 scheduling new scrub (and deep--scrub) while there is active recovery.
245 Already running scrubs will be continued. This might be useful to reduce
246 load on busy clusters.
247:Type: Boolean
248:Default: ``true``
249
250
251``osd scrub thread timeout``
252
253:Description: The maximum time in seconds before timing out a scrub thread.
254:Type: 32-bit Integer
255:Default: ``60``
256
257
258``osd scrub finalize thread timeout``
259
260:Description: The maximum time in seconds before timing out a scrub finalize
261 thread.
262
263:Type: 32-bit Integer
264:Default: ``60*10``
265
266
267``osd scrub load threshold``
268
269:Description: The maximum load. Ceph will not scrub when the system load
270 (as defined by ``getloadavg()``) is higher than this number.
271 Default is ``0.5``.
272
273:Type: Float
274:Default: ``0.5``
275
276
277``osd scrub min interval``
278
279:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
280 when the Ceph Storage Cluster load is low.
281
282:Type: Float
283:Default: Once per day. ``60*60*24``
284
285
286``osd scrub max interval``
287
288:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
289 irrespective of cluster load.
290
291:Type: Float
292:Default: Once per week. ``7*60*60*24``
293
294
295``osd scrub chunk min``
296
297:Description: The minimal number of object store chunks to scrub during single operation.
298 Ceph blocks writes to single chunk during scrub.
299
300:Type: 32-bit Integer
301:Default: 5
302
303
304``osd scrub chunk max``
305
306:Description: The maximum number of object store chunks to scrub during single operation.
307
308:Type: 32-bit Integer
309:Default: 25
310
311
312``osd scrub sleep``
313
314:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
315 down whole scrub operation while client operations will be less impacted.
316
317:Type: Float
318:Default: 0
319
320
321``osd deep scrub interval``
322
323:Description: The interval for "deep" scrubbing (fully reading all data). The
324 ``osd scrub load threshold`` does not affect this setting.
325
326:Type: Float
327:Default: Once per week. ``60*60*24*7``
328
329
330``osd scrub interval randomize ratio``
331
332:Description: Add a random delay to ``osd scrub min interval`` when scheduling
333 the next scrub job for a placement group. The delay is a random
334 value less than ``osd scrub min interval`` \*
335 ``osd scrub interval randomized ratio``. So the default setting
336 practically randomly spreads the scrubs out in the allowed time
337 window of ``[1, 1.5]`` \* ``osd scrub min interval``.
338:Type: Float
339:Default: ``0.5``
340
341``osd deep scrub stride``
342
343:Description: Read size when doing a deep scrub.
344:Type: 32-bit Integer
345:Default: 512 KB. ``524288``
346
347
348.. index:: OSD; operations settings
349
350Operations
351==========
352
353Operations settings allow you to configure the number of threads for servicing
354requests. If you set ``osd op threads`` to ``0``, it disables multi-threading.
355By default, Ceph uses two threads with a 30 second timeout and a 30 second
356complaint time if an operation doesn't complete within those time parameters.
357You can set operations priority weights between client operations and
358recovery operations to ensure optimal performance during recovery.
359
360
361``osd op threads``
362
363:Description: The number of threads to service Ceph OSD Daemon operations.
364 Set to ``0`` to disable it. Increasing the number may increase
365 the request processing rate.
366
367:Type: 32-bit Integer
368:Default: ``2``
369
370
371``osd op queue``
372
373:Description: This sets the type of queue to be used for prioritizing ops
374 in the OSDs. Both queues feature a strict sub-queue which is
375 dequeued before the normal queue. The normal queue is different
376 between implementations. The original PrioritizedQueue (``prio``) uses a
377 token bucket system which when there are sufficient tokens will
378 dequeue high priority queues first. If there are not enough
379 tokens available, queues are dequeued low priority to high priority.
c07f9fc5 380 The WeightedPriorityQueue (``wpq``) dequeues all priorities in
7c673cae
FG
381 relation to their priorities to prevent starvation of any queue.
382 WPQ should help in cases where a few OSDs are more overloaded
c07f9fc5
FG
383 than others. The new mClock based OpClassQueue
384 (``mclock_opclass``) prioritizes operations based on which class
385 they belong to (recovery, scrub, snaptrim, client op, osd subop).
386 And, the mClock based ClientQueue (``mclock_client``) also
387 incorporates the client identifier in order to promote fairness
388 between clients. See `QoS Based on mClock`_. Requires a restart.
7c673cae
FG
389
390:Type: String
c07f9fc5 391:Valid Choices: prio, wpq, mclock_opclass, mclock_client
7c673cae
FG
392:Default: ``prio``
393
394
395``osd op queue cut off``
396
397:Description: This selects which priority ops will be sent to the strict
398 queue verses the normal queue. The ``low`` setting sends all
399 replication ops and higher to the strict queue, while the ``high``
400 option sends only replication acknowledgement ops and higher to
401 the strict queue. Setting this to ``high`` should help when a few
402 OSDs in the cluster are very busy especially when combined with
403 ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
404 handling replication traffic could starve primary client traffic
405 on these OSDs without these settings. Requires a restart.
406
407:Type: String
408:Valid Choices: low, high
409:Default: ``low``
410
411
412``osd client op priority``
413
414:Description: The priority set for client operations. It is relative to
415 ``osd recovery op priority``.
416
417:Type: 32-bit Integer
418:Default: ``63``
419:Valid Range: 1-63
420
421
422``osd recovery op priority``
423
424:Description: The priority set for recovery operations. It is relative to
425 ``osd client op priority``.
426
427:Type: 32-bit Integer
31f18b77 428:Default: ``3``
7c673cae
FG
429:Valid Range: 1-63
430
431
432``osd scrub priority``
433
434:Description: The priority set for scrub operations. It is relative to
435 ``osd client op priority``.
436
437:Type: 32-bit Integer
438:Default: ``5``
439:Valid Range: 1-63
440
441
442``osd snap trim priority``
443
444:Description: The priority set for snap trim operations. It is relative to
445 ``osd client op priority``.
446
447:Type: 32-bit Integer
448:Default: ``5``
449:Valid Range: 1-63
450
451
452``osd op thread timeout``
453
454:Description: The Ceph OSD Daemon operation thread timeout in seconds.
455:Type: 32-bit Integer
456:Default: ``15``
457
458
459``osd op complaint time``
460
461:Description: An operation becomes complaint worthy after the specified number
462 of seconds have elapsed.
463
464:Type: Float
465:Default: ``30``
466
467
468``osd disk threads``
469
470:Description: The number of disk threads, which are used to perform background
471 disk intensive OSD operations such as scrubbing and snap
472 trimming.
473
474:Type: 32-bit Integer
475:Default: ``1``
476
477``osd disk thread ioprio class``
478
479:Description: Warning: it will only be used if both ``osd disk thread
480 ioprio class`` and ``osd disk thread ioprio priority`` are
481 set to a non default value. Sets the ioprio_set(2) I/O
482 scheduling ``class`` for the disk thread. Acceptable
483 values are ``idle``, ``be`` or ``rt``. The ``idle``
484 class means the disk thread will have lower priority
485 than any other thread in the OSD. This is useful to slow
486 down scrubbing on an OSD that is busy handling client
487 operations. ``be`` is the default and is the same
488 priority as all other threads in the OSD. ``rt`` means
489 the disk thread will have precendence over all other
490 threads in the OSD. Note: Only works with the Linux Kernel
491 CFQ scheduler. Since Jewel scrubbing is no longer carried
492 out by the disk iothread, see osd priority options instead.
493:Type: String
494:Default: the empty string
495
496``osd disk thread ioprio priority``
497
498:Description: Warning: it will only be used if both ``osd disk thread
499 ioprio class`` and ``osd disk thread ioprio priority`` are
500 set to a non default value. It sets the ioprio_set(2)
501 I/O scheduling ``priority`` of the disk thread ranging
502 from 0 (highest) to 7 (lowest). If all OSDs on a given
503 host were in class ``idle`` and compete for I/O
504 (i.e. due to controller congestion), it can be used to
505 lower the disk thread priority of one OSD to 7 so that
506 another OSD with priority 0 can have priority.
507 Note: Only works with the Linux Kernel CFQ scheduler.
508:Type: Integer in the range of 0 to 7 or -1 if not to be used.
509:Default: ``-1``
510
511``osd op history size``
512
513:Description: The maximum number of completed operations to track.
514:Type: 32-bit Unsigned Integer
515:Default: ``20``
516
517
518``osd op history duration``
519
520:Description: The oldest completed operation to track.
521:Type: 32-bit Unsigned Integer
522:Default: ``600``
523
524
525``osd op log threshold``
526
527:Description: How many operations logs to display at once.
528:Type: 32-bit Integer
529:Default: ``5``
530
c07f9fc5
FG
531
532QoS Based on mClock
533-------------------
534
535Ceph's use of mClock is currently in the experimental phase and should
536be approached with an exploratory mindset.
537
538Core Concepts
539`````````````
540
541The QoS support of Ceph is implemented using a queueing scheduler
542based on `the dmClock algorithm`_. This algorithm allocates the I/O
543resources of the Ceph cluster in proportion to weights, and enforces
544the constraits of minimum reservation and maximum limitation, so that
545the services can compete for the resources fairly. Currently the
546*mclock_opclass* operation queue divides Ceph services involving I/O
547resources into following buckets:
548
549- client op: the iops issued by client
550- osd subop: the iops issued by primary OSD
551- snap trim: the snap trimming related requests
552- pg recovery: the recovery related requests
553- pg scrub: the scrub related requests
554
555And the resources are partitioned using following three sets of tags. In other
556words, the share of each type of service is controlled by three tags:
557
558#. reservation: the minimum IOPS allocated for the service.
559#. limitation: the maximum IOPS allocated for the service.
560#. weight: the proportional share of capacity if extra capacity or system
561 oversubscribed.
562
563In Ceph operations are graded with "cost". And the resources allocated
564for serving various services are consumed by these "costs". So, for
565example, the more reservation a services has, the more resource it is
566guaranteed to possess, as long as it requires. Assuming there are 2
567services: recovery and client ops:
568
569- recovery: (r:1, l:5, w:1)
570- client ops: (r:2, l:0, w:9)
571
572The settings above ensure that the recovery won't get more than 5
573requests per second serviced, even if it requires so (see CURRENT
574IMPLEMENTATION NOTE below), and no other services are competing with
575it. But if the clients start to issue large amount of I/O requests,
576neither will they exhaust all the I/O resources. 1 request per second
577is always allocated for recovery jobs as long as there are any such
578requests. So the recovery jobs won't be starved even in a cluster with
579high load. And in the meantime, the client ops can enjoy a larger
580portion of the I/O resource, because its weight is "9", while its
581competitor "1". In the case of client ops, it is not clamped by the
582limit setting, so it can make use of all the resources if there is no
583recovery ongoing.
584
585Along with *mclock_opclass* another mclock operation queue named
586*mclock_client* is available. It divides operations based on category
587but also divides them based on the client making the request. This
588helps not only manage the distribution of resources spent on different
589classes of operations but also tries to insure fairness among clients.
590
591CURRENT IMPLEMENTATION NOTE: the current experimental implementation
592does not enforce the limit values. As a first approximation we decided
593not to prevent operations that would otherwise enter the operation
594sequencer from doing so.
595
596Subtleties of mClock
597````````````````````
598
599The reservation and limit values have a unit of requests per
600second. The weight, however, does not technically have a unit and the
601weights are relative to one another. So if one class of requests has a
602weight of 1 and another a weight of 9, then the latter class of
603requests should get 9 executed at a 9 to 1 ratio as the first class.
604However that will only happen once the reservations are met and those
605values include the operations executed under the reservation phase.
606
607Even though the weights do not have units, one must be careful in
608choosing their values due how the algorithm assigns weight tags to
609requests. If the weight is *W*, then for a given class of requests,
610the next one that comes in will have a weight tag of *1/W* plus the
611previous weight tag or the current time, whichever is larger. That
612means if *W* is sufficiently large and therefore *1/W* is sufficiently
613small, the calculated tag may never be assigned as it will get a value
614of the current time. The ultimate lesson is that values for weight
615should not be too large. They should be under the number of requests
616one expects to ve serviced each second.
617
618Caveats
619```````
620
621There are some factors that can reduce the impact of the mClock op
622queues within Ceph. First, requests to an OSD are sharded by their
623placement group identifier. Each shard has its own mClock queue and
624these queues neither interact nor share information among them. The
625number of shards can be controlled with the configuration options
626``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
627``osd_op_num_shards_ssd``. A lower number of shards will increase the
628impact of the mClock queues, but may have other deliterious effects.
629
630Second, requests are transferred from the operation queue to the
631operation sequencer, in which they go through the phases of
632execution. The operation queue is where mClock resides and mClock
633determines the next op to transfer to the operation sequencer. The
634number of operations allowed in the operation sequencer is a complex
635issue. In general we want to keep enough operations in the sequencer
636so it's always getting work done on some operations while it's waiting
637for disk and network access to complete on other operations. On the
638other hand, once an operation is transferred to the operation
639sequencer, mClock no longer has control over it. Therefore to maximize
640the impact of mClock, we want to keep as few operations in the
641operation sequencer as possible. So we have an inherent tension.
642
643The configuration options that influence the number of operations in
644the operation sequencer are ``bluestore_throttle_bytes``,
645``bluestore_throttle_deferred_bytes``,
646``bluestore_throttle_cost_per_io``,
647``bluestore_throttle_cost_per_io_hdd``, and
648``bluestore_throttle_cost_per_io_ssd``.
649
650A third factor that affects the impact of the mClock algorithm is that
651we're using a distributed system, where requests are made to multiple
652OSDs and each OSD has (can have) multiple shards. Yet we're currently
653using the mClock algorithm, which is not distributed (note: dmClock is
654the distributed version of mClock).
655
656Various organizations and individuals are currently experimenting with
657mClock as it exists in this code base along with their modifications
658to the code base. We hope you'll share you're experiences with your
659mClock and dmClock experiments in the ceph-devel mailing list.
660
661
662``osd push per object cost``
663
664:Description: the overhead for serving a push op
665
666:Type: Unsigned Integer
667:Default: 1000
668
669``osd recovery max chunk``
670
671:Description: the maximum total size of data chunks a recovery op can carry.
672
673:Type: Unsigned Integer
674:Default: 8 MiB
675
676
677``osd op queue mclock client op res``
678
679:Description: the reservation of client op.
680
681:Type: Float
682:Default: 1000.0
683
684
685``osd op queue mclock client op wgt``
686
687:Description: the weight of client op.
688
689:Type: Float
690:Default: 500.0
691
692
693``osd op queue mclock client op lim``
694
695:Description: the limit of client op.
696
697:Type: Float
698:Default: 1000.0
699
700
701``osd op queue mclock osd subop res``
702
703:Description: the reservation of osd subop.
704
705:Type: Float
706:Default: 1000.0
707
708
709``osd op queue mclock osd subop wgt``
710
711:Description: the weight of osd subop.
712
713:Type: Float
714:Default: 500.0
715
716
717``osd op queue mclock osd subop lim``
718
719:Description: the limit of osd subop.
720
721:Type: Float
722:Default: 0.0
723
724
725``osd op queue mclock snap res``
726
727:Description: the reservation of snap trimming.
728
729:Type: Float
730:Default: 0.0
731
732
733``osd op queue mclock snap wgt``
734
735:Description: the weight of snap trimming.
736
737:Type: Float
738:Default: 1.0
739
740
741``osd op queue mclock snap lim``
742
743:Description: the limit of snap trimming.
744
745:Type: Float
746:Default: 0.001
747
748
749``osd op queue mclock recov res``
750
751:Description: the reservation of recovery.
752
753:Type: Float
754:Default: 0.0
755
756
757``osd op queue mclock recov wgt``
758
759:Description: the weight of recovery.
760
761:Type: Float
762:Default: 1.0
763
764
765``osd op queue mclock recov lim``
766
767:Description: the limit of recovery.
768
769:Type: Float
770:Default: 0.001
771
772
773``osd op queue mclock scrub res``
774
775:Description: the reservation of scrub jobs.
776
777:Type: Float
778:Default: 0.0
779
780
781``osd op queue mclock scrub wgt``
782
783:Description: the weight of scrub jobs.
784
785:Type: Float
786:Default: 1.0
787
788
789``osd op queue mclock scrub lim``
790
791:Description: the limit of scrub jobs.
792
793:Type: Float
794:Default: 0.001
795
796.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
797
798
7c673cae
FG
799.. index:: OSD; backfilling
800
801Backfilling
802===========
803
804When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
805want to rebalance the cluster by moving placement groups to or from Ceph OSD
806Daemons to restore the balance. The process of migrating placement groups and
807the objects they contain can reduce the cluster's operational performance
808considerably. To maintain operational performance, Ceph performs this migration
809with 'backfilling', which allows Ceph to set backfill operations to a lower
810priority than requests to read or write data.
811
812
813``osd max backfills``
814
815:Description: The maximum number of backfills allowed to or from a single OSD.
816:Type: 64-bit Unsigned Integer
817:Default: ``1``
818
819
820``osd backfill scan min``
821
822:Description: The minimum number of objects per backfill scan.
823
824:Type: 32-bit Integer
825:Default: ``64``
826
827
828``osd backfill scan max``
829
830:Description: The maximum number of objects per backfill scan.
831
832:Type: 32-bit Integer
833:Default: ``512``
834
835
836``osd backfill retry interval``
837
838:Description: The number of seconds to wait before retrying backfill requests.
839:Type: Double
840:Default: ``10.0``
841
842.. index:: OSD; osdmap
843
844OSD Map
845=======
846
847OSD maps reflect the OSD daemons operating in the cluster. Over time, the
848number of map epochs increases. Ceph provides some settings to ensure that
849Ceph performs well as the OSD map grows larger.
850
851
852``osd map dedup``
853
854:Description: Enable removing duplicates in the OSD map.
855:Type: Boolean
856:Default: ``true``
857
858
859``osd map cache size``
860
861:Description: The number of OSD maps to keep cached.
862:Type: 32-bit Integer
863:Default: ``500``
864
865
866``osd map cache bl size``
867
868:Description: The size of the in-memory OSD map cache in OSD daemons.
869:Type: 32-bit Integer
870:Default: ``50``
871
872
873``osd map cache bl inc size``
874
875:Description: The size of the in-memory OSD map cache incrementals in
876 OSD daemons.
877
878:Type: 32-bit Integer
879:Default: ``100``
880
881
882``osd map message max``
883
884:Description: The maximum map entries allowed per MOSDMap message.
885:Type: 32-bit Integer
886:Default: ``100``
887
888
889
890.. index:: OSD; recovery
891
892Recovery
893========
894
895When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
896begins peering with other Ceph OSD Daemons before writes can occur. See
897`Monitoring OSDs and PGs`_ for details.
898
899If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
900sync with other Ceph OSD Daemons containing more recent versions of objects in
901the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
902mode and seeks to get the latest copy of the data and bring its map back up to
903date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
904and placement groups may be significantly out of date. Also, if a failure domain
905went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
906the same time. This can make the recovery process time consuming and resource
907intensive.
908
909To maintain operational performance, Ceph performs recovery with limitations on
910the number recovery requests, threads and object chunk sizes which allows Ceph
911perform well in a degraded state.
912
913
914``osd recovery delay start``
915
916:Description: After peering completes, Ceph will delay for the specified number
917 of seconds before starting to recover objects.
918
919:Type: Float
920:Default: ``0``
921
922
923``osd recovery max active``
924
925:Description: The number of active recovery requests per OSD at one time. More
926 requests will accelerate recovery, but the requests places an
927 increased load on the cluster.
928
929:Type: 32-bit Integer
31f18b77 930:Default: ``3``
7c673cae
FG
931
932
933``osd recovery max chunk``
934
935:Description: The maximum size of a recovered chunk of data to push.
c07f9fc5 936:Type: 64-bit Unsigned Integer
7c673cae
FG
937:Default: ``8 << 20``
938
939
31f18b77
FG
940``osd recovery max single start``
941
942:Description: The maximum number of recovery operations per OSD that will be
943 newly started when an OSD is recovering.
c07f9fc5 944:Type: 64-bit Unsigned Integer
31f18b77
FG
945:Default: ``1``
946
947
7c673cae
FG
948``osd recovery thread timeout``
949
950:Description: The maximum time in seconds before timing out a recovery thread.
951:Type: 32-bit Integer
952:Default: ``30``
953
954
955``osd recover clone overlap``
956
957:Description: Preserves clone overlap during recovery. Should always be set
958 to ``true``.
959
960:Type: Boolean
961:Default: ``true``
962
31f18b77
FG
963
964``osd recovery sleep``
965
c07f9fc5
FG
966:Description: Time in seconds to sleep before next recovery or backfill op.
967 Increasing this value will slow down recovery operation while
968 client operations will be less impacted.
31f18b77
FG
969
970:Type: Float
c07f9fc5
FG
971:Default: ``0``
972
973
974``osd recovery sleep hdd``
975
976:Description: Time in seconds to sleep before next recovery or backfill op
977 for HDDs.
978
979:Type: Float
980:Default: ``0.1``
981
982
983``osd recovery sleep ssd``
984
985:Description: Time in seconds to sleep before next recovery or backfill op
986 for SSDs.
987
988:Type: Float
989:Default: ``0``
31f18b77 990
d2e6a577
FG
991
992``osd recovery sleep hybrid``
993
994:Description: Time in seconds to sleep before next recovery or backfill op
995 when osd data is on HDD and osd journal is on SSD.
996
997:Type: Float
998:Default: ``0.025``
999
7c673cae
FG
1000Tiering
1001=======
1002
1003``osd agent max ops``
1004
1005:Description: The maximum number of simultaneous flushing ops per tiering agent
1006 in the high speed mode.
1007:Type: 32-bit Integer
1008:Default: ``4``
1009
1010
1011``osd agent max low ops``
1012
1013:Description: The maximum number of simultaneous flushing ops per tiering agent
1014 in the low speed mode.
1015:Type: 32-bit Integer
1016:Default: ``2``
1017
1018See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1019objects within the high speed mode.
1020
1021Miscellaneous
1022=============
1023
1024
1025``osd snap trim thread timeout``
1026
1027:Description: The maximum time in seconds before timing out a snap trim thread.
1028:Type: 32-bit Integer
1029:Default: ``60*60*1``
1030
1031
1032``osd backlog thread timeout``
1033
1034:Description: The maximum time in seconds before timing out a backlog thread.
1035:Type: 32-bit Integer
1036:Default: ``60*60*1``
1037
1038
1039``osd default notify timeout``
1040
1041:Description: The OSD default notification timeout (in seconds).
c07f9fc5 1042:Type: 32-bit Unsigned Integer
7c673cae
FG
1043:Default: ``30``
1044
1045
1046``osd check for log corruption``
1047
1048:Description: Check log files for corruption. Can be computationally expensive.
1049:Type: Boolean
1050:Default: ``false``
1051
1052
1053``osd remove thread timeout``
1054
1055:Description: The maximum time in seconds before timing out a remove OSD thread.
1056:Type: 32-bit Integer
1057:Default: ``60*60``
1058
1059
1060``osd command thread timeout``
1061
1062:Description: The maximum time in seconds before timing out a command thread.
1063:Type: 32-bit Integer
1064:Default: ``10*60``
1065
1066
1067``osd command max records``
1068
1069:Description: Limits the number of lost objects to return.
1070:Type: 32-bit Integer
1071:Default: ``256``
1072
1073
1074``osd auto upgrade tmap``
1075
1076:Description: Uses ``tmap`` for ``omap`` on old objects.
1077:Type: Boolean
1078:Default: ``true``
1079
1080
1081``osd tmapput sets users tmap``
1082
1083:Description: Uses ``tmap`` for debugging only.
1084:Type: Boolean
1085:Default: ``false``
1086
1087
7c673cae
FG
1088``osd fast fail on connection refused``
1089
1090:Description: If this option is enabled, crashed OSDs are marked down
1091 immediately by connected peers and MONs (assuming that the
1092 crashed OSD host survives). Disable it to restore old
1093 behavior, at the expense of possible long I/O stalls when
1094 OSDs crash in the middle of I/O operations.
1095:Type: Boolean
1096:Default: ``true``
1097
1098
1099
1100.. _pool: ../../operations/pools
1101.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1102.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1103.. _Pool & PG Config Reference: ../pool-pg-config-ref
1104.. _Journal Config Reference: ../journal-ref
1105.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio