]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/configuration/osd-config-ref.rst
update sources to v12.1.1
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
1 ======================
2 OSD Config Reference
3 ======================
4
5 .. index:: OSD; configuration
6
7 You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
8 Daemons can use the default values and a very minimal configuration. A minimal
9 Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and
10 uses default values for nearly everything else.
11
12 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
13 with ``0`` using the following convention. ::
14
15 osd.0
16 osd.1
17 osd.2
18
19 In a configuration file, you may specify settings for all Ceph OSD Daemons in
20 the cluster by adding configuration settings to the ``[osd]`` section of your
21 configuration file. To add settings directly to a specific Ceph OSD Daemon
22 (e.g., ``host``), enter it in an OSD-specific section of your configuration
23 file. For example:
24
25 .. code-block:: ini
26
27 [osd]
28 osd journal size = 1024
29
30 [osd.0]
31 host = osd-host-a
32
33 [osd.1]
34 host = osd-host-b
35
36
37 .. index:: OSD; config settings
38
39 General Settings
40 ================
41
42 The following settings provide an Ceph OSD Daemon's ID, and determine paths to
43 data and journals. Ceph deployment scripts typically generate the UUID
44 automatically. We **DO NOT** recommend changing the default paths for data or
45 journals, as it makes it more problematic to troubleshoot Ceph later.
46
47 The journal size should be at least twice the product of the expected drive
48 speed multiplied by ``filestore max sync interval``. However, the most common
49 practice is to partition the journal drive (often an SSD), and mount it such
50 that Ceph uses the entire partition for the journal.
51
52
53 ``osd uuid``
54
55 :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
56 :Type: UUID
57 :Default: The UUID.
58 :Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
59 applies to the entire cluster.
60
61
62 ``osd data``
63
64 :Description: The path to the OSDs data. You must create the directory when
65 deploying Ceph. You should mount a drive for OSD data at this
66 mount point. We do not recommend changing the default.
67
68 :Type: String
69 :Default: ``/var/lib/ceph/osd/$cluster-$id``
70
71
72 ``osd max write size``
73
74 :Description: The maximum size of a write in megabytes.
75 :Type: 32-bit Integer
76 :Default: ``90``
77
78
79 ``osd client message size cap``
80
81 :Description: The largest client data message allowed in memory.
82 :Type: 64-bit Integer Unsigned
83 :Default: 500MB default. ``500*1024L*1024L``
84
85
86 ``osd class dir``
87
88 :Description: The class path for RADOS class plug-ins.
89 :Type: String
90 :Default: ``$libdir/rados-classes``
91
92
93 .. index:: OSD; file system
94
95 File System Settings
96 ====================
97 Ceph builds and mounts file systems which are used for Ceph OSDs.
98
99 ``osd mkfs options {fs-type}``
100
101 :Description: Options used when creating a new Ceph OSD of type {fs-type}.
102
103 :Type: String
104 :Default for xfs: ``-f -i 2048``
105 :Default for other file systems: {empty string}
106
107 For example::
108 ``osd mkfs options xfs = -f -d agcount=24``
109
110 ``osd mount options {fs-type}``
111
112 :Description: Options used when mounting a Ceph OSD of type {fs-type}.
113
114 :Type: String
115 :Default for xfs: ``rw,noatime,inode64``
116 :Default for other file systems: ``rw, noatime``
117
118 For example::
119 ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
120
121
122 .. index:: OSD; journal settings
123
124 Journal Settings
125 ================
126
127 By default, Ceph expects that you will store an Ceph OSD Daemons journal with
128 the following path::
129
130 /var/lib/ceph/osd/$cluster-$id/journal
131
132 Without performance optimization, Ceph stores the journal on the same disk as
133 the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use
134 a separate disk to store journal data (e.g., a solid state drive delivers high
135 performance journaling).
136
137 Ceph's default ``osd journal size`` is 0, so you will need to set this in your
138 ``ceph.conf`` file. A journal size should find the product of the ``filestore
139 max sync interval`` and the expected throughput, and multiply the product by
140 two (2)::
141
142 osd journal size = {2 * (expected throughput * filestore max sync interval)}
143
144 The expected throughput number should include the expected disk throughput
145 (i.e., sustained data transfer rate), and network throughput. For example,
146 a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()``
147 of the disk and network throughput should provide a reasonable expected
148 throughput. Some users just start off with a 10GB journal size. For
149 example::
150
151 osd journal size = 10000
152
153
154 ``osd journal``
155
156 :Description: The path to the OSD's journal. This may be a path to a file or a
157 block device (such as a partition of an SSD). If it is a file,
158 you must create the directory to contain it. We recommend using a
159 drive separate from the ``osd data`` drive.
160
161 :Type: String
162 :Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
163
164
165 ``osd journal size``
166
167 :Description: The size of the journal in megabytes. If this is 0, and the
168 journal is a block device, the entire block device is used.
169 Since v0.54, this is ignored if the journal is a block device,
170 and the entire block device is used.
171
172 :Type: 32-bit Integer
173 :Default: ``5120``
174 :Recommended: Begin with 1GB. Should be at least twice the product of the
175 expected speed multiplied by ``filestore max sync interval``.
176
177
178 See `Journal Config Reference`_ for additional details.
179
180
181 Monitor OSD Interaction
182 =======================
183
184 Ceph OSD Daemons check each other's heartbeats and report to monitors
185 periodically. Ceph can use default values in many cases. However, if your
186 network has latency issues, you may need to adopt longer intervals. See
187 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
188
189
190 Data Placement
191 ==============
192
193 See `Pool & PG Config Reference`_ for details.
194
195
196 .. index:: OSD; scrubbing
197
198 Scrubbing
199 =========
200
201 In addition to making multiple copies of objects, Ceph insures data integrity by
202 scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
203 object storage layer. For each placement group, Ceph generates a catalog of all
204 objects and compares each primary object and its replicas to ensure that no
205 objects are missing or mismatched. Light scrubbing (daily) checks the object
206 size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
207 to ensure data integrity.
208
209 Scrubbing is important for maintaining data integrity, but it can reduce
210 performance. You can adjust the following settings to increase or decrease
211 scrubbing operations.
212
213
214 ``osd max scrubs``
215
216 :Description: The maximum number of simultaneous scrub operations for
217 a Ceph OSD Daemon.
218
219 :Type: 32-bit Int
220 :Default: ``1``
221
222 ``osd scrub begin hour``
223
224 :Description: The time of day for the lower bound when a scheduled scrub can be
225 performed.
226 :Type: Integer in the range of 0 to 24
227 :Default: ``0``
228
229
230 ``osd scrub end hour``
231
232 :Description: The time of day for the upper bound when a scheduled scrub can be
233 performed. Along with ``osd scrub begin hour``, they define a time
234 window, in which the scrubs can happen. But a scrub will be performed
235 no matter the time window allows or not, as long as the placement
236 group's scrub interval exceeds ``osd scrub max interval``.
237 :Type: Integer in the range of 0 to 24
238 :Default: ``24``
239
240
241 ``osd scrub during recovery``
242
243 :Description: Allow scrub during recovery. Setting this to ``false`` will disable
244 scheduling new scrub (and deep--scrub) while there is active recovery.
245 Already running scrubs will be continued. This might be useful to reduce
246 load on busy clusters.
247 :Type: Boolean
248 :Default: ``true``
249
250
251 ``osd scrub thread timeout``
252
253 :Description: The maximum time in seconds before timing out a scrub thread.
254 :Type: 32-bit Integer
255 :Default: ``60``
256
257
258 ``osd scrub finalize thread timeout``
259
260 :Description: The maximum time in seconds before timing out a scrub finalize
261 thread.
262
263 :Type: 32-bit Integer
264 :Default: ``60*10``
265
266
267 ``osd scrub load threshold``
268
269 :Description: The maximum load. Ceph will not scrub when the system load
270 (as defined by ``getloadavg()``) is higher than this number.
271 Default is ``0.5``.
272
273 :Type: Float
274 :Default: ``0.5``
275
276
277 ``osd scrub min interval``
278
279 :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
280 when the Ceph Storage Cluster load is low.
281
282 :Type: Float
283 :Default: Once per day. ``60*60*24``
284
285
286 ``osd scrub max interval``
287
288 :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
289 irrespective of cluster load.
290
291 :Type: Float
292 :Default: Once per week. ``7*60*60*24``
293
294
295 ``osd scrub chunk min``
296
297 :Description: The minimal number of object store chunks to scrub during single operation.
298 Ceph blocks writes to single chunk during scrub.
299
300 :Type: 32-bit Integer
301 :Default: 5
302
303
304 ``osd scrub chunk max``
305
306 :Description: The maximum number of object store chunks to scrub during single operation.
307
308 :Type: 32-bit Integer
309 :Default: 25
310
311
312 ``osd scrub sleep``
313
314 :Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
315 down whole scrub operation while client operations will be less impacted.
316
317 :Type: Float
318 :Default: 0
319
320
321 ``osd deep scrub interval``
322
323 :Description: The interval for "deep" scrubbing (fully reading all data). The
324 ``osd scrub load threshold`` does not affect this setting.
325
326 :Type: Float
327 :Default: Once per week. ``60*60*24*7``
328
329
330 ``osd scrub interval randomize ratio``
331
332 :Description: Add a random delay to ``osd scrub min interval`` when scheduling
333 the next scrub job for a placement group. The delay is a random
334 value less than ``osd scrub min interval`` \*
335 ``osd scrub interval randomized ratio``. So the default setting
336 practically randomly spreads the scrubs out in the allowed time
337 window of ``[1, 1.5]`` \* ``osd scrub min interval``.
338 :Type: Float
339 :Default: ``0.5``
340
341 ``osd deep scrub stride``
342
343 :Description: Read size when doing a deep scrub.
344 :Type: 32-bit Integer
345 :Default: 512 KB. ``524288``
346
347
348 .. index:: OSD; operations settings
349
350 Operations
351 ==========
352
353 Operations settings allow you to configure the number of threads for servicing
354 requests. If you set ``osd op threads`` to ``0``, it disables multi-threading.
355 By default, Ceph uses two threads with a 30 second timeout and a 30 second
356 complaint time if an operation doesn't complete within those time parameters.
357 You can set operations priority weights between client operations and
358 recovery operations to ensure optimal performance during recovery.
359
360
361 ``osd op threads``
362
363 :Description: The number of threads to service Ceph OSD Daemon operations.
364 Set to ``0`` to disable it. Increasing the number may increase
365 the request processing rate.
366
367 :Type: 32-bit Integer
368 :Default: ``2``
369
370
371 ``osd op queue``
372
373 :Description: This sets the type of queue to be used for prioritizing ops
374 in the OSDs. Both queues feature a strict sub-queue which is
375 dequeued before the normal queue. The normal queue is different
376 between implementations. The original PrioritizedQueue (``prio``) uses a
377 token bucket system which when there are sufficient tokens will
378 dequeue high priority queues first. If there are not enough
379 tokens available, queues are dequeued low priority to high priority.
380 The new WeightedPriorityQueue (``wpq``) dequeues all priorities in
381 relation to their priorities to prevent starvation of any queue.
382 WPQ should help in cases where a few OSDs are more overloaded
383 than others. Requires a restart.
384
385 :Type: String
386 :Valid Choices: prio, wpq
387 :Default: ``prio``
388
389
390 ``osd op queue cut off``
391
392 :Description: This selects which priority ops will be sent to the strict
393 queue verses the normal queue. The ``low`` setting sends all
394 replication ops and higher to the strict queue, while the ``high``
395 option sends only replication acknowledgement ops and higher to
396 the strict queue. Setting this to ``high`` should help when a few
397 OSDs in the cluster are very busy especially when combined with
398 ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
399 handling replication traffic could starve primary client traffic
400 on these OSDs without these settings. Requires a restart.
401
402 :Type: String
403 :Valid Choices: low, high
404 :Default: ``low``
405
406
407 ``osd client op priority``
408
409 :Description: The priority set for client operations. It is relative to
410 ``osd recovery op priority``.
411
412 :Type: 32-bit Integer
413 :Default: ``63``
414 :Valid Range: 1-63
415
416
417 ``osd recovery op priority``
418
419 :Description: The priority set for recovery operations. It is relative to
420 ``osd client op priority``.
421
422 :Type: 32-bit Integer
423 :Default: ``3``
424 :Valid Range: 1-63
425
426
427 ``osd scrub priority``
428
429 :Description: The priority set for scrub operations. It is relative to
430 ``osd client op priority``.
431
432 :Type: 32-bit Integer
433 :Default: ``5``
434 :Valid Range: 1-63
435
436
437 ``osd snap trim priority``
438
439 :Description: The priority set for snap trim operations. It is relative to
440 ``osd client op priority``.
441
442 :Type: 32-bit Integer
443 :Default: ``5``
444 :Valid Range: 1-63
445
446
447 ``osd op thread timeout``
448
449 :Description: The Ceph OSD Daemon operation thread timeout in seconds.
450 :Type: 32-bit Integer
451 :Default: ``15``
452
453
454 ``osd op complaint time``
455
456 :Description: An operation becomes complaint worthy after the specified number
457 of seconds have elapsed.
458
459 :Type: Float
460 :Default: ``30``
461
462
463 ``osd disk threads``
464
465 :Description: The number of disk threads, which are used to perform background
466 disk intensive OSD operations such as scrubbing and snap
467 trimming.
468
469 :Type: 32-bit Integer
470 :Default: ``1``
471
472 ``osd disk thread ioprio class``
473
474 :Description: Warning: it will only be used if both ``osd disk thread
475 ioprio class`` and ``osd disk thread ioprio priority`` are
476 set to a non default value. Sets the ioprio_set(2) I/O
477 scheduling ``class`` for the disk thread. Acceptable
478 values are ``idle``, ``be`` or ``rt``. The ``idle``
479 class means the disk thread will have lower priority
480 than any other thread in the OSD. This is useful to slow
481 down scrubbing on an OSD that is busy handling client
482 operations. ``be`` is the default and is the same
483 priority as all other threads in the OSD. ``rt`` means
484 the disk thread will have precendence over all other
485 threads in the OSD. Note: Only works with the Linux Kernel
486 CFQ scheduler. Since Jewel scrubbing is no longer carried
487 out by the disk iothread, see osd priority options instead.
488 :Type: String
489 :Default: the empty string
490
491 ``osd disk thread ioprio priority``
492
493 :Description: Warning: it will only be used if both ``osd disk thread
494 ioprio class`` and ``osd disk thread ioprio priority`` are
495 set to a non default value. It sets the ioprio_set(2)
496 I/O scheduling ``priority`` of the disk thread ranging
497 from 0 (highest) to 7 (lowest). If all OSDs on a given
498 host were in class ``idle`` and compete for I/O
499 (i.e. due to controller congestion), it can be used to
500 lower the disk thread priority of one OSD to 7 so that
501 another OSD with priority 0 can have priority.
502 Note: Only works with the Linux Kernel CFQ scheduler.
503 :Type: Integer in the range of 0 to 7 or -1 if not to be used.
504 :Default: ``-1``
505
506 ``osd op history size``
507
508 :Description: The maximum number of completed operations to track.
509 :Type: 32-bit Unsigned Integer
510 :Default: ``20``
511
512
513 ``osd op history duration``
514
515 :Description: The oldest completed operation to track.
516 :Type: 32-bit Unsigned Integer
517 :Default: ``600``
518
519
520 ``osd op log threshold``
521
522 :Description: How many operations logs to display at once.
523 :Type: 32-bit Integer
524 :Default: ``5``
525
526 .. index:: OSD; backfilling
527
528 Backfilling
529 ===========
530
531 When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
532 want to rebalance the cluster by moving placement groups to or from Ceph OSD
533 Daemons to restore the balance. The process of migrating placement groups and
534 the objects they contain can reduce the cluster's operational performance
535 considerably. To maintain operational performance, Ceph performs this migration
536 with 'backfilling', which allows Ceph to set backfill operations to a lower
537 priority than requests to read or write data.
538
539
540 ``osd max backfills``
541
542 :Description: The maximum number of backfills allowed to or from a single OSD.
543 :Type: 64-bit Unsigned Integer
544 :Default: ``1``
545
546
547 ``osd backfill scan min``
548
549 :Description: The minimum number of objects per backfill scan.
550
551 :Type: 32-bit Integer
552 :Default: ``64``
553
554
555 ``osd backfill scan max``
556
557 :Description: The maximum number of objects per backfill scan.
558
559 :Type: 32-bit Integer
560 :Default: ``512``
561
562
563 ``osd backfill retry interval``
564
565 :Description: The number of seconds to wait before retrying backfill requests.
566 :Type: Double
567 :Default: ``10.0``
568
569 .. index:: OSD; osdmap
570
571 OSD Map
572 =======
573
574 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
575 number of map epochs increases. Ceph provides some settings to ensure that
576 Ceph performs well as the OSD map grows larger.
577
578
579 ``osd map dedup``
580
581 :Description: Enable removing duplicates in the OSD map.
582 :Type: Boolean
583 :Default: ``true``
584
585
586 ``osd map cache size``
587
588 :Description: The number of OSD maps to keep cached.
589 :Type: 32-bit Integer
590 :Default: ``500``
591
592
593 ``osd map cache bl size``
594
595 :Description: The size of the in-memory OSD map cache in OSD daemons.
596 :Type: 32-bit Integer
597 :Default: ``50``
598
599
600 ``osd map cache bl inc size``
601
602 :Description: The size of the in-memory OSD map cache incrementals in
603 OSD daemons.
604
605 :Type: 32-bit Integer
606 :Default: ``100``
607
608
609 ``osd map message max``
610
611 :Description: The maximum map entries allowed per MOSDMap message.
612 :Type: 32-bit Integer
613 :Default: ``100``
614
615
616
617 .. index:: OSD; recovery
618
619 Recovery
620 ========
621
622 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
623 begins peering with other Ceph OSD Daemons before writes can occur. See
624 `Monitoring OSDs and PGs`_ for details.
625
626 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
627 sync with other Ceph OSD Daemons containing more recent versions of objects in
628 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
629 mode and seeks to get the latest copy of the data and bring its map back up to
630 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
631 and placement groups may be significantly out of date. Also, if a failure domain
632 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
633 the same time. This can make the recovery process time consuming and resource
634 intensive.
635
636 To maintain operational performance, Ceph performs recovery with limitations on
637 the number recovery requests, threads and object chunk sizes which allows Ceph
638 perform well in a degraded state.
639
640
641 ``osd recovery delay start``
642
643 :Description: After peering completes, Ceph will delay for the specified number
644 of seconds before starting to recover objects.
645
646 :Type: Float
647 :Default: ``0``
648
649
650 ``osd recovery max active``
651
652 :Description: The number of active recovery requests per OSD at one time. More
653 requests will accelerate recovery, but the requests places an
654 increased load on the cluster.
655
656 :Type: 32-bit Integer
657 :Default: ``3``
658
659
660 ``osd recovery max chunk``
661
662 :Description: The maximum size of a recovered chunk of data to push.
663 :Type: 64-bit Integer Unsigned
664 :Default: ``8 << 20``
665
666
667 ``osd recovery max single start``
668
669 :Description: The maximum number of recovery operations per OSD that will be
670 newly started when an OSD is recovering.
671 :Type: 64-bit Integer Unsigned
672 :Default: ``1``
673
674
675 ``osd recovery thread timeout``
676
677 :Description: The maximum time in seconds before timing out a recovery thread.
678 :Type: 32-bit Integer
679 :Default: ``30``
680
681
682 ``osd recover clone overlap``
683
684 :Description: Preserves clone overlap during recovery. Should always be set
685 to ``true``.
686
687 :Type: Boolean
688 :Default: ``true``
689
690
691 ``osd recovery sleep``
692
693 :Description: Time to sleep before next recovery. Increasing this value will
694 slow down recovery operation while client operations will be
695 less impacted.
696
697 :Type: Float
698 :Default: ``0.01``
699
700 Tiering
701 =======
702
703 ``osd agent max ops``
704
705 :Description: The maximum number of simultaneous flushing ops per tiering agent
706 in the high speed mode.
707 :Type: 32-bit Integer
708 :Default: ``4``
709
710
711 ``osd agent max low ops``
712
713 :Description: The maximum number of simultaneous flushing ops per tiering agent
714 in the low speed mode.
715 :Type: 32-bit Integer
716 :Default: ``2``
717
718 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
719 objects within the high speed mode.
720
721 Miscellaneous
722 =============
723
724
725 ``osd snap trim thread timeout``
726
727 :Description: The maximum time in seconds before timing out a snap trim thread.
728 :Type: 32-bit Integer
729 :Default: ``60*60*1``
730
731
732 ``osd backlog thread timeout``
733
734 :Description: The maximum time in seconds before timing out a backlog thread.
735 :Type: 32-bit Integer
736 :Default: ``60*60*1``
737
738
739 ``osd default notify timeout``
740
741 :Description: The OSD default notification timeout (in seconds).
742 :Type: 32-bit Integer Unsigned
743 :Default: ``30``
744
745
746 ``osd check for log corruption``
747
748 :Description: Check log files for corruption. Can be computationally expensive.
749 :Type: Boolean
750 :Default: ``false``
751
752
753 ``osd remove thread timeout``
754
755 :Description: The maximum time in seconds before timing out a remove OSD thread.
756 :Type: 32-bit Integer
757 :Default: ``60*60``
758
759
760 ``osd command thread timeout``
761
762 :Description: The maximum time in seconds before timing out a command thread.
763 :Type: 32-bit Integer
764 :Default: ``10*60``
765
766
767 ``osd command max records``
768
769 :Description: Limits the number of lost objects to return.
770 :Type: 32-bit Integer
771 :Default: ``256``
772
773
774 ``osd auto upgrade tmap``
775
776 :Description: Uses ``tmap`` for ``omap`` on old objects.
777 :Type: Boolean
778 :Default: ``true``
779
780
781 ``osd tmapput sets users tmap``
782
783 :Description: Uses ``tmap`` for debugging only.
784 :Type: Boolean
785 :Default: ``false``
786
787
788 ``osd fast fail on connection refused``
789
790 :Description: If this option is enabled, crashed OSDs are marked down
791 immediately by connected peers and MONs (assuming that the
792 crashed OSD host survives). Disable it to restore old
793 behavior, at the expense of possible long I/O stalls when
794 OSDs crash in the middle of I/O operations.
795 :Type: Boolean
796 :Default: ``true``
797
798
799
800 .. _pool: ../../operations/pools
801 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
802 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
803 .. _Pool & PG Config Reference: ../pool-pg-config-ref
804 .. _Journal Config Reference: ../journal-ref
805 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio