]> git.proxmox.com Git - ceph.git/blame_incremental - ceph/doc/rados/configuration/osd-config-ref.rst
import ceph pacific 16.2.5
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
... / ...
CommitLineData
1======================
2 OSD Config Reference
3======================
4
5.. index:: OSD; configuration
6
7You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
8releases, the central config store), but Ceph OSD
9Daemons can use the default values and a very minimal configuration. A minimal
10Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``, and
11uses default values for nearly everything else.
12
13Ceph OSD Daemons are numerically identified in incremental fashion, beginning
14with ``0`` using the following convention. ::
15
16 osd.0
17 osd.1
18 osd.2
19
20In a configuration file, you may specify settings for all Ceph OSD Daemons in
21the cluster by adding configuration settings to the ``[osd]`` section of your
22configuration file. To add settings directly to a specific Ceph OSD Daemon
23(e.g., ``host``), enter it in an OSD-specific section of your configuration
24file. For example:
25
26.. code-block:: ini
27
28 [osd]
29 osd_journal_size = 5120
30
31 [osd.0]
32 host = osd-host-a
33
34 [osd.1]
35 host = osd-host-b
36
37
38.. index:: OSD; config settings
39
40General Settings
41================
42
43The following settings provide a Ceph OSD Daemon's ID, and determine paths to
44data and journals. Ceph deployment scripts typically generate the UUID
45automatically.
46
47.. warning:: **DO NOT** change the default paths for data or journals, as it
48 makes it more problematic to troubleshoot Ceph later.
49
50When using Filestore, the journal size should be at least twice the product of the expected drive
51speed multiplied by ``filestore_max_sync_interval``. However, the most common
52practice is to partition the journal drive (often an SSD), and mount it such
53that Ceph uses the entire partition for the journal.
54
55
56``osd_uuid``
57
58:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
59:Type: UUID
60:Default: The UUID.
61:Note: The ``osd_uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
62 applies to the entire cluster.
63
64
65``osd_data``
66
67:Description: The path to the OSDs data. You must create the directory when
68 deploying Ceph. You should mount a drive for OSD data at this
69 mount point. We do not recommend changing the default.
70
71:Type: String
72:Default: ``/var/lib/ceph/osd/$cluster-$id``
73
74
75``osd_max_write_size``
76
77:Description: The maximum size of a write in megabytes.
78:Type: 32-bit Integer
79:Default: ``90``
80
81
82``osd_max_object_size``
83
84:Description: The maximum size of a RADOS object in bytes.
85:Type: 32-bit Unsigned Integer
86:Default: 128MB
87
88
89``osd_client_message_size_cap``
90
91:Description: The largest client data message allowed in memory.
92:Type: 64-bit Unsigned Integer
93:Default: 500MB default. ``500*1024L*1024L``
94
95
96``osd_class_dir``
97
98:Description: The class path for RADOS class plug-ins.
99:Type: String
100:Default: ``$libdir/rados-classes``
101
102
103.. index:: OSD; file system
104
105File System Settings
106====================
107Ceph builds and mounts file systems which are used for Ceph OSDs.
108
109``osd_mkfs_options {fs-type}``
110
111:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
112
113:Type: String
114:Default for xfs: ``-f -i 2048``
115:Default for other file systems: {empty string}
116
117For example::
118 ``osd_mkfs_options_xfs = -f -d agcount=24``
119
120``osd_mount_options {fs-type}``
121
122:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
123
124:Type: String
125:Default for xfs: ``rw,noatime,inode64``
126:Default for other file systems: ``rw, noatime``
127
128For example::
129 ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
130
131
132.. index:: OSD; journal settings
133
134Journal Settings
135================
136
137This section applies only to the older Filestore OSD back end. Since Luminous
138BlueStore has been default and preferred.
139
140By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
141the following path, which is usually a symlink to a device or partition::
142
143 /var/lib/ceph/osd/$cluster-$id/journal
144
145When using a single device type (for example, spinning drives), the journals
146should be *colocated*: the logical volume (or partition) should be in the same
147device as the ``data`` logical volume.
148
149When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
150drives) it makes sense to place the journal on the faster device, while
151``data`` occupies the slower device fully.
152
153The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
154larger, in which case it will need to be set in the ``ceph.conf`` file.
155A value of 10 gigabytes is common in practice::
156
157 osd_journal_size = 10240
158
159
160``osd_journal``
161
162:Description: The path to the OSD's journal. This may be a path to a file or a
163 block device (such as a partition of an SSD). If it is a file,
164 you must create the directory to contain it. We recommend using a
165 separate fast device when the ``osd_data`` drive is an HDD.
166
167:Type: String
168:Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
169
170
171``osd_journal_size``
172
173:Description: The size of the journal in megabytes.
174
175:Type: 32-bit Integer
176:Default: ``5120``
177
178
179See `Journal Config Reference`_ for additional details.
180
181
182Monitor OSD Interaction
183=======================
184
185Ceph OSD Daemons check each other's heartbeats and report to monitors
186periodically. Ceph can use default values in many cases. However, if your
187network has latency issues, you may need to adopt longer intervals. See
188`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
189
190
191Data Placement
192==============
193
194See `Pool & PG Config Reference`_ for details.
195
196
197.. index:: OSD; scrubbing
198
199Scrubbing
200=========
201
202In addition to making multiple copies of objects, Ceph ensures data integrity by
203scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
204object storage layer. For each placement group, Ceph generates a catalog of all
205objects and compares each primary object and its replicas to ensure that no
206objects are missing or mismatched. Light scrubbing (daily) checks the object
207size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
208to ensure data integrity.
209
210Scrubbing is important for maintaining data integrity, but it can reduce
211performance. You can adjust the following settings to increase or decrease
212scrubbing operations.
213
214
215``osd_max_scrubs``
216
217:Description: The maximum number of simultaneous scrub operations for
218 a Ceph OSD Daemon.
219
220:Type: 32-bit Int
221:Default: ``1``
222
223``osd_scrub_begin_hour``
224
225:Description: This restricts scrubbing to this hour of the day or later.
226 Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0``
227 to allow scrubbing the entire day. Along with ``osd_scrub_end_hour``, they define a time
228 window, in which the scrubs can happen.
229 But a scrub will be performed
230 no matter whether the time window allows or not, as long as the placement
231 group's scrub interval exceeds ``osd_scrub_max_interval``.
232:Type: Integer in the range of 0 to 23
233:Default: ``0``
234
235
236``osd_scrub_end_hour``
237
238:Description: This restricts scrubbing to the hour earlier than this.
239 Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` to allow scrubbing
240 for the entire day. Along with ``osd_scrub_begin_hour``, they define a time
241 window, in which the scrubs can happen. But a scrub will be performed
242 no matter whether the time window allows or not, as long as the placement
243 group's scrub interval exceeds ``osd_scrub_max_interval``.
244:Type: Integer in the range of 0 to 23
245:Default: ``0``
246
247
248``osd_scrub_begin_week_day``
249
250:Description: This restricts scrubbing to this day of the week or later.
251 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0``
252 and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week.
253 Along with ``osd_scrub_end_week_day``, they define a time window in which
254 scrubs can happen. But a scrub will be performed
255 no matter whether the time window allows or not, when the PG's
256 scrub interval exceeds ``osd_scrub_max_interval``.
257:Type: Integer in the range of 0 to 6
258:Default: ``0``
259
260
261``osd_scrub_end_week_day``
262
263:Description: This restricts scrubbing to days of the week earlier than this.
264 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0``
265 and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week.
266 Along with ``osd_scrub_begin_week_day``, they define a time
267 window, in which the scrubs can happen. But a scrub will be performed
268 no matter whether the time window allows or not, as long as the placement
269 group's scrub interval exceeds ``osd_scrub_max_interval``.
270:Type: Integer in the range of 0 to 6
271:Default: ``0``
272
273
274``osd scrub during recovery``
275
276:Description: Allow scrub during recovery. Setting this to ``false`` will disable
277 scheduling new scrub (and deep--scrub) while there is active recovery.
278 Already running scrubs will be continued. This might be useful to reduce
279 load on busy clusters.
280:Type: Boolean
281:Default: ``false``
282
283
284``osd_scrub_thread_timeout``
285
286:Description: The maximum time in seconds before timing out a scrub thread.
287:Type: 32-bit Integer
288:Default: ``60``
289
290
291``osd_scrub_finalize_thread_timeout``
292
293:Description: The maximum time in seconds before timing out a scrub finalize
294 thread.
295
296:Type: 32-bit Integer
297:Default: ``10*60``
298
299
300``osd_scrub_load_threshold``
301
302:Description: The normalized maximum load. Ceph will not scrub when the system load
303 (as defined by ``getloadavg() / number of online CPUs``) is higher than this number.
304 Default is ``0.5``.
305
306:Type: Float
307:Default: ``0.5``
308
309
310``osd_scrub_min_interval``
311
312:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
313 when the Ceph Storage Cluster load is low.
314
315:Type: Float
316:Default: Once per day. ``24*60*60``
317
318.. _osd_scrub_max_interval:
319
320``osd_scrub_max_interval``
321
322:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
323 irrespective of cluster load.
324
325:Type: Float
326:Default: Once per week. ``7*24*60*60``
327
328
329``osd_scrub_chunk_min``
330
331:Description: The minimal number of object store chunks to scrub during single operation.
332 Ceph blocks writes to single chunk during scrub.
333
334:Type: 32-bit Integer
335:Default: 5
336
337
338``osd_scrub_chunk_max``
339
340:Description: The maximum number of object store chunks to scrub during single operation.
341
342:Type: 32-bit Integer
343:Default: 25
344
345
346``osd_scrub_sleep``
347
348:Description: Time to sleep before scrubbing the next group of chunks. Increasing this value will slow
349 down the overall rate of scrubbing so that client operations will be less impacted.
350
351:Type: Float
352:Default: 0
353
354
355``osd_deep_scrub_interval``
356
357:Description: The interval for "deep" scrubbing (fully reading all data). The
358 ``osd_scrub_load_threshold`` does not affect this setting.
359
360:Type: Float
361:Default: Once per week. ``7*24*60*60``
362
363
364``osd_scrub_interval_randomize_ratio``
365
366:Description: Add a random delay to ``osd_scrub_min_interval`` when scheduling
367 the next scrub job for a PG. The delay is a random
368 value less than ``osd_scrub_min_interval`` \*
369 ``osd_scrub_interval_randomized_ratio``. The default setting
370 spreads scrubs throughout the allowed time
371 window of ``[1, 1.5]`` \* ``osd_scrub_min_interval``.
372:Type: Float
373:Default: ``0.5``
374
375``osd_deep_scrub_stride``
376
377:Description: Read size when doing a deep scrub.
378:Type: 32-bit Integer
379:Default: 512 KB. ``524288``
380
381
382``osd_scrub_auto_repair``
383
384:Description: Setting this to ``true`` will enable automatic PG repair when errors
385 are found by scrubs or deep-scrubs. However, if more than
386 ``osd_scrub_auto_repair_num_errors`` errors are found a repair is NOT performed.
387:Type: Boolean
388:Default: ``false``
389
390
391``osd_scrub_auto_repair_num_errors``
392
393:Description: Auto repair will not occur if more than this many errors are found.
394:Type: 32-bit Integer
395:Default: ``5``
396
397
398.. index:: OSD; operations settings
399
400Operations
401==========
402
403 ``osd_op_queue``
404
405:Description: This sets the type of queue to be used for prioritizing ops
406 within each OSD. Both queues feature a strict sub-queue which is
407 dequeued before the normal queue. The normal queue is different
408 between implementations. The WeightedPriorityQueue (``wpq``)
409 dequeues operations in relation to their priorities to prevent
410 starvation of any queue. WPQ should help in cases where a few OSDs
411 are more overloaded than others. The new mClockQueue
412 (``mclock_scheduler``) prioritizes operations based on which class
413 they belong to (recovery, scrub, snaptrim, client op, osd subop).
414 See `QoS Based on mClock`_. Requires a restart.
415
416:Type: String
417:Valid Choices: wpq, mclock_scheduler
418:Default: ``wpq``
419
420
421``osd_op_queue_cut_off``
422
423:Description: This selects which priority ops will be sent to the strict
424 queue verses the normal queue. The ``low`` setting sends all
425 replication ops and higher to the strict queue, while the ``high``
426 option sends only replication acknowledgment ops and higher to
427 the strict queue. Setting this to ``high`` should help when a few
428 OSDs in the cluster are very busy especially when combined with
429 ``wpq`` in the ``osd_op_queue`` setting. OSDs that are very busy
430 handling replication traffic could starve primary client traffic
431 on these OSDs without these settings. Requires a restart.
432
433:Type: String
434:Valid Choices: low, high
435:Default: ``high``
436
437
438``osd_client_op_priority``
439
440:Description: The priority set for client operations. This value is relative
441 to that of ``osd_recovery_op_priority`` below. The default
442 strongly favors client ops over recovery.
443
444:Type: 32-bit Integer
445:Default: ``63``
446:Valid Range: 1-63
447
448
449``osd_recovery_op_priority``
450
451:Description: The priority of recovery operations vs client operations, if not specified by the
452 pool's ``recovery_op_priority``. The default value prioritizes client
453 ops (see above) over recovery ops. You may adjust the tradeoff of client
454 impact against the time to restore cluster health by lowering this value
455 for increased prioritization of client ops, or by increasing it to favor
456 recovery.
457
458:Type: 32-bit Integer
459:Default: ``3``
460:Valid Range: 1-63
461
462
463``osd_scrub_priority``
464
465:Description: The default work queue priority for scheduled scrubs when the
466 pool doesn't specify a value of ``scrub_priority``. This can be
467 boosted to the value of ``osd_client_op_priority`` when scrubs are
468 blocking client operations.
469
470:Type: 32-bit Integer
471:Default: ``5``
472:Valid Range: 1-63
473
474
475``osd_requested_scrub_priority``
476
477:Description: The priority set for user requested scrub on the work queue. If
478 this value were to be smaller than ``osd_client_op_priority`` it
479 can be boosted to the value of ``osd_client_op_priority`` when
480 scrub is blocking client operations.
481
482:Type: 32-bit Integer
483:Default: ``120``
484
485
486``osd_snap_trim_priority``
487
488:Description: The priority set for the snap trim work queue.
489
490:Type: 32-bit Integer
491:Default: ``5``
492:Valid Range: 1-63
493
494``osd_snap_trim_sleep``
495
496:Description: Time in seconds to sleep before next snap trim op.
497 Increasing this value will slow down snap trimming.
498 This option overrides backend specific variants.
499
500:Type: Float
501:Default: ``0``
502
503
504``osd_snap_trim_sleep_hdd``
505
506:Description: Time in seconds to sleep before next snap trim op
507 for HDDs.
508
509:Type: Float
510:Default: ``5``
511
512
513``osd_snap_trim_sleep_ssd``
514
515:Description: Time in seconds to sleep before next snap trim op
516 for SSD OSDs (including NVMe).
517
518:Type: Float
519:Default: ``0``
520
521
522``osd_snap_trim_sleep_hybrid``
523
524:Description: Time in seconds to sleep before next snap trim op
525 when OSD data is on an HDD and the OSD journal or WAL+DB is on an SSD.
526
527:Type: Float
528:Default: ``2``
529
530``osd_op_thread_timeout``
531
532:Description: The Ceph OSD Daemon operation thread timeout in seconds.
533:Type: 32-bit Integer
534:Default: ``15``
535
536
537``osd_op_complaint_time``
538
539:Description: An operation becomes complaint worthy after the specified number
540 of seconds have elapsed.
541
542:Type: Float
543:Default: ``30``
544
545
546``osd_op_history_size``
547
548:Description: The maximum number of completed operations to track.
549:Type: 32-bit Unsigned Integer
550:Default: ``20``
551
552
553``osd_op_history_duration``
554
555:Description: The oldest completed operation to track.
556:Type: 32-bit Unsigned Integer
557:Default: ``600``
558
559
560``osd_op_log_threshold``
561
562:Description: How many operations logs to display at once.
563:Type: 32-bit Integer
564:Default: ``5``
565
566
567.. _dmclock-qos:
568
569QoS Based on mClock
570-------------------
571
572Ceph's use of mClock is now more refined and can be used by following the
573steps as described in `mClock Config Reference`_.
574
575Core Concepts
576`````````````
577
578Ceph's QoS support is implemented using a queueing scheduler
579based on `the dmClock algorithm`_. This algorithm allocates the I/O
580resources of the Ceph cluster in proportion to weights, and enforces
581the constraints of minimum reservation and maximum limitation, so that
582the services can compete for the resources fairly. Currently the
583*mclock_scheduler* operation queue divides Ceph services involving I/O
584resources into following buckets:
585
586- client op: the iops issued by client
587- osd subop: the iops issued by primary OSD
588- snap trim: the snap trimming related requests
589- pg recovery: the recovery related requests
590- pg scrub: the scrub related requests
591
592And the resources are partitioned using following three sets of tags. In other
593words, the share of each type of service is controlled by three tags:
594
595#. reservation: the minimum IOPS allocated for the service.
596#. limitation: the maximum IOPS allocated for the service.
597#. weight: the proportional share of capacity if extra capacity or system
598 oversubscribed.
599
600In Ceph, operations are graded with "cost". And the resources allocated
601for serving various services are consumed by these "costs". So, for
602example, the more reservation a services has, the more resource it is
603guaranteed to possess, as long as it requires. Assuming there are 2
604services: recovery and client ops:
605
606- recovery: (r:1, l:5, w:1)
607- client ops: (r:2, l:0, w:9)
608
609The settings above ensure that the recovery won't get more than 5
610requests per second serviced, even if it requires so (see CURRENT
611IMPLEMENTATION NOTE below), and no other services are competing with
612it. But if the clients start to issue large amount of I/O requests,
613neither will they exhaust all the I/O resources. 1 request per second
614is always allocated for recovery jobs as long as there are any such
615requests. So the recovery jobs won't be starved even in a cluster with
616high load. And in the meantime, the client ops can enjoy a larger
617portion of the I/O resource, because its weight is "9", while its
618competitor "1". In the case of client ops, it is not clamped by the
619limit setting, so it can make use of all the resources if there is no
620recovery ongoing.
621
622CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit
623values. Therefore, if a service crosses the enforced limit, the op remains
624in the operation queue until the limit is restored.
625
626Subtleties of mClock
627````````````````````
628
629The reservation and limit values have a unit of requests per
630second. The weight, however, does not technically have a unit and the
631weights are relative to one another. So if one class of requests has a
632weight of 1 and another a weight of 9, then the latter class of
633requests should get 9 executed at a 9 to 1 ratio as the first class.
634However that will only happen once the reservations are met and those
635values include the operations executed under the reservation phase.
636
637Even though the weights do not have units, one must be careful in
638choosing their values due how the algorithm assigns weight tags to
639requests. If the weight is *W*, then for a given class of requests,
640the next one that comes in will have a weight tag of *1/W* plus the
641previous weight tag or the current time, whichever is larger. That
642means if *W* is sufficiently large and therefore *1/W* is sufficiently
643small, the calculated tag may never be assigned as it will get a value
644of the current time. The ultimate lesson is that values for weight
645should not be too large. They should be under the number of requests
646one expects to be serviced each second.
647
648Caveats
649```````
650
651There are some factors that can reduce the impact of the mClock op
652queues within Ceph. First, requests to an OSD are sharded by their
653placement group identifier. Each shard has its own mClock queue and
654these queues neither interact nor share information among them. The
655number of shards can be controlled with the configuration options
656``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
657``osd_op_num_shards_ssd``. A lower number of shards will increase the
658impact of the mClock queues, but may have other deleterious effects.
659
660Second, requests are transferred from the operation queue to the
661operation sequencer, in which they go through the phases of
662execution. The operation queue is where mClock resides and mClock
663determines the next op to transfer to the operation sequencer. The
664number of operations allowed in the operation sequencer is a complex
665issue. In general we want to keep enough operations in the sequencer
666so it's always getting work done on some operations while it's waiting
667for disk and network access to complete on other operations. On the
668other hand, once an operation is transferred to the operation
669sequencer, mClock no longer has control over it. Therefore to maximize
670the impact of mClock, we want to keep as few operations in the
671operation sequencer as possible. So we have an inherent tension.
672
673The configuration options that influence the number of operations in
674the operation sequencer are ``bluestore_throttle_bytes``,
675``bluestore_throttle_deferred_bytes``,
676``bluestore_throttle_cost_per_io``,
677``bluestore_throttle_cost_per_io_hdd``, and
678``bluestore_throttle_cost_per_io_ssd``.
679
680A third factor that affects the impact of the mClock algorithm is that
681we're using a distributed system, where requests are made to multiple
682OSDs and each OSD has (can have) multiple shards. Yet we're currently
683using the mClock algorithm, which is not distributed (note: dmClock is
684the distributed version of mClock).
685
686Various organizations and individuals are currently experimenting with
687mClock as it exists in this code base along with their modifications
688to the code base. We hope you'll share you're experiences with your
689mClock and dmClock experiments on the ``ceph-devel`` mailing list.
690
691
692``osd_push_per_object_cost``
693
694:Description: the overhead for serving a push op
695
696:Type: Unsigned Integer
697:Default: 1000
698
699
700``osd_recovery_max_chunk``
701
702:Description: the maximum total size of data chunks a recovery op can carry.
703
704:Type: Unsigned Integer
705:Default: 8 MiB
706
707
708``osd_mclock_scheduler_client_res``
709
710:Description: IO proportion reserved for each client (default).
711
712:Type: Unsigned Integer
713:Default: 1
714
715
716``osd_mclock_scheduler_client_wgt``
717
718:Description: IO share for each client (default) over reservation.
719
720:Type: Unsigned Integer
721:Default: 1
722
723
724``osd_mclock_scheduler_client_lim``
725
726:Description: IO limit for each client (default) over reservation.
727
728:Type: Unsigned Integer
729:Default: 999999
730
731
732``osd_mclock_scheduler_background_recovery_res``
733
734:Description: IO proportion reserved for background recovery (default).
735
736:Type: Unsigned Integer
737:Default: 1
738
739
740``osd_mclock_scheduler_background_recovery_wgt``
741
742:Description: IO share for each background recovery over reservation.
743
744:Type: Unsigned Integer
745:Default: 1
746
747
748``osd_mclock_scheduler_background_recovery_lim``
749
750:Description: IO limit for background recovery over reservation.
751
752:Type: Unsigned Integer
753:Default: 999999
754
755
756``osd_mclock_scheduler_background_best_effort_res``
757
758:Description: IO proportion reserved for background best_effort (default).
759
760:Type: Unsigned Integer
761:Default: 1
762
763
764``osd_mclock_scheduler_background_best_effort_wgt``
765
766:Description: IO share for each background best_effort over reservation.
767
768:Type: Unsigned Integer
769:Default: 1
770
771
772``osd_mclock_scheduler_background_best_effort_lim``
773
774:Description: IO limit for background best_effort over reservation.
775
776:Type: Unsigned Integer
777:Default: 999999
778
779.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
780
781
782.. index:: OSD; backfilling
783
784Backfilling
785===========
786
787When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
788rebalance the cluster by moving placement groups to or from Ceph OSDs
789to restore balanced utilization. The process of migrating placement groups and
790the objects they contain can reduce the cluster's operational performance
791considerably. To maintain operational performance, Ceph performs this migration
792with 'backfilling', which allows Ceph to set backfill operations to a lower
793priority than requests to read or write data.
794
795
796``osd_max_backfills``
797
798:Description: The maximum number of backfills allowed to or from a single OSD.
799 Note that this is applied separately for read and write operations.
800:Type: 64-bit Unsigned Integer
801:Default: ``1``
802
803
804``osd_backfill_scan_min``
805
806:Description: The minimum number of objects per backfill scan.
807
808:Type: 32-bit Integer
809:Default: ``64``
810
811
812``osd_backfill_scan_max``
813
814:Description: The maximum number of objects per backfill scan.
815
816:Type: 32-bit Integer
817:Default: ``512``
818
819
820``osd_backfill_retry_interval``
821
822:Description: The number of seconds to wait before retrying backfill requests.
823:Type: Double
824:Default: ``10.0``
825
826.. index:: OSD; osdmap
827
828OSD Map
829=======
830
831OSD maps reflect the OSD daemons operating in the cluster. Over time, the
832number of map epochs increases. Ceph provides some settings to ensure that
833Ceph performs well as the OSD map grows larger.
834
835
836``osd_map_dedup``
837
838:Description: Enable removing duplicates in the OSD map.
839:Type: Boolean
840:Default: ``true``
841
842
843``osd_map_cache_size``
844
845:Description: The number of OSD maps to keep cached.
846:Type: 32-bit Integer
847:Default: ``50``
848
849
850``osd_map_message_max``
851
852:Description: The maximum map entries allowed per MOSDMap message.
853:Type: 32-bit Integer
854:Default: ``40``
855
856
857
858.. index:: OSD; recovery
859
860Recovery
861========
862
863When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
864begins peering with other Ceph OSD Daemons before writes can occur. See
865`Monitoring OSDs and PGs`_ for details.
866
867If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
868sync with other Ceph OSD Daemons containing more recent versions of objects in
869the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
870mode and seeks to get the latest copy of the data and bring its map back up to
871date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
872and placement groups may be significantly out of date. Also, if a failure domain
873went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
874the same time. This can make the recovery process time consuming and resource
875intensive.
876
877To maintain operational performance, Ceph performs recovery with limitations on
878the number recovery requests, threads and object chunk sizes which allows Ceph
879perform well in a degraded state.
880
881
882``osd_recovery_delay_start``
883
884:Description: After peering completes, Ceph will delay for the specified number
885 of seconds before starting to recover RADOS objects.
886
887:Type: Float
888:Default: ``0``
889
890
891``osd_recovery_max_active``
892
893:Description: The number of active recovery requests per OSD at one time. More
894 requests will accelerate recovery, but the requests places an
895 increased load on the cluster.
896
897 This value is only used if it is non-zero. Normally it
898 is ``0``, which means that the ``hdd`` or ``ssd`` values
899 (below) are used, depending on the type of the primary
900 device backing the OSD.
901
902:Type: 32-bit Integer
903:Default: ``0``
904
905``osd_recovery_max_active_hdd``
906
907:Description: The number of active recovery requests per OSD at one time, if the
908 primary device is rotational.
909
910:Type: 32-bit Integer
911:Default: ``3``
912
913``osd_recovery_max_active_ssd``
914
915:Description: The number of active recovery requests per OSD at one time, if the
916 primary device is non-rotational (i.e., an SSD).
917
918:Type: 32-bit Integer
919:Default: ``10``
920
921
922``osd_recovery_max_chunk``
923
924:Description: The maximum size of a recovered chunk of data to push.
925:Type: 64-bit Unsigned Integer
926:Default: ``8 << 20``
927
928
929``osd_recovery_max_single_start``
930
931:Description: The maximum number of recovery operations per OSD that will be
932 newly started when an OSD is recovering.
933:Type: 64-bit Unsigned Integer
934:Default: ``1``
935
936
937``osd_recovery_thread_timeout``
938
939:Description: The maximum time in seconds before timing out a recovery thread.
940:Type: 32-bit Integer
941:Default: ``30``
942
943
944``osd_recover_clone_overlap``
945
946:Description: Preserves clone overlap during recovery. Should always be set
947 to ``true``.
948
949:Type: Boolean
950:Default: ``true``
951
952
953``osd_recovery_sleep``
954
955:Description: Time in seconds to sleep before the next recovery or backfill op.
956 Increasing this value will slow down recovery operation while
957 client operations will be less impacted.
958
959:Type: Float
960:Default: ``0``
961
962
963``osd_recovery_sleep_hdd``
964
965:Description: Time in seconds to sleep before next recovery or backfill op
966 for HDDs.
967
968:Type: Float
969:Default: ``0.1``
970
971
972``osd_recovery_sleep_ssd``
973
974:Description: Time in seconds to sleep before the next recovery or backfill op
975 for SSDs.
976
977:Type: Float
978:Default: ``0``
979
980
981``osd_recovery_sleep_hybrid``
982
983:Description: Time in seconds to sleep before the next recovery or backfill op
984 when OSD data is on HDD and OSD journal / WAL+DB is on SSD.
985
986:Type: Float
987:Default: ``0.025``
988
989
990``osd_recovery_priority``
991
992:Description: The default priority set for recovery work queue. Not
993 related to a pool's ``recovery_priority``.
994
995:Type: 32-bit Integer
996:Default: ``5``
997
998
999Tiering
1000=======
1001
1002``osd_agent_max_ops``
1003
1004:Description: The maximum number of simultaneous flushing ops per tiering agent
1005 in the high speed mode.
1006:Type: 32-bit Integer
1007:Default: ``4``
1008
1009
1010``osd_agent_max_low_ops``
1011
1012:Description: The maximum number of simultaneous flushing ops per tiering agent
1013 in the low speed mode.
1014:Type: 32-bit Integer
1015:Default: ``2``
1016
1017See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
1018objects within the high speed mode.
1019
1020Miscellaneous
1021=============
1022
1023
1024``osd_snap_trim_thread_timeout``
1025
1026:Description: The maximum time in seconds before timing out a snap trim thread.
1027:Type: 32-bit Integer
1028:Default: ``1*60*60``
1029
1030
1031``osd_backlog_thread_timeout``
1032
1033:Description: The maximum time in seconds before timing out a backlog thread.
1034:Type: 32-bit Integer
1035:Default: ``1*60*60``
1036
1037
1038``osd_default_notify_timeout``
1039
1040:Description: The OSD default notification timeout (in seconds).
1041:Type: 32-bit Unsigned Integer
1042:Default: ``30``
1043
1044
1045``osd_check_for_log_corruption``
1046
1047:Description: Check log files for corruption. Can be computationally expensive.
1048:Type: Boolean
1049:Default: ``false``
1050
1051
1052``osd_remove_thread_timeout``
1053
1054:Description: The maximum time in seconds before timing out a remove OSD thread.
1055:Type: 32-bit Integer
1056:Default: ``60*60``
1057
1058
1059``osd_command_thread_timeout``
1060
1061:Description: The maximum time in seconds before timing out a command thread.
1062:Type: 32-bit Integer
1063:Default: ``10*60``
1064
1065
1066``osd_delete_sleep``
1067
1068:Description: Time in seconds to sleep before the next removal transaction. This
1069 throttles the PG deletion process.
1070
1071:Type: Float
1072:Default: ``0``
1073
1074
1075``osd_delete_sleep_hdd``
1076
1077:Description: Time in seconds to sleep before the next removal transaction
1078 for HDDs.
1079
1080:Type: Float
1081:Default: ``5``
1082
1083
1084``osd_delete_sleep_ssd``
1085
1086:Description: Time in seconds to sleep before the next removal transaction
1087 for SSDs.
1088
1089:Type: Float
1090:Default: ``0``
1091
1092
1093``osd_delete_sleep_hybrid``
1094
1095:Description: Time in seconds to sleep before the next removal transaction
1096 when OSD data is on HDD and OSD journal or WAL+DB is on SSD.
1097
1098:Type: Float
1099:Default: ``1``
1100
1101
1102``osd_command_max_records``
1103
1104:Description: Limits the number of lost objects to return.
1105:Type: 32-bit Integer
1106:Default: ``256``
1107
1108
1109``osd_fast_fail_on_connection_refused``
1110
1111:Description: If this option is enabled, crashed OSDs are marked down
1112 immediately by connected peers and MONs (assuming that the
1113 crashed OSD host survives). Disable it to restore old
1114 behavior, at the expense of possible long I/O stalls when
1115 OSDs crash in the middle of I/O operations.
1116:Type: Boolean
1117:Default: ``true``
1118
1119
1120
1121.. _pool: ../../operations/pools
1122.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
1123.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
1124.. _Pool & PG Config Reference: ../pool-pg-config-ref
1125.. _Journal Config Reference: ../journal-ref
1126.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
1127.. _mClock Config Reference: ../mclock-config-ref