]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/configuration/osd-config-ref.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
CommitLineData
7c673cae
FG
1======================
2 OSD Config Reference
3======================
4
5.. index:: OSD; configuration
6
f67539c2
TL
7You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
8releases, the central config store), but Ceph OSD
7c673cae 9Daemons can use the default values and a very minimal configuration. A minimal
1e59de90 10Ceph OSD Daemon configuration sets ``host`` and
7c673cae
FG
11uses default values for nearly everything else.
12
13Ceph OSD Daemons are numerically identified in incremental fashion, beginning
14with ``0`` using the following convention. ::
15
16 osd.0
17 osd.1
18 osd.2
19
20In a configuration file, you may specify settings for all Ceph OSD Daemons in
21the cluster by adding configuration settings to the ``[osd]`` section of your
22configuration file. To add settings directly to a specific Ceph OSD Daemon
23(e.g., ``host``), enter it in an OSD-specific section of your configuration
24file. For example:
25
26.. code-block:: ini
1adf2230 27
7c673cae 28 [osd]
f67539c2 29 osd_journal_size = 5120
1adf2230 30
7c673cae
FG
31 [osd.0]
32 host = osd-host-a
1adf2230 33
7c673cae
FG
34 [osd.1]
35 host = osd-host-b
36
37
38.. index:: OSD; config settings
39
40General Settings
41================
42
9f95a23c 43The following settings provide a Ceph OSD Daemon's ID, and determine paths to
7c673cae 44data and journals. Ceph deployment scripts typically generate the UUID
1adf2230
AA
45automatically.
46
47.. warning:: **DO NOT** change the default paths for data or journals, as it
48 makes it more problematic to troubleshoot Ceph later.
7c673cae 49
f67539c2
TL
50When using Filestore, the journal size should be at least twice the product of the expected drive
51speed multiplied by ``filestore_max_sync_interval``. However, the most common
7c673cae
FG
52practice is to partition the journal drive (often an SSD), and mount it such
53that Ceph uses the entire partition for the journal.
54
20effc67
TL
55.. confval:: osd_uuid
56.. confval:: osd_data
57.. confval:: osd_max_write_size
58.. confval:: osd_max_object_size
59.. confval:: osd_client_message_size_cap
60.. confval:: osd_class_dir
61 :default: $libdir/rados-classes
7c673cae
FG
62
63.. index:: OSD; file system
64
65File System Settings
66====================
67Ceph builds and mounts file systems which are used for Ceph OSDs.
68
f67539c2 69``osd_mkfs_options {fs-type}``
7c673cae 70
f67539c2 71:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
7c673cae
FG
72
73:Type: String
74:Default for xfs: ``-f -i 2048``
75:Default for other file systems: {empty string}
76
77For example::
f67539c2 78 ``osd_mkfs_options_xfs = -f -d agcount=24``
7c673cae 79
f67539c2 80``osd_mount_options {fs-type}``
7c673cae 81
f67539c2 82:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
7c673cae
FG
83
84:Type: String
85:Default for xfs: ``rw,noatime,inode64``
86:Default for other file systems: ``rw, noatime``
87
88For example::
f67539c2 89 ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
7c673cae
FG
90
91
92.. index:: OSD; journal settings
93
94Journal Settings
95================
96
f67539c2
TL
97This section applies only to the older Filestore OSD back end. Since Luminous
98BlueStore has been default and preferred.
99
100By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
101the following path, which is usually a symlink to a device or partition::
7c673cae
FG
102
103 /var/lib/ceph/osd/$cluster-$id/journal
104
1adf2230
AA
105When using a single device type (for example, spinning drives), the journals
106should be *colocated*: the logical volume (or partition) should be in the same
107device as the ``data`` logical volume.
7c673cae 108
1adf2230
AA
109When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
110drives) it makes sense to place the journal on the faster device, while
111``data`` occupies the slower device fully.
7c673cae 112
f67539c2
TL
113The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
114larger, in which case it will need to be set in the ``ceph.conf`` file.
115A value of 10 gigabytes is common in practice::
7c673cae 116
f67539c2 117 osd_journal_size = 10240
7c673cae 118
1adf2230 119
20effc67
TL
120.. confval:: osd_journal
121.. confval:: osd_journal_size
7c673cae
FG
122
123See `Journal Config Reference`_ for additional details.
124
125
126Monitor OSD Interaction
127=======================
128
129Ceph OSD Daemons check each other's heartbeats and report to monitors
130periodically. Ceph can use default values in many cases. However, if your
f67539c2 131network has latency issues, you may need to adopt longer intervals. See
7c673cae
FG
132`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
133
134
135Data Placement
136==============
137
138See `Pool & PG Config Reference`_ for details.
139
140
141.. index:: OSD; scrubbing
142
1e59de90
TL
143.. _rados_config_scrubbing:
144
7c673cae
FG
145Scrubbing
146=========
147
aee94f69
TL
148One way that Ceph ensures data integrity is by "scrubbing" placement groups.
149Ceph scrubbing is analogous to ``fsck`` on the object storage layer. Ceph
150generates a catalog of all objects in each placement group and compares each
151primary object to its replicas, ensuring that no objects are missing or
152mismatched. Light scrubbing checks the object size and attributes, and is
153usually done daily. Deep scrubbing reads the data and uses checksums to ensure
154data integrity, and is usually done weekly. The freqeuncies of both light
155scrubbing and deep scrubbing are determined by the cluster's configuration,
156which is fully under your control and subject to the settings explained below
157in this section.
158
159Although scrubbing is important for maintaining data integrity, it can reduce
160the performance of the Ceph cluster. You can adjust the following settings to
161increase or decrease the frequency and depth of scrubbing operations.
7c673cae
FG
162
163
20effc67
TL
164.. confval:: osd_max_scrubs
165.. confval:: osd_scrub_begin_hour
166.. confval:: osd_scrub_end_hour
167.. confval:: osd_scrub_begin_week_day
168.. confval:: osd_scrub_end_week_day
169.. confval:: osd_scrub_during_recovery
170.. confval:: osd_scrub_load_threshold
171.. confval:: osd_scrub_min_interval
172.. confval:: osd_scrub_max_interval
173.. confval:: osd_scrub_chunk_min
174.. confval:: osd_scrub_chunk_max
175.. confval:: osd_scrub_sleep
176.. confval:: osd_deep_scrub_interval
177.. confval:: osd_scrub_interval_randomize_ratio
178.. confval:: osd_deep_scrub_stride
179.. confval:: osd_scrub_auto_repair
180.. confval:: osd_scrub_auto_repair_num_errors
7c673cae 181
11fdf7f2 182.. index:: OSD; operations settings
7c673cae 183
11fdf7f2
TL
184Operations
185==========
7c673cae 186
20effc67
TL
187.. confval:: osd_op_num_shards
188.. confval:: osd_op_num_shards_hdd
189.. confval:: osd_op_num_shards_ssd
190.. confval:: osd_op_queue
191.. confval:: osd_op_queue_cut_off
192.. confval:: osd_client_op_priority
193.. confval:: osd_recovery_op_priority
194.. confval:: osd_scrub_priority
195.. confval:: osd_requested_scrub_priority
196.. confval:: osd_snap_trim_priority
197.. confval:: osd_snap_trim_sleep
198.. confval:: osd_snap_trim_sleep_hdd
199.. confval:: osd_snap_trim_sleep_ssd
200.. confval:: osd_snap_trim_sleep_hybrid
201.. confval:: osd_op_thread_timeout
202.. confval:: osd_op_complaint_time
203.. confval:: osd_op_history_size
204.. confval:: osd_op_history_duration
205.. confval:: osd_op_log_threshold
1e59de90
TL
206.. confval:: osd_op_thread_suicide_timeout
207.. note:: See https://old.ceph.com/planet/dealing-with-some-osd-timeouts/ for
208 more on ``osd_op_thread_suicide_timeout``. Be aware that this is a link to a
209 reworking of a blog post from 2017, and that its conclusion will direct you
210 back to this page "for more information".
c07f9fc5 211
9f95a23c
TL
212.. _dmclock-qos:
213
c07f9fc5
FG
214QoS Based on mClock
215-------------------
216
b3b6e05e
TL
217Ceph's use of mClock is now more refined and can be used by following the
218steps as described in `mClock Config Reference`_.
c07f9fc5
FG
219
220Core Concepts
221`````````````
222
f67539c2 223Ceph's QoS support is implemented using a queueing scheduler
c07f9fc5
FG
224based on `the dmClock algorithm`_. This algorithm allocates the I/O
225resources of the Ceph cluster in proportion to weights, and enforces
11fdf7f2 226the constraints of minimum reservation and maximum limitation, so that
c07f9fc5 227the services can compete for the resources fairly. Currently the
f67539c2 228*mclock_scheduler* operation queue divides Ceph services involving I/O
c07f9fc5
FG
229resources into following buckets:
230
231- client op: the iops issued by client
232- osd subop: the iops issued by primary OSD
233- snap trim: the snap trimming related requests
234- pg recovery: the recovery related requests
235- pg scrub: the scrub related requests
236
237And the resources are partitioned using following three sets of tags. In other
238words, the share of each type of service is controlled by three tags:
239
240#. reservation: the minimum IOPS allocated for the service.
241#. limitation: the maximum IOPS allocated for the service.
242#. weight: the proportional share of capacity if extra capacity or system
243 oversubscribed.
244
b3b6e05e 245In Ceph, operations are graded with "cost". And the resources allocated
c07f9fc5
FG
246for serving various services are consumed by these "costs". So, for
247example, the more reservation a services has, the more resource it is
248guaranteed to possess, as long as it requires. Assuming there are 2
249services: recovery and client ops:
250
251- recovery: (r:1, l:5, w:1)
252- client ops: (r:2, l:0, w:9)
253
254The settings above ensure that the recovery won't get more than 5
255requests per second serviced, even if it requires so (see CURRENT
256IMPLEMENTATION NOTE below), and no other services are competing with
257it. But if the clients start to issue large amount of I/O requests,
258neither will they exhaust all the I/O resources. 1 request per second
259is always allocated for recovery jobs as long as there are any such
260requests. So the recovery jobs won't be starved even in a cluster with
261high load. And in the meantime, the client ops can enjoy a larger
262portion of the I/O resource, because its weight is "9", while its
263competitor "1". In the case of client ops, it is not clamped by the
264limit setting, so it can make use of all the resources if there is no
265recovery ongoing.
266
b3b6e05e
TL
267CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit
268values. Therefore, if a service crosses the enforced limit, the op remains
269in the operation queue until the limit is restored.
c07f9fc5
FG
270
271Subtleties of mClock
272````````````````````
273
274The reservation and limit values have a unit of requests per
275second. The weight, however, does not technically have a unit and the
276weights are relative to one another. So if one class of requests has a
277weight of 1 and another a weight of 9, then the latter class of
278requests should get 9 executed at a 9 to 1 ratio as the first class.
279However that will only happen once the reservations are met and those
280values include the operations executed under the reservation phase.
281
282Even though the weights do not have units, one must be careful in
283choosing their values due how the algorithm assigns weight tags to
284requests. If the weight is *W*, then for a given class of requests,
285the next one that comes in will have a weight tag of *1/W* plus the
286previous weight tag or the current time, whichever is larger. That
287means if *W* is sufficiently large and therefore *1/W* is sufficiently
288small, the calculated tag may never be assigned as it will get a value
289of the current time. The ultimate lesson is that values for weight
290should not be too large. They should be under the number of requests
b3b6e05e 291one expects to be serviced each second.
c07f9fc5
FG
292
293Caveats
294```````
295
296There are some factors that can reduce the impact of the mClock op
297queues within Ceph. First, requests to an OSD are sharded by their
298placement group identifier. Each shard has its own mClock queue and
299these queues neither interact nor share information among them. The
300number of shards can be controlled with the configuration options
20effc67
TL
301:confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
302:confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
11fdf7f2 303impact of the mClock queues, but may have other deleterious effects.
c07f9fc5
FG
304
305Second, requests are transferred from the operation queue to the
306operation sequencer, in which they go through the phases of
307execution. The operation queue is where mClock resides and mClock
308determines the next op to transfer to the operation sequencer. The
309number of operations allowed in the operation sequencer is a complex
310issue. In general we want to keep enough operations in the sequencer
311so it's always getting work done on some operations while it's waiting
312for disk and network access to complete on other operations. On the
313other hand, once an operation is transferred to the operation
314sequencer, mClock no longer has control over it. Therefore to maximize
315the impact of mClock, we want to keep as few operations in the
316operation sequencer as possible. So we have an inherent tension.
317
318The configuration options that influence the number of operations in
20effc67
TL
319the operation sequencer are :confval:`bluestore_throttle_bytes`,
320:confval:`bluestore_throttle_deferred_bytes`,
321:confval:`bluestore_throttle_cost_per_io`,
322:confval:`bluestore_throttle_cost_per_io_hdd`, and
323:confval:`bluestore_throttle_cost_per_io_ssd`.
c07f9fc5
FG
324
325A third factor that affects the impact of the mClock algorithm is that
326we're using a distributed system, where requests are made to multiple
327OSDs and each OSD has (can have) multiple shards. Yet we're currently
328using the mClock algorithm, which is not distributed (note: dmClock is
329the distributed version of mClock).
330
331Various organizations and individuals are currently experimenting with
332mClock as it exists in this code base along with their modifications
333to the code base. We hope you'll share you're experiences with your
f67539c2 334mClock and dmClock experiments on the ``ceph-devel`` mailing list.
c07f9fc5 335
20effc67
TL
336.. confval:: osd_async_recovery_min_cost
337.. confval:: osd_push_per_object_cost
338.. confval:: osd_mclock_scheduler_client_res
339.. confval:: osd_mclock_scheduler_client_wgt
340.. confval:: osd_mclock_scheduler_client_lim
341.. confval:: osd_mclock_scheduler_background_recovery_res
342.. confval:: osd_mclock_scheduler_background_recovery_wgt
343.. confval:: osd_mclock_scheduler_background_recovery_lim
344.. confval:: osd_mclock_scheduler_background_best_effort_res
345.. confval:: osd_mclock_scheduler_background_best_effort_wgt
346.. confval:: osd_mclock_scheduler_background_best_effort_lim
c07f9fc5
FG
347
348.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
349
7c673cae
FG
350.. index:: OSD; backfilling
351
352Backfilling
353===========
354
f67539c2
TL
355When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
356rebalance the cluster by moving placement groups to or from Ceph OSDs
357to restore balanced utilization. The process of migrating placement groups and
7c673cae
FG
358the objects they contain can reduce the cluster's operational performance
359considerably. To maintain operational performance, Ceph performs this migration
360with 'backfilling', which allows Ceph to set backfill operations to a lower
1adf2230 361priority than requests to read or write data.
7c673cae
FG
362
363
20effc67
TL
364.. confval:: osd_max_backfills
365.. confval:: osd_backfill_scan_min
366.. confval:: osd_backfill_scan_max
367.. confval:: osd_backfill_retry_interval
7c673cae
FG
368
369.. index:: OSD; osdmap
370
371OSD Map
372=======
373
1adf2230 374OSD maps reflect the OSD daemons operating in the cluster. Over time, the
7c673cae
FG
375number of map epochs increases. Ceph provides some settings to ensure that
376Ceph performs well as the OSD map grows larger.
377
20effc67
TL
378.. confval:: osd_map_dedup
379.. confval:: osd_map_cache_size
380.. confval:: osd_map_message_max
7c673cae
FG
381
382.. index:: OSD; recovery
383
384Recovery
385========
386
387When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
388begins peering with other Ceph OSD Daemons before writes can occur. See
389`Monitoring OSDs and PGs`_ for details.
390
391If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
392sync with other Ceph OSD Daemons containing more recent versions of objects in
393the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
394mode and seeks to get the latest copy of the data and bring its map back up to
395date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
396and placement groups may be significantly out of date. Also, if a failure domain
397went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
398the same time. This can make the recovery process time consuming and resource
399intensive.
400
401To maintain operational performance, Ceph performs recovery with limitations on
402the number recovery requests, threads and object chunk sizes which allows Ceph
1adf2230 403perform well in a degraded state.
7c673cae 404
20effc67
TL
405.. confval:: osd_recovery_delay_start
406.. confval:: osd_recovery_max_active
407.. confval:: osd_recovery_max_active_hdd
408.. confval:: osd_recovery_max_active_ssd
409.. confval:: osd_recovery_max_chunk
410.. confval:: osd_recovery_max_single_start
411.. confval:: osd_recover_clone_overlap
412.. confval:: osd_recovery_sleep
413.. confval:: osd_recovery_sleep_hdd
414.. confval:: osd_recovery_sleep_ssd
415.. confval:: osd_recovery_sleep_hybrid
416.. confval:: osd_recovery_priority
11fdf7f2 417
7c673cae
FG
418Tiering
419=======
420
20effc67
TL
421.. confval:: osd_agent_max_ops
422.. confval:: osd_agent_max_low_ops
7c673cae
FG
423
424See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
425objects within the high speed mode.
426
427Miscellaneous
428=============
429
20effc67
TL
430.. confval:: osd_default_notify_timeout
431.. confval:: osd_check_for_log_corruption
432.. confval:: osd_delete_sleep
433.. confval:: osd_delete_sleep_hdd
434.. confval:: osd_delete_sleep_ssd
435.. confval:: osd_delete_sleep_hybrid
436.. confval:: osd_command_max_records
437.. confval:: osd_fast_fail_on_connection_refused
7c673cae
FG
438
439.. _pool: ../../operations/pools
440.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
441.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
442.. _Pool & PG Config Reference: ../pool-pg-config-ref
443.. _Journal Config Reference: ../journal-ref
444.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
b3b6e05e 445.. _mClock Config Reference: ../mclock-config-ref