]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/configuration/osd-config-ref.rst
import quincy beta 17.1.0
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
CommitLineData
7c673cae
FG
1======================
2 OSD Config Reference
3======================
4
5.. index:: OSD; configuration
6
f67539c2
TL
7You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
8releases, the central config store), but Ceph OSD
7c673cae 9Daemons can use the default values and a very minimal configuration. A minimal
f67539c2 10Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``, and
7c673cae
FG
11uses default values for nearly everything else.
12
13Ceph OSD Daemons are numerically identified in incremental fashion, beginning
14with ``0`` using the following convention. ::
15
16 osd.0
17 osd.1
18 osd.2
19
20In a configuration file, you may specify settings for all Ceph OSD Daemons in
21the cluster by adding configuration settings to the ``[osd]`` section of your
22configuration file. To add settings directly to a specific Ceph OSD Daemon
23(e.g., ``host``), enter it in an OSD-specific section of your configuration
24file. For example:
25
26.. code-block:: ini
1adf2230 27
7c673cae 28 [osd]
f67539c2 29 osd_journal_size = 5120
1adf2230 30
7c673cae
FG
31 [osd.0]
32 host = osd-host-a
1adf2230 33
7c673cae
FG
34 [osd.1]
35 host = osd-host-b
36
37
38.. index:: OSD; config settings
39
40General Settings
41================
42
9f95a23c 43The following settings provide a Ceph OSD Daemon's ID, and determine paths to
7c673cae 44data and journals. Ceph deployment scripts typically generate the UUID
1adf2230
AA
45automatically.
46
47.. warning:: **DO NOT** change the default paths for data or journals, as it
48 makes it more problematic to troubleshoot Ceph later.
7c673cae 49
f67539c2
TL
50When using Filestore, the journal size should be at least twice the product of the expected drive
51speed multiplied by ``filestore_max_sync_interval``. However, the most common
7c673cae
FG
52practice is to partition the journal drive (often an SSD), and mount it such
53that Ceph uses the entire partition for the journal.
54
20effc67
TL
55.. confval:: osd_uuid
56.. confval:: osd_data
57.. confval:: osd_max_write_size
58.. confval:: osd_max_object_size
59.. confval:: osd_client_message_size_cap
60.. confval:: osd_class_dir
61 :default: $libdir/rados-classes
7c673cae
FG
62
63.. index:: OSD; file system
64
65File System Settings
66====================
67Ceph builds and mounts file systems which are used for Ceph OSDs.
68
f67539c2 69``osd_mkfs_options {fs-type}``
7c673cae 70
f67539c2 71:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
7c673cae
FG
72
73:Type: String
74:Default for xfs: ``-f -i 2048``
75:Default for other file systems: {empty string}
76
77For example::
f67539c2 78 ``osd_mkfs_options_xfs = -f -d agcount=24``
7c673cae 79
f67539c2 80``osd_mount_options {fs-type}``
7c673cae 81
f67539c2 82:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
7c673cae
FG
83
84:Type: String
85:Default for xfs: ``rw,noatime,inode64``
86:Default for other file systems: ``rw, noatime``
87
88For example::
f67539c2 89 ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
7c673cae
FG
90
91
92.. index:: OSD; journal settings
93
94Journal Settings
95================
96
f67539c2
TL
97This section applies only to the older Filestore OSD back end. Since Luminous
98BlueStore has been default and preferred.
99
100By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
101the following path, which is usually a symlink to a device or partition::
7c673cae
FG
102
103 /var/lib/ceph/osd/$cluster-$id/journal
104
1adf2230
AA
105When using a single device type (for example, spinning drives), the journals
106should be *colocated*: the logical volume (or partition) should be in the same
107device as the ``data`` logical volume.
7c673cae 108
1adf2230
AA
109When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
110drives) it makes sense to place the journal on the faster device, while
111``data`` occupies the slower device fully.
7c673cae 112
f67539c2
TL
113The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
114larger, in which case it will need to be set in the ``ceph.conf`` file.
115A value of 10 gigabytes is common in practice::
7c673cae 116
f67539c2 117 osd_journal_size = 10240
7c673cae 118
1adf2230 119
20effc67
TL
120.. confval:: osd_journal
121.. confval:: osd_journal_size
7c673cae
FG
122
123See `Journal Config Reference`_ for additional details.
124
125
126Monitor OSD Interaction
127=======================
128
129Ceph OSD Daemons check each other's heartbeats and report to monitors
130periodically. Ceph can use default values in many cases. However, if your
f67539c2 131network has latency issues, you may need to adopt longer intervals. See
7c673cae
FG
132`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
133
134
135Data Placement
136==============
137
138See `Pool & PG Config Reference`_ for details.
139
140
141.. index:: OSD; scrubbing
142
143Scrubbing
144=========
145
9f95a23c 146In addition to making multiple copies of objects, Ceph ensures data integrity by
7c673cae
FG
147scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
148object storage layer. For each placement group, Ceph generates a catalog of all
149objects and compares each primary object and its replicas to ensure that no
150objects are missing or mismatched. Light scrubbing (daily) checks the object
151size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
152to ensure data integrity.
153
154Scrubbing is important for maintaining data integrity, but it can reduce
155performance. You can adjust the following settings to increase or decrease
156scrubbing operations.
157
158
20effc67
TL
159.. confval:: osd_max_scrubs
160.. confval:: osd_scrub_begin_hour
161.. confval:: osd_scrub_end_hour
162.. confval:: osd_scrub_begin_week_day
163.. confval:: osd_scrub_end_week_day
164.. confval:: osd_scrub_during_recovery
165.. confval:: osd_scrub_load_threshold
166.. confval:: osd_scrub_min_interval
167.. confval:: osd_scrub_max_interval
168.. confval:: osd_scrub_chunk_min
169.. confval:: osd_scrub_chunk_max
170.. confval:: osd_scrub_sleep
171.. confval:: osd_deep_scrub_interval
172.. confval:: osd_scrub_interval_randomize_ratio
173.. confval:: osd_deep_scrub_stride
174.. confval:: osd_scrub_auto_repair
175.. confval:: osd_scrub_auto_repair_num_errors
7c673cae 176
11fdf7f2 177.. index:: OSD; operations settings
7c673cae 178
11fdf7f2
TL
179Operations
180==========
7c673cae 181
20effc67
TL
182.. confval:: osd_op_num_shards
183.. confval:: osd_op_num_shards_hdd
184.. confval:: osd_op_num_shards_ssd
185.. confval:: osd_op_queue
186.. confval:: osd_op_queue_cut_off
187.. confval:: osd_client_op_priority
188.. confval:: osd_recovery_op_priority
189.. confval:: osd_scrub_priority
190.. confval:: osd_requested_scrub_priority
191.. confval:: osd_snap_trim_priority
192.. confval:: osd_snap_trim_sleep
193.. confval:: osd_snap_trim_sleep_hdd
194.. confval:: osd_snap_trim_sleep_ssd
195.. confval:: osd_snap_trim_sleep_hybrid
196.. confval:: osd_op_thread_timeout
197.. confval:: osd_op_complaint_time
198.. confval:: osd_op_history_size
199.. confval:: osd_op_history_duration
200.. confval:: osd_op_log_threshold
c07f9fc5 201
9f95a23c
TL
202.. _dmclock-qos:
203
c07f9fc5
FG
204QoS Based on mClock
205-------------------
206
b3b6e05e
TL
207Ceph's use of mClock is now more refined and can be used by following the
208steps as described in `mClock Config Reference`_.
c07f9fc5
FG
209
210Core Concepts
211`````````````
212
f67539c2 213Ceph's QoS support is implemented using a queueing scheduler
c07f9fc5
FG
214based on `the dmClock algorithm`_. This algorithm allocates the I/O
215resources of the Ceph cluster in proportion to weights, and enforces
11fdf7f2 216the constraints of minimum reservation and maximum limitation, so that
c07f9fc5 217the services can compete for the resources fairly. Currently the
f67539c2 218*mclock_scheduler* operation queue divides Ceph services involving I/O
c07f9fc5
FG
219resources into following buckets:
220
221- client op: the iops issued by client
222- osd subop: the iops issued by primary OSD
223- snap trim: the snap trimming related requests
224- pg recovery: the recovery related requests
225- pg scrub: the scrub related requests
226
227And the resources are partitioned using following three sets of tags. In other
228words, the share of each type of service is controlled by three tags:
229
230#. reservation: the minimum IOPS allocated for the service.
231#. limitation: the maximum IOPS allocated for the service.
232#. weight: the proportional share of capacity if extra capacity or system
233 oversubscribed.
234
b3b6e05e 235In Ceph, operations are graded with "cost". And the resources allocated
c07f9fc5
FG
236for serving various services are consumed by these "costs". So, for
237example, the more reservation a services has, the more resource it is
238guaranteed to possess, as long as it requires. Assuming there are 2
239services: recovery and client ops:
240
241- recovery: (r:1, l:5, w:1)
242- client ops: (r:2, l:0, w:9)
243
244The settings above ensure that the recovery won't get more than 5
245requests per second serviced, even if it requires so (see CURRENT
246IMPLEMENTATION NOTE below), and no other services are competing with
247it. But if the clients start to issue large amount of I/O requests,
248neither will they exhaust all the I/O resources. 1 request per second
249is always allocated for recovery jobs as long as there are any such
250requests. So the recovery jobs won't be starved even in a cluster with
251high load. And in the meantime, the client ops can enjoy a larger
252portion of the I/O resource, because its weight is "9", while its
253competitor "1". In the case of client ops, it is not clamped by the
254limit setting, so it can make use of all the resources if there is no
255recovery ongoing.
256
b3b6e05e
TL
257CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit
258values. Therefore, if a service crosses the enforced limit, the op remains
259in the operation queue until the limit is restored.
c07f9fc5
FG
260
261Subtleties of mClock
262````````````````````
263
264The reservation and limit values have a unit of requests per
265second. The weight, however, does not technically have a unit and the
266weights are relative to one another. So if one class of requests has a
267weight of 1 and another a weight of 9, then the latter class of
268requests should get 9 executed at a 9 to 1 ratio as the first class.
269However that will only happen once the reservations are met and those
270values include the operations executed under the reservation phase.
271
272Even though the weights do not have units, one must be careful in
273choosing their values due how the algorithm assigns weight tags to
274requests. If the weight is *W*, then for a given class of requests,
275the next one that comes in will have a weight tag of *1/W* plus the
276previous weight tag or the current time, whichever is larger. That
277means if *W* is sufficiently large and therefore *1/W* is sufficiently
278small, the calculated tag may never be assigned as it will get a value
279of the current time. The ultimate lesson is that values for weight
280should not be too large. They should be under the number of requests
b3b6e05e 281one expects to be serviced each second.
c07f9fc5
FG
282
283Caveats
284```````
285
286There are some factors that can reduce the impact of the mClock op
287queues within Ceph. First, requests to an OSD are sharded by their
288placement group identifier. Each shard has its own mClock queue and
289these queues neither interact nor share information among them. The
290number of shards can be controlled with the configuration options
20effc67
TL
291:confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
292:confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
11fdf7f2 293impact of the mClock queues, but may have other deleterious effects.
c07f9fc5
FG
294
295Second, requests are transferred from the operation queue to the
296operation sequencer, in which they go through the phases of
297execution. The operation queue is where mClock resides and mClock
298determines the next op to transfer to the operation sequencer. The
299number of operations allowed in the operation sequencer is a complex
300issue. In general we want to keep enough operations in the sequencer
301so it's always getting work done on some operations while it's waiting
302for disk and network access to complete on other operations. On the
303other hand, once an operation is transferred to the operation
304sequencer, mClock no longer has control over it. Therefore to maximize
305the impact of mClock, we want to keep as few operations in the
306operation sequencer as possible. So we have an inherent tension.
307
308The configuration options that influence the number of operations in
20effc67
TL
309the operation sequencer are :confval:`bluestore_throttle_bytes`,
310:confval:`bluestore_throttle_deferred_bytes`,
311:confval:`bluestore_throttle_cost_per_io`,
312:confval:`bluestore_throttle_cost_per_io_hdd`, and
313:confval:`bluestore_throttle_cost_per_io_ssd`.
c07f9fc5
FG
314
315A third factor that affects the impact of the mClock algorithm is that
316we're using a distributed system, where requests are made to multiple
317OSDs and each OSD has (can have) multiple shards. Yet we're currently
318using the mClock algorithm, which is not distributed (note: dmClock is
319the distributed version of mClock).
320
321Various organizations and individuals are currently experimenting with
322mClock as it exists in this code base along with their modifications
323to the code base. We hope you'll share you're experiences with your
f67539c2 324mClock and dmClock experiments on the ``ceph-devel`` mailing list.
c07f9fc5 325
20effc67
TL
326.. confval:: osd_async_recovery_min_cost
327.. confval:: osd_push_per_object_cost
328.. confval:: osd_mclock_scheduler_client_res
329.. confval:: osd_mclock_scheduler_client_wgt
330.. confval:: osd_mclock_scheduler_client_lim
331.. confval:: osd_mclock_scheduler_background_recovery_res
332.. confval:: osd_mclock_scheduler_background_recovery_wgt
333.. confval:: osd_mclock_scheduler_background_recovery_lim
334.. confval:: osd_mclock_scheduler_background_best_effort_res
335.. confval:: osd_mclock_scheduler_background_best_effort_wgt
336.. confval:: osd_mclock_scheduler_background_best_effort_lim
c07f9fc5
FG
337
338.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
339
7c673cae
FG
340.. index:: OSD; backfilling
341
342Backfilling
343===========
344
f67539c2
TL
345When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
346rebalance the cluster by moving placement groups to or from Ceph OSDs
347to restore balanced utilization. The process of migrating placement groups and
7c673cae
FG
348the objects they contain can reduce the cluster's operational performance
349considerably. To maintain operational performance, Ceph performs this migration
350with 'backfilling', which allows Ceph to set backfill operations to a lower
1adf2230 351priority than requests to read or write data.
7c673cae
FG
352
353
20effc67
TL
354.. confval:: osd_max_backfills
355.. confval:: osd_backfill_scan_min
356.. confval:: osd_backfill_scan_max
357.. confval:: osd_backfill_retry_interval
7c673cae
FG
358
359.. index:: OSD; osdmap
360
361OSD Map
362=======
363
1adf2230 364OSD maps reflect the OSD daemons operating in the cluster. Over time, the
7c673cae
FG
365number of map epochs increases. Ceph provides some settings to ensure that
366Ceph performs well as the OSD map grows larger.
367
20effc67
TL
368.. confval:: osd_map_dedup
369.. confval:: osd_map_cache_size
370.. confval:: osd_map_message_max
7c673cae
FG
371
372.. index:: OSD; recovery
373
374Recovery
375========
376
377When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
378begins peering with other Ceph OSD Daemons before writes can occur. See
379`Monitoring OSDs and PGs`_ for details.
380
381If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
382sync with other Ceph OSD Daemons containing more recent versions of objects in
383the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
384mode and seeks to get the latest copy of the data and bring its map back up to
385date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
386and placement groups may be significantly out of date. Also, if a failure domain
387went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
388the same time. This can make the recovery process time consuming and resource
389intensive.
390
391To maintain operational performance, Ceph performs recovery with limitations on
392the number recovery requests, threads and object chunk sizes which allows Ceph
1adf2230 393perform well in a degraded state.
7c673cae 394
20effc67
TL
395.. confval:: osd_recovery_delay_start
396.. confval:: osd_recovery_max_active
397.. confval:: osd_recovery_max_active_hdd
398.. confval:: osd_recovery_max_active_ssd
399.. confval:: osd_recovery_max_chunk
400.. confval:: osd_recovery_max_single_start
401.. confval:: osd_recover_clone_overlap
402.. confval:: osd_recovery_sleep
403.. confval:: osd_recovery_sleep_hdd
404.. confval:: osd_recovery_sleep_ssd
405.. confval:: osd_recovery_sleep_hybrid
406.. confval:: osd_recovery_priority
11fdf7f2 407
7c673cae
FG
408Tiering
409=======
410
20effc67
TL
411.. confval:: osd_agent_max_ops
412.. confval:: osd_agent_max_low_ops
7c673cae
FG
413
414See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
415objects within the high speed mode.
416
417Miscellaneous
418=============
419
20effc67
TL
420.. confval:: osd_default_notify_timeout
421.. confval:: osd_check_for_log_corruption
422.. confval:: osd_delete_sleep
423.. confval:: osd_delete_sleep_hdd
424.. confval:: osd_delete_sleep_ssd
425.. confval:: osd_delete_sleep_hybrid
426.. confval:: osd_command_max_records
427.. confval:: osd_fast_fail_on_connection_refused
7c673cae
FG
428
429.. _pool: ../../operations/pools
430.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
431.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
432.. _Pool & PG Config Reference: ../pool-pg-config-ref
433.. _Journal Config Reference: ../journal-ref
434.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
b3b6e05e 435.. _mClock Config Reference: ../mclock-config-ref