[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst

======================
 OSD Config Reference
======================

.. index:: OSD; configuration

You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
releases, the central config store), but Ceph OSD
Daemons can use the default values and a very minimal configuration. A minimal
Ceph OSD Daemon configuration sets ``host`` and
uses default values for nearly everything else.

Ceph OSD Daemons are numerically identified in incremental fashion, beginning
with ``0`` using the following convention. ::

	osd.0
	osd.1
	osd.2

In a configuration file, you may specify settings for all Ceph OSD Daemons in
the cluster by adding configuration settings to the ``[osd]`` section of your
configuration file. To add settings directly to a specific Ceph OSD Daemon
(e.g., ``host``), enter  it in an OSD-specific section of your configuration
file. For example:

.. code-block:: ini

	[osd]
		osd_journal_size = 5120

	[osd.0]
		host = osd-host-a

	[osd.1]
		host = osd-host-b


.. index:: OSD; config settings

General Settings
================

The following settings provide a Ceph OSD Daemon's ID, and determine paths to
data and journals. Ceph deployment scripts typically generate the UUID
automatically.

.. warning:: **DO NOT** change the default paths for data or journals, as it
             makes it more problematic to troubleshoot Ceph later.

When using Filestore, the journal size should be at least twice the product of the expected drive
speed multiplied by ``filestore_max_sync_interval``. However, the most common
practice is to partition the journal drive (often an SSD), and mount it such
that Ceph uses the entire partition for the journal.

.. confval:: osd_uuid
.. confval:: osd_data
.. confval:: osd_max_write_size
.. confval:: osd_max_object_size
.. confval:: osd_client_message_size_cap
.. confval:: osd_class_dir
   :default: $libdir/rados-classes

.. index:: OSD; file system

File System Settings
====================
Ceph builds and mounts file systems which are used for Ceph OSDs.

``osd_mkfs_options {fs-type}``

:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.

:Type: String
:Default for xfs: ``-f -i 2048``
:Default for other file systems: {empty string}

For example::
  ``osd_mkfs_options_xfs = -f -d agcount=24``

``osd_mount_options {fs-type}``

:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.

:Type: String
:Default for xfs: ``rw,noatime,inode64``
:Default for other file systems: ``rw, noatime``

For example::
  ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``


.. index:: OSD; journal settings

Journal Settings
================

This section applies only to the older Filestore OSD back end.  Since Luminous
BlueStore has been default and preferred.

By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
the following path, which is usually a symlink to a device or partition::

	/var/lib/ceph/osd/$cluster-$id/journal

When using a single device type (for example, spinning drives), the journals
should be *colocated*: the logical volume (or partition) should be in the same
device as the ``data`` logical volume.

When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
drives) it makes sense to place the journal on the faster device, while
``data`` occupies the slower device fully.

The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
larger, in which case it will need to be set in the ``ceph.conf`` file.
A value of 10 gigabytes is common in practice::

	osd_journal_size = 10240


.. confval:: osd_journal
.. confval:: osd_journal_size

See `Journal Config Reference`_ for additional details.


Monitor OSD Interaction
=======================

Ceph OSD Daemons check each other's heartbeats and report to monitors
periodically. Ceph can use default values in many cases. However, if your
network has latency issues, you may need to adopt longer intervals. See
`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.


Data Placement
==============

See `Pool & PG Config Reference`_ for details.


.. index:: OSD; scrubbing

.. _rados_config_scrubbing:

Scrubbing
=========

One way that Ceph ensures data integrity is by "scrubbing" placement groups.
Ceph scrubbing is analogous to ``fsck`` on the object storage layer. Ceph
generates a catalog of all objects in each placement group and compares each
primary object to its replicas, ensuring that no objects are missing or
mismatched. Light scrubbing checks the object size and attributes, and is
usually done daily. Deep scrubbing reads the data and uses checksums to ensure
data integrity, and is usually done weekly. The freqeuncies of both light
scrubbing and deep scrubbing are determined by the cluster's configuration,
which is fully under your control and subject to the settings explained below
in this section.

Although scrubbing is important for maintaining data integrity, it can reduce
the performance of the Ceph cluster. You can adjust the following settings to
increase or decrease the frequency and depth of scrubbing operations.


.. confval:: osd_max_scrubs
.. confval:: osd_scrub_begin_hour
.. confval:: osd_scrub_end_hour
.. confval:: osd_scrub_begin_week_day
.. confval:: osd_scrub_end_week_day
.. confval:: osd_scrub_during_recovery
.. confval:: osd_scrub_load_threshold
.. confval:: osd_scrub_min_interval
.. confval:: osd_scrub_max_interval
.. confval:: osd_scrub_chunk_min
.. confval:: osd_scrub_chunk_max
.. confval:: osd_scrub_sleep
.. confval:: osd_deep_scrub_interval
.. confval:: osd_scrub_interval_randomize_ratio
.. confval:: osd_deep_scrub_stride
.. confval:: osd_scrub_auto_repair
.. confval:: osd_scrub_auto_repair_num_errors

.. index:: OSD; operations settings

Operations
==========

.. confval:: osd_op_num_shards
.. confval:: osd_op_num_shards_hdd
.. confval:: osd_op_num_shards_ssd
.. confval:: osd_op_queue
.. confval:: osd_op_queue_cut_off
.. confval:: osd_client_op_priority
.. confval:: osd_recovery_op_priority
.. confval:: osd_scrub_priority
.. confval:: osd_requested_scrub_priority
.. confval:: osd_snap_trim_priority
.. confval:: osd_snap_trim_sleep
.. confval:: osd_snap_trim_sleep_hdd
.. confval:: osd_snap_trim_sleep_ssd
.. confval:: osd_snap_trim_sleep_hybrid
.. confval:: osd_op_thread_timeout
.. confval:: osd_op_complaint_time
.. confval:: osd_op_history_size
.. confval:: osd_op_history_duration
.. confval:: osd_op_log_threshold
.. confval:: osd_op_thread_suicide_timeout
.. note:: See https://old.ceph.com/planet/dealing-with-some-osd-timeouts/ for
   more on ``osd_op_thread_suicide_timeout``. Be aware that this is a link to a
   reworking of a blog post from 2017, and that its conclusion will direct you
   back to this page "for more information".

.. _dmclock-qos:

QoS Based on mClock
-------------------

Ceph's use of mClock is now more refined and can be used by following the
steps as described in `mClock Config Reference`_.

Core Concepts
`````````````

Ceph's QoS support is implemented using a queueing scheduler
based on `the dmClock algorithm`_. This algorithm allocates the I/O
resources of the Ceph cluster in proportion to weights, and enforces
the constraints of minimum reservation and maximum limitation, so that
the services can compete for the resources fairly. Currently the
*mclock_scheduler* operation queue divides Ceph services involving I/O
resources into following buckets:

- client op: the iops issued by client
- osd subop: the iops issued by primary OSD
- snap trim: the snap trimming related requests
- pg recovery: the recovery related requests
- pg scrub: the scrub related requests

And the resources are partitioned using following three sets of tags. In other
words, the share of each type of service is controlled by three tags:

#. reservation: the minimum IOPS allocated for the service.
#. limitation: the maximum IOPS allocated for the service.
#. weight: the proportional share of capacity if extra capacity or system
   oversubscribed.

In Ceph, operations are graded with "cost". And the resources allocated
for serving various services are consumed by these "costs". So, for
example, the more reservation a services has, the more resource it is
guaranteed to possess, as long as it requires. Assuming there are 2
services: recovery and client ops:

- recovery: (r:1, l:5, w:1)
- client ops: (r:2, l:0, w:9)

The settings above ensure that the recovery won't get more than 5
requests per second serviced, even if it requires so (see CURRENT
IMPLEMENTATION NOTE below), and no other services are competing with
it. But if the clients start to issue large amount of I/O requests,
neither will they exhaust all the I/O resources. 1 request per second
is always allocated for recovery jobs as long as there are any such
requests. So the recovery jobs won't be starved even in a cluster with
high load. And in the meantime, the client ops can enjoy a larger
portion of the I/O resource, because its weight is "9", while its
competitor "1". In the case of client ops, it is not clamped by the
limit setting, so it can make use of all the resources if there is no
recovery ongoing.

CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit
values. Therefore, if a service crosses the enforced limit, the op remains
in the operation queue until the limit is restored.

Subtleties of mClock
````````````````````

The reservation and limit values have a unit of requests per
second. The weight, however, does not technically have a unit and the
weights are relative to one another. So if one class of requests has a
weight of 1 and another a weight of 9, then the latter class of
requests should get 9 executed at a 9 to 1 ratio as the first class.
However that will only happen once the reservations are met and those
values include the operations executed under the reservation phase.

Even though the weights do not have units, one must be careful in
choosing their values due how the algorithm assigns weight tags to
requests. If the weight is *W*, then for a given class of requests,
the next one that comes in will have a weight tag of *1/W* plus the
previous weight tag or the current time, whichever is larger. That
means if *W* is sufficiently large and therefore *1/W* is sufficiently
small, the calculated tag may never be assigned as it will get a value
of the current time. The ultimate lesson is that values for weight
should not be too large. They should be under the number of requests
one expects to be serviced each second.

Caveats
```````

There are some factors that can reduce the impact of the mClock op
queues within Ceph. First, requests to an OSD are sharded by their
placement group identifier. Each shard has its own mClock queue and
these queues neither interact nor share information among them. The
number of shards can be controlled with the configuration options
:confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
:confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
impact of the mClock queues, but may have other deleterious effects.

Second, requests are transferred from the operation queue to the
operation sequencer, in which they go through the phases of
execution. The operation queue is where mClock resides and mClock
determines the next op to transfer to the operation sequencer. The
number of operations allowed in the operation sequencer is a complex
issue. In general we want to keep enough operations in the sequencer
so it's always getting work done on some operations while it's waiting
for disk and network access to complete on other operations. On the
other hand, once an operation is transferred to the operation
sequencer, mClock no longer has control over it. Therefore to maximize
the impact of mClock, we want to keep as few operations in the
operation sequencer as possible. So we have an inherent tension.

The configuration options that influence the number of operations in
the operation sequencer are :confval:`bluestore_throttle_bytes`,
:confval:`bluestore_throttle_deferred_bytes`,
:confval:`bluestore_throttle_cost_per_io`,
:confval:`bluestore_throttle_cost_per_io_hdd`, and
:confval:`bluestore_throttle_cost_per_io_ssd`.

A third factor that affects the impact of the mClock algorithm is that
we're using a distributed system, where requests are made to multiple
OSDs and each OSD has (can have) multiple shards. Yet we're currently
using the mClock algorithm, which is not distributed (note: dmClock is
the distributed version of mClock).

Various organizations and individuals are currently experimenting with
mClock as it exists in this code base along with their modifications
to the code base. We hope you'll share you're experiences with your
mClock and dmClock experiments on the ``ceph-devel`` mailing list.

.. confval:: osd_async_recovery_min_cost
.. confval:: osd_push_per_object_cost
.. confval:: osd_mclock_scheduler_client_res
.. confval:: osd_mclock_scheduler_client_wgt
.. confval:: osd_mclock_scheduler_client_lim
.. confval:: osd_mclock_scheduler_background_recovery_res
.. confval:: osd_mclock_scheduler_background_recovery_wgt
.. confval:: osd_mclock_scheduler_background_recovery_lim
.. confval:: osd_mclock_scheduler_background_best_effort_res
.. confval:: osd_mclock_scheduler_background_best_effort_wgt
.. confval:: osd_mclock_scheduler_background_best_effort_lim

.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf

.. index:: OSD; backfilling

Backfilling
===========

When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
rebalance the cluster by moving placement groups to or from Ceph OSDs
to restore balanced utilization. The process of migrating placement groups and
the objects they contain can reduce the cluster's operational performance
considerably. To maintain operational performance, Ceph performs this migration
with 'backfilling', which allows Ceph to set backfill operations to a lower
priority than requests to read or write data.


.. confval:: osd_max_backfills
.. confval:: osd_backfill_scan_min
.. confval:: osd_backfill_scan_max
.. confval:: osd_backfill_retry_interval

.. index:: OSD; osdmap

OSD Map
=======

OSD maps reflect the OSD daemons operating in the cluster. Over time, the
number of map epochs increases. Ceph provides some settings to ensure that
Ceph performs well as the OSD map grows larger.

.. confval:: osd_map_dedup
.. confval:: osd_map_cache_size
.. confval:: osd_map_message_max

.. index:: OSD; recovery

Recovery
========

When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
begins peering with other Ceph OSD Daemons before writes can occur.  See
`Monitoring OSDs and PGs`_ for details.

If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
sync with other Ceph OSD Daemons containing more recent versions of objects in
the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
mode and seeks to get the latest copy of the data and bring its map back up to
date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
and placement groups may be significantly out of date. Also, if a failure domain
went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
the same time. This can make the recovery process time consuming and resource
intensive.

To maintain operational performance, Ceph performs recovery with limitations on
the number recovery requests, threads and object chunk sizes which allows Ceph
perform well in a degraded state.

.. confval:: osd_recovery_delay_start
.. confval:: osd_recovery_max_active
.. confval:: osd_recovery_max_active_hdd
.. confval:: osd_recovery_max_active_ssd
.. confval:: osd_recovery_max_chunk
.. confval:: osd_recovery_max_single_start
.. confval:: osd_recover_clone_overlap
.. confval:: osd_recovery_sleep
.. confval:: osd_recovery_sleep_hdd
.. confval:: osd_recovery_sleep_ssd
.. confval:: osd_recovery_sleep_hybrid
.. confval:: osd_recovery_priority

Tiering
=======

.. confval:: osd_agent_max_ops
.. confval:: osd_agent_max_low_ops

See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
objects within the high speed mode.

Miscellaneous
=============

.. confval:: osd_default_notify_timeout
.. confval:: osd_check_for_log_corruption
.. confval:: osd_delete_sleep
.. confval:: osd_delete_sleep_hdd
.. confval:: osd_delete_sleep_ssd
.. confval:: osd_delete_sleep_hybrid
.. confval:: osd_command_max_records
.. confval:: osd_fast_fail_on_connection_refused

.. _pool: ../../operations/pools
.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
.. _Pool & PG Config Reference: ../pool-pg-config-ref
.. _Journal Config Reference: ../journal-ref
.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
.. _mClock Config Reference: ../mclock-config-ref
Commit	Line	Data
7c673cae FG	1	======================
	2	OSD Config Reference
	3	======================
	4
	5	.. index:: OSD; configuration
	6
f67539c2 TL	7	You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
f67539c2 TL	8	releases, the central config store), but Ceph OSD
7c673cae	9	Daemons can use the default values and a very minimal configuration. A minimal
1e59de90	10	Ceph OSD Daemon configuration sets ``host`` and
7c673cae FG	11	uses default values for nearly everything else.
	12
	13	Ceph OSD Daemons are numerically identified in incremental fashion, beginning
	14	with ``0`` using the following convention. ::
	15
	16	osd.0
	17	osd.1
	18	osd.2
	19
	20	In a configuration file, you may specify settings for all Ceph OSD Daemons in
	21	the cluster by adding configuration settings to the ``[osd]`` section of your
	22	configuration file. To add settings directly to a specific Ceph OSD Daemon
	23	(e.g., ``host``), enter it in an OSD-specific section of your configuration
	24	file. For example:
	25
	26	.. code-block:: ini
1adf2230	27
7c673cae	28	[osd]
f67539c2	29	osd_journal_size = 5120
1adf2230	30
7c673cae FG	31	[osd.0]
7c673cae FG	32	host = osd-host-a
1adf2230	33
7c673cae FG	34	[osd.1]
	35	host = osd-host-b
	36
	37
	38	.. index:: OSD; config settings
	39
	40	General Settings
	41	================
	42
9f95a23c	43	The following settings provide a Ceph OSD Daemon's ID, and determine paths to
7c673cae	44	data and journals. Ceph deployment scripts typically generate the UUID
1adf2230 AA	45	automatically.
	46
	47	.. warning:: DO NOT change the default paths for data or journals, as it
	48	makes it more problematic to troubleshoot Ceph later.
7c673cae	49
f67539c2 TL	50	When using Filestore, the journal size should be at least twice the product of the expected drive
f67539c2 TL	51	speed multiplied by ``filestore_max_sync_interval``. However, the most common
7c673cae FG	52	practice is to partition the journal drive (often an SSD), and mount it such
	53	that Ceph uses the entire partition for the journal.
	54
20effc67 TL	55	.. confval:: osd_uuid
	56	.. confval:: osd_data
	57	.. confval:: osd_max_write_size
	58	.. confval:: osd_max_object_size
	59	.. confval:: osd_client_message_size_cap
	60	.. confval:: osd_class_dir
	61	:default: $libdir/rados-classes
7c673cae FG	62
	63	.. index:: OSD; file system
	64
	65	File System Settings
	66	====================
	67	Ceph builds and mounts file systems which are used for Ceph OSDs.
	68
f67539c2	69	``osd_mkfs_options {fs-type}``
7c673cae	70
f67539c2	71	:Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
7c673cae FG	72
	73	:Type: String
	74	:Default for xfs: ``-f -i 2048``
	75	:Default for other file systems: {empty string}
	76
	77	For example::
f67539c2	78	``osd_mkfs_options_xfs = -f -d agcount=24``
7c673cae	79
f67539c2	80	``osd_mount_options {fs-type}``
7c673cae	81
f67539c2	82	:Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
7c673cae FG	83
	84	:Type: String
	85	:Default for xfs: ``rw,noatime,inode64``
	86	:Default for other file systems: ``rw, noatime``
	87
	88	For example::
f67539c2	89	``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
7c673cae FG	90
	91
	92	.. index:: OSD; journal settings
	93
	94	Journal Settings
	95	================
	96
f67539c2 TL	97	This section applies only to the older Filestore OSD back end. Since Luminous
	98	BlueStore has been default and preferred.
	99
	100	By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
	101	the following path, which is usually a symlink to a device or partition::
7c673cae FG	102
	103	/var/lib/ceph/osd/$cluster-$id/journal
	104
1adf2230 AA	105	When using a single device type (for example, spinning drives), the journals
	106	should be colocated: the logical volume (or partition) should be in the same
	107	device as the ``data`` logical volume.
7c673cae	108
1adf2230 AA	109	When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
	110	drives) it makes sense to place the journal on the faster device, while
	111	``data`` occupies the slower device fully.
7c673cae	112
f67539c2 TL	113	The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
	114	larger, in which case it will need to be set in the ``ceph.conf`` file.
	115	A value of 10 gigabytes is common in practice::
7c673cae	116
f67539c2	117	osd_journal_size = 10240
7c673cae	118
1adf2230	119
20effc67 TL	120	.. confval:: osd_journal
20effc67 TL	121	.. confval:: osd_journal_size
7c673cae FG	122
	123	See `Journal Config Reference`_ for additional details.
	124
	125
	126	Monitor OSD Interaction
	127	=======================
	128
	129	Ceph OSD Daemons check each other's heartbeats and report to monitors
	130	periodically. Ceph can use default values in many cases. However, if your
f67539c2	131	network has latency issues, you may need to adopt longer intervals. See
7c673cae FG	132	`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
	133
	134
	135	Data Placement
	136	==============
	137
	138	See `Pool & PG Config Reference`_ for details.
	139
	140
	141	.. index:: OSD; scrubbing
	142
1e59de90 TL	143	.. _rados_config_scrubbing:
1e59de90 TL	144
7c673cae FG	145	Scrubbing
	146	=========
	147
aee94f69 TL	148	One way that Ceph ensures data integrity is by "scrubbing" placement groups.
	149	Ceph scrubbing is analogous to ``fsck`` on the object storage layer. Ceph
	150	generates a catalog of all objects in each placement group and compares each
	151	primary object to its replicas, ensuring that no objects are missing or
	152	mismatched. Light scrubbing checks the object size and attributes, and is
	153	usually done daily. Deep scrubbing reads the data and uses checksums to ensure
	154	data integrity, and is usually done weekly. The freqeuncies of both light
	155	scrubbing and deep scrubbing are determined by the cluster's configuration,
	156	which is fully under your control and subject to the settings explained below
	157	in this section.
	158
	159	Although scrubbing is important for maintaining data integrity, it can reduce
	160	the performance of the Ceph cluster. You can adjust the following settings to
	161	increase or decrease the frequency and depth of scrubbing operations.
7c673cae FG	162
7c673cae FG	163
20effc67 TL	164	.. confval:: osd_max_scrubs
	165	.. confval:: osd_scrub_begin_hour
	166	.. confval:: osd_scrub_end_hour
	167	.. confval:: osd_scrub_begin_week_day
	168	.. confval:: osd_scrub_end_week_day
	169	.. confval:: osd_scrub_during_recovery
	170	.. confval:: osd_scrub_load_threshold
	171	.. confval:: osd_scrub_min_interval
	172	.. confval:: osd_scrub_max_interval
	173	.. confval:: osd_scrub_chunk_min
	174	.. confval:: osd_scrub_chunk_max
	175	.. confval:: osd_scrub_sleep
	176	.. confval:: osd_deep_scrub_interval
	177	.. confval:: osd_scrub_interval_randomize_ratio
	178	.. confval:: osd_deep_scrub_stride
	179	.. confval:: osd_scrub_auto_repair
	180	.. confval:: osd_scrub_auto_repair_num_errors
7c673cae	181
11fdf7f2	182	.. index:: OSD; operations settings
7c673cae	183
11fdf7f2 TL	184	Operations
11fdf7f2 TL	185	==========
7c673cae	186
20effc67 TL	187	.. confval:: osd_op_num_shards
	188	.. confval:: osd_op_num_shards_hdd
	189	.. confval:: osd_op_num_shards_ssd
	190	.. confval:: osd_op_queue
	191	.. confval:: osd_op_queue_cut_off
	192	.. confval:: osd_client_op_priority
	193	.. confval:: osd_recovery_op_priority
	194	.. confval:: osd_scrub_priority
	195	.. confval:: osd_requested_scrub_priority
	196	.. confval:: osd_snap_trim_priority
	197	.. confval:: osd_snap_trim_sleep
	198	.. confval:: osd_snap_trim_sleep_hdd
	199	.. confval:: osd_snap_trim_sleep_ssd
	200	.. confval:: osd_snap_trim_sleep_hybrid
	201	.. confval:: osd_op_thread_timeout
	202	.. confval:: osd_op_complaint_time
	203	.. confval:: osd_op_history_size
	204	.. confval:: osd_op_history_duration
	205	.. confval:: osd_op_log_threshold
1e59de90 TL	206	.. confval:: osd_op_thread_suicide_timeout
	207	.. note:: See https://old.ceph.com/planet/dealing-with-some-osd-timeouts/ for
	208	more on ``osd_op_thread_suicide_timeout``. Be aware that this is a link to a
	209	reworking of a blog post from 2017, and that its conclusion will direct you
	210	back to this page "for more information".
c07f9fc5	211
9f95a23c TL	212	.. _dmclock-qos:
9f95a23c TL	213
c07f9fc5 FG	214	QoS Based on mClock
	215	-------------------
	216
b3b6e05e TL	217	Ceph's use of mClock is now more refined and can be used by following the
b3b6e05e TL	218	steps as described in `mClock Config Reference`_.
c07f9fc5 FG	219
	220	Core Concepts
	221	`````````````
	222
f67539c2	223	Ceph's QoS support is implemented using a queueing scheduler
c07f9fc5 FG	224	based on `the dmClock algorithm`_. This algorithm allocates the I/O
c07f9fc5 FG	225	resources of the Ceph cluster in proportion to weights, and enforces
11fdf7f2	226	the constraints of minimum reservation and maximum limitation, so that
c07f9fc5	227	the services can compete for the resources fairly. Currently the
f67539c2	228	mclock_scheduler operation queue divides Ceph services involving I/O
c07f9fc5 FG	229	resources into following buckets:
	230
	231	- client op: the iops issued by client
	232	- osd subop: the iops issued by primary OSD
	233	- snap trim: the snap trimming related requests
	234	- pg recovery: the recovery related requests
	235	- pg scrub: the scrub related requests
	236
	237	And the resources are partitioned using following three sets of tags. In other
	238	words, the share of each type of service is controlled by three tags:
	239
	240	#. reservation: the minimum IOPS allocated for the service.
	241	#. limitation: the maximum IOPS allocated for the service.
	242	#. weight: the proportional share of capacity if extra capacity or system
	243	oversubscribed.
	244
b3b6e05e	245	In Ceph, operations are graded with "cost". And the resources allocated
c07f9fc5 FG	246	for serving various services are consumed by these "costs". So, for
	247	example, the more reservation a services has, the more resource it is
	248	guaranteed to possess, as long as it requires. Assuming there are 2
	249	services: recovery and client ops:
	250
	251	- recovery: (r:1, l:5, w:1)
	252	- client ops: (r:2, l:0, w:9)
	253
	254	The settings above ensure that the recovery won't get more than 5
	255	requests per second serviced, even if it requires so (see CURRENT
	256	IMPLEMENTATION NOTE below), and no other services are competing with
	257	it. But if the clients start to issue large amount of I/O requests,
	258	neither will they exhaust all the I/O resources. 1 request per second
	259	is always allocated for recovery jobs as long as there are any such
	260	requests. So the recovery jobs won't be starved even in a cluster with
	261	high load. And in the meantime, the client ops can enjoy a larger
	262	portion of the I/O resource, because its weight is "9", while its
	263	competitor "1". In the case of client ops, it is not clamped by the
	264	limit setting, so it can make use of all the resources if there is no
	265	recovery ongoing.
	266
b3b6e05e TL	267	CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit
	268	values. Therefore, if a service crosses the enforced limit, the op remains
	269	in the operation queue until the limit is restored.
c07f9fc5 FG	270
	271	Subtleties of mClock
	272	````````````````````
	273
	274	The reservation and limit values have a unit of requests per
	275	second. The weight, however, does not technically have a unit and the
	276	weights are relative to one another. So if one class of requests has a
	277	weight of 1 and another a weight of 9, then the latter class of
	278	requests should get 9 executed at a 9 to 1 ratio as the first class.
	279	However that will only happen once the reservations are met and those
	280	values include the operations executed under the reservation phase.
	281
	282	Even though the weights do not have units, one must be careful in
	283	choosing their values due how the algorithm assigns weight tags to
	284	requests. If the weight is W, then for a given class of requests,
	285	the next one that comes in will have a weight tag of 1/W plus the
	286	previous weight tag or the current time, whichever is larger. That
	287	means if W is sufficiently large and therefore 1/W is sufficiently
	288	small, the calculated tag may never be assigned as it will get a value
	289	of the current time. The ultimate lesson is that values for weight
	290	should not be too large. They should be under the number of requests
b3b6e05e	291	one expects to be serviced each second.
c07f9fc5 FG	292
	293	Caveats
	294	```````
	295
	296	There are some factors that can reduce the impact of the mClock op
	297	queues within Ceph. First, requests to an OSD are sharded by their
	298	placement group identifier. Each shard has its own mClock queue and
	299	these queues neither interact nor share information among them. The
	300	number of shards can be controlled with the configuration options
20effc67 TL	301	:confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
20effc67 TL	302	:confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
11fdf7f2	303	impact of the mClock queues, but may have other deleterious effects.
c07f9fc5 FG	304
	305	Second, requests are transferred from the operation queue to the
	306	operation sequencer, in which they go through the phases of
	307	execution. The operation queue is where mClock resides and mClock
	308	determines the next op to transfer to the operation sequencer. The
	309	number of operations allowed in the operation sequencer is a complex
	310	issue. In general we want to keep enough operations in the sequencer
	311	so it's always getting work done on some operations while it's waiting
	312	for disk and network access to complete on other operations. On the
	313	other hand, once an operation is transferred to the operation
	314	sequencer, mClock no longer has control over it. Therefore to maximize
	315	the impact of mClock, we want to keep as few operations in the
	316	operation sequencer as possible. So we have an inherent tension.
	317
	318	The configuration options that influence the number of operations in
20effc67 TL	319	the operation sequencer are :confval:`bluestore_throttle_bytes`,
	320	:confval:`bluestore_throttle_deferred_bytes`,
	321	:confval:`bluestore_throttle_cost_per_io`,
	322	:confval:`bluestore_throttle_cost_per_io_hdd`, and
	323	:confval:`bluestore_throttle_cost_per_io_ssd`.
c07f9fc5 FG	324
	325	A third factor that affects the impact of the mClock algorithm is that
	326	we're using a distributed system, where requests are made to multiple
	327	OSDs and each OSD has (can have) multiple shards. Yet we're currently
	328	using the mClock algorithm, which is not distributed (note: dmClock is
	329	the distributed version of mClock).
	330
	331	Various organizations and individuals are currently experimenting with
	332	mClock as it exists in this code base along with their modifications
	333	to the code base. We hope you'll share you're experiences with your
f67539c2	334	mClock and dmClock experiments on the ``ceph-devel`` mailing list.
c07f9fc5	335
20effc67 TL	336	.. confval:: osd_async_recovery_min_cost
	337	.. confval:: osd_push_per_object_cost
	338	.. confval:: osd_mclock_scheduler_client_res
	339	.. confval:: osd_mclock_scheduler_client_wgt
	340	.. confval:: osd_mclock_scheduler_client_lim
	341	.. confval:: osd_mclock_scheduler_background_recovery_res
	342	.. confval:: osd_mclock_scheduler_background_recovery_wgt
	343	.. confval:: osd_mclock_scheduler_background_recovery_lim
	344	.. confval:: osd_mclock_scheduler_background_best_effort_res
	345	.. confval:: osd_mclock_scheduler_background_best_effort_wgt
	346	.. confval:: osd_mclock_scheduler_background_best_effort_lim
c07f9fc5 FG	347
	348	.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
	349
7c673cae FG	350	.. index:: OSD; backfilling
	351
	352	Backfilling
	353	===========
	354
f67539c2 TL	355	When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
	356	rebalance the cluster by moving placement groups to or from Ceph OSDs
	357	to restore balanced utilization. The process of migrating placement groups and
7c673cae FG	358	the objects they contain can reduce the cluster's operational performance
	359	considerably. To maintain operational performance, Ceph performs this migration
	360	with 'backfilling', which allows Ceph to set backfill operations to a lower
1adf2230	361	priority than requests to read or write data.
7c673cae FG	362
7c673cae FG	363
20effc67 TL	364	.. confval:: osd_max_backfills
	365	.. confval:: osd_backfill_scan_min
	366	.. confval:: osd_backfill_scan_max
	367	.. confval:: osd_backfill_retry_interval
7c673cae FG	368
	369	.. index:: OSD; osdmap
	370
	371	OSD Map
	372	=======
	373
1adf2230	374	OSD maps reflect the OSD daemons operating in the cluster. Over time, the
7c673cae FG	375	number of map epochs increases. Ceph provides some settings to ensure that
	376	Ceph performs well as the OSD map grows larger.
	377
20effc67 TL	378	.. confval:: osd_map_dedup
	379	.. confval:: osd_map_cache_size
	380	.. confval:: osd_map_message_max
7c673cae FG	381
	382	.. index:: OSD; recovery
	383
	384	Recovery
	385	========
	386
	387	When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
	388	begins peering with other Ceph OSD Daemons before writes can occur. See
	389	`Monitoring OSDs and PGs`_ for details.
	390
	391	If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
	392	sync with other Ceph OSD Daemons containing more recent versions of objects in
	393	the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
	394	mode and seeks to get the latest copy of the data and bring its map back up to
	395	date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
	396	and placement groups may be significantly out of date. Also, if a failure domain
	397	went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
	398	the same time. This can make the recovery process time consuming and resource
	399	intensive.
	400
	401	To maintain operational performance, Ceph performs recovery with limitations on
	402	the number recovery requests, threads and object chunk sizes which allows Ceph
1adf2230	403	perform well in a degraded state.
7c673cae	404
20effc67 TL	405	.. confval:: osd_recovery_delay_start
	406	.. confval:: osd_recovery_max_active
	407	.. confval:: osd_recovery_max_active_hdd
	408	.. confval:: osd_recovery_max_active_ssd
	409	.. confval:: osd_recovery_max_chunk
	410	.. confval:: osd_recovery_max_single_start
	411	.. confval:: osd_recover_clone_overlap
	412	.. confval:: osd_recovery_sleep
	413	.. confval:: osd_recovery_sleep_hdd
	414	.. confval:: osd_recovery_sleep_ssd
	415	.. confval:: osd_recovery_sleep_hybrid
	416	.. confval:: osd_recovery_priority
11fdf7f2	417
7c673cae FG	418	Tiering
	419	=======
	420
20effc67 TL	421	.. confval:: osd_agent_max_ops
20effc67 TL	422	.. confval:: osd_agent_max_low_ops
7c673cae FG	423
	424	See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
	425	objects within the high speed mode.
	426
	427	Miscellaneous
	428	=============
	429
20effc67 TL	430	.. confval:: osd_default_notify_timeout
	431	.. confval:: osd_check_for_log_corruption
	432	.. confval:: osd_delete_sleep
	433	.. confval:: osd_delete_sleep_hdd
	434	.. confval:: osd_delete_sleep_ssd
	435	.. confval:: osd_delete_sleep_hybrid
	436	.. confval:: osd_command_max_records
	437	.. confval:: osd_fast_fail_on_connection_refused
7c673cae FG	438
	439	.. _pool: ../../operations/pools
	440	.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
	441	.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
	442	.. _Pool & PG Config Reference: ../pool-pg-config-ref
	443	.. _Journal Config Reference: ../journal-ref
	444	.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
b3b6e05e	445	.. _mClock Config Reference: ../mclock-config-ref