]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/configuration/osd-config-ref.rst
import quincy beta 17.1.0
[ceph.git] / ceph / doc / rados / configuration / osd-config-ref.rst
1 ======================
2 OSD Config Reference
3 ======================
4
5 .. index:: OSD; configuration
6
7 You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent
8 releases, the central config store), but Ceph OSD
9 Daemons can use the default values and a very minimal configuration. A minimal
10 Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``, and
11 uses default values for nearly everything else.
12
13 Ceph OSD Daemons are numerically identified in incremental fashion, beginning
14 with ``0`` using the following convention. ::
15
16 osd.0
17 osd.1
18 osd.2
19
20 In a configuration file, you may specify settings for all Ceph OSD Daemons in
21 the cluster by adding configuration settings to the ``[osd]`` section of your
22 configuration file. To add settings directly to a specific Ceph OSD Daemon
23 (e.g., ``host``), enter it in an OSD-specific section of your configuration
24 file. For example:
25
26 .. code-block:: ini
27
28 [osd]
29 osd_journal_size = 5120
30
31 [osd.0]
32 host = osd-host-a
33
34 [osd.1]
35 host = osd-host-b
36
37
38 .. index:: OSD; config settings
39
40 General Settings
41 ================
42
43 The following settings provide a Ceph OSD Daemon's ID, and determine paths to
44 data and journals. Ceph deployment scripts typically generate the UUID
45 automatically.
46
47 .. warning:: **DO NOT** change the default paths for data or journals, as it
48 makes it more problematic to troubleshoot Ceph later.
49
50 When using Filestore, the journal size should be at least twice the product of the expected drive
51 speed multiplied by ``filestore_max_sync_interval``. However, the most common
52 practice is to partition the journal drive (often an SSD), and mount it such
53 that Ceph uses the entire partition for the journal.
54
55 .. confval:: osd_uuid
56 .. confval:: osd_data
57 .. confval:: osd_max_write_size
58 .. confval:: osd_max_object_size
59 .. confval:: osd_client_message_size_cap
60 .. confval:: osd_class_dir
61 :default: $libdir/rados-classes
62
63 .. index:: OSD; file system
64
65 File System Settings
66 ====================
67 Ceph builds and mounts file systems which are used for Ceph OSDs.
68
69 ``osd_mkfs_options {fs-type}``
70
71 :Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}.
72
73 :Type: String
74 :Default for xfs: ``-f -i 2048``
75 :Default for other file systems: {empty string}
76
77 For example::
78 ``osd_mkfs_options_xfs = -f -d agcount=24``
79
80 ``osd_mount_options {fs-type}``
81
82 :Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}.
83
84 :Type: String
85 :Default for xfs: ``rw,noatime,inode64``
86 :Default for other file systems: ``rw, noatime``
87
88 For example::
89 ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8``
90
91
92 .. index:: OSD; journal settings
93
94 Journal Settings
95 ================
96
97 This section applies only to the older Filestore OSD back end. Since Luminous
98 BlueStore has been default and preferred.
99
100 By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
101 the following path, which is usually a symlink to a device or partition::
102
103 /var/lib/ceph/osd/$cluster-$id/journal
104
105 When using a single device type (for example, spinning drives), the journals
106 should be *colocated*: the logical volume (or partition) should be in the same
107 device as the ``data`` logical volume.
108
109 When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning
110 drives) it makes sense to place the journal on the faster device, while
111 ``data`` occupies the slower device fully.
112
113 The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be
114 larger, in which case it will need to be set in the ``ceph.conf`` file.
115 A value of 10 gigabytes is common in practice::
116
117 osd_journal_size = 10240
118
119
120 .. confval:: osd_journal
121 .. confval:: osd_journal_size
122
123 See `Journal Config Reference`_ for additional details.
124
125
126 Monitor OSD Interaction
127 =======================
128
129 Ceph OSD Daemons check each other's heartbeats and report to monitors
130 periodically. Ceph can use default values in many cases. However, if your
131 network has latency issues, you may need to adopt longer intervals. See
132 `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
133
134
135 Data Placement
136 ==============
137
138 See `Pool & PG Config Reference`_ for details.
139
140
141 .. index:: OSD; scrubbing
142
143 Scrubbing
144 =========
145
146 In addition to making multiple copies of objects, Ceph ensures data integrity by
147 scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
148 object storage layer. For each placement group, Ceph generates a catalog of all
149 objects and compares each primary object and its replicas to ensure that no
150 objects are missing or mismatched. Light scrubbing (daily) checks the object
151 size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
152 to ensure data integrity.
153
154 Scrubbing is important for maintaining data integrity, but it can reduce
155 performance. You can adjust the following settings to increase or decrease
156 scrubbing operations.
157
158
159 .. confval:: osd_max_scrubs
160 .. confval:: osd_scrub_begin_hour
161 .. confval:: osd_scrub_end_hour
162 .. confval:: osd_scrub_begin_week_day
163 .. confval:: osd_scrub_end_week_day
164 .. confval:: osd_scrub_during_recovery
165 .. confval:: osd_scrub_load_threshold
166 .. confval:: osd_scrub_min_interval
167 .. confval:: osd_scrub_max_interval
168 .. confval:: osd_scrub_chunk_min
169 .. confval:: osd_scrub_chunk_max
170 .. confval:: osd_scrub_sleep
171 .. confval:: osd_deep_scrub_interval
172 .. confval:: osd_scrub_interval_randomize_ratio
173 .. confval:: osd_deep_scrub_stride
174 .. confval:: osd_scrub_auto_repair
175 .. confval:: osd_scrub_auto_repair_num_errors
176
177 .. index:: OSD; operations settings
178
179 Operations
180 ==========
181
182 .. confval:: osd_op_num_shards
183 .. confval:: osd_op_num_shards_hdd
184 .. confval:: osd_op_num_shards_ssd
185 .. confval:: osd_op_queue
186 .. confval:: osd_op_queue_cut_off
187 .. confval:: osd_client_op_priority
188 .. confval:: osd_recovery_op_priority
189 .. confval:: osd_scrub_priority
190 .. confval:: osd_requested_scrub_priority
191 .. confval:: osd_snap_trim_priority
192 .. confval:: osd_snap_trim_sleep
193 .. confval:: osd_snap_trim_sleep_hdd
194 .. confval:: osd_snap_trim_sleep_ssd
195 .. confval:: osd_snap_trim_sleep_hybrid
196 .. confval:: osd_op_thread_timeout
197 .. confval:: osd_op_complaint_time
198 .. confval:: osd_op_history_size
199 .. confval:: osd_op_history_duration
200 .. confval:: osd_op_log_threshold
201
202 .. _dmclock-qos:
203
204 QoS Based on mClock
205 -------------------
206
207 Ceph's use of mClock is now more refined and can be used by following the
208 steps as described in `mClock Config Reference`_.
209
210 Core Concepts
211 `````````````
212
213 Ceph's QoS support is implemented using a queueing scheduler
214 based on `the dmClock algorithm`_. This algorithm allocates the I/O
215 resources of the Ceph cluster in proportion to weights, and enforces
216 the constraints of minimum reservation and maximum limitation, so that
217 the services can compete for the resources fairly. Currently the
218 *mclock_scheduler* operation queue divides Ceph services involving I/O
219 resources into following buckets:
220
221 - client op: the iops issued by client
222 - osd subop: the iops issued by primary OSD
223 - snap trim: the snap trimming related requests
224 - pg recovery: the recovery related requests
225 - pg scrub: the scrub related requests
226
227 And the resources are partitioned using following three sets of tags. In other
228 words, the share of each type of service is controlled by three tags:
229
230 #. reservation: the minimum IOPS allocated for the service.
231 #. limitation: the maximum IOPS allocated for the service.
232 #. weight: the proportional share of capacity if extra capacity or system
233 oversubscribed.
234
235 In Ceph, operations are graded with "cost". And the resources allocated
236 for serving various services are consumed by these "costs". So, for
237 example, the more reservation a services has, the more resource it is
238 guaranteed to possess, as long as it requires. Assuming there are 2
239 services: recovery and client ops:
240
241 - recovery: (r:1, l:5, w:1)
242 - client ops: (r:2, l:0, w:9)
243
244 The settings above ensure that the recovery won't get more than 5
245 requests per second serviced, even if it requires so (see CURRENT
246 IMPLEMENTATION NOTE below), and no other services are competing with
247 it. But if the clients start to issue large amount of I/O requests,
248 neither will they exhaust all the I/O resources. 1 request per second
249 is always allocated for recovery jobs as long as there are any such
250 requests. So the recovery jobs won't be starved even in a cluster with
251 high load. And in the meantime, the client ops can enjoy a larger
252 portion of the I/O resource, because its weight is "9", while its
253 competitor "1". In the case of client ops, it is not clamped by the
254 limit setting, so it can make use of all the resources if there is no
255 recovery ongoing.
256
257 CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit
258 values. Therefore, if a service crosses the enforced limit, the op remains
259 in the operation queue until the limit is restored.
260
261 Subtleties of mClock
262 ````````````````````
263
264 The reservation and limit values have a unit of requests per
265 second. The weight, however, does not technically have a unit and the
266 weights are relative to one another. So if one class of requests has a
267 weight of 1 and another a weight of 9, then the latter class of
268 requests should get 9 executed at a 9 to 1 ratio as the first class.
269 However that will only happen once the reservations are met and those
270 values include the operations executed under the reservation phase.
271
272 Even though the weights do not have units, one must be careful in
273 choosing their values due how the algorithm assigns weight tags to
274 requests. If the weight is *W*, then for a given class of requests,
275 the next one that comes in will have a weight tag of *1/W* plus the
276 previous weight tag or the current time, whichever is larger. That
277 means if *W* is sufficiently large and therefore *1/W* is sufficiently
278 small, the calculated tag may never be assigned as it will get a value
279 of the current time. The ultimate lesson is that values for weight
280 should not be too large. They should be under the number of requests
281 one expects to be serviced each second.
282
283 Caveats
284 ```````
285
286 There are some factors that can reduce the impact of the mClock op
287 queues within Ceph. First, requests to an OSD are sharded by their
288 placement group identifier. Each shard has its own mClock queue and
289 these queues neither interact nor share information among them. The
290 number of shards can be controlled with the configuration options
291 :confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
292 :confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
293 impact of the mClock queues, but may have other deleterious effects.
294
295 Second, requests are transferred from the operation queue to the
296 operation sequencer, in which they go through the phases of
297 execution. The operation queue is where mClock resides and mClock
298 determines the next op to transfer to the operation sequencer. The
299 number of operations allowed in the operation sequencer is a complex
300 issue. In general we want to keep enough operations in the sequencer
301 so it's always getting work done on some operations while it's waiting
302 for disk and network access to complete on other operations. On the
303 other hand, once an operation is transferred to the operation
304 sequencer, mClock no longer has control over it. Therefore to maximize
305 the impact of mClock, we want to keep as few operations in the
306 operation sequencer as possible. So we have an inherent tension.
307
308 The configuration options that influence the number of operations in
309 the operation sequencer are :confval:`bluestore_throttle_bytes`,
310 :confval:`bluestore_throttle_deferred_bytes`,
311 :confval:`bluestore_throttle_cost_per_io`,
312 :confval:`bluestore_throttle_cost_per_io_hdd`, and
313 :confval:`bluestore_throttle_cost_per_io_ssd`.
314
315 A third factor that affects the impact of the mClock algorithm is that
316 we're using a distributed system, where requests are made to multiple
317 OSDs and each OSD has (can have) multiple shards. Yet we're currently
318 using the mClock algorithm, which is not distributed (note: dmClock is
319 the distributed version of mClock).
320
321 Various organizations and individuals are currently experimenting with
322 mClock as it exists in this code base along with their modifications
323 to the code base. We hope you'll share you're experiences with your
324 mClock and dmClock experiments on the ``ceph-devel`` mailing list.
325
326 .. confval:: osd_async_recovery_min_cost
327 .. confval:: osd_push_per_object_cost
328 .. confval:: osd_mclock_scheduler_client_res
329 .. confval:: osd_mclock_scheduler_client_wgt
330 .. confval:: osd_mclock_scheduler_client_lim
331 .. confval:: osd_mclock_scheduler_background_recovery_res
332 .. confval:: osd_mclock_scheduler_background_recovery_wgt
333 .. confval:: osd_mclock_scheduler_background_recovery_lim
334 .. confval:: osd_mclock_scheduler_background_best_effort_res
335 .. confval:: osd_mclock_scheduler_background_best_effort_wgt
336 .. confval:: osd_mclock_scheduler_background_best_effort_lim
337
338 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
339
340 .. index:: OSD; backfilling
341
342 Backfilling
343 ===========
344
345 When you add or remove Ceph OSD Daemons to a cluster, CRUSH will
346 rebalance the cluster by moving placement groups to or from Ceph OSDs
347 to restore balanced utilization. The process of migrating placement groups and
348 the objects they contain can reduce the cluster's operational performance
349 considerably. To maintain operational performance, Ceph performs this migration
350 with 'backfilling', which allows Ceph to set backfill operations to a lower
351 priority than requests to read or write data.
352
353
354 .. confval:: osd_max_backfills
355 .. confval:: osd_backfill_scan_min
356 .. confval:: osd_backfill_scan_max
357 .. confval:: osd_backfill_retry_interval
358
359 .. index:: OSD; osdmap
360
361 OSD Map
362 =======
363
364 OSD maps reflect the OSD daemons operating in the cluster. Over time, the
365 number of map epochs increases. Ceph provides some settings to ensure that
366 Ceph performs well as the OSD map grows larger.
367
368 .. confval:: osd_map_dedup
369 .. confval:: osd_map_cache_size
370 .. confval:: osd_map_message_max
371
372 .. index:: OSD; recovery
373
374 Recovery
375 ========
376
377 When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
378 begins peering with other Ceph OSD Daemons before writes can occur. See
379 `Monitoring OSDs and PGs`_ for details.
380
381 If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
382 sync with other Ceph OSD Daemons containing more recent versions of objects in
383 the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
384 mode and seeks to get the latest copy of the data and bring its map back up to
385 date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
386 and placement groups may be significantly out of date. Also, if a failure domain
387 went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
388 the same time. This can make the recovery process time consuming and resource
389 intensive.
390
391 To maintain operational performance, Ceph performs recovery with limitations on
392 the number recovery requests, threads and object chunk sizes which allows Ceph
393 perform well in a degraded state.
394
395 .. confval:: osd_recovery_delay_start
396 .. confval:: osd_recovery_max_active
397 .. confval:: osd_recovery_max_active_hdd
398 .. confval:: osd_recovery_max_active_ssd
399 .. confval:: osd_recovery_max_chunk
400 .. confval:: osd_recovery_max_single_start
401 .. confval:: osd_recover_clone_overlap
402 .. confval:: osd_recovery_sleep
403 .. confval:: osd_recovery_sleep_hdd
404 .. confval:: osd_recovery_sleep_ssd
405 .. confval:: osd_recovery_sleep_hybrid
406 .. confval:: osd_recovery_priority
407
408 Tiering
409 =======
410
411 .. confval:: osd_agent_max_ops
412 .. confval:: osd_agent_max_low_ops
413
414 See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
415 objects within the high speed mode.
416
417 Miscellaneous
418 =============
419
420 .. confval:: osd_default_notify_timeout
421 .. confval:: osd_check_for_log_corruption
422 .. confval:: osd_delete_sleep
423 .. confval:: osd_delete_sleep_hdd
424 .. confval:: osd_delete_sleep_ssd
425 .. confval:: osd_delete_sleep_hybrid
426 .. confval:: osd_command_max_records
427 .. confval:: osd_fast_fail_on_connection_refused
428
429 .. _pool: ../../operations/pools
430 .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
431 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
432 .. _Pool & PG Config Reference: ../pool-pg-config-ref
433 .. _Journal Config Reference: ../journal-ref
434 .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
435 .. _mClock Config Reference: ../mclock-config-ref