]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ====================== |
2 | OSD Config Reference | |
3 | ====================== | |
4 | ||
5 | .. index:: OSD; configuration | |
6 | ||
f67539c2 TL |
7 | You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent |
8 | releases, the central config store), but Ceph OSD | |
7c673cae | 9 | Daemons can use the default values and a very minimal configuration. A minimal |
1e59de90 | 10 | Ceph OSD Daemon configuration sets ``host`` and |
7c673cae FG |
11 | uses default values for nearly everything else. |
12 | ||
13 | Ceph OSD Daemons are numerically identified in incremental fashion, beginning | |
14 | with ``0`` using the following convention. :: | |
15 | ||
16 | osd.0 | |
17 | osd.1 | |
18 | osd.2 | |
19 | ||
20 | In a configuration file, you may specify settings for all Ceph OSD Daemons in | |
21 | the cluster by adding configuration settings to the ``[osd]`` section of your | |
22 | configuration file. To add settings directly to a specific Ceph OSD Daemon | |
23 | (e.g., ``host``), enter it in an OSD-specific section of your configuration | |
24 | file. For example: | |
25 | ||
26 | .. code-block:: ini | |
1adf2230 | 27 | |
7c673cae | 28 | [osd] |
f67539c2 | 29 | osd_journal_size = 5120 |
1adf2230 | 30 | |
7c673cae FG |
31 | [osd.0] |
32 | host = osd-host-a | |
1adf2230 | 33 | |
7c673cae FG |
34 | [osd.1] |
35 | host = osd-host-b | |
36 | ||
37 | ||
38 | .. index:: OSD; config settings | |
39 | ||
40 | General Settings | |
41 | ================ | |
42 | ||
9f95a23c | 43 | The following settings provide a Ceph OSD Daemon's ID, and determine paths to |
7c673cae | 44 | data and journals. Ceph deployment scripts typically generate the UUID |
1adf2230 AA |
45 | automatically. |
46 | ||
47 | .. warning:: **DO NOT** change the default paths for data or journals, as it | |
48 | makes it more problematic to troubleshoot Ceph later. | |
7c673cae | 49 | |
f67539c2 TL |
50 | When using Filestore, the journal size should be at least twice the product of the expected drive |
51 | speed multiplied by ``filestore_max_sync_interval``. However, the most common | |
7c673cae FG |
52 | practice is to partition the journal drive (often an SSD), and mount it such |
53 | that Ceph uses the entire partition for the journal. | |
54 | ||
20effc67 TL |
55 | .. confval:: osd_uuid |
56 | .. confval:: osd_data | |
57 | .. confval:: osd_max_write_size | |
58 | .. confval:: osd_max_object_size | |
59 | .. confval:: osd_client_message_size_cap | |
60 | .. confval:: osd_class_dir | |
61 | :default: $libdir/rados-classes | |
7c673cae FG |
62 | |
63 | .. index:: OSD; file system | |
64 | ||
65 | File System Settings | |
66 | ==================== | |
67 | Ceph builds and mounts file systems which are used for Ceph OSDs. | |
68 | ||
f67539c2 | 69 | ``osd_mkfs_options {fs-type}`` |
7c673cae | 70 | |
f67539c2 | 71 | :Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}. |
7c673cae FG |
72 | |
73 | :Type: String | |
74 | :Default for xfs: ``-f -i 2048`` | |
75 | :Default for other file systems: {empty string} | |
76 | ||
77 | For example:: | |
f67539c2 | 78 | ``osd_mkfs_options_xfs = -f -d agcount=24`` |
7c673cae | 79 | |
f67539c2 | 80 | ``osd_mount_options {fs-type}`` |
7c673cae | 81 | |
f67539c2 | 82 | :Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}. |
7c673cae FG |
83 | |
84 | :Type: String | |
85 | :Default for xfs: ``rw,noatime,inode64`` | |
86 | :Default for other file systems: ``rw, noatime`` | |
87 | ||
88 | For example:: | |
f67539c2 | 89 | ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8`` |
7c673cae FG |
90 | |
91 | ||
92 | .. index:: OSD; journal settings | |
93 | ||
94 | Journal Settings | |
95 | ================ | |
96 | ||
f67539c2 TL |
97 | This section applies only to the older Filestore OSD back end. Since Luminous |
98 | BlueStore has been default and preferred. | |
99 | ||
100 | By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at | |
101 | the following path, which is usually a symlink to a device or partition:: | |
7c673cae FG |
102 | |
103 | /var/lib/ceph/osd/$cluster-$id/journal | |
104 | ||
1adf2230 AA |
105 | When using a single device type (for example, spinning drives), the journals |
106 | should be *colocated*: the logical volume (or partition) should be in the same | |
107 | device as the ``data`` logical volume. | |
7c673cae | 108 | |
1adf2230 AA |
109 | When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning |
110 | drives) it makes sense to place the journal on the faster device, while | |
111 | ``data`` occupies the slower device fully. | |
7c673cae | 112 | |
f67539c2 TL |
113 | The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be |
114 | larger, in which case it will need to be set in the ``ceph.conf`` file. | |
115 | A value of 10 gigabytes is common in practice:: | |
7c673cae | 116 | |
f67539c2 | 117 | osd_journal_size = 10240 |
7c673cae | 118 | |
1adf2230 | 119 | |
20effc67 TL |
120 | .. confval:: osd_journal |
121 | .. confval:: osd_journal_size | |
7c673cae FG |
122 | |
123 | See `Journal Config Reference`_ for additional details. | |
124 | ||
125 | ||
126 | Monitor OSD Interaction | |
127 | ======================= | |
128 | ||
129 | Ceph OSD Daemons check each other's heartbeats and report to monitors | |
130 | periodically. Ceph can use default values in many cases. However, if your | |
f67539c2 | 131 | network has latency issues, you may need to adopt longer intervals. See |
7c673cae FG |
132 | `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. |
133 | ||
134 | ||
135 | Data Placement | |
136 | ============== | |
137 | ||
138 | See `Pool & PG Config Reference`_ for details. | |
139 | ||
140 | ||
141 | .. index:: OSD; scrubbing | |
142 | ||
1e59de90 TL |
143 | .. _rados_config_scrubbing: |
144 | ||
7c673cae FG |
145 | Scrubbing |
146 | ========= | |
147 | ||
aee94f69 TL |
148 | One way that Ceph ensures data integrity is by "scrubbing" placement groups. |
149 | Ceph scrubbing is analogous to ``fsck`` on the object storage layer. Ceph | |
150 | generates a catalog of all objects in each placement group and compares each | |
151 | primary object to its replicas, ensuring that no objects are missing or | |
152 | mismatched. Light scrubbing checks the object size and attributes, and is | |
153 | usually done daily. Deep scrubbing reads the data and uses checksums to ensure | |
154 | data integrity, and is usually done weekly. The freqeuncies of both light | |
155 | scrubbing and deep scrubbing are determined by the cluster's configuration, | |
156 | which is fully under your control and subject to the settings explained below | |
157 | in this section. | |
158 | ||
159 | Although scrubbing is important for maintaining data integrity, it can reduce | |
160 | the performance of the Ceph cluster. You can adjust the following settings to | |
161 | increase or decrease the frequency and depth of scrubbing operations. | |
7c673cae FG |
162 | |
163 | ||
20effc67 TL |
164 | .. confval:: osd_max_scrubs |
165 | .. confval:: osd_scrub_begin_hour | |
166 | .. confval:: osd_scrub_end_hour | |
167 | .. confval:: osd_scrub_begin_week_day | |
168 | .. confval:: osd_scrub_end_week_day | |
169 | .. confval:: osd_scrub_during_recovery | |
170 | .. confval:: osd_scrub_load_threshold | |
171 | .. confval:: osd_scrub_min_interval | |
172 | .. confval:: osd_scrub_max_interval | |
173 | .. confval:: osd_scrub_chunk_min | |
174 | .. confval:: osd_scrub_chunk_max | |
175 | .. confval:: osd_scrub_sleep | |
176 | .. confval:: osd_deep_scrub_interval | |
177 | .. confval:: osd_scrub_interval_randomize_ratio | |
178 | .. confval:: osd_deep_scrub_stride | |
179 | .. confval:: osd_scrub_auto_repair | |
180 | .. confval:: osd_scrub_auto_repair_num_errors | |
7c673cae | 181 | |
11fdf7f2 | 182 | .. index:: OSD; operations settings |
7c673cae | 183 | |
11fdf7f2 TL |
184 | Operations |
185 | ========== | |
7c673cae | 186 | |
20effc67 TL |
187 | .. confval:: osd_op_num_shards |
188 | .. confval:: osd_op_num_shards_hdd | |
189 | .. confval:: osd_op_num_shards_ssd | |
190 | .. confval:: osd_op_queue | |
191 | .. confval:: osd_op_queue_cut_off | |
192 | .. confval:: osd_client_op_priority | |
193 | .. confval:: osd_recovery_op_priority | |
194 | .. confval:: osd_scrub_priority | |
195 | .. confval:: osd_requested_scrub_priority | |
196 | .. confval:: osd_snap_trim_priority | |
197 | .. confval:: osd_snap_trim_sleep | |
198 | .. confval:: osd_snap_trim_sleep_hdd | |
199 | .. confval:: osd_snap_trim_sleep_ssd | |
200 | .. confval:: osd_snap_trim_sleep_hybrid | |
201 | .. confval:: osd_op_thread_timeout | |
202 | .. confval:: osd_op_complaint_time | |
203 | .. confval:: osd_op_history_size | |
204 | .. confval:: osd_op_history_duration | |
205 | .. confval:: osd_op_log_threshold | |
1e59de90 TL |
206 | .. confval:: osd_op_thread_suicide_timeout |
207 | .. note:: See https://old.ceph.com/planet/dealing-with-some-osd-timeouts/ for | |
208 | more on ``osd_op_thread_suicide_timeout``. Be aware that this is a link to a | |
209 | reworking of a blog post from 2017, and that its conclusion will direct you | |
210 | back to this page "for more information". | |
c07f9fc5 | 211 | |
9f95a23c TL |
212 | .. _dmclock-qos: |
213 | ||
c07f9fc5 FG |
214 | QoS Based on mClock |
215 | ------------------- | |
216 | ||
b3b6e05e TL |
217 | Ceph's use of mClock is now more refined and can be used by following the |
218 | steps as described in `mClock Config Reference`_. | |
c07f9fc5 FG |
219 | |
220 | Core Concepts | |
221 | ````````````` | |
222 | ||
f67539c2 | 223 | Ceph's QoS support is implemented using a queueing scheduler |
c07f9fc5 FG |
224 | based on `the dmClock algorithm`_. This algorithm allocates the I/O |
225 | resources of the Ceph cluster in proportion to weights, and enforces | |
11fdf7f2 | 226 | the constraints of minimum reservation and maximum limitation, so that |
c07f9fc5 | 227 | the services can compete for the resources fairly. Currently the |
f67539c2 | 228 | *mclock_scheduler* operation queue divides Ceph services involving I/O |
c07f9fc5 FG |
229 | resources into following buckets: |
230 | ||
231 | - client op: the iops issued by client | |
232 | - osd subop: the iops issued by primary OSD | |
233 | - snap trim: the snap trimming related requests | |
234 | - pg recovery: the recovery related requests | |
235 | - pg scrub: the scrub related requests | |
236 | ||
237 | And the resources are partitioned using following three sets of tags. In other | |
238 | words, the share of each type of service is controlled by three tags: | |
239 | ||
240 | #. reservation: the minimum IOPS allocated for the service. | |
241 | #. limitation: the maximum IOPS allocated for the service. | |
242 | #. weight: the proportional share of capacity if extra capacity or system | |
243 | oversubscribed. | |
244 | ||
b3b6e05e | 245 | In Ceph, operations are graded with "cost". And the resources allocated |
c07f9fc5 FG |
246 | for serving various services are consumed by these "costs". So, for |
247 | example, the more reservation a services has, the more resource it is | |
248 | guaranteed to possess, as long as it requires. Assuming there are 2 | |
249 | services: recovery and client ops: | |
250 | ||
251 | - recovery: (r:1, l:5, w:1) | |
252 | - client ops: (r:2, l:0, w:9) | |
253 | ||
254 | The settings above ensure that the recovery won't get more than 5 | |
255 | requests per second serviced, even if it requires so (see CURRENT | |
256 | IMPLEMENTATION NOTE below), and no other services are competing with | |
257 | it. But if the clients start to issue large amount of I/O requests, | |
258 | neither will they exhaust all the I/O resources. 1 request per second | |
259 | is always allocated for recovery jobs as long as there are any such | |
260 | requests. So the recovery jobs won't be starved even in a cluster with | |
261 | high load. And in the meantime, the client ops can enjoy a larger | |
262 | portion of the I/O resource, because its weight is "9", while its | |
263 | competitor "1". In the case of client ops, it is not clamped by the | |
264 | limit setting, so it can make use of all the resources if there is no | |
265 | recovery ongoing. | |
266 | ||
b3b6e05e TL |
267 | CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit |
268 | values. Therefore, if a service crosses the enforced limit, the op remains | |
269 | in the operation queue until the limit is restored. | |
c07f9fc5 FG |
270 | |
271 | Subtleties of mClock | |
272 | ```````````````````` | |
273 | ||
274 | The reservation and limit values have a unit of requests per | |
275 | second. The weight, however, does not technically have a unit and the | |
276 | weights are relative to one another. So if one class of requests has a | |
277 | weight of 1 and another a weight of 9, then the latter class of | |
278 | requests should get 9 executed at a 9 to 1 ratio as the first class. | |
279 | However that will only happen once the reservations are met and those | |
280 | values include the operations executed under the reservation phase. | |
281 | ||
282 | Even though the weights do not have units, one must be careful in | |
283 | choosing their values due how the algorithm assigns weight tags to | |
284 | requests. If the weight is *W*, then for a given class of requests, | |
285 | the next one that comes in will have a weight tag of *1/W* plus the | |
286 | previous weight tag or the current time, whichever is larger. That | |
287 | means if *W* is sufficiently large and therefore *1/W* is sufficiently | |
288 | small, the calculated tag may never be assigned as it will get a value | |
289 | of the current time. The ultimate lesson is that values for weight | |
290 | should not be too large. They should be under the number of requests | |
b3b6e05e | 291 | one expects to be serviced each second. |
c07f9fc5 FG |
292 | |
293 | Caveats | |
294 | ``````` | |
295 | ||
296 | There are some factors that can reduce the impact of the mClock op | |
297 | queues within Ceph. First, requests to an OSD are sharded by their | |
298 | placement group identifier. Each shard has its own mClock queue and | |
299 | these queues neither interact nor share information among them. The | |
300 | number of shards can be controlled with the configuration options | |
20effc67 TL |
301 | :confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and |
302 | :confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the | |
11fdf7f2 | 303 | impact of the mClock queues, but may have other deleterious effects. |
c07f9fc5 FG |
304 | |
305 | Second, requests are transferred from the operation queue to the | |
306 | operation sequencer, in which they go through the phases of | |
307 | execution. The operation queue is where mClock resides and mClock | |
308 | determines the next op to transfer to the operation sequencer. The | |
309 | number of operations allowed in the operation sequencer is a complex | |
310 | issue. In general we want to keep enough operations in the sequencer | |
311 | so it's always getting work done on some operations while it's waiting | |
312 | for disk and network access to complete on other operations. On the | |
313 | other hand, once an operation is transferred to the operation | |
314 | sequencer, mClock no longer has control over it. Therefore to maximize | |
315 | the impact of mClock, we want to keep as few operations in the | |
316 | operation sequencer as possible. So we have an inherent tension. | |
317 | ||
318 | The configuration options that influence the number of operations in | |
20effc67 TL |
319 | the operation sequencer are :confval:`bluestore_throttle_bytes`, |
320 | :confval:`bluestore_throttle_deferred_bytes`, | |
321 | :confval:`bluestore_throttle_cost_per_io`, | |
322 | :confval:`bluestore_throttle_cost_per_io_hdd`, and | |
323 | :confval:`bluestore_throttle_cost_per_io_ssd`. | |
c07f9fc5 FG |
324 | |
325 | A third factor that affects the impact of the mClock algorithm is that | |
326 | we're using a distributed system, where requests are made to multiple | |
327 | OSDs and each OSD has (can have) multiple shards. Yet we're currently | |
328 | using the mClock algorithm, which is not distributed (note: dmClock is | |
329 | the distributed version of mClock). | |
330 | ||
331 | Various organizations and individuals are currently experimenting with | |
332 | mClock as it exists in this code base along with their modifications | |
333 | to the code base. We hope you'll share you're experiences with your | |
f67539c2 | 334 | mClock and dmClock experiments on the ``ceph-devel`` mailing list. |
c07f9fc5 | 335 | |
20effc67 TL |
336 | .. confval:: osd_async_recovery_min_cost |
337 | .. confval:: osd_push_per_object_cost | |
338 | .. confval:: osd_mclock_scheduler_client_res | |
339 | .. confval:: osd_mclock_scheduler_client_wgt | |
340 | .. confval:: osd_mclock_scheduler_client_lim | |
341 | .. confval:: osd_mclock_scheduler_background_recovery_res | |
342 | .. confval:: osd_mclock_scheduler_background_recovery_wgt | |
343 | .. confval:: osd_mclock_scheduler_background_recovery_lim | |
344 | .. confval:: osd_mclock_scheduler_background_best_effort_res | |
345 | .. confval:: osd_mclock_scheduler_background_best_effort_wgt | |
346 | .. confval:: osd_mclock_scheduler_background_best_effort_lim | |
c07f9fc5 FG |
347 | |
348 | .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf | |
349 | ||
7c673cae FG |
350 | .. index:: OSD; backfilling |
351 | ||
352 | Backfilling | |
353 | =========== | |
354 | ||
f67539c2 TL |
355 | When you add or remove Ceph OSD Daemons to a cluster, CRUSH will |
356 | rebalance the cluster by moving placement groups to or from Ceph OSDs | |
357 | to restore balanced utilization. The process of migrating placement groups and | |
7c673cae FG |
358 | the objects they contain can reduce the cluster's operational performance |
359 | considerably. To maintain operational performance, Ceph performs this migration | |
360 | with 'backfilling', which allows Ceph to set backfill operations to a lower | |
1adf2230 | 361 | priority than requests to read or write data. |
7c673cae FG |
362 | |
363 | ||
20effc67 TL |
364 | .. confval:: osd_max_backfills |
365 | .. confval:: osd_backfill_scan_min | |
366 | .. confval:: osd_backfill_scan_max | |
367 | .. confval:: osd_backfill_retry_interval | |
7c673cae FG |
368 | |
369 | .. index:: OSD; osdmap | |
370 | ||
371 | OSD Map | |
372 | ======= | |
373 | ||
1adf2230 | 374 | OSD maps reflect the OSD daemons operating in the cluster. Over time, the |
7c673cae FG |
375 | number of map epochs increases. Ceph provides some settings to ensure that |
376 | Ceph performs well as the OSD map grows larger. | |
377 | ||
20effc67 TL |
378 | .. confval:: osd_map_dedup |
379 | .. confval:: osd_map_cache_size | |
380 | .. confval:: osd_map_message_max | |
7c673cae FG |
381 | |
382 | .. index:: OSD; recovery | |
383 | ||
384 | Recovery | |
385 | ======== | |
386 | ||
387 | When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD | |
388 | begins peering with other Ceph OSD Daemons before writes can occur. See | |
389 | `Monitoring OSDs and PGs`_ for details. | |
390 | ||
391 | If a Ceph OSD Daemon crashes and comes back online, usually it will be out of | |
392 | sync with other Ceph OSD Daemons containing more recent versions of objects in | |
393 | the placement groups. When this happens, the Ceph OSD Daemon goes into recovery | |
394 | mode and seeks to get the latest copy of the data and bring its map back up to | |
395 | date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects | |
396 | and placement groups may be significantly out of date. Also, if a failure domain | |
397 | went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at | |
398 | the same time. This can make the recovery process time consuming and resource | |
399 | intensive. | |
400 | ||
401 | To maintain operational performance, Ceph performs recovery with limitations on | |
402 | the number recovery requests, threads and object chunk sizes which allows Ceph | |
1adf2230 | 403 | perform well in a degraded state. |
7c673cae | 404 | |
20effc67 TL |
405 | .. confval:: osd_recovery_delay_start |
406 | .. confval:: osd_recovery_max_active | |
407 | .. confval:: osd_recovery_max_active_hdd | |
408 | .. confval:: osd_recovery_max_active_ssd | |
409 | .. confval:: osd_recovery_max_chunk | |
410 | .. confval:: osd_recovery_max_single_start | |
411 | .. confval:: osd_recover_clone_overlap | |
412 | .. confval:: osd_recovery_sleep | |
413 | .. confval:: osd_recovery_sleep_hdd | |
414 | .. confval:: osd_recovery_sleep_ssd | |
415 | .. confval:: osd_recovery_sleep_hybrid | |
416 | .. confval:: osd_recovery_priority | |
11fdf7f2 | 417 | |
7c673cae FG |
418 | Tiering |
419 | ======= | |
420 | ||
20effc67 TL |
421 | .. confval:: osd_agent_max_ops |
422 | .. confval:: osd_agent_max_low_ops | |
7c673cae FG |
423 | |
424 | See `cache target dirty high ratio`_ for when the tiering agent flushes dirty | |
425 | objects within the high speed mode. | |
426 | ||
427 | Miscellaneous | |
428 | ============= | |
429 | ||
20effc67 TL |
430 | .. confval:: osd_default_notify_timeout |
431 | .. confval:: osd_check_for_log_corruption | |
432 | .. confval:: osd_delete_sleep | |
433 | .. confval:: osd_delete_sleep_hdd | |
434 | .. confval:: osd_delete_sleep_ssd | |
435 | .. confval:: osd_delete_sleep_hybrid | |
436 | .. confval:: osd_command_max_records | |
437 | .. confval:: osd_fast_fail_on_connection_refused | |
7c673cae FG |
438 | |
439 | .. _pool: ../../operations/pools | |
440 | .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction | |
441 | .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering | |
442 | .. _Pool & PG Config Reference: ../pool-pg-config-ref | |
443 | .. _Journal Config Reference: ../journal-ref | |
444 | .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio | |
b3b6e05e | 445 | .. _mClock Config Reference: ../mclock-config-ref |