]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ====================== |
2 | OSD Config Reference | |
3 | ====================== | |
4 | ||
5 | .. index:: OSD; configuration | |
6 | ||
f67539c2 TL |
7 | You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent |
8 | releases, the central config store), but Ceph OSD | |
7c673cae | 9 | Daemons can use the default values and a very minimal configuration. A minimal |
f67539c2 | 10 | Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``, and |
7c673cae FG |
11 | uses default values for nearly everything else. |
12 | ||
13 | Ceph OSD Daemons are numerically identified in incremental fashion, beginning | |
14 | with ``0`` using the following convention. :: | |
15 | ||
16 | osd.0 | |
17 | osd.1 | |
18 | osd.2 | |
19 | ||
20 | In a configuration file, you may specify settings for all Ceph OSD Daemons in | |
21 | the cluster by adding configuration settings to the ``[osd]`` section of your | |
22 | configuration file. To add settings directly to a specific Ceph OSD Daemon | |
23 | (e.g., ``host``), enter it in an OSD-specific section of your configuration | |
24 | file. For example: | |
25 | ||
26 | .. code-block:: ini | |
1adf2230 | 27 | |
7c673cae | 28 | [osd] |
f67539c2 | 29 | osd_journal_size = 5120 |
1adf2230 | 30 | |
7c673cae FG |
31 | [osd.0] |
32 | host = osd-host-a | |
1adf2230 | 33 | |
7c673cae FG |
34 | [osd.1] |
35 | host = osd-host-b | |
36 | ||
37 | ||
38 | .. index:: OSD; config settings | |
39 | ||
40 | General Settings | |
41 | ================ | |
42 | ||
9f95a23c | 43 | The following settings provide a Ceph OSD Daemon's ID, and determine paths to |
7c673cae | 44 | data and journals. Ceph deployment scripts typically generate the UUID |
1adf2230 AA |
45 | automatically. |
46 | ||
47 | .. warning:: **DO NOT** change the default paths for data or journals, as it | |
48 | makes it more problematic to troubleshoot Ceph later. | |
7c673cae | 49 | |
f67539c2 TL |
50 | When using Filestore, the journal size should be at least twice the product of the expected drive |
51 | speed multiplied by ``filestore_max_sync_interval``. However, the most common | |
7c673cae FG |
52 | practice is to partition the journal drive (often an SSD), and mount it such |
53 | that Ceph uses the entire partition for the journal. | |
54 | ||
20effc67 TL |
55 | .. confval:: osd_uuid |
56 | .. confval:: osd_data | |
57 | .. confval:: osd_max_write_size | |
58 | .. confval:: osd_max_object_size | |
59 | .. confval:: osd_client_message_size_cap | |
60 | .. confval:: osd_class_dir | |
61 | :default: $libdir/rados-classes | |
7c673cae FG |
62 | |
63 | .. index:: OSD; file system | |
64 | ||
65 | File System Settings | |
66 | ==================== | |
67 | Ceph builds and mounts file systems which are used for Ceph OSDs. | |
68 | ||
f67539c2 | 69 | ``osd_mkfs_options {fs-type}`` |
7c673cae | 70 | |
f67539c2 | 71 | :Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}. |
7c673cae FG |
72 | |
73 | :Type: String | |
74 | :Default for xfs: ``-f -i 2048`` | |
75 | :Default for other file systems: {empty string} | |
76 | ||
77 | For example:: | |
f67539c2 | 78 | ``osd_mkfs_options_xfs = -f -d agcount=24`` |
7c673cae | 79 | |
f67539c2 | 80 | ``osd_mount_options {fs-type}`` |
7c673cae | 81 | |
f67539c2 | 82 | :Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}. |
7c673cae FG |
83 | |
84 | :Type: String | |
85 | :Default for xfs: ``rw,noatime,inode64`` | |
86 | :Default for other file systems: ``rw, noatime`` | |
87 | ||
88 | For example:: | |
f67539c2 | 89 | ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8`` |
7c673cae FG |
90 | |
91 | ||
92 | .. index:: OSD; journal settings | |
93 | ||
94 | Journal Settings | |
95 | ================ | |
96 | ||
f67539c2 TL |
97 | This section applies only to the older Filestore OSD back end. Since Luminous |
98 | BlueStore has been default and preferred. | |
99 | ||
100 | By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at | |
101 | the following path, which is usually a symlink to a device or partition:: | |
7c673cae FG |
102 | |
103 | /var/lib/ceph/osd/$cluster-$id/journal | |
104 | ||
1adf2230 AA |
105 | When using a single device type (for example, spinning drives), the journals |
106 | should be *colocated*: the logical volume (or partition) should be in the same | |
107 | device as the ``data`` logical volume. | |
7c673cae | 108 | |
1adf2230 AA |
109 | When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning |
110 | drives) it makes sense to place the journal on the faster device, while | |
111 | ``data`` occupies the slower device fully. | |
7c673cae | 112 | |
f67539c2 TL |
113 | The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be |
114 | larger, in which case it will need to be set in the ``ceph.conf`` file. | |
115 | A value of 10 gigabytes is common in practice:: | |
7c673cae | 116 | |
f67539c2 | 117 | osd_journal_size = 10240 |
7c673cae | 118 | |
1adf2230 | 119 | |
20effc67 TL |
120 | .. confval:: osd_journal |
121 | .. confval:: osd_journal_size | |
7c673cae FG |
122 | |
123 | See `Journal Config Reference`_ for additional details. | |
124 | ||
125 | ||
126 | Monitor OSD Interaction | |
127 | ======================= | |
128 | ||
129 | Ceph OSD Daemons check each other's heartbeats and report to monitors | |
130 | periodically. Ceph can use default values in many cases. However, if your | |
f67539c2 | 131 | network has latency issues, you may need to adopt longer intervals. See |
7c673cae FG |
132 | `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. |
133 | ||
134 | ||
135 | Data Placement | |
136 | ============== | |
137 | ||
138 | See `Pool & PG Config Reference`_ for details. | |
139 | ||
140 | ||
141 | .. index:: OSD; scrubbing | |
142 | ||
143 | Scrubbing | |
144 | ========= | |
145 | ||
9f95a23c | 146 | In addition to making multiple copies of objects, Ceph ensures data integrity by |
7c673cae FG |
147 | scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the |
148 | object storage layer. For each placement group, Ceph generates a catalog of all | |
149 | objects and compares each primary object and its replicas to ensure that no | |
150 | objects are missing or mismatched. Light scrubbing (daily) checks the object | |
151 | size and attributes. Deep scrubbing (weekly) reads the data and uses checksums | |
152 | to ensure data integrity. | |
153 | ||
154 | Scrubbing is important for maintaining data integrity, but it can reduce | |
155 | performance. You can adjust the following settings to increase or decrease | |
156 | scrubbing operations. | |
157 | ||
158 | ||
20effc67 TL |
159 | .. confval:: osd_max_scrubs |
160 | .. confval:: osd_scrub_begin_hour | |
161 | .. confval:: osd_scrub_end_hour | |
162 | .. confval:: osd_scrub_begin_week_day | |
163 | .. confval:: osd_scrub_end_week_day | |
164 | .. confval:: osd_scrub_during_recovery | |
165 | .. confval:: osd_scrub_load_threshold | |
166 | .. confval:: osd_scrub_min_interval | |
167 | .. confval:: osd_scrub_max_interval | |
168 | .. confval:: osd_scrub_chunk_min | |
169 | .. confval:: osd_scrub_chunk_max | |
170 | .. confval:: osd_scrub_sleep | |
171 | .. confval:: osd_deep_scrub_interval | |
172 | .. confval:: osd_scrub_interval_randomize_ratio | |
173 | .. confval:: osd_deep_scrub_stride | |
174 | .. confval:: osd_scrub_auto_repair | |
175 | .. confval:: osd_scrub_auto_repair_num_errors | |
7c673cae | 176 | |
11fdf7f2 | 177 | .. index:: OSD; operations settings |
7c673cae | 178 | |
11fdf7f2 TL |
179 | Operations |
180 | ========== | |
7c673cae | 181 | |
20effc67 TL |
182 | .. confval:: osd_op_num_shards |
183 | .. confval:: osd_op_num_shards_hdd | |
184 | .. confval:: osd_op_num_shards_ssd | |
185 | .. confval:: osd_op_queue | |
186 | .. confval:: osd_op_queue_cut_off | |
187 | .. confval:: osd_client_op_priority | |
188 | .. confval:: osd_recovery_op_priority | |
189 | .. confval:: osd_scrub_priority | |
190 | .. confval:: osd_requested_scrub_priority | |
191 | .. confval:: osd_snap_trim_priority | |
192 | .. confval:: osd_snap_trim_sleep | |
193 | .. confval:: osd_snap_trim_sleep_hdd | |
194 | .. confval:: osd_snap_trim_sleep_ssd | |
195 | .. confval:: osd_snap_trim_sleep_hybrid | |
196 | .. confval:: osd_op_thread_timeout | |
197 | .. confval:: osd_op_complaint_time | |
198 | .. confval:: osd_op_history_size | |
199 | .. confval:: osd_op_history_duration | |
200 | .. confval:: osd_op_log_threshold | |
c07f9fc5 | 201 | |
9f95a23c TL |
202 | .. _dmclock-qos: |
203 | ||
c07f9fc5 FG |
204 | QoS Based on mClock |
205 | ------------------- | |
206 | ||
b3b6e05e TL |
207 | Ceph's use of mClock is now more refined and can be used by following the |
208 | steps as described in `mClock Config Reference`_. | |
c07f9fc5 FG |
209 | |
210 | Core Concepts | |
211 | ````````````` | |
212 | ||
f67539c2 | 213 | Ceph's QoS support is implemented using a queueing scheduler |
c07f9fc5 FG |
214 | based on `the dmClock algorithm`_. This algorithm allocates the I/O |
215 | resources of the Ceph cluster in proportion to weights, and enforces | |
11fdf7f2 | 216 | the constraints of minimum reservation and maximum limitation, so that |
c07f9fc5 | 217 | the services can compete for the resources fairly. Currently the |
f67539c2 | 218 | *mclock_scheduler* operation queue divides Ceph services involving I/O |
c07f9fc5 FG |
219 | resources into following buckets: |
220 | ||
221 | - client op: the iops issued by client | |
222 | - osd subop: the iops issued by primary OSD | |
223 | - snap trim: the snap trimming related requests | |
224 | - pg recovery: the recovery related requests | |
225 | - pg scrub: the scrub related requests | |
226 | ||
227 | And the resources are partitioned using following three sets of tags. In other | |
228 | words, the share of each type of service is controlled by three tags: | |
229 | ||
230 | #. reservation: the minimum IOPS allocated for the service. | |
231 | #. limitation: the maximum IOPS allocated for the service. | |
232 | #. weight: the proportional share of capacity if extra capacity or system | |
233 | oversubscribed. | |
234 | ||
b3b6e05e | 235 | In Ceph, operations are graded with "cost". And the resources allocated |
c07f9fc5 FG |
236 | for serving various services are consumed by these "costs". So, for |
237 | example, the more reservation a services has, the more resource it is | |
238 | guaranteed to possess, as long as it requires. Assuming there are 2 | |
239 | services: recovery and client ops: | |
240 | ||
241 | - recovery: (r:1, l:5, w:1) | |
242 | - client ops: (r:2, l:0, w:9) | |
243 | ||
244 | The settings above ensure that the recovery won't get more than 5 | |
245 | requests per second serviced, even if it requires so (see CURRENT | |
246 | IMPLEMENTATION NOTE below), and no other services are competing with | |
247 | it. But if the clients start to issue large amount of I/O requests, | |
248 | neither will they exhaust all the I/O resources. 1 request per second | |
249 | is always allocated for recovery jobs as long as there are any such | |
250 | requests. So the recovery jobs won't be starved even in a cluster with | |
251 | high load. And in the meantime, the client ops can enjoy a larger | |
252 | portion of the I/O resource, because its weight is "9", while its | |
253 | competitor "1". In the case of client ops, it is not clamped by the | |
254 | limit setting, so it can make use of all the resources if there is no | |
255 | recovery ongoing. | |
256 | ||
b3b6e05e TL |
257 | CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit |
258 | values. Therefore, if a service crosses the enforced limit, the op remains | |
259 | in the operation queue until the limit is restored. | |
c07f9fc5 FG |
260 | |
261 | Subtleties of mClock | |
262 | ```````````````````` | |
263 | ||
264 | The reservation and limit values have a unit of requests per | |
265 | second. The weight, however, does not technically have a unit and the | |
266 | weights are relative to one another. So if one class of requests has a | |
267 | weight of 1 and another a weight of 9, then the latter class of | |
268 | requests should get 9 executed at a 9 to 1 ratio as the first class. | |
269 | However that will only happen once the reservations are met and those | |
270 | values include the operations executed under the reservation phase. | |
271 | ||
272 | Even though the weights do not have units, one must be careful in | |
273 | choosing their values due how the algorithm assigns weight tags to | |
274 | requests. If the weight is *W*, then for a given class of requests, | |
275 | the next one that comes in will have a weight tag of *1/W* plus the | |
276 | previous weight tag or the current time, whichever is larger. That | |
277 | means if *W* is sufficiently large and therefore *1/W* is sufficiently | |
278 | small, the calculated tag may never be assigned as it will get a value | |
279 | of the current time. The ultimate lesson is that values for weight | |
280 | should not be too large. They should be under the number of requests | |
b3b6e05e | 281 | one expects to be serviced each second. |
c07f9fc5 FG |
282 | |
283 | Caveats | |
284 | ``````` | |
285 | ||
286 | There are some factors that can reduce the impact of the mClock op | |
287 | queues within Ceph. First, requests to an OSD are sharded by their | |
288 | placement group identifier. Each shard has its own mClock queue and | |
289 | these queues neither interact nor share information among them. The | |
290 | number of shards can be controlled with the configuration options | |
20effc67 TL |
291 | :confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and |
292 | :confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the | |
11fdf7f2 | 293 | impact of the mClock queues, but may have other deleterious effects. |
c07f9fc5 FG |
294 | |
295 | Second, requests are transferred from the operation queue to the | |
296 | operation sequencer, in which they go through the phases of | |
297 | execution. The operation queue is where mClock resides and mClock | |
298 | determines the next op to transfer to the operation sequencer. The | |
299 | number of operations allowed in the operation sequencer is a complex | |
300 | issue. In general we want to keep enough operations in the sequencer | |
301 | so it's always getting work done on some operations while it's waiting | |
302 | for disk and network access to complete on other operations. On the | |
303 | other hand, once an operation is transferred to the operation | |
304 | sequencer, mClock no longer has control over it. Therefore to maximize | |
305 | the impact of mClock, we want to keep as few operations in the | |
306 | operation sequencer as possible. So we have an inherent tension. | |
307 | ||
308 | The configuration options that influence the number of operations in | |
20effc67 TL |
309 | the operation sequencer are :confval:`bluestore_throttle_bytes`, |
310 | :confval:`bluestore_throttle_deferred_bytes`, | |
311 | :confval:`bluestore_throttle_cost_per_io`, | |
312 | :confval:`bluestore_throttle_cost_per_io_hdd`, and | |
313 | :confval:`bluestore_throttle_cost_per_io_ssd`. | |
c07f9fc5 FG |
314 | |
315 | A third factor that affects the impact of the mClock algorithm is that | |
316 | we're using a distributed system, where requests are made to multiple | |
317 | OSDs and each OSD has (can have) multiple shards. Yet we're currently | |
318 | using the mClock algorithm, which is not distributed (note: dmClock is | |
319 | the distributed version of mClock). | |
320 | ||
321 | Various organizations and individuals are currently experimenting with | |
322 | mClock as it exists in this code base along with their modifications | |
323 | to the code base. We hope you'll share you're experiences with your | |
f67539c2 | 324 | mClock and dmClock experiments on the ``ceph-devel`` mailing list. |
c07f9fc5 | 325 | |
20effc67 TL |
326 | .. confval:: osd_async_recovery_min_cost |
327 | .. confval:: osd_push_per_object_cost | |
328 | .. confval:: osd_mclock_scheduler_client_res | |
329 | .. confval:: osd_mclock_scheduler_client_wgt | |
330 | .. confval:: osd_mclock_scheduler_client_lim | |
331 | .. confval:: osd_mclock_scheduler_background_recovery_res | |
332 | .. confval:: osd_mclock_scheduler_background_recovery_wgt | |
333 | .. confval:: osd_mclock_scheduler_background_recovery_lim | |
334 | .. confval:: osd_mclock_scheduler_background_best_effort_res | |
335 | .. confval:: osd_mclock_scheduler_background_best_effort_wgt | |
336 | .. confval:: osd_mclock_scheduler_background_best_effort_lim | |
c07f9fc5 FG |
337 | |
338 | .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf | |
339 | ||
7c673cae FG |
340 | .. index:: OSD; backfilling |
341 | ||
342 | Backfilling | |
343 | =========== | |
344 | ||
f67539c2 TL |
345 | When you add or remove Ceph OSD Daemons to a cluster, CRUSH will |
346 | rebalance the cluster by moving placement groups to or from Ceph OSDs | |
347 | to restore balanced utilization. The process of migrating placement groups and | |
7c673cae FG |
348 | the objects they contain can reduce the cluster's operational performance |
349 | considerably. To maintain operational performance, Ceph performs this migration | |
350 | with 'backfilling', which allows Ceph to set backfill operations to a lower | |
1adf2230 | 351 | priority than requests to read or write data. |
7c673cae FG |
352 | |
353 | ||
20effc67 TL |
354 | .. confval:: osd_max_backfills |
355 | .. confval:: osd_backfill_scan_min | |
356 | .. confval:: osd_backfill_scan_max | |
357 | .. confval:: osd_backfill_retry_interval | |
7c673cae FG |
358 | |
359 | .. index:: OSD; osdmap | |
360 | ||
361 | OSD Map | |
362 | ======= | |
363 | ||
1adf2230 | 364 | OSD maps reflect the OSD daemons operating in the cluster. Over time, the |
7c673cae FG |
365 | number of map epochs increases. Ceph provides some settings to ensure that |
366 | Ceph performs well as the OSD map grows larger. | |
367 | ||
20effc67 TL |
368 | .. confval:: osd_map_dedup |
369 | .. confval:: osd_map_cache_size | |
370 | .. confval:: osd_map_message_max | |
7c673cae FG |
371 | |
372 | .. index:: OSD; recovery | |
373 | ||
374 | Recovery | |
375 | ======== | |
376 | ||
377 | When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD | |
378 | begins peering with other Ceph OSD Daemons before writes can occur. See | |
379 | `Monitoring OSDs and PGs`_ for details. | |
380 | ||
381 | If a Ceph OSD Daemon crashes and comes back online, usually it will be out of | |
382 | sync with other Ceph OSD Daemons containing more recent versions of objects in | |
383 | the placement groups. When this happens, the Ceph OSD Daemon goes into recovery | |
384 | mode and seeks to get the latest copy of the data and bring its map back up to | |
385 | date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects | |
386 | and placement groups may be significantly out of date. Also, if a failure domain | |
387 | went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at | |
388 | the same time. This can make the recovery process time consuming and resource | |
389 | intensive. | |
390 | ||
391 | To maintain operational performance, Ceph performs recovery with limitations on | |
392 | the number recovery requests, threads and object chunk sizes which allows Ceph | |
1adf2230 | 393 | perform well in a degraded state. |
7c673cae | 394 | |
20effc67 TL |
395 | .. confval:: osd_recovery_delay_start |
396 | .. confval:: osd_recovery_max_active | |
397 | .. confval:: osd_recovery_max_active_hdd | |
398 | .. confval:: osd_recovery_max_active_ssd | |
399 | .. confval:: osd_recovery_max_chunk | |
400 | .. confval:: osd_recovery_max_single_start | |
401 | .. confval:: osd_recover_clone_overlap | |
402 | .. confval:: osd_recovery_sleep | |
403 | .. confval:: osd_recovery_sleep_hdd | |
404 | .. confval:: osd_recovery_sleep_ssd | |
405 | .. confval:: osd_recovery_sleep_hybrid | |
406 | .. confval:: osd_recovery_priority | |
11fdf7f2 | 407 | |
7c673cae FG |
408 | Tiering |
409 | ======= | |
410 | ||
20effc67 TL |
411 | .. confval:: osd_agent_max_ops |
412 | .. confval:: osd_agent_max_low_ops | |
7c673cae FG |
413 | |
414 | See `cache target dirty high ratio`_ for when the tiering agent flushes dirty | |
415 | objects within the high speed mode. | |
416 | ||
417 | Miscellaneous | |
418 | ============= | |
419 | ||
20effc67 TL |
420 | .. confval:: osd_default_notify_timeout |
421 | .. confval:: osd_check_for_log_corruption | |
422 | .. confval:: osd_delete_sleep | |
423 | .. confval:: osd_delete_sleep_hdd | |
424 | .. confval:: osd_delete_sleep_ssd | |
425 | .. confval:: osd_delete_sleep_hybrid | |
426 | .. confval:: osd_command_max_records | |
427 | .. confval:: osd_fast_fail_on_connection_refused | |
7c673cae FG |
428 | |
429 | .. _pool: ../../operations/pools | |
430 | .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction | |
431 | .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering | |
432 | .. _Pool & PG Config Reference: ../pool-pg-config-ref | |
433 | .. _Journal Config Reference: ../journal-ref | |
434 | .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio | |
b3b6e05e | 435 | .. _mClock Config Reference: ../mclock-config-ref |