]>
Commit | Line | Data |
---|---|---|
1 | ====================== | |
2 | OSD Config Reference | |
3 | ====================== | |
4 | ||
5 | .. index:: OSD; configuration | |
6 | ||
7 | You can configure Ceph OSD Daemons in the Ceph configuration file (or in recent | |
8 | releases, the central config store), but Ceph OSD | |
9 | Daemons can use the default values and a very minimal configuration. A minimal | |
10 | Ceph OSD Daemon configuration sets ``osd journal size`` (for Filestore), ``host``, and | |
11 | uses default values for nearly everything else. | |
12 | ||
13 | Ceph OSD Daemons are numerically identified in incremental fashion, beginning | |
14 | with ``0`` using the following convention. :: | |
15 | ||
16 | osd.0 | |
17 | osd.1 | |
18 | osd.2 | |
19 | ||
20 | In a configuration file, you may specify settings for all Ceph OSD Daemons in | |
21 | the cluster by adding configuration settings to the ``[osd]`` section of your | |
22 | configuration file. To add settings directly to a specific Ceph OSD Daemon | |
23 | (e.g., ``host``), enter it in an OSD-specific section of your configuration | |
24 | file. For example: | |
25 | ||
26 | .. code-block:: ini | |
27 | ||
28 | [osd] | |
29 | osd_journal_size = 5120 | |
30 | ||
31 | [osd.0] | |
32 | host = osd-host-a | |
33 | ||
34 | [osd.1] | |
35 | host = osd-host-b | |
36 | ||
37 | ||
38 | .. index:: OSD; config settings | |
39 | ||
40 | General Settings | |
41 | ================ | |
42 | ||
43 | The following settings provide a Ceph OSD Daemon's ID, and determine paths to | |
44 | data and journals. Ceph deployment scripts typically generate the UUID | |
45 | automatically. | |
46 | ||
47 | .. warning:: **DO NOT** change the default paths for data or journals, as it | |
48 | makes it more problematic to troubleshoot Ceph later. | |
49 | ||
50 | When using Filestore, the journal size should be at least twice the product of the expected drive | |
51 | speed multiplied by ``filestore_max_sync_interval``. However, the most common | |
52 | practice is to partition the journal drive (often an SSD), and mount it such | |
53 | that Ceph uses the entire partition for the journal. | |
54 | ||
55 | ||
56 | ``osd_uuid`` | |
57 | ||
58 | :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. | |
59 | :Type: UUID | |
60 | :Default: The UUID. | |
61 | :Note: The ``osd_uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` | |
62 | applies to the entire cluster. | |
63 | ||
64 | ||
65 | ``osd_data`` | |
66 | ||
67 | :Description: The path to the OSDs data. You must create the directory when | |
68 | deploying Ceph. You should mount a drive for OSD data at this | |
69 | mount point. We do not recommend changing the default. | |
70 | ||
71 | :Type: String | |
72 | :Default: ``/var/lib/ceph/osd/$cluster-$id`` | |
73 | ||
74 | ||
75 | ``osd_max_write_size`` | |
76 | ||
77 | :Description: The maximum size of a write in megabytes. | |
78 | :Type: 32-bit Integer | |
79 | :Default: ``90`` | |
80 | ||
81 | ||
82 | ``osd_max_object_size`` | |
83 | ||
84 | :Description: The maximum size of a RADOS object in bytes. | |
85 | :Type: 32-bit Unsigned Integer | |
86 | :Default: 128MB | |
87 | ||
88 | ||
89 | ``osd_client_message_size_cap`` | |
90 | ||
91 | :Description: The largest client data message allowed in memory. | |
92 | :Type: 64-bit Unsigned Integer | |
93 | :Default: 500MB default. ``500*1024L*1024L`` | |
94 | ||
95 | ||
96 | ``osd_class_dir`` | |
97 | ||
98 | :Description: The class path for RADOS class plug-ins. | |
99 | :Type: String | |
100 | :Default: ``$libdir/rados-classes`` | |
101 | ||
102 | ||
103 | .. index:: OSD; file system | |
104 | ||
105 | File System Settings | |
106 | ==================== | |
107 | Ceph builds and mounts file systems which are used for Ceph OSDs. | |
108 | ||
109 | ``osd_mkfs_options {fs-type}`` | |
110 | ||
111 | :Description: Options used when creating a new Ceph Filestore OSD of type {fs-type}. | |
112 | ||
113 | :Type: String | |
114 | :Default for xfs: ``-f -i 2048`` | |
115 | :Default for other file systems: {empty string} | |
116 | ||
117 | For example:: | |
118 | ``osd_mkfs_options_xfs = -f -d agcount=24`` | |
119 | ||
120 | ``osd_mount_options {fs-type}`` | |
121 | ||
122 | :Description: Options used when mounting a Ceph Filestore OSD of type {fs-type}. | |
123 | ||
124 | :Type: String | |
125 | :Default for xfs: ``rw,noatime,inode64`` | |
126 | :Default for other file systems: ``rw, noatime`` | |
127 | ||
128 | For example:: | |
129 | ``osd_mount_options_xfs = rw, noatime, inode64, logbufs=8`` | |
130 | ||
131 | ||
132 | .. index:: OSD; journal settings | |
133 | ||
134 | Journal Settings | |
135 | ================ | |
136 | ||
137 | This section applies only to the older Filestore OSD back end. Since Luminous | |
138 | BlueStore has been default and preferred. | |
139 | ||
140 | By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at | |
141 | the following path, which is usually a symlink to a device or partition:: | |
142 | ||
143 | /var/lib/ceph/osd/$cluster-$id/journal | |
144 | ||
145 | When using a single device type (for example, spinning drives), the journals | |
146 | should be *colocated*: the logical volume (or partition) should be in the same | |
147 | device as the ``data`` logical volume. | |
148 | ||
149 | When using a mix of fast (SSDs, NVMe) devices with slower ones (like spinning | |
150 | drives) it makes sense to place the journal on the faster device, while | |
151 | ``data`` occupies the slower device fully. | |
152 | ||
153 | The default ``osd_journal_size`` value is 5120 (5 gigabytes), but it can be | |
154 | larger, in which case it will need to be set in the ``ceph.conf`` file. | |
155 | A value of 10 gigabytes is common in practice:: | |
156 | ||
157 | osd_journal_size = 10240 | |
158 | ||
159 | ||
160 | ``osd_journal`` | |
161 | ||
162 | :Description: The path to the OSD's journal. This may be a path to a file or a | |
163 | block device (such as a partition of an SSD). If it is a file, | |
164 | you must create the directory to contain it. We recommend using a | |
165 | separate fast device when the ``osd_data`` drive is an HDD. | |
166 | ||
167 | :Type: String | |
168 | :Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` | |
169 | ||
170 | ||
171 | ``osd_journal_size`` | |
172 | ||
173 | :Description: The size of the journal in megabytes. | |
174 | ||
175 | :Type: 32-bit Integer | |
176 | :Default: ``5120`` | |
177 | ||
178 | ||
179 | See `Journal Config Reference`_ for additional details. | |
180 | ||
181 | ||
182 | Monitor OSD Interaction | |
183 | ======================= | |
184 | ||
185 | Ceph OSD Daemons check each other's heartbeats and report to monitors | |
186 | periodically. Ceph can use default values in many cases. However, if your | |
187 | network has latency issues, you may need to adopt longer intervals. See | |
188 | `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. | |
189 | ||
190 | ||
191 | Data Placement | |
192 | ============== | |
193 | ||
194 | See `Pool & PG Config Reference`_ for details. | |
195 | ||
196 | ||
197 | .. index:: OSD; scrubbing | |
198 | ||
199 | Scrubbing | |
200 | ========= | |
201 | ||
202 | In addition to making multiple copies of objects, Ceph ensures data integrity by | |
203 | scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the | |
204 | object storage layer. For each placement group, Ceph generates a catalog of all | |
205 | objects and compares each primary object and its replicas to ensure that no | |
206 | objects are missing or mismatched. Light scrubbing (daily) checks the object | |
207 | size and attributes. Deep scrubbing (weekly) reads the data and uses checksums | |
208 | to ensure data integrity. | |
209 | ||
210 | Scrubbing is important for maintaining data integrity, but it can reduce | |
211 | performance. You can adjust the following settings to increase or decrease | |
212 | scrubbing operations. | |
213 | ||
214 | ||
215 | ``osd_max_scrubs`` | |
216 | ||
217 | :Description: The maximum number of simultaneous scrub operations for | |
218 | a Ceph OSD Daemon. | |
219 | ||
220 | :Type: 32-bit Int | |
221 | :Default: ``1`` | |
222 | ||
223 | ``osd_scrub_begin_hour`` | |
224 | ||
225 | :Description: This restricts scrubbing to this hour of the day or later. | |
226 | Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` | |
227 | to allow scrubbing the entire day. Along with ``osd_scrub_end_hour``, they define a time | |
228 | window, in which the scrubs can happen. | |
229 | But a scrub will be performed | |
230 | no matter whether the time window allows or not, as long as the placement | |
231 | group's scrub interval exceeds ``osd_scrub_max_interval``. | |
232 | :Type: Integer in the range of 0 to 23 | |
233 | :Default: ``0`` | |
234 | ||
235 | ||
236 | ``osd_scrub_end_hour`` | |
237 | ||
238 | :Description: This restricts scrubbing to the hour earlier than this. | |
239 | Use ``osd_scrub_begin_hour = 0`` and ``osd_scrub_end_hour = 0`` to allow scrubbing | |
240 | for the entire day. Along with ``osd_scrub_begin_hour``, they define a time | |
241 | window, in which the scrubs can happen. But a scrub will be performed | |
242 | no matter whether the time window allows or not, as long as the placement | |
243 | group's scrub interval exceeds ``osd_scrub_max_interval``. | |
244 | :Type: Integer in the range of 0 to 23 | |
245 | :Default: ``0`` | |
246 | ||
247 | ||
248 | ``osd_scrub_begin_week_day`` | |
249 | ||
250 | :Description: This restricts scrubbing to this day of the week or later. | |
251 | 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0`` | |
252 | and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week. | |
253 | Along with ``osd_scrub_end_week_day``, they define a time window in which | |
254 | scrubs can happen. But a scrub will be performed | |
255 | no matter whether the time window allows or not, when the PG's | |
256 | scrub interval exceeds ``osd_scrub_max_interval``. | |
257 | :Type: Integer in the range of 0 to 6 | |
258 | :Default: ``0`` | |
259 | ||
260 | ||
261 | ``osd_scrub_end_week_day`` | |
262 | ||
263 | :Description: This restricts scrubbing to days of the week earlier than this. | |
264 | 0 = Sunday, 1 = Monday, etc. Use ``osd_scrub_begin_week_day = 0`` | |
265 | and ``osd_scrub_end_week_day = 0`` to allow scrubbing for the entire week. | |
266 | Along with ``osd_scrub_begin_week_day``, they define a time | |
267 | window, in which the scrubs can happen. But a scrub will be performed | |
268 | no matter whether the time window allows or not, as long as the placement | |
269 | group's scrub interval exceeds ``osd_scrub_max_interval``. | |
270 | :Type: Integer in the range of 0 to 6 | |
271 | :Default: ``0`` | |
272 | ||
273 | ||
274 | ``osd scrub during recovery`` | |
275 | ||
276 | :Description: Allow scrub during recovery. Setting this to ``false`` will disable | |
277 | scheduling new scrub (and deep--scrub) while there is active recovery. | |
278 | Already running scrubs will be continued. This might be useful to reduce | |
279 | load on busy clusters. | |
280 | :Type: Boolean | |
281 | :Default: ``false`` | |
282 | ||
283 | ||
284 | ``osd_scrub_thread_timeout`` | |
285 | ||
286 | :Description: The maximum time in seconds before timing out a scrub thread. | |
287 | :Type: 32-bit Integer | |
288 | :Default: ``60`` | |
289 | ||
290 | ||
291 | ``osd_scrub_finalize_thread_timeout`` | |
292 | ||
293 | :Description: The maximum time in seconds before timing out a scrub finalize | |
294 | thread. | |
295 | ||
296 | :Type: 32-bit Integer | |
297 | :Default: ``10*60`` | |
298 | ||
299 | ||
300 | ``osd_scrub_load_threshold`` | |
301 | ||
302 | :Description: The normalized maximum load. Ceph will not scrub when the system load | |
303 | (as defined by ``getloadavg() / number of online CPUs``) is higher than this number. | |
304 | Default is ``0.5``. | |
305 | ||
306 | :Type: Float | |
307 | :Default: ``0.5`` | |
308 | ||
309 | ||
310 | ``osd_scrub_min_interval`` | |
311 | ||
312 | :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon | |
313 | when the Ceph Storage Cluster load is low. | |
314 | ||
315 | :Type: Float | |
316 | :Default: Once per day. ``24*60*60`` | |
317 | ||
318 | .. _osd_scrub_max_interval: | |
319 | ||
320 | ``osd_scrub_max_interval`` | |
321 | ||
322 | :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon | |
323 | irrespective of cluster load. | |
324 | ||
325 | :Type: Float | |
326 | :Default: Once per week. ``7*24*60*60`` | |
327 | ||
328 | ||
329 | ``osd_scrub_chunk_min`` | |
330 | ||
331 | :Description: The minimal number of object store chunks to scrub during single operation. | |
332 | Ceph blocks writes to single chunk during scrub. | |
333 | ||
334 | :Type: 32-bit Integer | |
335 | :Default: 5 | |
336 | ||
337 | ||
338 | ``osd_scrub_chunk_max`` | |
339 | ||
340 | :Description: The maximum number of object store chunks to scrub during single operation. | |
341 | ||
342 | :Type: 32-bit Integer | |
343 | :Default: 25 | |
344 | ||
345 | ||
346 | ``osd_scrub_sleep`` | |
347 | ||
348 | :Description: Time to sleep before scrubbing the next group of chunks. Increasing this value will slow | |
349 | down the overall rate of scrubbing so that client operations will be less impacted. | |
350 | ||
351 | :Type: Float | |
352 | :Default: 0 | |
353 | ||
354 | ||
355 | ``osd_deep_scrub_interval`` | |
356 | ||
357 | :Description: The interval for "deep" scrubbing (fully reading all data). The | |
358 | ``osd_scrub_load_threshold`` does not affect this setting. | |
359 | ||
360 | :Type: Float | |
361 | :Default: Once per week. ``7*24*60*60`` | |
362 | ||
363 | ||
364 | ``osd_scrub_interval_randomize_ratio`` | |
365 | ||
366 | :Description: Add a random delay to ``osd_scrub_min_interval`` when scheduling | |
367 | the next scrub job for a PG. The delay is a random | |
368 | value less than ``osd_scrub_min_interval`` \* | |
369 | ``osd_scrub_interval_randomized_ratio``. The default setting | |
370 | spreads scrubs throughout the allowed time | |
371 | window of ``[1, 1.5]`` \* ``osd_scrub_min_interval``. | |
372 | :Type: Float | |
373 | :Default: ``0.5`` | |
374 | ||
375 | ``osd_deep_scrub_stride`` | |
376 | ||
377 | :Description: Read size when doing a deep scrub. | |
378 | :Type: 32-bit Integer | |
379 | :Default: 512 KB. ``524288`` | |
380 | ||
381 | ||
382 | ``osd_scrub_auto_repair`` | |
383 | ||
384 | :Description: Setting this to ``true`` will enable automatic PG repair when errors | |
385 | are found by scrubs or deep-scrubs. However, if more than | |
386 | ``osd_scrub_auto_repair_num_errors`` errors are found a repair is NOT performed. | |
387 | :Type: Boolean | |
388 | :Default: ``false`` | |
389 | ||
390 | ||
391 | ``osd_scrub_auto_repair_num_errors`` | |
392 | ||
393 | :Description: Auto repair will not occur if more than this many errors are found. | |
394 | :Type: 32-bit Integer | |
395 | :Default: ``5`` | |
396 | ||
397 | ||
398 | .. index:: OSD; operations settings | |
399 | ||
400 | Operations | |
401 | ========== | |
402 | ||
403 | ``osd_op_queue`` | |
404 | ||
405 | :Description: This sets the type of queue to be used for prioritizing ops | |
406 | within each OSD. Both queues feature a strict sub-queue which is | |
407 | dequeued before the normal queue. The normal queue is different | |
408 | between implementations. The WeightedPriorityQueue (``wpq``) | |
409 | dequeues operations in relation to their priorities to prevent | |
410 | starvation of any queue. WPQ should help in cases where a few OSDs | |
411 | are more overloaded than others. The new mClockQueue | |
412 | (``mclock_scheduler``) prioritizes operations based on which class | |
413 | they belong to (recovery, scrub, snaptrim, client op, osd subop). | |
414 | See `QoS Based on mClock`_. Requires a restart. | |
415 | ||
416 | :Type: String | |
417 | :Valid Choices: wpq, mclock_scheduler | |
418 | :Default: ``wpq`` | |
419 | ||
420 | ||
421 | ``osd_op_queue_cut_off`` | |
422 | ||
423 | :Description: This selects which priority ops will be sent to the strict | |
424 | queue verses the normal queue. The ``low`` setting sends all | |
425 | replication ops and higher to the strict queue, while the ``high`` | |
426 | option sends only replication acknowledgment ops and higher to | |
427 | the strict queue. Setting this to ``high`` should help when a few | |
428 | OSDs in the cluster are very busy especially when combined with | |
429 | ``wpq`` in the ``osd_op_queue`` setting. OSDs that are very busy | |
430 | handling replication traffic could starve primary client traffic | |
431 | on these OSDs without these settings. Requires a restart. | |
432 | ||
433 | :Type: String | |
434 | :Valid Choices: low, high | |
435 | :Default: ``high`` | |
436 | ||
437 | ||
438 | ``osd_client_op_priority`` | |
439 | ||
440 | :Description: The priority set for client operations. This value is relative | |
441 | to that of ``osd_recovery_op_priority`` below. The default | |
442 | strongly favors client ops over recovery. | |
443 | ||
444 | :Type: 32-bit Integer | |
445 | :Default: ``63`` | |
446 | :Valid Range: 1-63 | |
447 | ||
448 | ||
449 | ``osd_recovery_op_priority`` | |
450 | ||
451 | :Description: The priority of recovery operations vs client operations, if not specified by the | |
452 | pool's ``recovery_op_priority``. The default value prioritizes client | |
453 | ops (see above) over recovery ops. You may adjust the tradeoff of client | |
454 | impact against the time to restore cluster health by lowering this value | |
455 | for increased prioritization of client ops, or by increasing it to favor | |
456 | recovery. | |
457 | ||
458 | :Type: 32-bit Integer | |
459 | :Default: ``3`` | |
460 | :Valid Range: 1-63 | |
461 | ||
462 | ||
463 | ``osd_scrub_priority`` | |
464 | ||
465 | :Description: The default work queue priority for scheduled scrubs when the | |
466 | pool doesn't specify a value of ``scrub_priority``. This can be | |
467 | boosted to the value of ``osd_client_op_priority`` when scrubs are | |
468 | blocking client operations. | |
469 | ||
470 | :Type: 32-bit Integer | |
471 | :Default: ``5`` | |
472 | :Valid Range: 1-63 | |
473 | ||
474 | ||
475 | ``osd_requested_scrub_priority`` | |
476 | ||
477 | :Description: The priority set for user requested scrub on the work queue. If | |
478 | this value were to be smaller than ``osd_client_op_priority`` it | |
479 | can be boosted to the value of ``osd_client_op_priority`` when | |
480 | scrub is blocking client operations. | |
481 | ||
482 | :Type: 32-bit Integer | |
483 | :Default: ``120`` | |
484 | ||
485 | ||
486 | ``osd_snap_trim_priority`` | |
487 | ||
488 | :Description: The priority set for the snap trim work queue. | |
489 | ||
490 | :Type: 32-bit Integer | |
491 | :Default: ``5`` | |
492 | :Valid Range: 1-63 | |
493 | ||
494 | ``osd_snap_trim_sleep`` | |
495 | ||
496 | :Description: Time in seconds to sleep before next snap trim op. | |
497 | Increasing this value will slow down snap trimming. | |
498 | This option overrides backend specific variants. | |
499 | ||
500 | :Type: Float | |
501 | :Default: ``0`` | |
502 | ||
503 | ||
504 | ``osd_snap_trim_sleep_hdd`` | |
505 | ||
506 | :Description: Time in seconds to sleep before next snap trim op | |
507 | for HDDs. | |
508 | ||
509 | :Type: Float | |
510 | :Default: ``5`` | |
511 | ||
512 | ||
513 | ``osd_snap_trim_sleep_ssd`` | |
514 | ||
515 | :Description: Time in seconds to sleep before next snap trim op | |
516 | for SSD OSDs (including NVMe). | |
517 | ||
518 | :Type: Float | |
519 | :Default: ``0`` | |
520 | ||
521 | ||
522 | ``osd_snap_trim_sleep_hybrid`` | |
523 | ||
524 | :Description: Time in seconds to sleep before next snap trim op | |
525 | when OSD data is on an HDD and the OSD journal or WAL+DB is on an SSD. | |
526 | ||
527 | :Type: Float | |
528 | :Default: ``2`` | |
529 | ||
530 | ``osd_op_thread_timeout`` | |
531 | ||
532 | :Description: The Ceph OSD Daemon operation thread timeout in seconds. | |
533 | :Type: 32-bit Integer | |
534 | :Default: ``15`` | |
535 | ||
536 | ||
537 | ``osd_op_complaint_time`` | |
538 | ||
539 | :Description: An operation becomes complaint worthy after the specified number | |
540 | of seconds have elapsed. | |
541 | ||
542 | :Type: Float | |
543 | :Default: ``30`` | |
544 | ||
545 | ||
546 | ``osd_op_history_size`` | |
547 | ||
548 | :Description: The maximum number of completed operations to track. | |
549 | :Type: 32-bit Unsigned Integer | |
550 | :Default: ``20`` | |
551 | ||
552 | ||
553 | ``osd_op_history_duration`` | |
554 | ||
555 | :Description: The oldest completed operation to track. | |
556 | :Type: 32-bit Unsigned Integer | |
557 | :Default: ``600`` | |
558 | ||
559 | ||
560 | ``osd_op_log_threshold`` | |
561 | ||
562 | :Description: How many operations logs to display at once. | |
563 | :Type: 32-bit Integer | |
564 | :Default: ``5`` | |
565 | ||
566 | ||
567 | .. _dmclock-qos: | |
568 | ||
569 | QoS Based on mClock | |
570 | ------------------- | |
571 | ||
572 | Ceph's use of mClock is now more refined and can be used by following the | |
573 | steps as described in `mClock Config Reference`_. | |
574 | ||
575 | Core Concepts | |
576 | ````````````` | |
577 | ||
578 | Ceph's QoS support is implemented using a queueing scheduler | |
579 | based on `the dmClock algorithm`_. This algorithm allocates the I/O | |
580 | resources of the Ceph cluster in proportion to weights, and enforces | |
581 | the constraints of minimum reservation and maximum limitation, so that | |
582 | the services can compete for the resources fairly. Currently the | |
583 | *mclock_scheduler* operation queue divides Ceph services involving I/O | |
584 | resources into following buckets: | |
585 | ||
586 | - client op: the iops issued by client | |
587 | - osd subop: the iops issued by primary OSD | |
588 | - snap trim: the snap trimming related requests | |
589 | - pg recovery: the recovery related requests | |
590 | - pg scrub: the scrub related requests | |
591 | ||
592 | And the resources are partitioned using following three sets of tags. In other | |
593 | words, the share of each type of service is controlled by three tags: | |
594 | ||
595 | #. reservation: the minimum IOPS allocated for the service. | |
596 | #. limitation: the maximum IOPS allocated for the service. | |
597 | #. weight: the proportional share of capacity if extra capacity or system | |
598 | oversubscribed. | |
599 | ||
600 | In Ceph, operations are graded with "cost". And the resources allocated | |
601 | for serving various services are consumed by these "costs". So, for | |
602 | example, the more reservation a services has, the more resource it is | |
603 | guaranteed to possess, as long as it requires. Assuming there are 2 | |
604 | services: recovery and client ops: | |
605 | ||
606 | - recovery: (r:1, l:5, w:1) | |
607 | - client ops: (r:2, l:0, w:9) | |
608 | ||
609 | The settings above ensure that the recovery won't get more than 5 | |
610 | requests per second serviced, even if it requires so (see CURRENT | |
611 | IMPLEMENTATION NOTE below), and no other services are competing with | |
612 | it. But if the clients start to issue large amount of I/O requests, | |
613 | neither will they exhaust all the I/O resources. 1 request per second | |
614 | is always allocated for recovery jobs as long as there are any such | |
615 | requests. So the recovery jobs won't be starved even in a cluster with | |
616 | high load. And in the meantime, the client ops can enjoy a larger | |
617 | portion of the I/O resource, because its weight is "9", while its | |
618 | competitor "1". In the case of client ops, it is not clamped by the | |
619 | limit setting, so it can make use of all the resources if there is no | |
620 | recovery ongoing. | |
621 | ||
622 | CURRENT IMPLEMENTATION NOTE: the current implementation enforces the limit | |
623 | values. Therefore, if a service crosses the enforced limit, the op remains | |
624 | in the operation queue until the limit is restored. | |
625 | ||
626 | Subtleties of mClock | |
627 | ```````````````````` | |
628 | ||
629 | The reservation and limit values have a unit of requests per | |
630 | second. The weight, however, does not technically have a unit and the | |
631 | weights are relative to one another. So if one class of requests has a | |
632 | weight of 1 and another a weight of 9, then the latter class of | |
633 | requests should get 9 executed at a 9 to 1 ratio as the first class. | |
634 | However that will only happen once the reservations are met and those | |
635 | values include the operations executed under the reservation phase. | |
636 | ||
637 | Even though the weights do not have units, one must be careful in | |
638 | choosing their values due how the algorithm assigns weight tags to | |
639 | requests. If the weight is *W*, then for a given class of requests, | |
640 | the next one that comes in will have a weight tag of *1/W* plus the | |
641 | previous weight tag or the current time, whichever is larger. That | |
642 | means if *W* is sufficiently large and therefore *1/W* is sufficiently | |
643 | small, the calculated tag may never be assigned as it will get a value | |
644 | of the current time. The ultimate lesson is that values for weight | |
645 | should not be too large. They should be under the number of requests | |
646 | one expects to be serviced each second. | |
647 | ||
648 | Caveats | |
649 | ``````` | |
650 | ||
651 | There are some factors that can reduce the impact of the mClock op | |
652 | queues within Ceph. First, requests to an OSD are sharded by their | |
653 | placement group identifier. Each shard has its own mClock queue and | |
654 | these queues neither interact nor share information among them. The | |
655 | number of shards can be controlled with the configuration options | |
656 | ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and | |
657 | ``osd_op_num_shards_ssd``. A lower number of shards will increase the | |
658 | impact of the mClock queues, but may have other deleterious effects. | |
659 | ||
660 | Second, requests are transferred from the operation queue to the | |
661 | operation sequencer, in which they go through the phases of | |
662 | execution. The operation queue is where mClock resides and mClock | |
663 | determines the next op to transfer to the operation sequencer. The | |
664 | number of operations allowed in the operation sequencer is a complex | |
665 | issue. In general we want to keep enough operations in the sequencer | |
666 | so it's always getting work done on some operations while it's waiting | |
667 | for disk and network access to complete on other operations. On the | |
668 | other hand, once an operation is transferred to the operation | |
669 | sequencer, mClock no longer has control over it. Therefore to maximize | |
670 | the impact of mClock, we want to keep as few operations in the | |
671 | operation sequencer as possible. So we have an inherent tension. | |
672 | ||
673 | The configuration options that influence the number of operations in | |
674 | the operation sequencer are ``bluestore_throttle_bytes``, | |
675 | ``bluestore_throttle_deferred_bytes``, | |
676 | ``bluestore_throttle_cost_per_io``, | |
677 | ``bluestore_throttle_cost_per_io_hdd``, and | |
678 | ``bluestore_throttle_cost_per_io_ssd``. | |
679 | ||
680 | A third factor that affects the impact of the mClock algorithm is that | |
681 | we're using a distributed system, where requests are made to multiple | |
682 | OSDs and each OSD has (can have) multiple shards. Yet we're currently | |
683 | using the mClock algorithm, which is not distributed (note: dmClock is | |
684 | the distributed version of mClock). | |
685 | ||
686 | Various organizations and individuals are currently experimenting with | |
687 | mClock as it exists in this code base along with their modifications | |
688 | to the code base. We hope you'll share you're experiences with your | |
689 | mClock and dmClock experiments on the ``ceph-devel`` mailing list. | |
690 | ||
691 | ||
692 | ``osd_push_per_object_cost`` | |
693 | ||
694 | :Description: the overhead for serving a push op | |
695 | ||
696 | :Type: Unsigned Integer | |
697 | :Default: 1000 | |
698 | ||
699 | ||
700 | ``osd_recovery_max_chunk`` | |
701 | ||
702 | :Description: the maximum total size of data chunks a recovery op can carry. | |
703 | ||
704 | :Type: Unsigned Integer | |
705 | :Default: 8 MiB | |
706 | ||
707 | ||
708 | ``osd_mclock_scheduler_client_res`` | |
709 | ||
710 | :Description: IO proportion reserved for each client (default). | |
711 | ||
712 | :Type: Unsigned Integer | |
713 | :Default: 1 | |
714 | ||
715 | ||
716 | ``osd_mclock_scheduler_client_wgt`` | |
717 | ||
718 | :Description: IO share for each client (default) over reservation. | |
719 | ||
720 | :Type: Unsigned Integer | |
721 | :Default: 1 | |
722 | ||
723 | ||
724 | ``osd_mclock_scheduler_client_lim`` | |
725 | ||
726 | :Description: IO limit for each client (default) over reservation. | |
727 | ||
728 | :Type: Unsigned Integer | |
729 | :Default: 999999 | |
730 | ||
731 | ||
732 | ``osd_mclock_scheduler_background_recovery_res`` | |
733 | ||
734 | :Description: IO proportion reserved for background recovery (default). | |
735 | ||
736 | :Type: Unsigned Integer | |
737 | :Default: 1 | |
738 | ||
739 | ||
740 | ``osd_mclock_scheduler_background_recovery_wgt`` | |
741 | ||
742 | :Description: IO share for each background recovery over reservation. | |
743 | ||
744 | :Type: Unsigned Integer | |
745 | :Default: 1 | |
746 | ||
747 | ||
748 | ``osd_mclock_scheduler_background_recovery_lim`` | |
749 | ||
750 | :Description: IO limit for background recovery over reservation. | |
751 | ||
752 | :Type: Unsigned Integer | |
753 | :Default: 999999 | |
754 | ||
755 | ||
756 | ``osd_mclock_scheduler_background_best_effort_res`` | |
757 | ||
758 | :Description: IO proportion reserved for background best_effort (default). | |
759 | ||
760 | :Type: Unsigned Integer | |
761 | :Default: 1 | |
762 | ||
763 | ||
764 | ``osd_mclock_scheduler_background_best_effort_wgt`` | |
765 | ||
766 | :Description: IO share for each background best_effort over reservation. | |
767 | ||
768 | :Type: Unsigned Integer | |
769 | :Default: 1 | |
770 | ||
771 | ||
772 | ``osd_mclock_scheduler_background_best_effort_lim`` | |
773 | ||
774 | :Description: IO limit for background best_effort over reservation. | |
775 | ||
776 | :Type: Unsigned Integer | |
777 | :Default: 999999 | |
778 | ||
779 | .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf | |
780 | ||
781 | ||
782 | .. index:: OSD; backfilling | |
783 | ||
784 | Backfilling | |
785 | =========== | |
786 | ||
787 | When you add or remove Ceph OSD Daemons to a cluster, CRUSH will | |
788 | rebalance the cluster by moving placement groups to or from Ceph OSDs | |
789 | to restore balanced utilization. The process of migrating placement groups and | |
790 | the objects they contain can reduce the cluster's operational performance | |
791 | considerably. To maintain operational performance, Ceph performs this migration | |
792 | with 'backfilling', which allows Ceph to set backfill operations to a lower | |
793 | priority than requests to read or write data. | |
794 | ||
795 | ||
796 | ``osd_max_backfills`` | |
797 | ||
798 | :Description: The maximum number of backfills allowed to or from a single OSD. | |
799 | Note that this is applied separately for read and write operations. | |
800 | :Type: 64-bit Unsigned Integer | |
801 | :Default: ``1`` | |
802 | ||
803 | ||
804 | ``osd_backfill_scan_min`` | |
805 | ||
806 | :Description: The minimum number of objects per backfill scan. | |
807 | ||
808 | :Type: 32-bit Integer | |
809 | :Default: ``64`` | |
810 | ||
811 | ||
812 | ``osd_backfill_scan_max`` | |
813 | ||
814 | :Description: The maximum number of objects per backfill scan. | |
815 | ||
816 | :Type: 32-bit Integer | |
817 | :Default: ``512`` | |
818 | ||
819 | ||
820 | ``osd_backfill_retry_interval`` | |
821 | ||
822 | :Description: The number of seconds to wait before retrying backfill requests. | |
823 | :Type: Double | |
824 | :Default: ``10.0`` | |
825 | ||
826 | .. index:: OSD; osdmap | |
827 | ||
828 | OSD Map | |
829 | ======= | |
830 | ||
831 | OSD maps reflect the OSD daemons operating in the cluster. Over time, the | |
832 | number of map epochs increases. Ceph provides some settings to ensure that | |
833 | Ceph performs well as the OSD map grows larger. | |
834 | ||
835 | ||
836 | ``osd_map_dedup`` | |
837 | ||
838 | :Description: Enable removing duplicates in the OSD map. | |
839 | :Type: Boolean | |
840 | :Default: ``true`` | |
841 | ||
842 | ||
843 | ``osd_map_cache_size`` | |
844 | ||
845 | :Description: The number of OSD maps to keep cached. | |
846 | :Type: 32-bit Integer | |
847 | :Default: ``50`` | |
848 | ||
849 | ||
850 | ``osd_map_message_max`` | |
851 | ||
852 | :Description: The maximum map entries allowed per MOSDMap message. | |
853 | :Type: 32-bit Integer | |
854 | :Default: ``40`` | |
855 | ||
856 | ||
857 | ||
858 | .. index:: OSD; recovery | |
859 | ||
860 | Recovery | |
861 | ======== | |
862 | ||
863 | When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD | |
864 | begins peering with other Ceph OSD Daemons before writes can occur. See | |
865 | `Monitoring OSDs and PGs`_ for details. | |
866 | ||
867 | If a Ceph OSD Daemon crashes and comes back online, usually it will be out of | |
868 | sync with other Ceph OSD Daemons containing more recent versions of objects in | |
869 | the placement groups. When this happens, the Ceph OSD Daemon goes into recovery | |
870 | mode and seeks to get the latest copy of the data and bring its map back up to | |
871 | date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects | |
872 | and placement groups may be significantly out of date. Also, if a failure domain | |
873 | went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at | |
874 | the same time. This can make the recovery process time consuming and resource | |
875 | intensive. | |
876 | ||
877 | To maintain operational performance, Ceph performs recovery with limitations on | |
878 | the number recovery requests, threads and object chunk sizes which allows Ceph | |
879 | perform well in a degraded state. | |
880 | ||
881 | ||
882 | ``osd_recovery_delay_start`` | |
883 | ||
884 | :Description: After peering completes, Ceph will delay for the specified number | |
885 | of seconds before starting to recover RADOS objects. | |
886 | ||
887 | :Type: Float | |
888 | :Default: ``0`` | |
889 | ||
890 | ||
891 | ``osd_recovery_max_active`` | |
892 | ||
893 | :Description: The number of active recovery requests per OSD at one time. More | |
894 | requests will accelerate recovery, but the requests places an | |
895 | increased load on the cluster. | |
896 | ||
897 | This value is only used if it is non-zero. Normally it | |
898 | is ``0``, which means that the ``hdd`` or ``ssd`` values | |
899 | (below) are used, depending on the type of the primary | |
900 | device backing the OSD. | |
901 | ||
902 | :Type: 32-bit Integer | |
903 | :Default: ``0`` | |
904 | ||
905 | ``osd_recovery_max_active_hdd`` | |
906 | ||
907 | :Description: The number of active recovery requests per OSD at one time, if the | |
908 | primary device is rotational. | |
909 | ||
910 | :Type: 32-bit Integer | |
911 | :Default: ``3`` | |
912 | ||
913 | ``osd_recovery_max_active_ssd`` | |
914 | ||
915 | :Description: The number of active recovery requests per OSD at one time, if the | |
916 | primary device is non-rotational (i.e., an SSD). | |
917 | ||
918 | :Type: 32-bit Integer | |
919 | :Default: ``10`` | |
920 | ||
921 | ||
922 | ``osd_recovery_max_chunk`` | |
923 | ||
924 | :Description: The maximum size of a recovered chunk of data to push. | |
925 | :Type: 64-bit Unsigned Integer | |
926 | :Default: ``8 << 20`` | |
927 | ||
928 | ||
929 | ``osd_recovery_max_single_start`` | |
930 | ||
931 | :Description: The maximum number of recovery operations per OSD that will be | |
932 | newly started when an OSD is recovering. | |
933 | :Type: 64-bit Unsigned Integer | |
934 | :Default: ``1`` | |
935 | ||
936 | ||
937 | ``osd_recovery_thread_timeout`` | |
938 | ||
939 | :Description: The maximum time in seconds before timing out a recovery thread. | |
940 | :Type: 32-bit Integer | |
941 | :Default: ``30`` | |
942 | ||
943 | ||
944 | ``osd_recover_clone_overlap`` | |
945 | ||
946 | :Description: Preserves clone overlap during recovery. Should always be set | |
947 | to ``true``. | |
948 | ||
949 | :Type: Boolean | |
950 | :Default: ``true`` | |
951 | ||
952 | ||
953 | ``osd_recovery_sleep`` | |
954 | ||
955 | :Description: Time in seconds to sleep before the next recovery or backfill op. | |
956 | Increasing this value will slow down recovery operation while | |
957 | client operations will be less impacted. | |
958 | ||
959 | :Type: Float | |
960 | :Default: ``0`` | |
961 | ||
962 | ||
963 | ``osd_recovery_sleep_hdd`` | |
964 | ||
965 | :Description: Time in seconds to sleep before next recovery or backfill op | |
966 | for HDDs. | |
967 | ||
968 | :Type: Float | |
969 | :Default: ``0.1`` | |
970 | ||
971 | ||
972 | ``osd_recovery_sleep_ssd`` | |
973 | ||
974 | :Description: Time in seconds to sleep before the next recovery or backfill op | |
975 | for SSDs. | |
976 | ||
977 | :Type: Float | |
978 | :Default: ``0`` | |
979 | ||
980 | ||
981 | ``osd_recovery_sleep_hybrid`` | |
982 | ||
983 | :Description: Time in seconds to sleep before the next recovery or backfill op | |
984 | when OSD data is on HDD and OSD journal / WAL+DB is on SSD. | |
985 | ||
986 | :Type: Float | |
987 | :Default: ``0.025`` | |
988 | ||
989 | ||
990 | ``osd_recovery_priority`` | |
991 | ||
992 | :Description: The default priority set for recovery work queue. Not | |
993 | related to a pool's ``recovery_priority``. | |
994 | ||
995 | :Type: 32-bit Integer | |
996 | :Default: ``5`` | |
997 | ||
998 | ||
999 | Tiering | |
1000 | ======= | |
1001 | ||
1002 | ``osd_agent_max_ops`` | |
1003 | ||
1004 | :Description: The maximum number of simultaneous flushing ops per tiering agent | |
1005 | in the high speed mode. | |
1006 | :Type: 32-bit Integer | |
1007 | :Default: ``4`` | |
1008 | ||
1009 | ||
1010 | ``osd_agent_max_low_ops`` | |
1011 | ||
1012 | :Description: The maximum number of simultaneous flushing ops per tiering agent | |
1013 | in the low speed mode. | |
1014 | :Type: 32-bit Integer | |
1015 | :Default: ``2`` | |
1016 | ||
1017 | See `cache target dirty high ratio`_ for when the tiering agent flushes dirty | |
1018 | objects within the high speed mode. | |
1019 | ||
1020 | Miscellaneous | |
1021 | ============= | |
1022 | ||
1023 | ||
1024 | ``osd_snap_trim_thread_timeout`` | |
1025 | ||
1026 | :Description: The maximum time in seconds before timing out a snap trim thread. | |
1027 | :Type: 32-bit Integer | |
1028 | :Default: ``1*60*60`` | |
1029 | ||
1030 | ||
1031 | ``osd_backlog_thread_timeout`` | |
1032 | ||
1033 | :Description: The maximum time in seconds before timing out a backlog thread. | |
1034 | :Type: 32-bit Integer | |
1035 | :Default: ``1*60*60`` | |
1036 | ||
1037 | ||
1038 | ``osd_default_notify_timeout`` | |
1039 | ||
1040 | :Description: The OSD default notification timeout (in seconds). | |
1041 | :Type: 32-bit Unsigned Integer | |
1042 | :Default: ``30`` | |
1043 | ||
1044 | ||
1045 | ``osd_check_for_log_corruption`` | |
1046 | ||
1047 | :Description: Check log files for corruption. Can be computationally expensive. | |
1048 | :Type: Boolean | |
1049 | :Default: ``false`` | |
1050 | ||
1051 | ||
1052 | ``osd_remove_thread_timeout`` | |
1053 | ||
1054 | :Description: The maximum time in seconds before timing out a remove OSD thread. | |
1055 | :Type: 32-bit Integer | |
1056 | :Default: ``60*60`` | |
1057 | ||
1058 | ||
1059 | ``osd_command_thread_timeout`` | |
1060 | ||
1061 | :Description: The maximum time in seconds before timing out a command thread. | |
1062 | :Type: 32-bit Integer | |
1063 | :Default: ``10*60`` | |
1064 | ||
1065 | ||
1066 | ``osd_delete_sleep`` | |
1067 | ||
1068 | :Description: Time in seconds to sleep before the next removal transaction. This | |
1069 | throttles the PG deletion process. | |
1070 | ||
1071 | :Type: Float | |
1072 | :Default: ``0`` | |
1073 | ||
1074 | ||
1075 | ``osd_delete_sleep_hdd`` | |
1076 | ||
1077 | :Description: Time in seconds to sleep before the next removal transaction | |
1078 | for HDDs. | |
1079 | ||
1080 | :Type: Float | |
1081 | :Default: ``5`` | |
1082 | ||
1083 | ||
1084 | ``osd_delete_sleep_ssd`` | |
1085 | ||
1086 | :Description: Time in seconds to sleep before the next removal transaction | |
1087 | for SSDs. | |
1088 | ||
1089 | :Type: Float | |
1090 | :Default: ``0`` | |
1091 | ||
1092 | ||
1093 | ``osd_delete_sleep_hybrid`` | |
1094 | ||
1095 | :Description: Time in seconds to sleep before the next removal transaction | |
1096 | when OSD data is on HDD and OSD journal or WAL+DB is on SSD. | |
1097 | ||
1098 | :Type: Float | |
1099 | :Default: ``1`` | |
1100 | ||
1101 | ||
1102 | ``osd_command_max_records`` | |
1103 | ||
1104 | :Description: Limits the number of lost objects to return. | |
1105 | :Type: 32-bit Integer | |
1106 | :Default: ``256`` | |
1107 | ||
1108 | ||
1109 | ``osd_fast_fail_on_connection_refused`` | |
1110 | ||
1111 | :Description: If this option is enabled, crashed OSDs are marked down | |
1112 | immediately by connected peers and MONs (assuming that the | |
1113 | crashed OSD host survives). Disable it to restore old | |
1114 | behavior, at the expense of possible long I/O stalls when | |
1115 | OSDs crash in the middle of I/O operations. | |
1116 | :Type: Boolean | |
1117 | :Default: ``true`` | |
1118 | ||
1119 | ||
1120 | ||
1121 | .. _pool: ../../operations/pools | |
1122 | .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction | |
1123 | .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering | |
1124 | .. _Pool & PG Config Reference: ../pool-pg-config-ref | |
1125 | .. _Journal Config Reference: ../journal-ref | |
1126 | .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio | |
1127 | .. _mClock Config Reference: ../mclock-config-ref |