]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ====================== |
2 | OSD Config Reference | |
3 | ====================== | |
4 | ||
5 | .. index:: OSD; configuration | |
6 | ||
7 | You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD | |
8 | Daemons can use the default values and a very minimal configuration. A minimal | |
9 | Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and | |
10 | uses default values for nearly everything else. | |
11 | ||
12 | Ceph OSD Daemons are numerically identified in incremental fashion, beginning | |
13 | with ``0`` using the following convention. :: | |
14 | ||
15 | osd.0 | |
16 | osd.1 | |
17 | osd.2 | |
18 | ||
19 | In a configuration file, you may specify settings for all Ceph OSD Daemons in | |
20 | the cluster by adding configuration settings to the ``[osd]`` section of your | |
21 | configuration file. To add settings directly to a specific Ceph OSD Daemon | |
22 | (e.g., ``host``), enter it in an OSD-specific section of your configuration | |
23 | file. For example: | |
24 | ||
25 | .. code-block:: ini | |
26 | ||
27 | [osd] | |
28 | osd journal size = 1024 | |
29 | ||
30 | [osd.0] | |
31 | host = osd-host-a | |
32 | ||
33 | [osd.1] | |
34 | host = osd-host-b | |
35 | ||
36 | ||
37 | .. index:: OSD; config settings | |
38 | ||
39 | General Settings | |
40 | ================ | |
41 | ||
42 | The following settings provide an Ceph OSD Daemon's ID, and determine paths to | |
43 | data and journals. Ceph deployment scripts typically generate the UUID | |
44 | automatically. We **DO NOT** recommend changing the default paths for data or | |
45 | journals, as it makes it more problematic to troubleshoot Ceph later. | |
46 | ||
47 | The journal size should be at least twice the product of the expected drive | |
48 | speed multiplied by ``filestore max sync interval``. However, the most common | |
49 | practice is to partition the journal drive (often an SSD), and mount it such | |
50 | that Ceph uses the entire partition for the journal. | |
51 | ||
52 | ||
53 | ``osd uuid`` | |
54 | ||
55 | :Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. | |
56 | :Type: UUID | |
57 | :Default: The UUID. | |
58 | :Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` | |
59 | applies to the entire cluster. | |
60 | ||
61 | ||
62 | ``osd data`` | |
63 | ||
64 | :Description: The path to the OSDs data. You must create the directory when | |
65 | deploying Ceph. You should mount a drive for OSD data at this | |
66 | mount point. We do not recommend changing the default. | |
67 | ||
68 | :Type: String | |
69 | :Default: ``/var/lib/ceph/osd/$cluster-$id`` | |
70 | ||
71 | ||
72 | ``osd max write size`` | |
73 | ||
74 | :Description: The maximum size of a write in megabytes. | |
75 | :Type: 32-bit Integer | |
76 | :Default: ``90`` | |
77 | ||
78 | ||
79 | ``osd client message size cap`` | |
80 | ||
81 | :Description: The largest client data message allowed in memory. | |
c07f9fc5 | 82 | :Type: 64-bit Unsigned Integer |
7c673cae FG |
83 | :Default: 500MB default. ``500*1024L*1024L`` |
84 | ||
85 | ||
86 | ``osd class dir`` | |
87 | ||
88 | :Description: The class path for RADOS class plug-ins. | |
89 | :Type: String | |
90 | :Default: ``$libdir/rados-classes`` | |
91 | ||
92 | ||
93 | .. index:: OSD; file system | |
94 | ||
95 | File System Settings | |
96 | ==================== | |
97 | Ceph builds and mounts file systems which are used for Ceph OSDs. | |
98 | ||
99 | ``osd mkfs options {fs-type}`` | |
100 | ||
101 | :Description: Options used when creating a new Ceph OSD of type {fs-type}. | |
102 | ||
103 | :Type: String | |
104 | :Default for xfs: ``-f -i 2048`` | |
105 | :Default for other file systems: {empty string} | |
106 | ||
107 | For example:: | |
108 | ``osd mkfs options xfs = -f -d agcount=24`` | |
109 | ||
110 | ``osd mount options {fs-type}`` | |
111 | ||
112 | :Description: Options used when mounting a Ceph OSD of type {fs-type}. | |
113 | ||
114 | :Type: String | |
115 | :Default for xfs: ``rw,noatime,inode64`` | |
116 | :Default for other file systems: ``rw, noatime`` | |
117 | ||
118 | For example:: | |
119 | ``osd mount options xfs = rw, noatime, inode64, logbufs=8`` | |
120 | ||
121 | ||
122 | .. index:: OSD; journal settings | |
123 | ||
124 | Journal Settings | |
125 | ================ | |
126 | ||
127 | By default, Ceph expects that you will store an Ceph OSD Daemons journal with | |
128 | the following path:: | |
129 | ||
130 | /var/lib/ceph/osd/$cluster-$id/journal | |
131 | ||
132 | Without performance optimization, Ceph stores the journal on the same disk as | |
133 | the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use | |
134 | a separate disk to store journal data (e.g., a solid state drive delivers high | |
135 | performance journaling). | |
136 | ||
137 | Ceph's default ``osd journal size`` is 0, so you will need to set this in your | |
138 | ``ceph.conf`` file. A journal size should find the product of the ``filestore | |
139 | max sync interval`` and the expected throughput, and multiply the product by | |
140 | two (2):: | |
141 | ||
142 | osd journal size = {2 * (expected throughput * filestore max sync interval)} | |
143 | ||
144 | The expected throughput number should include the expected disk throughput | |
145 | (i.e., sustained data transfer rate), and network throughput. For example, | |
146 | a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()`` | |
147 | of the disk and network throughput should provide a reasonable expected | |
148 | throughput. Some users just start off with a 10GB journal size. For | |
149 | example:: | |
150 | ||
151 | osd journal size = 10000 | |
152 | ||
153 | ||
154 | ``osd journal`` | |
155 | ||
156 | :Description: The path to the OSD's journal. This may be a path to a file or a | |
157 | block device (such as a partition of an SSD). If it is a file, | |
158 | you must create the directory to contain it. We recommend using a | |
159 | drive separate from the ``osd data`` drive. | |
160 | ||
161 | :Type: String | |
162 | :Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` | |
163 | ||
164 | ||
165 | ``osd journal size`` | |
166 | ||
167 | :Description: The size of the journal in megabytes. If this is 0, and the | |
168 | journal is a block device, the entire block device is used. | |
169 | Since v0.54, this is ignored if the journal is a block device, | |
170 | and the entire block device is used. | |
171 | ||
172 | :Type: 32-bit Integer | |
173 | :Default: ``5120`` | |
174 | :Recommended: Begin with 1GB. Should be at least twice the product of the | |
175 | expected speed multiplied by ``filestore max sync interval``. | |
176 | ||
177 | ||
178 | See `Journal Config Reference`_ for additional details. | |
179 | ||
180 | ||
181 | Monitor OSD Interaction | |
182 | ======================= | |
183 | ||
184 | Ceph OSD Daemons check each other's heartbeats and report to monitors | |
185 | periodically. Ceph can use default values in many cases. However, if your | |
186 | network has latency issues, you may need to adopt longer intervals. See | |
187 | `Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. | |
188 | ||
189 | ||
190 | Data Placement | |
191 | ============== | |
192 | ||
193 | See `Pool & PG Config Reference`_ for details. | |
194 | ||
195 | ||
196 | .. index:: OSD; scrubbing | |
197 | ||
198 | Scrubbing | |
199 | ========= | |
200 | ||
201 | In addition to making multiple copies of objects, Ceph insures data integrity by | |
202 | scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the | |
203 | object storage layer. For each placement group, Ceph generates a catalog of all | |
204 | objects and compares each primary object and its replicas to ensure that no | |
205 | objects are missing or mismatched. Light scrubbing (daily) checks the object | |
206 | size and attributes. Deep scrubbing (weekly) reads the data and uses checksums | |
207 | to ensure data integrity. | |
208 | ||
209 | Scrubbing is important for maintaining data integrity, but it can reduce | |
210 | performance. You can adjust the following settings to increase or decrease | |
211 | scrubbing operations. | |
212 | ||
213 | ||
214 | ``osd max scrubs`` | |
215 | ||
216 | :Description: The maximum number of simultaneous scrub operations for | |
217 | a Ceph OSD Daemon. | |
218 | ||
219 | :Type: 32-bit Int | |
220 | :Default: ``1`` | |
221 | ||
222 | ``osd scrub begin hour`` | |
223 | ||
224 | :Description: The time of day for the lower bound when a scheduled scrub can be | |
225 | performed. | |
226 | :Type: Integer in the range of 0 to 24 | |
227 | :Default: ``0`` | |
228 | ||
229 | ||
230 | ``osd scrub end hour`` | |
231 | ||
232 | :Description: The time of day for the upper bound when a scheduled scrub can be | |
233 | performed. Along with ``osd scrub begin hour``, they define a time | |
234 | window, in which the scrubs can happen. But a scrub will be performed | |
235 | no matter the time window allows or not, as long as the placement | |
236 | group's scrub interval exceeds ``osd scrub max interval``. | |
237 | :Type: Integer in the range of 0 to 24 | |
238 | :Default: ``24`` | |
239 | ||
240 | ||
241 | ``osd scrub during recovery`` | |
242 | ||
243 | :Description: Allow scrub during recovery. Setting this to ``false`` will disable | |
244 | scheduling new scrub (and deep--scrub) while there is active recovery. | |
245 | Already running scrubs will be continued. This might be useful to reduce | |
246 | load on busy clusters. | |
247 | :Type: Boolean | |
248 | :Default: ``true`` | |
249 | ||
250 | ||
251 | ``osd scrub thread timeout`` | |
252 | ||
253 | :Description: The maximum time in seconds before timing out a scrub thread. | |
254 | :Type: 32-bit Integer | |
255 | :Default: ``60`` | |
256 | ||
257 | ||
258 | ``osd scrub finalize thread timeout`` | |
259 | ||
260 | :Description: The maximum time in seconds before timing out a scrub finalize | |
261 | thread. | |
262 | ||
263 | :Type: 32-bit Integer | |
264 | :Default: ``60*10`` | |
265 | ||
266 | ||
267 | ``osd scrub load threshold`` | |
268 | ||
269 | :Description: The maximum load. Ceph will not scrub when the system load | |
270 | (as defined by ``getloadavg()``) is higher than this number. | |
271 | Default is ``0.5``. | |
272 | ||
273 | :Type: Float | |
274 | :Default: ``0.5`` | |
275 | ||
276 | ||
277 | ``osd scrub min interval`` | |
278 | ||
279 | :Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon | |
280 | when the Ceph Storage Cluster load is low. | |
281 | ||
282 | :Type: Float | |
283 | :Default: Once per day. ``60*60*24`` | |
284 | ||
285 | ||
286 | ``osd scrub max interval`` | |
287 | ||
288 | :Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon | |
289 | irrespective of cluster load. | |
290 | ||
291 | :Type: Float | |
292 | :Default: Once per week. ``7*60*60*24`` | |
293 | ||
294 | ||
295 | ``osd scrub chunk min`` | |
296 | ||
297 | :Description: The minimal number of object store chunks to scrub during single operation. | |
298 | Ceph blocks writes to single chunk during scrub. | |
299 | ||
300 | :Type: 32-bit Integer | |
301 | :Default: 5 | |
302 | ||
303 | ||
304 | ``osd scrub chunk max`` | |
305 | ||
306 | :Description: The maximum number of object store chunks to scrub during single operation. | |
307 | ||
308 | :Type: 32-bit Integer | |
309 | :Default: 25 | |
310 | ||
311 | ||
312 | ``osd scrub sleep`` | |
313 | ||
314 | :Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow | |
315 | down whole scrub operation while client operations will be less impacted. | |
316 | ||
317 | :Type: Float | |
318 | :Default: 0 | |
319 | ||
320 | ||
321 | ``osd deep scrub interval`` | |
322 | ||
323 | :Description: The interval for "deep" scrubbing (fully reading all data). The | |
324 | ``osd scrub load threshold`` does not affect this setting. | |
325 | ||
326 | :Type: Float | |
327 | :Default: Once per week. ``60*60*24*7`` | |
328 | ||
329 | ||
330 | ``osd scrub interval randomize ratio`` | |
331 | ||
332 | :Description: Add a random delay to ``osd scrub min interval`` when scheduling | |
333 | the next scrub job for a placement group. The delay is a random | |
334 | value less than ``osd scrub min interval`` \* | |
335 | ``osd scrub interval randomized ratio``. So the default setting | |
336 | practically randomly spreads the scrubs out in the allowed time | |
337 | window of ``[1, 1.5]`` \* ``osd scrub min interval``. | |
338 | :Type: Float | |
339 | :Default: ``0.5`` | |
340 | ||
341 | ``osd deep scrub stride`` | |
342 | ||
343 | :Description: Read size when doing a deep scrub. | |
344 | :Type: 32-bit Integer | |
345 | :Default: 512 KB. ``524288`` | |
346 | ||
347 | ||
348 | .. index:: OSD; operations settings | |
349 | ||
350 | Operations | |
351 | ========== | |
352 | ||
353 | Operations settings allow you to configure the number of threads for servicing | |
354 | requests. If you set ``osd op threads`` to ``0``, it disables multi-threading. | |
355 | By default, Ceph uses two threads with a 30 second timeout and a 30 second | |
356 | complaint time if an operation doesn't complete within those time parameters. | |
357 | You can set operations priority weights between client operations and | |
358 | recovery operations to ensure optimal performance during recovery. | |
359 | ||
360 | ||
361 | ``osd op threads`` | |
362 | ||
363 | :Description: The number of threads to service Ceph OSD Daemon operations. | |
364 | Set to ``0`` to disable it. Increasing the number may increase | |
365 | the request processing rate. | |
366 | ||
367 | :Type: 32-bit Integer | |
368 | :Default: ``2`` | |
369 | ||
370 | ||
371 | ``osd op queue`` | |
372 | ||
373 | :Description: This sets the type of queue to be used for prioritizing ops | |
374 | in the OSDs. Both queues feature a strict sub-queue which is | |
375 | dequeued before the normal queue. The normal queue is different | |
376 | between implementations. The original PrioritizedQueue (``prio``) uses a | |
377 | token bucket system which when there are sufficient tokens will | |
378 | dequeue high priority queues first. If there are not enough | |
379 | tokens available, queues are dequeued low priority to high priority. | |
c07f9fc5 | 380 | The WeightedPriorityQueue (``wpq``) dequeues all priorities in |
7c673cae FG |
381 | relation to their priorities to prevent starvation of any queue. |
382 | WPQ should help in cases where a few OSDs are more overloaded | |
c07f9fc5 FG |
383 | than others. The new mClock based OpClassQueue |
384 | (``mclock_opclass``) prioritizes operations based on which class | |
385 | they belong to (recovery, scrub, snaptrim, client op, osd subop). | |
386 | And, the mClock based ClientQueue (``mclock_client``) also | |
387 | incorporates the client identifier in order to promote fairness | |
388 | between clients. See `QoS Based on mClock`_. Requires a restart. | |
7c673cae FG |
389 | |
390 | :Type: String | |
c07f9fc5 | 391 | :Valid Choices: prio, wpq, mclock_opclass, mclock_client |
7c673cae FG |
392 | :Default: ``prio`` |
393 | ||
394 | ||
395 | ``osd op queue cut off`` | |
396 | ||
397 | :Description: This selects which priority ops will be sent to the strict | |
398 | queue verses the normal queue. The ``low`` setting sends all | |
399 | replication ops and higher to the strict queue, while the ``high`` | |
400 | option sends only replication acknowledgement ops and higher to | |
401 | the strict queue. Setting this to ``high`` should help when a few | |
402 | OSDs in the cluster are very busy especially when combined with | |
403 | ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy | |
404 | handling replication traffic could starve primary client traffic | |
405 | on these OSDs without these settings. Requires a restart. | |
406 | ||
407 | :Type: String | |
408 | :Valid Choices: low, high | |
409 | :Default: ``low`` | |
410 | ||
411 | ||
412 | ``osd client op priority`` | |
413 | ||
414 | :Description: The priority set for client operations. It is relative to | |
415 | ``osd recovery op priority``. | |
416 | ||
417 | :Type: 32-bit Integer | |
418 | :Default: ``63`` | |
419 | :Valid Range: 1-63 | |
420 | ||
421 | ||
422 | ``osd recovery op priority`` | |
423 | ||
424 | :Description: The priority set for recovery operations. It is relative to | |
425 | ``osd client op priority``. | |
426 | ||
427 | :Type: 32-bit Integer | |
31f18b77 | 428 | :Default: ``3`` |
7c673cae FG |
429 | :Valid Range: 1-63 |
430 | ||
431 | ||
432 | ``osd scrub priority`` | |
433 | ||
434 | :Description: The priority set for scrub operations. It is relative to | |
435 | ``osd client op priority``. | |
436 | ||
437 | :Type: 32-bit Integer | |
438 | :Default: ``5`` | |
439 | :Valid Range: 1-63 | |
440 | ||
441 | ||
442 | ``osd snap trim priority`` | |
443 | ||
444 | :Description: The priority set for snap trim operations. It is relative to | |
445 | ``osd client op priority``. | |
446 | ||
447 | :Type: 32-bit Integer | |
448 | :Default: ``5`` | |
449 | :Valid Range: 1-63 | |
450 | ||
451 | ||
452 | ``osd op thread timeout`` | |
453 | ||
454 | :Description: The Ceph OSD Daemon operation thread timeout in seconds. | |
455 | :Type: 32-bit Integer | |
456 | :Default: ``15`` | |
457 | ||
458 | ||
459 | ``osd op complaint time`` | |
460 | ||
461 | :Description: An operation becomes complaint worthy after the specified number | |
462 | of seconds have elapsed. | |
463 | ||
464 | :Type: Float | |
465 | :Default: ``30`` | |
466 | ||
467 | ||
468 | ``osd disk threads`` | |
469 | ||
470 | :Description: The number of disk threads, which are used to perform background | |
471 | disk intensive OSD operations such as scrubbing and snap | |
472 | trimming. | |
473 | ||
474 | :Type: 32-bit Integer | |
475 | :Default: ``1`` | |
476 | ||
477 | ``osd disk thread ioprio class`` | |
478 | ||
479 | :Description: Warning: it will only be used if both ``osd disk thread | |
480 | ioprio class`` and ``osd disk thread ioprio priority`` are | |
481 | set to a non default value. Sets the ioprio_set(2) I/O | |
482 | scheduling ``class`` for the disk thread. Acceptable | |
483 | values are ``idle``, ``be`` or ``rt``. The ``idle`` | |
484 | class means the disk thread will have lower priority | |
485 | than any other thread in the OSD. This is useful to slow | |
486 | down scrubbing on an OSD that is busy handling client | |
487 | operations. ``be`` is the default and is the same | |
488 | priority as all other threads in the OSD. ``rt`` means | |
489 | the disk thread will have precendence over all other | |
490 | threads in the OSD. Note: Only works with the Linux Kernel | |
491 | CFQ scheduler. Since Jewel scrubbing is no longer carried | |
492 | out by the disk iothread, see osd priority options instead. | |
493 | :Type: String | |
494 | :Default: the empty string | |
495 | ||
496 | ``osd disk thread ioprio priority`` | |
497 | ||
498 | :Description: Warning: it will only be used if both ``osd disk thread | |
499 | ioprio class`` and ``osd disk thread ioprio priority`` are | |
500 | set to a non default value. It sets the ioprio_set(2) | |
501 | I/O scheduling ``priority`` of the disk thread ranging | |
502 | from 0 (highest) to 7 (lowest). If all OSDs on a given | |
503 | host were in class ``idle`` and compete for I/O | |
504 | (i.e. due to controller congestion), it can be used to | |
505 | lower the disk thread priority of one OSD to 7 so that | |
506 | another OSD with priority 0 can have priority. | |
507 | Note: Only works with the Linux Kernel CFQ scheduler. | |
508 | :Type: Integer in the range of 0 to 7 or -1 if not to be used. | |
509 | :Default: ``-1`` | |
510 | ||
511 | ``osd op history size`` | |
512 | ||
513 | :Description: The maximum number of completed operations to track. | |
514 | :Type: 32-bit Unsigned Integer | |
515 | :Default: ``20`` | |
516 | ||
517 | ||
518 | ``osd op history duration`` | |
519 | ||
520 | :Description: The oldest completed operation to track. | |
521 | :Type: 32-bit Unsigned Integer | |
522 | :Default: ``600`` | |
523 | ||
524 | ||
525 | ``osd op log threshold`` | |
526 | ||
527 | :Description: How many operations logs to display at once. | |
528 | :Type: 32-bit Integer | |
529 | :Default: ``5`` | |
530 | ||
c07f9fc5 FG |
531 | |
532 | QoS Based on mClock | |
533 | ------------------- | |
534 | ||
535 | Ceph's use of mClock is currently in the experimental phase and should | |
536 | be approached with an exploratory mindset. | |
537 | ||
538 | Core Concepts | |
539 | ````````````` | |
540 | ||
541 | The QoS support of Ceph is implemented using a queueing scheduler | |
542 | based on `the dmClock algorithm`_. This algorithm allocates the I/O | |
543 | resources of the Ceph cluster in proportion to weights, and enforces | |
544 | the constraits of minimum reservation and maximum limitation, so that | |
545 | the services can compete for the resources fairly. Currently the | |
546 | *mclock_opclass* operation queue divides Ceph services involving I/O | |
547 | resources into following buckets: | |
548 | ||
549 | - client op: the iops issued by client | |
550 | - osd subop: the iops issued by primary OSD | |
551 | - snap trim: the snap trimming related requests | |
552 | - pg recovery: the recovery related requests | |
553 | - pg scrub: the scrub related requests | |
554 | ||
555 | And the resources are partitioned using following three sets of tags. In other | |
556 | words, the share of each type of service is controlled by three tags: | |
557 | ||
558 | #. reservation: the minimum IOPS allocated for the service. | |
559 | #. limitation: the maximum IOPS allocated for the service. | |
560 | #. weight: the proportional share of capacity if extra capacity or system | |
561 | oversubscribed. | |
562 | ||
563 | In Ceph operations are graded with "cost". And the resources allocated | |
564 | for serving various services are consumed by these "costs". So, for | |
565 | example, the more reservation a services has, the more resource it is | |
566 | guaranteed to possess, as long as it requires. Assuming there are 2 | |
567 | services: recovery and client ops: | |
568 | ||
569 | - recovery: (r:1, l:5, w:1) | |
570 | - client ops: (r:2, l:0, w:9) | |
571 | ||
572 | The settings above ensure that the recovery won't get more than 5 | |
573 | requests per second serviced, even if it requires so (see CURRENT | |
574 | IMPLEMENTATION NOTE below), and no other services are competing with | |
575 | it. But if the clients start to issue large amount of I/O requests, | |
576 | neither will they exhaust all the I/O resources. 1 request per second | |
577 | is always allocated for recovery jobs as long as there are any such | |
578 | requests. So the recovery jobs won't be starved even in a cluster with | |
579 | high load. And in the meantime, the client ops can enjoy a larger | |
580 | portion of the I/O resource, because its weight is "9", while its | |
581 | competitor "1". In the case of client ops, it is not clamped by the | |
582 | limit setting, so it can make use of all the resources if there is no | |
583 | recovery ongoing. | |
584 | ||
585 | Along with *mclock_opclass* another mclock operation queue named | |
586 | *mclock_client* is available. It divides operations based on category | |
587 | but also divides them based on the client making the request. This | |
588 | helps not only manage the distribution of resources spent on different | |
589 | classes of operations but also tries to insure fairness among clients. | |
590 | ||
591 | CURRENT IMPLEMENTATION NOTE: the current experimental implementation | |
592 | does not enforce the limit values. As a first approximation we decided | |
593 | not to prevent operations that would otherwise enter the operation | |
594 | sequencer from doing so. | |
595 | ||
596 | Subtleties of mClock | |
597 | ```````````````````` | |
598 | ||
599 | The reservation and limit values have a unit of requests per | |
600 | second. The weight, however, does not technically have a unit and the | |
601 | weights are relative to one another. So if one class of requests has a | |
602 | weight of 1 and another a weight of 9, then the latter class of | |
603 | requests should get 9 executed at a 9 to 1 ratio as the first class. | |
604 | However that will only happen once the reservations are met and those | |
605 | values include the operations executed under the reservation phase. | |
606 | ||
607 | Even though the weights do not have units, one must be careful in | |
608 | choosing their values due how the algorithm assigns weight tags to | |
609 | requests. If the weight is *W*, then for a given class of requests, | |
610 | the next one that comes in will have a weight tag of *1/W* plus the | |
611 | previous weight tag or the current time, whichever is larger. That | |
612 | means if *W* is sufficiently large and therefore *1/W* is sufficiently | |
613 | small, the calculated tag may never be assigned as it will get a value | |
614 | of the current time. The ultimate lesson is that values for weight | |
615 | should not be too large. They should be under the number of requests | |
616 | one expects to ve serviced each second. | |
617 | ||
618 | Caveats | |
619 | ``````` | |
620 | ||
621 | There are some factors that can reduce the impact of the mClock op | |
622 | queues within Ceph. First, requests to an OSD are sharded by their | |
623 | placement group identifier. Each shard has its own mClock queue and | |
624 | these queues neither interact nor share information among them. The | |
625 | number of shards can be controlled with the configuration options | |
626 | ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and | |
627 | ``osd_op_num_shards_ssd``. A lower number of shards will increase the | |
628 | impact of the mClock queues, but may have other deliterious effects. | |
629 | ||
630 | Second, requests are transferred from the operation queue to the | |
631 | operation sequencer, in which they go through the phases of | |
632 | execution. The operation queue is where mClock resides and mClock | |
633 | determines the next op to transfer to the operation sequencer. The | |
634 | number of operations allowed in the operation sequencer is a complex | |
635 | issue. In general we want to keep enough operations in the sequencer | |
636 | so it's always getting work done on some operations while it's waiting | |
637 | for disk and network access to complete on other operations. On the | |
638 | other hand, once an operation is transferred to the operation | |
639 | sequencer, mClock no longer has control over it. Therefore to maximize | |
640 | the impact of mClock, we want to keep as few operations in the | |
641 | operation sequencer as possible. So we have an inherent tension. | |
642 | ||
643 | The configuration options that influence the number of operations in | |
644 | the operation sequencer are ``bluestore_throttle_bytes``, | |
645 | ``bluestore_throttle_deferred_bytes``, | |
646 | ``bluestore_throttle_cost_per_io``, | |
647 | ``bluestore_throttle_cost_per_io_hdd``, and | |
648 | ``bluestore_throttle_cost_per_io_ssd``. | |
649 | ||
650 | A third factor that affects the impact of the mClock algorithm is that | |
651 | we're using a distributed system, where requests are made to multiple | |
652 | OSDs and each OSD has (can have) multiple shards. Yet we're currently | |
653 | using the mClock algorithm, which is not distributed (note: dmClock is | |
654 | the distributed version of mClock). | |
655 | ||
656 | Various organizations and individuals are currently experimenting with | |
657 | mClock as it exists in this code base along with their modifications | |
658 | to the code base. We hope you'll share you're experiences with your | |
659 | mClock and dmClock experiments in the ceph-devel mailing list. | |
660 | ||
661 | ||
662 | ``osd push per object cost`` | |
663 | ||
664 | :Description: the overhead for serving a push op | |
665 | ||
666 | :Type: Unsigned Integer | |
667 | :Default: 1000 | |
668 | ||
669 | ``osd recovery max chunk`` | |
670 | ||
671 | :Description: the maximum total size of data chunks a recovery op can carry. | |
672 | ||
673 | :Type: Unsigned Integer | |
674 | :Default: 8 MiB | |
675 | ||
676 | ||
677 | ``osd op queue mclock client op res`` | |
678 | ||
679 | :Description: the reservation of client op. | |
680 | ||
681 | :Type: Float | |
682 | :Default: 1000.0 | |
683 | ||
684 | ||
685 | ``osd op queue mclock client op wgt`` | |
686 | ||
687 | :Description: the weight of client op. | |
688 | ||
689 | :Type: Float | |
690 | :Default: 500.0 | |
691 | ||
692 | ||
693 | ``osd op queue mclock client op lim`` | |
694 | ||
695 | :Description: the limit of client op. | |
696 | ||
697 | :Type: Float | |
698 | :Default: 1000.0 | |
699 | ||
700 | ||
701 | ``osd op queue mclock osd subop res`` | |
702 | ||
703 | :Description: the reservation of osd subop. | |
704 | ||
705 | :Type: Float | |
706 | :Default: 1000.0 | |
707 | ||
708 | ||
709 | ``osd op queue mclock osd subop wgt`` | |
710 | ||
711 | :Description: the weight of osd subop. | |
712 | ||
713 | :Type: Float | |
714 | :Default: 500.0 | |
715 | ||
716 | ||
717 | ``osd op queue mclock osd subop lim`` | |
718 | ||
719 | :Description: the limit of osd subop. | |
720 | ||
721 | :Type: Float | |
722 | :Default: 0.0 | |
723 | ||
724 | ||
725 | ``osd op queue mclock snap res`` | |
726 | ||
727 | :Description: the reservation of snap trimming. | |
728 | ||
729 | :Type: Float | |
730 | :Default: 0.0 | |
731 | ||
732 | ||
733 | ``osd op queue mclock snap wgt`` | |
734 | ||
735 | :Description: the weight of snap trimming. | |
736 | ||
737 | :Type: Float | |
738 | :Default: 1.0 | |
739 | ||
740 | ||
741 | ``osd op queue mclock snap lim`` | |
742 | ||
743 | :Description: the limit of snap trimming. | |
744 | ||
745 | :Type: Float | |
746 | :Default: 0.001 | |
747 | ||
748 | ||
749 | ``osd op queue mclock recov res`` | |
750 | ||
751 | :Description: the reservation of recovery. | |
752 | ||
753 | :Type: Float | |
754 | :Default: 0.0 | |
755 | ||
756 | ||
757 | ``osd op queue mclock recov wgt`` | |
758 | ||
759 | :Description: the weight of recovery. | |
760 | ||
761 | :Type: Float | |
762 | :Default: 1.0 | |
763 | ||
764 | ||
765 | ``osd op queue mclock recov lim`` | |
766 | ||
767 | :Description: the limit of recovery. | |
768 | ||
769 | :Type: Float | |
770 | :Default: 0.001 | |
771 | ||
772 | ||
773 | ``osd op queue mclock scrub res`` | |
774 | ||
775 | :Description: the reservation of scrub jobs. | |
776 | ||
777 | :Type: Float | |
778 | :Default: 0.0 | |
779 | ||
780 | ||
781 | ``osd op queue mclock scrub wgt`` | |
782 | ||
783 | :Description: the weight of scrub jobs. | |
784 | ||
785 | :Type: Float | |
786 | :Default: 1.0 | |
787 | ||
788 | ||
789 | ``osd op queue mclock scrub lim`` | |
790 | ||
791 | :Description: the limit of scrub jobs. | |
792 | ||
793 | :Type: Float | |
794 | :Default: 0.001 | |
795 | ||
796 | .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf | |
797 | ||
798 | ||
7c673cae FG |
799 | .. index:: OSD; backfilling |
800 | ||
801 | Backfilling | |
802 | =========== | |
803 | ||
804 | When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will | |
805 | want to rebalance the cluster by moving placement groups to or from Ceph OSD | |
806 | Daemons to restore the balance. The process of migrating placement groups and | |
807 | the objects they contain can reduce the cluster's operational performance | |
808 | considerably. To maintain operational performance, Ceph performs this migration | |
809 | with 'backfilling', which allows Ceph to set backfill operations to a lower | |
810 | priority than requests to read or write data. | |
811 | ||
812 | ||
813 | ``osd max backfills`` | |
814 | ||
815 | :Description: The maximum number of backfills allowed to or from a single OSD. | |
816 | :Type: 64-bit Unsigned Integer | |
817 | :Default: ``1`` | |
818 | ||
819 | ||
820 | ``osd backfill scan min`` | |
821 | ||
822 | :Description: The minimum number of objects per backfill scan. | |
823 | ||
824 | :Type: 32-bit Integer | |
825 | :Default: ``64`` | |
826 | ||
827 | ||
828 | ``osd backfill scan max`` | |
829 | ||
830 | :Description: The maximum number of objects per backfill scan. | |
831 | ||
832 | :Type: 32-bit Integer | |
833 | :Default: ``512`` | |
834 | ||
835 | ||
836 | ``osd backfill retry interval`` | |
837 | ||
838 | :Description: The number of seconds to wait before retrying backfill requests. | |
839 | :Type: Double | |
840 | :Default: ``10.0`` | |
841 | ||
842 | .. index:: OSD; osdmap | |
843 | ||
844 | OSD Map | |
845 | ======= | |
846 | ||
847 | OSD maps reflect the OSD daemons operating in the cluster. Over time, the | |
848 | number of map epochs increases. Ceph provides some settings to ensure that | |
849 | Ceph performs well as the OSD map grows larger. | |
850 | ||
851 | ||
852 | ``osd map dedup`` | |
853 | ||
854 | :Description: Enable removing duplicates in the OSD map. | |
855 | :Type: Boolean | |
856 | :Default: ``true`` | |
857 | ||
858 | ||
859 | ``osd map cache size`` | |
860 | ||
861 | :Description: The number of OSD maps to keep cached. | |
862 | :Type: 32-bit Integer | |
863 | :Default: ``500`` | |
864 | ||
865 | ||
866 | ``osd map cache bl size`` | |
867 | ||
868 | :Description: The size of the in-memory OSD map cache in OSD daemons. | |
869 | :Type: 32-bit Integer | |
870 | :Default: ``50`` | |
871 | ||
872 | ||
873 | ``osd map cache bl inc size`` | |
874 | ||
875 | :Description: The size of the in-memory OSD map cache incrementals in | |
876 | OSD daemons. | |
877 | ||
878 | :Type: 32-bit Integer | |
879 | :Default: ``100`` | |
880 | ||
881 | ||
882 | ``osd map message max`` | |
883 | ||
884 | :Description: The maximum map entries allowed per MOSDMap message. | |
885 | :Type: 32-bit Integer | |
886 | :Default: ``100`` | |
887 | ||
888 | ||
889 | ||
890 | .. index:: OSD; recovery | |
891 | ||
892 | Recovery | |
893 | ======== | |
894 | ||
895 | When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD | |
896 | begins peering with other Ceph OSD Daemons before writes can occur. See | |
897 | `Monitoring OSDs and PGs`_ for details. | |
898 | ||
899 | If a Ceph OSD Daemon crashes and comes back online, usually it will be out of | |
900 | sync with other Ceph OSD Daemons containing more recent versions of objects in | |
901 | the placement groups. When this happens, the Ceph OSD Daemon goes into recovery | |
902 | mode and seeks to get the latest copy of the data and bring its map back up to | |
903 | date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects | |
904 | and placement groups may be significantly out of date. Also, if a failure domain | |
905 | went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at | |
906 | the same time. This can make the recovery process time consuming and resource | |
907 | intensive. | |
908 | ||
909 | To maintain operational performance, Ceph performs recovery with limitations on | |
910 | the number recovery requests, threads and object chunk sizes which allows Ceph | |
911 | perform well in a degraded state. | |
912 | ||
913 | ||
914 | ``osd recovery delay start`` | |
915 | ||
916 | :Description: After peering completes, Ceph will delay for the specified number | |
917 | of seconds before starting to recover objects. | |
918 | ||
919 | :Type: Float | |
920 | :Default: ``0`` | |
921 | ||
922 | ||
923 | ``osd recovery max active`` | |
924 | ||
925 | :Description: The number of active recovery requests per OSD at one time. More | |
926 | requests will accelerate recovery, but the requests places an | |
927 | increased load on the cluster. | |
928 | ||
929 | :Type: 32-bit Integer | |
31f18b77 | 930 | :Default: ``3`` |
7c673cae FG |
931 | |
932 | ||
933 | ``osd recovery max chunk`` | |
934 | ||
935 | :Description: The maximum size of a recovered chunk of data to push. | |
c07f9fc5 | 936 | :Type: 64-bit Unsigned Integer |
7c673cae FG |
937 | :Default: ``8 << 20`` |
938 | ||
939 | ||
31f18b77 FG |
940 | ``osd recovery max single start`` |
941 | ||
942 | :Description: The maximum number of recovery operations per OSD that will be | |
943 | newly started when an OSD is recovering. | |
c07f9fc5 | 944 | :Type: 64-bit Unsigned Integer |
31f18b77 FG |
945 | :Default: ``1`` |
946 | ||
947 | ||
7c673cae FG |
948 | ``osd recovery thread timeout`` |
949 | ||
950 | :Description: The maximum time in seconds before timing out a recovery thread. | |
951 | :Type: 32-bit Integer | |
952 | :Default: ``30`` | |
953 | ||
954 | ||
955 | ``osd recover clone overlap`` | |
956 | ||
957 | :Description: Preserves clone overlap during recovery. Should always be set | |
958 | to ``true``. | |
959 | ||
960 | :Type: Boolean | |
961 | :Default: ``true`` | |
962 | ||
31f18b77 FG |
963 | |
964 | ``osd recovery sleep`` | |
965 | ||
c07f9fc5 FG |
966 | :Description: Time in seconds to sleep before next recovery or backfill op. |
967 | Increasing this value will slow down recovery operation while | |
968 | client operations will be less impacted. | |
31f18b77 FG |
969 | |
970 | :Type: Float | |
c07f9fc5 FG |
971 | :Default: ``0`` |
972 | ||
973 | ||
974 | ``osd recovery sleep hdd`` | |
975 | ||
976 | :Description: Time in seconds to sleep before next recovery or backfill op | |
977 | for HDDs. | |
978 | ||
979 | :Type: Float | |
980 | :Default: ``0.1`` | |
981 | ||
982 | ||
983 | ``osd recovery sleep ssd`` | |
984 | ||
985 | :Description: Time in seconds to sleep before next recovery or backfill op | |
986 | for SSDs. | |
987 | ||
988 | :Type: Float | |
989 | :Default: ``0`` | |
31f18b77 | 990 | |
d2e6a577 FG |
991 | |
992 | ``osd recovery sleep hybrid`` | |
993 | ||
994 | :Description: Time in seconds to sleep before next recovery or backfill op | |
995 | when osd data is on HDD and osd journal is on SSD. | |
996 | ||
997 | :Type: Float | |
998 | :Default: ``0.025`` | |
999 | ||
7c673cae FG |
1000 | Tiering |
1001 | ======= | |
1002 | ||
1003 | ``osd agent max ops`` | |
1004 | ||
1005 | :Description: The maximum number of simultaneous flushing ops per tiering agent | |
1006 | in the high speed mode. | |
1007 | :Type: 32-bit Integer | |
1008 | :Default: ``4`` | |
1009 | ||
1010 | ||
1011 | ``osd agent max low ops`` | |
1012 | ||
1013 | :Description: The maximum number of simultaneous flushing ops per tiering agent | |
1014 | in the low speed mode. | |
1015 | :Type: 32-bit Integer | |
1016 | :Default: ``2`` | |
1017 | ||
1018 | See `cache target dirty high ratio`_ for when the tiering agent flushes dirty | |
1019 | objects within the high speed mode. | |
1020 | ||
1021 | Miscellaneous | |
1022 | ============= | |
1023 | ||
1024 | ||
1025 | ``osd snap trim thread timeout`` | |
1026 | ||
1027 | :Description: The maximum time in seconds before timing out a snap trim thread. | |
1028 | :Type: 32-bit Integer | |
1029 | :Default: ``60*60*1`` | |
1030 | ||
1031 | ||
1032 | ``osd backlog thread timeout`` | |
1033 | ||
1034 | :Description: The maximum time in seconds before timing out a backlog thread. | |
1035 | :Type: 32-bit Integer | |
1036 | :Default: ``60*60*1`` | |
1037 | ||
1038 | ||
1039 | ``osd default notify timeout`` | |
1040 | ||
1041 | :Description: The OSD default notification timeout (in seconds). | |
c07f9fc5 | 1042 | :Type: 32-bit Unsigned Integer |
7c673cae FG |
1043 | :Default: ``30`` |
1044 | ||
1045 | ||
1046 | ``osd check for log corruption`` | |
1047 | ||
1048 | :Description: Check log files for corruption. Can be computationally expensive. | |
1049 | :Type: Boolean | |
1050 | :Default: ``false`` | |
1051 | ||
1052 | ||
1053 | ``osd remove thread timeout`` | |
1054 | ||
1055 | :Description: The maximum time in seconds before timing out a remove OSD thread. | |
1056 | :Type: 32-bit Integer | |
1057 | :Default: ``60*60`` | |
1058 | ||
1059 | ||
1060 | ``osd command thread timeout`` | |
1061 | ||
1062 | :Description: The maximum time in seconds before timing out a command thread. | |
1063 | :Type: 32-bit Integer | |
1064 | :Default: ``10*60`` | |
1065 | ||
1066 | ||
1067 | ``osd command max records`` | |
1068 | ||
1069 | :Description: Limits the number of lost objects to return. | |
1070 | :Type: 32-bit Integer | |
1071 | :Default: ``256`` | |
1072 | ||
1073 | ||
1074 | ``osd auto upgrade tmap`` | |
1075 | ||
1076 | :Description: Uses ``tmap`` for ``omap`` on old objects. | |
1077 | :Type: Boolean | |
1078 | :Default: ``true`` | |
1079 | ||
1080 | ||
1081 | ``osd tmapput sets users tmap`` | |
1082 | ||
1083 | :Description: Uses ``tmap`` for debugging only. | |
1084 | :Type: Boolean | |
1085 | :Default: ``false`` | |
1086 | ||
1087 | ||
7c673cae FG |
1088 | ``osd fast fail on connection refused`` |
1089 | ||
1090 | :Description: If this option is enabled, crashed OSDs are marked down | |
1091 | immediately by connected peers and MONs (assuming that the | |
1092 | crashed OSD host survives). Disable it to restore old | |
1093 | behavior, at the expense of possible long I/O stalls when | |
1094 | OSDs crash in the middle of I/O operations. | |
1095 | :Type: Boolean | |
1096 | :Default: ``true`` | |
1097 | ||
1098 | ||
1099 | ||
1100 | .. _pool: ../../operations/pools | |
1101 | .. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction | |
1102 | .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering | |
1103 | .. _Pool & PG Config Reference: ../pool-pg-config-ref | |
1104 | .. _Journal Config Reference: ../journal-ref | |
1105 | .. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio |