]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/configuration/mclock-config-ref.rst
1040b2e66c2eb166606f6d7b0836179c4ebc0f61
[ceph.git] / ceph / doc / rados / configuration / mclock-config-ref.rst
1 ========================
2 mClock Config Reference
3 ========================
4
5 .. index:: mclock; configuration
6
7 QoS support in Ceph is implemented using a queuing scheduler based on `the
8 dmClock algorithm`_. See :ref:`dmclock-qos` section for more details.
9
10 .. note:: The *mclock_scheduler* is supported for BlueStore OSDs. For Filestore
11 OSDs the *osd_op_queue* is set to *wpq* and is enforced even if you
12 attempt to change it.
13
14 To make the usage of mclock more user-friendly and intuitive, mclock config
15 profiles are introduced. The mclock profiles mask the low level details from
16 users, making it easier to configure and use mclock.
17
18 The following input parameters are required for a mclock profile to configure
19 the QoS related parameters:
20
21 * total capacity (IOPS) of each OSD (determined automatically -
22 See `OSD Capacity Determination (Automated)`_)
23
24 * an mclock profile type to enable
25
26 Using the settings in the specified profile, an OSD determines and applies the
27 lower-level mclock and Ceph parameters. The parameters applied by the mclock
28 profile make it possible to tune the QoS between client I/O and background
29 operations in the OSD.
30
31
32 .. index:: mclock; mclock clients
33
34 mClock Client Types
35 ===================
36
37 The mclock scheduler handles requests from different types of Ceph services.
38 Each service can be considered as a type of client from mclock's perspective.
39 Depending on the type of requests handled, mclock clients are classified into
40 the buckets as shown in the table below,
41
42 +------------------------+----------------------------------------------------+
43 | Client Type | Request Types |
44 +========================+====================================================+
45 | Client | I/O requests issued by external clients of Ceph |
46 +------------------------+----------------------------------------------------+
47 | Background recovery | Internal recovery/backfill requests |
48 +------------------------+----------------------------------------------------+
49 | Background best-effort | Internal scrub, snap trim and PG deletion requests |
50 +------------------------+----------------------------------------------------+
51
52 The mclock profiles allocate parameters like reservation, weight and limit
53 (see :ref:`dmclock-qos`) differently for each client type. The next sections
54 describe the mclock profiles in greater detail.
55
56
57 .. index:: mclock; profile definition
58
59 mClock Profiles - Definition and Purpose
60 ========================================
61
62 A mclock profile is *“a configuration setting that when applied on a running
63 Ceph cluster enables the throttling of the operations(IOPS) belonging to
64 different client classes (background recovery, scrub, snaptrim, client op,
65 osd subop)”*.
66
67 The mclock profile uses the capacity limits and the mclock profile type selected
68 by the user to determine the low-level mclock resource control configuration
69 parameters and apply them transparently. Additionally, other Ceph configuration
70 parameters are also applied. Please see sections below for more information.
71
72 The low-level mclock resource control parameters are the *reservation*,
73 *limit*, and *weight* that provide control of the resource shares, as
74 described in the :ref:`dmclock-qos` section.
75
76
77 .. index:: mclock; profile types
78
79 mClock Profile Types
80 ====================
81
82 mclock profiles can be broadly classified into *built-in* and *custom* profiles,
83
84 Built-in Profiles
85 -----------------
86 Users can choose between the following built-in profile types:
87
88 .. note:: The values mentioned in the tables below represent the percentage
89 of the total IOPS capacity of the OSD allocated for the service type.
90
91 high_client_ops (*default*)
92 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
93 This profile optimizes client performance over background activities by
94 allocating more reservation and limit to client operations as compared to
95 background operations in the OSD. This profile is enabled by default. The table
96 shows the resource control parameters set by the profile:
97
98 +------------------------+-------------+--------+-------+
99 | Service Type | Reservation | Weight | Limit |
100 +========================+=============+========+=======+
101 | client | 50% | 2 | MAX |
102 +------------------------+-------------+--------+-------+
103 | background recovery | 25% | 1 | 100% |
104 +------------------------+-------------+--------+-------+
105 | background best-effort | 25% | 2 | MAX |
106 +------------------------+-------------+--------+-------+
107
108 high_recovery_ops
109 ^^^^^^^^^^^^^^^^^
110 This profile optimizes background recovery performance as compared to external
111 clients and other background operations within the OSD. This profile, for
112 example, may be enabled by an administrator temporarily to speed-up background
113 recoveries during non-peak hours. The table shows the resource control
114 parameters set by the profile:
115
116 +------------------------+-------------+--------+-------+
117 | Service Type | Reservation | Weight | Limit |
118 +========================+=============+========+=======+
119 | client | 30% | 1 | 80% |
120 +------------------------+-------------+--------+-------+
121 | background recovery | 60% | 2 | 200% |
122 +------------------------+-------------+--------+-------+
123 | background best-effort | 1 (MIN) | 2 | MAX |
124 +------------------------+-------------+--------+-------+
125
126 balanced
127 ^^^^^^^^
128 This profile allocates equal reservation to client I/O operations and background
129 recovery operations. This means that equal I/O resources are allocated to both
130 external and background recovery operations. This profile, for example, may be
131 enabled by an administrator when external client performance requirement is not
132 critical and there are other background operations that still need attention
133 within the OSD.
134
135 +------------------------+-------------+--------+-------+
136 | Service Type | Reservation | Weight | Limit |
137 +========================+=============+========+=======+
138 | client | 40% | 1 | 100% |
139 +------------------------+-------------+--------+-------+
140 | background recovery | 40% | 1 | 150% |
141 +------------------------+-------------+--------+-------+
142 | background best-effort | 20% | 2 | MAX |
143 +------------------------+-------------+--------+-------+
144
145 .. note:: Across the built-in profiles, internal background best-effort clients
146 of mclock ("scrub", "snap trim", and "pg deletion") are given lower
147 reservations but no limits(MAX). This ensures that requests from such
148 clients are able to complete quickly if there are no other competing
149 operations.
150
151
152 Custom Profile
153 --------------
154 This profile gives users complete control over all the mclock configuration
155 parameters. This profile should be used with caution and is meant for advanced
156 users, who understand mclock and Ceph related configuration options.
157
158
159 .. index:: mclock; built-in profiles
160
161 mClock Built-in Profiles
162 ========================
163
164 When a built-in profile is enabled, the mClock scheduler calculates the low
165 level mclock parameters [*reservation*, *weight*, *limit*] based on the profile
166 enabled for each client type. The mclock parameters are calculated based on
167 the max OSD capacity provided beforehand. As a result, the following mclock
168 config parameters cannot be modified when using any of the built-in profiles:
169
170 - :confval:`osd_mclock_scheduler_client_res`
171 - :confval:`osd_mclock_scheduler_client_wgt`
172 - :confval:`osd_mclock_scheduler_client_lim`
173 - :confval:`osd_mclock_scheduler_background_recovery_res`
174 - :confval:`osd_mclock_scheduler_background_recovery_wgt`
175 - :confval:`osd_mclock_scheduler_background_recovery_lim`
176 - :confval:`osd_mclock_scheduler_background_best_effort_res`
177 - :confval:`osd_mclock_scheduler_background_best_effort_wgt`
178 - :confval:`osd_mclock_scheduler_background_best_effort_lim`
179
180 The following Ceph options will not be modifiable by the user:
181
182 - :confval:`osd_max_backfills`
183 - :confval:`osd_recovery_max_active`
184
185 This is because the above options are internally modified by the mclock
186 scheduler in order to maximize the impact of the set profile.
187
188 By default, the *high_client_ops* profile is enabled to ensure that a larger
189 chunk of the bandwidth allocation goes to client ops. Background recovery ops
190 are given lower allocation (and therefore take a longer time to complete). But
191 there might be instances that necessitate giving higher allocations to either
192 client ops or recovery ops. In order to deal with such a situation, the
193 alternate built-in profiles may be enabled by following the steps mentioned
194 in the next section.
195
196 If any mClock profile (including "custom") is active, the following Ceph config
197 sleep options will be disabled,
198
199 - :confval:`osd_recovery_sleep`
200 - :confval:`osd_recovery_sleep_hdd`
201 - :confval:`osd_recovery_sleep_ssd`
202 - :confval:`osd_recovery_sleep_hybrid`
203 - :confval:`osd_scrub_sleep`
204 - :confval:`osd_delete_sleep`
205 - :confval:`osd_delete_sleep_hdd`
206 - :confval:`osd_delete_sleep_ssd`
207 - :confval:`osd_delete_sleep_hybrid`
208 - :confval:`osd_snap_trim_sleep`
209 - :confval:`osd_snap_trim_sleep_hdd`
210 - :confval:`osd_snap_trim_sleep_ssd`
211 - :confval:`osd_snap_trim_sleep_hybrid`
212
213 The above sleep options are disabled to ensure that mclock scheduler is able to
214 determine when to pick the next op from its operation queue and transfer it to
215 the operation sequencer. This results in the desired QoS being provided across
216 all its clients.
217
218
219 .. index:: mclock; enable built-in profile
220
221 Steps to Enable mClock Profile
222 ==============================
223
224 As already mentioned, the default mclock profile is set to *high_client_ops*.
225 The other values for the built-in profiles include *balanced* and
226 *high_recovery_ops*.
227
228 If there is a requirement to change the default profile, then the option
229 :confval:`osd_mclock_profile` may be set during runtime by using the following
230 command:
231
232 .. prompt:: bash #
233
234 ceph config set osd.N osd_mclock_profile <value>
235
236 For example, to change the profile to allow faster recoveries on "osd.0", the
237 following command can be used to switch to the *high_recovery_ops* profile:
238
239 .. prompt:: bash #
240
241 ceph config set osd.0 osd_mclock_profile high_recovery_ops
242
243 .. note:: The *custom* profile is not recommended unless you are an advanced
244 user.
245
246 And that's it! You are ready to run workloads on the cluster and check if the
247 QoS requirements are being met.
248
249
250 Switching Between Built-in and Custom Profiles
251 ==============================================
252
253 There may be situations requiring switching from a built-in profile to the
254 *custom* profile and vice-versa. The following sections outline the steps to
255 accomplish this.
256
257 Steps to Switch From a Built-in to the Custom Profile
258 -----------------------------------------------------
259
260 The following command can be used to switch to the *custom* profile:
261
262 .. prompt:: bash #
263
264 ceph config set osd osd_mclock_profile custom
265
266 For example, to change the profile to *custom* on all OSDs, the following
267 command can be used:
268
269 .. prompt:: bash #
270
271 ceph config set osd osd_mclock_profile custom
272
273 After switching to the *custom* profile, the desired mClock configuration
274 option may be modified. For example, to change the client reservation IOPS
275 allocation for a specific OSD (say osd.0), the following command can be used:
276
277 .. prompt:: bash #
278
279 ceph config set osd.0 osd_mclock_scheduler_client_res 3000
280
281 .. important:: Care must be taken to change the reservations of other services like
282 recovery and background best effort accordingly to ensure that the sum of the
283 reservations do not exceed the maximum IOPS capacity of the OSD.
284
285 .. tip:: The reservation and limit parameter allocations are per-shard based on
286 the type of backing device (HDD/SSD) under the OSD. See
287 :confval:`osd_op_num_shards_hdd` and :confval:`osd_op_num_shards_ssd` for
288 more details.
289
290 Steps to Switch From the Custom Profile to a Built-in Profile
291 -------------------------------------------------------------
292
293 Switching from the *custom* profile to a built-in profile requires an
294 intermediate step of removing the custom settings from the central config
295 database for the changes to take effect.
296
297 The following sequence of commands can be used to switch to a built-in profile:
298
299 #. Set the desired built-in profile using:
300
301 .. prompt:: bash #
302
303 ceph config set osd <mClock Configuration Option>
304
305 For example, to set the built-in profile to ``high_client_ops`` on all
306 OSDs, run the following command:
307
308 .. prompt:: bash #
309
310 ceph config set osd osd_mclock_profile high_client_ops
311 #. Determine the existing custom mClock configuration settings in the central
312 config database using the following command:
313
314 .. prompt:: bash #
315
316 ceph config dump
317 #. Remove the custom mClock configuration settings determined in the previous
318 step from the central config database:
319
320 .. prompt:: bash #
321
322 ceph config rm osd <mClock Configuration Option>
323
324 For example, to remove the configuration option
325 :confval:`osd_mclock_scheduler_client_res` that was set on all OSDs, run the
326 following command:
327
328 .. prompt:: bash #
329
330 ceph config rm osd osd_mclock_scheduler_client_res
331 #. After all existing custom mClock configuration settings have been removed
332 from the central config database, the configuration settings pertaining to
333 ``high_client_ops`` will come into effect. For e.g., to verify the settings
334 on osd.0 use:
335
336 .. prompt:: bash #
337
338 ceph config show osd.0
339
340 Switch Temporarily Between mClock Profiles
341 ------------------------------------------
342
343 To switch between mClock profiles on a temporary basis, the following commands
344 may be used to override the settings:
345
346 .. warning:: This section is for advanced users or for experimental testing. The
347 recommendation is to not use the below commands on a running cluster as it
348 could have unexpected outcomes.
349
350 .. note:: The configuration changes on an OSD using the below commands are
351 ephemeral and are lost when it restarts. It is also important to note that
352 the config options overridden using the below commands cannot be modified
353 further using the *ceph config set osd.N ...* command. The changes will not
354 take effect until a given OSD is restarted. This is intentional, as per the
355 config subsystem design. However, any further modification can still be made
356 ephemerally using the commands mentioned below.
357
358 #. Run the *injectargs* command as shown to override the mclock settings:
359
360 .. prompt:: bash #
361
362 ceph tell osd.N injectargs '--<mClock Configuration Option>=<value>'
363
364 For example, the following command overrides the
365 :confval:`osd_mclock_profile` option on osd.0:
366
367 .. prompt:: bash #
368
369 ceph tell osd.0 injectargs '--osd_mclock_profile=high_recovery_ops'
370
371
372 #. An alternate command that can be used is:
373
374 .. prompt:: bash #
375
376 ceph daemon osd.N config set <mClock Configuration Option> <value>
377
378 For example, the following command overrides the
379 :confval:`osd_mclock_profile` option on osd.0:
380
381 .. prompt:: bash #
382
383 ceph daemon osd.0 config set osd_mclock_profile high_recovery_ops
384
385 The individual QoS-related config options for the *custom* profile can also be
386 modified ephemerally using the above commands.
387
388
389 OSD Capacity Determination (Automated)
390 ======================================
391
392 The OSD capacity in terms of total IOPS is determined automatically during OSD
393 initialization. This is achieved by running the OSD bench tool and overriding
394 the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option
395 depending on the device type. No other action/input is expected from the user
396 to set the OSD capacity.
397
398 .. note:: If you wish to manually benchmark OSD(s) or manually tune the
399 Bluestore throttle parameters, see section
400 `Steps to Manually Benchmark an OSD (Optional)`_.
401
402 You may verify the capacity of an OSD after the cluster is brought up by using
403 the following command:
404
405 .. prompt:: bash #
406
407 ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd]
408
409 For example, the following command shows the max capacity for "osd.0" on a Ceph
410 node whose underlying device type is SSD:
411
412 .. prompt:: bash #
413
414 ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
415
416
417 Steps to Manually Benchmark an OSD (Optional)
418 =============================================
419
420 .. note:: These steps are only necessary if you want to override the OSD
421 capacity already determined automatically during OSD initialization.
422 Otherwise, you may skip this section entirely.
423
424 .. tip:: If you have already determined the benchmark data and wish to manually
425 override the max osd capacity for an OSD, you may skip to section
426 `Specifying Max OSD Capacity`_.
427
428
429 Any existing benchmarking tool can be used for this purpose. In this case, the
430 steps use the *Ceph OSD Bench* command described in the next section. Regardless
431 of the tool/command used, the steps outlined further below remain the same.
432
433 As already described in the :ref:`dmclock-qos` section, the number of
434 shards and the bluestore's throttle parameters have an impact on the mclock op
435 queues. Therefore, it is critical to set these values carefully in order to
436 maximize the impact of the mclock scheduler.
437
438 :Number of Operational Shards:
439 We recommend using the default number of shards as defined by the
440 configuration options ``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
441 ``osd_op_num_shards_ssd``. In general, a lower number of shards will increase
442 the impact of the mclock queues.
443
444 :Bluestore Throttle Parameters:
445 We recommend using the default values as defined by
446 :confval:`bluestore_throttle_bytes` and
447 :confval:`bluestore_throttle_deferred_bytes`. But these parameters may also be
448 determined during the benchmarking phase as described below.
449
450 OSD Bench Command Syntax
451 ------------------------
452
453 The :ref:`osd-subsystem` section describes the OSD bench command. The syntax
454 used for benchmarking is shown below :
455
456 .. prompt:: bash #
457
458 ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]
459
460 where,
461
462 * ``TOTAL_BYTES``: Total number of bytes to write
463 * ``BYTES_PER_WRITE``: Block size per write
464 * ``OBJ_SIZE``: Bytes per object
465 * ``NUM_OBJS``: Number of objects to write
466
467 Benchmarking Test Steps Using OSD Bench
468 ---------------------------------------
469
470 The steps below use the default shards and detail the steps used to determine
471 the correct bluestore throttle values (optional).
472
473 #. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that
474 you wish to benchmark.
475 #. Run a simple 4KiB random write workload on an OSD using the following
476 commands:
477
478 .. note:: Note that before running the test, caches must be cleared to get an
479 accurate measurement.
480
481 For example, if you are running the benchmark test on osd.0, run the following
482 commands:
483
484 .. prompt:: bash #
485
486 ceph tell osd.0 cache drop
487
488 .. prompt:: bash #
489
490 ceph tell osd.0 bench 12288000 4096 4194304 100
491
492 #. Note the overall throughput(IOPS) obtained from the output of the osd bench
493 command. This value is the baseline throughput(IOPS) when the default
494 bluestore throttle options are in effect.
495 #. If the intent is to determine the bluestore throttle values for your
496 environment, then set the two options, :confval:`bluestore_throttle_bytes`
497 and :confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each
498 to begin with. Otherwise, you may skip to the next section.
499 #. Run the 4KiB random write test as before using OSD bench.
500 #. Note the overall throughput from the output and compare the value
501 against the baseline throughput recorded in step 3.
502 #. If the throughput doesn't match with the baseline, increment the bluestore
503 throttle options by 2x and repeat steps 5 through 7 until the obtained
504 throughput is very close to the baseline value.
505
506 For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB
507 for both bluestore throttle and deferred bytes was determined to maximize the
508 impact of mclock. For HDDs, the corresponding value was 40 MiB, where the
509 overall throughput was roughly equal to the baseline throughput. Note that in
510 general for HDDs, the bluestore throttle values are expected to be higher when
511 compared to SSDs.
512
513
514 Specifying Max OSD Capacity
515 ----------------------------
516
517 The steps in this section may be performed only if you want to override the
518 max osd capacity automatically set during OSD initialization. The option
519 ``osd_mclock_max_capacity_iops_[hdd, ssd]`` for an OSD can be set by running the
520 following command:
521
522 .. prompt:: bash #
523
524 ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value>
525
526 For example, the following command sets the max capacity for a specific OSD
527 (say "osd.0") whose underlying device type is HDD to 350 IOPS:
528
529 .. prompt:: bash #
530
531 ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350
532
533 Alternatively, you may specify the max capacity for OSDs within the Ceph
534 configuration file under the respective [osd.N] section. See
535 :ref:`ceph-conf-settings` for more details.
536
537
538 .. index:: mclock; config settings
539
540 mClock Config Options
541 =====================
542
543 .. confval:: osd_mclock_profile
544 .. confval:: osd_mclock_max_capacity_iops_hdd
545 .. confval:: osd_mclock_max_capacity_iops_ssd
546 .. confval:: osd_mclock_cost_per_io_usec
547 .. confval:: osd_mclock_cost_per_io_usec_hdd
548 .. confval:: osd_mclock_cost_per_io_usec_ssd
549 .. confval:: osd_mclock_cost_per_byte_usec
550 .. confval:: osd_mclock_cost_per_byte_usec_hdd
551 .. confval:: osd_mclock_cost_per_byte_usec_ssd
552 .. confval:: osd_mclock_force_run_benchmark_on_init
553 .. confval:: osd_mclock_skip_benchmark
554
555 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf