]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/osd_internals/mclock_wpq_cmp_study.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / dev / osd_internals / mclock_wpq_cmp_study.rst
CommitLineData
20effc67
TL
1=========================================
2 QoS Study with mClock and WPQ Schedulers
3=========================================
4
5Introduction
6============
7
8The mClock scheduler provides three controls for each service using it. In Ceph,
9the services using mClock are for example client I/O, background recovery,
10scrub, snap trim and PG deletes. The three controls such as *weight*,
11*reservation* and *limit* are used for predictable allocation of resources to
12each service in proportion to its weight subject to the constraint that the
13service receives at least its reservation and no more than its limit. In Ceph,
14these controls are used to allocate IOPS for each service type provided the IOPS
15capacity of each OSD is known. The mClock scheduler is based on
16`the dmClock algorithm`_. See :ref:`dmclock-qos` section for more details.
17
18Ceph's use of mClock was primarily experimental and approached with an
19exploratory mindset. This is still true with other organizations and individuals
20who continue to either use the codebase or modify it according to their needs.
21
22DmClock exists in its own repository_. Before the Ceph *Pacific* release,
23mClock could be enabled by setting the :confval:`osd_op_queue` Ceph option to
24"mclock_scheduler". Additional mClock parameters like *reservation*, *weight*
25and *limit* for each service type could be set using Ceph options.
26For example, ``osd_mclock_scheduler_client_[res,wgt,lim]`` is one such option.
27See :ref:`dmclock-qos` section for more details. Even with all the mClock
28options set, the full capability of mClock could not be realized due to:
29
30- Unknown OSD capacity in terms of throughput (IOPS).
31- No limit enforcement. In other words, services using mClock were allowed to
32 exceed their limits resulting in the desired QoS goals not being met.
33- Share of each service type not distributed across the number of operational
34 shards.
35
36To resolve the above, refinements were made to the mClock scheduler in the Ceph
37code base. See :doc:`/rados/configuration/mclock-config-ref`. With the
38refinements, the usage of mClock is a bit more user-friendly and intuitive. This
39is one step of many to refine and optimize the way mClock is used in Ceph.
40
41Overview
42========
43
44A comparison study was performed as part of efforts to refine the mClock
45scheduler. The study involved running tests with client ops and background
46recovery operations in parallel with the two schedulers. The results were
47collated and then compared. The following statistics were compared between the
48schedulers from the test results for each service type:
49
50- External client
51
52 - Average throughput(IOPS),
53 - Average and percentile(95th, 99th, 99.5th) latency,
54
55- Background recovery
56
57 - Average recovery throughput,
58 - Number of misplaced objects recovered per second
59
60Test Environment
61================
62
631. **Software Configuration**: CentOS 8.1.1911 Linux Kernel 4.18.0-193.6.3.el8_2.x86_64
642. **CPU**: 2 x Intel® Xeon® CPU E5-2650 v3 @ 2.30GHz
653. **nproc**: 40
664. **System Memory**: 64 GiB
675. **Tuned-adm Profile**: network-latency
686. **CephVer**: 17.0.0-2125-g94f550a87f (94f550a87fcbda799afe9f85e40386e6d90b232e) quincy (dev)
697. **Storage**:
70
71 - Intel® NVMe SSD DC P3700 Series (SSDPE2MD800G4) [4 x 800GB]
72 - Seagate Constellation 7200 RPM 64MB Cache SATA 6.0Gb/s HDD (ST91000640NS) [4 x 1TB]
73
74Test Methodology
75================
76
77Ceph cbt_ was used to test the recovery scenarios. A new recovery test to
78generate background recoveries with client I/Os in parallel was created.
79See the next section for the detailed test steps. The test was executed 3 times
80with the default *Weighted Priority Queue (WPQ)* scheduler for comparison
81purposes. This was done to establish a credible mean value to compare
82the mClock scheduler results at a later point.
83
84After this, the same test was executed with mClock scheduler and with different
85mClock profiles i.e., *high_client_ops*, *balanced* and *high_recovery_ops* and
86the results collated for comparison. With each profile, the test was
87executed 3 times, and the average of those runs are reported in this study.
88
89.. note:: Tests with HDDs were performed with and without the bluestore WAL and
90 dB configured. The charts discussed further below help bring out the
91 comparison across the schedulers and their configurations.
92
93Establish Baseline Client Throughput (IOPS)
94===========================================
95
96Before the actual recovery tests, the baseline throughput was established for
97both the SSDs and the HDDs on the test machine by following the steps mentioned
98in the :doc:`/rados/configuration/mclock-config-ref` document under
99the "Benchmarking Test Steps Using CBT" section. For this study, the following
100baseline throughput for each device type was determined:
101
102+--------------------------------------+-------------------------------------------+
103| Device Type | Baseline Throughput(@4KiB Random Writes) |
104+======================================+===========================================+
105| **NVMe SSD** | 21500 IOPS (84 MiB/s) |
106+--------------------------------------+-------------------------------------------+
107| **HDD (with bluestore WAL & dB)** | 340 IOPS (1.33 MiB/s) |
108+--------------------------------------+-------------------------------------------+
109| **HDD (without bluestore WAL & dB)** | 315 IOPS (1.23 MiB/s) |
110+--------------------------------------+-------------------------------------------+
111
112.. note:: The :confval:`bluestore_throttle_bytes` and
113 :confval:`bluestore_throttle_deferred_bytes` for SSDs were determined to be
114 256 KiB. For HDDs, it was 40MiB. The above throughput was obtained
115 by running 4 KiB random writes at a queue depth of 64 for 300 secs.
116
20effc67
TL
117MClock Profile Allocations
118==========================
119
120The low-level mClock shares per profile are shown in the tables below. For
121parameters like *reservation* and *limit*, the shares are represented as a
122percentage of the total OSD capacity. For the *high_client_ops* profile, the
123*reservation* parameter is set to 50% of the total OSD capacity. Therefore, for
124the NVMe(baseline 21500 IOPS) device, a minimum of 10750 IOPS is reserved for
125client operations. These allocations are made under the hood once
126a profile is enabled.
127
128The *weight* parameter is unitless. See :ref:`dmclock-qos`.
129
130high_client_ops(default)
131````````````````````````
132
133This profile allocates more reservation and limit to external clients ops
134when compared to background recoveries and other internal clients within
135Ceph. This profile is enabled by default.
136
137+------------------------+-------------+--------+-------+
138| Service Type | Reservation | Weight | Limit |
139+========================+=============+========+=======+
140| client | 50% | 2 | MAX |
141+------------------------+-------------+--------+-------+
142| background recovery | 25% | 1 | 100% |
143+------------------------+-------------+--------+-------+
2a845540 144| background best effort | 25% | 2 | MAX |
20effc67
TL
145+------------------------+-------------+--------+-------+
146
147balanced
148`````````
149
150This profile allocates equal reservations to client ops and background
151recovery ops. The internal best effort client get a lower reservation
152but a very high limit so that they can complete quickly if there are
153no competing services.
154
155+------------------------+-------------+--------+-------+
156| Service Type | Reservation | Weight | Limit |
157+========================+=============+========+=======+
158| client | 40% | 1 | 100% |
159+------------------------+-------------+--------+-------+
160| background recovery | 40% | 1 | 150% |
161+------------------------+-------------+--------+-------+
2a845540 162| background best effort | 20% | 2 | MAX |
20effc67
TL
163+------------------------+-------------+--------+-------+
164
165high_recovery_ops
166`````````````````
167
168This profile allocates more reservation to background recoveries when
169compared to external clients and other internal clients within Ceph. For
170example, an admin may enable this profile temporarily to speed-up background
171recoveries during non-peak hours.
172
173+------------------------+-------------+--------+-------+
174| Service Type | Reservation | Weight | Limit |
175+========================+=============+========+=======+
176| client | 30% | 1 | 80% |
177+------------------------+-------------+--------+-------+
178| background recovery | 60% | 2 | 200% |
179+------------------------+-------------+--------+-------+
2a845540 180| background best effort | 1 (MIN) | 2 | MAX |
20effc67
TL
181+------------------------+-------------+--------+-------+
182
183custom
184```````
185
186The custom profile allows the user to have complete control of the mClock
187and Ceph config parameters. To use this profile, the user must have a deep
188understanding of the workings of Ceph and the mClock scheduler. All the
189*reservation*, *weight* and *limit* parameters of the different service types
190must be set manually along with any Ceph option(s). This profile may be used
191for experimental and exploratory purposes or if the built-in profiles do not
192meet the requirements. In such cases, adequate testing must be performed prior
193to enabling this profile.
194
195
196Recovery Test Steps
197===================
198
199Before bringing up the Ceph cluster, the following mClock configuration
200parameters were set appropriately based on the obtained baseline throughput
201from the previous section:
202
203- :confval:`osd_mclock_max_capacity_iops_hdd`
204- :confval:`osd_mclock_max_capacity_iops_ssd`
205- :confval:`osd_mclock_profile`
206
207See :doc:`/rados/configuration/mclock-config-ref` for more details.
208
209Test Steps(Using cbt)
210`````````````````````
211
2121. Bring up the Ceph cluster with 4 osds.
2132. Configure the OSDs with replication factor 3.
2143. Create a recovery pool to populate recovery data.
2154. Create a client pool and prefill some objects in it.
2165. Create the recovery thread and mark an OSD down and out.
2176. After the cluster handles the OSD down event, recovery data is
218 prefilled into the recovery pool. For the tests involving SSDs, prefill 100K
219 4MiB objects into the recovery pool. For the tests involving HDDs, prefill
220 5K 4MiB objects into the recovery pool.
2217. After the prefill stage is completed, the downed OSD is brought up and in.
222 The backfill phase starts at this point.
2238. As soon as the backfill/recovery starts, the test proceeds to initiate client
224 I/O on the client pool on another thread using a single client.
2259. During step 8 above, statistics related to the client latency and
226 bandwidth are captured by cbt. The test also captures the total number of
227 misplaced objects and the number of misplaced objects recovered per second.
228
229To summarize, the steps above creates 2 pools during the test. Recovery is
230triggered on one pool and client I/O is triggered on the other simultaneously.
231Statistics captured during the tests are discussed below.
232
233
234Non-Default Ceph Recovery Options
235`````````````````````````````````
236
237Apart from the non-default bluestore throttle already mentioned above, the
238following set of Ceph recovery related options were modified for tests with both
239the WPQ and mClock schedulers.
240
241- :confval:`osd_max_backfills` = 1000
242- :confval:`osd_recovery_max_active` = 1000
243- :confval:`osd_async_recovery_min_cost` = 1
244
245The above options set a high limit on the number of concurrent local and
246remote backfill operations per OSD. Under these conditions the capability of the
247mClock scheduler was tested and the results are discussed below.
248
249Test Results
250============
251
252Test Results With NVMe SSDs
253```````````````````````````
254
255Client Throughput Comparison
256----------------------------
257
258The chart below shows the average client throughput comparison across the
259schedulers and their respective configurations.
260
261.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_NVMe_SSD_WPQ_vs_mClock.png
262
263
264WPQ(def) in the chart shows the average client throughput obtained
265using the WPQ scheduler with all other Ceph configuration settings set to
266default values. The default setting for :confval:`osd_max_backfills` limits the number
267of concurrent local and remote backfills or recoveries per OSD to 1. As a
268result, the average client throughput obtained is impressive at just over 18000
269IOPS when compared to the baseline value which is 21500 IOPS.
270
271However, with WPQ scheduler along with non-default options mentioned in section
272`Non-Default Ceph Recovery Options`_, things are quite different as shown in the
273chart for WPQ(BST). In this case, the average client throughput obtained drops
274dramatically to only 2544 IOPS. The non-default recovery options clearly had a
275significant impact on the client throughput. In other words, recovery operations
276overwhelm the client operations. Sections further below discuss the recovery
277rates under these conditions.
278
279With the non-default options, the same test was executed with mClock and with
280the default profile (*high_client_ops*) enabled. As per the profile allocation,
281the reservation goal of 50% (10750 IOPS) is being met with an average throughput
282of 11209 IOPS during the course of recovery operations. This is more than 4x
283times the throughput obtained with WPQ(BST).
284
285Similar throughput with the *balanced* (11017 IOPS) and *high_recovery_ops*
286(11153 IOPS) profile was obtained as seen in the chart above. This clearly
287demonstrates that mClock is able to provide the desired QoS for the client
288with multiple concurrent backfill/recovery operations in progress.
289
290Client Latency Comparison
291-------------------------
292
293The chart below shows the average completion latency (*clat*) along with the
294average 95th, 99th and 99.5th percentiles across the schedulers and their
295respective configurations.
296
297.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_NVMe_SSD_WPQ_vs_mClock.png
298
299The average *clat* latency obtained with WPQ(Def) was 3.535 msec. But in this
300case the number of concurrent recoveries was very much limited at an average of
301around 97 objects/sec or ~388 MiB/s and a major contributing factor to the low
302latency seen by the client.
303
304With WPQ(BST) and with the non-default recovery options, things are very
305different with the average *clat* latency shooting up to an average of almost
30625 msec which is 7x times worse! This is due to the high number of concurrent
307recoveries which was measured to be ~350 objects/sec or ~1.4 GiB/s which is
308close to the maximum OSD bandwidth.
309
310With mClock enabled and with the default *high_client_ops* profile, the average
311*clat* latency was 5.688 msec which is impressive considering the high number
312of concurrent active background backfill/recoveries. The recovery rate was
313throttled down by mClock to an average of 80 objects/sec or ~320 MiB/s according
314to the minimum profile allocation of 25% of the maximum OSD bandwidth thus
315allowing the client operations to meet the QoS goal.
316
317With the other profiles like *balanced* and *high_recovery_ops*, the average
318client *clat* latency didn't change much and stayed between 5.7 - 5.8 msec with
319variations in the average percentile latency as observed from the chart above.
320
321.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_NVMe_SSD_WPQ_vs_mClock.png
322
323Perhaps a more interesting chart is the comparison chart shown above that
324tracks the average *clat* latency variations through the duration of the test.
325The chart shows the differences in the average latency between the
326WPQ and mClock profiles). During the initial phase of the test, for about 150
327secs, the differences in the average latency between the WPQ scheduler and
328across the profiles of mClock scheduler are quite evident and self explanatory.
329The *high_client_ops* profile shows the lowest latency followed by *balanced*
330and *high_recovery_ops* profiles. The WPQ(BST) had the highest average latency
331through the course of the test.
332
333Recovery Statistics Comparison
334------------------------------
335
336Another important aspect to consider is how the recovery bandwidth and recovery
337time are affected by mClock profile settings. The chart below outlines the
338recovery rates and times for each mClock profile and how they differ with the
339WPQ scheduler. The total number of objects to be recovered in all the cases was
340around 75000 objects as observed in the chart below.
341
342.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_NVMe_SSD_WPQ_vs_mClock.png
343
344Intuitively, the *high_client_ops* should impact recovery operations the most
345and this is indeed the case as it took an average of 966 secs for the
346recovery to complete at 80 Objects/sec. The recovery bandwidth as expected was
347the lowest at an average of ~320 MiB/s.
348
349.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_NVMe_SSD_WPQ_vs_mClock.png
350
351The *balanced* profile provides a good middle ground by allocating the same
352reservation and weight to client and recovery operations. The recovery rate
353curve falls between the *high_recovery_ops* and *high_client_ops* curves with
354an average bandwidth of ~480 MiB/s and taking an average of ~647 secs at ~120
355Objects/sec to complete the recovery.
356
357The *high_recovery_ops* profile provides the fastest way to complete recovery
358operations at the expense of other operations. The recovery bandwidth was
359nearly 2x the bandwidth at ~635 MiB/s when compared to the bandwidth observed
360using the *high_client_ops* profile. The average object recovery rate was ~159
361objects/sec and completed the fastest in approximately 488 secs.
362
363Test Results With HDDs (WAL and dB configured)
364``````````````````````````````````````````````
365
366The recovery tests were performed on HDDs with bluestore WAL and dB configured
367on faster NVMe SSDs. The baseline throughput measured was 340 IOPS.
368
369Client Throughput & latency Comparison
370--------------------------------------
371
372The average client throughput comparison for WPQ and mClock and its profiles
373are shown in the chart below.
374
375.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_WALdB_WPQ_vs_mClock.png
376
377With WPQ(Def), the average client throughput obtained was ~308 IOPS since the
378the number of concurrent recoveries was very much limited. The average *clat*
379latency was ~208 msec.
380
381However for WPQ(BST), due to concurrent recoveries client throughput is affected
382significantly with 146 IOPS and an average *clat* latency of 433 msec.
383
384.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_WALdB_WPQ_vs_mClock.png
385
386With the *high_client_ops* profile, mClock was able to meet the QoS requirement
387for client operations with an average throughput of 271 IOPS which is nearly
38880% of the baseline throughput at an average *clat* latency of 235 msecs.
389
390For *balanced* and *high_recovery_ops* profiles, the average client throughput
391came down marginally to ~248 IOPS and ~240 IOPS respectively. The average *clat*
392latency as expected increased to ~258 msec and ~265 msec respectively.
393
394.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_WALdB_WPQ_vs_mClock.png
395
396The *clat* latency comparison chart above provides a more comprehensive insight
397into the differences in latency through the course of the test. As observed
398with the NVMe SSD case, *high_client_ops* profile shows the lowest latency in
399the HDD case as well followed by the *balanced* and *high_recovery_ops* profile.
400It's fairly easy to discern this between the profiles during the first 200 secs
401of the test.
402
403Recovery Statistics Comparison
404------------------------------
405
406The charts below compares the recovery rates and times. The total number of
407objects to be recovered in all the cases using HDDs with WAL and dB was around
4084000 objects as observed in the chart below.
409
410.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_WALdB_WPQ_vs_mClock.png
411
412As expected, the *high_client_ops* impacts recovery operations the most as it
413took an average of ~1409 secs for the recovery to complete at ~3 Objects/sec.
414The recovery bandwidth as expected was the lowest at ~11 MiB/s.
415
416.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_WALdB_WPQ_vs_mClock.png
417
418The *balanced* profile as expected provides a decent compromise with an an
419average bandwidth of ~16.5 MiB/s and taking an average of ~966 secs at ~4
420Objects/sec to complete the recovery.
421
422The *high_recovery_ops* profile is the fastest with nearly 2x the bandwidth at
423~21 MiB/s when compared to the *high_client_ops* profile. The average object
424recovery rate was ~5 objects/sec and completed in approximately 747 secs. This
425is somewhat similar to the recovery time observed with WPQ(Def) at 647 secs with
426a bandwidth of 23 MiB/s and at a rate of 5.8 objects/sec.
427
428Test Results With HDDs (No WAL and dB configured)
429`````````````````````````````````````````````````
430
431The recovery tests were also performed on HDDs without bluestore WAL and dB
432configured. The baseline throughput measured was 315 IOPS.
433
434This type of configuration without WAL and dB configured is probably rare
435but testing was nevertheless performed to get a sense of how mClock performs
436under a very restrictive environment where the OSD capacity is at the lower end.
437The sections and charts below are very similar to the ones presented above and
438are provided here for reference.
439
440Client Throughput & latency Comparison
441--------------------------------------
442
443The average client throughput, latency and percentiles are compared as before
444in the set of charts shown below.
445
446.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png
447
448.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_NoWALdB_WPQ_vs_mClock.png
449
450.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png
451
452Recovery Statistics Comparison
453------------------------------
454
455The recovery rates and times are shown in the charts below.
456
457.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png
458
459.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png
460
461Key Takeaways and Conclusion
462============================
463
464- mClock is able to provide the desired QoS using profiles to allocate proper
465 *reservation*, *weight* and *limit* to the service types.
466- By using the cost per I/O and the cost per byte parameters, mClock can
467 schedule operations appropriately for the different device types(SSD/HDD).
468
469The study so far shows promising results with the refinements made to the mClock
470scheduler. Further refinements to mClock and profile tuning are planned. Further
471improvements will also be based on feedback from broader testing on larger
472clusters and with different workloads.
473
474.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
475.. _repository: https://github.com/ceph/dmclock
476.. _cbt: https://github.com/ceph/cbt