]>
Commit | Line | Data |
---|---|---|
20effc67 TL |
1 | ========================================= |
2 | QoS Study with mClock and WPQ Schedulers | |
3 | ========================================= | |
4 | ||
5 | Introduction | |
6 | ============ | |
7 | ||
8 | The mClock scheduler provides three controls for each service using it. In Ceph, | |
9 | the services using mClock are for example client I/O, background recovery, | |
10 | scrub, snap trim and PG deletes. The three controls such as *weight*, | |
11 | *reservation* and *limit* are used for predictable allocation of resources to | |
12 | each service in proportion to its weight subject to the constraint that the | |
13 | service receives at least its reservation and no more than its limit. In Ceph, | |
14 | these controls are used to allocate IOPS for each service type provided the IOPS | |
15 | capacity of each OSD is known. The mClock scheduler is based on | |
16 | `the dmClock algorithm`_. See :ref:`dmclock-qos` section for more details. | |
17 | ||
18 | Ceph's use of mClock was primarily experimental and approached with an | |
19 | exploratory mindset. This is still true with other organizations and individuals | |
20 | who continue to either use the codebase or modify it according to their needs. | |
21 | ||
22 | DmClock exists in its own repository_. Before the Ceph *Pacific* release, | |
23 | mClock could be enabled by setting the :confval:`osd_op_queue` Ceph option to | |
24 | "mclock_scheduler". Additional mClock parameters like *reservation*, *weight* | |
25 | and *limit* for each service type could be set using Ceph options. | |
26 | For example, ``osd_mclock_scheduler_client_[res,wgt,lim]`` is one such option. | |
27 | See :ref:`dmclock-qos` section for more details. Even with all the mClock | |
28 | options set, the full capability of mClock could not be realized due to: | |
29 | ||
30 | - Unknown OSD capacity in terms of throughput (IOPS). | |
31 | - No limit enforcement. In other words, services using mClock were allowed to | |
32 | exceed their limits resulting in the desired QoS goals not being met. | |
33 | - Share of each service type not distributed across the number of operational | |
34 | shards. | |
35 | ||
36 | To resolve the above, refinements were made to the mClock scheduler in the Ceph | |
37 | code base. See :doc:`/rados/configuration/mclock-config-ref`. With the | |
38 | refinements, the usage of mClock is a bit more user-friendly and intuitive. This | |
39 | is one step of many to refine and optimize the way mClock is used in Ceph. | |
40 | ||
41 | Overview | |
42 | ======== | |
43 | ||
44 | A comparison study was performed as part of efforts to refine the mClock | |
45 | scheduler. The study involved running tests with client ops and background | |
46 | recovery operations in parallel with the two schedulers. The results were | |
47 | collated and then compared. The following statistics were compared between the | |
48 | schedulers from the test results for each service type: | |
49 | ||
50 | - External client | |
51 | ||
52 | - Average throughput(IOPS), | |
53 | - Average and percentile(95th, 99th, 99.5th) latency, | |
54 | ||
55 | - Background recovery | |
56 | ||
57 | - Average recovery throughput, | |
58 | - Number of misplaced objects recovered per second | |
59 | ||
60 | Test Environment | |
61 | ================ | |
62 | ||
63 | 1. **Software Configuration**: CentOS 8.1.1911 Linux Kernel 4.18.0-193.6.3.el8_2.x86_64 | |
64 | 2. **CPU**: 2 x Intel® Xeon® CPU E5-2650 v3 @ 2.30GHz | |
65 | 3. **nproc**: 40 | |
66 | 4. **System Memory**: 64 GiB | |
67 | 5. **Tuned-adm Profile**: network-latency | |
68 | 6. **CephVer**: 17.0.0-2125-g94f550a87f (94f550a87fcbda799afe9f85e40386e6d90b232e) quincy (dev) | |
69 | 7. **Storage**: | |
70 | ||
71 | - Intel® NVMe SSD DC P3700 Series (SSDPE2MD800G4) [4 x 800GB] | |
72 | - Seagate Constellation 7200 RPM 64MB Cache SATA 6.0Gb/s HDD (ST91000640NS) [4 x 1TB] | |
73 | ||
74 | Test Methodology | |
75 | ================ | |
76 | ||
77 | Ceph cbt_ was used to test the recovery scenarios. A new recovery test to | |
78 | generate background recoveries with client I/Os in parallel was created. | |
79 | See the next section for the detailed test steps. The test was executed 3 times | |
80 | with the default *Weighted Priority Queue (WPQ)* scheduler for comparison | |
81 | purposes. This was done to establish a credible mean value to compare | |
82 | the mClock scheduler results at a later point. | |
83 | ||
84 | After this, the same test was executed with mClock scheduler and with different | |
85 | mClock profiles i.e., *high_client_ops*, *balanced* and *high_recovery_ops* and | |
86 | the results collated for comparison. With each profile, the test was | |
87 | executed 3 times, and the average of those runs are reported in this study. | |
88 | ||
89 | .. note:: Tests with HDDs were performed with and without the bluestore WAL and | |
90 | dB configured. The charts discussed further below help bring out the | |
91 | comparison across the schedulers and their configurations. | |
92 | ||
93 | Establish Baseline Client Throughput (IOPS) | |
94 | =========================================== | |
95 | ||
96 | Before the actual recovery tests, the baseline throughput was established for | |
97 | both the SSDs and the HDDs on the test machine by following the steps mentioned | |
98 | in the :doc:`/rados/configuration/mclock-config-ref` document under | |
99 | the "Benchmarking Test Steps Using CBT" section. For this study, the following | |
100 | baseline throughput for each device type was determined: | |
101 | ||
102 | +--------------------------------------+-------------------------------------------+ | |
103 | | Device Type | Baseline Throughput(@4KiB Random Writes) | | |
104 | +======================================+===========================================+ | |
105 | | **NVMe SSD** | 21500 IOPS (84 MiB/s) | | |
106 | +--------------------------------------+-------------------------------------------+ | |
107 | | **HDD (with bluestore WAL & dB)** | 340 IOPS (1.33 MiB/s) | | |
108 | +--------------------------------------+-------------------------------------------+ | |
109 | | **HDD (without bluestore WAL & dB)** | 315 IOPS (1.23 MiB/s) | | |
110 | +--------------------------------------+-------------------------------------------+ | |
111 | ||
112 | .. note:: The :confval:`bluestore_throttle_bytes` and | |
113 | :confval:`bluestore_throttle_deferred_bytes` for SSDs were determined to be | |
114 | 256 KiB. For HDDs, it was 40MiB. The above throughput was obtained | |
115 | by running 4 KiB random writes at a queue depth of 64 for 300 secs. | |
116 | ||
20effc67 TL |
117 | MClock Profile Allocations |
118 | ========================== | |
119 | ||
120 | The low-level mClock shares per profile are shown in the tables below. For | |
121 | parameters like *reservation* and *limit*, the shares are represented as a | |
122 | percentage of the total OSD capacity. For the *high_client_ops* profile, the | |
123 | *reservation* parameter is set to 50% of the total OSD capacity. Therefore, for | |
124 | the NVMe(baseline 21500 IOPS) device, a minimum of 10750 IOPS is reserved for | |
125 | client operations. These allocations are made under the hood once | |
126 | a profile is enabled. | |
127 | ||
128 | The *weight* parameter is unitless. See :ref:`dmclock-qos`. | |
129 | ||
130 | high_client_ops(default) | |
131 | ```````````````````````` | |
132 | ||
133 | This profile allocates more reservation and limit to external clients ops | |
134 | when compared to background recoveries and other internal clients within | |
135 | Ceph. This profile is enabled by default. | |
136 | ||
137 | +------------------------+-------------+--------+-------+ | |
138 | | Service Type | Reservation | Weight | Limit | | |
139 | +========================+=============+========+=======+ | |
140 | | client | 50% | 2 | MAX | | |
141 | +------------------------+-------------+--------+-------+ | |
142 | | background recovery | 25% | 1 | 100% | | |
143 | +------------------------+-------------+--------+-------+ | |
2a845540 | 144 | | background best effort | 25% | 2 | MAX | |
20effc67 TL |
145 | +------------------------+-------------+--------+-------+ |
146 | ||
147 | balanced | |
148 | ````````` | |
149 | ||
150 | This profile allocates equal reservations to client ops and background | |
151 | recovery ops. The internal best effort client get a lower reservation | |
152 | but a very high limit so that they can complete quickly if there are | |
153 | no competing services. | |
154 | ||
155 | +------------------------+-------------+--------+-------+ | |
156 | | Service Type | Reservation | Weight | Limit | | |
157 | +========================+=============+========+=======+ | |
158 | | client | 40% | 1 | 100% | | |
159 | +------------------------+-------------+--------+-------+ | |
160 | | background recovery | 40% | 1 | 150% | | |
161 | +------------------------+-------------+--------+-------+ | |
2a845540 | 162 | | background best effort | 20% | 2 | MAX | |
20effc67 TL |
163 | +------------------------+-------------+--------+-------+ |
164 | ||
165 | high_recovery_ops | |
166 | ````````````````` | |
167 | ||
168 | This profile allocates more reservation to background recoveries when | |
169 | compared to external clients and other internal clients within Ceph. For | |
170 | example, an admin may enable this profile temporarily to speed-up background | |
171 | recoveries during non-peak hours. | |
172 | ||
173 | +------------------------+-------------+--------+-------+ | |
174 | | Service Type | Reservation | Weight | Limit | | |
175 | +========================+=============+========+=======+ | |
176 | | client | 30% | 1 | 80% | | |
177 | +------------------------+-------------+--------+-------+ | |
178 | | background recovery | 60% | 2 | 200% | | |
179 | +------------------------+-------------+--------+-------+ | |
2a845540 | 180 | | background best effort | 1 (MIN) | 2 | MAX | |
20effc67 TL |
181 | +------------------------+-------------+--------+-------+ |
182 | ||
183 | custom | |
184 | ``````` | |
185 | ||
186 | The custom profile allows the user to have complete control of the mClock | |
187 | and Ceph config parameters. To use this profile, the user must have a deep | |
188 | understanding of the workings of Ceph and the mClock scheduler. All the | |
189 | *reservation*, *weight* and *limit* parameters of the different service types | |
190 | must be set manually along with any Ceph option(s). This profile may be used | |
191 | for experimental and exploratory purposes or if the built-in profiles do not | |
192 | meet the requirements. In such cases, adequate testing must be performed prior | |
193 | to enabling this profile. | |
194 | ||
195 | ||
196 | Recovery Test Steps | |
197 | =================== | |
198 | ||
199 | Before bringing up the Ceph cluster, the following mClock configuration | |
200 | parameters were set appropriately based on the obtained baseline throughput | |
201 | from the previous section: | |
202 | ||
203 | - :confval:`osd_mclock_max_capacity_iops_hdd` | |
204 | - :confval:`osd_mclock_max_capacity_iops_ssd` | |
205 | - :confval:`osd_mclock_profile` | |
206 | ||
207 | See :doc:`/rados/configuration/mclock-config-ref` for more details. | |
208 | ||
209 | Test Steps(Using cbt) | |
210 | ````````````````````` | |
211 | ||
212 | 1. Bring up the Ceph cluster with 4 osds. | |
213 | 2. Configure the OSDs with replication factor 3. | |
214 | 3. Create a recovery pool to populate recovery data. | |
215 | 4. Create a client pool and prefill some objects in it. | |
216 | 5. Create the recovery thread and mark an OSD down and out. | |
217 | 6. After the cluster handles the OSD down event, recovery data is | |
218 | prefilled into the recovery pool. For the tests involving SSDs, prefill 100K | |
219 | 4MiB objects into the recovery pool. For the tests involving HDDs, prefill | |
220 | 5K 4MiB objects into the recovery pool. | |
221 | 7. After the prefill stage is completed, the downed OSD is brought up and in. | |
222 | The backfill phase starts at this point. | |
223 | 8. As soon as the backfill/recovery starts, the test proceeds to initiate client | |
224 | I/O on the client pool on another thread using a single client. | |
225 | 9. During step 8 above, statistics related to the client latency and | |
226 | bandwidth are captured by cbt. The test also captures the total number of | |
227 | misplaced objects and the number of misplaced objects recovered per second. | |
228 | ||
229 | To summarize, the steps above creates 2 pools during the test. Recovery is | |
230 | triggered on one pool and client I/O is triggered on the other simultaneously. | |
231 | Statistics captured during the tests are discussed below. | |
232 | ||
233 | ||
234 | Non-Default Ceph Recovery Options | |
235 | ````````````````````````````````` | |
236 | ||
237 | Apart from the non-default bluestore throttle already mentioned above, the | |
238 | following set of Ceph recovery related options were modified for tests with both | |
239 | the WPQ and mClock schedulers. | |
240 | ||
241 | - :confval:`osd_max_backfills` = 1000 | |
242 | - :confval:`osd_recovery_max_active` = 1000 | |
243 | - :confval:`osd_async_recovery_min_cost` = 1 | |
244 | ||
245 | The above options set a high limit on the number of concurrent local and | |
246 | remote backfill operations per OSD. Under these conditions the capability of the | |
247 | mClock scheduler was tested and the results are discussed below. | |
248 | ||
249 | Test Results | |
250 | ============ | |
251 | ||
252 | Test Results With NVMe SSDs | |
253 | ``````````````````````````` | |
254 | ||
255 | Client Throughput Comparison | |
256 | ---------------------------- | |
257 | ||
258 | The chart below shows the average client throughput comparison across the | |
259 | schedulers and their respective configurations. | |
260 | ||
261 | .. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_NVMe_SSD_WPQ_vs_mClock.png | |
262 | ||
263 | ||
264 | WPQ(def) in the chart shows the average client throughput obtained | |
265 | using the WPQ scheduler with all other Ceph configuration settings set to | |
266 | default values. The default setting for :confval:`osd_max_backfills` limits the number | |
267 | of concurrent local and remote backfills or recoveries per OSD to 1. As a | |
268 | result, the average client throughput obtained is impressive at just over 18000 | |
269 | IOPS when compared to the baseline value which is 21500 IOPS. | |
270 | ||
271 | However, with WPQ scheduler along with non-default options mentioned in section | |
272 | `Non-Default Ceph Recovery Options`_, things are quite different as shown in the | |
273 | chart for WPQ(BST). In this case, the average client throughput obtained drops | |
274 | dramatically to only 2544 IOPS. The non-default recovery options clearly had a | |
275 | significant impact on the client throughput. In other words, recovery operations | |
276 | overwhelm the client operations. Sections further below discuss the recovery | |
277 | rates under these conditions. | |
278 | ||
279 | With the non-default options, the same test was executed with mClock and with | |
280 | the default profile (*high_client_ops*) enabled. As per the profile allocation, | |
281 | the reservation goal of 50% (10750 IOPS) is being met with an average throughput | |
282 | of 11209 IOPS during the course of recovery operations. This is more than 4x | |
283 | times the throughput obtained with WPQ(BST). | |
284 | ||
285 | Similar throughput with the *balanced* (11017 IOPS) and *high_recovery_ops* | |
286 | (11153 IOPS) profile was obtained as seen in the chart above. This clearly | |
287 | demonstrates that mClock is able to provide the desired QoS for the client | |
288 | with multiple concurrent backfill/recovery operations in progress. | |
289 | ||
290 | Client Latency Comparison | |
291 | ------------------------- | |
292 | ||
293 | The chart below shows the average completion latency (*clat*) along with the | |
294 | average 95th, 99th and 99.5th percentiles across the schedulers and their | |
295 | respective configurations. | |
296 | ||
297 | .. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_NVMe_SSD_WPQ_vs_mClock.png | |
298 | ||
299 | The average *clat* latency obtained with WPQ(Def) was 3.535 msec. But in this | |
300 | case the number of concurrent recoveries was very much limited at an average of | |
301 | around 97 objects/sec or ~388 MiB/s and a major contributing factor to the low | |
302 | latency seen by the client. | |
303 | ||
304 | With WPQ(BST) and with the non-default recovery options, things are very | |
305 | different with the average *clat* latency shooting up to an average of almost | |
306 | 25 msec which is 7x times worse! This is due to the high number of concurrent | |
307 | recoveries which was measured to be ~350 objects/sec or ~1.4 GiB/s which is | |
308 | close to the maximum OSD bandwidth. | |
309 | ||
310 | With mClock enabled and with the default *high_client_ops* profile, the average | |
311 | *clat* latency was 5.688 msec which is impressive considering the high number | |
312 | of concurrent active background backfill/recoveries. The recovery rate was | |
313 | throttled down by mClock to an average of 80 objects/sec or ~320 MiB/s according | |
314 | to the minimum profile allocation of 25% of the maximum OSD bandwidth thus | |
315 | allowing the client operations to meet the QoS goal. | |
316 | ||
317 | With the other profiles like *balanced* and *high_recovery_ops*, the average | |
318 | client *clat* latency didn't change much and stayed between 5.7 - 5.8 msec with | |
319 | variations in the average percentile latency as observed from the chart above. | |
320 | ||
321 | .. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_NVMe_SSD_WPQ_vs_mClock.png | |
322 | ||
323 | Perhaps a more interesting chart is the comparison chart shown above that | |
324 | tracks the average *clat* latency variations through the duration of the test. | |
325 | The chart shows the differences in the average latency between the | |
326 | WPQ and mClock profiles). During the initial phase of the test, for about 150 | |
327 | secs, the differences in the average latency between the WPQ scheduler and | |
328 | across the profiles of mClock scheduler are quite evident and self explanatory. | |
329 | The *high_client_ops* profile shows the lowest latency followed by *balanced* | |
330 | and *high_recovery_ops* profiles. The WPQ(BST) had the highest average latency | |
331 | through the course of the test. | |
332 | ||
333 | Recovery Statistics Comparison | |
334 | ------------------------------ | |
335 | ||
336 | Another important aspect to consider is how the recovery bandwidth and recovery | |
337 | time are affected by mClock profile settings. The chart below outlines the | |
338 | recovery rates and times for each mClock profile and how they differ with the | |
339 | WPQ scheduler. The total number of objects to be recovered in all the cases was | |
340 | around 75000 objects as observed in the chart below. | |
341 | ||
342 | .. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_NVMe_SSD_WPQ_vs_mClock.png | |
343 | ||
344 | Intuitively, the *high_client_ops* should impact recovery operations the most | |
345 | and this is indeed the case as it took an average of 966 secs for the | |
346 | recovery to complete at 80 Objects/sec. The recovery bandwidth as expected was | |
347 | the lowest at an average of ~320 MiB/s. | |
348 | ||
349 | .. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_NVMe_SSD_WPQ_vs_mClock.png | |
350 | ||
351 | The *balanced* profile provides a good middle ground by allocating the same | |
352 | reservation and weight to client and recovery operations. The recovery rate | |
353 | curve falls between the *high_recovery_ops* and *high_client_ops* curves with | |
354 | an average bandwidth of ~480 MiB/s and taking an average of ~647 secs at ~120 | |
355 | Objects/sec to complete the recovery. | |
356 | ||
357 | The *high_recovery_ops* profile provides the fastest way to complete recovery | |
358 | operations at the expense of other operations. The recovery bandwidth was | |
359 | nearly 2x the bandwidth at ~635 MiB/s when compared to the bandwidth observed | |
360 | using the *high_client_ops* profile. The average object recovery rate was ~159 | |
361 | objects/sec and completed the fastest in approximately 488 secs. | |
362 | ||
363 | Test Results With HDDs (WAL and dB configured) | |
364 | `````````````````````````````````````````````` | |
365 | ||
366 | The recovery tests were performed on HDDs with bluestore WAL and dB configured | |
367 | on faster NVMe SSDs. The baseline throughput measured was 340 IOPS. | |
368 | ||
369 | Client Throughput & latency Comparison | |
370 | -------------------------------------- | |
371 | ||
372 | The average client throughput comparison for WPQ and mClock and its profiles | |
373 | are shown in the chart below. | |
374 | ||
375 | .. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_WALdB_WPQ_vs_mClock.png | |
376 | ||
377 | With WPQ(Def), the average client throughput obtained was ~308 IOPS since the | |
378 | the number of concurrent recoveries was very much limited. The average *clat* | |
379 | latency was ~208 msec. | |
380 | ||
381 | However for WPQ(BST), due to concurrent recoveries client throughput is affected | |
382 | significantly with 146 IOPS and an average *clat* latency of 433 msec. | |
383 | ||
384 | .. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_WALdB_WPQ_vs_mClock.png | |
385 | ||
386 | With the *high_client_ops* profile, mClock was able to meet the QoS requirement | |
387 | for client operations with an average throughput of 271 IOPS which is nearly | |
388 | 80% of the baseline throughput at an average *clat* latency of 235 msecs. | |
389 | ||
390 | For *balanced* and *high_recovery_ops* profiles, the average client throughput | |
391 | came down marginally to ~248 IOPS and ~240 IOPS respectively. The average *clat* | |
392 | latency as expected increased to ~258 msec and ~265 msec respectively. | |
393 | ||
394 | .. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_WALdB_WPQ_vs_mClock.png | |
395 | ||
396 | The *clat* latency comparison chart above provides a more comprehensive insight | |
397 | into the differences in latency through the course of the test. As observed | |
398 | with the NVMe SSD case, *high_client_ops* profile shows the lowest latency in | |
399 | the HDD case as well followed by the *balanced* and *high_recovery_ops* profile. | |
400 | It's fairly easy to discern this between the profiles during the first 200 secs | |
401 | of the test. | |
402 | ||
403 | Recovery Statistics Comparison | |
404 | ------------------------------ | |
405 | ||
406 | The charts below compares the recovery rates and times. The total number of | |
407 | objects to be recovered in all the cases using HDDs with WAL and dB was around | |
408 | 4000 objects as observed in the chart below. | |
409 | ||
410 | .. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_WALdB_WPQ_vs_mClock.png | |
411 | ||
412 | As expected, the *high_client_ops* impacts recovery operations the most as it | |
413 | took an average of ~1409 secs for the recovery to complete at ~3 Objects/sec. | |
414 | The recovery bandwidth as expected was the lowest at ~11 MiB/s. | |
415 | ||
416 | .. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_WALdB_WPQ_vs_mClock.png | |
417 | ||
418 | The *balanced* profile as expected provides a decent compromise with an an | |
419 | average bandwidth of ~16.5 MiB/s and taking an average of ~966 secs at ~4 | |
420 | Objects/sec to complete the recovery. | |
421 | ||
422 | The *high_recovery_ops* profile is the fastest with nearly 2x the bandwidth at | |
423 | ~21 MiB/s when compared to the *high_client_ops* profile. The average object | |
424 | recovery rate was ~5 objects/sec and completed in approximately 747 secs. This | |
425 | is somewhat similar to the recovery time observed with WPQ(Def) at 647 secs with | |
426 | a bandwidth of 23 MiB/s and at a rate of 5.8 objects/sec. | |
427 | ||
428 | Test Results With HDDs (No WAL and dB configured) | |
429 | ````````````````````````````````````````````````` | |
430 | ||
431 | The recovery tests were also performed on HDDs without bluestore WAL and dB | |
432 | configured. The baseline throughput measured was 315 IOPS. | |
433 | ||
434 | This type of configuration without WAL and dB configured is probably rare | |
435 | but testing was nevertheless performed to get a sense of how mClock performs | |
436 | under a very restrictive environment where the OSD capacity is at the lower end. | |
437 | The sections and charts below are very similar to the ones presented above and | |
438 | are provided here for reference. | |
439 | ||
440 | Client Throughput & latency Comparison | |
441 | -------------------------------------- | |
442 | ||
443 | The average client throughput, latency and percentiles are compared as before | |
444 | in the set of charts shown below. | |
445 | ||
446 | .. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png | |
447 | ||
448 | .. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_NoWALdB_WPQ_vs_mClock.png | |
449 | ||
450 | .. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png | |
451 | ||
452 | Recovery Statistics Comparison | |
453 | ------------------------------ | |
454 | ||
455 | The recovery rates and times are shown in the charts below. | |
456 | ||
457 | .. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png | |
458 | ||
459 | .. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png | |
460 | ||
461 | Key Takeaways and Conclusion | |
462 | ============================ | |
463 | ||
464 | - mClock is able to provide the desired QoS using profiles to allocate proper | |
465 | *reservation*, *weight* and *limit* to the service types. | |
466 | - By using the cost per I/O and the cost per byte parameters, mClock can | |
467 | schedule operations appropriately for the different device types(SSD/HDD). | |
468 | ||
469 | The study so far shows promising results with the refinements made to the mClock | |
470 | scheduler. Further refinements to mClock and profile tuning are planned. Further | |
471 | improvements will also be based on feedback from broader testing on larger | |
472 | clusters and with different workloads. | |
473 | ||
474 | .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf | |
475 | .. _repository: https://github.com/ceph/dmclock | |
476 | .. _cbt: https://github.com/ceph/cbt |