ceph/doc/dev/osd_internals/mclock_wpq_cmp_study.rst

   1 =========================================
   2  QoS Study with mClock and WPQ Schedulers
   3 =========================================
   4
   5 Introduction
   6 ============
   7
   8 The mClock scheduler provides three controls for each service using it. In Ceph,
   9 the services using mClock are for example client I/O, background recovery,
  10 scrub, snap trim and PG deletes. The three controls such as *weight*,
  11 *reservation* and *limit* are used for predictable allocation of resources to
  12 each service in proportion to its weight subject to the constraint that the
  13 service receives at least its reservation and no more than its limit. In Ceph,
  14 these controls are used to allocate IOPS for each service type provided the IOPS
  15 capacity of each OSD is known. The mClock scheduler is based on
  16 `the dmClock algorithm`_. See :ref:`dmclock-qos` section for more details.
  17
  18 Ceph's use of mClock was primarily experimental and approached with an
  19 exploratory mindset. This is still true with other organizations and individuals
  20 who continue to either use the codebase or modify it according to their needs.
  21
  22 DmClock exists in its own repository_. Before the Ceph *Pacific* release,
  23 mClock could be enabled by setting the :confval:`osd_op_queue` Ceph option to
  24 "mclock_scheduler". Additional mClock parameters like *reservation*, *weight*
  25 and *limit* for each service type could be set using Ceph options.
  26 For example, ``osd_mclock_scheduler_client_[res,wgt,lim]`` is one such option.
  27 See :ref:`dmclock-qos` section for more details. Even with all the mClock
  28 options set, the full capability of mClock could not be realized due to:
  29
  30 - Unknown OSD capacity in terms of throughput (IOPS).
  31 - No limit enforcement. In other words, services using mClock were allowed to
  32   exceed their limits resulting in the desired QoS goals not being met.
  33 - Share of each service type not distributed across the number of operational
  34   shards.
  35
  36 To resolve the above, refinements were made to the mClock scheduler in the Ceph
  37 code base. See :doc:`/rados/configuration/mclock-config-ref`. With the
  38 refinements, the usage of mClock is a bit more user-friendly and intuitive. This
  39 is one step of many to refine and optimize the way mClock is used in Ceph.
  40
  41 Overview
  42 ========
  43
  44 A comparison study was performed as part of efforts to refine the mClock
  45 scheduler. The study involved running tests with client ops and background
  46 recovery operations in parallel with the two schedulers. The results were
  47 collated and then compared. The following statistics were compared between the
  48 schedulers from the test results for each service type:
  49
  50 - External client
  51
  52   - Average throughput(IOPS),
  53   - Average and percentile(95th, 99th, 99.5th) latency,
  54
  55 - Background recovery
  56
  57   - Average recovery throughput,
  58   - Number of misplaced objects recovered per second
  59
  60 Test Environment
  61 ================
  62
  63 1. **Software Configuration**: CentOS 8.1.1911 Linux Kernel 4.18.0-193.6.3.el8_2.x86_64
  64 2. **CPU**: 2 x Intel® Xeon® CPU E5-2650 v3 @ 2.30GHz
  65 3. **nproc**: 40
  66 4. **System Memory**: 64 GiB
  67 5. **Tuned-adm Profile**: network-latency
  68 6. **CephVer**: 17.0.0-2125-g94f550a87f (94f550a87fcbda799afe9f85e40386e6d90b232e) quincy (dev)
  69 7. **Storage**:
  70
  71   - Intel® NVMe SSD DC P3700 Series (SSDPE2MD800G4) [4 x 800GB]
  72   - Seagate Constellation 7200 RPM 64MB Cache SATA 6.0Gb/s HDD (ST91000640NS) [4 x 1TB]
  73
  74 Test Methodology
  75 ================
  76
  77 Ceph cbt_ was used to test the recovery scenarios. A new recovery test to
  78 generate background recoveries with client I/Os in parallel was created.
  79 See the next section for the detailed test steps. The test was executed 3 times
  80 with the default *Weighted Priority Queue (WPQ)* scheduler for comparison
  81 purposes. This was done to establish a credible mean value to compare
  82 the mClock scheduler results at a later point.
  83
  84 After this, the same test was executed with mClock scheduler and with different
  85 mClock profiles i.e., *high_client_ops*, *balanced* and *high_recovery_ops* and
  86 the results collated for comparison. With each profile, the test was
  87 executed 3 times, and the average of those runs are reported in this study.
  88
  89 .. note:: Tests with HDDs were performed with and without the bluestore WAL and
  90           dB configured. The charts discussed further below help bring out the
  91           comparison across the schedulers and their configurations.
  92
  93 Establish Baseline Client Throughput (IOPS)
  94 ===========================================
  95
  96 Before the actual recovery tests, the baseline throughput was established for
  97 both the SSDs and the HDDs on the test machine by following the steps mentioned
  98 in the :doc:`/rados/configuration/mclock-config-ref` document under
  99 the "Benchmarking Test Steps Using CBT" section. For this study, the following
 100 baseline throughput for each device type was determined:
 101
 102 +--------------------------------------+-------------------------------------------+
 103 |  Device Type                         | Baseline Throughput(@4KiB Random Writes)  |
 104 +======================================+===========================================+
 105 | **NVMe SSD**                         | 21500 IOPS (84 MiB/s)                     |
 106 +--------------------------------------+-------------------------------------------+
 107 | **HDD (with bluestore WAL & dB)**    | 340 IOPS (1.33 MiB/s)                     |
 108 +--------------------------------------+-------------------------------------------+
 109 | **HDD (without bluestore WAL & dB)** | 315 IOPS (1.23 MiB/s)                     |
 110 +--------------------------------------+-------------------------------------------+
 111
 112 .. note:: The :confval:`bluestore_throttle_bytes` and
 113           :confval:`bluestore_throttle_deferred_bytes` for SSDs were determined to be
 114           256 KiB. For HDDs, it was 40MiB. The above throughput was obtained
 115           by running 4 KiB random writes at a queue depth of 64 for 300 secs.
 116
 117 Factoring I/O Cost in mClock
 118 ============================
 119
 120 The services using mClock have a cost associated with them. The cost can be
 121 different for each service type. The mClock scheduler factors in the cost
 122 during calculations for parameters like *reservation*, *weight* and *limit*.
 123 The calculations determine when the next op for the service type can be
 124 dequeued from the operation queue. In general, the higher the cost, the longer
 125 an op remains in the operation queue.
 126
 127 A cost modeling study was performed to determine the cost per I/O and the cost
 128 per byte for SSD and HDD device types. The following cost specific options are
 129 used under the hood by mClock,
 130
 131 - :confval:`osd_mclock_cost_per_io_usec`
 132 - :confval:`osd_mclock_cost_per_io_usec_hdd`
 133 - :confval:`osd_mclock_cost_per_io_usec_ssd`
 134 - :confval:`osd_mclock_cost_per_byte_usec`
 135 - :confval:`osd_mclock_cost_per_byte_usec_hdd`
 136 - :confval:`osd_mclock_cost_per_byte_usec_ssd`
 137
 138 See :doc:`/rados/configuration/mclock-config-ref` for more details.
 139
 140 MClock Profile Allocations
 141 ==========================
 142
 143 The low-level mClock shares per profile are shown in the tables below. For
 144 parameters like *reservation* and *limit*, the shares are represented as a
 145 percentage of the total OSD capacity. For the *high_client_ops* profile, the
 146 *reservation* parameter is set to 50% of the total OSD capacity. Therefore, for
 147 the NVMe(baseline 21500 IOPS) device, a minimum of 10750 IOPS is reserved for
 148 client operations. These allocations are made under the hood once
 149 a profile is enabled.
 150
 151 The *weight* parameter is unitless. See :ref:`dmclock-qos`.
 152
 153 high_client_ops(default)
 154 ````````````````````````
 155
 156 This profile allocates more reservation and limit to external clients ops
 157 when compared to background recoveries and other internal clients within
 158 Ceph. This profile is enabled by default.
 159
 160 +------------------------+-------------+--------+-------+
 161 |  Service Type          | Reservation | Weight | Limit |
 162 +========================+=============+========+=======+
 163 | client                 | 50%         | 2      | MAX   |
 164 +------------------------+-------------+--------+-------+
 165 | background recovery    | 25%         | 1      | 100%  |
 166 +------------------------+-------------+--------+-------+
 167 | background best effort | 25%         | 2      | MAX   |
 168 +------------------------+-------------+--------+-------+
 169
 170 balanced
 171 `````````
 172
 173 This profile allocates equal reservations to client ops and background
 174 recovery ops. The internal best effort client get a lower reservation
 175 but a very high limit so that they can complete quickly if there are
 176 no competing services.
 177
 178 +------------------------+-------------+--------+-------+
 179 |  Service Type          | Reservation | Weight | Limit |
 180 +========================+=============+========+=======+
 181 | client                 | 40%         | 1      | 100%  |
 182 +------------------------+-------------+--------+-------+
 183 | background recovery    | 40%         | 1      | 150%  |
 184 +------------------------+-------------+--------+-------+
 185 | background best effort | 20%         | 2      | MAX   |
 186 +------------------------+-------------+--------+-------+
 187
 188 high_recovery_ops
 189 `````````````````
 190
 191 This profile allocates more reservation to background recoveries when
 192 compared to external clients and other internal clients within Ceph. For
 193 example, an admin may enable this profile temporarily to speed-up background
 194 recoveries during non-peak hours.
 195
 196 +------------------------+-------------+--------+-------+
 197 |  Service Type          | Reservation | Weight | Limit |
 198 +========================+=============+========+=======+
 199 | client                 | 30%         | 1      | 80%   |
 200 +------------------------+-------------+--------+-------+
 201 | background recovery    | 60%         | 2      | 200%  |
 202 +------------------------+-------------+--------+-------+
 203 | background best effort | 1 (MIN)     | 2      | MAX   |
 204 +------------------------+-------------+--------+-------+
 205
 206 custom
 207 ```````
 208
 209 The custom profile allows the user to have complete control of the mClock
 210 and Ceph config parameters. To use this profile, the user must have a deep
 211 understanding of the workings of Ceph and the mClock scheduler. All the
 212 *reservation*, *weight* and *limit* parameters of the different service types
 213 must be set manually along with any Ceph option(s). This profile may be used
 214 for experimental and exploratory purposes or if the built-in profiles do not
 215 meet the requirements. In such cases, adequate testing must be performed prior
 216 to enabling this profile.
 217
 218
 219 Recovery Test Steps
 220 ===================
 221
 222 Before bringing up the Ceph cluster, the following mClock configuration
 223 parameters were set appropriately based on the obtained baseline throughput
 224 from the previous section:
 225
 226 - :confval:`osd_mclock_max_capacity_iops_hdd`
 227 - :confval:`osd_mclock_max_capacity_iops_ssd`
 228 - :confval:`osd_mclock_profile`
 229
 230 See :doc:`/rados/configuration/mclock-config-ref` for more details.
 231
 232 Test Steps(Using cbt)
 233 `````````````````````
 234
 235 1. Bring up the Ceph cluster with 4 osds.
 236 2. Configure the OSDs with replication factor 3.
 237 3. Create a recovery pool to populate recovery data.
 238 4. Create a client pool and prefill some objects in it.
 239 5. Create the recovery thread and mark an OSD down and out.
 240 6. After the cluster handles the OSD down event, recovery data is
 241    prefilled into the recovery pool. For the tests involving SSDs, prefill 100K
 242    4MiB objects into the recovery pool. For the tests involving HDDs, prefill
 243    5K 4MiB objects into the recovery pool.
 244 7. After the prefill stage is completed, the downed OSD is brought up and in.
 245    The backfill phase starts at this point.
 246 8. As soon as the backfill/recovery starts, the test proceeds to initiate client
 247    I/O on the client pool on another thread using a single client.
 248 9. During step 8 above, statistics related to the client latency and
 249    bandwidth are captured by cbt. The test also captures the total number of
 250    misplaced objects and the number of misplaced objects recovered per second.
 251
 252 To summarize, the steps above creates 2 pools during the test. Recovery is
 253 triggered on one pool and client I/O is triggered on the other simultaneously.
 254 Statistics captured during the tests are discussed below.
 255
 256
 257 Non-Default Ceph Recovery Options
 258 `````````````````````````````````
 259
 260 Apart from the non-default bluestore throttle already mentioned above, the
 261 following set of Ceph recovery related options were modified for tests with both
 262 the WPQ and mClock schedulers.
 263
 264 - :confval:`osd_max_backfills` = 1000
 265 - :confval:`osd_recovery_max_active` = 1000
 266 - :confval:`osd_async_recovery_min_cost` = 1
 267
 268 The above options set a high limit on the number of concurrent local and
 269 remote backfill operations per OSD. Under these conditions the capability of the
 270 mClock scheduler was tested and the results are discussed below.
 271
 272 Test Results
 273 ============
 274
 275 Test Results With NVMe SSDs
 276 ```````````````````````````
 277
 278 Client Throughput Comparison
 279 ----------------------------
 280
 281 The chart below shows the average client throughput comparison across the
 282 schedulers and their respective configurations.
 283
 284 .. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_NVMe_SSD_WPQ_vs_mClock.png
 285
 286
 287 WPQ(def) in the chart shows the average client throughput obtained
 288 using the WPQ scheduler with all other Ceph configuration settings set to
 289 default values. The default setting for :confval:`osd_max_backfills` limits the number
 290 of concurrent local and remote backfills or recoveries per OSD to 1. As a
 291 result, the average client throughput obtained is impressive at just over 18000
 292 IOPS when compared to the baseline value which is 21500 IOPS.
 293
 294 However, with WPQ scheduler along with non-default options mentioned in section
 295 `Non-Default Ceph Recovery Options`_, things are quite different as shown in the
 296 chart for WPQ(BST). In this case, the average client throughput obtained drops
 297 dramatically to only 2544 IOPS. The non-default recovery options clearly had a
 298 significant impact on the client throughput. In other words, recovery operations
 299 overwhelm the client operations. Sections further below discuss the recovery
 300 rates under these conditions.
 301
 302 With the non-default options, the same test was executed with mClock and with
 303 the default profile (*high_client_ops*) enabled. As per the profile allocation,
 304 the reservation goal of 50% (10750 IOPS) is being met with an average throughput
 305 of 11209 IOPS during the course of recovery operations. This is more than 4x
 306 times the throughput obtained with WPQ(BST).
 307
 308 Similar throughput with the *balanced* (11017 IOPS) and *high_recovery_ops*
 309 (11153 IOPS) profile was obtained as seen in the chart above. This clearly
 310 demonstrates that mClock is able to provide the desired QoS for the client
 311 with multiple concurrent backfill/recovery operations in progress.
 312
 313 Client Latency Comparison
 314 -------------------------
 315
 316 The chart below shows the average completion latency (*clat*) along with the
 317 average 95th, 99th and 99.5th percentiles across the schedulers and their
 318 respective configurations.
 319
 320 .. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_NVMe_SSD_WPQ_vs_mClock.png
 321
 322 The average *clat* latency obtained with WPQ(Def) was 3.535 msec. But in this
 323 case the number of concurrent recoveries was very much limited at an average of
 324 around 97 objects/sec or ~388 MiB/s and a major contributing factor to the low
 325 latency seen by the client.
 326
 327 With WPQ(BST) and with the non-default recovery options, things are very
 328 different with the average *clat* latency shooting up to an average of almost
 329 25 msec which is 7x times worse! This is due to the high number of concurrent
 330 recoveries which was measured to be ~350 objects/sec or ~1.4 GiB/s which is
 331 close to the maximum OSD bandwidth.
 332
 333 With mClock enabled and with the default *high_client_ops* profile, the average
 334 *clat* latency was 5.688 msec which is impressive considering the high number
 335 of concurrent active background backfill/recoveries. The recovery rate was
 336 throttled down by mClock to an average of 80 objects/sec or ~320 MiB/s according
 337 to the minimum profile allocation of 25% of the maximum OSD bandwidth thus
 338 allowing the client operations to meet the QoS goal.
 339
 340 With the other profiles like *balanced* and *high_recovery_ops*, the average
 341 client *clat* latency didn't change much and stayed between 5.7 - 5.8 msec with
 342 variations in the average percentile latency as observed from the chart above.
 343
 344 .. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_NVMe_SSD_WPQ_vs_mClock.png
 345
 346 Perhaps a more interesting chart is the comparison chart shown above that
 347 tracks the average *clat* latency variations through the duration of the test.
 348 The chart shows the differences in the average latency between the
 349 WPQ and mClock profiles). During the initial phase of the test, for about 150
 350 secs, the differences in the average latency between the WPQ scheduler and
 351 across the profiles of mClock scheduler are quite evident and self explanatory.
 352 The *high_client_ops* profile shows the lowest latency followed by *balanced*
 353 and *high_recovery_ops* profiles. The WPQ(BST) had the highest average latency
 354 through the course of the test.
 355
 356 Recovery Statistics Comparison
 357 ------------------------------
 358
 359 Another important aspect to consider is how the recovery bandwidth and recovery
 360 time are affected by mClock profile settings. The chart below outlines the
 361 recovery rates and times for each mClock profile and how they differ with the
 362 WPQ scheduler. The total number of objects to be recovered in all the cases was
 363 around 75000 objects as observed in the chart below.
 364
 365 .. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_NVMe_SSD_WPQ_vs_mClock.png
 366
 367 Intuitively, the *high_client_ops* should impact recovery operations the most
 368 and this is indeed the case as it took an average of 966 secs for the
 369 recovery to complete at 80 Objects/sec. The recovery bandwidth as expected was
 370 the lowest at an average of ~320 MiB/s.
 371
 372 .. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_NVMe_SSD_WPQ_vs_mClock.png
 373
 374 The *balanced* profile provides a good middle ground by allocating the same
 375 reservation and weight to client and recovery operations. The recovery rate
 376 curve falls between the *high_recovery_ops* and *high_client_ops* curves with
 377 an average bandwidth of ~480 MiB/s and taking an average of ~647 secs at ~120
 378 Objects/sec to complete the recovery.
 379
 380 The *high_recovery_ops* profile provides the fastest way to complete recovery
 381 operations at the expense of other operations. The recovery bandwidth was
 382 nearly 2x the bandwidth at ~635 MiB/s when compared to the bandwidth observed
 383 using the *high_client_ops* profile. The average object recovery rate was ~159
 384 objects/sec and completed the fastest in approximately 488 secs.
 385
 386 Test Results With HDDs (WAL and dB configured)
 387 ``````````````````````````````````````````````
 388
 389 The recovery tests were performed on HDDs with bluestore WAL and dB configured
 390 on faster NVMe SSDs. The baseline throughput measured was 340 IOPS.
 391
 392 Client Throughput & latency Comparison
 393 --------------------------------------
 394
 395 The average client throughput comparison for WPQ and mClock and its profiles
 396 are shown in the chart below.
 397
 398 .. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_WALdB_WPQ_vs_mClock.png
 399
 400 With WPQ(Def), the average client throughput obtained was ~308 IOPS since the
 401 the number of concurrent recoveries was very much limited. The average *clat*
 402 latency was ~208 msec.
 403
 404 However for WPQ(BST), due to concurrent recoveries client throughput is affected
 405 significantly with 146 IOPS and an average *clat* latency of 433 msec.
 406
 407 .. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_WALdB_WPQ_vs_mClock.png
 408
 409 With the *high_client_ops* profile, mClock was able to meet the QoS requirement
 410 for client operations with an average throughput of 271 IOPS which is nearly
 411 80% of the baseline throughput at an average *clat* latency of 235 msecs.
 412
 413 For *balanced* and *high_recovery_ops* profiles, the average client throughput
 414 came down marginally to ~248 IOPS and ~240 IOPS respectively. The average *clat*
 415 latency as expected increased to ~258 msec and ~265 msec respectively.
 416
 417 .. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_WALdB_WPQ_vs_mClock.png
 418
 419 The *clat* latency comparison chart above provides a more comprehensive insight
 420 into the differences in latency through the course of the test. As observed
 421 with the NVMe SSD case, *high_client_ops* profile shows the lowest latency in
 422 the HDD case as well followed by the *balanced* and *high_recovery_ops* profile.
 423 It's fairly easy to discern this between the profiles during the first 200 secs
 424 of the test.
 425
 426 Recovery Statistics Comparison
 427 ------------------------------
 428
 429 The charts below compares the recovery rates and times. The total number of
 430 objects to be recovered in all the cases using HDDs with WAL and dB was around
 431 4000 objects as observed in the chart below.
 432
 433 .. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_WALdB_WPQ_vs_mClock.png
 434
 435 As expected, the *high_client_ops* impacts recovery operations the most as it
 436 took an average of  ~1409 secs for the recovery to complete at ~3 Objects/sec.
 437 The recovery bandwidth as expected was the lowest at ~11 MiB/s.
 438
 439 .. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_WALdB_WPQ_vs_mClock.png
 440
 441 The *balanced* profile as expected provides a decent compromise with an an
 442 average bandwidth of ~16.5 MiB/s and taking an average of ~966 secs at ~4
 443 Objects/sec to complete the recovery.
 444
 445 The *high_recovery_ops* profile is the fastest with nearly 2x the bandwidth at
 446 ~21 MiB/s when compared to the *high_client_ops* profile. The average object
 447 recovery rate was ~5 objects/sec and completed in approximately 747 secs. This
 448 is somewhat similar to the recovery time observed with WPQ(Def) at 647 secs with
 449 a bandwidth of 23 MiB/s and at a rate of 5.8 objects/sec.
 450
 451 Test Results With HDDs (No WAL and dB configured)
 452 `````````````````````````````````````````````````
 453
 454 The recovery tests were also performed on HDDs without bluestore WAL and dB
 455 configured. The baseline throughput measured was 315 IOPS.
 456
 457 This type of configuration without WAL and dB configured is probably rare
 458 but testing was nevertheless performed to get a sense of how mClock performs
 459 under a very restrictive environment where the OSD capacity is at the lower end.
 460 The sections and charts below are very similar to the ones presented above and
 461 are provided here for reference.
 462
 463 Client Throughput & latency Comparison
 464 --------------------------------------
 465
 466 The average client throughput, latency and percentiles are compared as before
 467 in the set of charts shown below.
 468
 469 .. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png
 470
 471 .. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_NoWALdB_WPQ_vs_mClock.png
 472
 473 .. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png
 474
 475 Recovery Statistics Comparison
 476 ------------------------------
 477
 478 The recovery rates and times are shown in the charts below.
 479
 480 .. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png
 481
 482 .. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png
 483
 484 Key Takeaways and Conclusion
 485 ============================
 486
 487 - mClock is able to provide the desired QoS using profiles to allocate proper
 488   *reservation*, *weight* and *limit* to the service types.
 489 - By using the cost per I/O and the cost per byte parameters, mClock can
 490   schedule operations appropriately for the different device types(SSD/HDD).
 491
 492 The study so far shows promising results with the refinements made to the mClock
 493 scheduler. Further refinements to mClock and profile tuning are planned. Further
 494 improvements will also be based on feedback from broader testing on larger
 495 clusters and with different workloads.
 496
 497 .. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
 498 .. _repository: https://github.com/ceph/dmclock
 499 .. _cbt: https://github.com/ceph/cbt