ceph/doc/rados/operations/cache-tiering.rst

   1 ===============
   2  Cache Tiering
   3 ===============
   4
   5 A cache tier provides Ceph Clients with better I/O performance for a subset of
   6 the data stored in a backing storage tier. Cache tiering involves creating a
   7 pool of relatively fast/expensive storage devices (e.g., solid state drives)
   8 configured to act as a cache tier, and a backing pool of either erasure-coded
   9 or relatively slower/cheaper devices configured to act as an economical storage
  10 tier. The Ceph objecter handles where to place the objects and the tiering
  11 agent determines when to flush objects from the cache to the backing storage
  12 tier. So the cache tier and the backing storage tier are completely transparent
  13 to Ceph clients.
  14
  15
  16 .. ditaa::
  17            +-------------+
  18            | Ceph Client |
  19            +------+------+
  20                   ^
  21      Tiering is   |
  22     Transparent   |              Faster I/O
  23         to Ceph   |           +---------------+
  24      Client Ops   |           |               |
  25                   |    +----->+   Cache Tier  |
  26                   |    |      |               |
  27                   |    |      +-----+---+-----+
  28                   |    |            |   ^
  29                   v    v            |   |   Active Data in Cache Tier
  30            +------+----+--+         |   |
  31            |   Objecter   |         |   |
  32            +-----------+--+         |   |
  33                        ^            |   |   Inactive Data in Storage Tier
  34                        |            v   |
  35                        |      +-----+---+-----+
  36                        |      |               |
  37                        +----->|  Storage Tier |
  38                               |               |
  39                               +---------------+
  40                                  Slower I/O
  41
  42
  43 The cache tiering agent handles the migration of data between the cache tier
  44 and the backing storage tier automatically. However, admins have the ability to
  45 configure how this migration takes place by setting the ``cache-mode``. There are
  46 two main scenarios:
  47
  48 - **writeback** mode: If the base tier and the cache tier are configured in
  49   ``writeback`` mode, Ceph clients receive an ACK from the base tier every time
  50   they write data to it. Then the cache tiering agent determines whether
  51   ``osd_tier_default_cache_min_write_recency_for_promote`` has been set. If it
  52   has been set and the data has been written more than a specified number of
  53   times per interval, the data is promoted to the cache tier.
  54
  55   When Ceph clients need access to data stored in the base tier, the cache
  56   tiering agent reads the data from the base tier and returns it to the client.
  57   While data is being read from the base tier, the cache tiering agent consults
  58   the value of ``osd_tier_default_cache_min_read_recency_for_promote`` and
  59   decides whether to promote that data from the base tier to the cache tier.
  60   When data has been promoted from the base tier to the cache tier, the Ceph
  61   client is able to perform I/O operations on it using the cache tier. This is
  62   well-suited for mutable data (for example, photo/video editing, transactional
  63   data).
  64
  65 - **readproxy** mode: This mode will use any objects that already
  66   exist in the cache tier, but if an object is not present in the
  67   cache the request will be proxied to the base tier.  This is useful
  68   for transitioning from ``writeback`` mode to a disabled cache as it
  69   allows the workload to function properly while the cache is drained,
  70   without adding any new objects to the cache.
  71
  72 Other cache modes are:
  73
  74 - **readonly** promotes objects to the cache on read operations only; write
  75   operations are forwarded to the base tier. This mode is intended for
  76   read-only workloads that do not require consistency to be enforced by the
  77   storage system. (**Warning**: when objects are updated in the base tier,
  78   Ceph makes **no** attempt to sync these updates to the corresponding objects
  79   in the cache. Since this mode is considered experimental, a
  80   ``--yes-i-really-mean-it`` option must be passed in order to enable it.)
  81
  82 - **none** is used to completely disable caching.
  83
  84
  85 A word of caution
  86 =================
  87
  88 Cache tiering will *degrade* performance for most workloads.  Users should use
  89 extreme caution before using this feature.
  90
  91 * *Workload dependent*: Whether a cache will improve performance is
  92   highly dependent on the workload.  Because there is a cost
  93   associated with moving objects into or out of the cache, it can only
  94   be effective when there is a *large skew* in the access pattern in
  95   the data set, such that most of the requests touch a small number of
  96   objects.  The cache pool should be large enough to capture the
  97   working set for your workload to avoid thrashing.
  98
  99 * *Difficult to benchmark*: Most benchmarks that users run to measure
 100   performance will show terrible performance with cache tiering, in
 101   part because very few of them skew requests toward a small set of
 102   objects, it can take a long time for the cache to "warm up," and
 103   because the warm-up cost can be high.
 104
 105 * *Usually slower*: For workloads that are not cache tiering-friendly,
 106   performance is often slower than a normal RADOS pool without cache
 107   tiering enabled.
 108
 109 * *librados object enumeration*: The librados-level object enumeration
 110   API is not meant to be coherent in the presence of the case.  If
 111   your application is using librados directly and relies on object
 112   enumeration, cache tiering will probably not work as expected.
 113   (This is not a problem for RGW, RBD, or CephFS.)
 114
 115 * *Complexity*: Enabling cache tiering means that a lot of additional
 116   machinery and complexity within the RADOS cluster is being used.
 117   This increases the probability that you will encounter a bug in the system
 118   that other users have not yet encountered and will put your deployment at a
 119   higher level of risk.
 120
 121 Known Good Workloads
 122 --------------------
 123
 124 * *RGW time-skewed*: If the RGW workload is such that almost all read
 125   operations are directed at recently written objects, a simple cache
 126   tiering configuration that destages recently written objects from
 127   the cache to the base tier after a configurable period can work
 128   well.
 129
 130 Known Bad Workloads
 131 -------------------
 132
 133 The following configurations are *known to work poorly* with cache
 134 tiering.
 135
 136 * *RBD with replicated cache and erasure-coded base*: This is a common
 137   request, but usually does not perform well.  Even reasonably skewed
 138   workloads still send some small writes to cold objects, and because
 139   small writes are not yet supported by the erasure-coded pool, entire
 140   (usually 4 MB) objects must be migrated into the cache in order to
 141   satisfy a small (often 4 KB) write.  Only a handful of users have
 142   successfully deployed this configuration, and it only works for them
 143   because their data is extremely cold (backups) and they are not in
 144   any way sensitive to performance.
 145
 146 * *RBD with replicated cache and base*: RBD with a replicated base
 147   tier does better than when the base is erasure coded, but it is
 148   still highly dependent on the amount of skew in the workload, and
 149   very difficult to validate.  The user will need to have a good
 150   understanding of their workload and will need to tune the cache
 151   tiering parameters carefully.
 152
 153
 154 Setting Up Pools
 155 ================
 156
 157 To set up cache tiering, you must have two pools. One will act as the
 158 backing storage and the other will act as the cache.
 159
 160
 161 Setting Up a Backing Storage Pool
 162 ---------------------------------
 163
 164 Setting up a backing storage pool typically involves one of two scenarios:
 165
 166 - **Standard Storage**: In this scenario, the pool stores multiple copies
 167   of an object in the Ceph Storage Cluster.
 168
 169 - **Erasure Coding:** In this scenario, the pool uses erasure coding to
 170   store data much more efficiently with a small performance tradeoff.
 171
 172 In the standard storage scenario, you can setup a CRUSH rule to establish
 173 the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD
 174 Daemons perform optimally when all storage drives in the rule are of the
 175 same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_
 176 for details on creating a rule. Once you have created a rule, create
 177 a backing storage pool.
 178
 179 In the erasure coding scenario, the pool creation arguments will generate the
 180 appropriate rule automatically. See `Create a Pool`_ for details.
 181
 182 In subsequent examples, we will refer to the backing storage pool
 183 as ``cold-storage``.
 184
 185
 186 Setting Up a Cache Pool
 187 -----------------------
 188
 189 Setting up a cache pool follows the same procedure as the standard storage
 190 scenario, but with this difference: the drives for the cache tier are typically
 191 high performance drives that reside in their own servers and have their own
 192 CRUSH rule.  When setting up such a rule, it should take account of the hosts
 193 that have the high performance drives while omitting the hosts that don't. See
 194 :ref:`CRUSH Device Class<crush-map-device-class>` for details.
 195
 196
 197 In subsequent examples, we will refer to the cache pool as ``hot-storage`` and
 198 the backing pool as ``cold-storage``.
 199
 200 For cache tier configuration and default values, see
 201 `Pools - Set Pool Values`_.
 202
 203
 204 Creating a Cache Tier
 205 =====================
 206
 207 Setting up a cache tier involves associating a backing storage pool with
 208 a cache pool:
 209
 210 .. prompt:: bash $
 211
 212    ceph osd tier add {storagepool} {cachepool}
 213
 214 For example:
 215
 216 .. prompt:: bash $
 217
 218    ceph osd tier add cold-storage hot-storage
 219
 220 To set the cache mode, execute the following:
 221
 222 .. prompt:: bash $
 223
 224    ceph osd tier cache-mode {cachepool} {cache-mode}
 225
 226 For example:
 227
 228 .. prompt:: bash $
 229
 230    ceph osd tier cache-mode hot-storage writeback
 231
 232 The cache tiers overlay the backing storage tier, so they require one
 233 additional step: you must direct all client traffic from the storage pool to
 234 the cache pool. To direct client traffic directly to the cache pool, execute
 235 the following:
 236
 237 .. prompt:: bash $
 238
 239    ceph osd tier set-overlay {storagepool} {cachepool}
 240
 241 For example:
 242
 243 .. prompt:: bash $
 244
 245    ceph osd tier set-overlay cold-storage hot-storage
 246
 247
 248 Configuring a Cache Tier
 249 ========================
 250
 251 Cache tiers have several configuration options. You may set
 252 cache tier configuration options with the following usage:
 253
 254 .. prompt:: bash $
 255
 256    ceph osd pool set {cachepool} {key} {value}
 257
 258 See `Pools - Set Pool Values`_ for details.
 259
 260
 261 Target Size and Type
 262 --------------------
 263
 264 Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``:
 265
 266 .. prompt:: bash $
 267
 268    ceph osd pool set {cachepool} hit_set_type bloom
 269
 270 For example:
 271
 272 .. prompt:: bash $
 273
 274    ceph osd pool set hot-storage hit_set_type bloom
 275
 276 The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to
 277 store, and how much time each HitSet should cover:
 278
 279 .. prompt:: bash $
 280
 281    ceph osd pool set {cachepool} hit_set_count 12
 282    ceph osd pool set {cachepool} hit_set_period 14400
 283    ceph osd pool set {cachepool} target_max_bytes 1000000000000
 284
 285 .. note:: A larger ``hit_set_count`` results in more RAM consumed by
 286           the ``ceph-osd`` process.
 287
 288 Binning accesses over time allows Ceph to determine whether a Ceph client
 289 accessed an object at least once, or more than once over a time period
 290 ("age" vs "temperature").
 291
 292 The ``min_read_recency_for_promote`` defines how many HitSets to check for the
 293 existence of an object when handling a read operation. The checking result is
 294 used to decide whether to promote the object asynchronously. Its value should be
 295 between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
 296 If it's set to 1, the current HitSet is checked. And if this object is in the
 297 current HitSet, it's promoted. Otherwise not. For the other values, the exact
 298 number of archive HitSets are checked. The object is promoted if the object is
 299 found in any of the most recent ``min_read_recency_for_promote`` HitSets.
 300
 301 A similar parameter can be set for the write operation, which is
 302 ``min_write_recency_for_promote``:
 303
 304 .. prompt:: bash $
 305
 306    ceph osd pool set {cachepool} min_read_recency_for_promote 2
 307    ceph osd pool set {cachepool} min_write_recency_for_promote 2
 308
 309 .. note:: The longer the period and the higher the
 310    ``min_read_recency_for_promote`` and
 311    ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd``
 312    daemon consumes. In particular, when the agent is active to flush
 313    or evict cache objects, all ``hit_set_count`` HitSets are loaded
 314    into RAM.
 315
 316
 317 Cache Sizing
 318 ------------
 319
 320 The cache tiering agent performs two main functions:
 321
 322 - **Flushing:** The agent identifies modified (or dirty) objects and forwards
 323   them to the storage pool for long-term storage.
 324
 325 - **Evicting:** The agent identifies objects that haven't been modified
 326   (or clean) and evicts the least recently used among them from the cache.
 327
 328
 329 Absolute Sizing
 330 ~~~~~~~~~~~~~~~
 331
 332 The cache tiering agent can flush or evict objects based upon the total number
 333 of bytes or the total number of objects. To specify a maximum number of bytes,
 334 execute the following:
 335
 336 .. prompt:: bash $
 337
 338    ceph osd pool set {cachepool} target_max_bytes {#bytes}
 339
 340 For example, to flush or evict at 1 TB, execute the following:
 341
 342 .. prompt:: bash $
 343
 344    ceph osd pool set hot-storage target_max_bytes 1099511627776
 345
 346 To specify the maximum number of objects, execute the following:
 347
 348 .. prompt:: bash $
 349
 350    ceph osd pool set {cachepool} target_max_objects {#objects}
 351
 352 For example, to flush or evict at 1M objects, execute the following:
 353
 354 .. prompt:: bash $
 355
 356    ceph osd pool set hot-storage target_max_objects 1000000
 357
 358 .. note:: Ceph is not able to determine the size of a cache pool automatically, so
 359    the configuration on the absolute size is required here, otherwise the
 360    flush/evict will not work. If you specify both limits, the cache tiering
 361    agent will begin flushing or evicting when either threshold is triggered.
 362
 363 .. note:: All client requests will be blocked only when  ``target_max_bytes`` or
 364    ``target_max_objects`` reached
 365
 366 Relative Sizing
 367 ~~~~~~~~~~~~~~~
 368
 369 The cache tiering agent can flush or evict objects relative to the size of the
 370 cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in
 371 `Absolute sizing`_).  When the cache pool consists of a certain percentage of
 372 modified (or dirty) objects, the cache tiering agent will flush them to the
 373 storage pool. To set the ``cache_target_dirty_ratio``, execute the following:
 374
 375 .. prompt:: bash $
 376
 377    ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0}
 378
 379 For example, setting the value to ``0.4`` will begin flushing modified
 380 (dirty) objects when they reach 40% of the cache pool's capacity:
 381
 382 .. prompt:: bash $
 383
 384    ceph osd pool set hot-storage cache_target_dirty_ratio 0.4
 385
 386 When the dirty objects reaches a certain percentage of its capacity, flush dirty
 387 objects with a higher speed. To set the ``cache_target_dirty_high_ratio``:
 388
 389 .. prompt:: bash $
 390
 391    ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0}
 392
 393 For example, setting the value to ``0.6`` will begin aggressively flush dirty
 394 objects when they reach 60% of the cache pool's capacity. obviously, we'd
 395 better set the value between dirty_ratio and full_ratio:
 396
 397 .. prompt:: bash $
 398
 399    ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6
 400
 401 When the cache pool reaches a certain percentage of its capacity, the cache
 402 tiering agent will evict objects to maintain free capacity. To set the
 403 ``cache_target_full_ratio``, execute the following:
 404
 405 .. prompt:: bash $
 406
 407    ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0}
 408
 409 For example, setting the value to ``0.8`` will begin flushing unmodified
 410 (clean) objects when they reach 80% of the cache pool's capacity:
 411
 412 .. prompt:: bash $
 413
 414    ceph osd pool set hot-storage cache_target_full_ratio 0.8
 415
 416
 417 Cache Age
 418 ---------
 419
 420 You can specify the minimum age of an object before the cache tiering agent
 421 flushes a recently modified (or dirty) object to the backing storage pool:
 422
 423 .. prompt:: bash $
 424
 425    ceph osd pool set {cachepool} cache_min_flush_age {#seconds}
 426
 427 For example, to flush modified (or dirty) objects after 10 minutes, execute the
 428 following:
 429
 430 .. prompt:: bash $
 431
 432    ceph osd pool set hot-storage cache_min_flush_age 600
 433
 434 You can specify the minimum age of an object before it will be evicted from the
 435 cache tier:
 436
 437 .. prompt:: bash $
 438
 439    ceph osd pool {cache-tier} cache_min_evict_age {#seconds}
 440
 441 For example, to evict objects after 30 minutes, execute the following:
 442
 443 .. prompt:: bash $
 444
 445    ceph osd pool set hot-storage cache_min_evict_age 1800
 446
 447
 448 Removing a Cache Tier
 449 =====================
 450
 451 Removing a cache tier differs depending on whether it is a writeback
 452 cache or a read-only cache.
 453
 454
 455 Removing a Read-Only Cache
 456 --------------------------
 457
 458 Since a read-only cache does not have modified data, you can disable
 459 and remove it without losing any recent changes to objects in the cache.
 460
 461 #. Change the cache-mode to ``none`` to disable it.:
 462
 463    .. prompt:: bash
 464
 465       ceph osd tier cache-mode {cachepool} none
 466
 467    For example:
 468
 469    .. prompt:: bash $
 470
 471       ceph osd tier cache-mode hot-storage none
 472
 473 #. Remove the cache pool from the backing pool.:
 474
 475    .. prompt:: bash $
 476
 477       ceph osd tier remove {storagepool} {cachepool}
 478
 479    For example:
 480
 481    .. prompt:: bash $
 482
 483       ceph osd tier remove cold-storage hot-storage
 484
 485
 486 Removing a Writeback Cache
 487 --------------------------
 488
 489 Since a writeback cache may have modified data, you must take steps to ensure
 490 that you do not lose any recent changes to objects in the cache before you
 491 disable and remove it.
 492
 493
 494 #. Change the cache mode to ``proxy`` so that new and modified objects will
 495    flush to the backing storage pool.:
 496
 497    .. prompt:: bash $
 498
 499       ceph osd tier cache-mode {cachepool} proxy
 500
 501    For example:
 502
 503    .. prompt:: bash $
 504
 505       ceph osd tier cache-mode hot-storage proxy
 506
 507
 508 #. Ensure that the cache pool has been flushed. This may take a few minutes:
 509
 510    .. prompt:: bash $
 511
 512       rados -p {cachepool} ls
 513
 514    If the cache pool still has objects, you can flush them manually.
 515    For example:
 516
 517    .. prompt:: bash $
 518
 519       rados -p {cachepool} cache-flush-evict-all
 520
 521
 522 #. Remove the overlay so that clients will not direct traffic to the cache.:
 523
 524    .. prompt:: bash $
 525
 526       ceph osd tier remove-overlay {storagetier}
 527
 528    For example:
 529
 530    .. prompt:: bash $
 531
 532       ceph osd tier remove-overlay cold-storage
 533
 534
 535 #. Finally, remove the cache tier pool from the backing storage pool.:
 536
 537    .. prompt:: bash $
 538
 539       ceph osd tier remove {storagepool} {cachepool}
 540
 541    For example:
 542
 543    .. prompt:: bash $
 544
 545       ceph osd tier remove cold-storage hot-storage
 546
 547
 548 .. _Create a Pool: ../pools#create-a-pool
 549 .. _Pools - Set Pool Values: ../pools#set-pool-values
 550 .. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter
 551 .. _CRUSH Maps: ../crush-map
 552 .. _Absolute Sizing: #absolute-sizing