]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/cache-tiering.rst
import ceph quincy 17.2.6
[ceph.git] / ceph / doc / rados / operations / cache-tiering.rst
1 ===============
2 Cache Tiering
3 ===============
4
5 A cache tier provides Ceph Clients with better I/O performance for a subset of
6 the data stored in a backing storage tier. Cache tiering involves creating a
7 pool of relatively fast/expensive storage devices (e.g., solid state drives)
8 configured to act as a cache tier, and a backing pool of either erasure-coded
9 or relatively slower/cheaper devices configured to act as an economical storage
10 tier. The Ceph objecter handles where to place the objects and the tiering
11 agent determines when to flush objects from the cache to the backing storage
12 tier. So the cache tier and the backing storage tier are completely transparent
13 to Ceph clients.
14
15
16 .. ditaa::
17 +-------------+
18 | Ceph Client |
19 +------+------+
20 ^
21 Tiering is |
22 Transparent | Faster I/O
23 to Ceph | +---------------+
24 Client Ops | | |
25 | +----->+ Cache Tier |
26 | | | |
27 | | +-----+---+-----+
28 | | | ^
29 v v | | Active Data in Cache Tier
30 +------+----+--+ | |
31 | Objecter | | |
32 +-----------+--+ | |
33 ^ | | Inactive Data in Storage Tier
34 | v |
35 | +-----+---+-----+
36 | | |
37 +----->| Storage Tier |
38 | |
39 +---------------+
40 Slower I/O
41
42
43 The cache tiering agent handles the migration of data between the cache tier
44 and the backing storage tier automatically. However, admins have the ability to
45 configure how this migration takes place by setting the ``cache-mode``. There are
46 two main scenarios:
47
48 - **writeback** mode: If the base tier and the cache tier are configured in
49 ``writeback`` mode, Ceph clients receive an ACK from the base tier every time
50 they write data to it. Then the cache tiering agent determines whether
51 ``osd_tier_default_cache_min_write_recency_for_promote`` has been set. If it
52 has been set and the data has been written more than a specified number of
53 times per interval, the data is promoted to the cache tier.
54
55 When Ceph clients need access to data stored in the base tier, the cache
56 tiering agent reads the data from the base tier and returns it to the client.
57 While data is being read from the base tier, the cache tiering agent consults
58 the value of ``osd_tier_default_cache_min_read_recency_for_promote`` and
59 decides whether to promote that data from the base tier to the cache tier.
60 When data has been promoted from the base tier to the cache tier, the Ceph
61 client is able to perform I/O operations on it using the cache tier. This is
62 well-suited for mutable data (for example, photo/video editing, transactional
63 data).
64
65 - **readproxy** mode: This mode will use any objects that already
66 exist in the cache tier, but if an object is not present in the
67 cache the request will be proxied to the base tier. This is useful
68 for transitioning from ``writeback`` mode to a disabled cache as it
69 allows the workload to function properly while the cache is drained,
70 without adding any new objects to the cache.
71
72 Other cache modes are:
73
74 - **readonly** promotes objects to the cache on read operations only; write
75 operations are forwarded to the base tier. This mode is intended for
76 read-only workloads that do not require consistency to be enforced by the
77 storage system. (**Warning**: when objects are updated in the base tier,
78 Ceph makes **no** attempt to sync these updates to the corresponding objects
79 in the cache. Since this mode is considered experimental, a
80 ``--yes-i-really-mean-it`` option must be passed in order to enable it.)
81
82 - **none** is used to completely disable caching.
83
84
85 A word of caution
86 =================
87
88 Cache tiering will *degrade* performance for most workloads. Users should use
89 extreme caution before using this feature.
90
91 * *Workload dependent*: Whether a cache will improve performance is
92 highly dependent on the workload. Because there is a cost
93 associated with moving objects into or out of the cache, it can only
94 be effective when there is a *large skew* in the access pattern in
95 the data set, such that most of the requests touch a small number of
96 objects. The cache pool should be large enough to capture the
97 working set for your workload to avoid thrashing.
98
99 * *Difficult to benchmark*: Most benchmarks that users run to measure
100 performance will show terrible performance with cache tiering, in
101 part because very few of them skew requests toward a small set of
102 objects, it can take a long time for the cache to "warm up," and
103 because the warm-up cost can be high.
104
105 * *Usually slower*: For workloads that are not cache tiering-friendly,
106 performance is often slower than a normal RADOS pool without cache
107 tiering enabled.
108
109 * *librados object enumeration*: The librados-level object enumeration
110 API is not meant to be coherent in the presence of the case. If
111 your application is using librados directly and relies on object
112 enumeration, cache tiering will probably not work as expected.
113 (This is not a problem for RGW, RBD, or CephFS.)
114
115 * *Complexity*: Enabling cache tiering means that a lot of additional
116 machinery and complexity within the RADOS cluster is being used.
117 This increases the probability that you will encounter a bug in the system
118 that other users have not yet encountered and will put your deployment at a
119 higher level of risk.
120
121 Known Good Workloads
122 --------------------
123
124 * *RGW time-skewed*: If the RGW workload is such that almost all read
125 operations are directed at recently written objects, a simple cache
126 tiering configuration that destages recently written objects from
127 the cache to the base tier after a configurable period can work
128 well.
129
130 Known Bad Workloads
131 -------------------
132
133 The following configurations are *known to work poorly* with cache
134 tiering.
135
136 * *RBD with replicated cache and erasure-coded base*: This is a common
137 request, but usually does not perform well. Even reasonably skewed
138 workloads still send some small writes to cold objects, and because
139 small writes are not yet supported by the erasure-coded pool, entire
140 (usually 4 MB) objects must be migrated into the cache in order to
141 satisfy a small (often 4 KB) write. Only a handful of users have
142 successfully deployed this configuration, and it only works for them
143 because their data is extremely cold (backups) and they are not in
144 any way sensitive to performance.
145
146 * *RBD with replicated cache and base*: RBD with a replicated base
147 tier does better than when the base is erasure coded, but it is
148 still highly dependent on the amount of skew in the workload, and
149 very difficult to validate. The user will need to have a good
150 understanding of their workload and will need to tune the cache
151 tiering parameters carefully.
152
153
154 Setting Up Pools
155 ================
156
157 To set up cache tiering, you must have two pools. One will act as the
158 backing storage and the other will act as the cache.
159
160
161 Setting Up a Backing Storage Pool
162 ---------------------------------
163
164 Setting up a backing storage pool typically involves one of two scenarios:
165
166 - **Standard Storage**: In this scenario, the pool stores multiple copies
167 of an object in the Ceph Storage Cluster.
168
169 - **Erasure Coding:** In this scenario, the pool uses erasure coding to
170 store data much more efficiently with a small performance tradeoff.
171
172 In the standard storage scenario, you can setup a CRUSH rule to establish
173 the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD
174 Daemons perform optimally when all storage drives in the rule are of the
175 same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_
176 for details on creating a rule. Once you have created a rule, create
177 a backing storage pool.
178
179 In the erasure coding scenario, the pool creation arguments will generate the
180 appropriate rule automatically. See `Create a Pool`_ for details.
181
182 In subsequent examples, we will refer to the backing storage pool
183 as ``cold-storage``.
184
185
186 Setting Up a Cache Pool
187 -----------------------
188
189 Setting up a cache pool follows the same procedure as the standard storage
190 scenario, but with this difference: the drives for the cache tier are typically
191 high performance drives that reside in their own servers and have their own
192 CRUSH rule. When setting up such a rule, it should take account of the hosts
193 that have the high performance drives while omitting the hosts that don't. See
194 :ref:`CRUSH Device Class<crush-map-device-class>` for details.
195
196
197 In subsequent examples, we will refer to the cache pool as ``hot-storage`` and
198 the backing pool as ``cold-storage``.
199
200 For cache tier configuration and default values, see
201 `Pools - Set Pool Values`_.
202
203
204 Creating a Cache Tier
205 =====================
206
207 Setting up a cache tier involves associating a backing storage pool with
208 a cache pool:
209
210 .. prompt:: bash $
211
212 ceph osd tier add {storagepool} {cachepool}
213
214 For example:
215
216 .. prompt:: bash $
217
218 ceph osd tier add cold-storage hot-storage
219
220 To set the cache mode, execute the following:
221
222 .. prompt:: bash $
223
224 ceph osd tier cache-mode {cachepool} {cache-mode}
225
226 For example:
227
228 .. prompt:: bash $
229
230 ceph osd tier cache-mode hot-storage writeback
231
232 The cache tiers overlay the backing storage tier, so they require one
233 additional step: you must direct all client traffic from the storage pool to
234 the cache pool. To direct client traffic directly to the cache pool, execute
235 the following:
236
237 .. prompt:: bash $
238
239 ceph osd tier set-overlay {storagepool} {cachepool}
240
241 For example:
242
243 .. prompt:: bash $
244
245 ceph osd tier set-overlay cold-storage hot-storage
246
247
248 Configuring a Cache Tier
249 ========================
250
251 Cache tiers have several configuration options. You may set
252 cache tier configuration options with the following usage:
253
254 .. prompt:: bash $
255
256 ceph osd pool set {cachepool} {key} {value}
257
258 See `Pools - Set Pool Values`_ for details.
259
260
261 Target Size and Type
262 --------------------
263
264 Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``:
265
266 .. prompt:: bash $
267
268 ceph osd pool set {cachepool} hit_set_type bloom
269
270 For example:
271
272 .. prompt:: bash $
273
274 ceph osd pool set hot-storage hit_set_type bloom
275
276 The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to
277 store, and how much time each HitSet should cover:
278
279 .. prompt:: bash $
280
281 ceph osd pool set {cachepool} hit_set_count 12
282 ceph osd pool set {cachepool} hit_set_period 14400
283 ceph osd pool set {cachepool} target_max_bytes 1000000000000
284
285 .. note:: A larger ``hit_set_count`` results in more RAM consumed by
286 the ``ceph-osd`` process.
287
288 Binning accesses over time allows Ceph to determine whether a Ceph client
289 accessed an object at least once, or more than once over a time period
290 ("age" vs "temperature").
291
292 The ``min_read_recency_for_promote`` defines how many HitSets to check for the
293 existence of an object when handling a read operation. The checking result is
294 used to decide whether to promote the object asynchronously. Its value should be
295 between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
296 If it's set to 1, the current HitSet is checked. And if this object is in the
297 current HitSet, it's promoted. Otherwise not. For the other values, the exact
298 number of archive HitSets are checked. The object is promoted if the object is
299 found in any of the most recent ``min_read_recency_for_promote`` HitSets.
300
301 A similar parameter can be set for the write operation, which is
302 ``min_write_recency_for_promote``:
303
304 .. prompt:: bash $
305
306 ceph osd pool set {cachepool} min_read_recency_for_promote 2
307 ceph osd pool set {cachepool} min_write_recency_for_promote 2
308
309 .. note:: The longer the period and the higher the
310 ``min_read_recency_for_promote`` and
311 ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd``
312 daemon consumes. In particular, when the agent is active to flush
313 or evict cache objects, all ``hit_set_count`` HitSets are loaded
314 into RAM.
315
316
317 Cache Sizing
318 ------------
319
320 The cache tiering agent performs two main functions:
321
322 - **Flushing:** The agent identifies modified (or dirty) objects and forwards
323 them to the storage pool for long-term storage.
324
325 - **Evicting:** The agent identifies objects that haven't been modified
326 (or clean) and evicts the least recently used among them from the cache.
327
328
329 Absolute Sizing
330 ~~~~~~~~~~~~~~~
331
332 The cache tiering agent can flush or evict objects based upon the total number
333 of bytes or the total number of objects. To specify a maximum number of bytes,
334 execute the following:
335
336 .. prompt:: bash $
337
338 ceph osd pool set {cachepool} target_max_bytes {#bytes}
339
340 For example, to flush or evict at 1 TB, execute the following:
341
342 .. prompt:: bash $
343
344 ceph osd pool set hot-storage target_max_bytes 1099511627776
345
346 To specify the maximum number of objects, execute the following:
347
348 .. prompt:: bash $
349
350 ceph osd pool set {cachepool} target_max_objects {#objects}
351
352 For example, to flush or evict at 1M objects, execute the following:
353
354 .. prompt:: bash $
355
356 ceph osd pool set hot-storage target_max_objects 1000000
357
358 .. note:: Ceph is not able to determine the size of a cache pool automatically, so
359 the configuration on the absolute size is required here, otherwise the
360 flush/evict will not work. If you specify both limits, the cache tiering
361 agent will begin flushing or evicting when either threshold is triggered.
362
363 .. note:: All client requests will be blocked only when ``target_max_bytes`` or
364 ``target_max_objects`` reached
365
366 Relative Sizing
367 ~~~~~~~~~~~~~~~
368
369 The cache tiering agent can flush or evict objects relative to the size of the
370 cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in
371 `Absolute sizing`_). When the cache pool consists of a certain percentage of
372 modified (or dirty) objects, the cache tiering agent will flush them to the
373 storage pool. To set the ``cache_target_dirty_ratio``, execute the following:
374
375 .. prompt:: bash $
376
377 ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0}
378
379 For example, setting the value to ``0.4`` will begin flushing modified
380 (dirty) objects when they reach 40% of the cache pool's capacity:
381
382 .. prompt:: bash $
383
384 ceph osd pool set hot-storage cache_target_dirty_ratio 0.4
385
386 When the dirty objects reaches a certain percentage of its capacity, flush dirty
387 objects with a higher speed. To set the ``cache_target_dirty_high_ratio``:
388
389 .. prompt:: bash $
390
391 ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0}
392
393 For example, setting the value to ``0.6`` will begin aggressively flush dirty
394 objects when they reach 60% of the cache pool's capacity. obviously, we'd
395 better set the value between dirty_ratio and full_ratio:
396
397 .. prompt:: bash $
398
399 ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6
400
401 When the cache pool reaches a certain percentage of its capacity, the cache
402 tiering agent will evict objects to maintain free capacity. To set the
403 ``cache_target_full_ratio``, execute the following:
404
405 .. prompt:: bash $
406
407 ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0}
408
409 For example, setting the value to ``0.8`` will begin flushing unmodified
410 (clean) objects when they reach 80% of the cache pool's capacity:
411
412 .. prompt:: bash $
413
414 ceph osd pool set hot-storage cache_target_full_ratio 0.8
415
416
417 Cache Age
418 ---------
419
420 You can specify the minimum age of an object before the cache tiering agent
421 flushes a recently modified (or dirty) object to the backing storage pool:
422
423 .. prompt:: bash $
424
425 ceph osd pool set {cachepool} cache_min_flush_age {#seconds}
426
427 For example, to flush modified (or dirty) objects after 10 minutes, execute the
428 following:
429
430 .. prompt:: bash $
431
432 ceph osd pool set hot-storage cache_min_flush_age 600
433
434 You can specify the minimum age of an object before it will be evicted from the
435 cache tier:
436
437 .. prompt:: bash $
438
439 ceph osd pool {cache-tier} cache_min_evict_age {#seconds}
440
441 For example, to evict objects after 30 minutes, execute the following:
442
443 .. prompt:: bash $
444
445 ceph osd pool set hot-storage cache_min_evict_age 1800
446
447
448 Removing a Cache Tier
449 =====================
450
451 Removing a cache tier differs depending on whether it is a writeback
452 cache or a read-only cache.
453
454
455 Removing a Read-Only Cache
456 --------------------------
457
458 Since a read-only cache does not have modified data, you can disable
459 and remove it without losing any recent changes to objects in the cache.
460
461 #. Change the cache-mode to ``none`` to disable it.:
462
463 .. prompt:: bash
464
465 ceph osd tier cache-mode {cachepool} none
466
467 For example:
468
469 .. prompt:: bash $
470
471 ceph osd tier cache-mode hot-storage none
472
473 #. Remove the cache pool from the backing pool.:
474
475 .. prompt:: bash $
476
477 ceph osd tier remove {storagepool} {cachepool}
478
479 For example:
480
481 .. prompt:: bash $
482
483 ceph osd tier remove cold-storage hot-storage
484
485
486 Removing a Writeback Cache
487 --------------------------
488
489 Since a writeback cache may have modified data, you must take steps to ensure
490 that you do not lose any recent changes to objects in the cache before you
491 disable and remove it.
492
493
494 #. Change the cache mode to ``proxy`` so that new and modified objects will
495 flush to the backing storage pool.:
496
497 .. prompt:: bash $
498
499 ceph osd tier cache-mode {cachepool} proxy
500
501 For example:
502
503 .. prompt:: bash $
504
505 ceph osd tier cache-mode hot-storage proxy
506
507
508 #. Ensure that the cache pool has been flushed. This may take a few minutes:
509
510 .. prompt:: bash $
511
512 rados -p {cachepool} ls
513
514 If the cache pool still has objects, you can flush them manually.
515 For example:
516
517 .. prompt:: bash $
518
519 rados -p {cachepool} cache-flush-evict-all
520
521
522 #. Remove the overlay so that clients will not direct traffic to the cache.:
523
524 .. prompt:: bash $
525
526 ceph osd tier remove-overlay {storagetier}
527
528 For example:
529
530 .. prompt:: bash $
531
532 ceph osd tier remove-overlay cold-storage
533
534
535 #. Finally, remove the cache tier pool from the backing storage pool.:
536
537 .. prompt:: bash $
538
539 ceph osd tier remove {storagepool} {cachepool}
540
541 For example:
542
543 .. prompt:: bash $
544
545 ceph osd tier remove cold-storage hot-storage
546
547
548 .. _Create a Pool: ../pools#create-a-pool
549 .. _Pools - Set Pool Values: ../pools#set-pool-values
550 .. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter
551 .. _CRUSH Maps: ../crush-map
552 .. _Absolute Sizing: #absolute-sizing