]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/cache-tiering.rst
8dec0ced6544329c1d1c1524f2318b14d83accc2
[ceph.git] / ceph / doc / rados / operations / cache-tiering.rst
1 ===============
2 Cache Tiering
3 ===============
4
5 A cache tier provides Ceph Clients with better I/O performance for a subset of
6 the data stored in a backing storage tier. Cache tiering involves creating a
7 pool of relatively fast/expensive storage devices (e.g., solid state drives)
8 configured to act as a cache tier, and a backing pool of either erasure-coded
9 or relatively slower/cheaper devices configured to act as an economical storage
10 tier. The Ceph objecter handles where to place the objects and the tiering
11 agent determines when to flush objects from the cache to the backing storage
12 tier. So the cache tier and the backing storage tier are completely transparent
13 to Ceph clients.
14
15
16 .. ditaa::
17 +-------------+
18 | Ceph Client |
19 +------+------+
20 ^
21 Tiering is |
22 Transparent | Faster I/O
23 to Ceph | +---------------+
24 Client Ops | | |
25 | +----->+ Cache Tier |
26 | | | |
27 | | +-----+---+-----+
28 | | | ^
29 v v | | Active Data in Cache Tier
30 +------+----+--+ | |
31 | Objecter | | |
32 +-----------+--+ | |
33 ^ | | Inactive Data in Storage Tier
34 | v |
35 | +-----+---+-----+
36 | | |
37 +----->| Storage Tier |
38 | |
39 +---------------+
40 Slower I/O
41
42
43 The cache tiering agent handles the migration of data between the cache tier
44 and the backing storage tier automatically. However, admins have the ability to
45 configure how this migration takes place by setting the ``cache-mode``. There are
46 two main scenarios:
47
48 - **writeback** mode: When admins configure tiers with ``writeback`` mode, Ceph
49 clients write data to the cache tier and receive an ACK from the cache tier.
50 In time, the data written to the cache tier migrates to the storage tier
51 and gets flushed from the cache tier. Conceptually, the cache tier is
52 overlaid "in front" of the backing storage tier. When a Ceph client needs
53 data that resides in the storage tier, the cache tiering agent migrates the
54 data to the cache tier on read, then it is sent to the Ceph client.
55 Thereafter, the Ceph client can perform I/O using the cache tier, until the
56 data becomes inactive. This is ideal for mutable data (e.g., photo/video
57 editing, transactional data, etc.).
58
59 - **readproxy** mode: This mode will use any objects that already
60 exist in the cache tier, but if an object is not present in the
61 cache the request will be proxied to the base tier. This is useful
62 for transitioning from ``writeback`` mode to a disabled cache as it
63 allows the workload to function properly while the cache is drained,
64 without adding any new objects to the cache.
65
66 Other cache modes are:
67
68 - **readonly** promotes objects to the cache on read operations only; write
69 operations are forwarded to the base tier. This mode is intended for
70 read-only workloads that do not require consistency to be enforced by the
71 storage system. (**Warning**: when objects are updated in the base tier,
72 Ceph makes **no** attempt to sync these updates to the corresponding objects
73 in the cache. Since this mode is considered experimental, a
74 ``--yes-i-really-mean-it`` option must be passed in order to enable it.)
75
76 - **none** is used to completely disable caching.
77
78
79 A word of caution
80 =================
81
82 Cache tiering will *degrade* performance for most workloads. Users should use
83 extreme caution before using this feature.
84
85 * *Workload dependent*: Whether a cache will improve performance is
86 highly dependent on the workload. Because there is a cost
87 associated with moving objects into or out of the cache, it can only
88 be effective when there is a *large skew* in the access pattern in
89 the data set, such that most of the requests touch a small number of
90 objects. The cache pool should be large enough to capture the
91 working set for your workload to avoid thrashing.
92
93 * *Difficult to benchmark*: Most benchmarks that users run to measure
94 performance will show terrible performance with cache tiering, in
95 part because very few of them skew requests toward a small set of
96 objects, it can take a long time for the cache to "warm up," and
97 because the warm-up cost can be high.
98
99 * *Usually slower*: For workloads that are not cache tiering-friendly,
100 performance is often slower than a normal RADOS pool without cache
101 tiering enabled.
102
103 * *librados object enumeration*: The librados-level object enumeration
104 API is not meant to be coherent in the presence of the case. If
105 your application is using librados directly and relies on object
106 enumeration, cache tiering will probably not work as expected.
107 (This is not a problem for RGW, RBD, or CephFS.)
108
109 * *Complexity*: Enabling cache tiering means that a lot of additional
110 machinery and complexity within the RADOS cluster is being used.
111 This increases the probability that you will encounter a bug in the system
112 that other users have not yet encountered and will put your deployment at a
113 higher level of risk.
114
115 Known Good Workloads
116 --------------------
117
118 * *RGW time-skewed*: If the RGW workload is such that almost all read
119 operations are directed at recently written objects, a simple cache
120 tiering configuration that destages recently written objects from
121 the cache to the base tier after a configurable period can work
122 well.
123
124 Known Bad Workloads
125 -------------------
126
127 The following configurations are *known to work poorly* with cache
128 tiering.
129
130 * *RBD with replicated cache and erasure-coded base*: This is a common
131 request, but usually does not perform well. Even reasonably skewed
132 workloads still send some small writes to cold objects, and because
133 small writes are not yet supported by the erasure-coded pool, entire
134 (usually 4 MB) objects must be migrated into the cache in order to
135 satisfy a small (often 4 KB) write. Only a handful of users have
136 successfully deployed this configuration, and it only works for them
137 because their data is extremely cold (backups) and they are not in
138 any way sensitive to performance.
139
140 * *RBD with replicated cache and base*: RBD with a replicated base
141 tier does better than when the base is erasure coded, but it is
142 still highly dependent on the amount of skew in the workload, and
143 very difficult to validate. The user will need to have a good
144 understanding of their workload and will need to tune the cache
145 tiering parameters carefully.
146
147
148 Setting Up Pools
149 ================
150
151 To set up cache tiering, you must have two pools. One will act as the
152 backing storage and the other will act as the cache.
153
154
155 Setting Up a Backing Storage Pool
156 ---------------------------------
157
158 Setting up a backing storage pool typically involves one of two scenarios:
159
160 - **Standard Storage**: In this scenario, the pool stores multiple copies
161 of an object in the Ceph Storage Cluster.
162
163 - **Erasure Coding:** In this scenario, the pool uses erasure coding to
164 store data much more efficiently with a small performance tradeoff.
165
166 In the standard storage scenario, you can setup a CRUSH rule to establish
167 the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD
168 Daemons perform optimally when all storage drives in the rule are of the
169 same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_
170 for details on creating a rule. Once you have created a rule, create
171 a backing storage pool.
172
173 In the erasure coding scenario, the pool creation arguments will generate the
174 appropriate rule automatically. See `Create a Pool`_ for details.
175
176 In subsequent examples, we will refer to the backing storage pool
177 as ``cold-storage``.
178
179
180 Setting Up a Cache Pool
181 -----------------------
182
183 Setting up a cache pool follows the same procedure as the standard storage
184 scenario, but with this difference: the drives for the cache tier are typically
185 high performance drives that reside in their own servers and have their own
186 CRUSH rule. When setting up such a rule, it should take account of the hosts
187 that have the high performance drives while omitting the hosts that don't. See
188 :ref:`CRUSH Device Class<crush-map-device-class>` for details.
189
190
191 In subsequent examples, we will refer to the cache pool as ``hot-storage`` and
192 the backing pool as ``cold-storage``.
193
194 For cache tier configuration and default values, see
195 `Pools - Set Pool Values`_.
196
197
198 Creating a Cache Tier
199 =====================
200
201 Setting up a cache tier involves associating a backing storage pool with
202 a cache pool ::
203
204 ceph osd tier add {storagepool} {cachepool}
205
206 For example ::
207
208 ceph osd tier add cold-storage hot-storage
209
210 To set the cache mode, execute the following::
211
212 ceph osd tier cache-mode {cachepool} {cache-mode}
213
214 For example::
215
216 ceph osd tier cache-mode hot-storage writeback
217
218 The cache tiers overlay the backing storage tier, so they require one
219 additional step: you must direct all client traffic from the storage pool to
220 the cache pool. To direct client traffic directly to the cache pool, execute
221 the following::
222
223 ceph osd tier set-overlay {storagepool} {cachepool}
224
225 For example::
226
227 ceph osd tier set-overlay cold-storage hot-storage
228
229
230 Configuring a Cache Tier
231 ========================
232
233 Cache tiers have several configuration options. You may set
234 cache tier configuration options with the following usage::
235
236 ceph osd pool set {cachepool} {key} {value}
237
238 See `Pools - Set Pool Values`_ for details.
239
240
241 Target Size and Type
242 --------------------
243
244 Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``::
245
246 ceph osd pool set {cachepool} hit_set_type bloom
247
248 For example::
249
250 ceph osd pool set hot-storage hit_set_type bloom
251
252 The ``hit_set_count`` and ``hit_set_period`` define how many such HitSets to
253 store, and how much time each HitSet should cover. ::
254
255 ceph osd pool set {cachepool} hit_set_count 12
256 ceph osd pool set {cachepool} hit_set_period 14400
257 ceph osd pool set {cachepool} target_max_bytes 1000000000000
258
259 .. note:: A larger ``hit_set_count`` results in more RAM consumed by
260 the ``ceph-osd`` process.
261
262 Binning accesses over time allows Ceph to determine whether a Ceph client
263 accessed an object at least once, or more than once over a time period
264 ("age" vs "temperature").
265
266 The ``min_read_recency_for_promote`` defines how many HitSets to check for the
267 existence of an object when handling a read operation. The checking result is
268 used to decide whether to promote the object asynchronously. Its value should be
269 between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
270 If it's set to 1, the current HitSet is checked. And if this object is in the
271 current HitSet, it's promoted. Otherwise not. For the other values, the exact
272 number of archive HitSets are checked. The object is promoted if the object is
273 found in any of the most recent ``min_read_recency_for_promote`` HitSets.
274
275 A similar parameter can be set for the write operation, which is
276 ``min_write_recency_for_promote``. ::
277
278 ceph osd pool set {cachepool} min_read_recency_for_promote 2
279 ceph osd pool set {cachepool} min_write_recency_for_promote 2
280
281 .. note:: The longer the period and the higher the
282 ``min_read_recency_for_promote`` and
283 ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd``
284 daemon consumes. In particular, when the agent is active to flush
285 or evict cache objects, all ``hit_set_count`` HitSets are loaded
286 into RAM.
287
288
289 Cache Sizing
290 ------------
291
292 The cache tiering agent performs two main functions:
293
294 - **Flushing:** The agent identifies modified (or dirty) objects and forwards
295 them to the storage pool for long-term storage.
296
297 - **Evicting:** The agent identifies objects that haven't been modified
298 (or clean) and evicts the least recently used among them from the cache.
299
300
301 Absolute Sizing
302 ~~~~~~~~~~~~~~~
303
304 The cache tiering agent can flush or evict objects based upon the total number
305 of bytes or the total number of objects. To specify a maximum number of bytes,
306 execute the following::
307
308 ceph osd pool set {cachepool} target_max_bytes {#bytes}
309
310 For example, to flush or evict at 1 TB, execute the following::
311
312 ceph osd pool set hot-storage target_max_bytes 1099511627776
313
314
315 To specify the maximum number of objects, execute the following::
316
317 ceph osd pool set {cachepool} target_max_objects {#objects}
318
319 For example, to flush or evict at 1M objects, execute the following::
320
321 ceph osd pool set hot-storage target_max_objects 1000000
322
323 .. note:: Ceph is not able to determine the size of a cache pool automatically, so
324 the configuration on the absolute size is required here, otherwise the
325 flush/evict will not work. If you specify both limits, the cache tiering
326 agent will begin flushing or evicting when either threshold is triggered.
327
328 .. note:: All client requests will be blocked only when ``target_max_bytes`` or
329 ``target_max_objects`` reached
330
331 Relative Sizing
332 ~~~~~~~~~~~~~~~
333
334 The cache tiering agent can flush or evict objects relative to the size of the
335 cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in
336 `Absolute sizing`_). When the cache pool consists of a certain percentage of
337 modified (or dirty) objects, the cache tiering agent will flush them to the
338 storage pool. To set the ``cache_target_dirty_ratio``, execute the following::
339
340 ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0}
341
342 For example, setting the value to ``0.4`` will begin flushing modified
343 (dirty) objects when they reach 40% of the cache pool's capacity::
344
345 ceph osd pool set hot-storage cache_target_dirty_ratio 0.4
346
347 When the dirty objects reaches a certain percentage of its capacity, flush dirty
348 objects with a higher speed. To set the ``cache_target_dirty_high_ratio``::
349
350 ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0}
351
352 For example, setting the value to ``0.6`` will begin aggressively flush dirty objects
353 when they reach 60% of the cache pool's capacity. obviously, we'd better set the value
354 between dirty_ratio and full_ratio::
355
356 ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6
357
358 When the cache pool reaches a certain percentage of its capacity, the cache
359 tiering agent will evict objects to maintain free capacity. To set the
360 ``cache_target_full_ratio``, execute the following::
361
362 ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0}
363
364 For example, setting the value to ``0.8`` will begin flushing unmodified
365 (clean) objects when they reach 80% of the cache pool's capacity::
366
367 ceph osd pool set hot-storage cache_target_full_ratio 0.8
368
369
370 Cache Age
371 ---------
372
373 You can specify the minimum age of an object before the cache tiering agent
374 flushes a recently modified (or dirty) object to the backing storage pool::
375
376 ceph osd pool set {cachepool} cache_min_flush_age {#seconds}
377
378 For example, to flush modified (or dirty) objects after 10 minutes, execute
379 the following::
380
381 ceph osd pool set hot-storage cache_min_flush_age 600
382
383 You can specify the minimum age of an object before it will be evicted from
384 the cache tier::
385
386 ceph osd pool {cache-tier} cache_min_evict_age {#seconds}
387
388 For example, to evict objects after 30 minutes, execute the following::
389
390 ceph osd pool set hot-storage cache_min_evict_age 1800
391
392
393 Removing a Cache Tier
394 =====================
395
396 Removing a cache tier differs depending on whether it is a writeback
397 cache or a read-only cache.
398
399
400 Removing a Read-Only Cache
401 --------------------------
402
403 Since a read-only cache does not have modified data, you can disable
404 and remove it without losing any recent changes to objects in the cache.
405
406 #. Change the cache-mode to ``none`` to disable it. ::
407
408 ceph osd tier cache-mode {cachepool} none
409
410 For example::
411
412 ceph osd tier cache-mode hot-storage none
413
414 #. Remove the cache pool from the backing pool. ::
415
416 ceph osd tier remove {storagepool} {cachepool}
417
418 For example::
419
420 ceph osd tier remove cold-storage hot-storage
421
422
423
424 Removing a Writeback Cache
425 --------------------------
426
427 Since a writeback cache may have modified data, you must take steps to ensure
428 that you do not lose any recent changes to objects in the cache before you
429 disable and remove it.
430
431
432 #. Change the cache mode to ``proxy`` so that new and modified objects will
433 flush to the backing storage pool. ::
434
435 ceph osd tier cache-mode {cachepool} proxy
436
437 For example::
438
439 ceph osd tier cache-mode hot-storage proxy
440
441
442 #. Ensure that the cache pool has been flushed. This may take a few minutes::
443
444 rados -p {cachepool} ls
445
446 If the cache pool still has objects, you can flush them manually.
447 For example::
448
449 rados -p {cachepool} cache-flush-evict-all
450
451
452 #. Remove the overlay so that clients will not direct traffic to the cache. ::
453
454 ceph osd tier remove-overlay {storagetier}
455
456 For example::
457
458 ceph osd tier remove-overlay cold-storage
459
460
461 #. Finally, remove the cache tier pool from the backing storage pool. ::
462
463 ceph osd tier remove {storagepool} {cachepool}
464
465 For example::
466
467 ceph osd tier remove cold-storage hot-storage
468
469
470 .. _Create a Pool: ../pools#create-a-pool
471 .. _Pools - Set Pool Values: ../pools#set-pool-values
472 .. _Bloom Filter: https://en.wikipedia.org/wiki/Bloom_filter
473 .. _CRUSH Maps: ../crush-map
474 .. _Absolute Sizing: #absolute-sizing