9 As described in ``../deduplication.rst``, adding transparent redirect
10 machinery to RADOS would enable a more capable tiering solution
11 than RADOS currently has with "cache/tiering".
13 See ``../deduplication.rst``
15 At a high level, each object has a piece of metadata embedded in
16 the ``object_info_t`` which can map subsets of the object data payload
17 to (refcounted) objects in other pools.
19 This document exists to detail:
21 1. Manifest data structures
22 2. Rados operations for manipulating manifests.
32 For RBD, the primary goal is for either an OSD-internal agent or a
33 cluster-external agent to be able to transparently shift portions
34 of the constituent 4MB extents between a dedup pool and a hot base
37 As such, RBD operations (including class operations and snapshots)
38 must have the same observable results regardless of the current
41 Moreover, tiering/dedup operations must interleave with RBD operations
42 without changing the result.
44 Thus, here is a sketch of how I'd expect a tiering agent to perform
47 * Demote cold RBD chunk to slow pool:
49 1. Read object, noting current user_version.
50 2. In memory, run CDC implementation to fingerprint object.
51 3. Write out each resulting extent to an object in the cold pool
53 4. Submit operation to base pool:
55 * ``ASSERT_VER`` with the user version from the read to fail if the
56 object has been mutated since the read.
57 * ``SET_CHUNK`` for each of the extents to the corresponding object
59 * ``EVICT_CHUNK`` for each extent to free up space in the base pool.
60 Results in each chunk being marked ``MISSING``.
62 RBD users should then either see the state prior to the demotion or
65 Note that between 3 and 4, we potentially leak references, so a
66 periodic scrub would be needed to validate refcounts.
68 * Promote cold RBD chunk to fast pool.
70 1. Submit ``TIER_PROMOTE``
72 For clones, all of the above would be identical except that the
73 initial read would need a ``LIST_SNAPS`` to determine which clones exist
74 and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include
80 For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the
81 manifest machinery in the OSD to hide the distinction between the object
82 being dedup'd or present in the base pool
84 For writes, RGW could operate as RBD does above, but could
85 optionally have the freedom to fingerprint prior to doing the write.
86 In that case, it could immediately write out the target objects to the
87 CAS pool and then atomically write an object with the corresponding
90 Status and Future Work
91 ======================
93 At the moment, initial versions of a manifest data structure along
94 with IO path support and rados control operations exist. This section
95 is meant to outline next steps.
97 At a high level, our future work plan is:
99 - Cleanups: Address immediate inconsistencies and shortcomings outlined
101 - Testing: Rados relies heavily on teuthology failure testing to validate
102 features like cache/tiering. We'll need corresponding tests for
104 - Snapshots: We want to be able to deduplicate portions of clones
105 below the level of the rados snapshot system. As such, the
106 rados operations below need to be extended to work correctly on
107 clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the
108 corresponding extent in the base pool, and correctly maintain OSD metadata).
109 - Cache/tiering: Ultimately, we'd like to be able to deprecate the existing
110 cache/tiering implementation, but to do that we need to ensure that we
111 can address the same use cases.
117 The existing implementation has some things that need to be cleaned up:
119 * ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise
120 one couldn't create an object atomically as a redirect.
123 * Appears to trigger a new clone as user_modify gets set in
124 ``do_osd_ops``. This probably isn't desirable, see Snapshots section
125 below for some options on how generally to mix these operations
126 with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set
128 * Appears to assume that the corresponding section of the object
129 does not exist (sets ``FLAG_MISSING``) but does not check whether the
130 corresponding extent exists already in the object. Should always
131 leave the extent clean.
132 * Appears to clear the manifest unconditionally if not chunked,
133 that's probably wrong. We should return an error if it's a
136 case CEPH_OSD_OP_SET_CHUNK:
137 if (oi.manifest.is_redirect()) {
145 * ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears
146 to copy them back in, but does not unset the redirect or clear the
147 reference. This violates the invariant that a redirect object
148 should be empty in the base pool. In particular, as long as the
149 redirect is set, it appears that all operations will be proxied
150 even after the promote defeating the purpose. We do want ``PROMOTE``
151 to be able to atomically replace a redirect with the actual
152 object, so the solution is to clear the redirect at the end of the
154 * For a chunked manifest, we appear to flush prior to promoting.
155 Promotion will often be used to prepare an object for low latency
156 reads and writes, accordingly, the only effect should be to read
157 any ``MISSING`` extents into the base pool. No flushing should be done.
161 * It appears that ``FLAG_DIRTY`` should never be used for an extent pointing
162 at a dedup extent. Writing the mutated extent back to the dedup pool
163 requires writing a new object since the previous one cannot be mutated,
164 just as it would if it hadn't been dedup'd yet. Thus, we should always
165 drop the reference and remove the manifest pointer.
167 * There isn't currently a way to "evict" an object region. With the above
168 change to ``SET_CHUNK`` to always retain the existing object region, we
169 need an ``EVICT_CHUNK`` operation to then remove the extent.
175 We rely really heavily on randomized failure testing. As such, we need
176 to extend that testing to include dedup/manifest support as well. Here's
177 a short list of the touchpoints:
179 * Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml``
181 That test, of course, tests the existing cache/tiering machinery. Add
182 additional files to that directory that instead setup a dedup pool. Add
183 support to ``ceph_test_rados`` (``src/test/osd/TestRados*``).
187 Add a test that runs an RBD workload concurrently with blind
188 promote/evict operations.
192 Add a test that runs a rgw workload concurrently with blind
193 promote/evict operations.
199 Fundamentally we need to be able to manipulate the manifest
200 status of clones because we want to be able to dynamically promote,
201 flush (if the state was dirty when the clone was created), and evict
204 As such, the plan is to allow the ``object_manifest_t`` for each clone
205 to be independent. Here's an incomplete list of the high level
208 * Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK``
209 to operation directly on clones.
210 * Ensure that recovery checks the object_manifest prior to trying to
211 use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets``
212 are the two methods that would likely need to be modified.
214 See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD
215 support details. I'd like to call out one particular data structure
216 we may want to exploit.
218 The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
219 clones as part of leak detection.
221 An important question is how we deal with the fact that many clones
222 will frequently have references to the same backing chunks at the same
223 offset. In particular, ``make_writeable`` will generally create a clone
224 that shares the same ``object_manifest_t`` references with the exception
225 of any extents modified in that transaction. The metadata that
226 commits as part of that transaction must therefore map onto the same
227 refcount as before because otherwise we'd have to first increment
228 refcounts on backing objects (or risk a reference to a dead object)
229 Thus, we introduce a simple convention: consecutive clones which
230 share a reference at the same offset share the same refcount. This
231 means that a write that invokes ``make_writeable`` may decrease refcounts,
232 but not increase them. This has some consequences for removing clones.
233 Consider the following sequence ::
237 head: [0, 512) aaa, [512, 1024) bbb
238 refcount(aaa)=1, refcount(bbb)=1
240 write foo [0, 512) ->
241 head: [512, 1024) bbb
242 10 : [0, 512) aaa, [512, 1024) bbb
243 refcount(aaa)=1, refcount(bbb)=1
245 head: [0, 512) ccc, [512, 1024) bbb
246 10 : [0, 512) aaa, [512, 1024) bbb
247 refcount(aaa)=1, refcount(bbb)=1, refcount(ccc)=1
249 write foo [0, 512) (same contents as the original write)
250 head: [512, 1024) bbb
251 20 : [0, 512) ccc, [512, 1024) bbb
252 10 : [0, 512) aaa, [512, 1024) bbb
253 refcount(aaa)=?, refcount(bbb)=1
255 head: [0, 512) aaa, [512, 1024) bbb
256 20 : [0, 512) ccc, [512, 1024) bbb
257 10 : [0, 512) aaa, [512, 1024) bbb
258 refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1
260 What should be the refcount for ``aaa`` be at the end? By our
261 above rule, it should be ``2`` since the two ```aaa``` refs are not
262 contiguous. However, consider removing clone ``20`` ::
265 head: [0, 512) aaa, [512, 1024) bbb
266 20 : [0, 512) ccc, [512, 1024) bbb
267 10 : [0, 512) aaa, [512, 1024) bbb
268 refcount(aaa)=2, refcount(bbb)=1, refcount(ccc)=1
270 head: [0, 512) aaa, [512, 1024) bbb
271 10 : [0, 512) aaa, [512, 1024) bbb
272 refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0
274 At this point, our rule dictates that ``refcount(aaa)`` is `1`.
275 This means that removing ``20`` needs to check for refs held by
276 the clones on either side which will then match.
278 See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal``
279 for the logic implementing this rule.
281 This seems complicated, but it gets us two valuable properties:
283 1) The refcount change from make_writeable will not block on
285 2) We don't need to load the ``object_manifest_t`` for every clone
286 to determine how to handle removing one -- just the ones
287 immediately preceding and succeeding it.
289 All clone operations will need to consider adjacent ``chunk_maps``
290 when adding or removing references.
295 There already exists a cache/tiering mechanism based on whiteouts.
296 One goal here should ultimately be for this manifest machinery to
297 provide a complete replacement.
299 See ``cache-pool.rst``
301 The manifest machinery already shares some code paths with the
302 existing cache/tiering code, mainly ``stat_flush``.
304 In no particular order, here's in incomplete list of things that need
305 to be wired up to provide feature parity:
307 * Online object access information: The osd already has pool configs
308 for maintaining bloom filters which provide estimates of access
309 recency for objects. We probably need to modify this to permit
310 hitset maintenance for a normal pool -- there are already
311 ``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them.
312 * Tiering agent: The osd already has a background tiering agent which
313 would need to be modified to instead flush and evict using
316 * Use exiting existing features regarding the cache flush policy such as
321 * Add tiering-mode to ``manifest-tiering``
329 Each RADOS object contains an ``object_manifest_t`` embedded within the
330 ``object_info_t`` (see ``osd_types.h``):
334 struct object_manifest_t {
340 uint8_t type; // redirect, chunked, ...
341 hobject_t redirect_target;
342 std::map<uint64_t, chunk_info_t> chunk_map;
345 The ``type`` enum reflects three possible states an object can be in:
347 1. ``TYPE_NONE``: normal RADOS object
348 2. ``TYPE_REDIRECT``: object payload is backed by a single object
349 specified by ``redirect_target``
350 3. ``TYPE_CHUNKED: object payload is distributed among objects with
351 size and offset specified by the ``chunk_map``. ``chunk_map`` maps
352 the offset of the chunk to a ``chunk_info_t`` as shown below, also
353 specifying the ``length``, target `OID`, and ``flags``.
357 struct chunk_info_t {
361 FLAG_HAS_REFERENCE = 4,
362 FLAG_HAS_FINGERPRINT = 8,
367 cflag_t flags; // FLAG_*
370 ``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint
371 is written. This should be changed to drop the fingerprint instead.
377 Similarly to cache/tiering, the initial touchpoint is
378 ``maybe_handle_manifest_detail``.
380 For manifest operations listed below, we return ``NOOP`` and continue onto
381 dedicated handling within ``do_osd_ops``.
383 For redirect objects which haven't been promoted (apparently ``oi.size >
384 0`` indicates that it's present?) we proxy reads and writes.
386 For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all
387 of the ops are reads of extents in the ``object_manifest_t chunk_map``),
388 we proxy requests to those objects.
394 To set up deduplication one must provision two pools. One will act as the
395 base pool and the other will act as the chunk pool. The base pool need to be
396 configured with the ``fingerprint_algorithm`` option as follows.
400 ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
401 --yes-i-really-mean-it
405 rados -p base_pool put foo ./foo
406 rados -p chunk_pool put foo-chunk ./foo-chunk
408 Make a manifest object ::
410 rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference
416 Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object``
417 in the ``target_pool``.
418 A redirected object will forward all operations from the client to the
419 ``target_object``. ::
421 void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
422 uint64_t tgt_version, int flag = 0);
424 rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
427 Returns ``ENOENT`` if the object does not exist (TODO: why?)
428 Returns ``EINVAL`` if the object already is a redirect.
430 Takes a reference to target as part of operation, can possibly leak a ref
431 if the acting set resets and the client dies between taking the ref and
432 recording the redirect.
434 Truncates object, clears omap, and clears xattrs as a side effect.
436 At the top of ``do_osd_ops``, does not set user_modify.
438 This operation is not a user mutation and does not trigger a clone to be created.
440 There are two purposes of ``set_redirect``:
442 1. Redirect all operation to the target object (like proxy)
443 2. Cache when ``tier_promote`` is called (redirect will be cleared at this time).
447 Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a
448 ``target_object``. ::
450 void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
451 std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
453 rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
454 <caspool> <target_object> <target-offset>
456 Returns ``ENOENT`` if the object does not exist (TODO: why?)
457 Returns ``EINVAL`` if the object already is a redirect.
458 Returns ``EINVAL`` if on ill-formed parameter buffer.
459 Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping.
461 Takes references to targets as part of operation, can possibly leak refs
462 if the acting set resets and the client dies between taking the ref and
463 recording the redirect.
465 Truncates object, clears omap, and clears xattrs as a side effect.
467 This operation is not a user mutation and does not trigger a clone to be created.
469 TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. ::
471 if (!oi.manifest.is_chunked()) {
477 Clears an extent from an object leaving only the manifest link between
478 it and the ``target_object``. ::
481 uint64_t offset, uint64_t length, int flag = 0);
483 rados -p base_pool evict-chunk <offset> <length> <object>
485 Returns ``EINVAL`` if the extent is not present in the manifest.
487 Note: this does not exist yet.
492 Promotes the object ensuring that subsequent reads and writes will be local ::
496 rados -p base_pool tier-promote <obj-name>
498 Returns ``ENOENT`` if the object does not exist
500 For a redirect manifest, copies data to head.
502 TODO: Promote on a redirect object needs to clear the redirect.
504 For a chunked manifest, reads all MISSING extents into the base pool,
505 subsequent reads and writes will be served from the base pool.
507 Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The
508 resulting ``copy_get`` operation will issue reads which will then be redirected by
509 the normal manifest read machinery.
511 Does not set the ``user_modify`` flag.
513 Future work will involve adding support for specifying a ``clone_id``.
517 Unset the manifest info in the object that has manifest. ::
519 void unset_manifest();
521 rados -p base_pool unset-manifest <obj-name>
523 Clears manifest chunks or redirect. Lazily releases references, may
526 ``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``,
527 and so will trigger a snapshot. Note, this will be true even for a
528 redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should
529 be fixed -- ``unset-manifest`` should not be a ``user_modify``.
533 Flush the object which has chunks to the chunk pool. ::
537 rados -p base_pool tier-flush <obj-name>
539 Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone.
541 Does not evict the extents.
547 ``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking
548 and fixing the reference count (see ``./refcount.rst``).
550 * Find an optimal chunk offset
554 To find out a fixed chunk length, you need to run the following command many
555 times while changing the ``chunk_size``. ::
557 ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
558 --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
560 b. Rabin chunk(Rabin-Karp algorithm)
562 Rabin-Karp is a string-searching algorithm based
563 on a rolling hash. But a rolling hash is not enough to do deduplication because
564 we don't know the chunk boundary. So, we need content-based slicing using
565 a rolling hash for content-defined chunking.
566 The current implementation uses the simplest approach: look for chunk boundaries
567 by inspecting the rolling hash for pattern (like the
568 lower N bits are all zeroes).
570 Users who want to use deduplication need to find an ideal chunk offset.
571 To find out ideal chunk offset, users should discover
572 the optimal configuration for their data workload via ``ceph-dedup-tool``.
573 This information will then be used for object chunking through
574 the ``set-chunk`` API. ::
576 ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
577 --chunk-algorithm rabin --fingerprint-algorithm rabin
579 ``ceph-dedup-tool`` has many options to utilize ``rabin chunk``.
580 These are options for ``rabin chunk``. ::
582 --mod-prime <uint64_t>
583 --rabin-prime <uint64_t>
585 --chunk-mask-bit <uint32_t>
586 --window-size <uint32_t>
587 --min-chunk <uint32_t>
588 --max-chunk <uint64_t>
590 Users need to refer following equation to use above options for ``rabin chunk``. ::
593 (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
595 c. Fixed chunk vs content-defined chunk
597 Content-defined chunking may or not be optimal solution.
600 Data chunk ``A`` : ``abcdefgabcdefgabcdefg``
602 Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is
603 from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length.
604 But, in the case of content-based slicing, the optimal chunk length
605 could not be found (dedup ratio will not be 100%).
606 Because we need to find optimal parameter such
607 as boundary bit, window size and prime value. This is as easy as fixed chunk.
608 But, content defined chunking is very effective in the following case.
610 Data chunk ``B`` : ``abcdefgabcdefgabcdefg``
612 Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg``
615 * Fix reference count
617 The key idea behind of reference counting for dedup is false-positive, which means
618 ``(manifest object (no ref),, chunk object(has ref))`` happen instead of
619 ``(manifest object (has ref), chunk 1(no ref))``.
620 To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. ::
622 ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL