]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/osd_internals/manifest.rst
f998a04f2e7b748b790b52fa26a42af29f0669a0
[ceph.git] / ceph / doc / dev / osd_internals / manifest.rst
1 ========
2 Manifest
3 ========
4
5
6 Introduction
7 ============
8
9 As described in ``../deduplication.rst``, adding transparent redirect
10 machinery to RADOS would enable a more capable tiering solution
11 than RADOS currently has with "cache/tiering".
12
13 See ``../deduplication.rst``
14
15 At a high level, each object has a piece of metadata embedded in
16 the ``object_info_t`` which can map subsets of the object data payload
17 to (refcounted) objects in other pools.
18
19 This document exists to detail:
20
21 1. Manifest data structures
22 2. Rados operations for manipulating manifests.
23 3. Status and Plans
24
25
26 Intended Usage Model
27 ====================
28
29 RBD
30 ---
31
32 For RBD, the primary goal is for either an OSD-internal agent or a
33 cluster-external agent to be able to transparently shift portions
34 of the constituent 4MB extents between a dedup pool and a hot base
35 pool.
36
37 As such, RBD operations (including class operations and snapshots)
38 must have the same observable results regardless of the current
39 status of the object.
40
41 Moreover, tiering/dedup operations must interleave with RBD operations
42 without changing the result.
43
44 Thus, here is a sketch of how I'd expect a tiering agent to perform
45 basic operations:
46
47 * Demote cold RBD chunk to slow pool:
48
49 1. Read object, noting current user_version.
50 2. In memory, run CDC implementation to fingerprint object.
51 3. Write out each resulting extent to an object in the cold pool
52 using the CAS class.
53 4. Submit operation to base pool:
54
55 * ``ASSERT_VER`` with the user version from the read to fail if the
56 object has been mutated since the read.
57 * ``SET_CHUNK`` for each of the extents to the corresponding object
58 in the base pool.
59 * ``EVICT_CHUNK`` for each extent to free up space in the base pool.
60 Results in each chunk being marked ``MISSING``.
61
62 RBD users should then either see the state prior to the demotion or
63 subsequent to it.
64
65 Note that between 3 and 4, we potentially leak references, so a
66 periodic scrub would be needed to validate refcounts.
67
68 * Promote cold RBD chunk to fast pool.
69
70 1. Submit ``TIER_PROMOTE``
71
72 For clones, all of the above would be identical except that the
73 initial read would need a ``LIST_SNAPS`` to determine which clones exist
74 and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include
75 the ``cloneid``.
76
77 RadosGW
78 -------
79
80 For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the
81 manifest machinery in the OSD to hide the distinction between the object
82 being dedup'd or present in the base pool
83
84 For writes, RGW could operate as RBD does above, but could
85 optionally have the freedom to fingerprint prior to doing the write.
86 In that case, it could immediately write out the target objects to the
87 CAS pool and then atomically write an object with the corresponding
88 chunks set.
89
90 Status and Future Work
91 ======================
92
93 At the moment, initial versions of a manifest data structure along
94 with IO path support and rados control operations exist. This section
95 is meant to outline next steps.
96
97 At a high level, our future work plan is:
98
99 - Cleanups: Address immediate inconsistencies and shortcomings outlined
100 in the next section.
101 - Testing: Rados relies heavily on teuthology failure testing to validate
102 features like cache/tiering. We'll need corresponding tests for
103 manifest operations.
104 - Snapshots: We want to be able to deduplicate portions of clones
105 below the level of the rados snapshot system. As such, the
106 rados operations below need to be extended to work correctly on
107 clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the
108 corresponding extent in the base pool, and correctly maintain OSD metadata).
109 - Cache/tiering: Ultimately, we'd like to be able to deprecate the existing
110 cache/tiering implementation, but to do that we need to ensure that we
111 can address the same use cases.
112
113
114 Cleanups
115 --------
116
117 The existing implementation has some things that need to be cleaned up:
118
119 * ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise
120 one couldn't create an object atomically as a redirect.
121 * ``SET_CHUNK``:
122
123 * Appears to trigger a new clone as user_modify gets set in
124 ``do_osd_ops``. This probably isn't desirable, see Snapshots section
125 below for some options on how generally to mix these operations
126 with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set
127 user_modify.
128 * Appears to assume that the corresponding section of the object
129 does not exist (sets ``FLAG_MISSING``) but does not check whether the
130 corresponding extent exists already in the object. Should always
131 leave the extent clean.
132 * Appears to clear the manifest unconditionally if not chunked,
133 that's probably wrong. We should return an error if it's a
134 ``REDIRECT`` ::
135
136 case CEPH_OSD_OP_SET_CHUNK:
137 if (oi.manifest.is_redirect()) {
138 result = -EINVAL;
139 goto fail;
140 }
141
142
143 * ``TIER_PROMOTE``:
144
145 * ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears
146 to copy them back in, but does not unset the redirect or clear the
147 reference. This violates the invariant that a redirect object
148 should be empty in the base pool. In particular, as long as the
149 redirect is set, it appears that all operations will be proxied
150 even after the promote defeating the purpose. We do want ``PROMOTE``
151 to be able to atomically replace a redirect with the actual
152 object, so the solution is to clear the redirect at the end of the
153 promote.
154 * For a chunked manifest, we appear to flush prior to promoting.
155 Promotion will often be used to prepare an object for low latency
156 reads and writes, accordingly, the only effect should be to read
157 any ``MISSING`` extents into the base pool. No flushing should be done.
158
159 * High Level:
160
161 * It appears that ``FLAG_DIRTY`` should never be used for an extent pointing
162 at a dedup extent. Writing the mutated extent back to the dedup pool
163 requires writing a new object since the previous one cannot be mutated,
164 just as it would if it hadn't been dedup'd yet. Thus, we should always
165 drop the reference and remove the manifest pointer.
166
167 * There isn't currently a way to "evict" an object region. With the above
168 change to ``SET_CHUNK`` to always retain the existing object region, we
169 need an ``EVICT_CHUNK`` operation to then remove the extent.
170
171
172 Testing
173 -------
174
175 We rely really heavily on randomized failure testing. As such, we need
176 to extend that testing to include dedup/manifest support as well. Here's
177 a short list of the touchpoints:
178
179 * Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml``
180
181 That test, of course, tests the existing cache/tiering machinery. Add
182 additional files to that directory that instead setup a dedup pool. Add
183 support to ``ceph_test_rados`` (``src/test/osd/TestRados*``).
184
185 * RBD tests
186
187 Add a test that runs an RBD workload concurrently with blind
188 promote/evict operations.
189
190 * RGW
191
192 Add a test that runs a rgw workload concurrently with blind
193 promote/evict operations.
194
195
196 Snapshots
197 ---------
198
199 Fundamentally we need to be able to manipulate the manifest
200 status of clones because we want to be able to dynamically promote,
201 flush (if the state was dirty when the clone was created), and evict
202 extents from clones.
203
204 As such, the plan is to allow the ``object_manifest_t`` for each clone
205 to be independent. Here's an incomplete list of the high level
206 tasks:
207
208 * Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK``
209 to operation directly on clones.
210 * Ensure that recovery checks the object_manifest prior to trying to
211 use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets``
212 are the two methods that would likely need to be modified.
213
214 See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD
215 support details. I'd like to call out one particular data structure
216 we may want to exploit.
217
218 The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
219 clones as part of leak detection.
220
221 An important question is how we deal with the fact that many clones
222 will frequently have references to the same backing chunks at the same
223 offset. In particular, ``make_writeable`` will generally create a clone
224 that shares the same ``object_manifest_t`` references with the exception
225 of any extents modified in that transaction. The metadata that
226 commits as part of that transaction must therefore map onto the same
227 refcount as before because otherwise we'd have to first increment
228 refcounts on backing objects (or risk a reference to a dead object)
229 Thus, we introduce a simple convention: consecutive clones which
230 share a reference at the same offset share the same refcount. This
231 means that a write that invokes ``make_writeable`` may decrease refcounts,
232 but not increase them. This has some consequences for removing clones.
233 Consider the following sequence ::
234
235 write foo [0, 1024)
236 flush foo ->
237 head: [0, 512) aaa, [512, 1024) bbb
238 refcount(aaa)=1, refcount(bbb)=1
239 snapshot 10
240 write foo [0, 512) ->
241 head: [512, 1024) bbb
242 10 : [0, 512) aaa, [512, 1024) bbb
243 refcount(aaa)=1, refcount(bbb)=1
244 flush foo ->
245 head: [0, 512) ccc, [512, 1024) bbb
246 10 : [0, 512) aaa, [512, 1024) bbb
247 refcount(aaa)=1, refcount(bbb)=1, refcount(ccc)=1
248 snapshot 20
249 write foo [0, 512) (same contents as the original write)
250 head: [512, 1024) bbb
251 20 : [0, 512) ccc, [512, 1024) bbb
252 10 : [0, 512) aaa, [512, 1024) bbb
253 refcount(aaa)=?, refcount(bbb)=1
254 flush foo
255 head: [0, 512) aaa, [512, 1024) bbb
256 20 : [0, 512) ccc, [512, 1024) bbb
257 10 : [0, 512) aaa, [512, 1024) bbb
258 refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1
259
260 What should be the refcount for ``aaa`` be at the end? By our
261 above rule, it should be ``2`` since the two ```aaa``` refs are not
262 contiguous. However, consider removing clone ``20`` ::
263
264 initial:
265 head: [0, 512) aaa, [512, 1024) bbb
266 20 : [0, 512) ccc, [512, 1024) bbb
267 10 : [0, 512) aaa, [512, 1024) bbb
268 refcount(aaa)=2, refcount(bbb)=1, refcount(ccc)=1
269 trim 20
270 head: [0, 512) aaa, [512, 1024) bbb
271 10 : [0, 512) aaa, [512, 1024) bbb
272 refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0
273
274 At this point, our rule dictates that ``refcount(aaa)`` is `1`.
275 This means that removing ``20`` needs to check for refs held by
276 the clones on either side which will then match.
277
278 See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal``
279 for the logic implementing this rule.
280
281 This seems complicated, but it gets us two valuable properties:
282
283 1) The refcount change from make_writeable will not block on
284 incrementing a ref
285 2) We don't need to load the ``object_manifest_t`` for every clone
286 to determine how to handle removing one -- just the ones
287 immediately preceding and succeeding it.
288
289 All clone operations will need to consider adjacent ``chunk_maps``
290 when adding or removing references.
291
292 Cache/Tiering
293 -------------
294
295 There already exists a cache/tiering mechanism based on whiteouts.
296 One goal here should ultimately be for this manifest machinery to
297 provide a complete replacement.
298
299 See ``cache-pool.rst``
300
301 The manifest machinery already shares some code paths with the
302 existing cache/tiering code, mainly ``stat_flush``.
303
304 In no particular order, here's in incomplete list of things that need
305 to be wired up to provide feature parity:
306
307 * Online object access information: The osd already has pool configs
308 for maintaining bloom filters which provide estimates of access
309 recency for objects. We probably need to modify this to permit
310 hitset maintenance for a normal pool -- there are already
311 ``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them.
312 * Tiering agent: The osd already has a background tiering agent which
313 would need to be modified to instead flush and evict using
314 manifests.
315
316 * Use exiting existing features regarding the cache flush policy such as
317 histset, age, ratio.
318 - hitset
319 - age, ratio, bytes
320
321 * Add tiering-mode to ``manifest-tiering``
322 - Writeback
323 - Read-only
324
325
326 Data Structures
327 ===============
328
329 Each RADOS object contains an ``object_manifest_t`` embedded within the
330 ``object_info_t`` (see ``osd_types.h``):
331
332 ::
333
334 struct object_manifest_t {
335 enum {
336 TYPE_NONE = 0,
337 TYPE_REDIRECT = 1,
338 TYPE_CHUNKED = 2,
339 };
340 uint8_t type; // redirect, chunked, ...
341 hobject_t redirect_target;
342 std::map<uint64_t, chunk_info_t> chunk_map;
343 }
344
345 The ``type`` enum reflects three possible states an object can be in:
346
347 1. ``TYPE_NONE``: normal RADOS object
348 2. ``TYPE_REDIRECT``: object payload is backed by a single object
349 specified by ``redirect_target``
350 3. ``TYPE_CHUNKED: object payload is distributed among objects with
351 size and offset specified by the ``chunk_map``. ``chunk_map`` maps
352 the offset of the chunk to a ``chunk_info_t`` as shown below, also
353 specifying the ``length``, target `OID`, and ``flags``.
354
355 ::
356
357 struct chunk_info_t {
358 typedef enum {
359 FLAG_DIRTY = 1,
360 FLAG_MISSING = 2,
361 FLAG_HAS_REFERENCE = 4,
362 FLAG_HAS_FINGERPRINT = 8,
363 } cflag_t;
364 uint32_t offset;
365 uint32_t length;
366 hobject_t oid;
367 cflag_t flags; // FLAG_*
368
369
370 ``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint
371 is written. This should be changed to drop the fingerprint instead.
372
373
374 Request Handling
375 ================
376
377 Similarly to cache/tiering, the initial touchpoint is
378 ``maybe_handle_manifest_detail``.
379
380 For manifest operations listed below, we return ``NOOP`` and continue onto
381 dedicated handling within ``do_osd_ops``.
382
383 For redirect objects which haven't been promoted (apparently ``oi.size >
384 0`` indicates that it's present?) we proxy reads and writes.
385
386 For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all
387 of the ops are reads of extents in the ``object_manifest_t chunk_map``),
388 we proxy requests to those objects.
389
390
391 RADOS Interface
392 ================
393
394 To set up deduplication one must provision two pools. One will act as the
395 base pool and the other will act as the chunk pool. The base pool need to be
396 configured with the ``fingerprint_algorithm`` option as follows.
397
398 ::
399
400 ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
401 --yes-i-really-mean-it
402
403 Create objects ::
404
405 rados -p base_pool put foo ./foo
406 rados -p chunk_pool put foo-chunk ./foo-chunk
407
408 Make a manifest object ::
409
410 rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference
411
412 Operations:
413
414 * ``set-redirect``
415
416 Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object``
417 in the ``target_pool``.
418 A redirected object will forward all operations from the client to the
419 ``target_object``. ::
420
421 void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
422 uint64_t tgt_version, int flag = 0);
423
424 rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
425 <target_object>
426
427 Returns ``ENOENT`` if the object does not exist (TODO: why?)
428 Returns ``EINVAL`` if the object already is a redirect.
429
430 Takes a reference to target as part of operation, can possibly leak a ref
431 if the acting set resets and the client dies between taking the ref and
432 recording the redirect.
433
434 Truncates object, clears omap, and clears xattrs as a side effect.
435
436 At the top of ``do_osd_ops``, does not set user_modify.
437
438 This operation is not a user mutation and does not trigger a clone to be created.
439
440 There are two purposes of ``set_redirect``:
441
442 1. Redirect all operation to the target object (like proxy)
443 2. Cache when ``tier_promote`` is called (redirect will be cleared at this time).
444
445 * ``set-chunk``
446
447 Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a
448 ``target_object``. ::
449
450 void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
451 std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
452
453 rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
454 <caspool> <target_object> <target-offset>
455
456 Returns ``ENOENT`` if the object does not exist (TODO: why?)
457 Returns ``EINVAL`` if the object already is a redirect.
458 Returns ``EINVAL`` if on ill-formed parameter buffer.
459 Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping.
460
461 Takes references to targets as part of operation, can possibly leak refs
462 if the acting set resets and the client dies between taking the ref and
463 recording the redirect.
464
465 Truncates object, clears omap, and clears xattrs as a side effect.
466
467 This operation is not a user mutation and does not trigger a clone to be created.
468
469 TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. ::
470
471 if (!oi.manifest.is_chunked()) {
472 oi.manifest.clear();
473 }
474
475 * ``evict-chunk``
476
477 Clears an extent from an object leaving only the manifest link between
478 it and the ``target_object``. ::
479
480 void evict_chunk(
481 uint64_t offset, uint64_t length, int flag = 0);
482
483 rados -p base_pool evict-chunk <offset> <length> <object>
484
485 Returns ``EINVAL`` if the extent is not present in the manifest.
486
487 Note: this does not exist yet.
488
489
490 * ``tier-promote``
491
492 Promotes the object ensuring that subsequent reads and writes will be local ::
493
494 void tier_promote();
495
496 rados -p base_pool tier-promote <obj-name>
497
498 Returns ``ENOENT`` if the object does not exist
499
500 For a redirect manifest, copies data to head.
501
502 TODO: Promote on a redirect object needs to clear the redirect.
503
504 For a chunked manifest, reads all MISSING extents into the base pool,
505 subsequent reads and writes will be served from the base pool.
506
507 Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The
508 resulting ``copy_get`` operation will issue reads which will then be redirected by
509 the normal manifest read machinery.
510
511 Does not set the ``user_modify`` flag.
512
513 Future work will involve adding support for specifying a ``clone_id``.
514
515 * ``unset-manifest``
516
517 Unset the manifest info in the object that has manifest. ::
518
519 void unset_manifest();
520
521 rados -p base_pool unset-manifest <obj-name>
522
523 Clears manifest chunks or redirect. Lazily releases references, may
524 leak.
525
526 ``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``,
527 and so will trigger a snapshot. Note, this will be true even for a
528 redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should
529 be fixed -- ``unset-manifest`` should not be a ``user_modify``.
530
531 * ``tier-flush``
532
533 Flush the object which has chunks to the chunk pool. ::
534
535 void tier_flush();
536
537 rados -p base_pool tier-flush <obj-name>
538
539 Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone.
540
541 Does not evict the extents.
542
543
544 ceph-dedup-tool
545 ===============
546
547 ``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking
548 and fixing the reference count (see ``./refcount.rst``).
549
550 * Find an optimal chunk offset
551
552 a. Fixed chunk
553
554 To find out a fixed chunk length, you need to run the following command many
555 times while changing the ``chunk_size``. ::
556
557 ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
558 --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
559
560 b. Rabin chunk(Rabin-Karp algorithm)
561
562 Rabin-Karp is a string-searching algorithm based
563 on a rolling hash. But a rolling hash is not enough to do deduplication because
564 we don't know the chunk boundary. So, we need content-based slicing using
565 a rolling hash for content-defined chunking.
566 The current implementation uses the simplest approach: look for chunk boundaries
567 by inspecting the rolling hash for pattern (like the
568 lower N bits are all zeroes).
569
570 Users who want to use deduplication need to find an ideal chunk offset.
571 To find out ideal chunk offset, users should discover
572 the optimal configuration for their data workload via ``ceph-dedup-tool``.
573 This information will then be used for object chunking through
574 the ``set-chunk`` API. ::
575
576 ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
577 --chunk-algorithm rabin --fingerprint-algorithm rabin
578
579 ``ceph-dedup-tool`` has many options to utilize ``rabin chunk``.
580 These are options for ``rabin chunk``. ::
581
582 --mod-prime <uint64_t>
583 --rabin-prime <uint64_t>
584 --pow <uint64_t>
585 --chunk-mask-bit <uint32_t>
586 --window-size <uint32_t>
587 --min-chunk <uint32_t>
588 --max-chunk <uint64_t>
589
590 Users need to refer following equation to use above options for ``rabin chunk``. ::
591
592 rabin_hash =
593 (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
594
595 c. Fixed chunk vs content-defined chunk
596
597 Content-defined chunking may or not be optimal solution.
598 For example,
599
600 Data chunk ``A`` : ``abcdefgabcdefgabcdefg``
601
602 Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is
603 from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length.
604 But, in the case of content-based slicing, the optimal chunk length
605 could not be found (dedup ratio will not be 100%).
606 Because we need to find optimal parameter such
607 as boundary bit, window size and prime value. This is as easy as fixed chunk.
608 But, content defined chunking is very effective in the following case.
609
610 Data chunk ``B`` : ``abcdefgabcdefgabcdefg``
611
612 Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg``
613
614
615 * Fix reference count
616
617 The key idea behind of reference counting for dedup is false-positive, which means
618 ``(manifest object (no ref),, chunk object(has ref))`` happen instead of
619 ``(manifest object (has ref), chunk 1(no ref))``.
620 To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. ::
621
622 ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
623