]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/osd_internals/manifest.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / dev / osd_internals / manifest.rst
CommitLineData
f67539c2
TL
1========
2Manifest
3========
4
5
6Introduction
7============
8
9As described in ``../deduplication.rst``, adding transparent redirect
10machinery to RADOS would enable a more capable tiering solution
11than RADOS currently has with "cache/tiering".
12
13See ``../deduplication.rst``
14
15At a high level, each object has a piece of metadata embedded in
16the ``object_info_t`` which can map subsets of the object data payload
17to (refcounted) objects in other pools.
18
19This document exists to detail:
20
211. Manifest data structures
222. Rados operations for manipulating manifests.
233. Status and Plans
24
25
26Intended Usage Model
27====================
28
29RBD
30---
31
32For RBD, the primary goal is for either an OSD-internal agent or a
33cluster-external agent to be able to transparently shift portions
20effc67 34of the constituent 4MB extents between a dedup pool and a hot base
f67539c2
TL
35pool.
36
37As such, RBD operations (including class operations and snapshots)
38must have the same observable results regardless of the current
39status of the object.
40
41Moreover, tiering/dedup operations must interleave with RBD operations
42without changing the result.
43
44Thus, here is a sketch of how I'd expect a tiering agent to perform
45basic operations:
46
47* Demote cold RBD chunk to slow pool:
48
49 1. Read object, noting current user_version.
50 2. In memory, run CDC implementation to fingerprint object.
51 3. Write out each resulting extent to an object in the cold pool
52 using the CAS class.
53 4. Submit operation to base pool:
54
55 * ``ASSERT_VER`` with the user version from the read to fail if the
56 object has been mutated since the read.
57 * ``SET_CHUNK`` for each of the extents to the corresponding object
58 in the base pool.
59 * ``EVICT_CHUNK`` for each extent to free up space in the base pool.
60 Results in each chunk being marked ``MISSING``.
61
62 RBD users should then either see the state prior to the demotion or
63 subsequent to it.
64
65 Note that between 3 and 4, we potentially leak references, so a
66 periodic scrub would be needed to validate refcounts.
67
68* Promote cold RBD chunk to fast pool.
69
70 1. Submit ``TIER_PROMOTE``
71
72For clones, all of the above would be identical except that the
73initial read would need a ``LIST_SNAPS`` to determine which clones exist
74and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include
75the ``cloneid``.
76
77RadosGW
78-------
79
80For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the
81manifest machinery in the OSD to hide the distinction between the object
82being dedup'd or present in the base pool
83
84For writes, RGW could operate as RBD does above, but could
85optionally have the freedom to fingerprint prior to doing the write.
86In that case, it could immediately write out the target objects to the
87CAS pool and then atomically write an object with the corresponding
88chunks set.
89
90Status and Future Work
91======================
92
93At the moment, initial versions of a manifest data structure along
94with IO path support and rados control operations exist. This section
95is meant to outline next steps.
96
97At a high level, our future work plan is:
98
99- Cleanups: Address immediate inconsistencies and shortcomings outlined
100 in the next section.
101- Testing: Rados relies heavily on teuthology failure testing to validate
102 features like cache/tiering. We'll need corresponding tests for
103 manifest operations.
104- Snapshots: We want to be able to deduplicate portions of clones
105 below the level of the rados snapshot system. As such, the
106 rados operations below need to be extended to work correctly on
107 clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the
108 corresponding extent in the base pool, and correctly maintain OSD metadata).
109- Cache/tiering: Ultimately, we'd like to be able to deprecate the existing
110 cache/tiering implementation, but to do that we need to ensure that we
111 can address the same use cases.
112
113
114Cleanups
115--------
116
117The existing implementation has some things that need to be cleaned up:
118
119* ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise
120 one couldn't create an object atomically as a redirect.
121* ``SET_CHUNK``:
122
123 * Appears to trigger a new clone as user_modify gets set in
124 ``do_osd_ops``. This probably isn't desirable, see Snapshots section
125 below for some options on how generally to mix these operations
126 with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set
127 user_modify.
128 * Appears to assume that the corresponding section of the object
129 does not exist (sets ``FLAG_MISSING``) but does not check whether the
130 corresponding extent exists already in the object. Should always
131 leave the extent clean.
132 * Appears to clear the manifest unconditionally if not chunked,
133 that's probably wrong. We should return an error if it's a
134 ``REDIRECT`` ::
135
136 case CEPH_OSD_OP_SET_CHUNK:
137 if (oi.manifest.is_redirect()) {
138 result = -EINVAL;
139 goto fail;
140 }
141
142
143* ``TIER_PROMOTE``:
144
145 * ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears
146 to copy them back in, but does not unset the redirect or clear the
147 reference. This violates the invariant that a redirect object
148 should be empty in the base pool. In particular, as long as the
149 redirect is set, it appears that all operations will be proxied
150 even after the promote defeating the purpose. We do want ``PROMOTE``
151 to be able to atomically replace a redirect with the actual
152 object, so the solution is to clear the redirect at the end of the
153 promote.
154 * For a chunked manifest, we appear to flush prior to promoting.
155 Promotion will often be used to prepare an object for low latency
156 reads and writes, accordingly, the only effect should be to read
157 any ``MISSING`` extents into the base pool. No flushing should be done.
158
159* High Level:
160
161 * It appears that ``FLAG_DIRTY`` should never be used for an extent pointing
162 at a dedup extent. Writing the mutated extent back to the dedup pool
163 requires writing a new object since the previous one cannot be mutated,
164 just as it would if it hadn't been dedup'd yet. Thus, we should always
165 drop the reference and remove the manifest pointer.
166
167 * There isn't currently a way to "evict" an object region. With the above
168 change to ``SET_CHUNK`` to always retain the existing object region, we
169 need an ``EVICT_CHUNK`` operation to then remove the extent.
170
171
172Testing
173-------
174
175We rely really heavily on randomized failure testing. As such, we need
176to extend that testing to include dedup/manifest support as well. Here's
177a short list of the touchpoints:
178
179* Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml``
180
181 That test, of course, tests the existing cache/tiering machinery. Add
182 additional files to that directory that instead setup a dedup pool. Add
183 support to ``ceph_test_rados`` (``src/test/osd/TestRados*``).
184
185* RBD tests
186
187 Add a test that runs an RBD workload concurrently with blind
188 promote/evict operations.
189
190* RGW
191
192 Add a test that runs a rgw workload concurrently with blind
193 promote/evict operations.
194
195
196Snapshots
197---------
198
199Fundamentally we need to be able to manipulate the manifest
200status of clones because we want to be able to dynamically promote,
201flush (if the state was dirty when the clone was created), and evict
202extents from clones.
203
204As such, the plan is to allow the ``object_manifest_t`` for each clone
205to be independent. Here's an incomplete list of the high level
206tasks:
207
208* Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK``
209 to operation directly on clones.
210* Ensure that recovery checks the object_manifest prior to trying to
211 use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets``
212 are the two methods that would likely need to be modified.
213
214See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD
215support details. I'd like to call out one particular data structure
216we may want to exploit.
217
218The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
219clones as part of leak detection.
220
221An important question is how we deal with the fact that many clones
222will frequently have references to the same backing chunks at the same
223offset. In particular, ``make_writeable`` will generally create a clone
224that shares the same ``object_manifest_t`` references with the exception
225of any extents modified in that transaction. The metadata that
226commits as part of that transaction must therefore map onto the same
227refcount as before because otherwise we'd have to first increment
228refcounts on backing objects (or risk a reference to a dead object)
229Thus, we introduce a simple convention: consecutive clones which
230share a reference at the same offset share the same refcount. This
231means that a write that invokes ``make_writeable`` may decrease refcounts,
1e59de90 232but not increase them. This has some consequences for removing clones.
f67539c2
TL
233Consider the following sequence ::
234
235 write foo [0, 1024)
236 flush foo ->
237 head: [0, 512) aaa, [512, 1024) bbb
238 refcount(aaa)=1, refcount(bbb)=1
239 snapshot 10
240 write foo [0, 512) ->
241 head: [512, 1024) bbb
242 10 : [0, 512) aaa, [512, 1024) bbb
243 refcount(aaa)=1, refcount(bbb)=1
244 flush foo ->
245 head: [0, 512) ccc, [512, 1024) bbb
246 10 : [0, 512) aaa, [512, 1024) bbb
247 refcount(aaa)=1, refcount(bbb)=1, refcount(ccc)=1
248 snapshot 20
249 write foo [0, 512) (same contents as the original write)
250 head: [512, 1024) bbb
251 20 : [0, 512) ccc, [512, 1024) bbb
252 10 : [0, 512) aaa, [512, 1024) bbb
253 refcount(aaa)=?, refcount(bbb)=1
254 flush foo
255 head: [0, 512) aaa, [512, 1024) bbb
256 20 : [0, 512) ccc, [512, 1024) bbb
257 10 : [0, 512) aaa, [512, 1024) bbb
258 refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1
259
260What should be the refcount for ``aaa`` be at the end? By our
261above rule, it should be ``2`` since the two ```aaa``` refs are not
262contiguous. However, consider removing clone ``20`` ::
263
264 initial:
265 head: [0, 512) aaa, [512, 1024) bbb
266 20 : [0, 512) ccc, [512, 1024) bbb
267 10 : [0, 512) aaa, [512, 1024) bbb
268 refcount(aaa)=2, refcount(bbb)=1, refcount(ccc)=1
269 trim 20
270 head: [0, 512) aaa, [512, 1024) bbb
271 10 : [0, 512) aaa, [512, 1024) bbb
272 refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0
273
274At this point, our rule dictates that ``refcount(aaa)`` is `1`.
275This means that removing ``20`` needs to check for refs held by
276the clones on either side which will then match.
277
278See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal``
279for the logic implementing this rule.
280
281This seems complicated, but it gets us two valuable properties:
282
2831) The refcount change from make_writeable will not block on
284 incrementing a ref
2852) We don't need to load the ``object_manifest_t`` for every clone
286 to determine how to handle removing one -- just the ones
287 immediately preceding and succeeding it.
288
289All clone operations will need to consider adjacent ``chunk_maps``
290when adding or removing references.
291
f67539c2
TL
292Data Structures
293===============
294
295Each RADOS object contains an ``object_manifest_t`` embedded within the
296``object_info_t`` (see ``osd_types.h``):
297
298::
299
300 struct object_manifest_t {
301 enum {
302 TYPE_NONE = 0,
303 TYPE_REDIRECT = 1,
304 TYPE_CHUNKED = 2,
305 };
306 uint8_t type; // redirect, chunked, ...
307 hobject_t redirect_target;
308 std::map<uint64_t, chunk_info_t> chunk_map;
309 }
310
311The ``type`` enum reflects three possible states an object can be in:
312
3131. ``TYPE_NONE``: normal RADOS object
3142. ``TYPE_REDIRECT``: object payload is backed by a single object
315 specified by ``redirect_target``
3163. ``TYPE_CHUNKED: object payload is distributed among objects with
317 size and offset specified by the ``chunk_map``. ``chunk_map`` maps
318 the offset of the chunk to a ``chunk_info_t`` as shown below, also
319 specifying the ``length``, target `OID`, and ``flags``.
320
321::
322
323 struct chunk_info_t {
324 typedef enum {
325 FLAG_DIRTY = 1,
326 FLAG_MISSING = 2,
327 FLAG_HAS_REFERENCE = 4,
328 FLAG_HAS_FINGERPRINT = 8,
329 } cflag_t;
330 uint32_t offset;
331 uint32_t length;
332 hobject_t oid;
333 cflag_t flags; // FLAG_*
334
335
336``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint
337is written. This should be changed to drop the fingerprint instead.
338
339
340Request Handling
341================
342
343Similarly to cache/tiering, the initial touchpoint is
344``maybe_handle_manifest_detail``.
345
346For manifest operations listed below, we return ``NOOP`` and continue onto
347dedicated handling within ``do_osd_ops``.
348
349For redirect objects which haven't been promoted (apparently ``oi.size >
3500`` indicates that it's present?) we proxy reads and writes.
351
352For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all
353of the ops are reads of extents in the ``object_manifest_t chunk_map``),
354we proxy requests to those objects.
355
356
357RADOS Interface
358================
359
360To set up deduplication one must provision two pools. One will act as the
361base pool and the other will act as the chunk pool. The base pool need to be
362configured with the ``fingerprint_algorithm`` option as follows.
363
364::
365
366 ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
367 --yes-i-really-mean-it
368
369Create objects ::
370
371 rados -p base_pool put foo ./foo
372 rados -p chunk_pool put foo-chunk ./foo-chunk
373
374Make a manifest object ::
375
376 rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference
377
378Operations:
379
380* ``set-redirect``
381
382 Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object``
383 in the ``target_pool``.
384 A redirected object will forward all operations from the client to the
385 ``target_object``. ::
386
387 void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
388 uint64_t tgt_version, int flag = 0);
389
390 rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
391 <target_object>
392
393 Returns ``ENOENT`` if the object does not exist (TODO: why?)
394 Returns ``EINVAL`` if the object already is a redirect.
395
396 Takes a reference to target as part of operation, can possibly leak a ref
397 if the acting set resets and the client dies between taking the ref and
398 recording the redirect.
399
400 Truncates object, clears omap, and clears xattrs as a side effect.
401
402 At the top of ``do_osd_ops``, does not set user_modify.
403
404 This operation is not a user mutation and does not trigger a clone to be created.
405
406 There are two purposes of ``set_redirect``:
407
408 1. Redirect all operation to the target object (like proxy)
409 2. Cache when ``tier_promote`` is called (redirect will be cleared at this time).
410
411* ``set-chunk``
412
413 Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a
414 ``target_object``. ::
415
416 void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
417 std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
418
419 rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
420 <caspool> <target_object> <target-offset>
421
422 Returns ``ENOENT`` if the object does not exist (TODO: why?)
423 Returns ``EINVAL`` if the object already is a redirect.
424 Returns ``EINVAL`` if on ill-formed parameter buffer.
425 Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping.
426
427 Takes references to targets as part of operation, can possibly leak refs
428 if the acting set resets and the client dies between taking the ref and
429 recording the redirect.
430
431 Truncates object, clears omap, and clears xattrs as a side effect.
432
433 This operation is not a user mutation and does not trigger a clone to be created.
434
435 TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. ::
436
437 if (!oi.manifest.is_chunked()) {
438 oi.manifest.clear();
439 }
440
441* ``evict-chunk``
442
443 Clears an extent from an object leaving only the manifest link between
444 it and the ``target_object``. ::
445
446 void evict_chunk(
447 uint64_t offset, uint64_t length, int flag = 0);
448
449 rados -p base_pool evict-chunk <offset> <length> <object>
450
451 Returns ``EINVAL`` if the extent is not present in the manifest.
452
453 Note: this does not exist yet.
454
455
456* ``tier-promote``
457
458 Promotes the object ensuring that subsequent reads and writes will be local ::
459
460 void tier_promote();
461
462 rados -p base_pool tier-promote <obj-name>
463
464 Returns ``ENOENT`` if the object does not exist
465
466 For a redirect manifest, copies data to head.
467
468 TODO: Promote on a redirect object needs to clear the redirect.
469
470 For a chunked manifest, reads all MISSING extents into the base pool,
471 subsequent reads and writes will be served from the base pool.
472
473 Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The
474 resulting ``copy_get`` operation will issue reads which will then be redirected by
475 the normal manifest read machinery.
476
477 Does not set the ``user_modify`` flag.
478
479 Future work will involve adding support for specifying a ``clone_id``.
480
481* ``unset-manifest``
482
483 Unset the manifest info in the object that has manifest. ::
484
485 void unset_manifest();
486
487 rados -p base_pool unset-manifest <obj-name>
488
489 Clears manifest chunks or redirect. Lazily releases references, may
490 leak.
491
492 ``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``,
493 and so will trigger a snapshot. Note, this will be true even for a
494 redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should
495 be fixed -- ``unset-manifest`` should not be a ``user_modify``.
496
497* ``tier-flush``
498
499 Flush the object which has chunks to the chunk pool. ::
500
501 void tier_flush();
502
503 rados -p base_pool tier-flush <obj-name>
504
505 Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone.
506
507 Does not evict the extents.
508
509
510ceph-dedup-tool
511===============
512
513``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking
514and fixing the reference count (see ``./refcount.rst``).
515
516* Find an optimal chunk offset
517
518 a. Fixed chunk
519
520 To find out a fixed chunk length, you need to run the following command many
521 times while changing the ``chunk_size``. ::
522
523 ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
524 --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
525
526 b. Rabin chunk(Rabin-Karp algorithm)
527
528 Rabin-Karp is a string-searching algorithm based
529 on a rolling hash. But a rolling hash is not enough to do deduplication because
530 we don't know the chunk boundary. So, we need content-based slicing using
531 a rolling hash for content-defined chunking.
532 The current implementation uses the simplest approach: look for chunk boundaries
533 by inspecting the rolling hash for pattern (like the
534 lower N bits are all zeroes).
535
536 Users who want to use deduplication need to find an ideal chunk offset.
537 To find out ideal chunk offset, users should discover
538 the optimal configuration for their data workload via ``ceph-dedup-tool``.
539 This information will then be used for object chunking through
540 the ``set-chunk`` API. ::
541
542 ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
543 --chunk-algorithm rabin --fingerprint-algorithm rabin
544
545 ``ceph-dedup-tool`` has many options to utilize ``rabin chunk``.
546 These are options for ``rabin chunk``. ::
547
548 --mod-prime <uint64_t>
549 --rabin-prime <uint64_t>
550 --pow <uint64_t>
551 --chunk-mask-bit <uint32_t>
552 --window-size <uint32_t>
553 --min-chunk <uint32_t>
554 --max-chunk <uint64_t>
555
556 Users need to refer following equation to use above options for ``rabin chunk``. ::
557
558 rabin_hash =
559 (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
560
561 c. Fixed chunk vs content-defined chunk
562
563 Content-defined chunking may or not be optimal solution.
564 For example,
565
566 Data chunk ``A`` : ``abcdefgabcdefgabcdefg``
567
568 Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is
569 from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length.
570 But, in the case of content-based slicing, the optimal chunk length
571 could not be found (dedup ratio will not be 100%).
572 Because we need to find optimal parameter such
573 as boundary bit, window size and prime value. This is as easy as fixed chunk.
574 But, content defined chunking is very effective in the following case.
575
576 Data chunk ``B`` : ``abcdefgabcdefgabcdefg``
577
578 Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg``
579
580
581* Fix reference count
582
583 The key idea behind of reference counting for dedup is false-positive, which means
584 ``(manifest object (no ref),, chunk object(has ref))`` happen instead of
585 ``(manifest object (has ref), chunk 1(no ref))``.
586 To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. ::
587
588 ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
589