ceph/doc/dev/osd_internals/manifest.rst

   1 ========
   2 Manifest
   3 ========
   4
   5
   6 Introduction
   7 ============
   8
   9 As described in ``../deduplication.rst``, adding transparent redirect
  10 machinery to RADOS would enable a more capable tiering solution
  11 than RADOS currently has with "cache/tiering".
  12
  13 See ``../deduplication.rst``
  14
  15 At a high level, each object has a piece of metadata embedded in
  16 the ``object_info_t`` which can map subsets of the object data payload
  17 to (refcounted) objects in other pools.
  18
  19 This document exists to detail:
  20
  21 1. Manifest data structures
  22 2. Rados operations for manipulating manifests.
  23 3. Status and Plans
  24
  25
  26 Intended Usage Model
  27 ====================
  28
  29 RBD
  30 ---
  31
  32 For RBD, the primary goal is for either an OSD-internal agent or a
  33 cluster-external agent to be able to transparently shift portions
  34 of the constituent 4MB extents between a dedup pool and a hot base
  35 pool.
  36
  37 As such, RBD operations (including class operations and snapshots)
  38 must have the same observable results regardless of the current
  39 status of the object.
  40
  41 Moreover, tiering/dedup operations must interleave with RBD operations
  42 without changing the result.
  43
  44 Thus, here is a sketch of how I'd expect a tiering agent to perform
  45 basic operations:
  46
  47 * Demote cold RBD chunk to slow pool:
  48
  49   1. Read object, noting current user_version.
  50   2. In memory, run CDC implementation to fingerprint object.
  51   3. Write out each resulting extent to an object in the cold pool
  52      using the CAS class.
  53   4. Submit operation to base pool:
  54
  55      * ``ASSERT_VER`` with the user version from the read to fail if the
  56        object has been mutated since the read.
  57      * ``SET_CHUNK`` for each of the extents to the corresponding object
  58        in the base pool.
  59      * ``EVICT_CHUNK`` for each extent to free up space in the base pool.
  60        Results in each chunk being marked ``MISSING``.
  61
  62   RBD users should then either see the state prior to the demotion or
  63   subsequent to it.
  64
  65   Note that between 3 and 4, we potentially leak references, so a
  66   periodic scrub would be needed to validate refcounts.
  67
  68 * Promote cold RBD chunk to fast pool.
  69
  70   1. Submit ``TIER_PROMOTE``
  71
  72 For clones, all of the above would be identical except that the
  73 initial read would need a ``LIST_SNAPS`` to determine which clones exist
  74 and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include
  75 the ``cloneid``.
  76
  77 RadosGW
  78 -------
  79
  80 For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the
  81 manifest machinery in the OSD to hide the distinction between the object
  82 being dedup'd or present in the base pool
  83
  84 For writes, RGW could operate as RBD does above, but could
  85 optionally have the freedom to fingerprint prior to doing the write.
  86 In that case, it could immediately write out the target objects to the
  87 CAS pool and then atomically write an object with the corresponding
  88 chunks set.
  89
  90 Status and Future Work
  91 ======================
  92
  93 At the moment, initial versions of a manifest data structure along
  94 with IO path support and rados control operations exist.  This section
  95 is meant to outline next steps.
  96
  97 At a high level, our future work plan is:
  98
  99 - Cleanups: Address immediate inconsistencies and shortcomings outlined
 100   in the next section.
 101 - Testing: Rados relies heavily on teuthology failure testing to validate
 102   features like cache/tiering.  We'll need corresponding tests for
 103   manifest operations.
 104 - Snapshots: We want to be able to deduplicate portions of clones
 105   below the level of the rados snapshot system.  As such, the
 106   rados operations below need to be extended to work correctly on
 107   clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the
 108   corresponding extent in the base pool, and correctly maintain OSD metadata).
 109 - Cache/tiering: Ultimately, we'd like to be able to deprecate the existing
 110   cache/tiering implementation, but to do that we need to ensure that we
 111   can address the same use cases.
 112
 113
 114 Cleanups
 115 --------
 116
 117 The existing implementation has some things that need to be cleaned up:
 118
 119 * ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise
 120   one couldn't create an object atomically as a redirect.
 121 * ``SET_CHUNK``:
 122
 123   * Appears to trigger a new clone as user_modify gets set in
 124     ``do_osd_ops``.  This probably isn't desirable, see Snapshots section
 125     below for some options on how generally to mix these operations
 126     with snapshots.  At a minimum, ``SET_CHUNK`` probably shouldn't set
 127     user_modify.
 128   * Appears to assume that the corresponding section of the object
 129     does not exist (sets ``FLAG_MISSING``) but does not check whether the
 130     corresponding extent exists already in the object.  Should always
 131     leave the extent clean.
 132   * Appears to clear the manifest unconditionally if not chunked,
 133     that's probably wrong.  We should return an error if it's a
 134     ``REDIRECT`` ::
 135
 136         case CEPH_OSD_OP_SET_CHUNK:
 137           if (oi.manifest.is_redirect()) {
 138             result = -EINVAL;
 139             goto fail;
 140           }
 141
 142
 143 * ``TIER_PROMOTE``:
 144
 145   * ``SET_REDIRECT`` clears the contents of the object.  ``PROMOTE`` appears
 146     to copy them back in, but does not unset the redirect or clear the
 147     reference. This violates the invariant that a redirect object
 148     should be empty in the base pool.  In particular, as long as the
 149     redirect is set, it appears that all operations will be proxied
 150     even after the promote defeating the purpose.  We do want ``PROMOTE``
 151     to be able to atomically replace a redirect with the actual
 152     object, so the solution is to clear the redirect at the end of the
 153     promote.
 154   * For a chunked manifest, we appear to flush prior to promoting.
 155     Promotion will often be used to prepare an object for low latency
 156     reads and writes, accordingly, the only effect should be to read
 157     any ``MISSING`` extents into the base pool.  No flushing should be done.
 158
 159 * High Level:
 160
 161   * It appears that ``FLAG_DIRTY`` should never be used for an extent pointing
 162     at a dedup extent.  Writing the mutated extent back to the dedup pool
 163     requires writing a new object since the previous one cannot be mutated,
 164     just as it would if it hadn't been dedup'd yet.  Thus, we should always
 165     drop the reference and remove the manifest pointer.
 166
 167   * There isn't currently a way to "evict" an object region.  With the above
 168     change to ``SET_CHUNK`` to always retain the existing object region, we
 169     need an ``EVICT_CHUNK`` operation to then remove the extent.
 170
 171
 172 Testing
 173 -------
 174
 175 We rely really heavily on randomized failure testing.  As such, we need
 176 to extend that testing to include dedup/manifest support as well.  Here's
 177 a short list of the touchpoints:
 178
 179 * Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml``
 180
 181   That test, of course, tests the existing cache/tiering machinery.  Add
 182   additional files to that directory that instead setup a dedup pool.  Add
 183   support to ``ceph_test_rados`` (``src/test/osd/TestRados*``).
 184
 185 * RBD tests
 186
 187   Add a test that runs an RBD workload concurrently with blind
 188   promote/evict operations.
 189
 190 * RGW
 191
 192   Add a test that runs a rgw workload concurrently with blind
 193   promote/evict operations.
 194
 195
 196 Snapshots
 197 ---------
 198
 199 Fundamentally we need to be able to manipulate the manifest
 200 status of clones because we want to be able to dynamically promote,
 201 flush (if the state was dirty when the clone was created), and evict
 202 extents from clones.
 203
 204 As such, the plan is to allow the ``object_manifest_t`` for each clone
 205 to be independent.  Here's an incomplete list of the high level
 206 tasks:
 207
 208 * Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK``
 209   to operation directly on clones.
 210 * Ensure that recovery checks the object_manifest prior to trying to
 211   use the overlaps in clone_range.  ``ReplicatedBackend::calc_*_subsets``
 212   are the two methods that would likely need to be modified.
 213
 214 See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD
 215 support details.  I'd like to call out one particular data structure
 216 we may want to exploit.
 217
 218 The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
 219 clones as part of leak detection.
 220
 221 An important question is how we deal with the fact that many clones
 222 will frequently have references to the same backing chunks at the same
 223 offset.  In particular, ``make_writeable`` will generally create a clone
 224 that shares the same ``object_manifest_t`` references with the exception
 225 of any extents modified in that transaction.  The metadata that
 226 commits as part of that transaction must therefore map onto the same
 227 refcount as before because otherwise we'd have to first increment
 228 refcounts on backing objects (or risk a reference to a dead object)
 229 Thus, we introduce a simple convention: consecutive clones which
 230 share a reference at the same offset share the same refcount.  This
 231 means that a write that invokes ``make_writeable`` may decrease refcounts,
 232 but not increase them.  This has some consequences for removing clones.
 233 Consider the following sequence ::
 234
 235   write foo [0, 1024)
 236   flush foo ->
 237     head: [0, 512) aaa, [512, 1024) bbb
 238     refcount(aaa)=1, refcount(bbb)=1
 239   snapshot 10
 240   write foo [0, 512) ->
 241     head:               [512, 1024) bbb
 242     10  : [0, 512) aaa, [512, 1024) bbb
 243     refcount(aaa)=1, refcount(bbb)=1
 244   flush foo ->
 245     head: [0, 512) ccc, [512, 1024) bbb
 246     10  : [0, 512) aaa, [512, 1024) bbb
 247     refcount(aaa)=1, refcount(bbb)=1, refcount(ccc)=1
 248   snapshot 20
 249   write foo [0, 512) (same contents as the original write)
 250     head:               [512, 1024) bbb
 251     20  : [0, 512) ccc, [512, 1024) bbb
 252     10  : [0, 512) aaa, [512, 1024) bbb
 253     refcount(aaa)=?, refcount(bbb)=1
 254   flush foo
 255     head: [0, 512) aaa, [512, 1024) bbb
 256     20  : [0, 512) ccc, [512, 1024) bbb
 257     10  : [0, 512) aaa, [512, 1024) bbb
 258     refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1
 259
 260 What should be the refcount for ``aaa`` be at the end?  By our
 261 above rule, it should be ``2`` since the two ```aaa``` refs are not
 262 contiguous.  However, consider removing clone ``20`` ::
 263
 264   initial:
 265     head: [0, 512) aaa, [512, 1024) bbb
 266     20  : [0, 512) ccc, [512, 1024) bbb
 267     10  : [0, 512) aaa, [512, 1024) bbb
 268     refcount(aaa)=2, refcount(bbb)=1, refcount(ccc)=1
 269   trim 20
 270     head: [0, 512) aaa, [512, 1024) bbb
 271     10  : [0, 512) aaa, [512, 1024) bbb
 272     refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0
 273
 274 At this point, our rule dictates that ``refcount(aaa)`` is `1`.
 275 This means that removing ``20`` needs to check for refs held by
 276 the clones on either side which will then match.
 277
 278 See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal``
 279 for the logic implementing this rule.
 280
 281 This seems complicated, but it gets us two valuable properties:
 282
 283 1) The refcount change from make_writeable will not block on
 284    incrementing a ref
 285 2) We don't need to load the ``object_manifest_t`` for every clone
 286    to determine how to handle removing one -- just the ones
 287    immediately preceding and succeeding it.
 288
 289 All clone operations will need to consider adjacent ``chunk_maps``
 290 when adding or removing references.
 291
 292 Cache/Tiering
 293 -------------
 294
 295 There already exists a cache/tiering mechanism based on whiteouts.
 296 One goal here should ultimately be for this manifest machinery to
 297 provide a complete replacement.
 298
 299 See ``cache-pool.rst``
 300
 301 The manifest machinery already shares some code paths with the
 302 existing cache/tiering code, mainly ``stat_flush``.
 303
 304 In no particular order, here's in incomplete list of things that need
 305 to be wired up to provide feature parity:
 306
 307 * Online object access information: The osd already has pool configs
 308   for maintaining bloom filters which provide estimates of access
 309   recency for objects.  We probably need to modify this to permit
 310   hitset maintenance for a normal pool -- there are already
 311   ``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them.
 312 * Tiering agent: The osd already has a background tiering agent which
 313   would need to be modified to instead flush and evict using
 314   manifests.
 315
 316 * Use exiting existing features regarding the cache flush policy such as
 317   histset, age, ratio.
 318   - hitset
 319   - age, ratio, bytes
 320
 321 * Add tiering-mode to ``manifest-tiering``
 322   - Writeback
 323   - Read-only
 324
 325
 326 Data Structures
 327 ===============
 328
 329 Each RADOS object contains an ``object_manifest_t`` embedded within the
 330 ``object_info_t`` (see ``osd_types.h``):
 331
 332 ::
 333
 334         struct object_manifest_t {
 335                 enum {
 336                         TYPE_NONE = 0,
 337                         TYPE_REDIRECT = 1,
 338                         TYPE_CHUNKED = 2,
 339                 };
 340                 uint8_t type;  // redirect, chunked, ...
 341                 hobject_t redirect_target;
 342                 std::map<uint64_t, chunk_info_t> chunk_map;
 343         }
 344
 345 The ``type`` enum reflects three possible states an object can be in:
 346
 347 1. ``TYPE_NONE``: normal RADOS object
 348 2. ``TYPE_REDIRECT``: object payload is backed by a single object
 349    specified by ``redirect_target``
 350 3. ``TYPE_CHUNKED: object payload is distributed among objects with
 351    size and offset specified by the ``chunk_map``. ``chunk_map`` maps
 352    the offset of the chunk to a ``chunk_info_t`` as shown below, also
 353    specifying the ``length``, target `OID`, and ``flags``.
 354
 355 ::
 356
 357         struct chunk_info_t {
 358           typedef enum {
 359             FLAG_DIRTY = 1,
 360             FLAG_MISSING = 2,
 361             FLAG_HAS_REFERENCE = 4,
 362             FLAG_HAS_FINGERPRINT = 8,
 363           } cflag_t;
 364           uint32_t offset;
 365           uint32_t length;
 366           hobject_t oid;
 367           cflag_t flags;   // FLAG_*
 368
 369
 370 ``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint
 371 is written.  This should be changed to drop the fingerprint instead.
 372
 373
 374 Request Handling
 375 ================
 376
 377 Similarly to cache/tiering, the initial touchpoint is
 378 ``maybe_handle_manifest_detail``.
 379
 380 For manifest operations listed below, we return ``NOOP`` and continue onto
 381 dedicated handling within ``do_osd_ops``.
 382
 383 For redirect objects which haven't been promoted (apparently ``oi.size >
 384 0`` indicates that it's present?) we proxy reads and writes.
 385
 386 For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all
 387 of the ops are reads of extents in the ``object_manifest_t chunk_map``),
 388 we proxy requests to those objects.
 389
 390
 391 RADOS Interface
 392 ================
 393
 394 To set up deduplication one must provision two pools. One will act as the
 395 base pool and the other will act as the chunk pool. The base pool need to be
 396 configured with the ``fingerprint_algorithm`` option as follows.
 397
 398 ::
 399
 400   ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
 401   --yes-i-really-mean-it
 402
 403 Create objects ::
 404
 405   rados -p base_pool put foo ./foo
 406   rados -p chunk_pool put foo-chunk ./foo-chunk
 407
 408 Make a manifest object ::
 409
 410   rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference
 411
 412 Operations:
 413
 414 * ``set-redirect``
 415
 416   Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object``
 417   in the ``target_pool``.
 418   A redirected object will forward all operations from the client to the
 419   ``target_object``. ::
 420
 421         void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
 422                       uint64_t tgt_version, int flag = 0);
 423
 424         rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
 425          <target_object>
 426
 427   Returns ``ENOENT`` if the object does not exist (TODO: why?)
 428   Returns ``EINVAL`` if the object already is a redirect.
 429
 430   Takes a reference to target as part of operation, can possibly leak a ref
 431   if the acting set resets and the client dies between taking the ref and
 432   recording the redirect.
 433
 434   Truncates object, clears omap, and clears xattrs as a side effect.
 435
 436   At the top of ``do_osd_ops``, does not set user_modify.
 437
 438   This operation is not a user mutation and does not trigger a clone to be created.
 439
 440   There are two purposes of ``set_redirect``:
 441
 442   1. Redirect all operation to the target object (like proxy)
 443   2. Cache when ``tier_promote`` is called (redirect will be cleared at this time).
 444
 445 * ``set-chunk``
 446
 447   Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a
 448   ``target_object``. ::
 449
 450         void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
 451                    std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
 452
 453         rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
 454          <caspool> <target_object> <target-offset>
 455
 456   Returns ``ENOENT`` if the object does not exist (TODO: why?)
 457   Returns ``EINVAL`` if the object already is a redirect.
 458   Returns ``EINVAL`` if on ill-formed parameter buffer.
 459   Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping.
 460
 461   Takes references to targets as part of operation, can possibly leak refs
 462   if the acting set resets and the client dies between taking the ref and
 463   recording the redirect.
 464
 465   Truncates object, clears omap, and clears xattrs as a side effect.
 466
 467   This operation is not a user mutation and does not trigger a clone to be created.
 468
 469   TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. ::
 470
 471        if (!oi.manifest.is_chunked()) {
 472          oi.manifest.clear();
 473        }
 474
 475 * ``evict-chunk``
 476
 477   Clears an extent from an object leaving only the manifest link between
 478   it and the ``target_object``. ::
 479
 480         void evict_chunk(
 481           uint64_t offset, uint64_t length, int flag = 0);
 482
 483         rados -p base_pool evict-chunk <offset> <length> <object>
 484
 485   Returns ``EINVAL`` if the extent is not present in the manifest.
 486
 487   Note: this does not exist yet.
 488
 489
 490 * ``tier-promote``
 491
 492   Promotes the object ensuring that subsequent reads and writes will be local ::
 493
 494         void tier_promote();
 495
 496         rados -p base_pool tier-promote <obj-name>
 497
 498   Returns ``ENOENT`` if the object does not exist
 499
 500   For a redirect manifest, copies data to head.
 501
 502   TODO: Promote on a redirect object needs to clear the redirect.
 503
 504   For a chunked manifest, reads all MISSING extents into the base pool,
 505   subsequent reads and writes will be served from the base pool.
 506
 507   Implementation Note: For a chunked manifest, calls ``start_copy`` on itself.  The
 508   resulting ``copy_get`` operation will issue reads which will then be redirected by
 509   the normal manifest read machinery.
 510
 511   Does not set the ``user_modify`` flag.
 512
 513   Future work will involve adding support for specifying a ``clone_id``.
 514
 515 * ``unset-manifest``
 516
 517   Unset the manifest info in the object that has manifest. ::
 518
 519         void unset_manifest();
 520
 521         rados -p base_pool unset-manifest <obj-name>
 522
 523   Clears manifest chunks or redirect.  Lazily releases references, may
 524   leak.
 525
 526   ``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``,
 527   and so will trigger a snapshot.  Note, this will be true even for a
 528   redirect though ``SET_REDIRECT`` does not flip ``user_modify``.  This should
 529   be fixed -- ``unset-manifest`` should not be a ``user_modify``.
 530
 531 * ``tier-flush``
 532
 533   Flush the object which has chunks to the chunk pool. ::
 534
 535         void tier_flush();
 536
 537         rados -p base_pool tier-flush <obj-name>
 538
 539   Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone.
 540
 541   Does not evict the extents.
 542
 543
 544 ceph-dedup-tool
 545 ===============
 546
 547 ``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking
 548 and fixing the reference count (see ``./refcount.rst``).
 549
 550 * Find an optimal chunk offset
 551
 552   a. Fixed chunk
 553
 554   To find out a fixed chunk length, you need to run the following command many
 555   times while changing the ``chunk_size``. ::
 556
 557             ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
 558               --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
 559
 560   b. Rabin chunk(Rabin-Karp algorithm)
 561
 562   Rabin-Karp is a string-searching algorithm based
 563   on a rolling hash. But a rolling hash is not enough to do deduplication because
 564   we don't know the chunk boundary. So, we need content-based slicing using
 565   a rolling hash for content-defined chunking.
 566   The current implementation uses the simplest approach: look for chunk boundaries
 567   by inspecting the rolling hash for pattern (like the
 568   lower N bits are all zeroes).
 569
 570   Users who want to use deduplication need to find an ideal chunk offset.
 571   To find out ideal chunk offset, users should discover
 572   the optimal configuration for their data workload via ``ceph-dedup-tool``.
 573   This information will then be used for object chunking through
 574   the ``set-chunk`` API. ::
 575
 576               ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
 577                 --chunk-algorithm rabin --fingerprint-algorithm rabin
 578
 579   ``ceph-dedup-tool`` has many options to utilize ``rabin chunk``.
 580   These are options for ``rabin chunk``. ::
 581
 582               --mod-prime <uint64_t>
 583               --rabin-prime <uint64_t>
 584               --pow <uint64_t>
 585               --chunk-mask-bit <uint32_t>
 586               --window-size <uint32_t>
 587               --min-chunk <uint32_t>
 588               --max-chunk <uint64_t>
 589
 590    Users need to refer following equation to use above options for ``rabin chunk``. ::
 591
 592               rabin_hash =
 593                 (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
 594
 595   c. Fixed chunk vs content-defined chunk
 596
 597   Content-defined chunking may or not be optimal solution.
 598   For example,
 599
 600   Data chunk ``A`` : ``abcdefgabcdefgabcdefg``
 601
 602   Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is
 603   from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length.
 604   But, in the case of content-based slicing, the optimal chunk length
 605   could not be found (dedup ratio will not be 100%).
 606   Because we need to find optimal parameter such
 607   as boundary bit, window size and prime value. This is as easy as fixed chunk.
 608   But, content defined chunking is very effective in the following case.
 609
 610     Data chunk ``B`` : ``abcdefgabcdefgabcdefg``
 611
 612     Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg``
 613
 614
 615 * Fix reference count
 616
 617   The key idea behind of reference counting for dedup is false-positive, which means
 618   ``(manifest object (no ref),, chunk object(has ref))`` happen instead of
 619   ``(manifest object (has ref), chunk 1(no ref))``.
 620   To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. ::
 621
 622           ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
 623