update source to Ceph Pacific 16.2.2

[ceph.git] / ceph / doc / dev / deduplication.rst
diff --git a/ceph/doc/dev/deduplication.rst b/ceph/doc/dev/deduplication.rst

index def15955e43798cd5ead5e2fb1b0ba51e4139de4..b715896b6f7b40f231617b798b2196e47cc61c3e 100644 (file)
--- a/ceph/doc/dev/deduplication.rst
+++ b/ceph/doc/dev/deduplication.rst
@@ -89,176 +89,63 @@ scheme between replication and erasure coding depending on
  its usage and each pool can be placed in a different storage
  location depending on the required performance.
  
-Manifest Object: 
-Metadata objects are stored in the
-base pool, which contains metadata for data deduplication.
-
-::
-  
-        struct object_manifest_t {
-                enum {
-                        TYPE_NONE = 0,
-                        TYPE_REDIRECT = 1,
-                        TYPE_CHUNKED = 2,
-                };
-                uint8_t type;  // redirect, chunked, ...
-                hobject_t redirect_target;
-                std::map<uint64_t, chunk_info_t> chunk_map;
-        }
-
-
-A chunk Object: 
-Chunk objects are stored in the chunk pool. Chunk object contains chunk data 
-and its reference count information.
-
-
-Although chunk objects and manifest objects have a different purpose 
-from existing objects, they can be handled the same way as 
-original objects. Therefore, to support existing features such as replication,
-no additional operations for dedup are needed.
-
-Usage
-=====
-
-To set up deduplication pools, you must have two pools. One will act as the 
-base pool and the other will act as the chunk pool. The base pool need to be
-configured with fingerprint_algorithm option as follows.
-
-::
-
-  ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512 
-  --yes-i-really-mean-it
-
-1. Create objects ::
-
-        - rados -p base_pool put foo ./foo
-
-        - rados -p chunk_pool put foo-chunk ./foo-chunk
-
-2. Make a manifest object ::
-
-        - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool 
-        chunk_pool foo-chunk $START_OFFSET --with-reference
-
-
-Interface
-=========
-
-* set-redirect 
-
-  set redirection between a base_object in the base_pool and a target_object 
-  in the target_pool.
-  A redirected object will forward all operations from the client to the 
-  target_object. ::
-  
-        rados -p base_pool set-redirect <base_object> --target-pool <target_pool> 
-         <target_object>
-
-* set-chunk 
-
-  set chunk-offset in a source_object to make a link between it and a 
-  target_object. ::
-  
-        rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool 
-         <caspool> <target_object> <taget-offset> 
-
-* tier-promote 
-
-  promote the object (including chunks). ::
-
-        rados -p base_pool tier-promote <obj-name> 
-
-* unset-manifest
-
-  unset manifest option from the object that has manifest. ::
-
-        rados -p base_pool unset-manifest <obj-name>
-
-* tier-flush
-
-  flush the object that has chunks to the chunk pool. ::
-
-        rados -p base_pool tier-flush <obj-name>
-
-Dedup tool
-==========
-
-Dedup tool has two features: finding optimal chunk offset for dedup chunking 
-and fixing the reference count.
-
-* find optimal chunk offset
-
-  a. fixed chunk  
-
-    To find out a fixed chunk length, you need to run following command many 
-    times while changing the chunk_size. ::
-
-            ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size  
-              --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
-
-  b. rabin chunk(Rabin-karp algorithm) 
-
-    As you know, Rabin-karp algorithm is string-searching algorithm based
-    on a rolling-hash. But rolling-hash is not enough to do deduplication because 
-    we don't know the chunk boundary. So, we need content-based slicing using 
-    a rolling hash for content-defined chunking.
-    The current implementation uses the simplest approach: look for chunk boundaries 
-    by inspecting the rolling hash for pattern(like the
-    lower N bits are all zeroes). 
-      
-    - Usage
-
-      Users who want to use deduplication need to find an ideal chunk offset.
-      To find out ideal chunk offset, Users should discover
-      the optimal configuration for their data workload via ceph-dedup-tool.
-      And then, this chunking information will be used for object chunking through
-      set-chunk api. ::
-
-              ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size  
-                --chunk-algorithm rabin --fingerprint-algorithm rabin
-
-      ceph-dedup-tool has many options to utilize rabin chunk.
-      These are options for rabin chunk. ::
-
-              --mod-prime <uint64_t>
-              --rabin-prime <uint64_t>
-              --pow <uint64_t>
-              --chunk-mask-bit <uint32_t>
-              --window-size <uint32_t>
-              --min-chunk <uint32_t>
-              --max-chunk <uint64_t>
-
-      Users need to refer following equation to use above options for rabin chunk. ::
-
-              rabin_hash = 
-                (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
-
-  c. Fixed chunk vs content-defined chunk
-
-    Content-defined chunking may or not be optimal solution.
-    For example,
-
-    Data chunk A : abcdefgabcdefgabcdefg
+Regarding how to use, please see ``osd_internals/manifest.rst``
+
+Usage Patterns
+==============
+
+The different Ceph interface layers present potentially different oportunities
+and costs for deduplication and tiering in general.
+
+RadosGW
+-------
+
+S3 big data workloads seem like a good opportunity for deduplication.  These
+objects tend to be write once, read mostly objects which don't see partial
+overwrites.  As such, it makes sense to fingerprint and dedup up front.
+
+Unlike cephfs and rbd, radosgw has a system for storing
+explicit metadata in the head object of a logical s3 object for
+locating the remaining pieces.  As such, radosgw could use the
+refcounting machinery (``osd_internals/refcount.rst``) directly without
+needing direct support from rados for manifests.
+
+RBD/Cephfs
+----------
+
+RBD and CephFS both use deterministic naming schemes to partition
+block devices/file data over rados objects.  As such, the redirection
+metadata would need to be included as part of rados, presumably
+transparently.
+
+Moreover, unlike radosgw, rbd/cephfs rados objects can see overwrites.
+For those objects, we don't really want to perform dedup, and we don't
+want to pay a write latency penalty in the hot path to do so anyway.
+As such, performing tiering and dedup on cold objects in the background
+is likely to be preferred.
+   
+One important wrinkle, however, is that both rbd and cephfs workloads
+often feature usage of snapshots.  This means that the rados manifest
+support needs robust support for snapshots.
+
+RADOS Machinery
+===============
  
-    Let's think about Data chunk A's deduplication. Ideal chunk offset is
-    from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
-    But, in the case of content-based slicing, the optimal chunk length
-    could not be found (dedup ratio will not be 100%).
-    Because we need to find optimal parameter such
-    as boundary bit, window size and prime value. This is as easy as fixed chunk.
-    But, content defined chunking is very effective in the following case.
+For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
+For more information on rados refcount support, see ``osd_internals/refcount.rst``.
  
-    Data chunk B : abcdefgabcdefgabcdefg
+Status and Future Work
+======================
  
-    Data chunk C : Tabcdefgabcdefgabcdefg
-      
+At the moment, there exists some preliminary support for manifest
+objects within the OSD as well as a dedup tool.
  
-* fix reference count
-  
-  The key idea behind of reference counting for dedup is false-positive, which means 
-  (manifest object (no ref), chunk object(has ref)) happen instead of 
-  (manifest object (has ref), chunk 1(no ref)).
-  To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
+RadosGW data warehouse workloads probably represent the largest
+opportunity for this feature, so the first priority is probably to add
+direct support for fingerprinting and redirects into the refcount pool
+to radosgw.
  
-          ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
+Aside from radosgw, completing work on manifest object support in the
+OSD particularly as it relates to snapshots would be the next step for
+rbd and cephfs workloads.