ceph/doc/dev/deduplication.rst

   1 ===============
   2  Deduplication
   3 ===============
   4
   5
   6 Introduction
   7 ============
   8
   9 Applying data deduplication on an existing software stack is not easy
  10 due to additional metadata management and original data processing
  11 procedure.
  12
  13 In a typical deduplication system, the input source as a data
  14 object is split into multiple chunks by a chunking algorithm.
  15 The deduplication system then compares each chunk with
  16 the existing data chunks, stored in the storage previously.
  17 To this end, a fingerprint index that stores the hash value
  18 of each chunk is employed by the deduplication system
  19 in order to easily find the existing chunks by comparing
  20 hash value rather than searching all contents that reside in
  21 the underlying storage.
  22
  23 There are many challenges in order to implement deduplication on top
  24 of Ceph. Among them, two issues are essential for deduplication.
  25 First is managing scalability of fingerprint index; Second is
  26 it is complex to ensure compatibility between newly applied
  27 deduplication metadata and existing metadata.
  28
  29 Key Idea
  30 ========
  31 1. Content hashing (Double hashing): Each client can find an object data
  32 for an object ID using CRUSH. With CRUSH, a client knows object's location
  33 in Base tier.
  34 By hashing object's content at Base tier, a new OID (chunk ID) is generated.
  35 Chunk tier stores in the new OID that has a partial content of original object.
  36
  37  Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
  38  CRUSH(K) -> chunk's location
  39
  40
  41 2. Self-contained object: The external metadata design
  42 makes difficult for integration with storage feature support
  43 since existing storage features cannot recognize the
  44 additional external data structures. If we can design data
  45 deduplication system without any external component, the
  46 original storage features can be reused.
  47
  48 More details in https://ieeexplore.ieee.org/document/8416369
  49
  50 Design
  51 ======
  52
  53 .. ditaa::
  54
  55            +-------------+
  56            | Ceph Client |
  57            +------+------+
  58                   ^
  59      Tiering is   |
  60     Transparent   |               Metadata
  61         to Ceph   |           +---------------+
  62      Client Ops   |           |               |
  63                   |    +----->+   Base Pool   |
  64                   |    |      |               |
  65                   |    |      +-----+---+-----+
  66                   |    |            |   ^
  67                   v    v            |   |   Dedup metadata in Base Pool
  68            +------+----+--+         |   |   (Dedup metadata contains chunk offsets
  69            |   Objecter   |         |   |    and fingerprints)
  70            +-----------+--+         |   |
  71                        ^            |   |   Data in Chunk Pool
  72                        |            v   |
  73                        |      +-----+---+-----+
  74                        |      |               |
  75                        +----->|  Chunk Pool   |
  76                               |               |
  77                               +---------------+
  78                                     Data
  79
  80
  81 Pool-based object management:
  82 We define two pools.
  83 The metadata pool stores metadata objects and the chunk pool stores
  84 chunk objects. Since these two pools are divided based on
  85 the purpose and usage, each pool can be managed more
  86 efficiently according to its different characteristics. Base
  87 pool and the chunk pool can separately select a redundancy
  88 scheme between replication and erasure coding depending on
  89 its usage and each pool can be placed in a different storage
  90 location depending on the required performance.
  91
  92 Manifest Object:
  93 Metadata objects are stored in the
  94 base pool, which contains metadata for data deduplication.
  95
  96 ::
  97
  98         struct object_manifest_t {
  99                 enum {
 100                         TYPE_NONE = 0,
 101                         TYPE_REDIRECT = 1,
 102                         TYPE_CHUNKED = 2,
 103                 };
 104                 uint8_t type;  // redirect, chunked, ...
 105                 hobject_t redirect_target;
 106                 std::map<uint64_t, chunk_info_t> chunk_map;
 107         }
 108
 109
 110 A chunk Object:
 111 Chunk objects are stored in the chunk pool. Chunk object contains chunk data
 112 and its reference count information.
 113
 114
 115 Although chunk objects and manifest objects have a different purpose
 116 from existing objects, they can be handled the same way as
 117 original objects. Therefore, to support existing features such as replication,
 118 no additional operations for dedup are needed.
 119
 120 Usage
 121 =====
 122
 123 To set up deduplication pools, you must have two pools. One will act as the
 124 base pool and the other will act as the chunk pool. The base pool need to be
 125 configured with fingerprint_algorithm option as follows.
 126
 127 ::
 128
 129   ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
 130   --yes-i-really-mean-it
 131
 132 1. Create objects ::
 133
 134         - rados -p base_pool put foo ./foo
 135
 136         - rados -p chunk_pool put foo-chunk ./foo-chunk
 137
 138 2. Make a manifest object ::
 139
 140         - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool
 141         chunk_pool foo-chunk $START_OFFSET --with-reference
 142
 143
 144 Interface
 145 =========
 146
 147 * set-redirect
 148
 149   set redirection between a base_object in the base_pool and a target_object
 150   in the target_pool.
 151   A redirected object will forward all operations from the client to the
 152   target_object. ::
 153
 154         rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
 155          <target_object>
 156
 157 * set-chunk
 158
 159   set chunk-offset in a source_object to make a link between it and a
 160   target_object. ::
 161
 162         rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
 163          <caspool> <target_object> <taget-offset>
 164
 165 * tier-promote
 166
 167   promote the object (including chunks). ::
 168
 169         rados -p base_pool tier-promote <obj-name>
 170
 171 * unset-manifest
 172
 173   unset manifest option from the object that has manifest. ::
 174
 175         rados -p base_pool unset-manifest <obj-name>
 176
 177 * tier-flush
 178
 179   flush the object that has chunks to the chunk pool. ::
 180
 181         rados -p base_pool tier-flush <obj-name>
 182
 183 Dedup tool
 184 ==========
 185
 186 Dedup tool has two features: finding optimal chunk offset for dedup chunking
 187 and fixing the reference count.
 188
 189 * find optimal chunk offset
 190
 191   a. fixed chunk
 192
 193     To find out a fixed chunk length, you need to run following command many
 194     times while changing the chunk_size. ::
 195
 196             ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
 197               --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
 198
 199   b. rabin chunk(Rabin-karp algorithm)
 200
 201     As you know, Rabin-karp algorithm is string-searching algorithm based
 202     on a rolling-hash. But rolling-hash is not enough to do deduplication because
 203     we don't know the chunk boundary. So, we need content-based slicing using
 204     a rolling hash for content-defined chunking.
 205     The current implementation uses the simplest approach: look for chunk boundaries
 206     by inspecting the rolling hash for pattern(like the
 207     lower N bits are all zeroes).
 208
 209     - Usage
 210
 211       Users who want to use deduplication need to find an ideal chunk offset.
 212       To find out ideal chunk offset, Users should discover
 213       the optimal configuration for their data workload via ceph-dedup-tool.
 214       And then, this chunking information will be used for object chunking through
 215       set-chunk api. ::
 216
 217               ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
 218                 --chunk-algorithm rabin --fingerprint-algorithm rabin
 219
 220       ceph-dedup-tool has many options to utilize rabin chunk.
 221       These are options for rabin chunk. ::
 222
 223               --mod-prime <uint64_t>
 224               --rabin-prime <uint64_t>
 225               --pow <uint64_t>
 226               --chunk-mask-bit <uint32_t>
 227               --window-size <uint32_t>
 228               --min-chunk <uint32_t>
 229               --max-chunk <uint64_t>
 230
 231       Users need to refer following equation to use above options for rabin chunk. ::
 232
 233               rabin_hash =
 234                 (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
 235
 236   c. Fixed chunk vs content-defined chunk
 237
 238     Content-defined chunking may or not be optimal solution.
 239     For example,
 240
 241     Data chunk A : abcdefgabcdefgabcdefg
 242
 243     Let's think about Data chunk A's deduplication. Ideal chunk offset is
 244     from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
 245     But, in the case of content-based slicing, the optimal chunk length
 246     could not be found (dedup ratio will not be 100%).
 247     Because we need to find optimal parameter such
 248     as boundary bit, window size and prime value. This is as easy as fixed chunk.
 249     But, content defined chunking is very effective in the following case.
 250
 251     Data chunk B : abcdefgabcdefgabcdefg
 252
 253     Data chunk C : Tabcdefgabcdefgabcdefg
 254
 255
 256 * fix reference count
 257
 258   The key idea behind of reference counting for dedup is false-positive, which means
 259   (manifest object (no ref), chunk object(has ref)) happen instead of
 260   (manifest object (has ref), chunk 1(no ref)).
 261   To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
 262
 263           ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
 264