ceph/doc/dev/deduplication.rst

   1 ===============
   2  Deduplication
   3 ===============
   4
   5
   6 Introduction
   7 ============
   8
   9 Applying data deduplication on an existing software stack is not easy
  10 due to additional metadata management and original data processing
  11 procedure.
  12
  13 In a typical deduplication system, the input source as a data
  14 object is split into multiple chunks by a chunking algorithm.
  15 The deduplication system then compares each chunk with
  16 the existing data chunks, stored in the storage previously.
  17 To this end, a fingerprint index that stores the hash value
  18 of each chunk is employed by the deduplication system
  19 in order to easily find the existing chunks by comparing
  20 hash value rather than searching all contents that reside in
  21 the underlying storage.
  22
  23 There are many challenges in order to implement deduplication on top
  24 of Ceph. Among them, two issues are essential for deduplication.
  25 First is managing scalability of fingerprint index; Second is
  26 it is complex to ensure compatibility between newly applied
  27 deduplication metadata and existing metadata.
  28
  29 Key Idea
  30 ========
  31 1. Content hashing (Double hashing): Each client can find an object data
  32 for an object ID using CRUSH. With CRUSH, a client knows object's location
  33 in Base tier.
  34 By hashing object's content at Base tier, a new OID (chunk ID) is generated.
  35 Chunk tier stores in the new OID that has a partial content of original object.
  36
  37  Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
  38  CRUSH(K) -> chunk's location
  39
  40
  41 2. Self-contained object: The external metadata design
  42 makes difficult for integration with storage feature support
  43 since existing storage features cannot recognize the
  44 additional external data structures. If we can design data
  45 deduplication system without any external component, the
  46 original storage features can be reused.
  47
  48 More details in https://ieeexplore.ieee.org/document/8416369
  49
  50 Design
  51 ======
  52
  53 .. ditaa::
  54
  55            +-------------+
  56            | Ceph Client |
  57            +------+------+
  58                   ^
  59      Tiering is   |
  60     Transparent   |               Metadata
  61         to Ceph   |           +---------------+
  62      Client Ops   |           |               |
  63                   |    +----->+   Base Pool   |
  64                   |    |      |               |
  65                   |    |      +-----+---+-----+
  66                   |    |            |   ^
  67                   v    v            |   |   Dedup metadata in Base Pool
  68            +------+----+--+         |   |   (Dedup metadata contains chunk offsets
  69            |   Objecter   |         |   |    and fingerprints)
  70            +-----------+--+         |   |
  71                        ^            |   |   Data in Chunk Pool
  72                        |            v   |
  73                        |      +-----+---+-----+
  74                        |      |               |
  75                        +----->|  Chunk Pool   |
  76                               |               |
  77                               +---------------+
  78                                     Data
  79
  80
  81 Pool-based object management:
  82 We define two pools.
  83 The metadata pool stores metadata objects and the chunk pool stores
  84 chunk objects. Since these two pools are divided based on
  85 the purpose and usage, each pool can be managed more
  86 efficiently according to its different characteristics. Base
  87 pool and the chunk pool can separately select a redundancy
  88 scheme between replication and erasure coding depending on
  89 its usage and each pool can be placed in a different storage
  90 location depending on the required performance.
  91
  92 Regarding how to use, please see ``osd_internals/manifest.rst``
  93
  94 Usage Patterns
  95 ==============
  96
  97 Each Ceph interface layer presents unique opportunities and costs for
  98 deduplication and tiering in general.
  99
 100 RadosGW
 101 -------
 102
 103 S3 big data workloads seem like a good opportunity for deduplication.  These
 104 objects tend to be write once, read mostly objects which don't see partial
 105 overwrites.  As such, it makes sense to fingerprint and dedup up front.
 106
 107 Unlike cephfs and rbd, radosgw has a system for storing
 108 explicit metadata in the head object of a logical s3 object for
 109 locating the remaining pieces.  As such, radosgw could use the
 110 refcounting machinery (``osd_internals/refcount.rst``) directly without
 111 needing direct support from rados for manifests.
 112
 113 RBD/Cephfs
 114 ----------
 115
 116 RBD and CephFS both use deterministic naming schemes to partition
 117 block devices/file data over rados objects.  As such, the redirection
 118 metadata would need to be included as part of rados, presumably
 119 transparently.
 120
 121 Moreover, unlike radosgw, rbd/cephfs rados objects can see overwrites.
 122 For those objects, we don't really want to perform dedup, and we don't
 123 want to pay a write latency penalty in the hot path to do so anyway.
 124 As such, performing tiering and dedup on cold objects in the background
 125 is likely to be preferred.
 126
 127 One important wrinkle, however, is that both rbd and cephfs workloads
 128 often feature usage of snapshots.  This means that the rados manifest
 129 support needs robust support for snapshots.
 130
 131 RADOS Machinery
 132 ===============
 133
 134 For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
 135 For more information on rados refcount support, see ``osd_internals/refcount.rst``.
 136
 137 Status and Future Work
 138 ======================
 139
 140 At the moment, there exists some preliminary support for manifest
 141 objects within the OSD as well as a dedup tool.
 142
 143 RadosGW data warehouse workloads probably represent the largest
 144 opportunity for this feature, so the first priority is probably to add
 145 direct support for fingerprinting and redirects into the refcount pool
 146 to radosgw.
 147
 148 Aside from radosgw, completing work on manifest object support in the
 149 OSD particularly as it relates to snapshots would be the next step for
 150 rbd and cephfs workloads.
 151
 152 How to use deduplication
 153 ========================
 154
 155  * This feature is highly experimental and is subject to change or removal.
 156
 157 Ceph provides deduplication using RADOS machinery.
 158 Below we explain how to perform deduplication.
 159
 160
 161 1. Estimate space saving ratio of a target pool using ``ceph-dedup-tool``.
 162
 163 .. code:: bash
 164
 165     ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
 166       --chunk-algorithm fixed|fastcdc --fingerprint-algorithm sha1|sha256|sha512
 167       --max-thread THREAD_COUNT
 168
 169 This CLI command will show how much storage space can be saved when deduplication
 170 is applied on the pool. If the amount of the saved space is higher than user's expectation,
 171 the pool probably is worth performing deduplication.
 172 Users should specify $POOL where the object---the users want to perform
 173 deduplication---is stored. The users also need to run ceph-dedup-tool multiple time
 174 with varying ``chunk_size`` to find the optimal chunk size. Note that the
 175 optimal value probably differs in the content of each object in case of fastcdc
 176 chunk algorithm (not fixed). Example output:
 177
 178 ::
 179
 180     {
 181       "chunk_algo": "fastcdc",
 182       "chunk_sizes": [
 183         {
 184           "target_chunk_size": 8192,
 185           "dedup_bytes_ratio": 0.4897049
 186           "dedup_object_ratio": 34.567315
 187           "chunk_size_average": 64439,
 188           "chunk_size_stddev": 33620
 189         }
 190       ],
 191       "summary": {
 192         "examined_objects": 95,
 193         "examined_bytes": 214968649
 194       }
 195     }
 196
 197 The above is an example output when executing ``estimate``. ``target_chunk_size`` is the same as
 198 ``chunk_size`` given by the user. ``dedup_bytes_ratio`` shows how many bytes are redundant from
 199 examined bytes. For instance, 1 - ``dedup_bytes_ratio`` means the percentage of saved storage space.
 200 ``dedup_object_ratio`` is the generated chunk objects / ``examined_objects``. ``chunk_size_average``
 201 means that the divided chunk size on average when performing CDC---this may differnet from ``target_chunk_size``
 202 because CDC genarates different chunk-boundary depending on the content. ``chunk_size_stddev``
 203 represents the standard deviation of the chunk size.
 204
 205
 206 2. Create chunk pool.
 207
 208 .. code:: bash
 209
 210   ceph osd pool create CHUNK_POOL
 211
 212
 213 3. Run dedup command (there are two ways).
 214
 215 .. code:: bash
 216
 217     ceph-dedup-tool --op sample-dedup --pool POOL --chunk-pool CHUNK_POOL --chunk-size
 218     CHUNK_SIZE --chunk-algorithm fastcdc --fingerprint-algorithm sha1|sha256|sha512
 219     --chunk-dedup-threshold THRESHOLD --max-thread THREAD_COUNT ----sampling-ratio SAMPLE_RATIO
 220     --wakeup-period WAKEUP_PERIOD --loop --snap
 221
 222 The ``sample-dedup`` comamnd spawns threads specified by ``THREAD_COUNT`` to deduplicate objects on
 223 the ``POOL``. According to sampling-ratio---do a full search if ``SAMPLE_RATIO`` is 100, the threads selectively
 224 perform deduplication if the chunk is redundant over ``THRESHOLD`` times during iteration.
 225 If --loop is set, the theads will wakeup after ``WAKEUP_PERIOD``. If not, the threads will exit after one iteration.
 226
 227 .. code:: bash
 228
 229     ceph-dedup-tool --op object-dedup --pool POOL --object OID --chunk-pool CHUNK_POOL
 230       --fingerprint-algorithm sha1|sha256|sha512 --dedup-cdc-chunk-size CHUNK_SIZE
 231
 232 The ``object-dedup`` command triggers deduplication on the RADOS object specified by ``OID``.
 233 All parameters shown above must be specified. ``CHUNK_SIZE`` should be taken from
 234 the results of step 1 above.
 235 Note that when this command is executed, ``fastcdc`` will be set by default and other parameters
 236 such as ``FP`` and ``CHUNK_SIZE`` will be set as defaults for the pool.
 237 Deduplicated objects will appear in the chunk pool. If the object is mutated over time, user needs to re-run
 238 ``object-dedup`` because chunk-boundary should be recalculated based on updated contents.
 239 The user needs to specify ``snap`` if the target object is snapshotted. After deduplication is done, the target
 240 object size in ``POOL`` is zero (evicted) and chunks objects are genereated---these appear in ``CHUNK_POOL``.
 241
 242
 243 4. Read/write I/Os
 244
 245 After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
 246 completely compatible with existing RAODS operations.
 247
 248
 249 5. Run scrub to fix reference count
 250
 251 Reference mismatches can on rare occasions occur to false positives when handling reference counts for
 252 deduplicated RADOS objects. These mismatches will be fixed by periodically scrubbing the pool:
 253
 254 .. code:: bash
 255
 256     ceph-dedup-tool --op chunk-scrub --op chunk-scrub --chunk-pool CHUNK_POOL --pool POOL --max-thread THREAD_COUNT
 257