ceph/doc/dev/deduplication.rst

   1 ===============
   2  Deduplication
   3 ===============
   4
   5
   6 Introduction
   7 ============
   8
   9 Applying data deduplication on an existing software stack is not easy
  10 due to additional metadata management and original data processing
  11 procedure.
  12
  13 In a typical deduplication system, the input source as a data
  14 object is split into multiple chunks by a chunking algorithm.
  15 The deduplication system then compares each chunk with
  16 the existing data chunks, stored in the storage previously.
  17 To this end, a fingerprint index that stores the hash value
  18 of each chunk is employed by the deduplication system
  19 in order to easily find the existing chunks by comparing
  20 hash value rather than searching all contents that reside in
  21 the underlying storage.
  22
  23 There are many challenges in order to implement deduplication on top
  24 of Ceph. Among them, two issues are essential for deduplication.
  25 First is managing scalability of fingerprint index; Second is
  26 it is complex to ensure compatibility between newly applied
  27 deduplication metadata and existing metadata.
  28
  29 Key Idea
  30 ========
  31 1. Content hashing (Double hashing): Each client can find an object data
  32 for an object ID using CRUSH. With CRUSH, a client knows object's location
  33 in Base tier.
  34 By hashing object's content at Base tier, a new OID (chunk ID) is generated.
  35 Chunk tier stores in the new OID that has a partial content of original object.
  36
  37  Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
  38  CRUSH(K) -> chunk's location
  39
  40
  41 2. Self-contained object: The external metadata design
  42 makes difficult for integration with storage feature support
  43 since existing storage features cannot recognize the
  44 additional external data structures. If we can design data
  45 deduplication system without any external component, the
  46 original storage features can be reused.
  47
  48 More details in https://ieeexplore.ieee.org/document/8416369
  49
  50 Design
  51 ======
  52
  53 .. ditaa::
  54
  55            +-------------+
  56            | Ceph Client |
  57            +------+------+
  58                   ^
  59      Tiering is   |
  60     Transparent   |               Metadata
  61         to Ceph   |           +---------------+
  62      Client Ops   |           |               |
  63                   |    +----->+   Base Pool   |
  64                   |    |      |               |
  65                   |    |      +-----+---+-----+
  66                   |    |            |   ^
  67                   v    v            |   |   Dedup metadata in Base Pool
  68            +------+----+--+         |   |   (Dedup metadata contains chunk offsets
  69            |   Objecter   |         |   |    and fingerprints)
  70            +-----------+--+         |   |
  71                        ^            |   |   Data in Chunk Pool
  72                        |            v   |
  73                        |      +-----+---+-----+
  74                        |      |               |
  75                        +----->|  Chunk Pool   |
  76                               |               |
  77                               +---------------+
  78                                     Data
  79
  80
  81 Pool-based object management:
  82 We define two pools.
  83 The metadata pool stores metadata objects and the chunk pool stores
  84 chunk objects. Since these two pools are divided based on
  85 the purpose and usage, each pool can be managed more
  86 efficiently according to its different characteristics. Base
  87 pool and the chunk pool can separately select a redundancy
  88 scheme between replication and erasure coding depending on
  89 its usage and each pool can be placed in a different storage
  90 location depending on the required performance.
  91
  92 Regarding how to use, please see ``osd_internals/manifest.rst``
  93
  94 Usage Patterns
  95 ==============
  96
  97 The different Ceph interface layers present potentially different oportunities
  98 and costs for deduplication and tiering in general.
  99
 100 RadosGW
 101 -------
 102
 103 S3 big data workloads seem like a good opportunity for deduplication.  These
 104 objects tend to be write once, read mostly objects which don't see partial
 105 overwrites.  As such, it makes sense to fingerprint and dedup up front.
 106
 107 Unlike cephfs and rbd, radosgw has a system for storing
 108 explicit metadata in the head object of a logical s3 object for
 109 locating the remaining pieces.  As such, radosgw could use the
 110 refcounting machinery (``osd_internals/refcount.rst``) directly without
 111 needing direct support from rados for manifests.
 112
 113 RBD/Cephfs
 114 ----------
 115
 116 RBD and CephFS both use deterministic naming schemes to partition
 117 block devices/file data over rados objects.  As such, the redirection
 118 metadata would need to be included as part of rados, presumably
 119 transparently.
 120
 121 Moreover, unlike radosgw, rbd/cephfs rados objects can see overwrites.
 122 For those objects, we don't really want to perform dedup, and we don't
 123 want to pay a write latency penalty in the hot path to do so anyway.
 124 As such, performing tiering and dedup on cold objects in the background
 125 is likely to be preferred.
 126
 127 One important wrinkle, however, is that both rbd and cephfs workloads
 128 often feature usage of snapshots.  This means that the rados manifest
 129 support needs robust support for snapshots.
 130
 131 RADOS Machinery
 132 ===============
 133
 134 For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
 135 For more information on rados refcount support, see ``osd_internals/refcount.rst``.
 136
 137 Status and Future Work
 138 ======================
 139
 140 At the moment, there exists some preliminary support for manifest
 141 objects within the OSD as well as a dedup tool.
 142
 143 RadosGW data warehouse workloads probably represent the largest
 144 opportunity for this feature, so the first priority is probably to add
 145 direct support for fingerprinting and redirects into the refcount pool
 146 to radosgw.
 147
 148 Aside from radosgw, completing work on manifest object support in the
 149 OSD particularly as it relates to snapshots would be the next step for
 150 rbd and cephfs workloads.
 151