9 Applying data deduplication on an existing software stack is not easy
10 due to additional metadata management and original data processing
13 In a typical deduplication system, the input source as a data
14 object is split into multiple chunks by a chunking algorithm.
15 The deduplication system then compares each chunk with
16 the existing data chunks, stored in the storage previously.
17 To this end, a fingerprint index that stores the hash value
18 of each chunk is employed by the deduplication system
19 in order to easily find the existing chunks by comparing
20 hash value rather than searching all contents that reside in
21 the underlying storage.
23 There are many challenges in order to implement deduplication on top
24 of Ceph. Among them, two issues are essential for deduplication.
25 First is managing scalability of fingerprint index; Second is
26 it is complex to ensure compatibility between newly applied
27 deduplication metadata and existing metadata.
31 1. Content hashing (Double hashing): Each client can find an object data
32 for an object ID using CRUSH. With CRUSH, a client knows object's location
34 By hashing object's content at Base tier, a new OID (chunk ID) is generated.
35 Chunk tier stores in the new OID that has a partial content of original object.
37 Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
38 CRUSH(K) -> chunk's location
41 2. Self-contained object: The external metadata design
42 makes difficult for integration with storage feature support
43 since existing storage features cannot recognize the
44 additional external data structures. If we can design data
45 deduplication system without any external component, the
46 original storage features can be reused.
48 More details in https://ieeexplore.ieee.org/document/8416369
60 Transparent | Metadata
61 to Ceph | +---------------+
63 | +----->+ Base Pool |
67 v v | | Dedup metadata in Base Pool
68 +------+----+--+ | | (Dedup metadata contains chunk offsets
69 | Objecter | | | and fingerprints)
71 ^ | | Data in Chunk Pool
81 Pool-based object management:
83 The metadata pool stores metadata objects and the chunk pool stores
84 chunk objects. Since these two pools are divided based on
85 the purpose and usage, each pool can be managed more
86 efficiently according to its different characteristics. Base
87 pool and the chunk pool can separately select a redundancy
88 scheme between replication and erasure coding depending on
89 its usage and each pool can be placed in a different storage
90 location depending on the required performance.
93 Metadata objects are stored in the
94 base pool, which contains metadata for data deduplication.
98 struct object_manifest_t {
104 uint8_t type; // redirect, chunked, ...
105 hobject_t redirect_target;
106 std::map<uint64_t, chunk_info_t> chunk_map;
111 Chunk objects are stored in the chunk pool. Chunk object contains chunk data
112 and its reference count information.
115 Although chunk objects and manifest objects have a different purpose
116 from existing objects, they can be handled the same way as
117 original objects. Therefore, to support existing features such as replication,
118 no additional operations for dedup are needed.
123 To set up deduplication pools, you must have two pools. One will act as the
124 base pool and the other will act as the chunk pool. The base pool need to be
125 configured with fingerprint_algorithm option as follows.
129 ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
130 --yes-i-really-mean-it
134 - rados -p base_pool put foo ./foo
136 - rados -p chunk_pool put foo-chunk ./foo-chunk
138 2. Make a manifest object ::
140 - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool
141 chunk_pool foo-chunk $START_OFFSET --with-reference
149 set redirection between a base_object in the base_pool and a target_object
151 A redirected object will forward all operations from the client to the
154 rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
159 set chunk-offset in a source_object to make a link between it and a
162 rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
163 <caspool> <target_object> <taget-offset>
167 promote the object (including chunks). ::
169 rados -p base_pool tier-promote <obj-name>
173 unset manifest option from the object that has manifest. ::
175 rados -p base_pool unset-manifest <obj-name>
179 flush the object that has chunks to the chunk pool. ::
181 rados -p base_pool tier-flush <obj-name>
186 Dedup tool has two features: finding optimal chunk offset for dedup chunking
187 and fixing the reference count.
189 * find optimal chunk offset
193 To find out a fixed chunk length, you need to run following command many
194 times while changing the chunk_size. ::
196 ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
197 --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
199 b. rabin chunk(Rabin-karp algorithm)
201 As you know, Rabin-karp algorithm is string-searching algorithm based
202 on a rolling-hash. But rolling-hash is not enough to do deduplication because
203 we don't know the chunk boundary. So, we need content-based slicing using
204 a rolling hash for content-defined chunking.
205 The current implementation uses the simplest approach: look for chunk boundaries
206 by inspecting the rolling hash for pattern(like the
207 lower N bits are all zeroes).
211 Users who want to use deduplication need to find an ideal chunk offset.
212 To find out ideal chunk offset, Users should discover
213 the optimal configuration for their data workload via ceph-dedup-tool.
214 And then, this chunking information will be used for object chunking through
217 ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
218 --chunk-algorithm rabin --fingerprint-algorithm rabin
220 ceph-dedup-tool has many options to utilize rabin chunk.
221 These are options for rabin chunk. ::
223 --mod-prime <uint64_t>
224 --rabin-prime <uint64_t>
226 --chunk-mask-bit <uint32_t>
227 --window-size <uint32_t>
228 --min-chunk <uint32_t>
229 --max-chunk <uint64_t>
231 Users need to refer following equation to use above options for rabin chunk. ::
234 (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
236 c. Fixed chunk vs content-defined chunk
238 Content-defined chunking may or not be optimal solution.
241 Data chunk A : abcdefgabcdefgabcdefg
243 Let's think about Data chunk A's deduplication. Ideal chunk offset is
244 from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
245 But, in the case of content-based slicing, the optimal chunk length
246 could not be found (dedup ratio will not be 100%).
247 Because we need to find optimal parameter such
248 as boundary bit, window size and prime value. This is as easy as fixed chunk.
249 But, content defined chunking is very effective in the following case.
251 Data chunk B : abcdefgabcdefgabcdefg
253 Data chunk C : Tabcdefgabcdefgabcdefg
256 * fix reference count
258 The key idea behind of reference counting for dedup is false-positive, which means
259 (manifest object (no ref), chunk object(has ref)) happen instead of
260 (manifest object (has ref), chunk 1(no ref)).
261 To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
263 ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL