]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/deduplication.rst
Import ceph 15.2.8
[ceph.git] / ceph / doc / dev / deduplication.rst
CommitLineData
9f95a23c
TL
1===============
2 Deduplication
3===============
4
5
6Introduction
7============
8
9Applying data deduplication on an existing software stack is not easy
10due to additional metadata management and original data processing
11procedure.
12
13In a typical deduplication system, the input source as a data
14object is split into multiple chunks by a chunking algorithm.
15The deduplication system then compares each chunk with
16the existing data chunks, stored in the storage previously.
17To this end, a fingerprint index that stores the hash value
18of each chunk is employed by the deduplication system
19in order to easily find the existing chunks by comparing
20hash value rather than searching all contents that reside in
21the underlying storage.
22
23There are many challenges in order to implement deduplication on top
24of Ceph. Among them, two issues are essential for deduplication.
25First is managing scalability of fingerprint index; Second is
26it is complex to ensure compatibility between newly applied
27deduplication metadata and existing metadata.
28
29Key Idea
30========
311. Content hashing (Double hashing): Each client can find an object data
32for an object ID using CRUSH. With CRUSH, a client knows object's location
33in Base tier.
34By hashing object's content at Base tier, a new OID (chunk ID) is generated.
35Chunk tier stores in the new OID that has a partial content of original object.
36
37 Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
38 CRUSH(K) -> chunk's location
39
40
412. Self-contained object: The external metadata design
42makes difficult for integration with storage feature support
43since existing storage features cannot recognize the
44additional external data structures. If we can design data
45deduplication system without any external component, the
46original storage features can be reused.
47
48More details in https://ieeexplore.ieee.org/document/8416369
49
50Design
51======
52
f91f0fd5
TL
53.. ditaa::
54
9f95a23c
TL
55 +-------------+
56 | Ceph Client |
57 +------+------+
58 ^
59 Tiering is |
60 Transparent | Metadata
61 to Ceph | +---------------+
62 Client Ops | | |
63 | +----->+ Base Pool |
64 | | | |
65 | | +-----+---+-----+
66 | | | ^
67 v v | | Dedup metadata in Base Pool
68 +------+----+--+ | | (Dedup metadata contains chunk offsets
69 | Objecter | | | and fingerprints)
70 +-----------+--+ | |
71 ^ | | Data in Chunk Pool
72 | v |
73 | +-----+---+-----+
74 | | |
75 +----->| Chunk Pool |
76 | |
77 +---------------+
78 Data
79
80
81Pool-based object management:
82We define two pools.
83The metadata pool stores metadata objects and the chunk pool stores
84chunk objects. Since these two pools are divided based on
85the purpose and usage, each pool can be managed more
86efficiently according to its different characteristics. Base
87pool and the chunk pool can separately select a redundancy
88scheme between replication and erasure coding depending on
89its usage and each pool can be placed in a different storage
90location depending on the required performance.
91
92Manifest Object:
93Metadata objects are stored in the
94base pool, which contains metadata for data deduplication.
95
96::
97
98 struct object_manifest_t {
99 enum {
100 TYPE_NONE = 0,
101 TYPE_REDIRECT = 1,
102 TYPE_CHUNKED = 2,
103 };
104 uint8_t type; // redirect, chunked, ...
105 hobject_t redirect_target;
106 std::map<uint64_t, chunk_info_t> chunk_map;
107 }
108
109
110A chunk Object:
111Chunk objects are stored in the chunk pool. Chunk object contains chunk data
112and its reference count information.
113
114
115Although chunk objects and manifest objects have a different purpose
116from existing objects, they can be handled the same way as
117original objects. Therefore, to support existing features such as replication,
118no additional operations for dedup are needed.
119
120Usage
121=====
122
123To set up deduplication pools, you must have two pools. One will act as the
124base pool and the other will act as the chunk pool. The base pool need to be
125configured with fingerprint_algorithm option as follows.
126
127::
128
129 ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
130 --yes-i-really-mean-it
131
1321. Create objects ::
133
134 - rados -p base_pool put foo ./foo
135
136 - rados -p chunk_pool put foo-chunk ./foo-chunk
137
1382. Make a manifest object ::
139
140 - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool
141 chunk_pool foo-chunk $START_OFFSET --with-reference
142
143
144Interface
145=========
146
147* set-redirect
148
149 set redirection between a base_object in the base_pool and a target_object
150 in the target_pool.
151 A redirected object will forward all operations from the client to the
152 target_object. ::
153
154 rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
155 <target_object>
156
157* set-chunk
158
159 set chunk-offset in a source_object to make a link between it and a
160 target_object. ::
161
162 rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
163 <caspool> <target_object> <taget-offset>
164
165* tier-promote
166
167 promote the object (including chunks). ::
168
169 rados -p base_pool tier-promote <obj-name>
170
171* unset-manifest
172
173 unset manifest option from the object that has manifest. ::
174
175 rados -p base_pool unset-manifest <obj-name>
176
177* tier-flush
178
179 flush the object that has chunks to the chunk pool. ::
180
181 rados -p base_pool tier-flush <obj-name>
182
183Dedup tool
184==========
185
186Dedup tool has two features: finding optimal chunk offset for dedup chunking
187and fixing the reference count.
188
189* find optimal chunk offset
190
191 a. fixed chunk
192
193 To find out a fixed chunk length, you need to run following command many
194 times while changing the chunk_size. ::
195
196 ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
197 --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
198
199 b. rabin chunk(Rabin-karp algorithm)
200
201 As you know, Rabin-karp algorithm is string-searching algorithm based
202 on a rolling-hash. But rolling-hash is not enough to do deduplication because
203 we don't know the chunk boundary. So, we need content-based slicing using
204 a rolling hash for content-defined chunking.
205 The current implementation uses the simplest approach: look for chunk boundaries
206 by inspecting the rolling hash for pattern(like the
207 lower N bits are all zeroes).
208
209 - Usage
210
211 Users who want to use deduplication need to find an ideal chunk offset.
212 To find out ideal chunk offset, Users should discover
213 the optimal configuration for their data workload via ceph-dedup-tool.
214 And then, this chunking information will be used for object chunking through
215 set-chunk api. ::
216
217 ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
218 --chunk-algorithm rabin --fingerprint-algorithm rabin
219
220 ceph-dedup-tool has many options to utilize rabin chunk.
221 These are options for rabin chunk. ::
222
223 --mod-prime <uint64_t>
224 --rabin-prime <uint64_t>
225 --pow <uint64_t>
226 --chunk-mask-bit <uint32_t>
227 --window-size <uint32_t>
228 --min-chunk <uint32_t>
229 --max-chunk <uint64_t>
230
231 Users need to refer following equation to use above options for rabin chunk. ::
232
233 rabin_hash =
234 (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
235
236 c. Fixed chunk vs content-defined chunk
237
238 Content-defined chunking may or not be optimal solution.
239 For example,
240
241 Data chunk A : abcdefgabcdefgabcdefg
242
243 Let's think about Data chunk A's deduplication. Ideal chunk offset is
244 from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
245 But, in the case of content-based slicing, the optimal chunk length
246 could not be found (dedup ratio will not be 100%).
247 Because we need to find optimal parameter such
248 as boundary bit, window size and prime value. This is as easy as fixed chunk.
249 But, content defined chunking is very effective in the following case.
250
251 Data chunk B : abcdefgabcdefgabcdefg
252
253 Data chunk C : Tabcdefgabcdefgabcdefg
254
255
256* fix reference count
257
258 The key idea behind of reference counting for dedup is false-positive, which means
259 (manifest object (no ref), chunk object(has ref)) happen instead of
260 (manifest object (has ref), chunk 1(no ref)).
261 To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
262
263 ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
264