]>
Commit | Line | Data |
---|---|---|
9f95a23c TL |
1 | =============== |
2 | Deduplication | |
3 | =============== | |
4 | ||
5 | ||
6 | Introduction | |
7 | ============ | |
8 | ||
9 | Applying data deduplication on an existing software stack is not easy | |
10 | due to additional metadata management and original data processing | |
11 | procedure. | |
12 | ||
13 | In a typical deduplication system, the input source as a data | |
14 | object is split into multiple chunks by a chunking algorithm. | |
15 | The deduplication system then compares each chunk with | |
16 | the existing data chunks, stored in the storage previously. | |
17 | To this end, a fingerprint index that stores the hash value | |
18 | of each chunk is employed by the deduplication system | |
19 | in order to easily find the existing chunks by comparing | |
20 | hash value rather than searching all contents that reside in | |
21 | the underlying storage. | |
22 | ||
23 | There are many challenges in order to implement deduplication on top | |
24 | of Ceph. Among them, two issues are essential for deduplication. | |
25 | First is managing scalability of fingerprint index; Second is | |
26 | it is complex to ensure compatibility between newly applied | |
27 | deduplication metadata and existing metadata. | |
28 | ||
29 | Key Idea | |
30 | ======== | |
31 | 1. Content hashing (Double hashing): Each client can find an object data | |
32 | for an object ID using CRUSH. With CRUSH, a client knows object's location | |
33 | in Base tier. | |
34 | By hashing object's content at Base tier, a new OID (chunk ID) is generated. | |
35 | Chunk tier stores in the new OID that has a partial content of original object. | |
36 | ||
37 | Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K -> | |
38 | CRUSH(K) -> chunk's location | |
39 | ||
40 | ||
41 | 2. Self-contained object: The external metadata design | |
42 | makes difficult for integration with storage feature support | |
43 | since existing storage features cannot recognize the | |
44 | additional external data structures. If we can design data | |
45 | deduplication system without any external component, the | |
46 | original storage features can be reused. | |
47 | ||
48 | More details in https://ieeexplore.ieee.org/document/8416369 | |
49 | ||
50 | Design | |
51 | ====== | |
52 | ||
f91f0fd5 TL |
53 | .. ditaa:: |
54 | ||
9f95a23c TL |
55 | +-------------+ |
56 | | Ceph Client | | |
57 | +------+------+ | |
58 | ^ | |
59 | Tiering is | | |
60 | Transparent | Metadata | |
61 | to Ceph | +---------------+ | |
62 | Client Ops | | | | |
63 | | +----->+ Base Pool | | |
64 | | | | | | |
65 | | | +-----+---+-----+ | |
66 | | | | ^ | |
67 | v v | | Dedup metadata in Base Pool | |
68 | +------+----+--+ | | (Dedup metadata contains chunk offsets | |
69 | | Objecter | | | and fingerprints) | |
70 | +-----------+--+ | | | |
71 | ^ | | Data in Chunk Pool | |
72 | | v | | |
73 | | +-----+---+-----+ | |
74 | | | | | |
75 | +----->| Chunk Pool | | |
76 | | | | |
77 | +---------------+ | |
78 | Data | |
79 | ||
80 | ||
81 | Pool-based object management: | |
82 | We define two pools. | |
83 | The metadata pool stores metadata objects and the chunk pool stores | |
84 | chunk objects. Since these two pools are divided based on | |
85 | the purpose and usage, each pool can be managed more | |
86 | efficiently according to its different characteristics. Base | |
87 | pool and the chunk pool can separately select a redundancy | |
88 | scheme between replication and erasure coding depending on | |
89 | its usage and each pool can be placed in a different storage | |
90 | location depending on the required performance. | |
91 | ||
92 | Manifest Object: | |
93 | Metadata objects are stored in the | |
94 | base pool, which contains metadata for data deduplication. | |
95 | ||
96 | :: | |
97 | ||
98 | struct object_manifest_t { | |
99 | enum { | |
100 | TYPE_NONE = 0, | |
101 | TYPE_REDIRECT = 1, | |
102 | TYPE_CHUNKED = 2, | |
103 | }; | |
104 | uint8_t type; // redirect, chunked, ... | |
105 | hobject_t redirect_target; | |
106 | std::map<uint64_t, chunk_info_t> chunk_map; | |
107 | } | |
108 | ||
109 | ||
110 | A chunk Object: | |
111 | Chunk objects are stored in the chunk pool. Chunk object contains chunk data | |
112 | and its reference count information. | |
113 | ||
114 | ||
115 | Although chunk objects and manifest objects have a different purpose | |
116 | from existing objects, they can be handled the same way as | |
117 | original objects. Therefore, to support existing features such as replication, | |
118 | no additional operations for dedup are needed. | |
119 | ||
120 | Usage | |
121 | ===== | |
122 | ||
123 | To set up deduplication pools, you must have two pools. One will act as the | |
124 | base pool and the other will act as the chunk pool. The base pool need to be | |
125 | configured with fingerprint_algorithm option as follows. | |
126 | ||
127 | :: | |
128 | ||
129 | ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512 | |
130 | --yes-i-really-mean-it | |
131 | ||
132 | 1. Create objects :: | |
133 | ||
134 | - rados -p base_pool put foo ./foo | |
135 | ||
136 | - rados -p chunk_pool put foo-chunk ./foo-chunk | |
137 | ||
138 | 2. Make a manifest object :: | |
139 | ||
140 | - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool | |
141 | chunk_pool foo-chunk $START_OFFSET --with-reference | |
142 | ||
143 | ||
144 | Interface | |
145 | ========= | |
146 | ||
147 | * set-redirect | |
148 | ||
149 | set redirection between a base_object in the base_pool and a target_object | |
150 | in the target_pool. | |
151 | A redirected object will forward all operations from the client to the | |
152 | target_object. :: | |
153 | ||
154 | rados -p base_pool set-redirect <base_object> --target-pool <target_pool> | |
155 | <target_object> | |
156 | ||
157 | * set-chunk | |
158 | ||
159 | set chunk-offset in a source_object to make a link between it and a | |
160 | target_object. :: | |
161 | ||
162 | rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool | |
163 | <caspool> <target_object> <taget-offset> | |
164 | ||
165 | * tier-promote | |
166 | ||
167 | promote the object (including chunks). :: | |
168 | ||
169 | rados -p base_pool tier-promote <obj-name> | |
170 | ||
171 | * unset-manifest | |
172 | ||
173 | unset manifest option from the object that has manifest. :: | |
174 | ||
175 | rados -p base_pool unset-manifest <obj-name> | |
176 | ||
177 | * tier-flush | |
178 | ||
179 | flush the object that has chunks to the chunk pool. :: | |
180 | ||
181 | rados -p base_pool tier-flush <obj-name> | |
182 | ||
183 | Dedup tool | |
184 | ========== | |
185 | ||
186 | Dedup tool has two features: finding optimal chunk offset for dedup chunking | |
187 | and fixing the reference count. | |
188 | ||
189 | * find optimal chunk offset | |
190 | ||
191 | a. fixed chunk | |
192 | ||
193 | To find out a fixed chunk length, you need to run following command many | |
194 | times while changing the chunk_size. :: | |
195 | ||
196 | ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size | |
197 | --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512 | |
198 | ||
199 | b. rabin chunk(Rabin-karp algorithm) | |
200 | ||
201 | As you know, Rabin-karp algorithm is string-searching algorithm based | |
202 | on a rolling-hash. But rolling-hash is not enough to do deduplication because | |
203 | we don't know the chunk boundary. So, we need content-based slicing using | |
204 | a rolling hash for content-defined chunking. | |
205 | The current implementation uses the simplest approach: look for chunk boundaries | |
206 | by inspecting the rolling hash for pattern(like the | |
207 | lower N bits are all zeroes). | |
208 | ||
209 | - Usage | |
210 | ||
211 | Users who want to use deduplication need to find an ideal chunk offset. | |
212 | To find out ideal chunk offset, Users should discover | |
213 | the optimal configuration for their data workload via ceph-dedup-tool. | |
214 | And then, this chunking information will be used for object chunking through | |
215 | set-chunk api. :: | |
216 | ||
217 | ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size | |
218 | --chunk-algorithm rabin --fingerprint-algorithm rabin | |
219 | ||
220 | ceph-dedup-tool has many options to utilize rabin chunk. | |
221 | These are options for rabin chunk. :: | |
222 | ||
223 | --mod-prime <uint64_t> | |
224 | --rabin-prime <uint64_t> | |
225 | --pow <uint64_t> | |
226 | --chunk-mask-bit <uint32_t> | |
227 | --window-size <uint32_t> | |
228 | --min-chunk <uint32_t> | |
229 | --max-chunk <uint64_t> | |
230 | ||
231 | Users need to refer following equation to use above options for rabin chunk. :: | |
232 | ||
233 | rabin_hash = | |
234 | (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime) | |
235 | ||
236 | c. Fixed chunk vs content-defined chunk | |
237 | ||
238 | Content-defined chunking may or not be optimal solution. | |
239 | For example, | |
240 | ||
241 | Data chunk A : abcdefgabcdefgabcdefg | |
242 | ||
243 | Let's think about Data chunk A's deduplication. Ideal chunk offset is | |
244 | from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length. | |
245 | But, in the case of content-based slicing, the optimal chunk length | |
246 | could not be found (dedup ratio will not be 100%). | |
247 | Because we need to find optimal parameter such | |
248 | as boundary bit, window size and prime value. This is as easy as fixed chunk. | |
249 | But, content defined chunking is very effective in the following case. | |
250 | ||
251 | Data chunk B : abcdefgabcdefgabcdefg | |
252 | ||
253 | Data chunk C : Tabcdefgabcdefgabcdefg | |
254 | ||
255 | ||
256 | * fix reference count | |
257 | ||
258 | The key idea behind of reference counting for dedup is false-positive, which means | |
259 | (manifest object (no ref), chunk object(has ref)) happen instead of | |
260 | (manifest object (has ref), chunk 1(no ref)). | |
261 | To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. :: | |
262 | ||
263 | ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL | |
264 |