]>
Commit | Line | Data |
---|---|---|
f67539c2 TL |
1 | ======== |
2 | Manifest | |
3 | ======== | |
4 | ||
5 | ||
6 | Introduction | |
7 | ============ | |
8 | ||
9 | As described in ``../deduplication.rst``, adding transparent redirect | |
10 | machinery to RADOS would enable a more capable tiering solution | |
11 | than RADOS currently has with "cache/tiering". | |
12 | ||
13 | See ``../deduplication.rst`` | |
14 | ||
15 | At a high level, each object has a piece of metadata embedded in | |
16 | the ``object_info_t`` which can map subsets of the object data payload | |
17 | to (refcounted) objects in other pools. | |
18 | ||
19 | This document exists to detail: | |
20 | ||
21 | 1. Manifest data structures | |
22 | 2. Rados operations for manipulating manifests. | |
23 | 3. Status and Plans | |
24 | ||
25 | ||
26 | Intended Usage Model | |
27 | ==================== | |
28 | ||
29 | RBD | |
30 | --- | |
31 | ||
32 | For RBD, the primary goal is for either an OSD-internal agent or a | |
33 | cluster-external agent to be able to transparently shift portions | |
20effc67 | 34 | of the constituent 4MB extents between a dedup pool and a hot base |
f67539c2 TL |
35 | pool. |
36 | ||
37 | As such, RBD operations (including class operations and snapshots) | |
38 | must have the same observable results regardless of the current | |
39 | status of the object. | |
40 | ||
41 | Moreover, tiering/dedup operations must interleave with RBD operations | |
42 | without changing the result. | |
43 | ||
44 | Thus, here is a sketch of how I'd expect a tiering agent to perform | |
45 | basic operations: | |
46 | ||
47 | * Demote cold RBD chunk to slow pool: | |
48 | ||
49 | 1. Read object, noting current user_version. | |
50 | 2. In memory, run CDC implementation to fingerprint object. | |
51 | 3. Write out each resulting extent to an object in the cold pool | |
52 | using the CAS class. | |
53 | 4. Submit operation to base pool: | |
54 | ||
55 | * ``ASSERT_VER`` with the user version from the read to fail if the | |
56 | object has been mutated since the read. | |
57 | * ``SET_CHUNK`` for each of the extents to the corresponding object | |
58 | in the base pool. | |
59 | * ``EVICT_CHUNK`` for each extent to free up space in the base pool. | |
60 | Results in each chunk being marked ``MISSING``. | |
61 | ||
62 | RBD users should then either see the state prior to the demotion or | |
63 | subsequent to it. | |
64 | ||
65 | Note that between 3 and 4, we potentially leak references, so a | |
66 | periodic scrub would be needed to validate refcounts. | |
67 | ||
68 | * Promote cold RBD chunk to fast pool. | |
69 | ||
70 | 1. Submit ``TIER_PROMOTE`` | |
71 | ||
72 | For clones, all of the above would be identical except that the | |
73 | initial read would need a ``LIST_SNAPS`` to determine which clones exist | |
74 | and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include | |
75 | the ``cloneid``. | |
76 | ||
77 | RadosGW | |
78 | ------- | |
79 | ||
80 | For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the | |
81 | manifest machinery in the OSD to hide the distinction between the object | |
82 | being dedup'd or present in the base pool | |
83 | ||
84 | For writes, RGW could operate as RBD does above, but could | |
85 | optionally have the freedom to fingerprint prior to doing the write. | |
86 | In that case, it could immediately write out the target objects to the | |
87 | CAS pool and then atomically write an object with the corresponding | |
88 | chunks set. | |
89 | ||
90 | Status and Future Work | |
91 | ====================== | |
92 | ||
93 | At the moment, initial versions of a manifest data structure along | |
94 | with IO path support and rados control operations exist. This section | |
95 | is meant to outline next steps. | |
96 | ||
97 | At a high level, our future work plan is: | |
98 | ||
99 | - Cleanups: Address immediate inconsistencies and shortcomings outlined | |
100 | in the next section. | |
101 | - Testing: Rados relies heavily on teuthology failure testing to validate | |
102 | features like cache/tiering. We'll need corresponding tests for | |
103 | manifest operations. | |
104 | - Snapshots: We want to be able to deduplicate portions of clones | |
105 | below the level of the rados snapshot system. As such, the | |
106 | rados operations below need to be extended to work correctly on | |
107 | clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the | |
108 | corresponding extent in the base pool, and correctly maintain OSD metadata). | |
109 | - Cache/tiering: Ultimately, we'd like to be able to deprecate the existing | |
110 | cache/tiering implementation, but to do that we need to ensure that we | |
111 | can address the same use cases. | |
112 | ||
113 | ||
114 | Cleanups | |
115 | -------- | |
116 | ||
117 | The existing implementation has some things that need to be cleaned up: | |
118 | ||
119 | * ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise | |
120 | one couldn't create an object atomically as a redirect. | |
121 | * ``SET_CHUNK``: | |
122 | ||
123 | * Appears to trigger a new clone as user_modify gets set in | |
124 | ``do_osd_ops``. This probably isn't desirable, see Snapshots section | |
125 | below for some options on how generally to mix these operations | |
126 | with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set | |
127 | user_modify. | |
128 | * Appears to assume that the corresponding section of the object | |
129 | does not exist (sets ``FLAG_MISSING``) but does not check whether the | |
130 | corresponding extent exists already in the object. Should always | |
131 | leave the extent clean. | |
132 | * Appears to clear the manifest unconditionally if not chunked, | |
133 | that's probably wrong. We should return an error if it's a | |
134 | ``REDIRECT`` :: | |
135 | ||
136 | case CEPH_OSD_OP_SET_CHUNK: | |
137 | if (oi.manifest.is_redirect()) { | |
138 | result = -EINVAL; | |
139 | goto fail; | |
140 | } | |
141 | ||
142 | ||
143 | * ``TIER_PROMOTE``: | |
144 | ||
145 | * ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears | |
146 | to copy them back in, but does not unset the redirect or clear the | |
147 | reference. This violates the invariant that a redirect object | |
148 | should be empty in the base pool. In particular, as long as the | |
149 | redirect is set, it appears that all operations will be proxied | |
150 | even after the promote defeating the purpose. We do want ``PROMOTE`` | |
151 | to be able to atomically replace a redirect with the actual | |
152 | object, so the solution is to clear the redirect at the end of the | |
153 | promote. | |
154 | * For a chunked manifest, we appear to flush prior to promoting. | |
155 | Promotion will often be used to prepare an object for low latency | |
156 | reads and writes, accordingly, the only effect should be to read | |
157 | any ``MISSING`` extents into the base pool. No flushing should be done. | |
158 | ||
159 | * High Level: | |
160 | ||
161 | * It appears that ``FLAG_DIRTY`` should never be used for an extent pointing | |
162 | at a dedup extent. Writing the mutated extent back to the dedup pool | |
163 | requires writing a new object since the previous one cannot be mutated, | |
164 | just as it would if it hadn't been dedup'd yet. Thus, we should always | |
165 | drop the reference and remove the manifest pointer. | |
166 | ||
167 | * There isn't currently a way to "evict" an object region. With the above | |
168 | change to ``SET_CHUNK`` to always retain the existing object region, we | |
169 | need an ``EVICT_CHUNK`` operation to then remove the extent. | |
170 | ||
171 | ||
172 | Testing | |
173 | ------- | |
174 | ||
175 | We rely really heavily on randomized failure testing. As such, we need | |
176 | to extend that testing to include dedup/manifest support as well. Here's | |
177 | a short list of the touchpoints: | |
178 | ||
179 | * Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml`` | |
180 | ||
181 | That test, of course, tests the existing cache/tiering machinery. Add | |
182 | additional files to that directory that instead setup a dedup pool. Add | |
183 | support to ``ceph_test_rados`` (``src/test/osd/TestRados*``). | |
184 | ||
185 | * RBD tests | |
186 | ||
187 | Add a test that runs an RBD workload concurrently with blind | |
188 | promote/evict operations. | |
189 | ||
190 | * RGW | |
191 | ||
192 | Add a test that runs a rgw workload concurrently with blind | |
193 | promote/evict operations. | |
194 | ||
195 | ||
196 | Snapshots | |
197 | --------- | |
198 | ||
199 | Fundamentally we need to be able to manipulate the manifest | |
200 | status of clones because we want to be able to dynamically promote, | |
201 | flush (if the state was dirty when the clone was created), and evict | |
202 | extents from clones. | |
203 | ||
204 | As such, the plan is to allow the ``object_manifest_t`` for each clone | |
205 | to be independent. Here's an incomplete list of the high level | |
206 | tasks: | |
207 | ||
208 | * Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK`` | |
209 | to operation directly on clones. | |
210 | * Ensure that recovery checks the object_manifest prior to trying to | |
211 | use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets`` | |
212 | are the two methods that would likely need to be modified. | |
213 | ||
214 | See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD | |
215 | support details. I'd like to call out one particular data structure | |
216 | we may want to exploit. | |
217 | ||
218 | The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover | |
219 | clones as part of leak detection. | |
220 | ||
221 | An important question is how we deal with the fact that many clones | |
222 | will frequently have references to the same backing chunks at the same | |
223 | offset. In particular, ``make_writeable`` will generally create a clone | |
224 | that shares the same ``object_manifest_t`` references with the exception | |
225 | of any extents modified in that transaction. The metadata that | |
226 | commits as part of that transaction must therefore map onto the same | |
227 | refcount as before because otherwise we'd have to first increment | |
228 | refcounts on backing objects (or risk a reference to a dead object) | |
229 | Thus, we introduce a simple convention: consecutive clones which | |
230 | share a reference at the same offset share the same refcount. This | |
231 | means that a write that invokes ``make_writeable`` may decrease refcounts, | |
1e59de90 | 232 | but not increase them. This has some consequences for removing clones. |
f67539c2 TL |
233 | Consider the following sequence :: |
234 | ||
235 | write foo [0, 1024) | |
236 | flush foo -> | |
237 | head: [0, 512) aaa, [512, 1024) bbb | |
238 | refcount(aaa)=1, refcount(bbb)=1 | |
239 | snapshot 10 | |
240 | write foo [0, 512) -> | |
241 | head: [512, 1024) bbb | |
242 | 10 : [0, 512) aaa, [512, 1024) bbb | |
243 | refcount(aaa)=1, refcount(bbb)=1 | |
244 | flush foo -> | |
245 | head: [0, 512) ccc, [512, 1024) bbb | |
246 | 10 : [0, 512) aaa, [512, 1024) bbb | |
247 | refcount(aaa)=1, refcount(bbb)=1, refcount(ccc)=1 | |
248 | snapshot 20 | |
249 | write foo [0, 512) (same contents as the original write) | |
250 | head: [512, 1024) bbb | |
251 | 20 : [0, 512) ccc, [512, 1024) bbb | |
252 | 10 : [0, 512) aaa, [512, 1024) bbb | |
253 | refcount(aaa)=?, refcount(bbb)=1 | |
254 | flush foo | |
255 | head: [0, 512) aaa, [512, 1024) bbb | |
256 | 20 : [0, 512) ccc, [512, 1024) bbb | |
257 | 10 : [0, 512) aaa, [512, 1024) bbb | |
258 | refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1 | |
259 | ||
260 | What should be the refcount for ``aaa`` be at the end? By our | |
261 | above rule, it should be ``2`` since the two ```aaa``` refs are not | |
262 | contiguous. However, consider removing clone ``20`` :: | |
263 | ||
264 | initial: | |
265 | head: [0, 512) aaa, [512, 1024) bbb | |
266 | 20 : [0, 512) ccc, [512, 1024) bbb | |
267 | 10 : [0, 512) aaa, [512, 1024) bbb | |
268 | refcount(aaa)=2, refcount(bbb)=1, refcount(ccc)=1 | |
269 | trim 20 | |
270 | head: [0, 512) aaa, [512, 1024) bbb | |
271 | 10 : [0, 512) aaa, [512, 1024) bbb | |
272 | refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0 | |
273 | ||
274 | At this point, our rule dictates that ``refcount(aaa)`` is `1`. | |
275 | This means that removing ``20`` needs to check for refs held by | |
276 | the clones on either side which will then match. | |
277 | ||
278 | See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal`` | |
279 | for the logic implementing this rule. | |
280 | ||
281 | This seems complicated, but it gets us two valuable properties: | |
282 | ||
283 | 1) The refcount change from make_writeable will not block on | |
284 | incrementing a ref | |
285 | 2) We don't need to load the ``object_manifest_t`` for every clone | |
286 | to determine how to handle removing one -- just the ones | |
287 | immediately preceding and succeeding it. | |
288 | ||
289 | All clone operations will need to consider adjacent ``chunk_maps`` | |
290 | when adding or removing references. | |
291 | ||
f67539c2 TL |
292 | Data Structures |
293 | =============== | |
294 | ||
295 | Each RADOS object contains an ``object_manifest_t`` embedded within the | |
296 | ``object_info_t`` (see ``osd_types.h``): | |
297 | ||
298 | :: | |
299 | ||
300 | struct object_manifest_t { | |
301 | enum { | |
302 | TYPE_NONE = 0, | |
303 | TYPE_REDIRECT = 1, | |
304 | TYPE_CHUNKED = 2, | |
305 | }; | |
306 | uint8_t type; // redirect, chunked, ... | |
307 | hobject_t redirect_target; | |
308 | std::map<uint64_t, chunk_info_t> chunk_map; | |
309 | } | |
310 | ||
311 | The ``type`` enum reflects three possible states an object can be in: | |
312 | ||
313 | 1. ``TYPE_NONE``: normal RADOS object | |
314 | 2. ``TYPE_REDIRECT``: object payload is backed by a single object | |
315 | specified by ``redirect_target`` | |
316 | 3. ``TYPE_CHUNKED: object payload is distributed among objects with | |
317 | size and offset specified by the ``chunk_map``. ``chunk_map`` maps | |
318 | the offset of the chunk to a ``chunk_info_t`` as shown below, also | |
319 | specifying the ``length``, target `OID`, and ``flags``. | |
320 | ||
321 | :: | |
322 | ||
323 | struct chunk_info_t { | |
324 | typedef enum { | |
325 | FLAG_DIRTY = 1, | |
326 | FLAG_MISSING = 2, | |
327 | FLAG_HAS_REFERENCE = 4, | |
328 | FLAG_HAS_FINGERPRINT = 8, | |
329 | } cflag_t; | |
330 | uint32_t offset; | |
331 | uint32_t length; | |
332 | hobject_t oid; | |
333 | cflag_t flags; // FLAG_* | |
334 | ||
335 | ||
336 | ``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint | |
337 | is written. This should be changed to drop the fingerprint instead. | |
338 | ||
339 | ||
340 | Request Handling | |
341 | ================ | |
342 | ||
343 | Similarly to cache/tiering, the initial touchpoint is | |
344 | ``maybe_handle_manifest_detail``. | |
345 | ||
346 | For manifest operations listed below, we return ``NOOP`` and continue onto | |
347 | dedicated handling within ``do_osd_ops``. | |
348 | ||
349 | For redirect objects which haven't been promoted (apparently ``oi.size > | |
350 | 0`` indicates that it's present?) we proxy reads and writes. | |
351 | ||
352 | For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all | |
353 | of the ops are reads of extents in the ``object_manifest_t chunk_map``), | |
354 | we proxy requests to those objects. | |
355 | ||
356 | ||
357 | RADOS Interface | |
358 | ================ | |
359 | ||
360 | To set up deduplication one must provision two pools. One will act as the | |
361 | base pool and the other will act as the chunk pool. The base pool need to be | |
362 | configured with the ``fingerprint_algorithm`` option as follows. | |
363 | ||
364 | :: | |
365 | ||
366 | ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512 | |
367 | --yes-i-really-mean-it | |
368 | ||
369 | Create objects :: | |
370 | ||
371 | rados -p base_pool put foo ./foo | |
372 | rados -p chunk_pool put foo-chunk ./foo-chunk | |
373 | ||
374 | Make a manifest object :: | |
375 | ||
376 | rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference | |
377 | ||
378 | Operations: | |
379 | ||
380 | * ``set-redirect`` | |
381 | ||
382 | Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object`` | |
383 | in the ``target_pool``. | |
384 | A redirected object will forward all operations from the client to the | |
385 | ``target_object``. :: | |
386 | ||
387 | void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx, | |
388 | uint64_t tgt_version, int flag = 0); | |
389 | ||
390 | rados -p base_pool set-redirect <base_object> --target-pool <target_pool> | |
391 | <target_object> | |
392 | ||
393 | Returns ``ENOENT`` if the object does not exist (TODO: why?) | |
394 | Returns ``EINVAL`` if the object already is a redirect. | |
395 | ||
396 | Takes a reference to target as part of operation, can possibly leak a ref | |
397 | if the acting set resets and the client dies between taking the ref and | |
398 | recording the redirect. | |
399 | ||
400 | Truncates object, clears omap, and clears xattrs as a side effect. | |
401 | ||
402 | At the top of ``do_osd_ops``, does not set user_modify. | |
403 | ||
404 | This operation is not a user mutation and does not trigger a clone to be created. | |
405 | ||
406 | There are two purposes of ``set_redirect``: | |
407 | ||
408 | 1. Redirect all operation to the target object (like proxy) | |
409 | 2. Cache when ``tier_promote`` is called (redirect will be cleared at this time). | |
410 | ||
411 | * ``set-chunk`` | |
412 | ||
413 | Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a | |
414 | ``target_object``. :: | |
415 | ||
416 | void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx, | |
417 | std::string tgt_oid, uint64_t tgt_offset, int flag = 0); | |
418 | ||
419 | rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool | |
420 | <caspool> <target_object> <target-offset> | |
421 | ||
422 | Returns ``ENOENT`` if the object does not exist (TODO: why?) | |
423 | Returns ``EINVAL`` if the object already is a redirect. | |
424 | Returns ``EINVAL`` if on ill-formed parameter buffer. | |
425 | Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping. | |
426 | ||
427 | Takes references to targets as part of operation, can possibly leak refs | |
428 | if the acting set resets and the client dies between taking the ref and | |
429 | recording the redirect. | |
430 | ||
431 | Truncates object, clears omap, and clears xattrs as a side effect. | |
432 | ||
433 | This operation is not a user mutation and does not trigger a clone to be created. | |
434 | ||
435 | TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. :: | |
436 | ||
437 | if (!oi.manifest.is_chunked()) { | |
438 | oi.manifest.clear(); | |
439 | } | |
440 | ||
441 | * ``evict-chunk`` | |
442 | ||
443 | Clears an extent from an object leaving only the manifest link between | |
444 | it and the ``target_object``. :: | |
445 | ||
446 | void evict_chunk( | |
447 | uint64_t offset, uint64_t length, int flag = 0); | |
448 | ||
449 | rados -p base_pool evict-chunk <offset> <length> <object> | |
450 | ||
451 | Returns ``EINVAL`` if the extent is not present in the manifest. | |
452 | ||
453 | Note: this does not exist yet. | |
454 | ||
455 | ||
456 | * ``tier-promote`` | |
457 | ||
458 | Promotes the object ensuring that subsequent reads and writes will be local :: | |
459 | ||
460 | void tier_promote(); | |
461 | ||
462 | rados -p base_pool tier-promote <obj-name> | |
463 | ||
464 | Returns ``ENOENT`` if the object does not exist | |
465 | ||
466 | For a redirect manifest, copies data to head. | |
467 | ||
468 | TODO: Promote on a redirect object needs to clear the redirect. | |
469 | ||
470 | For a chunked manifest, reads all MISSING extents into the base pool, | |
471 | subsequent reads and writes will be served from the base pool. | |
472 | ||
473 | Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The | |
474 | resulting ``copy_get`` operation will issue reads which will then be redirected by | |
475 | the normal manifest read machinery. | |
476 | ||
477 | Does not set the ``user_modify`` flag. | |
478 | ||
479 | Future work will involve adding support for specifying a ``clone_id``. | |
480 | ||
481 | * ``unset-manifest`` | |
482 | ||
483 | Unset the manifest info in the object that has manifest. :: | |
484 | ||
485 | void unset_manifest(); | |
486 | ||
487 | rados -p base_pool unset-manifest <obj-name> | |
488 | ||
489 | Clears manifest chunks or redirect. Lazily releases references, may | |
490 | leak. | |
491 | ||
492 | ``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``, | |
493 | and so will trigger a snapshot. Note, this will be true even for a | |
494 | redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should | |
495 | be fixed -- ``unset-manifest`` should not be a ``user_modify``. | |
496 | ||
497 | * ``tier-flush`` | |
498 | ||
499 | Flush the object which has chunks to the chunk pool. :: | |
500 | ||
501 | void tier_flush(); | |
502 | ||
503 | rados -p base_pool tier-flush <obj-name> | |
504 | ||
505 | Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone. | |
506 | ||
507 | Does not evict the extents. | |
508 | ||
509 | ||
510 | ceph-dedup-tool | |
511 | =============== | |
512 | ||
513 | ``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking | |
514 | and fixing the reference count (see ``./refcount.rst``). | |
515 | ||
516 | * Find an optimal chunk offset | |
517 | ||
518 | a. Fixed chunk | |
519 | ||
520 | To find out a fixed chunk length, you need to run the following command many | |
521 | times while changing the ``chunk_size``. :: | |
522 | ||
523 | ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size | |
524 | --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512 | |
525 | ||
526 | b. Rabin chunk(Rabin-Karp algorithm) | |
527 | ||
528 | Rabin-Karp is a string-searching algorithm based | |
529 | on a rolling hash. But a rolling hash is not enough to do deduplication because | |
530 | we don't know the chunk boundary. So, we need content-based slicing using | |
531 | a rolling hash for content-defined chunking. | |
532 | The current implementation uses the simplest approach: look for chunk boundaries | |
533 | by inspecting the rolling hash for pattern (like the | |
534 | lower N bits are all zeroes). | |
535 | ||
536 | Users who want to use deduplication need to find an ideal chunk offset. | |
537 | To find out ideal chunk offset, users should discover | |
538 | the optimal configuration for their data workload via ``ceph-dedup-tool``. | |
539 | This information will then be used for object chunking through | |
540 | the ``set-chunk`` API. :: | |
541 | ||
542 | ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size | |
543 | --chunk-algorithm rabin --fingerprint-algorithm rabin | |
544 | ||
545 | ``ceph-dedup-tool`` has many options to utilize ``rabin chunk``. | |
546 | These are options for ``rabin chunk``. :: | |
547 | ||
548 | --mod-prime <uint64_t> | |
549 | --rabin-prime <uint64_t> | |
550 | --pow <uint64_t> | |
551 | --chunk-mask-bit <uint32_t> | |
552 | --window-size <uint32_t> | |
553 | --min-chunk <uint32_t> | |
554 | --max-chunk <uint64_t> | |
555 | ||
556 | Users need to refer following equation to use above options for ``rabin chunk``. :: | |
557 | ||
558 | rabin_hash = | |
559 | (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime) | |
560 | ||
561 | c. Fixed chunk vs content-defined chunk | |
562 | ||
563 | Content-defined chunking may or not be optimal solution. | |
564 | For example, | |
565 | ||
566 | Data chunk ``A`` : ``abcdefgabcdefgabcdefg`` | |
567 | ||
568 | Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is | |
569 | from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length. | |
570 | But, in the case of content-based slicing, the optimal chunk length | |
571 | could not be found (dedup ratio will not be 100%). | |
572 | Because we need to find optimal parameter such | |
573 | as boundary bit, window size and prime value. This is as easy as fixed chunk. | |
574 | But, content defined chunking is very effective in the following case. | |
575 | ||
576 | Data chunk ``B`` : ``abcdefgabcdefgabcdefg`` | |
577 | ||
578 | Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg`` | |
579 | ||
580 | ||
581 | * Fix reference count | |
582 | ||
583 | The key idea behind of reference counting for dedup is false-positive, which means | |
584 | ``(manifest object (no ref),, chunk object(has ref))`` happen instead of | |
585 | ``(manifest object (has ref), chunk 1(no ref))``. | |
586 | To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. :: | |
587 | ||
588 | ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL | |
589 |