[ceph.git] / ceph / doc / dev / deduplication.rst

===============
 Deduplication
===============


Introduction
============

Applying data deduplication on an existing software stack is not easy 
due to additional metadata management and original data processing 
procedure. 

In a typical deduplication system, the input source as a data
object is split into multiple chunks by a chunking algorithm.
The deduplication system then compares each chunk with
the existing data chunks, stored in the storage previously.
To this end, a fingerprint index that stores the hash value
of each chunk is employed by the deduplication system
in order to easily find the existing chunks by comparing
hash value rather than searching all contents that reside in
the underlying storage.

There are many challenges in order to implement deduplication on top
of Ceph. Among them, two issues are essential for deduplication.
First is managing scalability of fingerprint index; Second is
it is complex to ensure compatibility between newly applied
deduplication metadata and existing metadata.

Key Idea
========
1. Content hashing (Double hashing): Each client can find an object data 
for an object ID using CRUSH. With CRUSH, a client knows object's location
in Base tier. 
By hashing object's content at Base tier, a new OID (chunk ID) is generated.
Chunk tier stores in the new OID that has a partial content of original object.

 Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K -> 
 CRUSH(K) -> chunk's location


2. Self-contained object: The external metadata design
makes difficult for integration with storage feature support
since existing storage features cannot recognize the
additional external data structures. If we can design data
deduplication system without any external component, the
original storage features can be reused.

More details in https://ieeexplore.ieee.org/document/8416369

Design
======

.. ditaa::

           +-------------+
           | Ceph Client |
           +------+------+
                  ^
     Tiering is   |  
    Transparent   |               Metadata
        to Ceph   |           +---------------+
     Client Ops   |           |               |   
                  |    +----->+   Base Pool   |
                  |    |      |               |
                  |    |      +-----+---+-----+
                  |    |            |   ^ 
                  v    v            |   |   Dedup metadata in Base Pool
           +------+----+--+         |   |   (Dedup metadata contains chunk offsets
           |   Objecter   |         |   |    and fingerprints)
           +-----------+--+         |   |
                       ^            |   |   Data in Chunk Pool
                       |            v   |
                       |      +-----+---+-----+
                       |      |               |
                       +----->|  Chunk Pool   |
                              |               |
                              +---------------+
                                    Data


Pool-based object management:
We define two pools.
The metadata pool stores metadata objects and the chunk pool stores
chunk objects. Since these two pools are divided based on
the purpose and usage, each pool can be managed more
efficiently according to its different characteristics. Base
pool and the chunk pool can separately select a redundancy
scheme between replication and erasure coding depending on
its usage and each pool can be placed in a different storage
location depending on the required performance.

Manifest Object: 
Metadata objects are stored in the
base pool, which contains metadata for data deduplication.

::
  
        struct object_manifest_t {
                enum {
                        TYPE_NONE = 0,
                        TYPE_REDIRECT = 1,
                        TYPE_CHUNKED = 2,
                };
                uint8_t type;  // redirect, chunked, ...
                hobject_t redirect_target;
                std::map<uint64_t, chunk_info_t> chunk_map;
        }


A chunk Object: 
Chunk objects are stored in the chunk pool. Chunk object contains chunk data 
and its reference count information.


Although chunk objects and manifest objects have a different purpose 
from existing objects, they can be handled the same way as 
original objects. Therefore, to support existing features such as replication,
no additional operations for dedup are needed.

Usage
=====

To set up deduplication pools, you must have two pools. One will act as the 
base pool and the other will act as the chunk pool. The base pool need to be
configured with fingerprint_algorithm option as follows.

::

  ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512 
  --yes-i-really-mean-it

1. Create objects ::

        - rados -p base_pool put foo ./foo

        - rados -p chunk_pool put foo-chunk ./foo-chunk

2. Make a manifest object ::

        - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool 
        chunk_pool foo-chunk $START_OFFSET --with-reference


Interface
=========

* set-redirect 

  set redirection between a base_object in the base_pool and a target_object 
  in the target_pool.
  A redirected object will forward all operations from the client to the 
  target_object. ::
  
        rados -p base_pool set-redirect <base_object> --target-pool <target_pool> 
         <target_object>

* set-chunk 

  set chunk-offset in a source_object to make a link between it and a 
  target_object. ::
  
        rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool 
         <caspool> <target_object> <taget-offset> 

* tier-promote 

  promote the object (including chunks). ::

        rados -p base_pool tier-promote <obj-name> 

* unset-manifest

  unset manifest option from the object that has manifest. ::

        rados -p base_pool unset-manifest <obj-name>

* tier-flush

  flush the object that has chunks to the chunk pool. ::

        rados -p base_pool tier-flush <obj-name>

Dedup tool
==========

Dedup tool has two features: finding optimal chunk offset for dedup chunking 
and fixing the reference count.

* find optimal chunk offset

  a. fixed chunk  

    To find out a fixed chunk length, you need to run following command many 
    times while changing the chunk_size. ::

            ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size  
              --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512

  b. rabin chunk(Rabin-karp algorithm) 

    As you know, Rabin-karp algorithm is string-searching algorithm based
    on a rolling-hash. But rolling-hash is not enough to do deduplication because 
    we don't know the chunk boundary. So, we need content-based slicing using 
    a rolling hash for content-defined chunking.
    The current implementation uses the simplest approach: look for chunk boundaries 
    by inspecting the rolling hash for pattern(like the
    lower N bits are all zeroes). 
      
    - Usage

      Users who want to use deduplication need to find an ideal chunk offset.
      To find out ideal chunk offset, Users should discover
      the optimal configuration for their data workload via ceph-dedup-tool.
      And then, this chunking information will be used for object chunking through
      set-chunk api. ::

              ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size  
                --chunk-algorithm rabin --fingerprint-algorithm rabin

      ceph-dedup-tool has many options to utilize rabin chunk.
      These are options for rabin chunk. ::

              --mod-prime <uint64_t>
              --rabin-prime <uint64_t>
              --pow <uint64_t>
              --chunk-mask-bit <uint32_t>
              --window-size <uint32_t>
              --min-chunk <uint32_t>
              --max-chunk <uint64_t>

      Users need to refer following equation to use above options for rabin chunk. ::

              rabin_hash = 
                (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)

  c. Fixed chunk vs content-defined chunk

    Content-defined chunking may or not be optimal solution.
    For example,

    Data chunk A : abcdefgabcdefgabcdefg

    Let's think about Data chunk A's deduplication. Ideal chunk offset is
    from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
    But, in the case of content-based slicing, the optimal chunk length
    could not be found (dedup ratio will not be 100%).
    Because we need to find optimal parameter such
    as boundary bit, window size and prime value. This is as easy as fixed chunk.
    But, content defined chunking is very effective in the following case.

    Data chunk B : abcdefgabcdefgabcdefg

    Data chunk C : Tabcdefgabcdefgabcdefg
      

* fix reference count
  
  The key idea behind of reference counting for dedup is false-positive, which means 
  (manifest object (no ref), chunk object(has ref)) happen instead of 
  (manifest object (has ref), chunk 1(no ref)).
  To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::

          ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
Commit	Line	Data
9f95a23c TL	1	===============
	2	Deduplication
	3	===============
	4
	5
	6	Introduction
	7	============
	8
	9	Applying data deduplication on an existing software stack is not easy
	10	due to additional metadata management and original data processing
	11	procedure.
	12
	13	In a typical deduplication system, the input source as a data
	14	object is split into multiple chunks by a chunking algorithm.
	15	The deduplication system then compares each chunk with
	16	the existing data chunks, stored in the storage previously.
	17	To this end, a fingerprint index that stores the hash value
	18	of each chunk is employed by the deduplication system
	19	in order to easily find the existing chunks by comparing
	20	hash value rather than searching all contents that reside in
	21	the underlying storage.
	22
	23	There are many challenges in order to implement deduplication on top
	24	of Ceph. Among them, two issues are essential for deduplication.
	25	First is managing scalability of fingerprint index; Second is
	26	it is complex to ensure compatibility between newly applied
	27	deduplication metadata and existing metadata.
	28
	29	Key Idea
	30	========
	31	1. Content hashing (Double hashing): Each client can find an object data
	32	for an object ID using CRUSH. With CRUSH, a client knows object's location
	33	in Base tier.
	34	By hashing object's content at Base tier, a new OID (chunk ID) is generated.
	35	Chunk tier stores in the new OID that has a partial content of original object.
	36
	37	Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
	38	CRUSH(K) -> chunk's location
	39
	40
	41	2. Self-contained object: The external metadata design
	42	makes difficult for integration with storage feature support
	43	since existing storage features cannot recognize the
	44	additional external data structures. If we can design data
	45	deduplication system without any external component, the
	46	original storage features can be reused.
	47
	48	More details in https://ieeexplore.ieee.org/document/8416369
	49
	50	Design
	51	======
	52
f91f0fd5 TL	53	.. ditaa::
f91f0fd5 TL	54
9f95a23c TL	55	+-------------+
	56	\| Ceph Client \|
	57	+------+------+
	58	^
	59	Tiering is \|
	60	Transparent \| Metadata
	61	to Ceph \| +---------------+
	62	Client Ops \| \| \|
	63	\| +----->+ Base Pool \|
	64	\| \| \| \|
	65	\| \| +-----+---+-----+
	66	\| \| \| ^
	67	v v \| \| Dedup metadata in Base Pool
	68	+------+----+--+ \| \| (Dedup metadata contains chunk offsets
	69	\| Objecter \| \| \| and fingerprints)
	70	+-----------+--+ \| \|
	71	^ \| \| Data in Chunk Pool
	72	\| v \|
	73	\| +-----+---+-----+
	74	\| \| \|
	75	+----->\| Chunk Pool \|
	76	\| \|
	77	+---------------+
	78	Data
	79
	80
	81	Pool-based object management:
	82	We define two pools.
	83	The metadata pool stores metadata objects and the chunk pool stores
	84	chunk objects. Since these two pools are divided based on
	85	the purpose and usage, each pool can be managed more
	86	efficiently according to its different characteristics. Base
	87	pool and the chunk pool can separately select a redundancy
	88	scheme between replication and erasure coding depending on
	89	its usage and each pool can be placed in a different storage
	90	location depending on the required performance.
	91
	92	Manifest Object:
	93	Metadata objects are stored in the
	94	base pool, which contains metadata for data deduplication.
	95
	96	::
	97
	98	struct object_manifest_t {
	99	enum {
	100	TYPE_NONE = 0,
	101	TYPE_REDIRECT = 1,
	102	TYPE_CHUNKED = 2,
	103	};
	104	uint8_t type; // redirect, chunked, ...
	105	hobject_t redirect_target;
	106	std::map<uint64_t, chunk_info_t> chunk_map;
	107	}
	108
	109
	110	A chunk Object:
	111	Chunk objects are stored in the chunk pool. Chunk object contains chunk data
	112	and its reference count information.
	113
	114
	115	Although chunk objects and manifest objects have a different purpose
	116	from existing objects, they can be handled the same way as
	117	original objects. Therefore, to support existing features such as replication,
	118	no additional operations for dedup are needed.
119
120	Usage
121	=====
122
123	To set up deduplication pools, you must have two pools. One will act as the
124	base pool and the other will act as the chunk pool. The base pool need to be
125	configured with fingerprint_algorithm option as follows.
126
127	::
128
129	ceph osd pool set $BASE_POOL fingerprint_algorithm sha1\|sha256\|sha512
130	--yes-i-really-mean-it
131
132	1. Create objects ::
133
134	- rados -p base_pool put foo ./foo
135
136	- rados -p chunk_pool put foo-chunk ./foo-chunk
137
138	2. Make a manifest object ::
139
140	- rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool
141	chunk_pool foo-chunk $START_OFFSET --with-reference
142
143
144	Interface
145	=========
146
147	* set-redirect
148
149	set redirection between a base_object in the base_pool and a target_object
150	in the target_pool.
151	A redirected object will forward all operations from the client to the
152	target_object. ::
153
154	rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
155	<target_object>
156
157	* set-chunk
158
159	set chunk-offset in a source_object to make a link between it and a
160	target_object. ::
161
162	rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
163	<caspool> <target_object> <taget-offset>
164
165	* tier-promote
166
167	promote the object (including chunks). ::
168
169	rados -p base_pool tier-promote <obj-name>
170
171	* unset-manifest
172
173	unset manifest option from the object that has manifest. ::
174
175	rados -p base_pool unset-manifest <obj-name>
176
177	* tier-flush
178
179	flush the object that has chunks to the chunk pool. ::
180
181	rados -p base_pool tier-flush <obj-name>
182
183	Dedup tool
184	==========
185
186	Dedup tool has two features: finding optimal chunk offset for dedup chunking
187	and fixing the reference count.
188
189	* find optimal chunk offset
190
191	a. fixed chunk
192
193	To find out a fixed chunk length, you need to run following command many
194	times while changing the chunk_size. ::
195
196	ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
197	--chunk-algorithm fixed --fingerprint-algorithm sha1\|sha256\|sha512
198
199	b. rabin chunk(Rabin-karp algorithm)
200
201	As you know, Rabin-karp algorithm is string-searching algorithm based
202	on a rolling-hash. But rolling-hash is not enough to do deduplication because
203	we don't know the chunk boundary. So, we need content-based slicing using
204	a rolling hash for content-defined chunking.
205	The current implementation uses the simplest approach: look for chunk boundaries
206	by inspecting the rolling hash for pattern(like the
207	lower N bits are all zeroes).
208
209	- Usage
210
211	Users who want to use deduplication need to find an ideal chunk offset.
212	To find out ideal chunk offset, Users should discover
213	the optimal configuration for their data workload via ceph-dedup-tool.
214	And then, this chunking information will be used for object chunking through
215	set-chunk api. ::
216
217	ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
218	--chunk-algorithm rabin --fingerprint-algorithm rabin
219
220	ceph-dedup-tool has many options to utilize rabin chunk.
221	These are options for rabin chunk. ::
222
223	--mod-prime <uint64_t>
224	--rabin-prime <uint64_t>
225	--pow <uint64_t>
226	--chunk-mask-bit <uint32_t>
227	--window-size <uint32_t>
228	--min-chunk <uint32_t>
229	--max-chunk <uint64_t>
230
231	Users need to refer following equation to use above options for rabin chunk. ::
232
233	rabin_hash =
234	(rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
235
236	c. Fixed chunk vs content-defined chunk
237
238	Content-defined chunking may or not be optimal solution.
239	For example,
240
241	Data chunk A : abcdefgabcdefgabcdefg
242
243	Let's think about Data chunk A's deduplication. Ideal chunk offset is
244	from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
245	But, in the case of content-based slicing, the optimal chunk length
246	could not be found (dedup ratio will not be 100%).
247	Because we need to find optimal parameter such
248	as boundary bit, window size and prime value. This is as easy as fixed chunk.
249	But, content defined chunking is very effective in the following case.
250
251	Data chunk B : abcdefgabcdefgabcdefg
252
253	Data chunk C : Tabcdefgabcdefgabcdefg
254
255
256	* fix reference count
257
258	The key idea behind of reference counting for dedup is false-positive, which means
259	(manifest object (no ref), chunk object(has ref)) happen instead of
260	(manifest object (has ref), chunk 1(no ref)).
261	To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
262
263	ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
264