]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/deduplication.rst
add stop-gap to fix compat with CPUs not supporting SSE 4.1
[ceph.git] / ceph / doc / dev / deduplication.rst
1 ===============
2 Deduplication
3 ===============
4
5
6 Introduction
7 ============
8
9 Applying data deduplication on an existing software stack is not easy
10 due to additional metadata management and original data processing
11 procedure.
12
13 In a typical deduplication system, the input source as a data
14 object is split into multiple chunks by a chunking algorithm.
15 The deduplication system then compares each chunk with
16 the existing data chunks, stored in the storage previously.
17 To this end, a fingerprint index that stores the hash value
18 of each chunk is employed by the deduplication system
19 in order to easily find the existing chunks by comparing
20 hash value rather than searching all contents that reside in
21 the underlying storage.
22
23 There are many challenges in order to implement deduplication on top
24 of Ceph. Among them, two issues are essential for deduplication.
25 First is managing scalability of fingerprint index; Second is
26 it is complex to ensure compatibility between newly applied
27 deduplication metadata and existing metadata.
28
29 Key Idea
30 ========
31 1. Content hashing (Double hashing): Each client can find an object data
32 for an object ID using CRUSH. With CRUSH, a client knows object's location
33 in Base tier.
34 By hashing object's content at Base tier, a new OID (chunk ID) is generated.
35 Chunk tier stores in the new OID that has a partial content of original object.
36
37 Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
38 CRUSH(K) -> chunk's location
39
40
41 2. Self-contained object: The external metadata design
42 makes difficult for integration with storage feature support
43 since existing storage features cannot recognize the
44 additional external data structures. If we can design data
45 deduplication system without any external component, the
46 original storage features can be reused.
47
48 More details in https://ieeexplore.ieee.org/document/8416369
49
50 Design
51 ======
52
53 .. ditaa::
54
55 +-------------+
56 | Ceph Client |
57 +------+------+
58 ^
59 Tiering is |
60 Transparent | Metadata
61 to Ceph | +---------------+
62 Client Ops | | |
63 | +----->+ Base Pool |
64 | | | |
65 | | +-----+---+-----+
66 | | | ^
67 v v | | Dedup metadata in Base Pool
68 +------+----+--+ | | (Dedup metadata contains chunk offsets
69 | Objecter | | | and fingerprints)
70 +-----------+--+ | |
71 ^ | | Data in Chunk Pool
72 | v |
73 | +-----+---+-----+
74 | | |
75 +----->| Chunk Pool |
76 | |
77 +---------------+
78 Data
79
80
81 Pool-based object management:
82 We define two pools.
83 The metadata pool stores metadata objects and the chunk pool stores
84 chunk objects. Since these two pools are divided based on
85 the purpose and usage, each pool can be managed more
86 efficiently according to its different characteristics. Base
87 pool and the chunk pool can separately select a redundancy
88 scheme between replication and erasure coding depending on
89 its usage and each pool can be placed in a different storage
90 location depending on the required performance.
91
92 Regarding how to use, please see ``osd_internals/manifest.rst``
93
94 Usage Patterns
95 ==============
96
97 Each Ceph interface layer presents unique opportunities and costs for
98 deduplication and tiering in general.
99
100 RadosGW
101 -------
102
103 S3 big data workloads seem like a good opportunity for deduplication. These
104 objects tend to be write once, read mostly objects which don't see partial
105 overwrites. As such, it makes sense to fingerprint and dedup up front.
106
107 Unlike cephfs and rbd, radosgw has a system for storing
108 explicit metadata in the head object of a logical s3 object for
109 locating the remaining pieces. As such, radosgw could use the
110 refcounting machinery (``osd_internals/refcount.rst``) directly without
111 needing direct support from rados for manifests.
112
113 RBD/Cephfs
114 ----------
115
116 RBD and CephFS both use deterministic naming schemes to partition
117 block devices/file data over rados objects. As such, the redirection
118 metadata would need to be included as part of rados, presumably
119 transparently.
120
121 Moreover, unlike radosgw, rbd/cephfs rados objects can see overwrites.
122 For those objects, we don't really want to perform dedup, and we don't
123 want to pay a write latency penalty in the hot path to do so anyway.
124 As such, performing tiering and dedup on cold objects in the background
125 is likely to be preferred.
126
127 One important wrinkle, however, is that both rbd and cephfs workloads
128 often feature usage of snapshots. This means that the rados manifest
129 support needs robust support for snapshots.
130
131 RADOS Machinery
132 ===============
133
134 For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
135 For more information on rados refcount support, see ``osd_internals/refcount.rst``.
136
137 Status and Future Work
138 ======================
139
140 At the moment, there exists some preliminary support for manifest
141 objects within the OSD as well as a dedup tool.
142
143 RadosGW data warehouse workloads probably represent the largest
144 opportunity for this feature, so the first priority is probably to add
145 direct support for fingerprinting and redirects into the refcount pool
146 to radosgw.
147
148 Aside from radosgw, completing work on manifest object support in the
149 OSD particularly as it relates to snapshots would be the next step for
150 rbd and cephfs workloads.
151
152 How to use deduplication
153 ========================
154
155 * This feature is highly experimental and is subject to change or removal.
156
157 Ceph provides deduplication using RADOS machinery.
158 Below we explain how to perform deduplication.
159
160
161 1. Estimate space saving ratio of a target pool using ``ceph-dedup-tool``.
162
163 .. code:: bash
164
165 ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
166 --chunk-algorithm fixed|fastcdc --fingerprint-algorithm sha1|sha256|sha512
167 --max-thread THREAD_COUNT
168
169 This CLI command will show how much storage space can be saved when deduplication
170 is applied on the pool. If the amount of the saved space is higher than user's expectation,
171 the pool probably is worth performing deduplication.
172 Users should specify $POOL where the object---the users want to perform
173 deduplication---is stored. The users also need to run ceph-dedup-tool multiple time
174 with varying ``chunk_size`` to find the optimal chunk size. Note that the
175 optimal value probably differs in the content of each object in case of fastcdc
176 chunk algorithm (not fixed). Example output:
177
178 ::
179
180 {
181 "chunk_algo": "fastcdc",
182 "chunk_sizes": [
183 {
184 "target_chunk_size": 8192,
185 "dedup_bytes_ratio": 0.4897049
186 "dedup_object_ratio": 34.567315
187 "chunk_size_average": 64439,
188 "chunk_size_stddev": 33620
189 }
190 ],
191 "summary": {
192 "examined_objects": 95,
193 "examined_bytes": 214968649
194 }
195 }
196
197 The above is an example output when executing ``estimate``. ``target_chunk_size`` is the same as
198 ``chunk_size`` given by the user. ``dedup_bytes_ratio`` shows how many bytes are redundant from
199 examined bytes. For instance, 1 - ``dedup_bytes_ratio`` means the percentage of saved storage space.
200 ``dedup_object_ratio`` is the generated chunk objects / ``examined_objects``. ``chunk_size_average``
201 means that the divided chunk size on average when performing CDC---this may differnet from ``target_chunk_size``
202 because CDC genarates different chunk-boundary depending on the content. ``chunk_size_stddev``
203 represents the standard deviation of the chunk size.
204
205
206 2. Create chunk pool.
207
208 .. code:: bash
209
210 ceph osd pool create CHUNK_POOL
211
212
213 3. Run dedup command (there are two ways).
214
215 .. code:: bash
216
217 ceph-dedup-tool --op sample-dedup --pool POOL --chunk-pool CHUNK_POOL --chunk-size
218 CHUNK_SIZE --chunk-algorithm fastcdc --fingerprint-algorithm sha1|sha256|sha512
219 --chunk-dedup-threshold THRESHOLD --max-thread THREAD_COUNT ----sampling-ratio SAMPLE_RATIO
220 --wakeup-period WAKEUP_PERIOD --loop --snap
221
222 The ``sample-dedup`` comamnd spawns threads specified by ``THREAD_COUNT`` to deduplicate objects on
223 the ``POOL``. According to sampling-ratio---do a full search if ``SAMPLE_RATIO`` is 100, the threads selectively
224 perform deduplication if the chunk is redundant over ``THRESHOLD`` times during iteration.
225 If --loop is set, the theads will wakeup after ``WAKEUP_PERIOD``. If not, the threads will exit after one iteration.
226
227 .. code:: bash
228
229 ceph-dedup-tool --op object-dedup --pool POOL --object OID --chunk-pool CHUNK_POOL
230 --fingerprint-algorithm sha1|sha256|sha512 --dedup-cdc-chunk-size CHUNK_SIZE
231
232 The ``object-dedup`` command triggers deduplication on the RADOS object specified by ``OID``.
233 All parameters shown above must be specified. ``CHUNK_SIZE`` should be taken from
234 the results of step 1 above.
235 Note that when this command is executed, ``fastcdc`` will be set by default and other parameters
236 such as ``FP`` and ``CHUNK_SIZE`` will be set as defaults for the pool.
237 Deduplicated objects will appear in the chunk pool. If the object is mutated over time, user needs to re-run
238 ``object-dedup`` because chunk-boundary should be recalculated based on updated contents.
239 The user needs to specify ``snap`` if the target object is snapshotted. After deduplication is done, the target
240 object size in ``POOL`` is zero (evicted) and chunks objects are genereated---these appear in ``CHUNK_POOL``.
241
242
243 4. Read/write I/Os
244
245 After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
246 completely compatible with existing RAODS operations.
247
248
249 5. Run scrub to fix reference count
250
251 Reference mismatches can on rare occasions occur to false positives when handling reference counts for
252 deduplicated RADOS objects. These mismatches will be fixed by periodically scrubbing the pool:
253
254 .. code:: bash
255
256 ceph-dedup-tool --op chunk-scrub --op chunk-scrub --chunk-pool CHUNK_POOL --pool POOL --max-thread THREAD_COUNT
257