]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/erasure-code-lrc.rst
update sources to v12.1.1
[ceph.git] / ceph / doc / rados / operations / erasure-code-lrc.rst
CommitLineData
7c673cae
FG
1======================================
2Locally repairable erasure code plugin
3======================================
4
5With the *jerasure* plugin, when an erasure coded object is stored on
6multiple OSDs, recovering from the loss of one OSD requires reading
7from all the others. For instance if *jerasure* is configured with
8*k=8* and *m=4*, losing one OSD requires reading from the eleven
9others to repair.
10
11The *lrc* erasure code plugin creates local parity chunks to be able
12to recover using less OSDs. For instance if *lrc* is configured with
13*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
14every four OSDs. When a single OSD is lost, it can be recovered with
15only four OSDs instead of eleven.
16
17Erasure code profile examples
18=============================
19
20Reduce recovery bandwidth between hosts
21---------------------------------------
22
23Although it is probably not an interesting use case when all hosts are
24connected to the same switch, reduced bandwidth usage can actually be
25observed.::
26
27 $ ceph osd erasure-code-profile set LRCprofile \
28 plugin=lrc \
29 k=4 m=2 l=3 \
224ce89b 30 crush-failure-domain=host
7c673cae
FG
31 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
32
33
34Reduce recovery bandwidth between racks
35---------------------------------------
36
37In Firefly the reduced bandwidth will only be observed if the primary
38OSD is in the same rack as the lost chunk.::
39
40 $ ceph osd erasure-code-profile set LRCprofile \
41 plugin=lrc \
42 k=4 m=2 l=3 \
224ce89b
WB
43 crush-locality=rack \
44 crush-failure-domain=host
7c673cae
FG
45 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
46
47
48Create an lrc profile
49=====================
50
51To create a new lrc erasure code profile::
52
53 ceph osd erasure-code-profile set {name} \
54 plugin=lrc \
55 k={data-chunks} \
56 m={coding-chunks} \
57 l={locality} \
224ce89b
WB
58 [crush-root={root}] \
59 [crush-locality={bucket-type}] \
60 [crush-failure-domain={bucket-type}] \
61 [crush-device-class={device-class}] \
7c673cae
FG
62 [directory={directory}] \
63 [--force]
64
65Where:
66
67``k={data chunks}``
68
69:Description: Each object is split in **data-chunks** parts,
70 each stored on a different OSD.
71
72:Type: Integer
73:Required: Yes.
74:Example: 4
75
76``m={coding-chunks}``
77
78:Description: Compute **coding chunks** for each object and store them
79 on different OSDs. The number of coding chunks is also
80 the number of OSDs that can be down without losing data.
81
82:Type: Integer
83:Required: Yes.
84:Example: 2
85
86``l={locality}``
87
88:Description: Group the coding and data chunks into sets of size
89 **locality**. For instance, for **k=4** and **m=2**,
90 when **locality=3** two groups of three are created.
91 Each set can be recovered without reading chunks
92 from another set.
93
94:Type: Integer
95:Required: Yes.
96:Example: 3
97
224ce89b 98``crush-root={root}``
7c673cae
FG
99
100:Description: The name of the crush bucket used for the first step of
101 the ruleset. For intance **step take default**.
102
103:Type: String
104:Required: No.
105:Default: default
106
224ce89b 107``crush-locality={bucket-type}``
7c673cae
FG
108
109:Description: The type of the crush bucket in which each set of chunks
110 defined by **l** will be stored. For instance, if it is
111 set to **rack**, each group of **l** chunks will be
112 placed in a different rack. It is used to create a
113 ruleset step such as **step choose rack**. If it is not
114 set, no such grouping is done.
115
116:Type: String
117:Required: No.
118
224ce89b 119``crush-failure-domain={bucket-type}``
7c673cae
FG
120
121:Description: Ensure that no two chunks are in a bucket with the same
122 failure domain. For instance, if the failure domain is
123 **host** no two chunks will be stored on the same
124 host. It is used to create a ruleset step such as **step
125 chooseleaf host**.
126
127:Type: String
128:Required: No.
129:Default: host
130
224ce89b
WB
131``crush-device-class={device-class}``
132
133:Description: Restrict placement to devices of a specific class (e.g.,
134 ``ssd`` or ``hdd``), using the crush device class names
135 in the CRUSH map.
136
137:Type: String
138:Required: No.
139:Default:
140
7c673cae
FG
141``directory={directory}``
142
143:Description: Set the **directory** name from which the erasure code
144 plugin is loaded.
145
146:Type: String
147:Required: No.
148:Default: /usr/lib/ceph/erasure-code
149
150``--force``
151
152:Description: Override an existing profile by the same name.
153
154:Type: String
155:Required: No.
156
157Low level plugin configuration
158==============================
159
160The sum of **k** and **m** must be a multiple of the **l** parameter.
161The low level configuration parameters do not impose such a
162restriction and it may be more convienient to use it for specific
163purposes. It is for instance possible to define two groups, one with 4
164chunks and another with 3 chunks. It is also possible to recursively
165define locality sets, for instance datacenters and racks into
166datacenters. The **k/m/l** are implemented by generating a low level
167configuration.
168
169The *lrc* erasure code plugin recursively applies erasure code
170techniques so that recovering from the loss of some chunks only
171requires a subset of the available chunks, most of the time.
172
173For instance, when three coding steps are described as::
174
175 chunk nr 01234567
176 step 1 _cDD_cDD
177 step 2 cDDD____
178 step 3 ____cDDD
179
180where *c* are coding chunks calculated from the data chunks *D*, the
181loss of chunk *7* can be recovered with the last four chunks. And the
182loss of chunk *2* chunk can be recovered with the first four
183chunks.
184
185Erasure code profile examples using low level configuration
186===========================================================
187
188Minimal testing
189---------------
190
191It is strictly equivalent to using the default erasure code profile. The *DD*
192implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
193by default.::
194
195 $ ceph osd erasure-code-profile set LRCprofile \
196 plugin=lrc \
197 mapping=DD_ \
198 layers='[ [ "DDc", "" ] ]'
199 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
200
201Reduce recovery bandwidth between hosts
202---------------------------------------
203
204Although it is probably not an interesting use case when all hosts are
205connected to the same switch, reduced bandwidth usage can actually be
206observed. It is equivalent to **k=4**, **m=2** and **l=3** although
207the layout of the chunks is different::
208
209 $ ceph osd erasure-code-profile set LRCprofile \
210 plugin=lrc \
211 mapping=__DD__DD \
212 layers='[
213 [ "_cDD_cDD", "" ],
214 [ "cDDD____", "" ],
215 [ "____cDDD", "" ],
216 ]'
217 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
218
219
220Reduce recovery bandwidth between racks
221---------------------------------------
222
223In Firefly the reduced bandwidth will only be observed if the primary
224OSD is in the same rack as the lost chunk.::
225
226 $ ceph osd erasure-code-profile set LRCprofile \
227 plugin=lrc \
228 mapping=__DD__DD \
229 layers='[
230 [ "_cDD_cDD", "" ],
231 [ "cDDD____", "" ],
232 [ "____cDDD", "" ],
233 ]' \
224ce89b 234 crush-steps='[
7c673cae
FG
235 [ "choose", "rack", 2 ],
236 [ "chooseleaf", "host", 4 ],
237 ]'
238 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
239
240Testing with different Erasure Code backends
241--------------------------------------------
242
243LRC now uses jerasure as the default EC backend. It is possible to
244specify the EC backend/algorithm on a per layer basis using the low
245level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
246is actually an erasure code profile to be used for this level. The
247example below specifies the ISA backend with the cauchy technique to
248be used in the lrcpool.::
249
250 $ ceph osd erasure-code-profile set LRCprofile \
251 plugin=lrc \
252 mapping=DD_ \
253 layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
254 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
255
256You could also use a different erasure code profile for for each
257layer.::
258
259 $ ceph osd erasure-code-profile set LRCprofile \
260 plugin=lrc \
261 mapping=__DD__DD \
262 layers='[
263 [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
264 [ "cDDD____", "plugin=isa" ],
265 [ "____cDDD", "plugin=jerasure" ],
266 ]'
267 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
268
269
270
271Erasure coding and decoding algorithm
272=====================================
273
274The steps found in the layers description::
275
276 chunk nr 01234567
277
278 step 1 _cDD_cDD
279 step 2 cDDD____
280 step 3 ____cDDD
281
282are applied in order. For instance, if a 4K object is encoded, it will
283first go thru *step 1* and be divided in four 1K chunks (the four
284uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
285order. From these, two coding chunks are calculated (the two lowercase
286c). The coding chunks are stored in the chunks 1 and 5, respectively.
287
288The *step 2* re-uses the content created by *step 1* in a similar
289fashion and stores a single coding chunk *c* at position 0. The last four
290chunks, marked with an underscore (*_*) for readability, are ignored.
291
292The *step 3* stores a single coding chunk *c* at position 4. The three
293chunks created by *step 1* are used to compute this coding chunk,
294i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.
295
296If chunk *2* is lost::
297
298 chunk nr 01234567
299
300 step 1 _c D_cDD
301 step 2 cD D____
302 step 3 __ _cDDD
303
304decoding will attempt to recover it by walking the steps in reverse
305order: *step 3* then *step 2* and finally *step 1*.
306
307The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
308and is skipped.
309
310The coding chunk from *step 2*, stored in chunk *0*, allows it to
311recover the content of chunk *2*. There are no more chunks to recover
312and the process stops, without considering *step 1*.
313
314Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
315back chunk *2*.
316
317If chunk *2, 3, 6* are lost::
318
319 chunk nr 01234567
320
321 step 1 _c _c D
322 step 2 cD __ _
323 step 3 __ cD D
324
325The *step 3* can recover the content of chunk *6*::
326
327 chunk nr 01234567
328
329 step 1 _c _cDD
330 step 2 cD ____
331 step 3 __ cDDD
332
333The *step 2* fails to recover and is skipped because there are two
334chunks missing (*2, 3*) and it can only recover from one missing
335chunk.
336
337The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
338recover the content of chunk *2, 3*::
339
340 chunk nr 01234567
341
342 step 1 _cDD_cDD
343 step 2 cDDD____
344 step 3 ____cDDD
345
346Controlling crush placement
347===========================
348
349The default crush ruleset provides OSDs that are on different hosts. For instance::
350
351 chunk nr 01234567
352
353 step 1 _cDD_cDD
354 step 2 cDDD____
355 step 3 ____cDDD
356
357needs exactly *8* OSDs, one for each chunk. If the hosts are in two
358adjacent racks, the first four chunks can be placed in the first rack
359and the last four in the second rack. So that recovering from the loss
360of a single OSD does not require using bandwidth between the two
361racks.
362
363For instance::
364
224ce89b 365 crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
7c673cae
FG
366
367will create a ruleset that will select two crush buckets of type
368*rack* and for each of them choose four OSDs, each of them located in
369different buckets of type *host*.
370
371The ruleset can also be manually crafted for finer control.