]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/erasure-code-lrc.rst
buildsys: fix parallel builds
[ceph.git] / ceph / doc / rados / operations / erasure-code-lrc.rst
CommitLineData
7c673cae
FG
1======================================
2Locally repairable erasure code plugin
3======================================
4
5With the *jerasure* plugin, when an erasure coded object is stored on
6multiple OSDs, recovering from the loss of one OSD requires reading
7from all the others. For instance if *jerasure* is configured with
8*k=8* and *m=4*, losing one OSD requires reading from the eleven
9others to repair.
10
11The *lrc* erasure code plugin creates local parity chunks to be able
12to recover using less OSDs. For instance if *lrc* is configured with
13*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
14every four OSDs. When a single OSD is lost, it can be recovered with
15only four OSDs instead of eleven.
16
17Erasure code profile examples
18=============================
19
20Reduce recovery bandwidth between hosts
21---------------------------------------
22
23Although it is probably not an interesting use case when all hosts are
24connected to the same switch, reduced bandwidth usage can actually be
25observed.::
26
27 $ ceph osd erasure-code-profile set LRCprofile \
28 plugin=lrc \
29 k=4 m=2 l=3 \
30 ruleset-failure-domain=host
31 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
32
33
34Reduce recovery bandwidth between racks
35---------------------------------------
36
37In Firefly the reduced bandwidth will only be observed if the primary
38OSD is in the same rack as the lost chunk.::
39
40 $ ceph osd erasure-code-profile set LRCprofile \
41 plugin=lrc \
42 k=4 m=2 l=3 \
43 ruleset-locality=rack \
44 ruleset-failure-domain=host
45 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
46
47
48Create an lrc profile
49=====================
50
51To create a new lrc erasure code profile::
52
53 ceph osd erasure-code-profile set {name} \
54 plugin=lrc \
55 k={data-chunks} \
56 m={coding-chunks} \
57 l={locality} \
58 [ruleset-root={root}] \
59 [ruleset-locality={bucket-type}] \
60 [ruleset-failure-domain={bucket-type}] \
61 [directory={directory}] \
62 [--force]
63
64Where:
65
66``k={data chunks}``
67
68:Description: Each object is split in **data-chunks** parts,
69 each stored on a different OSD.
70
71:Type: Integer
72:Required: Yes.
73:Example: 4
74
75``m={coding-chunks}``
76
77:Description: Compute **coding chunks** for each object and store them
78 on different OSDs. The number of coding chunks is also
79 the number of OSDs that can be down without losing data.
80
81:Type: Integer
82:Required: Yes.
83:Example: 2
84
85``l={locality}``
86
87:Description: Group the coding and data chunks into sets of size
88 **locality**. For instance, for **k=4** and **m=2**,
89 when **locality=3** two groups of three are created.
90 Each set can be recovered without reading chunks
91 from another set.
92
93:Type: Integer
94:Required: Yes.
95:Example: 3
96
97``ruleset-root={root}``
98
99:Description: The name of the crush bucket used for the first step of
100 the ruleset. For intance **step take default**.
101
102:Type: String
103:Required: No.
104:Default: default
105
106``ruleset-locality={bucket-type}``
107
108:Description: The type of the crush bucket in which each set of chunks
109 defined by **l** will be stored. For instance, if it is
110 set to **rack**, each group of **l** chunks will be
111 placed in a different rack. It is used to create a
112 ruleset step such as **step choose rack**. If it is not
113 set, no such grouping is done.
114
115:Type: String
116:Required: No.
117
118``ruleset-failure-domain={bucket-type}``
119
120:Description: Ensure that no two chunks are in a bucket with the same
121 failure domain. For instance, if the failure domain is
122 **host** no two chunks will be stored on the same
123 host. It is used to create a ruleset step such as **step
124 chooseleaf host**.
125
126:Type: String
127:Required: No.
128:Default: host
129
130``directory={directory}``
131
132:Description: Set the **directory** name from which the erasure code
133 plugin is loaded.
134
135:Type: String
136:Required: No.
137:Default: /usr/lib/ceph/erasure-code
138
139``--force``
140
141:Description: Override an existing profile by the same name.
142
143:Type: String
144:Required: No.
145
146Low level plugin configuration
147==============================
148
149The sum of **k** and **m** must be a multiple of the **l** parameter.
150The low level configuration parameters do not impose such a
151restriction and it may be more convienient to use it for specific
152purposes. It is for instance possible to define two groups, one with 4
153chunks and another with 3 chunks. It is also possible to recursively
154define locality sets, for instance datacenters and racks into
155datacenters. The **k/m/l** are implemented by generating a low level
156configuration.
157
158The *lrc* erasure code plugin recursively applies erasure code
159techniques so that recovering from the loss of some chunks only
160requires a subset of the available chunks, most of the time.
161
162For instance, when three coding steps are described as::
163
164 chunk nr 01234567
165 step 1 _cDD_cDD
166 step 2 cDDD____
167 step 3 ____cDDD
168
169where *c* are coding chunks calculated from the data chunks *D*, the
170loss of chunk *7* can be recovered with the last four chunks. And the
171loss of chunk *2* chunk can be recovered with the first four
172chunks.
173
174Erasure code profile examples using low level configuration
175===========================================================
176
177Minimal testing
178---------------
179
180It is strictly equivalent to using the default erasure code profile. The *DD*
181implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
182by default.::
183
184 $ ceph osd erasure-code-profile set LRCprofile \
185 plugin=lrc \
186 mapping=DD_ \
187 layers='[ [ "DDc", "" ] ]'
188 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
189
190Reduce recovery bandwidth between hosts
191---------------------------------------
192
193Although it is probably not an interesting use case when all hosts are
194connected to the same switch, reduced bandwidth usage can actually be
195observed. It is equivalent to **k=4**, **m=2** and **l=3** although
196the layout of the chunks is different::
197
198 $ ceph osd erasure-code-profile set LRCprofile \
199 plugin=lrc \
200 mapping=__DD__DD \
201 layers='[
202 [ "_cDD_cDD", "" ],
203 [ "cDDD____", "" ],
204 [ "____cDDD", "" ],
205 ]'
206 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
207
208
209Reduce recovery bandwidth between racks
210---------------------------------------
211
212In Firefly the reduced bandwidth will only be observed if the primary
213OSD is in the same rack as the lost chunk.::
214
215 $ ceph osd erasure-code-profile set LRCprofile \
216 plugin=lrc \
217 mapping=__DD__DD \
218 layers='[
219 [ "_cDD_cDD", "" ],
220 [ "cDDD____", "" ],
221 [ "____cDDD", "" ],
222 ]' \
223 ruleset-steps='[
224 [ "choose", "rack", 2 ],
225 [ "chooseleaf", "host", 4 ],
226 ]'
227 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
228
229Testing with different Erasure Code backends
230--------------------------------------------
231
232LRC now uses jerasure as the default EC backend. It is possible to
233specify the EC backend/algorithm on a per layer basis using the low
234level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
235is actually an erasure code profile to be used for this level. The
236example below specifies the ISA backend with the cauchy technique to
237be used in the lrcpool.::
238
239 $ ceph osd erasure-code-profile set LRCprofile \
240 plugin=lrc \
241 mapping=DD_ \
242 layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
243 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
244
245You could also use a different erasure code profile for for each
246layer.::
247
248 $ ceph osd erasure-code-profile set LRCprofile \
249 plugin=lrc \
250 mapping=__DD__DD \
251 layers='[
252 [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
253 [ "cDDD____", "plugin=isa" ],
254 [ "____cDDD", "plugin=jerasure" ],
255 ]'
256 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
257
258
259
260Erasure coding and decoding algorithm
261=====================================
262
263The steps found in the layers description::
264
265 chunk nr 01234567
266
267 step 1 _cDD_cDD
268 step 2 cDDD____
269 step 3 ____cDDD
270
271are applied in order. For instance, if a 4K object is encoded, it will
272first go thru *step 1* and be divided in four 1K chunks (the four
273uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
274order. From these, two coding chunks are calculated (the two lowercase
275c). The coding chunks are stored in the chunks 1 and 5, respectively.
276
277The *step 2* re-uses the content created by *step 1* in a similar
278fashion and stores a single coding chunk *c* at position 0. The last four
279chunks, marked with an underscore (*_*) for readability, are ignored.
280
281The *step 3* stores a single coding chunk *c* at position 4. The three
282chunks created by *step 1* are used to compute this coding chunk,
283i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.
284
285If chunk *2* is lost::
286
287 chunk nr 01234567
288
289 step 1 _c D_cDD
290 step 2 cD D____
291 step 3 __ _cDDD
292
293decoding will attempt to recover it by walking the steps in reverse
294order: *step 3* then *step 2* and finally *step 1*.
295
296The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
297and is skipped.
298
299The coding chunk from *step 2*, stored in chunk *0*, allows it to
300recover the content of chunk *2*. There are no more chunks to recover
301and the process stops, without considering *step 1*.
302
303Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
304back chunk *2*.
305
306If chunk *2, 3, 6* are lost::
307
308 chunk nr 01234567
309
310 step 1 _c _c D
311 step 2 cD __ _
312 step 3 __ cD D
313
314The *step 3* can recover the content of chunk *6*::
315
316 chunk nr 01234567
317
318 step 1 _c _cDD
319 step 2 cD ____
320 step 3 __ cDDD
321
322The *step 2* fails to recover and is skipped because there are two
323chunks missing (*2, 3*) and it can only recover from one missing
324chunk.
325
326The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
327recover the content of chunk *2, 3*::
328
329 chunk nr 01234567
330
331 step 1 _cDD_cDD
332 step 2 cDDD____
333 step 3 ____cDDD
334
335Controlling crush placement
336===========================
337
338The default crush ruleset provides OSDs that are on different hosts. For instance::
339
340 chunk nr 01234567
341
342 step 1 _cDD_cDD
343 step 2 cDDD____
344 step 3 ____cDDD
345
346needs exactly *8* OSDs, one for each chunk. If the hosts are in two
347adjacent racks, the first four chunks can be placed in the first rack
348and the last four in the second rack. So that recovering from the loss
349of a single OSD does not require using bandwidth between the two
350racks.
351
352For instance::
353
354 ruleset-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
355
356will create a ruleset that will select two crush buckets of type
357*rack* and for each of them choose four OSDs, each of them located in
358different buckets of type *host*.
359
360The ruleset can also be manually crafted for finer control.