]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ====================================== |
2 | Locally repairable erasure code plugin | |
3 | ====================================== | |
4 | ||
5 | With the *jerasure* plugin, when an erasure coded object is stored on | |
6 | multiple OSDs, recovering from the loss of one OSD requires reading | |
7 | from all the others. For instance if *jerasure* is configured with | |
8 | *k=8* and *m=4*, losing one OSD requires reading from the eleven | |
9 | others to repair. | |
10 | ||
11 | The *lrc* erasure code plugin creates local parity chunks to be able | |
12 | to recover using less OSDs. For instance if *lrc* is configured with | |
13 | *k=8*, *m=4* and *l=4*, it will create an additional parity chunk for | |
14 | every four OSDs. When a single OSD is lost, it can be recovered with | |
15 | only four OSDs instead of eleven. | |
16 | ||
17 | Erasure code profile examples | |
18 | ============================= | |
19 | ||
20 | Reduce recovery bandwidth between hosts | |
21 | --------------------------------------- | |
22 | ||
23 | Although it is probably not an interesting use case when all hosts are | |
24 | connected to the same switch, reduced bandwidth usage can actually be | |
25 | observed.:: | |
26 | ||
27 | $ ceph osd erasure-code-profile set LRCprofile \ | |
28 | plugin=lrc \ | |
29 | k=4 m=2 l=3 \ | |
224ce89b | 30 | crush-failure-domain=host |
7c673cae FG |
31 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile |
32 | ||
33 | ||
34 | Reduce recovery bandwidth between racks | |
35 | --------------------------------------- | |
36 | ||
37 | In Firefly the reduced bandwidth will only be observed if the primary | |
38 | OSD is in the same rack as the lost chunk.:: | |
39 | ||
40 | $ ceph osd erasure-code-profile set LRCprofile \ | |
41 | plugin=lrc \ | |
42 | k=4 m=2 l=3 \ | |
224ce89b WB |
43 | crush-locality=rack \ |
44 | crush-failure-domain=host | |
7c673cae FG |
45 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile |
46 | ||
47 | ||
48 | Create an lrc profile | |
49 | ===================== | |
50 | ||
51 | To create a new lrc erasure code profile:: | |
52 | ||
53 | ceph osd erasure-code-profile set {name} \ | |
54 | plugin=lrc \ | |
55 | k={data-chunks} \ | |
56 | m={coding-chunks} \ | |
57 | l={locality} \ | |
224ce89b WB |
58 | [crush-root={root}] \ |
59 | [crush-locality={bucket-type}] \ | |
60 | [crush-failure-domain={bucket-type}] \ | |
61 | [crush-device-class={device-class}] \ | |
7c673cae FG |
62 | [directory={directory}] \ |
63 | [--force] | |
64 | ||
65 | Where: | |
66 | ||
67 | ``k={data chunks}`` | |
68 | ||
69 | :Description: Each object is split in **data-chunks** parts, | |
70 | each stored on a different OSD. | |
71 | ||
72 | :Type: Integer | |
73 | :Required: Yes. | |
74 | :Example: 4 | |
75 | ||
76 | ``m={coding-chunks}`` | |
77 | ||
78 | :Description: Compute **coding chunks** for each object and store them | |
79 | on different OSDs. The number of coding chunks is also | |
80 | the number of OSDs that can be down without losing data. | |
81 | ||
82 | :Type: Integer | |
83 | :Required: Yes. | |
84 | :Example: 2 | |
85 | ||
86 | ``l={locality}`` | |
87 | ||
88 | :Description: Group the coding and data chunks into sets of size | |
89 | **locality**. For instance, for **k=4** and **m=2**, | |
90 | when **locality=3** two groups of three are created. | |
91 | Each set can be recovered without reading chunks | |
92 | from another set. | |
93 | ||
94 | :Type: Integer | |
95 | :Required: Yes. | |
96 | :Example: 3 | |
97 | ||
224ce89b | 98 | ``crush-root={root}`` |
7c673cae FG |
99 | |
100 | :Description: The name of the crush bucket used for the first step of | |
b32b8144 | 101 | the CRUSH rule. For intance **step take default**. |
7c673cae FG |
102 | |
103 | :Type: String | |
104 | :Required: No. | |
105 | :Default: default | |
106 | ||
224ce89b | 107 | ``crush-locality={bucket-type}`` |
7c673cae FG |
108 | |
109 | :Description: The type of the crush bucket in which each set of chunks | |
110 | defined by **l** will be stored. For instance, if it is | |
111 | set to **rack**, each group of **l** chunks will be | |
112 | placed in a different rack. It is used to create a | |
b32b8144 | 113 | CRUSH rule step such as **step choose rack**. If it is not |
7c673cae FG |
114 | set, no such grouping is done. |
115 | ||
116 | :Type: String | |
117 | :Required: No. | |
118 | ||
224ce89b | 119 | ``crush-failure-domain={bucket-type}`` |
7c673cae FG |
120 | |
121 | :Description: Ensure that no two chunks are in a bucket with the same | |
122 | failure domain. For instance, if the failure domain is | |
123 | **host** no two chunks will be stored on the same | |
b32b8144 | 124 | host. It is used to create a CRUSH rule step such as **step |
7c673cae FG |
125 | chooseleaf host**. |
126 | ||
127 | :Type: String | |
128 | :Required: No. | |
129 | :Default: host | |
130 | ||
224ce89b WB |
131 | ``crush-device-class={device-class}`` |
132 | ||
133 | :Description: Restrict placement to devices of a specific class (e.g., | |
134 | ``ssd`` or ``hdd``), using the crush device class names | |
135 | in the CRUSH map. | |
136 | ||
137 | :Type: String | |
138 | :Required: No. | |
139 | :Default: | |
140 | ||
7c673cae FG |
141 | ``directory={directory}`` |
142 | ||
143 | :Description: Set the **directory** name from which the erasure code | |
144 | plugin is loaded. | |
145 | ||
146 | :Type: String | |
147 | :Required: No. | |
148 | :Default: /usr/lib/ceph/erasure-code | |
149 | ||
150 | ``--force`` | |
151 | ||
152 | :Description: Override an existing profile by the same name. | |
153 | ||
154 | :Type: String | |
155 | :Required: No. | |
156 | ||
157 | Low level plugin configuration | |
158 | ============================== | |
159 | ||
160 | The sum of **k** and **m** must be a multiple of the **l** parameter. | |
161 | The low level configuration parameters do not impose such a | |
162 | restriction and it may be more convienient to use it for specific | |
163 | purposes. It is for instance possible to define two groups, one with 4 | |
164 | chunks and another with 3 chunks. It is also possible to recursively | |
165 | define locality sets, for instance datacenters and racks into | |
166 | datacenters. The **k/m/l** are implemented by generating a low level | |
167 | configuration. | |
168 | ||
169 | The *lrc* erasure code plugin recursively applies erasure code | |
170 | techniques so that recovering from the loss of some chunks only | |
171 | requires a subset of the available chunks, most of the time. | |
172 | ||
173 | For instance, when three coding steps are described as:: | |
174 | ||
175 | chunk nr 01234567 | |
176 | step 1 _cDD_cDD | |
177 | step 2 cDDD____ | |
178 | step 3 ____cDDD | |
179 | ||
180 | where *c* are coding chunks calculated from the data chunks *D*, the | |
181 | loss of chunk *7* can be recovered with the last four chunks. And the | |
182 | loss of chunk *2* chunk can be recovered with the first four | |
183 | chunks. | |
184 | ||
185 | Erasure code profile examples using low level configuration | |
186 | =========================================================== | |
187 | ||
188 | Minimal testing | |
189 | --------------- | |
190 | ||
191 | It is strictly equivalent to using the default erasure code profile. The *DD* | |
192 | implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used | |
193 | by default.:: | |
194 | ||
195 | $ ceph osd erasure-code-profile set LRCprofile \ | |
196 | plugin=lrc \ | |
197 | mapping=DD_ \ | |
198 | layers='[ [ "DDc", "" ] ]' | |
199 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
200 | ||
201 | Reduce recovery bandwidth between hosts | |
202 | --------------------------------------- | |
203 | ||
204 | Although it is probably not an interesting use case when all hosts are | |
205 | connected to the same switch, reduced bandwidth usage can actually be | |
206 | observed. It is equivalent to **k=4**, **m=2** and **l=3** although | |
207 | the layout of the chunks is different:: | |
208 | ||
209 | $ ceph osd erasure-code-profile set LRCprofile \ | |
210 | plugin=lrc \ | |
211 | mapping=__DD__DD \ | |
212 | layers='[ | |
213 | [ "_cDD_cDD", "" ], | |
214 | [ "cDDD____", "" ], | |
215 | [ "____cDDD", "" ], | |
216 | ]' | |
217 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
218 | ||
219 | ||
220 | Reduce recovery bandwidth between racks | |
221 | --------------------------------------- | |
222 | ||
223 | In Firefly the reduced bandwidth will only be observed if the primary | |
224 | OSD is in the same rack as the lost chunk.:: | |
225 | ||
226 | $ ceph osd erasure-code-profile set LRCprofile \ | |
227 | plugin=lrc \ | |
228 | mapping=__DD__DD \ | |
229 | layers='[ | |
230 | [ "_cDD_cDD", "" ], | |
231 | [ "cDDD____", "" ], | |
232 | [ "____cDDD", "" ], | |
233 | ]' \ | |
224ce89b | 234 | crush-steps='[ |
7c673cae FG |
235 | [ "choose", "rack", 2 ], |
236 | [ "chooseleaf", "host", 4 ], | |
237 | ]' | |
238 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
239 | ||
240 | Testing with different Erasure Code backends | |
241 | -------------------------------------------- | |
242 | ||
243 | LRC now uses jerasure as the default EC backend. It is possible to | |
244 | specify the EC backend/algorithm on a per layer basis using the low | |
245 | level configuration. The second argument in layers='[ [ "DDc", "" ] ]' | |
246 | is actually an erasure code profile to be used for this level. The | |
247 | example below specifies the ISA backend with the cauchy technique to | |
248 | be used in the lrcpool.:: | |
249 | ||
250 | $ ceph osd erasure-code-profile set LRCprofile \ | |
251 | plugin=lrc \ | |
252 | mapping=DD_ \ | |
253 | layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]' | |
254 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
255 | ||
256 | You could also use a different erasure code profile for for each | |
257 | layer.:: | |
258 | ||
259 | $ ceph osd erasure-code-profile set LRCprofile \ | |
260 | plugin=lrc \ | |
261 | mapping=__DD__DD \ | |
262 | layers='[ | |
263 | [ "_cDD_cDD", "plugin=isa technique=cauchy" ], | |
264 | [ "cDDD____", "plugin=isa" ], | |
265 | [ "____cDDD", "plugin=jerasure" ], | |
266 | ]' | |
267 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
268 | ||
269 | ||
270 | ||
271 | Erasure coding and decoding algorithm | |
272 | ===================================== | |
273 | ||
274 | The steps found in the layers description:: | |
275 | ||
276 | chunk nr 01234567 | |
277 | ||
278 | step 1 _cDD_cDD | |
279 | step 2 cDDD____ | |
280 | step 3 ____cDDD | |
281 | ||
282 | are applied in order. For instance, if a 4K object is encoded, it will | |
283 | first go thru *step 1* and be divided in four 1K chunks (the four | |
284 | uppercase D). They are stored in the chunks 2, 3, 6 and 7, in | |
285 | order. From these, two coding chunks are calculated (the two lowercase | |
286 | c). The coding chunks are stored in the chunks 1 and 5, respectively. | |
287 | ||
288 | The *step 2* re-uses the content created by *step 1* in a similar | |
289 | fashion and stores a single coding chunk *c* at position 0. The last four | |
290 | chunks, marked with an underscore (*_*) for readability, are ignored. | |
291 | ||
292 | The *step 3* stores a single coding chunk *c* at position 4. The three | |
293 | chunks created by *step 1* are used to compute this coding chunk, | |
294 | i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*. | |
295 | ||
296 | If chunk *2* is lost:: | |
297 | ||
298 | chunk nr 01234567 | |
299 | ||
300 | step 1 _c D_cDD | |
301 | step 2 cD D____ | |
302 | step 3 __ _cDDD | |
303 | ||
304 | decoding will attempt to recover it by walking the steps in reverse | |
305 | order: *step 3* then *step 2* and finally *step 1*. | |
306 | ||
307 | The *step 3* knows nothing about chunk *2* (i.e. it is an underscore) | |
308 | and is skipped. | |
309 | ||
310 | The coding chunk from *step 2*, stored in chunk *0*, allows it to | |
311 | recover the content of chunk *2*. There are no more chunks to recover | |
312 | and the process stops, without considering *step 1*. | |
313 | ||
314 | Recovering chunk *2* requires reading chunks *0, 1, 3* and writing | |
315 | back chunk *2*. | |
316 | ||
317 | If chunk *2, 3, 6* are lost:: | |
318 | ||
319 | chunk nr 01234567 | |
320 | ||
321 | step 1 _c _c D | |
322 | step 2 cD __ _ | |
323 | step 3 __ cD D | |
324 | ||
325 | The *step 3* can recover the content of chunk *6*:: | |
326 | ||
327 | chunk nr 01234567 | |
328 | ||
329 | step 1 _c _cDD | |
330 | step 2 cD ____ | |
331 | step 3 __ cDDD | |
332 | ||
333 | The *step 2* fails to recover and is skipped because there are two | |
334 | chunks missing (*2, 3*) and it can only recover from one missing | |
335 | chunk. | |
336 | ||
337 | The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to | |
338 | recover the content of chunk *2, 3*:: | |
339 | ||
340 | chunk nr 01234567 | |
341 | ||
342 | step 1 _cDD_cDD | |
343 | step 2 cDDD____ | |
344 | step 3 ____cDDD | |
345 | ||
b32b8144 | 346 | Controlling CRUSH placement |
7c673cae FG |
347 | =========================== |
348 | ||
b32b8144 | 349 | The default CRUSH rule provides OSDs that are on different hosts. For instance:: |
7c673cae FG |
350 | |
351 | chunk nr 01234567 | |
352 | ||
353 | step 1 _cDD_cDD | |
354 | step 2 cDDD____ | |
355 | step 3 ____cDDD | |
356 | ||
357 | needs exactly *8* OSDs, one for each chunk. If the hosts are in two | |
358 | adjacent racks, the first four chunks can be placed in the first rack | |
359 | and the last four in the second rack. So that recovering from the loss | |
360 | of a single OSD does not require using bandwidth between the two | |
361 | racks. | |
362 | ||
363 | For instance:: | |
364 | ||
224ce89b | 365 | crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]' |
7c673cae | 366 | |
b32b8144 | 367 | will create a rule that will select two crush buckets of type |
7c673cae FG |
368 | *rack* and for each of them choose four OSDs, each of them located in |
369 | different buckets of type *host*. | |
370 | ||
b32b8144 | 371 | The CRUSH rule can also be manually crafted for finer control. |