]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ====================================== |
2 | Locally repairable erasure code plugin | |
3 | ====================================== | |
4 | ||
5 | With the *jerasure* plugin, when an erasure coded object is stored on | |
6 | multiple OSDs, recovering from the loss of one OSD requires reading | |
7 | from all the others. For instance if *jerasure* is configured with | |
8 | *k=8* and *m=4*, losing one OSD requires reading from the eleven | |
9 | others to repair. | |
10 | ||
11 | The *lrc* erasure code plugin creates local parity chunks to be able | |
12 | to recover using less OSDs. For instance if *lrc* is configured with | |
13 | *k=8*, *m=4* and *l=4*, it will create an additional parity chunk for | |
14 | every four OSDs. When a single OSD is lost, it can be recovered with | |
15 | only four OSDs instead of eleven. | |
16 | ||
17 | Erasure code profile examples | |
18 | ============================= | |
19 | ||
20 | Reduce recovery bandwidth between hosts | |
21 | --------------------------------------- | |
22 | ||
23 | Although it is probably not an interesting use case when all hosts are | |
24 | connected to the same switch, reduced bandwidth usage can actually be | |
25 | observed.:: | |
26 | ||
27 | $ ceph osd erasure-code-profile set LRCprofile \ | |
28 | plugin=lrc \ | |
29 | k=4 m=2 l=3 \ | |
30 | ruleset-failure-domain=host | |
31 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
32 | ||
33 | ||
34 | Reduce recovery bandwidth between racks | |
35 | --------------------------------------- | |
36 | ||
37 | In Firefly the reduced bandwidth will only be observed if the primary | |
38 | OSD is in the same rack as the lost chunk.:: | |
39 | ||
40 | $ ceph osd erasure-code-profile set LRCprofile \ | |
41 | plugin=lrc \ | |
42 | k=4 m=2 l=3 \ | |
43 | ruleset-locality=rack \ | |
44 | ruleset-failure-domain=host | |
45 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
46 | ||
47 | ||
48 | Create an lrc profile | |
49 | ===================== | |
50 | ||
51 | To create a new lrc erasure code profile:: | |
52 | ||
53 | ceph osd erasure-code-profile set {name} \ | |
54 | plugin=lrc \ | |
55 | k={data-chunks} \ | |
56 | m={coding-chunks} \ | |
57 | l={locality} \ | |
58 | [ruleset-root={root}] \ | |
59 | [ruleset-locality={bucket-type}] \ | |
60 | [ruleset-failure-domain={bucket-type}] \ | |
61 | [directory={directory}] \ | |
62 | [--force] | |
63 | ||
64 | Where: | |
65 | ||
66 | ``k={data chunks}`` | |
67 | ||
68 | :Description: Each object is split in **data-chunks** parts, | |
69 | each stored on a different OSD. | |
70 | ||
71 | :Type: Integer | |
72 | :Required: Yes. | |
73 | :Example: 4 | |
74 | ||
75 | ``m={coding-chunks}`` | |
76 | ||
77 | :Description: Compute **coding chunks** for each object and store them | |
78 | on different OSDs. The number of coding chunks is also | |
79 | the number of OSDs that can be down without losing data. | |
80 | ||
81 | :Type: Integer | |
82 | :Required: Yes. | |
83 | :Example: 2 | |
84 | ||
85 | ``l={locality}`` | |
86 | ||
87 | :Description: Group the coding and data chunks into sets of size | |
88 | **locality**. For instance, for **k=4** and **m=2**, | |
89 | when **locality=3** two groups of three are created. | |
90 | Each set can be recovered without reading chunks | |
91 | from another set. | |
92 | ||
93 | :Type: Integer | |
94 | :Required: Yes. | |
95 | :Example: 3 | |
96 | ||
97 | ``ruleset-root={root}`` | |
98 | ||
99 | :Description: The name of the crush bucket used for the first step of | |
100 | the ruleset. For intance **step take default**. | |
101 | ||
102 | :Type: String | |
103 | :Required: No. | |
104 | :Default: default | |
105 | ||
106 | ``ruleset-locality={bucket-type}`` | |
107 | ||
108 | :Description: The type of the crush bucket in which each set of chunks | |
109 | defined by **l** will be stored. For instance, if it is | |
110 | set to **rack**, each group of **l** chunks will be | |
111 | placed in a different rack. It is used to create a | |
112 | ruleset step such as **step choose rack**. If it is not | |
113 | set, no such grouping is done. | |
114 | ||
115 | :Type: String | |
116 | :Required: No. | |
117 | ||
118 | ``ruleset-failure-domain={bucket-type}`` | |
119 | ||
120 | :Description: Ensure that no two chunks are in a bucket with the same | |
121 | failure domain. For instance, if the failure domain is | |
122 | **host** no two chunks will be stored on the same | |
123 | host. It is used to create a ruleset step such as **step | |
124 | chooseleaf host**. | |
125 | ||
126 | :Type: String | |
127 | :Required: No. | |
128 | :Default: host | |
129 | ||
130 | ``directory={directory}`` | |
131 | ||
132 | :Description: Set the **directory** name from which the erasure code | |
133 | plugin is loaded. | |
134 | ||
135 | :Type: String | |
136 | :Required: No. | |
137 | :Default: /usr/lib/ceph/erasure-code | |
138 | ||
139 | ``--force`` | |
140 | ||
141 | :Description: Override an existing profile by the same name. | |
142 | ||
143 | :Type: String | |
144 | :Required: No. | |
145 | ||
146 | Low level plugin configuration | |
147 | ============================== | |
148 | ||
149 | The sum of **k** and **m** must be a multiple of the **l** parameter. | |
150 | The low level configuration parameters do not impose such a | |
151 | restriction and it may be more convienient to use it for specific | |
152 | purposes. It is for instance possible to define two groups, one with 4 | |
153 | chunks and another with 3 chunks. It is also possible to recursively | |
154 | define locality sets, for instance datacenters and racks into | |
155 | datacenters. The **k/m/l** are implemented by generating a low level | |
156 | configuration. | |
157 | ||
158 | The *lrc* erasure code plugin recursively applies erasure code | |
159 | techniques so that recovering from the loss of some chunks only | |
160 | requires a subset of the available chunks, most of the time. | |
161 | ||
162 | For instance, when three coding steps are described as:: | |
163 | ||
164 | chunk nr 01234567 | |
165 | step 1 _cDD_cDD | |
166 | step 2 cDDD____ | |
167 | step 3 ____cDDD | |
168 | ||
169 | where *c* are coding chunks calculated from the data chunks *D*, the | |
170 | loss of chunk *7* can be recovered with the last four chunks. And the | |
171 | loss of chunk *2* chunk can be recovered with the first four | |
172 | chunks. | |
173 | ||
174 | Erasure code profile examples using low level configuration | |
175 | =========================================================== | |
176 | ||
177 | Minimal testing | |
178 | --------------- | |
179 | ||
180 | It is strictly equivalent to using the default erasure code profile. The *DD* | |
181 | implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used | |
182 | by default.:: | |
183 | ||
184 | $ ceph osd erasure-code-profile set LRCprofile \ | |
185 | plugin=lrc \ | |
186 | mapping=DD_ \ | |
187 | layers='[ [ "DDc", "" ] ]' | |
188 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
189 | ||
190 | Reduce recovery bandwidth between hosts | |
191 | --------------------------------------- | |
192 | ||
193 | Although it is probably not an interesting use case when all hosts are | |
194 | connected to the same switch, reduced bandwidth usage can actually be | |
195 | observed. It is equivalent to **k=4**, **m=2** and **l=3** although | |
196 | the layout of the chunks is different:: | |
197 | ||
198 | $ ceph osd erasure-code-profile set LRCprofile \ | |
199 | plugin=lrc \ | |
200 | mapping=__DD__DD \ | |
201 | layers='[ | |
202 | [ "_cDD_cDD", "" ], | |
203 | [ "cDDD____", "" ], | |
204 | [ "____cDDD", "" ], | |
205 | ]' | |
206 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
207 | ||
208 | ||
209 | Reduce recovery bandwidth between racks | |
210 | --------------------------------------- | |
211 | ||
212 | In Firefly the reduced bandwidth will only be observed if the primary | |
213 | OSD is in the same rack as the lost chunk.:: | |
214 | ||
215 | $ ceph osd erasure-code-profile set LRCprofile \ | |
216 | plugin=lrc \ | |
217 | mapping=__DD__DD \ | |
218 | layers='[ | |
219 | [ "_cDD_cDD", "" ], | |
220 | [ "cDDD____", "" ], | |
221 | [ "____cDDD", "" ], | |
222 | ]' \ | |
223 | ruleset-steps='[ | |
224 | [ "choose", "rack", 2 ], | |
225 | [ "chooseleaf", "host", 4 ], | |
226 | ]' | |
227 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
228 | ||
229 | Testing with different Erasure Code backends | |
230 | -------------------------------------------- | |
231 | ||
232 | LRC now uses jerasure as the default EC backend. It is possible to | |
233 | specify the EC backend/algorithm on a per layer basis using the low | |
234 | level configuration. The second argument in layers='[ [ "DDc", "" ] ]' | |
235 | is actually an erasure code profile to be used for this level. The | |
236 | example below specifies the ISA backend with the cauchy technique to | |
237 | be used in the lrcpool.:: | |
238 | ||
239 | $ ceph osd erasure-code-profile set LRCprofile \ | |
240 | plugin=lrc \ | |
241 | mapping=DD_ \ | |
242 | layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]' | |
243 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
244 | ||
245 | You could also use a different erasure code profile for for each | |
246 | layer.:: | |
247 | ||
248 | $ ceph osd erasure-code-profile set LRCprofile \ | |
249 | plugin=lrc \ | |
250 | mapping=__DD__DD \ | |
251 | layers='[ | |
252 | [ "_cDD_cDD", "plugin=isa technique=cauchy" ], | |
253 | [ "cDDD____", "plugin=isa" ], | |
254 | [ "____cDDD", "plugin=jerasure" ], | |
255 | ]' | |
256 | $ ceph osd pool create lrcpool 12 12 erasure LRCprofile | |
257 | ||
258 | ||
259 | ||
260 | Erasure coding and decoding algorithm | |
261 | ===================================== | |
262 | ||
263 | The steps found in the layers description:: | |
264 | ||
265 | chunk nr 01234567 | |
266 | ||
267 | step 1 _cDD_cDD | |
268 | step 2 cDDD____ | |
269 | step 3 ____cDDD | |
270 | ||
271 | are applied in order. For instance, if a 4K object is encoded, it will | |
272 | first go thru *step 1* and be divided in four 1K chunks (the four | |
273 | uppercase D). They are stored in the chunks 2, 3, 6 and 7, in | |
274 | order. From these, two coding chunks are calculated (the two lowercase | |
275 | c). The coding chunks are stored in the chunks 1 and 5, respectively. | |
276 | ||
277 | The *step 2* re-uses the content created by *step 1* in a similar | |
278 | fashion and stores a single coding chunk *c* at position 0. The last four | |
279 | chunks, marked with an underscore (*_*) for readability, are ignored. | |
280 | ||
281 | The *step 3* stores a single coding chunk *c* at position 4. The three | |
282 | chunks created by *step 1* are used to compute this coding chunk, | |
283 | i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*. | |
284 | ||
285 | If chunk *2* is lost:: | |
286 | ||
287 | chunk nr 01234567 | |
288 | ||
289 | step 1 _c D_cDD | |
290 | step 2 cD D____ | |
291 | step 3 __ _cDDD | |
292 | ||
293 | decoding will attempt to recover it by walking the steps in reverse | |
294 | order: *step 3* then *step 2* and finally *step 1*. | |
295 | ||
296 | The *step 3* knows nothing about chunk *2* (i.e. it is an underscore) | |
297 | and is skipped. | |
298 | ||
299 | The coding chunk from *step 2*, stored in chunk *0*, allows it to | |
300 | recover the content of chunk *2*. There are no more chunks to recover | |
301 | and the process stops, without considering *step 1*. | |
302 | ||
303 | Recovering chunk *2* requires reading chunks *0, 1, 3* and writing | |
304 | back chunk *2*. | |
305 | ||
306 | If chunk *2, 3, 6* are lost:: | |
307 | ||
308 | chunk nr 01234567 | |
309 | ||
310 | step 1 _c _c D | |
311 | step 2 cD __ _ | |
312 | step 3 __ cD D | |
313 | ||
314 | The *step 3* can recover the content of chunk *6*:: | |
315 | ||
316 | chunk nr 01234567 | |
317 | ||
318 | step 1 _c _cDD | |
319 | step 2 cD ____ | |
320 | step 3 __ cDDD | |
321 | ||
322 | The *step 2* fails to recover and is skipped because there are two | |
323 | chunks missing (*2, 3*) and it can only recover from one missing | |
324 | chunk. | |
325 | ||
326 | The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to | |
327 | recover the content of chunk *2, 3*:: | |
328 | ||
329 | chunk nr 01234567 | |
330 | ||
331 | step 1 _cDD_cDD | |
332 | step 2 cDDD____ | |
333 | step 3 ____cDDD | |
334 | ||
335 | Controlling crush placement | |
336 | =========================== | |
337 | ||
338 | The default crush ruleset provides OSDs that are on different hosts. For instance:: | |
339 | ||
340 | chunk nr 01234567 | |
341 | ||
342 | step 1 _cDD_cDD | |
343 | step 2 cDDD____ | |
344 | step 3 ____cDDD | |
345 | ||
346 | needs exactly *8* OSDs, one for each chunk. If the hosts are in two | |
347 | adjacent racks, the first four chunks can be placed in the first rack | |
348 | and the last four in the second rack. So that recovering from the loss | |
349 | of a single OSD does not require using bandwidth between the two | |
350 | racks. | |
351 | ||
352 | For instance:: | |
353 | ||
354 | ruleset-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]' | |
355 | ||
356 | will create a ruleset that will select two crush buckets of type | |
357 | *rack* and for each of them choose four OSDs, each of them located in | |
358 | different buckets of type *host*. | |
359 | ||
360 | The ruleset can also be manually crafted for finer control. |