ceph/doc/rados/operations/erasure-code-lrc.rst

   1 ======================================
   2 Locally repairable erasure code plugin
   3 ======================================
   4
   5 With the *jerasure* plugin, when an erasure coded object is stored on
   6 multiple OSDs, recovering from the loss of one OSD requires reading
   7 from all the others. For instance if *jerasure* is configured with
   8 *k=8* and *m=4*, losing one OSD requires reading from the eleven
   9 others to repair.
  10
  11 The *lrc* erasure code plugin creates local parity chunks to be able
  12 to recover using less OSDs. For instance if *lrc* is configured with
  13 *k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
  14 every four OSDs. When a single OSD is lost, it can be recovered with
  15 only four OSDs instead of eleven.
  16
  17 Erasure code profile examples
  18 =============================
  19
  20 Reduce recovery bandwidth between hosts
  21 ---------------------------------------
  22
  23 Although it is probably not an interesting use case when all hosts are
  24 connected to the same switch, reduced bandwidth usage can actually be
  25 observed.::
  26
  27         $ ceph osd erasure-code-profile set LRCprofile \
  28              plugin=lrc \
  29              k=4 m=2 l=3 \
  30              ruleset-failure-domain=host
  31         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
  32
  33
  34 Reduce recovery bandwidth between racks
  35 ---------------------------------------
  36
  37 In Firefly the reduced bandwidth will only be observed if the primary
  38 OSD is in the same rack as the lost chunk.::
  39
  40         $ ceph osd erasure-code-profile set LRCprofile \
  41              plugin=lrc \
  42              k=4 m=2 l=3 \
  43              ruleset-locality=rack \
  44              ruleset-failure-domain=host
  45         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
  46
  47
  48 Create an lrc profile
  49 =====================
  50
  51 To create a new lrc erasure code profile::
  52
  53         ceph osd erasure-code-profile set {name} \
  54              plugin=lrc \
  55              k={data-chunks} \
  56              m={coding-chunks} \
  57              l={locality} \
  58              [ruleset-root={root}] \
  59              [ruleset-locality={bucket-type}] \
  60              [ruleset-failure-domain={bucket-type}] \
  61              [directory={directory}] \
  62              [--force]
  63
  64 Where:
  65
  66 ``k={data chunks}``
  67
  68 :Description: Each object is split in **data-chunks** parts,
  69               each stored on a different OSD.
  70
  71 :Type: Integer
  72 :Required: Yes.
  73 :Example: 4
  74
  75 ``m={coding-chunks}``
  76
  77 :Description: Compute **coding chunks** for each object and store them
  78               on different OSDs. The number of coding chunks is also
  79               the number of OSDs that can be down without losing data.
  80
  81 :Type: Integer
  82 :Required: Yes.
  83 :Example: 2
  84
  85 ``l={locality}``
  86
  87 :Description: Group the coding and data chunks into sets of size
  88               **locality**. For instance, for **k=4** and **m=2**,
  89               when **locality=3** two groups of three are created.
  90               Each set can be recovered without reading chunks
  91               from another set.
  92
  93 :Type: Integer
  94 :Required: Yes.
  95 :Example: 3
  96
  97 ``ruleset-root={root}``
  98
  99 :Description: The name of the crush bucket used for the first step of
 100               the ruleset. For intance **step take default**.
 101
 102 :Type: String
 103 :Required: No.
 104 :Default: default
 105
 106 ``ruleset-locality={bucket-type}``
 107
 108 :Description: The type of the crush bucket in which each set of chunks
 109               defined by **l** will be stored. For instance, if it is
 110               set to **rack**, each group of **l** chunks will be
 111               placed in a different rack. It is used to create a
 112               ruleset step such as **step choose rack**. If it is not
 113               set, no such grouping is done.
 114
 115 :Type: String
 116 :Required: No.
 117
 118 ``ruleset-failure-domain={bucket-type}``
 119
 120 :Description: Ensure that no two chunks are in a bucket with the same
 121               failure domain. For instance, if the failure domain is
 122               **host** no two chunks will be stored on the same
 123               host. It is used to create a ruleset step such as **step
 124               chooseleaf host**.
 125
 126 :Type: String
 127 :Required: No.
 128 :Default: host
 129
 130 ``directory={directory}``
 131
 132 :Description: Set the **directory** name from which the erasure code
 133               plugin is loaded.
 134
 135 :Type: String
 136 :Required: No.
 137 :Default: /usr/lib/ceph/erasure-code
 138
 139 ``--force``
 140
 141 :Description: Override an existing profile by the same name.
 142
 143 :Type: String
 144 :Required: No.
 145
 146 Low level plugin configuration
 147 ==============================
 148
 149 The sum of **k** and **m** must be a multiple of the **l** parameter.
 150 The low level configuration parameters do not impose such a
 151 restriction and it may be more convienient to use it for specific
 152 purposes. It is for instance possible to define two groups, one with 4
 153 chunks and another with 3 chunks. It is also possible to recursively
 154 define locality sets, for instance datacenters and racks into
 155 datacenters. The **k/m/l** are implemented by generating a low level
 156 configuration.
 157
 158 The *lrc* erasure code plugin recursively applies erasure code
 159 techniques so that recovering from the loss of some chunks only
 160 requires a subset of the available chunks, most of the time.
 161
 162 For instance, when three coding steps are described as::
 163
 164    chunk nr    01234567
 165    step 1      _cDD_cDD
 166    step 2      cDDD____
 167    step 3      ____cDDD
 168
 169 where *c* are coding chunks calculated from the data chunks *D*, the
 170 loss of chunk *7* can be recovered with the last four chunks. And the
 171 loss of chunk *2* chunk can be recovered with the first four
 172 chunks.
 173
 174 Erasure code profile examples using low level configuration
 175 ===========================================================
 176
 177 Minimal testing
 178 ---------------
 179
 180 It is strictly equivalent to using the default erasure code profile. The *DD*
 181 implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
 182 by default.::
 183
 184         $ ceph osd erasure-code-profile set LRCprofile \
 185              plugin=lrc \
 186              mapping=DD_ \
 187              layers='[ [ "DDc", "" ] ]'
 188         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
 189
 190 Reduce recovery bandwidth between hosts
 191 ---------------------------------------
 192
 193 Although it is probably not an interesting use case when all hosts are
 194 connected to the same switch, reduced bandwidth usage can actually be
 195 observed. It is equivalent to **k=4**, **m=2** and **l=3** although
 196 the layout of the chunks is different::
 197
 198         $ ceph osd erasure-code-profile set LRCprofile \
 199              plugin=lrc \
 200              mapping=__DD__DD \
 201              layers='[
 202                        [ "_cDD_cDD", "" ],
 203                        [ "cDDD____", "" ],
 204                        [ "____cDDD", "" ],
 205                      ]'
 206         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
 207
 208
 209 Reduce recovery bandwidth between racks
 210 ---------------------------------------
 211
 212 In Firefly the reduced bandwidth will only be observed if the primary
 213 OSD is in the same rack as the lost chunk.::
 214
 215         $ ceph osd erasure-code-profile set LRCprofile \
 216              plugin=lrc \
 217              mapping=__DD__DD \
 218              layers='[
 219                        [ "_cDD_cDD", "" ],
 220                        [ "cDDD____", "" ],
 221                        [ "____cDDD", "" ],
 222                      ]' \
 223              ruleset-steps='[
 224                              [ "choose", "rack", 2 ],
 225                              [ "chooseleaf", "host", 4 ],
 226                             ]'
 227         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
 228
 229 Testing with different Erasure Code backends
 230 --------------------------------------------
 231
 232 LRC now uses jerasure as the default EC backend. It is possible to
 233 specify the EC backend/algorithm on a per layer basis using the low
 234 level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
 235 is actually an erasure code profile to be used for this level. The
 236 example below specifies the ISA backend with the cauchy technique to
 237 be used in the lrcpool.::
 238
 239         $ ceph osd erasure-code-profile set LRCprofile \
 240              plugin=lrc \
 241              mapping=DD_ \
 242              layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
 243         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
 244
 245 You could also use a different erasure code profile for for each
 246 layer.::
 247
 248         $ ceph osd erasure-code-profile set LRCprofile \
 249              plugin=lrc \
 250              mapping=__DD__DD \
 251              layers='[
 252                        [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
 253                        [ "cDDD____", "plugin=isa" ],
 254                        [ "____cDDD", "plugin=jerasure" ],
 255                      ]'
 256         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
 257
 258
 259
 260 Erasure coding and decoding algorithm
 261 =====================================
 262
 263 The steps found in the layers description::
 264
 265    chunk nr    01234567
 266
 267    step 1      _cDD_cDD
 268    step 2      cDDD____
 269    step 3      ____cDDD
 270
 271 are applied in order. For instance, if a 4K object is encoded, it will
 272 first go thru *step 1* and be divided in four 1K chunks (the four
 273 uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
 274 order. From these, two coding chunks are calculated (the two lowercase
 275 c). The coding chunks are stored in the chunks 1 and 5, respectively.
 276
 277 The *step 2* re-uses the content created by *step 1* in a similar
 278 fashion and stores a single coding chunk *c* at position 0. The last four
 279 chunks, marked with an underscore (*_*) for readability, are ignored.
 280
 281 The *step 3* stores a single coding chunk *c* at position 4. The three
 282 chunks created by *step 1* are used to compute this coding chunk,
 283 i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.
 284
 285 If chunk *2* is lost::
 286
 287    chunk nr    01234567
 288
 289    step 1      _c D_cDD
 290    step 2      cD D____
 291    step 3      __ _cDDD
 292
 293 decoding will attempt to recover it by walking the steps in reverse
 294 order: *step 3* then *step 2* and finally *step 1*.
 295
 296 The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
 297 and is skipped.
 298
 299 The coding chunk from *step 2*, stored in chunk *0*, allows it to
 300 recover the content of chunk *2*. There are no more chunks to recover
 301 and the process stops, without considering *step 1*.
 302
 303 Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
 304 back chunk *2*.
 305
 306 If chunk *2, 3, 6* are lost::
 307
 308    chunk nr    01234567
 309
 310    step 1      _c  _c D
 311    step 2      cD  __ _
 312    step 3      __  cD D
 313
 314 The *step 3* can recover the content of chunk *6*::
 315
 316    chunk nr    01234567
 317
 318    step 1      _c  _cDD
 319    step 2      cD  ____
 320    step 3      __  cDDD
 321
 322 The *step 2* fails to recover and is skipped because there are two
 323 chunks missing (*2, 3*) and it can only recover from one missing
 324 chunk.
 325
 326 The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
 327 recover the content of chunk *2, 3*::
 328
 329    chunk nr    01234567
 330
 331    step 1      _cDD_cDD
 332    step 2      cDDD____
 333    step 3      ____cDDD
 334
 335 Controlling crush placement
 336 ===========================
 337
 338 The default crush ruleset provides OSDs that are on different hosts. For instance::
 339
 340    chunk nr    01234567
 341
 342    step 1      _cDD_cDD
 343    step 2      cDDD____
 344    step 3      ____cDDD
 345
 346 needs exactly *8* OSDs, one for each chunk. If the hosts are in two
 347 adjacent racks, the first four chunks can be placed in the first rack
 348 and the last four in the second rack. So that recovering from the loss
 349 of a single OSD does not require using bandwidth between the two
 350 racks.
 351
 352 For instance::
 353
 354    ruleset-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
 355
 356 will create a ruleset that will select two crush buckets of type
 357 *rack* and for each of them choose four OSDs, each of them located in
 358 different buckets of type *host*.
 359
 360 The ruleset can also be manually crafted for finer control.