ceph/doc/rados/operations/erasure-code-clay.rst

   1 ================
   2 CLAY code plugin
   3 ================
   4
   5 CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings
   6 in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let:
   7
   8         d = number of OSDs contacted during repair
   9
  10 If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires
  11 reading from the *d=8* others to repair. And recovery of say a 1GiB needs
  12 a download of 8 X 1GiB = 8GiB of information.
  13
  14 However, in the case of the *clay* plugin *d* is configurable within the limits:
  15
  16         k+1 <= d <= k+m-1
  17
  18 By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms
  19 of network bandwidth and disk IO. In the case of the *clay* plugin configured with
  20 *k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and
  21 250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB
  22 amount of information. More general parameters are provided below. The benefits are substantial
  23 when the repair is carried out for a rack that stores information on the order of
  24 Terabytes.
  25
  26         +-------------+---------------------------+
  27         | plugin      | total amount of disk IO   |
  28         +=============+===========================+
  29         |jerasure,isa | k*S                       |
  30         +-------------+---------------------------+
  31         | clay        | d*S/(d-k+1) = (k+m-1)*S/m |
  32         +-------------+---------------------------+
  33
  34 where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have
  35 used the largest possible value of *d* as this will result in the smallest amount of data download needed
  36 to achieve recovery from an OSD failure.
  37
  38 Erasure-code profile examples
  39 =============================
  40
  41 An example configuration that can be used to observe reduced bandwidth usage::
  42
  43         $ ceph osd erasure-code-profile set CLAYprofile \
  44              plugin=clay \
  45              k=4 m=2 d=5 \
  46              crush-failure-domain=host
  47         $ ceph osd pool create claypool 12 12 erasure CLAYprofile
  48
  49
  50 Creating a clay profile
  51 =======================
  52
  53 To create a new clay code profile::
  54
  55         ceph osd erasure-code-profile set {name} \
  56              plugin=clay \
  57              k={data-chunks} \
  58              m={coding-chunks} \
  59              [d={helper-chunks}] \
  60              [scalar_mds={plugin-name}] \
  61              [technique={technique-name}] \
  62              [crush-failure-domain={bucket-type}] \
  63              [directory={directory}] \
  64              [--force]
  65
  66 Where:
  67
  68 ``k={data chunks}``
  69
  70 :Description: Each object is split into **data-chunks** parts,
  71               each of which is stored on a different OSD.
  72
  73 :Type: Integer
  74 :Required: Yes.
  75 :Example: 4
  76
  77 ``m={coding-chunks}``
  78
  79 :Description: Compute **coding chunks** for each object and store them
  80               on different OSDs. The number of coding chunks is also
  81               the number of OSDs that can be down without losing data.
  82
  83 :Type: Integer
  84 :Required: Yes.
  85 :Example: 2
  86
  87 ``d={helper-chunks}``
  88
  89 :Description: Number of OSDs requested to send data during recovery of
  90               a single chunk. *d* needs to be chosen such that
  91               k+1 <= d <= k+m-1. Larger the *d*, the better the savings.
  92
  93 :Type: Integer
  94 :Required: No.
  95 :Default: k+m-1
  96
  97 ``scalar_mds={jerasure|isa|shec}``
  98
  99 :Description: **scalar_mds** specifies the plugin that is used as a
 100              building block in the layered construction. It can be
 101              one of *jerasure*, *isa*, *shec*
 102
 103 :Type: String
 104 :Required: No.
 105 :Default: jerasure
 106
 107 ``technique={technique}``
 108
 109 :Description: **technique** specifies the technique that will be picked
 110              within the 'scalar_mds' plugin specified. Supported techniques
 111              are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig',
 112              'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
 113              'cauchy' for isa and 'single', 'multiple' for shec.
 114
 115 :Type: String
 116 :Required: No.
 117 :Default: reed_sol_van (for jerasure, isa), single (for shec)
 118
 119
 120 ``crush-root={root}``
 121
 122 :Description: The name of the crush bucket used for the first step of
 123               the CRUSH rule. For intance **step take default**.
 124
 125 :Type: String
 126 :Required: No.
 127 :Default: default
 128
 129
 130 ``crush-failure-domain={bucket-type}``
 131
 132 :Description: Ensure that no two chunks are in a bucket with the same
 133               failure domain. For instance, if the failure domain is
 134               **host** no two chunks will be stored on the same
 135               host. It is used to create a CRUSH rule step such as **step
 136               chooseleaf host**.
 137
 138 :Type: String
 139 :Required: No.
 140 :Default: host
 141
 142 ``crush-device-class={device-class}``
 143
 144 :Description: Restrict placement to devices of a specific class (e.g.,
 145               ``ssd`` or ``hdd``), using the crush device class names
 146               in the CRUSH map.
 147
 148 :Type: String
 149 :Required: No.
 150 :Default:
 151
 152 ``directory={directory}``
 153
 154 :Description: Set the **directory** name from which the erasure code
 155               plugin is loaded.
 156
 157 :Type: String
 158 :Required: No.
 159 :Default: /usr/lib/ceph/erasure-code
 160
 161 ``--force``
 162
 163 :Description: Override an existing profile by the same name.
 164
 165 :Type: String
 166 :Required: No.
 167
 168
 169 Notion of sub-chunks
 170 ====================
 171
 172 The Clay code is able to save in terms of disk IO, network bandwidth as it
 173 is a vector code and it is able to view and manipulate data within a chunk
 174 at a finer granularity termed as a sub-chunk. The number of sub-chunks within
 175 a chunk for a Clay code is given by:
 176
 177         sub-chunk count = q\ :sup:`(k+m)/q`, where q=d-k+1
 178
 179
 180 During repair of an OSD, the helper information requested
 181 from an available OSD is only a fraction of a chunk. In fact, the number
 182 of sub-chunks within a chunk that are accessed during repair is given by:
 183
 184         repair sub-chunk count = sub-chunk count / q
 185
 186 Examples
 187 --------
 188
 189 #. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is
 190    8 and  the repair sub-chunk count is 4. Therefore, only half of a chunk is read
 191    during repair.
 192 #. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count
 193    is 16. A quarter of a chunk is read from an available OSD for repair of a failed
 194    chunk.
 195
 196
 197
 198 How to choose a configuration given a workload
 199 ==============================================
 200
 201 Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
 202 are not necessarily stored consecutively within a chunk. For best disk IO
 203 performance, it is helpful to read contiguous data. For this reason, it is suggested that
 204 you choose stripe-size such that the sub-chunk size is sufficiently large.
 205
 206 For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that::
 207
 208         sub-chunk size = stripe-size / (k*sub-chunk count) = 4KB, 8KB, 12KB ...
 209
 210 #. For large size workloads for which the stripe size is large, it is easy to choose k, m, d.
 211    For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will
 212    result in a sub-chunk count of 1024 and a sub-chunk size of 4KB.
 213 #. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network
 214    and disk IO benefits.
 215
 216 Comparisons with LRC
 217 ====================
 218
 219 Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
 220 bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
 221 number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead.
 222 The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in
 223 addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc*
 224 can recover from the failure of any ``m`` OSDs.
 225
 226         +-----------------+----------------------------------+----------------------------------+
 227         | Parameters      | disk IO, storage overhead (LRC)  | disk IO, storage overhead (CLAY) |
 228         +=================+================+=================+==================================+
 229         | (k=10, m=4)     | 7 * S, 0.6 (d=7)                 | 3.25 * S, 0.4 (d=13)             |
 230         +-----------------+----------------------------------+----------------------------------+
 231         | (k=16, m=4)     | 4 * S, 0.5 (d=5)                 | 4.75 * S, 0.25 (d=19)            |
 232         +-----------------+----------------------------------+----------------------------------+
 233
 234
 235 where ``S`` is the amount of data stored of single OSD being recovered.