]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/erasure-code-clay.rst
import 14.2.4 nautilus point release
[ceph.git] / ceph / doc / rados / operations / erasure-code-clay.rst
CommitLineData
11fdf7f2
TL
1================
2CLAY code plugin
3================
4
5CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings
6in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let:
7
8 d = number of OSDs contacted during repair
9
10If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires
11reading from the *d=8* others to repair. And recovery of say a 1GiB needs
12a download of 8 X 1GiB = 8GiB of information.
13
14However, in the case of the *clay* plugin *d* is configurable within the limits:
15
16 k+1 <= d <= k+m-1
17
18By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms
19of network bandwidth and disk IO. In the case of the *clay* plugin configured with
20*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and
21250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB
22amount of information. More general parameters are provided below. The benefits are substantial
23when the repair is carried out for a rack that stores information on the order of
24Terabytes.
25
26 +-------------+---------------------------+
27 | plugin | total amount of disk IO |
28 +=============+===========================+
29 |jerasure,isa | k*S |
30 +-------------+---------------------------+
31 | clay | d*S/(d-k+1) = (k+m-1)*S/m |
32 +-------------+---------------------------+
33
34where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have
35used the largest possible value of *d* as this will result in the smallest amount of data download needed
36to achieve recovery from an OSD failure.
37
38Erasure-code profile examples
39=============================
40
41An example configuration that can be used to observe reduced bandwidth usage::
42
43 $ ceph osd erasure-code-profile set CLAYprofile \
44 plugin=clay \
45 k=4 m=2 d=5 \
46 crush-failure-domain=host
47 $ ceph osd pool create claypool 12 12 erasure CLAYprofile
48
49
50Creating a clay profile
51=======================
52
53To create a new clay code profile::
54
55 ceph osd erasure-code-profile set {name} \
56 plugin=clay \
57 k={data-chunks} \
58 m={coding-chunks} \
59 [d={helper-chunks}] \
60 [scalar_mds={plugin-name}] \
61 [technique={technique-name}] \
62 [crush-failure-domain={bucket-type}] \
63 [directory={directory}] \
64 [--force]
65
66Where:
67
68``k={data chunks}``
69
70:Description: Each object is split into **data-chunks** parts,
71 each of which is stored on a different OSD.
72
73:Type: Integer
74:Required: Yes.
75:Example: 4
76
77``m={coding-chunks}``
78
79:Description: Compute **coding chunks** for each object and store them
80 on different OSDs. The number of coding chunks is also
81 the number of OSDs that can be down without losing data.
82
83:Type: Integer
84:Required: Yes.
85:Example: 2
86
87``d={helper-chunks}``
88
89:Description: Number of OSDs requested to send data during recovery of
90 a single chunk. *d* needs to be chosen such that
91 k+1 <= d <= k+m-1. Larger the *d*, the better the savings.
92
93:Type: Integer
94:Required: No.
95:Default: k+m-1
96
97``scalar_mds={jerasure|isa|shec}``
98
99:Description: **scalar_mds** specifies the plugin that is used as a
100 building block in the layered construction. It can be
101 one of *jerasure*, *isa*, *shec*
102
103:Type: String
104:Required: No.
105:Default: jerasure
106
107``technique={technique}``
108
109:Description: **technique** specifies the technique that will be picked
110 within the 'scalar_mds' plugin specified. Supported techniques
111 are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig',
112 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
113 'cauchy' for isa and 'single', 'multiple' for shec.
114
115:Type: String
116:Required: No.
117:Default: reed_sol_van (for jerasure, isa), single (for shec)
118
119
120``crush-root={root}``
121
122:Description: The name of the crush bucket used for the first step of
123 the CRUSH rule. For intance **step take default**.
124
125:Type: String
126:Required: No.
127:Default: default
128
129
130``crush-failure-domain={bucket-type}``
131
132:Description: Ensure that no two chunks are in a bucket with the same
133 failure domain. For instance, if the failure domain is
134 **host** no two chunks will be stored on the same
135 host. It is used to create a CRUSH rule step such as **step
136 chooseleaf host**.
137
138:Type: String
139:Required: No.
140:Default: host
141
142``crush-device-class={device-class}``
143
144:Description: Restrict placement to devices of a specific class (e.g.,
145 ``ssd`` or ``hdd``), using the crush device class names
146 in the CRUSH map.
147
148:Type: String
149:Required: No.
150:Default:
151
152``directory={directory}``
153
154:Description: Set the **directory** name from which the erasure code
155 plugin is loaded.
156
157:Type: String
158:Required: No.
159:Default: /usr/lib/ceph/erasure-code
160
161``--force``
162
163:Description: Override an existing profile by the same name.
164
165:Type: String
166:Required: No.
167
168
169Notion of sub-chunks
170====================
171
172The Clay code is able to save in terms of disk IO, network bandwidth as it
173is a vector code and it is able to view and manipulate data within a chunk
174at a finer granularity termed as a sub-chunk. The number of sub-chunks within
175a chunk for a Clay code is given by:
176
177 sub-chunk count = q\ :sup:`(k+m)/q`, where q=d-k+1
178
179
180During repair of an OSD, the helper information requested
181from an available OSD is only a fraction of a chunk. In fact, the number
182of sub-chunks within a chunk that are accessed during repair is given by:
183
184 repair sub-chunk count = sub-chunk count / q
185
186Examples
187--------
188
189#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is
190 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read
191 during repair.
192#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count
193 is 16. A quarter of a chunk is read from an available OSD for repair of a failed
194 chunk.
195
196
197
198How to choose a configuration given a workload
199==============================================
200
201Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
202are not necessarily stored consecutively within a chunk. For best disk IO
203performance, it is helpful to read contiguous data. For this reason, it is suggested that
204you choose stripe-size such that the sub-chunk size is sufficiently large.
205
206For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that::
207
208 sub-chunk size = stripe-size / (k*sub-chunk count) = 4KB, 8KB, 12KB ...
209
210#. For large size workloads for which the stripe size is large, it is easy to choose k, m, d.
211 For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will
212 result in a sub-chunk count of 1024 and a sub-chunk size of 4KB.
213#. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network
214 and disk IO benefits.
215
216Comparisons with LRC
217====================
218
219Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
220bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
221number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead.
222The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in
223addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc*
224can recover from the failure of any ``m`` OSDs.
225
226 +-----------------+----------------------------------+----------------------------------+
227 | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) |
228 +=================+================+=================+==================================+
229 | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) |
230 +-----------------+----------------------------------+----------------------------------+
494da23a 231 | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) |
11fdf7f2
TL
232 +-----------------+----------------------------------+----------------------------------+
233
234
235where ``S`` is the amount of data stored of single OSD being recovered.