]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/erasure-code-clay.rst
import ceph quincy 17.2.6
[ceph.git] / ceph / doc / rados / operations / erasure-code-clay.rst
CommitLineData
11fdf7f2
TL
1================
2CLAY code plugin
3================
4
5CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings
6in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let:
7
8 d = number of OSDs contacted during repair
9
10If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires
11reading from the *d=8* others to repair. And recovery of say a 1GiB needs
12a download of 8 X 1GiB = 8GiB of information.
13
14However, in the case of the *clay* plugin *d* is configurable within the limits:
15
16 k+1 <= d <= k+m-1
17
18By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms
19of network bandwidth and disk IO. In the case of the *clay* plugin configured with
20*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and
21250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB
22amount of information. More general parameters are provided below. The benefits are substantial
23when the repair is carried out for a rack that stores information on the order of
24Terabytes.
25
f67539c2
TL
26 +-------------+---------------------------------------------------------+
27 | plugin | total amount of disk IO |
28 +=============+=========================================================+
29 |jerasure,isa | :math:`k S` |
30 +-------------+---------------------------------------------------------+
31 | clay | :math:`\frac{d S}{d - k + 1} = \frac{(k + m - 1) S}{m}` |
32 +-------------+---------------------------------------------------------+
11fdf7f2
TL
33
34where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have
35used the largest possible value of *d* as this will result in the smallest amount of data download needed
36to achieve recovery from an OSD failure.
37
38Erasure-code profile examples
39=============================
40
39ae355f 41An example configuration that can be used to observe reduced bandwidth usage:
11fdf7f2 42
39ae355f
TL
43.. prompt:: bash $
44
45 ceph osd erasure-code-profile set CLAYprofile \
46 plugin=clay \
47 k=4 m=2 d=5 \
48 crush-failure-domain=host
49 ceph osd pool create claypool erasure CLAYprofile
11fdf7f2
TL
50
51
52Creating a clay profile
53=======================
54
39ae355f
TL
55To create a new clay code profile:
56
57.. prompt:: bash $
58
59 ceph osd erasure-code-profile set {name} \
60 plugin=clay \
61 k={data-chunks} \
62 m={coding-chunks} \
63 [d={helper-chunks}] \
64 [scalar_mds={plugin-name}] \
65 [technique={technique-name}] \
66 [crush-failure-domain={bucket-type}] \
67 [crush-device-class={device-class}] \
68 [directory={directory}] \
69 [--force]
11fdf7f2
TL
70
71Where:
72
73``k={data chunks}``
74
75:Description: Each object is split into **data-chunks** parts,
76 each of which is stored on a different OSD.
77
78:Type: Integer
79:Required: Yes.
80:Example: 4
81
82``m={coding-chunks}``
83
84:Description: Compute **coding chunks** for each object and store them
85 on different OSDs. The number of coding chunks is also
86 the number of OSDs that can be down without losing data.
87
88:Type: Integer
89:Required: Yes.
90:Example: 2
91
92``d={helper-chunks}``
93
94:Description: Number of OSDs requested to send data during recovery of
95 a single chunk. *d* needs to be chosen such that
adb31ebb 96 k+1 <= d <= k+m-1. The larger the *d*, the better the savings.
11fdf7f2
TL
97
98:Type: Integer
99:Required: No.
100:Default: k+m-1
101
102``scalar_mds={jerasure|isa|shec}``
103
104:Description: **scalar_mds** specifies the plugin that is used as a
105 building block in the layered construction. It can be
106 one of *jerasure*, *isa*, *shec*
107
108:Type: String
109:Required: No.
110:Default: jerasure
111
112``technique={technique}``
113
114:Description: **technique** specifies the technique that will be picked
115 within the 'scalar_mds' plugin specified. Supported techniques
116 are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig',
117 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
118 'cauchy' for isa and 'single', 'multiple' for shec.
119
120:Type: String
121:Required: No.
122:Default: reed_sol_van (for jerasure, isa), single (for shec)
123
124
125``crush-root={root}``
126
127:Description: The name of the crush bucket used for the first step of
9f95a23c 128 the CRUSH rule. For instance **step take default**.
11fdf7f2
TL
129
130:Type: String
131:Required: No.
132:Default: default
133
134
135``crush-failure-domain={bucket-type}``
136
137:Description: Ensure that no two chunks are in a bucket with the same
138 failure domain. For instance, if the failure domain is
139 **host** no two chunks will be stored on the same
140 host. It is used to create a CRUSH rule step such as **step
141 chooseleaf host**.
142
143:Type: String
144:Required: No.
145:Default: host
146
147``crush-device-class={device-class}``
148
149:Description: Restrict placement to devices of a specific class (e.g.,
150 ``ssd`` or ``hdd``), using the crush device class names
151 in the CRUSH map.
152
153:Type: String
154:Required: No.
155:Default:
156
157``directory={directory}``
158
159:Description: Set the **directory** name from which the erasure code
160 plugin is loaded.
161
162:Type: String
163:Required: No.
164:Default: /usr/lib/ceph/erasure-code
165
166``--force``
167
168:Description: Override an existing profile by the same name.
169
170:Type: String
171:Required: No.
172
173
174Notion of sub-chunks
175====================
176
177The Clay code is able to save in terms of disk IO, network bandwidth as it
178is a vector code and it is able to view and manipulate data within a chunk
179at a finer granularity termed as a sub-chunk. The number of sub-chunks within
180a chunk for a Clay code is given by:
181
f67539c2 182 sub-chunk count = :math:`q^{\frac{k+m}{q}}`, where :math:`q = d - k + 1`
11fdf7f2
TL
183
184
185During repair of an OSD, the helper information requested
186from an available OSD is only a fraction of a chunk. In fact, the number
187of sub-chunks within a chunk that are accessed during repair is given by:
188
f67539c2 189 repair sub-chunk count = :math:`\frac{sub---chunk \: count}{q}`
11fdf7f2
TL
190
191Examples
192--------
193
194#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is
195 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read
196 during repair.
197#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count
198 is 16. A quarter of a chunk is read from an available OSD for repair of a failed
199 chunk.
200
201
202
203How to choose a configuration given a workload
204==============================================
205
206Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
207are not necessarily stored consecutively within a chunk. For best disk IO
208performance, it is helpful to read contiguous data. For this reason, it is suggested that
209you choose stripe-size such that the sub-chunk size is sufficiently large.
210
f67539c2 211For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that:
11fdf7f2 212
f67539c2 213 sub-chunk size = :math:`\frac{stripe-size}{k sub-chunk count}` = 4KB, 8KB, 12KB ...
11fdf7f2
TL
214
215#. For large size workloads for which the stripe size is large, it is easy to choose k, m, d.
216 For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will
217 result in a sub-chunk count of 1024 and a sub-chunk size of 4KB.
218#. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network
219 and disk IO benefits.
220
221Comparisons with LRC
222====================
223
224Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
225bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
226number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead.
227The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in
228addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc*
229can recover from the failure of any ``m`` OSDs.
230
231 +-----------------+----------------------------------+----------------------------------+
232 | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) |
233 +=================+================+=================+==================================+
234 | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) |
235 +-----------------+----------------------------------+----------------------------------+
494da23a 236 | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) |
11fdf7f2
TL
237 +-----------------+----------------------------------+----------------------------------+
238
239
240where ``S`` is the amount of data stored of single OSD being recovered.