]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/erasure-code-clay.rst
5d9e3f63156943f65fbef774f09caf342d0edf11
[ceph.git] / ceph / doc / rados / operations / erasure-code-clay.rst
1 ================
2 CLAY code plugin
3 ================
4
5 CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings
6 in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let:
7
8 d = number of OSDs contacted during repair
9
10 If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires
11 reading from the *d=8* others to repair. And recovery of say a 1GiB needs
12 a download of 8 X 1GiB = 8GiB of information.
13
14 However, in the case of the *clay* plugin *d* is configurable within the limits:
15
16 k+1 <= d <= k+m-1
17
18 By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms
19 of network bandwidth and disk IO. In the case of the *clay* plugin configured with
20 *k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and
21 250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB
22 amount of information. More general parameters are provided below. The benefits are substantial
23 when the repair is carried out for a rack that stores information on the order of
24 Terabytes.
25
26 +-------------+---------------------------------------------------------+
27 | plugin | total amount of disk IO |
28 +=============+=========================================================+
29 |jerasure,isa | :math:`k S` |
30 +-------------+---------------------------------------------------------+
31 | clay | :math:`\frac{d S}{d - k + 1} = \frac{(k + m - 1) S}{m}` |
32 +-------------+---------------------------------------------------------+
33
34 where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have
35 used the largest possible value of *d* as this will result in the smallest amount of data download needed
36 to achieve recovery from an OSD failure.
37
38 Erasure-code profile examples
39 =============================
40
41 An example configuration that can be used to observe reduced bandwidth usage::
42
43 $ ceph osd erasure-code-profile set CLAYprofile \
44 plugin=clay \
45 k=4 m=2 d=5 \
46 crush-failure-domain=host
47 $ ceph osd pool create claypool erasure CLAYprofile
48
49
50 Creating a clay profile
51 =======================
52
53 To create a new clay code profile::
54
55 ceph osd erasure-code-profile set {name} \
56 plugin=clay \
57 k={data-chunks} \
58 m={coding-chunks} \
59 [d={helper-chunks}] \
60 [scalar_mds={plugin-name}] \
61 [technique={technique-name}] \
62 [crush-failure-domain={bucket-type}] \
63 [crush-device-class={device-class}] \
64 [directory={directory}] \
65 [--force]
66
67 Where:
68
69 ``k={data chunks}``
70
71 :Description: Each object is split into **data-chunks** parts,
72 each of which is stored on a different OSD.
73
74 :Type: Integer
75 :Required: Yes.
76 :Example: 4
77
78 ``m={coding-chunks}``
79
80 :Description: Compute **coding chunks** for each object and store them
81 on different OSDs. The number of coding chunks is also
82 the number of OSDs that can be down without losing data.
83
84 :Type: Integer
85 :Required: Yes.
86 :Example: 2
87
88 ``d={helper-chunks}``
89
90 :Description: Number of OSDs requested to send data during recovery of
91 a single chunk. *d* needs to be chosen such that
92 k+1 <= d <= k+m-1. The larger the *d*, the better the savings.
93
94 :Type: Integer
95 :Required: No.
96 :Default: k+m-1
97
98 ``scalar_mds={jerasure|isa|shec}``
99
100 :Description: **scalar_mds** specifies the plugin that is used as a
101 building block in the layered construction. It can be
102 one of *jerasure*, *isa*, *shec*
103
104 :Type: String
105 :Required: No.
106 :Default: jerasure
107
108 ``technique={technique}``
109
110 :Description: **technique** specifies the technique that will be picked
111 within the 'scalar_mds' plugin specified. Supported techniques
112 are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig',
113 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
114 'cauchy' for isa and 'single', 'multiple' for shec.
115
116 :Type: String
117 :Required: No.
118 :Default: reed_sol_van (for jerasure, isa), single (for shec)
119
120
121 ``crush-root={root}``
122
123 :Description: The name of the crush bucket used for the first step of
124 the CRUSH rule. For instance **step take default**.
125
126 :Type: String
127 :Required: No.
128 :Default: default
129
130
131 ``crush-failure-domain={bucket-type}``
132
133 :Description: Ensure that no two chunks are in a bucket with the same
134 failure domain. For instance, if the failure domain is
135 **host** no two chunks will be stored on the same
136 host. It is used to create a CRUSH rule step such as **step
137 chooseleaf host**.
138
139 :Type: String
140 :Required: No.
141 :Default: host
142
143 ``crush-device-class={device-class}``
144
145 :Description: Restrict placement to devices of a specific class (e.g.,
146 ``ssd`` or ``hdd``), using the crush device class names
147 in the CRUSH map.
148
149 :Type: String
150 :Required: No.
151 :Default:
152
153 ``directory={directory}``
154
155 :Description: Set the **directory** name from which the erasure code
156 plugin is loaded.
157
158 :Type: String
159 :Required: No.
160 :Default: /usr/lib/ceph/erasure-code
161
162 ``--force``
163
164 :Description: Override an existing profile by the same name.
165
166 :Type: String
167 :Required: No.
168
169
170 Notion of sub-chunks
171 ====================
172
173 The Clay code is able to save in terms of disk IO, network bandwidth as it
174 is a vector code and it is able to view and manipulate data within a chunk
175 at a finer granularity termed as a sub-chunk. The number of sub-chunks within
176 a chunk for a Clay code is given by:
177
178 sub-chunk count = :math:`q^{\frac{k+m}{q}}`, where :math:`q = d - k + 1`
179
180
181 During repair of an OSD, the helper information requested
182 from an available OSD is only a fraction of a chunk. In fact, the number
183 of sub-chunks within a chunk that are accessed during repair is given by:
184
185 repair sub-chunk count = :math:`\frac{sub---chunk \: count}{q}`
186
187 Examples
188 --------
189
190 #. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is
191 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read
192 during repair.
193 #. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count
194 is 16. A quarter of a chunk is read from an available OSD for repair of a failed
195 chunk.
196
197
198
199 How to choose a configuration given a workload
200 ==============================================
201
202 Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
203 are not necessarily stored consecutively within a chunk. For best disk IO
204 performance, it is helpful to read contiguous data. For this reason, it is suggested that
205 you choose stripe-size such that the sub-chunk size is sufficiently large.
206
207 For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that:
208
209 sub-chunk size = :math:`\frac{stripe-size}{k sub-chunk count}` = 4KB, 8KB, 12KB ...
210
211 #. For large size workloads for which the stripe size is large, it is easy to choose k, m, d.
212 For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will
213 result in a sub-chunk count of 1024 and a sub-chunk size of 4KB.
214 #. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network
215 and disk IO benefits.
216
217 Comparisons with LRC
218 ====================
219
220 Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
221 bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
222 number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead.
223 The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in
224 addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc*
225 can recover from the failure of any ``m`` OSDs.
226
227 +-----------------+----------------------------------+----------------------------------+
228 | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) |
229 +=================+================+=================+==================================+
230 | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) |
231 +-----------------+----------------------------------+----------------------------------+
232 | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) |
233 +-----------------+----------------------------------+----------------------------------+
234
235
236 where ``S`` is the amount of data stored of single OSD being recovered.