]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | ================ |
2 | CLAY code plugin | |
3 | ================ | |
4 | ||
5 | CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings | |
6 | in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let: | |
7 | ||
8 | d = number of OSDs contacted during repair | |
9 | ||
10 | If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires | |
11 | reading from the *d=8* others to repair. And recovery of say a 1GiB needs | |
12 | a download of 8 X 1GiB = 8GiB of information. | |
13 | ||
14 | However, in the case of the *clay* plugin *d* is configurable within the limits: | |
15 | ||
16 | k+1 <= d <= k+m-1 | |
17 | ||
18 | By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms | |
19 | of network bandwidth and disk IO. In the case of the *clay* plugin configured with | |
20 | *k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and | |
21 | 250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB | |
22 | amount of information. More general parameters are provided below. The benefits are substantial | |
23 | when the repair is carried out for a rack that stores information on the order of | |
24 | Terabytes. | |
25 | ||
26 | +-------------+---------------------------+ | |
27 | | plugin | total amount of disk IO | | |
28 | +=============+===========================+ | |
29 | |jerasure,isa | k*S | | |
30 | +-------------+---------------------------+ | |
31 | | clay | d*S/(d-k+1) = (k+m-1)*S/m | | |
32 | +-------------+---------------------------+ | |
33 | ||
34 | where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have | |
35 | used the largest possible value of *d* as this will result in the smallest amount of data download needed | |
36 | to achieve recovery from an OSD failure. | |
37 | ||
38 | Erasure-code profile examples | |
39 | ============================= | |
40 | ||
41 | An example configuration that can be used to observe reduced bandwidth usage:: | |
42 | ||
43 | $ ceph osd erasure-code-profile set CLAYprofile \ | |
44 | plugin=clay \ | |
45 | k=4 m=2 d=5 \ | |
46 | crush-failure-domain=host | |
47 | $ ceph osd pool create claypool 12 12 erasure CLAYprofile | |
48 | ||
49 | ||
50 | Creating a clay profile | |
51 | ======================= | |
52 | ||
53 | To create a new clay code profile:: | |
54 | ||
55 | ceph osd erasure-code-profile set {name} \ | |
56 | plugin=clay \ | |
57 | k={data-chunks} \ | |
58 | m={coding-chunks} \ | |
59 | [d={helper-chunks}] \ | |
60 | [scalar_mds={plugin-name}] \ | |
61 | [technique={technique-name}] \ | |
62 | [crush-failure-domain={bucket-type}] \ | |
63 | [directory={directory}] \ | |
64 | [--force] | |
65 | ||
66 | Where: | |
67 | ||
68 | ``k={data chunks}`` | |
69 | ||
70 | :Description: Each object is split into **data-chunks** parts, | |
71 | each of which is stored on a different OSD. | |
72 | ||
73 | :Type: Integer | |
74 | :Required: Yes. | |
75 | :Example: 4 | |
76 | ||
77 | ``m={coding-chunks}`` | |
78 | ||
79 | :Description: Compute **coding chunks** for each object and store them | |
80 | on different OSDs. The number of coding chunks is also | |
81 | the number of OSDs that can be down without losing data. | |
82 | ||
83 | :Type: Integer | |
84 | :Required: Yes. | |
85 | :Example: 2 | |
86 | ||
87 | ``d={helper-chunks}`` | |
88 | ||
89 | :Description: Number of OSDs requested to send data during recovery of | |
90 | a single chunk. *d* needs to be chosen such that | |
91 | k+1 <= d <= k+m-1. Larger the *d*, the better the savings. | |
92 | ||
93 | :Type: Integer | |
94 | :Required: No. | |
95 | :Default: k+m-1 | |
96 | ||
97 | ``scalar_mds={jerasure|isa|shec}`` | |
98 | ||
99 | :Description: **scalar_mds** specifies the plugin that is used as a | |
100 | building block in the layered construction. It can be | |
101 | one of *jerasure*, *isa*, *shec* | |
102 | ||
103 | :Type: String | |
104 | :Required: No. | |
105 | :Default: jerasure | |
106 | ||
107 | ``technique={technique}`` | |
108 | ||
109 | :Description: **technique** specifies the technique that will be picked | |
110 | within the 'scalar_mds' plugin specified. Supported techniques | |
111 | are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', | |
112 | 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van', | |
113 | 'cauchy' for isa and 'single', 'multiple' for shec. | |
114 | ||
115 | :Type: String | |
116 | :Required: No. | |
117 | :Default: reed_sol_van (for jerasure, isa), single (for shec) | |
118 | ||
119 | ||
120 | ``crush-root={root}`` | |
121 | ||
122 | :Description: The name of the crush bucket used for the first step of | |
123 | the CRUSH rule. For intance **step take default**. | |
124 | ||
125 | :Type: String | |
126 | :Required: No. | |
127 | :Default: default | |
128 | ||
129 | ||
130 | ``crush-failure-domain={bucket-type}`` | |
131 | ||
132 | :Description: Ensure that no two chunks are in a bucket with the same | |
133 | failure domain. For instance, if the failure domain is | |
134 | **host** no two chunks will be stored on the same | |
135 | host. It is used to create a CRUSH rule step such as **step | |
136 | chooseleaf host**. | |
137 | ||
138 | :Type: String | |
139 | :Required: No. | |
140 | :Default: host | |
141 | ||
142 | ``crush-device-class={device-class}`` | |
143 | ||
144 | :Description: Restrict placement to devices of a specific class (e.g., | |
145 | ``ssd`` or ``hdd``), using the crush device class names | |
146 | in the CRUSH map. | |
147 | ||
148 | :Type: String | |
149 | :Required: No. | |
150 | :Default: | |
151 | ||
152 | ``directory={directory}`` | |
153 | ||
154 | :Description: Set the **directory** name from which the erasure code | |
155 | plugin is loaded. | |
156 | ||
157 | :Type: String | |
158 | :Required: No. | |
159 | :Default: /usr/lib/ceph/erasure-code | |
160 | ||
161 | ``--force`` | |
162 | ||
163 | :Description: Override an existing profile by the same name. | |
164 | ||
165 | :Type: String | |
166 | :Required: No. | |
167 | ||
168 | ||
169 | Notion of sub-chunks | |
170 | ==================== | |
171 | ||
172 | The Clay code is able to save in terms of disk IO, network bandwidth as it | |
173 | is a vector code and it is able to view and manipulate data within a chunk | |
174 | at a finer granularity termed as a sub-chunk. The number of sub-chunks within | |
175 | a chunk for a Clay code is given by: | |
176 | ||
177 | sub-chunk count = q\ :sup:`(k+m)/q`, where q=d-k+1 | |
178 | ||
179 | ||
180 | During repair of an OSD, the helper information requested | |
181 | from an available OSD is only a fraction of a chunk. In fact, the number | |
182 | of sub-chunks within a chunk that are accessed during repair is given by: | |
183 | ||
184 | repair sub-chunk count = sub-chunk count / q | |
185 | ||
186 | Examples | |
187 | -------- | |
188 | ||
189 | #. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is | |
190 | 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read | |
191 | during repair. | |
192 | #. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count | |
193 | is 16. A quarter of a chunk is read from an available OSD for repair of a failed | |
194 | chunk. | |
195 | ||
196 | ||
197 | ||
198 | How to choose a configuration given a workload | |
199 | ============================================== | |
200 | ||
201 | Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks | |
202 | are not necessarily stored consecutively within a chunk. For best disk IO | |
203 | performance, it is helpful to read contiguous data. For this reason, it is suggested that | |
204 | you choose stripe-size such that the sub-chunk size is sufficiently large. | |
205 | ||
206 | For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that:: | |
207 | ||
208 | sub-chunk size = stripe-size / (k*sub-chunk count) = 4KB, 8KB, 12KB ... | |
209 | ||
210 | #. For large size workloads for which the stripe size is large, it is easy to choose k, m, d. | |
211 | For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will | |
212 | result in a sub-chunk count of 1024 and a sub-chunk size of 4KB. | |
213 | #. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network | |
214 | and disk IO benefits. | |
215 | ||
216 | Comparisons with LRC | |
217 | ==================== | |
218 | ||
219 | Locally Recoverable Codes (LRC) are also designed in order to save in terms of network | |
220 | bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the | |
221 | number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead. | |
222 | The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in | |
223 | addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc* | |
224 | can recover from the failure of any ``m`` OSDs. | |
225 | ||
226 | +-----------------+----------------------------------+----------------------------------+ | |
227 | | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) | | |
228 | +=================+================+=================+==================================+ | |
229 | | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) | | |
230 | +-----------------+----------------------------------+----------------------------------+ | |
494da23a | 231 | | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) | |
11fdf7f2 TL |
232 | +-----------------+----------------------------------+----------------------------------+ |
233 | ||
234 | ||
235 | where ``S`` is the amount of data stored of single OSD being recovered. |