]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | ================ |
2 | CLAY code plugin | |
3 | ================ | |
4 | ||
5 | CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings | |
6 | in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let: | |
7 | ||
8 | d = number of OSDs contacted during repair | |
9 | ||
10 | If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires | |
11 | reading from the *d=8* others to repair. And recovery of say a 1GiB needs | |
12 | a download of 8 X 1GiB = 8GiB of information. | |
13 | ||
14 | However, in the case of the *clay* plugin *d* is configurable within the limits: | |
15 | ||
16 | k+1 <= d <= k+m-1 | |
17 | ||
18 | By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms | |
19 | of network bandwidth and disk IO. In the case of the *clay* plugin configured with | |
20 | *k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and | |
21 | 250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB | |
22 | amount of information. More general parameters are provided below. The benefits are substantial | |
23 | when the repair is carried out for a rack that stores information on the order of | |
24 | Terabytes. | |
25 | ||
f67539c2 TL |
26 | +-------------+---------------------------------------------------------+ |
27 | | plugin | total amount of disk IO | | |
28 | +=============+=========================================================+ | |
29 | |jerasure,isa | :math:`k S` | | |
30 | +-------------+---------------------------------------------------------+ | |
31 | | clay | :math:`\frac{d S}{d - k + 1} = \frac{(k + m - 1) S}{m}` | | |
32 | +-------------+---------------------------------------------------------+ | |
11fdf7f2 TL |
33 | |
34 | where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have | |
35 | used the largest possible value of *d* as this will result in the smallest amount of data download needed | |
36 | to achieve recovery from an OSD failure. | |
37 | ||
38 | Erasure-code profile examples | |
39 | ============================= | |
40 | ||
39ae355f | 41 | An example configuration that can be used to observe reduced bandwidth usage: |
11fdf7f2 | 42 | |
39ae355f TL |
43 | .. prompt:: bash $ |
44 | ||
45 | ceph osd erasure-code-profile set CLAYprofile \ | |
46 | plugin=clay \ | |
47 | k=4 m=2 d=5 \ | |
48 | crush-failure-domain=host | |
49 | ceph osd pool create claypool erasure CLAYprofile | |
11fdf7f2 TL |
50 | |
51 | ||
52 | Creating a clay profile | |
53 | ======================= | |
54 | ||
39ae355f TL |
55 | To create a new clay code profile: |
56 | ||
57 | .. prompt:: bash $ | |
58 | ||
59 | ceph osd erasure-code-profile set {name} \ | |
60 | plugin=clay \ | |
61 | k={data-chunks} \ | |
62 | m={coding-chunks} \ | |
63 | [d={helper-chunks}] \ | |
64 | [scalar_mds={plugin-name}] \ | |
65 | [technique={technique-name}] \ | |
66 | [crush-failure-domain={bucket-type}] \ | |
67 | [crush-device-class={device-class}] \ | |
68 | [directory={directory}] \ | |
69 | [--force] | |
11fdf7f2 TL |
70 | |
71 | Where: | |
72 | ||
73 | ``k={data chunks}`` | |
74 | ||
75 | :Description: Each object is split into **data-chunks** parts, | |
76 | each of which is stored on a different OSD. | |
77 | ||
78 | :Type: Integer | |
79 | :Required: Yes. | |
80 | :Example: 4 | |
81 | ||
82 | ``m={coding-chunks}`` | |
83 | ||
84 | :Description: Compute **coding chunks** for each object and store them | |
85 | on different OSDs. The number of coding chunks is also | |
86 | the number of OSDs that can be down without losing data. | |
87 | ||
88 | :Type: Integer | |
89 | :Required: Yes. | |
90 | :Example: 2 | |
91 | ||
92 | ``d={helper-chunks}`` | |
93 | ||
94 | :Description: Number of OSDs requested to send data during recovery of | |
95 | a single chunk. *d* needs to be chosen such that | |
adb31ebb | 96 | k+1 <= d <= k+m-1. The larger the *d*, the better the savings. |
11fdf7f2 TL |
97 | |
98 | :Type: Integer | |
99 | :Required: No. | |
100 | :Default: k+m-1 | |
101 | ||
102 | ``scalar_mds={jerasure|isa|shec}`` | |
103 | ||
104 | :Description: **scalar_mds** specifies the plugin that is used as a | |
105 | building block in the layered construction. It can be | |
106 | one of *jerasure*, *isa*, *shec* | |
107 | ||
108 | :Type: String | |
109 | :Required: No. | |
110 | :Default: jerasure | |
111 | ||
112 | ``technique={technique}`` | |
113 | ||
114 | :Description: **technique** specifies the technique that will be picked | |
115 | within the 'scalar_mds' plugin specified. Supported techniques | |
116 | are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', | |
117 | 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van', | |
118 | 'cauchy' for isa and 'single', 'multiple' for shec. | |
119 | ||
120 | :Type: String | |
121 | :Required: No. | |
122 | :Default: reed_sol_van (for jerasure, isa), single (for shec) | |
123 | ||
124 | ||
125 | ``crush-root={root}`` | |
126 | ||
127 | :Description: The name of the crush bucket used for the first step of | |
9f95a23c | 128 | the CRUSH rule. For instance **step take default**. |
11fdf7f2 TL |
129 | |
130 | :Type: String | |
131 | :Required: No. | |
132 | :Default: default | |
133 | ||
134 | ||
135 | ``crush-failure-domain={bucket-type}`` | |
136 | ||
137 | :Description: Ensure that no two chunks are in a bucket with the same | |
138 | failure domain. For instance, if the failure domain is | |
139 | **host** no two chunks will be stored on the same | |
140 | host. It is used to create a CRUSH rule step such as **step | |
141 | chooseleaf host**. | |
142 | ||
143 | :Type: String | |
144 | :Required: No. | |
145 | :Default: host | |
146 | ||
147 | ``crush-device-class={device-class}`` | |
148 | ||
149 | :Description: Restrict placement to devices of a specific class (e.g., | |
150 | ``ssd`` or ``hdd``), using the crush device class names | |
151 | in the CRUSH map. | |
152 | ||
153 | :Type: String | |
154 | :Required: No. | |
155 | :Default: | |
156 | ||
157 | ``directory={directory}`` | |
158 | ||
159 | :Description: Set the **directory** name from which the erasure code | |
160 | plugin is loaded. | |
161 | ||
162 | :Type: String | |
163 | :Required: No. | |
164 | :Default: /usr/lib/ceph/erasure-code | |
165 | ||
166 | ``--force`` | |
167 | ||
168 | :Description: Override an existing profile by the same name. | |
169 | ||
170 | :Type: String | |
171 | :Required: No. | |
172 | ||
173 | ||
174 | Notion of sub-chunks | |
175 | ==================== | |
176 | ||
177 | The Clay code is able to save in terms of disk IO, network bandwidth as it | |
178 | is a vector code and it is able to view and manipulate data within a chunk | |
179 | at a finer granularity termed as a sub-chunk. The number of sub-chunks within | |
180 | a chunk for a Clay code is given by: | |
181 | ||
f67539c2 | 182 | sub-chunk count = :math:`q^{\frac{k+m}{q}}`, where :math:`q = d - k + 1` |
11fdf7f2 TL |
183 | |
184 | ||
185 | During repair of an OSD, the helper information requested | |
186 | from an available OSD is only a fraction of a chunk. In fact, the number | |
187 | of sub-chunks within a chunk that are accessed during repair is given by: | |
188 | ||
f67539c2 | 189 | repair sub-chunk count = :math:`\frac{sub---chunk \: count}{q}` |
11fdf7f2 TL |
190 | |
191 | Examples | |
192 | -------- | |
193 | ||
194 | #. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is | |
195 | 8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read | |
196 | during repair. | |
197 | #. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count | |
198 | is 16. A quarter of a chunk is read from an available OSD for repair of a failed | |
199 | chunk. | |
200 | ||
201 | ||
202 | ||
203 | How to choose a configuration given a workload | |
204 | ============================================== | |
205 | ||
206 | Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks | |
207 | are not necessarily stored consecutively within a chunk. For best disk IO | |
208 | performance, it is helpful to read contiguous data. For this reason, it is suggested that | |
209 | you choose stripe-size such that the sub-chunk size is sufficiently large. | |
210 | ||
f67539c2 | 211 | For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that: |
11fdf7f2 | 212 | |
f67539c2 | 213 | sub-chunk size = :math:`\frac{stripe-size}{k sub-chunk count}` = 4KB, 8KB, 12KB ... |
11fdf7f2 TL |
214 | |
215 | #. For large size workloads for which the stripe size is large, it is easy to choose k, m, d. | |
216 | For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will | |
217 | result in a sub-chunk count of 1024 and a sub-chunk size of 4KB. | |
218 | #. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network | |
219 | and disk IO benefits. | |
220 | ||
221 | Comparisons with LRC | |
222 | ==================== | |
223 | ||
224 | Locally Recoverable Codes (LRC) are also designed in order to save in terms of network | |
225 | bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the | |
226 | number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead. | |
227 | The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in | |
228 | addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc* | |
229 | can recover from the failure of any ``m`` OSDs. | |
230 | ||
231 | +-----------------+----------------------------------+----------------------------------+ | |
232 | | Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) | | |
233 | +=================+================+=================+==================================+ | |
234 | | (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) | | |
235 | +-----------------+----------------------------------+----------------------------------+ | |
494da23a | 236 | | (k=16, m=4) | 4 * S, 0.5625 (d=4) | 4.75 * S, 0.25 (d=19) | |
11fdf7f2 TL |
237 | +-----------------+----------------------------------+----------------------------------+ |
238 | ||
239 | ||
240 | where ``S`` is the amount of data stored of single OSD being recovered. |