]>
Commit | Line | Data |
---|---|---|
92f5a8d4 TL |
1 | .. _ecpool: |
2 | ||
7c673cae FG |
3 | ============= |
4 | Erasure code | |
5 | ============= | |
6 | ||
39ae355f TL |
7 | By default, Ceph `pools <../pools>`_ are created with the type "replicated". In |
8 | replicated-type pools, every object is copied to multiple disks (this | |
9 | multiple copying is the "replication"). | |
10 | ||
11 | In contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_ | |
12 | pools use a method of data protection that is different from replication. In | |
13 | erasure coding, data is broken into fragments of two kinds: data blocks and | |
14 | parity blocks. If a drive fails or becomes corrupted, the parity blocks are | |
15 | used to rebuild the data. At scale, erasure coding saves space relative to | |
16 | replication. | |
17 | ||
18 | In this documentation, data blocks are referred to as "data chunks" | |
19 | and parity blocks are referred to as "encoding chunks". | |
20 | ||
21 | Erasure codes are also called "forward error correction codes". The | |
22 | first forward error correction code was developed in 1950 by Richard | |
23 | Hamming at Bell Laboratories. | |
24 | ||
7c673cae FG |
25 | |
26 | Creating a sample erasure coded pool | |
27 | ------------------------------------ | |
28 | ||
29 | The simplest erasure coded pool is equivalent to `RAID5 | |
30 | <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and | |
39ae355f | 31 | requires at least three hosts: |
7c673cae | 32 | |
39ae355f TL |
33 | .. prompt:: bash $ |
34 | ||
35 | ceph osd pool create ecpool erasure | |
36 | ||
37 | :: | |
38 | ||
39 | pool 'ecpool' created | |
40 | ||
41 | .. prompt:: bash $ | |
42 | ||
43 | echo ABCDEFGHI | rados --pool ecpool put NYAN - | |
44 | rados --pool ecpool get NYAN - | |
45 | ||
46 | :: | |
47 | ||
48 | ABCDEFGHI | |
7c673cae | 49 | |
7c673cae FG |
50 | Erasure code profiles |
51 | --------------------- | |
52 | ||
39ae355f TL |
53 | The default erasure code profile can sustain the loss of two OSDs. This erasure |
54 | code profile is equivalent to a replicated pool of size three, but requires | |
55 | 2TB to store 1TB of data instead of 3TB to store 1TB of data. The default | |
56 | profile can be displayed with this command: | |
7c673cae | 57 | |
39ae355f | 58 | .. prompt:: bash $ |
7c673cae | 59 | |
39ae355f TL |
60 | ceph osd erasure-code-profile get default |
61 | ||
62 | :: | |
63 | ||
64 | k=2 | |
65 | m=2 | |
66 | plugin=jerasure | |
67 | crush-failure-domain=host | |
68 | technique=reed_sol_van | |
69 | ||
70 | .. note:: | |
71 | The default erasure-coded pool, the profile of which is displayed here, is | |
72 | not the same as the simplest erasure-coded pool. | |
73 | ||
74 | The default erasure-coded pool has two data chunks (k) and two coding chunks | |
75 | (m). The profile of the default erasure-coded pool is "k=2 m=2". | |
76 | ||
77 | The simplest erasure-coded pool has two data chunks (k) and one coding chunk | |
78 | (m). The profile of the simplest erasure-coded pool is "k=2 m=1". | |
79 | ||
80 | Choosing the right profile is important because the profile cannot be modified | |
81 | after the pool is created. If you find that you need an erasure-coded pool with | |
82 | a profile different than the one you have created, you must create a new pool | |
83 | with a different (and presumably more carefully-considered) profile. When the | |
84 | new pool is created, all objects from the wrongly-configured pool must be moved | |
85 | to the newly-created pool. There is no way to alter the profile of a pool after its creation. | |
7c673cae FG |
86 | |
87 | The most important parameters of the profile are *K*, *M* and | |
224ce89b | 88 | *crush-failure-domain* because they define the storage overhead and |
39ae355f | 89 | the data durability. For example, if the desired architecture must |
11fdf7f2 | 90 | sustain the loss of two racks with a storage overhead of 67% overhead, |
39ae355f TL |
91 | the following profile can be defined: |
92 | ||
93 | .. prompt:: bash $ | |
7c673cae | 94 | |
39ae355f | 95 | ceph osd erasure-code-profile set myprofile \ |
7c673cae FG |
96 | k=3 \ |
97 | m=2 \ | |
224ce89b | 98 | crush-failure-domain=rack |
39ae355f TL |
99 | ceph osd pool create ecpool erasure myprofile |
100 | echo ABCDEFGHI | rados --pool ecpool put NYAN - | |
101 | rados --pool ecpool get NYAN - | |
102 | ||
103 | :: | |
104 | ||
7c673cae FG |
105 | ABCDEFGHI |
106 | ||
107 | The *NYAN* object will be divided in three (*K=3*) and two additional | |
108 | *chunks* will be created (*M=2*). The value of *M* defines how many | |
109 | OSD can be lost simultaneously without losing any data. The | |
b32b8144 | 110 | *crush-failure-domain=rack* will create a CRUSH rule that ensures |
7c673cae FG |
111 | no two *chunks* are stored in the same rack. |
112 | ||
113 | .. ditaa:: | |
114 | +-------------------+ | |
115 | name | NYAN | | |
116 | +-------------------+ | |
117 | content | ABCDEFGHI | | |
118 | +--------+----------+ | |
119 | | | |
120 | | | |
121 | v | |
122 | +------+------+ | |
123 | +---------------+ encode(3,2) +-----------+ | |
124 | | +--+--+---+---+ | | |
125 | | | | | | | |
126 | | +-------+ | +-----+ | | |
127 | | | | | | | |
128 | +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ | |
129 | name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | | |
130 | +------+ +------+ +------+ +------+ +------+ | |
131 | shard | 1 | | 2 | | 3 | | 4 | | 5 | | |
132 | +------+ +------+ +------+ +------+ +------+ | |
133 | content | ABC | | DEF | | GHI | | YXY | | QGC | | |
134 | +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ | |
135 | | | | | | | |
136 | | | v | | | |
137 | | | +--+---+ | | | |
138 | | | | OSD1 | | | | |
139 | | | +------+ | | | |
140 | | | | | | |
141 | | | +------+ | | | |
142 | | +------>| OSD2 | | | | |
143 | | +------+ | | | |
144 | | | | | |
145 | | +------+ | | | |
146 | | | OSD3 |<----+ | | |
147 | | +------+ | | |
148 | | | | |
149 | | +------+ | | |
150 | | | OSD4 |<--------------+ | |
151 | | +------+ | |
152 | | | |
153 | | +------+ | |
154 | +----------------->| OSD5 | | |
155 | +------+ | |
156 | ||
157 | ||
158 | More information can be found in the `erasure code profiles | |
159 | <../erasure-code-profile>`_ documentation. | |
160 | ||
161 | ||
162 | Erasure Coding with Overwrites | |
163 | ------------------------------ | |
164 | ||
165 | By default, erasure coded pools only work with uses like RGW that | |
166 | perform full object writes and appends. | |
167 | ||
168 | Since Luminous, partial writes for an erasure coded pool may be | |
11fdf7f2 | 169 | enabled with a per-pool setting. This lets RBD and CephFS store their |
39ae355f TL |
170 | data in an erasure coded pool: |
171 | ||
172 | .. prompt:: bash $ | |
7c673cae FG |
173 | |
174 | ceph osd pool set ec_pool allow_ec_overwrites true | |
175 | ||
39ae355f TL |
176 | This can be enabled only on a pool residing on BlueStore OSDs, since |
177 | BlueStore's checksumming is used during deep scrubs to detect bitrot | |
178 | or other corruption. In addition to being unsafe, using Filestore with | |
179 | EC overwrites results in lower performance compared to BlueStore. | |
7c673cae FG |
180 | |
181 | Erasure coded pools do not support omap, so to use them with RBD and | |
39ae355f | 182 | CephFS you must instruct them to store their data in an EC pool, and |
7c673cae | 183 | their metadata in a replicated pool. For RBD, this means using the |
39ae355f TL |
184 | erasure coded pool as the ``--data-pool`` during image creation: |
185 | ||
186 | .. prompt:: bash $ | |
7c673cae FG |
187 | |
188 | rbd create --size 1G --data-pool ec_pool replicated_pool/image_name | |
189 | ||
11fdf7f2 TL |
190 | For CephFS, an erasure coded pool can be set as the default data pool during |
191 | file system creation or via `file layouts <../../../cephfs/file-layouts>`_. | |
7c673cae FG |
192 | |
193 | ||
194 | Erasure coded pool and cache tiering | |
195 | ------------------------------------ | |
196 | ||
197 | Erasure coded pools require more resources than replicated pools and | |
39ae355f | 198 | lack some functionality such as omap. To overcome these |
7c673cae FG |
199 | limitations, one can set up a `cache tier <../cache-tiering>`_ |
200 | before the erasure coded pool. | |
201 | ||
39ae355f TL |
202 | For instance, if the pool *hot-storage* is made of fast storage: |
203 | ||
204 | .. prompt:: bash $ | |
7c673cae | 205 | |
39ae355f TL |
206 | ceph osd tier add ecpool hot-storage |
207 | ceph osd tier cache-mode hot-storage writeback | |
208 | ceph osd tier set-overlay ecpool hot-storage | |
7c673cae FG |
209 | |
210 | will place the *hot-storage* pool as tier of *ecpool* in *writeback* | |
211 | mode so that every write and read to the *ecpool* are actually using | |
212 | the *hot-storage* and benefit from its flexibility and speed. | |
213 | ||
214 | More information can be found in the `cache tiering | |
39ae355f TL |
215 | <../cache-tiering>`_ documentation. Note however that cache tiering |
216 | is deprecated and may be removed completely in a future release. | |
7c673cae | 217 | |
9f95a23c TL |
218 | Erasure coded pool recovery |
219 | --------------------------- | |
39ae355f TL |
220 | If an erasure coded pool loses some data shards, it must recover them from others. |
221 | This involves reading from the remaining shards, reconstructing the data, and | |
222 | writing new shards. | |
223 | In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards | |
9f95a23c TL |
224 | available. (With fewer than *K* shards, you have actually lost data!) |
225 | ||
39ae355f TL |
226 | Prior to Octopus, erasure coded pools required at least ``min_size`` shards to be |
227 | available, even if ``min_size`` is greater than ``K``. We recommend ``min_size`` | |
228 | be ``K+2`` or more to prevent loss of writes and data. | |
229 | This conservative decision was made out of an abundance of caution when | |
230 | designing the new pool mode. As a result pools with lost OSDs but without | |
231 | complete loss of any data were unable to recover and go active | |
232 | without manual intervention to temporarily change the ``min_size`` setting. | |
9f95a23c | 233 | |
7c673cae FG |
234 | Glossary |
235 | -------- | |
236 | ||
237 | *chunk* | |
238 | when the encoding function is called, it returns chunks of the same | |
239 | size. Data chunks which can be concatenated to reconstruct the original | |
240 | object and coding chunks which can be used to rebuild a lost chunk. | |
241 | ||
242 | *K* | |
243 | the number of data *chunks*, i.e. the number of *chunks* in which the | |
244 | original object is divided. For instance if *K* = 2 a 10KB object | |
245 | will be divided into *K* objects of 5KB each. | |
246 | ||
247 | *M* | |
248 | the number of coding *chunks*, i.e. the number of additional *chunks* | |
249 | computed by the encoding functions. If there are 2 coding *chunks*, | |
250 | it means 2 OSDs can be out without losing data. | |
251 | ||
252 | ||
253 | Table of content | |
254 | ---------------- | |
255 | ||
256 | .. toctree:: | |
257 | :maxdepth: 1 | |
258 | ||
259 | erasure-code-profile | |
260 | erasure-code-jerasure | |
261 | erasure-code-isa | |
262 | erasure-code-lrc | |
263 | erasure-code-shec | |
11fdf7f2 | 264 | erasure-code-clay |