]>
Commit | Line | Data |
---|---|---|
92f5a8d4 TL |
1 | .. _ecpool: |
2 | ||
1e59de90 | 3 | ============== |
7c673cae | 4 | Erasure code |
1e59de90 | 5 | ============== |
7c673cae | 6 | |
39ae355f | 7 | By default, Ceph `pools <../pools>`_ are created with the type "replicated". In |
1e59de90 TL |
8 | replicated-type pools, every object is copied to multiple disks. This |
9 | multiple copying is the method of data protection known as "replication". | |
39ae355f | 10 | |
1e59de90 | 11 | By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_ |
39ae355f TL |
12 | pools use a method of data protection that is different from replication. In |
13 | erasure coding, data is broken into fragments of two kinds: data blocks and | |
14 | parity blocks. If a drive fails or becomes corrupted, the parity blocks are | |
15 | used to rebuild the data. At scale, erasure coding saves space relative to | |
16 | replication. | |
17 | ||
18 | In this documentation, data blocks are referred to as "data chunks" | |
1e59de90 | 19 | and parity blocks are referred to as "coding chunks". |
39ae355f TL |
20 | |
21 | Erasure codes are also called "forward error correction codes". The | |
22 | first forward error correction code was developed in 1950 by Richard | |
23 | Hamming at Bell Laboratories. | |
24 | ||
7c673cae | 25 | |
1e59de90 | 26 | Creating a sample erasure-coded pool |
7c673cae FG |
27 | ------------------------------------ |
28 | ||
1e59de90 | 29 | The simplest erasure-coded pool is similar to `RAID5 |
7c673cae | 30 | <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and |
39ae355f | 31 | requires at least three hosts: |
7c673cae | 32 | |
39ae355f TL |
33 | .. prompt:: bash $ |
34 | ||
35 | ceph osd pool create ecpool erasure | |
36 | ||
37 | :: | |
38 | ||
39 | pool 'ecpool' created | |
40 | ||
41 | .. prompt:: bash $ | |
42 | ||
43 | echo ABCDEFGHI | rados --pool ecpool put NYAN - | |
44 | rados --pool ecpool get NYAN - | |
45 | ||
46 | :: | |
47 | ||
48 | ABCDEFGHI | |
7c673cae | 49 | |
1e59de90 | 50 | Erasure-code profiles |
7c673cae FG |
51 | --------------------- |
52 | ||
1e59de90 TL |
53 | The default erasure-code profile can sustain the overlapping loss of two OSDs |
54 | without losing data. This erasure-code profile is equivalent to a replicated | |
55 | pool of size three, but with different storage requirements: instead of | |
56 | requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default | |
39ae355f | 57 | profile can be displayed with this command: |
7c673cae | 58 | |
39ae355f | 59 | .. prompt:: bash $ |
7c673cae | 60 | |
39ae355f TL |
61 | ceph osd erasure-code-profile get default |
62 | ||
63 | :: | |
64 | ||
65 | k=2 | |
66 | m=2 | |
67 | plugin=jerasure | |
68 | crush-failure-domain=host | |
69 | technique=reed_sol_van | |
70 | ||
71 | .. note:: | |
1e59de90 TL |
72 | The profile just displayed is for the *default* erasure-coded pool, not the |
73 | *simplest* erasure-coded pool. These two pools are not the same: | |
74 | ||
75 | The default erasure-coded pool has two data chunks (K) and two coding chunks | |
76 | (M). The profile of the default erasure-coded pool is "k=2 m=2". | |
39ae355f | 77 | |
1e59de90 TL |
78 | The simplest erasure-coded pool has two data chunks (K) and one coding chunk |
79 | (M). The profile of the simplest erasure-coded pool is "k=2 m=1". | |
39ae355f TL |
80 | |
81 | Choosing the right profile is important because the profile cannot be modified | |
82 | after the pool is created. If you find that you need an erasure-coded pool with | |
83 | a profile different than the one you have created, you must create a new pool | |
1e59de90 TL |
84 | with a different (and presumably more carefully considered) profile. When the |
85 | new pool is created, all objects from the wrongly configured pool must be moved | |
86 | to the newly created pool. There is no way to alter the profile of a pool after | |
87 | the pool has been created. | |
7c673cae | 88 | |
1e59de90 | 89 | The most important parameters of the profile are *K*, *M*, and |
224ce89b | 90 | *crush-failure-domain* because they define the storage overhead and |
39ae355f | 91 | the data durability. For example, if the desired architecture must |
1e59de90 | 92 | sustain the loss of two racks with a storage overhead of 67%, |
39ae355f TL |
93 | the following profile can be defined: |
94 | ||
95 | .. prompt:: bash $ | |
7c673cae | 96 | |
39ae355f | 97 | ceph osd erasure-code-profile set myprofile \ |
7c673cae FG |
98 | k=3 \ |
99 | m=2 \ | |
224ce89b | 100 | crush-failure-domain=rack |
39ae355f TL |
101 | ceph osd pool create ecpool erasure myprofile |
102 | echo ABCDEFGHI | rados --pool ecpool put NYAN - | |
103 | rados --pool ecpool get NYAN - | |
104 | ||
105 | :: | |
106 | ||
7c673cae FG |
107 | ABCDEFGHI |
108 | ||
109 | The *NYAN* object will be divided in three (*K=3*) and two additional | |
110 | *chunks* will be created (*M=2*). The value of *M* defines how many | |
1e59de90 | 111 | OSDs can be lost simultaneously without losing any data. The |
b32b8144 | 112 | *crush-failure-domain=rack* will create a CRUSH rule that ensures |
7c673cae FG |
113 | no two *chunks* are stored in the same rack. |
114 | ||
115 | .. ditaa:: | |
116 | +-------------------+ | |
117 | name | NYAN | | |
118 | +-------------------+ | |
119 | content | ABCDEFGHI | | |
120 | +--------+----------+ | |
121 | | | |
122 | | | |
123 | v | |
124 | +------+------+ | |
125 | +---------------+ encode(3,2) +-----------+ | |
126 | | +--+--+---+---+ | | |
127 | | | | | | | |
128 | | +-------+ | +-----+ | | |
129 | | | | | | | |
130 | +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ | |
131 | name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | | |
132 | +------+ +------+ +------+ +------+ +------+ | |
133 | shard | 1 | | 2 | | 3 | | 4 | | 5 | | |
134 | +------+ +------+ +------+ +------+ +------+ | |
135 | content | ABC | | DEF | | GHI | | YXY | | QGC | | |
136 | +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ | |
137 | | | | | | | |
138 | | | v | | | |
139 | | | +--+---+ | | | |
140 | | | | OSD1 | | | | |
141 | | | +------+ | | | |
142 | | | | | | |
143 | | | +------+ | | | |
144 | | +------>| OSD2 | | | | |
145 | | +------+ | | | |
146 | | | | | |
147 | | +------+ | | | |
148 | | | OSD3 |<----+ | | |
149 | | +------+ | | |
150 | | | | |
151 | | +------+ | | |
152 | | | OSD4 |<--------------+ | |
153 | | +------+ | |
154 | | | |
155 | | +------+ | |
156 | +----------------->| OSD5 | | |
157 | +------+ | |
158 | ||
159 | ||
1e59de90 | 160 | More information can be found in the `erasure-code profiles |
7c673cae FG |
161 | <../erasure-code-profile>`_ documentation. |
162 | ||
163 | ||
164 | Erasure Coding with Overwrites | |
165 | ------------------------------ | |
166 | ||
1e59de90 TL |
167 | By default, erasure-coded pools work only with operations that |
168 | perform full object writes and appends (for example, RGW). | |
7c673cae | 169 | |
1e59de90 | 170 | Since Luminous, partial writes for an erasure-coded pool may be |
11fdf7f2 | 171 | enabled with a per-pool setting. This lets RBD and CephFS store their |
1e59de90 | 172 | data in an erasure-coded pool: |
39ae355f TL |
173 | |
174 | .. prompt:: bash $ | |
7c673cae FG |
175 | |
176 | ceph osd pool set ec_pool allow_ec_overwrites true | |
177 | ||
39ae355f TL |
178 | This can be enabled only on a pool residing on BlueStore OSDs, since |
179 | BlueStore's checksumming is used during deep scrubs to detect bitrot | |
1e59de90 TL |
180 | or other corruption. Using Filestore with EC overwrites is not only |
181 | unsafe, but it also results in lower performance compared to BlueStore. | |
7c673cae | 182 | |
1e59de90 TL |
183 | Erasure-coded pools do not support omap, so to use them with RBD and |
184 | CephFS you must instruct them to store their data in an EC pool and | |
7c673cae | 185 | their metadata in a replicated pool. For RBD, this means using the |
1e59de90 | 186 | erasure-coded pool as the ``--data-pool`` during image creation: |
39ae355f TL |
187 | |
188 | .. prompt:: bash $ | |
7c673cae FG |
189 | |
190 | rbd create --size 1G --data-pool ec_pool replicated_pool/image_name | |
191 | ||
1e59de90 | 192 | For CephFS, an erasure-coded pool can be set as the default data pool during |
11fdf7f2 | 193 | file system creation or via `file layouts <../../../cephfs/file-layouts>`_. |
7c673cae FG |
194 | |
195 | ||
1e59de90 TL |
196 | Erasure-coded pools and cache tiering |
197 | ------------------------------------- | |
7c673cae | 198 | |
1e59de90 | 199 | .. note:: Cache tiering is deprecated in Reef. |
7c673cae | 200 | |
1e59de90 TL |
201 | Erasure-coded pools require more resources than replicated pools and |
202 | lack some of the functionality supported by replicated pools (for example, omap). | |
203 | To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_ | |
204 | before setting up the erasure-coded pool. | |
205 | ||
206 | For example, if the pool *hot-storage* is made of fast storage, the following commands | |
207 | will place the *hot-storage* pool as a tier of *ecpool* in *writeback* | |
208 | mode: | |
39ae355f TL |
209 | |
210 | .. prompt:: bash $ | |
7c673cae | 211 | |
39ae355f TL |
212 | ceph osd tier add ecpool hot-storage |
213 | ceph osd tier cache-mode hot-storage writeback | |
214 | ceph osd tier set-overlay ecpool hot-storage | |
7c673cae | 215 | |
1e59de90 TL |
216 | The result is that every write and read to the *ecpool* actually uses |
217 | the *hot-storage* pool and benefits from its flexibility and speed. | |
7c673cae FG |
218 | |
219 | More information can be found in the `cache tiering | |
1e59de90 | 220 | <../cache-tiering>`_ documentation. Note, however, that cache tiering |
39ae355f | 221 | is deprecated and may be removed completely in a future release. |
7c673cae | 222 | |
1e59de90 | 223 | Erasure-coded pool recovery |
9f95a23c | 224 | --------------------------- |
1e59de90 TL |
225 | If an erasure-coded pool loses any data shards, it must recover them from others. |
226 | This recovery involves reading from the remaining shards, reconstructing the data, and | |
39ae355f | 227 | writing new shards. |
1e59de90 | 228 | |
39ae355f | 229 | In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards |
9f95a23c TL |
230 | available. (With fewer than *K* shards, you have actually lost data!) |
231 | ||
1e59de90 TL |
232 | Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be |
233 | available, even if ``min_size`` was greater than ``K``. This was a conservative | |
234 | decision made out of an abundance of caution when designing the new pool | |
235 | mode. As a result, however, pools with lost OSDs but without complete data loss were | |
236 | unable to recover and go active without manual intervention to temporarily change | |
237 | the ``min_size`` setting. | |
238 | ||
05a536ef | 239 | We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and |
1e59de90 TL |
240 | loss of data. |
241 | ||
242 | ||
9f95a23c | 243 | |
7c673cae FG |
244 | Glossary |
245 | -------- | |
246 | ||
247 | *chunk* | |
1e59de90 TL |
248 | When the encoding function is called, it returns chunks of the same size as each other. There are two |
249 | kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and | |
250 | (2) *coding chunks*, which can be used to rebuild a lost chunk. | |
7c673cae FG |
251 | |
252 | *K* | |
1e59de90 TL |
253 | The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object |
254 | is divided into two objects of 5KB each. | |
7c673cae FG |
255 | |
256 | *M* | |
1e59de90 TL |
257 | The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can |
258 | be missing from the cluster without the cluster suffering data loss. For example, if there are two coding | |
259 | chunks, then two OSDs can be missing without data loss. | |
7c673cae | 260 | |
1e59de90 TL |
261 | Table of contents |
262 | ----------------- | |
7c673cae FG |
263 | |
264 | .. toctree:: | |
1e59de90 TL |
265 | :maxdepth: 1 |
266 | ||
267 | erasure-code-profile | |
268 | erasure-code-jerasure | |
269 | erasure-code-isa | |
270 | erasure-code-lrc | |
271 | erasure-code-shec | |
272 | erasure-code-clay |