]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/erasure-code.rst
import ceph quincy 17.2.6
[ceph.git] / ceph / doc / rados / operations / erasure-code.rst
1 .. _ecpool:
2
3 =============
4 Erasure code
5 =============
6
7 By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
8 replicated-type pools, every object is copied to multiple disks (this
9 multiple copying is the "replication").
10
11 In contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
12 pools use a method of data protection that is different from replication. In
13 erasure coding, data is broken into fragments of two kinds: data blocks and
14 parity blocks. If a drive fails or becomes corrupted, the parity blocks are
15 used to rebuild the data. At scale, erasure coding saves space relative to
16 replication.
17
18 In this documentation, data blocks are referred to as "data chunks"
19 and parity blocks are referred to as "encoding chunks".
20
21 Erasure codes are also called "forward error correction codes". The
22 first forward error correction code was developed in 1950 by Richard
23 Hamming at Bell Laboratories.
24
25
26 Creating a sample erasure coded pool
27 ------------------------------------
28
29 The simplest erasure coded pool is equivalent to `RAID5
30 <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
31 requires at least three hosts:
32
33 .. prompt:: bash $
34
35 ceph osd pool create ecpool erasure
36
37 ::
38
39 pool 'ecpool' created
40
41 .. prompt:: bash $
42
43 echo ABCDEFGHI | rados --pool ecpool put NYAN -
44 rados --pool ecpool get NYAN -
45
46 ::
47
48 ABCDEFGHI
49
50 Erasure code profiles
51 ---------------------
52
53 The default erasure code profile can sustain the loss of two OSDs. This erasure
54 code profile is equivalent to a replicated pool of size three, but requires
55 2TB to store 1TB of data instead of 3TB to store 1TB of data. The default
56 profile can be displayed with this command:
57
58 .. prompt:: bash $
59
60 ceph osd erasure-code-profile get default
61
62 ::
63
64 k=2
65 m=2
66 plugin=jerasure
67 crush-failure-domain=host
68 technique=reed_sol_van
69
70 .. note::
71 The default erasure-coded pool, the profile of which is displayed here, is
72 not the same as the simplest erasure-coded pool.
73
74 The default erasure-coded pool has two data chunks (k) and two coding chunks
75 (m). The profile of the default erasure-coded pool is "k=2 m=2".
76
77 The simplest erasure-coded pool has two data chunks (k) and one coding chunk
78 (m). The profile of the simplest erasure-coded pool is "k=2 m=1".
79
80 Choosing the right profile is important because the profile cannot be modified
81 after the pool is created. If you find that you need an erasure-coded pool with
82 a profile different than the one you have created, you must create a new pool
83 with a different (and presumably more carefully-considered) profile. When the
84 new pool is created, all objects from the wrongly-configured pool must be moved
85 to the newly-created pool. There is no way to alter the profile of a pool after its creation.
86
87 The most important parameters of the profile are *K*, *M* and
88 *crush-failure-domain* because they define the storage overhead and
89 the data durability. For example, if the desired architecture must
90 sustain the loss of two racks with a storage overhead of 67% overhead,
91 the following profile can be defined:
92
93 .. prompt:: bash $
94
95 ceph osd erasure-code-profile set myprofile \
96 k=3 \
97 m=2 \
98 crush-failure-domain=rack
99 ceph osd pool create ecpool erasure myprofile
100 echo ABCDEFGHI | rados --pool ecpool put NYAN -
101 rados --pool ecpool get NYAN -
102
103 ::
104
105 ABCDEFGHI
106
107 The *NYAN* object will be divided in three (*K=3*) and two additional
108 *chunks* will be created (*M=2*). The value of *M* defines how many
109 OSD can be lost simultaneously without losing any data. The
110 *crush-failure-domain=rack* will create a CRUSH rule that ensures
111 no two *chunks* are stored in the same rack.
112
113 .. ditaa::
114 +-------------------+
115 name | NYAN |
116 +-------------------+
117 content | ABCDEFGHI |
118 +--------+----------+
119 |
120 |
121 v
122 +------+------+
123 +---------------+ encode(3,2) +-----------+
124 | +--+--+---+---+ |
125 | | | | |
126 | +-------+ | +-----+ |
127 | | | | |
128 +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
129 name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
130 +------+ +------+ +------+ +------+ +------+
131 shard | 1 | | 2 | | 3 | | 4 | | 5 |
132 +------+ +------+ +------+ +------+ +------+
133 content | ABC | | DEF | | GHI | | YXY | | QGC |
134 +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
135 | | | | |
136 | | v | |
137 | | +--+---+ | |
138 | | | OSD1 | | |
139 | | +------+ | |
140 | | | |
141 | | +------+ | |
142 | +------>| OSD2 | | |
143 | +------+ | |
144 | | |
145 | +------+ | |
146 | | OSD3 |<----+ |
147 | +------+ |
148 | |
149 | +------+ |
150 | | OSD4 |<--------------+
151 | +------+
152 |
153 | +------+
154 +----------------->| OSD5 |
155 +------+
156
157
158 More information can be found in the `erasure code profiles
159 <../erasure-code-profile>`_ documentation.
160
161
162 Erasure Coding with Overwrites
163 ------------------------------
164
165 By default, erasure coded pools only work with uses like RGW that
166 perform full object writes and appends.
167
168 Since Luminous, partial writes for an erasure coded pool may be
169 enabled with a per-pool setting. This lets RBD and CephFS store their
170 data in an erasure coded pool:
171
172 .. prompt:: bash $
173
174 ceph osd pool set ec_pool allow_ec_overwrites true
175
176 This can be enabled only on a pool residing on BlueStore OSDs, since
177 BlueStore's checksumming is used during deep scrubs to detect bitrot
178 or other corruption. In addition to being unsafe, using Filestore with
179 EC overwrites results in lower performance compared to BlueStore.
180
181 Erasure coded pools do not support omap, so to use them with RBD and
182 CephFS you must instruct them to store their data in an EC pool, and
183 their metadata in a replicated pool. For RBD, this means using the
184 erasure coded pool as the ``--data-pool`` during image creation:
185
186 .. prompt:: bash $
187
188 rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
189
190 For CephFS, an erasure coded pool can be set as the default data pool during
191 file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
192
193
194 Erasure coded pool and cache tiering
195 ------------------------------------
196
197 Erasure coded pools require more resources than replicated pools and
198 lack some functionality such as omap. To overcome these
199 limitations, one can set up a `cache tier <../cache-tiering>`_
200 before the erasure coded pool.
201
202 For instance, if the pool *hot-storage* is made of fast storage:
203
204 .. prompt:: bash $
205
206 ceph osd tier add ecpool hot-storage
207 ceph osd tier cache-mode hot-storage writeback
208 ceph osd tier set-overlay ecpool hot-storage
209
210 will place the *hot-storage* pool as tier of *ecpool* in *writeback*
211 mode so that every write and read to the *ecpool* are actually using
212 the *hot-storage* and benefit from its flexibility and speed.
213
214 More information can be found in the `cache tiering
215 <../cache-tiering>`_ documentation. Note however that cache tiering
216 is deprecated and may be removed completely in a future release.
217
218 Erasure coded pool recovery
219 ---------------------------
220 If an erasure coded pool loses some data shards, it must recover them from others.
221 This involves reading from the remaining shards, reconstructing the data, and
222 writing new shards.
223 In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
224 available. (With fewer than *K* shards, you have actually lost data!)
225
226 Prior to Octopus, erasure coded pools required at least ``min_size`` shards to be
227 available, even if ``min_size`` is greater than ``K``. We recommend ``min_size``
228 be ``K+2`` or more to prevent loss of writes and data.
229 This conservative decision was made out of an abundance of caution when
230 designing the new pool mode. As a result pools with lost OSDs but without
231 complete loss of any data were unable to recover and go active
232 without manual intervention to temporarily change the ``min_size`` setting.
233
234 Glossary
235 --------
236
237 *chunk*
238 when the encoding function is called, it returns chunks of the same
239 size. Data chunks which can be concatenated to reconstruct the original
240 object and coding chunks which can be used to rebuild a lost chunk.
241
242 *K*
243 the number of data *chunks*, i.e. the number of *chunks* in which the
244 original object is divided. For instance if *K* = 2 a 10KB object
245 will be divided into *K* objects of 5KB each.
246
247 *M*
248 the number of coding *chunks*, i.e. the number of additional *chunks*
249 computed by the encoding functions. If there are 2 coding *chunks*,
250 it means 2 OSDs can be out without losing data.
251
252
253 Table of content
254 ----------------
255
256 .. toctree::
257 :maxdepth: 1
258
259 erasure-code-profile
260 erasure-code-jerasure
261 erasure-code-isa
262 erasure-code-lrc
263 erasure-code-shec
264 erasure-code-clay