]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/erasure-code.rst
bump version to 18.2.2-pve1
[ceph.git] / ceph / doc / rados / operations / erasure-code.rst
1 .. _ecpool:
2
3 ==============
4 Erasure code
5 ==============
6
7 By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
8 replicated-type pools, every object is copied to multiple disks. This
9 multiple copying is the method of data protection known as "replication".
10
11 By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
12 pools use a method of data protection that is different from replication. In
13 erasure coding, data is broken into fragments of two kinds: data blocks and
14 parity blocks. If a drive fails or becomes corrupted, the parity blocks are
15 used to rebuild the data. At scale, erasure coding saves space relative to
16 replication.
17
18 In this documentation, data blocks are referred to as "data chunks"
19 and parity blocks are referred to as "coding chunks".
20
21 Erasure codes are also called "forward error correction codes". The
22 first forward error correction code was developed in 1950 by Richard
23 Hamming at Bell Laboratories.
24
25
26 Creating a sample erasure-coded pool
27 ------------------------------------
28
29 The simplest erasure-coded pool is similar to `RAID5
30 <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
31 requires at least three hosts:
32
33 .. prompt:: bash $
34
35 ceph osd pool create ecpool erasure
36
37 ::
38
39 pool 'ecpool' created
40
41 .. prompt:: bash $
42
43 echo ABCDEFGHI | rados --pool ecpool put NYAN -
44 rados --pool ecpool get NYAN -
45
46 ::
47
48 ABCDEFGHI
49
50 Erasure-code profiles
51 ---------------------
52
53 The default erasure-code profile can sustain the overlapping loss of two OSDs
54 without losing data. This erasure-code profile is equivalent to a replicated
55 pool of size three, but with different storage requirements: instead of
56 requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default
57 profile can be displayed with this command:
58
59 .. prompt:: bash $
60
61 ceph osd erasure-code-profile get default
62
63 ::
64
65 k=2
66 m=2
67 plugin=jerasure
68 crush-failure-domain=host
69 technique=reed_sol_van
70
71 .. note::
72 The profile just displayed is for the *default* erasure-coded pool, not the
73 *simplest* erasure-coded pool. These two pools are not the same:
74
75 The default erasure-coded pool has two data chunks (K) and two coding chunks
76 (M). The profile of the default erasure-coded pool is "k=2 m=2".
77
78 The simplest erasure-coded pool has two data chunks (K) and one coding chunk
79 (M). The profile of the simplest erasure-coded pool is "k=2 m=1".
80
81 Choosing the right profile is important because the profile cannot be modified
82 after the pool is created. If you find that you need an erasure-coded pool with
83 a profile different than the one you have created, you must create a new pool
84 with a different (and presumably more carefully considered) profile. When the
85 new pool is created, all objects from the wrongly configured pool must be moved
86 to the newly created pool. There is no way to alter the profile of a pool after
87 the pool has been created.
88
89 The most important parameters of the profile are *K*, *M*, and
90 *crush-failure-domain* because they define the storage overhead and
91 the data durability. For example, if the desired architecture must
92 sustain the loss of two racks with a storage overhead of 67%,
93 the following profile can be defined:
94
95 .. prompt:: bash $
96
97 ceph osd erasure-code-profile set myprofile \
98 k=3 \
99 m=2 \
100 crush-failure-domain=rack
101 ceph osd pool create ecpool erasure myprofile
102 echo ABCDEFGHI | rados --pool ecpool put NYAN -
103 rados --pool ecpool get NYAN -
104
105 ::
106
107 ABCDEFGHI
108
109 The *NYAN* object will be divided in three (*K=3*) and two additional
110 *chunks* will be created (*M=2*). The value of *M* defines how many
111 OSDs can be lost simultaneously without losing any data. The
112 *crush-failure-domain=rack* will create a CRUSH rule that ensures
113 no two *chunks* are stored in the same rack.
114
115 .. ditaa::
116 +-------------------+
117 name | NYAN |
118 +-------------------+
119 content | ABCDEFGHI |
120 +--------+----------+
121 |
122 |
123 v
124 +------+------+
125 +---------------+ encode(3,2) +-----------+
126 | +--+--+---+---+ |
127 | | | | |
128 | +-------+ | +-----+ |
129 | | | | |
130 +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
131 name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
132 +------+ +------+ +------+ +------+ +------+
133 shard | 1 | | 2 | | 3 | | 4 | | 5 |
134 +------+ +------+ +------+ +------+ +------+
135 content | ABC | | DEF | | GHI | | YXY | | QGC |
136 +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
137 | | | | |
138 | | v | |
139 | | +--+---+ | |
140 | | | OSD1 | | |
141 | | +------+ | |
142 | | | |
143 | | +------+ | |
144 | +------>| OSD2 | | |
145 | +------+ | |
146 | | |
147 | +------+ | |
148 | | OSD3 |<----+ |
149 | +------+ |
150 | |
151 | +------+ |
152 | | OSD4 |<--------------+
153 | +------+
154 |
155 | +------+
156 +----------------->| OSD5 |
157 +------+
158
159
160 More information can be found in the `erasure-code profiles
161 <../erasure-code-profile>`_ documentation.
162
163
164 Erasure Coding with Overwrites
165 ------------------------------
166
167 By default, erasure-coded pools work only with operations that
168 perform full object writes and appends (for example, RGW).
169
170 Since Luminous, partial writes for an erasure-coded pool may be
171 enabled with a per-pool setting. This lets RBD and CephFS store their
172 data in an erasure-coded pool:
173
174 .. prompt:: bash $
175
176 ceph osd pool set ec_pool allow_ec_overwrites true
177
178 This can be enabled only on a pool residing on BlueStore OSDs, since
179 BlueStore's checksumming is used during deep scrubs to detect bitrot
180 or other corruption. Using Filestore with EC overwrites is not only
181 unsafe, but it also results in lower performance compared to BlueStore.
182
183 Erasure-coded pools do not support omap, so to use them with RBD and
184 CephFS you must instruct them to store their data in an EC pool and
185 their metadata in a replicated pool. For RBD, this means using the
186 erasure-coded pool as the ``--data-pool`` during image creation:
187
188 .. prompt:: bash $
189
190 rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
191
192 For CephFS, an erasure-coded pool can be set as the default data pool during
193 file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
194
195
196 Erasure-coded pools and cache tiering
197 -------------------------------------
198
199 .. note:: Cache tiering is deprecated in Reef.
200
201 Erasure-coded pools require more resources than replicated pools and
202 lack some of the functionality supported by replicated pools (for example, omap).
203 To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_
204 before setting up the erasure-coded pool.
205
206 For example, if the pool *hot-storage* is made of fast storage, the following commands
207 will place the *hot-storage* pool as a tier of *ecpool* in *writeback*
208 mode:
209
210 .. prompt:: bash $
211
212 ceph osd tier add ecpool hot-storage
213 ceph osd tier cache-mode hot-storage writeback
214 ceph osd tier set-overlay ecpool hot-storage
215
216 The result is that every write and read to the *ecpool* actually uses
217 the *hot-storage* pool and benefits from its flexibility and speed.
218
219 More information can be found in the `cache tiering
220 <../cache-tiering>`_ documentation. Note, however, that cache tiering
221 is deprecated and may be removed completely in a future release.
222
223 Erasure-coded pool recovery
224 ---------------------------
225 If an erasure-coded pool loses any data shards, it must recover them from others.
226 This recovery involves reading from the remaining shards, reconstructing the data, and
227 writing new shards.
228
229 In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
230 available. (With fewer than *K* shards, you have actually lost data!)
231
232 Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be
233 available, even if ``min_size`` was greater than ``K``. This was a conservative
234 decision made out of an abundance of caution when designing the new pool
235 mode. As a result, however, pools with lost OSDs but without complete data loss were
236 unable to recover and go active without manual intervention to temporarily change
237 the ``min_size`` setting.
238
239 We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and
240 loss of data.
241
242
243
244 Glossary
245 --------
246
247 *chunk*
248 When the encoding function is called, it returns chunks of the same size as each other. There are two
249 kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and
250 (2) *coding chunks*, which can be used to rebuild a lost chunk.
251
252 *K*
253 The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object
254 is divided into two objects of 5KB each.
255
256 *M*
257 The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can
258 be missing from the cluster without the cluster suffering data loss. For example, if there are two coding
259 chunks, then two OSDs can be missing without data loss.
260
261 Table of contents
262 -----------------
263
264 .. toctree::
265 :maxdepth: 1
266
267 erasure-code-profile
268 erasure-code-jerasure
269 erasure-code-isa
270 erasure-code-lrc
271 erasure-code-shec
272 erasure-code-clay