]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/erasure-code.rst
import ceph quincy 17.2.6
[ceph.git] / ceph / doc / rados / operations / erasure-code.rst
CommitLineData
92f5a8d4
TL
1.. _ecpool:
2
7c673cae
FG
3=============
4 Erasure code
5=============
6
39ae355f
TL
7By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
8replicated-type pools, every object is copied to multiple disks (this
9multiple copying is the "replication").
10
11In contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
12pools use a method of data protection that is different from replication. In
13erasure coding, data is broken into fragments of two kinds: data blocks and
14parity blocks. If a drive fails or becomes corrupted, the parity blocks are
15used to rebuild the data. At scale, erasure coding saves space relative to
16replication.
17
18In this documentation, data blocks are referred to as "data chunks"
19and parity blocks are referred to as "encoding chunks".
20
21Erasure codes are also called "forward error correction codes". The
22first forward error correction code was developed in 1950 by Richard
23Hamming at Bell Laboratories.
24
7c673cae
FG
25
26Creating a sample erasure coded pool
27------------------------------------
28
29The simplest erasure coded pool is equivalent to `RAID5
30<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
39ae355f 31requires at least three hosts:
7c673cae 32
39ae355f
TL
33.. prompt:: bash $
34
35 ceph osd pool create ecpool erasure
36
37::
38
39 pool 'ecpool' created
40
41.. prompt:: bash $
42
43 echo ABCDEFGHI | rados --pool ecpool put NYAN -
44 rados --pool ecpool get NYAN -
45
46::
47
48 ABCDEFGHI
7c673cae 49
7c673cae
FG
50Erasure code profiles
51---------------------
52
39ae355f
TL
53The default erasure code profile can sustain the loss of two OSDs. This erasure
54code profile is equivalent to a replicated pool of size three, but requires
552TB to store 1TB of data instead of 3TB to store 1TB of data. The default
56profile can be displayed with this command:
7c673cae 57
39ae355f 58.. prompt:: bash $
7c673cae 59
39ae355f
TL
60 ceph osd erasure-code-profile get default
61
62::
63
64 k=2
65 m=2
66 plugin=jerasure
67 crush-failure-domain=host
68 technique=reed_sol_van
69
70.. note::
71 The default erasure-coded pool, the profile of which is displayed here, is
72 not the same as the simplest erasure-coded pool.
73
74 The default erasure-coded pool has two data chunks (k) and two coding chunks
75 (m). The profile of the default erasure-coded pool is "k=2 m=2".
76
77 The simplest erasure-coded pool has two data chunks (k) and one coding chunk
78 (m). The profile of the simplest erasure-coded pool is "k=2 m=1".
79
80Choosing the right profile is important because the profile cannot be modified
81after the pool is created. If you find that you need an erasure-coded pool with
82a profile different than the one you have created, you must create a new pool
83with a different (and presumably more carefully-considered) profile. When the
84new pool is created, all objects from the wrongly-configured pool must be moved
85to the newly-created pool. There is no way to alter the profile of a pool after its creation.
7c673cae
FG
86
87The most important parameters of the profile are *K*, *M* and
224ce89b 88*crush-failure-domain* because they define the storage overhead and
39ae355f 89the data durability. For example, if the desired architecture must
11fdf7f2 90sustain the loss of two racks with a storage overhead of 67% overhead,
39ae355f
TL
91the following profile can be defined:
92
93.. prompt:: bash $
7c673cae 94
39ae355f 95 ceph osd erasure-code-profile set myprofile \
7c673cae
FG
96 k=3 \
97 m=2 \
224ce89b 98 crush-failure-domain=rack
39ae355f
TL
99 ceph osd pool create ecpool erasure myprofile
100 echo ABCDEFGHI | rados --pool ecpool put NYAN -
101 rados --pool ecpool get NYAN -
102
103::
104
7c673cae
FG
105 ABCDEFGHI
106
107The *NYAN* object will be divided in three (*K=3*) and two additional
108*chunks* will be created (*M=2*). The value of *M* defines how many
109OSD can be lost simultaneously without losing any data. The
b32b8144 110*crush-failure-domain=rack* will create a CRUSH rule that ensures
7c673cae
FG
111no two *chunks* are stored in the same rack.
112
113.. ditaa::
114 +-------------------+
115 name | NYAN |
116 +-------------------+
117 content | ABCDEFGHI |
118 +--------+----------+
119 |
120 |
121 v
122 +------+------+
123 +---------------+ encode(3,2) +-----------+
124 | +--+--+---+---+ |
125 | | | | |
126 | +-------+ | +-----+ |
127 | | | | |
128 +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
129 name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
130 +------+ +------+ +------+ +------+ +------+
131 shard | 1 | | 2 | | 3 | | 4 | | 5 |
132 +------+ +------+ +------+ +------+ +------+
133 content | ABC | | DEF | | GHI | | YXY | | QGC |
134 +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
135 | | | | |
136 | | v | |
137 | | +--+---+ | |
138 | | | OSD1 | | |
139 | | +------+ | |
140 | | | |
141 | | +------+ | |
142 | +------>| OSD2 | | |
143 | +------+ | |
144 | | |
145 | +------+ | |
146 | | OSD3 |<----+ |
147 | +------+ |
148 | |
149 | +------+ |
150 | | OSD4 |<--------------+
151 | +------+
152 |
153 | +------+
154 +----------------->| OSD5 |
155 +------+
156
157
158More information can be found in the `erasure code profiles
159<../erasure-code-profile>`_ documentation.
160
161
162Erasure Coding with Overwrites
163------------------------------
164
165By default, erasure coded pools only work with uses like RGW that
166perform full object writes and appends.
167
168Since Luminous, partial writes for an erasure coded pool may be
11fdf7f2 169enabled with a per-pool setting. This lets RBD and CephFS store their
39ae355f
TL
170data in an erasure coded pool:
171
172.. prompt:: bash $
7c673cae
FG
173
174 ceph osd pool set ec_pool allow_ec_overwrites true
175
39ae355f
TL
176This can be enabled only on a pool residing on BlueStore OSDs, since
177BlueStore's checksumming is used during deep scrubs to detect bitrot
178or other corruption. In addition to being unsafe, using Filestore with
179EC overwrites results in lower performance compared to BlueStore.
7c673cae
FG
180
181Erasure coded pools do not support omap, so to use them with RBD and
39ae355f 182CephFS you must instruct them to store their data in an EC pool, and
7c673cae 183their metadata in a replicated pool. For RBD, this means using the
39ae355f
TL
184erasure coded pool as the ``--data-pool`` during image creation:
185
186.. prompt:: bash $
7c673cae
FG
187
188 rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
189
11fdf7f2
TL
190For CephFS, an erasure coded pool can be set as the default data pool during
191file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
7c673cae
FG
192
193
194Erasure coded pool and cache tiering
195------------------------------------
196
197Erasure coded pools require more resources than replicated pools and
39ae355f 198lack some functionality such as omap. To overcome these
7c673cae
FG
199limitations, one can set up a `cache tier <../cache-tiering>`_
200before the erasure coded pool.
201
39ae355f
TL
202For instance, if the pool *hot-storage* is made of fast storage:
203
204.. prompt:: bash $
7c673cae 205
39ae355f
TL
206 ceph osd tier add ecpool hot-storage
207 ceph osd tier cache-mode hot-storage writeback
208 ceph osd tier set-overlay ecpool hot-storage
7c673cae
FG
209
210will place the *hot-storage* pool as tier of *ecpool* in *writeback*
211mode so that every write and read to the *ecpool* are actually using
212the *hot-storage* and benefit from its flexibility and speed.
213
214More information can be found in the `cache tiering
39ae355f
TL
215<../cache-tiering>`_ documentation. Note however that cache tiering
216is deprecated and may be removed completely in a future release.
7c673cae 217
9f95a23c
TL
218Erasure coded pool recovery
219---------------------------
39ae355f
TL
220If an erasure coded pool loses some data shards, it must recover them from others.
221This involves reading from the remaining shards, reconstructing the data, and
222writing new shards.
223In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
9f95a23c
TL
224available. (With fewer than *K* shards, you have actually lost data!)
225
39ae355f
TL
226Prior to Octopus, erasure coded pools required at least ``min_size`` shards to be
227available, even if ``min_size`` is greater than ``K``. We recommend ``min_size``
228be ``K+2`` or more to prevent loss of writes and data.
229This conservative decision was made out of an abundance of caution when
230designing the new pool mode. As a result pools with lost OSDs but without
231complete loss of any data were unable to recover and go active
232without manual intervention to temporarily change the ``min_size`` setting.
9f95a23c 233
7c673cae
FG
234Glossary
235--------
236
237*chunk*
238 when the encoding function is called, it returns chunks of the same
239 size. Data chunks which can be concatenated to reconstruct the original
240 object and coding chunks which can be used to rebuild a lost chunk.
241
242*K*
243 the number of data *chunks*, i.e. the number of *chunks* in which the
244 original object is divided. For instance if *K* = 2 a 10KB object
245 will be divided into *K* objects of 5KB each.
246
247*M*
248 the number of coding *chunks*, i.e. the number of additional *chunks*
249 computed by the encoding functions. If there are 2 coding *chunks*,
250 it means 2 OSDs can be out without losing data.
251
252
253Table of content
254----------------
255
256.. toctree::
257 :maxdepth: 1
258
259 erasure-code-profile
260 erasure-code-jerasure
261 erasure-code-isa
262 erasure-code-lrc
263 erasure-code-shec
11fdf7f2 264 erasure-code-clay