7 By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
8 replicated-type pools, every object is copied to multiple disks. This
9 multiple copying is the method of data protection known as "replication".
11 By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
12 pools use a method of data protection that is different from replication. In
13 erasure coding, data is broken into fragments of two kinds: data blocks and
14 parity blocks. If a drive fails or becomes corrupted, the parity blocks are
15 used to rebuild the data. At scale, erasure coding saves space relative to
18 In this documentation, data blocks are referred to as "data chunks"
19 and parity blocks are referred to as "coding chunks".
21 Erasure codes are also called "forward error correction codes". The
22 first forward error correction code was developed in 1950 by Richard
23 Hamming at Bell Laboratories.
26 Creating a sample erasure-coded pool
27 ------------------------------------
29 The simplest erasure-coded pool is similar to `RAID5
30 <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
31 requires at least three hosts:
35 ceph osd pool create ecpool erasure
43 echo ABCDEFGHI | rados --pool ecpool put NYAN -
44 rados --pool ecpool get NYAN -
53 The default erasure-code profile can sustain the overlapping loss of two OSDs
54 without losing data. This erasure-code profile is equivalent to a replicated
55 pool of size three, but with different storage requirements: instead of
56 requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default
57 profile can be displayed with this command:
61 ceph osd erasure-code-profile get default
68 crush-failure-domain=host
69 technique=reed_sol_van
72 The profile just displayed is for the *default* erasure-coded pool, not the
73 *simplest* erasure-coded pool. These two pools are not the same:
75 The default erasure-coded pool has two data chunks (K) and two coding chunks
76 (M). The profile of the default erasure-coded pool is "k=2 m=2".
78 The simplest erasure-coded pool has two data chunks (K) and one coding chunk
79 (M). The profile of the simplest erasure-coded pool is "k=2 m=1".
81 Choosing the right profile is important because the profile cannot be modified
82 after the pool is created. If you find that you need an erasure-coded pool with
83 a profile different than the one you have created, you must create a new pool
84 with a different (and presumably more carefully considered) profile. When the
85 new pool is created, all objects from the wrongly configured pool must be moved
86 to the newly created pool. There is no way to alter the profile of a pool after
87 the pool has been created.
89 The most important parameters of the profile are *K*, *M*, and
90 *crush-failure-domain* because they define the storage overhead and
91 the data durability. For example, if the desired architecture must
92 sustain the loss of two racks with a storage overhead of 67%,
93 the following profile can be defined:
97 ceph osd erasure-code-profile set myprofile \
100 crush-failure-domain=rack
101 ceph osd pool create ecpool erasure myprofile
102 echo ABCDEFGHI | rados --pool ecpool put NYAN -
103 rados --pool ecpool get NYAN -
109 The *NYAN* object will be divided in three (*K=3*) and two additional
110 *chunks* will be created (*M=2*). The value of *M* defines how many
111 OSDs can be lost simultaneously without losing any data. The
112 *crush-failure-domain=rack* will create a CRUSH rule that ensures
113 no two *chunks* are stored in the same rack.
116 +-------------------+
118 +-------------------+
119 content | ABCDEFGHI |
120 +--------+----------+
125 +---------------+ encode(3,2) +-----------+
128 | +-------+ | +-----+ |
130 +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
131 name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
132 +------+ +------+ +------+ +------+ +------+
133 shard | 1 | | 2 | | 3 | | 4 | | 5 |
134 +------+ +------+ +------+ +------+ +------+
135 content | ABC | | DEF | | GHI | | YXY | | QGC |
136 +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
144 | +------>| OSD2 | | |
152 | | OSD4 |<--------------+
156 +----------------->| OSD5 |
160 More information can be found in the `erasure-code profiles
161 <../erasure-code-profile>`_ documentation.
164 Erasure Coding with Overwrites
165 ------------------------------
167 By default, erasure-coded pools work only with operations that
168 perform full object writes and appends (for example, RGW).
170 Since Luminous, partial writes for an erasure-coded pool may be
171 enabled with a per-pool setting. This lets RBD and CephFS store their
172 data in an erasure-coded pool:
176 ceph osd pool set ec_pool allow_ec_overwrites true
178 This can be enabled only on a pool residing on BlueStore OSDs, since
179 BlueStore's checksumming is used during deep scrubs to detect bitrot
180 or other corruption. Using Filestore with EC overwrites is not only
181 unsafe, but it also results in lower performance compared to BlueStore.
183 Erasure-coded pools do not support omap, so to use them with RBD and
184 CephFS you must instruct them to store their data in an EC pool and
185 their metadata in a replicated pool. For RBD, this means using the
186 erasure-coded pool as the ``--data-pool`` during image creation:
190 rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
192 For CephFS, an erasure-coded pool can be set as the default data pool during
193 file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
196 Erasure-coded pools and cache tiering
197 -------------------------------------
199 .. note:: Cache tiering is deprecated in Reef.
201 Erasure-coded pools require more resources than replicated pools and
202 lack some of the functionality supported by replicated pools (for example, omap).
203 To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_
204 before setting up the erasure-coded pool.
206 For example, if the pool *hot-storage* is made of fast storage, the following commands
207 will place the *hot-storage* pool as a tier of *ecpool* in *writeback*
212 ceph osd tier add ecpool hot-storage
213 ceph osd tier cache-mode hot-storage writeback
214 ceph osd tier set-overlay ecpool hot-storage
216 The result is that every write and read to the *ecpool* actually uses
217 the *hot-storage* pool and benefits from its flexibility and speed.
219 More information can be found in the `cache tiering
220 <../cache-tiering>`_ documentation. Note, however, that cache tiering
221 is deprecated and may be removed completely in a future release.
223 Erasure-coded pool recovery
224 ---------------------------
225 If an erasure-coded pool loses any data shards, it must recover them from others.
226 This recovery involves reading from the remaining shards, reconstructing the data, and
229 In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
230 available. (With fewer than *K* shards, you have actually lost data!)
232 Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be
233 available, even if ``min_size`` was greater than ``K``. This was a conservative
234 decision made out of an abundance of caution when designing the new pool
235 mode. As a result, however, pools with lost OSDs but without complete data loss were
236 unable to recover and go active without manual intervention to temporarily change
237 the ``min_size`` setting.
239 We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and
248 When the encoding function is called, it returns chunks of the same size as each other. There are two
249 kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and
250 (2) *coding chunks*, which can be used to rebuild a lost chunk.
253 The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object
254 is divided into two objects of 5KB each.
257 The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can
258 be missing from the cluster without the cluster suffering data loss. For example, if there are two coding
259 chunks, then two OSDs can be missing without data loss.
268 erasure-code-jerasure