]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/erasure-code.rst
bump version to 18.2.2-pve1
[ceph.git] / ceph / doc / rados / operations / erasure-code.rst
CommitLineData
92f5a8d4
TL
1.. _ecpool:
2
1e59de90 3==============
7c673cae 4 Erasure code
1e59de90 5==============
7c673cae 6
39ae355f 7By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
1e59de90
TL
8replicated-type pools, every object is copied to multiple disks. This
9multiple copying is the method of data protection known as "replication".
39ae355f 10
1e59de90 11By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
39ae355f
TL
12pools use a method of data protection that is different from replication. In
13erasure coding, data is broken into fragments of two kinds: data blocks and
14parity blocks. If a drive fails or becomes corrupted, the parity blocks are
15used to rebuild the data. At scale, erasure coding saves space relative to
16replication.
17
18In this documentation, data blocks are referred to as "data chunks"
1e59de90 19and parity blocks are referred to as "coding chunks".
39ae355f
TL
20
21Erasure codes are also called "forward error correction codes". The
22first forward error correction code was developed in 1950 by Richard
23Hamming at Bell Laboratories.
24
7c673cae 25
1e59de90 26Creating a sample erasure-coded pool
7c673cae
FG
27------------------------------------
28
1e59de90 29The simplest erasure-coded pool is similar to `RAID5
7c673cae 30<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
39ae355f 31requires at least three hosts:
7c673cae 32
39ae355f
TL
33.. prompt:: bash $
34
35 ceph osd pool create ecpool erasure
36
37::
38
39 pool 'ecpool' created
40
41.. prompt:: bash $
42
43 echo ABCDEFGHI | rados --pool ecpool put NYAN -
44 rados --pool ecpool get NYAN -
45
46::
47
48 ABCDEFGHI
7c673cae 49
1e59de90 50Erasure-code profiles
7c673cae
FG
51---------------------
52
1e59de90
TL
53The default erasure-code profile can sustain the overlapping loss of two OSDs
54without losing data. This erasure-code profile is equivalent to a replicated
55pool of size three, but with different storage requirements: instead of
56requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default
39ae355f 57profile can be displayed with this command:
7c673cae 58
39ae355f 59.. prompt:: bash $
7c673cae 60
39ae355f
TL
61 ceph osd erasure-code-profile get default
62
63::
64
65 k=2
66 m=2
67 plugin=jerasure
68 crush-failure-domain=host
69 technique=reed_sol_van
70
71.. note::
1e59de90
TL
72 The profile just displayed is for the *default* erasure-coded pool, not the
73 *simplest* erasure-coded pool. These two pools are not the same:
74
75 The default erasure-coded pool has two data chunks (K) and two coding chunks
76 (M). The profile of the default erasure-coded pool is "k=2 m=2".
39ae355f 77
1e59de90
TL
78 The simplest erasure-coded pool has two data chunks (K) and one coding chunk
79 (M). The profile of the simplest erasure-coded pool is "k=2 m=1".
39ae355f
TL
80
81Choosing the right profile is important because the profile cannot be modified
82after the pool is created. If you find that you need an erasure-coded pool with
83a profile different than the one you have created, you must create a new pool
1e59de90
TL
84with a different (and presumably more carefully considered) profile. When the
85new pool is created, all objects from the wrongly configured pool must be moved
86to the newly created pool. There is no way to alter the profile of a pool after
87the pool has been created.
7c673cae 88
1e59de90 89The most important parameters of the profile are *K*, *M*, and
224ce89b 90*crush-failure-domain* because they define the storage overhead and
39ae355f 91the data durability. For example, if the desired architecture must
1e59de90 92sustain the loss of two racks with a storage overhead of 67%,
39ae355f
TL
93the following profile can be defined:
94
95.. prompt:: bash $
7c673cae 96
39ae355f 97 ceph osd erasure-code-profile set myprofile \
7c673cae
FG
98 k=3 \
99 m=2 \
224ce89b 100 crush-failure-domain=rack
39ae355f
TL
101 ceph osd pool create ecpool erasure myprofile
102 echo ABCDEFGHI | rados --pool ecpool put NYAN -
103 rados --pool ecpool get NYAN -
104
105::
106
7c673cae
FG
107 ABCDEFGHI
108
109The *NYAN* object will be divided in three (*K=3*) and two additional
110*chunks* will be created (*M=2*). The value of *M* defines how many
1e59de90 111OSDs can be lost simultaneously without losing any data. The
b32b8144 112*crush-failure-domain=rack* will create a CRUSH rule that ensures
7c673cae
FG
113no two *chunks* are stored in the same rack.
114
115.. ditaa::
116 +-------------------+
117 name | NYAN |
118 +-------------------+
119 content | ABCDEFGHI |
120 +--------+----------+
121 |
122 |
123 v
124 +------+------+
125 +---------------+ encode(3,2) +-----------+
126 | +--+--+---+---+ |
127 | | | | |
128 | +-------+ | +-----+ |
129 | | | | |
130 +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
131 name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
132 +------+ +------+ +------+ +------+ +------+
133 shard | 1 | | 2 | | 3 | | 4 | | 5 |
134 +------+ +------+ +------+ +------+ +------+
135 content | ABC | | DEF | | GHI | | YXY | | QGC |
136 +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
137 | | | | |
138 | | v | |
139 | | +--+---+ | |
140 | | | OSD1 | | |
141 | | +------+ | |
142 | | | |
143 | | +------+ | |
144 | +------>| OSD2 | | |
145 | +------+ | |
146 | | |
147 | +------+ | |
148 | | OSD3 |<----+ |
149 | +------+ |
150 | |
151 | +------+ |
152 | | OSD4 |<--------------+
153 | +------+
154 |
155 | +------+
156 +----------------->| OSD5 |
157 +------+
158
159
1e59de90 160More information can be found in the `erasure-code profiles
7c673cae
FG
161<../erasure-code-profile>`_ documentation.
162
163
164Erasure Coding with Overwrites
165------------------------------
166
1e59de90
TL
167By default, erasure-coded pools work only with operations that
168perform full object writes and appends (for example, RGW).
7c673cae 169
1e59de90 170Since Luminous, partial writes for an erasure-coded pool may be
11fdf7f2 171enabled with a per-pool setting. This lets RBD and CephFS store their
1e59de90 172data in an erasure-coded pool:
39ae355f
TL
173
174.. prompt:: bash $
7c673cae
FG
175
176 ceph osd pool set ec_pool allow_ec_overwrites true
177
39ae355f
TL
178This can be enabled only on a pool residing on BlueStore OSDs, since
179BlueStore's checksumming is used during deep scrubs to detect bitrot
1e59de90
TL
180or other corruption. Using Filestore with EC overwrites is not only
181unsafe, but it also results in lower performance compared to BlueStore.
7c673cae 182
1e59de90
TL
183Erasure-coded pools do not support omap, so to use them with RBD and
184CephFS you must instruct them to store their data in an EC pool and
7c673cae 185their metadata in a replicated pool. For RBD, this means using the
1e59de90 186erasure-coded pool as the ``--data-pool`` during image creation:
39ae355f
TL
187
188.. prompt:: bash $
7c673cae
FG
189
190 rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
191
1e59de90 192For CephFS, an erasure-coded pool can be set as the default data pool during
11fdf7f2 193file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
7c673cae
FG
194
195
1e59de90
TL
196Erasure-coded pools and cache tiering
197-------------------------------------
7c673cae 198
1e59de90 199.. note:: Cache tiering is deprecated in Reef.
7c673cae 200
1e59de90
TL
201Erasure-coded pools require more resources than replicated pools and
202lack some of the functionality supported by replicated pools (for example, omap).
203To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_
204before setting up the erasure-coded pool.
205
206For example, if the pool *hot-storage* is made of fast storage, the following commands
207will place the *hot-storage* pool as a tier of *ecpool* in *writeback*
208mode:
39ae355f
TL
209
210.. prompt:: bash $
7c673cae 211
39ae355f
TL
212 ceph osd tier add ecpool hot-storage
213 ceph osd tier cache-mode hot-storage writeback
214 ceph osd tier set-overlay ecpool hot-storage
7c673cae 215
1e59de90
TL
216The result is that every write and read to the *ecpool* actually uses
217the *hot-storage* pool and benefits from its flexibility and speed.
7c673cae
FG
218
219More information can be found in the `cache tiering
1e59de90 220<../cache-tiering>`_ documentation. Note, however, that cache tiering
39ae355f 221is deprecated and may be removed completely in a future release.
7c673cae 222
1e59de90 223Erasure-coded pool recovery
9f95a23c 224---------------------------
1e59de90
TL
225If an erasure-coded pool loses any data shards, it must recover them from others.
226This recovery involves reading from the remaining shards, reconstructing the data, and
39ae355f 227writing new shards.
1e59de90 228
39ae355f 229In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
9f95a23c
TL
230available. (With fewer than *K* shards, you have actually lost data!)
231
1e59de90
TL
232Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be
233available, even if ``min_size`` was greater than ``K``. This was a conservative
234decision made out of an abundance of caution when designing the new pool
235mode. As a result, however, pools with lost OSDs but without complete data loss were
236unable to recover and go active without manual intervention to temporarily change
237the ``min_size`` setting.
238
05a536ef 239We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and
1e59de90
TL
240loss of data.
241
242
9f95a23c 243
7c673cae
FG
244Glossary
245--------
246
247*chunk*
1e59de90
TL
248 When the encoding function is called, it returns chunks of the same size as each other. There are two
249 kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and
250 (2) *coding chunks*, which can be used to rebuild a lost chunk.
7c673cae
FG
251
252*K*
1e59de90
TL
253 The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object
254 is divided into two objects of 5KB each.
7c673cae
FG
255
256*M*
1e59de90
TL
257 The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can
258 be missing from the cluster without the cluster suffering data loss. For example, if there are two coding
259 chunks, then two OSDs can be missing without data loss.
7c673cae 260
1e59de90
TL
261Table of contents
262-----------------
7c673cae
FG
263
264.. toctree::
1e59de90
TL
265 :maxdepth: 1
266
267 erasure-code-profile
268 erasure-code-jerasure
269 erasure-code-isa
270 erasure-code-lrc
271 erasure-code-shec
272 erasure-code-clay