]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/erasure-code.rst
ddbea214ba5dab7b03434b1579982496ef93a483
[ceph.git] / ceph / doc / rados / operations / erasure-code.rst
1 .. _ecpool:
2
3 =============
4 Erasure code
5 =============
6
7 A Ceph pool is associated to a type to sustain the loss of an OSD
8 (i.e. a disk since most of the time there is one OSD per disk). The
9 default choice when `creating a pool <../pools>`_ is *replicated*,
10 meaning every object is copied on multiple disks. The `Erasure Code
11 <https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used
12 instead to save space.
13
14 Creating a sample erasure coded pool
15 ------------------------------------
16
17 The simplest erasure coded pool is equivalent to `RAID5
18 <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
19 requires at least three hosts::
20
21 $ ceph osd pool create ecpool erasure
22 pool 'ecpool' created
23 $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
24 $ rados --pool ecpool get NYAN -
25 ABCDEFGHI
26
27 Erasure code profiles
28 ---------------------
29
30 The default erasure code profile sustains the loss of a two OSDs. It
31 is equivalent to a replicated pool of size three but requires 2TB
32 instead of 3TB to store 1TB of data. The default profile can be
33 displayed with::
34
35 $ ceph osd erasure-code-profile get default
36 k=2
37 m=2
38 plugin=jerasure
39 crush-failure-domain=host
40 technique=reed_sol_van
41
42 Choosing the right profile is important because it cannot be modified
43 after the pool is created: a new pool with a different profile needs
44 to be created and all objects from the previous pool moved to the new.
45
46 The most important parameters of the profile are *K*, *M* and
47 *crush-failure-domain* because they define the storage overhead and
48 the data durability. For instance, if the desired architecture must
49 sustain the loss of two racks with a storage overhead of 67% overhead,
50 the following profile can be defined::
51
52 $ ceph osd erasure-code-profile set myprofile \
53 k=3 \
54 m=2 \
55 crush-failure-domain=rack
56 $ ceph osd pool create ecpool erasure myprofile
57 $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
58 $ rados --pool ecpool get NYAN -
59 ABCDEFGHI
60
61 The *NYAN* object will be divided in three (*K=3*) and two additional
62 *chunks* will be created (*M=2*). The value of *M* defines how many
63 OSD can be lost simultaneously without losing any data. The
64 *crush-failure-domain=rack* will create a CRUSH rule that ensures
65 no two *chunks* are stored in the same rack.
66
67 .. ditaa::
68 +-------------------+
69 name | NYAN |
70 +-------------------+
71 content | ABCDEFGHI |
72 +--------+----------+
73 |
74 |
75 v
76 +------+------+
77 +---------------+ encode(3,2) +-----------+
78 | +--+--+---+---+ |
79 | | | | |
80 | +-------+ | +-----+ |
81 | | | | |
82 +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
83 name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
84 +------+ +------+ +------+ +------+ +------+
85 shard | 1 | | 2 | | 3 | | 4 | | 5 |
86 +------+ +------+ +------+ +------+ +------+
87 content | ABC | | DEF | | GHI | | YXY | | QGC |
88 +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
89 | | | | |
90 | | v | |
91 | | +--+---+ | |
92 | | | OSD1 | | |
93 | | +------+ | |
94 | | | |
95 | | +------+ | |
96 | +------>| OSD2 | | |
97 | +------+ | |
98 | | |
99 | +------+ | |
100 | | OSD3 |<----+ |
101 | +------+ |
102 | |
103 | +------+ |
104 | | OSD4 |<--------------+
105 | +------+
106 |
107 | +------+
108 +----------------->| OSD5 |
109 +------+
110
111
112 More information can be found in the `erasure code profiles
113 <../erasure-code-profile>`_ documentation.
114
115
116 Erasure Coding with Overwrites
117 ------------------------------
118
119 By default, erasure coded pools only work with uses like RGW that
120 perform full object writes and appends.
121
122 Since Luminous, partial writes for an erasure coded pool may be
123 enabled with a per-pool setting. This lets RBD and CephFS store their
124 data in an erasure coded pool::
125
126 ceph osd pool set ec_pool allow_ec_overwrites true
127
128 This can only be enabled on a pool residing on bluestore OSDs, since
129 bluestore's checksumming is used to detect bitrot or other corruption
130 during deep-scrub. In addition to being unsafe, using filestore with
131 ec overwrites yields low performance compared to bluestore.
132
133 Erasure coded pools do not support omap, so to use them with RBD and
134 CephFS you must instruct them to store their data in an ec pool, and
135 their metadata in a replicated pool. For RBD, this means using the
136 erasure coded pool as the ``--data-pool`` during image creation::
137
138 rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
139
140 For CephFS, an erasure coded pool can be set as the default data pool during
141 file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
142
143
144 Erasure coded pool and cache tiering
145 ------------------------------------
146
147 Erasure coded pools require more resources than replicated pools and
148 lack some functionalities such as omap. To overcome these
149 limitations, one can set up a `cache tier <../cache-tiering>`_
150 before the erasure coded pool.
151
152 For instance, if the pool *hot-storage* is made of fast storage::
153
154 $ ceph osd tier add ecpool hot-storage
155 $ ceph osd tier cache-mode hot-storage writeback
156 $ ceph osd tier set-overlay ecpool hot-storage
157
158 will place the *hot-storage* pool as tier of *ecpool* in *writeback*
159 mode so that every write and read to the *ecpool* are actually using
160 the *hot-storage* and benefit from its flexibility and speed.
161
162 More information can be found in the `cache tiering
163 <../cache-tiering>`_ documentation.
164
165 Erasure coded pool recovery
166 ---------------------------
167 If an erasure coded pool loses some shards, it must recover them from the others.
168 This generally involves reading from the remaining shards, reconstructing the data, and
169 writing it to the new peer.
170 In Octopus, erasure coded pools can recover as long as there are at least *K* shards
171 available. (With fewer than *K* shards, you have actually lost data!)
172
173 Prior to Octopus, erasure coded pools required at least *min_size* shards to be
174 available, even if *min_size* is greater than *K*. (We generally recommend min_size
175 be *K+2* or more to prevent loss of writes and data.)
176 This conservative decision was made out of an abundance of caution when designing the new pool
177 mode but also meant pools with lost OSDs but no data loss were unable to recover and go active
178 without manual intervention to change the *min_size*.
179
180 Glossary
181 --------
182
183 *chunk*
184 when the encoding function is called, it returns chunks of the same
185 size. Data chunks which can be concatenated to reconstruct the original
186 object and coding chunks which can be used to rebuild a lost chunk.
187
188 *K*
189 the number of data *chunks*, i.e. the number of *chunks* in which the
190 original object is divided. For instance if *K* = 2 a 10KB object
191 will be divided into *K* objects of 5KB each.
192
193 *M*
194 the number of coding *chunks*, i.e. the number of additional *chunks*
195 computed by the encoding functions. If there are 2 coding *chunks*,
196 it means 2 OSDs can be out without losing data.
197
198
199 Table of content
200 ----------------
201
202 .. toctree::
203 :maxdepth: 1
204
205 erasure-code-profile
206 erasure-code-jerasure
207 erasure-code-isa
208 erasure-code-lrc
209 erasure-code-shec
210 erasure-code-clay