]> git.proxmox.com Git - ceph.git/blob - ceph/doc/rados/operations/erasure-code.rst
import new upstream nautilus stable release 14.2.8
[ceph.git] / ceph / doc / rados / operations / erasure-code.rst
1 .. _ecpool:
2
3 =============
4 Erasure code
5 =============
6
7 A Ceph pool is associated to a type to sustain the loss of an OSD
8 (i.e. a disk since most of the time there is one OSD per disk). The
9 default choice when `creating a pool <../pools>`_ is *replicated*,
10 meaning every object is copied on multiple disks. The `Erasure Code
11 <https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used
12 instead to save space.
13
14 Creating a sample erasure coded pool
15 ------------------------------------
16
17 The simplest erasure coded pool is equivalent to `RAID5
18 <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
19 requires at least three hosts::
20
21 $ ceph osd pool create ecpool 12 12 erasure
22 pool 'ecpool' created
23 $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
24 $ rados --pool ecpool get NYAN -
25 ABCDEFGHI
26
27 .. note:: the 12 in *pool create* stands for
28 `the number of placement groups <../pools>`_.
29
30 Erasure code profiles
31 ---------------------
32
33 The default erasure code profile sustains the loss of a single OSD. It
34 is equivalent to a replicated pool of size two but requires 1.5TB
35 instead of 2TB to store 1TB of data. The default profile can be
36 displayed with::
37
38 $ ceph osd erasure-code-profile get default
39 k=2
40 m=1
41 plugin=jerasure
42 crush-failure-domain=host
43 technique=reed_sol_van
44
45 Choosing the right profile is important because it cannot be modified
46 after the pool is created: a new pool with a different profile needs
47 to be created and all objects from the previous pool moved to the new.
48
49 The most important parameters of the profile are *K*, *M* and
50 *crush-failure-domain* because they define the storage overhead and
51 the data durability. For instance, if the desired architecture must
52 sustain the loss of two racks with a storage overhead of 67% overhead,
53 the following profile can be defined::
54
55 $ ceph osd erasure-code-profile set myprofile \
56 k=3 \
57 m=2 \
58 crush-failure-domain=rack
59 $ ceph osd pool create ecpool 12 12 erasure myprofile
60 $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
61 $ rados --pool ecpool get NYAN -
62 ABCDEFGHI
63
64 The *NYAN* object will be divided in three (*K=3*) and two additional
65 *chunks* will be created (*M=2*). The value of *M* defines how many
66 OSD can be lost simultaneously without losing any data. The
67 *crush-failure-domain=rack* will create a CRUSH rule that ensures
68 no two *chunks* are stored in the same rack.
69
70 .. ditaa::
71 +-------------------+
72 name | NYAN |
73 +-------------------+
74 content | ABCDEFGHI |
75 +--------+----------+
76 |
77 |
78 v
79 +------+------+
80 +---------------+ encode(3,2) +-----------+
81 | +--+--+---+---+ |
82 | | | | |
83 | +-------+ | +-----+ |
84 | | | | |
85 +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
86 name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
87 +------+ +------+ +------+ +------+ +------+
88 shard | 1 | | 2 | | 3 | | 4 | | 5 |
89 +------+ +------+ +------+ +------+ +------+
90 content | ABC | | DEF | | GHI | | YXY | | QGC |
91 +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
92 | | | | |
93 | | v | |
94 | | +--+---+ | |
95 | | | OSD1 | | |
96 | | +------+ | |
97 | | | |
98 | | +------+ | |
99 | +------>| OSD2 | | |
100 | +------+ | |
101 | | |
102 | +------+ | |
103 | | OSD3 |<----+ |
104 | +------+ |
105 | |
106 | +------+ |
107 | | OSD4 |<--------------+
108 | +------+
109 |
110 | +------+
111 +----------------->| OSD5 |
112 +------+
113
114
115 More information can be found in the `erasure code profiles
116 <../erasure-code-profile>`_ documentation.
117
118
119 Erasure Coding with Overwrites
120 ------------------------------
121
122 By default, erasure coded pools only work with uses like RGW that
123 perform full object writes and appends.
124
125 Since Luminous, partial writes for an erasure coded pool may be
126 enabled with a per-pool setting. This lets RBD and CephFS store their
127 data in an erasure coded pool::
128
129 ceph osd pool set ec_pool allow_ec_overwrites true
130
131 This can only be enabled on a pool residing on bluestore OSDs, since
132 bluestore's checksumming is used to detect bitrot or other corruption
133 during deep-scrub. In addition to being unsafe, using filestore with
134 ec overwrites yields low performance compared to bluestore.
135
136 Erasure coded pools do not support omap, so to use them with RBD and
137 CephFS you must instruct them to store their data in an ec pool, and
138 their metadata in a replicated pool. For RBD, this means using the
139 erasure coded pool as the ``--data-pool`` during image creation::
140
141 rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
142
143 For CephFS, an erasure coded pool can be set as the default data pool during
144 file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
145
146
147 Erasure coded pool and cache tiering
148 ------------------------------------
149
150 Erasure coded pools require more resources than replicated pools and
151 lack some functionalities such as omap. To overcome these
152 limitations, one can set up a `cache tier <../cache-tiering>`_
153 before the erasure coded pool.
154
155 For instance, if the pool *hot-storage* is made of fast storage::
156
157 $ ceph osd tier add ecpool hot-storage
158 $ ceph osd tier cache-mode hot-storage writeback
159 $ ceph osd tier set-overlay ecpool hot-storage
160
161 will place the *hot-storage* pool as tier of *ecpool* in *writeback*
162 mode so that every write and read to the *ecpool* are actually using
163 the *hot-storage* and benefit from its flexibility and speed.
164
165 More information can be found in the `cache tiering
166 <../cache-tiering>`_ documentation.
167
168 Glossary
169 --------
170
171 *chunk*
172 when the encoding function is called, it returns chunks of the same
173 size. Data chunks which can be concatenated to reconstruct the original
174 object and coding chunks which can be used to rebuild a lost chunk.
175
176 *K*
177 the number of data *chunks*, i.e. the number of *chunks* in which the
178 original object is divided. For instance if *K* = 2 a 10KB object
179 will be divided into *K* objects of 5KB each.
180
181 *M*
182 the number of coding *chunks*, i.e. the number of additional *chunks*
183 computed by the encoding functions. If there are 2 coding *chunks*,
184 it means 2 OSDs can be out without losing data.
185
186
187 Table of content
188 ----------------
189
190 .. toctree::
191 :maxdepth: 1
192
193 erasure-code-profile
194 erasure-code-jerasure
195 erasure-code-isa
196 erasure-code-lrc
197 erasure-code-shec
198 erasure-code-clay