]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/operations/erasure-code.rst
update sources to v12.1.1
[ceph.git] / ceph / doc / rados / operations / erasure-code.rst
CommitLineData
7c673cae
FG
1=============
2 Erasure code
3=============
4
5A Ceph pool is associated to a type to sustain the loss of an OSD
6(i.e. a disk since most of the time there is one OSD per disk). The
7default choice when `creating a pool <../pools>`_ is *replicated*,
8meaning every object is copied on multiple disks. The `Erasure Code
9<https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used
10instead to save space.
11
12Creating a sample erasure coded pool
13------------------------------------
14
15The simplest erasure coded pool is equivalent to `RAID5
16<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
17requires at least three hosts::
18
19 $ ceph osd pool create ecpool 12 12 erasure
20 pool 'ecpool' created
21 $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
22 $ rados --pool ecpool get NYAN -
23 ABCDEFGHI
24
25.. note:: the 12 in *pool create* stands for
26 `the number of placement groups <../pools>`_.
27
28Erasure code profiles
29---------------------
30
31The default erasure code profile sustains the loss of a single OSD. It
32is equivalent to a replicated pool of size two but requires 1.5TB
33instead of 2TB to store 1TB of data. The default profile can be
34displayed with::
35
36 $ ceph osd erasure-code-profile get default
37 k=2
38 m=1
39 plugin=jerasure
224ce89b 40 crush-failure-domain=host
7c673cae
FG
41 technique=reed_sol_van
42
43Choosing the right profile is important because it cannot be modified
44after the pool is created: a new pool with a different profile needs
45to be created and all objects from the previous pool moved to the new.
46
47The most important parameters of the profile are *K*, *M* and
224ce89b 48*crush-failure-domain* because they define the storage overhead and
7c673cae
FG
49the data durability. For instance, if the desired architecture must
50sustain the loss of two racks with a storage overhead of 40% overhead,
51the following profile can be defined::
52
53 $ ceph osd erasure-code-profile set myprofile \
54 k=3 \
55 m=2 \
224ce89b 56 crush-failure-domain=rack
7c673cae
FG
57 $ ceph osd pool create ecpool 12 12 erasure myprofile
58 $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
59 $ rados --pool ecpool get NYAN -
60 ABCDEFGHI
61
62The *NYAN* object will be divided in three (*K=3*) and two additional
63*chunks* will be created (*M=2*). The value of *M* defines how many
64OSD can be lost simultaneously without losing any data. The
224ce89b 65*crush-failure-domain=rack* will create a CRUSH ruleset that ensures
7c673cae
FG
66no two *chunks* are stored in the same rack.
67
68.. ditaa::
69 +-------------------+
70 name | NYAN |
71 +-------------------+
72 content | ABCDEFGHI |
73 +--------+----------+
74 |
75 |
76 v
77 +------+------+
78 +---------------+ encode(3,2) +-----------+
79 | +--+--+---+---+ |
80 | | | | |
81 | +-------+ | +-----+ |
82 | | | | |
83 +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
84 name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
85 +------+ +------+ +------+ +------+ +------+
86 shard | 1 | | 2 | | 3 | | 4 | | 5 |
87 +------+ +------+ +------+ +------+ +------+
88 content | ABC | | DEF | | GHI | | YXY | | QGC |
89 +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
90 | | | | |
91 | | v | |
92 | | +--+---+ | |
93 | | | OSD1 | | |
94 | | +------+ | |
95 | | | |
96 | | +------+ | |
97 | +------>| OSD2 | | |
98 | +------+ | |
99 | | |
100 | +------+ | |
101 | | OSD3 |<----+ |
102 | +------+ |
103 | |
104 | +------+ |
105 | | OSD4 |<--------------+
106 | +------+
107 |
108 | +------+
109 +----------------->| OSD5 |
110 +------+
111
112
113More information can be found in the `erasure code profiles
114<../erasure-code-profile>`_ documentation.
115
116
117Erasure Coding with Overwrites
118------------------------------
119
120By default, erasure coded pools only work with uses like RGW that
121perform full object writes and appends.
122
123Since Luminous, partial writes for an erasure coded pool may be
124enabled with a per-pool setting. This lets RBD and Cephfs store their
125data in an erasure coded pool::
126
127 ceph osd pool set ec_pool allow_ec_overwrites true
128
129This can only be enabled on a pool residing on bluestore OSDs, since
130bluestore's checksumming is used to detect bitrot or other corruption
131during deep-scrub. In addition to being unsafe, using filestore with
132ec overwrites yields low performance compared to bluestore.
133
134Erasure coded pools do not support omap, so to use them with RBD and
135Cephfs you must instruct them to store their data in an ec pool, and
136their metadata in a replicated pool. For RBD, this means using the
137erasure coded pool as the ``--data-pool`` during image creation::
138
139 rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
140
141For Cephfs, using an erasure coded pool means setting that pool in
31f18b77 142a `file layout <../../../cephfs/file-layouts>`_.
7c673cae
FG
143
144
145Erasure coded pool and cache tiering
146------------------------------------
147
148Erasure coded pools require more resources than replicated pools and
149lack some functionalities such as omap. To overcome these
150limitations, one can set up a `cache tier <../cache-tiering>`_
151before the erasure coded pool.
152
153For instance, if the pool *hot-storage* is made of fast storage::
154
155 $ ceph osd tier add ecpool hot-storage
156 $ ceph osd tier cache-mode hot-storage writeback
157 $ ceph osd tier set-overlay ecpool hot-storage
158
159will place the *hot-storage* pool as tier of *ecpool* in *writeback*
160mode so that every write and read to the *ecpool* are actually using
161the *hot-storage* and benefit from its flexibility and speed.
162
163More information can be found in the `cache tiering
164<../cache-tiering>`_ documentation.
165
166Glossary
167--------
168
169*chunk*
170 when the encoding function is called, it returns chunks of the same
171 size. Data chunks which can be concatenated to reconstruct the original
172 object and coding chunks which can be used to rebuild a lost chunk.
173
174*K*
175 the number of data *chunks*, i.e. the number of *chunks* in which the
176 original object is divided. For instance if *K* = 2 a 10KB object
177 will be divided into *K* objects of 5KB each.
178
179*M*
180 the number of coding *chunks*, i.e. the number of additional *chunks*
181 computed by the encoding functions. If there are 2 coding *chunks*,
182 it means 2 OSDs can be out without losing data.
183
184
185Table of content
186----------------
187
188.. toctree::
189 :maxdepth: 1
190
191 erasure-code-profile
192 erasure-code-jerasure
193 erasure-code-isa
194 erasure-code-lrc
195 erasure-code-shec