]>
Commit | Line | Data |
---|---|---|
1 | ============= | |
2 | Erasure code | |
3 | ============= | |
4 | ||
5 | A Ceph pool is associated to a type to sustain the loss of an OSD | |
6 | (i.e. a disk since most of the time there is one OSD per disk). The | |
7 | default choice when `creating a pool <../pools>`_ is *replicated*, | |
8 | meaning every object is copied on multiple disks. The `Erasure Code | |
9 | <https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used | |
10 | instead to save space. | |
11 | ||
12 | Creating a sample erasure coded pool | |
13 | ------------------------------------ | |
14 | ||
15 | The simplest erasure coded pool is equivalent to `RAID5 | |
16 | <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and | |
17 | requires at least three hosts:: | |
18 | ||
19 | $ ceph osd pool create ecpool 12 12 erasure | |
20 | pool 'ecpool' created | |
21 | $ echo ABCDEFGHI | rados --pool ecpool put NYAN - | |
22 | $ rados --pool ecpool get NYAN - | |
23 | ABCDEFGHI | |
24 | ||
25 | .. note:: the 12 in *pool create* stands for | |
26 | `the number of placement groups <../pools>`_. | |
27 | ||
28 | Erasure code profiles | |
29 | --------------------- | |
30 | ||
31 | The default erasure code profile sustains the loss of a single OSD. It | |
32 | is equivalent to a replicated pool of size two but requires 1.5TB | |
33 | instead of 2TB to store 1TB of data. The default profile can be | |
34 | displayed with:: | |
35 | ||
36 | $ ceph osd erasure-code-profile get default | |
37 | k=2 | |
38 | m=1 | |
39 | plugin=jerasure | |
40 | ruleset-failure-domain=host | |
41 | technique=reed_sol_van | |
42 | ||
43 | Choosing the right profile is important because it cannot be modified | |
44 | after the pool is created: a new pool with a different profile needs | |
45 | to be created and all objects from the previous pool moved to the new. | |
46 | ||
47 | The most important parameters of the profile are *K*, *M* and | |
48 | *ruleset-failure-domain* because they define the storage overhead and | |
49 | the data durability. For instance, if the desired architecture must | |
50 | sustain the loss of two racks with a storage overhead of 40% overhead, | |
51 | the following profile can be defined:: | |
52 | ||
53 | $ ceph osd erasure-code-profile set myprofile \ | |
54 | k=3 \ | |
55 | m=2 \ | |
56 | ruleset-failure-domain=rack | |
57 | $ ceph osd pool create ecpool 12 12 erasure myprofile | |
58 | $ echo ABCDEFGHI | rados --pool ecpool put NYAN - | |
59 | $ rados --pool ecpool get NYAN - | |
60 | ABCDEFGHI | |
61 | ||
62 | The *NYAN* object will be divided in three (*K=3*) and two additional | |
63 | *chunks* will be created (*M=2*). The value of *M* defines how many | |
64 | OSD can be lost simultaneously without losing any data. The | |
65 | *ruleset-failure-domain=rack* will create a CRUSH ruleset that ensures | |
66 | no two *chunks* are stored in the same rack. | |
67 | ||
68 | .. ditaa:: | |
69 | +-------------------+ | |
70 | name | NYAN | | |
71 | +-------------------+ | |
72 | content | ABCDEFGHI | | |
73 | +--------+----------+ | |
74 | | | |
75 | | | |
76 | v | |
77 | +------+------+ | |
78 | +---------------+ encode(3,2) +-----------+ | |
79 | | +--+--+---+---+ | | |
80 | | | | | | | |
81 | | +-------+ | +-----+ | | |
82 | | | | | | | |
83 | +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ | |
84 | name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | | |
85 | +------+ +------+ +------+ +------+ +------+ | |
86 | shard | 1 | | 2 | | 3 | | 4 | | 5 | | |
87 | +------+ +------+ +------+ +------+ +------+ | |
88 | content | ABC | | DEF | | GHI | | YXY | | QGC | | |
89 | +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ | |
90 | | | | | | | |
91 | | | v | | | |
92 | | | +--+---+ | | | |
93 | | | | OSD1 | | | | |
94 | | | +------+ | | | |
95 | | | | | | |
96 | | | +------+ | | | |
97 | | +------>| OSD2 | | | | |
98 | | +------+ | | | |
99 | | | | | |
100 | | +------+ | | | |
101 | | | OSD3 |<----+ | | |
102 | | +------+ | | |
103 | | | | |
104 | | +------+ | | |
105 | | | OSD4 |<--------------+ | |
106 | | +------+ | |
107 | | | |
108 | | +------+ | |
109 | +----------------->| OSD5 | | |
110 | +------+ | |
111 | ||
112 | ||
113 | More information can be found in the `erasure code profiles | |
114 | <../erasure-code-profile>`_ documentation. | |
115 | ||
116 | ||
117 | Erasure Coding with Overwrites | |
118 | ------------------------------ | |
119 | ||
120 | By default, erasure coded pools only work with uses like RGW that | |
121 | perform full object writes and appends. | |
122 | ||
123 | Since Luminous, partial writes for an erasure coded pool may be | |
124 | enabled with a per-pool setting. This lets RBD and Cephfs store their | |
125 | data in an erasure coded pool:: | |
126 | ||
127 | ceph osd pool set ec_pool allow_ec_overwrites true | |
128 | ||
129 | This can only be enabled on a pool residing on bluestore OSDs, since | |
130 | bluestore's checksumming is used to detect bitrot or other corruption | |
131 | during deep-scrub. In addition to being unsafe, using filestore with | |
132 | ec overwrites yields low performance compared to bluestore. | |
133 | ||
134 | Erasure coded pools do not support omap, so to use them with RBD and | |
135 | Cephfs you must instruct them to store their data in an ec pool, and | |
136 | their metadata in a replicated pool. For RBD, this means using the | |
137 | erasure coded pool as the ``--data-pool`` during image creation:: | |
138 | ||
139 | rbd create --size 1G --data-pool ec_pool replicated_pool/image_name | |
140 | ||
141 | For Cephfs, using an erasure coded pool means setting that pool in | |
142 | a `file layout <../../../cephfs/file-layouts>`_. | |
143 | ||
144 | ||
145 | Erasure coded pool and cache tiering | |
146 | ------------------------------------ | |
147 | ||
148 | Erasure coded pools require more resources than replicated pools and | |
149 | lack some functionalities such as omap. To overcome these | |
150 | limitations, one can set up a `cache tier <../cache-tiering>`_ | |
151 | before the erasure coded pool. | |
152 | ||
153 | For instance, if the pool *hot-storage* is made of fast storage:: | |
154 | ||
155 | $ ceph osd tier add ecpool hot-storage | |
156 | $ ceph osd tier cache-mode hot-storage writeback | |
157 | $ ceph osd tier set-overlay ecpool hot-storage | |
158 | ||
159 | will place the *hot-storage* pool as tier of *ecpool* in *writeback* | |
160 | mode so that every write and read to the *ecpool* are actually using | |
161 | the *hot-storage* and benefit from its flexibility and speed. | |
162 | ||
163 | More information can be found in the `cache tiering | |
164 | <../cache-tiering>`_ documentation. | |
165 | ||
166 | Glossary | |
167 | -------- | |
168 | ||
169 | *chunk* | |
170 | when the encoding function is called, it returns chunks of the same | |
171 | size. Data chunks which can be concatenated to reconstruct the original | |
172 | object and coding chunks which can be used to rebuild a lost chunk. | |
173 | ||
174 | *K* | |
175 | the number of data *chunks*, i.e. the number of *chunks* in which the | |
176 | original object is divided. For instance if *K* = 2 a 10KB object | |
177 | will be divided into *K* objects of 5KB each. | |
178 | ||
179 | *M* | |
180 | the number of coding *chunks*, i.e. the number of additional *chunks* | |
181 | computed by the encoding functions. If there are 2 coding *chunks*, | |
182 | it means 2 OSDs can be out without losing data. | |
183 | ||
184 | ||
185 | Table of content | |
186 | ---------------- | |
187 | ||
188 | .. toctree:: | |
189 | :maxdepth: 1 | |
190 | ||
191 | erasure-code-profile | |
192 | erasure-code-jerasure | |
193 | erasure-code-isa | |
194 | erasure-code-lrc | |
195 | erasure-code-shec |