]>
Commit | Line | Data |
---|---|---|
1 | ============================ | |
2 | Erasure Code developer notes | |
3 | ============================ | |
4 | ||
5 | Introduction | |
6 | ------------ | |
7 | ||
8 | Each chapter of this document explains an aspect of the implementation | |
9 | of the erasure code within Ceph. It is mostly based on examples being | |
10 | explained to demonstrate how things work. | |
11 | ||
12 | Reading and writing encoded chunks from and to OSDs | |
13 | --------------------------------------------------- | |
14 | ||
15 | An erasure coded pool stores each object as K+M chunks. It is divided | |
16 | into K data chunks and M coding chunks. The pool is configured to have | |
17 | a size of K+M so that each chunk is stored in an OSD in the acting | |
18 | set. The rank of the chunk is stored as an attribute of the object. | |
19 | ||
20 | Let's say an erasure coded pool is created to use five OSDs ( K+M = | |
21 | 5 ) and sustain the loss of two of them ( M = 2 ). | |
22 | ||
23 | When the object *NYAN* containing *ABCDEFGHI* is written to it, the | |
24 | erasure encoding function splits the content in three data chunks, | |
25 | simply by dividing the content in three : the first contains *ABC*, | |
26 | the second *DEF* and the last *GHI*. The content will be padded if the | |
27 | content length is not a multiple of K. The function also creates two | |
28 | coding chunks : the fourth with *YXY* and the fifth with *GQC*. Each | |
29 | chunk is stored in an OSD in the acting set. The chunks are stored in | |
30 | objects that have the same name ( *NYAN* ) but reside on different | |
31 | OSDs. The order in which the chunks were created must be preserved and | |
32 | is stored as an attribute of the object ( shard_t ), in addition to its | |
33 | name. Chunk *1* contains *ABC* and is stored on *OSD5* while chunk *4* | |
34 | contains *YXY* and is stored on *OSD3*. | |
35 | ||
36 | :: | |
37 | ||
38 | +-------------------+ | |
39 | name | NYAN | | |
40 | +-------------------+ | |
41 | content | ABCDEFGHI | | |
42 | +--------+----------+ | |
43 | | | |
44 | | | |
45 | v | |
46 | +------+------+ | |
47 | +---------------+ encode(3,2) +-----------+ | |
48 | | +--+--+---+---+ | | |
49 | | | | | | | |
50 | | +-------+ | +-----+ | | |
51 | | | | | | | |
52 | +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ | |
53 | name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | | |
54 | +------+ +------+ +------+ +------+ +------+ | |
55 | shard | 1 | | 2 | | 3 | | 4 | | 5 | | |
56 | +------+ +------+ +------+ +------+ +------+ | |
57 | content | ABC | | DEF | | GHI | | YXY | | QGC | | |
58 | +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ | |
59 | | | | | | | |
60 | | | | | | | |
61 | | | +--+---+ | | | |
62 | | | | OSD1 | | | | |
63 | | | +------+ | | | |
64 | | | +------+ | | | |
65 | | +------>| OSD2 | | | | |
66 | | +------+ | | | |
67 | | +------+ | | | |
68 | | | OSD3 |<----+ | | |
69 | | +------+ | | |
70 | | +------+ | | |
71 | | | OSD4 |<--------------+ | |
72 | | +------+ | |
73 | | +------+ | |
74 | +----------------->| OSD5 | | |
75 | +------+ | |
76 | ||
77 | ||
78 | ||
79 | ||
80 | When the object *NYAN* is read from the erasure coded pool, the | |
81 | decoding function reads three chunks : chunk *1* containing *ABC*, | |
82 | chunk *3* containing *GHI* and chunk *4* containing *YXY* and rebuild | |
83 | the original content of the object *ABCDEFGHI*. The decoding function | |
84 | is informed that the chunks *2* and *5* are missing ( they are called | |
85 | *erasures* ). The chunk *5* could not be read because the *OSD4* is | |
86 | *out*. | |
87 | ||
88 | The decoding function could be called as soon as three chunks are | |
89 | read : *OSD2* was the slowest and its chunk does not need to be taken into | |
90 | account. This optimization is not implemented in Firefly. | |
91 | ||
92 | :: | |
93 | ||
94 | +-------------------+ | |
95 | name | NYAN | | |
96 | +-------------------+ | |
97 | content | ABCDEFGHI | | |
98 | +--------+----------+ | |
99 | ^ | |
100 | | | |
101 | | | |
102 | +------+------+ | |
103 | | decode(3,2) | | |
104 | | erasures 2,5| | |
105 | +-------------->| | | |
106 | | +-------------+ | |
107 | | ^ ^ | |
108 | | | +-----+ | |
109 | | | | | |
110 | +--+---+ +------+ +--+---+ +--+---+ | |
111 | name | NYAN | | NYAN | | NYAN | | NYAN | | |
112 | +------+ +------+ +------+ +------+ | |
113 | shard | 1 | | 2 | | 3 | | 4 | | |
114 | +------+ +------+ +------+ +------+ | |
115 | content | ABC | | DEF | | GHI | | YXY | | |
116 | +--+---+ +--+---+ +--+---+ +--+---+ | |
117 | ^ . ^ ^ | |
118 | | TOO . | | | |
119 | | SLOW . +--+---+ | | |
120 | | ^ | OSD1 | | | |
121 | | | +------+ | | |
122 | | | +------+ | | |
123 | | +-------| OSD2 | | | |
124 | | +------+ | | |
125 | | +------+ | | |
126 | | | OSD3 |-----+ | |
127 | | +------+ | |
128 | | +------+ | |
129 | | | OSD4 | OUT | |
130 | | +------+ | |
131 | | +------+ | |
132 | +------------------| OSD5 | | |
133 | +------+ | |
134 | ||
135 | ||
136 | Erasure code library | |
137 | -------------------- | |
138 | ||
139 | Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon>`_, | |
140 | with parameters K+M, object O is encoded by dividing it into chunks O1, | |
141 | O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks | |
142 | out of the available K+M chunks can be used to obtain the original | |
143 | object. If data chunk O2 or coding chunk P2 are lost, they can be | |
144 | repaired using any K chunks out of the K+M chunks. If more than M | |
145 | chunks are lost, it is not possible to recover the object. | |
146 | ||
147 | Reading the original content of object O can be a simple | |
148 | concatenation of O1, O2, ... OM, because the plugins are using | |
149 | `systematic codes | |
150 | <http://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the chunks | |
151 | must be given to the erasure code library *decode* method to retrieve | |
152 | the content of the object. | |
153 | ||
154 | Performance depend on the parameters to the encoding functions and | |
155 | is also influenced by the packet sizes used when calling the encoding | |
156 | functions ( for Cauchy or Liberation for instance ): smaller packets | |
157 | means more calls and more overhead. | |
158 | ||
159 | Although Reed-Solomon is provided as a default, Ceph uses it via an | |
160 | `abstract API <https://github.com/ceph/ceph/blob/v0.78/src/erasure-code/ErasureCodeInterface.h>`_ designed to | |
161 | allow each pool to choose the plugin that implements it using | |
162 | key=value pairs stored in an `erasure code profile`_. | |
163 | ||
164 | .. _erasure code profile: ../../../erasure-coded-pool | |
165 | ||
166 | :: | |
167 | ||
168 | $ ceph osd erasure-code-profile set myprofile \ | |
169 | ruleset-failure-domain=osd | |
170 | $ ceph osd erasure-code-profile get myprofile | |
171 | directory=/usr/lib/ceph/erasure-code | |
172 | k=2 | |
173 | m=1 | |
174 | plugin=jerasure | |
175 | technique=reed_sol_van | |
176 | ruleset-failure-domain=osd | |
177 | $ ceph osd pool create ecpool 12 12 erasure myprofile | |
178 | ||
179 | The *plugin* is dynamically loaded from *directory* and expected to | |
180 | implement the *int __erasure_code_init(char *plugin_name, char *directory)* function | |
181 | which is responsible for registering an object derived from *ErasureCodePlugin* | |
182 | in the registry. The `ErasureCodePluginExample <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ plugin reads: | |
183 | ||
184 | :: | |
185 | ||
186 | ErasureCodePluginRegistry &instance = | |
187 | ErasureCodePluginRegistry::instance(); | |
188 | instance.add(plugin_name, new ErasureCodePluginExample()); | |
189 | ||
190 | The *ErasureCodePlugin* derived object must provide a factory method | |
191 | from which the concrete implementation of the *ErasureCodeInterface* | |
192 | object can be generated. The `ErasureCodePluginExample plugin <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ reads: | |
193 | ||
194 | :: | |
195 | ||
196 | virtual int factory(const map<std::string,std::string> ¶meters, | |
197 | ErasureCodeInterfaceRef *erasure_code) { | |
198 | *erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters)); | |
199 | return 0; | |
200 | } | |
201 | ||
202 | The *parameters* argument is the list of *key=value* pairs that were | |
203 | set in the erasure code profile, before the pool was created. | |
204 | ||
205 | :: | |
206 | ||
207 | ceph osd erasure-code-profile set myprofile \ | |
208 | directory=<dir> \ # mandatory | |
209 | plugin=jerasure \ # mandatory | |
210 | m=10 \ # optional and plugin dependant | |
211 | k=3 \ # optional and plugin dependant | |
212 | technique=reed_sol_van \ # optional and plugin dependant | |
213 | ||
214 | Notes | |
215 | ----- | |
216 | ||
217 | If the objects are large, it may be impractical to encode and decode | |
218 | them in memory. However, when using *RBD* a 1TB device is divided in | |
219 | many individual 4MB objects and *RGW* does the same. | |
220 | ||
221 | Encoding and decoding is implemented in the OSD. Although it could be | |
222 | implemented client side for read write, the OSD must be able to encode | |
223 | and decode on its own when scrubbing. |