[ceph.git] / ceph / doc / rados / operations / erasure-code-lrc.rst

======================================
Locally repairable erasure code plugin
======================================

With the *jerasure* plugin, when an erasure coded object is stored on
multiple OSDs, recovering from the loss of one OSD requires reading
from all the others. For instance if *jerasure* is configured with
*k=8* and *m=4*, losing one OSD requires reading from the eleven
others to repair.

The *lrc* erasure code plugin creates local parity chunks to be able
to recover using less OSDs. For instance if *lrc* is configured with
*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
every four OSDs. When a single OSD is lost, it can be recovered with
only four OSDs instead of eleven.

Erasure code profile examples
=============================

Reduce recovery bandwidth between hosts
---------------------------------------

Although it is probably not an interesting use case when all hosts are
connected to the same switch, reduced bandwidth usage can actually be
observed.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             k=4 m=2 l=3 \
             ruleset-failure-domain=host
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Reduce recovery bandwidth between racks
---------------------------------------

In Firefly the reduced bandwidth will only be observed if the primary
OSD is in the same rack as the lost chunk.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             k=4 m=2 l=3 \
             ruleset-locality=rack \
             ruleset-failure-domain=host
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Create an lrc profile
=====================

To create a new lrc erasure code profile::

        ceph osd erasure-code-profile set {name} \
             plugin=lrc \
             k={data-chunks} \
             m={coding-chunks} \
             l={locality} \
             [ruleset-root={root}] \
             [ruleset-locality={bucket-type}] \
             [ruleset-failure-domain={bucket-type}] \
             [directory={directory}] \
             [--force]

Where:

``k={data chunks}``

:Description: Each object is split in **data-chunks** parts,
              each stored on a different OSD.

:Type: Integer
:Required: Yes.
:Example: 4

``m={coding-chunks}``

:Description: Compute **coding chunks** for each object and store them
              on different OSDs. The number of coding chunks is also
              the number of OSDs that can be down without losing data.

:Type: Integer
:Required: Yes.
:Example: 2

``l={locality}``

:Description: Group the coding and data chunks into sets of size
              **locality**. For instance, for **k=4** and **m=2**,
              when **locality=3** two groups of three are created.
              Each set can be recovered without reading chunks
              from another set.

:Type: Integer
:Required: Yes.
:Example: 3

``ruleset-root={root}``

:Description: The name of the crush bucket used for the first step of
              the ruleset. For intance **step take default**.

:Type: String
:Required: No.
:Default: default

``ruleset-locality={bucket-type}``

:Description: The type of the crush bucket in which each set of chunks
              defined by **l** will be stored. For instance, if it is
              set to **rack**, each group of **l** chunks will be
              placed in a different rack. It is used to create a
              ruleset step such as **step choose rack**. If it is not
              set, no such grouping is done.

:Type: String
:Required: No.

``ruleset-failure-domain={bucket-type}``

:Description: Ensure that no two chunks are in a bucket with the same
              failure domain. For instance, if the failure domain is
              **host** no two chunks will be stored on the same
              host. It is used to create a ruleset step such as **step
              chooseleaf host**.

:Type: String
:Required: No.
:Default: host

``directory={directory}``

:Description: Set the **directory** name from which the erasure code
              plugin is loaded.

:Type: String
:Required: No.
:Default: /usr/lib/ceph/erasure-code

``--force``

:Description: Override an existing profile by the same name.

:Type: String
:Required: No.

Low level plugin configuration
==============================

The sum of **k** and **m** must be a multiple of the **l** parameter.
The low level configuration parameters do not impose such a
restriction and it may be more convienient to use it for specific
purposes. It is for instance possible to define two groups, one with 4
chunks and another with 3 chunks. It is also possible to recursively
define locality sets, for instance datacenters and racks into
datacenters. The **k/m/l** are implemented by generating a low level
configuration.

The *lrc* erasure code plugin recursively applies erasure code
techniques so that recovering from the loss of some chunks only
requires a subset of the available chunks, most of the time.

For instance, when three coding steps are described as::

   chunk nr    01234567
   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

where *c* are coding chunks calculated from the data chunks *D*, the
loss of chunk *7* can be recovered with the last four chunks. And the
loss of chunk *2* chunk can be recovered with the first four
chunks.

Erasure code profile examples using low level configuration
===========================================================

Minimal testing
---------------

It is strictly equivalent to using the default erasure code profile. The *DD*
implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
by default.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=DD_ \
             layers='[ [ "DDc", "" ] ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile

Reduce recovery bandwidth between hosts
---------------------------------------

Although it is probably not an interesting use case when all hosts are
connected to the same switch, reduced bandwidth usage can actually be
observed. It is equivalent to **k=4**, **m=2** and **l=3** although
the layout of the chunks is different::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=__DD__DD \
             layers='[
                       [ "_cDD_cDD", "" ],
                       [ "cDDD____", "" ],
                       [ "____cDDD", "" ],
                     ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Reduce recovery bandwidth between racks
---------------------------------------

In Firefly the reduced bandwidth will only be observed if the primary
OSD is in the same rack as the lost chunk.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=__DD__DD \
             layers='[
                       [ "_cDD_cDD", "" ],
                       [ "cDDD____", "" ],
                       [ "____cDDD", "" ],
                     ]' \
             ruleset-steps='[
                             [ "choose", "rack", 2 ],
                             [ "chooseleaf", "host", 4 ],
                            ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile

Testing with different Erasure Code backends
--------------------------------------------

LRC now uses jerasure as the default EC backend. It is possible to
specify the EC backend/algorithm on a per layer basis using the low
level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
is actually an erasure code profile to be used for this level. The
example below specifies the ISA backend with the cauchy technique to
be used in the lrcpool.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=DD_ \
             layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile

You could also use a different erasure code profile for for each
layer.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=__DD__DD \
             layers='[
                       [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
                       [ "cDDD____", "plugin=isa" ],
                       [ "____cDDD", "plugin=jerasure" ],
                     ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Erasure coding and decoding algorithm
=====================================

The steps found in the layers description::

   chunk nr    01234567

   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

are applied in order. For instance, if a 4K object is encoded, it will
first go thru *step 1* and be divided in four 1K chunks (the four
uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
order. From these, two coding chunks are calculated (the two lowercase
c). The coding chunks are stored in the chunks 1 and 5, respectively.

The *step 2* re-uses the content created by *step 1* in a similar
fashion and stores a single coding chunk *c* at position 0. The last four
chunks, marked with an underscore (*_*) for readability, are ignored.

The *step 3* stores a single coding chunk *c* at position 4. The three
chunks created by *step 1* are used to compute this coding chunk,
i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.

If chunk *2* is lost::

   chunk nr    01234567

   step 1      _c D_cDD
   step 2      cD D____
   step 3      __ _cDDD

decoding will attempt to recover it by walking the steps in reverse
order: *step 3* then *step 2* and finally *step 1*.

The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
and is skipped.

The coding chunk from *step 2*, stored in chunk *0*, allows it to
recover the content of chunk *2*. There are no more chunks to recover
and the process stops, without considering *step 1*.

Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
back chunk *2*.

If chunk *2, 3, 6* are lost::

   chunk nr    01234567

   step 1      _c  _c D
   step 2      cD  __ _
   step 3      __  cD D

The *step 3* can recover the content of chunk *6*::

   chunk nr    01234567

   step 1      _c  _cDD
   step 2      cD  ____
   step 3      __  cDDD

The *step 2* fails to recover and is skipped because there are two
chunks missing (*2, 3*) and it can only recover from one missing
chunk.

The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
recover the content of chunk *2, 3*::

   chunk nr    01234567

   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

Controlling crush placement
===========================

The default crush ruleset provides OSDs that are on different hosts. For instance::

   chunk nr    01234567

   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

needs exactly *8* OSDs, one for each chunk. If the hosts are in two
adjacent racks, the first four chunks can be placed in the first rack
and the last four in the second rack. So that recovering from the loss
of a single OSD does not require using bandwidth between the two
racks.

For instance::

   ruleset-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'

will create a ruleset that will select two crush buckets of type
*rack* and for each of them choose four OSDs, each of them located in
different buckets of type *host*.

The ruleset can also be manually crafted for finer control.
Commit	Line	Data
7c673cae FG	1	======================================
	2	Locally repairable erasure code plugin
	3	======================================
	4
	5	With the jerasure plugin, when an erasure coded object is stored on
	6	multiple OSDs, recovering from the loss of one OSD requires reading
	7	from all the others. For instance if jerasure is configured with
	8	k=8 and m=4, losing one OSD requires reading from the eleven
	9	others to repair.
	10
	11	The lrc erasure code plugin creates local parity chunks to be able
	12	to recover using less OSDs. For instance if lrc is configured with
	13	k=8, m=4 and l=4, it will create an additional parity chunk for
	14	every four OSDs. When a single OSD is lost, it can be recovered with
	15	only four OSDs instead of eleven.
	16
	17	Erasure code profile examples
	18	=============================
	19
	20	Reduce recovery bandwidth between hosts
	21	---------------------------------------
	22
	23	Although it is probably not an interesting use case when all hosts are
	24	connected to the same switch, reduced bandwidth usage can actually be
	25	observed.::
	26
	27	$ ceph osd erasure-code-profile set LRCprofile \
	28	plugin=lrc \
	29	k=4 m=2 l=3 \
	30	ruleset-failure-domain=host
	31	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
	32
	33
	34	Reduce recovery bandwidth between racks
	35	---------------------------------------
	36
	37	In Firefly the reduced bandwidth will only be observed if the primary
	38	OSD is in the same rack as the lost chunk.::
	39
	40	$ ceph osd erasure-code-profile set LRCprofile \
	41	plugin=lrc \
	42	k=4 m=2 l=3 \
	43	ruleset-locality=rack \
	44	ruleset-failure-domain=host
	45	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
	46
	47
	48	Create an lrc profile
	49	=====================
	50
	51	To create a new lrc erasure code profile::
	52
	53	ceph osd erasure-code-profile set {name} \
	54	plugin=lrc \
	55	k={data-chunks} \
	56	m={coding-chunks} \
	57	l={locality} \
	58	[ruleset-root={root}] \
	59	[ruleset-locality={bucket-type}] \
	60	[ruleset-failure-domain={bucket-type}] \
	61	[directory={directory}] \
	62	[--force]
	63
	64	Where:
65
66	``k={data chunks}``
67
68	:Description: Each object is split in data-chunks parts,
69	each stored on a different OSD.
70
71	:Type: Integer
72	:Required: Yes.
73	:Example: 4
74
75	``m={coding-chunks}``
76
77	:Description: Compute coding chunks for each object and store them
78	on different OSDs. The number of coding chunks is also
79	the number of OSDs that can be down without losing data.
80
81	:Type: Integer
82	:Required: Yes.
83	:Example: 2
84
85	``l={locality}``
86
87	:Description: Group the coding and data chunks into sets of size
88	locality. For instance, for k=4 and m=2,
89	when locality=3 two groups of three are created.
90	Each set can be recovered without reading chunks
91	from another set.
92
93	:Type: Integer
94	:Required: Yes.
95	:Example: 3
96
97	``ruleset-root={root}``
98
99	:Description: The name of the crush bucket used for the first step of
100	the ruleset. For intance step take default.
101
102	:Type: String
103	:Required: No.
104	:Default: default
105
106	``ruleset-locality={bucket-type}``
107
108	:Description: The type of the crush bucket in which each set of chunks
109	defined by l will be stored. For instance, if it is
110	set to rack, each group of l chunks will be
111	placed in a different rack. It is used to create a
112	ruleset step such as step choose rack. If it is not
113	set, no such grouping is done.
114
115	:Type: String
116	:Required: No.
117
118	``ruleset-failure-domain={bucket-type}``
119
120	:Description: Ensure that no two chunks are in a bucket with the same
121	failure domain. For instance, if the failure domain is
122	host no two chunks will be stored on the same
123	host. It is used to create a ruleset step such as **step
124	chooseleaf host**.
125
126	:Type: String
127	:Required: No.
128	:Default: host
129
130	``directory={directory}``
131
132	:Description: Set the directory name from which the erasure code
133	plugin is loaded.
134
135	:Type: String
136	:Required: No.
137	:Default: /usr/lib/ceph/erasure-code
138
139	``--force``
140
141	:Description: Override an existing profile by the same name.
142
143	:Type: String
144	:Required: No.
145
146	Low level plugin configuration
147	==============================
148
149	The sum of k and m must be a multiple of the l parameter.
150	The low level configuration parameters do not impose such a
151	restriction and it may be more convienient to use it for specific
152	purposes. It is for instance possible to define two groups, one with 4
153	chunks and another with 3 chunks. It is also possible to recursively
154	define locality sets, for instance datacenters and racks into
155	datacenters. The k/m/l are implemented by generating a low level
156	configuration.
157
158	The lrc erasure code plugin recursively applies erasure code
159	techniques so that recovering from the loss of some chunks only
160	requires a subset of the available chunks, most of the time.
161
162	For instance, when three coding steps are described as::
163
164	chunk nr 01234567
165	step 1 _cDD_cDD
166	step 2 cDDD____
167	step 3 ____cDDD
168
169	where c are coding chunks calculated from the data chunks D, the
170	loss of chunk 7 can be recovered with the last four chunks. And the
171	loss of chunk 2 chunk can be recovered with the first four
172	chunks.
173
174	Erasure code profile examples using low level configuration
175	===========================================================
176
177	Minimal testing
178	---------------
179
180	It is strictly equivalent to using the default erasure code profile. The DD
181	implies K=2, the c implies M=1 and the jerasure plugin is used
182	by default.::
183
184	$ ceph osd erasure-code-profile set LRCprofile \
185	plugin=lrc \
186	mapping=DD_ \
187	layers='[ [ "DDc", "" ] ]'
188	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
189
190	Reduce recovery bandwidth between hosts
191	---------------------------------------
192
193	Although it is probably not an interesting use case when all hosts are
194	connected to the same switch, reduced bandwidth usage can actually be
195	observed. It is equivalent to k=4, m=2 and l=3 although
196	the layout of the chunks is different::
197
198	$ ceph osd erasure-code-profile set LRCprofile \
199	plugin=lrc \
200	mapping=__DD__DD \
201	layers='[
202	[ "_cDD_cDD", "" ],
203	[ "cDDD____", "" ],
204	[ "____cDDD", "" ],
205	]'
206	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
207
208
209	Reduce recovery bandwidth between racks
210	---------------------------------------
211
212	In Firefly the reduced bandwidth will only be observed if the primary
213	OSD is in the same rack as the lost chunk.::
214
215	$ ceph osd erasure-code-profile set LRCprofile \
216	plugin=lrc \
217	mapping=__DD__DD \
218	layers='[
219	[ "_cDD_cDD", "" ],
220	[ "cDDD____", "" ],
221	[ "____cDDD", "" ],
222	]' \
223	ruleset-steps='[
224	[ "choose", "rack", 2 ],
225	[ "chooseleaf", "host", 4 ],
226	]'
227	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
228
229	Testing with different Erasure Code backends
230	--------------------------------------------
231
232	LRC now uses jerasure as the default EC backend. It is possible to
233	specify the EC backend/algorithm on a per layer basis using the low
234	level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
235	is actually an erasure code profile to be used for this level. The
236	example below specifies the ISA backend with the cauchy technique to
237	be used in the lrcpool.::
238
239	$ ceph osd erasure-code-profile set LRCprofile \
240	plugin=lrc \
241	mapping=DD_ \
242	layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
243	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
244
245	You could also use a different erasure code profile for for each
246	layer.::
247
248	$ ceph osd erasure-code-profile set LRCprofile \
249	plugin=lrc \
250	mapping=__DD__DD \
251	layers='[
252	[ "_cDD_cDD", "plugin=isa technique=cauchy" ],
253	[ "cDDD____", "plugin=isa" ],
254	[ "____cDDD", "plugin=jerasure" ],
255	]'
256	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
257
258
259
260	Erasure coding and decoding algorithm
261	=====================================
262
263	The steps found in the layers description::
264
265	chunk nr 01234567
266
267	step 1 _cDD_cDD
268	step 2 cDDD____
269	step 3 ____cDDD
270
271	are applied in order. For instance, if a 4K object is encoded, it will
272	first go thru step 1 and be divided in four 1K chunks (the four
273	uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
274	order. From these, two coding chunks are calculated (the two lowercase
275	c). The coding chunks are stored in the chunks 1 and 5, respectively.
276
277	The step 2 re-uses the content created by step 1 in a similar
278	fashion and stores a single coding chunk c at position 0. The last four
279	chunks, marked with an underscore (_) for readability, are ignored.
280
281	The step 3 stores a single coding chunk c at position 4. The three
282	chunks created by step 1 are used to compute this coding chunk,
283	i.e. the coding chunk from step 1 becomes a data chunk in step 3.
284
285	If chunk 2 is lost::
286
287	chunk nr 01234567
288
289	step 1 _c D_cDD
290	step 2 cD D____
291	step 3 __ _cDDD
292
293	decoding will attempt to recover it by walking the steps in reverse
294	order: step 3 then step 2 and finally step 1.
295
296	The step 3 knows nothing about chunk 2 (i.e. it is an underscore)
297	and is skipped.
298
299	The coding chunk from step 2, stored in chunk 0, allows it to
300	recover the content of chunk 2. There are no more chunks to recover
301	and the process stops, without considering step 1.
302
303	Recovering chunk 2 requires reading chunks 0, 1, 3 and writing
304	back chunk 2.
305
306	If chunk 2, 3, 6 are lost::
307
308	chunk nr 01234567
309
310	step 1 _c _c D
311	step 2 cD __ _
312	step 3 __ cD D
313
314	The step 3 can recover the content of chunk 6::
315
316	chunk nr 01234567
317
318	step 1 _c _cDD
319	step 2 cD ____
320	step 3 __ cDDD
321
322	The step 2 fails to recover and is skipped because there are two
323	chunks missing (2, 3) and it can only recover from one missing
324	chunk.
325
326	The coding chunk from step 1, stored in chunk 1, 5, allows it to
327	recover the content of chunk 2, 3::
328
329	chunk nr 01234567
330
331	step 1 _cDD_cDD
332	step 2 cDDD____
333	step 3 ____cDDD
334
335	Controlling crush placement
336	===========================
337
338	The default crush ruleset provides OSDs that are on different hosts. For instance::
339
340	chunk nr 01234567
341
342	step 1 _cDD_cDD
343	step 2 cDDD____
344	step 3 ____cDDD
345
346	needs exactly 8 OSDs, one for each chunk. If the hosts are in two
347	adjacent racks, the first four chunks can be placed in the first rack
348	and the last four in the second rack. So that recovering from the loss
349	of a single OSD does not require using bandwidth between the two
350	racks.
351
352	For instance::
353
354	ruleset-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
355
356	will create a ruleset that will select two crush buckets of type
357	rack and for each of them choose four OSDs, each of them located in
358	different buckets of type host.
359
360	The ruleset can also be manually crafted for finer control.