[ceph.git] / ceph / doc / rados / operations / erasure-code-lrc.rst

======================================
Locally repairable erasure code plugin
======================================

With the *jerasure* plugin, when an erasure coded object is stored on
multiple OSDs, recovering from the loss of one OSD requires reading
from all the others. For instance if *jerasure* is configured with
*k=8* and *m=4*, losing one OSD requires reading from the eleven
others to repair.

The *lrc* erasure code plugin creates local parity chunks to be able
to recover using less OSDs. For instance if *lrc* is configured with
*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
every four OSDs. When a single OSD is lost, it can be recovered with
only four OSDs instead of eleven.

Erasure code profile examples
=============================

Reduce recovery bandwidth between hosts
---------------------------------------

Although it is probably not an interesting use case when all hosts are
connected to the same switch, reduced bandwidth usage can actually be
observed.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             k=4 m=2 l=3 \
             crush-failure-domain=host
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Reduce recovery bandwidth between racks
---------------------------------------

In Firefly the reduced bandwidth will only be observed if the primary
OSD is in the same rack as the lost chunk.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             k=4 m=2 l=3 \
             crush-locality=rack \
             crush-failure-domain=host
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Create an lrc profile
=====================

To create a new lrc erasure code profile::

        ceph osd erasure-code-profile set {name} \
             plugin=lrc \
             k={data-chunks} \
             m={coding-chunks} \
             l={locality} \
             [crush-root={root}] \
             [crush-locality={bucket-type}] \
             [crush-failure-domain={bucket-type}] \
             [crush-device-class={device-class}] \
             [directory={directory}] \
             [--force]

Where:

``k={data chunks}``

:Description: Each object is split in **data-chunks** parts,
              each stored on a different OSD.

:Type: Integer
:Required: Yes.
:Example: 4

``m={coding-chunks}``

:Description: Compute **coding chunks** for each object and store them
              on different OSDs. The number of coding chunks is also
              the number of OSDs that can be down without losing data.

:Type: Integer
:Required: Yes.
:Example: 2

``l={locality}``

:Description: Group the coding and data chunks into sets of size
              **locality**. For instance, for **k=4** and **m=2**,
              when **locality=3** two groups of three are created.
              Each set can be recovered without reading chunks
              from another set.

:Type: Integer
:Required: Yes.
:Example: 3

``crush-root={root}``

:Description: The name of the crush bucket used for the first step of
              the CRUSH rule. For intance **step take default**.

:Type: String
:Required: No.
:Default: default

``crush-locality={bucket-type}``

:Description: The type of the crush bucket in which each set of chunks
              defined by **l** will be stored. For instance, if it is
              set to **rack**, each group of **l** chunks will be
              placed in a different rack. It is used to create a
              CRUSH rule step such as **step choose rack**. If it is not
              set, no such grouping is done.

:Type: String
:Required: No.

``crush-failure-domain={bucket-type}``

:Description: Ensure that no two chunks are in a bucket with the same
              failure domain. For instance, if the failure domain is
              **host** no two chunks will be stored on the same
              host. It is used to create a CRUSH rule step such as **step
              chooseleaf host**.

:Type: String
:Required: No.
:Default: host

``crush-device-class={device-class}``

:Description: Restrict placement to devices of a specific class (e.g.,
              ``ssd`` or ``hdd``), using the crush device class names
              in the CRUSH map.

:Type: String
:Required: No.
:Default:

``directory={directory}``

:Description: Set the **directory** name from which the erasure code
              plugin is loaded.

:Type: String
:Required: No.
:Default: /usr/lib/ceph/erasure-code

``--force``

:Description: Override an existing profile by the same name.

:Type: String
:Required: No.

Low level plugin configuration
==============================

The sum of **k** and **m** must be a multiple of the **l** parameter.
The low level configuration parameters do not impose such a
restriction and it may be more convienient to use it for specific
purposes. It is for instance possible to define two groups, one with 4
chunks and another with 3 chunks. It is also possible to recursively
define locality sets, for instance datacenters and racks into
datacenters. The **k/m/l** are implemented by generating a low level
configuration.

The *lrc* erasure code plugin recursively applies erasure code
techniques so that recovering from the loss of some chunks only
requires a subset of the available chunks, most of the time.

For instance, when three coding steps are described as::

   chunk nr    01234567
   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

where *c* are coding chunks calculated from the data chunks *D*, the
loss of chunk *7* can be recovered with the last four chunks. And the
loss of chunk *2* chunk can be recovered with the first four
chunks.

Erasure code profile examples using low level configuration
===========================================================

Minimal testing
---------------

It is strictly equivalent to using the default erasure code profile. The *DD*
implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
by default.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=DD_ \
             layers='[ [ "DDc", "" ] ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile

Reduce recovery bandwidth between hosts
---------------------------------------

Although it is probably not an interesting use case when all hosts are
connected to the same switch, reduced bandwidth usage can actually be
observed. It is equivalent to **k=4**, **m=2** and **l=3** although
the layout of the chunks is different::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=__DD__DD \
             layers='[
                       [ "_cDD_cDD", "" ],
                       [ "cDDD____", "" ],
                       [ "____cDDD", "" ],
                     ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Reduce recovery bandwidth between racks
---------------------------------------

In Firefly the reduced bandwidth will only be observed if the primary
OSD is in the same rack as the lost chunk.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=__DD__DD \
             layers='[
                       [ "_cDD_cDD", "" ],
                       [ "cDDD____", "" ],
                       [ "____cDDD", "" ],
                     ]' \
             crush-steps='[
                             [ "choose", "rack", 2 ],
                             [ "chooseleaf", "host", 4 ],
                            ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile

Testing with different Erasure Code backends
--------------------------------------------

LRC now uses jerasure as the default EC backend. It is possible to
specify the EC backend/algorithm on a per layer basis using the low
level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
is actually an erasure code profile to be used for this level. The
example below specifies the ISA backend with the cauchy technique to
be used in the lrcpool.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=DD_ \
             layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile

You could also use a different erasure code profile for for each
layer.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=__DD__DD \
             layers='[
                       [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
                       [ "cDDD____", "plugin=isa" ],
                       [ "____cDDD", "plugin=jerasure" ],
                     ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Erasure coding and decoding algorithm
=====================================

The steps found in the layers description::

   chunk nr    01234567

   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

are applied in order. For instance, if a 4K object is encoded, it will
first go thru *step 1* and be divided in four 1K chunks (the four
uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
order. From these, two coding chunks are calculated (the two lowercase
c). The coding chunks are stored in the chunks 1 and 5, respectively.

The *step 2* re-uses the content created by *step 1* in a similar
fashion and stores a single coding chunk *c* at position 0. The last four
chunks, marked with an underscore (*_*) for readability, are ignored.

The *step 3* stores a single coding chunk *c* at position 4. The three
chunks created by *step 1* are used to compute this coding chunk,
i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.

If chunk *2* is lost::

   chunk nr    01234567

   step 1      _c D_cDD
   step 2      cD D____
   step 3      __ _cDDD

decoding will attempt to recover it by walking the steps in reverse
order: *step 3* then *step 2* and finally *step 1*.

The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
and is skipped.

The coding chunk from *step 2*, stored in chunk *0*, allows it to
recover the content of chunk *2*. There are no more chunks to recover
and the process stops, without considering *step 1*.

Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
back chunk *2*.

If chunk *2, 3, 6* are lost::

   chunk nr    01234567

   step 1      _c  _c D
   step 2      cD  __ _
   step 3      __  cD D

The *step 3* can recover the content of chunk *6*::

   chunk nr    01234567

   step 1      _c  _cDD
   step 2      cD  ____
   step 3      __  cDDD

The *step 2* fails to recover and is skipped because there are two
chunks missing (*2, 3*) and it can only recover from one missing
chunk.

The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
recover the content of chunk *2, 3*::

   chunk nr    01234567

   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

Controlling CRUSH placement
===========================

The default CRUSH rule provides OSDs that are on different hosts. For instance::

   chunk nr    01234567

   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

needs exactly *8* OSDs, one for each chunk. If the hosts are in two
adjacent racks, the first four chunks can be placed in the first rack
and the last four in the second rack. So that recovering from the loss
of a single OSD does not require using bandwidth between the two
racks.

For instance::

   crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'

will create a rule that will select two crush buckets of type
*rack* and for each of them choose four OSDs, each of them located in
different buckets of type *host*.

The CRUSH rule can also be manually crafted for finer control.
Commit	Line	Data
7c673cae FG	1	======================================
	2	Locally repairable erasure code plugin
	3	======================================
	4
	5	With the jerasure plugin, when an erasure coded object is stored on
	6	multiple OSDs, recovering from the loss of one OSD requires reading
	7	from all the others. For instance if jerasure is configured with
	8	k=8 and m=4, losing one OSD requires reading from the eleven
	9	others to repair.
	10
	11	The lrc erasure code plugin creates local parity chunks to be able
	12	to recover using less OSDs. For instance if lrc is configured with
	13	k=8, m=4 and l=4, it will create an additional parity chunk for
	14	every four OSDs. When a single OSD is lost, it can be recovered with
	15	only four OSDs instead of eleven.
	16
	17	Erasure code profile examples
	18	=============================
	19
	20	Reduce recovery bandwidth between hosts
	21	---------------------------------------
	22
	23	Although it is probably not an interesting use case when all hosts are
	24	connected to the same switch, reduced bandwidth usage can actually be
	25	observed.::
	26
	27	$ ceph osd erasure-code-profile set LRCprofile \
	28	plugin=lrc \
	29	k=4 m=2 l=3 \
224ce89b	30	crush-failure-domain=host
7c673cae FG	31	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
	32
	33
	34	Reduce recovery bandwidth between racks
	35	---------------------------------------
	36
	37	In Firefly the reduced bandwidth will only be observed if the primary
	38	OSD is in the same rack as the lost chunk.::
	39
	40	$ ceph osd erasure-code-profile set LRCprofile \
	41	plugin=lrc \
	42	k=4 m=2 l=3 \
224ce89b WB	43	crush-locality=rack \
224ce89b WB	44	crush-failure-domain=host
7c673cae FG	45	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
	46
	47
	48	Create an lrc profile
	49	=====================
	50
	51	To create a new lrc erasure code profile::
	52
	53	ceph osd erasure-code-profile set {name} \
	54	plugin=lrc \
	55	k={data-chunks} \
	56	m={coding-chunks} \
	57	l={locality} \
224ce89b WB	58	[crush-root={root}] \
	59	[crush-locality={bucket-type}] \
	60	[crush-failure-domain={bucket-type}] \
	61	[crush-device-class={device-class}] \
7c673cae FG	62	[directory={directory}] \
	63	[--force]
	64
	65	Where:
	66
	67	``k={data chunks}``
	68
	69	:Description: Each object is split in data-chunks parts,
	70	each stored on a different OSD.
	71
	72	:Type: Integer
	73	:Required: Yes.
	74	:Example: 4
	75
	76	``m={coding-chunks}``
	77
	78	:Description: Compute coding chunks for each object and store them
	79	on different OSDs. The number of coding chunks is also
	80	the number of OSDs that can be down without losing data.
	81
	82	:Type: Integer
	83	:Required: Yes.
	84	:Example: 2
	85
	86	``l={locality}``
	87
	88	:Description: Group the coding and data chunks into sets of size
	89	locality. For instance, for k=4 and m=2,
	90	when locality=3 two groups of three are created.
	91	Each set can be recovered without reading chunks
	92	from another set.
	93
	94	:Type: Integer
	95	:Required: Yes.
	96	:Example: 3
	97
224ce89b	98	``crush-root={root}``
7c673cae FG	99
7c673cae FG	100	:Description: The name of the crush bucket used for the first step of
b32b8144	101	the CRUSH rule. For intance step take default.
7c673cae FG	102
	103	:Type: String
	104	:Required: No.
	105	:Default: default
	106
224ce89b	107	``crush-locality={bucket-type}``
7c673cae FG	108
	109	:Description: The type of the crush bucket in which each set of chunks
	110	defined by l will be stored. For instance, if it is
	111	set to rack, each group of l chunks will be
	112	placed in a different rack. It is used to create a
b32b8144	113	CRUSH rule step such as step choose rack. If it is not
7c673cae FG	114	set, no such grouping is done.
	115
	116	:Type: String
	117	:Required: No.
	118
224ce89b	119	``crush-failure-domain={bucket-type}``
7c673cae FG	120
	121	:Description: Ensure that no two chunks are in a bucket with the same
	122	failure domain. For instance, if the failure domain is
	123	host no two chunks will be stored on the same
b32b8144	124	host. It is used to create a CRUSH rule step such as **step
7c673cae FG	125	chooseleaf host**.
	126
	127	:Type: String
	128	:Required: No.
	129	:Default: host
	130
224ce89b WB	131	``crush-device-class={device-class}``
	132
	133	:Description: Restrict placement to devices of a specific class (e.g.,
	134	``ssd`` or ``hdd``), using the crush device class names
	135	in the CRUSH map.
	136
	137	:Type: String
	138	:Required: No.
	139	:Default:
	140
7c673cae FG	141	``directory={directory}``
	142
	143	:Description: Set the directory name from which the erasure code
	144	plugin is loaded.
	145
	146	:Type: String
	147	:Required: No.
	148	:Default: /usr/lib/ceph/erasure-code
	149
	150	``--force``
	151
	152	:Description: Override an existing profile by the same name.
	153
	154	:Type: String
	155	:Required: No.
	156
	157	Low level plugin configuration
	158	==============================
	159
	160	The sum of k and m must be a multiple of the l parameter.
	161	The low level configuration parameters do not impose such a
	162	restriction and it may be more convienient to use it for specific
	163	purposes. It is for instance possible to define two groups, one with 4
	164	chunks and another with 3 chunks. It is also possible to recursively
	165	define locality sets, for instance datacenters and racks into
	166	datacenters. The k/m/l are implemented by generating a low level
	167	configuration.
	168
	169	The lrc erasure code plugin recursively applies erasure code
	170	techniques so that recovering from the loss of some chunks only
	171	requires a subset of the available chunks, most of the time.
	172
	173	For instance, when three coding steps are described as::
	174
	175	chunk nr 01234567
	176	step 1 _cDD_cDD
	177	step 2 cDDD____
	178	step 3 ____cDDD
	179
	180	where c are coding chunks calculated from the data chunks D, the
	181	loss of chunk 7 can be recovered with the last four chunks. And the
	182	loss of chunk 2 chunk can be recovered with the first four
	183	chunks.
	184
	185	Erasure code profile examples using low level configuration
	186	===========================================================
	187
	188	Minimal testing
	189	---------------
	190
	191	It is strictly equivalent to using the default erasure code profile. The DD
	192	implies K=2, the c implies M=1 and the jerasure plugin is used
	193	by default.::
	194
	195	$ ceph osd erasure-code-profile set LRCprofile \
	196	plugin=lrc \
	197	mapping=DD_ \
	198	layers='[ [ "DDc", "" ] ]'
	199	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
	200
	201	Reduce recovery bandwidth between hosts
	202	---------------------------------------
	203
	204	Although it is probably not an interesting use case when all hosts are
205	connected to the same switch, reduced bandwidth usage can actually be
206	observed. It is equivalent to k=4, m=2 and l=3 although
207	the layout of the chunks is different::
208
209	$ ceph osd erasure-code-profile set LRCprofile \
210	plugin=lrc \
211	mapping=__DD__DD \
212	layers='[
213	[ "_cDD_cDD", "" ],
214	[ "cDDD____", "" ],
215	[ "____cDDD", "" ],
216	]'
217	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
218
219
220	Reduce recovery bandwidth between racks
221	---------------------------------------
222
223	In Firefly the reduced bandwidth will only be observed if the primary
224	OSD is in the same rack as the lost chunk.::
225
226	$ ceph osd erasure-code-profile set LRCprofile \
227	plugin=lrc \
228	mapping=__DD__DD \
229	layers='[
230	[ "_cDD_cDD", "" ],
231	[ "cDDD____", "" ],
232	[ "____cDDD", "" ],
233	]' \
224ce89b	234	crush-steps='[
7c673cae FG	235	[ "choose", "rack", 2 ],
	236	[ "chooseleaf", "host", 4 ],
	237	]'
	238	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
	239
	240	Testing with different Erasure Code backends
	241	--------------------------------------------
	242
	243	LRC now uses jerasure as the default EC backend. It is possible to
	244	specify the EC backend/algorithm on a per layer basis using the low
	245	level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
	246	is actually an erasure code profile to be used for this level. The
	247	example below specifies the ISA backend with the cauchy technique to
	248	be used in the lrcpool.::
	249
	250	$ ceph osd erasure-code-profile set LRCprofile \
	251	plugin=lrc \
	252	mapping=DD_ \
	253	layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
	254	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
	255
	256	You could also use a different erasure code profile for for each
	257	layer.::
	258
	259	$ ceph osd erasure-code-profile set LRCprofile \
	260	plugin=lrc \
	261	mapping=__DD__DD \
	262	layers='[
	263	[ "_cDD_cDD", "plugin=isa technique=cauchy" ],
	264	[ "cDDD____", "plugin=isa" ],
	265	[ "____cDDD", "plugin=jerasure" ],
	266	]'
	267	$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
	268
	269
	270
	271	Erasure coding and decoding algorithm
	272	=====================================
	273
	274	The steps found in the layers description::
	275
	276	chunk nr 01234567
	277
	278	step 1 _cDD_cDD
	279	step 2 cDDD____
	280	step 3 ____cDDD
	281
	282	are applied in order. For instance, if a 4K object is encoded, it will
	283	first go thru step 1 and be divided in four 1K chunks (the four
	284	uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
	285	order. From these, two coding chunks are calculated (the two lowercase
	286	c). The coding chunks are stored in the chunks 1 and 5, respectively.
	287
	288	The step 2 re-uses the content created by step 1 in a similar
	289	fashion and stores a single coding chunk c at position 0. The last four
	290	chunks, marked with an underscore (_) for readability, are ignored.
	291
	292	The step 3 stores a single coding chunk c at position 4. The three
	293	chunks created by step 1 are used to compute this coding chunk,
	294	i.e. the coding chunk from step 1 becomes a data chunk in step 3.
	295
	296	If chunk 2 is lost::
	297
	298	chunk nr 01234567
299
300	step 1 _c D_cDD
301	step 2 cD D____
302	step 3 __ _cDDD
303
304	decoding will attempt to recover it by walking the steps in reverse
305	order: step 3 then step 2 and finally step 1.
306
307	The step 3 knows nothing about chunk 2 (i.e. it is an underscore)
308	and is skipped.
309
310	The coding chunk from step 2, stored in chunk 0, allows it to
311	recover the content of chunk 2. There are no more chunks to recover
312	and the process stops, without considering step 1.
313
314	Recovering chunk 2 requires reading chunks 0, 1, 3 and writing
315	back chunk 2.
316
317	If chunk 2, 3, 6 are lost::
318
319	chunk nr 01234567
320
321	step 1 _c _c D
322	step 2 cD __ _
323	step 3 __ cD D
324
325	The step 3 can recover the content of chunk 6::
326
327	chunk nr 01234567
328
329	step 1 _c _cDD
330	step 2 cD ____
331	step 3 __ cDDD
332
333	The step 2 fails to recover and is skipped because there are two
334	chunks missing (2, 3) and it can only recover from one missing
335	chunk.
336
337	The coding chunk from step 1, stored in chunk 1, 5, allows it to
338	recover the content of chunk 2, 3::
339
340	chunk nr 01234567
341
342	step 1 _cDD_cDD
343	step 2 cDDD____
344	step 3 ____cDDD
345
b32b8144	346	Controlling CRUSH placement
7c673cae FG	347	===========================
7c673cae FG	348
b32b8144	349	The default CRUSH rule provides OSDs that are on different hosts. For instance::
7c673cae FG	350
	351	chunk nr 01234567
	352
	353	step 1 _cDD_cDD
	354	step 2 cDDD____
	355	step 3 ____cDDD
	356
	357	needs exactly 8 OSDs, one for each chunk. If the hosts are in two
	358	adjacent racks, the first four chunks can be placed in the first rack
	359	and the last four in the second rack. So that recovering from the loss
	360	of a single OSD does not require using bandwidth between the two
	361	racks.
	362
	363	For instance::
	364
224ce89b	365	crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
7c673cae	366
b32b8144	367	will create a rule that will select two crush buckets of type
7c673cae FG	368	rack and for each of them choose four OSDs, each of them located in
	369	different buckets of type host.
	370
b32b8144	371	The CRUSH rule can also be manually crafted for finer control.