[ceph.git] / ceph / doc / rados / operations / erasure-code-clay.rst

================
CLAY code plugin
================

CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings 
in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let:

	d = number of OSDs contacted during repair

If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires 
reading from the *d=8* others to repair. And recovery of say a 1GiB needs
a download of 8 X 1GiB = 8GiB of information.

However, in the case of the *clay* plugin *d* is configurable within the limits:

	k+1 <= d <= k+m-1 

By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms 
of network bandwidth and disk IO. In the case of the *clay* plugin configured with 
*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and 
250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB 
amount of information. More general parameters are provided below. The benefits are substantial 
when the repair is carried out for a rack that stores information on the order of 
Terabytes.

	+-------------+---------------------------------------------------------+
	| plugin      | total amount of disk IO                                 |
	+=============+=========================================================+
	|jerasure,isa | :math:`k S`                                             |
	+-------------+---------------------------------------------------------+
	| clay        | :math:`\frac{d S}{d - k + 1} = \frac{(k + m - 1) S}{m}` |
	+-------------+---------------------------------------------------------+

where *S* is the amount of data stored on a single OSD undergoing repair. In the table above, we have 
used the largest possible value of *d* as this will result in the smallest amount of data download needed
to achieve recovery from an OSD failure.

Erasure-code profile examples
=============================

An example configuration that can be used to observe reduced bandwidth usage:

.. prompt:: bash $

   ceph osd erasure-code-profile set CLAYprofile \
      plugin=clay \
      k=4 m=2 d=5 \
      crush-failure-domain=host
   ceph osd pool create claypool erasure CLAYprofile


Creating a clay profile
=======================

To create a new clay code profile:

.. prompt:: bash $

   ceph osd erasure-code-profile set {name} \
        plugin=clay \
        k={data-chunks} \
        m={coding-chunks} \
        [d={helper-chunks}] \
        [scalar_mds={plugin-name}] \
        [technique={technique-name}] \
        [crush-failure-domain={bucket-type}] \
        [crush-device-class={device-class}] \
        [directory={directory}] \
        [--force]

Where:

``k={data chunks}``

:Description: Each object is split into **data-chunks** parts,
              each of which is stored on a different OSD.

:Type: Integer
:Required: Yes.
:Example: 4

``m={coding-chunks}``

:Description: Compute **coding chunks** for each object and store them
              on different OSDs. The number of coding chunks is also
              the number of OSDs that can be down without losing data.

:Type: Integer
:Required: Yes.
:Example: 2

``d={helper-chunks}``

:Description: Number of OSDs requested to send data during recovery of
              a single chunk. *d* needs to be chosen such that
              k+1 <= d <= k+m-1. The larger the *d*, the better the savings.

:Type: Integer
:Required: No.
:Default: k+m-1

``scalar_mds={jerasure|isa|shec}``

:Description: **scalar_mds** specifies the plugin that is used as a 
             building block in the layered construction. It can be 
             one of *jerasure*, *isa*, *shec*

:Type: String
:Required: No.
:Default: jerasure

``technique={technique}``

:Description: **technique** specifies the technique that will be picked
             within the 'scalar_mds' plugin specified. Supported techniques
             are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', 
             'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
             'cauchy' for isa and 'single', 'multiple' for shec.

:Type: String
:Required: No.
:Default: reed_sol_van (for jerasure, isa), single (for shec)


``crush-root={root}``

:Description: The name of the crush bucket used for the first step of
              the CRUSH rule. For instance **step take default**.

:Type: String
:Required: No.
:Default: default


``crush-failure-domain={bucket-type}``

:Description: Ensure that no two chunks are in a bucket with the same
              failure domain. For instance, if the failure domain is
              **host** no two chunks will be stored on the same
              host. It is used to create a CRUSH rule step such as **step
              chooseleaf host**.

:Type: String
:Required: No.
:Default: host

``crush-device-class={device-class}``

:Description: Restrict placement to devices of a specific class (e.g.,
              ``ssd`` or ``hdd``), using the crush device class names
              in the CRUSH map.

:Type: String
:Required: No.
:Default:

``directory={directory}``

:Description: Set the **directory** name from which the erasure code
              plugin is loaded.

:Type: String
:Required: No.
:Default: /usr/lib/ceph/erasure-code

``--force``

:Description: Override an existing profile by the same name.

:Type: String
:Required: No.


Notion of sub-chunks
====================

The Clay code is able to save in terms of disk IO, network bandwidth as it
is a vector code and it is able to view and manipulate data within a chunk 
at a finer granularity termed as a sub-chunk. The number of sub-chunks within 
a chunk for a Clay code is given by:

	sub-chunk count = :math:`q^{\frac{k+m}{q}}`, where :math:`q = d - k + 1`


During repair of an OSD, the helper information requested
from an available OSD is only a fraction of a chunk. In fact, the number
of sub-chunks within a chunk that are accessed during repair is given by:

	repair sub-chunk count = :math:`\frac{sub---chunk \: count}{q}`

Examples
--------

#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is
   8 and  the repair sub-chunk count is 4. Therefore, only half of a chunk is read 
   during repair.
#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count
   is 16. A quarter of a chunk is read from an available OSD for repair of a failed 
   chunk.


How to choose a configuration given a workload
==============================================

Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
are not necessarily stored consecutively within a chunk. For best disk IO 
performance, it is helpful to read contiguous data. For this reason, it is suggested that
you choose stripe-size such that the sub-chunk size is sufficiently large.

For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that:

	sub-chunk size = :math:`\frac{stripe-size}{k sub-chunk count}` = 4KB, 8KB, 12KB ...

#. For large size workloads for which the stripe size is large, it is easy to choose k, m, d.
   For example consider a stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will
   result in a sub-chunk count of 1024 and a sub-chunk size of 4KB.
#. For small size workloads, *k=4*, *m=2* is a good configuration that provides both network
   and disk IO benefits.

Comparisons with LRC
====================

Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead.
The *clay* code has a storage overhead m/k. In the case of an *lrc*, it stores (k+m)/d parities in
addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc*
can recover from the failure of any ``m`` OSDs.

	+-----------------+----------------------------------+----------------------------------+
	| Parameters      | disk IO, storage overhead (LRC)  | disk IO, storage overhead (CLAY) |
	+=================+================+=================+==================================+
	| (k=10, m=4)     | 7 * S, 0.6 (d=7)                 | 3.25 * S, 0.4 (d=13)             |
	+-----------------+----------------------------------+----------------------------------+
	| (k=16, m=4)     | 4 * S, 0.5625 (d=4)              | 4.75 * S, 0.25 (d=19)            |
	+-----------------+----------------------------------+----------------------------------+


where ``S`` is the amount of data stored of single OSD being recovered.
Commit	Line	Data
11fdf7f2 TL	1	================
	2	CLAY code plugin
	3	================
	4
	5	CLAY (short for coupled-layer) codes are erasure codes designed to bring about significant savings
	6	in terms of network bandwidth and disk IO when a failed node/OSD/rack is being repaired. Let:
	7
	8	d = number of OSDs contacted during repair
	9
	10	If jerasure is configured with k=8 and m=4, losing one OSD requires
	11	reading from the d=8 others to repair. And recovery of say a 1GiB needs
	12	a download of 8 X 1GiB = 8GiB of information.
	13
	14	However, in the case of the clay plugin d is configurable within the limits:
	15
	16	k+1 <= d <= k+m-1
	17
	18	By default, the clay code plugin picks d=k+m-1 as it provides the greatest savings in terms
	19	of network bandwidth and disk IO. In the case of the clay plugin configured with
	20	k=8, m=4 and d=11 when a single OSD fails, d=11 osds are contacted and
	21	250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB
	22	amount of information. More general parameters are provided below. The benefits are substantial
	23	when the repair is carried out for a rack that stores information on the order of
	24	Terabytes.
	25
f67539c2 TL	26	+-------------+---------------------------------------------------------+
	27	\| plugin \| total amount of disk IO \|
	28	+=============+=========================================================+
	29	\|jerasure,isa \| :math:`k S` \|
	30	+-------------+---------------------------------------------------------+
	31	\| clay \| :math:`\frac{d S}{d - k + 1} = \frac{(k + m - 1) S}{m}` \|
	32	+-------------+---------------------------------------------------------+
11fdf7f2 TL	33
	34	where S is the amount of data stored on a single OSD undergoing repair. In the table above, we have
	35	used the largest possible value of d as this will result in the smallest amount of data download needed
	36	to achieve recovery from an OSD failure.
	37
	38	Erasure-code profile examples
	39	=============================
	40
39ae355f	41	An example configuration that can be used to observe reduced bandwidth usage:
11fdf7f2	42
39ae355f TL	43	.. prompt:: bash $
	44
	45	ceph osd erasure-code-profile set CLAYprofile \
	46	plugin=clay \
	47	k=4 m=2 d=5 \
	48	crush-failure-domain=host
	49	ceph osd pool create claypool erasure CLAYprofile
11fdf7f2 TL	50
	51
	52	Creating a clay profile
	53	=======================
	54
39ae355f TL	55	To create a new clay code profile:
	56
	57	.. prompt:: bash $
	58
	59	ceph osd erasure-code-profile set {name} \
	60	plugin=clay \
	61	k={data-chunks} \
	62	m={coding-chunks} \
	63	[d={helper-chunks}] \
	64	[scalar_mds={plugin-name}] \
	65	[technique={technique-name}] \
	66	[crush-failure-domain={bucket-type}] \
	67	[crush-device-class={device-class}] \
	68	[directory={directory}] \
	69	[--force]
11fdf7f2 TL	70
	71	Where:
	72
	73	``k={data chunks}``
	74
	75	:Description: Each object is split into data-chunks parts,
	76	each of which is stored on a different OSD.
	77
	78	:Type: Integer
	79	:Required: Yes.
	80	:Example: 4
	81
	82	``m={coding-chunks}``
	83
	84	:Description: Compute coding chunks for each object and store them
	85	on different OSDs. The number of coding chunks is also
	86	the number of OSDs that can be down without losing data.
	87
	88	:Type: Integer
	89	:Required: Yes.
	90	:Example: 2
	91
	92	``d={helper-chunks}``
	93
	94	:Description: Number of OSDs requested to send data during recovery of
	95	a single chunk. d needs to be chosen such that
adb31ebb	96	k+1 <= d <= k+m-1. The larger the d, the better the savings.
11fdf7f2 TL	97
	98	:Type: Integer
	99	:Required: No.
	100	:Default: k+m-1
	101
	102	``scalar_mds={jerasure\|isa\|shec}``
	103
	104	:Description: scalar_mds specifies the plugin that is used as a
	105	building block in the layered construction. It can be
	106	one of jerasure, isa, shec
	107
	108	:Type: String
	109	:Required: No.
	110	:Default: jerasure
	111
	112	``technique={technique}``
	113
	114	:Description: technique specifies the technique that will be picked
	115	within the 'scalar_mds' plugin specified. Supported techniques
	116	are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig',
	117	'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
	118	'cauchy' for isa and 'single', 'multiple' for shec.
	119
	120	:Type: String
	121	:Required: No.
	122	:Default: reed_sol_van (for jerasure, isa), single (for shec)
	123
	124
	125	``crush-root={root}``
	126
	127	:Description: The name of the crush bucket used for the first step of
9f95a23c	128	the CRUSH rule. For instance step take default.
11fdf7f2 TL	129
	130	:Type: String
	131	:Required: No.
	132	:Default: default
	133
	134
	135	``crush-failure-domain={bucket-type}``
	136
	137	:Description: Ensure that no two chunks are in a bucket with the same
	138	failure domain. For instance, if the failure domain is
	139	host no two chunks will be stored on the same
	140	host. It is used to create a CRUSH rule step such as **step
	141	chooseleaf host**.
	142
	143	:Type: String
	144	:Required: No.
	145	:Default: host
	146
	147	``crush-device-class={device-class}``
	148
	149	:Description: Restrict placement to devices of a specific class (e.g.,
	150	``ssd`` or ``hdd``), using the crush device class names
	151	in the CRUSH map.
	152
	153	:Type: String
	154	:Required: No.
	155	:Default:
	156
	157	``directory={directory}``
	158
	159	:Description: Set the directory name from which the erasure code
	160	plugin is loaded.
	161
	162	:Type: String
	163	:Required: No.
	164	:Default: /usr/lib/ceph/erasure-code
	165
	166	``--force``
	167
	168	:Description: Override an existing profile by the same name.
	169
	170	:Type: String
	171	:Required: No.
	172
	173
	174	Notion of sub-chunks
	175	====================
	176
	177	The Clay code is able to save in terms of disk IO, network bandwidth as it
	178	is a vector code and it is able to view and manipulate data within a chunk
	179	at a finer granularity termed as a sub-chunk. The number of sub-chunks within
	180	a chunk for a Clay code is given by:
	181
f67539c2	182	sub-chunk count = :math:`q^{\frac{k+m}{q}}`, where :math:`q = d - k + 1`
11fdf7f2 TL	183
	184
	185	During repair of an OSD, the helper information requested
	186	from an available OSD is only a fraction of a chunk. In fact, the number
	187	of sub-chunks within a chunk that are accessed during repair is given by:
	188
f67539c2	189	repair sub-chunk count = :math:`\frac{sub---chunk \: count}{q}`
11fdf7f2 TL	190
	191	Examples
	192	--------
	193
	194	#. For a configuration with k=4, m=2, d=5, the sub-chunk count is
	195	8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read
	196	during repair.
	197	#. When k=8, m=4, d=11 the sub-chunk count is 64 and repair sub-chunk count
	198	is 16. A quarter of a chunk is read from an available OSD for repair of a failed
	199	chunk.
	200
	201
	202
	203	How to choose a configuration given a workload
	204	==============================================
	205
	206	Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
	207	are not necessarily stored consecutively within a chunk. For best disk IO
	208	performance, it is helpful to read contiguous data. For this reason, it is suggested that
	209	you choose stripe-size such that the sub-chunk size is sufficiently large.
	210
f67539c2	211	For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that:
11fdf7f2	212
f67539c2	213	sub-chunk size = :math:`\frac{stripe-size}{k sub-chunk count}` = 4KB, 8KB, 12KB ...
11fdf7f2 TL	214
	215	#. For large size workloads for which the stripe size is large, it is easy to choose k, m, d.
	216	For example consider a stripe-size of size 64MB, choosing k=16, m=4 and d=19 will
	217	result in a sub-chunk count of 1024 and a sub-chunk size of 4KB.
	218	#. For small size workloads, k=4, m=2 is a good configuration that provides both network
	219	and disk IO benefits.
	220
	221	Comparisons with LRC
	222	====================
	223
	224	Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
	225	bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
	226	number of OSDs contacted during repair (d) to be minimal, but this comes at the cost of storage overhead.
	227	The clay code has a storage overhead m/k. In the case of an lrc, it stores (k+m)/d parities in
	228	addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both clay and lrc
	229	can recover from the failure of any ``m`` OSDs.
	230
	231	+-----------------+----------------------------------+----------------------------------+
	232	\| Parameters \| disk IO, storage overhead (LRC) \| disk IO, storage overhead (CLAY) \|
	233	+=================+================+=================+==================================+
	234	\| (k=10, m=4) \| 7 * S, 0.6 (d=7) \| 3.25 * S, 0.4 (d=13) \|
	235	+-----------------+----------------------------------+----------------------------------+
494da23a	236	\| (k=16, m=4) \| 4 * S, 0.5625 (d=4) \| 4.75 * S, 0.25 (d=19) \|
11fdf7f2 TL	237	+-----------------+----------------------------------+----------------------------------+
	238
	239
	240	where ``S`` is the amount of data stored of single OSD being recovered.