[ceph.git] / ceph / doc / rados / operations / erasure-code.rst

=============
 Erasure code
=============

A Ceph pool is associated to a type to sustain the loss of an OSD
(i.e. a disk since most of the time there is one OSD per disk). The
default choice when `creating a pool <../pools>`_ is *replicated*,
meaning every object is copied on multiple disks. The `Erasure Code
<https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used
instead to save space.

Creating a sample erasure coded pool
------------------------------------

The simplest erasure coded pool is equivalent to `RAID5
<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
requires at least three hosts::

    $ ceph osd pool create ecpool 12 12 erasure
    pool 'ecpool' created
    $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
    $ rados --pool ecpool get NYAN -
    ABCDEFGHI

.. note:: the 12 in *pool create* stands for 
          `the number of placement groups <../pools>`_.

Erasure code profiles
---------------------

The default erasure code profile sustains the loss of a single OSD. It
is equivalent to a replicated pool of size two but requires 1.5TB
instead of 2TB to store 1TB of data. The default profile can be
displayed with::

    $ ceph osd erasure-code-profile get default
    k=2
    m=1
    plugin=jerasure
    ruleset-failure-domain=host
    technique=reed_sol_van

Choosing the right profile is important because it cannot be modified
after the pool is created: a new pool with a different profile needs
to be created and all objects from the previous pool moved to the new.

The most important parameters of the profile are *K*, *M* and
*ruleset-failure-domain* because they define the storage overhead and
the data durability. For instance, if the desired architecture must
sustain the loss of two racks with a storage overhead of 40% overhead,
the following profile can be defined::

    $ ceph osd erasure-code-profile set myprofile \
       k=3 \
       m=2 \
       ruleset-failure-domain=rack
    $ ceph osd pool create ecpool 12 12 erasure myprofile
    $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
    $ rados --pool ecpool get NYAN -
    ABCDEFGHI

The *NYAN* object will be divided in three (*K=3*) and two additional
*chunks* will be created (*M=2*). The value of *M* defines how many
OSD can be lost simultaneously without losing any data. The
*ruleset-failure-domain=rack* will create a CRUSH ruleset that ensures
no two *chunks* are stored in the same rack.

.. ditaa::
                            +-------------------+
                       name |       NYAN        |
                            +-------------------+
                    content |     ABCDEFGHI     |
                            +--------+----------+
                                     |
                                     |
                                     v
                              +------+------+
              +---------------+ encode(3,2) +-----------+
              |               +--+--+---+---+           |
              |                  |  |   |               |
              |          +-------+  |   +-----+         |
              |          |          |         |         |
           +--v---+   +--v---+   +--v---+  +--v---+  +--v---+
     name  | NYAN |   | NYAN |   | NYAN |  | NYAN |  | NYAN |
           +------+   +------+   +------+  +------+  +------+
    shard  |  1   |   |  2   |   |  3   |  |  4   |  |  5   |
           +------+   +------+   +------+  +------+  +------+
  content  | ABC  |   | DEF  |   | GHI  |  | YXY  |  | QGC  |
           +--+---+   +--+---+   +--+---+  +--+---+  +--+---+
              |          |          |         |         |
              |          |          v         |         |
              |          |       +--+---+     |         |
              |          |       | OSD1 |     |         |
              |          |       +------+     |         |
              |          |                    |         |
              |          |       +------+     |         |
              |          +------>| OSD2 |     |         |
              |                  +------+     |         |
              |                               |         |
              |                  +------+     |         |
              |                  | OSD3 |<----+         |
              |                  +------+               |
              |                                         |
              |                  +------+               |
              |                  | OSD4 |<--------------+
              |                  +------+
              |
              |                  +------+
              +----------------->| OSD5 |
                                 +------+

 
More information can be found in the `erasure code profiles
<../erasure-code-profile>`_ documentation.


Erasure Coding with Overwrites
------------------------------

By default, erasure coded pools only work with uses like RGW that
perform full object writes and appends.

Since Luminous, partial writes for an erasure coded pool may be
enabled with a per-pool setting. This lets RBD and Cephfs store their
data in an erasure coded pool::

    ceph osd pool set ec_pool allow_ec_overwrites true

This can only be enabled on a pool residing on bluestore OSDs, since
bluestore's checksumming is used to detect bitrot or other corruption
during deep-scrub. In addition to being unsafe, using filestore with
ec overwrites yields low performance compared to bluestore.

Erasure coded pools do not support omap, so to use them with RBD and
Cephfs you must instruct them to store their data in an ec pool, and
their metadata in a replicated pool. For RBD, this means using the
erasure coded pool as the ``--data-pool`` during image creation::

    rbd create --size 1G --data-pool ec_pool replicated_pool/image_name

For Cephfs, using an erasure coded pool means setting that pool in
a `file layout <../../../cephfs/file-layouts>`_.


Erasure coded pool and cache tiering
------------------------------------

Erasure coded pools require more resources than replicated pools and
lack some functionalities such as omap. To overcome these
limitations, one can set up a `cache tier <../cache-tiering>`_
before the erasure coded pool.

For instance, if the pool *hot-storage* is made of fast storage::

    $ ceph osd tier add ecpool hot-storage
    $ ceph osd tier cache-mode hot-storage writeback
    $ ceph osd tier set-overlay ecpool hot-storage

will place the *hot-storage* pool as tier of *ecpool* in *writeback*
mode so that every write and read to the *ecpool* are actually using
the *hot-storage* and benefit from its flexibility and speed.

More information can be found in the `cache tiering
<../cache-tiering>`_ documentation.

Glossary
--------

*chunk*
   when the encoding function is called, it returns chunks of the same
   size. Data chunks which can be concatenated to reconstruct the original
   object and coding chunks which can be used to rebuild a lost chunk.

*K*
   the number of data *chunks*, i.e. the number of *chunks* in which the
   original object is divided. For instance if *K* = 2 a 10KB object
   will be divided into *K* objects of 5KB each.

*M*
   the number of coding *chunks*, i.e. the number of additional *chunks*
   computed by the encoding functions. If there are 2 coding *chunks*,
   it means 2 OSDs can be out without losing data.


Table of content
----------------

.. toctree::
	:maxdepth: 1

	erasure-code-profile
	erasure-code-jerasure
	erasure-code-isa
	erasure-code-lrc
	erasure-code-shec
Commit	Line	Data
	1	=============
	2	Erasure code
	3	=============
	4
	5	A Ceph pool is associated to a type to sustain the loss of an OSD
	6	(i.e. a disk since most of the time there is one OSD per disk). The
	7	default choice when `creating a pool <../pools>`_ is replicated,
	8	meaning every object is copied on multiple disks. The `Erasure Code
	9	<https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used
	10	instead to save space.
	11
	12	Creating a sample erasure coded pool
	13	------------------------------------
	14
	15	The simplest erasure coded pool is equivalent to `RAID5
	16	<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
	17	requires at least three hosts::
	18
	19	$ ceph osd pool create ecpool 12 12 erasure
	20	pool 'ecpool' created
	21	$ echo ABCDEFGHI \| rados --pool ecpool put NYAN -
	22	$ rados --pool ecpool get NYAN -
	23	ABCDEFGHI
	24
	25	.. note:: the 12 in pool create stands for
	26	`the number of placement groups <../pools>`_.
	27
	28	Erasure code profiles
	29	---------------------
	30
	31	The default erasure code profile sustains the loss of a single OSD. It
	32	is equivalent to a replicated pool of size two but requires 1.5TB
	33	instead of 2TB to store 1TB of data. The default profile can be
	34	displayed with::
	35
	36	$ ceph osd erasure-code-profile get default
	37	k=2
	38	m=1
	39	plugin=jerasure
	40	ruleset-failure-domain=host
	41	technique=reed_sol_van
	42
	43	Choosing the right profile is important because it cannot be modified
	44	after the pool is created: a new pool with a different profile needs
	45	to be created and all objects from the previous pool moved to the new.
	46
	47	The most important parameters of the profile are K, M and
	48	ruleset-failure-domain because they define the storage overhead and
	49	the data durability. For instance, if the desired architecture must
	50	sustain the loss of two racks with a storage overhead of 40% overhead,
	51	the following profile can be defined::
	52
	53	$ ceph osd erasure-code-profile set myprofile \
	54	k=3 \
	55	m=2 \
	56	ruleset-failure-domain=rack
	57	$ ceph osd pool create ecpool 12 12 erasure myprofile
	58	$ echo ABCDEFGHI \| rados --pool ecpool put NYAN -
	59	$ rados --pool ecpool get NYAN -
	60	ABCDEFGHI
	61
	62	The NYAN object will be divided in three (K=3) and two additional
	63	chunks will be created (M=2). The value of M defines how many
	64	OSD can be lost simultaneously without losing any data. The
	65	ruleset-failure-domain=rack will create a CRUSH ruleset that ensures
	66	no two chunks are stored in the same rack.
	67
	68	.. ditaa::
	69	+-------------------+
	70	name \| NYAN \|
	71	+-------------------+
	72	content \| ABCDEFGHI \|
	73	+--------+----------+
	74	\|
	75	\|
	76	v
	77	+------+------+
	78	+---------------+ encode(3,2) +-----------+
	79	\| +--+--+---+---+ \|
	80	\| \| \| \| \|
	81	\| +-------+ \| +-----+ \|
	82	\| \| \| \| \|
	83	+--v---+ +--v---+ +--v---+ +--v---+ +--v---+
	84	name \| NYAN \| \| NYAN \| \| NYAN \| \| NYAN \| \| NYAN \|
	85	+------+ +------+ +------+ +------+ +------+
	86	shard \| 1 \| \| 2 \| \| 3 \| \| 4 \| \| 5 \|
	87	+------+ +------+ +------+ +------+ +------+
	88	content \| ABC \| \| DEF \| \| GHI \| \| YXY \| \| QGC \|
	89	+--+---+ +--+---+ +--+---+ +--+---+ +--+---+
	90	\| \| \| \| \|
	91	\| \| v \| \|
	92	\| \| +--+---+ \| \|
	93	\| \| \| OSD1 \| \| \|
	94	\| \| +------+ \| \|
	95	\| \| \| \|
	96	\| \| +------+ \| \|
	97	\| +------>\| OSD2 \| \| \|
	98	\| +------+ \| \|
	99	\| \| \|
	100	\| +------+ \| \|
	101	\| \| OSD3 \|<----+ \|
	102	\| +------+ \|
	103	\| \|
	104	\| +------+ \|
	105	\| \| OSD4 \|<--------------+
	106	\| +------+
	107	\|
	108	\| +------+
	109	+----------------->\| OSD5 \|
	110	+------+
	111
	112
	113	More information can be found in the `erasure code profiles
	114	<../erasure-code-profile>`_ documentation.
	115
	116
	117	Erasure Coding with Overwrites
	118	------------------------------
	119
	120	By default, erasure coded pools only work with uses like RGW that
	121	perform full object writes and appends.
	122
	123	Since Luminous, partial writes for an erasure coded pool may be
	124	enabled with a per-pool setting. This lets RBD and Cephfs store their
	125	data in an erasure coded pool::
	126
	127	ceph osd pool set ec_pool allow_ec_overwrites true
	128
	129	This can only be enabled on a pool residing on bluestore OSDs, since
	130	bluestore's checksumming is used to detect bitrot or other corruption
	131	during deep-scrub. In addition to being unsafe, using filestore with
	132	ec overwrites yields low performance compared to bluestore.
	133
	134	Erasure coded pools do not support omap, so to use them with RBD and
	135	Cephfs you must instruct them to store their data in an ec pool, and
	136	their metadata in a replicated pool. For RBD, this means using the
	137	erasure coded pool as the ``--data-pool`` during image creation::
	138
	139	rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
	140
	141	For Cephfs, using an erasure coded pool means setting that pool in
	142	a `file layout <../../../cephfs/file-layouts>`_.
	143
	144
	145	Erasure coded pool and cache tiering
	146	------------------------------------
	147
	148	Erasure coded pools require more resources than replicated pools and
	149	lack some functionalities such as omap. To overcome these
	150	limitations, one can set up a `cache tier <../cache-tiering>`_
	151	before the erasure coded pool.
	152
	153	For instance, if the pool hot-storage is made of fast storage::
	154
	155	$ ceph osd tier add ecpool hot-storage
	156	$ ceph osd tier cache-mode hot-storage writeback
	157	$ ceph osd tier set-overlay ecpool hot-storage
	158
	159	will place the hot-storage pool as tier of ecpool in writeback
	160	mode so that every write and read to the ecpool are actually using
	161	the hot-storage and benefit from its flexibility and speed.
	162
	163	More information can be found in the `cache tiering
	164	<../cache-tiering>`_ documentation.
	165
	166	Glossary
	167	--------
	168
	169	chunk
	170	when the encoding function is called, it returns chunks of the same
	171	size. Data chunks which can be concatenated to reconstruct the original
	172	object and coding chunks which can be used to rebuild a lost chunk.
	173
	174	K
	175	the number of data chunks, i.e. the number of chunks in which the
	176	original object is divided. For instance if K = 2 a 10KB object
	177	will be divided into K objects of 5KB each.
	178
	179	M
	180	the number of coding chunks, i.e. the number of additional chunks
	181	computed by the encoding functions. If there are 2 coding chunks,
	182	it means 2 OSDs can be out without losing data.
	183
	184
	185	Table of content
	186	----------------
	187
	188	.. toctree::
	189	:maxdepth: 1
	190
	191	erasure-code-profile
	192	erasure-code-jerasure
	193	erasure-code-isa
	194	erasure-code-lrc
	195	erasure-code-shec