ceph/doc/rados/operations/erasure-code.rst

   1 .. _ecpool:
   2
   3 =============
   4  Erasure code
   5 =============
   6
   7 A Ceph pool is associated to a type to sustain the loss of an OSD
   8 (i.e. a disk since most of the time there is one OSD per disk). The
   9 default choice when `creating a pool <../pools>`_ is *replicated*,
  10 meaning every object is copied on multiple disks. The `Erasure Code
  11 <https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used
  12 instead to save space.
  13
  14 Creating a sample erasure coded pool
  15 ------------------------------------
  16
  17 The simplest erasure coded pool is equivalent to `RAID5
  18 <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
  19 requires at least three hosts::
  20
  21     $ ceph osd pool create ecpool erasure
  22     pool 'ecpool' created
  23     $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
  24     $ rados --pool ecpool get NYAN -
  25     ABCDEFGHI
  26
  27 .. note:: the 12 in *pool create* stands for
  28           `the number of placement groups <../pools>`_.
  29
  30 Erasure code profiles
  31 ---------------------
  32
  33 The default erasure code profile sustains the loss of a two OSDs. It
  34 is equivalent to a replicated pool of size three but requires 2TB
  35 instead of 3TB to store 1TB of data. The default profile can be
  36 displayed with::
  37
  38     $ ceph osd erasure-code-profile get default
  39     k=2
  40     m=2
  41     plugin=jerasure
  42     crush-failure-domain=host
  43     technique=reed_sol_van
  44
  45 Choosing the right profile is important because it cannot be modified
  46 after the pool is created: a new pool with a different profile needs
  47 to be created and all objects from the previous pool moved to the new.
  48
  49 The most important parameters of the profile are *K*, *M* and
  50 *crush-failure-domain* because they define the storage overhead and
  51 the data durability. For instance, if the desired architecture must
  52 sustain the loss of two racks with a storage overhead of 67% overhead,
  53 the following profile can be defined::
  54
  55     $ ceph osd erasure-code-profile set myprofile \
  56        k=3 \
  57        m=2 \
  58        crush-failure-domain=rack
  59     $ ceph osd pool create ecpool erasure myprofile
  60     $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
  61     $ rados --pool ecpool get NYAN -
  62     ABCDEFGHI
  63
  64 The *NYAN* object will be divided in three (*K=3*) and two additional
  65 *chunks* will be created (*M=2*). The value of *M* defines how many
  66 OSD can be lost simultaneously without losing any data. The
  67 *crush-failure-domain=rack* will create a CRUSH rule that ensures
  68 no two *chunks* are stored in the same rack.
  69
  70 .. ditaa::
  71                             +-------------------+
  72                        name |       NYAN        |
  73                             +-------------------+
  74                     content |     ABCDEFGHI     |
  75                             +--------+----------+
  76                                      |
  77                                      |
  78                                      v
  79                               +------+------+
  80               +---------------+ encode(3,2) +-----------+
  81               |               +--+--+---+---+           |
  82               |                  |  |   |               |
  83               |          +-------+  |   +-----+         |
  84               |          |          |         |         |
  85            +--v---+   +--v---+   +--v---+  +--v---+  +--v---+
  86      name  | NYAN |   | NYAN |   | NYAN |  | NYAN |  | NYAN |
  87            +------+   +------+   +------+  +------+  +------+
  88     shard  |  1   |   |  2   |   |  3   |  |  4   |  |  5   |
  89            +------+   +------+   +------+  +------+  +------+
  90   content  | ABC  |   | DEF  |   | GHI  |  | YXY  |  | QGC  |
  91            +--+---+   +--+---+   +--+---+  +--+---+  +--+---+
  92               |          |          |         |         |
  93               |          |          v         |         |
  94               |          |       +--+---+     |         |
  95               |          |       | OSD1 |     |         |
  96               |          |       +------+     |         |
  97               |          |                    |         |
  98               |          |       +------+     |         |
  99               |          +------>| OSD2 |     |         |
 100               |                  +------+     |         |
 101               |                               |         |
 102               |                  +------+     |         |
 103               |                  | OSD3 |<----+         |
 104               |                  +------+               |
 105               |                                         |
 106               |                  +------+               |
 107               |                  | OSD4 |<--------------+
 108               |                  +------+
 109               |
 110               |                  +------+
 111               +----------------->| OSD5 |
 112                                  +------+
 113
 114
 115 More information can be found in the `erasure code profiles
 116 <../erasure-code-profile>`_ documentation.
 117
 118
 119 Erasure Coding with Overwrites
 120 ------------------------------
 121
 122 By default, erasure coded pools only work with uses like RGW that
 123 perform full object writes and appends.
 124
 125 Since Luminous, partial writes for an erasure coded pool may be
 126 enabled with a per-pool setting. This lets RBD and CephFS store their
 127 data in an erasure coded pool::
 128
 129     ceph osd pool set ec_pool allow_ec_overwrites true
 130
 131 This can only be enabled on a pool residing on bluestore OSDs, since
 132 bluestore's checksumming is used to detect bitrot or other corruption
 133 during deep-scrub. In addition to being unsafe, using filestore with
 134 ec overwrites yields low performance compared to bluestore.
 135
 136 Erasure coded pools do not support omap, so to use them with RBD and
 137 CephFS you must instruct them to store their data in an ec pool, and
 138 their metadata in a replicated pool. For RBD, this means using the
 139 erasure coded pool as the ``--data-pool`` during image creation::
 140
 141     rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
 142
 143 For CephFS, an erasure coded pool can be set as the default data pool during
 144 file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
 145
 146
 147 Erasure coded pool and cache tiering
 148 ------------------------------------
 149
 150 Erasure coded pools require more resources than replicated pools and
 151 lack some functionalities such as omap. To overcome these
 152 limitations, one can set up a `cache tier <../cache-tiering>`_
 153 before the erasure coded pool.
 154
 155 For instance, if the pool *hot-storage* is made of fast storage::
 156
 157     $ ceph osd tier add ecpool hot-storage
 158     $ ceph osd tier cache-mode hot-storage writeback
 159     $ ceph osd tier set-overlay ecpool hot-storage
 160
 161 will place the *hot-storage* pool as tier of *ecpool* in *writeback*
 162 mode so that every write and read to the *ecpool* are actually using
 163 the *hot-storage* and benefit from its flexibility and speed.
 164
 165 More information can be found in the `cache tiering
 166 <../cache-tiering>`_ documentation.
 167
 168 Erasure coded pool recovery
 169 ---------------------------
 170 If an erasure coded pool loses some shards, it must recover them from the others.
 171 This generally involves reading from the remaining shards, reconstructing the data, and
 172 writing it to the new peer.
 173 In Octopus, erasure coded pools can recover as long as there are at least *K* shards
 174 available. (With fewer than *K* shards, you have actually lost data!)
 175
 176 Prior to Octopus, erasure coded pools required at least *min_size* shards to be
 177 available, even if *min_size* is greater than *K*. (We generally recommend min_size
 178 be *K+2* or more to prevent loss of writes and data.)
 179 This conservative decision was made out of an abundance of caution when designing the new pool
 180 mode but also meant pools with lost OSDs but no data loss were unable to recover and go active
 181 without manual intervention to change the *min_size*.
 182
 183 Glossary
 184 --------
 185
 186 *chunk*
 187    when the encoding function is called, it returns chunks of the same
 188    size. Data chunks which can be concatenated to reconstruct the original
 189    object and coding chunks which can be used to rebuild a lost chunk.
 190
 191 *K*
 192    the number of data *chunks*, i.e. the number of *chunks* in which the
 193    original object is divided. For instance if *K* = 2 a 10KB object
 194    will be divided into *K* objects of 5KB each.
 195
 196 *M*
 197    the number of coding *chunks*, i.e. the number of additional *chunks*
 198    computed by the encoding functions. If there are 2 coding *chunks*,
 199    it means 2 OSDs can be out without losing data.
 200
 201
 202 Table of content
 203 ----------------
 204
 205 .. toctree::
 206         :maxdepth: 1
 207
 208         erasure-code-profile
 209         erasure-code-jerasure
 210         erasure-code-isa
 211         erasure-code-lrc
 212         erasure-code-shec
 213         erasure-code-clay