ceph/doc/rados/operations/erasure-code.rst

   1 .. _ecpool:
   2
   3 =============
   4  Erasure code
   5 =============
   6
   7 By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
   8 replicated-type pools, every object is copied to multiple disks (this
   9 multiple copying is the "replication").
  10
  11 In contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
  12 pools use a method of data protection that is different from replication. In
  13 erasure coding, data is broken into fragments of two kinds: data blocks and
  14 parity blocks. If a drive fails or becomes corrupted, the parity blocks are
  15 used to rebuild the data. At scale, erasure coding saves space relative to
  16 replication.
  17
  18 In this documentation, data blocks are referred to as "data chunks"
  19 and parity blocks are referred to as "encoding chunks".
  20
  21 Erasure codes are also called "forward error correction codes". The
  22 first forward error correction code was developed in 1950 by Richard
  23 Hamming at Bell Laboratories.
  24
  25
  26 Creating a sample erasure coded pool
  27 ------------------------------------
  28
  29 The simplest erasure coded pool is equivalent to `RAID5
  30 <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
  31 requires at least three hosts:
  32
  33 .. prompt:: bash $
  34
  35    ceph osd pool create ecpool erasure
  36
  37 ::
  38
  39    pool 'ecpool' created
  40
  41 .. prompt:: bash $
  42
  43    echo ABCDEFGHI | rados --pool ecpool put NYAN -
  44    rados --pool ecpool get NYAN -
  45
  46 ::
  47
  48    ABCDEFGHI
  49
  50 Erasure code profiles
  51 ---------------------
  52
  53 The default erasure code profile can sustain the loss of two OSDs. This erasure
  54 code profile is equivalent to a replicated pool of size three, but requires
  55 2TB to store 1TB of data instead of 3TB to store 1TB of data. The default
  56 profile can be displayed with this command:
  57
  58 .. prompt:: bash $
  59
  60    ceph osd erasure-code-profile get default
  61
  62 ::
  63
  64    k=2
  65    m=2
  66    plugin=jerasure
  67    crush-failure-domain=host
  68    technique=reed_sol_van
  69
  70 .. note::
  71    The default erasure-coded pool, the profile of which is displayed here, is
  72    not the same as the simplest erasure-coded pool.
  73
  74    The default erasure-coded pool has two data chunks (k) and two coding chunks
  75    (m). The profile of the default erasure-coded pool is "k=2 m=2".
  76
  77    The simplest erasure-coded pool has two data chunks (k) and one coding chunk
  78    (m). The profile of the simplest erasure-coded pool is "k=2 m=1".
  79
  80 Choosing the right profile is important because the profile cannot be modified
  81 after the pool is created. If you find that you need an erasure-coded pool with
  82 a profile different than the one you have created, you must create a new pool
  83 with a different (and presumably more carefully-considered) profile. When the
  84 new pool is created, all objects from the wrongly-configured pool must be moved
  85 to the newly-created pool. There is no way to alter the profile of a pool after its creation.
  86
  87 The most important parameters of the profile are *K*, *M* and
  88 *crush-failure-domain* because they define the storage overhead and
  89 the data durability. For example, if the desired architecture must
  90 sustain the loss of two racks with a storage overhead of 67% overhead,
  91 the following profile can be defined:
  92
  93 .. prompt:: bash $
  94
  95    ceph osd erasure-code-profile set myprofile \
  96        k=3 \
  97        m=2 \
  98        crush-failure-domain=rack
  99    ceph osd pool create ecpool erasure myprofile
 100    echo ABCDEFGHI | rados --pool ecpool put NYAN -
 101    rados --pool ecpool get NYAN -
 102
 103 ::
 104
 105     ABCDEFGHI
 106
 107 The *NYAN* object will be divided in three (*K=3*) and two additional
 108 *chunks* will be created (*M=2*). The value of *M* defines how many
 109 OSD can be lost simultaneously without losing any data. The
 110 *crush-failure-domain=rack* will create a CRUSH rule that ensures
 111 no two *chunks* are stored in the same rack.
 112
 113 .. ditaa::
 114                             +-------------------+
 115                        name |       NYAN        |
 116                             +-------------------+
 117                     content |     ABCDEFGHI     |
 118                             +--------+----------+
 119                                      |
 120                                      |
 121                                      v
 122                               +------+------+
 123               +---------------+ encode(3,2) +-----------+
 124               |               +--+--+---+---+           |
 125               |                  |  |   |               |
 126               |          +-------+  |   +-----+         |
 127               |          |          |         |         |
 128            +--v---+   +--v---+   +--v---+  +--v---+  +--v---+
 129      name  | NYAN |   | NYAN |   | NYAN |  | NYAN |  | NYAN |
 130            +------+   +------+   +------+  +------+  +------+
 131     shard  |  1   |   |  2   |   |  3   |  |  4   |  |  5   |
 132            +------+   +------+   +------+  +------+  +------+
 133   content  | ABC  |   | DEF  |   | GHI  |  | YXY  |  | QGC  |
 134            +--+---+   +--+---+   +--+---+  +--+---+  +--+---+
 135               |          |          |         |         |
 136               |          |          v         |         |
 137               |          |       +--+---+     |         |
 138               |          |       | OSD1 |     |         |
 139               |          |       +------+     |         |
 140               |          |                    |         |
 141               |          |       +------+     |         |
 142               |          +------>| OSD2 |     |         |
 143               |                  +------+     |         |
 144               |                               |         |
 145               |                  +------+     |         |
 146               |                  | OSD3 |<----+         |
 147               |                  +------+               |
 148               |                                         |
 149               |                  +------+               |
 150               |                  | OSD4 |<--------------+
 151               |                  +------+
 152               |
 153               |                  +------+
 154               +----------------->| OSD5 |
 155                                  +------+
 156
 157
 158 More information can be found in the `erasure code profiles
 159 <../erasure-code-profile>`_ documentation.
 160
 161
 162 Erasure Coding with Overwrites
 163 ------------------------------
 164
 165 By default, erasure coded pools only work with uses like RGW that
 166 perform full object writes and appends.
 167
 168 Since Luminous, partial writes for an erasure coded pool may be
 169 enabled with a per-pool setting. This lets RBD and CephFS store their
 170 data in an erasure coded pool:
 171
 172 .. prompt:: bash $
 173
 174     ceph osd pool set ec_pool allow_ec_overwrites true
 175
 176 This can be enabled only on a pool residing on BlueStore OSDs, since
 177 BlueStore's checksumming is used during deep scrubs to detect bitrot
 178 or other corruption. In addition to being unsafe, using Filestore with
 179 EC overwrites results in lower performance compared to BlueStore.
 180
 181 Erasure coded pools do not support omap, so to use them with RBD and
 182 CephFS you must instruct them to store their data in an EC pool, and
 183 their metadata in a replicated pool. For RBD, this means using the
 184 erasure coded pool as the ``--data-pool`` during image creation:
 185
 186 .. prompt:: bash $
 187
 188     rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
 189
 190 For CephFS, an erasure coded pool can be set as the default data pool during
 191 file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
 192
 193
 194 Erasure coded pool and cache tiering
 195 ------------------------------------
 196
 197 Erasure coded pools require more resources than replicated pools and
 198 lack some functionality such as omap. To overcome these
 199 limitations, one can set up a `cache tier <../cache-tiering>`_
 200 before the erasure coded pool.
 201
 202 For instance, if the pool *hot-storage* is made of fast storage:
 203
 204 .. prompt:: bash $
 205
 206    ceph osd tier add ecpool hot-storage
 207    ceph osd tier cache-mode hot-storage writeback
 208    ceph osd tier set-overlay ecpool hot-storage
 209
 210 will place the *hot-storage* pool as tier of *ecpool* in *writeback*
 211 mode so that every write and read to the *ecpool* are actually using
 212 the *hot-storage* and benefit from its flexibility and speed.
 213
 214 More information can be found in the `cache tiering
 215 <../cache-tiering>`_ documentation.  Note however that cache tiering
 216 is deprecated and may be removed completely in a future release.
 217
 218 Erasure coded pool recovery
 219 ---------------------------
 220 If an erasure coded pool loses some data shards, it must recover them from others.
 221 This involves reading from the remaining shards, reconstructing the data, and
 222 writing new shards.
 223 In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
 224 available. (With fewer than *K* shards, you have actually lost data!)
 225
 226 Prior to Octopus, erasure coded pools required at least ``min_size`` shards to be
 227 available, even if ``min_size`` is greater than ``K``. We recommend ``min_size``
 228 be ``K+2`` or more to prevent loss of writes and data.
 229 This conservative decision was made out of an abundance of caution when
 230 designing the new pool mode.  As a result pools with lost OSDs but without
 231 complete loss of any data were unable to recover and go active
 232 without manual intervention to temporarily change the ``min_size`` setting.
 233
 234 Glossary
 235 --------
 236
 237 *chunk*
 238    when the encoding function is called, it returns chunks of the same
 239    size. Data chunks which can be concatenated to reconstruct the original
 240    object and coding chunks which can be used to rebuild a lost chunk.
 241
 242 *K*
 243    the number of data *chunks*, i.e. the number of *chunks* in which the
 244    original object is divided. For instance if *K* = 2 a 10KB object
 245    will be divided into *K* objects of 5KB each.
 246
 247 *M*
 248    the number of coding *chunks*, i.e. the number of additional *chunks*
 249    computed by the encoding functions. If there are 2 coding *chunks*,
 250    it means 2 OSDs can be out without losing data.
 251
 252
 253 Table of content
 254 ----------------
 255
 256 .. toctree::
 257         :maxdepth: 1
 258
 259         erasure-code-profile
 260         erasure-code-jerasure
 261         erasure-code-isa
 262         erasure-code-lrc
 263         erasure-code-shec
 264         erasure-code-clay