ceph/doc/rados/operations/erasure-code.rst

   1 .. _ecpool:
   2
   3 ==============
   4  Erasure code
   5 ==============
   6
   7 By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
   8 replicated-type pools, every object is copied to multiple disks. This
   9 multiple copying is the method of data protection known as "replication".
  10
  11 By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
  12 pools use a method of data protection that is different from replication. In
  13 erasure coding, data is broken into fragments of two kinds: data blocks and
  14 parity blocks. If a drive fails or becomes corrupted, the parity blocks are
  15 used to rebuild the data. At scale, erasure coding saves space relative to
  16 replication.
  17
  18 In this documentation, data blocks are referred to as "data chunks"
  19 and parity blocks are referred to as "coding chunks".
  20
  21 Erasure codes are also called "forward error correction codes". The
  22 first forward error correction code was developed in 1950 by Richard
  23 Hamming at Bell Laboratories.
  24
  25
  26 Creating a sample erasure-coded pool
  27 ------------------------------------
  28
  29 The simplest erasure-coded pool is similar to `RAID5
  30 <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
  31 requires at least three hosts:
  32
  33 .. prompt:: bash $
  34
  35    ceph osd pool create ecpool erasure
  36
  37 ::
  38
  39    pool 'ecpool' created
  40
  41 .. prompt:: bash $
  42
  43    echo ABCDEFGHI | rados --pool ecpool put NYAN -
  44    rados --pool ecpool get NYAN -
  45
  46 ::
  47
  48    ABCDEFGHI
  49
  50 Erasure-code profiles
  51 ---------------------
  52
  53 The default erasure-code profile can sustain the overlapping loss of two OSDs
  54 without losing data. This erasure-code profile is equivalent to a replicated
  55 pool of size three, but with different storage requirements: instead of
  56 requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default
  57 profile can be displayed with this command:
  58
  59 .. prompt:: bash $
  60
  61    ceph osd erasure-code-profile get default
  62
  63 ::
  64
  65    k=2
  66    m=2
  67    plugin=jerasure
  68    crush-failure-domain=host
  69    technique=reed_sol_van
  70
  71 .. note::
  72   The profile just displayed is for the *default* erasure-coded pool, not the
  73   *simplest* erasure-coded pool. These two pools are not the same:
  74
  75    The default erasure-coded pool has two data chunks (K) and two coding chunks
  76    (M). The profile of the default erasure-coded pool is "k=2 m=2".
  77
  78    The simplest erasure-coded pool has two data chunks (K) and one coding chunk
  79    (M). The profile of the simplest erasure-coded pool is "k=2 m=1".
  80
  81 Choosing the right profile is important because the profile cannot be modified
  82 after the pool is created. If you find that you need an erasure-coded pool with
  83 a profile different than the one you have created, you must create a new pool
  84 with a different (and presumably more carefully considered) profile. When the
  85 new pool is created, all objects from the wrongly configured pool must be moved
  86 to the newly created pool. There is no way to alter the profile of a pool after
  87 the pool has been created.
  88
  89 The most important parameters of the profile are *K*, *M*, and
  90 *crush-failure-domain* because they define the storage overhead and
  91 the data durability. For example, if the desired architecture must
  92 sustain the loss of two racks with a storage overhead of 67%,
  93 the following profile can be defined:
  94
  95 .. prompt:: bash $
  96
  97    ceph osd erasure-code-profile set myprofile \
  98        k=3 \
  99        m=2 \
 100        crush-failure-domain=rack
 101    ceph osd pool create ecpool erasure myprofile
 102    echo ABCDEFGHI | rados --pool ecpool put NYAN -
 103    rados --pool ecpool get NYAN -
 104
 105 ::
 106
 107     ABCDEFGHI
 108
 109 The *NYAN* object will be divided in three (*K=3*) and two additional
 110 *chunks* will be created (*M=2*). The value of *M* defines how many
 111 OSDs can be lost simultaneously without losing any data. The
 112 *crush-failure-domain=rack* will create a CRUSH rule that ensures
 113 no two *chunks* are stored in the same rack.
 114
 115 .. ditaa::
 116                             +-------------------+
 117                        name |       NYAN        |
 118                             +-------------------+
 119                     content |     ABCDEFGHI     |
 120                             +--------+----------+
 121                                      |
 122                                      |
 123                                      v
 124                               +------+------+
 125               +---------------+ encode(3,2) +-----------+
 126               |               +--+--+---+---+           |
 127               |                  |  |   |               |
 128               |          +-------+  |   +-----+         |
 129               |          |          |         |         |
 130            +--v---+   +--v---+   +--v---+  +--v---+  +--v---+
 131      name  | NYAN |   | NYAN |   | NYAN |  | NYAN |  | NYAN |
 132            +------+   +------+   +------+  +------+  +------+
 133     shard  |  1   |   |  2   |   |  3   |  |  4   |  |  5   |
 134            +------+   +------+   +------+  +------+  +------+
 135   content  | ABC  |   | DEF  |   | GHI  |  | YXY  |  | QGC  |
 136            +--+---+   +--+---+   +--+---+  +--+---+  +--+---+
 137               |          |          |         |         |
 138               |          |          v         |         |
 139               |          |       +--+---+     |         |
 140               |          |       | OSD1 |     |         |
 141               |          |       +------+     |         |
 142               |          |                    |         |
 143               |          |       +------+     |         |
 144               |          +------>| OSD2 |     |         |
 145               |                  +------+     |         |
 146               |                               |         |
 147               |                  +------+     |         |
 148               |                  | OSD3 |<----+         |
 149               |                  +------+               |
 150               |                                         |
 151               |                  +------+               |
 152               |                  | OSD4 |<--------------+
 153               |                  +------+
 154               |
 155               |                  +------+
 156               +----------------->| OSD5 |
 157                                  +------+
 158
 159
 160 More information can be found in the `erasure-code profiles
 161 <../erasure-code-profile>`_ documentation.
 162
 163
 164 Erasure Coding with Overwrites
 165 ------------------------------
 166
 167 By default, erasure-coded pools work only with operations that
 168 perform full object writes and appends (for example, RGW).
 169
 170 Since Luminous, partial writes for an erasure-coded pool may be
 171 enabled with a per-pool setting. This lets RBD and CephFS store their
 172 data in an erasure-coded pool:
 173
 174 .. prompt:: bash $
 175
 176     ceph osd pool set ec_pool allow_ec_overwrites true
 177
 178 This can be enabled only on a pool residing on BlueStore OSDs, since
 179 BlueStore's checksumming is used during deep scrubs to detect bitrot
 180 or other corruption. Using Filestore with EC overwrites is not only
 181 unsafe, but it also results in lower performance compared to BlueStore.
 182
 183 Erasure-coded pools do not support omap, so to use them with RBD and
 184 CephFS you must instruct them to store their data in an EC pool and
 185 their metadata in a replicated pool. For RBD, this means using the
 186 erasure-coded pool as the ``--data-pool`` during image creation:
 187
 188 .. prompt:: bash $
 189
 190     rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
 191
 192 For CephFS, an erasure-coded pool can be set as the default data pool during
 193 file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
 194
 195
 196 Erasure-coded pools and cache tiering
 197 -------------------------------------
 198
 199 .. note:: Cache tiering is deprecated in Reef.
 200
 201 Erasure-coded pools require more resources than replicated pools and
 202 lack some of the functionality supported by replicated pools (for example, omap).
 203 To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_
 204 before setting up the erasure-coded pool.
 205
 206 For example, if the pool *hot-storage* is made of fast storage, the following commands
 207 will place the *hot-storage* pool as a tier of *ecpool* in *writeback*
 208 mode:
 209
 210 .. prompt:: bash $
 211
 212    ceph osd tier add ecpool hot-storage
 213    ceph osd tier cache-mode hot-storage writeback
 214    ceph osd tier set-overlay ecpool hot-storage
 215
 216 The result is that every write and read to the *ecpool* actually uses
 217 the *hot-storage* pool and benefits from its flexibility and speed.
 218
 219 More information can be found in the `cache tiering
 220 <../cache-tiering>`_ documentation. Note, however, that cache tiering
 221 is deprecated and may be removed completely in a future release.
 222
 223 Erasure-coded pool recovery
 224 ---------------------------
 225 If an erasure-coded pool loses any data shards, it must recover them from others.
 226 This recovery involves reading from the remaining shards, reconstructing the data, and
 227 writing new shards.
 228
 229 In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
 230 available. (With fewer than *K* shards, you have actually lost data!)
 231
 232 Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be
 233 available, even if ``min_size`` was greater than ``K``. This was a conservative
 234 decision made out of an abundance of caution when designing the new pool
 235 mode. As a result, however, pools with lost OSDs but without complete data loss were
 236 unable to recover and go active without manual intervention to temporarily change
 237 the ``min_size`` setting.
 238
 239 We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and
 240 loss of data.
 241
 242
 243
 244 Glossary
 245 --------
 246
 247 *chunk*
 248    When the encoding function is called, it returns chunks of the same size as each other. There are two
 249    kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and
 250    (2) *coding chunks*, which can be used to rebuild a lost chunk.
 251
 252 *K*
 253    The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object
 254    is divided into two objects of 5KB each.
 255
 256 *M*
 257    The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can
 258    be missing from the cluster without the cluster suffering data loss. For example, if there are two coding
 259    chunks, then two OSDs can be missing without data loss.
 260
 261 Table of contents
 262 -----------------
 263
 264 .. toctree::
 265     :maxdepth: 1
 266
 267     erasure-code-profile
 268     erasure-code-jerasure
 269     erasure-code-isa
 270     erasure-code-lrc
 271     erasure-code-shec
 272     erasure-code-clay