ceph/doc/rados/operations/stretch-mode.rst

   1 .. _stretch_mode:
   2
   3 ================
   4 Stretch Clusters
   5 ================
   6
   7
   8 Stretch Clusters
   9 ================
  10 Ceph generally expects all parts of its network and overall cluster to be
  11 equally reliable, with failures randomly distributed across the CRUSH map.
  12 So you may lose a switch that knocks out a number of OSDs, but we expect
  13 the remaining OSDs and monitors to route around that.
  14
  15 This is usually a good choice, but may not work well in some
  16 stretched cluster configurations where a significant part of your cluster
  17 is stuck behind a single network component. For instance, a single
  18 cluster which is located in multiple data centers, and you want to
  19 sustain the loss of a full DC.
  20
  21 There are two standard configurations we've seen deployed, with either
  22 two or three data centers (or, in clouds, availability zones). With two
  23 zones, we expect each site to hold a copy of the data, and for a third
  24 site to have a tiebreaker monitor (this can be a VM or high-latency compared
  25 to the main sites) to pick a winner if the network connection fails and both
  26 DCs remain alive. For three sites, we expect a copy of the data and an equal
  27 number of monitors in each site.
  28
  29 Note that the standard Ceph configuration will survive MANY failures of the
  30 network or data centers and it will never compromise data consistency.  If you
  31 bring back enough Ceph servers following a failure, it will recover. If you
  32 lose a data center, but can still form a quorum of monitors and have all the data
  33 available (with enough copies to satisfy pools' ``min_size``, or CRUSH rules
  34 that will re-replicate to meet it), Ceph will maintain availability.
  35
  36 What can't it handle?
  37
  38 Stretch Cluster Issues
  39 ======================
  40 No matter what happens, Ceph will not compromise on data integrity
  41 and consistency. If there's a failure in your network or a loss of nodes and
  42 you can restore service, Ceph will return to normal functionality on its own.
  43
  44 But there are scenarios where you lose data availability despite having
  45 enough servers available to satisfy Ceph's consistency and sizing constraints, or
  46 where you may be surprised to not satisfy Ceph's constraints.
  47 The first important category of these failures resolve around inconsistent
  48 networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick
  49 them out of the acting PG sets despite the primary being unable to replicate data.
  50 If this happens, IO will not be permitted, because Ceph can't satisfy its durability
  51 guarantees.
  52
  53 The second important category of failures is when you think you have data replicated
  54 across data centers, but the constraints aren't sufficient to guarantee this.
  55 For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies
  56 and places a copy in each data center with a ``min_size`` of 2. The PG may go active with
  57 2 copies in site A and no copies in site B, which means that if you then lose site A you
  58 have lost data and Ceph can't operate on it. This situation is surprisingly difficult
  59 to avoid with standard CRUSH rules.
  60
  61 Stretch Mode
  62 ============
  63 The new stretch mode is designed to handle the 2-site case. Three sites are
  64 just as susceptible to netsplit issues, but are much more tolerant of
  65 component availability outages than 2-site clusters are.
  66
  67 To enter stretch mode, you must set the location of each monitor, matching
  68 your CRUSH map. For instance, to place ``mon.a`` in your first data center:
  69
  70 .. prompt:: bash $
  71
  72    ceph mon set_location a datacenter=site1
  73
  74 Next, generate a CRUSH rule which will place 2 copies in each data center. This
  75 will require editing the CRUSH map directly:
  76
  77 .. prompt:: bash $
  78
  79    ceph osd getcrushmap > crush.map.bin
  80    crushtool -d crush.map.bin -o crush.map.txt
  81
  82 Now edit the ``crush.map.txt`` file to add a new rule. Here
  83 there is only one other rule, so this is ID 1, but you may need
  84 to use a different rule ID. We also have two datacenter buckets
  85 named ``site1`` and ``site2``::
  86
  87   rule stretch_rule {
  88           id 1
  89           type replicated
  90           min_size 1
  91           max_size 10
  92           step take site1
  93           step chooseleaf firstn 2 type host
  94           step emit
  95           step take site2
  96           step chooseleaf firstn 2 type host
  97           step emit
  98   }
  99
 100 Finally, inject the CRUSH map to make the rule available to the cluster:
 101
 102 .. prompt:: bash $
 103
 104    crushtool -c crush.map.txt -o crush2.map.bin
 105    ceph osd setcrushmap -i crush2.map.bin
 106
 107 If you aren't already running your monitors in connectivity mode, do so with
 108 the instructions in `Changing Monitor Elections`_.
 109
 110 .. _Changing Monitor elections: ../change-mon-elections
 111
 112 And lastly, tell the cluster to enter stretch mode. Here, ``mon.e`` is the
 113 tiebreaker and we are splitting across data centers. ``mon.e`` should be also
 114 set a datacenter, that will differ from ``site1`` and ``site2``. For this
 115 purpose you can create another datacenter bucket named ```site3`` in your
 116 CRUSH and place ``mon.e`` there:
 117
 118 .. prompt:: bash $
 119
 120    ceph mon set_location e datacenter=site3
 121    ceph mon enable_stretch_mode e stretch_rule datacenter
 122
 123 When stretch mode is enabled, the OSDs will only take PGs active when
 124 they peer across data centers (or whatever other CRUSH bucket type
 125 you specified), assuming both are alive. Pools will increase in size
 126 from the default 3 to 4, expecting 2 copies in each site. OSDs will only
 127 be allowed to connect to monitors in the same data center. New monitors
 128 will not be allowed to join the cluster if they do not specify a location.
 129
 130 If all the OSDs and monitors from a data center become inaccessible
 131 at once, the surviving data center will enter a degraded stretch mode. This
 132 will issue a warning, reduce the min_size to 1, and allow
 133 the cluster to go active with data in the single remaining site. Note that
 134 we do not change the pool size, so you will also get warnings that the
 135 pools are too small -- but a special stretch mode flag will prevent the OSDs
 136 from creating extra copies in the remaining data center (so it will only keep
 137 2 copies, as before).
 138
 139 When the missing data center comes back, the cluster will enter
 140 recovery stretch mode. This changes the warning and allows peering, but
 141 still only requires OSDs from the data center which was up the whole time.
 142 When all PGs are in a known state, and are neither degraded nor incomplete,
 143 the cluster transitions back to regular stretch mode, ends the warning,
 144 restores min_size to its starting value (2) and requires both sites to peer,
 145 and stops requiring the always-alive site when peering (so that you can fail
 146 over to the other site, if necessary).
 147
 148 Stretch Mode Limitations
 149 ========================
 150 As implied by the setup, stretch mode only handles 2 sites with OSDs.
 151
 152 While it is not enforced, you should run 2 monitors in each site plus
 153 a tiebreaker, for a total of 5. This is because OSDs can only connect
 154 to monitors in their own site when in stretch mode.
 155
 156 You cannot use erasure coded pools with stretch mode. If you try, it will
 157 refuse, and it will not allow you to create EC pools once in stretch mode.
 158
 159 You must create your own CRUSH rule which provides 2 copies in each site, and
 160 you must use 4 total copies with 2 in each site. If you have existing pools
 161 with non-default size/min_size, Ceph will object when you attempt to
 162 enable stretch mode.
 163
 164 Because it runs with ``min_size 1`` when degraded, you should only use stretch
 165 mode with all-flash OSDs.  This minimizes the time needed to recover once
 166 connectivity is restored, and thus minimizes the potential for data loss.
 167
 168 Hopefully, future development will extend this feature to support EC pools and
 169 running with more than 2 full sites.
 170
 171 Other commands
 172 ==============
 173 If your tiebreaker monitor fails for some reason, you can replace it. Turn on
 174 a new monitor and run:
 175
 176 .. prompt:: bash $
 177
 178    ceph mon set_new_tiebreaker mon.<new_mon_name>
 179
 180 This command will protest if the new monitor is in the same location as existing
 181 non-tiebreaker monitors. This command WILL NOT remove the previous tiebreaker
 182 monitor; you should do so yourself.
 183
 184 If you are writing your own tooling for deploying Ceph, you can use a new
 185 ``--set-crush-location`` option when booting monitors, instead of running
 186 ``ceph mon set_location``. This option accepts only a single "bucket=loc" pair, eg
 187 ``ceph-mon --set-crush-location 'datacenter=a'``, which must match the
 188 bucket type you specified when running ``enable_stretch_mode``.
 189
 190
 191 When in stretch degraded mode, the cluster will go into "recovery" mode automatically
 192 when the disconnected data center comes back. If that doesn't work, or you want to
 193 enable recovery mode early, you can invoke:
 194
 195 .. prompt:: bash $
 196
 197    ceph osd force_recovery_stretch_mode --yes-i-really-mean-it
 198
 199 But this command should not be necessary; it is included to deal with
 200 unanticipated situations.
 201
 202 When in recovery mode, the cluster should go back into normal stretch mode
 203 when the PGs are healthy. If this doesn't happen, or you want to force the
 204 cross-data-center peering early and are willing to risk data downtime (or have
 205 verified separately that all the PGs can peer, even if they aren't fully
 206 recovered), you can invoke:
 207
 208 .. prompt:: bash $
 209
 210    ceph osd force_healthy_stretch_mode --yes-i-really-mean-it
 211
 212 This command should not be necessary; it is included to deal with
 213 unanticipated situations. But you might wish to invoke it to remove
 214 the ``HEALTH_WARN`` state which recovery mode generates.