ceph/doc/rados/operations/stretch-mode.rst

   1 .. _stretch_mode:
   2
   3 ================
   4 Stretch Clusters
   5 ================
   6
   7
   8 Stretch Clusters
   9 ================
  10
  11 A stretch cluster is a cluster that has servers in geographically separated
  12 data centers, distributed over a WAN. Stretch clusters have LAN-like high-speed
  13 and low-latency connections, but limited links. Stretch clusters have a higher
  14 likelihood of (possibly asymmetric) network splits, and a higher likelihood of
  15 temporary or complete loss of an entire data center (which can represent
  16 one-third to one-half of the total cluster).
  17
  18 Ceph is designed with the expectation that all parts of its network and cluster
  19 will be reliable and that failures will be distributed randomly across the
  20 CRUSH map. Even if a switch goes down and causes the loss of many OSDs, Ceph is
  21 designed so that the remaining OSDs and monitors will route around such a loss.
  22
  23 Sometimes this cannot be relied upon. If you have a "stretched-cluster"
  24 deployment in which much of your cluster is behind a single network component,
  25 you might need to use **stretch mode** to ensure data integrity.
  26
  27 We will here consider two standard configurations: a configuration with two
  28 data centers (or, in clouds, two availability zones), and a configuration with
  29 three data centers (or, in clouds, three availability zones).
  30
  31 In the two-site configuration, Ceph expects each of the sites to hold a copy of
  32 the data, and Ceph also expects there to be a third site that has a tiebreaker
  33 monitor. This tiebreaker monitor picks a winner if the network connection fails
  34 and both data centers remain alive.
  35
  36 The tiebreaker monitor can be a VM. It can also have high latency relative to
  37 the two main sites.
  38
  39 The standard Ceph configuration is able to survive MANY network failures or
  40 data-center failures without ever compromising data availability. If enough
  41 Ceph servers are brought back following a failure, the cluster *will* recover.
  42 If you lose a data center but are still able to form a quorum of monitors and
  43 still have all the data available, Ceph will maintain availability. (This
  44 assumes that the cluster has enough copies to satisfy the pools' ``min_size``
  45 configuration option, or (failing that) that the cluster has CRUSH rules in
  46 place that will cause the cluster to re-replicate the data until the
  47 ``min_size`` configuration option has been met.)
  48
  49 Stretch Cluster Issues
  50 ======================
  51
  52 Ceph does not permit the compromise of data integrity and data consistency
  53 under any circumstances. When service is restored after a network failure or a
  54 loss of Ceph nodes, Ceph will restore itself to a state of normal functioning
  55 without operator intervention.
  56
  57 Ceph does not permit the compromise of data integrity or data consistency, but
  58 there are situations in which *data availability* is compromised. These
  59 situations can occur even though there are enough clusters available to satisfy
  60 Ceph's consistency and sizing constraints. In some situations, you might
  61 discover that your cluster does not satisfy those constraints.
  62
  63 The first category of these failures that we will discuss involves inconsistent
  64 networks -- if there is a netsplit (a disconnection between two servers that
  65 splits the network into two pieces), Ceph might be unable to mark OSDs ``down``
  66 and remove them from the acting PG sets. This failure to mark ODSs ``down``
  67 will occur, despite the fact that the primary PG is unable to replicate data (a
  68 situation that, under normal non-netsplit circumstances, would result in the
  69 marking of affected OSDs as ``down`` and their removal from the PG). If this
  70 happens, Ceph will be unable to satisfy its durability guarantees and
  71 consequently IO will not be permitted.
  72
  73 The second category of failures that we will discuss involves the situation in
  74 which the constraints are not sufficient to guarantee the replication of data
  75 across data centers, though it might seem that the data is correctly replicated
  76 across data centers. For example, in a scenario in which there are two data
  77 centers named Data Center A and Data Center B, and the CRUSH rule targets three
  78 replicas and places a replica in each data center with a ``min_size`` of ``2``,
  79 the PG might go active with two replicas in Data Center A and zero replicas in
  80 Data Center B. In a situation of this kind, the loss of Data Center A means
  81 that the data is lost and Ceph will not be able to operate on it. This
  82 situation is surprisingly difficult to avoid using only standard CRUSH rules.
  83
  84
  85 Stretch Mode
  86 ============
  87 Stretch mode is designed to handle deployments in which you cannot guarantee the
  88 replication of data across two data centers. This kind of situation can arise
  89 when the cluster's CRUSH rule specifies that three copies are to be made, but
  90 then a copy is placed in each data center with a ``min_size`` of 2. Under such
  91 conditions, a placement group can become active with two copies in the first
  92 data center and no copies in the second data center.
  93
  94
  95 Entering Stretch Mode
  96 ---------------------
  97
  98 To enable stretch mode, you must set the location of each monitor, matching
  99 your CRUSH map. This procedure shows how to do this.
 100
 101
 102 #. Place ``mon.a`` in your first data center:
 103
 104    .. prompt:: bash $
 105
 106       ceph mon set_location a datacenter=site1
 107
 108 #. Generate a CRUSH rule that places two copies in each data center.
 109    This requires editing the CRUSH map directly:
 110
 111    .. prompt:: bash $
 112
 113       ceph osd getcrushmap > crush.map.bin
 114       crushtool -d crush.map.bin -o crush.map.txt
 115
 116 #. Edit the ``crush.map.txt`` file to add a new rule. Here there is only one
 117    other rule (``id 1``), but you might need to use a different rule ID. We
 118    have two data-center buckets named ``site1`` and ``site2``:
 119
 120    ::
 121
 122       rule stretch_rule {
 123              id 1
 124              min_size 1
 125              max_size 10
 126              type replicated
 127              step take site1
 128              step chooseleaf firstn 2 type host
 129              step emit
 130              step take site2
 131              step chooseleaf firstn 2 type host
 132              step emit
 133      }
 134
 135 #. Inject the CRUSH map to make the rule available to the cluster:
 136
 137    .. prompt:: bash $
 138
 139       crushtool -c crush.map.txt -o crush2.map.bin
 140       ceph osd setcrushmap -i crush2.map.bin
 141
 142 #. Run the monitors in connectivity mode. See `Changing Monitor Elections`_.
 143
 144 #. Command the cluster to enter stretch mode. In this example, ``mon.e`` is the
 145    tiebreaker monitor and we are splitting across data centers. The tiebreaker
 146    monitor must be assigned a data center that is neither ``site1`` nor
 147    ``site2``. For this purpose you can create another data-center bucket named
 148    ``site3`` in your CRUSH and place ``mon.e`` there:
 149
 150    .. prompt:: bash $
 151
 152       ceph mon set_location e datacenter=site3
 153       ceph mon enable_stretch_mode e stretch_rule datacenter
 154
 155 When stretch mode is enabled, PGs will become active only when they peer
 156 across data centers (or across whichever CRUSH bucket type was specified),
 157 assuming both are alive. Pools will increase in size from the default ``3`` to
 158 ``4``, and two copies will be expected in each site. OSDs will be allowed to
 159 connect to monitors only if they are in the same data center as the monitors.
 160 New monitors will not be allowed to join the cluster if they do not specify a
 161 location.
 162
 163 If all OSDs and monitors in one of the data centers become inaccessible at once,
 164 the surviving data center enters a "degraded stretch mode". A warning will be
 165 issued, the ``min_size`` will be reduced to ``1``, and the cluster will be
 166 allowed to go active with the data in the single remaining site. The pool size
 167 does not change, so warnings will be generated that report that the pools are
 168 too small -- but a special stretch mode flag will prevent the OSDs from
 169 creating extra copies in the remaining data center. This means that the data
 170 center will keep only two copies, just as before.
 171
 172 When the missing data center comes back, the cluster will enter a "recovery
 173 stretch mode". This changes the warning and allows peering, but requires OSDs
 174 only from the data center that was ``up`` throughout the duration of the
 175 downtime. When all PGs are in a known state, and are neither degraded nor
 176 incomplete, the cluster transitions back to regular stretch mode, ends the
 177 warning, restores ``min_size`` to its original value (``2``), requires both
 178 sites to peer, and no longer requires the site that was up throughout the
 179 duration of the downtime when peering (which makes failover to the other site
 180 possible, if needed).
 181
 182 .. _Changing Monitor elections: ../change-mon-elections
 183
 184 Limitations of Stretch Mode
 185 ===========================
 186 When using stretch mode, OSDs must be located at exactly two sites.
 187
 188 Two monitors should be run in each data center, plus a tiebreaker in a third
 189 (or in the cloud) for a total of five monitors. While in stretch mode, OSDs
 190 will connect only to monitors within the data center in which they are located.
 191 OSDs *DO NOT* connect to the tiebreaker monitor.
 192
 193 Erasure-coded pools cannot be used with stretch mode. Attempts to use erasure
 194 coded pools with stretch mode will fail. Erasure coded pools cannot be created
 195 while in stretch mode.
 196
 197 To use stretch mode, you will need to create a CRUSH rule that provides two
 198 replicas in each data center. Ensure that there are four total replicas: two in
 199 each data center. If pools exist in the cluster that do not have the default
 200 ``size`` or ``min_size``, Ceph will not enter stretch mode. An example of such
 201 a CRUSH rule is given above.
 202
 203 Because stretch mode runs with ``min_size`` set to ``1`` (or, more directly,
 204 ``min_size 1``), we recommend enabling stretch mode only when using OSDs on
 205 SSDs (including NVMe OSDs). Hybrid HDD+SDD or HDD-only OSDs are not recommended
 206 due to the long time it takes for them to recover after connectivity between
 207 data centers has been restored. This reduces the potential for data loss.
 208
 209 In the future, stretch mode might support erasure-coded pools and might support
 210 deployments that have more than two data centers.
 211
 212 Other commands
 213 ==============
 214
 215 Replacing a failed tiebreaker monitor
 216 -------------------------------------
 217
 218 Turn on a new monitor and run the following command:
 219
 220 .. prompt:: bash $
 221
 222    ceph mon set_new_tiebreaker mon.<new_mon_name>
 223
 224 This command protests if the new monitor is in the same location as the
 225 existing non-tiebreaker monitors. **This command WILL NOT remove the previous
 226 tiebreaker monitor.** Remove the previous tiebreaker monitor yourself.
 227
 228 Using "--set-crush-location" and not "ceph mon set_location"
 229 ------------------------------------------------------------
 230
 231 If you write your own tooling for deploying Ceph, use the
 232 ``--set-crush-location`` option when booting monitors instead of running ``ceph
 233 mon set_location``. This option accepts only a single ``bucket=loc`` pair (for
 234 example, ``ceph-mon --set-crush-location 'datacenter=a'``), and that pair must
 235 match the bucket type that was specified when running ``enable_stretch_mode``.
 236
 237 Forcing recovery stretch mode
 238 -----------------------------
 239
 240 When in stretch degraded mode, the cluster will go into "recovery" mode
 241 automatically when the disconnected data center comes back. If that does not
 242 happen or you want to enable recovery mode early, run the following command:
 243
 244 .. prompt:: bash $
 245
 246    ceph osd force_recovery_stretch_mode --yes-i-really-mean-it
 247
 248 Forcing normal stretch mode
 249 ---------------------------
 250
 251 When in recovery mode, the cluster should go back into normal stretch mode when
 252 the PGs are healthy. If this fails to happen or if you want to force the
 253 cross-data-center peering early and are willing to risk data downtime (or have
 254 verified separately that all the PGs can peer, even if they aren't fully
 255 recovered), run the following command:
 256
 257 .. prompt:: bash $
 258
 259    ceph osd force_healthy_stretch_mode --yes-i-really-mean-it
 260
 261 This command can be used to to remove the ``HEALTH_WARN`` state, which recovery
 262 mode generates.