ceph/doc/rados/configuration/mon-config-ref.rst

   1 ==========================
   2  Monitor Config Reference
   3 ==========================
   4
   5 Understanding how to configure a :term:`Ceph Monitor` is an important part of
   6 building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters
   7 have at least one monitor**. A monitor configuration usually remains fairly
   8 consistent, but you can add, remove or replace a monitor in a cluster. See
   9 `Adding/Removing a Monitor`_ and `Add/Remove a Monitor (ceph-deploy)`_ for
  10 details.
  11
  12
  13 .. index:: Ceph Monitor; Paxos
  14
  15 Background
  16 ==========
  17
  18 Ceph Monitors maintain a "master copy" of the :term:`cluster map`, which means a
  19 :term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD
  20 Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and
  21 retrieving a current cluster map. Before Ceph Clients can read from or write to
  22 Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor
  23 first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph
  24 Client can compute the location for any object. The ability to compute object
  25 locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a
  26 very important aspect of Ceph's high scalability and performance. See
  27 `Scalability and High Availability`_ for additional details.
  28
  29 The primary role of the Ceph Monitor is to maintain a master copy of the cluster
  30 map. Ceph Monitors also provide authentication and logging services. Ceph
  31 Monitors write all changes in the monitor services to a single Paxos instance,
  32 and Paxos writes the changes to a key/value store for strong consistency. Ceph
  33 Monitors can query the most recent version of the cluster map during sync
  34 operations. Ceph Monitors leverage the key/value store's snapshots and iterators
  35 (using leveldb) to perform store-wide synchronization.
  36
  37 .. ditaa::
  38  /-------------\               /-------------\
  39  |   Monitor   | Write Changes |    Paxos    |
  40  |   cCCC      +-------------->+   cCCC      |
  41  |             |               |             |
  42  +-------------+               \------+------/
  43  |    Auth     |                      |
  44  +-------------+                      | Write Changes
  45  |    Log      |                      |
  46  +-------------+                      v
  47  | Monitor Map |               /------+------\
  48  +-------------+               | Key / Value |
  49  |   OSD Map   |               |    Store    |
  50  +-------------+               |  cCCC       |
  51  |   PG Map    |               \------+------/
  52  +-------------+                      ^
  53  |   MDS Map   |                      | Read Changes
  54  +-------------+                      |
  55  |    cCCC     |*---------------------+
  56  \-------------/
  57
  58
  59 .. deprecated:: version 0.58
  60
  61 In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for
  62 each service and store the map as a file.
  63
  64 .. index:: Ceph Monitor; cluster map
  65
  66 Cluster Maps
  67 ------------
  68
  69 The cluster map is a composite of maps, including the monitor map, the OSD map,
  70 the placement group map and the metadata server map. The cluster map tracks a
  71 number of important things: which processes are ``in`` the Ceph Storage Cluster;
  72 which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running
  73 or ``down``; whether, the placement groups are ``active`` or ``inactive``, and
  74 ``clean`` or in some other state; and, other details that reflect the current
  75 state of the cluster such as the total amount of storage space, and the amount
  76 of storage used.
  77
  78 When there is a significant change in the state of the cluster--e.g., a Ceph OSD
  79 Daemon goes down, a placement group falls into a degraded state, etc.--the
  80 cluster map gets updated to reflect the current state of the cluster.
  81 Additionally, the Ceph Monitor also maintains a history of the prior states of
  82 the cluster. The monitor map, OSD map, placement group map and metadata server
  83 map each maintain a history of their map versions. We call each version an
  84 "epoch."
  85
  86 When operating your Ceph Storage Cluster, keeping track of these states is an
  87 important part of your system administration duties. See `Monitoring a Cluster`_
  88 and `Monitoring OSDs and PGs`_ for additional details.
  89
  90 .. index:: high availability; quorum
  91
  92 Monitor Quorum
  93 --------------
  94
  95 Our Configuring ceph section provides a trivial `Ceph configuration file`_ that
  96 provides for one monitor in the test cluster. A cluster will run fine with a
  97 single monitor; however, **a single monitor is a single-point-of-failure**. To
  98 ensure high availability in a production Ceph Storage Cluster, you should run
  99 Ceph with multiple monitors so that the failure of a single monitor **WILL NOT**
 100 bring down your entire cluster.
 101
 102 When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability,
 103 Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map.
 104 A consensus requires a majority of monitors running to establish a quorum for
 105 consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6;
 106 etc.).
 107
 108 ``mon force quorum join``
 109
 110 :Description: Force monitor to join quorum even if it has been previously removed from the map
 111 :Type: Boolean
 112 :Default: ``False``
 113
 114 .. index:: Ceph Monitor; consistency
 115
 116 Consistency
 117 -----------
 118
 119 When you add monitor settings to your Ceph configuration file, you need to be
 120 aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes
 121 strict consistency requirements** for a Ceph monitor when discovering another
 122 Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons
 123 use the Ceph configuration file to discover monitors, monitors discover each
 124 other using the monitor map (monmap), not the Ceph configuration file.
 125
 126 A Ceph Monitor always refers to the local copy of the monmap when discovering
 127 other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the
 128 Ceph configuration file avoids errors that could break the cluster (e.g., typos
 129 in ``ceph.conf`` when specifying a monitor address or port). Since monitors use
 130 monmaps for discovery and they share monmaps with clients and other Ceph
 131 daemons, **the monmap provides monitors with a strict guarantee that their
 132 consensus is valid.**
 133
 134 Strict consistency also applies to updates to the monmap. As with any other
 135 updates on the Ceph Monitor, changes to the monmap always run through a
 136 distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on
 137 each update to the monmap, such as adding or removing a Ceph Monitor, to ensure
 138 that each monitor in the quorum has the same version of the monmap. Updates to
 139 the monmap are incremental so that Ceph Monitors have the latest agreed upon
 140 version, and a set of previous versions. Maintaining a history enables a Ceph
 141 Monitor that has an older version of the monmap to catch up with the current
 142 state of the Ceph Storage Cluster.
 143
 144 If Ceph Monitors discovered each other through the Ceph configuration file
 145 instead of through the monmap, it would introduce additional risks because the
 146 Ceph configuration files are not updated and distributed automatically. Ceph
 147 Monitors might inadvertently use an older Ceph configuration file, fail to
 148 recognize a Ceph Monitor, fall out of a quorum, or develop a situation where
 149 `Paxos`_ is not able to determine the current state of the system accurately.
 150
 151
 152 .. index:: Ceph Monitor; bootstrapping monitors
 153
 154 Bootstrapping Monitors
 155 ----------------------
 156
 157 In most configuration and deployment cases, tools that deploy Ceph may help
 158 bootstrap the Ceph Monitors by generating a monitor map for you (e.g.,
 159 ``ceph-deploy``, etc). A Ceph Monitor requires a few explicit
 160 settings:
 161
 162 - **Filesystem ID**: The ``fsid`` is the unique identifier for your
 163   object store. Since you can run multiple clusters on the same
 164   hardware, you must specify the unique ID of the object store when
 165   bootstrapping a monitor.  Deployment tools usually do this for you
 166   (e.g., ``ceph-deploy`` can call a tool like ``uuidgen``), but you
 167   may specify the ``fsid`` manually too.
 168
 169 - **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within
 170   the cluster. It is an alphanumeric value, and by convention the identifier
 171   usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This
 172   can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.),
 173   by a deployment tool, or using the ``ceph`` commandline.
 174
 175 - **Keys**: The monitor must have secret keys. A deployment tool such as
 176   ``ceph-deploy`` usually does this for you, but you may
 177   perform this step manually too. See `Monitor Keyrings`_ for details.
 178
 179 For additional details on bootstrapping, see `Bootstrapping a Monitor`_.
 180
 181 .. index:: Ceph Monitor; configuring monitors
 182
 183 Configuring Monitors
 184 ====================
 185
 186 To apply configuration settings to the entire cluster, enter the configuration
 187 settings under ``[global]``. To apply configuration settings to all monitors in
 188 your cluster, enter the configuration settings under ``[mon]``. To apply
 189 configuration settings to specific monitors, specify the monitor instance
 190 (e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation.
 191
 192 .. code-block:: ini
 193
 194         [global]
 195
 196         [mon]
 197
 198         [mon.a]
 199
 200         [mon.b]
 201
 202         [mon.c]
 203
 204
 205 Minimum Configuration
 206 ---------------------
 207
 208 The bare minimum monitor settings for a Ceph monitor via the Ceph configuration
 209 file include a hostname and a monitor address for each monitor. You can configure
 210 these under ``[mon]`` or under the entry for a specific monitor.
 211
 212 .. code-block:: ini
 213
 214         [global]
 215                 mon host = 10.0.0.2,10.0.0.3,10.0.0.4
 216
 217 .. code-block:: ini
 218
 219         [mon.a]
 220                 host = hostname1
 221                 mon addr = 10.0.0.10:6789
 222
 223 See the `Network Configuration Reference`_ for details.
 224
 225 .. note:: This minimum configuration for monitors assumes that a deployment
 226    tool generates the ``fsid`` and the ``mon.`` key for you.
 227
 228 Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP address of
 229 the monitors. However, if you decide to change the monitor's IP address, you
 230 must follow a specific procedure. See `Changing a Monitor's IP Address`_ for
 231 details.
 232
 233 Monitors can also be found by clients using DNS SRV records. See `Monitor lookup through DNS`_ for details.
 234
 235 Cluster ID
 236 ----------
 237
 238 Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it
 239 usually appears under the ``[global]`` section of the configuration file.
 240 Deployment tools usually generate the ``fsid`` and store it in the monitor map,
 241 so the value may not appear in a configuration file. The ``fsid`` makes it
 242 possible to run daemons for multiple clusters on the same hardware.
 243
 244 ``fsid``
 245
 246 :Description: The cluster ID. One per cluster.
 247 :Type: UUID
 248 :Required: Yes.
 249 :Default: N/A. May be generated by a deployment tool if not specified.
 250
 251 .. note:: Do not set this value if you use a deployment tool that does
 252    it for you.
 253
 254
 255 .. index:: Ceph Monitor; initial members
 256
 257 Initial Members
 258 ---------------
 259
 260 We recommend running a production Ceph Storage Cluster with at least three Ceph
 261 Monitors to ensure high availability. When you run multiple monitors, you may
 262 specify the initial monitors that must be members of the cluster in order to
 263 establish a quorum. This may reduce the time it takes for your cluster to come
 264 online.
 265
 266 .. code-block:: ini
 267
 268         [mon]
 269                 mon initial members = a,b,c
 270
 271
 272 ``mon initial members``
 273
 274 :Description: The IDs of initial monitors in a cluster during startup. If
 275               specified, Ceph requires an odd number of monitors to form an
 276               initial quorum (e.g., 3).
 277
 278 :Type: String
 279 :Default: None
 280
 281 .. note:: A *majority* of monitors in your cluster must be able to reach
 282    each other in order to establish a quorum. You can decrease the initial
 283    number of monitors to establish a quorum with this setting.
 284
 285 .. index:: Ceph Monitor; data path
 286
 287 Data
 288 ----
 289
 290 Ceph provides a default path where Ceph Monitors store data. For optimal
 291 performance in a production Ceph Storage Cluster, we recommend running Ceph
 292 Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb is using
 293 ``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk
 294 very often, which can interfere with Ceph OSD Daemon workloads if the data
 295 store is co-located with the OSD Daemons.
 296
 297 In Ceph versions 0.58 and earlier, Ceph Monitors store their data in files. This
 298 approach allows users to inspect monitor data with common tools like ``ls``
 299 and ``cat``. However, it doesn't provide strong consistency.
 300
 301 In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value
 302 pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents
 303 recovering Ceph Monitors from running corrupted versions through Paxos, and it
 304 enables multiple modification operations in one single atomic batch, among other
 305 advantages.
 306
 307 Generally, we do not recommend changing the default data location. If you modify
 308 the default location, we recommend that you make it uniform across Ceph Monitors
 309 by setting it in the ``[mon]`` section of the configuration file.
 310
 311
 312 ``mon data``
 313
 314 :Description: The monitor's data location.
 315 :Type: String
 316 :Default: ``/var/lib/ceph/mon/$cluster-$id``
 317
 318
 319 ``mon data size warn``
 320
 321 :Description: Issue a ``HEALTH_WARN`` in cluster log when the monitor's data
 322               store goes over 15GB.
 323
 324 :Type: Integer
 325 :Default: ``15*1024*1024*1024``
 326
 327
 328 ``mon data avail warn``
 329
 330 :Description: Issue a ``HEALTH_WARN`` in cluster log when the available disk
 331               space of monitor's data store is lower or equal to this
 332               percentage.
 333
 334 :Type: Integer
 335 :Default: ``30``
 336
 337
 338 ``mon data avail crit``
 339
 340 :Description: Issue a ``HEALTH_ERR`` in cluster log when the available disk
 341               space of monitor's data store is lower or equal to this
 342               percentage.
 343
 344 :Type: Integer
 345 :Default: ``5``
 346
 347
 348 ``mon warn on cache pools without hit sets``
 349
 350 :Description: Issue a ``HEALTH_WARN`` in cluster log if a cache pool does not
 351               have the ``hit_set_type`` value configured.
 352               See :ref:`hit_set_type <hit_set_type>` for more
 353               details.
 354
 355 :Type: Boolean
 356 :Default: ``True``
 357
 358
 359 ``mon warn on crush straw calc version zero``
 360
 361 :Description: Issue a ``HEALTH_WARN`` in cluster log if the CRUSH's
 362               ``straw_calc_version`` is zero. See
 363               :ref:`CRUSH map tunables <crush-map-tunables>` for
 364               details.
 365
 366 :Type: Boolean
 367 :Default: ``True``
 368
 369
 370 ``mon warn on legacy crush tunables``
 371
 372 :Description: Issue a ``HEALTH_WARN`` in cluster log if
 373               CRUSH tunables are too old (older than ``mon_min_crush_required_version``)
 374
 375 :Type: Boolean
 376 :Default: ``True``
 377
 378
 379 ``mon crush min required version``
 380
 381 :Description: The minimum tunable profile version required by the cluster.
 382               See
 383               :ref:`CRUSH map tunables <crush-map-tunables>` for
 384               details.
 385
 386 :Type: String
 387 :Default: ``hammer``
 388
 389
 390 ``mon warn on osd down out interval zero``
 391
 392 :Description: Issue a ``HEALTH_WARN`` in cluster log if
 393               ``mon osd down out interval`` is zero. Having this option set to
 394               zero on the leader acts much like the ``noout`` flag. It's hard
 395               to figure out what's going wrong with clusters without the
 396               ``noout`` flag set but acting like that just the same, so we
 397               report a warning in this case.
 398
 399 :Type: Boolean
 400 :Default: ``True``
 401
 402
 403 ``mon warn on slow ping ratio``
 404
 405 :Description: Issue a ``HEALTH_WARN`` in cluster log if any heartbeat
 406               between OSDs exceeds ``mon warn on slow ping ratio``
 407               of ``osd heartbeat grace``.  The default is 5%.
 408 :Type: Float
 409 :Default: ``0.05``
 410
 411
 412 ``mon warn on slow ping time``
 413
 414 :Description: Override ``mon warn on slow ping ratio`` with a specific value.
 415               Issue a ``HEALTH_WARN`` in cluster log if any heartbeat
 416               between OSDs exceeds ``mon warn on slow ping time``
 417               milliseconds.  The default is 0 (disabled).
 418 :Type: Integer
 419 :Default: ``0``
 420
 421
 422 ``mon warn on pool no redundancy``
 423
 424 :Description: Issue a ``HEALTH_WARN`` in cluster log if any pool is
 425               configured with no replicas.
 426 :Type: Boolean
 427 :Default: ``True``
 428
 429
 430 ``mon cache target full warn ratio``
 431
 432 :Description: Position between pool's ``cache_target_full`` and
 433               ``target_max_object`` where we start warning
 434
 435 :Type: Float
 436 :Default: ``0.66``
 437
 438
 439 ``mon health to clog``
 440
 441 :Description: Enable sending health summary to cluster log periodically.
 442 :Type: Boolean
 443 :Default: ``True``
 444
 445
 446 ``mon health to clog tick interval``
 447
 448 :Description: How often (in seconds) the monitor send health summary to cluster
 449               log (a non-positive number disables it). If current health summary
 450               is empty or identical to the last time, monitor will not send it
 451               to cluster log.
 452
 453 :Type: Float
 454 :Default: ``60.0``
 455
 456
 457 ``mon health to clog interval``
 458
 459 :Description: How often (in seconds) the monitor send health summary to cluster
 460               log (a non-positive number disables it). Monitor will always
 461               send the summary to cluster log no matter if the summary changes
 462               or not.
 463
 464 :Type: Integer
 465 :Default: ``3600``
 466
 467
 468
 469 .. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning
 470
 471 Storage Capacity
 472 ----------------
 473
 474 When a Ceph Storage Cluster gets close to its maximum capacity (i.e., ``mon osd
 475 full ratio``),  Ceph prevents you from writing to or reading from Ceph OSD
 476 Daemons as a safety measure to prevent data loss. Therefore, letting a
 477 production Ceph Storage Cluster approach its full ratio is not a good practice,
 478 because it sacrifices high availability. The default full ratio is ``.95``, or
 479 95% of capacity. This a very aggressive setting for a test cluster with a small
 480 number of OSDs.
 481
 482 .. tip:: When monitoring your cluster, be alert to warnings related to the
 483    ``nearfull`` ratio. This means that a failure of some OSDs could result
 484    in a temporary service disruption if one or more OSDs fails. Consider adding
 485    more OSDs to increase storage capacity.
 486
 487 A common scenario for test clusters involves a system administrator removing a
 488 Ceph OSD Daemon from the Ceph Storage Cluster to watch the cluster rebalance;
 489 then, removing another Ceph OSD Daemon, and so on until the Ceph Storage Cluster
 490 eventually reaches the full ratio and locks up. We recommend a bit of capacity
 491 planning even with a test cluster. Planning enables you to gauge how much spare
 492 capacity you will need in order to maintain high availability. Ideally, you want
 493 to plan for a series of Ceph OSD Daemon failures where the cluster can recover
 494 to an ``active + clean`` state without replacing those Ceph OSD Daemons
 495 immediately. You can run a cluster in an ``active + degraded`` state, but this
 496 is not ideal for normal operating conditions.
 497
 498 The following diagram depicts a simplistic Ceph Storage Cluster containing 33
 499 Ceph Nodes with one Ceph OSD Daemon per host, each Ceph OSD Daemon reading from
 500 and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum
 501 actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph
 502 Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow
 503 Ceph Clients to read and write data. So the Ceph Storage Cluster's operating
 504 capacity is 95TB, not 99TB.
 505
 506 .. ditaa::
 507  +--------+  +--------+  +--------+  +--------+  +--------+  +--------+
 508  | Rack 1 |  | Rack 2 |  | Rack 3 |  | Rack 4 |  | Rack 5 |  | Rack 6 |
 509  | cCCC   |  | cF00   |  | cCCC   |  | cCCC   |  | cCCC   |  | cCCC   |
 510  +--------+  +--------+  +--------+  +--------+  +--------+  +--------+
 511  | OSD 1  |  | OSD 7  |  | OSD 13 |  | OSD 19 |  | OSD 25 |  | OSD 31 |
 512  +--------+  +--------+  +--------+  +--------+  +--------+  +--------+
 513  | OSD 2  |  | OSD 8  |  | OSD 14 |  | OSD 20 |  | OSD 26 |  | OSD 32 |
 514  +--------+  +--------+  +--------+  +--------+  +--------+  +--------+
 515  | OSD 3  |  | OSD 9  |  | OSD 15 |  | OSD 21 |  | OSD 27 |  | OSD 33 |
 516  +--------+  +--------+  +--------+  +--------+  +--------+  +--------+
 517  | OSD 4  |  | OSD 10 |  | OSD 16 |  | OSD 22 |  | OSD 28 |  | Spare  |
 518  +--------+  +--------+  +--------+  +--------+  +--------+  +--------+
 519  | OSD 5  |  | OSD 11 |  | OSD 17 |  | OSD 23 |  | OSD 29 |  | Spare  |
 520  +--------+  +--------+  +--------+  +--------+  +--------+  +--------+
 521  | OSD 6  |  | OSD 12 |  | OSD 18 |  | OSD 24 |  | OSD 30 |  | Spare  |
 522  +--------+  +--------+  +--------+  +--------+  +--------+  +--------+
 523
 524 It is normal in such a cluster for one or two OSDs to fail. A less frequent but
 525 reasonable scenario involves a rack's router or power supply failing, which
 526 brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario,
 527 you should still strive for a cluster that can remain operational and achieve an
 528 ``active + clean`` state--even if that means adding a few hosts with additional
 529 OSDs in short order. If your capacity utilization is too high, you may not lose
 530 data, but you could still sacrifice data availability while resolving an outage
 531 within a failure domain if capacity utilization of the cluster exceeds the full
 532 ratio. For this reason, we recommend at least some rough capacity planning.
 533
 534 Identify two numbers for your cluster:
 535
 536 #. The number of OSDs.
 537 #. The total capacity of the cluster
 538
 539 If you divide the total capacity of your cluster by the number of OSDs in your
 540 cluster, you will find the mean average capacity of an OSD within your cluster.
 541 Consider multiplying that number by the number of OSDs you expect will fail
 542 simultaneously during normal operations (a relatively small number). Finally
 543 multiply the capacity of the cluster by the full ratio to arrive at a maximum
 544 operating capacity; then, subtract the number of amount of data from the OSDs
 545 you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing
 546 process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at
 547 a reasonable number for a near full ratio.
 548
 549 The following settings only apply on cluster creation and are then stored in
 550 the OSDMap.
 551
 552 .. code-block:: ini
 553
 554         [global]
 555
 556                 mon osd full ratio = .80
 557                 mon osd backfillfull ratio = .75
 558                 mon osd nearfull ratio = .70
 559
 560
 561 ``mon osd full ratio``
 562
 563 :Description: The percentage of disk space used before an OSD is
 564               considered ``full``.
 565
 566 :Type: Float
 567 :Default: ``0.95``
 568
 569
 570 ``mon osd backfillfull ratio``
 571
 572 :Description: The percentage of disk space used before an OSD is
 573               considered too ``full`` to backfill.
 574
 575 :Type: Float
 576 :Default: ``0.90``
 577
 578
 579 ``mon osd nearfull ratio``
 580
 581 :Description: The percentage of disk space used before an OSD is
 582               considered ``nearfull``.
 583
 584 :Type: Float
 585 :Default: ``0.85``
 586
 587
 588 .. tip:: If some OSDs are nearfull, but others have plenty of capacity, you
 589          may have a problem with the CRUSH weight for the nearfull OSDs.
 590
 591 .. tip:: These settings only apply during cluster creation. Afterwards they need
 592          to be changed in the OSDMap using ``ceph osd set-nearfull-ratio`` and
 593          ``ceph osd set-full-ratio``
 594
 595 .. index:: heartbeat
 596
 597 Heartbeat
 598 ---------
 599
 600 Ceph monitors know about the cluster by requiring reports from each OSD, and by
 601 receiving reports from OSDs about the status of their neighboring OSDs. Ceph
 602 provides reasonable default settings for monitor/OSD interaction; however,  you
 603 may modify them as needed. See `Monitor/OSD Interaction`_ for details.
 604
 605
 606 .. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization
 607
 608 Monitor Store Synchronization
 609 -----------------------------
 610
 611 When you run a production cluster with multiple monitors (recommended), each
 612 monitor checks to see if a neighboring monitor has a more recent version of the
 613 cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers
 614 higher than the most current epoch in the map of the instant monitor).
 615 Periodically, one monitor in the cluster may fall behind the other monitors to
 616 the point where it must leave the quorum, synchronize to retrieve the most
 617 current information about the cluster, and then rejoin the quorum. For the
 618 purposes of synchronization, monitors may assume one of three roles:
 619
 620 #. **Leader**: The `Leader` is the first monitor to achieve the most recent
 621    Paxos version of the cluster map.
 622
 623 #. **Provider**: The `Provider` is a monitor that has the most recent version
 624    of the cluster map, but wasn't the first to achieve the most recent version.
 625
 626 #. **Requester:** A `Requester` is a monitor that has fallen behind the leader
 627    and must synchronize in order to retrieve the most recent information about
 628    the cluster before it can rejoin the quorum.
 629
 630 These roles enable a leader to delegate synchronization duties to a provider,
 631 which prevents synchronization requests from overloading the leader--improving
 632 performance. In the following diagram, the requester has learned that it has
 633 fallen behind the other monitors. The requester asks the leader to synchronize,
 634 and the leader tells the requester to synchronize with a provider.
 635
 636
 637 .. ditaa::
 638            +-----------+          +---------+          +----------+
 639            | Requester |          | Leader  |          | Provider |
 640            +-----------+          +---------+          +----------+
 641                   |                    |                     |
 642                   |                    |                     |
 643                   | Ask to Synchronize |                     |
 644                   |------------------->|                     |
 645                   |                    |                     |
 646                   |<-------------------|                     |
 647                   | Tell Requester to  |                     |
 648                   | Sync with Provider |                     |
 649                   |                    |                     |
 650                   |               Synchronize                |
 651                   |--------------------+-------------------->|
 652                   |                    |                     |
 653                   |<-------------------+---------------------|
 654                   |        Send Chunk to Requester           |
 655                   |         (repeat as necessary)            |
 656                   |    Requester Acks Chuck to Provider      |
 657                   |--------------------+-------------------->|
 658                   |                    |
 659                   |   Sync Complete    |
 660                   |    Notification    |
 661                   |------------------->|
 662                   |                    |
 663                   |<-------------------|
 664                   |        Ack         |
 665                   |                    |
 666
 667
 668 Synchronization always occurs when a new monitor joins the cluster. During
 669 runtime operations, monitors may receive updates to the cluster map at different
 670 times. This means the leader and provider roles may migrate from one monitor to
 671 another. If this happens while synchronizing (e.g., a provider falls behind the
 672 leader), the provider can terminate synchronization with a requester.
 673
 674 Once synchronization is complete, Ceph requires trimming across the cluster.
 675 Trimming requires that the placement groups are ``active + clean``.
 676
 677
 678 ``mon sync timeout``
 679
 680 :Description: Number of seconds the monitor will wait for the next update
 681               message from its sync provider before it gives up and bootstrap
 682               again.
 683
 684 :Type: Double
 685 :Default: ``60.0``
 686
 687
 688 ``mon sync max payload size``
 689
 690 :Description: The maximum size for a sync payload (in bytes).
 691 :Type: 32-bit Integer
 692 :Default: ``1048576``
 693
 694
 695 ``paxos max join drift``
 696
 697 :Description: The maximum Paxos iterations before we must first sync the
 698               monitor data stores. When a monitor finds that its peer is too
 699               far ahead of it, it will first sync with data stores before moving
 700               on.
 701
 702 :Type: Integer
 703 :Default: ``10``
 704
 705
 706 ``paxos stash full interval``
 707
 708 :Description: How often (in commits) to stash a full copy of the PaxosService state.
 709               Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr``
 710               PaxosServices.
 711
 712 :Type: Integer
 713 :Default: ``25``
 714
 715
 716 ``paxos propose interval``
 717
 718 :Description: Gather updates for this time interval before proposing
 719               a map update.
 720
 721 :Type: Double
 722 :Default: ``1.0``
 723
 724
 725 ``paxos min``
 726
 727 :Description: The minimum number of paxos states to keep around
 728 :Type: Integer
 729 :Default: ``500``
 730
 731
 732 ``paxos min wait``
 733
 734 :Description: The minimum amount of time to gather updates after a period of
 735               inactivity.
 736
 737 :Type: Double
 738 :Default: ``0.05``
 739
 740
 741 ``paxos trim min``
 742
 743 :Description: Number of extra proposals tolerated before trimming
 744 :Type: Integer
 745 :Default: ``250``
 746
 747
 748 ``paxos trim max``
 749
 750 :Description: The maximum number of extra proposals to trim at a time
 751 :Type: Integer
 752 :Default: ``500``
 753
 754
 755 ``paxos service trim min``
 756
 757 :Description: The minimum amount of versions to trigger a trim (0 disables it)
 758 :Type: Integer
 759 :Default: ``250``
 760
 761
 762 ``paxos service trim max``
 763
 764 :Description: The maximum amount of versions to trim during a single proposal (0 disables it)
 765 :Type: Integer
 766 :Default: ``500``
 767
 768
 769 ``mon mds force trim to``
 770
 771 :Description: Force monitor to trim mdsmaps to this point (0 disables it.
 772               dangerous, use with care)
 773
 774 :Type: Integer
 775 :Default: ``0``
 776
 777
 778 ``mon osd force trim to``
 779
 780 :Description: Force monitor to trim osdmaps to this point, even if there is
 781               PGs not clean at the specified epoch (0 disables it. dangerous,
 782               use with care)
 783
 784 :Type: Integer
 785 :Default: ``0``
 786
 787
 788 ``mon osd cache size``
 789
 790 :Description: The size of osdmaps cache, not to rely on underlying store's cache
 791 :Type: Integer
 792 :Default: ``500``
 793
 794
 795 ``mon election timeout``
 796
 797 :Description: On election proposer, maximum waiting time for all ACKs in seconds.
 798 :Type: Float
 799 :Default: ``5.00``
 800
 801
 802 ``mon lease``
 803
 804 :Description: The length (in seconds) of the lease on the monitor's versions.
 805 :Type: Float
 806 :Default: ``5.00``
 807
 808
 809 ``mon lease renew interval factor``
 810
 811 :Description: ``mon lease`` \* ``mon lease renew interval factor`` will be the
 812               interval for the Leader to renew the other monitor's leases. The
 813               factor should be less than ``1.0``.
 814
 815 :Type: Float
 816 :Default: ``0.60``
 817
 818
 819 ``mon lease ack timeout factor``
 820
 821 :Description: The Leader will wait ``mon lease`` \* ``mon lease ack timeout factor``
 822               for the Providers to acknowledge the lease extension.
 823
 824 :Type: Float
 825 :Default: ``2.00``
 826
 827
 828 ``mon accept timeout factor``
 829
 830 :Description: The Leader will wait ``mon lease`` \* ``mon accept timeout factor``
 831               for the Requester(s) to accept a Paxos update. It is also used
 832               during the Paxos recovery phase for similar purposes.
 833
 834 :Type: Float
 835 :Default: ``2.00``
 836
 837
 838 ``mon min osdmap epochs``
 839
 840 :Description: Minimum number of OSD map epochs to keep at all times.
 841 :Type: 32-bit Integer
 842 :Default: ``500``
 843
 844
 845 ``mon max log epochs``
 846
 847 :Description: Maximum number of Log epochs the monitor should keep.
 848 :Type: 32-bit Integer
 849 :Default: ``500``
 850
 851
 852
 853 .. index:: Ceph Monitor; clock
 854
 855 Clock
 856 -----
 857
 858 Ceph daemons pass critical messages to each other, which must be processed
 859 before daemons reach a timeout threshold. If the clocks in Ceph monitors
 860 are not synchronized, it can lead to a number of anomalies. For example:
 861
 862 - Daemons ignoring received messages (e.g., timestamps outdated)
 863 - Timeouts triggered too soon/late when a message wasn't received in time.
 864
 865 See `Monitor Store Synchronization`_ for details.
 866
 867
 868 .. tip:: You SHOULD install NTP on your Ceph monitor hosts to
 869          ensure that the monitor cluster operates with synchronized clocks.
 870
 871 Clock drift may still be noticeable with NTP even though the discrepancy is not
 872 yet harmful. Ceph's clock drift / clock skew warnings may get triggered even
 873 though NTP maintains a reasonable level of synchronization. Increasing your
 874 clock drift may be tolerable under such circumstances; however, a number of
 875 factors such as workload, network latency, configuring overrides to default
 876 timeouts and the `Monitor Store Synchronization`_ settings may influence
 877 the level of acceptable clock drift without compromising Paxos guarantees.
 878
 879 Ceph provides the following tunable options to allow you to find
 880 acceptable values.
 881
 882
 883 ``mon tick interval``
 884
 885 :Description: A monitor's tick interval in seconds.
 886 :Type: 32-bit Integer
 887 :Default: ``5``
 888
 889
 890 ``mon clock drift allowed``
 891
 892 :Description: The clock drift in seconds allowed between monitors.
 893 :Type: Float
 894 :Default: ``0.05``
 895
 896
 897 ``mon clock drift warn backoff``
 898
 899 :Description: Exponential backoff for clock drift warnings
 900 :Type: Float
 901 :Default: ``5.00``
 902
 903
 904 ``mon timecheck interval``
 905
 906 :Description: The time check interval (clock drift check) in seconds
 907               for the Leader.
 908
 909 :Type: Float
 910 :Default: ``300.00``
 911
 912
 913 ``mon timecheck skew interval``
 914
 915 :Description: The time check interval (clock drift check) in seconds when in
 916               presence of a skew in seconds for the Leader.
 917
 918 :Type: Float
 919 :Default: ``30.00``
 920
 921
 922 Client
 923 ------
 924
 925 ``mon client hunt interval``
 926
 927 :Description: The client will try a new monitor every ``N`` seconds until it
 928               establishes a connection.
 929
 930 :Type: Double
 931 :Default: ``3.00``
 932
 933
 934 ``mon client ping interval``
 935
 936 :Description: The client will ping the monitor every ``N`` seconds.
 937 :Type: Double
 938 :Default: ``10.00``
 939
 940
 941 ``mon client max log entries per message``
 942
 943 :Description: The maximum number of log entries a monitor will generate
 944               per client message.
 945
 946 :Type: Integer
 947 :Default: ``1000``
 948
 949
 950 ``mon client bytes``
 951
 952 :Description: The amount of client message data allowed in memory (in bytes).
 953 :Type: 64-bit Integer Unsigned
 954 :Default: ``100ul << 20``
 955
 956
 957 Pool settings
 958 =============
 959
 960 Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools.
 961
 962 Monitors can also disallow removal of pools if configured that way.
 963
 964 ``mon allow pool delete``
 965
 966 :Description: If the monitors should allow pools to be removed. Regardless of what the pool flags say.
 967 :Type: Boolean
 968 :Default: ``false``
 969
 970
 971 ``osd pool default ec fast read``
 972
 973 :Description: Whether to turn on fast read on the pool or not. It will be used as
 974               the default setting of newly created erasure coded pools if ``fast_read``
 975               is not specified at create time.
 976
 977 :Type: Boolean
 978 :Default: ``false``
 979
 980
 981 ``osd pool default flag hashpspool``
 982
 983 :Description: Set the hashpspool flag on new pools
 984 :Type: Boolean
 985 :Default: ``true``
 986
 987
 988 ``osd pool default flag nodelete``
 989
 990 :Description: Set the nodelete flag on new pools. Prevents allow pool removal with this flag in any way.
 991 :Type: Boolean
 992 :Default: ``false``
 993
 994
 995 ``osd pool default flag nopgchange``
 996
 997 :Description: Set the nopgchange flag on new pools. Does not allow the number of PGs to be changed for a pool.
 998 :Type: Boolean
 999 :Default: ``false``
1000
1001
1002 ``osd pool default flag nosizechange``
1003
1004 :Description: Set the nosizechange flag on new pools. Does not allow the size to be changed of pool.
1005 :Type: Boolean
1006 :Default: ``false``
1007
1008 For more information about the pool flags see `Pool values`_.
1009
1010 Miscellaneous
1011 =============
1012
1013 ``mon max osd``
1014
1015 :Description: The maximum number of OSDs allowed in the cluster.
1016 :Type: 32-bit Integer
1017 :Default: ``10000``
1018
1019
1020 ``mon globalid prealloc``
1021
1022 :Description: The number of global IDs to pre-allocate for clients and daemons in the cluster.
1023 :Type: 32-bit Integer
1024 :Default: ``10000``
1025
1026
1027 ``mon subscribe interval``
1028
1029 :Description: The refresh interval (in seconds) for subscriptions. The
1030               subscription mechanism enables obtaining the cluster maps
1031               and log information.
1032
1033 :Type: Double
1034 :Default: ``86400.00``
1035
1036
1037 ``mon stat smooth intervals``
1038
1039 :Description: Ceph will smooth statistics over the last ``N`` PG maps.
1040 :Type: Integer
1041 :Default: ``6``
1042
1043
1044 ``mon probe timeout``
1045
1046 :Description: Number of seconds the monitor will wait to find peers before bootstrapping.
1047 :Type: Double
1048 :Default: ``2.00``
1049
1050
1051 ``mon daemon bytes``
1052
1053 :Description: The message memory cap for metadata server and OSD messages (in bytes).
1054 :Type: 64-bit Integer Unsigned
1055 :Default: ``400ul << 20``
1056
1057
1058 ``mon max log entries per event``
1059
1060 :Description: The maximum number of log entries per event.
1061 :Type: Integer
1062 :Default: ``4096``
1063
1064
1065 ``mon osd prime pg temp``
1066
1067 :Description: Enables or disable priming the PGMap with the previous OSDs when an out
1068               OSD comes back into the cluster. With the ``true`` setting the clients
1069               will continue to use the previous OSDs until the newly in OSDs as that
1070               PG peered.
1071
1072 :Type: Boolean
1073 :Default: ``true``
1074
1075
1076 ``mon osd prime pg temp max time``
1077
1078 :Description: How much time in seconds the monitor should spend trying to prime the
1079               PGMap when an out OSD comes back into the cluster.
1080
1081 :Type: Float
1082 :Default: ``0.50``
1083
1084
1085 ``mon osd prime pg temp max time estimate``
1086
1087 :Description: Maximum estimate of time spent on each PG before we prime all PGs
1088               in parallel.
1089
1090 :Type: Float
1091 :Default: ``0.25``
1092
1093
1094 ``mon mds skip sanity``
1095
1096 :Description: Skip safety assertions on FSMap (in case of bugs where we want to
1097               continue anyway). Monitor terminates if the FSMap sanity check
1098               fails, but we can disable it by enabling this option.
1099
1100 :Type: Boolean
1101 :Default: ``False``
1102
1103
1104 ``mon max mdsmap epochs``
1105
1106 :Description: The maximum amount of mdsmap epochs to trim during a single proposal.
1107 :Type: Integer
1108 :Default: ``500``
1109
1110
1111 ``mon config key max entry size``
1112
1113 :Description: The maximum size of config-key entry (in bytes)
1114 :Type: Integer
1115 :Default: ``65536``
1116
1117
1118 ``mon scrub interval``
1119
1120 :Description: How often (in seconds) the monitor scrub its store by comparing
1121               the stored checksums with the computed ones of all the stored
1122               keys.
1123
1124 :Type: Integer
1125 :Default: ``3600*24``
1126
1127
1128 ``mon scrub max keys``
1129
1130 :Description: The maximum number of keys to scrub each time.
1131 :Type: Integer
1132 :Default: ``100``
1133
1134
1135 ``mon compact on start``
1136
1137 :Description: Compact the database used as Ceph Monitor store on
1138               ``ceph-mon`` start. A manual compaction helps to shrink the
1139               monitor database and improve the performance of it if the regular
1140               compaction fails to work.
1141
1142 :Type: Boolean
1143 :Default: ``False``
1144
1145
1146 ``mon compact on bootstrap``
1147
1148 :Description: Compact the database used as Ceph Monitor store on
1149               on bootstrap. Monitor starts probing each other for creating
1150               a quorum after bootstrap. If it times out before joining the
1151               quorum, it will start over and bootstrap itself again.
1152
1153 :Type: Boolean
1154 :Default: ``False``
1155
1156
1157 ``mon compact on trim``
1158
1159 :Description: Compact a certain prefix (including paxos) when we trim its old states.
1160 :Type: Boolean
1161 :Default: ``True``
1162
1163
1164 ``mon cpu threads``
1165
1166 :Description: Number of threads for performing CPU intensive work on monitor.
1167 :Type: Integer
1168 :Default: ``4``
1169
1170
1171 ``mon osd mapping pgs per chunk``
1172
1173 :Description: We calculate the mapping from placement group to OSDs in chunks.
1174               This option specifies the number of placement groups per chunk.
1175
1176 :Type: Integer
1177 :Default: ``4096``
1178
1179
1180 ``mon session timeout``
1181
1182 :Description: Monitor will terminate inactive sessions stay idle over this
1183               time limit.
1184
1185 :Type: Integer
1186 :Default: ``300``
1187
1188
1189 ``mon osd cache size min``
1190
1191 :Description: The minimum amount of bytes to be kept mapped in memory for osd
1192                monitor caches.
1193
1194 :Type: 64-bit Integer
1195 :Default: ``134217728``
1196
1197
1198 ``mon memory target``
1199
1200 :Description: The amount of bytes pertaining to osd monitor caches and kv cache
1201               to be kept mapped in memory with cache auto-tuning enabled.
1202
1203 :Type: 64-bit Integer
1204 :Default: ``2147483648``
1205
1206
1207 ``mon memory autotune``
1208
1209 :Description: Autotune the cache memory being used for osd monitors and kv
1210               database.
1211
1212 :Type: Boolean
1213 :Default: ``True``
1214
1215
1216 .. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
1217 .. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys
1218 .. _Ceph configuration file: ../ceph-conf/#monitors
1219 .. _Network Configuration Reference: ../network-config-ref
1220 .. _Monitor lookup through DNS: ../mon-lookup-dns
1221 .. _ACID: https://en.wikipedia.org/wiki/ACID
1222 .. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons
1223 .. _Add/Remove a Monitor (ceph-deploy): ../../deployment/ceph-deploy-mon
1224 .. _Monitoring a Cluster: ../../operations/monitoring
1225 .. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg
1226 .. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap
1227 .. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address
1228 .. _Monitor/OSD Interaction: ../mon-osd-interaction
1229 .. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability
1230 .. _Pool values: ../../operations/pools/#set-pool-values