import 15.2.0 Octopus source

[ceph.git] / ceph / doc / rados / operations / health-checks.rst
diff --git a/ceph/doc/rados/operations/health-checks.rst b/ceph/doc/rados/operations/health-checks.rst

index 3cbf9eae4afae64adb9ec7ad37becfadb9674b2d..bd71a2ee937f8c8918d8842ed7183868866d84d6 100644 (file)
--- a/ceph/doc/rados/operations/health-checks.rst
+++ b/ceph/doc/rados/operations/health-checks.rst
@@ -1,3 +1,4 @@
+.. _health-checks:
  
  =============
  Health checks
@@ -71,10 +72,78 @@ listen for v2 connections on the new default 3300 port.
  If a monitor is configured to listen for v1 connections on a non-standard port (not 6789), then the monmap will need to be modified manually.
  
  
+MON_DISK_LOW
+____________
+
+One or more monitors is low on disk space.  This alert triggers if the
+available space on the file system storing the monitor database
+(normally ``/var/lib/ceph/mon``), as a percentage, drops below
+``mon_data_avail_warn`` (default: 30%).
+
+This may indicate that some other process or user on the system is
+filling up the same file system used by the monitor.  It may also
+indicate that the monitors database is large (see ``MON_DISK_BIG``
+below).
+
+If space cannot be freed, the monitor's data directory may need to be
+moved to another storage device or file system (while the monitor
+daemon is not running, of course).
+
+
+MON_DISK_CRIT
+_____________
+
+One or more monitors is critically low on disk space.  This alert
+triggers if the available space on the file system storing the monitor
+database (normally ``/var/lib/ceph/mon``), as a percentage, drops
+below ``mon_data_avail_crit`` (default: 5%).  See ``MON_DISK_LOW``, above.
+
+MON_DISK_BIG
+____________
+
+The database size for one or more monitors is very large.  This alert
+triggers if the size of the monitor's database is larger than
+``mon_data_size_warn`` (default: 15 GiB).
+
+A large database is unusual, but may not necessarily indicate a
+problem.  Monitor databases may grow in size when there are placement
+groups that have not reached an ``active+clean`` state in a long time.
+
+This may also indicate that the monitor's database is not properly
+compacting, which has been observed with some older versions of
+leveldb and rocksdb.  Forcing a compaction with ``ceph daemon mon.<id>
+compact`` may shrink the on-disk size.
+
+This warning may also indicate that the monitor has a bug that is
+preventing it from pruning the cluster metadata it stores.  If the
+problem persists, please report a bug.
+
+The warning threshold may be adjusted with::
+
+  ceph config set global mon_data_size_warn <size>
+
  
  Manager
  -------
  
+MGR_DOWN
+________
+
+All manager daemons are currently down.  The cluster should normally
+have at least one running manager (``ceph-mgr``) daemon.  If no
+manager daemon is running, the cluster's ability to monitor itself will
+be compromised, and parts of the management API will become
+unavailable (for example, the dashboard will not work, and most CLI
+commands that report metrics or runtime state will block).  However,
+the cluster will still be able to perform all IO operations and
+recover from failures.
+
+The down manager daemon should generally be restarted as soon as
+possible to ensure that the cluster can be monitored (e.g., so that
+the ``ceph -s`` information is up to date, and/or metrics can be
+scraped by Prometheus).
+
+
  MGR_MODULE_DEPENDENCY
  _____________________
  
@@ -140,15 +209,15 @@ The OSD can be removed from the CRUSH hierarchy with::
  OSD_OUT_OF_ORDER_FULL
  _____________________
  
-The utilization thresholds for `backfillfull`, `nearfull`, `full`,
+The utilization thresholds for `nearfull`, `backfillfull`, `full`,
  and/or `failsafe_full` are not ascending.  In particular, we expect
-`backfillfull < nearfull`, `nearfull < full`, and `full <
+`nearfull < backfillfull`, `backfillfull < full`, and `full <
  failsafe_full`.
  
  The thresholds can be adjusted with::
  
-  ceph osd set-backfillfull-ratio <ratio>
    ceph osd set-nearfull-ratio <ratio>
+  ceph osd set-backfillfull-ratio <ratio>
    ceph osd set-full-ratio <ratio>
  
  
@@ -409,6 +478,27 @@ This warning can be disabled with::
  
    ceph config set global bluestore_warn_on_legacy_statfs false
  
+BLUESTORE_NO_PER_POOL_OMAP
+__________________________
+
+Starting with the Octopus release, BlueStore tracks omap space utilization
+by pool, and one or more OSDs have volumes that were created prior to
+Octopus.  If all OSDs are not running BlueStore with the new tracking
+enabled, the cluster will report and approximate value for per-pool omap usage
+based on the most recent deep-scrub.
+
+The old OSDs can be updated to track by pool by stopping each OSD,
+running a repair operation, and the restarting it.  For example, if
+``osd.123`` needed to be updated,::
+
+  systemctl stop ceph-osd@123
+  ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123
+  systemctl start ceph-osd@123
+
+This warning can be disabled with::
+
+  ceph config set global bluestore_warn_on_no_per_pool_omap false
+
  
  BLUESTORE_DISK_SIZE_MISMATCH
  ____________________________
@@ -427,6 +517,20 @@ risk.  For example, if osd ``$N`` has the error,::
    ceph-volume lvm zap /path/to/device
    ceph-volume lvm create --osd-id $N --data /path/to/device
  
+BLUESTORE_NO_COMPRESSION
+________________________
+
+One or more OSDs is unable to load a BlueStore compression plugin.
+This can be caused by a broken installation, in which the ``ceph-osd``
+binary does not match the compression plugins, or a recent upgrade
+that did not include a restart of the ``ceph-osd`` daemon.
+
+Verify that the package(s) on the host running the OSD(s) in question
+are correctly installed and that the OSD daemon(s) have been
+restarted.  If the problem persists, check the OSD log for any clues
+as to the source of the problem.
+
+
  
  Device health
  -------------
@@ -729,31 +833,31 @@ recommended amount with::
  Please refer to :ref:`choosing-number-of-placement-groups` and
  :ref:`pg-autoscaler` for more information.
  
-POOL_TARGET_SIZE_RATIO_OVERCOMMITTED
+POOL_TARGET_SIZE_BYTES_OVERCOMMITTED
  ____________________________________
  
-One or more pools have a ``target_size_ratio`` property set to
-estimate the expected size of the pool as a fraction of total storage,
+One or more pools have a ``target_size_bytes`` property set to
+estimate the expected size of the pool,
  but the value(s) exceed the total available storage (either by
  themselves or in combination with other pools' actual usage).
  
-This is usually an indication that the ``target_size_ratio`` value for
+This is usually an indication that the ``target_size_bytes`` value for
  the pool is too large and should be reduced or set to zero with::
  
-  ceph osd pool set <pool-name> target_size_ratio 0
+  ceph osd pool set <pool-name> target_size_bytes 0
  
  For more information, see :ref:`specifying_pool_target_size`.
  
-POOL_TARGET_SIZE_BYTES_OVERCOMMITTED
+POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO
  ____________________________________
  
-One or more pools have a ``target_size_bytes`` property set to
-estimate the expected size of the pool,
-but the value(s) exceed the total available storage (either by
-themselves or in combination with other pools' actual usage).
+One or more pools have both ``target_size_bytes`` and
+``target_size_ratio`` set to estimate the expected size of the pool.
+Only one of these properties should be non-zero. If both are set,
+``target_size_ratio`` takes precedence and ``target_size_bytes`` is
+ignored.
  
-This is usually an indication that the ``target_size_bytes`` value for
-the pool is too large and should be reduced or set to zero with::
+To reset ``target_size_bytes`` to zero::
  
    ceph osd pool set <pool-name> target_size_bytes 0
  
@@ -795,7 +899,7 @@ not contain as much data have too many PGs.  See the discussion of
  *TOO_MANY_PGS* above.
  
  The threshold can be raised to silence the health warning by adjusting
-the ``mon_pg_warn_max_object_skew`` config option on the monitors.
+the ``mon_pg_warn_max_object_skew`` config option on the managers.
  
  
  POOL_APP_NOT_ENABLED
@@ -929,6 +1033,28 @@ You can manually initiate a scrub of a clean PG with::
    ceph pg deep-scrub <pgid>
  
  
+PG_SLOW_SNAP_TRIMMING
+_____________________
+
+The snapshot trim queue for one or more PGs has exceeded the
+configured warning threshold.  This indicates that either an extremely
+large number of snapshots were recently deleted, or that the OSDs are
+unable to trim snapshots quickly enough to keep up with the rate of
+new snapshot deletions.
+
+The warning threshold is controlled by the
+``mon_osd_snap_trim_queue_warn_on`` option (default: 32768).
+
+This warning may trigger if OSDs are under excessive load and unable
+to keep up with their background work, or if the OSDs' internal
+metadata database is heavily fragmented and unable to perform.  It may
+also indicate some other performance issue with the OSDs.
+
+The exact size of the snapshot trim queue is reported by the
+``snaptrimq_len`` field of ``ceph pg ls -f json-detail``.
+
+
+
  Miscellaneous
  -------------
  
@@ -996,3 +1122,46 @@ To re-enable telemetry (and make this warning go away),::
  To disable telemetry (and make this warning go away),::
  
    ceph telemetry off
+
+AUTH_BAD_CAPS
+_____________
+
+One or more auth users has capabilities that cannot be parsed by the
+monitor.  This generally indicates that the user will not be
+authorized to perform any action with one or more daemon types.
+
+This error is mostly likely to occur after an upgrade if the
+capabilities were set with an older version of Ceph that did not
+properly validate their syntax, or if the syntax of the capabilities
+has changed.
+
+The user in question can be removed with::
+
+  ceph auth rm <entity-name>
+
+(This will resolve the health alert, but obviously clients will not be
+able to authenticate as that user.)
+
+Alternatively, the capabilities for the user can be updated with::
+
+  ceph auth <entity-name> <daemon-type> <caps> [<daemon-type> <caps> ...]
+
+For more information about auth capabilities, see :ref:`user-management`.
+
+
+OSD_NO_DOWN_OUT_INTERVAL
+________________________
+
+The ``mon_osd_down_out_interval`` option is set to zero, which means
+that the system will not automatically perform any repair or healing
+operations after an OSD fails.  Instead, an administrator (or some
+other external entity) will need to manually mark down OSDs as 'out'
+(i.e., via ``ceph osd out <osd-id>``) in order to trigger recovery.
+
+This option is normally set to five or ten minutes--enough time for a
+host to power-cycle or reboot.
+
+This warning can silenced by setting the
+``mon_warn_on_osd_down_out_interval_zero`` to false::
+
+  ceph config global mon mon_warn_on_osd_down_out_interval_zero false