+.. _health-checks:
=============
Health checks
If a monitor is configured to listen for v1 connections on a non-standard port (not 6789), then the monmap will need to be modified manually.
+MON_DISK_LOW
+____________
+
+One or more monitors is low on disk space. This alert triggers if the
+available space on the file system storing the monitor database
+(normally ``/var/lib/ceph/mon``), as a percentage, drops below
+``mon_data_avail_warn`` (default: 30%).
+
+This may indicate that some other process or user on the system is
+filling up the same file system used by the monitor. It may also
+indicate that the monitors database is large (see ``MON_DISK_BIG``
+below).
+
+If space cannot be freed, the monitor's data directory may need to be
+moved to another storage device or file system (while the monitor
+daemon is not running, of course).
+
+
+MON_DISK_CRIT
+_____________
+
+One or more monitors is critically low on disk space. This alert
+triggers if the available space on the file system storing the monitor
+database (normally ``/var/lib/ceph/mon``), as a percentage, drops
+below ``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above.
+
+MON_DISK_BIG
+____________
+
+The database size for one or more monitors is very large. This alert
+triggers if the size of the monitor's database is larger than
+``mon_data_size_warn`` (default: 15 GiB).
+
+A large database is unusual, but may not necessarily indicate a
+problem. Monitor databases may grow in size when there are placement
+groups that have not reached an ``active+clean`` state in a long time.
+
+This may also indicate that the monitor's database is not properly
+compacting, which has been observed with some older versions of
+leveldb and rocksdb. Forcing a compaction with ``ceph daemon mon.<id>
+compact`` may shrink the on-disk size.
+
+This warning may also indicate that the monitor has a bug that is
+preventing it from pruning the cluster metadata it stores. If the
+problem persists, please report a bug.
+
+The warning threshold may be adjusted with::
+
+ ceph config set global mon_data_size_warn <size>
+
Manager
-------
+MGR_DOWN
+________
+
+All manager daemons are currently down. The cluster should normally
+have at least one running manager (``ceph-mgr``) daemon. If no
+manager daemon is running, the cluster's ability to monitor itself will
+be compromised, and parts of the management API will become
+unavailable (for example, the dashboard will not work, and most CLI
+commands that report metrics or runtime state will block). However,
+the cluster will still be able to perform all IO operations and
+recover from failures.
+
+The down manager daemon should generally be restarted as soon as
+possible to ensure that the cluster can be monitored (e.g., so that
+the ``ceph -s`` information is up to date, and/or metrics can be
+scraped by Prometheus).
+
+
MGR_MODULE_DEPENDENCY
_____________________
OSD_OUT_OF_ORDER_FULL
_____________________
-The utilization thresholds for `backfillfull`, `nearfull`, `full`,
+The utilization thresholds for `nearfull`, `backfillfull`, `full`,
and/or `failsafe_full` are not ascending. In particular, we expect
-`backfillfull < nearfull`, `nearfull < full`, and `full <
+`nearfull < backfillfull`, `backfillfull < full`, and `full <
failsafe_full`.
The thresholds can be adjusted with::
- ceph osd set-backfillfull-ratio <ratio>
ceph osd set-nearfull-ratio <ratio>
+ ceph osd set-backfillfull-ratio <ratio>
ceph osd set-full-ratio <ratio>
ceph config set global bluestore_warn_on_legacy_statfs false
+BLUESTORE_NO_PER_POOL_OMAP
+__________________________
+
+Starting with the Octopus release, BlueStore tracks omap space utilization
+by pool, and one or more OSDs have volumes that were created prior to
+Octopus. If all OSDs are not running BlueStore with the new tracking
+enabled, the cluster will report and approximate value for per-pool omap usage
+based on the most recent deep-scrub.
+
+The old OSDs can be updated to track by pool by stopping each OSD,
+running a repair operation, and the restarting it. For example, if
+``osd.123`` needed to be updated,::
+
+ systemctl stop ceph-osd@123
+ ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123
+ systemctl start ceph-osd@123
+
+This warning can be disabled with::
+
+ ceph config set global bluestore_warn_on_no_per_pool_omap false
+
BLUESTORE_DISK_SIZE_MISMATCH
____________________________
ceph-volume lvm zap /path/to/device
ceph-volume lvm create --osd-id $N --data /path/to/device
+BLUESTORE_NO_COMPRESSION
+________________________
+
+One or more OSDs is unable to load a BlueStore compression plugin.
+This can be caused by a broken installation, in which the ``ceph-osd``
+binary does not match the compression plugins, or a recent upgrade
+that did not include a restart of the ``ceph-osd`` daemon.
+
+Verify that the package(s) on the host running the OSD(s) in question
+are correctly installed and that the OSD daemon(s) have been
+restarted. If the problem persists, check the OSD log for any clues
+as to the source of the problem.
+
+
Device health
-------------
Please refer to :ref:`choosing-number-of-placement-groups` and
:ref:`pg-autoscaler` for more information.
-POOL_TARGET_SIZE_RATIO_OVERCOMMITTED
+POOL_TARGET_SIZE_BYTES_OVERCOMMITTED
____________________________________
-One or more pools have a ``target_size_ratio`` property set to
-estimate the expected size of the pool as a fraction of total storage,
+One or more pools have a ``target_size_bytes`` property set to
+estimate the expected size of the pool,
but the value(s) exceed the total available storage (either by
themselves or in combination with other pools' actual usage).
-This is usually an indication that the ``target_size_ratio`` value for
+This is usually an indication that the ``target_size_bytes`` value for
the pool is too large and should be reduced or set to zero with::
- ceph osd pool set <pool-name> target_size_ratio 0
+ ceph osd pool set <pool-name> target_size_bytes 0
For more information, see :ref:`specifying_pool_target_size`.
-POOL_TARGET_SIZE_BYTES_OVERCOMMITTED
+POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO
____________________________________
-One or more pools have a ``target_size_bytes`` property set to
-estimate the expected size of the pool,
-but the value(s) exceed the total available storage (either by
-themselves or in combination with other pools' actual usage).
+One or more pools have both ``target_size_bytes`` and
+``target_size_ratio`` set to estimate the expected size of the pool.
+Only one of these properties should be non-zero. If both are set,
+``target_size_ratio`` takes precedence and ``target_size_bytes`` is
+ignored.
-This is usually an indication that the ``target_size_bytes`` value for
-the pool is too large and should be reduced or set to zero with::
+To reset ``target_size_bytes`` to zero::
ceph osd pool set <pool-name> target_size_bytes 0
*TOO_MANY_PGS* above.
The threshold can be raised to silence the health warning by adjusting
-the ``mon_pg_warn_max_object_skew`` config option on the monitors.
+the ``mon_pg_warn_max_object_skew`` config option on the managers.
POOL_APP_NOT_ENABLED
ceph pg deep-scrub <pgid>
+PG_SLOW_SNAP_TRIMMING
+_____________________
+
+The snapshot trim queue for one or more PGs has exceeded the
+configured warning threshold. This indicates that either an extremely
+large number of snapshots were recently deleted, or that the OSDs are
+unable to trim snapshots quickly enough to keep up with the rate of
+new snapshot deletions.
+
+The warning threshold is controlled by the
+``mon_osd_snap_trim_queue_warn_on`` option (default: 32768).
+
+This warning may trigger if OSDs are under excessive load and unable
+to keep up with their background work, or if the OSDs' internal
+metadata database is heavily fragmented and unable to perform. It may
+also indicate some other performance issue with the OSDs.
+
+The exact size of the snapshot trim queue is reported by the
+``snaptrimq_len`` field of ``ceph pg ls -f json-detail``.
+
+
+
Miscellaneous
-------------
To disable telemetry (and make this warning go away),::
ceph telemetry off
+
+AUTH_BAD_CAPS
+_____________
+
+One or more auth users has capabilities that cannot be parsed by the
+monitor. This generally indicates that the user will not be
+authorized to perform any action with one or more daemon types.
+
+This error is mostly likely to occur after an upgrade if the
+capabilities were set with an older version of Ceph that did not
+properly validate their syntax, or if the syntax of the capabilities
+has changed.
+
+The user in question can be removed with::
+
+ ceph auth rm <entity-name>
+
+(This will resolve the health alert, but obviously clients will not be
+able to authenticate as that user.)
+
+Alternatively, the capabilities for the user can be updated with::
+
+ ceph auth <entity-name> <daemon-type> <caps> [<daemon-type> <caps> ...]
+
+For more information about auth capabilities, see :ref:`user-management`.
+
+
+OSD_NO_DOWN_OUT_INTERVAL
+________________________
+
+The ``mon_osd_down_out_interval`` option is set to zero, which means
+that the system will not automatically perform any repair or healing
+operations after an OSD fails. Instead, an administrator (or some
+other external entity) will need to manually mark down OSDs as 'out'
+(i.e., via ``ceph osd out <osd-id>``) in order to trigger recovery.
+
+This option is normally set to five or ten minutes--enough time for a
+host to power-cycle or reboot.
+
+This warning can silenced by setting the
+``mon_warn_on_osd_down_out_interval_zero`` to false::
+
+ ceph config global mon mon_warn_on_osd_down_out_interval_zero false