ceph/monitoring/ceph-mixin/tests_alerts/README.md

   1
   2 ## Alert Rule Standards
   3
   4 The alert rules should adhere to the following principles
   5 - each alert must have a unique name
   6 - each alert should define a common structure
   7   - labels : must contain severity and type
   8   - annotations : must provide description
   9   - expr : must define the promql expression
  10   - alert : defines the alert name
  11 - alerts that have a corresponding section within docs.ceph.com must include a
  12   documentation field in the annotations section
  13 - critical alerts should declare an oid in the labels section
  14 - critical alerts should have a corresponding entry in the Ceph MIB
  15
  16 &nbsp;
  17 ## Testing Prometheus Rules
  18 Once you have updated the `ceph_default_alerts.yml` file, you should use the
  19 `validate_rules.py` script directly, or via `tox` to ensure the format of any update
  20 or change aligns to our rule structure guidelines. The validate_rules.py script will
  21 process the rules and look for any configuration anomalies and output a report if
  22 problems are detected.
  23
  24 Here's an example run, to illustrate the format and the kinds of issues detected.
  25
  26 ```
  27 [paul@myhost tests]$ ./validate_rules.py
  28
  29 Checking rule groups
  30         cluster health : ..
  31         mon            : E.W..
  32         osd            : E...W......W.E..
  33         mds            : WW
  34         mgr            : WW
  35         pgs            : ..WWWW..
  36         nodes          : .EEEE
  37         pools          : EEEW.
  38         healthchecks   : .
  39         cephadm        : WW.
  40         prometheus     : W
  41         rados          : W
  42
  43 Summary
  44
  45 Rule file             : ../alerts/ceph_default_alerts.yml
  46 Unit Test file        : test_alerts.yml
  47
  48 Rule groups processed :  12
  49 Rules processed       :  51
  50 Rule errors           :  10
  51 Rule warnings         :  16
  52 Rule name duplicates  :   0
  53 Unit tests missing    :   4
  54
  55 Problem Report
  56
  57   Group       Severity  Alert Name                                          Problem Description
  58   -----       --------  ----------                                          -------------------
  59   cephadm     Warning   Cluster upgrade has failed                          critical level alert is missing an SNMP oid entry
  60   cephadm     Warning   A daemon managed by cephadm is down                 critical level alert is missing an SNMP oid entry
  61   mds         Warning   Ceph Filesystem damage detected                     critical level alert is missing an SNMP oid entry
  62   mds         Warning   Ceph Filesystem switched to READ ONLY               critical level alert is missing an SNMP oid entry
  63   mgr         Warning   mgr module failure                                  critical level alert is missing an SNMP oid entry
  64   mgr         Warning   mgr prometheus module is not active                 critical level alert is missing an SNMP oid entry
  65   mon         Error     Monitor down, quorum is at risk                     documentation link error: #mon-downwah not found on the page
  66   mon         Warning   Ceph mon disk space critically low                  critical level alert is missing an SNMP oid entry
  67   nodes       Error     network packets dropped                             invalid alert structure. Missing field: for
  68   nodes       Error     network packet errors                               invalid alert structure. Missing field: for
  69   nodes       Error     storage filling up                                  invalid alert structure. Missing field: for
  70   nodes       Error     MTU Mismatch                                        invalid alert structure. Missing field: for
  71   osd         Error     10% OSDs down                                       invalid alert structure. Missing field: for
  72   osd         Error     Flapping OSD                                        invalid alert structure. Missing field: for
  73   osd         Warning   OSD Full                                            critical level alert is missing an SNMP oid entry
  74   osd         Warning   Too many devices predicted to fail                  critical level alert is missing an SNMP oid entry
  75   pgs         Warning   Placement Group (PG) damaged                        critical level alert is missing an SNMP oid entry
  76   pgs         Warning   Recovery at risk, cluster too full                  critical level alert is missing an SNMP oid entry
  77   pgs         Warning   I/O blocked to some data                            critical level alert is missing an SNMP oid entry
  78   pgs         Warning   Cluster too full, automatic data recovery impaired  critical level alert is missing an SNMP oid entry
  79   pools       Error     pool full                                           invalid alert structure. Missing field: for
  80   pools       Error     pool filling up (growth forecast)                   invalid alert structure. Missing field: for
  81   pools       Error     Ceph pool is too full for recovery/rebalance        invalid alert structure. Missing field: for
  82   pools       Warning   Ceph pool is full - writes blocked                  critical level alert is missing an SNMP oid entry
  83   prometheus  Warning   Scrape job is missing                               critical level alert is missing an SNMP oid entry
  84   rados       Warning   Data not found/missing                              critical level alert is missing an SNMP oid entry
  85
  86 Unit tests are incomplete. Tests missing for the following alerts;
  87   - Placement Group (PG) damaged
  88   - OSD Full
  89   - storage filling up
  90   - pool filling up (growth forecast)
  91
  92 ```