]> git.proxmox.com Git - ceph.git/blob - ceph/monitoring/ceph-mixin/tests_alerts/README.md
import quincy beta 17.1.0
[ceph.git] / ceph / monitoring / ceph-mixin / tests_alerts / README.md
1
2 ## Alert Rule Standards
3
4 The alert rules should adhere to the following principles
5 - each alert must have a unique name
6 - each alert should define a common structure
7 - labels : must contain severity and type
8 - annotations : must provide description
9 - expr : must define the promql expression
10 - alert : defines the alert name
11 - alerts that have a corresponding section within docs.ceph.com must include a
12 documentation field in the annotations section
13 - critical alerts should declare an oid in the labels section
14 - critical alerts should have a corresponding entry in the Ceph MIB
15
16  
17 ## Testing Prometheus Rules
18 Once you have updated the `ceph_default_alerts.yml` file, you should use the
19 `validate_rules.py` script directly, or via `tox` to ensure the format of any update
20 or change aligns to our rule structure guidelines. The validate_rules.py script will
21 process the rules and look for any configuration anomalies and output a report if
22 problems are detected.
23
24 Here's an example run, to illustrate the format and the kinds of issues detected.
25
26 ```
27 [paul@myhost tests]$ ./validate_rules.py
28
29 Checking rule groups
30 cluster health : ..
31 mon : E.W..
32 osd : E...W......W.E..
33 mds : WW
34 mgr : WW
35 pgs : ..WWWW..
36 nodes : .EEEE
37 pools : EEEW.
38 healthchecks : .
39 cephadm : WW.
40 prometheus : W
41 rados : W
42
43 Summary
44
45 Rule file : ../alerts/ceph_default_alerts.yml
46 Unit Test file : test_alerts.yml
47
48 Rule groups processed : 12
49 Rules processed : 51
50 Rule errors : 10
51 Rule warnings : 16
52 Rule name duplicates : 0
53 Unit tests missing : 4
54
55 Problem Report
56
57 Group Severity Alert Name Problem Description
58 ----- -------- ---------- -------------------
59 cephadm Warning Cluster upgrade has failed critical level alert is missing an SNMP oid entry
60 cephadm Warning A daemon managed by cephadm is down critical level alert is missing an SNMP oid entry
61 mds Warning Ceph Filesystem damage detected critical level alert is missing an SNMP oid entry
62 mds Warning Ceph Filesystem switched to READ ONLY critical level alert is missing an SNMP oid entry
63 mgr Warning mgr module failure critical level alert is missing an SNMP oid entry
64 mgr Warning mgr prometheus module is not active critical level alert is missing an SNMP oid entry
65 mon Error Monitor down, quorum is at risk documentation link error: #mon-downwah not found on the page
66 mon Warning Ceph mon disk space critically low critical level alert is missing an SNMP oid entry
67 nodes Error network packets dropped invalid alert structure. Missing field: for
68 nodes Error network packet errors invalid alert structure. Missing field: for
69 nodes Error storage filling up invalid alert structure. Missing field: for
70 nodes Error MTU Mismatch invalid alert structure. Missing field: for
71 osd Error 10% OSDs down invalid alert structure. Missing field: for
72 osd Error Flapping OSD invalid alert structure. Missing field: for
73 osd Warning OSD Full critical level alert is missing an SNMP oid entry
74 osd Warning Too many devices predicted to fail critical level alert is missing an SNMP oid entry
75 pgs Warning Placement Group (PG) damaged critical level alert is missing an SNMP oid entry
76 pgs Warning Recovery at risk, cluster too full critical level alert is missing an SNMP oid entry
77 pgs Warning I/O blocked to some data critical level alert is missing an SNMP oid entry
78 pgs Warning Cluster too full, automatic data recovery impaired critical level alert is missing an SNMP oid entry
79 pools Error pool full invalid alert structure. Missing field: for
80 pools Error pool filling up (growth forecast) invalid alert structure. Missing field: for
81 pools Error Ceph pool is too full for recovery/rebalance invalid alert structure. Missing field: for
82 pools Warning Ceph pool is full - writes blocked critical level alert is missing an SNMP oid entry
83 prometheus Warning Scrape job is missing critical level alert is missing an SNMP oid entry
84 rados Warning Data not found/missing critical level alert is missing an SNMP oid entry
85
86 Unit tests are incomplete. Tests missing for the following alerts;
87 - Placement Group (PG) damaged
88 - OSD Full
89 - storage filling up
90 - pool filling up (growth forecast)
91
92 ```