5 expr: ceph_health_status == 2
10 oid: 1.3.6.1.4.1.50495.15.1.2.2.1
13 Ceph in HEALTH_ERROR state for more than 5 minutes.
14 Please check "ceph health detail" for more information.
17 expr: ceph_health_status == 1
22 oid: 1.3.6.1.4.1.50495.15.1.2.2.2
25 Ceph has been in HEALTH_WARN for more than 15 minutes.
26 Please check "ceph health detail" for more information.
30 - alert: low monitor quorum count
31 expr: sum(ceph_mon_quorum_status) < 3
35 oid: 1.3.6.1.4.1.50495.15.1.2.3.1
38 Monitor count in quorum is below three.
40 Only {{ $value }} of {{ with query "count(ceph_mon_quorum_status)" }}{{ . | first | value }}{{ end }} monitors are active.
42 The following monitors are down:
43 {{- range query "(ceph_mon_quorum_status == 0) + on(ceph_daemon) group_left(hostname) (ceph_mon_metadata * 0)" }}
44 - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
49 - alert: 10% OSDs down
50 expr: count(ceph_osd_up == 0) / count(ceph_osd_up) * 100 >= 10
54 oid: 1.3.6.1.4.1.50495.15.1.2.4.1
57 {{ $value | humanize }}% or {{ with query "count(ceph_osd_up == 0)" }}{{ . | first | value }}{{ end }} of {{ with query "count(ceph_osd_up)" }}{{ . | first | value }}{{ end }} OSDs are down (≥ 10%).
59 The following OSDs are down:
60 {{- range query "(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0" }}
61 - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
65 expr: count(ceph_osd_up == 0) > 0
70 oid: 1.3.6.1.4.1.50495.15.1.2.4.2
73 {{ $s := "" }}{{ if gt $value 1.0 }}{{ $s = "s" }}{{ end }}
74 {{ $value }} OSD{{ $s }} down for more than 15 minutes.
76 {{ $value }} of {{ query "count(ceph_osd_up)" | first | value }} OSDs are down.
78 The following OSD{{ $s }} {{ if eq $s "" }}is{{ else }}are{{ end }} down:
79 {{- range query "(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0"}}
80 - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
83 - alert: OSDs near full
86 ((ceph_osd_stat_bytes_used / ceph_osd_stat_bytes) and on(ceph_daemon) ceph_osd_up == 1)
87 * on(ceph_daemon) group_left(hostname) ceph_osd_metadata
93 oid: 1.3.6.1.4.1.50495.15.1.2.4.3
96 OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} is
97 dangerously full: {{ $value | humanize }}%
102 rate(ceph_osd_up[5m])
103 * on(ceph_daemon) group_left(hostname) ceph_osd_metadata
108 oid: 1.3.6.1.4.1.50495.15.1.2.4.4
111 OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} was
112 marked down and back up at {{ $value | humanize }} times once a
113 minute for 5 minutes.
115 # alert on high deviation from average PG count
116 - alert: high pg count deviation
120 (ceph_osd_numpg > 0) - on (job) group_left avg(ceph_osd_numpg > 0) by (job)
121 ) / on (job) group_left avg(ceph_osd_numpg > 0) by (job)
122 ) * on(ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.30
127 oid: 1.3.6.1.4.1.50495.15.1.2.4.5
130 OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates
131 by more than 30% from average PG count.
132 # alert on high commit latency...but how high is too high
135 # no mds metrics are exported yet
138 # no mgr metrics are exported yet
141 - alert: pgs inactive
142 expr: ceph_pool_metadata * on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_active) > 0
147 oid: 1.3.6.1.4.1.50495.15.1.2.7.1
150 {{ $value }} PGs have been inactive for more than 5 minutes in pool {{ $labels.name }}.
151 Inactive placement groups aren't able to serve read/write
154 expr: ceph_pool_metadata * on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_clean) > 0
159 oid: 1.3.6.1.4.1.50495.15.1.2.7.2
162 {{ $value }} PGs haven't been clean for more than 15 minutes in pool {{ $labels.name }}.
163 Unclean PGs haven't been able to completely recover from a
167 - alert: root volume full
168 expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 5
173 oid: 1.3.6.1.4.1.50495.15.1.2.8.1
176 Root volume (OSD and MON store) is dangerously full: {{ $value | humanize }}% free.
178 # alert on nic packet errors and drops rates > 1% packets/s
179 - alert: network packets dropped
182 increase(node_network_receive_drop_total{device!="lo"}[1m]) +
183 increase(node_network_transmit_drop_total{device!="lo"}[1m])
185 increase(node_network_receive_packets_total{device!="lo"}[1m]) +
186 increase(node_network_transmit_packets_total{device!="lo"}[1m])
188 increase(node_network_receive_drop_total{device!="lo"}[1m]) +
189 increase(node_network_transmit_drop_total{device!="lo"}[1m])
194 oid: 1.3.6.1.4.1.50495.15.1.2.8.2
197 Node {{ $labels.instance }} experiences packet drop > 0.01% or >
198 10 packets/s on interface {{ $labels.device }}.
200 - alert: network packet errors
203 increase(node_network_receive_errs_total{device!="lo"}[1m]) +
204 increase(node_network_transmit_errs_total{device!="lo"}[1m])
206 increase(node_network_receive_packets_total{device!="lo"}[1m]) +
207 increase(node_network_transmit_packets_total{device!="lo"}[1m])
209 increase(node_network_receive_errs_total{device!="lo"}[1m]) +
210 increase(node_network_transmit_errs_total{device!="lo"}[1m])
215 oid: 1.3.6.1.4.1.50495.15.1.2.8.3
218 Node {{ $labels.instance }} experiences packet errors > 0.01% or
219 > 10 packets/s on interface {{ $labels.device }}.
221 - alert: storage filling up
223 predict_linear(node_filesystem_free_bytes[2d], 3600 * 24 * 5) *
224 on(instance) group_left(nodename) node_uname_info < 0
228 oid: 1.3.6.1.4.1.50495.15.1.2.8.4
231 Mountpoint {{ $labels.mountpoint }} on {{ $labels.nodename }}
232 will be full in less than 5 days assuming the average fill-up
233 rate of the past 48 hours.
235 - alert: MTU Mismatch
236 expr: node_network_mtu_bytes{device!="lo"} * (node_network_up{device!="lo"} > 0) != on() group_left() (quantile(0.5, node_network_mtu_bytes{device!="lo"}))
240 oid: 1.3.6.1.4.1.50495.15.1.2.8.5
243 Node {{ $labels.instance }} has a different MTU size ({{ $value }})
244 than the median value on device {{ $labels.device }}.
250 ceph_pool_stored / (ceph_pool_stored + ceph_pool_max_avail)
251 * on(pool_id) group_right ceph_pool_metadata * 100 > 90
255 oid: 1.3.6.1.4.1.50495.15.1.2.9.1
257 description: Pool {{ $labels.name }} at {{ $value | humanize }}% capacity.
259 - alert: pool filling up
262 predict_linear(ceph_pool_stored[2d], 3600 * 24 * 5)
263 >= ceph_pool_stored + ceph_pool_max_avail
264 ) * on(pool_id) group_left(name) ceph_pool_metadata
268 oid: 1.3.6.1.4.1.50495.15.1.2.9.2
271 Pool {{ $labels.name }} will be full in less than 5 days
272 assuming the average fill-up rate of the past 48 hours.
276 - alert: Slow OSD Ops
277 expr: ceph_healthcheck_slow_ops > 0
284 {{ $value }} OSD requests are taking too long to process (osd_op_complaint_time exceeded)