]>
Commit | Line | Data |
---|---|---|
9f95a23c | 1 | .. _health-checks: |
c07f9fc5 FG |
2 | |
3 | ============= | |
4 | Health checks | |
5 | ============= | |
6 | ||
7 | Overview | |
8 | ======== | |
9 | ||
10 | There is a finite set of possible health messages that a Ceph cluster can | |
11 | raise -- these are defined as *health checks* which have unique identifiers. | |
12 | ||
13 | The identifier is a terse pseudo-human-readable (i.e. like a variable name) | |
14 | string. It is intended to enable tools (such as UIs) to make sense of | |
15 | health checks, and present them in a way that reflects their meaning. | |
16 | ||
17 | This page lists the health checks that are raised by the monitor and manager | |
18 | daemons. In addition to these, you may also see health checks that originate | |
11fdf7f2 | 19 | from MDS daemons (see :ref:`cephfs-health-messages`), and health checks |
c07f9fc5 FG |
20 | that are defined by ceph-mgr python modules. |
21 | ||
22 | Definitions | |
23 | =========== | |
24 | ||
11fdf7f2 TL |
25 | Monitor |
26 | ------- | |
27 | ||
28 | MON_DOWN | |
29 | ________ | |
30 | ||
31 | One or more monitor daemons is currently down. The cluster requires a | |
32 | majority (more than 1/2) of the monitors in order to function. When | |
33 | one or more monitors are down, clients may have a harder time forming | |
34 | their initial connection to the cluster as they may need to try more | |
35 | addresses before they reach an operating monitor. | |
36 | ||
37 | The down monitor daemon should generally be restarted as soon as | |
38 | possible to reduce the risk of a subsequen monitor failure leading to | |
39 | a service outage. | |
40 | ||
41 | MON_CLOCK_SKEW | |
42 | ______________ | |
43 | ||
44 | The clocks on the hosts running the ceph-mon monitor daemons are not | |
45 | sufficiently well synchronized. This health alert is raised if the | |
46 | cluster detects a clock skew greater than ``mon_clock_drift_allowed``. | |
47 | ||
48 | This is best resolved by synchronizing the clocks using a tool like | |
49 | ``ntpd`` or ``chrony``. | |
50 | ||
51 | If it is impractical to keep the clocks closely synchronized, the | |
52 | ``mon_clock_drift_allowed`` threshold can also be increased, but this | |
53 | value must stay significantly below the ``mon_lease`` interval in | |
54 | order for monitor cluster to function properly. | |
55 | ||
56 | MON_MSGR2_NOT_ENABLED | |
57 | _____________________ | |
58 | ||
59 | The ``ms_bind_msgr2`` option is enabled but one or more monitors is | |
60 | not configured to bind to a v2 port in the cluster's monmap. This | |
61 | means that features specific to the msgr2 protocol (e.g., encryption) | |
62 | are not available on some or all connections. | |
63 | ||
64 | In most cases this can be corrected by issuing the command:: | |
65 | ||
66 | ceph mon enable-msgr2 | |
67 | ||
68 | That command will change any monitor configured for the old default | |
69 | port 6789 to continue to listen for v1 connections on 6789 and also | |
70 | listen for v2 connections on the new default 3300 port. | |
71 | ||
72 | If a monitor is configured to listen for v1 connections on a non-standard port (not 6789), then the monmap will need to be modified manually. | |
73 | ||
74 | ||
9f95a23c TL |
75 | MON_DISK_LOW |
76 | ____________ | |
77 | ||
78 | One or more monitors is low on disk space. This alert triggers if the | |
79 | available space on the file system storing the monitor database | |
80 | (normally ``/var/lib/ceph/mon``), as a percentage, drops below | |
81 | ``mon_data_avail_warn`` (default: 30%). | |
82 | ||
83 | This may indicate that some other process or user on the system is | |
84 | filling up the same file system used by the monitor. It may also | |
85 | indicate that the monitors database is large (see ``MON_DISK_BIG`` | |
86 | below). | |
87 | ||
88 | If space cannot be freed, the monitor's data directory may need to be | |
89 | moved to another storage device or file system (while the monitor | |
90 | daemon is not running, of course). | |
91 | ||
92 | ||
93 | MON_DISK_CRIT | |
94 | _____________ | |
95 | ||
96 | One or more monitors is critically low on disk space. This alert | |
97 | triggers if the available space on the file system storing the monitor | |
98 | database (normally ``/var/lib/ceph/mon``), as a percentage, drops | |
99 | below ``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above. | |
100 | ||
101 | MON_DISK_BIG | |
102 | ____________ | |
103 | ||
104 | The database size for one or more monitors is very large. This alert | |
105 | triggers if the size of the monitor's database is larger than | |
106 | ``mon_data_size_warn`` (default: 15 GiB). | |
107 | ||
108 | A large database is unusual, but may not necessarily indicate a | |
109 | problem. Monitor databases may grow in size when there are placement | |
110 | groups that have not reached an ``active+clean`` state in a long time. | |
111 | ||
112 | This may also indicate that the monitor's database is not properly | |
113 | compacting, which has been observed with some older versions of | |
114 | leveldb and rocksdb. Forcing a compaction with ``ceph daemon mon.<id> | |
115 | compact`` may shrink the on-disk size. | |
116 | ||
117 | This warning may also indicate that the monitor has a bug that is | |
118 | preventing it from pruning the cluster metadata it stores. If the | |
119 | problem persists, please report a bug. | |
120 | ||
121 | The warning threshold may be adjusted with:: | |
122 | ||
123 | ceph config set global mon_data_size_warn <size> | |
124 | ||
11fdf7f2 TL |
125 | |
126 | Manager | |
127 | ------- | |
128 | ||
9f95a23c TL |
129 | MGR_DOWN |
130 | ________ | |
131 | ||
132 | All manager daemons are currently down. The cluster should normally | |
133 | have at least one running manager (``ceph-mgr``) daemon. If no | |
134 | manager daemon is running, the cluster's ability to monitor itself will | |
135 | be compromised, and parts of the management API will become | |
136 | unavailable (for example, the dashboard will not work, and most CLI | |
137 | commands that report metrics or runtime state will block). However, | |
138 | the cluster will still be able to perform all IO operations and | |
139 | recover from failures. | |
140 | ||
141 | The down manager daemon should generally be restarted as soon as | |
142 | possible to ensure that the cluster can be monitored (e.g., so that | |
143 | the ``ceph -s`` information is up to date, and/or metrics can be | |
144 | scraped by Prometheus). | |
145 | ||
146 | ||
11fdf7f2 TL |
147 | MGR_MODULE_DEPENDENCY |
148 | _____________________ | |
149 | ||
150 | An enabled manager module is failing its dependency check. This health check | |
151 | should come with an explanatory message from the module about the problem. | |
152 | ||
153 | For example, a module might report that a required package is not installed: | |
154 | install the required package and restart your manager daemons. | |
155 | ||
156 | This health check is only applied to enabled modules. If a module is | |
157 | not enabled, you can see whether it is reporting dependency issues in | |
158 | the output of `ceph module ls`. | |
159 | ||
160 | ||
161 | MGR_MODULE_ERROR | |
162 | ________________ | |
163 | ||
164 | A manager module has experienced an unexpected error. Typically, | |
165 | this means an unhandled exception was raised from the module's `serve` | |
166 | function. The human readable description of the error may be obscurely | |
167 | worded if the exception did not provide a useful description of itself. | |
168 | ||
169 | This health check may indicate a bug: please open a Ceph bug report if you | |
170 | think you have encountered a bug. | |
171 | ||
172 | If you believe the error is transient, you may restart your manager | |
173 | daemon(s), or use `ceph mgr fail` on the active daemon to prompt | |
174 | a failover to another daemon. | |
175 | ||
c07f9fc5 FG |
176 | |
177 | OSDs | |
178 | ---- | |
179 | ||
180 | OSD_DOWN | |
181 | ________ | |
182 | ||
183 | One or more OSDs are marked down. The ceph-osd daemon may have been | |
184 | stopped, or peer OSDs may be unable to reach the OSD over the network. | |
185 | Common causes include a stopped or crashed daemon, a down host, or a | |
186 | network outage. | |
187 | ||
188 | Verify the host is healthy, the daemon is started, and network is | |
189 | functioning. If the daemon has crashed, the daemon log file | |
190 | (``/var/log/ceph/ceph-osd.*``) may contain debugging information. | |
191 | ||
192 | OSD_<crush type>_DOWN | |
193 | _____________________ | |
194 | ||
195 | (e.g. OSD_HOST_DOWN, OSD_ROOT_DOWN) | |
196 | ||
197 | All the OSDs within a particular CRUSH subtree are marked down, for example | |
198 | all OSDs on a host. | |
199 | ||
200 | OSD_ORPHAN | |
201 | __________ | |
202 | ||
203 | An OSD is referenced in the CRUSH map hierarchy but does not exist. | |
204 | ||
205 | The OSD can be removed from the CRUSH hierarchy with:: | |
206 | ||
207 | ceph osd crush rm osd.<id> | |
208 | ||
209 | OSD_OUT_OF_ORDER_FULL | |
210 | _____________________ | |
211 | ||
9f95a23c | 212 | The utilization thresholds for `nearfull`, `backfillfull`, `full`, |
c07f9fc5 | 213 | and/or `failsafe_full` are not ascending. In particular, we expect |
9f95a23c | 214 | `nearfull < backfillfull`, `backfillfull < full`, and `full < |
c07f9fc5 FG |
215 | failsafe_full`. |
216 | ||
217 | The thresholds can be adjusted with:: | |
218 | ||
c07f9fc5 | 219 | ceph osd set-nearfull-ratio <ratio> |
9f95a23c | 220 | ceph osd set-backfillfull-ratio <ratio> |
c07f9fc5 FG |
221 | ceph osd set-full-ratio <ratio> |
222 | ||
223 | ||
224 | OSD_FULL | |
225 | ________ | |
226 | ||
227 | One or more OSDs has exceeded the `full` threshold and is preventing | |
228 | the cluster from servicing writes. | |
229 | ||
230 | Utilization by pool can be checked with:: | |
231 | ||
232 | ceph df | |
233 | ||
234 | The currently defined `full` ratio can be seen with:: | |
235 | ||
236 | ceph osd dump | grep full_ratio | |
237 | ||
238 | A short-term workaround to restore write availability is to raise the full | |
239 | threshold by a small amount:: | |
240 | ||
241 | ceph osd set-full-ratio <ratio> | |
242 | ||
243 | New storage should be added to the cluster by deploying more OSDs or | |
244 | existing data should be deleted in order to free up space. | |
11fdf7f2 | 245 | |
c07f9fc5 FG |
246 | OSD_BACKFILLFULL |
247 | ________________ | |
248 | ||
249 | One or more OSDs has exceeded the `backfillfull` threshold, which will | |
250 | prevent data from being allowed to rebalance to this device. This is | |
251 | an early warning that rebalancing may not be able to complete and that | |
252 | the cluster is approaching full. | |
253 | ||
254 | Utilization by pool can be checked with:: | |
255 | ||
256 | ceph df | |
257 | ||
258 | OSD_NEARFULL | |
259 | ____________ | |
260 | ||
261 | One or more OSDs has exceeded the `nearfull` threshold. This is an early | |
262 | warning that the cluster is approaching full. | |
263 | ||
264 | Utilization by pool can be checked with:: | |
265 | ||
266 | ceph df | |
267 | ||
268 | OSDMAP_FLAGS | |
269 | ____________ | |
270 | ||
271 | One or more cluster flags of interest has been set. These flags include: | |
272 | ||
81eedcae | 273 | * *full* - the cluster is flagged as full and cannot serve writes |
c07f9fc5 FG |
274 | * *pauserd*, *pausewr* - paused reads or writes |
275 | * *noup* - OSDs are not allowed to start | |
276 | * *nodown* - OSD failure reports are being ignored, such that the | |
277 | monitors will not mark OSDs `down` | |
278 | * *noin* - OSDs that were previously marked `out` will not be marked | |
279 | back `in` when they start | |
280 | * *noout* - down OSDs will not automatically be marked out after the | |
281 | configured interval | |
282 | * *nobackfill*, *norecover*, *norebalance* - recovery or data | |
283 | rebalancing is suspended | |
284 | * *noscrub*, *nodeep_scrub* - scrubbing is disabled | |
285 | * *notieragent* - cache tiering activity is suspended | |
286 | ||
287 | With the exception of *full*, these flags can be set or cleared with:: | |
288 | ||
289 | ceph osd set <flag> | |
290 | ceph osd unset <flag> | |
11fdf7f2 | 291 | |
c07f9fc5 FG |
292 | OSD_FLAGS |
293 | _________ | |
294 | ||
81eedcae TL |
295 | One or more OSDs or CRUSH {nodes,device classes} has a flag of interest set. |
296 | These flags include: | |
c07f9fc5 | 297 | |
81eedcae TL |
298 | * *noup*: these OSDs are not allowed to start |
299 | * *nodown*: failure reports for these OSDs will be ignored | |
300 | * *noin*: if these OSDs were previously marked `out` automatically | |
301 | after a failure, they will not be marked in when they start | |
302 | * *noout*: if these OSDs are down they will not automatically be marked | |
c07f9fc5 FG |
303 | `out` after the configured interval |
304 | ||
81eedcae | 305 | These flags can be set and cleared in batch with:: |
c07f9fc5 | 306 | |
81eedcae TL |
307 | ceph osd set-group <flags> <who> |
308 | ceph osd unset-group <flags> <who> | |
c07f9fc5 FG |
309 | |
310 | For example, :: | |
311 | ||
81eedcae TL |
312 | ceph osd set-group noup,noout osd.0 osd.1 |
313 | ceph osd unset-group noup,noout osd.0 osd.1 | |
314 | ceph osd set-group noup,noout host-foo | |
315 | ceph osd unset-group noup,noout host-foo | |
316 | ceph osd set-group noup,noout class-hdd | |
317 | ceph osd unset-group noup,noout class-hdd | |
c07f9fc5 FG |
318 | |
319 | OLD_CRUSH_TUNABLES | |
320 | __________________ | |
321 | ||
322 | The CRUSH map is using very old settings and should be updated. The | |
323 | oldest tunables that can be used (i.e., the oldest client version that | |
324 | can connect to the cluster) without triggering this health warning is | |
325 | determined by the ``mon_crush_min_required_version`` config option. | |
11fdf7f2 | 326 | See :ref:`crush-map-tunables` for more information. |
c07f9fc5 FG |
327 | |
328 | OLD_CRUSH_STRAW_CALC_VERSION | |
329 | ____________________________ | |
330 | ||
331 | The CRUSH map is using an older, non-optimal method for calculating | |
332 | intermediate weight values for ``straw`` buckets. | |
333 | ||
334 | The CRUSH map should be updated to use the newer method | |
335 | (``straw_calc_version=1``). See | |
11fdf7f2 | 336 | :ref:`crush-map-tunables` for more information. |
c07f9fc5 FG |
337 | |
338 | CACHE_POOL_NO_HIT_SET | |
339 | _____________________ | |
340 | ||
341 | One or more cache pools is not configured with a *hit set* to track | |
342 | utilization, which will prevent the tiering agent from identifying | |
343 | cold objects to flush and evict from the cache. | |
344 | ||
345 | Hit sets can be configured on the cache pool with:: | |
346 | ||
347 | ceph osd pool set <poolname> hit_set_type <type> | |
348 | ceph osd pool set <poolname> hit_set_period <period-in-seconds> | |
349 | ceph osd pool set <poolname> hit_set_count <number-of-hitsets> | |
11fdf7f2 | 350 | ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate> |
c07f9fc5 FG |
351 | |
352 | OSD_NO_SORTBITWISE | |
353 | __________________ | |
354 | ||
355 | No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not | |
356 | been set. | |
357 | ||
358 | The ``sortbitwise`` flag must be set before luminous v12.y.z or newer | |
359 | OSDs can start. You can safely set the flag with:: | |
360 | ||
361 | ceph osd set sortbitwise | |
362 | ||
363 | POOL_FULL | |
364 | _________ | |
365 | ||
366 | One or more pools has reached its quota and is no longer allowing writes. | |
367 | ||
368 | Pool quotas and utilization can be seen with:: | |
369 | ||
370 | ceph df detail | |
371 | ||
372 | You can either raise the pool quota with:: | |
373 | ||
374 | ceph osd pool set-quota <poolname> max_objects <num-objects> | |
375 | ceph osd pool set-quota <poolname> max_bytes <num-bytes> | |
376 | ||
377 | or delete some existing data to reduce utilization. | |
378 | ||
81eedcae TL |
379 | BLUEFS_SPILLOVER |
380 | ________________ | |
381 | ||
382 | One or more OSDs that use the BlueStore backend have been allocated | |
383 | `db` partitions (storage space for metadata, normally on a faster | |
384 | device) but that space has filled, such that metadata has "spilled | |
385 | over" onto the normal slow device. This isn't necessarily an error | |
386 | condition or even unexpected, but if the administrator's expectation | |
387 | was that all metadata would fit on the faster device, it indicates | |
388 | that not enough space was provided. | |
389 | ||
390 | This warning can be disabled on all OSDs with:: | |
391 | ||
392 | ceph config set osd bluestore_warn_on_bluefs_spillover false | |
393 | ||
394 | Alternatively, it can be disabled on a specific OSD with:: | |
395 | ||
396 | ceph config set osd.123 bluestore_warn_on_bluefs_spillover false | |
397 | ||
398 | To provide more metadata space, the OSD in question could be destroyed and | |
399 | reprovisioned. This will involve data migration and recovery. | |
400 | ||
401 | It may also be possible to expand the LVM logical volume backing the | |
402 | `db` storage. If the underlying LV has been expanded, the OSD daemon | |
403 | needs to be stopped and BlueFS informed of the device size change with:: | |
404 | ||
405 | ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-$ID | |
406 | ||
eafe8130 TL |
407 | BLUEFS_AVAILABLE_SPACE |
408 | ______________________ | |
409 | ||
410 | To check how much space is free for BlueFS do:: | |
411 | ||
412 | ceph daemon osd.123 bluestore bluefs available | |
413 | ||
414 | This will output up to 3 values: `BDEV_DB free`, `BDEV_SLOW free` and | |
415 | `available_from_bluestore`. `BDEV_DB` and `BDEV_SLOW` report amount of space that | |
416 | has been acquired by BlueFS and is considered free. Value `available_from_bluestore` | |
417 | denotes ability of BlueStore to relinquish more space to BlueFS. | |
418 | It is normal that this value is different from amount of BlueStore free space, as | |
419 | BlueFS allocation unit is typically larger than BlueStore allocation unit. | |
420 | This means that only part of BlueStore free space will be acceptable for BlueFS. | |
421 | ||
422 | BLUEFS_LOW_SPACE | |
423 | _________________ | |
424 | ||
425 | If BlueFS is running low on available free space and there is little | |
426 | `available_from_bluestore` one can consider reducing BlueFS allocation unit size. | |
427 | To simulate available space when allocation unit is different do:: | |
428 | ||
429 | ceph daemon osd.123 bluestore bluefs available <alloc-unit-size> | |
430 | ||
431 | BLUESTORE_FRAGMENTATION | |
432 | _______________________ | |
433 | ||
434 | As BlueStore works free space on underlying storage will get fragmented. | |
435 | This is normal and unavoidable but excessive fragmentation will cause slowdown. | |
436 | To inspect BlueStore fragmentation one can do:: | |
437 | ||
438 | ceph daemon osd.123 bluestore allocator score block | |
439 | ||
440 | Score is given in [0-1] range. | |
441 | [0.0 .. 0.4] tiny fragmentation | |
442 | [0.4 .. 0.7] small, acceptable fragmentation | |
443 | [0.7 .. 0.9] considerable, but safe fragmentation | |
444 | [0.9 .. 1.0] severe fragmentation, may impact BlueFS ability to get space from BlueStore | |
445 | ||
446 | If detailed report of free fragments is required do:: | |
447 | ||
448 | ceph daemon osd.123 bluestore allocator dump block | |
449 | ||
450 | In case when handling OSD process that is not running fragmentation can be | |
451 | inspected with `ceph-bluestore-tool`. | |
452 | Get fragmentation score:: | |
453 | ||
454 | ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score | |
455 | ||
456 | And dump detailed free chunks:: | |
457 | ||
458 | ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump | |
459 | ||
81eedcae TL |
460 | BLUESTORE_LEGACY_STATFS |
461 | _______________________ | |
462 | ||
463 | In the Nautilus release, BlueStore tracks its internal usage | |
464 | statistics on a per-pool granular basis, and one or more OSDs have | |
465 | BlueStore volumes that were created prior to Nautilus. If *all* OSDs | |
466 | are older than Nautilus, this just means that the per-pool metrics are | |
467 | not available. However, if there is a mix of pre-Nautilus and | |
468 | post-Nautilus OSDs, the cluster usage statistics reported by ``ceph | |
469 | df`` will not be accurate. | |
470 | ||
471 | The old OSDs can be updated to use the new usage tracking scheme by stopping each OSD, running a repair operation, and the restarting it. For example, if ``osd.123`` needed to be updated,:: | |
472 | ||
473 | systemctl stop ceph-osd@123 | |
474 | ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 | |
475 | systemctl start ceph-osd@123 | |
476 | ||
477 | This warning can be disabled with:: | |
478 | ||
479 | ceph config set global bluestore_warn_on_legacy_statfs false | |
480 | ||
9f95a23c TL |
481 | BLUESTORE_NO_PER_POOL_OMAP |
482 | __________________________ | |
483 | ||
484 | Starting with the Octopus release, BlueStore tracks omap space utilization | |
485 | by pool, and one or more OSDs have volumes that were created prior to | |
486 | Octopus. If all OSDs are not running BlueStore with the new tracking | |
487 | enabled, the cluster will report and approximate value for per-pool omap usage | |
488 | based on the most recent deep-scrub. | |
489 | ||
490 | The old OSDs can be updated to track by pool by stopping each OSD, | |
491 | running a repair operation, and the restarting it. For example, if | |
492 | ``osd.123`` needed to be updated,:: | |
493 | ||
494 | systemctl stop ceph-osd@123 | |
495 | ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123 | |
496 | systemctl start ceph-osd@123 | |
497 | ||
498 | This warning can be disabled with:: | |
499 | ||
500 | ceph config set global bluestore_warn_on_no_per_pool_omap false | |
501 | ||
81eedcae TL |
502 | |
503 | BLUESTORE_DISK_SIZE_MISMATCH | |
504 | ____________________________ | |
505 | ||
506 | One or more OSDs using BlueStore has an internal inconsistency between the size | |
507 | of the physical device and the metadata tracking its size. This can lead to | |
508 | the OSD crashing in the future. | |
509 | ||
510 | The OSDs in question should be destroyed and reprovisioned. Care should be | |
511 | taken to do this one OSD at a time, and in a way that doesn't put any data at | |
512 | risk. For example, if osd ``$N`` has the error,:: | |
513 | ||
514 | ceph osd out osd.$N | |
515 | while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done | |
516 | ceph osd destroy osd.$N | |
517 | ceph-volume lvm zap /path/to/device | |
518 | ceph-volume lvm create --osd-id $N --data /path/to/device | |
519 | ||
9f95a23c TL |
520 | BLUESTORE_NO_COMPRESSION |
521 | ________________________ | |
522 | ||
523 | One or more OSDs is unable to load a BlueStore compression plugin. | |
524 | This can be caused by a broken installation, in which the ``ceph-osd`` | |
525 | binary does not match the compression plugins, or a recent upgrade | |
526 | that did not include a restart of the ``ceph-osd`` daemon. | |
527 | ||
528 | Verify that the package(s) on the host running the OSD(s) in question | |
529 | are correctly installed and that the OSD daemon(s) have been | |
530 | restarted. If the problem persists, check the OSD log for any clues | |
531 | as to the source of the problem. | |
532 | ||
533 | ||
c07f9fc5 | 534 | |
11fdf7f2 TL |
535 | Device health |
536 | ------------- | |
537 | ||
538 | DEVICE_HEALTH | |
539 | _____________ | |
540 | ||
541 | One or more devices is expected to fail soon, where the warning | |
542 | threshold is controlled by the ``mgr/devicehealth/warn_threshold`` | |
543 | config option. | |
544 | ||
545 | This warning only applies to OSDs that are currently marked "in", so | |
546 | the expected response to this failure is to mark the device "out" so | |
547 | that data is migrated off of the device, and then to remove the | |
548 | hardware from the system. Note that the marking out is normally done | |
549 | automatically if ``mgr/devicehealth/self_heal`` is enabled based on | |
550 | the ``mgr/devicehealth/mark_out_threshold``. | |
551 | ||
552 | Device health can be checked with:: | |
553 | ||
554 | ceph device info <device-id> | |
555 | ||
556 | Device life expectancy is set by a prediction model run by | |
557 | the mgr or an by external tool via the command:: | |
558 | ||
559 | ceph device set-life-expectancy <device-id> <from> <to> | |
560 | ||
561 | You can change the stored life expectancy manually, but that usually | |
562 | doesn't accomplish anything as whatever tool originally set it will | |
563 | probably set it again, and changing the stored value does not affect | |
564 | the actual health of the hardware device. | |
565 | ||
566 | DEVICE_HEALTH_IN_USE | |
567 | ____________________ | |
568 | ||
569 | One or more devices is expected to fail soon and has been marked "out" | |
570 | of the cluster based on ``mgr/devicehealth/mark_out_threshold``, but it | |
571 | is still participating in one more PGs. This may be because it was | |
572 | only recently marked "out" and data is still migrating, or because data | |
573 | cannot be migrated off for some reason (e.g., the cluster is nearly | |
574 | full, or the CRUSH hierarchy is such that there isn't another suitable | |
575 | OSD to migrate the data too). | |
576 | ||
577 | This message can be silenced by disabling the self heal behavior | |
578 | (setting ``mgr/devicehealth/self_heal`` to false), by adjusting the | |
579 | ``mgr/devicehealth/mark_out_threshold``, or by addressing what is | |
580 | preventing data from being migrated off of the ailing device. | |
581 | ||
582 | DEVICE_HEALTH_TOOMANY | |
583 | _____________________ | |
584 | ||
585 | Too many devices is expected to fail soon and the | |
586 | ``mgr/devicehealth/self_heal`` behavior is enabled, such that marking | |
587 | out all of the ailing devices would exceed the clusters | |
588 | ``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being | |
589 | automatically marked "out". | |
590 | ||
591 | This generally indicates that too many devices in your cluster are | |
592 | expected to fail soon and you should take action to add newer | |
593 | (healthier) devices before too many devices fail and data is lost. | |
594 | ||
595 | The health message can also be silenced by adjusting parameters like | |
596 | ``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``, | |
597 | but be warned that this will increase the likelihood of unrecoverable | |
598 | data loss in the cluster. | |
599 | ||
600 | ||
c07f9fc5 | 601 | Data health (pools & placement groups) |
d2e6a577 | 602 | -------------------------------------- |
c07f9fc5 FG |
603 | |
604 | PG_AVAILABILITY | |
605 | _______________ | |
606 | ||
607 | Data availability is reduced, meaning that the cluster is unable to | |
608 | service potential read or write requests for some data in the cluster. | |
609 | Specifically, one or more PGs is in a state that does not allow IO | |
610 | requests to be serviced. Problematic PG states include *peering*, | |
611 | *stale*, *incomplete*, and the lack of *active* (if those conditions do not clear | |
612 | quickly). | |
613 | ||
614 | Detailed information about which PGs are affected is available from:: | |
615 | ||
616 | ceph health detail | |
617 | ||
618 | In most cases the root cause is that one or more OSDs is currently | |
11fdf7f2 | 619 | down; see the discussion for ``OSD_DOWN`` above. |
c07f9fc5 FG |
620 | |
621 | The state of specific problematic PGs can be queried with:: | |
622 | ||
623 | ceph tell <pgid> query | |
624 | ||
625 | PG_DEGRADED | |
626 | ___________ | |
627 | ||
628 | Data redundancy is reduced for some data, meaning the cluster does not | |
629 | have the desired number of replicas for all data (for replicated | |
630 | pools) or erasure code fragments (for erasure coded pools). | |
631 | Specifically, one or more PGs: | |
632 | ||
633 | * has the *degraded* or *undersized* flag set, meaning there are not | |
634 | enough instances of that placement group in the cluster; | |
635 | * has not had the *clean* flag set for some time. | |
636 | ||
637 | Detailed information about which PGs are affected is available from:: | |
638 | ||
639 | ceph health detail | |
640 | ||
641 | In most cases the root cause is that one or more OSDs is currently | |
642 | down; see the dicussion for ``OSD_DOWN`` above. | |
643 | ||
644 | The state of specific problematic PGs can be queried with:: | |
645 | ||
646 | ceph tell <pgid> query | |
647 | ||
648 | ||
eafe8130 TL |
649 | PG_RECOVERY_FULL |
650 | ________________ | |
651 | ||
652 | Data redundancy may be reduced or at risk for some data due to a lack | |
653 | of free space in the cluster. Specifically, one or more PGs has the | |
654 | *recovery_toofull* flag set, meaning that the | |
655 | cluster is unable to migrate or recover data because one or more OSDs | |
656 | is above the *full* threshold. | |
657 | ||
658 | See the discussion for *OSD_FULL* above for steps to resolve this condition. | |
659 | ||
660 | PG_BACKFILL_FULL | |
c07f9fc5 FG |
661 | ________________ |
662 | ||
663 | Data redundancy may be reduced or at risk for some data due to a lack | |
664 | of free space in the cluster. Specifically, one or more PGs has the | |
eafe8130 | 665 | *backfill_toofull* flag set, meaning that the |
c07f9fc5 FG |
666 | cluster is unable to migrate or recover data because one or more OSDs |
667 | is above the *backfillfull* threshold. | |
668 | ||
eafe8130 | 669 | See the discussion for *OSD_BACKFILLFULL* above for |
c07f9fc5 FG |
670 | steps to resolve this condition. |
671 | ||
672 | PG_DAMAGED | |
673 | __________ | |
674 | ||
675 | Data scrubbing has discovered some problems with data consistency in | |
676 | the cluster. Specifically, one or more PGs has the *inconsistent* or | |
677 | *snaptrim_error* flag is set, indicating an earlier scrub operation | |
678 | found a problem, or that the *repair* flag is set, meaning a repair | |
679 | for such an inconsistency is currently in progress. | |
680 | ||
681 | See :doc:`pg-repair` for more information. | |
682 | ||
683 | OSD_SCRUB_ERRORS | |
684 | ________________ | |
685 | ||
686 | Recent OSD scrubs have uncovered inconsistencies. This error is generally | |
11fdf7f2 | 687 | paired with *PG_DAMAGED* (see above). |
c07f9fc5 FG |
688 | |
689 | See :doc:`pg-repair` for more information. | |
690 | ||
11fdf7f2 TL |
691 | LARGE_OMAP_OBJECTS |
692 | __________________ | |
693 | ||
694 | One or more pools contain large omap objects as determined by | |
695 | ``osd_deep_scrub_large_omap_object_key_threshold`` (threshold for number of keys | |
696 | to determine a large omap object) or | |
697 | ``osd_deep_scrub_large_omap_object_value_sum_threshold`` (the threshold for | |
698 | summed size (bytes) of all key values to determine a large omap object) or both. | |
699 | More information on the object name, key count, and size in bytes can be found | |
700 | by searching the cluster log for 'Large omap object found'. Large omap objects | |
701 | can be caused by RGW bucket index objects that do not have automatic resharding | |
702 | enabled. Please see :ref:`RGW Dynamic Bucket Index Resharding | |
703 | <rgw_dynamic_bucket_index_resharding>` for more information on resharding. | |
704 | ||
705 | The thresholds can be adjusted with:: | |
706 | ||
707 | ceph config set osd osd_deep_scrub_large_omap_object_key_threshold <keys> | |
708 | ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold <bytes> | |
709 | ||
c07f9fc5 FG |
710 | CACHE_POOL_NEAR_FULL |
711 | ____________________ | |
712 | ||
713 | A cache tier pool is nearly full. Full in this context is determined | |
714 | by the ``target_max_bytes`` and ``target_max_objects`` properties on | |
715 | the cache pool. Once the pool reaches the target threshold, write | |
716 | requests to the pool may block while data is flushed and evicted | |
717 | from the cache, a state that normally leads to very high latencies and | |
718 | poor performance. | |
719 | ||
720 | The cache pool target size can be adjusted with:: | |
721 | ||
722 | ceph osd pool set <cache-pool-name> target_max_bytes <bytes> | |
723 | ceph osd pool set <cache-pool-name> target_max_objects <objects> | |
724 | ||
725 | Normal cache flush and evict activity may also be throttled due to reduced | |
726 | availability or performance of the base tier, or overall cluster load. | |
727 | ||
728 | TOO_FEW_PGS | |
729 | ___________ | |
730 | ||
731 | The number of PGs in use in the cluster is below the configurable | |
732 | threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can lead | |
11fdf7f2 TL |
733 | to suboptimal distribution and balance of data across the OSDs in |
734 | the cluster, and similarly reduce overall performance. | |
c07f9fc5 FG |
735 | |
736 | This may be an expected condition if data pools have not yet been | |
737 | created. | |
738 | ||
11fdf7f2 TL |
739 | The PG count for existing pools can be increased or new pools can be created. |
740 | Please refer to :ref:`choosing-number-of-placement-groups` for more | |
741 | information. | |
742 | ||
92f5a8d4 TL |
743 | POOL_PG_NUM_NOT_POWER_OF_TWO |
744 | ____________________________ | |
745 | ||
746 | One or more pools has a ``pg_num`` value that is not a power of two. | |
747 | Although this is not strictly incorrect, it does lead to a less | |
748 | balanced distribution of data because some PGs have roughly twice as | |
749 | much data as others. | |
750 | ||
751 | This is easily corrected by setting the ``pg_num`` value for the | |
752 | affected pool(s) to a nearby power of two:: | |
753 | ||
754 | ceph osd pool set <pool-name> pg_num <value> | |
755 | ||
756 | This health warning can be disabled with:: | |
757 | ||
758 | ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false | |
759 | ||
11fdf7f2 TL |
760 | POOL_TOO_FEW_PGS |
761 | ________________ | |
762 | ||
763 | One or more pools should probably have more PGs, based on the amount | |
764 | of data that is currently stored in the pool. This can lead to | |
765 | suboptimal distribution and balance of data across the OSDs in the | |
766 | cluster, and similarly reduce overall performance. This warning is | |
767 | generated if the ``pg_autoscale_mode`` property on the pool is set to | |
768 | ``warn``. | |
769 | ||
770 | To disable the warning, you can disable auto-scaling of PGs for the | |
771 | pool entirely with:: | |
772 | ||
773 | ceph osd pool set <pool-name> pg_autoscale_mode off | |
774 | ||
775 | To allow the cluster to automatically adjust the number of PGs,:: | |
776 | ||
777 | ceph osd pool set <pool-name> pg_autoscale_mode on | |
778 | ||
779 | You can also manually set the number of PGs for the pool to the | |
780 | recommended amount with:: | |
781 | ||
782 | ceph osd pool set <pool-name> pg_num <new-pg-num> | |
783 | ||
784 | Please refer to :ref:`choosing-number-of-placement-groups` and | |
785 | :ref:`pg-autoscaler` for more information. | |
c07f9fc5 FG |
786 | |
787 | TOO_MANY_PGS | |
788 | ____________ | |
789 | ||
790 | The number of PGs in use in the cluster is above the configurable | |
3efd9988 FG |
791 | threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold is |
792 | exceed the cluster will not allow new pools to be created, pool `pg_num` to | |
793 | be increased, or pool replication to be increased (any of which would lead to | |
794 | more PGs in the cluster). A large number of PGs can lead | |
c07f9fc5 FG |
795 | to higher memory utilization for OSD daemons, slower peering after |
796 | cluster state changes (like OSD restarts, additions, or removals), and | |
797 | higher load on the Manager and Monitor daemons. | |
798 | ||
3efd9988 FG |
799 | The simplest way to mitigate the problem is to increase the number of |
800 | OSDs in the cluster by adding more hardware. Note that the OSD count | |
801 | used for the purposes of this health check is the number of "in" OSDs, | |
802 | so marking "out" OSDs "in" (if there are any) can also help:: | |
c07f9fc5 | 803 | |
3efd9988 | 804 | ceph osd in <osd id(s)> |
c07f9fc5 | 805 | |
11fdf7f2 TL |
806 | Please refer to :ref:`choosing-number-of-placement-groups` for more |
807 | information. | |
808 | ||
809 | POOL_TOO_MANY_PGS | |
810 | _________________ | |
811 | ||
812 | One or more pools should probably have more PGs, based on the amount | |
813 | of data that is currently stored in the pool. This can lead to higher | |
814 | memory utilization for OSD daemons, slower peering after cluster state | |
815 | changes (like OSD restarts, additions, or removals), and higher load | |
816 | on the Manager and Monitor daemons. This warning is generated if the | |
817 | ``pg_autoscale_mode`` property on the pool is set to ``warn``. | |
818 | ||
819 | To disable the warning, you can disable auto-scaling of PGs for the | |
820 | pool entirely with:: | |
821 | ||
822 | ceph osd pool set <pool-name> pg_autoscale_mode off | |
823 | ||
824 | To allow the cluster to automatically adjust the number of PGs,:: | |
825 | ||
826 | ceph osd pool set <pool-name> pg_autoscale_mode on | |
827 | ||
828 | You can also manually set the number of PGs for the pool to the | |
829 | recommended amount with:: | |
830 | ||
831 | ceph osd pool set <pool-name> pg_num <new-pg-num> | |
832 | ||
833 | Please refer to :ref:`choosing-number-of-placement-groups` and | |
834 | :ref:`pg-autoscaler` for more information. | |
835 | ||
9f95a23c | 836 | POOL_TARGET_SIZE_BYTES_OVERCOMMITTED |
11fdf7f2 TL |
837 | ____________________________________ |
838 | ||
9f95a23c TL |
839 | One or more pools have a ``target_size_bytes`` property set to |
840 | estimate the expected size of the pool, | |
11fdf7f2 TL |
841 | but the value(s) exceed the total available storage (either by |
842 | themselves or in combination with other pools' actual usage). | |
843 | ||
9f95a23c | 844 | This is usually an indication that the ``target_size_bytes`` value for |
11fdf7f2 TL |
845 | the pool is too large and should be reduced or set to zero with:: |
846 | ||
9f95a23c | 847 | ceph osd pool set <pool-name> target_size_bytes 0 |
11fdf7f2 TL |
848 | |
849 | For more information, see :ref:`specifying_pool_target_size`. | |
850 | ||
9f95a23c | 851 | POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO |
11fdf7f2 TL |
852 | ____________________________________ |
853 | ||
9f95a23c TL |
854 | One or more pools have both ``target_size_bytes`` and |
855 | ``target_size_ratio`` set to estimate the expected size of the pool. | |
856 | Only one of these properties should be non-zero. If both are set, | |
857 | ``target_size_ratio`` takes precedence and ``target_size_bytes`` is | |
858 | ignored. | |
11fdf7f2 | 859 | |
9f95a23c | 860 | To reset ``target_size_bytes`` to zero:: |
11fdf7f2 TL |
861 | |
862 | ceph osd pool set <pool-name> target_size_bytes 0 | |
863 | ||
864 | For more information, see :ref:`specifying_pool_target_size`. | |
c07f9fc5 | 865 | |
eafe8130 TL |
866 | TOO_FEW_OSDS |
867 | ____________ | |
868 | ||
869 | The number of OSDs in the cluster is below the configurable | |
870 | threshold of ``osd_pool_default_size``. | |
871 | ||
c07f9fc5 FG |
872 | SMALLER_PGP_NUM |
873 | _______________ | |
874 | ||
875 | One or more pools has a ``pgp_num`` value less than ``pg_num``. This | |
876 | is normally an indication that the PG count was increased without | |
877 | also increasing the placement behavior. | |
878 | ||
879 | This is sometimes done deliberately to separate out the `split` step | |
880 | when the PG count is adjusted from the data migration that is needed | |
881 | when ``pgp_num`` is changed. | |
882 | ||
883 | This is normally resolved by setting ``pgp_num`` to match ``pg_num``, | |
884 | triggering the data migration, with:: | |
885 | ||
886 | ceph osd pool set <pool> pgp_num <pg-num-value> | |
887 | ||
c07f9fc5 FG |
888 | MANY_OBJECTS_PER_PG |
889 | ___________________ | |
890 | ||
891 | One or more pools has an average number of objects per PG that is | |
892 | significantly higher than the overall cluster average. The specific | |
893 | threshold is controlled by the ``mon_pg_warn_max_object_skew`` | |
894 | configuration value. | |
895 | ||
896 | This is usually an indication that the pool(s) containing most of the | |
897 | data in the cluster have too few PGs, and/or that other pools that do | |
898 | not contain as much data have too many PGs. See the discussion of | |
899 | *TOO_MANY_PGS* above. | |
900 | ||
901 | The threshold can be raised to silence the health warning by adjusting | |
9f95a23c | 902 | the ``mon_pg_warn_max_object_skew`` config option on the managers. |
c07f9fc5 | 903 | |
11fdf7f2 | 904 | |
c07f9fc5 FG |
905 | POOL_APP_NOT_ENABLED |
906 | ____________________ | |
907 | ||
908 | A pool exists that contains one or more objects but has not been | |
909 | tagged for use by a particular application. | |
910 | ||
911 | Resolve this warning by labeling the pool for use by an application. For | |
912 | example, if the pool is used by RBD,:: | |
913 | ||
914 | rbd pool init <poolname> | |
915 | ||
916 | If the pool is being used by a custom application 'foo', you can also label | |
917 | via the low-level command:: | |
918 | ||
919 | ceph osd pool application enable foo | |
920 | ||
11fdf7f2 | 921 | For more information, see :ref:`associate-pool-to-application`. |
c07f9fc5 FG |
922 | |
923 | POOL_FULL | |
924 | _________ | |
925 | ||
926 | One or more pools has reached (or is very close to reaching) its | |
927 | quota. The threshold to trigger this error condition is controlled by | |
928 | the ``mon_pool_quota_crit_threshold`` configuration option. | |
929 | ||
930 | Pool quotas can be adjusted up or down (or removed) with:: | |
931 | ||
932 | ceph osd pool set-quota <pool> max_bytes <bytes> | |
933 | ceph osd pool set-quota <pool> max_objects <objects> | |
934 | ||
11fdf7f2 | 935 | Setting the quota value to 0 will disable the quota. |
c07f9fc5 FG |
936 | |
937 | POOL_NEAR_FULL | |
938 | ______________ | |
939 | ||
940 | One or more pools is approaching is quota. The threshold to trigger | |
941 | this warning condition is controlled by the | |
942 | ``mon_pool_quota_warn_threshold`` configuration option. | |
943 | ||
944 | Pool quotas can be adjusted up or down (or removed) with:: | |
945 | ||
946 | ceph osd pool set-quota <pool> max_bytes <bytes> | |
947 | ceph osd pool set-quota <pool> max_objects <objects> | |
948 | ||
949 | Setting the quota value to 0 will disable the quota. | |
950 | ||
951 | OBJECT_MISPLACED | |
952 | ________________ | |
953 | ||
954 | One or more objects in the cluster is not stored on the node the | |
955 | cluster would like it to be stored on. This is an indication that | |
956 | data migration due to some recent cluster change has not yet completed. | |
957 | ||
958 | Misplaced data is not a dangerous condition in and of itself; data | |
959 | consistency is never at risk, and old copies of objects are never | |
960 | removed until the desired number of new copies (in the desired | |
961 | locations) are present. | |
962 | ||
963 | OBJECT_UNFOUND | |
964 | ______________ | |
965 | ||
966 | One or more objects in the cluster cannot be found. Specifically, the | |
967 | OSDs know that a new or updated copy of an object should exist, but a | |
968 | copy of that version of the object has not been found on OSDs that are | |
969 | currently online. | |
970 | ||
971 | Read or write requests to unfound objects will block. | |
972 | ||
973 | Ideally, a down OSD can be brought back online that has the more | |
974 | recent copy of the unfound object. Candidate OSDs can be identified from the | |
975 | peering state for the PG(s) responsible for the unfound object:: | |
976 | ||
977 | ceph tell <pgid> query | |
978 | ||
979 | If the latest copy of the object is not available, the cluster can be | |
11fdf7f2 TL |
980 | told to roll back to a previous version of the object. See |
981 | :ref:`failures-osd-unfound` for more information. | |
c07f9fc5 | 982 | |
11fdf7f2 TL |
983 | SLOW_OPS |
984 | ________ | |
c07f9fc5 FG |
985 | |
986 | One or more OSD requests is taking a long time to process. This can | |
987 | be an indication of extreme load, a slow storage device, or a software | |
988 | bug. | |
989 | ||
990 | The request queue on the OSD(s) in question can be queried with the | |
991 | following command, executed from the OSD host:: | |
992 | ||
993 | ceph daemon osd.<id> ops | |
994 | ||
995 | A summary of the slowest recent requests can be seen with:: | |
996 | ||
997 | ceph daemon osd.<id> dump_historic_ops | |
998 | ||
999 | The location of an OSD can be found with:: | |
1000 | ||
1001 | ceph osd find osd.<id> | |
1002 | ||
c07f9fc5 FG |
1003 | PG_NOT_SCRUBBED |
1004 | _______________ | |
1005 | ||
1006 | One or more PGs has not been scrubbed recently. PGs are normally | |
1007 | scrubbed every ``mon_scrub_interval`` seconds, and this warning | |
11fdf7f2 TL |
1008 | triggers when ``mon_warn_pg_not_scrubbed_ratio`` percentage of interval has elapsed |
1009 | without a scrub since it was due. | |
c07f9fc5 FG |
1010 | |
1011 | PGs will not scrub if they are not flagged as *clean*, which may | |
1012 | happen if they are misplaced or degraded (see *PG_AVAILABILITY* and | |
1013 | *PG_DEGRADED* above). | |
1014 | ||
1015 | You can manually initiate a scrub of a clean PG with:: | |
1016 | ||
1017 | ceph pg scrub <pgid> | |
1018 | ||
1019 | PG_NOT_DEEP_SCRUBBED | |
1020 | ____________________ | |
1021 | ||
1022 | One or more PGs has not been deep scrubbed recently. PGs are normally | |
a8e16298 | 1023 | scrubbed every ``osd_deep_scrub_interval`` seconds, and this warning |
11fdf7f2 TL |
1024 | triggers when ``mon_warn_pg_not_deep_scrubbed_ratio`` percentage of interval has elapsed |
1025 | without a scrub since it was due. | |
c07f9fc5 FG |
1026 | |
1027 | PGs will not (deep) scrub if they are not flagged as *clean*, which may | |
1028 | happen if they are misplaced or degraded (see *PG_AVAILABILITY* and | |
1029 | *PG_DEGRADED* above). | |
1030 | ||
1031 | You can manually initiate a scrub of a clean PG with:: | |
1032 | ||
1033 | ceph pg deep-scrub <pgid> | |
eafe8130 TL |
1034 | |
1035 | ||
9f95a23c TL |
1036 | PG_SLOW_SNAP_TRIMMING |
1037 | _____________________ | |
1038 | ||
1039 | The snapshot trim queue for one or more PGs has exceeded the | |
1040 | configured warning threshold. This indicates that either an extremely | |
1041 | large number of snapshots were recently deleted, or that the OSDs are | |
1042 | unable to trim snapshots quickly enough to keep up with the rate of | |
1043 | new snapshot deletions. | |
1044 | ||
1045 | The warning threshold is controlled by the | |
1046 | ``mon_osd_snap_trim_queue_warn_on`` option (default: 32768). | |
1047 | ||
1048 | This warning may trigger if OSDs are under excessive load and unable | |
1049 | to keep up with their background work, or if the OSDs' internal | |
1050 | metadata database is heavily fragmented and unable to perform. It may | |
1051 | also indicate some other performance issue with the OSDs. | |
1052 | ||
1053 | The exact size of the snapshot trim queue is reported by the | |
1054 | ``snaptrimq_len`` field of ``ceph pg ls -f json-detail``. | |
1055 | ||
1056 | ||
1057 | ||
eafe8130 TL |
1058 | Miscellaneous |
1059 | ------------- | |
1060 | ||
1061 | RECENT_CRASH | |
1062 | ____________ | |
1063 | ||
1064 | One or more Ceph daemons has crashed recently, and the crash has not | |
1065 | yet been archived (acknowledged) by the administrator. This may | |
1066 | indicate a software bug, a hardware problem (e.g., a failing disk), or | |
1067 | some other problem. | |
1068 | ||
1069 | New crashes can be listed with:: | |
1070 | ||
1071 | ceph crash ls-new | |
1072 | ||
1073 | Information about a specific crash can be examined with:: | |
1074 | ||
1075 | ceph crash info <crash-id> | |
1076 | ||
1077 | This warning can be silenced by "archiving" the crash (perhaps after | |
1078 | being examined by an administrator) so that it does not generate this | |
1079 | warning:: | |
1080 | ||
1081 | ceph crash archive <crash-id> | |
1082 | ||
1083 | Similarly, all new crashes can be archived with:: | |
1084 | ||
1085 | ceph crash archive-all | |
1086 | ||
1087 | Archived crashes will still be visible via ``ceph crash ls`` but not | |
1088 | ``ceph crash ls-new``. | |
1089 | ||
1090 | The time period for what "recent" means is controlled by the option | |
1091 | ``mgr/crash/warn_recent_interval`` (default: two weeks). | |
1092 | ||
1093 | These warnings can be disabled entirely with:: | |
1094 | ||
1095 | ceph config set mgr/crash/warn_recent_interval 0 | |
1096 | ||
1097 | TELEMETRY_CHANGED | |
1098 | _________________ | |
1099 | ||
1100 | Telemetry has been enabled, but the contents of the telemetry report | |
1101 | have changed since that time, so telemetry reports will not be sent. | |
1102 | ||
1103 | The Ceph developers periodically revise the telemetry feature to | |
1104 | include new and useful information, or to remove information found to | |
1105 | be useless or sensitive. If any new information is included in the | |
1106 | report, Ceph will require the administrator to re-enable telemetry to | |
1107 | ensure they have an opportunity to (re)review what information will be | |
1108 | shared. | |
1109 | ||
1110 | To review the contents of the telemetry report,:: | |
1111 | ||
1112 | ceph telemetry show | |
1113 | ||
1114 | Note that the telemetry report consists of several optional channels | |
1115 | that may be independently enabled or disabled. For more information, see | |
1116 | :ref:`telemetry`. | |
1117 | ||
1118 | To re-enable telemetry (and make this warning go away),:: | |
1119 | ||
1120 | ceph telemetry on | |
1121 | ||
1122 | To disable telemetry (and make this warning go away),:: | |
1123 | ||
1124 | ceph telemetry off | |
9f95a23c TL |
1125 | |
1126 | AUTH_BAD_CAPS | |
1127 | _____________ | |
1128 | ||
1129 | One or more auth users has capabilities that cannot be parsed by the | |
1130 | monitor. This generally indicates that the user will not be | |
1131 | authorized to perform any action with one or more daemon types. | |
1132 | ||
1133 | This error is mostly likely to occur after an upgrade if the | |
1134 | capabilities were set with an older version of Ceph that did not | |
1135 | properly validate their syntax, or if the syntax of the capabilities | |
1136 | has changed. | |
1137 | ||
1138 | The user in question can be removed with:: | |
1139 | ||
1140 | ceph auth rm <entity-name> | |
1141 | ||
1142 | (This will resolve the health alert, but obviously clients will not be | |
1143 | able to authenticate as that user.) | |
1144 | ||
1145 | Alternatively, the capabilities for the user can be updated with:: | |
1146 | ||
1147 | ceph auth <entity-name> <daemon-type> <caps> [<daemon-type> <caps> ...] | |
1148 | ||
1149 | For more information about auth capabilities, see :ref:`user-management`. | |
1150 | ||
1151 | ||
1152 | OSD_NO_DOWN_OUT_INTERVAL | |
1153 | ________________________ | |
1154 | ||
1155 | The ``mon_osd_down_out_interval`` option is set to zero, which means | |
1156 | that the system will not automatically perform any repair or healing | |
1157 | operations after an OSD fails. Instead, an administrator (or some | |
1158 | other external entity) will need to manually mark down OSDs as 'out' | |
1159 | (i.e., via ``ceph osd out <osd-id>``) in order to trigger recovery. | |
1160 | ||
1161 | This option is normally set to five or ten minutes--enough time for a | |
1162 | host to power-cycle or reboot. | |
1163 | ||
1164 | This warning can silenced by setting the | |
1165 | ``mon_warn_on_osd_down_out_interval_zero`` to false:: | |
1166 | ||
1167 | ceph config global mon mon_warn_on_osd_down_out_interval_zero false |