ceph/doc/mgr/crash.rst

   1 Crash Module
   2 ============
   3 The crash module collects information about daemon crashdumps and stores
   4 it in the Ceph cluster for later analysis.
   5
   6 Daemon crashdumps are dumped in /var/lib/ceph/crash by default; this can
   7 be configured with the option 'crash dir'.  Crash directories are named by
   8 time and date and a randomly-generated UUID, and contain a metadata file
   9 'meta' and a recent log file, with a "crash_id" that is the same.
  10 This module allows the metadata about those dumps to be persisted in
  11 the monitors' storage.
  12
  13 Enabling
  14 --------
  15
  16 The *crash* module is enabled with::
  17
  18   ceph mgr module enable crash
  19
  20 Commands
  21 --------
  22 ::
  23
  24   ceph crash post -i <metafile>
  25
  26 Save a crash dump.  The metadata file is a JSON blob stored in the crash
  27 dir as ``meta``.  As usual, the ceph command can be invoked with ``-i -``,
  28 and will read from stdin.
  29
  30 ::
  31
  32   ceph rm <crashid>
  33
  34 Remove a specific crash dump.
  35
  36 ::
  37
  38   ceph crash ls
  39
  40 List the timestamp/uuid crashids for all new and archived crash info.
  41
  42 ::
  43
  44   ceph crash ls-new
  45
  46 List the timestamp/uuid crashids for all newcrash info.
  47
  48 ::
  49
  50   ceph crash stat
  51
  52 Show a summary of saved crash info grouped by age.
  53
  54 ::
  55
  56   ceph crash info <crashid>
  57
  58 Show all details of a saved crash.
  59
  60 ::
  61
  62    ceph crash prune <keep>
  63
  64 Remove saved crashes older than 'keep' days.  <keep> must be an integer.
  65
  66 ::
  67
  68    ceph crash archive <crashid>
  69
  70 Archive a crash report so that it is no longer considered for the ``RECENT_CRASH`` health check and does not appear in the ``crash ls-new`` output (it will still appear in the ``crash ls`` output).
  71
  72 ::
  73
  74    ceph crash archive-all
  75
  76 Archive all new crash reports.
  77
  78
  79 Options
  80 -------
  81
  82 * ``mgr/crash/warn_recent_interval`` [default: 2 weeks] controls what constitutes "recent" for the purposes of raising the ``RECENT_CRASH`` health warning.
  83 * ``mgr/crash/retain_interval`` [default: 1 year] controls how long crash reports are retained by the cluster before they are automatically purged.