ceph/doc/cephfs/disaster-recovery.rst

   1
   2 Disaster recovery
   3 =================
   4
   5 .. danger::
   6
   7     The notes in this section are aimed at experts, making a best effort
   8     to recovery what they can from damaged filesystems.  These steps
   9     have the potential to make things worse as well as better.  If you
  10     are unsure, do not proceed.
  11
  12
  13 Journal export
  14 --------------
  15
  16 Before attempting dangerous operations, make a copy of the journal like so:
  17
  18 ::
  19
  20     cephfs-journal-tool journal export backup.bin
  21
  22 Note that this command may not always work if the journal is badly corrupted,
  23 in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
  24
  25
  26 Dentry recovery from journal
  27 ----------------------------
  28
  29 If a journal is damaged or for any reason an MDS is incapable of replaying it,
  30 attempt to recover what file metadata we can like so:
  31
  32 ::
  33
  34     cephfs-journal-tool event recover_dentries summary
  35
  36 This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
  37
  38 This command will write any inodes/dentries recoverable from the journal
  39 into the backing store, if these inodes/dentries are higher-versioned
  40 than the previous contents of the backing store.  If any regions of the journal
  41 are missing/damaged, they will be skipped.
  42
  43 Note that in addition to writing out dentries and inodes, this command will update
  44 the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
  45 are now in use.  In simple cases, this will result in an entirely valid backing
  46 store state.
  47
  48 .. warning::
  49
  50     The resulting state of the backing store is not guaranteed to be self-consistent,
  51     and an online MDS scrub will be required afterwards.  The journal contents
  52     will not be modified by this command, you should truncate the journal
  53     separately after recovering what you can.
  54
  55 Journal truncation
  56 ------------------
  57
  58 If the journal is corrupt or MDSs cannot replay it for any reason, you can
  59 truncate it like so:
  60
  61 ::
  62
  63     cephfs-journal-tool journal reset
  64
  65 .. warning::
  66
  67     Resetting the journal *will* lose metadata unless you have extracted
  68     it by other means such as ``recover_dentries``.  It is likely to leave
  69     some orphaned objects in the data pool.  It may result in re-allocation
  70     of already-written inodes, such that permissions rules could be violated.
  71
  72 MDS table wipes
  73 ---------------
  74
  75 After the journal has been reset, it may no longer be consistent with respect
  76 to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
  77
  78 To reset the SessionMap (erase all sessions), use:
  79
  80 ::
  81
  82     cephfs-table-tool all reset session
  83
  84 This command acts on the tables of all 'in' MDS ranks.  Replace 'all' with an MDS
  85 rank to operate on that rank only.
  86
  87 The session table is the table most likely to need resetting, but if you know you
  88 also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
  89
  90 MDS map reset
  91 -------------
  92
  93 Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool)
  94 is somewhat recovered, it may be necessary to update the MDS map to reflect
  95 the contents of the metadata pool.  Use the following command to reset the MDS
  96 map to a single MDS:
  97
  98 ::
  99
 100     ceph fs reset <fs name> --yes-i-really-mean-it
 101
 102 Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
 103 as a result it is possible for this to result in data loss.
 104
 105 One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'.  The
 106 key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
 107 that it would overwrite any existing root inode on disk and orphan any existing files.  In
 108 contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
 109 daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
 110
 111 Recovery from missing metadata objects
 112 --------------------------------------
 113
 114 Depending on what objects are missing or corrupt, you may need to
 115 run various commands to regenerate default versions of the
 116 objects.
 117
 118 ::
 119
 120     # Session table
 121     cephfs-table-tool 0 reset session
 122     # SnapServer
 123     cephfs-table-tool 0 reset snap
 124     # InoTable
 125     cephfs-table-tool 0 reset inode
 126     # Journal
 127     cephfs-journal-tool --rank=0 journal reset
 128     # Root inodes ("/" and MDS directory)
 129     cephfs-data-scan init
 130
 131 Finally, you can regenerate metadata objects for missing files
 132 and directories based on the contents of a data pool.  This is
 133 a two-phase process.  First, scanning *all* objects to calculate
 134 size and mtime metadata for inodes.  Second, scanning the first
 135 object from every file to collect this metadata and inject
 136 it into the metadata pool.
 137
 138 ::
 139
 140     cephfs-data-scan scan_extents <data pool>
 141     cephfs-data-scan scan_inodes <data pool>
 142
 143 This command may take a *very long* time if there are many
 144 files or very large files in the data pool.
 145
 146 To accelerate the process, run multiple instances of the tool.
 147
 148 Decide on a number of workers, and pass each worker a number within
 149 the range 0-(worker_m - 1).
 150
 151 The example below shows how to run 4 workers simultaneously:
 152
 153 ::
 154
 155     # Worker 0
 156     cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
 157     # Worker 1
 158     cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
 159     # Worker 2
 160     cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
 161     # Worker 3
 162     cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
 163
 164     # Worker 0
 165     cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
 166     # Worker 1
 167     cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
 168     # Worker 2
 169     cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
 170     # Worker 3
 171     cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
 172
 173 It is **important** to ensure that all workers have completed the
 174 scan_extents phase before any workers enter the scan_inodes phase.
 175
 176 After completing the metadata recovery, you may want to run cleanup
 177 operation to delete ancillary data geneated during recovery.
 178
 179 ::
 180
 181     cephfs-data-scan cleanup <data pool>
 182
 183 Finding files affected by lost data PGs
 184 ---------------------------------------
 185
 186 Losing a data PG may affect many files.  Files are split into many objects,
 187 so identifying which files are affected by loss of particular PGs requires
 188 a full scan over all object IDs that may exist within the size of a file.
 189 This type of scan may be useful for identifying which files require
 190 restoring from a backup.
 191
 192 .. danger::
 193
 194     This command does not repair any metadata, so when restoring files in
 195     this case you must *remove* the damaged file, and replace it in order
 196     to have a fresh inode.  Do not overwrite damaged files in place.
 197
 198 If you know that objects have been lost from PGs, use the ``pg_files``
 199 subcommand to scan for files that may have been damaged as a result:
 200
 201 ::
 202
 203     cephfs-data-scan pg_files <path> <pg id> [<pg id>...]
 204
 205 For example, if you have lost data from PGs 1.4 and 4.5, and you would like
 206 to know which files under /home/bob might have been damaged:
 207
 208 ::
 209
 210     cephfs-data-scan pg_files /home/bob 1.4 4.5
 211
 212 The output will be a list of paths to potentially damaged files, one
 213 per line.
 214
 215 Note that this command acts as a normal CephFS client to find all the
 216 files in the filesystem and read their layouts, so the MDS must be
 217 up and running.
 218
 219 Using an alternate metadata pool for recovery
 220 ---------------------------------------------
 221
 222 .. warning::
 223
 224    There has not been extensive testing of this procedure. It should be
 225    undertaken with great care.
 226
 227 If an existing filesystem is damaged and inoperative, it is possible to create
 228 a fresh metadata pool and attempt to reconstruct the filesystem metadata
 229 into this new pool, leaving the old metadata in place. This could be used to
 230 make a safer attempt at recovery since the existing metadata pool would not be
 231 overwritten.
 232
 233 .. caution::
 234
 235    During this process, multiple metadata pools will contain data referring to
 236    the same data pool. Extreme caution must be exercised to avoid changing the
 237    data pool contents while this is the case. Once recovery is complete, the
 238    damaged metadata pool should be deleted.
 239
 240 To begin this process, first create the fresh metadata pool and initialize
 241 it with empty file system data structures:
 242
 243 ::
 244
 245     ceph fs flag set enable_multiple true --yes-i-really-mean-it
 246     ceph osd pool create recovery <pg-num> replicated <crush-ruleset-name>
 247     ceph fs new recovery-fs recovery <data pool> --allow-dangerous-metadata-overlay
 248     cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery
 249     ceph fs reset recovery-fs --yes-i-realy-mean-it
 250     cephfs-table-tool recovery-fs:all reset session
 251     cephfs-table-tool recovery-fs:all reset snap
 252     cephfs-table-tool recovery-fs:all reset inode
 253
 254 Next, run the recovery toolset using the --alternate-pool argument to output
 255 results to the alternate pool:
 256
 257 ::
 258
 259     cephfs-data-scan scan_extents --alternate-pool recovery --filesystem <original filesystem name>
 260     cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original filesystem name> --force-corrupt --force-init <original data pool name>
 261
 262 If the damaged filesystem contains dirty journal data, it may be recovered next
 263 with:
 264
 265 ::
 266
 267     cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery
 268     cephfs-journal-tool --rank recovery-fs:0 journal reset --force
 269
 270 After recovery, some recovered directories will have incorrect link counts.
 271 Ensure the parameter mds_debug_scatterstat is set to false (the default) to
 272 prevent the MDS from checking the link counts, then run a forward scrub to
 273 repair them. Ensure you have an MDS running and issue:
 274
 275 ::
 276
 277     ceph daemon mds.a scrub_path / recursive repair