ceph/doc/cephfs/disaster-recovery-experts.rst

   1
   2 .. _disaster-recovery-experts:
   3
   4 Advanced: Metadata repair tools
   5 ===============================
   6
   7 .. warning::
   8
   9     If you do not have expert knowledge of CephFS internals, you will
  10     need to seek assistance before using any of these tools.
  11
  12     The tools mentioned here can easily cause damage as well as fixing it.
  13
  14     It is essential to understand exactly what has gone wrong with your
  15     file system before attempting to repair it.
  16
  17     If you do not have access to professional support for your cluster,
  18     consult the ceph-users mailing list or the #ceph IRC channel.
  19
  20
  21 Journal export
  22 --------------
  23
  24 Before attempting dangerous operations, make a copy of the journal like so:
  25
  26 ::
  27
  28     cephfs-journal-tool journal export backup.bin
  29
  30 Note that this command may not always work if the journal is badly corrupted,
  31 in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
  32
  33
  34 Dentry recovery from journal
  35 ----------------------------
  36
  37 If a journal is damaged or for any reason an MDS is incapable of replaying it,
  38 attempt to recover what file metadata we can like so:
  39
  40 ::
  41
  42     cephfs-journal-tool event recover_dentries summary
  43
  44 This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
  45
  46 This command will write any inodes/dentries recoverable from the journal
  47 into the backing store, if these inodes/dentries are higher-versioned
  48 than the previous contents of the backing store.  If any regions of the journal
  49 are missing/damaged, they will be skipped.
  50
  51 Note that in addition to writing out dentries and inodes, this command will update
  52 the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
  53 are now in use.  In simple cases, this will result in an entirely valid backing
  54 store state.
  55
  56 .. warning::
  57
  58     The resulting state of the backing store is not guaranteed to be self-consistent,
  59     and an online MDS scrub will be required afterwards.  The journal contents
  60     will not be modified by this command, you should truncate the journal
  61     separately after recovering what you can.
  62
  63 Journal truncation
  64 ------------------
  65
  66 If the journal is corrupt or MDSs cannot replay it for any reason, you can
  67 truncate it like so:
  68
  69 ::
  70
  71     cephfs-journal-tool [--rank=N] journal reset
  72
  73 Specify the MDS rank using the ``--rank`` option when the file system has/had
  74 multiple active MDS.
  75
  76 .. warning::
  77
  78     Resetting the journal *will* lose metadata unless you have extracted
  79     it by other means such as ``recover_dentries``.  It is likely to leave
  80     some orphaned objects in the data pool.  It may result in re-allocation
  81     of already-written inodes, such that permissions rules could be violated.
  82
  83 MDS table wipes
  84 ---------------
  85
  86 After the journal has been reset, it may no longer be consistent with respect
  87 to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
  88
  89 To reset the SessionMap (erase all sessions), use:
  90
  91 ::
  92
  93     cephfs-table-tool all reset session
  94
  95 This command acts on the tables of all 'in' MDS ranks.  Replace 'all' with an MDS
  96 rank to operate on that rank only.
  97
  98 The session table is the table most likely to need resetting, but if you know you
  99 also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
 100
 101 MDS map reset
 102 -------------
 103
 104 Once the in-RADOS state of the file system (i.e. contents of the metadata pool)
 105 is somewhat recovered, it may be necessary to update the MDS map to reflect
 106 the contents of the metadata pool.  Use the following command to reset the MDS
 107 map to a single MDS:
 108
 109 ::
 110
 111     ceph fs reset <fs name> --yes-i-really-mean-it
 112
 113 Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
 114 as a result it is possible for this to result in data loss.
 115
 116 One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'.  The
 117 key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
 118 that it would overwrite any existing root inode on disk and orphan any existing files.  In
 119 contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
 120 daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
 121
 122 Recovery from missing metadata objects
 123 --------------------------------------
 124
 125 Depending on what objects are missing or corrupt, you may need to
 126 run various commands to regenerate default versions of the
 127 objects.
 128
 129 ::
 130
 131     # Session table
 132     cephfs-table-tool 0 reset session
 133     # SnapServer
 134     cephfs-table-tool 0 reset snap
 135     # InoTable
 136     cephfs-table-tool 0 reset inode
 137     # Journal
 138     cephfs-journal-tool --rank=0 journal reset
 139     # Root inodes ("/" and MDS directory)
 140     cephfs-data-scan init
 141
 142 Finally, you can regenerate metadata objects for missing files
 143 and directories based on the contents of a data pool.  This is
 144 a three-phase process.  First, scanning *all* objects to calculate
 145 size and mtime metadata for inodes.  Second, scanning the first
 146 object from every file to collect this metadata and inject it into
 147 the metadata pool. Third, checking inode linkages and fixing found
 148 errors.
 149
 150 ::
 151
 152     cephfs-data-scan scan_extents <data pool>
 153     cephfs-data-scan scan_inodes <data pool>
 154     cephfs-data-scan scan_links
 155
 156 'scan_extents' and 'scan_inodes' commands may take a *very long* time
 157 if there are many files or very large files in the data pool.
 158
 159 To accelerate the process, run multiple instances of the tool.
 160
 161 Decide on a number of workers, and pass each worker a number within
 162 the range 0-(worker_m - 1).
 163
 164 The example below shows how to run 4 workers simultaneously:
 165
 166 ::
 167
 168     # Worker 0
 169     cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
 170     # Worker 1
 171     cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
 172     # Worker 2
 173     cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
 174     # Worker 3
 175     cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
 176
 177     # Worker 0
 178     cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
 179     # Worker 1
 180     cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
 181     # Worker 2
 182     cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
 183     # Worker 3
 184     cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
 185
 186 It is **important** to ensure that all workers have completed the
 187 scan_extents phase before any workers enter the scan_inodes phase.
 188
 189 After completing the metadata recovery, you may want to run cleanup
 190 operation to delete ancillary data generated during recovery.
 191
 192 ::
 193
 194     cephfs-data-scan cleanup <data pool>
 195
 196
 197
 198 Using an alternate metadata pool for recovery
 199 ---------------------------------------------
 200
 201 .. warning::
 202
 203    There has not been extensive testing of this procedure. It should be
 204    undertaken with great care.
 205
 206 If an existing file system is damaged and inoperative, it is possible to create
 207 a fresh metadata pool and attempt to reconstruct the file system metadata into
 208 this new pool, leaving the old metadata in place. This could be used to make a
 209 safer attempt at recovery since the existing metadata pool would not be
 210 modified.
 211
 212 .. caution::
 213
 214    During this process, multiple metadata pools will contain data referring to
 215    the same data pool. Extreme caution must be exercised to avoid changing the
 216    data pool contents while this is the case. Once recovery is complete, the
 217    damaged metadata pool should be archived or deleted.
 218
 219 To begin, the existing file system should be taken down, if not done already,
 220 to prevent further modification of the data pool. Unmount all clients and then
 221 mark the file system failed:
 222
 223 ::
 224
 225     ceph fs fail <fs_name>
 226
 227 Next, create a recovery file system in which we will populate a new metadata pool
 228 backed by the original data pool.
 229
 230 ::
 231
 232     ceph fs flag set enable_multiple true --yes-i-really-mean-it
 233     ceph osd pool create cephfs_recovery_meta
 234     ceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay
 235
 236
 237 The recovery file system starts with an MDS rank that will initialize the new
 238 metadata pool with some metadata. This is necessary to bootstrap recovery.
 239 However, now we will take the MDS down as we do not want it interacting with
 240 the metadata pool further.
 241
 242 ::
 243
 244     ceph fs fail cephfs_recovery
 245
 246 Next, we will reset the initial metadata the MDS created:
 247
 248 ::
 249
 250     cephfs-table-tool cephfs_recovery:all reset session
 251     cephfs-table-tool cephfs_recovery:all reset snap
 252     cephfs-table-tool cephfs_recovery:all reset inode
 253
 254 Now perform the recovery of the metadata pool from the data pool:
 255
 256 ::
 257
 258     cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta
 259     cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool>
 260     cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool>
 261     cephfs-data-scan scan_links --filesystem cephfs_recovery
 262
 263 .. note::
 264
 265    Each scan procedure above goes through the entire data pool. This may take a
 266    significant amount of time. See the previous section on how to distribute
 267    this task among workers.
 268
 269 If the damaged file system contains dirty journal data, it may be recovered next
 270 with:
 271
 272 ::
 273
 274     cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta
 275     cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force
 276
 277 After recovery, some recovered directories will have incorrect statistics.
 278 Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are
 279 set to false (the default) to prevent the MDS from checking the statistics:
 280
 281 ::
 282
 283     ceph config rm mds mds_verify_scatter
 284     ceph config rm mds mds_debug_scatterstat
 285
 286 (Note, the config may also have been set globally or via a ceph.conf file.)
 287 Now, allow an MDS to join the recovery file system:
 288
 289 ::
 290
 291     ceph fs set cephfs_recovery joinable true
 292
 293 Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair the statistics.
 294 Ensure you have an MDS running and issue:
 295
 296 ::
 297
 298     ceph fs status # get active MDS
 299     ceph tell mds.<id> scrub start / recursive repair
 300
 301 .. note::
 302
 303    Symbolic links are recovered as empty regular files. `Symbolic link recovery
 304    <https://tracker.ceph.com/issues/46166>`_ is scheduled to be supported in
 305    Pacific.
 306
 307 It is recommended to migrate any data from the recovery file system as soon as
 308 possible. Do not restore the old file system while the recovery file system is
 309 operational.
 310
 311 .. note::
 312
 313     If the data pool is also corrupt, some files may not be restored because
 314     backtrace information is lost. If any data objects are missing (due to
 315     issues like lost Placement Groups on the data pool), the recovered files
 316     will contain holes in place of the missing data.
 317
 318 .. _Symbolic link recovery: https://tracker.ceph.com/issues/46166