[ceph.git] / ceph / doc / cephfs / disaster-recovery-experts.rst


.. _disaster-recovery-experts:

Advanced: Metadata repair tools
===============================

.. warning::

    If you do not have expert knowledge of CephFS internals, you will
    need to seek assistance before using any of these tools.

    The tools mentioned here can easily cause damage as well as fixing it.

    It is essential to understand exactly what has gone wrong with your
    file system before attempting to repair it.

    If you do not have access to professional support for your cluster,
    consult the ceph-users mailing list or the #ceph IRC channel.


Journal export
--------------

Before attempting dangerous operations, make a copy of the journal like so:

::

    cephfs-journal-tool journal export backup.bin

Note that this command may not always work if the journal is badly corrupted,
in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).


Dentry recovery from journal
----------------------------

If a journal is damaged or for any reason an MDS is incapable of replaying it,
attempt to recover what file metadata we can like so:

::

    cephfs-journal-tool event recover_dentries summary

This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.

This command will write any inodes/dentries recoverable from the journal
into the backing store, if these inodes/dentries are higher-versioned
than the previous contents of the backing store.  If any regions of the journal
are missing/damaged, they will be skipped.

Note that in addition to writing out dentries and inodes, this command will update
the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
are now in use.  In simple cases, this will result in an entirely valid backing
store state.

.. warning::

    The resulting state of the backing store is not guaranteed to be self-consistent,
    and an online MDS scrub will be required afterwards.  The journal contents
    will not be modified by this command, you should truncate the journal
    separately after recovering what you can.

Journal truncation
------------------

If the journal is corrupt or MDSs cannot replay it for any reason, you can
truncate it like so:

::

    cephfs-journal-tool [--rank=N] journal reset

Specify the MDS rank using the ``--rank`` option when the file system has/had
multiple active MDS.

.. warning::

    Resetting the journal *will* lose metadata unless you have extracted
    it by other means such as ``recover_dentries``.  It is likely to leave
    some orphaned objects in the data pool.  It may result in re-allocation
    of already-written inodes, such that permissions rules could be violated.

MDS table wipes
---------------

After the journal has been reset, it may no longer be consistent with respect
to the contents of the MDS tables (InoTable, SessionMap, SnapServer).

To reset the SessionMap (erase all sessions), use:

::

    cephfs-table-tool all reset session

This command acts on the tables of all 'in' MDS ranks.  Replace 'all' with an MDS
rank to operate on that rank only.

The session table is the table most likely to need resetting, but if you know you
also need to reset the other tables then replace 'session' with 'snap' or 'inode'.

MDS map reset
-------------

Once the in-RADOS state of the file system (i.e. contents of the metadata pool)
is somewhat recovered, it may be necessary to update the MDS map to reflect
the contents of the metadata pool.  Use the following command to reset the MDS
map to a single MDS:

::

    ceph fs reset <fs name> --yes-i-really-mean-it

Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
as a result it is possible for this to result in data loss.

One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'.  The
key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
that it would overwrite any existing root inode on disk and orphan any existing files.  In
contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
daemon to claim the rank will go ahead and use the existing in-RADOS metadata.

Recovery from missing metadata objects
--------------------------------------

Depending on what objects are missing or corrupt, you may need to
run various commands to regenerate default versions of the
objects.

::

    # Session table
    cephfs-table-tool 0 reset session
    # SnapServer
    cephfs-table-tool 0 reset snap
    # InoTable
    cephfs-table-tool 0 reset inode
    # Journal
    cephfs-journal-tool --rank=0 journal reset
    # Root inodes ("/" and MDS directory)
    cephfs-data-scan init

Finally, you can regenerate metadata objects for missing files
and directories based on the contents of a data pool.  This is
a three-phase process.  First, scanning *all* objects to calculate
size and mtime metadata for inodes.  Second, scanning the first
object from every file to collect this metadata and inject it into
the metadata pool. Third, checking inode linkages and fixing found
errors.

::

    cephfs-data-scan scan_extents <data pool>
    cephfs-data-scan scan_inodes <data pool>
    cephfs-data-scan scan_links

'scan_extents' and 'scan_inodes' commands may take a *very long* time
if there are many files or very large files in the data pool.

To accelerate the process, run multiple instances of the tool.

Decide on a number of workers, and pass each worker a number within
the range 0-(worker_m - 1).

The example below shows how to run 4 workers simultaneously:

::

    # Worker 0
    cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
    # Worker 1
    cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
    # Worker 2
    cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
    # Worker 3
    cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>

    # Worker 0
    cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
    # Worker 1
    cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
    # Worker 2
    cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
    # Worker 3
    cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>

It is **important** to ensure that all workers have completed the
scan_extents phase before any workers enter the scan_inodes phase.

After completing the metadata recovery, you may want to run cleanup
operation to delete ancillary data geneated during recovery.

::

    cephfs-data-scan cleanup <data pool>


Using an alternate metadata pool for recovery
---------------------------------------------

.. warning::

   There has not been extensive testing of this procedure. It should be
   undertaken with great care.

If an existing file system is damaged and inoperative, it is possible to create
a fresh metadata pool and attempt to reconstruct the file system metadata into
this new pool, leaving the old metadata in place. This could be used to make a
safer attempt at recovery since the existing metadata pool would not be
modified.

.. caution::

   During this process, multiple metadata pools will contain data referring to
   the same data pool. Extreme caution must be exercised to avoid changing the
   data pool contents while this is the case. Once recovery is complete, the
   damaged metadata pool should be archived or deleted.

To begin, the existing file system should be taken down, if not done already,
to prevent further modification of the data pool. Unmount all clients and then
mark the file system failed:

::

    ceph fs fail <fs_name>

Next, create a recovery file system in which we will populate a new metadata pool
backed by the original data pool.

::

    ceph fs flag set enable_multiple true --yes-i-really-mean-it
    ceph osd pool create cephfs_recovery_meta
    ceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay


The recovery file system starts with an MDS rank that will initialize the new
metadata pool with some metadata. This is necessary to bootstrap recovery.
However, now we will take the MDS down as we do not want it interacting with
the metadata pool further.

::

    ceph fs fail cephfs_recovery

Next, we will reset the initial metadata the MDS created:

::

    cephfs-table-tool cephfs_recovery:all reset session
    cephfs-table-tool cephfs_recovery:all reset snap
    cephfs-table-tool cephfs_recovery:all reset inode

Now perform the recovery of the metadata pool from the data pool:

::

    cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta
    cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool>
    cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool>
    cephfs-data-scan scan_links --filesystem cephfs_recovery

.. note::

   Each scan procedure above goes through the entire data pool. This may take a
   significant amount of time. See the previous section on how to distribute
   this task among workers.

If the damaged file system contains dirty journal data, it may be recovered next
with:

::

    cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta
    cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force

After recovery, some recovered directories will have incorrect statistics.
Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are
set to false (the default) to prevent the MDS from checking the statistics:

::

    ceph config rm mds mds_verify_scatter
    ceph config rm mds mds_debug_scatterstat

(Note, the config may also have been set globally or via a ceph.conf file.)
Now, allow an MDS to join the recovery file system:

::

    ceph fs set cephfs_recovery joinable true

Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair the statistics.
Ensure you have an MDS running and issue:

::

    ceph fs status # get active MDS
    ceph tell mds.<id> scrub start / recursive repair

.. note::

   Symbolic links are recovered as empty regular files. `Symbolic link recovery
   <https://tracker.ceph.com/issues/46166>`_ is scheduled to be supported in
   Pacific.

It is recommended to migrate any data from the recovery file system as soon as
possible. Do not restore the old file system while the recovery file system is
operational.

.. note::

    If the data pool is also corrupt, some files may not be restored because
    backtrace information is lost. If any data objects are missing (due to
    issues like lost Placement Groups on the data pool), the recovered files
    will contain holes in place of the missing data.

.. _Symbolic link recovery: https://tracker.ceph.com/issues/46166
Commit	Line	Data
11fdf7f2 TL	1
	2	.. _disaster-recovery-experts:
	3
	4	Advanced: Metadata repair tools
	5	===============================
	6
	7	.. warning::
	8
	9	If you do not have expert knowledge of CephFS internals, you will
	10	need to seek assistance before using any of these tools.
	11
	12	The tools mentioned here can easily cause damage as well as fixing it.
	13
	14	It is essential to understand exactly what has gone wrong with your
9f95a23c	15	file system before attempting to repair it.
11fdf7f2 TL	16
	17	If you do not have access to professional support for your cluster,
	18	consult the ceph-users mailing list or the #ceph IRC channel.
	19
	20
	21	Journal export
	22	--------------
	23
	24	Before attempting dangerous operations, make a copy of the journal like so:
	25
	26	::
	27
	28	cephfs-journal-tool journal export backup.bin
	29
	30	Note that this command may not always work if the journal is badly corrupted,
	31	in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
	32
	33
	34	Dentry recovery from journal
	35	----------------------------
	36
	37	If a journal is damaged or for any reason an MDS is incapable of replaying it,
	38	attempt to recover what file metadata we can like so:
	39
	40	::
	41
	42	cephfs-journal-tool event recover_dentries summary
	43
	44	This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
	45
	46	This command will write any inodes/dentries recoverable from the journal
	47	into the backing store, if these inodes/dentries are higher-versioned
	48	than the previous contents of the backing store. If any regions of the journal
	49	are missing/damaged, they will be skipped.
	50
	51	Note that in addition to writing out dentries and inodes, this command will update
	52	the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
	53	are now in use. In simple cases, this will result in an entirely valid backing
	54	store state.
	55
	56	.. warning::
	57
	58	The resulting state of the backing store is not guaranteed to be self-consistent,
	59	and an online MDS scrub will be required afterwards. The journal contents
	60	will not be modified by this command, you should truncate the journal
	61	separately after recovering what you can.
	62
	63	Journal truncation
	64	------------------
	65
	66	If the journal is corrupt or MDSs cannot replay it for any reason, you can
	67	truncate it like so:
	68
	69	::
	70
9f95a23c TL	71	cephfs-journal-tool [--rank=N] journal reset
	72
	73	Specify the MDS rank using the ``--rank`` option when the file system has/had
	74	multiple active MDS.
11fdf7f2 TL	75
	76	.. warning::
	77
	78	Resetting the journal will lose metadata unless you have extracted
	79	it by other means such as ``recover_dentries``. It is likely to leave
	80	some orphaned objects in the data pool. It may result in re-allocation
	81	of already-written inodes, such that permissions rules could be violated.
	82
	83	MDS table wipes
	84	---------------
	85
	86	After the journal has been reset, it may no longer be consistent with respect
	87	to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
	88
	89	To reset the SessionMap (erase all sessions), use:
	90
	91	::
	92
	93	cephfs-table-tool all reset session
	94
	95	This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS
	96	rank to operate on that rank only.
	97
	98	The session table is the table most likely to need resetting, but if you know you
	99	also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
	100
	101	MDS map reset
	102	-------------
	103
9f95a23c	104	Once the in-RADOS state of the file system (i.e. contents of the metadata pool)
11fdf7f2 TL	105	is somewhat recovered, it may be necessary to update the MDS map to reflect
	106	the contents of the metadata pool. Use the following command to reset the MDS
	107	map to a single MDS:
	108
	109	::
	110
	111	ceph fs reset <fs name> --yes-i-really-mean-it
	112
	113	Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
	114	as a result it is possible for this to result in data loss.
	115
	116	One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The
	117	key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
	118	that it would overwrite any existing root inode on disk and orphan any existing files. In
	119	contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
	120	daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
	121
	122	Recovery from missing metadata objects
	123	--------------------------------------
	124
	125	Depending on what objects are missing or corrupt, you may need to
	126	run various commands to regenerate default versions of the
	127	objects.
	128
	129	::
	130
	131	# Session table
	132	cephfs-table-tool 0 reset session
	133	# SnapServer
	134	cephfs-table-tool 0 reset snap
	135	# InoTable
	136	cephfs-table-tool 0 reset inode
	137	# Journal
	138	cephfs-journal-tool --rank=0 journal reset
	139	# Root inodes ("/" and MDS directory)
	140	cephfs-data-scan init
	141
	142	Finally, you can regenerate metadata objects for missing files
	143	and directories based on the contents of a data pool. This is
	144	a three-phase process. First, scanning all objects to calculate
	145	size and mtime metadata for inodes. Second, scanning the first
	146	object from every file to collect this metadata and inject it into
	147	the metadata pool. Third, checking inode linkages and fixing found
	148	errors.
	149
	150	::
	151
	152	cephfs-data-scan scan_extents <data pool>
	153	cephfs-data-scan scan_inodes <data pool>
	154	cephfs-data-scan scan_links
	155
	156	'scan_extents' and 'scan_inodes' commands may take a very long time
	157	if there are many files or very large files in the data pool.
	158
	159	To accelerate the process, run multiple instances of the tool.
	160
	161	Decide on a number of workers, and pass each worker a number within
	162	the range 0-(worker_m - 1).
	163
	164	The example below shows how to run 4 workers simultaneously:
	165
	166	::
	167
	168	# Worker 0
169	cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
170	# Worker 1
171	cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
172	# Worker 2
173	cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
174	# Worker 3
175	cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
176
177	# Worker 0
178	cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
179	# Worker 1
180	cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
181	# Worker 2
182	cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
183	# Worker 3
184	cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
185
186	It is important to ensure that all workers have completed the
187	scan_extents phase before any workers enter the scan_inodes phase.
188
189	After completing the metadata recovery, you may want to run cleanup
190	operation to delete ancillary data geneated during recovery.
191
192	::
193
194	cephfs-data-scan cleanup <data pool>
195
196
197
198	Using an alternate metadata pool for recovery
199	---------------------------------------------
200
201	.. warning::
202
203	There has not been extensive testing of this procedure. It should be
204	undertaken with great care.
205
9f95a23c	206	If an existing file system is damaged and inoperative, it is possible to create
f67539c2 TL	207	a fresh metadata pool and attempt to reconstruct the file system metadata into
	208	this new pool, leaving the old metadata in place. This could be used to make a
	209	safer attempt at recovery since the existing metadata pool would not be
	210	modified.
11fdf7f2 TL	211
	212	.. caution::
	213
	214	During this process, multiple metadata pools will contain data referring to
	215	the same data pool. Extreme caution must be exercised to avoid changing the
	216	data pool contents while this is the case. Once recovery is complete, the
f67539c2	217	damaged metadata pool should be archived or deleted.
11fdf7f2	218
f67539c2 TL	219	To begin, the existing file system should be taken down, if not done already,
	220	to prevent further modification of the data pool. Unmount all clients and then
	221	mark the file system failed:
	222
	223	::
	224
	225	ceph fs fail <fs_name>
	226
	227	Next, create a recovery file system in which we will populate a new metadata pool
	228	backed by the original data pool.
11fdf7f2 TL	229
	230	::
	231
	232	ceph fs flag set enable_multiple true --yes-i-really-mean-it
f67539c2 TL	233	ceph osd pool create cephfs_recovery_meta
	234	ceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay
	235
	236
	237	The recovery file system starts with an MDS rank that will initialize the new
	238	metadata pool with some metadata. This is necessary to bootstrap recovery.
	239	However, now we will take the MDS down as we do not want it interacting with
	240	the metadata pool further.
	241
	242	::
	243
	244	ceph fs fail cephfs_recovery
11fdf7f2	245
f67539c2	246	Next, we will reset the initial metadata the MDS created:
11fdf7f2 TL	247
	248	::
	249
f67539c2 TL	250	cephfs-table-tool cephfs_recovery:all reset session
	251	cephfs-table-tool cephfs_recovery:all reset snap
	252	cephfs-table-tool cephfs_recovery:all reset inode
	253
	254	Now perform the recovery of the metadata pool from the data pool:
	255
	256	::
	257
	258	cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta
	259	cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool>
	260	cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool>
	261	cephfs-data-scan scan_links --filesystem cephfs_recovery
	262
	263	.. note::
	264
	265	Each scan procedure above goes through the entire data pool. This may take a
	266	significant amount of time. See the previous section on how to distribute
	267	this task among workers.
11fdf7f2	268
9f95a23c	269	If the damaged file system contains dirty journal data, it may be recovered next
11fdf7f2 TL	270	with:
	271
	272	::
	273
f67539c2 TL	274	cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta
f67539c2 TL	275	cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force
11fdf7f2 TL	276
11fdf7f2 TL	277	After recovery, some recovered directories will have incorrect statistics.
f67539c2 TL	278	Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are
	279	set to false (the default) to prevent the MDS from checking the statistics:
	280
	281	::
	282
	283	ceph config rm mds mds_verify_scatter
	284	ceph config rm mds mds_debug_scatterstat
	285
	286	(Note, the config may also have been set globally or via a ceph.conf file.)
	287	Now, allow an MDS to join the recovery file system:
	288
	289	::
	290
	291	ceph fs set cephfs_recovery joinable true
	292
	293	Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair the statistics.
	294	Ensure you have an MDS running and issue:
11fdf7f2 TL	295
	296	::
	297
f67539c2 TL	298	ceph fs status # get active MDS
	299	ceph tell mds.<id> scrub start / recursive repair
	300
	301	.. note::
	302
	303	Symbolic links are recovered as empty regular files. `Symbolic link recovery
	304	<https://tracker.ceph.com/issues/46166>`_ is scheduled to be supported in
	305	Pacific.
	306
	307	It is recommended to migrate any data from the recovery file system as soon as
	308	possible. Do not restore the old file system while the recovery file system is
	309	operational.
9f95a23c TL	310
	311	.. note::
	312
f67539c2 TL	313	If the data pool is also corrupt, some files may not be restored because
	314	backtrace information is lost. If any data objects are missing (due to
	315	issues like lost Placement Groups on the data pool), the recovered files
	316	will contain holes in place of the missing data.
9f95a23c	317
f67539c2	318	.. _Symbolic link recovery: https://tracker.ceph.com/issues/46166