[ceph.git] / ceph / doc / cephfs / disaster-recovery.rst


Disaster recovery
=================

.. danger::

    The notes in this section are aimed at experts, making a best effort
    to recovery what they can from damaged filesystems.  These steps
    have the potential to make things worse as well as better.  If you
    are unsure, do not proceed.


Journal export
--------------

Before attempting dangerous operations, make a copy of the journal like so:

::

    cephfs-journal-tool journal export backup.bin

Note that this command may not always work if the journal is badly corrupted,
in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).


Dentry recovery from journal
----------------------------

If a journal is damaged or for any reason an MDS is incapable of replaying it,
attempt to recover what file metadata we can like so:

::

    cephfs-journal-tool event recover_dentries summary

This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.

This command will write any inodes/dentries recoverable from the journal
into the backing store, if these inodes/dentries are higher-versioned
than the previous contents of the backing store.  If any regions of the journal
are missing/damaged, they will be skipped.

Note that in addition to writing out dentries and inodes, this command will update
the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
are now in use.  In simple cases, this will result in an entirely valid backing
store state.

.. warning::

    The resulting state of the backing store is not guaranteed to be self-consistent,
    and an online MDS scrub will be required afterwards.  The journal contents
    will not be modified by this command, you should truncate the journal
    separately after recovering what you can.

Journal truncation
------------------

If the journal is corrupt or MDSs cannot replay it for any reason, you can
truncate it like so:

::

    cephfs-journal-tool journal reset

.. warning::

    Resetting the journal *will* lose metadata unless you have extracted
    it by other means such as ``recover_dentries``.  It is likely to leave
    some orphaned objects in the data pool.  It may result in re-allocation
    of already-written inodes, such that permissions rules could be violated.

MDS table wipes
---------------

After the journal has been reset, it may no longer be consistent with respect
to the contents of the MDS tables (InoTable, SessionMap, SnapServer).

To reset the SessionMap (erase all sessions), use:

::

    cephfs-table-tool all reset session

This command acts on the tables of all 'in' MDS ranks.  Replace 'all' with an MDS
rank to operate on that rank only.

The session table is the table most likely to need resetting, but if you know you
also need to reset the other tables then replace 'session' with 'snap' or 'inode'.

MDS map reset
-------------

Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool)
is somewhat recovered, it may be necessary to update the MDS map to reflect
the contents of the metadata pool.  Use the following command to reset the MDS
map to a single MDS:

::

    ceph fs reset <fs name> --yes-i-really-mean-it

Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
as a result it is possible for this to result in data loss.

One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'.  The
key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
that it would overwrite any existing root inode on disk and orphan any existing files.  In
contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
daemon to claim the rank will go ahead and use the existing in-RADOS metadata.

Recovery from missing metadata objects
--------------------------------------

Depending on what objects are missing or corrupt, you may need to
run various commands to regenerate default versions of the
objects.

::

    # Session table
    cephfs-table-tool 0 reset session
    # SnapServer
    cephfs-table-tool 0 reset snap
    # InoTable
    cephfs-table-tool 0 reset inode
    # Journal
    cephfs-journal-tool --rank=0 journal reset
    # Root inodes ("/" and MDS directory)
    cephfs-data-scan init

Finally, you can regenerate metadata objects for missing files
and directories based on the contents of a data pool.  This is
a three-phase process.  First, scanning *all* objects to calculate
size and mtime metadata for inodes.  Second, scanning the first
object from every file to collect this metadata and inject it into
the metadata pool. Third, checking inode linkages and fixing found
errors.

::

    cephfs-data-scan scan_extents <data pool>
    cephfs-data-scan scan_inodes <data pool>
    cephfs-data-scan scan_links

'scan_extents' and 'scan_inodes' commands may take a *very long* time
if there are many files or very large files in the data pool.

To accelerate the process, run multiple instances of the tool.

Decide on a number of workers, and pass each worker a number within
the range 0-(worker_m - 1).

The example below shows how to run 4 workers simultaneously:

::

    # Worker 0
    cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
    # Worker 1
    cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
    # Worker 2
    cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
    # Worker 3
    cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>

    # Worker 0
    cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
    # Worker 1
    cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
    # Worker 2
    cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
    # Worker 3
    cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>

It is **important** to ensure that all workers have completed the
scan_extents phase before any workers enter the scan_inodes phase.

After completing the metadata recovery, you may want to run cleanup
operation to delete ancillary data geneated during recovery.

::

    cephfs-data-scan cleanup <data pool>

Finding files affected by lost data PGs
---------------------------------------

Losing a data PG may affect many files.  Files are split into many objects,
so identifying which files are affected by loss of particular PGs requires
a full scan over all object IDs that may exist within the size of a file. 
This type of scan may be useful for identifying which files require
restoring from a backup.

.. danger::

    This command does not repair any metadata, so when restoring files in
    this case you must *remove* the damaged file, and replace it in order
    to have a fresh inode.  Do not overwrite damaged files in place.

If you know that objects have been lost from PGs, use the ``pg_files``
subcommand to scan for files that may have been damaged as a result:

::

    cephfs-data-scan pg_files <path> <pg id> [<pg id>...]

For example, if you have lost data from PGs 1.4 and 4.5, and you would like
to know which files under /home/bob might have been damaged:

::

    cephfs-data-scan pg_files /home/bob 1.4 4.5

The output will be a list of paths to potentially damaged files, one
per line.

Note that this command acts as a normal CephFS client to find all the
files in the filesystem and read their layouts, so the MDS must be
up and running.

Using an alternate metadata pool for recovery
---------------------------------------------

.. warning::

   There has not been extensive testing of this procedure. It should be
   undertaken with great care.

If an existing filesystem is damaged and inoperative, it is possible to create
a fresh metadata pool and attempt to reconstruct the filesystem metadata
into this new pool, leaving the old metadata in place. This could be used to
make a safer attempt at recovery since the existing metadata pool would not be
overwritten.

.. caution::

   During this process, multiple metadata pools will contain data referring to
   the same data pool. Extreme caution must be exercised to avoid changing the
   data pool contents while this is the case. Once recovery is complete, the
   damaged metadata pool should be deleted.

To begin this process, first create the fresh metadata pool and initialize
it with empty file system data structures:

::

    ceph fs flag set enable_multiple true --yes-i-really-mean-it
    ceph osd pool create recovery <pg-num> replicated <crush-ruleset-name>
    ceph fs new recovery-fs recovery <data pool> --allow-dangerous-metadata-overlay
    cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery
    ceph fs reset recovery-fs --yes-i-really-mean-it
    cephfs-table-tool recovery-fs:all reset session
    cephfs-table-tool recovery-fs:all reset snap
    cephfs-table-tool recovery-fs:all reset inode

Next, run the recovery toolset using the --alternate-pool argument to output
results to the alternate pool:

::

    cephfs-data-scan scan_extents --alternate-pool recovery --filesystem <original filesystem name> <original data pool name>
    cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original filesystem name> --force-corrupt --force-init <original data pool name>
    cephfs-data-scan scan_links --filesystem recovery-fs

If the damaged filesystem contains dirty journal data, it may be recovered next
with:

::

    cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery
    cephfs-journal-tool --rank recovery-fs:0 journal reset --force

After recovery, some recovered directories will have incorrect statistics.
Ensure the parameters mds_verify_scatter and mds_debug_scatterstat are set
to false (the default) to prevent the MDS from checking the statistics, then
run a forward scrub to repair them. Ensure you have an MDS running and issue:

::

    ceph daemon mds.a scrub_path / recursive repair
Commit	Line	Data
7c673cae FG	1
	2	Disaster recovery
	3	=================
	4
	5	.. danger::
	6
	7	The notes in this section are aimed at experts, making a best effort
	8	to recovery what they can from damaged filesystems. These steps
	9	have the potential to make things worse as well as better. If you
	10	are unsure, do not proceed.
	11
	12
	13	Journal export
	14	--------------
	15
	16	Before attempting dangerous operations, make a copy of the journal like so:
	17
	18	::
	19
	20	cephfs-journal-tool journal export backup.bin
	21
	22	Note that this command may not always work if the journal is badly corrupted,
	23	in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
	24
	25
	26	Dentry recovery from journal
	27	----------------------------
	28
	29	If a journal is damaged or for any reason an MDS is incapable of replaying it,
	30	attempt to recover what file metadata we can like so:
	31
	32	::
	33
	34	cephfs-journal-tool event recover_dentries summary
	35
	36	This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
	37
	38	This command will write any inodes/dentries recoverable from the journal
	39	into the backing store, if these inodes/dentries are higher-versioned
	40	than the previous contents of the backing store. If any regions of the journal
	41	are missing/damaged, they will be skipped.
	42
	43	Note that in addition to writing out dentries and inodes, this command will update
	44	the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
	45	are now in use. In simple cases, this will result in an entirely valid backing
	46	store state.
	47
	48	.. warning::
	49
	50	The resulting state of the backing store is not guaranteed to be self-consistent,
	51	and an online MDS scrub will be required afterwards. The journal contents
	52	will not be modified by this command, you should truncate the journal
	53	separately after recovering what you can.
	54
	55	Journal truncation
	56	------------------
	57
	58	If the journal is corrupt or MDSs cannot replay it for any reason, you can
	59	truncate it like so:
	60
	61	::
	62
	63	cephfs-journal-tool journal reset
	64
65	.. warning::
66
67	Resetting the journal will lose metadata unless you have extracted
68	it by other means such as ``recover_dentries``. It is likely to leave
69	some orphaned objects in the data pool. It may result in re-allocation
70	of already-written inodes, such that permissions rules could be violated.
71
72	MDS table wipes
73	---------------
74
75	After the journal has been reset, it may no longer be consistent with respect
76	to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
77
78	To reset the SessionMap (erase all sessions), use:
79
80	::
81
82	cephfs-table-tool all reset session
83
84	This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS
85	rank to operate on that rank only.
86
87	The session table is the table most likely to need resetting, but if you know you
88	also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
89
90	MDS map reset
91	-------------
92
93	Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool)
94	is somewhat recovered, it may be necessary to update the MDS map to reflect
95	the contents of the metadata pool. Use the following command to reset the MDS
96	map to a single MDS:
97
98	::
99
100	ceph fs reset <fs name> --yes-i-really-mean-it
101
102	Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
103	as a result it is possible for this to result in data loss.
104
105	One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The
106	key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
107	that it would overwrite any existing root inode on disk and orphan any existing files. In
108	contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
109	daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
110
111	Recovery from missing metadata objects
112	--------------------------------------
113
114	Depending on what objects are missing or corrupt, you may need to
115	run various commands to regenerate default versions of the
116	objects.
117
118	::
119
120	# Session table
121	cephfs-table-tool 0 reset session
122	# SnapServer
123	cephfs-table-tool 0 reset snap
124	# InoTable
125	cephfs-table-tool 0 reset inode
126	# Journal
127	cephfs-journal-tool --rank=0 journal reset
128	# Root inodes ("/" and MDS directory)
129	cephfs-data-scan init
130
131	Finally, you can regenerate metadata objects for missing files
132	and directories based on the contents of a data pool. This is
c07f9fc5	133	a three-phase process. First, scanning all objects to calculate
7c673cae	134	size and mtime metadata for inodes. Second, scanning the first
c07f9fc5 FG	135	object from every file to collect this metadata and inject it into
	136	the metadata pool. Third, checking inode linkages and fixing found
	137	errors.
7c673cae FG	138
	139	::
	140
	141	cephfs-data-scan scan_extents <data pool>
	142	cephfs-data-scan scan_inodes <data pool>
c07f9fc5	143	cephfs-data-scan scan_links
7c673cae	144
c07f9fc5 FG	145	'scan_extents' and 'scan_inodes' commands may take a very long time
c07f9fc5 FG	146	if there are many files or very large files in the data pool.
7c673cae FG	147
	148	To accelerate the process, run multiple instances of the tool.
	149
	150	Decide on a number of workers, and pass each worker a number within
	151	the range 0-(worker_m - 1).
	152
	153	The example below shows how to run 4 workers simultaneously:
	154
	155	::
	156
	157	# Worker 0
	158	cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
	159	# Worker 1
	160	cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
	161	# Worker 2
	162	cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
	163	# Worker 3
	164	cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
	165
	166	# Worker 0
	167	cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
	168	# Worker 1
	169	cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
	170	# Worker 2
	171	cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
	172	# Worker 3
	173	cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
	174
	175	It is important to ensure that all workers have completed the
	176	scan_extents phase before any workers enter the scan_inodes phase.
	177
	178	After completing the metadata recovery, you may want to run cleanup
	179	operation to delete ancillary data geneated during recovery.
	180
	181	::
	182
	183	cephfs-data-scan cleanup <data pool>
	184
	185	Finding files affected by lost data PGs
	186	---------------------------------------
	187
	188	Losing a data PG may affect many files. Files are split into many objects,
	189	so identifying which files are affected by loss of particular PGs requires
	190	a full scan over all object IDs that may exist within the size of a file.
	191	This type of scan may be useful for identifying which files require
	192	restoring from a backup.
	193
	194	.. danger::
	195
	196	This command does not repair any metadata, so when restoring files in
	197	this case you must remove the damaged file, and replace it in order
	198	to have a fresh inode. Do not overwrite damaged files in place.
	199
	200	If you know that objects have been lost from PGs, use the ``pg_files``
	201	subcommand to scan for files that may have been damaged as a result:
	202
	203	::
	204
	205	cephfs-data-scan pg_files <path> <pg id> [<pg id>...]
	206
	207	For example, if you have lost data from PGs 1.4 and 4.5, and you would like
	208	to know which files under /home/bob might have been damaged:
	209
	210	::
211
212	cephfs-data-scan pg_files /home/bob 1.4 4.5
213
214	The output will be a list of paths to potentially damaged files, one
215	per line.
216
217	Note that this command acts as a normal CephFS client to find all the
218	files in the filesystem and read their layouts, so the MDS must be
219	up and running.
220
221	Using an alternate metadata pool for recovery
222	---------------------------------------------
223
224	.. warning::
225
226	There has not been extensive testing of this procedure. It should be
227	undertaken with great care.
228
229	If an existing filesystem is damaged and inoperative, it is possible to create
230	a fresh metadata pool and attempt to reconstruct the filesystem metadata
231	into this new pool, leaving the old metadata in place. This could be used to
232	make a safer attempt at recovery since the existing metadata pool would not be
233	overwritten.
234
235	.. caution::
236
237	During this process, multiple metadata pools will contain data referring to
238	the same data pool. Extreme caution must be exercised to avoid changing the
239	data pool contents while this is the case. Once recovery is complete, the
240	damaged metadata pool should be deleted.
241
242	To begin this process, first create the fresh metadata pool and initialize
243	it with empty file system data structures:
244
245	::
246
247	ceph fs flag set enable_multiple true --yes-i-really-mean-it
248	ceph osd pool create recovery <pg-num> replicated <crush-ruleset-name>
249	ceph fs new recovery-fs recovery <data pool> --allow-dangerous-metadata-overlay
250	cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery
c07f9fc5	251	ceph fs reset recovery-fs --yes-i-really-mean-it
7c673cae FG	252	cephfs-table-tool recovery-fs:all reset session
	253	cephfs-table-tool recovery-fs:all reset snap
	254	cephfs-table-tool recovery-fs:all reset inode
	255
	256	Next, run the recovery toolset using the --alternate-pool argument to output
	257	results to the alternate pool:
	258
	259	::
	260
c07f9fc5	261	cephfs-data-scan scan_extents --alternate-pool recovery --filesystem <original filesystem name> <original data pool name>
7c673cae	262	cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original filesystem name> --force-corrupt --force-init <original data pool name>
c07f9fc5	263	cephfs-data-scan scan_links --filesystem recovery-fs
7c673cae FG	264
	265	If the damaged filesystem contains dirty journal data, it may be recovered next
	266	with:
	267
	268	::
	269
	270	cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery
	271	cephfs-journal-tool --rank recovery-fs:0 journal reset --force
	272
c07f9fc5 FG	273	After recovery, some recovered directories will have incorrect statistics.
	274	Ensure the parameters mds_verify_scatter and mds_debug_scatterstat are set
	275	to false (the default) to prevent the MDS from checking the statistics, then
	276	run a forward scrub to repair them. Ensure you have an MDS running and issue:
7c673cae FG	277
	278	::
	279
	280	ceph daemon mds.a scrub_path / recursive repair