]> git.proxmox.com Git - ceph.git/blob - ceph/doc/cephfs/disaster-recovery.rst
add subtree-ish sources for 12.0.3
[ceph.git] / ceph / doc / cephfs / disaster-recovery.rst
1
2 Disaster recovery
3 =================
4
5 .. danger::
6
7 The notes in this section are aimed at experts, making a best effort
8 to recovery what they can from damaged filesystems. These steps
9 have the potential to make things worse as well as better. If you
10 are unsure, do not proceed.
11
12
13 Journal export
14 --------------
15
16 Before attempting dangerous operations, make a copy of the journal like so:
17
18 ::
19
20 cephfs-journal-tool journal export backup.bin
21
22 Note that this command may not always work if the journal is badly corrupted,
23 in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
24
25
26 Dentry recovery from journal
27 ----------------------------
28
29 If a journal is damaged or for any reason an MDS is incapable of replaying it,
30 attempt to recover what file metadata we can like so:
31
32 ::
33
34 cephfs-journal-tool event recover_dentries summary
35
36 This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
37
38 This command will write any inodes/dentries recoverable from the journal
39 into the backing store, if these inodes/dentries are higher-versioned
40 than the previous contents of the backing store. If any regions of the journal
41 are missing/damaged, they will be skipped.
42
43 Note that in addition to writing out dentries and inodes, this command will update
44 the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
45 are now in use. In simple cases, this will result in an entirely valid backing
46 store state.
47
48 .. warning::
49
50 The resulting state of the backing store is not guaranteed to be self-consistent,
51 and an online MDS scrub will be required afterwards. The journal contents
52 will not be modified by this command, you should truncate the journal
53 separately after recovering what you can.
54
55 Journal truncation
56 ------------------
57
58 If the journal is corrupt or MDSs cannot replay it for any reason, you can
59 truncate it like so:
60
61 ::
62
63 cephfs-journal-tool journal reset
64
65 .. warning::
66
67 Resetting the journal *will* lose metadata unless you have extracted
68 it by other means such as ``recover_dentries``. It is likely to leave
69 some orphaned objects in the data pool. It may result in re-allocation
70 of already-written inodes, such that permissions rules could be violated.
71
72 MDS table wipes
73 ---------------
74
75 After the journal has been reset, it may no longer be consistent with respect
76 to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
77
78 To reset the SessionMap (erase all sessions), use:
79
80 ::
81
82 cephfs-table-tool all reset session
83
84 This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS
85 rank to operate on that rank only.
86
87 The session table is the table most likely to need resetting, but if you know you
88 also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
89
90 MDS map reset
91 -------------
92
93 Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool)
94 is somewhat recovered, it may be necessary to update the MDS map to reflect
95 the contents of the metadata pool. Use the following command to reset the MDS
96 map to a single MDS:
97
98 ::
99
100 ceph fs reset <fs name> --yes-i-really-mean-it
101
102 Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
103 as a result it is possible for this to result in data loss.
104
105 One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The
106 key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
107 that it would overwrite any existing root inode on disk and orphan any existing files. In
108 contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
109 daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
110
111 Recovery from missing metadata objects
112 --------------------------------------
113
114 Depending on what objects are missing or corrupt, you may need to
115 run various commands to regenerate default versions of the
116 objects.
117
118 ::
119
120 # Session table
121 cephfs-table-tool 0 reset session
122 # SnapServer
123 cephfs-table-tool 0 reset snap
124 # InoTable
125 cephfs-table-tool 0 reset inode
126 # Journal
127 cephfs-journal-tool --rank=0 journal reset
128 # Root inodes ("/" and MDS directory)
129 cephfs-data-scan init
130
131 Finally, you can regenerate metadata objects for missing files
132 and directories based on the contents of a data pool. This is
133 a two-phase process. First, scanning *all* objects to calculate
134 size and mtime metadata for inodes. Second, scanning the first
135 object from every file to collect this metadata and inject
136 it into the metadata pool.
137
138 ::
139
140 cephfs-data-scan scan_extents <data pool>
141 cephfs-data-scan scan_inodes <data pool>
142
143 This command may take a *very long* time if there are many
144 files or very large files in the data pool.
145
146 To accelerate the process, run multiple instances of the tool.
147
148 Decide on a number of workers, and pass each worker a number within
149 the range 0-(worker_m - 1).
150
151 The example below shows how to run 4 workers simultaneously:
152
153 ::
154
155 # Worker 0
156 cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
157 # Worker 1
158 cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
159 # Worker 2
160 cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
161 # Worker 3
162 cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
163
164 # Worker 0
165 cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
166 # Worker 1
167 cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
168 # Worker 2
169 cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
170 # Worker 3
171 cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
172
173 It is **important** to ensure that all workers have completed the
174 scan_extents phase before any workers enter the scan_inodes phase.
175
176 After completing the metadata recovery, you may want to run cleanup
177 operation to delete ancillary data geneated during recovery.
178
179 ::
180
181 cephfs-data-scan cleanup <data pool>
182
183 Finding files affected by lost data PGs
184 ---------------------------------------
185
186 Losing a data PG may affect many files. Files are split into many objects,
187 so identifying which files are affected by loss of particular PGs requires
188 a full scan over all object IDs that may exist within the size of a file.
189 This type of scan may be useful for identifying which files require
190 restoring from a backup.
191
192 .. danger::
193
194 This command does not repair any metadata, so when restoring files in
195 this case you must *remove* the damaged file, and replace it in order
196 to have a fresh inode. Do not overwrite damaged files in place.
197
198 If you know that objects have been lost from PGs, use the ``pg_files``
199 subcommand to scan for files that may have been damaged as a result:
200
201 ::
202
203 cephfs-data-scan pg_files <path> <pg id> [<pg id>...]
204
205 For example, if you have lost data from PGs 1.4 and 4.5, and you would like
206 to know which files under /home/bob might have been damaged:
207
208 ::
209
210 cephfs-data-scan pg_files /home/bob 1.4 4.5
211
212 The output will be a list of paths to potentially damaged files, one
213 per line.
214
215 Note that this command acts as a normal CephFS client to find all the
216 files in the filesystem and read their layouts, so the MDS must be
217 up and running.
218
219 Using an alternate metadata pool for recovery
220 ---------------------------------------------
221
222 .. warning::
223
224 There has not been extensive testing of this procedure. It should be
225 undertaken with great care.
226
227 If an existing filesystem is damaged and inoperative, it is possible to create
228 a fresh metadata pool and attempt to reconstruct the filesystem metadata
229 into this new pool, leaving the old metadata in place. This could be used to
230 make a safer attempt at recovery since the existing metadata pool would not be
231 overwritten.
232
233 .. caution::
234
235 During this process, multiple metadata pools will contain data referring to
236 the same data pool. Extreme caution must be exercised to avoid changing the
237 data pool contents while this is the case. Once recovery is complete, the
238 damaged metadata pool should be deleted.
239
240 To begin this process, first create the fresh metadata pool and initialize
241 it with empty file system data structures:
242
243 ::
244
245 ceph fs flag set enable_multiple true --yes-i-really-mean-it
246 ceph osd pool create recovery <pg-num> replicated <crush-ruleset-name>
247 ceph fs new recovery-fs recovery <data pool> --allow-dangerous-metadata-overlay
248 cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery
249 ceph fs reset recovery-fs --yes-i-realy-mean-it
250 cephfs-table-tool recovery-fs:all reset session
251 cephfs-table-tool recovery-fs:all reset snap
252 cephfs-table-tool recovery-fs:all reset inode
253
254 Next, run the recovery toolset using the --alternate-pool argument to output
255 results to the alternate pool:
256
257 ::
258
259 cephfs-data-scan scan_extents --alternate-pool recovery --filesystem <original filesystem name>
260 cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original filesystem name> --force-corrupt --force-init <original data pool name>
261
262 If the damaged filesystem contains dirty journal data, it may be recovered next
263 with:
264
265 ::
266
267 cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery
268 cephfs-journal-tool --rank recovery-fs:0 journal reset --force
269
270 After recovery, some recovered directories will have incorrect link counts.
271 Ensure the parameter mds_debug_scatterstat is set to false (the default) to
272 prevent the MDS from checking the link counts, then run a forward scrub to
273 repair them. Ensure you have an MDS running and issue:
274
275 ::
276
277 ceph daemon mds.a scrub_path / recursive repair