]> git.proxmox.com Git - mirror_smartmontools-debian.git/blob - debian/badblockhowto.html
move badblockhowto.html in debian/ (Closes: #538631)
[mirror_smartmontools-debian.git] / debian / badblockhowto.html
1 <html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Bad block HOWTO for smartmontools</title><meta name="generator" content="DocBook XSL Stylesheets V1.69.1"><meta name="description" content="
2 This article describes what actions might be taken when smartmontools
3 detects a bad block on a disk. It demonstrates how to identify the file
4 associated with an unreadable disk sector, and how to force that sector
5 to reallocate.
6 "></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="article" lang="en"><div class="titlepage"><div><div><h1 class="title"><a name="index"></a>Bad block HOWTO for smartmontools</h1></div><div><div class="author"><h3 class="author"><span class="firstname">Bruce</span> <span class="surname">Allen</span></h3><div class="affiliation"><div class="address"><p><br>
7       <code class="email">&lt;<a href="mailto:smartmontools-support@lists.sourceforge.net">smartmontools-support@lists.sourceforge.net</a>&gt;</code><br>
8      </p></div></div></div></div><div><div class="author"><h3 class="author"><span class="firstname">Douglas</span> <span class="surname">Gilbert</span></h3><div class="affiliation"><div class="address"><p><br>
9       <code class="email">&lt;<a href="mailto:smartmontools-support@lists.sourceforge.net">smartmontools-support@lists.sourceforge.net</a>&gt;</code><br>
10      </p></div></div></div></div><div><p class="copyright">Copyright © 2004, 2005, 2006, 2007 Bruce Allen</p></div><div><div class="legalnotice"><a name="id4710404"></a><p>
11 Permission is granted to copy, distribute and/or modify this document
12 under the terms of the GNU Free Documentation License, Version 1.1
13 or any later version published by the Free Software Foundation;
14 with no Invariant Sections, with no Front-Cover Texts, and with
15 no Back-Cover Texts.
16 </p><p>
17 For an online copy of the license see
18 <a href="http://www.fsf.org/copyleft/fdl.html" target="_top">
19 <code class="literal">www.fsf.org/copyleft/fdl.html</code></a>.
20 </p></div></div><div><p class="pubdate">2007-01-23</p></div><div><div class="revhistory"><table border="1" width="100%" summary="Revision history"><tr><th align="left" valign="top" colspan="3"><b>Revision History</b></th></tr><tr><td align="left">Revision 1.1</td><td align="left">2007-01-23</td><td align="left">dpg</td></tr><tr><td align="left" colspan="3">
21 add sections on ReiserFS and partition table damage
22 </td></tr><tr><td align="left">Revision 1.0</td><td align="left">2006-11-14</td><td align="left">dpg</td></tr><tr><td align="left" colspan="3">
23 merge BadBlockHowTo.txt and BadBlockSCSIHowTo.txt
24 </td></tr></table></div></div><div><div class="abstract"><p class="title"><b>Abstract</b></p><p>
25 This article describes what actions might be taken when smartmontools
26 detects a bad block on a disk. It demonstrates how to identify the file
27 associated with an unreadable disk sector, and how to force that sector
28 to reallocate.
29 </p></div></div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="sect1"><a href="#intro">Introduction</a></span></dt><dt><span class="sect1"><a href="#rfile">Repairs in a file system</a></span></dt><dd><dl><dt><span class="sect2"><a href="#e2_example1">ext2/ext3 first example</a></span></dt><dt><span class="sect2"><a href="#e2_example2">ext2/ext3 second example</a></span></dt><dt><span class="sect2"><a href="#unassigned">Unassigned sectors</a></span></dt><dt><span class="sect2"><a href="#reiserfs_ex">ReiserFS example</a></span></dt></dl></dd><dt><span class="sect1"><a href="#sdisk">Repairs at the disk level</a></span></dt><dd><dl><dt><span class="sect2"><a href="#partition">Partition table problems</a></span></dt><dt><span class="sect2"><a href="#lvm">LVM repairs</a></span></dt><dt><span class="sect2"><a href="#bb">Bad block reassignment</a></span></dt></dl></dd></dl></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro"></a>Introduction</h2></div></div></div><p>
30 Handling bad blocks is a difficult problem as it often involves
31 decisions about losing information. Modern storage devices tend
32 to handle the simple cases automatically, for example by writing
33 a disk sector that was read with difficulty to another area on
34 the media. Even though such a remapping can be done by a disk
35 drive transparently, there is still a lingering worry about media
36 deterioration and the disk running out of spare sectors to remap.
37 </p><p>
38 Can smartmontools help? As the <span class="acronym">SMART</span> acronym
39 <sup>[<a name="id4710480" href="#ftn.id4710480">1</a>]</sup>
40 suggests, the <span><strong class="command">smartctl</strong></span> command and the
41 <span><strong class="command">smartd</strong></span> daemon concentrate on monitoring and analysis.
42 So apart from changing some reporting settings, smartmontools will not
43 modify the raw data in a device. Also smartmontools only works with
44 physical devices, it does not know about partitions and file systems.
45 So other tools are needed. The job of smartmontools is to alert the user
46 that something is wrong and user intervention may be required.
47 </p><p>
48 When a bad block is reported one approach is to work out the mapping between
49 the logical block address used by a storage device and a file or some other
50 component of a file system using that device. Note that there may not be such
51 a mapping reflecting that a bad block has been found at a location not
52 currently used by the file system. A user may want to do this analysis to
53 localize and minimize the number of replacement files that are retrieved from
54 some backup store. This approach requires knowledge of the file system
55 involved and this document uses the Linux ext2/ext3 and ReiserFS file systems
56 for examples. Also the type of content may come into play. For example if
57 an area storing video has a corrupted sector, it may be easiest to accept
58 that a frame or two might be corrupted and instruct the disk not to retry
59 as that may have the visual effect of causing a momentary blank into a 1
60 second pause (while the disk retries the faulty sector, often accompanied
61 by a telltale clicking sound).
62 </p><p>
63 Another approach is to ignore the upper level consequences (e.g. corrupting
64 a file or worse damage to a file system) and use the facilities offered by
65 a storage device to repair the damage. The SCSI disk command set is used
66 elaborate on this low level approach.
67 </p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="rfile"></a>Repairs in a file system</h2></div></div></div><p>
68 This section contains examples of what to do at the file system level
69 when smartmontools reports a bad block. These examples assume the Linux
70 operating system and either the ext2/ext3 or ReiserFS file system. The
71 various Linux commands shown have man pages and the reader is encouraged
72 to examine these. Of note is the <span><strong class="command">dd</strong></span> command which is
73 often used in repair work
74 <sup>[<a name="id4710574" href="#ftn.id4710574">2</a>]</sup>
75 and has a unique command line syntax.
76 </p><p>
77 The authors would like to thank Sergey Vlasov, Theodore Ts'o,
78 Michael Bendzick, and others for explaining this approach. The authors would
79 like to add text showing how to do this for other file systems, in
80 particular XFS, and JFS: please email if you can provide this
81 information.
82 </p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="e2_example1"></a>ext2/ext3 first example</h3></div></div></div><p>
83 In this example, the disk is failing self-tests at Logical Block
84 Address LBA = 0x016561e9 = 23421417. The LBA counts sectors in units
85 of 512 bytes, and starts at zero.
86 </p><p>
87 </p><pre class="programlisting">
88 root]# smartctl -l selftest /dev/hda:
89
90 SMART Self-test log structure revision number 1
91 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
92 # 1 Extended offline Completed: read failure 90% 217 0x016561e9
93 </pre><p>
94 Note that other signs that there is a bad sector on the disk can be
95 found in the non-zero value of the Current Pending Sector count:
96 </p><pre class="programlisting">
97 root]# smartctl -A /dev/hda
98 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
99 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
100 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
101 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
102 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1
103 </pre><p>
104 </p><p>
105 First Step: We need to locate the partition on which this sector of
106 the disk lives:
107 </p><pre class="programlisting">
108 root]# fdisk -lu /dev/hda
109
110 Disk /dev/hda: 123.5 GB, 123522416640 bytes
111 255 heads, 63 sectors/track, 15017 cylinders, total 241254720 sectors
112 Units = sectors of 1 * 512 = 512 bytes
113
114 Device Boot Start End Blocks Id System
115 /dev/hda1 * 63 4209029 2104483+ 83 Linux
116 /dev/hda2 4209030 5269319 530145 82 Linux swap
117 /dev/hda3 5269320 238227884 116479282+ 83 Linux
118 /dev/hda4 238227885 241248104 1510110 83 Linux
119 </pre><p>
120
121 The partition <code class="filename">/dev/hda3</code> starts at LBA 5269320 and
122 extends past the 'problem' LBA. The 'problem' LBA is offset
123 23421417 - 5269320 = 18152097 sectors into the partition
124 <code class="filename">/dev/hda3</code>.
125 </p><p>
126 To verify the type of the file system and the mount point, look in
127 <code class="filename">/etc/fstab</code>:
128 </p><pre class="programlisting">
129 root]# grep hda3 /etc/fstab
130 /dev/hda3 /data ext2 defaults 1 2
131 </pre><p>
132 You can see that this is an ext2 file system, mounted at
133 <code class="filename">/data</code>.
134 </p><p>
135 Second Step: we need to find the block size of the file system
136 (normally 4096 bytes for ext2):
137 </p><pre class="programlisting">
138 root]# tune2fs -l /dev/hda3 | grep Block
139 Block count: 29119820
140 Block size: 4096
141 </pre><p>
142 In this case the block size is 4096 bytes.
143
144 Third Step: we need to determine which File System Block contains this
145 LBA. The formula is:
146 </p><pre class="programlisting">
147 b = (int)((L-S)*512/B)
148 where:
149 b = File System block number
150 B = File system block size in bytes
151 L = LBA of bad sector
152 S = Starting sector of partition as shown by fdisk -lu
153 and (int) denotes the integer part.
154 </pre><p>
155
156 In our example, L=23421417, S=5269320, and B=4096. Hence the
157 'problem' LBA is in block number
158 </p><pre class="programlisting">
159 b = (int)18152097*512/4096 = (int)2269012.125
160 so b=2269012.
161 </pre><p>
162 </p><p>
163 Note: the fractional part of 0.125 indicates that this problem LBA is
164 actually the second of the eight sectors that make up this file system
165 block.
166 </p><p>
167 Fourth Step: we use debugfs to locate the inode stored in this block,
168 and the file that contains that inode:
169 </p><pre class="programlisting">
170 root]# debugfs
171 debugfs 1.32 (09-Nov-2002)
172 debugfs: open /dev/hda3
173 debugfs: icheck 2269012
174 Block Inode number
175 2269012 41032
176 debugfs: ncheck 41032
177 Inode Pathname
178 41032 /S1/R/H/714197568-714203359/H-R-714202192-16.gwf
179 </pre><p>
180
181 In this example, you can see that the problematic file (with the mount
182 point included in the path) is:
183 <code class="filename">/data/S1/R/H/714197568-714203359/H-R-714202192-16.gwf</code>
184 </p><p>
185 To force the disk to reallocate this bad block we'll write zeros to
186 the bad block, and sync the disk:
187 </p><pre class="programlisting">
188 root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=1 seek=2269012
189 root]# sync
190 </pre><p>
191 </p><p>
192 <span class="emphasis"><em>NOTE:</em></span> This last step has <span class="emphasis"><em>permanently
193 </em></span> and irretrievably <span class="emphasis"><em>destroyed</em></span> some of
194 the data that was in this file. Don't do this unless you don't need
195 the file or you can replace it with a fresh or correct version.
196 </p><p>
197 Now everything is back to normal: the sector has been reallocated.
198 Compare the output just below to similar output near the top of this
199 article:
200 </p><pre class="programlisting">
201 root]# smartctl -A /dev/hda
202 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
203 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1
204 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
205 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
206 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1
207 </pre><p>
208
209 Note: for some disks it may be necessary to update the SMART Attribute values by using
210 <span><strong class="command">smartctl -t offline /dev/hda</strong></span>
211 </p><p>
212 The disk now passes its self-tests again:
213
214 </p><pre class="programlisting">
215 root]# smartctl -t long /dev/hda [wait until test completes, then]
216 root]# smartctl -l selftest /dev/hda
217
218 SMART Self-test log structure revision number 1
219 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
220 # 1 Extended offline Completed without error 00% 239 -
221 # 2 Extended offline Completed: read failure 90% 217 0x016561e9
222 # 3 Extended offline Completed: read failure 90% 212 0x016561e9
223 # 4 Extended offline Completed: read failure 90% 181 0x016561e9
224 # 5 Extended offline Completed without error 00% 14 -
225 # 6 Extended offline Completed without error 00% 4 -
226 </pre><p>
227 </p><p>
228 and no longer shows any offline uncorrectable sectors:
229
230 </p><pre class="programlisting">
231 root]# smartctl -A /dev/hda
232 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
233 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1
234 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
235 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
236 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
237 </pre><p>
238 </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="e2_example2"></a>ext2/ext3 second example</h3></div></div></div><p>
239 On this drive, the first sign of trouble was this email from smartd:
240 </p><pre class="programlisting">
241 To: ballen
242 Subject: SMART error (selftest) detected on host: medusa-slave166.medusa.phys.uwm.edu
243
244 This email was generated by the smartd daemon running on host:
245 medusa-slave166.medusa.phys.uwm.edu in the domain: master001-nis
246
247 The following warning/error was logged by the smartd daemon:
248 Device: /dev/hda, Self-Test Log error count increased from 0 to 1
249 </pre><p>
250 </p><p>
251 Running <span><strong class="command">smartctl -a /dev/hda</strong></span> confirmed the problem:
252
253 </p><pre class="programlisting">
254 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
255 # 1 Extended offline Completed: read failure 80% 682 0x021d9f44
256
257 Note that the failing LBA reported is 0x021d9f44 (base 16) = 35495748 (base 10)
258
259 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
260 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
261 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
262 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 3
263 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 3
264 </pre><p>
265 </p><p>
266 and one can see above that there are 3 sectors on the list of pending
267 sectors that the disk can't read but would like to reallocate.
268 </p><p>
269 The device also shows errors in the SMART error log:
270 </p><pre class="programlisting">
271 Error 212 occurred at disk power-on lifetime: 690 hours
272 After command completion occurred, registers were:
273 ER ST SC SN CL CH DH
274 -- -- -- -- -- -- --
275 40 51 12 46 9f 1d e2 Error: UNC 18 sectors at LBA = 0x021d9f46 = 35495750
276
277 Commands leading to the command that caused the error were:
278 CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name
279 -- -- -- -- -- -- -- -- --------- --------------------
280 25 00 12 46 9f 1d e0 00 2485545.000 READ DMA EXT
281 </pre><p>
282 </p><p>
283 Signs of trouble at this LBA may also be found in SYSLOG:
284 </p><pre class="programlisting">
285 [root]# grep LBA /var/log/messages | awk '{print $12}' | sort | uniq
286 LBAsect=35495748
287 LBAsect=35495750
288 </pre><p>
289 </p><p>
290 So I decide to do a quick check to see how many bad sectors there
291 really are. Using the bash shell I check 70 sectors around the trouble
292 area:
293 </p><pre class="programlisting">
294 [root]# export i=35495730
295 [root]# while [ $i -lt 35495800 ]
296 &gt; do echo $i
297 &gt; dd if=/dev/hda of=/dev/null bs=512 count=1 skip=$i
298 &gt; let i+=1
299 &gt; done
300
301 &lt;SNIP&gt;
302
303 35495734
304 1+0 records in
305 1+0 records out
306 35495735
307 dd: reading `/dev/hda': Input/output error
308 0+0 records in
309 0+0 records out
310
311 &lt;SNIP&gt;
312
313 35495751
314 dd: reading `/dev/hda': Input/output error
315 0+0 records in
316 0+0 records out
317 35495752
318 1+0 records in
319 1+0 records out
320
321 &lt;SNIP&gt;
322 </pre><p>
323 </p><p>
324 which shows that the seventeen sectors 35495735-35495751 (inclusive)
325 are not readable.
326 </p><p>
327 Next, we identify the files at those locations. The partitioning
328 information on this disk is identical to the first example above, and
329 as in that case the problem sectors are on the third partition
330 <code class="filename">/dev/hda3</code>. So we have:
331 </p><pre class="programlisting">
332 L=35495735 to 35495751
333 S=5269320
334 B=4096
335 </pre><p>
336 so that b=3778301 to 3778303 are the three bad blocks in the file
337 system.
338
339 </p><pre class="programlisting">
340 [root]# debugfs
341 debugfs 1.32 (09-Nov-2002)
342 debugfs: open /dev/hda3
343 debugfs: icheck 3778301
344 Block Inode number
345 3778301 45192
346 debugfs: icheck 3778302
347 Block Inode number
348 3778302 45192
349 debugfs: icheck 3778303
350 Block Inode number
351 3778303 45192
352 debugfs: ncheck 45192
353 Inode Pathname
354 45192 /S1/R/H/714979488-714985279/H-R-714979984-16.gwf
355 debugfs: quit
356 </pre><p>
357 </p><p>
358 And finally, just to confirm that this is really the damaged file:
359 </p><p>
360 </p><pre class="programlisting">
361 [root]# md5sum /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf
362 md5sum: /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf: Input/output error
363 </pre><p>
364 </p><p>
365 Finally we force the disk to reallocate the three bad blocks:
366 </p><pre class="programlisting">
367 [root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=3 seek=3778301
368 [root]# sync
369 </pre><p>
370 </p><p>
371 We could also probably use:
372 </p><pre class="programlisting">
373 [root]# dd if=/dev/zero of=/dev/hda bs=512 count=17 seek=35495735
374 </pre><p>
375 </p><p>
376 At this point we now have:
377 </p><pre class="programlisting">
378 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
379 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
380 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
381 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
382 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
383 </pre><p>
384 </p><p>
385 which is encouraging, since the pending sectors count is now zero.
386 Note that the drive reallocation count has not yet increased: the
387 drive may now have confidence in these sectors and have decided not to
388 reallocate them..
389 </p><p>
390 A device self test:
391 </p><pre class="programlisting">
392 [root#] smartctl -t long /dev/hda
393 (then wait about an hour) shows no unreadable sectors or errors:
394
395 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
396 # 1 Extended offline Completed without error 00% 692 -
397 # 2 Extended offline Completed: read failure 80% 682 0x021d9f44
398 </pre><p>
399 </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="unassigned"></a>Unassigned sectors</h3></div></div></div><p>
400 This section was written by Kay Diederichs. Even though this section
401 assumes Linux and the ext2/ext3 file system, the strategy should be
402 more generally applicable.
403 </p><p>
404 I read your badblocks-howto at and greatly
405 benefited from it. One thing that's (maybe) missing is that often the
406 <span><strong class="command">smartctl -t long</strong></span> scan finds a bad sector which is
407 <span class="emphasis"><em> not</em></span> assigned to
408 any file. In that case it does not help to run debugfs, or rather
409 debugfs reports the fact that no file owns that sector. Furthermore,
410 it is somewhat laborious to come up with the correct numbers for
411 debugfs, and debugfs is slow ...
412 </p><p>
413 So what I suggest in the case of presence of
414 Current_Pending_Sector/Offline_Uncorrectable errors is to create a
415 huge file on that file system.
416 </p><pre class="programlisting">
417 dd if=/dev/zero of=/some/mount/point bs=4k
418 </pre><p>
419 creates the file. Leave it running until the partition/file system is
420 full. This will make the disk reallocate those sectors which do not
421 belong to a file. Check the <span><strong class="command">smartctl -a</strong></span> output after
422 that and make
423 sure that the sectors are reallocated. If any remain, use the debugfs
424 method. Of course the usual caveats apply - back it up first, and so
425 on.
426 </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="reiserfs_ex"></a>ReiserFS example</h3></div></div></div><p>
427 This section was written by Joachim Jautz with additions from Manfred
428 Schwarb.
429 </p><p>
430 The following problems were reported during a scheduled test:
431 </p><pre class="programlisting">
432 smartd[575]: Device: /dev/hda, starting scheduled Offline Immediate Test.
433 [... 1 hour later ...]
434 smartd[575]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
435 smartd[575]: Device: /dev/hda, 1 Offline uncorrectable sectors
436 </pre><p>
437 </p><p>
438 [Step 0] The SMART selftest/error log
439 (see <span><strong class="command">smartctl -l selftest</strong></span>) indicated there was a problem
440 with block address (i.e. the 512 byte sector at) 58656333. The partition
441 table (e.g. see <span><strong class="command">sfdisk -luS /dev/hda</strong></span> or
442 <span><strong class="command">fdisk -ul /dev/hda</strong></span>) indicated that this block was in the
443 <code class="filename">/dev/hda3</code> partition which contained a ReiserFS file
444 system. That partition started at block address 54781650.
445 </p><p>
446 While doing the initial analysis it may also be useful to take a copy
447 of the disk attributes returned by <span><strong class="command">smartctl -A /dev/hda</strong></span>.
448 Specifically the values associated with the "Reallocated_Sector_Ct" and
449 "Reallocated_Event_Count" attributes (for ATA disks, the grown list (GLIST)
450 length for SCSI disks). If these are incremented at the end of the procedure
451 it indicates that the disk has re-allocated one or more sectors.
452 </p><p>
453 [Step 1] Get the file system's block size:
454 </p><pre class="programlisting">
455 # debugreiserfs /dev/hda3 | grep '^Blocksize'
456 Blocksize: 4096
457 </pre><p>
458 </p><p>
459 [Step 2] Calculate the block number:
460 </p><pre class="programlisting">
461 # echo "(58656333-54781650)*512/4096" | bc -l
462 484335.37500000000000000000
463 </pre><p>
464 It is re-assuring that the calculated 4 KB damaged block address in
465 <code class="filename">/dev/hda3</code> is less than "Count of blocks on the
466 device" shown in the output of <span><strong class="command">debugreiserfs</strong></span> shown above.
467 </p><p>
468 [Step 3] Try to get more info about this block =&gt; reading the block
469 fails as expected but at least we see now that it seems to be unused.
470 If we do not get the `Cannot read the block' error we should
471 check if our calculation in [Step 2] was correct ;)
472 </p><pre class="programlisting">
473 # debugreiserfs -1 484335 /dev/hda3
474 debugreiserfs 3.6.19 (2003 http://www.namesys.com)
475
476 484335 is free in ondisk bitmap
477 The problem has occurred looks like a hardware problem.
478 </pre><p>
479 </p><p>
480 If you have bad blocks, we advise you to get a new hard drive, because
481 once you get one bad block that the disk drive internals cannot hide from
482 your sight, the chances of getting more are generally said to become
483 much higher (precise statistics are unknown to us), and this disk
484 drive is probably not expensive enough for you to risk your
485 time and data on it. If you don't want to follow that
486 advice then if you have just a few bad blocks, try writing to the
487 bad blocks and see if the drive remaps the bad blocks (that means
488 it takes a block it has in reserve and allocates it for use for
489 of that block number). If it cannot remap the block, use
490 <span><strong class="command">badblock</strong></span> option (-B) with reiserfs utils to handle
491 this block correctly.
492 </p><pre class="programlisting">
493 bread: Cannot read the block (484335): (Input/output error).
494
495 Aborted
496 </pre><p>
497 So it looks like we have the right (i.e. faulty) block address.
498 </p><p>
499 [Step 4] Try then to find the affected file
500 <sup>[<a name="id4711397" href="#ftn.id4711397">3</a>]</sup>:
501 </p><pre class="programlisting">
502 tar -cO /mydir &gt;/dev/null
503 </pre><p>
504 If you do not find any unreadable files, then the block may be free or
505 located in some metadata of the file system.
506 </p><p>
507 [Step 5] Try your luck: bang the affected block with
508 <span><strong class="command">badblocks -n</strong></span> (non-destructive read-write mode, do unmount
509 first), if you are very lucky the failure is transient and you can provoke
510 reallocation
511 <sup>[<a name="id4711431" href="#ftn.id4711431">4</a>]</sup>:
512 </p><pre class="programlisting">
513 # badblocks -b 4096 -p 3 -s -v -n /dev/hda3 `expr 484335 + 100` `expr 484335 - 100`
514 </pre><p>
515 <sup>[<a name="id4711447" href="#ftn.id4711447">5</a>]</sup>
516 </p><p>
517 check success with <span><strong class="command">debugreiserfs -1 484335 /dev/hda3</strong></span>.
518 Otherwise:
519 </p><p>
520 [Step 6] Perform this step <span class="emphasis"><em>only</em></span> if Step 5 has failed
521 to fix the problem: overwrite that block to force reallocation:
522 </p><pre class="programlisting">
523 # dd if=/dev/zero of=/dev/hda3 count=1 bs=4096 seek=484335
524 1+0 records in
525 1+0 records out
526 4096 bytes transferred in 0.007770 seconds (527153 bytes/sec)
527 </pre><p>
528 </p><p>
529 [Step 7] If you can't rule out the bad block being in metadata, do
530 a file system check:
531 </p><pre class="programlisting">
532 reiserfsck --check
533 </pre><p>
534 This could take a long time so you probably better go for lunch ...
535 </p><p>
536 [Step 8] Proceed as stated earlier. For example, sync disk and run a long
537 selftest that should succeed now.
538 </p></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sdisk"></a>Repairs at the disk level</h2></div></div></div><p>
539 This section first looks at a damaged partition table. Then it ignores
540 the upper level impact of a bad block and just repairs the underlying
541 sector so that defective sector will not cause problems in the future.
542 </p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="partition"></a>Partition table problems</h3></div></div></div><p>
543 Some software failures can lead to zeroes or random data being written
544 on the first block of a disk. For disks that use a DOS-based partitioning
545 scheme this will overwrite the partition table which is found at the
546 end of the first block. This is a single point of failure so after the
547 damage tools like <span><strong class="command">fdisk</strong></span> have no alternate data to use
548 so they report no partitions or a damaged partition table.
549 </p><p>
550 One utility that may help is
551 <a href="http://www.cgsecurity.org/wiki/TestDisk" target="_top">
552 <code class="literal">testdisk</code></a> which can scan a disk looking for
553 partitions and recreate a partition table if requested.
554 <sup>[<a name="id4711568" href="#ftn.id4711568">6</a>]</sup>
555 </p><p>
556 Programs that create DOS partitions
557 often place the first partition at logical block address 63. In Linux
558 a loop back mount can be attempted at the appropriate offset of a disk
559 with a damaged partition table. This approach may involve placing the
560 disk with the damaged partition table in a working computer or perhaps
561 an external USB enclosure. Assuming the disk with the damaged partition
562 is <code class="filename">/dev/hdb</code>. Then the following read-only loop back
563 mount could be tried:
564 </p><pre class="programlisting">
565 # mount -r /dev/hdb -o loop,offset=32256 /mnt
566 </pre><p>
567 The offset is in bytes so the number given is (63 * 512). If the file
568 system cannot be identified then a '-t &lt;fs_type&gt;'
569 may be needed (although this is not a good sign). If this mount is
570 successful, a backup procedure is advised.
571 </p><p>
572 Only the primary DOS partitions are recorded in the first block of
573 a disk. The extended DOS partition table is placed elsewhere on
574 a disk. Again there is only one copy of it so it represents another
575 single point of failure. All DOS partition information can be
576 read in a form that can be used to recreate the tables with the
577 <span><strong class="command">sfdisk</strong></span> command. Obviously this needs to be done
578 beforehand and the file put on other media. Here is how to fetch the
579 partition table information:
580 </p><pre class="programlisting">
581 # sfdisk -dx /dev/hda &gt; my_disk_partition_info.txt
582 </pre><p>
583 Then <code class="filename">my_disk_partition_info.txt</code> should be placed on
584 other media. If disaster strikes, then the disk with the damaged partition
585 table(s) can be placed in a working system, let us say the damaged disk is
586 now at <code class="filename">/dev/hdc</code>, and the following command restores
587 the partition table(s):
588 </p><pre class="programlisting">
589 # sfdisk -x -O part_block_prior.img /dev/hdc &lt; my_disk_partition_info.txt
590 </pre><p>
591 Since the above command is potentially destructive it takes a copy of the
592 block(s) holding the partition table(s) and puts it in
593 <code class="filename">part_block_prior.img</code> prior to any changes. Then it
594 changes the partition tables as indicated by
595 <code class="filename">my_disk_partition_info.txt</code>. For what it is worth the
596 author did test this on his system!
597 <sup>[<a name="id4711687" href="#ftn.id4711687">7</a>]</sup>
598 </p><p>
599 For creating, destroying, resizing, checking and copying partitions, and
600 the file systems on them, GNU's
601 <a href="http://www.gnu.org/software/parted" target="_top">
602 <code class="literal">parted</code></a> is worth examining.
603 The <a href="http://www.tldp.org/HOWTO/Large-Disk-HOWTO.html" target="_top">
604 <code class="literal">Large Disk HOWTO</code></a> is also a useful resource.
605 </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="lvm"></a>LVM repairs</h3></div></div></div><p>
606 This section was written by Frederic BOITEUX. It was titled: "HOW TO
607 LOCATE AND REPAIR BAD BLOCKS ON AN LVM VOLUME".
608 </p><p>
609 Smartd reports an error in a short test :
610 </p><pre class="programlisting">
611 # smartctl -a /dev/hdb
612 ...
613 SMART Self-test log structure revision number 1
614 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
615 # 1 Short offline Completed: read failure 90% 66 37383668
616 </pre><p>
617 So the disk has a bad block located in LBA block 37383668
618 </p><p>
619 In which physical partition is the bad block ?
620 </p><pre class="programlisting">
621 # sfdisk -luS /dev/hdb # or 'fdisk -ul /dev/hdb'
622
623 Disk /dev/hdb: 9729 cylinders, 255 heads, 63 sectors/track
624 Units = sectors of 512 bytes, counting from 0
625
626 Device Boot Start End #sectors Id System
627 /dev/hdb1 63 996029 995967 82 Linux swap / Solaris
628 /dev/hdb2 * 996030 1188809 192780 83 Linux
629 /dev/hdb3 1188810 156296384 155107575 8e Linux LVM
630 /dev/hdb4 0 - 0 0 Empty
631 </pre><p>
632
633 It's in the <code class="filename">/dev/hdb3</code> partition, a LVM2 partition.
634 From the LVM2 partition beginning, the bad block has an offset of
635 </p><pre class="programlisting">
636 (37383668 - 1188810) = 36194858
637 </pre><p>
638 </p><p>
639 We have to find in which LVM2 logical partition the block belongs to.
640 </p><p>
641 In which logical partition is the bad block ?
642 </p><p>
643 <span class="emphasis"><em>IMPORTANT</em></span> : LVM2 can use different schemes dividing
644 its physical partitions to logical ones : linear, striped, contiguous or
645 not... The following example assumes that allocation is linear !
646 </p><p>
647 The physical partition used by LVM2 is divided in PE (Physical Extent)
648 units of the same size, starting at pe_start' 512 bytes blocks from
649 the beginning of the physical partition.
650 </p><p>
651 The 'pvdisplay' command gives the size of the PE (in KB) of the
652 LVM partition :
653 </p><pre class="programlisting">
654 # part=/dev/hdb3 ; pvdisplay -c $part | awk -F: '{print $8}'
655 4096
656 </pre><p>
657 </p><p>
658 To get its size in LBA block size (512 bytes or 0.5 KB), we multiply this
659 number by 2 : 4096 * 2 = 8192 blocks for each PE.
660 </p><p>
661 To find the offset from the beginning of the physical partition is a
662 bit more difficult : if you have a recent LVM2 version, try :
663 </p><pre class="programlisting">
664 # pvs -o+pe_start $part
665 </pre><p>
666 </p><p>
667 Either, you can look in /etc/lvm/backup :
668 </p><pre class="programlisting">
669 # grep pe_start $(grep -l $part /etc/lvm/backup/*)
670 pe_start = 384
671 </pre><p>
672 </p><p>
673 Then, we search in which PE is the badblock, calculating the PE rank
674 in which the faulty block of the partition is :
675 physical partition's bad block number / sizeof(PE) =
676 </p><pre class="programlisting">
677 36194858 / 8192 = 4418.3176
678 </pre><p>
679 </p><p>
680 So we have to find in which LVM2 logical partition is used the PE
681 number 4418 (count starts from 0) :
682 </p><pre class="programlisting">
683 # lvdisplay --maps |egrep 'Physical|LV Name|Type'
684 LV Name /dev/WDC80Go/racine
685 Type linear
686 Physical volume /dev/hdb3
687 Physical extents 0 to 127
688 LV Name /dev/WDC80Go/usr
689 Type linear
690 Physical volume /dev/hdb3
691 Physical extents 128 to 1407
692 LV Name /dev/WDC80Go/var
693 Type linear
694 Physical volume /dev/hdb3
695 Physical extents 1408 to 1663
696 LV Name /dev/WDC80Go/tmp
697 Type linear
698 Physical volume /dev/hdb3
699 Physical extents 1664 to 1791
700 LV Name /dev/WDC80Go/home
701 Type linear
702 Physical volume /dev/hdb3
703 Physical extents 1792 to 3071
704 LV Name /dev/WDC80Go/ext1
705 Type linear
706 Physical volume /dev/hdb3
707 Physical extents 3072 to 10751
708 LV Name /dev/WDC80Go/ext2
709 Type linear
710 Physical volume /dev/hdb3
711 Physical extents 10752 to 18932
712 </pre><p>
713 </p><p>
714 So the PE #4418 is in the <code class="filename">/dev/WDC80Go/ext1</code>
715 LVM logical partition.
716 </p><p>
717 Size of logical block of file system on <code class="filename">/dev/WDC80Go/ext1
718 </code> :
719 </p><p>
720 It's a ext3 fs, so I get it like this :
721 </p><pre class="programlisting">
722 # dumpe2fs /dev/WDC80Go/ext1 | grep 'Block size'
723 dumpe2fs 1.37 (21-Mar-2005)
724 Block size: 4096
725 </pre><p>
726 </p><p>
727 bad block number for the file system :
728 </p><p>
729 The logical partition begins on PE 3072 :
730 </p><pre class="programlisting">
731 (# PE's start of partition * sizeof(PE)) + parttion offset[pe_start] =
732 (3072 * 8192) + 384 = 25166208
733 </pre><p>
734 512b block of the physical partition, so the bad block number for the
735 file system  is :
736 </p><pre class="programlisting">
737 (36194858 - 25166208) / (sizeof(fs block) / 512)
738 = 11028650 / (4096 / 512) = 1378581.25
739 </pre><p>
740 </p><p>
741 Test of the fs bad block :
742 </p><pre class="programlisting">
743 dd if=/dev/WDC80Go/ext1 of=block1378581 bs=4096 count=1 skip=1378581
744 </pre><p>
745 </p><p>
746 If this dd command succeeds, without any error message in console or
747 syslog, then the block number calculation is probably wrong ! *Don't*
748 go further, re-check it and if you don't find the error, please
749 renounce !
750 </p><p>
751 Search / correction follows the same scheme as for simple
752 partitions :
753 </p><div class="itemizedlist"><ul type="disc"><li><p>
754 find possible impacted files with debugfs (icheck &lt;fs block nb&gt;,
755 then ncheck &lt;icheck nb&gt;).
756 </p></li><li><p>
757 reallocate bad block writing zeros in it, *using the fs block size* :
758 </p></li></ul></div><p>
759 </p><p>
760 </p><pre class="programlisting">
761 dd if=/dev/zero of=/dev/WDC80Go/ext1 count=1 bs=4096 seek=1378581
762 </pre><p>
763 </p><p>
764 Et voilà !
765 </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="bb"></a>Bad block reassignment</h3></div></div></div><p>
766 The SCSI disk command set and associated disk architecture are assumed
767 in this section. SCSI disks have their own logical to physical mapping
768 allowing a damaged sector (usually carrying 512 bytes of data) to be
769 remapped irrespective of the operating system, file system or software
770 RAID being used.
771 </p><p>
772 The terms <span class="emphasis"><em>block</em></span> and <span class="emphasis"><em>sector</em></span> are
773 used interchangeably, although block tends to get used in higher level or
774 more abstract contexts such as a <span class="emphasis"><em>logical block</em></span>.
775 </p><p>
776 When a SCSI disk is formatted, defective sectors identified during
777 the manufacturing process (the so called primary list: PLIST),
778 those found during the format itself (the certification list: CLIST),
779 those given explicitly to the format command (the DLIST) and optionally
780 the previous grown list (GLIST) are not used in the logical block
781 map. The number (and low level addresses) of the unmapped sectors can be
782 found with the READ DEFECT DATA SCSI command.
783 </p><p>
784 SCSI disks tend to be divided into zones which have spare sectors and
785 perhaps spare tracks, to support the logical block address mapping
786 process. The idea is that if a logical block is remapped, the heads do not
787 have to move a long way to access the replacement sector. Note that spare
788 sectors are a scarce resource.
789 </p><p>
790 Once a SCSI disk format has completed successfully, other problems
791 may appear over time. These fall into two categories:
792 </p><div class="itemizedlist"><ul type="disc"><li><p>
793 recoverable: the Error Correction Codes (ECC) detect a problem
794 but it is small enough to be corrected. Optionally other strategies
795 such as retrying the access may retrieve the data.
796 </p></li><li><p>
797 unrecoverable: try as it may, the disk logic and ECC algorithms
798 cannot recover the data. This is often reported as a
799 <span class="emphasis"><em>medium error</em></span>.
800 </p></li></ul></div><p>
801 </p><p>
802 Other things can go wrong, typically associated with the transport and
803 they will be reported using a term other than
804 <span class="emphasis"><em>medium error</em></span>. For example a disk may decide a read
805 operation was successful but a computer's host bus adapter (HBA) checking
806 the incoming data detects a CRC error due to a bad cable or termination.
807 </p><p>
808 Depending on the disk vendor, recoverable errors can be ignored. After all,
809 some disks have up to 68 bytes of ECC above the payload size of 512 bytes
810 so why use up spare sectors which are limited in number
811 <sup>[<a name="id4712485" href="#ftn.id4712485">8</a>]</sup>
812 ?
813 If the disk can recover the data and does decide to re-allocate (reassign)
814 a sector, then first it checks the settings of the ARRE and AWRE bits in the
815 read-write error recovery mode page. Usually these bits are set
816 <sup>[<a name="id4712514" href="#ftn.id4712514">9</a>]</sup>
817 enabling automatic (read or write) re-allocation. The automatic
818 re-allocation may also fail if the zone (or disk) has run out of spare
819 sectors.
820 </p><p>
821 Another consideration with RAIDs, and applications that require a high
822 data rate without pauses, is that the controller logic may not want a
823 disk to spend too long trying to recover an error.
824 </p><p>
825 Unrecoverable errors will cause a <span class="emphasis"><em>medium error</em></span> sense
826 key, perhaps with some useful additional sense information. If the extended
827 background self test includes a full disk read scan, one would expect the
828 self test log to list the bad block, as shown in the <a href="#rfile" title="Repairs in a file system">the section called &#8220;Repairs in a file system&#8221;</a>.
829 Recent SCSI disks with a periodic background scan should also list
830 unrecoverable read errors (and some recoverable errors as well). The
831 advantage of the background scan is that it runs to completion while self
832 tests will often terminate at the first serious error.
833 </p><p>
834 SCSI disks expect unrecoverable errors to be fixed manually using the
835 REASSIGN BLOCKS SCSI command since loss of data is involved. It is possible
836 that an operating system or a file system could issue the REASSIGN BLOCKS
837 command itself but the authors are unaware of any examples. The REASSIGN BLOCKS
838 command will reassign one or more blocks, attempting to (partially ?) recover
839 the data (a forlorn hope at this stage), fetch an unused spare sector from the
840 current zone while adding the damaged old sector to the GLIST (hence the
841 name "grown" list). The contents of the GLIST may not be that interesting
842 but <span><strong class="command">smartctl</strong></span> prints out the number of entries in the grown
843 list and if that number grows quickly, the disk may be approaching the end
844 of its useful life.
845 </p><p>
846 Here is an alternate brute force technique to consider: if the data on the
847 SCSI or ATA disk has all been backed up (e.g. is held on the other disks in
848 a RAID 5 enclosure), then simply reformatting the disk may be the least
849 cumbersome approach.
850 </p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="sexample"></a>Example</h4></div></div></div><p>
851 Given a "bad block", it still may be useful to look at the
852 <span><strong class="command">fdisk</strong></span> command (if the disk has multiple partitions)
853 to find out which partition is involved, then use
854 <span><strong class="command">debugfs</strong></span> (or a similar tool for the file system in
855 question) to find out which, if any, file or other part of the file system
856 may have been damaged. This is discussed in the <a href="#rfile" title="Repairs in a file system">the section called &#8220;Repairs in a file system&#8221;</a>.
857 </p><p>
858 Then a program that can execute the REASSIGN BLOCKS SCSI command is
859 required. In Linux (2.4 and 2.6 series), FreeBSD, Tru64(OSF) and Windows
860 the author's <span><strong class="command">sg_reassign</strong></span> utility in the sg3_utils
861 package can be used. Also found in that package is
862 <span><strong class="command">sg_verify</strong></span> which can be used to check that a block is
863 readable.
864 </p><p>
865 Assume that logical block address 1193046 (which is 123456 in hex) is
866 corrupt
867 <sup>[<a name="id4712652" href="#ftn.id4712652">10</a>]</sup>
868 on the disk at <code class="filename">/dev/sdb</code>. A long selftest command like
869 <span><strong class="command">smartctl -t long /dev/sdb</strong></span> may result in log results
870 like this:
871 </p><pre class="programlisting">
872 # smartctl -l selftest /dev/sdb
873 smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
874 Home page is http://smartmontools.sourceforge.net/
875
876
877 SMART Self-test log
878 Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
879 Description number (hours)
880 # 1 Background long Failed in segment - 354 1193046 [0x3 0x11 0x0]
881 # 2 Background short Completed - 323 - [- - -]
882 # 3 Background short Completed - 194 - [- - -]
883 </pre><p>
884 </p><p>
885 The <span><strong class="command">sg_verify</strong></span> utility can be used to confirm that there
886 is a problem at that address:
887 </p><pre class="programlisting">
888 # sg_verify --lba=1193046 /dev/sdb
889 verify (10): Fixed format, current; Sense key: Medium Error
890 Additional sense: Unrecovered read error
891 Info fld=0x123456 [1193046]
892 Field replaceable unit code: 228
893 Actual retry count: 0x008b
894 medium or hardware error, reported lba=0x123456
895 </pre><p>
896 </p><p>
897 Now the GLIST length is checked before the block reassignment:
898 </p><pre class="programlisting">
899 # sg_reassign --grown /dev/sdb
900 &gt;&gt; Elements in grown defect list: 0
901 </pre><p>
902 </p><p>
903 And now for the actual reassignment followed by another check of the GLIST
904 length:
905 </p><pre class="programlisting">
906 # sg_reassign --address=1193046 /dev/sdb
907
908 # sg_reassign --grown /dev/sdb
909 &gt;&gt; Elements in grown defect list: 1
910 </pre><p>
911 </p><p>
912 The GLIST length has grown by one as expected. If the disk was unable to
913 recover any data, then the "new" block at lba 0x123456 has vendor specific
914 data in it. The <span><strong class="command">sg_reassign</strong></span> utility can also do bulk
915 reassigns, see <span><strong class="command">man sg_reassign</strong></span> for more information.
916 </p><p>
917 The <span><strong class="command">dd</strong></span> command could be used to read the contents of
918 the "new" block:
919 </p><pre class="programlisting">
920 # dd if=/dev/sdb iflag=direct skip=1193046 of=blk.img bs=512 count=1
921 </pre><p>
922 </p><p>
923 and a hex editor
924 <sup>[<a name="id4712776" href="#ftn.id4712776">11</a>]</sup>
925 used to view and potentially change the
926 <code class="filename">blk.img</code> file. An altered <code class="filename">blk.img</code>
927 file (or <code class="filename">/dev/zero</code>) could be written back with:
928 </p><pre class="programlisting">
929 # dd if=blk.img of=/dev/sdb seek=1193046 oflag=direct bs=512 count=1
930 </pre><p>
931 </p><p>
932 More work may be needed at the file system level, especially if the
933 reassigned block held critical file system information such as
934 a superblock or a directory.
935 </p><p>
936 Even if a full backup of the disk is available, or the disk has been
937 "ejected" from a RAID, it may still be worthwhile to reassign the bad
938 block(s) that caused the problem (or simply format the disk (see
939 <span><strong class="command">sg_format</strong></span> in the sg3_utils package)) and re-use the
940 disk later (not unlike the way a replacement disk from a manufacturer
941 might be used).
942 </p><p>
943 CVS $Id: badblockhowto.xml,v 1.4 2007/01/31 13:56:32 dpgilbert Exp $
944 </p></div></div></div><div class="footnotes"><br><hr width="100" align="left"><div class="footnote"><p><sup>[<a name="ftn.id4710480" href="#id4710480">1</a>] </sup>
945 Self-Monitoring, Analysis and Reporting Technology -&gt; SMART
946 </p></div><div class="footnote"><p><sup>[<a name="ftn.id4710574" href="#id4710574">2</a>] </sup>
947 Starting with GNU coreutils release 5.3.0, the <span><strong class="command">dd</strong></span>
948 command in Linux includes the options 'iflag=direct' and 'oflag=direct'.
949 Using these with the <span><strong class="command">dd</strong></span> commands should be helpful,
950 because adding these flags should avoid any interaction
951 with the block buffering IO layer in Linux and permit direct reads/writes
952 from the raw device. Use <span><strong class="command">dd --help</strong></span> to see if your
953 version of dd supports these options. If not, the latest code for dd
954 can be found at <a href="http://alpha.gnu.org/gnu/coreutils" target="_top">
955 <code class="literal">alpha.gnu.org/gnu/coreutils</code></a>.
956 </p></div><div class="footnote"><p><sup>[<a name="ftn.id4711397" href="#id4711397">3</a>] </sup>
957 Do not use <span><strong class="command">tar cf /dev/null</strong></span>, see
958 <span><strong class="command">info tar</strong></span>.
959 </p></div><div class="footnote"><p><sup>[<a name="ftn.id4711431" href="#id4711431">4</a>] </sup>
960 Important: set blocksize range is arbitrary, but do not only test a single
961 block, as bad blocks are often social. Not too large as this test probably
962 has not 0% risk.
963 </p></div><div class="footnote"><p><sup>[<a name="ftn.id4711447" href="#id4711447">5</a>] </sup>
964 The rather awkward `expr 484335 + 100` (note the back quotes) can be replaced
965 with $((484335+100)) if the bash shell is being used. Similarly the last
966 argument can become $((484335-100)) .
967 </p></div><div class="footnote"><p><sup>[<a name="ftn.id4711568" href="#id4711568">6</a>] </sup>
968 <span><strong class="command">testdisk</strong></span> scans the media for the beginning of file
969 systems that it recognizes. It can be tricked by data that looks
970 like the beginning of a file system or an old file system from a
971 previous partitioning of the media (disk). So care should be taken.
972 Note that file systems should not overlap apart from the fact that
973 extended partitions lie wholly within a extended partition table
974 allocation. Also if the root partition of a Linux/Unix installation
975 can be found then the <code class="filename">/etc/fstab</code> file is a useful
976 resource for finding the partition numbers of other partitions.
977 </p></div><div class="footnote"><p><sup>[<a name="ftn.id4711687" href="#id4711687">7</a>] </sup>
978 Thanks to Manfred Schwarb for the information about storing partition
979 table(s) beforehand.
980 </p></div><div class="footnote"><p><sup>[<a name="ftn.id4712485" href="#id4712485">8</a>] </sup>
981 Detecting and fixing an error with ECC "on the fly" and not going the further
982 step and reassigning the block in question may explain why some disks have
983 large numbers in their read error counter log. Various worried users have
984 reported large numbers in the "errors corrected without substantial delay"
985 counter field which is in the "Errors corrected by ECC fast" column in
986 the <span><strong class="command">smartctl -l error</strong></span> output.
987 </p></div><div class="footnote"><p><sup>[<a name="ftn.id4712514" href="#id4712514">9</a>] </sup>
988 Often disks inside a hardware RAID have the ARRE and AWRE bits
989 cleared (disabled) so the RAID controller can do things manually or flag
990 the disk for replacement.
991 </p></div><div class="footnote"><p><sup>[<a name="ftn.id4712652" href="#id4712652">10</a>] </sup>
992 In this case the corruption was manufactured by using the WRITE LONG
993 SCSI command. See <span><strong class="command">sg_write_long</strong></span> in sg3_utils.
994 </p></div><div class="footnote"><p><sup>[<a name="ftn.id4712776" href="#id4712776">11</a>] </sup>
995 Most window managers have a handy calculator that will do hex to
996 decimal conversions. More work may be needed at the file system level,
997 </p></div></div></div></body></html>