]>
Commit | Line | Data |
---|---|---|
b53ba588 MR |
1 | .. hwpoison: |
2 | ||
3 | ======== | |
4 | hwpoison | |
5 | ======== | |
6 | ||
f58ee00f | 7 | What is hwpoison? |
b53ba588 | 8 | ================= |
f58ee00f AK |
9 | |
10 | Upcoming Intel CPUs have support for recovering from some memory errors | |
b53ba588 | 11 | (``MCA recovery``). This requires the OS to declare a page "poisoned", |
f58ee00f AK |
12 | kill the processes associated with it and avoid using it in the future. |
13 | ||
14 | This patchkit implements the necessary infrastructure in the VM. | |
15 | ||
22aac857 VS |
16 | To quote the overview comment:: |
17 | ||
18 | High level machine check handler. Handles pages reported by the | |
19 | hardware as being corrupted usually due to a 2bit ECC memory or cache | |
20 | failure. | |
21 | ||
22 | This focusses on pages detected as corrupted in the background. | |
23 | When the current CPU tries to consume corruption the currently | |
24 | running process can just be killed directly instead. This implies | |
25 | that if the error cannot be handled for some reason it's safe to | |
26 | just ignore it because no corruption has been consumed yet. Instead | |
27 | when that happens another machine check will happen. | |
28 | ||
29 | Handles page cache pages in various states. The tricky part | |
30 | here is that we can access any page asynchronous to other VM | |
31 | users, because memory failures could happen anytime and anywhere, | |
32 | possibly violating some of their assumptions. This is why this code | |
33 | has to be extremely careful. Generally it tries to use normal locking | |
34 | rules, as in get the standard locks, even if that means the | |
35 | error handling takes potentially a long time. | |
36 | ||
37 | Some of the operations here are somewhat inefficient and have non | |
38 | linear algorithmic complexity, because the data structures have not | |
39 | been optimized for this case. This is in particular the case | |
40 | for the mapping from a vma to a process. Since this case is expected | |
41 | to be rare we hope we can get away with this. | |
f58ee00f AK |
42 | |
43 | The code consists of a the high level handler in mm/memory-failure.c, | |
44 | a new page poison bit and various checks in the VM to handle poisoned | |
45 | pages. | |
46 | ||
47 | The main target right now is KVM guests, but it works for all kinds | |
48 | of applications. KVM support requires a recent qemu-kvm release. | |
49 | ||
50 | For the KVM use there was need for a new signal type so that | |
51 | KVM can inject the machine check into the guest with the proper | |
52 | address. This in theory allows other applications to handle | |
53 | memory failures too. The expection is that near all applications | |
54 | won't do that, but some very specialized ones might. | |
55 | ||
b53ba588 MR |
56 | Failure recovery modes |
57 | ====================== | |
f58ee00f | 58 | |
b53ba588 | 59 | There are two (actually three) modes memory failure recovery can be in: |
f58ee00f AK |
60 | |
61 | vm.memory_failure_recovery sysctl set to zero: | |
62 | All memory failures cause a panic. Do not attempt recovery. | |
f58ee00f AK |
63 | |
64 | early kill | |
65 | (can be controlled globally and per process) | |
66 | Send SIGBUS to the application as soon as the error is detected | |
67 | This allows applications who can process memory errors in a gentle | |
68 | way (e.g. drop affected object) | |
69 | This is the mode used by KVM qemu. | |
70 | ||
71 | late kill | |
72 | Send SIGBUS when the application runs into the corrupted page. | |
73 | This is best for memory error unaware applications and default | |
74 | Note some pages are always handled as late kill. | |
75 | ||
b53ba588 MR |
76 | User control |
77 | ============ | |
f58ee00f AK |
78 | |
79 | vm.memory_failure_recovery | |
80 | See sysctl.txt | |
81 | ||
82 | vm.memory_failure_early_kill | |
83 | Enable early kill mode globally | |
84 | ||
85 | PR_MCE_KILL | |
86 | Set early/late kill mode/revert to system default | |
b53ba588 MR |
87 | |
88 | arg1: PR_MCE_KILL_CLEAR: | |
89 | Revert to system default | |
90 | arg1: PR_MCE_KILL_SET: | |
91 | arg2 defines thread specific mode | |
92 | ||
93 | PR_MCE_KILL_EARLY: | |
94 | Early kill | |
95 | PR_MCE_KILL_LATE: | |
96 | Late kill | |
97 | PR_MCE_KILL_DEFAULT | |
98 | Use system global default | |
99 | ||
3ba08129 NH |
100 | Note that if you want to have a dedicated thread which handles |
101 | the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should | |
102 | call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, | |
103 | the SIGBUS is sent to the main thread. | |
104 | ||
f58ee00f AK |
105 | PR_MCE_KILL_GET |
106 | return current mode | |
107 | ||
b53ba588 MR |
108 | Testing |
109 | ======= | |
f58ee00f | 110 | |
b53ba588 MR |
111 | * madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the |
112 | process for testing | |
f58ee00f | 113 | |
b53ba588 | 114 | * hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/`` |
f58ee00f | 115 | |
b53ba588 MR |
116 | corrupt-pfn |
117 | Inject hwpoison fault at PFN echoed into this file. This does | |
118 | some early filtering to avoid corrupted unintended pages in test suites. | |
f58ee00f | 119 | |
b53ba588 MR |
120 | unpoison-pfn |
121 | Software-unpoison page at PFN echoed into this file. This way | |
122 | a page can be reused again. This only works for Linux | |
67f22ba7 | 123 | injected failures, not for real memory failures. Once any hardware |
124 | memory failure happens, this feature is disabled. | |
847ce401 | 125 | |
b53ba588 MR |
126 | Note these injection interfaces are not stable and might change between |
127 | kernel versions | |
847ce401 | 128 | |
b53ba588 MR |
129 | corrupt-filter-dev-major, corrupt-filter-dev-minor |
130 | Only handle memory failures to pages associated with the file | |
131 | system defined by block device major/minor. -1U is the | |
132 | wildcard value. This should be only used for testing with | |
133 | artificial injection. | |
847ce401 | 134 | |
b53ba588 MR |
135 | corrupt-filter-memcg |
136 | Limit injection to pages owned by memgroup. Specified by inode | |
137 | number of the memcg. | |
847ce401 | 138 | |
b53ba588 | 139 | Example:: |
f58ee00f | 140 | |
b53ba588 | 141 | mkdir /sys/fs/cgroup/mem/hwpoison |
7c116f2b | 142 | |
b53ba588 MR |
143 | usemem -m 100 -s 1000 & |
144 | echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks | |
7c116f2b | 145 | |
b53ba588 MR |
146 | memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') |
147 | echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg | |
4fd466eb | 148 | |
b53ba588 MR |
149 | page-types -p `pidof init` --hwpoison # shall do nothing |
150 | page-types -p `pidof usemem` --hwpoison # poison its pages | |
4fd466eb | 151 | |
b53ba588 MR |
152 | corrupt-filter-flags-mask, corrupt-filter-flags-value |
153 | When specified, only poison pages if ((page_flags & mask) == | |
154 | value). This allows stress testing of many kinds of | |
155 | pages. The page_flags are the same as in /proc/kpageflags. The | |
156 | flag bits are defined in include/linux/kernel-page-flags.h and | |
1ad1335d | 157 | documented in Documentation/admin-guide/mm/pagemap.rst |
4fd466eb | 158 | |
b53ba588 | 159 | * Architecture specific MCE injector |
4fd466eb | 160 | |
b53ba588 | 161 | x86 has mce-inject, mce-test |
4fd466eb | 162 | |
b53ba588 | 163 | Some portable hwpoison test programs in mce-test, see below. |
478c5ffc | 164 | |
b53ba588 MR |
165 | References |
166 | ========== | |
f58ee00f AK |
167 | |
168 | http://halobates.de/mce-lc09-2.pdf | |
169 | Overview presentation from LinuxCon 09 | |
170 | ||
171 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git | |
172 | Test suite (hwpoison specific portable tests in tsrc) | |
173 | ||
174 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git | |
175 | x86 specific injector | |
176 | ||
177 | ||
b53ba588 MR |
178 | Limitations |
179 | =========== | |
f58ee00f | 180 | - Not all page types are supported and never will. Most kernel internal |
b53ba588 | 181 | objects cannot be recovered, only LRU pages for now. |
f58ee00f AK |
182 | |
183 | --- | |
184 | Andi Kleen, Oct 2009 |