]>
Commit | Line | Data |
---|---|---|
b53ba588 MR |
1 | .. hwpoison: |
2 | ||
3 | ======== | |
4 | hwpoison | |
5 | ======== | |
6 | ||
f58ee00f | 7 | What is hwpoison? |
b53ba588 | 8 | ================= |
f58ee00f AK |
9 | |
10 | Upcoming Intel CPUs have support for recovering from some memory errors | |
b53ba588 | 11 | (``MCA recovery``). This requires the OS to declare a page "poisoned", |
f58ee00f AK |
12 | kill the processes associated with it and avoid using it in the future. |
13 | ||
14 | This patchkit implements the necessary infrastructure in the VM. | |
15 | ||
22aac857 VS |
16 | To quote the overview comment:: |
17 | ||
18 | High level machine check handler. Handles pages reported by the | |
19 | hardware as being corrupted usually due to a 2bit ECC memory or cache | |
20 | failure. | |
21 | ||
22 | This focusses on pages detected as corrupted in the background. | |
23 | When the current CPU tries to consume corruption the currently | |
24 | running process can just be killed directly instead. This implies | |
25 | that if the error cannot be handled for some reason it's safe to | |
26 | just ignore it because no corruption has been consumed yet. Instead | |
27 | when that happens another machine check will happen. | |
28 | ||
29 | Handles page cache pages in various states. The tricky part | |
30 | here is that we can access any page asynchronous to other VM | |
31 | users, because memory failures could happen anytime and anywhere, | |
32 | possibly violating some of their assumptions. This is why this code | |
33 | has to be extremely careful. Generally it tries to use normal locking | |
34 | rules, as in get the standard locks, even if that means the | |
35 | error handling takes potentially a long time. | |
36 | ||
37 | Some of the operations here are somewhat inefficient and have non | |
38 | linear algorithmic complexity, because the data structures have not | |
39 | been optimized for this case. This is in particular the case | |
40 | for the mapping from a vma to a process. Since this case is expected | |
41 | to be rare we hope we can get away with this. | |
f58ee00f AK |
42 | |
43 | The code consists of a the high level handler in mm/memory-failure.c, | |
44 | a new page poison bit and various checks in the VM to handle poisoned | |
45 | pages. | |
46 | ||
47 | The main target right now is KVM guests, but it works for all kinds | |
48 | of applications. KVM support requires a recent qemu-kvm release. | |
49 | ||
50 | For the KVM use there was need for a new signal type so that | |
51 | KVM can inject the machine check into the guest with the proper | |
52 | address. This in theory allows other applications to handle | |
53 | memory failures too. The expection is that near all applications | |
54 | won't do that, but some very specialized ones might. | |
55 | ||
b53ba588 MR |
56 | Failure recovery modes |
57 | ====================== | |
f58ee00f | 58 | |
b53ba588 | 59 | There are two (actually three) modes memory failure recovery can be in: |
f58ee00f AK |
60 | |
61 | vm.memory_failure_recovery sysctl set to zero: | |
62 | All memory failures cause a panic. Do not attempt recovery. | |
63 | (on x86 this can be also affected by the tolerant level of the | |
64 | MCE subsystem) | |
65 | ||
66 | early kill | |
67 | (can be controlled globally and per process) | |
68 | Send SIGBUS to the application as soon as the error is detected | |
69 | This allows applications who can process memory errors in a gentle | |
70 | way (e.g. drop affected object) | |
71 | This is the mode used by KVM qemu. | |
72 | ||
73 | late kill | |
74 | Send SIGBUS when the application runs into the corrupted page. | |
75 | This is best for memory error unaware applications and default | |
76 | Note some pages are always handled as late kill. | |
77 | ||
b53ba588 MR |
78 | User control |
79 | ============ | |
f58ee00f AK |
80 | |
81 | vm.memory_failure_recovery | |
82 | See sysctl.txt | |
83 | ||
84 | vm.memory_failure_early_kill | |
85 | Enable early kill mode globally | |
86 | ||
87 | PR_MCE_KILL | |
88 | Set early/late kill mode/revert to system default | |
b53ba588 MR |
89 | |
90 | arg1: PR_MCE_KILL_CLEAR: | |
91 | Revert to system default | |
92 | arg1: PR_MCE_KILL_SET: | |
93 | arg2 defines thread specific mode | |
94 | ||
95 | PR_MCE_KILL_EARLY: | |
96 | Early kill | |
97 | PR_MCE_KILL_LATE: | |
98 | Late kill | |
99 | PR_MCE_KILL_DEFAULT | |
100 | Use system global default | |
101 | ||
3ba08129 NH |
102 | Note that if you want to have a dedicated thread which handles |
103 | the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should | |
104 | call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, | |
105 | the SIGBUS is sent to the main thread. | |
106 | ||
f58ee00f AK |
107 | PR_MCE_KILL_GET |
108 | return current mode | |
109 | ||
b53ba588 MR |
110 | Testing |
111 | ======= | |
f58ee00f | 112 | |
b53ba588 MR |
113 | * madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the |
114 | process for testing | |
f58ee00f | 115 | |
b53ba588 | 116 | * hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/`` |
f58ee00f | 117 | |
b53ba588 MR |
118 | corrupt-pfn |
119 | Inject hwpoison fault at PFN echoed into this file. This does | |
120 | some early filtering to avoid corrupted unintended pages in test suites. | |
f58ee00f | 121 | |
b53ba588 MR |
122 | unpoison-pfn |
123 | Software-unpoison page at PFN echoed into this file. This way | |
124 | a page can be reused again. This only works for Linux | |
125 | injected failures, not for real memory failures. | |
847ce401 | 126 | |
b53ba588 MR |
127 | Note these injection interfaces are not stable and might change between |
128 | kernel versions | |
847ce401 | 129 | |
b53ba588 MR |
130 | corrupt-filter-dev-major, corrupt-filter-dev-minor |
131 | Only handle memory failures to pages associated with the file | |
132 | system defined by block device major/minor. -1U is the | |
133 | wildcard value. This should be only used for testing with | |
134 | artificial injection. | |
847ce401 | 135 | |
b53ba588 MR |
136 | corrupt-filter-memcg |
137 | Limit injection to pages owned by memgroup. Specified by inode | |
138 | number of the memcg. | |
847ce401 | 139 | |
b53ba588 | 140 | Example:: |
f58ee00f | 141 | |
b53ba588 | 142 | mkdir /sys/fs/cgroup/mem/hwpoison |
7c116f2b | 143 | |
b53ba588 MR |
144 | usemem -m 100 -s 1000 & |
145 | echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks | |
7c116f2b | 146 | |
b53ba588 MR |
147 | memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') |
148 | echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg | |
4fd466eb | 149 | |
b53ba588 MR |
150 | page-types -p `pidof init` --hwpoison # shall do nothing |
151 | page-types -p `pidof usemem` --hwpoison # poison its pages | |
4fd466eb | 152 | |
b53ba588 MR |
153 | corrupt-filter-flags-mask, corrupt-filter-flags-value |
154 | When specified, only poison pages if ((page_flags & mask) == | |
155 | value). This allows stress testing of many kinds of | |
156 | pages. The page_flags are the same as in /proc/kpageflags. The | |
157 | flag bits are defined in include/linux/kernel-page-flags.h and | |
1ad1335d | 158 | documented in Documentation/admin-guide/mm/pagemap.rst |
4fd466eb | 159 | |
b53ba588 | 160 | * Architecture specific MCE injector |
4fd466eb | 161 | |
b53ba588 | 162 | x86 has mce-inject, mce-test |
4fd466eb | 163 | |
b53ba588 | 164 | Some portable hwpoison test programs in mce-test, see below. |
478c5ffc | 165 | |
b53ba588 MR |
166 | References |
167 | ========== | |
f58ee00f AK |
168 | |
169 | http://halobates.de/mce-lc09-2.pdf | |
170 | Overview presentation from LinuxCon 09 | |
171 | ||
172 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git | |
173 | Test suite (hwpoison specific portable tests in tsrc) | |
174 | ||
175 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git | |
176 | x86 specific injector | |
177 | ||
178 | ||
b53ba588 MR |
179 | Limitations |
180 | =========== | |
f58ee00f | 181 | - Not all page types are supported and never will. Most kernel internal |
b53ba588 | 182 | objects cannot be recovered, only LRU pages for now. |
f58ee00f AK |
183 | - Right now hugepage support is missing. |
184 | ||
185 | --- | |
186 | Andi Kleen, Oct 2009 |