]>
Commit | Line | Data |
---|---|---|
47402400 ZY |
1 | The PCI Express Advanced Error Reporting Driver Guide HOWTO |
2 | T. Long Nguyen <tom.l.nguyen@intel.com> | |
3 | Yanmin Zhang <yanmin.zhang@intel.com> | |
4 | 07/29/2006 | |
5 | ||
6 | ||
7 | 1. Overview | |
8 | ||
9 | 1.1 About this guide | |
10 | ||
11 | This guide describes the basics of the PCI Express Advanced Error | |
12 | Reporting (AER) driver and provides information on how to use it, as | |
13 | well as how to enable the drivers of endpoint devices to conform with | |
14 | PCI Express AER driver. | |
15 | ||
89713422 | 16 | 1.2 Copyright (C) Intel Corporation 2006. |
47402400 ZY |
17 | |
18 | 1.3 What is the PCI Express AER Driver? | |
19 | ||
20 | PCI Express error signaling can occur on the PCI Express link itself | |
21 | or on behalf of transactions initiated on the link. PCI Express | |
22 | defines two error reporting paradigms: the baseline capability and | |
23 | the Advanced Error Reporting capability. The baseline capability is | |
24 | required of all PCI Express components providing a minimum defined | |
25 | set of error reporting requirements. Advanced Error Reporting | |
26 | capability is implemented with a PCI Express advanced error reporting | |
27 | extended capability structure providing more robust error reporting. | |
28 | ||
29 | The PCI Express AER driver provides the infrastructure to support PCI | |
30 | Express Advanced Error Reporting capability. The PCI Express AER | |
31 | driver provides three basic functions: | |
32 | ||
33 | - Gathers the comprehensive error information if errors occurred. | |
34 | - Reports error to the users. | |
35 | - Performs error recovery actions. | |
36 | ||
37 | AER driver only attaches root ports which support PCI-Express AER | |
38 | capability. | |
39 | ||
40 | ||
41 | 2. User Guide | |
42 | ||
43 | 2.1 Include the PCI Express AER Root Driver into the Linux Kernel | |
44 | ||
45 | The PCI Express AER Root driver is a Root Port service driver attached | |
46 | to the PCI Express Port Bus driver. If a user wants to use it, the driver | |
47 | has to be compiled. Option CONFIG_PCIEAER supports this capability. It | |
48 | depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and | |
49 | CONFIG_PCIEAER = y. | |
50 | ||
51 | 2.2 Load PCI Express AER Root Driver | |
7ece1417 BH |
52 | |
53 | Some systems have AER support in firmware. Enabling Linux AER support at | |
54 | the same time the firmware handles AER may result in unpredictable | |
55 | behavior. Therefore, Linux does not handle AER events unless the firmware | |
56 | grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 | |
57 | Specification for details regarding _OSC usage. | |
47402400 ZY |
58 | |
59 | 2.3 AER error output | |
7ece1417 BH |
60 | |
61 | When a PCIe AER error is captured, an error message will be output to | |
62 | console. If it's a correctable error, it is output as a warning. | |
47402400 ZY |
63 | Otherwise, it is printed as an error. So users could choose different |
64 | log level to filter out correctable error messages. | |
65 | ||
d4dfd727 HS |
66 | Below shows an example: |
67 | 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) | |
68 | 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 | |
69 | 0000:50:00.0: [20] Unsupported Request (First) | |
70 | 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 | |
47402400 ZY |
71 | |
72 | In the example, 'Requester ID' means the ID of the device who sends | |
73 | the error message to root port. Pls. refer to pci express specs for | |
74 | other fields. | |
75 | ||
76 | ||
77 | 3. Developer Guide | |
78 | ||
79 | To enable AER aware support requires a software driver to configure | |
80 | the AER capability structure within its device and to provide callbacks. | |
81 | ||
82 | To support AER better, developers need understand how AER does work | |
83 | firstly. | |
84 | ||
85 | PCI Express errors are classified into two types: correctable errors | |
86 | and uncorrectable errors. This classification is based on the impacts | |
87 | of those errors, which may result in degraded performance or function | |
88 | failure. | |
89 | ||
90 | Correctable errors pose no impacts on the functionality of the | |
91 | interface. The PCI Express protocol can recover without any software | |
92 | intervention or any loss of data. These errors are detected and | |
93 | corrected by hardware. Unlike correctable errors, uncorrectable | |
94 | errors impact functionality of the interface. Uncorrectable errors | |
95 | can cause a particular transaction or a particular PCI Express link | |
96 | to be unreliable. Depending on those error conditions, uncorrectable | |
97 | errors are further classified into non-fatal errors and fatal errors. | |
98 | Non-fatal errors cause the particular transaction to be unreliable, | |
99 | but the PCI Express link itself is fully functional. Fatal errors, on | |
100 | the other hand, cause the link to be unreliable. | |
101 | ||
102 | When AER is enabled, a PCI Express device will automatically send an | |
89713422 | 103 | error message to the PCIe root port above it when the device captures |
47402400 ZY |
104 | an error. The Root Port, upon receiving an error reporting message, |
105 | internally processes and logs the error message in its PCI Express | |
106 | capability structure. Error information being logged includes storing | |
107 | the error reporting agent's requestor ID into the Error Source | |
108 | Identification Registers and setting the error bits of the Root Error | |
109 | Status Register accordingly. If AER error reporting is enabled in Root | |
110 | Error Command Register, the Root Port generates an interrupt if an | |
111 | error is detected. | |
112 | ||
113 | Note that the errors as described above are related to the PCI Express | |
114 | hierarchy and links. These errors do not include any device specific | |
115 | errors because device specific errors will still get sent directly to | |
116 | the device driver. | |
117 | ||
118 | 3.1 Configure the AER capability structure | |
119 | ||
120 | AER aware drivers of PCI Express component need change the device | |
121 | control registers to enable AER. They also could change AER registers, | |
122 | including mask and severity registers. Helper function | |
123 | pci_enable_pcie_error_reporting could be used to enable AER. See | |
124 | section 3.3. | |
125 | ||
126 | 3.2. Provide callbacks | |
127 | ||
128 | 3.2.1 callback reset_link to reset pci express link | |
129 | ||
130 | This callback is used to reset the pci express physical link when a | |
131 | fatal error happens. The root port aer service driver provides a | |
132 | default reset_link function, but different upstream ports might | |
133 | have different specifications to reset pci express link, so all | |
134 | upstream ports should provide their own reset_link functions. | |
135 | ||
136 | In struct pcie_port_service_driver, a new pointer, reset_link, is | |
137 | added. | |
138 | ||
139 | pci_ers_result_t (*reset_link) (struct pci_dev *dev); | |
140 | ||
141 | Section 3.2.2.2 provides more detailed info on when to call | |
142 | reset_link. | |
143 | ||
144 | 3.2.2 PCI error-recovery callbacks | |
145 | ||
146 | The PCI Express AER Root driver uses error callbacks to coordinate | |
147 | with downstream device drivers associated with a hierarchy in question | |
148 | when performing error recovery actions. | |
149 | ||
150 | Data struct pci_driver has a pointer, err_handler, to point to | |
151 | pci_error_handlers who consists of a couple of callback function | |
152 | pointers. AER driver follows the rules defined in | |
153 | pci-error-recovery.txt except pci express specific parts (e.g. | |
154 | reset_link). Pls. refer to pci-error-recovery.txt for detailed | |
155 | definitions of the callbacks. | |
156 | ||
157 | Below sections specify when to call the error callback functions. | |
158 | ||
159 | 3.2.2.1 Correctable errors | |
160 | ||
161 | Correctable errors pose no impacts on the functionality of | |
162 | the interface. The PCI Express protocol can recover without any | |
163 | software intervention or any loss of data. These errors do not | |
164 | require any recovery actions. The AER driver clears the device's | |
165 | correctable error status register accordingly and logs these errors. | |
166 | ||
167 | 3.2.2.2 Non-correctable (non-fatal and fatal) errors | |
168 | ||
169 | If an error message indicates a non-fatal error, performing link reset | |
170 | at upstream is not required. The AER driver calls error_detected(dev, | |
171 | pci_channel_io_normal) to all drivers associated within a hierarchy in | |
172 | question. for example, | |
173 | EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. | |
174 | If Upstream port A captures an AER error, the hierarchy consists of | |
175 | Downstream port B and EndPoint. | |
176 | ||
177 | A driver may return PCI_ERS_RESULT_CAN_RECOVER, | |
178 | PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on | |
179 | whether it can recover or the AER driver calls mmio_enabled as next. | |
180 | ||
181 | If an error message indicates a fatal error, kernel will broadcast | |
182 | error_detected(dev, pci_channel_io_frozen) to all drivers within | |
183 | a hierarchy in question. Then, performing link reset at upstream is | |
184 | necessary. As different kinds of devices might use different approaches | |
185 | to reset link, AER port service driver is required to provide the | |
186 | function to reset link. Firstly, kernel looks for if the upstream | |
187 | component has an aer driver. If it has, kernel uses the reset_link | |
188 | callback of the aer driver. If the upstream component has no aer driver | |
89713422 HS |
189 | and the port is downstream port, we will perform a hot reset as the |
190 | default by setting the Secondary Bus Reset bit of the Bridge Control | |
191 | register associated with the downstream port. As for upstream ports, | |
47402400 ZY |
192 | they should provide their own aer service drivers with reset_link |
193 | function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and | |
194 | reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes | |
195 | to mmio_enabled. | |
196 | ||
197 | 3.3 helper functions | |
198 | ||
270c66be | 199 | 3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev); |
47402400 ZY |
200 | pci_enable_pcie_error_reporting enables the device to send error |
201 | messages to root port when an error is detected. Note that devices | |
202 | don't enable the error reporting by default, so device drivers need | |
203 | call this function to enable it. | |
204 | ||
270c66be | 205 | 3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev); |
47402400 ZY |
206 | pci_disable_pcie_error_reporting disables the device to send error |
207 | messages to root port when an error is detected. | |
208 | ||
270c66be | 209 | 3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); |
47402400 ZY |
210 | pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable |
211 | error status register. | |
212 | ||
213 | 3.4 Frequent Asked Questions | |
214 | ||
215 | Q: What happens if a PCI Express device driver does not provide an | |
216 | error recovery handler (pci_driver->err_handler is equal to NULL)? | |
217 | ||
218 | A: The devices attached with the driver won't be recovered. If the | |
219 | error is fatal, kernel will print out warning messages. Please refer | |
220 | to section 3 for more information. | |
221 | ||
222 | Q: What happens if an upstream port service driver does not provide | |
223 | callback reset_link? | |
224 | ||
225 | A: Fatal error recovery will fail if the errors are reported by the | |
226 | upstream ports who are attached by the service driver. | |
227 | ||
228 | Q: How does this infrastructure deal with driver that is not PCI | |
229 | Express aware? | |
230 | ||
231 | A: This infrastructure calls the error callback functions of the | |
232 | driver when an error happens. But if the driver is not aware of | |
233 | PCI Express, the device might not report its own errors to root | |
234 | port. | |
235 | ||
236 | Q: What modifications will that driver need to make it compatible | |
237 | with the PCI Express AER Root driver? | |
238 | ||
239 | A: It could call the helper functions to enable AER in devices and | |
240 | cleanup uncorrectable status register. Pls. refer to section 3.3. | |
241 | ||
bfe5a740 HY |
242 | |
243 | 4. Software error injection | |
244 | ||
89713422 | 245 | Debugging PCIe AER error recovery code is quite difficult because it |
bfe5a740 | 246 | is hard to trigger real hardware errors. Software based error |
89713422 | 247 | injection can be used to fake various kinds of PCIe errors. |
bfe5a740 | 248 | |
89713422 | 249 | First you should enable PCIe AER software error injection in kernel |
bfe5a740 HY |
250 | configuration, that is, following item should be in your .config. |
251 | ||
252 | CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m | |
253 | ||
254 | After reboot with new kernel or insert the module, a device file named | |
255 | /dev/aer_inject should be created. | |
256 | ||
257 | Then, you need a user space tool named aer-inject, which can be gotten | |
258 | from: | |
2eb6a4b2 | 259 | https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ |
bfe5a740 HY |
260 | |
261 | More information about aer-inject can be found in the document comes | |
262 | with its source code. |