]>
Commit | Line | Data |
---|---|---|
151f4e2b | 1 | ==================== |
1da177e4 | 2 | PCI Power Management |
151f4e2b | 3 | ==================== |
1da177e4 | 4 | |
b7999570 RW |
5 | Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. |
6 | ||
7 | An overview of concepts and the Linux kernel's interfaces related to PCI power | |
8 | management. Based on previous work by Patrick Mochel <mochel@transmeta.com> | |
9 | (and others). | |
1da177e4 | 10 | |
b7999570 RW |
11 | This document only covers the aspects of power management specific to PCI |
12 | devices. For general description of the kernel's interfaces related to device | |
66ccc64f | 13 | power management refer to Documentation/driver-api/pm/devices.rst and |
151f4e2b | 14 | Documentation/power/runtime_pm.rst. |
1da177e4 | 15 | |
151f4e2b | 16 | .. contents: |
1da177e4 | 17 | |
151f4e2b MCC |
18 | 1. Hardware and Platform Support for PCI Power Management |
19 | 2. PCI Subsystem and Device Power Management | |
20 | 3. PCI Device Drivers and Power Management | |
21 | 4. Resources | |
b7999570 RW |
22 | |
23 | ||
24 | 1. Hardware and Platform Support for PCI Power Management | |
25 | ========================================================= | |
26 | ||
27 | 1.1. Native and Platform-Based Power Management | |
28 | ----------------------------------------------- | |
151f4e2b | 29 | |
b7999570 RW |
30 | In general, power management is a feature allowing one to save energy by putting |
31 | devices into states in which they draw less power (low-power states) at the | |
32 | price of reduced functionality or performance. | |
33 | ||
34 | Usually, a device is put into a low-power state when it is underutilized or | |
35 | completely inactive. However, when it is necessary to use the device once | |
36 | again, it has to be put back into the "fully functional" state (full-power | |
37 | state). This may happen when there are some data for the device to handle or | |
38 | as a result of an external event requiring the device to be active, which may | |
39 | be signaled by the device itself. | |
40 | ||
41 | PCI devices may be put into low-power states in two ways, by using the device | |
42 | capabilities introduced by the PCI Bus Power Management Interface Specification, | |
43 | or with the help of platform firmware, such as an ACPI BIOS. In the first | |
44 | approach, that is referred to as the native PCI power management (native PCI PM) | |
45 | in what follows, the device power state is changed as a result of writing a | |
46 | specific value into one of its standard configuration registers. The second | |
47 | approach requires the platform firmware to provide special methods that may be | |
48 | used by the kernel to change the device's power state. | |
49 | ||
50 | Devices supporting the native PCI PM usually can generate wakeup signals called | |
51 | Power Management Events (PMEs) to let the kernel know about external events | |
52 | requiring the device to be active. After receiving a PME the kernel is supposed | |
53 | to put the device that sent it into the full-power state. However, the PCI Bus | |
54 | Power Management Interface Specification doesn't define any standard method of | |
55 | delivering the PME from the device to the CPU and the operating system kernel. | |
56 | It is assumed that the platform firmware will perform this task and therefore, | |
57 | even though a PCI device is set up to generate PMEs, it also may be necessary to | |
58 | prepare the platform firmware for notifying the CPU of the PMEs coming from the | |
59 | device (e.g. by generating interrupts). | |
60 | ||
61 | In turn, if the methods provided by the platform firmware are used for changing | |
62 | the power state of a device, usually the platform also provides a method for | |
63 | preparing the device to generate wakeup signals. In that case, however, it | |
64 | often also is necessary to prepare the device for generating PMEs using the | |
65 | native PCI PM mechanism, because the method provided by the platform depends on | |
66 | that. | |
67 | ||
68 | Thus in many situations both the native and the platform-based power management | |
69 | mechanisms have to be used simultaneously to obtain the desired result. | |
70 | ||
71 | 1.2. Native PCI Power Management | |
72 | -------------------------------- | |
151f4e2b | 73 | |
b7999570 RW |
74 | The PCI Bus Power Management Interface Specification (PCI PM Spec) was |
75 | introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a | |
76 | standard interface for performing various operations related to power | |
77 | management. | |
78 | ||
79 | The implementation of the PCI PM Spec is optional for conventional PCI devices, | |
80 | but it is mandatory for PCI Express devices. If a device supports the PCI PM | |
81 | Spec, it has an 8 byte power management capability field in its PCI | |
82 | configuration space. This field is used to describe and control the standard | |
83 | features related to the native PCI power management. | |
84 | ||
85 | The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses | |
86 | (B0-B3). The higher the number, the less power is drawn by the device or bus | |
87 | in that state. However, the higher the number, the longer the latency for | |
88 | the device or bus to return to the full-power state (D0 or B0, respectively). | |
89 | ||
90 | There are two variants of the D3 state defined by the specification. The first | |
91 | one is D3hot, referred to as the software accessible D3, because devices can be | |
92 | programmed to go into it. The second one, D3cold, is the state that PCI devices | |
93 | are in when the supply voltage (Vcc) is removed from them. It is not possible | |
94 | to program a PCI device to go into D3cold, although there may be a programmable | |
95 | interface for putting the bus the device is on into a state in which Vcc is | |
96 | removed from all devices on the bus. | |
97 | ||
98 | PCI bus power management, however, is not supported by the Linux kernel at the | |
99 | time of this writing and therefore it is not covered by this document. | |
100 | ||
101 | Note that every PCI device can be in the full-power state (D0) or in D3cold, | |
102 | regardless of whether or not it implements the PCI PM Spec. In addition to | |
103 | that, if the PCI PM Spec is implemented by the device, it must support D3hot | |
104 | as well as D0. The support for the D1 and D2 power states is optional. | |
105 | ||
106 | PCI devices supporting the PCI PM Spec can be programmed to go to any of the | |
107 | supported low-power states (except for D3cold). While in D1-D3hot the | |
108 | standard configuration registers of the device must be accessible to software | |
109 | (i.e. the device is required to respond to PCI configuration accesses), although | |
110 | its I/O and memory spaces are then disabled. This allows the device to be | |
111 | programmatically put into D0. Thus the kernel can switch the device back and | |
112 | forth between D0 and the supported low-power states (except for D3cold) and the | |
113 | possible power state transitions the device can undergo are the following: | |
114 | ||
115 | +----------------------------+ | |
116 | | Current State | New State | | |
117 | +----------------------------+ | |
118 | | D0 | D1, D2, D3 | | |
119 | +----------------------------+ | |
120 | | D1 | D2, D3 | | |
121 | +----------------------------+ | |
122 | | D2 | D3 | | |
123 | +----------------------------+ | |
124 | | D1, D2, D3 | D0 | | |
125 | +----------------------------+ | |
126 | ||
127 | The transition from D3cold to D0 occurs when the supply voltage is provided to | |
128 | the device (i.e. power is restored). In that case the device returns to D0 with | |
129 | a full power-on reset sequence and the power-on defaults are restored to the | |
130 | device by hardware just as at initial power up. | |
131 | ||
132 | PCI devices supporting the PCI PM Spec can be programmed to generate PMEs | |
133 | while in a low-power state (D1-D3), but they are not required to be capable | |
134 | of generating PMEs from all supported low-power states. In particular, the | |
135 | capability of generating PMEs from D3cold is optional and depends on the | |
136 | presence of additional voltage (3.3Vaux) allowing the device to remain | |
137 | sufficiently active to generate a wakeup signal. | |
138 | ||
139 | 1.3. ACPI Device Power Management | |
140 | --------------------------------- | |
151f4e2b | 141 | |
b7999570 RW |
142 | The platform firmware support for the power management of PCI devices is |
143 | system-specific. However, if the system in question is compliant with the | |
144 | Advanced Configuration and Power Interface (ACPI) Specification, like the | |
145 | majority of x86-based systems, it is supposed to implement device power | |
146 | management interfaces defined by the ACPI standard. | |
147 | ||
148 | For this purpose the ACPI BIOS provides special functions called "control | |
149 | methods" that may be executed by the kernel to perform specific tasks, such as | |
150 | putting a device into a low-power state. These control methods are encoded | |
151 | using special byte-code language called the ACPI Machine Language (AML) and | |
152 | stored in the machine's BIOS. The kernel loads them from the BIOS and executes | |
153 | them as needed using an AML interpreter that translates the AML byte code into | |
154 | computations and memory or I/O space accesses. This way, in theory, a BIOS | |
155 | writer can provide the kernel with a means to perform actions depending | |
156 | on the system design in a system-specific fashion. | |
157 | ||
158 | ACPI control methods may be divided into global control methods, that are not | |
159 | associated with any particular devices, and device control methods, that have | |
160 | to be defined separately for each device supposed to be handled with the help of | |
161 | the platform. This means, in particular, that ACPI device control methods can | |
162 | only be used to handle devices that the BIOS writer knew about in advance. The | |
163 | ACPI methods used for device power management fall into that category. | |
164 | ||
165 | The ACPI specification assumes that devices can be in one of four power states | |
166 | labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM | |
167 | D0-D3 states (although the difference between D3hot and D3cold is not taken | |
168 | into account by ACPI). Moreover, for each power state of a device there is a | |
169 | set of power resources that have to be enabled for the device to be put into | |
170 | that state. These power resources are controlled (i.e. enabled or disabled) | |
171 | with the help of their own control methods, _ON and _OFF, that have to be | |
172 | defined individually for each of them. | |
173 | ||
174 | To put a device into the ACPI power state Dx (where x is a number between 0 and | |
175 | 3 inclusive) the kernel is supposed to (1) enable the power resources required | |
176 | by the device in this state using their _ON control methods and (2) execute the | |
177 | _PSx control method defined for the device. In addition to that, if the device | |
178 | is going to be put into a low-power state (D1-D3) and is supposed to generate | |
179 | wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI | |
180 | 3.0) control method defined for it has to be executed before _PSx. Power | |
181 | resources that are not required by the device in the target power state and are | |
182 | not required any more by any other device should be disabled (by executing their | |
183 | _OFF control methods). If the current power state of the device is D3, it can | |
184 | only be put into D0 this way. | |
185 | ||
186 | However, quite often the power states of devices are changed during a | |
187 | system-wide transition into a sleep state or back into the working state. ACPI | |
188 | defines four system sleep states, S1, S2, S3, and S4, and denotes the system | |
189 | working state as S0. In general, the target system sleep (or working) state | |
190 | determines the highest power (lowest number) state the device can be put | |
191 | into and the kernel is supposed to obtain this information by executing the | |
192 | device's _SxD control method (where x is a number between 0 and 4 inclusive). | |
193 | If the device is required to wake up the system from the target sleep state, the | |
194 | lowest power (highest number) state it can be put into is also determined by the | |
195 | target state of the system. The kernel is then supposed to use the device's | |
196 | _SxW control method to obtain the number of that state. It also is supposed to | |
197 | use the device's _PRW control method to learn which power resources need to be | |
198 | enabled for the device to be able to generate wakeup signals. | |
199 | ||
200 | 1.4. Wakeup Signaling | |
201 | --------------------- | |
151f4e2b | 202 | |
b7999570 RW |
203 | Wakeup signals generated by PCI devices, either as native PCI PMEs, or as |
204 | a result of the execution of the _DSW (or _PSW) ACPI control method before | |
205 | putting the device into a low-power state, have to be caught and handled as | |
206 | appropriate. If they are sent while the system is in the working state | |
207 | (ACPI S0), they should be translated into interrupts so that the kernel can | |
208 | put the devices generating them into the full-power state and take care of the | |
209 | events that triggered them. In turn, if they are sent while the system is | |
210 | sleeping, they should cause the system's core logic to trigger wakeup. | |
211 | ||
212 | On ACPI-based systems wakeup signals sent by conventional PCI devices are | |
213 | converted into ACPI General-Purpose Events (GPEs) which are hardware signals | |
214 | from the system core logic generated in response to various events that need to | |
215 | be acted upon. Every GPE is associated with one or more sources of potentially | |
216 | interesting events. In particular, a GPE may be associated with a PCI device | |
217 | capable of signaling wakeup. The information on the connections between GPEs | |
218 | and event sources is recorded in the system's ACPI BIOS from where it can be | |
219 | read by the kernel. | |
220 | ||
221 | If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE | |
222 | associated with it (if there is one) is triggered. The GPEs associated with PCI | |
223 | bridges may also be triggered in response to a wakeup signal from one of the | |
224 | devices below the bridge (this also is the case for root bridges) and, for | |
225 | example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be | |
226 | handled this way. | |
227 | ||
228 | A GPE may be triggered when the system is sleeping (i.e. when it is in one of | |
229 | the ACPI S1-S4 states), in which case system wakeup is started by its core logic | |
230 | (the device that was the source of the signal causing the system wakeup to occur | |
231 | may be identified later). The GPEs used in such situations are referred to as | |
232 | wakeup GPEs. | |
233 | ||
234 | Usually, however, GPEs are also triggered when the system is in the working | |
235 | state (ACPI S0) and in that case the system's core logic generates a System | |
236 | Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI | |
237 | handler identifies the GPE that caused the interrupt to be generated which, | |
238 | in turn, allows the kernel to identify the source of the event (that may be | |
239 | a PCI device signaling wakeup). The GPEs used for notifying the kernel of | |
240 | events occurring while the system is in the working state are referred to as | |
241 | runtime GPEs. | |
242 | ||
243 | Unfortunately, there is no standard way of handling wakeup signals sent by | |
244 | conventional PCI devices on systems that are not ACPI-based, but there is one | |
245 | for PCI Express devices. Namely, the PCI Express Base Specification introduced | |
246 | a native mechanism for converting native PCI PMEs into interrupts generated by | |
247 | root ports. For conventional PCI devices native PMEs are out-of-band, so they | |
248 | are routed separately and they need not pass through bridges (in principle they | |
249 | may be routed directly to the system's core logic), but for PCI Express devices | |
250 | they are in-band messages that have to pass through the PCI Express hierarchy, | |
251 | including the root port on the path from the device to the Root Complex. Thus | |
252 | it was possible to introduce a mechanism by which a root port generates an | |
253 | interrupt whenever it receives a PME message from one of the devices below it. | |
254 | The PCI Express Requester ID of the device that sent the PME message is then | |
255 | recorded in one of the root port's configuration registers from where it may be | |
256 | read by the interrupt handler allowing the device to be identified. [PME | |
257 | messages sent by PCI Express endpoints integrated with the Root Complex don't | |
258 | pass through root ports, but instead they cause a Root Complex Event Collector | |
259 | (if there is one) to generate interrupts.] | |
260 | ||
261 | In principle the native PCI Express PME signaling may also be used on ACPI-based | |
262 | systems along with the GPEs, but to use it the kernel has to ask the system's | |
263 | ACPI BIOS to release control of root port configuration registers. The ACPI | |
264 | BIOS, however, is not required to allow the kernel to control these registers | |
265 | and if it doesn't do that, the kernel must not modify their contents. Of course | |
266 | the native PCI Express PME signaling cannot be used by the kernel in that case. | |
267 | ||
268 | ||
269 | 2. PCI Subsystem and Device Power Management | |
270 | ============================================ | |
271 | ||
272 | 2.1. Device Power Management Callbacks | |
273 | -------------------------------------- | |
151f4e2b | 274 | |
b7999570 RW |
275 | The PCI Subsystem participates in the power management of PCI devices in a |
276 | number of ways. First of all, it provides an intermediate code layer between | |
277 | the device power management core (PM core) and PCI device drivers. | |
278 | Specifically, the pm field of the PCI subsystem's struct bus_type object, | |
279 | pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing | |
151f4e2b | 280 | pointers to several device power management callbacks:: |
b7999570 | 281 | |
151f4e2b | 282 | const struct dev_pm_ops pci_dev_pm_ops = { |
b7999570 RW |
283 | .prepare = pci_pm_prepare, |
284 | .complete = pci_pm_complete, | |
285 | .suspend = pci_pm_suspend, | |
286 | .resume = pci_pm_resume, | |
287 | .freeze = pci_pm_freeze, | |
288 | .thaw = pci_pm_thaw, | |
289 | .poweroff = pci_pm_poweroff, | |
290 | .restore = pci_pm_restore, | |
291 | .suspend_noirq = pci_pm_suspend_noirq, | |
292 | .resume_noirq = pci_pm_resume_noirq, | |
293 | .freeze_noirq = pci_pm_freeze_noirq, | |
294 | .thaw_noirq = pci_pm_thaw_noirq, | |
295 | .poweroff_noirq = pci_pm_poweroff_noirq, | |
296 | .restore_noirq = pci_pm_restore_noirq, | |
297 | .runtime_suspend = pci_pm_runtime_suspend, | |
298 | .runtime_resume = pci_pm_runtime_resume, | |
299 | .runtime_idle = pci_pm_runtime_idle, | |
151f4e2b | 300 | }; |
b7999570 RW |
301 | |
302 | These callbacks are executed by the PM core in various situations related to | |
303 | device power management and they, in turn, execute power management callbacks | |
304 | provided by PCI device drivers. They also perform power management operations | |
305 | involving some standard configuration registers of PCI devices that device | |
306 | drivers need not know or care about. | |
307 | ||
308 | The structure representing a PCI device, struct pci_dev, contains several fields | |
151f4e2b | 309 | that these callbacks operate on:: |
b7999570 | 310 | |
151f4e2b | 311 | struct pci_dev { |
b7999570 RW |
312 | ... |
313 | pci_power_t current_state; /* Current operating state. */ | |
314 | int pm_cap; /* PM capability offset in the | |
315 | configuration space */ | |
316 | unsigned int pme_support:5; /* Bitmask of states from which PME# | |
317 | can be generated */ | |
318 | unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ | |
319 | unsigned int d1_support:1; /* Low power state D1 is supported */ | |
320 | unsigned int d2_support:1; /* Low power state D2 is supported */ | |
321 | unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ | |
322 | unsigned int wakeup_prepared:1; /* Device prepared for wake up */ | |
323 | unsigned int d3_delay; /* D3->D0 transition time in ms */ | |
324 | ... | |
151f4e2b | 325 | }; |
b7999570 RW |
326 | |
327 | They also indirectly use some fields of the struct device that is embedded in | |
328 | struct pci_dev. | |
329 | ||
330 | 2.2. Device Initialization | |
331 | -------------------------- | |
151f4e2b | 332 | |
b7999570 RW |
333 | The PCI subsystem's first task related to device power management is to |
334 | prepare the device for power management and initialize the fields of struct | |
335 | pci_dev used for this purpose. This happens in two functions defined in | |
336 | drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). | |
337 | ||
338 | The first of these functions checks if the device supports native PCI PM | |
339 | and if that's the case the offset of its power management capability structure | |
340 | in the configuration space is stored in the pm_cap field of the device's struct | |
341 | pci_dev object. Next, the function checks which PCI low-power states are | |
342 | supported by the device and from which low-power states the device can generate | |
343 | native PCI PMEs. The power management fields of the device's struct pci_dev and | |
344 | the struct device embedded in it are updated accordingly and the generation of | |
345 | PMEs by the device is disabled. | |
346 | ||
347 | The second function checks if the device can be prepared to signal wakeup with | |
348 | the help of the platform firmware, such as the ACPI BIOS. If that is the case, | |
349 | the function updates the wakeup fields in struct device embedded in the | |
350 | device's struct pci_dev and uses the firmware-provided method to prevent the | |
351 | device from signaling wakeup. | |
352 | ||
353 | At this point the device is ready for power management. For driverless devices, | |
354 | however, this functionality is limited to a few basic operations carried out | |
355 | during system-wide transitions to a sleep state and back to the working state. | |
356 | ||
357 | 2.3. Runtime Device Power Management | |
358 | ------------------------------------ | |
151f4e2b | 359 | |
b7999570 RW |
360 | The PCI subsystem plays a vital role in the runtime power management of PCI |
361 | devices. For this purpose it uses the general runtime power management | |
151f4e2b MCC |
362 | (runtime PM) framework described in Documentation/power/runtime_pm.rst. |
363 | Namely, it provides subsystem-level callbacks:: | |
b7999570 RW |
364 | |
365 | pci_pm_runtime_suspend() | |
366 | pci_pm_runtime_resume() | |
367 | pci_pm_runtime_idle() | |
368 | ||
369 | that are executed by the core runtime PM routines. It also implements the | |
370 | entire mechanics necessary for handling runtime wakeup signals from PCI devices | |
371 | in low-power states, which at the time of this writing works for both the native | |
372 | PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in | |
373 | Section 1. | |
374 | ||
375 | First, a PCI device is put into a low-power state, or suspended, with the help | |
376 | of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call | |
377 | pci_pm_runtime_suspend() to do the actual job. For this to work, the device's | |
378 | driver has to provide a pm->runtime_suspend() callback (see below), which is | |
379 | run by pci_pm_runtime_suspend() as the first action. If the driver's callback | |
380 | returns successfully, the device's standard configuration registers are saved, | |
381 | the device is prepared to generate wakeup signals and, finally, it is put into | |
382 | the target low-power state. | |
383 | ||
384 | The low-power state to put the device into is the lowest-power (highest number) | |
385 | state from which it can signal wakeup. The exact method of signaling wakeup is | |
386 | system-dependent and is determined by the PCI subsystem on the basis of the | |
387 | reported capabilities of the device and the platform firmware. To prepare the | |
388 | device for signaling wakeup and put it into the selected low-power state, the | |
389 | PCI subsystem can use the platform firmware as well as the device's native PCI | |
390 | PM capabilities, if supported. | |
391 | ||
392 | It is expected that the device driver's pm->runtime_suspend() callback will | |
393 | not attempt to prepare the device for signaling wakeup or to put it into a | |
394 | low-power state. The driver ought to leave these tasks to the PCI subsystem | |
395 | that has all of the information necessary to perform them. | |
396 | ||
397 | A suspended device is brought back into the "active" state, or resumed, | |
398 | with the help of pm_request_resume() or pm_runtime_resume() which both call | |
399 | pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's | |
400 | driver provides a pm->runtime_resume() callback (see below). However, before | |
401 | the driver's callback is executed, pci_pm_runtime_resume() brings the device | |
402 | back into the full-power state, prevents it from signaling wakeup while in that | |
403 | state and restores its standard configuration registers. Thus the driver's | |
404 | callback need not worry about the PCI-specific aspects of the device resume. | |
405 | ||
406 | Note that generally pci_pm_runtime_resume() may be called in two different | |
407 | situations. First, it may be called at the request of the device's driver, for | |
408 | example if there are some data for it to process. Second, it may be called | |
409 | as a result of a wakeup signal from the device itself (this sometimes is | |
410 | referred to as "remote wakeup"). Of course, for this purpose the wakeup signal | |
411 | is handled in one of the ways described in Section 1 and finally converted into | |
412 | a notification for the PCI subsystem after the source device has been | |
413 | identified. | |
414 | ||
415 | The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() | |
416 | and pm_request_idle(), executes the device driver's pm->runtime_idle() | |
417 | callback, if defined, and if that callback doesn't return error code (or is not | |
418 | present at all), suspends the device with the help of pm_runtime_suspend(). | |
419 | Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for | |
420 | example, it is called right after the device has just been resumed), in which | |
421 | cases it is expected to suspend the device if that makes sense. Usually, | |
422 | however, the PCI subsystem doesn't really know if the device really can be | |
423 | suspended, so it lets the device's driver decide by running its | |
424 | pm->runtime_idle() callback. | |
425 | ||
426 | 2.4. System-Wide Power Transitions | |
427 | ---------------------------------- | |
428 | There are a few different types of system-wide power transitions, described in | |
66ccc64f | 429 | Documentation/driver-api/pm/devices.rst. Each of them requires devices to be handled |
b7999570 RW |
430 | in a specific way and the PM core executes subsystem-level power management |
431 | callbacks for this purpose. They are executed in phases such that each phase | |
432 | involves executing the same subsystem-level callback for every device belonging | |
433 | to the given subsystem before the next phase begins. These phases always run | |
434 | after tasks have been frozen. | |
435 | ||
436 | 2.4.1. System Suspend | |
151f4e2b | 437 | ^^^^^^^^^^^^^^^^^^^^^ |
b7999570 RW |
438 | |
439 | When the system is going into a sleep state in which the contents of memory will | |
440 | be preserved, such as one of the ACPI sleep states S1-S3, the phases are: | |
441 | ||
442 | prepare, suspend, suspend_noirq. | |
443 | ||
151f4e2b | 444 | The following PCI bus type's callbacks, respectively, are used in these phases:: |
b7999570 RW |
445 | |
446 | pci_pm_prepare() | |
447 | pci_pm_suspend() | |
448 | pci_pm_suspend_noirq() | |
449 | ||
450 | The pci_pm_prepare() routine first puts the device into the "fully functional" | |
451 | state with the help of pm_runtime_resume(). Then, it executes the device | |
452 | driver's pm->prepare() callback if defined (i.e. if the driver's struct | |
453 | dev_pm_ops object is present and the prepare pointer in that object is valid). | |
454 | ||
455 | The pci_pm_suspend() routine first checks if the device's driver implements | |
456 | legacy PCI suspend routines (see Section 3), in which case the driver's legacy | |
457 | suspend callback is executed, if present, and its result is returned. Next, if | |
458 | the device's driver doesn't provide a struct dev_pm_ops object (containing | |
459 | pointers to the driver's callbacks), pci_pm_default_suspend() is called, which | |
460 | simply turns off the device's bus master capability and runs | |
461 | pcibios_disable_device() to disable it, unless the device is a bridge (PCI | |
462 | bridges are ignored by this routine). Next, the device driver's pm->suspend() | |
463 | callback is executed, if defined, and its result is returned if it fails. | |
464 | Finally, pci_fixup_device() is called to apply hardware suspend quirks related | |
465 | to the device if necessary. | |
466 | ||
467 | Note that the suspend phase is carried out asynchronously for PCI devices, so | |
468 | the pci_pm_suspend() callback may be executed in parallel for any pair of PCI | |
469 | devices that don't depend on each other in a known way (i.e. none of the paths | |
470 | in the device tree from the root bridge to a leaf device contains both of them). | |
471 | ||
472 | The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has | |
473 | been called, which means that the device driver's interrupt handler won't be | |
474 | invoked while this routine is running. It first checks if the device's driver | |
475 | implements legacy PCI suspends routines (Section 3), in which case the legacy | |
476 | late suspend routine is called and its result is returned (the standard | |
477 | configuration registers of the device are saved if the driver's callback hasn't | |
478 | done that). Second, if the device driver's struct dev_pm_ops object is not | |
479 | present, the device's standard configuration registers are saved and the routine | |
480 | returns success. Otherwise the device driver's pm->suspend_noirq() callback is | |
481 | executed, if present, and its result is returned if it fails. Next, if the | |
482 | device's standard configuration registers haven't been saved yet (one of the | |
483 | device driver's callbacks executed before might do that), pci_pm_suspend_noirq() | |
484 | saves them, prepares the device to signal wakeup (if necessary) and puts it into | |
485 | a low-power state. | |
486 | ||
487 | The low-power state to put the device into is the lowest-power (highest number) | |
488 | state from which it can signal wakeup while the system is in the target sleep | |
489 | state. Just like in the runtime PM case described above, the mechanism of | |
490 | signaling wakeup is system-dependent and determined by the PCI subsystem, which | |
491 | is also responsible for preparing the device to signal wakeup from the system's | |
492 | target sleep state as appropriate. | |
493 | ||
494 | PCI device drivers (that don't implement legacy power management callbacks) are | |
495 | generally not expected to prepare devices for signaling wakeup or to put them | |
496 | into low-power states. However, if one of the driver's suspend callbacks | |
497 | (pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration | |
498 | registers, pci_pm_suspend_noirq() will assume that the device has been prepared | |
499 | to signal wakeup and put into a low-power state by the driver (the driver is | |
500 | then assumed to have used the helper functions provided by the PCI subsystem for | |
501 | this purpose). PCI device drivers are not encouraged to do that, but in some | |
502 | rare cases doing that in the driver may be the optimum approach. | |
503 | ||
504 | 2.4.2. System Resume | |
151f4e2b | 505 | ^^^^^^^^^^^^^^^^^^^^ |
b7999570 RW |
506 | |
507 | When the system is undergoing a transition from a sleep state in which the | |
508 | contents of memory have been preserved, such as one of the ACPI sleep states | |
509 | S1-S3, into the working state (ACPI S0), the phases are: | |
510 | ||
511 | resume_noirq, resume, complete. | |
512 | ||
513 | The following PCI bus type's callbacks, respectively, are executed in these | |
151f4e2b | 514 | phases:: |
b7999570 RW |
515 | |
516 | pci_pm_resume_noirq() | |
517 | pci_pm_resume() | |
518 | pci_pm_complete() | |
519 | ||
520 | The pci_pm_resume_noirq() routine first puts the device into the full-power | |
521 | state, restores its standard configuration registers and applies early resume | |
522 | hardware quirks related to the device, if necessary. This is done | |
523 | unconditionally, regardless of whether or not the device's driver implements | |
524 | legacy PCI power management callbacks (this way all PCI devices are in the | |
525 | full-power state and their standard configuration registers have been restored | |
526 | when their interrupt handlers are invoked for the first time during resume, | |
527 | which allows the kernel to avoid problems with the handling of shared interrupts | |
528 | by drivers whose devices are still suspended). If legacy PCI power management | |
529 | callbacks (see Section 3) are implemented by the device's driver, the legacy | |
530 | early resume callback is executed and its result is returned. Otherwise, the | |
531 | device driver's pm->resume_noirq() callback is executed, if defined, and its | |
532 | result is returned. | |
533 | ||
534 | The pci_pm_resume() routine first checks if the device's standard configuration | |
535 | registers have been restored and restores them if that's not the case (this | |
536 | only is necessary in the error path during a failing suspend). Next, resume | |
537 | hardware quirks related to the device are applied, if necessary, and if the | |
538 | device's driver implements legacy PCI power management callbacks (see | |
539 | Section 3), the driver's legacy resume callback is executed and its result is | |
540 | returned. Otherwise, the device's wakeup signaling mechanisms are blocked and | |
541 | its driver's pm->resume() callback is executed, if defined (the callback's | |
542 | result is then returned). | |
543 | ||
544 | The resume phase is carried out asynchronously for PCI devices, like the | |
545 | suspend phase described above, which means that if two PCI devices don't depend | |
546 | on each other in a known way, the pci_pm_resume() routine may be executed for | |
547 | the both of them in parallel. | |
548 | ||
549 | The pci_pm_complete() routine only executes the device driver's pm->complete() | |
550 | callback, if defined. | |
551 | ||
552 | 2.4.3. System Hibernation | |
151f4e2b | 553 | ^^^^^^^^^^^^^^^^^^^^^^^^^ |
b7999570 RW |
554 | |
555 | System hibernation is more complicated than system suspend, because it requires | |
556 | a system image to be created and written into a persistent storage medium. The | |
557 | image is created atomically and all devices are quiesced, or frozen, before that | |
558 | happens. | |
559 | ||
560 | The freezing of devices is carried out after enough memory has been freed (at | |
561 | the time of this writing the image creation requires at least 50% of system RAM | |
562 | to be free) in the following three phases: | |
563 | ||
564 | prepare, freeze, freeze_noirq | |
565 | ||
151f4e2b | 566 | that correspond to the PCI bus type's callbacks:: |
b7999570 RW |
567 | |
568 | pci_pm_prepare() | |
569 | pci_pm_freeze() | |
570 | pci_pm_freeze_noirq() | |
571 | ||
572 | This means that the prepare phase is exactly the same as for system suspend. | |
573 | The other two phases, however, are different. | |
574 | ||
575 | The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs | |
576 | the device driver's pm->freeze() callback, if defined, instead of pm->suspend(), | |
577 | and it doesn't apply the suspend-related hardware quirks. It is executed | |
578 | asynchronously for different PCI devices that don't depend on each other in a | |
579 | known way. | |
580 | ||
581 | The pci_pm_freeze_noirq() routine, in turn, is similar to | |
582 | pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() | |
583 | routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the | |
584 | device for signaling wakeup and put it into a low-power state. Still, it saves | |
585 | the device's standard configuration registers if they haven't been saved by one | |
586 | of the driver's callbacks. | |
587 | ||
588 | Once the image has been created, it has to be saved. However, at this point all | |
589 | devices are frozen and they cannot handle I/O, while their ability to handle | |
590 | I/O is obviously necessary for the image saving. Thus they have to be brought | |
591 | back to the fully functional state and this is done in the following phases: | |
592 | ||
593 | thaw_noirq, thaw, complete | |
594 | ||
151f4e2b | 595 | using the following PCI bus type's callbacks:: |
b7999570 RW |
596 | |
597 | pci_pm_thaw_noirq() | |
598 | pci_pm_thaw() | |
599 | pci_pm_complete() | |
600 | ||
601 | respectively. | |
602 | ||
603 | The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(), | |
604 | but it doesn't put the device into the full power state and doesn't attempt to | |
605 | restore its standard configuration registers. It also executes the device | |
606 | driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq(). | |
607 | ||
608 | The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device | |
609 | driver's pm->thaw() callback instead of pm->resume(). It is executed | |
610 | asynchronously for different PCI devices that don't depend on each other in a | |
611 | known way. | |
612 | ||
613 | The complete phase it the same as for system resume. | |
614 | ||
615 | After saving the image, devices need to be powered down before the system can | |
616 | enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in | |
617 | three phases: | |
618 | ||
619 | prepare, poweroff, poweroff_noirq | |
620 | ||
621 | where the prepare phase is exactly the same as for system suspend. The other | |
622 | two phases are analogous to the suspend and suspend_noirq phases, respectively. | |
151f4e2b | 623 | The PCI subsystem-level callbacks they correspond to:: |
b7999570 RW |
624 | |
625 | pci_pm_poweroff() | |
626 | pci_pm_poweroff_noirq() | |
627 | ||
628 | work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, | |
629 | although they don't attempt to save the device's standard configuration | |
630 | registers. | |
631 | ||
632 | 2.4.4. System Restore | |
151f4e2b | 633 | ^^^^^^^^^^^^^^^^^^^^^ |
b7999570 RW |
634 | |
635 | System restore requires a hibernation image to be loaded into memory and the | |
636 | pre-hibernation memory contents to be restored before the pre-hibernation system | |
637 | activity can be resumed. | |
638 | ||
66ccc64f | 639 | As described in Documentation/driver-api/pm/devices.rst, the hibernation image is loaded |
b7999570 RW |
640 | into memory by a fresh instance of the kernel, called the boot kernel, which in |
641 | turn is loaded and run by a boot loader in the usual way. After the boot kernel | |
642 | has loaded the image, it needs to replace its own code and data with the code | |
643 | and data of the "hibernated" kernel stored within the image, called the image | |
644 | kernel. For this purpose all devices are frozen just like before creating | |
645 | the image during hibernation, in the | |
646 | ||
647 | prepare, freeze, freeze_noirq | |
648 | ||
649 | phases described above. However, the devices affected by these phases are only | |
650 | those having drivers in the boot kernel; other devices will still be in whatever | |
651 | state the boot loader left them. | |
652 | ||
653 | Should the restoration of the pre-hibernation memory contents fail, the boot | |
654 | kernel would go through the "thawing" procedure described above, using the | |
655 | thaw_noirq, thaw, and complete phases (that will only affect the devices having | |
656 | drivers in the boot kernel), and then continue running normally. | |
657 | ||
658 | If the pre-hibernation memory contents are restored successfully, which is the | |
659 | usual situation, control is passed to the image kernel, which then becomes | |
660 | responsible for bringing the system back to the working state. To achieve this, | |
661 | it must restore the devices' pre-hibernation functionality, which is done much | |
662 | like waking up from the memory sleep state, although it involves different | |
663 | phases: | |
664 | ||
665 | restore_noirq, restore, complete | |
666 | ||
667 | The first two of these are analogous to the resume_noirq and resume phases | |
668 | described above, respectively, and correspond to the following PCI subsystem | |
151f4e2b | 669 | callbacks:: |
b7999570 RW |
670 | |
671 | pci_pm_restore_noirq() | |
672 | pci_pm_restore() | |
673 | ||
674 | These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), | |
675 | respectively, but they execute the device driver's pm->restore_noirq() and | |
676 | pm->restore() callbacks, if available. | |
677 | ||
678 | The complete phase is carried out in exactly the same way as during system | |
679 | resume. | |
680 | ||
681 | ||
682 | 3. PCI Device Drivers and Power Management | |
683 | ========================================== | |
684 | ||
685 | 3.1. Power Management Callbacks | |
686 | ------------------------------- | |
151f4e2b | 687 | |
b7999570 RW |
688 | PCI device drivers participate in power management by providing callbacks to be |
689 | executed by the PCI subsystem's power management routines described above and by | |
690 | controlling the runtime power management of their devices. | |
691 | ||
692 | At the time of this writing there are two ways to define power management | |
693 | callbacks for a PCI device driver, the recommended one, based on using a | |
66ccc64f | 694 | dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and the |
b7999570 RW |
695 | "legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and |
696 | .resume() callbacks from struct pci_driver are used. The legacy approach, | |
697 | however, doesn't allow one to define runtime power management callbacks and is | |
698 | not really suitable for any new drivers. Therefore it is not covered by this | |
699 | document (refer to the source code to learn more about it). | |
700 | ||
701 | It is recommended that all PCI device drivers define a struct dev_pm_ops object | |
702 | containing pointers to power management (PM) callbacks that will be executed by | |
703 | the PCI subsystem's PM routines in various circumstances. A pointer to the | |
704 | driver's struct dev_pm_ops object has to be assigned to the driver.pm field in | |
705 | its struct pci_driver object. Once that has happened, the "legacy" PM callbacks | |
706 | in struct pci_driver are ignored (even if they are not NULL). | |
707 | ||
708 | The PM callbacks in struct dev_pm_ops are not mandatory and if they are not | |
709 | defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI | |
710 | subsystem will handle the device in a simplified default manner. If they are | |
711 | defined, though, they are expected to behave as described in the following | |
712 | subsections. | |
713 | ||
714 | 3.1.1. prepare() | |
151f4e2b | 715 | ^^^^^^^^^^^^^^^^ |
b7999570 RW |
716 | |
717 | The prepare() callback is executed during system suspend, during hibernation | |
718 | (when a hibernation image is about to be created), during power-off after | |
719 | saving a hibernation image and during system restore, when a hibernation image | |
720 | has just been loaded into memory. | |
721 | ||
722 | This callback is only necessary if the driver's device has children that in | |
723 | general may be registered at any time. In that case the role of the prepare() | |
724 | callback is to prevent new children of the device from being registered until | |
725 | one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. | |
726 | ||
727 | In addition to that the prepare() callback may carry out some operations | |
728 | preparing the device to be suspended, although it should not allocate memory | |
729 | (if additional memory is required to suspend the device, it has to be | |
730 | preallocated earlier, for example in a suspend/hibernate notifier as described | |
730c4c05 | 731 | in Documentation/driver-api/pm/notifiers.rst). |
b7999570 RW |
732 | |
733 | 3.1.2. suspend() | |
151f4e2b | 734 | ^^^^^^^^^^^^^^^^ |
b7999570 RW |
735 | |
736 | The suspend() callback is only executed during system suspend, after prepare() | |
737 | callbacks have been executed for all devices in the system. | |
738 | ||
739 | This callback is expected to quiesce the device and prepare it to be put into a | |
740 | low-power state by the PCI subsystem. It is not required (in fact it even is | |
741 | not recommended) that a PCI driver's suspend() callback save the standard | |
742 | configuration registers of the device, prepare it for waking up the system, or | |
743 | put it into a low-power state. All of these operations can very well be taken | |
744 | care of by the PCI subsystem, without the driver's participation. | |
745 | ||
746 | However, in some rare case it is convenient to carry out these operations in | |
747 | a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and | |
748 | pci_set_power_state() should be used to save the device's standard configuration | |
749 | registers, to prepare it for system wakeup (if necessary), and to put it into a | |
750 | low-power state, respectively. Moreover, if the driver calls pci_save_state(), | |
751 | the PCI subsystem will not execute either pci_prepare_to_sleep(), or | |
752 | pci_set_power_state() for its device, so the driver is then responsible for | |
753 | handling the device as appropriate. | |
754 | ||
755 | While the suspend() callback is being executed, the driver's interrupt handler | |
756 | can be invoked to handle an interrupt from the device, so all suspend-related | |
757 | operations relying on the driver's ability to handle interrupts should be | |
758 | carried out in this callback. | |
759 | ||
760 | 3.1.3. suspend_noirq() | |
151f4e2b | 761 | ^^^^^^^^^^^^^^^^^^^^^^ |
b7999570 RW |
762 | |
763 | The suspend_noirq() callback is only executed during system suspend, after | |
764 | suspend() callbacks have been executed for all devices in the system and | |
765 | after device interrupts have been disabled by the PM core. | |
766 | ||
767 | The difference between suspend_noirq() and suspend() is that the driver's | |
768 | interrupt handler will not be invoked while suspend_noirq() is running. Thus | |
769 | suspend_noirq() can carry out operations that would cause race conditions to | |
770 | arise if they were performed in suspend(). | |
771 | ||
772 | 3.1.4. freeze() | |
151f4e2b | 773 | ^^^^^^^^^^^^^^^ |
b7999570 RW |
774 | |
775 | The freeze() callback is hibernation-specific and is executed in two situations, | |
776 | during hibernation, after prepare() callbacks have been executed for all devices | |
777 | in preparation for the creation of a system image, and during restore, | |
778 | after a system image has been loaded into memory from persistent storage and the | |
779 | prepare() callbacks have been executed for all devices. | |
780 | ||
781 | The role of this callback is analogous to the role of the suspend() callback | |
782 | described above. In fact, they only need to be different in the rare cases when | |
783 | the driver takes the responsibility for putting the device into a low-power | |
1da177e4 LT |
784 | state. |
785 | ||
b7999570 RW |
786 | In that cases the freeze() callback should not prepare the device system wakeup |
787 | or put it into a low-power state. Still, either it or freeze_noirq() should | |
788 | save the device's standard configuration registers using pci_save_state(). | |
1da177e4 | 789 | |
b7999570 | 790 | 3.1.5. freeze_noirq() |
151f4e2b | 791 | ^^^^^^^^^^^^^^^^^^^^^ |
1da177e4 | 792 | |
b7999570 RW |
793 | The freeze_noirq() callback is hibernation-specific. It is executed during |
794 | hibernation, after prepare() and freeze() callbacks have been executed for all | |
795 | devices in preparation for the creation of a system image, and during restore, | |
796 | after a system image has been loaded into memory and after prepare() and | |
797 | freeze() callbacks have been executed for all devices. It is always executed | |
798 | after device interrupts have been disabled by the PM core. | |
1da177e4 | 799 | |
b7999570 RW |
800 | The role of this callback is analogous to the role of the suspend_noirq() |
801 | callback described above and it very rarely is necessary to define | |
802 | freeze_noirq(). | |
1da177e4 | 803 | |
b7999570 RW |
804 | The difference between freeze_noirq() and freeze() is analogous to the |
805 | difference between suspend_noirq() and suspend(). | |
1da177e4 | 806 | |
b7999570 | 807 | 3.1.6. poweroff() |
151f4e2b | 808 | ^^^^^^^^^^^^^^^^^ |
1da177e4 | 809 | |
b7999570 RW |
810 | The poweroff() callback is hibernation-specific. It is executed when the system |
811 | is about to be powered off after saving a hibernation image to a persistent | |
812 | storage. prepare() callbacks are executed for all devices before poweroff() is | |
813 | called. | |
1da177e4 | 814 | |
b7999570 RW |
815 | The role of this callback is analogous to the role of the suspend() and freeze() |
816 | callbacks described above, although it does not need to save the contents of | |
817 | the device's registers. In particular, if the driver wants to put the device | |
818 | into a low-power state itself instead of allowing the PCI subsystem to do that, | |
819 | the poweroff() callback should use pci_prepare_to_sleep() and | |
820 | pci_set_power_state() to prepare the device for system wakeup and to put it | |
821 | into a low-power state, respectively, but it need not save the device's standard | |
822 | configuration registers. | |
1da177e4 | 823 | |
b7999570 | 824 | 3.1.7. poweroff_noirq() |
151f4e2b | 825 | ^^^^^^^^^^^^^^^^^^^^^^^ |
1da177e4 | 826 | |
b7999570 RW |
827 | The poweroff_noirq() callback is hibernation-specific. It is executed after |
828 | poweroff() callbacks have been executed for all devices in the system. | |
1da177e4 | 829 | |
b7999570 RW |
830 | The role of this callback is analogous to the role of the suspend_noirq() and |
831 | freeze_noirq() callbacks described above, but it does not need to save the | |
832 | contents of the device's registers. | |
1da177e4 | 833 | |
b7999570 RW |
834 | The difference between poweroff_noirq() and poweroff() is analogous to the |
835 | difference between suspend_noirq() and suspend(). | |
1da177e4 | 836 | |
b7999570 | 837 | 3.1.8. resume_noirq() |
151f4e2b | 838 | ^^^^^^^^^^^^^^^^^^^^^ |
1da177e4 | 839 | |
b7999570 RW |
840 | The resume_noirq() callback is only executed during system resume, after the |
841 | PM core has enabled the non-boot CPUs. The driver's interrupt handler will not | |
842 | be invoked while resume_noirq() is running, so this callback can carry out | |
843 | operations that might race with the interrupt handler. | |
1da177e4 | 844 | |
b7999570 RW |
845 | Since the PCI subsystem unconditionally puts all devices into the full power |
846 | state in the resume_noirq phase of system resume and restores their standard | |
847 | configuration registers, resume_noirq() is usually not necessary. In general | |
848 | it should only be used for performing operations that would lead to race | |
849 | conditions if carried out by resume(). | |
1da177e4 | 850 | |
b7999570 | 851 | 3.1.9. resume() |
151f4e2b | 852 | ^^^^^^^^^^^^^^^ |
21d6b7e1 | 853 | |
b7999570 RW |
854 | The resume() callback is only executed during system resume, after |
855 | resume_noirq() callbacks have been executed for all devices in the system and | |
856 | device interrupts have been enabled by the PM core. | |
21d6b7e1 | 857 | |
b7999570 RW |
858 | This callback is responsible for restoring the pre-suspend configuration of the |
859 | device and bringing it back to the fully functional state. The device should be | |
860 | able to process I/O in a usual way after resume() has returned. | |
21d6b7e1 | 861 | |
b7999570 | 862 | 3.1.10. thaw_noirq() |
151f4e2b | 863 | ^^^^^^^^^^^^^^^^^^^^ |
21d6b7e1 | 864 | |
b7999570 RW |
865 | The thaw_noirq() callback is hibernation-specific. It is executed after a |
866 | system image has been created and the non-boot CPUs have been enabled by the PM | |
867 | core, in the thaw_noirq phase of hibernation. It also may be executed if the | |
868 | loading of a hibernation image fails during system restore (it is then executed | |
869 | after enabling the non-boot CPUs). The driver's interrupt handler will not be | |
870 | invoked while thaw_noirq() is running. | |
21d6b7e1 | 871 | |
b7999570 RW |
872 | The role of this callback is analogous to the role of resume_noirq(). The |
873 | difference between these two callbacks is that thaw_noirq() is executed after | |
874 | freeze() and freeze_noirq(), so in general it does not need to modify the | |
875 | contents of the device's registers. | |
21d6b7e1 | 876 | |
b7999570 | 877 | 3.1.11. thaw() |
151f4e2b | 878 | ^^^^^^^^^^^^^^ |
21d6b7e1 | 879 | |
b7999570 RW |
880 | The thaw() callback is hibernation-specific. It is executed after thaw_noirq() |
881 | callbacks have been executed for all devices in the system and after device | |
882 | interrupts have been enabled by the PM core. | |
1da177e4 | 883 | |
b7999570 RW |
884 | This callback is responsible for restoring the pre-freeze configuration of |
885 | the device, so that it will work in a usual way after thaw() has returned. | |
1da177e4 | 886 | |
b7999570 | 887 | 3.1.12. restore_noirq() |
151f4e2b | 888 | ^^^^^^^^^^^^^^^^^^^^^^^ |
1da177e4 | 889 | |
b7999570 RW |
890 | The restore_noirq() callback is hibernation-specific. It is executed in the |
891 | restore_noirq phase of hibernation, when the boot kernel has passed control to | |
892 | the image kernel and the non-boot CPUs have been enabled by the image kernel's | |
893 | PM core. | |
894 | ||
895 | This callback is analogous to resume_noirq() with the exception that it cannot | |
896 | make any assumption on the previous state of the device, even if the BIOS (or | |
897 | generally the platform firmware) is known to preserve that state over a | |
898 | suspend-resume cycle. | |
899 | ||
900 | For the vast majority of PCI device drivers there is no difference between | |
901 | resume_noirq() and restore_noirq(). | |
902 | ||
903 | 3.1.13. restore() | |
151f4e2b | 904 | ^^^^^^^^^^^^^^^^^ |
b7999570 RW |
905 | |
906 | The restore() callback is hibernation-specific. It is executed after | |
907 | restore_noirq() callbacks have been executed for all devices in the system and | |
908 | after the PM core has enabled device drivers' interrupt handlers to be invoked. | |
909 | ||
910 | This callback is analogous to resume(), just like restore_noirq() is analogous | |
911 | to resume_noirq(). Consequently, the difference between restore_noirq() and | |
912 | restore() is analogous to the difference between resume_noirq() and resume(). | |
913 | ||
914 | For the vast majority of PCI device drivers there is no difference between | |
915 | resume() and restore(). | |
916 | ||
917 | 3.1.14. complete() | |
151f4e2b | 918 | ^^^^^^^^^^^^^^^^^^ |
b7999570 RW |
919 | |
920 | The complete() callback is executed in the following situations: | |
151f4e2b | 921 | |
b7999570 RW |
922 | - during system resume, after resume() callbacks have been executed for all |
923 | devices, | |
924 | - during hibernation, before saving the system image, after thaw() callbacks | |
925 | have been executed for all devices, | |
926 | - during system restore, when the system is going back to its pre-hibernation | |
927 | state, after restore() callbacks have been executed for all devices. | |
151f4e2b | 928 | |
b7999570 RW |
929 | It also may be executed if the loading of a hibernation image into memory fails |
930 | (in that case it is run after thaw() callbacks have been executed for all | |
931 | devices that have drivers in the boot kernel). | |
932 | ||
933 | This callback is entirely optional, although it may be necessary if the | |
934 | prepare() callback performs operations that need to be reversed. | |
935 | ||
936 | 3.1.15. runtime_suspend() | |
151f4e2b | 937 | ^^^^^^^^^^^^^^^^^^^^^^^^^ |
b7999570 RW |
938 | |
939 | The runtime_suspend() callback is specific to device runtime power management | |
940 | (runtime PM). It is executed by the PM core's runtime PM framework when the | |
941 | device is about to be suspended (i.e. quiesced and put into a low-power state) | |
942 | at run time. | |
943 | ||
944 | This callback is responsible for freezing the device and preparing it to be | |
945 | put into a low-power state, but it must allow the PCI subsystem to perform all | |
946 | of the PCI-specific actions necessary for suspending the device. | |
947 | ||
948 | 3.1.16. runtime_resume() | |
151f4e2b | 949 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
b7999570 RW |
950 | |
951 | The runtime_resume() callback is specific to device runtime PM. It is executed | |
952 | by the PM core's runtime PM framework when the device is about to be resumed | |
953 | (i.e. put into the full-power state and programmed to process I/O normally) at | |
954 | run time. | |
955 | ||
956 | This callback is responsible for restoring the normal functionality of the | |
957 | device after it has been put into the full-power state by the PCI subsystem. | |
958 | The device is expected to be able to process I/O in the usual way after | |
959 | runtime_resume() has returned. | |
960 | ||
961 | 3.1.17. runtime_idle() | |
151f4e2b | 962 | ^^^^^^^^^^^^^^^^^^^^^^ |
b7999570 RW |
963 | |
964 | The runtime_idle() callback is specific to device runtime PM. It is executed | |
965 | by the PM core's runtime PM framework whenever it may be desirable to suspend | |
966 | the device according to the PM core's information. In particular, it is | |
967 | automatically executed right after runtime_resume() has returned in case the | |
968 | resume of the device has happened as a result of a spurious event. | |
969 | ||
970 | This callback is optional, but if it is not implemented or if it returns 0, the | |
971 | PCI subsystem will call pm_runtime_suspend() for the device, which in turn will | |
972 | cause the driver's runtime_suspend() callback to be executed. | |
973 | ||
974 | 3.1.18. Pointing Multiple Callback Pointers to One Routine | |
151f4e2b | 975 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
b7999570 RW |
976 | |
977 | Although in principle each of the callbacks described in the previous | |
978 | subsections can be defined as a separate function, it often is convenient to | |
979 | point two or more members of struct dev_pm_ops to the same routine. There are | |
980 | a few convenience macros that can be used for this purpose. | |
981 | ||
982 | The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one | |
983 | suspend routine pointed to by the .suspend(), .freeze(), and .poweroff() | |
984 | members and one resume routine pointed to by the .resume(), .thaw(), and | |
985 | .restore() members. The other function pointers in this struct dev_pm_ops are | |
986 | unset. | |
987 | ||
988 | The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it | |
989 | additionally sets the .runtime_resume() pointer to the same value as | |
990 | .resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to | |
991 | the same value as .suspend() (and .freeze() and .poweroff()). | |
992 | ||
993 | The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct | |
994 | dev_pm_ops to indicate that one suspend routine is to be pointed to by the | |
995 | .suspend(), .freeze(), and .poweroff() members and one resume routine is to | |
996 | be pointed to by the .resume(), .thaw(), and .restore() members. | |
997 | ||
08810a41 | 998 | 3.1.19. Driver Flags for Power Management |
151f4e2b | 999 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
08810a41 RW |
1000 | |
1001 | The PM core allows device drivers to set flags that influence the handling of | |
1002 | power management for the devices by the core itself and by middle layer code | |
1003 | including the PCI bus type. The flags should be set once at the driver probe | |
1004 | time with the help of the dev_pm_set_driver_flags() function and they should not | |
1005 | be updated directly afterwards. | |
1006 | ||
1007 | The DPM_FLAG_NEVER_SKIP flag prevents the PM core from using the direct-complete | |
1008 | mechanism allowing device suspend/resume callbacks to be skipped if the device | |
1009 | is in runtime suspend when the system suspend starts. That also affects all of | |
1010 | the ancestors of the device, so this flag should only be used if absolutely | |
1011 | necessary. | |
1012 | ||
1013 | The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a | |
1014 | positive value from pci_pm_prepare() if the ->prepare callback provided by the | |
1015 | driver of the device returns a positive value. That allows the driver to opt | |
1016 | out from using the direct-complete mechanism dynamically. | |
1017 | ||
c4b65157 RW |
1018 | The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's |
1019 | perspective the device can be safely left in runtime suspend during system | |
1020 | suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() | |
1021 | to skip resuming the device from runtime suspend unless there are PCI-specific | |
1022 | reasons for doing that. Also, it causes pci_pm_suspend_late/noirq(), | |
1023 | pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq() to return early | |
1024 | if the device remains in runtime suspend in the beginning of the "late" phase | |
1025 | of the system-wide transition under way. Moreover, if the device is in | |
1026 | runtime suspend in pci_pm_resume_noirq() or pci_pm_restore_noirq(), its runtime | |
1027 | power management status will be changed to "active" (as it is going to be put | |
1028 | into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), | |
1029 | the function will set the power.direct_complete flag for it (to make the PM core | |
1030 | skip the subsequent "thaw" callbacks for it) and return. | |
1031 | ||
bd755d77 RW |
1032 | Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the |
1033 | device to be left in suspend after system-wide transitions to the working state. | |
1034 | This flag is checked by the PM core, but the PCI bus type informs the PM core | |
1035 | which devices may be left in suspend from its perspective (that happens during | |
1036 | the "noirq" phase of system-wide suspend and analogous transitions) and next it | |
1037 | uses the dev_pm_may_skip_resume() helper to decide whether or not to return from | |
1038 | pci_pm_resume_noirq() early, as the PM core will skip the remaining resume | |
1039 | callbacks for the device during the transition under way and will set its | |
1040 | runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for | |
1041 | it. | |
1042 | ||
b7999570 RW |
1043 | 3.2. Device Runtime Power Management |
1044 | ------------------------------------ | |
151f4e2b | 1045 | |
b7999570 RW |
1046 | In addition to providing device power management callbacks PCI device drivers |
1047 | are responsible for controlling the runtime power management (runtime PM) of | |
1048 | their devices. | |
1049 | ||
1050 | The PCI device runtime PM is optional, but it is recommended that PCI device | |
1051 | drivers implement it at least in the cases where there is a reliable way of | |
1052 | verifying that the device is not used (like when the network cable is detached | |
1053 | from an Ethernet adapter or there are no devices attached to a USB controller). | |
1054 | ||
1055 | To support the PCI runtime PM the driver first needs to implement the | |
1056 | runtime_suspend() and runtime_resume() callbacks. It also may need to implement | |
1057 | the runtime_idle() callback to prevent the device from being suspended again | |
1058 | every time right after the runtime_resume() callback has returned | |
1059 | (alternatively, the runtime_suspend() callback will have to check if the | |
1060 | device should really be suspended and return -EAGAIN if that is not the case). | |
1061 | ||
a8360062 RW |
1062 | The runtime PM of PCI devices is enabled by default by the PCI core. PCI |
1063 | device drivers do not need to enable it and should not attempt to do so. | |
1064 | However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid() | |
1065 | helper function. In addition to that, the runtime PM usage counter of | |
1066 | each PCI device is incremented by local_pci_probe() before executing the | |
1067 | probe callback provided by the device's driver. | |
1068 | ||
1069 | If a PCI driver implements the runtime PM callbacks and intends to use the | |
1070 | runtime PM framework provided by the PM core and the PCI subsystem, it needs | |
1071 | to decrement the device's runtime PM usage counter in its probe callback | |
1072 | function. If it doesn't do that, the counter will always be different from | |
1073 | zero for the device and it will never be runtime-suspended. The simplest | |
1074 | way to do that is by calling pm_runtime_put_noidle(), but if the driver | |
1075 | wants to schedule an autosuspend right away, for example, it may call | |
1076 | pm_runtime_put_autosuspend() instead for this purpose. Generally, it | |
1077 | just needs to call a function that decrements the devices usage counter | |
1078 | from its probe routine to make runtime PM work for the device. | |
1079 | ||
1080 | It is important to remember that the driver's runtime_suspend() callback | |
1081 | may be executed right after the usage counter has been decremented, because | |
76fc35dd | 1082 | user space may already have caused the pm_runtime_allow() helper function |
a8360062 RW |
1083 | unblocking the runtime PM of the device to run via sysfs, so the driver must |
1084 | be prepared to cope with that. | |
1085 | ||
1086 | The driver itself should not call pm_runtime_allow(), though. Instead, it | |
1087 | should let user space or some platform-specific code do that (user space can | |
1088 | do it via sysfs as stated above), but it must be prepared to handle the | |
b7999570 | 1089 | runtime PM of the device correctly as soon as pm_runtime_allow() is called |
a8360062 RW |
1090 | (which may happen at any time, even before the driver is loaded). |
1091 | ||
1092 | When the driver's remove callback runs, it has to balance the decrementation | |
1093 | of the device's runtime PM usage counter at the probe time. For this reason, | |
1094 | if it has decremented the counter in its probe callback, it must run | |
1095 | pm_runtime_get_noresume() in its remove callback. [Since the core carries | |
1096 | out a runtime resume of the device and bumps up the device's usage counter | |
1097 | before running the driver's remove callback, the runtime PM of the device | |
1098 | is effectively disabled for the duration of the remove execution and all | |
1099 | runtime PM helper functions incrementing the device's usage counter are | |
1100 | then effectively equivalent to pm_runtime_get_noresume().] | |
b7999570 RW |
1101 | |
1102 | The runtime PM framework works by processing requests to suspend or resume | |
1103 | devices, or to check if they are idle (in which cases it is reasonable to | |
1104 | subsequently request that they be suspended). These requests are represented | |
1105 | by work items put into the power management workqueue, pm_wq. Although there | |
1106 | are a few situations in which power management requests are automatically | |
1107 | queued by the PM core (for example, after processing a request to resume a | |
1108 | device the PM core automatically queues a request to check if the device is | |
1109 | idle), device drivers are generally responsible for queuing power management | |
1110 | requests for their devices. For this purpose they should use the runtime PM | |
1111 | helper functions provided by the PM core, discussed in | |
151f4e2b | 1112 | Documentation/power/runtime_pm.rst. |
b7999570 RW |
1113 | |
1114 | Devices can also be suspended and resumed synchronously, without placing a | |
1115 | request into pm_wq. In the majority of cases this also is done by their | |
1116 | drivers that use helper functions provided by the PM core for this purpose. | |
1117 | ||
1118 | For more information on the runtime PM of devices refer to | |
151f4e2b | 1119 | Documentation/power/runtime_pm.rst. |
b7999570 RW |
1120 | |
1121 | ||
1122 | 4. Resources | |
1123 | ============ | |
1124 | ||
1125 | PCI Local Bus Specification, Rev. 3.0 | |
151f4e2b | 1126 | |
b7999570 | 1127 | PCI Bus Power Management Interface Specification, Rev. 1.2 |
151f4e2b | 1128 | |
b7999570 | 1129 | Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b |
151f4e2b | 1130 | |
b7999570 | 1131 | PCI Express Base Specification, Rev. 2.0 |
151f4e2b | 1132 | |
66ccc64f | 1133 | Documentation/driver-api/pm/devices.rst |
151f4e2b MCC |
1134 | |
1135 | Documentation/power/runtime_pm.rst |