]> git.proxmox.com Git - mirror_ubuntu-artful-kernel.git/blame - Documentation/cpu-freq/intel-pstate.txt
Merge branches 'for-4.11/upstream-fixes', 'for-4.12/accutouch', 'for-4.12/cp2112...
[mirror_ubuntu-artful-kernel.git] / Documentation / cpu-freq / intel-pstate.txt
CommitLineData
a032d2de 1Intel P-State driver
a3ea0153
RR
2--------------------
3
a032d2de
SP
4This driver provides an interface to control the P-State selection for the
5SandyBridge+ Intel processors.
6
7The following document explains P-States:
8http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
9As stated in the document, P-State doesn’t exactly mean a frequency. However, for
10the sake of the relationship with cpufreq, P-State and frequency are used
11interchangeably.
12
13Understanding the cpufreq core governors and policies are important before
14discussing more details about the Intel P-State driver. Based on what callbacks
15a cpufreq driver provides to the cpufreq core, it can support two types of
16drivers:
17- with target_index() callback: In this mode, the drivers using cpufreq core
18simply provide the minimum and maximum frequency limits and an additional
19interface target_index() to set the current frequency. The cpufreq subsystem
20has a number of scaling governors ("performance", "powersave", "ondemand",
21etc.). Depending on which governor is in use, cpufreq core will call for
22transitions to a specific frequency using target_index() callback.
23- setpolicy() callback: In this mode, drivers do not provide target_index()
24callback, so cpufreq core can't request a transition to a specific frequency.
25The driver provides minimum and maximum frequency limits and callbacks to set a
26policy. The policy in cpufreq sysfs is referred to as the "scaling governor".
27The cpufreq core can request the driver to operate in any of the two policies:
5bc8ac0f 28"performance" and "powersave". The driver decides which frequency to use based
a032d2de
SP
29on the above policy selection considering minimum and maximum frequency limits.
30
31The Intel P-State driver falls under the latter category, which implements the
32setpolicy() callback. This driver decides what P-State to use based on the
33requested policy from the cpufreq core. If the processor is capable of
34selecting its next P-State internally, then the driver will offload this
35responsibility to the processor (aka HWP: Hardware P-States). If not, the
36driver implements algorithms to select the next P-State.
37
38Since these policies are implemented in the driver, they are not same as the
39cpufreq scaling governors implementation, even if they have the same name in
40the cpufreq sysfs (scaling_governors). For example the "performance" policy is
41similar to cpufreq’s "performance" governor, but "powersave" is completely
42different than the cpufreq "powersave" governor. The strategy here is similar
43to cpufreq "ondemand", where the requested P-State is related to the system load.
44
45Sysfs Interface
46
47In addition to the frequency-controlling interfaces provided by the cpufreq
48core, the driver provides its own sysfs files to control the P-State selection.
49These files have been added to /sys/devices/system/cpu/intel_pstate/.
50Any changes made to these files are applicable to all CPUs (even in a
9e472f95 51multi-package system, Refer to later section on placing "Per-CPU limits").
a032d2de
SP
52
53 max_perf_pct: Limits the maximum P-State that will be requested by
54 the driver. It states it as a percentage of the available performance. The
55 available (P-State) performance may be reduced by the no_turbo
41629a82 56 setting described below.
a3ea0153 57
a032d2de
SP
58 min_perf_pct: Limits the minimum P-State that will be requested by
59 the driver. It states it as a percentage of the max (non-turbo)
41629a82 60 performance level.
a3ea0153 61
a032d2de 62 no_turbo: Limits the driver to selecting P-State below the turbo
a3ea0153
RR
63 frequency range.
64
a032d2de
SP
65 turbo_pct: Displays the percentage of the total performance that
66 is supported by hardware that is in the turbo range. This number
d01b1f48
KCA
67 is independent of whether turbo has been disabled or not.
68
a032d2de
SP
69 num_pstates: Displays the number of P-States that are supported
70 by hardware. This number is independent of whether turbo has
0522424e
KCA
71 been disabled or not.
72
a032d2de
SP
73For example, if a system has these parameters:
74 Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State)
75 Max non turbo ratio: 0x17
76 Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio)
77
78Sysfs will show :
79 max_perf_pct:100, which corresponds to 1 core ratio
80 min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio
81 no_turbo:0, turbo is not disabled
82 num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1)
83 turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates
84
85Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual
86Volume 3: System Programming Guide" to understand ratios.
87
fb1fe104
RW
88There is one more sysfs attribute in /sys/devices/system/cpu/intel_pstate/
89that can be used for controlling the operation mode of the driver:
90
91 status: Three settings are possible:
92 "off" - The driver is not in use at this time.
93 "active" - The driver works as a P-state governor (default).
94 "passive" - The driver works as a regular cpufreq one and collaborates
95 with the generic cpufreq governors (it sets P-states as
96 requested by those governors).
97 The current setting is returned by reads from this attribute. Writing one
98 of the above strings to it changes the operation mode as indicated by that
99 string, if possible. If HW-managed P-states (HWP) are enabled, it is not
100 possible to change the driver's operation mode and attempts to write to
101 this attribute will fail.
102
a032d2de
SP
103cpufreq sysfs for Intel P-State
104
105Since this driver registers with cpufreq, cpufreq sysfs is also presented.
106There are some important differences, which need to be considered.
107
108scaling_cur_freq: This displays the real frequency which was used during
109the last sample period instead of what is requested. Some other cpufreq driver,
110like acpi-cpufreq, displays what is requested (Some changes are on the
111way to fix this for acpi-cpufreq driver). The same is true for frequencies
112displayed at /proc/cpuinfo.
113
114scaling_governor: This displays current active policy. Since each CPU has a
115cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this
116is not possible with Intel P-States, as there is one common policy for all
117CPUs. Here, the last requested policy will be applicable to all CPUs. It is
118suggested that one use the cpupower utility to change policy to all CPUs at the
119same time.
120
121scaling_setspeed: This attribute can never be used with Intel P-State.
122
123scaling_max_freq/scaling_min_freq: This interface can be used similarly to
124the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies
125are converted to nearest possible P-State, this is prone to rounding errors.
126This method is not preferred to limit performance.
127
128affected_cpus: Not used
129related_cpus: Not used
130
a3ea0153 131For contemporary Intel processors, the frequency is controlled by the
a032d2de 132processor itself and the P-State exposed to software is related to
a3ea0153 133performance levels. The idea that frequency can be set to a single
a032d2de
SP
134frequency is fictional for Intel Core processors. Even if the scaling
135driver selects a single P-State, the actual frequency the processor
a3ea0153
RR
136will run at is selected by the processor itself.
137
9e472f95
SP
138Per-CPU limits
139
140The kernel command line option "intel_pstate=per_cpu_perf_limits" forces
141the intel_pstate driver to use per-CPU performance limits. When it is set,
142the sysfs control interface described above is subject to limitations.
143- The following controls are not available for both read and write
144 /sys/devices/system/cpu/intel_pstate/max_perf_pct
145 /sys/devices/system/cpu/intel_pstate/min_perf_pct
146- The following controls can be used to set performance limits, as far as the
147architecture of the processor permits:
148 /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
149 /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
150 /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
151- User can still observe turbo percent and number of P-States from
152 /sys/devices/system/cpu/intel_pstate/turbo_pct
153 /sys/devices/system/cpu/intel_pstate/num_pstates
154- User can read write system wide turbo status
155 /sys/devices/system/cpu/no_turbo
156
bf006e14
SP
157Support of energy performance hints
158It is possible to provide hints to the HWP algorithms in the processor
159to be more performance centric to more energy centric. When the driver
160is using HWP, two additional cpufreq sysfs attributes are presented for
161each logical CPU.
162These attributes are:
163 - energy_performance_available_preferences
164 - energy_performance_preference
165
166To get list of supported hints:
167$ cat energy_performance_available_preferences
168 default performance balance_performance balance_power power
169
170The current preference can be read or changed via cpufreq sysfs
171attribute "energy_performance_preference". Reading from this attribute
172will display current effective setting. User can write any of the valid
173preference string to this attribute. User can always restore to power-on
174default by writing "default".
175
176Since threads can migrate to different CPUs, this is possible that the
177new CPU may have different energy performance preference than the previous
178one. To avoid such issues, either threads can be pinned to specific CPUs
179or set the same energy performance preference value to all CPUs.
180
a032d2de
SP
181Tuning Intel P-State driver
182
b8b97a42
SP
183When the performance can be tuned using PID (Proportional Integral
184Derivative) controller, debugfs files are provided for adjusting performance.
185They are presented under:
186/sys/kernel/debug/pstate_snb/
a3ea0153 187
b8b97a42 188The PID tunable parameters are:
a3ea0153
RR
189 deadband
190 d_gain_pct
191 i_gain_pct
192 p_gain_pct
193 sample_rate_ms
194 setpoint
a032d2de
SP
195
196To adjust these parameters, some understanding of driver implementation is
197necessary. There are some tweeks described here, but be very careful. Adjusting
198them requires expert level understanding of power and performance relationship.
199These limits are only useful when the "powersave" policy is active.
200
201-To make the system more responsive to load changes, sample_rate_ms can
202be adjusted (current default is 10ms).
203-To make the system use higher performance, even if the load is lower, setpoint
204can be adjusted to a lower number. This will also lead to faster ramp up time
205to reach the maximum P-State.
206If there are no derivative and integral coefficients, The next P-State will be
207equal to:
208 current P-State - ((setpoint - current cpu load) * p_gain_pct)
209
210For example, if the current PID parameters are (Which are defaults for the core
211processors like SandyBridge):
212 deadband = 0
213 d_gain_pct = 0
214 i_gain_pct = 0
215 p_gain_pct = 20
216 sample_rate_ms = 10
217 setpoint = 97
218
219If the current P-State = 0x08 and current load = 100, this will result in the
220next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State
221goes up by only 1. If during next sample interval the current load doesn't
222change and still 100, then P-State goes up by one again. This process will
223continue as long as the load is more than the setpoint until the maximum P-State
224is reached.
225
226For the same load at setpoint = 60, this will result in the next P-State
227= 0x08 - ((60 - 100) * 0.2) = 16
228So by changing the setpoint from 97 to 60, there is an increase of the
229next P-State from 9 to 16. So this will make processor execute at higher
230P-State for the same CPU load. If the load continues to be more than the
231setpoint during next sample intervals, then P-State will go up again till the
232maximum P-State is reached. But the ramp up time to reach the maximum P-State
233will be much faster when the setpoint is 60 compared to 97.
234
235Debugging Intel P-State driver
236
237Event tracing
238To debug P-State transition, the Linux event tracing interface can be used.
239There are two specific events, which can be enabled (Provided the kernel
240configs related to event tracing are enabled).
241
242# cd /sys/kernel/debug/tracing/
243# echo 1 > events/power/pstate_sample/enable
244# echo 1 > events/power/cpu_frequency/enable
245# cat trace
246gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107
247 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618
248 freq=2474476
249cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2
250
251
252Using ftrace
253
254If function level tracing is required, the Linux ftrace interface can be used.
255For example if we want to check how often a function to set a P-State is
256called, we can set ftrace filter to intel_pstate_set_pstate.
257
258# cd /sys/kernel/debug/tracing/
259# cat available_filter_functions | grep -i pstate
260intel_pstate_set_pstate
261intel_pstate_cpu_init
262...
263
264# echo intel_pstate_set_pstate > set_ftrace_filter
265# echo function > current_tracer
266# cat trace | head -15
267# tracer: function
268#
269# entries-in-buffer/entries-written: 80/80 #P:4
270#
271# _-----=> irqs-off
272# / _----=> need-resched
273# | / _---=> hardirq/softirq
274# || / _--=> preempt-depth
275# ||| / delay
276# TASK-PID CPU# |||| TIMESTAMP FUNCTION
277# | | | |||| | |
278 Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
279 gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
280 gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
281 <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func