]>
Commit | Line | Data |
---|---|---|
a032d2de | 1 | Intel P-State driver |
a3ea0153 RR |
2 | -------------------- |
3 | ||
a032d2de SP |
4 | This driver provides an interface to control the P-State selection for the |
5 | SandyBridge+ Intel processors. | |
6 | ||
7 | The following document explains P-States: | |
8 | http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf | |
9 | As stated in the document, P-State doesn’t exactly mean a frequency. However, for | |
10 | the sake of the relationship with cpufreq, P-State and frequency are used | |
11 | interchangeably. | |
12 | ||
13 | Understanding the cpufreq core governors and policies are important before | |
14 | discussing more details about the Intel P-State driver. Based on what callbacks | |
15 | a cpufreq driver provides to the cpufreq core, it can support two types of | |
16 | drivers: | |
17 | - with target_index() callback: In this mode, the drivers using cpufreq core | |
18 | simply provide the minimum and maximum frequency limits and an additional | |
19 | interface target_index() to set the current frequency. The cpufreq subsystem | |
20 | has a number of scaling governors ("performance", "powersave", "ondemand", | |
21 | etc.). Depending on which governor is in use, cpufreq core will call for | |
22 | transitions to a specific frequency using target_index() callback. | |
23 | - setpolicy() callback: In this mode, drivers do not provide target_index() | |
24 | callback, so cpufreq core can't request a transition to a specific frequency. | |
25 | The driver provides minimum and maximum frequency limits and callbacks to set a | |
26 | policy. The policy in cpufreq sysfs is referred to as the "scaling governor". | |
27 | The cpufreq core can request the driver to operate in any of the two policies: | |
5bc8ac0f | 28 | "performance" and "powersave". The driver decides which frequency to use based |
a032d2de SP |
29 | on the above policy selection considering minimum and maximum frequency limits. |
30 | ||
31 | The Intel P-State driver falls under the latter category, which implements the | |
32 | setpolicy() callback. This driver decides what P-State to use based on the | |
33 | requested policy from the cpufreq core. If the processor is capable of | |
34 | selecting its next P-State internally, then the driver will offload this | |
35 | responsibility to the processor (aka HWP: Hardware P-States). If not, the | |
36 | driver implements algorithms to select the next P-State. | |
37 | ||
38 | Since these policies are implemented in the driver, they are not same as the | |
39 | cpufreq scaling governors implementation, even if they have the same name in | |
40 | the cpufreq sysfs (scaling_governors). For example the "performance" policy is | |
41 | similar to cpufreq’s "performance" governor, but "powersave" is completely | |
42 | different than the cpufreq "powersave" governor. The strategy here is similar | |
43 | to cpufreq "ondemand", where the requested P-State is related to the system load. | |
44 | ||
45 | Sysfs Interface | |
46 | ||
47 | In addition to the frequency-controlling interfaces provided by the cpufreq | |
48 | core, the driver provides its own sysfs files to control the P-State selection. | |
49 | These files have been added to /sys/devices/system/cpu/intel_pstate/. | |
50 | Any changes made to these files are applicable to all CPUs (even in a | |
9e472f95 | 51 | multi-package system, Refer to later section on placing "Per-CPU limits"). |
a032d2de SP |
52 | |
53 | max_perf_pct: Limits the maximum P-State that will be requested by | |
54 | the driver. It states it as a percentage of the available performance. The | |
55 | available (P-State) performance may be reduced by the no_turbo | |
41629a82 | 56 | setting described below. |
a3ea0153 | 57 | |
a032d2de SP |
58 | min_perf_pct: Limits the minimum P-State that will be requested by |
59 | the driver. It states it as a percentage of the max (non-turbo) | |
41629a82 | 60 | performance level. |
a3ea0153 | 61 | |
a032d2de | 62 | no_turbo: Limits the driver to selecting P-State below the turbo |
a3ea0153 RR |
63 | frequency range. |
64 | ||
a032d2de SP |
65 | turbo_pct: Displays the percentage of the total performance that |
66 | is supported by hardware that is in the turbo range. This number | |
d01b1f48 KCA |
67 | is independent of whether turbo has been disabled or not. |
68 | ||
a032d2de SP |
69 | num_pstates: Displays the number of P-States that are supported |
70 | by hardware. This number is independent of whether turbo has | |
0522424e KCA |
71 | been disabled or not. |
72 | ||
a032d2de SP |
73 | For example, if a system has these parameters: |
74 | Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State) | |
75 | Max non turbo ratio: 0x17 | |
76 | Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio) | |
77 | ||
78 | Sysfs will show : | |
79 | max_perf_pct:100, which corresponds to 1 core ratio | |
80 | min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio | |
81 | no_turbo:0, turbo is not disabled | |
82 | num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1) | |
83 | turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates | |
84 | ||
85 | Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual | |
86 | Volume 3: System Programming Guide" to understand ratios. | |
87 | ||
88 | cpufreq sysfs for Intel P-State | |
89 | ||
90 | Since this driver registers with cpufreq, cpufreq sysfs is also presented. | |
91 | There are some important differences, which need to be considered. | |
92 | ||
93 | scaling_cur_freq: This displays the real frequency which was used during | |
94 | the last sample period instead of what is requested. Some other cpufreq driver, | |
95 | like acpi-cpufreq, displays what is requested (Some changes are on the | |
96 | way to fix this for acpi-cpufreq driver). The same is true for frequencies | |
97 | displayed at /proc/cpuinfo. | |
98 | ||
99 | scaling_governor: This displays current active policy. Since each CPU has a | |
100 | cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this | |
101 | is not possible with Intel P-States, as there is one common policy for all | |
102 | CPUs. Here, the last requested policy will be applicable to all CPUs. It is | |
103 | suggested that one use the cpupower utility to change policy to all CPUs at the | |
104 | same time. | |
105 | ||
106 | scaling_setspeed: This attribute can never be used with Intel P-State. | |
107 | ||
108 | scaling_max_freq/scaling_min_freq: This interface can be used similarly to | |
109 | the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies | |
110 | are converted to nearest possible P-State, this is prone to rounding errors. | |
111 | This method is not preferred to limit performance. | |
112 | ||
113 | affected_cpus: Not used | |
114 | related_cpus: Not used | |
115 | ||
a3ea0153 | 116 | For contemporary Intel processors, the frequency is controlled by the |
a032d2de | 117 | processor itself and the P-State exposed to software is related to |
a3ea0153 | 118 | performance levels. The idea that frequency can be set to a single |
a032d2de SP |
119 | frequency is fictional for Intel Core processors. Even if the scaling |
120 | driver selects a single P-State, the actual frequency the processor | |
a3ea0153 RR |
121 | will run at is selected by the processor itself. |
122 | ||
9e472f95 SP |
123 | Per-CPU limits |
124 | ||
125 | The kernel command line option "intel_pstate=per_cpu_perf_limits" forces | |
126 | the intel_pstate driver to use per-CPU performance limits. When it is set, | |
127 | the sysfs control interface described above is subject to limitations. | |
128 | - The following controls are not available for both read and write | |
129 | /sys/devices/system/cpu/intel_pstate/max_perf_pct | |
130 | /sys/devices/system/cpu/intel_pstate/min_perf_pct | |
131 | - The following controls can be used to set performance limits, as far as the | |
132 | architecture of the processor permits: | |
133 | /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq | |
134 | /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq | |
135 | /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | |
136 | - User can still observe turbo percent and number of P-States from | |
137 | /sys/devices/system/cpu/intel_pstate/turbo_pct | |
138 | /sys/devices/system/cpu/intel_pstate/num_pstates | |
139 | - User can read write system wide turbo status | |
140 | /sys/devices/system/cpu/no_turbo | |
141 | ||
bf006e14 SP |
142 | Support of energy performance hints |
143 | It is possible to provide hints to the HWP algorithms in the processor | |
144 | to be more performance centric to more energy centric. When the driver | |
145 | is using HWP, two additional cpufreq sysfs attributes are presented for | |
146 | each logical CPU. | |
147 | These attributes are: | |
148 | - energy_performance_available_preferences | |
149 | - energy_performance_preference | |
150 | ||
151 | To get list of supported hints: | |
152 | $ cat energy_performance_available_preferences | |
153 | default performance balance_performance balance_power power | |
154 | ||
155 | The current preference can be read or changed via cpufreq sysfs | |
156 | attribute "energy_performance_preference". Reading from this attribute | |
157 | will display current effective setting. User can write any of the valid | |
158 | preference string to this attribute. User can always restore to power-on | |
159 | default by writing "default". | |
160 | ||
161 | Since threads can migrate to different CPUs, this is possible that the | |
162 | new CPU may have different energy performance preference than the previous | |
163 | one. To avoid such issues, either threads can be pinned to specific CPUs | |
164 | or set the same energy performance preference value to all CPUs. | |
165 | ||
a032d2de SP |
166 | Tuning Intel P-State driver |
167 | ||
b8b97a42 SP |
168 | When the performance can be tuned using PID (Proportional Integral |
169 | Derivative) controller, debugfs files are provided for adjusting performance. | |
170 | They are presented under: | |
171 | /sys/kernel/debug/pstate_snb/ | |
a3ea0153 | 172 | |
b8b97a42 | 173 | The PID tunable parameters are: |
a3ea0153 RR |
174 | deadband |
175 | d_gain_pct | |
176 | i_gain_pct | |
177 | p_gain_pct | |
178 | sample_rate_ms | |
179 | setpoint | |
a032d2de SP |
180 | |
181 | To adjust these parameters, some understanding of driver implementation is | |
182 | necessary. There are some tweeks described here, but be very careful. Adjusting | |
183 | them requires expert level understanding of power and performance relationship. | |
184 | These limits are only useful when the "powersave" policy is active. | |
185 | ||
186 | -To make the system more responsive to load changes, sample_rate_ms can | |
187 | be adjusted (current default is 10ms). | |
188 | -To make the system use higher performance, even if the load is lower, setpoint | |
189 | can be adjusted to a lower number. This will also lead to faster ramp up time | |
190 | to reach the maximum P-State. | |
191 | If there are no derivative and integral coefficients, The next P-State will be | |
192 | equal to: | |
193 | current P-State - ((setpoint - current cpu load) * p_gain_pct) | |
194 | ||
195 | For example, if the current PID parameters are (Which are defaults for the core | |
196 | processors like SandyBridge): | |
197 | deadband = 0 | |
198 | d_gain_pct = 0 | |
199 | i_gain_pct = 0 | |
200 | p_gain_pct = 20 | |
201 | sample_rate_ms = 10 | |
202 | setpoint = 97 | |
203 | ||
204 | If the current P-State = 0x08 and current load = 100, this will result in the | |
205 | next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State | |
206 | goes up by only 1. If during next sample interval the current load doesn't | |
207 | change and still 100, then P-State goes up by one again. This process will | |
208 | continue as long as the load is more than the setpoint until the maximum P-State | |
209 | is reached. | |
210 | ||
211 | For the same load at setpoint = 60, this will result in the next P-State | |
212 | = 0x08 - ((60 - 100) * 0.2) = 16 | |
213 | So by changing the setpoint from 97 to 60, there is an increase of the | |
214 | next P-State from 9 to 16. So this will make processor execute at higher | |
215 | P-State for the same CPU load. If the load continues to be more than the | |
216 | setpoint during next sample intervals, then P-State will go up again till the | |
217 | maximum P-State is reached. But the ramp up time to reach the maximum P-State | |
218 | will be much faster when the setpoint is 60 compared to 97. | |
219 | ||
220 | Debugging Intel P-State driver | |
221 | ||
222 | Event tracing | |
223 | To debug P-State transition, the Linux event tracing interface can be used. | |
224 | There are two specific events, which can be enabled (Provided the kernel | |
225 | configs related to event tracing are enabled). | |
226 | ||
227 | # cd /sys/kernel/debug/tracing/ | |
228 | # echo 1 > events/power/pstate_sample/enable | |
229 | # echo 1 > events/power/cpu_frequency/enable | |
230 | # cat trace | |
231 | gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 | |
232 | scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 | |
233 | freq=2474476 | |
234 | cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 | |
235 | ||
236 | ||
237 | Using ftrace | |
238 | ||
239 | If function level tracing is required, the Linux ftrace interface can be used. | |
240 | For example if we want to check how often a function to set a P-State is | |
241 | called, we can set ftrace filter to intel_pstate_set_pstate. | |
242 | ||
243 | # cd /sys/kernel/debug/tracing/ | |
244 | # cat available_filter_functions | grep -i pstate | |
245 | intel_pstate_set_pstate | |
246 | intel_pstate_cpu_init | |
247 | ... | |
248 | ||
249 | # echo intel_pstate_set_pstate > set_ftrace_filter | |
250 | # echo function > current_tracer | |
251 | # cat trace | head -15 | |
252 | # tracer: function | |
253 | # | |
254 | # entries-in-buffer/entries-written: 80/80 #P:4 | |
255 | # | |
256 | # _-----=> irqs-off | |
257 | # / _----=> need-resched | |
258 | # | / _---=> hardirq/softirq | |
259 | # || / _--=> preempt-depth | |
260 | # ||| / delay | |
261 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION | |
262 | # | | | |||| | | | |
263 | Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func | |
264 | gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func | |
265 | gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func | |
266 | <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func |