]>
Commit | Line | Data |
---|---|---|
1 | CPU frequency and voltage scaling code in the Linux(TM) kernel | |
2 | ||
3 | ||
4 | L i n u x C P U F r e q | |
5 | ||
6 | C P U F r e q G o v e r n o r s | |
7 | ||
8 | - information for users and developers - | |
9 | ||
10 | ||
11 | Dominik Brodowski <linux@brodo.de> | |
12 | some additions and corrections by Nico Golde <nico@ngolde.de> | |
13 | Rafael J. Wysocki <rafael.j.wysocki@intel.com> | |
14 | Viresh Kumar <viresh.kumar@linaro.org> | |
15 | ||
16 | ||
17 | ||
18 | Clock scaling allows you to change the clock speed of the CPUs on the | |
19 | fly. This is a nice method to save battery power, because the lower | |
20 | the clock speed, the less power the CPU consumes. | |
21 | ||
22 | ||
23 | Contents: | |
24 | --------- | |
25 | 1. What is a CPUFreq Governor? | |
26 | ||
27 | 2. Governors In the Linux Kernel | |
28 | 2.1 Performance | |
29 | 2.2 Powersave | |
30 | 2.3 Userspace | |
31 | 2.4 Ondemand | |
32 | 2.5 Conservative | |
33 | 2.6 Schedutil | |
34 | ||
35 | 3. The Governor Interface in the CPUfreq Core | |
36 | ||
37 | 4. References | |
38 | ||
39 | ||
40 | 1. What Is A CPUFreq Governor? | |
41 | ============================== | |
42 | ||
43 | Most cpufreq drivers (except the intel_pstate and longrun) or even most | |
44 | cpu frequency scaling algorithms only allow the CPU frequency to be set | |
45 | to predefined fixed values. In order to offer dynamic frequency | |
46 | scaling, the cpufreq core must be able to tell these drivers of a | |
47 | "target frequency". So these specific drivers will be transformed to | |
48 | offer a "->target/target_index/fast_switch()" call instead of the | |
49 | "->setpolicy()" call. For set_policy drivers, all stays the same, | |
50 | though. | |
51 | ||
52 | How to decide what frequency within the CPUfreq policy should be used? | |
53 | That's done using "cpufreq governors". | |
54 | ||
55 | Basically, it's the following flow graph: | |
56 | ||
57 | CPU can be set to switch independently | CPU can only be set | |
58 | within specific "limits" | to specific frequencies | |
59 | ||
60 | "CPUfreq policy" | |
61 | consists of frequency limits (policy->{min,max}) | |
62 | and CPUfreq governor to be used | |
63 | / \ | |
64 | / \ | |
65 | / the cpufreq governor decides | |
66 | / (dynamically or statically) | |
67 | / what target_freq to set within | |
68 | / the limits of policy->{min,max} | |
69 | / \ | |
70 | / \ | |
71 | Using the ->setpolicy call, Using the ->target/target_index/fast_switch call, | |
72 | the limits and the the frequency closest | |
73 | "policy" is set. to target_freq is set. | |
74 | It is assured that it | |
75 | is within policy->{min,max} | |
76 | ||
77 | ||
78 | 2. Governors In the Linux Kernel | |
79 | ================================ | |
80 | ||
81 | 2.1 Performance | |
82 | --------------- | |
83 | ||
84 | The CPUfreq governor "performance" sets the CPU statically to the | |
85 | highest frequency within the borders of scaling_min_freq and | |
86 | scaling_max_freq. | |
87 | ||
88 | ||
89 | 2.2 Powersave | |
90 | ------------- | |
91 | ||
92 | The CPUfreq governor "powersave" sets the CPU statically to the | |
93 | lowest frequency within the borders of scaling_min_freq and | |
94 | scaling_max_freq. | |
95 | ||
96 | ||
97 | 2.3 Userspace | |
98 | ------------- | |
99 | ||
100 | The CPUfreq governor "userspace" allows the user, or any userspace | |
101 | program running with UID "root", to set the CPU to a specific frequency | |
102 | by making a sysfs file "scaling_setspeed" available in the CPU-device | |
103 | directory. | |
104 | ||
105 | ||
106 | 2.4 Ondemand | |
107 | ------------ | |
108 | ||
109 | The CPUfreq governor "ondemand" sets the CPU frequency depending on the | |
110 | current system load. Load estimation is triggered by the scheduler | |
111 | through the update_util_data->func hook; when triggered, cpufreq checks | |
112 | the CPU-usage statistics over the last period and the governor sets the | |
113 | CPU accordingly. The CPU must have the capability to switch the | |
114 | frequency very quickly. | |
115 | ||
116 | Sysfs files: | |
117 | ||
118 | * sampling_rate: | |
119 | ||
120 | Measured in uS (10^-6 seconds), this is how often you want the kernel | |
121 | to look at the CPU usage and to make decisions on what to do about the | |
122 | frequency. Typically this is set to values of around '10000' or more. | |
123 | It's default value is (cmp. with users-guide.txt): transition_latency | |
124 | * 1000. Be aware that transition latency is in ns and sampling_rate | |
125 | is in us, so you get the same sysfs value by default. Sampling rate | |
126 | should always get adjusted considering the transition latency to set | |
127 | the sampling rate 750 times as high as the transition latency in the | |
128 | bash (as said, 1000 is default), do: | |
129 | ||
130 | $ echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate | |
131 | ||
132 | * sampling_rate_min: | |
133 | ||
134 | The sampling rate is limited by the HW transition latency: | |
135 | transition_latency * 100 | |
136 | ||
137 | Or by kernel restrictions: | |
138 | - If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed. | |
139 | - If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is | |
140 | used, the limits depend on the CONFIG_HZ option: | |
141 | HZ=1000: min=20000us (20ms) | |
142 | HZ=250: min=80000us (80ms) | |
143 | HZ=100: min=200000us (200ms) | |
144 | ||
145 | The highest value of kernel and HW latency restrictions is shown and | |
146 | used as the minimum sampling rate. | |
147 | ||
148 | * up_threshold: | |
149 | ||
150 | This defines what the average CPU usage between the samplings of | |
151 | 'sampling_rate' needs to be for the kernel to make a decision on | |
152 | whether it should increase the frequency. For example when it is set | |
153 | to its default value of '95' it means that between the checking | |
154 | intervals the CPU needs to be on average more than 95% in use to then | |
155 | decide that the CPU frequency needs to be increased. | |
156 | ||
157 | * ignore_nice_load: | |
158 | ||
159 | This parameter takes a value of '0' or '1'. When set to '0' (its | |
160 | default), all processes are counted towards the 'cpu utilisation' | |
161 | value. When set to '1', the processes that are run with a 'nice' | |
162 | value will not count (and thus be ignored) in the overall usage | |
163 | calculation. This is useful if you are running a CPU intensive | |
164 | calculation on your laptop that you do not care how long it takes to | |
165 | complete as you can 'nice' it and prevent it from taking part in the | |
166 | deciding process of whether to increase your CPU frequency. | |
167 | ||
168 | * sampling_down_factor: | |
169 | ||
170 | This parameter controls the rate at which the kernel makes a decision | |
171 | on when to decrease the frequency while running at top speed. When set | |
172 | to 1 (the default) decisions to reevaluate load are made at the same | |
173 | interval regardless of current clock speed. But when set to greater | |
174 | than 1 (e.g. 100) it acts as a multiplier for the scheduling interval | |
175 | for reevaluating load when the CPU is at its top speed due to high | |
176 | load. This improves performance by reducing the overhead of load | |
177 | evaluation and helping the CPU stay at its top speed when truly busy, | |
178 | rather than shifting back and forth in speed. This tunable has no | |
179 | effect on behavior at lower speeds/lower CPU loads. | |
180 | ||
181 | * powersave_bias: | |
182 | ||
183 | This parameter takes a value between 0 to 1000. It defines the | |
184 | percentage (times 10) value of the target frequency that will be | |
185 | shaved off of the target. For example, when set to 100 -- 10%, when | |
186 | ondemand governor would have targeted 1000 MHz, it will target | |
187 | 1000 MHz - (10% of 1000 MHz) = 900 MHz instead. This is set to 0 | |
188 | (disabled) by default. | |
189 | ||
190 | When AMD frequency sensitivity powersave bias driver -- | |
191 | drivers/cpufreq/amd_freq_sensitivity.c is loaded, this parameter | |
192 | defines the workload frequency sensitivity threshold in which a lower | |
193 | frequency is chosen instead of ondemand governor's original target. | |
194 | The frequency sensitivity is a hardware reported (on AMD Family 16h | |
195 | Processors and above) value between 0 to 100% that tells software how | |
196 | the performance of the workload running on a CPU will change when | |
197 | frequency changes. A workload with sensitivity of 0% (memory/IO-bound) | |
198 | will not perform any better on higher core frequency, whereas a | |
199 | workload with sensitivity of 100% (CPU-bound) will perform better | |
200 | higher the frequency. When the driver is loaded, this is set to 400 by | |
201 | default -- for CPUs running workloads with sensitivity value below | |
202 | 40%, a lower frequency is chosen. Unloading the driver or writing 0 | |
203 | will disable this feature. | |
204 | ||
205 | ||
206 | 2.5 Conservative | |
207 | ---------------- | |
208 | ||
209 | The CPUfreq governor "conservative", much like the "ondemand" | |
210 | governor, sets the CPU frequency depending on the current usage. It | |
211 | differs in behaviour in that it gracefully increases and decreases the | |
212 | CPU speed rather than jumping to max speed the moment there is any load | |
213 | on the CPU. This behaviour is more suitable in a battery powered | |
214 | environment. The governor is tweaked in the same manner as the | |
215 | "ondemand" governor through sysfs with the addition of: | |
216 | ||
217 | * freq_step: | |
218 | ||
219 | This describes what percentage steps the cpu freq should be increased | |
220 | and decreased smoothly by. By default the cpu frequency will increase | |
221 | in 5% chunks of your maximum cpu frequency. You can change this value | |
222 | to anywhere between 0 and 100 where '0' will effectively lock your CPU | |
223 | at a speed regardless of its load whilst '100' will, in theory, make | |
224 | it behave identically to the "ondemand" governor. | |
225 | ||
226 | * down_threshold: | |
227 | ||
228 | Same as the 'up_threshold' found for the "ondemand" governor but for | |
229 | the opposite direction. For example when set to its default value of | |
230 | '20' it means that if the CPU usage needs to be below 20% between | |
231 | samples to have the frequency decreased. | |
232 | ||
233 | * sampling_down_factor: | |
234 | ||
235 | Similar functionality as in "ondemand" governor. But in | |
236 | "conservative", it controls the rate at which the kernel makes a | |
237 | decision on when to decrease the frequency while running in any speed. | |
238 | Load for frequency increase is still evaluated every sampling rate. | |
239 | ||
240 | ||
241 | 2.6 Schedutil | |
242 | ------------- | |
243 | ||
244 | The "schedutil" governor aims at better integration with the Linux | |
245 | kernel scheduler. Load estimation is achieved through the scheduler's | |
246 | Per-Entity Load Tracking (PELT) mechanism, which also provides | |
247 | information about the recent load [1]. This governor currently does | |
248 | load based DVFS only for tasks managed by CFS. RT and DL scheduler tasks | |
249 | are always run at the highest frequency. Unlike all the other | |
250 | governors, the code is located under the kernel/sched/ directory. | |
251 | ||
252 | Sysfs files: | |
253 | ||
254 | * rate_limit_us: | |
255 | ||
256 | This contains a value in microseconds. The governor waits for | |
257 | rate_limit_us time before reevaluating the load again, after it has | |
258 | evaluated the load once. | |
259 | ||
260 | For an in-depth comparison with the other governors refer to [2]. | |
261 | ||
262 | ||
263 | 3. The Governor Interface in the CPUfreq Core | |
264 | ============================================= | |
265 | ||
266 | A new governor must register itself with the CPUfreq core using | |
267 | "cpufreq_register_governor". The struct cpufreq_governor, which has to | |
268 | be passed to that function, must contain the following values: | |
269 | ||
270 | governor->name - A unique name for this governor. | |
271 | governor->owner - .THIS_MODULE for the governor module (if appropriate). | |
272 | ||
273 | plus a set of hooks to the functions implementing the governor's logic. | |
274 | ||
275 | The CPUfreq governor may call the CPU processor driver using one of | |
276 | these two functions: | |
277 | ||
278 | int cpufreq_driver_target(struct cpufreq_policy *policy, | |
279 | unsigned int target_freq, | |
280 | unsigned int relation); | |
281 | ||
282 | int __cpufreq_driver_target(struct cpufreq_policy *policy, | |
283 | unsigned int target_freq, | |
284 | unsigned int relation); | |
285 | ||
286 | target_freq must be within policy->min and policy->max, of course. | |
287 | What's the difference between these two functions? When your governor is | |
288 | in a direct code path of a call to governor callbacks, like | |
289 | governor->start(), the policy->rwsem is still held in the cpufreq core, | |
290 | and there's no need to lock it again (in fact, this would cause a | |
291 | deadlock). So use __cpufreq_driver_target only in these cases. In all | |
292 | other cases (for example, when there's a "daemonized" function that | |
293 | wakes up every second), use cpufreq_driver_target to take policy->rwsem | |
294 | before the command is passed to the cpufreq driver. | |
295 | ||
296 | 4. References | |
297 | ============= | |
298 | ||
299 | [1] Per-entity load tracking: https://lwn.net/Articles/531853/ | |
300 | [2] Improvements in CPU frequency management: https://lwn.net/Articles/682391/ | |
301 |