]>
Commit | Line | Data |
---|---|---|
9f95a23c TL |
1 | .. SPDX-License-Identifier: BSD-3-Clause |
2 | Copyright(c) 2010-2014 Intel Corporation. | |
7c673cae FG |
3 | |
4 | Power Management | |
5 | ================ | |
6 | ||
7 | The DPDK Power Management feature allows users space applications to save power | |
8 | by dynamically adjusting CPU frequency or entering into different C-States. | |
9 | ||
10 | * Adjusting the CPU frequency dynamically according to the utilization of RX queue. | |
11 | ||
12 | * Entering into different deeper C-States according to the adaptive algorithms to speculate | |
13 | brief periods of time suspending the application if no packets are received. | |
14 | ||
15 | The interfaces for adjusting the operating CPU frequency are in the power management library. | |
16 | C-State control is implemented in applications according to the different use cases. | |
17 | ||
18 | CPU Frequency Scaling | |
19 | --------------------- | |
20 | ||
21 | The Linux kernel provides a cpufreq module for CPU frequency scaling for each lcore. | |
22 | For example, for cpuX, /sys/devices/system/cpu/cpuX/cpufreq/ has the following sys files for frequency scaling: | |
23 | ||
24 | * affected_cpus | |
25 | ||
26 | * bios_limit | |
27 | ||
28 | * cpuinfo_cur_freq | |
29 | ||
30 | * cpuinfo_max_freq | |
31 | ||
32 | * cpuinfo_min_freq | |
33 | ||
34 | * cpuinfo_transition_latency | |
35 | ||
36 | * related_cpus | |
37 | ||
38 | * scaling_available_frequencies | |
39 | ||
40 | * scaling_available_governors | |
41 | ||
42 | * scaling_cur_freq | |
43 | ||
44 | * scaling_driver | |
45 | ||
46 | * scaling_governor | |
47 | ||
48 | * scaling_max_freq | |
49 | ||
50 | * scaling_min_freq | |
51 | ||
52 | * scaling_setspeed | |
53 | ||
54 | In the DPDK, scaling_governor is configured in user space. | |
55 | Then, a user space application can prompt the kernel by writing scaling_setspeed to adjust the CPU frequency | |
56 | according to the strategies defined by the user space application. | |
57 | ||
58 | Core-load Throttling through C-States | |
59 | ------------------------------------- | |
60 | ||
61 | Core state can be altered by speculative sleeps whenever the specified lcore has nothing to do. | |
62 | In the DPDK, if no packet is received after polling, | |
63 | speculative sleeps can be triggered according the strategies defined by the user space application. | |
64 | ||
9f95a23c TL |
65 | Per-core Turbo Boost |
66 | -------------------- | |
67 | ||
68 | Individual cores can be allowed to enter a Turbo Boost state on a per-core | |
69 | basis. This is achieved by enabling Turbo Boost Technology in the BIOS, then | |
70 | looping through the relevant cores and enabling/disabling Turbo Boost on each | |
71 | core. | |
72 | ||
73 | Use of Power Library in a Hyper-Threaded Environment | |
74 | ---------------------------------------------------- | |
75 | ||
76 | In the case where the power library is in use on a system with Hyper-Threading enabled, | |
77 | the frequency on the physical core is set to the highest frequency of the Hyper-Thread siblings. | |
78 | So even though an application may request a scale down, the core frequency will | |
79 | remain at the highest frequency until all Hyper-Threads on that core request a scale down. | |
80 | ||
7c673cae FG |
81 | API Overview of the Power Library |
82 | --------------------------------- | |
83 | ||
84 | The main methods exported by power library are for CPU frequency scaling and include the following: | |
85 | ||
86 | * **Freq up**: Prompt the kernel to scale up the frequency of the specific lcore. | |
87 | ||
88 | * **Freq down**: Prompt the kernel to scale down the frequency of the specific lcore. | |
89 | ||
90 | * **Freq max**: Prompt the kernel to scale up the frequency of the specific lcore to the maximum. | |
91 | ||
92 | * **Freq min**: Prompt the kernel to scale down the frequency of the specific lcore to the minimum. | |
93 | ||
94 | * **Get available freqs**: Read the available frequencies of the specific lcore from the sys file. | |
95 | ||
96 | * **Freq get**: Get the current frequency of the specific lcore. | |
97 | ||
98 | * **Freq set**: Prompt the kernel to set the frequency for the specific lcore. | |
99 | ||
9f95a23c TL |
100 | * **Enable turbo**: Prompt the kernel to enable Turbo Boost for the specific lcore. |
101 | ||
102 | * **Disable turbo**: Prompt the kernel to disable Turbo Boost for the specific lcore. | |
103 | ||
7c673cae FG |
104 | User Cases |
105 | ---------- | |
106 | ||
107 | The power management mechanism is used to save power when performing L3 forwarding. | |
108 | ||
9f95a23c TL |
109 | |
110 | Empty Poll API | |
111 | -------------- | |
112 | ||
113 | Abstract | |
114 | ~~~~~~~~ | |
115 | ||
116 | For packet processing workloads such as DPDK polling is continuous. | |
117 | This means CPU cores always show 100% busy independent of how much work | |
118 | those cores are doing. It is critical to accurately determine how busy | |
119 | a core is hugely important for the following reasons: | |
120 | ||
121 | * No indication of overload conditions | |
122 | * User does not know how much real load is on a system, resulting | |
123 | in wasted energy as no power management is utilized | |
124 | ||
125 | Compared to the original l3fwd-power design, instead of going to sleep | |
126 | after detecting an empty poll, the new mechanism just lowers the core frequency. | |
127 | As a result, the application does not stop polling the device, which leads | |
128 | to improved handling of bursts of traffic. | |
129 | ||
130 | When the system become busy, the empty poll mechanism can also increase the core | |
131 | frequency (including turbo) to do best effort for intensive traffic. This gives | |
132 | us more flexible and balanced traffic awareness over the standard l3fwd-power | |
133 | application. | |
134 | ||
135 | ||
136 | Proposed Solution | |
137 | ~~~~~~~~~~~~~~~~~ | |
138 | The proposed solution focuses on how many times empty polls are executed. | |
139 | The less the number of empty polls, means current core is busy with processing | |
140 | workload, therefore, the higher frequency is needed. The high empty poll number | |
141 | indicates the current core not doing any real work therefore, we can lower the | |
142 | frequency to safe power. | |
143 | ||
144 | In the current implementation, each core has 1 empty-poll counter which assume | |
145 | 1 core is dedicated to 1 queue. This will need to be expanded in the future to | |
146 | support multiple queues per core. | |
147 | ||
148 | Power state definition: | |
149 | ^^^^^^^^^^^^^^^^^^^^^^^ | |
150 | ||
151 | * LOW: Not currently used, reserved for future use. | |
152 | ||
153 | * MED: the frequency is used to process modest traffic workload. | |
154 | ||
155 | * HIGH: the frequency is used to process busy traffic workload. | |
156 | ||
157 | There are two phases to establish the power management system: | |
158 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
159 | * Training phase. This phase is used to measure the optimal frequency | |
160 | change thresholds for a given system. The thresholds will differ from | |
161 | system to system due to differences in processor micro-architecture, | |
162 | cache and device configurations. | |
163 | In this phase, the user must ensure that no traffic can enter the | |
164 | system so that counts can be measured for empty polls at low, medium | |
165 | and high frequencies. Each frequency is measured for two seconds. | |
166 | Once the training phase is complete, the threshold numbers are | |
167 | displayed, and normal mode resumes, and traffic can be allowed into | |
168 | the system. These threshold number can be used on the command line | |
169 | when starting the application in normal mode to avoid re-training | |
170 | every time. | |
171 | ||
172 | * Normal phase. Every 10ms the run-time counters are compared | |
173 | to the supplied threshold values, and the decision will be made | |
174 | whether to move to a different power state (by adjusting the | |
175 | frequency). | |
176 | ||
177 | API Overview for Empty Poll Power Management | |
178 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
179 | * **State Init**: initialize the power management system. | |
180 | ||
181 | * **State Free**: free the resource hold by power management system. | |
182 | ||
183 | * **Update Empty Poll Counter**: update the empty poll counter. | |
184 | ||
185 | * **Update Valid Poll Counter**: update the valid poll counter. | |
186 | ||
187 | * **Set the Frequency Index**: update the power state/frequency mapping. | |
188 | ||
189 | * **Detect empty poll state change**: empty poll state change detection algorithm then take action. | |
190 | ||
191 | User Cases | |
192 | ---------- | |
193 | The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA. | |
194 | ||
7c673cae FG |
195 | References |
196 | ---------- | |
197 | ||
9f95a23c TL |
198 | * The :doc:`../sample_app_ug/l3_forward_power_man` |
199 | chapter in the :doc:`../sample_app_ug/index` section. | |
7c673cae | 200 | |
9f95a23c TL |
201 | * The :doc:`../sample_app_ug/vm_power_management` |
202 | chapter in the :doc:`../sample_app_ug/index` section. |