]>
Commit | Line | Data |
---|---|---|
985794ad SH |
1 | .TH CBQ 8 "8 December 2001" "iproute2" "Linux" |
2 | .SH NAME | |
3 | CBQ \- Class Based Queueing | |
4 | .SH SYNOPSIS | |
5 | .B tc qdisc ... dev | |
6 | dev | |
7 | .B ( parent | |
8 | classid | |
9 | .B | root) [ handle | |
10 | major: | |
11 | .B ] cbq avpkt | |
12 | bytes | |
13 | .B bandwidth | |
14 | rate | |
15 | .B [ cell | |
16 | bytes | |
17 | .B ] [ ewma | |
18 | log | |
19 | .B ] [ mpu | |
20 | bytes | |
21 | .B ] | |
22 | ||
23 | .B tc class ... dev | |
24 | dev | |
25 | .B parent | |
26 | major:[minor] | |
27 | .B [ classid | |
28 | major:minor | |
29 | .B ] cbq allot | |
30 | bytes | |
31 | .B [ bandwidth | |
32 | rate | |
33 | .B ] [ rate | |
34 | rate | |
35 | .B ] prio | |
36 | priority | |
37 | .B [ weight | |
38 | weight | |
39 | .B ] [ minburst | |
40 | packets | |
41 | .B ] [ maxburst | |
42 | packets | |
43 | .B ] [ ewma | |
44 | log | |
45 | .B ] [ cell | |
46 | bytes | |
47 | .B ] avpkt | |
48 | bytes | |
49 | .B [ mpu | |
50 | bytes | |
51 | .B ] [ bounded isolated ] [ split | |
52 | handle | |
53 | .B & defmap | |
54 | defmap | |
55 | .B ] [ estimator | |
56 | interval timeconstant | |
57 | .B ] | |
58 | ||
59 | .SH DESCRIPTION | |
60 | Class Based Queueing is a classful qdisc that implements a rich | |
61 | linksharing hierarchy of classes. It contains shaping elements as | |
62 | well as prioritizing capabilities. Shaping is performed using link | |
63 | idle time calculations based on the timing of dequeue events and | |
64 | underlying link bandwidth. | |
65 | ||
66 | .SH SHAPING ALGORITHM | |
67 | Shaping is done using link idle time calculations, and actions taken if | |
68 | these calculations deviate from set limits. | |
69 | ||
70 | When shaping a 10mbit/s connection to 1mbit/s, the link will | |
71 | be idle 90% of the time. If it isn't, it needs to be throttled so that it | |
72 | IS idle 90% of the time. | |
73 | ||
74 | From the kernel's perspective, this is hard to measure, so CBQ instead | |
75 | derives the idle time from the number of microseconds (in fact, jiffies) | |
76 | that elapse between requests from the device driver for more data. Combined | |
77 | with the knowledge of packet sizes, this is used to approximate how full or | |
78 | empty the link is. | |
79 | ||
80 | This is rather circumspect and doesn't always arrive at proper | |
81 | results. For example, what is the actual link speed of an interface | |
82 | that is not really able to transmit the full 100mbit/s of data, | |
83 | perhaps because of a badly implemented driver? A PCMCIA network card | |
84 | will also never achieve 100mbit/s because of the way the bus is | |
85 | designed - again, how do we calculate the idle time? | |
86 | ||
87 | The physical link bandwidth may be ill defined in case of not-quite-real | |
88 | network devices like PPP over Ethernet or PPTP over TCP/IP. The effective | |
89 | bandwidth in that case is probably determined by the efficiency of pipes | |
90 | to userspace - which not defined. | |
91 | ||
92 | During operations, the effective idletime is measured using an | |
93 | exponential weighted moving average (EWMA), which considers recent | |
94 | packets to be exponentially more important than past ones. The Unix | |
95 | loadaverage is calculated in the same way. | |
96 | ||
97 | The calculated idle time is subtracted from the EWMA measured one, | |
98 | the resulting number is called 'avgidle'. A perfectly loaded link has | |
99 | an avgidle of zero: packets arrive exactly at the calculated | |
100 | interval. | |
101 | ||
102 | An overloaded link has a negative avgidle and if it gets too negative, | |
103 | CBQ throttles and is then 'overlimit'. | |
104 | ||
105 | Conversely, an idle link might amass a huge avgidle, which would then | |
106 | allow infinite bandwidths after a few hours of silence. To prevent | |
107 | this, avgidle is capped at | |
108 | .B maxidle. | |
109 | ||
110 | If overlimit, in theory, the CBQ could throttle itself for exactly the | |
111 | amount of time that was calculated to pass between packets, and then | |
112 | pass one packet, and throttle again. Due to timer resolution constraints, | |
113 | this may not be feasible, see the | |
114 | .B minburst | |
115 | parameter below. | |
116 | ||
117 | .SH CLASSIFICATION | |
118 | Within the one CBQ instance many classes may exist. Each of these classes | |
119 | contains another qdisc, by default | |
120 | .BR tc-pfifo (8). | |
121 | ||
122 | When enqueueing a packet, CBQ starts at the root and uses various methods to | |
123 | determine which class should receive the data. If a verdict is reached, this | |
124 | process is repeated for the recipient class which might have further | |
125 | means of classifying traffic to its children, if any. | |
126 | ||
127 | CBQ has the following methods available to classify a packet to any child | |
128 | classes. | |
129 | .TP | |
130 | (i) | |
131 | .B skb->priority class encoding. | |
132 | Can be set from userspace by an application with the | |
133 | .B SO_PRIORITY | |
134 | setsockopt. | |
135 | The | |
136 | .B skb->priority class encoding | |
137 | only applies if the skb->priority holds a major:minor handle of an existing | |
138 | class within this qdisc. | |
139 | .TP | |
140 | (ii) | |
141 | tc filters attached to the class. | |
142 | .TP | |
143 | (iii) | |
144 | The defmap of a class, as set with the | |
145 | .B split & defmap | |
146 | parameters. The defmap may contain instructions for each possible Linux packet | |
147 | priority. | |
148 | ||
149 | .P | |
150 | Each class also has a | |
151 | .B level. | |
152 | Leaf nodes, attached to the bottom of the class hierarchy, have a level of 0. | |
153 | .SH CLASSIFICATION ALGORITHM | |
154 | ||
155 | Classification is a loop, which terminates when a leaf class is found. At any | |
156 | point the loop may jump to the fallback algorithm. | |
157 | ||
158 | The loop consists of the following steps: | |
159 | .TP | |
160 | (i) | |
161 | If the packet is generated locally and has a valid classid encoded within its | |
162 | .B skb->priority, | |
163 | choose it and terminate. | |
164 | ||
165 | .TP | |
166 | (ii) | |
167 | Consult the tc filters, if any, attached to this child. If these return | |
168 | a class which is not a leaf class, restart loop from the class returned. | |
169 | If it is a leaf, choose it and terminate. | |
170 | .TP | |
171 | (iii) | |
172 | If the tc filters did not return a class, but did return a classid, | |
173 | try to find a class with that id within this qdisc. | |
174 | Check if the found class is of a lower | |
175 | .B level | |
176 | than the current class. If so, and the returned class is not a leaf node, | |
177 | restart the loop at the found class. If it is a leaf node, terminate. | |
178 | If we found an upward reference to a higher level, enter the fallback | |
179 | algorithm. | |
180 | .TP | |
181 | (iv) | |
182 | If the tc filters did not return a class, nor a valid reference to one, | |
183 | consider the minor number of the reference to be the priority. Retrieve | |
184 | a class from the defmap of this class for the priority. If this did not | |
185 | contain a class, consult the defmap of this class for the | |
186 | .B BEST_EFFORT | |
187 | class. If this is an upward reference, or no | |
188 | .B BEST_EFFORT | |
189 | class was defined, | |
190 | enter the fallback algorithm. If a valid class was found, and it is not a | |
191 | leaf node, restart the loop at this class. If it is a leaf, choose it and | |
192 | terminate. If | |
193 | neither the priority distilled from the classid, nor the | |
194 | .B BEST_EFFORT | |
195 | priority yielded a class, enter the fallback algorithm. | |
196 | .P | |
197 | The fallback algorithm resides outside of the loop and is as follows. | |
198 | .TP | |
199 | (i) | |
200 | Consult the defmap of the class at which the jump to fallback occured. If | |
201 | the defmap contains a class for the | |
202 | .B | |
203 | priority | |
204 | of the class (which is related to the TOS field), choose this class and | |
205 | terminate. | |
206 | .TP | |
207 | (ii) | |
208 | Consult the map for a class for the | |
209 | .B BEST_EFFORT | |
210 | priority. If found, choose it, and terminate. | |
211 | .TP | |
212 | (iii) | |
b096fa5f | 213 | Choose the class at which break out to the fallback algorithm occurred. Terminate. |
985794ad SH |
214 | .P |
215 | The packet is enqueued to the class which was chosen when either algorithm | |
216 | terminated. It is therefore possible for a packet to be enqueued *not* at a | |
217 | leaf node, but in the middle of the hierarchy. | |
218 | ||
219 | .SH LINK SHARING ALGORITHM | |
220 | When dequeuing for sending to the network device, CBQ decides which of its | |
221 | classes will be allowed to send. It does so with a Weighted Round Robin process | |
222 | in which each class with packets gets a chance to send in turn. The WRR process | |
223 | starts by asking the highest priority classes (lowest numerically - | |
224 | highest semantically) for packets, and will continue to do so until they | |
225 | have no more data to offer, in which case the process repeats for lower | |
226 | priorities. | |
227 | ||
228 | .B CERTAINTY ENDS HERE, ANK PLEASE HELP | |
229 | ||
230 | Each class is not allowed to send at length though - they can only dequeue a | |
231 | configurable amount of data during each round. | |
232 | ||
233 | If a class is about to go overlimit, and it is not | |
234 | .B bounded | |
235 | it will try to borrow avgidle from siblings that are not | |
236 | .B isolated. | |
237 | This process is repeated from the bottom upwards. If a class is unable | |
238 | to borrow enough avgidle to send a packet, it is throttled and not asked | |
239 | for a packet for enough time for the avgidle to increase above zero. | |
240 | ||
241 | .B I REALLY NEED HELP FIGURING THIS OUT. REST OF DOCUMENT IS PRETTY CERTAIN | |
242 | .B AGAIN. | |
243 | ||
244 | .SH QDISC | |
245 | The root qdisc of a CBQ class tree has the following parameters: | |
246 | ||
247 | .TP | |
248 | parent major:minor | root | |
249 | This mandatory parameter determines the place of the CBQ instance, either at the | |
250 | .B root | |
251 | of an interface or within an existing class. | |
252 | .TP | |
253 | handle major: | |
254 | Like all other qdiscs, the CBQ can be assigned a handle. Should consist only | |
255 | of a major number, followed by a colon. Optional. | |
256 | .TP | |
257 | avpkt bytes | |
258 | For calculations, the average packet size must be known. It is silently capped | |
259 | at a minimum of 2/3 of the interface MTU. Mandatory. | |
260 | .TP | |
261 | bandwidth rate | |
262 | To determine the idle time, CBQ must know the bandwidth of your underlying | |
263 | physical interface, or parent qdisc. This is a vital parameter, more about it | |
264 | later. Mandatory. | |
265 | .TP | |
266 | cell | |
267 | The cell size determines he granularity of packet transmission time calculations. Has a sensible default. | |
268 | .TP | |
269 | mpu | |
270 | A zero sized packet may still take time to transmit. This value is the lower | |
271 | cap for packet transmission time calculations - packets smaller than this value | |
272 | are still deemed to have this size. Defaults to zero. | |
273 | .TP | |
274 | ewma log | |
275 | When CBQ needs to measure the average idle time, it does so using an | |
276 | Exponentially Weighted Moving Average which smoothes out measurements into | |
277 | a moving average. The EWMA LOG determines how much smoothing occurs. Defaults | |
278 | to 5. Lower values imply greater sensitivity. Must be between 0 and 31. | |
279 | .P | |
280 | A CBQ qdisc does not shape out of its own accord. It only needs to know certain | |
281 | parameters about the underlying link. Actual shaping is done in classes. | |
282 | ||
283 | .SH CLASSES | |
284 | Classes have a host of parameters to configure their operation. | |
285 | ||
286 | .TP | |
287 | parent major:minor | |
288 | Place of this class within the hierarchy. If attached directly to a qdisc | |
289 | and not to another class, minor can be omitted. Mandatory. | |
290 | .TP | |
291 | classid major:minor | |
292 | Like qdiscs, classes can be named. The major number must be equal to the | |
293 | major number of the qdisc to which it belongs. Optional, but needed if this | |
294 | class is going to have children. | |
295 | .TP | |
296 | weight weight | |
297 | When dequeuing to the interface, classes are tried for traffic in a | |
298 | round-robin fashion. Classes with a higher configured qdisc will generally | |
299 | have more traffic to offer during each round, so it makes sense to allow | |
300 | it to dequeue more traffic. All weights under a class are normalized, so | |
301 | only the ratios matter. Defaults to the configured rate, unless the priority | |
302 | of this class is maximal, in which case it is set to 1. | |
303 | .TP | |
304 | allot bytes | |
305 | Allot specifies how many bytes a qdisc can dequeue | |
306 | during each round of the process. This parameter is weighted using the | |
307 | renormalized class weight described above. | |
308 | ||
309 | .TP | |
310 | priority priority | |
311 | In the round-robin process, classes with the lowest priority field are tried | |
312 | for packets first. Mandatory. | |
313 | ||
314 | .TP | |
315 | rate rate | |
316 | Maximum rate this class and all its children combined can send at. Mandatory. | |
317 | ||
318 | .TP | |
319 | bandwidth rate | |
320 | This is different from the bandwidth specified when creating a CBQ disc. Only | |
321 | used to determine maxidle and offtime, which are only calculated when | |
322 | specifying maxburst or minburst. Mandatory if specifying maxburst or minburst. | |
323 | ||
324 | .TP | |
325 | maxburst | |
326 | This number of packets is used to calculate maxidle so that when | |
327 | avgidle is at maxidle, this number of average packets can be burst | |
328 | before avgidle drops to 0. Set it higher to be more tolerant of | |
329 | bursts. You can't set maxidle directly, only via this parameter. | |
330 | ||
331 | .TP | |
332 | minburst | |
333 | As mentioned before, CBQ needs to throttle in case of | |
334 | overlimit. The ideal solution is to do so for exactly the calculated | |
335 | idle time, and pass 1 packet. However, Unix kernels generally have a | |
336 | hard time scheduling events shorter than 10ms, so it is better to | |
337 | throttle for a longer period, and then pass minburst packets in one | |
338 | go, and then sleep minburst times longer. | |
339 | ||
340 | The time to wait is called the offtime. Higher values of minburst lead | |
341 | to more accurate shaping in the long term, but to bigger bursts at | |
342 | millisecond timescales. | |
343 | ||
344 | .TP | |
345 | minidle | |
346 | If avgidle is below 0, we are overlimits and need to wait until | |
347 | avgidle will be big enough to send one packet. To prevent a sudden | |
348 | burst from shutting down the link for a prolonged period of time, | |
349 | avgidle is reset to minidle if it gets too low. | |
350 | ||
351 | Minidle is specified in negative microseconds, so 10 means that | |
352 | avgidle is capped at -10us. | |
353 | ||
354 | .TP | |
355 | bounded | |
356 | Signifies that this class will not borrow bandwidth from its siblings. | |
357 | .TP | |
358 | isolated | |
359 | Means that this class will not borrow bandwidth to its siblings | |
360 | ||
361 | .TP | |
362 | split major:minor & defmap bitmap[/bitmap] | |
363 | If consulting filters attached to a class did not give a verdict, | |
364 | CBQ can also classify based on the packet's priority. There are 16 | |
365 | priorities available, numbered from 0 to 15. | |
366 | ||
367 | The defmap specifies which priorities this class wants to receive, | |
368 | specified as a bitmap. The Least Significant Bit corresponds to priority | |
369 | zero. The | |
370 | .B split | |
371 | parameter tells CBQ at which class the decision must be made, which should | |
372 | be a (grand)parent of the class you are adding. | |
373 | ||
374 | As an example, 'tc class add ... classid 10:1 cbq .. split 10:0 defmap c0' | |
375 | configures class 10:0 to send packets with priorities 6 and 7 to 10:1. | |
376 | ||
377 | The complimentary configuration would then | |
378 | be: 'tc class add ... classid 10:2 cbq ... split 10:0 defmap 3f' | |
379 | Which would send all packets 0, 1, 2, 3, 4 and 5 to 10:1. | |
380 | .TP | |
381 | estimator interval timeconstant | |
382 | CBQ can measure how much bandwidth each class is using, which tc filters | |
383 | can use to classify packets with. In order to determine the bandwidth | |
384 | it uses a very simple estimator that measures once every | |
385 | .B interval | |
386 | microseconds how much traffic has passed. This again is a EWMA, for which | |
387 | the time constant can be specified, also in microseconds. The | |
388 | .B time constant | |
389 | corresponds to the sluggishness of the measurement or, conversely, to the | |
390 | sensitivity of the average to short bursts. Higher values mean less | |
391 | sensitivity. | |
392 | ||
393 | ||
394 | ||
395 | .SH SOURCES | |
396 | .TP | |
397 | o | |
398 | Sally Floyd and Van Jacobson, "Link-sharing and Resource | |
399 | Management Models for Packet Networks", | |
400 | IEEE/ACM Transactions on Networking, Vol.3, No.4, 1995 | |
401 | ||
402 | .TP | |
403 | o | |
404 | Sally Floyd, "Notes on CBQ and Guarantee Service", 1995 | |
405 | ||
406 | .TP | |
407 | o | |
408 | Sally Floyd, "Notes on Class-Based Queueing: Setting | |
409 | Parameters", 1996 | |
410 | ||
411 | .TP | |
412 | o | |
413 | Sally Floyd and Michael Speer, "Experimental Results | |
414 | for Class-Based Queueing", 1998, not published. | |
415 | ||
416 | ||
417 | ||
418 | .SH SEE ALSO | |
419 | .BR tc (8) | |
420 | ||
421 | .SH AUTHOR | |
422 | Alexey N. Kuznetsov, <kuznet@ms2.inr.ac.ru>. This manpage maintained by | |
423 | bert hubert <ahu@ds9a.nl> | |
424 | ||
425 |