]>
Commit | Line | Data |
---|---|---|
5f4d27d5 PS |
1 | \documentclass[12pt,twoside]{article} |
2 | ||
3 | \usepackage[hidelinks]{hyperref} % \url | |
4 | \usepackage{booktabs} % nicer tabulars | |
5 | \usepackage{fancyvrb} | |
6 | \usepackage{fullpage} | |
7 | \usepackage{float} | |
8 | ||
9 | \newcommand{\iface}{\textit} | |
10 | \newcommand{\cmd}{\texttt} | |
11 | \newcommand{\man}{\textit} | |
12 | \newcommand{\qdisc}{\texttt} | |
13 | \newcommand{\filter}{\texttt} | |
14 | ||
15 | \begin{document} | |
16 | \title{QoS in Linux with TC and Filters} | |
17 | \author{Phil Sutter (phil@nwl.cc)} | |
18 | \date{January 2016} | |
19 | \maketitle | |
20 | ||
5f4d27d5 PS |
21 | Standard practice when transmitting packets over a medium which may block (due |
22 | to congestion, e.g.) is to use a queue which temporarily holds these packets. In | |
23 | Linux, this queueing approach is where QoS happens: A Queueing Discipline | |
24 | (qdisc) holds multiple packet queues with different priorities for dequeueing to | |
25 | the network driver. The classification (i.e. deciding which queue a packet | |
26 | should go into) is typically done based on Type Of Service (IPv4) or Traffic | |
27 | Class (IPv6) header fields but depending on qdisc implementation, might be | |
28 | controlled by the user as well. | |
29 | ||
30 | Qdiscs come in two flavors, classful or classless. While classless qdiscs are | |
31 | not as flexible as classful ones, they also require much less customizing. Often | |
32 | it is enough to just attach them to an interface, without exact knowledge of | |
33 | what is done internally. Classful qdiscs are the exact opposite: flexible in | |
34 | application, they are often not even usable without insightful configuration. | |
35 | ||
36 | As the name implies, classful qdiscs provide configurable classes to sort | |
37 | traffic into. In it's basic form, this is not much different than, say, the | |
38 | classless \qdisc{pfifo\_fast} which holds three queues and classifies per | |
39 | packet upon priority field. Though typically classes go beyond that by | |
40 | supporting nesting and additional characteristics like e.g. maximum traffic | |
41 | rate or quantum. | |
42 | ||
43 | When it comes to controlling the classification process, filters come into play. | |
44 | They attach to the parent of a set of classes (i.e. either the qdisc itself or | |
45 | a parent class) and specify how a packet (or it's associated flow) has to look | |
46 | like in order to suit a given class. To overcome this simplification, it is | |
47 | possible to attach multiple filters to the same parent, which then consults each | |
48 | of them in row until the first one accepts the packet. | |
49 | ||
50 | Before getting into detail about what filters there are and how to use them, a | |
51 | simple setup of a qdisc with classes is necessary: | |
52 | \begin{figure}[H] | |
53 | \begin{Verbatim} | |
54 | .-------------------------------------------------------. | |
55 | | | | |
56 | | HTB | | |
57 | | | | |
58 | | .----------------------------------------------------.| | |
59 | | | || | |
60 | | | Class 1:1 || | |
61 | | | || | |
62 | | | .---------------..---------------..---------------.|| | |
63 | | | | || || ||| | |
64 | | | | Class 1:10 || Class 1:20 || Class 1:30 ||| | |
65 | | | | || || ||| | |
66 | | | | .------------.|| .------------.|| .------------.||| | |
67 | | | | | ||| | ||| | |||| | |
68 | | | | | fq_codel ||| | fq_codel ||| | fq_codel |||| | |
69 | | | | | ||| | ||| | |||| | |
70 | | | | '------------'|| '------------'|| '------------'||| | |
71 | | | '---------------''---------------''---------------'|| | |
72 | | '----------------------------------------------------'| | |
73 | '-------------------------------------------------------' | |
74 | \end{Verbatim} | |
75 | \end{figure} | |
76 | \noindent | |
77 | The following commands establish the basic setup shown: | |
78 | \begin{Verbatim} | |
79 | (1) # tc qdisc replace dev eth0 root handle 1: htb default 30 | |
80 | (2) # tc class add dev eth0 parent 1: classid 1:1 htb rate 95mbit | |
81 | (3) # alias tclass='tc class add dev eth0 parent 1:1' | |
82 | (4) # tclass classid 1:10 htb rate 1mbit ceil 20mbit prio 1 | |
83 | (4) # tclass classid 1:20 htb rate 90mbit ceil 95mbit prio 2 | |
84 | (4) # tclass classid 1:30 htb rate 1mbit ceil 95mbit prio 3 | |
85 | (5) # tc qdisc add dev eth0 parent 1:10 fq_codel | |
86 | (5) # tc qdisc add dev eth0 parent 1:20 fq_codel | |
87 | (5) # tc qdisc add dev eth0 parent 1:30 fq_codel | |
88 | \end{Verbatim} | |
89 | A little explanation for the unfamiliar reader: | |
90 | \begin{enumerate} | |
91 | \item Replace the root qdisc of \iface{eth0} by an instance of \qdisc{HTB}. | |
92 | Specifying the handle is necessary so it can be referenced in consecutive | |
93 | calls to \cmd{tc}. The default class for unclassified traffic is set to | |
94 | 30. | |
95 | \item Create a single top-level class with handle 1:1 which limits the total | |
96 | bandwidth allowed to 95mbit/s. It is assumed that \iface{eth0} is a 100mbit/s link, | |
97 | staying a little below that helps to keep the main point of enqueueing in | |
98 | the qdisc layer instead of the interface hardware queue or at another | |
99 | bottleneck in the network. | |
100 | \item Define an alias for the common part of the remaining three calls in order | |
101 | to improve readability. This means all remaining classes are attached to the | |
102 | common parent class from (2). | |
103 | \item Create three child classes for different uses: Class 1:10 has highest | |
104 | priority but is tightly limited in bandwidth - fine for interactive | |
105 | connections. Class 1:20 has mid priority and high guaranteed bandwidth, for | |
106 | high priority bulk traffic. Finally, there's the default class 1:30 with | |
107 | lowest priority, low guaranteed bandwidth and the ability to use the full | |
108 | link in case it's unused otherwise. This should be fine for uninteresting | |
109 | traffic not explicitly taken care of. | |
110 | \item Attach a leaf qdisc to each of the child classes created in (4). Since | |
111 | \qdisc{HTB} by default attaches \qdisc{pfifo} as leaf qdisc, this step is optional. Still, | |
112 | the fairness between different flows provided by the classless \qdisc{fq\_codel} is | |
113 | worth the effort. | |
114 | \end{enumerate} | |
115 | More information about the qdiscs and fine-tuning parameters can be found in | |
116 | \man{tc-htb(8)} and \man{tc-fq\_codel(8)}. | |
117 | ||
118 | Without any additional setup done, now all traffic leaving \iface{eth0} is shaped to | |
119 | 95mbit/s and directed through class 1:30. This can be verified by looking at the | |
120 | \texttt{Sent} field of the class statistics printed via \cmd{tc -s class show dev eth0}: | |
121 | Only the root class 1:1 and it's child 1:30 should show any traffic. | |
122 | ||
123 | ||
124 | \section*{Finally time to start filtering!} | |
125 | ||
126 | Let's begin with a simple one, i.e. reestablishing what \qdisc{pfifo\_fast} did | |
127 | automatically based on TOS/Priority field. Linux internally translates the | |
128 | header field into the priority field of struct skbuff, which | |
129 | \qdisc{pfifo\_fast} uses for | |
130 | classification. \man{tc-prio(8)} contains a table listing the priority (and | |
131 | ultimately, \qdisc{pfifo\_fast} queue index) each TOS value is being translated into. | |
132 | Here is a shorter version: | |
133 | \begin{center} | |
134 | \begin{tabular}{lll} | |
135 | TOS Values & Linux Priority (Number) & Queue Index \\ | |
136 | \midrule | |
137 | 0x0 - 0x6 & Best Effort (0) & 1 \\ | |
138 | 0x8 - 0xe & Bulk (2) & 2 \\ | |
139 | 0x10 - 0x16 & Interactive (6) & 0 \\ | |
140 | 0x18 - 0x1e & Interactive Bulk (4) & 1 \\ | |
141 | \end{tabular} | |
142 | \end{center} | |
143 | Using the \filter{basic} filter, it is possible to match packets based on that skbuff | |
144 | field, which has the added benefit of being IP version agnostic. Since the | |
145 | \qdisc{HTB} setup above defaults to class ID 1:30, the Bulk priority can be | |
146 | ignored. The \filter{basic} filter allows to combine matches, therefore we get along | |
147 | with only two filters: | |
148 | \begin{Verbatim} | |
149 | # tc filter add dev eth0 parent 1: basic \ | |
150 | match 'meta(priority eq 6)' classid 1:10 | |
151 | # tc filter add dev eth0 parent 1: basic \ | |
152 | match 'meta(priority eq 0)' \ | |
153 | or 'meta(priority eq 4)' classid 1:20 | |
154 | \end{Verbatim} | |
155 | A detailed description of the \filter{basic} filter and the ematch syntax it uses can be | |
156 | found in \man{tc-basic(8)} and \man{tc-ematch(8)}. | |
157 | ||
158 | Obviously, this first example cries for optimization. A simple one would be to | |
159 | just change the default class from 1:30 to 1:20, so filters are only needed for | |
160 | Bulk and Interactive priorities: | |
161 | \begin{Verbatim} | |
162 | # tc filter add dev eth0 parent 1: basic \ | |
163 | match 'meta(priority eq 6)' classid 1:10 | |
164 | # tc filter add dev eth0 parent 1: basic \ | |
165 | match 'meta(priority eq 2)' classid 1:20 | |
166 | \end{Verbatim} | |
167 | Given that class IDs are random, choosing them wisely allows for a direct | |
168 | mapping. So first, recreate the qdisc and classes configuration: | |
169 | \begin{Verbatim} | |
170 | # tc qdisc replace dev eth0 root handle 1: htb default 10 | |
171 | # tc class add dev eth0 parent 1: classid 1:1 htb rate 95mbit | |
172 | # alias tclass='tc class add dev eth0 parent 1:1' | |
173 | # tclass classid 1:16 htb rate 1mbit ceil 20mbit prio 1 | |
174 | # tclass classid 1:10 htb rate 90mbit ceil 95mbit prio 2 | |
175 | # tclass classid 1:12 htb rate 1mbit ceil 95mbit prio 3 | |
176 | # tc qdisc add dev eth0 parent 1:16 fq_codel | |
177 | # tc qdisc add dev eth0 parent 1:10 fq_codel | |
178 | # tc qdisc add dev eth0 parent 1:12 fq_codel | |
179 | \end{Verbatim} | |
180 | This is basically identical to above, but with changed leaf class IDs and the | |
181 | second priority class being the default. Using the \filter{flow} filter with it's \texttt{map} | |
182 | functionality, a single filter command is enough: | |
183 | \begin{Verbatim} | |
184 | # tc filter add dev eth0 parent 1: handle 0x1337 flow \ | |
185 | map key priority baseclass 1:10 | |
186 | \end{Verbatim} | |
187 | The \filter{flow} filter now uses the priority value to construct a destination class ID | |
188 | by adding it to the value of \texttt{baseclass}. While this works for priority values of | |
189 | 0, 2 and 6, it will result in non-existent class ID 1:14 for Interactive Bulk | |
190 | traffic. In that case, the \qdisc{HTB} default applies so that traffic goes into class | |
191 | ID 1:10 just as intended. Please note that specifying a handle is a mandatory | |
192 | requirement by the \filter{flow} filter, although I didn't see where one would use that | |
193 | later. For more information about \filter{flow}, see \man{tc-flow(8)}. | |
194 | ||
195 | While \filter{flow} and \filter{basic} filters are relatively easy to apply and understand, they | |
196 | are as well quite limited to their intended purpose. A more flexible option is | |
197 | the \filter{u32} filter, which allows to match on arbitrary parts of the packet data - | |
198 | yet only on that, not any meta data associated to it by the kernel (with the | |
199 | exception of firewall mark value). So in order to continue this little | |
200 | exercise with \filter{u32}, we have to base classification directly upon the actual TOS | |
201 | value. An intuitive attempt might look like this: | |
202 | \begin{Verbatim} | |
203 | # alias tcfilter='tc filter add dev eth0 parent 1:' | |
204 | # tcfilter u32 match ip dsfield 0x10 0x1e classid 1:16 | |
205 | # tcfilter u32 match ip dsfield 0x12 0x1e classid 1:16 | |
206 | # tcfilter u32 match ip dsfield 0x14 0x1e classid 1:16 | |
207 | # tcfilter u32 match ip dsfield 0x16 0x1e classid 1:16 | |
208 | # tcfilter u32 match ip dsfield 0x8 0x1e classid 1:12 | |
209 | # tcfilter u32 match ip dsfield 0xa 0x1e classid 1:12 | |
210 | # tcfilter u32 match ip dsfield 0xc 0x1e classid 1:12 | |
211 | # tcfilter u32 match ip dsfield 0xe 0x1e classid 1:12 | |
212 | \end{Verbatim} | |
213 | The obvious drawback here is the amount of filters needed. And without the | |
214 | default class, eight more filters would be necessary. This also has performance | |
215 | implications: A packet with TOS value 0xe will be checked eight times in total | |
216 | in order to determine it's destination class. While there's not much to be done | |
217 | about the number of filters, at least the performance problem can be eliminated | |
218 | by using \filter{u32}'s hash table support: | |
219 | \begin{Verbatim} | |
220 | # tc filter add dev eth0 parent 1: prio 99 handle 1: u32 divisor 16 | |
221 | \end{Verbatim} | |
222 | This creates a hash table with 16 buckets. The table size is arbitrary, but not | |
223 | random: Since the first bit of the TOS field is not interesting, it can be | |
224 | ignored and therefore the range of values to consider is just [0;15], i.e. a | |
225 | number of 16 different values. The next step is to populate the hash table: | |
226 | \begin{Verbatim} | |
227 | # alias tcfilter='tc filter add dev eth0 parent 1: prio 99' | |
228 | # tcfilter u32 match u8 0 0 ht 1:0: classid 1:16 | |
229 | # tcfilter u32 match u8 0 0 ht 1:1: classid 1:16 | |
230 | # tcfilter u32 match u8 0 0 ht 1:2: classid 1:16 | |
231 | # tcfilter u32 match u8 0 0 ht 1:3: classid 1:16 | |
232 | # tcfilter u32 match u8 0 0 ht 1:4: classid 1:12 | |
233 | # tcfilter u32 match u8 0 0 ht 1:5: classid 1:12 | |
234 | # tcfilter u32 match u8 0 0 ht 1:6: classid 1:12 | |
235 | # tcfilter u32 match u8 0 0 ht 1:7: classid 1:12 | |
236 | # tcfilter u32 match u8 0 0 ht 1:8: classid 1:16 | |
237 | # tcfilter u32 match u8 0 0 ht 1:9: classid 1:16 | |
238 | # tcfilter u32 match u8 0 0 ht 1:a: classid 1:16 | |
239 | # tcfilter u32 match u8 0 0 ht 1:b: classid 1:16 | |
240 | # tcfilter u32 match u8 0 0 ht 1:c: classid 1:10 | |
241 | # tcfilter u32 match u8 0 0 ht 1:d: classid 1:10 | |
242 | # tcfilter u32 match u8 0 0 ht 1:e: classid 1:10 | |
243 | # tcfilter u32 match u8 0 0 ht 1:f: classid 1:10 | |
244 | \end{Verbatim} | |
245 | The parameter \texttt{ht} denotes the hash table and bucket the filter should be added | |
246 | to. Since the first TOS bit is ignored, it's value has to be divided by two in | |
247 | order to get to the bucket it maps to. E.g. a TOS value of 0x10 will therefore | |
248 | map to bucket 0x8. For the sake of completeness, all possible values are mapped | |
249 | and therefore a configurable default class is not required. Note that the used | |
250 | match expression is not necessary, but mandatory. Therefore anything that | |
251 | matches any packet will suffice. Finally, a filter which links to the defined | |
252 | hash table is needed: | |
253 | \begin{Verbatim} | |
254 | # tc filter add dev eth0 parent 1: prio 1 protocol ip u32 \ | |
255 | link 1: hashkey mask 0x001e0000 match u8 0 0 | |
256 | \end{Verbatim} | |
257 | Here again, the actual match statement is not necessary, but syntactically | |
258 | required. All the magic lies within the \texttt{hashkey} parameter, which defines which | |
259 | part of the packet should be used directly as hash key. Here's a drawing of the | |
260 | first four bytes of the IPv4 header, with the area selected by \texttt{hashkey mask} | |
261 | highlighted: | |
262 | \begin{figure}[H] | |
263 | \begin{Verbatim} | |
264 | 0 1 2 3 | |
265 | .-----------------------------------------------------------------. | |
266 | | | | ######## | | | | |
267 | | Version| IHL | #DSCP### | ECN| Total Length | | |
268 | | | | ######## | | | | |
269 | '-----------------------------------------------------------------' | |
270 | \end{Verbatim} | |
271 | \end{figure} | |
272 | \noindent | |
273 | This may look confusing at first, but keep in mind that bit- as well as | |
274 | byte-ordering here is LSB while the mask value is written in MSB we humans use. | |
275 | Therefore reading the mask is done like so, starting from left: | |
276 | \begin{enumerate} | |
277 | \item Skip the first byte (which contains Version and IHL fields). | |
278 | \item Skip the lowest bit of the second byte (0x1e is even). | |
279 | \item Mark the four following bits (0x1e is 11110 in binary). | |
280 | \item Skip the remaining three bits of the second byte as well as the remaining two | |
281 | bytes. | |
282 | \end{enumerate} | |
283 | Before doing the lookup, the kernel right-shifts the masked value by the amount | |
284 | of zero-bits in \texttt{mask}, which implicitly also does the division by two which the | |
285 | hash table depends on. With this setup, every packet has to pass exactly two | |
286 | filters to be classified. Note that this filter is limited to IPv4 packets: Due | |
287 | to the related Traffic Class field being at a different offset in the packet, it | |
288 | would not work for IPv6. To use the same setup for IPv6 as well, a second | |
289 | entry-level filter is necessary: | |
290 | \begin{Verbatim} | |
291 | # tc filter add dev eth0 parent 1: prio 2 protocol ipv6 u32 \ | |
292 | link 1: hashkey mask 0x01e00000 match u8 0 0 | |
293 | \end{Verbatim} | |
294 | For illustration purposes, here again is a drawing of the first four bytes of | |
295 | the IPv6 header, again with masked area highlighted: | |
296 | \begin{figure}[H] | |
297 | \begin{Verbatim} | |
298 | 0 1 2 3 | |
299 | .-----------------------------------------------------------------. | |
300 | | | ######## | | | |
301 | | Version| #Traffic Class| Flow Label | | |
302 | | | ######## | | | |
303 | '-----------------------------------------------------------------' | |
304 | \end{Verbatim} | |
305 | \end{figure} | |
306 | \noindent | |
307 | Reading the mask value is analogous to IPv4 with the added complexity that | |
308 | Traffic Class spans over two bytes. Yet, for comparison there's a simple trick: | |
309 | IPv6 has the interesting field shifted by four bits to the left, and the new | |
310 | mask's value is shifted by the same amount. For further information about | |
311 | \filter{u32} and what can be done with it, consult it's man page | |
312 | \man{tc-u32(8)}. | |
313 | ||
314 | Of course, the kernel provides many more filters than just \filter{basic}, | |
315 | \filter{flow} and \filter{u32} which have been presented above. As of now, the | |
316 | remaining ones are: | |
317 | \begin{description} | |
318 | \item[bpf] | |
319 | Filtering using Berkeley Packet Filter programs. The program's return | |
320 | code determines the packet's destination class ID. | |
321 | ||
322 | \item[cgroup] | |
323 | Filter packets based on control groups. This is only useful for packets | |
324 | originating from the local host, as control groups only exist in that | |
325 | scope. | |
326 | ||
327 | \item[flower] | |
328 | An extended variant of the flow filter. | |
329 | ||
330 | \item[fw] | |
331 | Matches on firewall mark values previously assigned to the packet by | |
332 | netfilter (or a filter action, see below for details). This allows to | |
333 | export the classification algorithm into netfilter, which is very | |
334 | convenient if appropriate rules exist on the same system in there | |
335 | already. | |
336 | ||
337 | \item[route] | |
338 | Filter packets based on matching routing table entry. Basically | |
339 | equivalent to the \texttt{fw} filter above, to make use of an already existing | |
340 | extensive routing table setup. | |
341 | ||
342 | \item[rsvp, rsvp6] | |
343 | Implementation of the Resource Reservation Protocol in Linux, to react | |
344 | upon requests sent by an RSVP daemon. | |
345 | ||
346 | \item[tcindex] | |
347 | Match packets based on tcindex value, which is usually set by the dsmark | |
348 | qdisc. This is part of an approach to support Differentiated Services in | |
349 | Linux, which is another topic on it's own. | |
350 | \end{description} | |
351 | ||
352 | ||
353 | \section*{Filter Actions} | |
354 | ||
355 | The tc filter framework provides the infrastructure to another extensible set of | |
356 | tools as well, namely tc actions. As the name suggests, they allow to do things | |
357 | with packets (or associated data). (The list of) Actions are part of a given | |
358 | filter. If it matches, each action it contains is executed in order before | |
359 | returning the classification result. Since the action has direct access to the | |
360 | latter, it is in theory possible for an action to react upon or even change the | |
361 | filtering result - as long as the packet matched, of course. Yet none of the | |
362 | currently in-tree actions make use of this. | |
363 | ||
364 | The Generic Actions framework originally evolved out of the filters' ability to | |
365 | police traffic to a given maximum bandwidth. One common use case for that is to | |
366 | limit ingress traffic, dropping packets which exceed the threshold. A classic | |
367 | setup example is like so: | |
368 | \begin{Verbatim} | |
369 | # tc qdisc add dev eth0 handle ffff: ingress | |
370 | # tc filter add dev eth0 parent ffff: u32 \ | |
371 | match u32 0 0 | |
372 | police rate 1mbit burst 100k | |
373 | \end{Verbatim} | |
374 | The ingress qdisc is not a real one, but merely a point of reference for filters | |
375 | to attach to which should get applied to incoming traffic. The \filter{u32} filter added | |
376 | above matches on any packet and therefore limits the total incoming bandwidth to | |
377 | 1mbit/s, allowing bursts of up to 100kbytes. Using the new syntax, the filter | |
378 | command changes slightly: | |
379 | \begin{Verbatim} | |
380 | # tc filter add dev eth0 parent ffff: u32 \ | |
381 | match u32 0 0 \ | |
382 | action police rate 1mbit burst 100k | |
383 | \end{Verbatim} | |
384 | The important detail is that this syntax allows to define multiple actions. | |
385 | E.g. for testing purposes, it is possible to redirect exceeding traffic to the | |
386 | loopback interface instead of dropping it: | |
387 | \begin{Verbatim} | |
388 | # tc filter add dev eth0 parent ffff: u32 \ | |
389 | match u32 0 0 \ | |
390 | action police rate 1mbit burst 100k conform-exceed pipe \ | |
391 | action mirred egress redirect dev lo | |
392 | \end{Verbatim} | |
393 | The added parameter \texttt{conform-exceed pipe} tells the police action to allow for | |
394 | further actions to handle the exceeding packet. | |
395 | ||
396 | Apart from \texttt{police} and \texttt{mirred} actions, there are a few more. Here's a full | |
397 | list of the currently implemented ones: | |
398 | \begin{description} | |
399 | \item[bpf] | |
400 | Apply a Berkeley Packet Filter program to the packet. | |
401 | ||
402 | \item[connmark] | |
403 | Set the packet's firewall mark to that of it's connection. This works by | |
404 | searching the conntrack table for a matching entry. If found, the mark | |
405 | is restored. | |
406 | ||
407 | \item[csum] | |
408 | Trigger recalculation of packet checksums. The supported protocols are: | |
409 | IPv4, ICMP, IGMP, TCP, UDP and UDPLite. | |
410 | ||
411 | \item[ipt] | |
412 | Pass the packet to an iptables target. This allows to use iptables | |
413 | extensions directly instead of having to go the extra mile via setting | |
414 | an arbitrary firewall mark and matching on that from within netfilter. | |
415 | ||
416 | \item[mirred] | |
417 | Mirror or redirect packets. This is often combined with the ifb pseudo | |
418 | device to share a common QoS setup between multiple interfaces or even | |
419 | ingress traffic. | |
420 | ||
421 | \item[nat] | |
422 | Perform stateless Native Address Translation. This is certainly not | |
423 | complete and therefore inferior to NAT using iptables: Although the | |
424 | kernel module decides between TCP, UDP and ICMP traffic, it does not | |
425 | handle typical problematic protocols such as active FTP or SIP. | |
426 | ||
427 | \item[pedit] | |
428 | Generic packet editing. This allows to alter arbitrary bytes of the | |
429 | packet, either by specifying an offset into the packet or by naming a | |
430 | packet header and field name to change. Currently, the latter is | |
431 | implemented only for IPv4 yet. | |
432 | ||
433 | \item[police] | |
434 | Apply a bandwidth rate limiting policy. Packets exceeding it are dropped | |
435 | by default, but may optionally be handled differently. | |
436 | ||
437 | \item[simple] | |
438 | This is rather an example than real action. All it does is print a | |
439 | user-defined string together with a packet counter. Useful maybe for | |
440 | debugging when filter statistics are not available or too complicated. | |
441 | ||
442 | \item[skbedit] | |
443 | Edit associated packet data, supports changing queue mapping, priority | |
444 | field and firewall mark value. | |
445 | ||
446 | \item[vlan] | |
447 | Add/remove a VLAN header to/from the packet. This might serve as | |
448 | alternative to using 802.1Q pseudo-interfaces in combination with | |
449 | routing rules when e.g. packets for a given destination need to be | |
450 | encapsulated. | |
451 | \end{description} | |
452 | ||
453 | ||
454 | \section*{Intermediate Functional Block} | |
455 | ||
456 | The Intermediate Functional Block (\texttt{ifb}) pseudo network interface acts as a QoS | |
457 | concentrator for multiple different sources of traffic. Packets from or to other | |
458 | interfaces have to be redirected to it using the \texttt{mirred} action in order to be | |
459 | handled, regularly routed traffic will be dropped. This way, a single stack of | |
460 | qdiscs, classes and filters can be shared between multiple interfaces. | |
461 | ||
462 | Here's a simple example to feed incoming traffic from multiple interfaces | |
463 | through a Stochastic Fairness Queue (\qdisc{sfq}): | |
464 | \begin{Verbatim} | |
465 | (1) # modprobe ifb | |
466 | (2) # ip link set ifb0 up | |
467 | (3) # tc qdisc add dev ifb0 root sfq | |
468 | \end{Verbatim} | |
469 | The first step is to load the \texttt{ifb} kernel module (1). By default, this will | |
470 | create two ifb devices: \iface{ifb0} and \iface{ifb1}. After setting | |
471 | \iface{ifb0} up in (2), the root | |
472 | qdisc is replaced by \qdisc{sfq} in (3). Finally, one can start redirecting ingress | |
473 | traffic to \iface{ifb0}, e.g. from \iface{eth0}: | |
474 | \begin{Verbatim} | |
475 | # tc qdisc add dev eth0 handle ffff: ingress | |
476 | # tc filter add dev eth0 parent ffff: u32 \ | |
477 | match u32 0 0 \ | |
478 | action mirred egress redirect dev ifb0 | |
479 | \end{Verbatim} | |
480 | The same can be done for other interfaces, just replacing \iface{eth0} in the two | |
481 | commands above. One thing to keep in mind here is the asymmetrical routing this | |
482 | creates within the host doing the QoS: Incoming packets enter the system via | |
483 | \iface{ifb0}, while corresponding replies leave directly via \iface{eth0}. This can be observed | |
484 | using \cmd{tcpdump} on \iface{ifb0}, which shows the input part of the traffic only. What's | |
485 | more confusing is that \cmd{tcpdump} on \iface{eth0} shows both incoming and outgoing traffic, | |
486 | but the redirection is still effective - a simple prove is setting | |
487 | \iface{ifb0} down, | |
488 | which will interrupt the communication. Obviously \cmd{tcpdump} catches the packets to | |
489 | dump before they enter the ingress qdisc, which is why it sees them while the | |
490 | kernel itself doesn't. | |
491 | ||
492 | ||
493 | \section*{Conclusion} | |
494 | ||
edf35b88 PS |
495 | Once the steep learning curve has been mastered, the conglomerate of (classful) |
496 | qdiscs, filters and actions provides a highly sophisticated and flexible | |
497 | infrastructure to perform QoS, which plays nicely along with routing and | |
498 | firewalling setups. | |
5f4d27d5 PS |
499 | |
500 | ||
501 | \section*{Further Reading} | |
502 | ||
503 | A good starting point for novice users and experienced ones diving into unknown | |
504 | areas is the extensive HOWTO at \url{http://lartc.org}. The iproute2 package ships | |
505 | some examples (usually in /usr/share/doc/, depending on distribution) as well as | |
506 | man pages for \cmd{tc} in general, qdiscs and filters. The latter have been added | |
507 | just recently though, so if your distribution does not ship iproute2 version | |
508 | 4.3.0 yet, these are not in there. Apart from that, the internet is a spring of | |
509 | HOWTOs and scripts people wrote - though these should be taken with a grain of | |
510 | salt: The complexity of the matter often leads to copying others' solutions | |
511 | without much validation, which allows for less optimal or even obsolete | |
512 | implementations to survive much longer than desired. | |
513 | ||
514 | \end{document} |