]>
Commit | Line | Data |
---|---|---|
5f4d27d5 PS |
1 | \documentclass[12pt,twoside]{article} |
2 | ||
3 | \usepackage[hidelinks]{hyperref} % \url | |
4 | \usepackage{booktabs} % nicer tabulars | |
5 | \usepackage{fancyvrb} | |
6 | \usepackage{fullpage} | |
7 | \usepackage{float} | |
8 | ||
9 | \newcommand{\iface}{\textit} | |
10 | \newcommand{\cmd}{\texttt} | |
11 | \newcommand{\man}{\textit} | |
12 | \newcommand{\qdisc}{\texttt} | |
13 | \newcommand{\filter}{\texttt} | |
14 | ||
15 | \begin{document} | |
16 | \title{QoS in Linux with TC and Filters} | |
17 | \author{Phil Sutter (phil@nwl.cc)} | |
18 | \date{January 2016} | |
19 | \maketitle | |
20 | ||
21 | TC, the Traffic Control utility, has been there for a very long time - forever | |
22 | in my humble perception. It is still (and has ever been if I'm not mistaken) the | |
23 | only tool to configure QoS in Linux. | |
24 | ||
25 | Standard practice when transmitting packets over a medium which may block (due | |
26 | to congestion, e.g.) is to use a queue which temporarily holds these packets. In | |
27 | Linux, this queueing approach is where QoS happens: A Queueing Discipline | |
28 | (qdisc) holds multiple packet queues with different priorities for dequeueing to | |
29 | the network driver. The classification (i.e. deciding which queue a packet | |
30 | should go into) is typically done based on Type Of Service (IPv4) or Traffic | |
31 | Class (IPv6) header fields but depending on qdisc implementation, might be | |
32 | controlled by the user as well. | |
33 | ||
34 | Qdiscs come in two flavors, classful or classless. While classless qdiscs are | |
35 | not as flexible as classful ones, they also require much less customizing. Often | |
36 | it is enough to just attach them to an interface, without exact knowledge of | |
37 | what is done internally. Classful qdiscs are the exact opposite: flexible in | |
38 | application, they are often not even usable without insightful configuration. | |
39 | ||
40 | As the name implies, classful qdiscs provide configurable classes to sort | |
41 | traffic into. In it's basic form, this is not much different than, say, the | |
42 | classless \qdisc{pfifo\_fast} which holds three queues and classifies per | |
43 | packet upon priority field. Though typically classes go beyond that by | |
44 | supporting nesting and additional characteristics like e.g. maximum traffic | |
45 | rate or quantum. | |
46 | ||
47 | When it comes to controlling the classification process, filters come into play. | |
48 | They attach to the parent of a set of classes (i.e. either the qdisc itself or | |
49 | a parent class) and specify how a packet (or it's associated flow) has to look | |
50 | like in order to suit a given class. To overcome this simplification, it is | |
51 | possible to attach multiple filters to the same parent, which then consults each | |
52 | of them in row until the first one accepts the packet. | |
53 | ||
54 | Before getting into detail about what filters there are and how to use them, a | |
55 | simple setup of a qdisc with classes is necessary: | |
56 | \begin{figure}[H] | |
57 | \begin{Verbatim} | |
58 | .-------------------------------------------------------. | |
59 | | | | |
60 | | HTB | | |
61 | | | | |
62 | | .----------------------------------------------------.| | |
63 | | | || | |
64 | | | Class 1:1 || | |
65 | | | || | |
66 | | | .---------------..---------------..---------------.|| | |
67 | | | | || || ||| | |
68 | | | | Class 1:10 || Class 1:20 || Class 1:30 ||| | |
69 | | | | || || ||| | |
70 | | | | .------------.|| .------------.|| .------------.||| | |
71 | | | | | ||| | ||| | |||| | |
72 | | | | | fq_codel ||| | fq_codel ||| | fq_codel |||| | |
73 | | | | | ||| | ||| | |||| | |
74 | | | | '------------'|| '------------'|| '------------'||| | |
75 | | | '---------------''---------------''---------------'|| | |
76 | | '----------------------------------------------------'| | |
77 | '-------------------------------------------------------' | |
78 | \end{Verbatim} | |
79 | \end{figure} | |
80 | \noindent | |
81 | The following commands establish the basic setup shown: | |
82 | \begin{Verbatim} | |
83 | (1) # tc qdisc replace dev eth0 root handle 1: htb default 30 | |
84 | (2) # tc class add dev eth0 parent 1: classid 1:1 htb rate 95mbit | |
85 | (3) # alias tclass='tc class add dev eth0 parent 1:1' | |
86 | (4) # tclass classid 1:10 htb rate 1mbit ceil 20mbit prio 1 | |
87 | (4) # tclass classid 1:20 htb rate 90mbit ceil 95mbit prio 2 | |
88 | (4) # tclass classid 1:30 htb rate 1mbit ceil 95mbit prio 3 | |
89 | (5) # tc qdisc add dev eth0 parent 1:10 fq_codel | |
90 | (5) # tc qdisc add dev eth0 parent 1:20 fq_codel | |
91 | (5) # tc qdisc add dev eth0 parent 1:30 fq_codel | |
92 | \end{Verbatim} | |
93 | A little explanation for the unfamiliar reader: | |
94 | \begin{enumerate} | |
95 | \item Replace the root qdisc of \iface{eth0} by an instance of \qdisc{HTB}. | |
96 | Specifying the handle is necessary so it can be referenced in consecutive | |
97 | calls to \cmd{tc}. The default class for unclassified traffic is set to | |
98 | 30. | |
99 | \item Create a single top-level class with handle 1:1 which limits the total | |
100 | bandwidth allowed to 95mbit/s. It is assumed that \iface{eth0} is a 100mbit/s link, | |
101 | staying a little below that helps to keep the main point of enqueueing in | |
102 | the qdisc layer instead of the interface hardware queue or at another | |
103 | bottleneck in the network. | |
104 | \item Define an alias for the common part of the remaining three calls in order | |
105 | to improve readability. This means all remaining classes are attached to the | |
106 | common parent class from (2). | |
107 | \item Create three child classes for different uses: Class 1:10 has highest | |
108 | priority but is tightly limited in bandwidth - fine for interactive | |
109 | connections. Class 1:20 has mid priority and high guaranteed bandwidth, for | |
110 | high priority bulk traffic. Finally, there's the default class 1:30 with | |
111 | lowest priority, low guaranteed bandwidth and the ability to use the full | |
112 | link in case it's unused otherwise. This should be fine for uninteresting | |
113 | traffic not explicitly taken care of. | |
114 | \item Attach a leaf qdisc to each of the child classes created in (4). Since | |
115 | \qdisc{HTB} by default attaches \qdisc{pfifo} as leaf qdisc, this step is optional. Still, | |
116 | the fairness between different flows provided by the classless \qdisc{fq\_codel} is | |
117 | worth the effort. | |
118 | \end{enumerate} | |
119 | More information about the qdiscs and fine-tuning parameters can be found in | |
120 | \man{tc-htb(8)} and \man{tc-fq\_codel(8)}. | |
121 | ||
122 | Without any additional setup done, now all traffic leaving \iface{eth0} is shaped to | |
123 | 95mbit/s and directed through class 1:30. This can be verified by looking at the | |
124 | \texttt{Sent} field of the class statistics printed via \cmd{tc -s class show dev eth0}: | |
125 | Only the root class 1:1 and it's child 1:30 should show any traffic. | |
126 | ||
127 | ||
128 | \section*{Finally time to start filtering!} | |
129 | ||
130 | Let's begin with a simple one, i.e. reestablishing what \qdisc{pfifo\_fast} did | |
131 | automatically based on TOS/Priority field. Linux internally translates the | |
132 | header field into the priority field of struct skbuff, which | |
133 | \qdisc{pfifo\_fast} uses for | |
134 | classification. \man{tc-prio(8)} contains a table listing the priority (and | |
135 | ultimately, \qdisc{pfifo\_fast} queue index) each TOS value is being translated into. | |
136 | Here is a shorter version: | |
137 | \begin{center} | |
138 | \begin{tabular}{lll} | |
139 | TOS Values & Linux Priority (Number) & Queue Index \\ | |
140 | \midrule | |
141 | 0x0 - 0x6 & Best Effort (0) & 1 \\ | |
142 | 0x8 - 0xe & Bulk (2) & 2 \\ | |
143 | 0x10 - 0x16 & Interactive (6) & 0 \\ | |
144 | 0x18 - 0x1e & Interactive Bulk (4) & 1 \\ | |
145 | \end{tabular} | |
146 | \end{center} | |
147 | Using the \filter{basic} filter, it is possible to match packets based on that skbuff | |
148 | field, which has the added benefit of being IP version agnostic. Since the | |
149 | \qdisc{HTB} setup above defaults to class ID 1:30, the Bulk priority can be | |
150 | ignored. The \filter{basic} filter allows to combine matches, therefore we get along | |
151 | with only two filters: | |
152 | \begin{Verbatim} | |
153 | # tc filter add dev eth0 parent 1: basic \ | |
154 | match 'meta(priority eq 6)' classid 1:10 | |
155 | # tc filter add dev eth0 parent 1: basic \ | |
156 | match 'meta(priority eq 0)' \ | |
157 | or 'meta(priority eq 4)' classid 1:20 | |
158 | \end{Verbatim} | |
159 | A detailed description of the \filter{basic} filter and the ematch syntax it uses can be | |
160 | found in \man{tc-basic(8)} and \man{tc-ematch(8)}. | |
161 | ||
162 | Obviously, this first example cries for optimization. A simple one would be to | |
163 | just change the default class from 1:30 to 1:20, so filters are only needed for | |
164 | Bulk and Interactive priorities: | |
165 | \begin{Verbatim} | |
166 | # tc filter add dev eth0 parent 1: basic \ | |
167 | match 'meta(priority eq 6)' classid 1:10 | |
168 | # tc filter add dev eth0 parent 1: basic \ | |
169 | match 'meta(priority eq 2)' classid 1:20 | |
170 | \end{Verbatim} | |
171 | Given that class IDs are random, choosing them wisely allows for a direct | |
172 | mapping. So first, recreate the qdisc and classes configuration: | |
173 | \begin{Verbatim} | |
174 | # tc qdisc replace dev eth0 root handle 1: htb default 10 | |
175 | # tc class add dev eth0 parent 1: classid 1:1 htb rate 95mbit | |
176 | # alias tclass='tc class add dev eth0 parent 1:1' | |
177 | # tclass classid 1:16 htb rate 1mbit ceil 20mbit prio 1 | |
178 | # tclass classid 1:10 htb rate 90mbit ceil 95mbit prio 2 | |
179 | # tclass classid 1:12 htb rate 1mbit ceil 95mbit prio 3 | |
180 | # tc qdisc add dev eth0 parent 1:16 fq_codel | |
181 | # tc qdisc add dev eth0 parent 1:10 fq_codel | |
182 | # tc qdisc add dev eth0 parent 1:12 fq_codel | |
183 | \end{Verbatim} | |
184 | This is basically identical to above, but with changed leaf class IDs and the | |
185 | second priority class being the default. Using the \filter{flow} filter with it's \texttt{map} | |
186 | functionality, a single filter command is enough: | |
187 | \begin{Verbatim} | |
188 | # tc filter add dev eth0 parent 1: handle 0x1337 flow \ | |
189 | map key priority baseclass 1:10 | |
190 | \end{Verbatim} | |
191 | The \filter{flow} filter now uses the priority value to construct a destination class ID | |
192 | by adding it to the value of \texttt{baseclass}. While this works for priority values of | |
193 | 0, 2 and 6, it will result in non-existent class ID 1:14 for Interactive Bulk | |
194 | traffic. In that case, the \qdisc{HTB} default applies so that traffic goes into class | |
195 | ID 1:10 just as intended. Please note that specifying a handle is a mandatory | |
196 | requirement by the \filter{flow} filter, although I didn't see where one would use that | |
197 | later. For more information about \filter{flow}, see \man{tc-flow(8)}. | |
198 | ||
199 | While \filter{flow} and \filter{basic} filters are relatively easy to apply and understand, they | |
200 | are as well quite limited to their intended purpose. A more flexible option is | |
201 | the \filter{u32} filter, which allows to match on arbitrary parts of the packet data - | |
202 | yet only on that, not any meta data associated to it by the kernel (with the | |
203 | exception of firewall mark value). So in order to continue this little | |
204 | exercise with \filter{u32}, we have to base classification directly upon the actual TOS | |
205 | value. An intuitive attempt might look like this: | |
206 | \begin{Verbatim} | |
207 | # alias tcfilter='tc filter add dev eth0 parent 1:' | |
208 | # tcfilter u32 match ip dsfield 0x10 0x1e classid 1:16 | |
209 | # tcfilter u32 match ip dsfield 0x12 0x1e classid 1:16 | |
210 | # tcfilter u32 match ip dsfield 0x14 0x1e classid 1:16 | |
211 | # tcfilter u32 match ip dsfield 0x16 0x1e classid 1:16 | |
212 | # tcfilter u32 match ip dsfield 0x8 0x1e classid 1:12 | |
213 | # tcfilter u32 match ip dsfield 0xa 0x1e classid 1:12 | |
214 | # tcfilter u32 match ip dsfield 0xc 0x1e classid 1:12 | |
215 | # tcfilter u32 match ip dsfield 0xe 0x1e classid 1:12 | |
216 | \end{Verbatim} | |
217 | The obvious drawback here is the amount of filters needed. And without the | |
218 | default class, eight more filters would be necessary. This also has performance | |
219 | implications: A packet with TOS value 0xe will be checked eight times in total | |
220 | in order to determine it's destination class. While there's not much to be done | |
221 | about the number of filters, at least the performance problem can be eliminated | |
222 | by using \filter{u32}'s hash table support: | |
223 | \begin{Verbatim} | |
224 | # tc filter add dev eth0 parent 1: prio 99 handle 1: u32 divisor 16 | |
225 | \end{Verbatim} | |
226 | This creates a hash table with 16 buckets. The table size is arbitrary, but not | |
227 | random: Since the first bit of the TOS field is not interesting, it can be | |
228 | ignored and therefore the range of values to consider is just [0;15], i.e. a | |
229 | number of 16 different values. The next step is to populate the hash table: | |
230 | \begin{Verbatim} | |
231 | # alias tcfilter='tc filter add dev eth0 parent 1: prio 99' | |
232 | # tcfilter u32 match u8 0 0 ht 1:0: classid 1:16 | |
233 | # tcfilter u32 match u8 0 0 ht 1:1: classid 1:16 | |
234 | # tcfilter u32 match u8 0 0 ht 1:2: classid 1:16 | |
235 | # tcfilter u32 match u8 0 0 ht 1:3: classid 1:16 | |
236 | # tcfilter u32 match u8 0 0 ht 1:4: classid 1:12 | |
237 | # tcfilter u32 match u8 0 0 ht 1:5: classid 1:12 | |
238 | # tcfilter u32 match u8 0 0 ht 1:6: classid 1:12 | |
239 | # tcfilter u32 match u8 0 0 ht 1:7: classid 1:12 | |
240 | # tcfilter u32 match u8 0 0 ht 1:8: classid 1:16 | |
241 | # tcfilter u32 match u8 0 0 ht 1:9: classid 1:16 | |
242 | # tcfilter u32 match u8 0 0 ht 1:a: classid 1:16 | |
243 | # tcfilter u32 match u8 0 0 ht 1:b: classid 1:16 | |
244 | # tcfilter u32 match u8 0 0 ht 1:c: classid 1:10 | |
245 | # tcfilter u32 match u8 0 0 ht 1:d: classid 1:10 | |
246 | # tcfilter u32 match u8 0 0 ht 1:e: classid 1:10 | |
247 | # tcfilter u32 match u8 0 0 ht 1:f: classid 1:10 | |
248 | \end{Verbatim} | |
249 | The parameter \texttt{ht} denotes the hash table and bucket the filter should be added | |
250 | to. Since the first TOS bit is ignored, it's value has to be divided by two in | |
251 | order to get to the bucket it maps to. E.g. a TOS value of 0x10 will therefore | |
252 | map to bucket 0x8. For the sake of completeness, all possible values are mapped | |
253 | and therefore a configurable default class is not required. Note that the used | |
254 | match expression is not necessary, but mandatory. Therefore anything that | |
255 | matches any packet will suffice. Finally, a filter which links to the defined | |
256 | hash table is needed: | |
257 | \begin{Verbatim} | |
258 | # tc filter add dev eth0 parent 1: prio 1 protocol ip u32 \ | |
259 | link 1: hashkey mask 0x001e0000 match u8 0 0 | |
260 | \end{Verbatim} | |
261 | Here again, the actual match statement is not necessary, but syntactically | |
262 | required. All the magic lies within the \texttt{hashkey} parameter, which defines which | |
263 | part of the packet should be used directly as hash key. Here's a drawing of the | |
264 | first four bytes of the IPv4 header, with the area selected by \texttt{hashkey mask} | |
265 | highlighted: | |
266 | \begin{figure}[H] | |
267 | \begin{Verbatim} | |
268 | 0 1 2 3 | |
269 | .-----------------------------------------------------------------. | |
270 | | | | ######## | | | | |
271 | | Version| IHL | #DSCP### | ECN| Total Length | | |
272 | | | | ######## | | | | |
273 | '-----------------------------------------------------------------' | |
274 | \end{Verbatim} | |
275 | \end{figure} | |
276 | \noindent | |
277 | This may look confusing at first, but keep in mind that bit- as well as | |
278 | byte-ordering here is LSB while the mask value is written in MSB we humans use. | |
279 | Therefore reading the mask is done like so, starting from left: | |
280 | \begin{enumerate} | |
281 | \item Skip the first byte (which contains Version and IHL fields). | |
282 | \item Skip the lowest bit of the second byte (0x1e is even). | |
283 | \item Mark the four following bits (0x1e is 11110 in binary). | |
284 | \item Skip the remaining three bits of the second byte as well as the remaining two | |
285 | bytes. | |
286 | \end{enumerate} | |
287 | Before doing the lookup, the kernel right-shifts the masked value by the amount | |
288 | of zero-bits in \texttt{mask}, which implicitly also does the division by two which the | |
289 | hash table depends on. With this setup, every packet has to pass exactly two | |
290 | filters to be classified. Note that this filter is limited to IPv4 packets: Due | |
291 | to the related Traffic Class field being at a different offset in the packet, it | |
292 | would not work for IPv6. To use the same setup for IPv6 as well, a second | |
293 | entry-level filter is necessary: | |
294 | \begin{Verbatim} | |
295 | # tc filter add dev eth0 parent 1: prio 2 protocol ipv6 u32 \ | |
296 | link 1: hashkey mask 0x01e00000 match u8 0 0 | |
297 | \end{Verbatim} | |
298 | For illustration purposes, here again is a drawing of the first four bytes of | |
299 | the IPv6 header, again with masked area highlighted: | |
300 | \begin{figure}[H] | |
301 | \begin{Verbatim} | |
302 | 0 1 2 3 | |
303 | .-----------------------------------------------------------------. | |
304 | | | ######## | | | |
305 | | Version| #Traffic Class| Flow Label | | |
306 | | | ######## | | | |
307 | '-----------------------------------------------------------------' | |
308 | \end{Verbatim} | |
309 | \end{figure} | |
310 | \noindent | |
311 | Reading the mask value is analogous to IPv4 with the added complexity that | |
312 | Traffic Class spans over two bytes. Yet, for comparison there's a simple trick: | |
313 | IPv6 has the interesting field shifted by four bits to the left, and the new | |
314 | mask's value is shifted by the same amount. For further information about | |
315 | \filter{u32} and what can be done with it, consult it's man page | |
316 | \man{tc-u32(8)}. | |
317 | ||
318 | Of course, the kernel provides many more filters than just \filter{basic}, | |
319 | \filter{flow} and \filter{u32} which have been presented above. As of now, the | |
320 | remaining ones are: | |
321 | \begin{description} | |
322 | \item[bpf] | |
323 | Filtering using Berkeley Packet Filter programs. The program's return | |
324 | code determines the packet's destination class ID. | |
325 | ||
326 | \item[cgroup] | |
327 | Filter packets based on control groups. This is only useful for packets | |
328 | originating from the local host, as control groups only exist in that | |
329 | scope. | |
330 | ||
331 | \item[flower] | |
332 | An extended variant of the flow filter. | |
333 | ||
334 | \item[fw] | |
335 | Matches on firewall mark values previously assigned to the packet by | |
336 | netfilter (or a filter action, see below for details). This allows to | |
337 | export the classification algorithm into netfilter, which is very | |
338 | convenient if appropriate rules exist on the same system in there | |
339 | already. | |
340 | ||
341 | \item[route] | |
342 | Filter packets based on matching routing table entry. Basically | |
343 | equivalent to the \texttt{fw} filter above, to make use of an already existing | |
344 | extensive routing table setup. | |
345 | ||
346 | \item[rsvp, rsvp6] | |
347 | Implementation of the Resource Reservation Protocol in Linux, to react | |
348 | upon requests sent by an RSVP daemon. | |
349 | ||
350 | \item[tcindex] | |
351 | Match packets based on tcindex value, which is usually set by the dsmark | |
352 | qdisc. This is part of an approach to support Differentiated Services in | |
353 | Linux, which is another topic on it's own. | |
354 | \end{description} | |
355 | ||
356 | ||
357 | \section*{Filter Actions} | |
358 | ||
359 | The tc filter framework provides the infrastructure to another extensible set of | |
360 | tools as well, namely tc actions. As the name suggests, they allow to do things | |
361 | with packets (or associated data). (The list of) Actions are part of a given | |
362 | filter. If it matches, each action it contains is executed in order before | |
363 | returning the classification result. Since the action has direct access to the | |
364 | latter, it is in theory possible for an action to react upon or even change the | |
365 | filtering result - as long as the packet matched, of course. Yet none of the | |
366 | currently in-tree actions make use of this. | |
367 | ||
368 | The Generic Actions framework originally evolved out of the filters' ability to | |
369 | police traffic to a given maximum bandwidth. One common use case for that is to | |
370 | limit ingress traffic, dropping packets which exceed the threshold. A classic | |
371 | setup example is like so: | |
372 | \begin{Verbatim} | |
373 | # tc qdisc add dev eth0 handle ffff: ingress | |
374 | # tc filter add dev eth0 parent ffff: u32 \ | |
375 | match u32 0 0 | |
376 | police rate 1mbit burst 100k | |
377 | \end{Verbatim} | |
378 | The ingress qdisc is not a real one, but merely a point of reference for filters | |
379 | to attach to which should get applied to incoming traffic. The \filter{u32} filter added | |
380 | above matches on any packet and therefore limits the total incoming bandwidth to | |
381 | 1mbit/s, allowing bursts of up to 100kbytes. Using the new syntax, the filter | |
382 | command changes slightly: | |
383 | \begin{Verbatim} | |
384 | # tc filter add dev eth0 parent ffff: u32 \ | |
385 | match u32 0 0 \ | |
386 | action police rate 1mbit burst 100k | |
387 | \end{Verbatim} | |
388 | The important detail is that this syntax allows to define multiple actions. | |
389 | E.g. for testing purposes, it is possible to redirect exceeding traffic to the | |
390 | loopback interface instead of dropping it: | |
391 | \begin{Verbatim} | |
392 | # tc filter add dev eth0 parent ffff: u32 \ | |
393 | match u32 0 0 \ | |
394 | action police rate 1mbit burst 100k conform-exceed pipe \ | |
395 | action mirred egress redirect dev lo | |
396 | \end{Verbatim} | |
397 | The added parameter \texttt{conform-exceed pipe} tells the police action to allow for | |
398 | further actions to handle the exceeding packet. | |
399 | ||
400 | Apart from \texttt{police} and \texttt{mirred} actions, there are a few more. Here's a full | |
401 | list of the currently implemented ones: | |
402 | \begin{description} | |
403 | \item[bpf] | |
404 | Apply a Berkeley Packet Filter program to the packet. | |
405 | ||
406 | \item[connmark] | |
407 | Set the packet's firewall mark to that of it's connection. This works by | |
408 | searching the conntrack table for a matching entry. If found, the mark | |
409 | is restored. | |
410 | ||
411 | \item[csum] | |
412 | Trigger recalculation of packet checksums. The supported protocols are: | |
413 | IPv4, ICMP, IGMP, TCP, UDP and UDPLite. | |
414 | ||
415 | \item[ipt] | |
416 | Pass the packet to an iptables target. This allows to use iptables | |
417 | extensions directly instead of having to go the extra mile via setting | |
418 | an arbitrary firewall mark and matching on that from within netfilter. | |
419 | ||
420 | \item[mirred] | |
421 | Mirror or redirect packets. This is often combined with the ifb pseudo | |
422 | device to share a common QoS setup between multiple interfaces or even | |
423 | ingress traffic. | |
424 | ||
425 | \item[nat] | |
426 | Perform stateless Native Address Translation. This is certainly not | |
427 | complete and therefore inferior to NAT using iptables: Although the | |
428 | kernel module decides between TCP, UDP and ICMP traffic, it does not | |
429 | handle typical problematic protocols such as active FTP or SIP. | |
430 | ||
431 | \item[pedit] | |
432 | Generic packet editing. This allows to alter arbitrary bytes of the | |
433 | packet, either by specifying an offset into the packet or by naming a | |
434 | packet header and field name to change. Currently, the latter is | |
435 | implemented only for IPv4 yet. | |
436 | ||
437 | \item[police] | |
438 | Apply a bandwidth rate limiting policy. Packets exceeding it are dropped | |
439 | by default, but may optionally be handled differently. | |
440 | ||
441 | \item[simple] | |
442 | This is rather an example than real action. All it does is print a | |
443 | user-defined string together with a packet counter. Useful maybe for | |
444 | debugging when filter statistics are not available or too complicated. | |
445 | ||
446 | \item[skbedit] | |
447 | Edit associated packet data, supports changing queue mapping, priority | |
448 | field and firewall mark value. | |
449 | ||
450 | \item[vlan] | |
451 | Add/remove a VLAN header to/from the packet. This might serve as | |
452 | alternative to using 802.1Q pseudo-interfaces in combination with | |
453 | routing rules when e.g. packets for a given destination need to be | |
454 | encapsulated. | |
455 | \end{description} | |
456 | ||
457 | ||
458 | \section*{Intermediate Functional Block} | |
459 | ||
460 | The Intermediate Functional Block (\texttt{ifb}) pseudo network interface acts as a QoS | |
461 | concentrator for multiple different sources of traffic. Packets from or to other | |
462 | interfaces have to be redirected to it using the \texttt{mirred} action in order to be | |
463 | handled, regularly routed traffic will be dropped. This way, a single stack of | |
464 | qdiscs, classes and filters can be shared between multiple interfaces. | |
465 | ||
466 | Here's a simple example to feed incoming traffic from multiple interfaces | |
467 | through a Stochastic Fairness Queue (\qdisc{sfq}): | |
468 | \begin{Verbatim} | |
469 | (1) # modprobe ifb | |
470 | (2) # ip link set ifb0 up | |
471 | (3) # tc qdisc add dev ifb0 root sfq | |
472 | \end{Verbatim} | |
473 | The first step is to load the \texttt{ifb} kernel module (1). By default, this will | |
474 | create two ifb devices: \iface{ifb0} and \iface{ifb1}. After setting | |
475 | \iface{ifb0} up in (2), the root | |
476 | qdisc is replaced by \qdisc{sfq} in (3). Finally, one can start redirecting ingress | |
477 | traffic to \iface{ifb0}, e.g. from \iface{eth0}: | |
478 | \begin{Verbatim} | |
479 | # tc qdisc add dev eth0 handle ffff: ingress | |
480 | # tc filter add dev eth0 parent ffff: u32 \ | |
481 | match u32 0 0 \ | |
482 | action mirred egress redirect dev ifb0 | |
483 | \end{Verbatim} | |
484 | The same can be done for other interfaces, just replacing \iface{eth0} in the two | |
485 | commands above. One thing to keep in mind here is the asymmetrical routing this | |
486 | creates within the host doing the QoS: Incoming packets enter the system via | |
487 | \iface{ifb0}, while corresponding replies leave directly via \iface{eth0}. This can be observed | |
488 | using \cmd{tcpdump} on \iface{ifb0}, which shows the input part of the traffic only. What's | |
489 | more confusing is that \cmd{tcpdump} on \iface{eth0} shows both incoming and outgoing traffic, | |
490 | but the redirection is still effective - a simple prove is setting | |
491 | \iface{ifb0} down, | |
492 | which will interrupt the communication. Obviously \cmd{tcpdump} catches the packets to | |
493 | dump before they enter the ingress qdisc, which is why it sees them while the | |
494 | kernel itself doesn't. | |
495 | ||
496 | ||
497 | \section*{Conclusion} | |
498 | ||
499 | My personal impression is that although the \cmd{tc} utility is an absolute | |
500 | necessity for anyone aiming at doing QoS in Linux professionally, there are way | |
501 | too many loose ends and trip wires present in it's environment. Contributing to | |
502 | this is the fact, that much of the non-essential functionality is redundantly | |
503 | available in netfilter. Another problem which adds weight to the first one is a | |
504 | general lack of documentation. Of course, there are many HOWTOs and guides in | |
505 | the internet, but since it's often not clear how up to date these are, I prefer | |
506 | the usual resources such as man or info pages. Surely nothing one couldn't fix | |
507 | in hindsight, but quality certainly suffers if the original author of the code | |
508 | does not or can not contribute to that. | |
509 | ||
510 | All that being said, once the steep learning curve has been mastered, the | |
511 | conglomerate of (classful) qdiscs, filters and actions provides a highly | |
512 | sophisticated and flexible infrastructure to perform QoS, which plays nicely | |
513 | along with routing and firewalling setups. | |
514 | ||
515 | ||
516 | \section*{Further Reading} | |
517 | ||
518 | A good starting point for novice users and experienced ones diving into unknown | |
519 | areas is the extensive HOWTO at \url{http://lartc.org}. The iproute2 package ships | |
520 | some examples (usually in /usr/share/doc/, depending on distribution) as well as | |
521 | man pages for \cmd{tc} in general, qdiscs and filters. The latter have been added | |
522 | just recently though, so if your distribution does not ship iproute2 version | |
523 | 4.3.0 yet, these are not in there. Apart from that, the internet is a spring of | |
524 | HOWTOs and scripts people wrote - though these should be taken with a grain of | |
525 | salt: The complexity of the matter often leads to copying others' solutions | |
526 | without much validation, which allows for less optimal or even obsolete | |
527 | implementations to survive much longer than desired. | |
528 | ||
529 | \end{document} |