1 \documentclass[12pt,twoside
]{article
}
3 \usepackage[hidelinks
]{hyperref
} % \url
4 \usepackage{booktabs
} % nicer tabulars
9 \newcommand{\iface}{\textit}
10 \newcommand{\cmd}{\texttt}
11 \newcommand{\man}{\textit}
12 \newcommand{\qdisc}{\texttt}
13 \newcommand{\filter}{\texttt}
16 \title{QoS in Linux with TC and Filters
}
17 \author{Phil Sutter (phil@nwl.cc)
}
21 Standard practice when transmitting packets over a medium which may block (due
22 to congestion, e.g.) is to use a queue which temporarily holds these packets. In
23 Linux, this queueing approach is where QoS happens: A Queueing Discipline
24 (qdisc) holds multiple packet queues with different priorities for dequeueing to
25 the network driver. The classification (i.e. deciding which queue a packet
26 should go into) is typically done based on Type Of Service (IPv4) or Traffic
27 Class (IPv6) header fields but depending on qdisc implementation, might be
28 controlled by the user as well.
30 Qdiscs come in two flavors, classful or classless. While classless qdiscs are
31 not as flexible as classful ones, they also require much less customizing. Often
32 it is enough to just attach them to an interface, without exact knowledge of
33 what is done internally. Classful qdiscs are the exact opposite: flexible in
34 application, they are often not even usable without insightful configuration.
36 As the name implies, classful qdiscs provide configurable classes to sort
37 traffic into. In it's basic form, this is not much different than, say, the
38 classless
\qdisc{pfifo
\_fast} which holds three queues and classifies per
39 packet upon priority field. Though typically classes go beyond that by
40 supporting nesting and additional characteristics like e.g. maximum traffic
43 When it comes to controlling the classification process, filters come into play.
44 They attach to the parent of a set of classes (i.e. either the qdisc itself or
45 a parent class) and specify how a packet (or it's associated flow) has to look
46 like in order to suit a given class. To overcome this simplification, it is
47 possible to attach multiple filters to the same parent, which then consults each
48 of them in row until the first one accepts the packet.
50 Before getting into detail about what filters there are and how to use them, a
51 simple setup of a qdisc with classes is necessary:
54 .-------------------------------------------------------.
58 | .----------------------------------------------------.|
62 | | .---------------..---------------..---------------.||
64 | | | Class
1:
10 || Class
1:
20 || Class
1:
30 |||
66 | | | .------------.|| .------------.|| .------------.|||
67 | | | | ||| | ||| | ||||
68 | | | | fq_codel ||| | fq_codel ||| | fq_codel ||||
69 | | | | ||| | ||| | ||||
70 | | | '------------'|| '------------'|| '------------'|||
71 | | '---------------''---------------''---------------'||
72 | '----------------------------------------------------'|
73 '-------------------------------------------------------'
77 The following commands establish the basic setup shown:
79 (
1) # tc qdisc replace dev eth0 root handle
1: htb default
30
80 (
2) # tc class add dev eth0 parent
1: classid
1:
1 htb rate
95mbit
81 (
3) # alias tclass='tc class add dev eth0 parent
1:
1'
82 (
4) # tclass classid
1:
10 htb rate
1mbit ceil
20mbit prio
1
83 (
4) # tclass classid
1:
20 htb rate
90mbit ceil
95mbit prio
2
84 (
4) # tclass classid
1:
30 htb rate
1mbit ceil
95mbit prio
3
85 (
5) # tc qdisc add dev eth0 parent
1:
10 fq_codel
86 (
5) # tc qdisc add dev eth0 parent
1:
20 fq_codel
87 (
5) # tc qdisc add dev eth0 parent
1:
30 fq_codel
89 A little explanation for the unfamiliar reader:
91 \item Replace the root qdisc of
\iface{eth0
} by an instance of
\qdisc{HTB
}.
92 Specifying the handle is necessary so it can be referenced in consecutive
93 calls to
\cmd{tc
}. The default class for unclassified traffic is set to
95 \item Create a single top-level class with handle
1:
1 which limits the total
96 bandwidth allowed to
95mbit/s. It is assumed that
\iface{eth0
} is a
100mbit/s link,
97 staying a little below that helps to keep the main point of enqueueing in
98 the qdisc layer instead of the interface hardware queue or at another
99 bottleneck in the network.
100 \item Define an alias for the common part of the remaining three calls in order
101 to improve readability. This means all remaining classes are attached to the
102 common parent class from (
2).
103 \item Create three child classes for different uses: Class
1:
10 has highest
104 priority but is tightly limited in bandwidth - fine for interactive
105 connections. Class
1:
20 has mid priority and high guaranteed bandwidth, for
106 high priority bulk traffic. Finally, there's the default class
1:
30 with
107 lowest priority, low guaranteed bandwidth and the ability to use the full
108 link in case it's unused otherwise. This should be fine for uninteresting
109 traffic not explicitly taken care of.
110 \item Attach a leaf qdisc to each of the child classes created in (
4). Since
111 \qdisc{HTB
} by default attaches
\qdisc{pfifo
} as leaf qdisc, this step is optional. Still,
112 the fairness between different flows provided by the classless
\qdisc{fq
\_codel} is
115 More information about the qdiscs and fine-tuning parameters can be found in
116 \man{tc-htb(
8)
} and
\man{tc-fq
\_codel(
8)
}.
118 Without any additional setup done, now all traffic leaving
\iface{eth0
} is shaped to
119 95mbit/s and directed through class
1:
30. This can be verified by looking at the
120 \texttt{Sent
} field of the class statistics printed via
\cmd{tc -s class show dev eth0
}:
121 Only the root class
1:
1 and it's child
1:
30 should show any traffic.
124 \section*
{Finally time to start filtering!
}
126 Let's begin with a simple one, i.e. reestablishing what
\qdisc{pfifo
\_fast} did
127 automatically based on TOS/Priority field. Linux internally translates the
128 header field into the priority field of struct skbuff, which
129 \qdisc{pfifo
\_fast} uses for
130 classification.
\man{tc-prio(
8)
} contains a table listing the priority (and
131 ultimately,
\qdisc{pfifo
\_fast} queue index) each TOS value is being translated into.
132 Here is a shorter version:
135 TOS Values & Linux Priority (Number) & Queue Index \\
137 0x0 -
0x6 & Best Effort (
0) &
1 \\
138 0x8 -
0xe & Bulk (
2) &
2 \\
139 0x10 -
0x16 & Interactive (
6) &
0 \\
140 0x18 -
0x1e & Interactive Bulk (
4) &
1 \\
143 Using the
\filter{basic
} filter, it is possible to match packets based on that skbuff
144 field, which has the added benefit of being IP version agnostic. Since the
145 \qdisc{HTB
} setup above defaults to class ID
1:
30, the Bulk priority can be
146 ignored. The
\filter{basic
} filter allows to combine matches, therefore we get along
147 with only two filters:
149 # tc filter add dev eth0 parent
1: basic \
150 match 'meta(priority eq
6)' classid
1:
10
151 # tc filter add dev eth0 parent
1: basic \
152 match 'meta(priority eq
0)' \
153 or 'meta(priority eq
4)' classid
1:
20
155 A detailed description of the
\filter{basic
} filter and the ematch syntax it uses can be
156 found in
\man{tc-basic(
8)
} and
\man{tc-ematch(
8)
}.
158 Obviously, this first example cries for optimization. A simple one would be to
159 just change the default class from
1:
30 to
1:
20, so filters are only needed for
160 Bulk and Interactive priorities:
162 # tc filter add dev eth0 parent
1: basic \
163 match 'meta(priority eq
6)' classid
1:
10
164 # tc filter add dev eth0 parent
1: basic \
165 match 'meta(priority eq
2)' classid
1:
20
167 Given that class IDs are random, choosing them wisely allows for a direct
168 mapping. So first, recreate the qdisc and classes configuration:
170 # tc qdisc replace dev eth0 root handle
1: htb default
10
171 # tc class add dev eth0 parent
1: classid
1:
1 htb rate
95mbit
172 # alias tclass='tc class add dev eth0 parent
1:
1'
173 # tclass classid
1:
16 htb rate
1mbit ceil
20mbit prio
1
174 # tclass classid
1:
10 htb rate
90mbit ceil
95mbit prio
2
175 # tclass classid
1:
12 htb rate
1mbit ceil
95mbit prio
3
176 # tc qdisc add dev eth0 parent
1:
16 fq_codel
177 # tc qdisc add dev eth0 parent
1:
10 fq_codel
178 # tc qdisc add dev eth0 parent
1:
12 fq_codel
180 This is basically identical to above, but with changed leaf class IDs and the
181 second priority class being the default. Using the
\filter{flow
} filter with it's
\texttt{map
}
182 functionality, a single filter command is enough:
184 # tc filter add dev eth0 parent
1: handle
0x1337 flow \
185 map key priority baseclass
1:
10
187 The
\filter{flow
} filter now uses the priority value to construct a destination class ID
188 by adding it to the value of
\texttt{baseclass
}. While this works for priority values of
189 0,
2 and
6, it will result in non-existent class ID
1:
14 for Interactive Bulk
190 traffic. In that case, the
\qdisc{HTB
} default applies so that traffic goes into class
191 ID
1:
10 just as intended. Please note that specifying a handle is a mandatory
192 requirement by the
\filter{flow
} filter, although I didn't see where one would use that
193 later. For more information about
\filter{flow
}, see
\man{tc-flow(
8)
}.
195 While
\filter{flow
} and
\filter{basic
} filters are relatively easy to apply and understand, they
196 are as well quite limited to their intended purpose. A more flexible option is
197 the
\filter{u32
} filter, which allows to match on arbitrary parts of the packet data -
198 yet only on that, not any meta data associated to it by the kernel (with the
199 exception of firewall mark value). So in order to continue this little
200 exercise with
\filter{u32
}, we have to base classification directly upon the actual TOS
201 value. An intuitive attempt might look like this:
203 # alias tcfilter='tc filter add dev eth0 parent
1:'
204 # tcfilter u32 match ip dsfield
0x10 0x1e classid
1:
16
205 # tcfilter u32 match ip dsfield
0x12 0x1e classid
1:
16
206 # tcfilter u32 match ip dsfield
0x14 0x1e classid
1:
16
207 # tcfilter u32 match ip dsfield
0x16 0x1e classid
1:
16
208 # tcfilter u32 match ip dsfield
0x8 0x1e classid
1:
12
209 # tcfilter u32 match ip dsfield
0xa 0x1e classid
1:
12
210 # tcfilter u32 match ip dsfield
0xc 0x1e classid
1:
12
211 # tcfilter u32 match ip dsfield
0xe 0x1e classid
1:
12
213 The obvious drawback here is the amount of filters needed. And without the
214 default class, eight more filters would be necessary. This also has performance
215 implications: A packet with TOS value
0xe will be checked eight times in total
216 in order to determine it's destination class. While there's not much to be done
217 about the number of filters, at least the performance problem can be eliminated
218 by using
\filter{u32
}'s hash table support:
220 # tc filter add dev eth0 parent
1: prio
99 handle
1: u32 divisor
16
222 This creates a hash table with
16 buckets. The table size is arbitrary, but not
223 random: Since the first bit of the TOS field is not interesting, it can be
224 ignored and therefore the range of values to consider is just
[0;
15], i.e. a
225 number of
16 different values. The next step is to populate the hash table:
227 # alias tcfilter='tc filter add dev eth0 parent
1: prio
99'
228 # tcfilter u32 match u8
0 0 ht
1:
0: classid
1:
16
229 # tcfilter u32 match u8
0 0 ht
1:
1: classid
1:
16
230 # tcfilter u32 match u8
0 0 ht
1:
2: classid
1:
16
231 # tcfilter u32 match u8
0 0 ht
1:
3: classid
1:
16
232 # tcfilter u32 match u8
0 0 ht
1:
4: classid
1:
12
233 # tcfilter u32 match u8
0 0 ht
1:
5: classid
1:
12
234 # tcfilter u32 match u8
0 0 ht
1:
6: classid
1:
12
235 # tcfilter u32 match u8
0 0 ht
1:
7: classid
1:
12
236 # tcfilter u32 match u8
0 0 ht
1:
8: classid
1:
16
237 # tcfilter u32 match u8
0 0 ht
1:
9: classid
1:
16
238 # tcfilter u32 match u8
0 0 ht
1:a: classid
1:
16
239 # tcfilter u32 match u8
0 0 ht
1:b: classid
1:
16
240 # tcfilter u32 match u8
0 0 ht
1:c: classid
1:
10
241 # tcfilter u32 match u8
0 0 ht
1:d: classid
1:
10
242 # tcfilter u32 match u8
0 0 ht
1:e: classid
1:
10
243 # tcfilter u32 match u8
0 0 ht
1:f: classid
1:
10
245 The parameter
\texttt{ht
} denotes the hash table and bucket the filter should be added
246 to. Since the first TOS bit is ignored, it's value has to be divided by two in
247 order to get to the bucket it maps to. E.g. a TOS value of
0x10 will therefore
248 map to bucket
0x8. For the sake of completeness, all possible values are mapped
249 and therefore a configurable default class is not required. Note that the used
250 match expression is not necessary, but mandatory. Therefore anything that
251 matches any packet will suffice. Finally, a filter which links to the defined
252 hash table is needed:
254 # tc filter add dev eth0 parent
1: prio
1 protocol ip u32 \
255 link
1: hashkey mask
0x001e0000 match u8
0 0
257 Here again, the actual match statement is not necessary, but syntactically
258 required. All the magic lies within the
\texttt{hashkey
} parameter, which defines which
259 part of the packet should be used directly as hash key. Here's a drawing of the
260 first four bytes of the IPv4 header, with the area selected by
\texttt{hashkey mask
}
265 .-----------------------------------------------------------------.
267 | Version| IHL | #DSCP### | ECN| Total Length |
269 '-----------------------------------------------------------------'
273 This may look confusing at first, but keep in mind that bit- as well as
274 byte-ordering here is LSB while the mask value is written in MSB we humans use.
275 Therefore reading the mask is done like so, starting from left:
277 \item Skip the first byte (which contains Version and IHL fields).
278 \item Skip the lowest bit of the second byte (
0x1e is even).
279 \item Mark the four following bits (
0x1e is
11110 in binary).
280 \item Skip the remaining three bits of the second byte as well as the remaining two
283 Before doing the lookup, the kernel right-shifts the masked value by the amount
284 of zero-bits in
\texttt{mask
}, which implicitly also does the division by two which the
285 hash table depends on. With this setup, every packet has to pass exactly two
286 filters to be classified. Note that this filter is limited to IPv4 packets: Due
287 to the related Traffic Class field being at a different offset in the packet, it
288 would not work for IPv6. To use the same setup for IPv6 as well, a second
289 entry-level filter is necessary:
291 # tc filter add dev eth0 parent
1: prio
2 protocol ipv6 u32 \
292 link
1: hashkey mask
0x01e00000 match u8
0 0
294 For illustration purposes, here again is a drawing of the first four bytes of
295 the IPv6 header, again with masked area highlighted:
299 .-----------------------------------------------------------------.
301 | Version| #Traffic Class| Flow Label |
303 '-----------------------------------------------------------------'
307 Reading the mask value is analogous to IPv4 with the added complexity that
308 Traffic Class spans over two bytes. Yet, for comparison there's a simple trick:
309 IPv6 has the interesting field shifted by four bits to the left, and the new
310 mask's value is shifted by the same amount. For further information about
311 \filter{u32
} and what can be done with it, consult it's man page
314 Of course, the kernel provides many more filters than just
\filter{basic
},
315 \filter{flow
} and
\filter{u32
} which have been presented above. As of now, the
319 Filtering using Berkeley Packet Filter programs. The program's return
320 code determines the packet's destination class ID.
323 Filter packets based on control groups. This is only useful for packets
324 originating from the local host, as control groups only exist in that
328 An extended variant of the flow filter.
331 Matches on firewall mark values previously assigned to the packet by
332 netfilter (or a filter action, see below for details). This allows to
333 export the classification algorithm into netfilter, which is very
334 convenient if appropriate rules exist on the same system in there
338 Filter packets based on matching routing table entry. Basically
339 equivalent to the
\texttt{fw
} filter above, to make use of an already existing
340 extensive routing table setup.
343 Implementation of the Resource Reservation Protocol in Linux, to react
344 upon requests sent by an RSVP daemon.
347 Match packets based on tcindex value, which is usually set by the dsmark
348 qdisc. This is part of an approach to support Differentiated Services in
349 Linux, which is another topic on it's own.
353 \section*
{Filter Actions
}
355 The tc filter framework provides the infrastructure to another extensible set of
356 tools as well, namely tc actions. As the name suggests, they allow to do things
357 with packets (or associated data). (The list of) Actions are part of a given
358 filter. If it matches, each action it contains is executed in order before
359 returning the classification result. Since the action has direct access to the
360 latter, it is in theory possible for an action to react upon or even change the
361 filtering result - as long as the packet matched, of course. Yet none of the
362 currently in-tree actions make use of this.
364 The Generic Actions framework originally evolved out of the filters' ability to
365 police traffic to a given maximum bandwidth. One common use case for that is to
366 limit ingress traffic, dropping packets which exceed the threshold. A classic
367 setup example is like so:
369 # tc qdisc add dev eth0 handle ffff: ingress
370 # tc filter add dev eth0 parent ffff: u32 \
372 police rate
1mbit burst
100k
374 The ingress qdisc is not a real one, but merely a point of reference for filters
375 to attach to which should get applied to incoming traffic. The
\filter{u32
} filter added
376 above matches on any packet and therefore limits the total incoming bandwidth to
377 1mbit/s, allowing bursts of up to
100kbytes. Using the new syntax, the filter
378 command changes slightly:
380 # tc filter add dev eth0 parent ffff: u32 \
382 action police rate
1mbit burst
100k
384 The important detail is that this syntax allows to define multiple actions.
385 E.g. for testing purposes, it is possible to redirect exceeding traffic to the
386 loopback interface instead of dropping it:
388 # tc filter add dev eth0 parent ffff: u32 \
390 action police rate
1mbit burst
100k conform-exceed pipe \
391 action mirred egress redirect dev lo
393 The added parameter
\texttt{conform-exceed pipe
} tells the police action to allow for
394 further actions to handle the exceeding packet.
396 Apart from
\texttt{police
} and
\texttt{mirred
} actions, there are a few more. Here's a full
397 list of the currently implemented ones:
400 Apply a Berkeley Packet Filter program to the packet.
403 Set the packet's firewall mark to that of it's connection. This works by
404 searching the conntrack table for a matching entry. If found, the mark
408 Trigger recalculation of packet checksums. The supported protocols are:
409 IPv4, ICMP, IGMP, TCP, UDP and UDPLite.
412 Pass the packet to an iptables target. This allows to use iptables
413 extensions directly instead of having to go the extra mile via setting
414 an arbitrary firewall mark and matching on that from within netfilter.
417 Mirror or redirect packets. This is often combined with the ifb pseudo
418 device to share a common QoS setup between multiple interfaces or even
422 Perform stateless Native Address Translation. This is certainly not
423 complete and therefore inferior to NAT using iptables: Although the
424 kernel module decides between TCP, UDP and ICMP traffic, it does not
425 handle typical problematic protocols such as active FTP or SIP.
428 Generic packet editing. This allows to alter arbitrary bytes of the
429 packet, either by specifying an offset into the packet or by naming a
430 packet header and field name to change. Currently, the latter is
431 implemented only for IPv4 yet.
434 Apply a bandwidth rate limiting policy. Packets exceeding it are dropped
435 by default, but may optionally be handled differently.
438 This is rather an example than real action. All it does is print a
439 user-defined string together with a packet counter. Useful maybe for
440 debugging when filter statistics are not available or too complicated.
443 Edit associated packet data, supports changing queue mapping, priority
444 field and firewall mark value.
447 Add/remove a VLAN header to/from the packet. This might serve as
448 alternative to using
802.1Q pseudo-interfaces in combination with
449 routing rules when e.g. packets for a given destination need to be
454 \section*
{Intermediate Functional Block
}
456 The Intermediate Functional Block (
\texttt{ifb
}) pseudo network interface acts as a QoS
457 concentrator for multiple different sources of traffic. Packets from or to other
458 interfaces have to be redirected to it using the
\texttt{mirred
} action in order to be
459 handled, regularly routed traffic will be dropped. This way, a single stack of
460 qdiscs, classes and filters can be shared between multiple interfaces.
462 Here's a simple example to feed incoming traffic from multiple interfaces
463 through a Stochastic Fairness Queue (
\qdisc{sfq
}):
466 (
2) # ip link set ifb0 up
467 (
3) # tc qdisc add dev ifb0 root sfq
469 The first step is to load the
\texttt{ifb
} kernel module (
1). By default, this will
470 create two ifb devices:
\iface{ifb0
} and
\iface{ifb1
}. After setting
471 \iface{ifb0
} up in (
2), the root
472 qdisc is replaced by
\qdisc{sfq
} in (
3). Finally, one can start redirecting ingress
473 traffic to
\iface{ifb0
}, e.g. from
\iface{eth0
}:
475 # tc qdisc add dev eth0 handle ffff: ingress
476 # tc filter add dev eth0 parent ffff: u32 \
478 action mirred egress redirect dev ifb0
480 The same can be done for other interfaces, just replacing
\iface{eth0
} in the two
481 commands above. One thing to keep in mind here is the asymmetrical routing this
482 creates within the host doing the QoS: Incoming packets enter the system via
483 \iface{ifb0
}, while corresponding replies leave directly via
\iface{eth0
}. This can be observed
484 using
\cmd{tcpdump
} on
\iface{ifb0
}, which shows the input part of the traffic only. What's
485 more confusing is that
\cmd{tcpdump
} on
\iface{eth0
} shows both incoming and outgoing traffic,
486 but the redirection is still effective - a simple prove is setting
488 which will interrupt the communication. Obviously
\cmd{tcpdump
} catches the packets to
489 dump before they enter the ingress qdisc, which is why it sees them while the
490 kernel itself doesn't.
493 \section*
{Conclusion
}
495 Once the steep learning curve has been mastered, the conglomerate of (classful)
496 qdiscs, filters and actions provides a highly sophisticated and flexible
497 infrastructure to perform QoS, which plays nicely along with routing and
501 \section*
{Further Reading
}
503 A good starting point for novice users and experienced ones diving into unknown
504 areas is the extensive HOWTO at
\url{http://lartc.org
}. The iproute2 package ships
505 some examples (usually in /usr/share/doc/, depending on distribution) as well as
506 man pages for
\cmd{tc
} in general, qdiscs and filters. The latter have been added
507 just recently though, so if your distribution does not ship iproute2 version
508 4.3.0 yet, these are not in there. Apart from that, the internet is a spring of
509 HOWTOs and scripts people wrote - though these should be taken with a grain of
510 salt: The complexity of the matter often leads to copying others' solutions
511 without much validation, which allows for less optimal or even obsolete
512 implementations to survive much longer than desired.