]>
Commit | Line | Data |
---|---|---|
aba5acdf SH |
1 | \documentstyle[12pt,twoside]{article} |
2 | \def\TITLE{IPv6 Flow Labels} | |
3 | \input preamble | |
4 | \begin{center} | |
5 | \Large\bf IPv6 Flow Labels in Linux-2.2. | |
6 | \end{center} | |
7 | ||
8 | ||
9 | \begin{center} | |
10 | { \large Alexey~N.~Kuznetsov } \\ | |
11 | \em Institute for Nuclear Research, Moscow \\ | |
12 | \verb|kuznet@ms2.inr.ac.ru| \\ | |
13 | \rm April 11, 1999 | |
14 | \end{center} | |
15 | ||
16 | \vspace{5mm} | |
17 | ||
18 | \tableofcontents | |
19 | ||
20 | \section{Introduction.} | |
21 | ||
22 | Every IPv6 packet carries 28 bits of flow information. RFC2460 splits | |
23 | these bits to two fields: 8 bits of traffic class (or DS field, if you | |
24 | prefer this term) and 20 bits of flow label. Currently there exist | |
25 | no well-defined API to manage IPv6 flow information. In this document | |
26 | I describe an attempt to design the API for Linux-2.2 IPv6 stack. | |
27 | ||
28 | \vskip 1mm | |
29 | ||
30 | The API must solve the following tasks: | |
31 | ||
32 | \begin{enumerate} | |
33 | ||
34 | \item To allow user to set traffic class bits. | |
35 | ||
36 | \item To allow user to read traffic class bits of received packets. | |
37 | This feature is not so useful as the first one, however it will be | |
38 | necessary f.e.\ to implement ECN [RFC2481] for datagram oriented services | |
39 | or to implement receiver side of SRP or another end-to-end protocol | |
40 | using traffic class bits. | |
41 | ||
42 | \item To assign flow labels to packets sent by user. | |
43 | ||
44 | \item To get flow labels of received packets. I do not know | |
45 | any applications of this feature, but it is possible that receiver will | |
46 | want to use flow labels to distinguish sub-flows. | |
47 | ||
48 | \item To allocate flow labels in the way, compliant to RFC2460. Namely: | |
49 | ||
50 | \begin{itemize} | |
51 | \item | |
52 | Flow labels must be uniformly distributed (pseudo-)random numbers, | |
53 | so that any subset of 20 bits can be used as hash key. | |
54 | ||
55 | \item | |
56 | Flows with coinciding source address and flow label must have identical | |
57 | destination address and not-fragmentable extensions headers (i.e.\ | |
58 | hop by hop options and all the headers up to and including routing header, | |
59 | if it is present.) | |
60 | ||
61 | \begin{NB} | |
62 | There is a hole in specs: some hop-by-hop options can be | |
63 | defined only on per-packet base (f.e.\ jumbo payload option). | |
64 | Essentially, it means that such options cannot present in packets | |
65 | with flow labels. | |
66 | \end{NB} | |
67 | \begin{NB} | |
68 | NB notes here and below reflect only my personal opinion, | |
69 | they should be read with smile or should not be read at all :-). | |
70 | \end{NB} | |
71 | ||
72 | ||
73 | \item | |
74 | Flow labels have finite lifetime and source is not allowed to reuse | |
75 | flow label for another flow within the maximal lifetime has expired, | |
76 | so that intermediate nodes will be able to invalidate flow state before | |
77 | the label is taken over by another flow. | |
78 | Flow state, including lifetime, is propagated along datagram path | |
79 | by some application specific methods | |
80 | (f.e.\ in RSVP PATH messages or in some hop-by-hop option). | |
81 | ||
82 | ||
83 | \end{itemize} | |
84 | ||
85 | \end{enumerate} | |
86 | ||
87 | \section{Sending/receiving flow information.} | |
88 | ||
89 | \paragraph{Discussion.} | |
90 | \addcontentsline{toc}{subsection}{Discussion} | |
91 | It was proposed (Where? I do not remember any explicit statement) | |
92 | to solve the first four tasks using | |
93 | \verb|sin6_flowinfo| field added to \verb|struct| \verb|sockaddr_in6| | |
94 | (see RFC2553). | |
95 | ||
96 | \begin{NB} | |
97 | This method is difficult to consider as reasonable, because it | |
98 | puts additional overhead to all the services, despite of only | |
99 | very small subset of them (none, to be more exact) really use it. | |
100 | It contradicts both to IETF spirit and the letter. Before RFC2553 | |
101 | one justification existed, IPv6 address alignment left 4 byte | |
102 | hole in \verb|sockaddr_in6| in any case. Now it has no justification. | |
103 | \end{NB} | |
104 | ||
105 | We have two problems with this method. The first one is common for all OSes: | |
106 | if \verb|recvmsg()| initializes \verb|sin6_flowinfo| to flow info | |
107 | of received packet, we loose one very important property of BSD socket API, | |
108 | namely, we are not allowed to use received address for reply directly | |
109 | and have to mangle it, even if we are not interested in flowinfo subtleties. | |
110 | ||
111 | \begin{NB} | |
112 | RFC2553 adds new requirement: to clear \verb|sin6_flowinfo|. | |
113 | Certainly, it is not solution but rather attempt to force applications | |
114 | to make unnecessary work. Well, as usually, one mistake in design | |
115 | is followed by attempts to patch the hole and more mistakes... | |
116 | \end{NB} | |
117 | ||
118 | Another problem is Linux specific. Historically Linux IPv6 did not | |
119 | initialize \verb|sin6_flowinfo| at all, so that, if kernel does not | |
120 | support flow labels, this field is not zero, but a random number. | |
121 | Some applications also did not take care about it. | |
122 | ||
123 | \begin{NB} | |
124 | Following RFC2553 such applications can be considered as broken, | |
125 | but I still think that they are right: clearing all the address | |
126 | before filling known fields is robust but stupid solution. | |
127 | Useless wasting CPU cycles and | |
128 | memory bandwidth is not a good idea. Such patches are acceptable | |
129 | as temporary hacks, but not as standard of the future. | |
130 | \end{NB} | |
131 | ||
132 | ||
133 | \paragraph{Implementation.} | |
134 | \addcontentsline{toc}{subsection}{Implementation} | |
135 | By default Linux IPv6 does not read \verb|sin6_flowinfo| field | |
136 | assuming that common applications are not obliged to initialize it | |
137 | and are permitted to consider it as pure alignment padding. | |
138 | In order to tell kernel that application | |
139 | is aware of this field, it is necessary to set socket option | |
140 | \verb|IPV6_FLOWINFO_SEND|. | |
141 | ||
142 | \begin{verbatim} | |
143 | int on = 1; | |
144 | setsockopt(sock, SOL_IPV6, IPV6_FLOWINFO_SEND, | |
145 | (void*)&on, sizeof(on)); | |
146 | \end{verbatim} | |
147 | ||
148 | Linux kernel never fills \verb|sin6_flowinfo| field, when passing | |
149 | message to user space, though the kernels which support flow labels | |
150 | initialize it to zero. If user wants to get received flowinfo, he | |
151 | will set option \verb|IPV6_FLOWINFO| and after this he will receive | |
152 | flowinfo as ancillary data object of type \verb|IPV6_FLOWINFO| | |
153 | (cf.\ RFC2292). | |
154 | ||
155 | \begin{verbatim} | |
156 | int on = 1; | |
157 | setsockopt(sock, SOL_IPV6, IPV6_FLOWINFO, (void*)&on, sizeof(on)); | |
158 | \end{verbatim} | |
159 | ||
160 | Flowinfo received and latched by a connected TCP socket also may be fetched | |
161 | with \verb|getsockopt()| \verb|IPV6_PKTOPTIONS| together with | |
162 | another optional information. | |
163 | ||
164 | Besides that, in the spirit of RFC2292 the option \verb|IPV6_FLOWINFO| | |
165 | may be used as alternative way to send flowinfo with \verb|sendmsg()| or | |
166 | to latch it with \verb|IPV6_PKTOPTIONS|. | |
167 | ||
168 | \paragraph{Note about IPv6 options and destination address.} | |
169 | \addcontentsline{toc}{subsection}{IPv6 options and destination address} | |
170 | If \verb|sin6_flowinfo| does contain not zero flow label, | |
171 | destination address in \verb|sin6_addr| and non-fragmentable | |
172 | extension headers are ignored. Instead, kernel uses the values | |
173 | cached at flow setup (see below). However, for connected sockets | |
174 | kernel prefers the values set at connection time. | |
175 | ||
176 | \paragraph{Example.} | |
177 | \addcontentsline{toc}{subsection}{Example} | |
178 | After setting socket option \verb|IPV6_FLOWINFO| | |
179 | flowlabel and DS field are received as ancillary data object | |
180 | of type \verb|IPV6_FLOWINFO| and level \verb|SOL_IPV6|. | |
181 | In the cases when it is convenient to use \verb|recvfrom(2)|, | |
182 | it is possible to replace library variant with your own one, | |
183 | sort of: | |
184 | ||
185 | \begin{verbatim} | |
186 | #include <sys/socket.h> | |
187 | #include <netinet/in6.h> | |
188 | ||
189 | size_t recvfrom(int fd, char *buf, size_t len, int flags, | |
190 | struct sockaddr *addr, int *addrlen) | |
191 | { | |
192 | size_t cc; | |
193 | char cbuf[128]; | |
194 | struct cmsghdr *c; | |
195 | struct iovec iov = { buf, len }; | |
196 | struct msghdr msg = { addr, *addrlen, | |
197 | &iov, 1, | |
198 | cbuf, sizeof(cbuf), | |
199 | 0 }; | |
200 | ||
201 | cc = recvmsg(fd, &msg, flags); | |
202 | if (cc < 0) | |
203 | return cc; | |
204 | ((struct sockaddr_in6*)addr)->sin6_flowinfo = 0; | |
205 | *addrlen = msg.msg_namelen; | |
206 | for (c=CMSG_FIRSTHDR(&msg); c; c = CMSG_NEXTHDR(&msg, c)) { | |
207 | if (c->cmsg_level != SOL_IPV6 || | |
208 | c->cmsg_type != IPV6_FLOWINFO) | |
209 | continue; | |
210 | ((struct sockaddr_in6*)addr)->sin6_flowinfo = *(__u32*)CMSG_DATA(c); | |
211 | } | |
212 | return cc; | |
213 | } | |
214 | \end{verbatim} | |
215 | ||
216 | ||
217 | ||
218 | \section{Flow label management.} | |
219 | ||
220 | \paragraph{Discussion.} | |
221 | \addcontentsline{toc}{subsection}{Discussion} | |
222 | Requirements of RFC2460 are pretty tough. Particularly, lifetimes | |
223 | longer than boot time require to store allocated labels at stable | |
224 | storage, so that the full implementation necessarily includes user space flow | |
225 | label manager. There are at least three different approaches: | |
226 | ||
227 | \begin{enumerate} | |
228 | \item {\bf ``Cooperative''. } We could leave flow label allocation wholly | |
229 | to user space. When user needs label he requests manager directly. The approach | |
230 | is valid, but as any ``cooperative'' approach it suffers of security problems. | |
231 | ||
232 | \begin{NB} | |
233 | One idea is to disallow not privileged user to allocate flow | |
234 | labels, but instead to pass the socket to manager via \verb|SCM_RIGHTS| | |
235 | control message, so that it will allocate label and assign it to socket | |
236 | itself. Hmm... the idea is interesting. | |
237 | \end{NB} | |
238 | ||
239 | \item {\bf ``Indirect''.} Kernel redirects requests to user level daemon | |
240 | and does not install label until the daemon acknowledged the request. | |
241 | The approach is the most promising, it is especially pleasant to recognize | |
242 | parallel with IPsec API [RFC2367,Craig]. Actually, it may share API with | |
243 | IPsec. | |
244 | ||
245 | \item {\bf ``Stupid''.} To allocate labels in kernel space. It is the simplest | |
246 | method, but it suffers of two serious flaws: the first, | |
247 | we cannot lease labels with lifetimes longer than boot time, the second, | |
248 | it is sensitive to DoS attacks. Kernel have to remember all the obsolete | |
249 | labels until their expiration and malicious user may fastly eat all the | |
250 | flow label space. | |
251 | ||
252 | \end{enumerate} | |
253 | ||
254 | Certainly, I choose the most ``stupid'' method. It is the cheapest one | |
255 | for implementor (i.e.\ me), and taking into account that flow labels | |
256 | still have no serious applications it is not useful to work on more | |
257 | advanced API, especially, taking into account that eventually we | |
258 | will get it for no fee together with IPsec. | |
259 | ||
260 | ||
261 | \paragraph{Implementation.} | |
262 | \addcontentsline{toc}{subsection}{Implementation} | |
263 | Socket option \verb|IPV6_FLOWLABEL_MGR| allows to | |
264 | request flow label manager to allocate new flow label, to reuse | |
265 | already allocated one or to delete old flow label. | |
266 | Its argument is \verb|struct| \verb|in6_flowlabel_req|: | |
267 | ||
268 | \begin{verbatim} | |
269 | struct in6_flowlabel_req | |
270 | { | |
271 | struct in6_addr flr_dst; | |
272 | __u32 flr_label; | |
273 | __u8 flr_action; | |
274 | __u8 flr_share; | |
275 | __u16 flr_flags; | |
276 | __u16 flr_expires; | |
277 | __u16 flr_linger; | |
278 | __u32 __flr_reserved; | |
279 | /* Options in format of IPV6_PKTOPTIONS */ | |
280 | }; | |
281 | \end{verbatim} | |
282 | ||
283 | \begin{itemize} | |
284 | ||
285 | \item \verb|dst| is IPv6 destination address associated with the label. | |
286 | ||
287 | \item \verb|label| is flow label value in network byte order. If it is zero, | |
288 | kernel will allocate new pseudo-random number. Otherwise, kernel will try | |
289 | to lease flow label ordered by user. In this case, it is user task to provide | |
290 | necessary flow label randomness. | |
291 | ||
292 | \item \verb|action| is requested operation. Currently, only three operations | |
293 | are defined: | |
294 | ||
295 | \begin{verbatim} | |
296 | #define IPV6_FL_A_GET 0 /* Get flow label */ | |
297 | #define IPV6_FL_A_PUT 1 /* Release flow label */ | |
298 | #define IPV6_FL_A_RENEW 2 /* Update expire time */ | |
299 | \end{verbatim} | |
300 | ||
301 | \item \verb|flags| are optional modifiers. Currently | |
302 | only \verb|IPV6_FL_A_GET| has modifiers: | |
303 | ||
304 | \begin{verbatim} | |
305 | #define IPV6_FL_F_CREATE 1 /* Allowed to create new label */ | |
306 | #define IPV6_FL_F_EXCL 2 /* Do not create new label */ | |
307 | \end{verbatim} | |
308 | ||
309 | ||
310 | \item \verb|share| defines who is allowed to reuse the same flow label. | |
311 | ||
312 | \begin{verbatim} | |
313 | #define IPV6_FL_S_NONE 0 /* Not defined */ | |
314 | #define IPV6_FL_S_EXCL 1 /* Label is private */ | |
315 | #define IPV6_FL_S_PROCESS 2 /* May be reused by this process */ | |
316 | #define IPV6_FL_S_USER 3 /* May be reused by this user */ | |
317 | #define IPV6_FL_S_ANY 255 /* Anyone may reuse it */ | |
318 | \end{verbatim} | |
319 | ||
320 | \item \verb|linger| is time in seconds. After the last user releases flow | |
321 | label, it will not be reused with different destination and options at least | |
322 | during this time. If \verb|share| is not \verb|IPV6_FL_S_EXCL| the label | |
323 | still can be shared by another sockets. Current implementation does not allow | |
324 | unprivileged user to set linger longer than 60 sec. | |
325 | ||
326 | \item \verb|expires| is time in seconds. Flow label will be kept at least | |
327 | for this time, but it will not be destroyed before user released it explicitly | |
328 | or closed all the sockets using it. Current implementation does not allow | |
329 | unprivileged user to set timeout longer than 60 sec. Proviledged applications | |
330 | MAY set longer lifetimes, but in this case they MUST save allocated | |
331 | labels at stable storage and restore them back after reboot before the first | |
332 | application allocates new flow. | |
333 | ||
334 | \end{itemize} | |
335 | ||
336 | This structure is followed by optional extension headers associated | |
337 | with this flow label in format of \verb|IPV6_PKTOPTIONS|. Only | |
338 | \verb|IPV6_HOPOPTS|, \verb|IPV6_RTHDR| and, if \verb|IPV6_RTHDR| presents, | |
339 | \verb|IPV6_DSTOPTS| are allowed. | |
340 | ||
341 | \paragraph{Example.} | |
342 | \addcontentsline{toc}{subsection}{Example} | |
343 | The function \verb|get_flow_label| allocates | |
344 | private flow label. | |
345 | ||
346 | \begin{verbatim} | |
347 | int get_flow_label(int fd, struct sockaddr_in6 *dst, __u32 fl) | |
348 | { | |
349 | int on = 1; | |
350 | struct in6_flowlabel_req freq; | |
351 | ||
352 | memset(&freq, 0, sizeof(freq)); | |
353 | freq.flr_label = htonl(fl); | |
354 | freq.flr_action = IPV6_FL_A_GET; | |
355 | freq.flr_flags = IPV6_FL_F_CREATE | IPV6_FL_F_EXCL; | |
356 | freq.flr_share = IPV6_FL_S_EXCL; | |
357 | memcpy(&freq.flr_dst, &dst->sin6_addr, 16); | |
358 | if (setsockopt(fd, SOL_IPV6, IPV6_FLOWLABEL_MGR, | |
359 | &freq, sizeof(freq)) == -1) { | |
360 | perror ("can't lease flowlabel"); | |
361 | return -1; | |
362 | } | |
363 | dst->sin6_flowinfo |= freq.flr_label; | |
364 | ||
365 | if (setsockopt(fd, SOL_IPV6, IPV6_FLOWINFO_SEND, | |
366 | &on, sizeof(on)) == -1) { | |
367 | perror ("can't send flowinfo"); | |
368 | ||
369 | freq.flr_action = IPV6_FL_A_PUT; | |
370 | setsockopt(fd, SOL_IPV6, IPV6_FLOWLABEL_MGR, | |
371 | &freq, sizeof(freq)); | |
372 | return -1; | |
373 | } | |
374 | return 0; | |
375 | } | |
376 | \end{verbatim} | |
377 | ||
378 | A bit more complicated example using routing header can be found | |
379 | in \verb|ping6| utility (\verb|iputils| package). Linux rsvpd backend | |
380 | contains an example of using operation \verb|IPV6_FL_A_RENEW|. | |
381 | ||
382 | \paragraph{Listing flow labels.} | |
383 | \addcontentsline{toc}{subsection}{Listing flow labels} | |
384 | List of currently allocated | |
385 | flow labels may be read from \verb|/proc/net/ip6_flowlabel|. | |
386 | ||
387 | \begin{verbatim} | |
388 | Label S Owner Users Linger Expires Dst Opt | |
389 | A1BE5 1 0 0 6 3 3ffe2400000000010a0020fffe71fb30 0 | |
390 | \end{verbatim} | |
391 | ||
392 | \begin{itemize} | |
393 | \item \verb|Label| is hexadecimal flow label value. | |
394 | \item \verb|S| is sharing style. | |
395 | \item \verb|Owner| is ID of creator, it is zero, pid or uid, depending on | |
396 | sharing style. | |
397 | \item \verb|Users| is number of applications using the label now. | |
398 | \item \verb|Linger| is \verb|linger| of this label in seconds. | |
399 | \item \verb|Expires| is time until expiration of the label in seconds. It may | |
400 | be negative, if the label is in use. | |
401 | \item \verb|Dst| is IPv6 destination address. | |
402 | \item \verb|Opt| is length of options, associated with the label. Option | |
403 | data are not accessible. | |
404 | \end{itemize} | |
405 | ||
406 | ||
407 | \paragraph{Flow labels and RSVP.} | |
408 | \addcontentsline{toc}{subsection}{Flow labels and RSVP} | |
409 | RSVP daemon supports IPv6 flow labels | |
410 | without any modifications to standard ISI RAPI. Sender must allocate | |
411 | flow label, fill corresponding sender template and submit it to local rsvp | |
412 | daemon. rsvpd will check the label and start to announce it in PATH | |
413 | messages. Rsvpd on sender node will renew the flow label, so that it will not | |
414 | be reused before path state expires and all the intermediate | |
415 | routers and receiver purge flow state. | |
416 | ||
417 | \verb|rtap| utility is modified to parse flow labels. F.e.\ if user allocated | |
418 | flow label \verb|0xA1234|, he may write: | |
419 | ||
420 | \begin{verbatim} | |
421 | RTAP> sender 3ffe:2400::1/FL0xA1234 <Tspec> | |
422 | \end{verbatim} | |
423 | ||
424 | Receiver makes reservation with command: | |
425 | \begin{verbatim} | |
426 | RTAP> reserve ff 3ffe:2400::1/FL0xA1234 <Flowspec> | |
427 | \end{verbatim} | |
428 | ||
429 | \end{document} |