]> git.proxmox.com Git - mirror_ubuntu-artful-kernel.git/blame - Documentation/networking/filter.txt
ipv4: fix a race in ip4_datagram_release_cb()
[mirror_ubuntu-artful-kernel.git] / Documentation / networking / filter.txt
CommitLineData
7924cd5e
DB
1Linux Socket Filtering aka Berkeley Packet Filter (BPF)
2=======================================================
1da177e4
LT
3
4Introduction
7924cd5e
DB
5------------
6
7Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
8Though there are some distinct differences between the BSD and Linux
9Kernel filtering, but when we speak of BPF or LSF in Linux context, we
10mean the very same mechanism of filtering in the Linux kernel.
11
12BPF allows a user-space program to attach a filter onto any socket and
13allow or disallow certain types of data to come through the socket. LSF
14follows exactly the same filter code structure as BSD's BPF, so referring
15to the BSD bpf.4 manpage is very helpful in creating filters.
16
17On Linux, BPF is much simpler than on BSD. One does not have to worry
18about devices or anything like that. You simply create your filter code,
19send it to the kernel via the SO_ATTACH_FILTER option and if your filter
20code passes the kernel check on it, you then immediately begin filtering
21data on that socket.
22
23You can also detach filters from your socket via the SO_DETACH_FILTER
24option. This will probably not be used much since when you close a socket
25that has a filter on it the filter is automagically removed. The other
26less common case may be adding a different filter on the same socket where
27you had another filter that is still running: the kernel takes care of
28removing the old one and placing your new one in its place, assuming your
29filter has passed the checks, otherwise if it fails the old filter will
30remain on that socket.
31
32SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
33set, a filter cannot be removed or changed. This allows one process to
34setup a socket, attach a filter, lock it then drop privileges and be
35assured that the filter will be kept until the socket is closed.
36
37The biggest user of this construct might be libpcap. Issuing a high-level
38filter command like `tcpdump -i em1 port 22` passes through the libpcap
39internal compiler that generates a structure that can eventually be loaded
40via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
41displays what is being placed into this structure.
42
43Although we were only speaking about sockets here, BPF in Linux is used
44in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
45qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places
46such as team driver, PTP code, etc where BPF is being used.
47
48 [1] Documentation/prctl/seccomp_filter.txt
49
50Original BPF paper:
51
52Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
53architecture for user-level packet capture. In Proceedings of the
54USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
55Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
56CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
57
58Structure
59---------
60
61User space applications include <linux/filter.h> which contains the
62following relevant structures:
63
64struct sock_filter { /* Filter block */
65 __u16 code; /* Actual filter code */
66 __u8 jt; /* Jump true */
67 __u8 jf; /* Jump false */
68 __u32 k; /* Generic multiuse field */
69};
70
71Such a structure is assembled as an array of 4-tuples, that contains
72a code, jt, jf and k value. jt and jf are jump offsets and k a generic
73value to be used for a provided code.
74
75struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
76 unsigned short len; /* Number of filter blocks */
77 struct sock_filter __user *filter;
78};
79
80For socket filtering, a pointer to this structure (as shown in
81follow-up example) is being passed to the kernel through setsockopt(2).
82
83Example
84-------
85
86#include <sys/socket.h>
87#include <sys/types.h>
88#include <arpa/inet.h>
89#include <linux/if_ether.h>
90/* ... */
91
92/* From the example above: tcpdump -i em1 port 22 -dd */
93struct sock_filter code[] = {
94 { 0x28, 0, 0, 0x0000000c },
95 { 0x15, 0, 8, 0x000086dd },
96 { 0x30, 0, 0, 0x00000014 },
97 { 0x15, 2, 0, 0x00000084 },
98 { 0x15, 1, 0, 0x00000006 },
99 { 0x15, 0, 17, 0x00000011 },
100 { 0x28, 0, 0, 0x00000036 },
101 { 0x15, 14, 0, 0x00000016 },
102 { 0x28, 0, 0, 0x00000038 },
103 { 0x15, 12, 13, 0x00000016 },
104 { 0x15, 0, 12, 0x00000800 },
105 { 0x30, 0, 0, 0x00000017 },
106 { 0x15, 2, 0, 0x00000084 },
107 { 0x15, 1, 0, 0x00000006 },
108 { 0x15, 0, 8, 0x00000011 },
109 { 0x28, 0, 0, 0x00000014 },
110 { 0x45, 6, 0, 0x00001fff },
111 { 0xb1, 0, 0, 0x0000000e },
112 { 0x48, 0, 0, 0x0000000e },
113 { 0x15, 2, 0, 0x00000016 },
114 { 0x48, 0, 0, 0x00000010 },
115 { 0x15, 0, 1, 0x00000016 },
116 { 0x06, 0, 0, 0x0000ffff },
117 { 0x06, 0, 0, 0x00000000 },
118};
119
120struct sock_fprog bpf = {
121 .len = ARRAY_SIZE(code),
122 .filter = code,
123};
124
125sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
126if (sock < 0)
127 /* ... bail out ... */
128
129ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
130if (ret < 0)
131 /* ... bail out ... */
132
133/* ... */
134close(sock);
135
136The above example code attaches a socket filter for a PF_PACKET socket
137in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
138be dropped for this socket.
139
140The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
141and SO_LOCK_FILTER for preventing the filter to be detached, takes an
142integer value with 0 or 1.
143
144Note that socket filters are not restricted to PF_PACKET sockets only,
145but can also be used on other socket families.
146
147Summary of system calls:
148
149 * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
150 * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
151 * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val));
152
153Normally, most use cases for socket filtering on packet sockets will be
154covered by libpcap in high-level syntax, so as an application developer
155you should stick to that. libpcap wraps its own layer around all that.
156
157Unless i) using/linking to libpcap is not an option, ii) the required BPF
158filters use Linux extensions that are not supported by libpcap's compiler,
159iii) a filter might be more complex and not cleanly implementable with
160libpcap's compiler, or iv) particular filter codes should be optimized
161differently than libpcap's internal compiler does; then in such cases
162writing such a filter "by hand" can be of an alternative. For example,
163xt_bpf and cls_bpf users might have requirements that could result in
164more complex filter code, or one that cannot be expressed with libpcap
165(e.g. different return codes for various code paths). Moreover, BPF JIT
166implementors may wish to manually write test cases and thus need low-level
167access to BPF code as well.
168
169BPF engine and instruction set
170------------------------------
171
172Under tools/net/ there's a small helper tool called bpf_asm which can
173be used to write low-level filters for example scenarios mentioned in the
174previous section. Asm-like syntax mentioned here has been implemented in
175bpf_asm and will be used for further explanations (instead of dealing with
176less readable opcodes directly, principles are the same). The syntax is
177closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
178
179The BPF architecture consists of the following basic elements:
180
181 Element Description
182
183 A 32 bit wide accumulator
184 X 32 bit wide X register
185 M[] 16 x 32 bit wide misc registers aka "scratch memory
186 store", addressable from 0 to 15
187
188A program, that is translated by bpf_asm into "opcodes" is an array that
189consists of the following elements (as already mentioned):
190
191 op:16, jt:8, jf:8, k:32
192
193The element op is a 16 bit wide opcode that has a particular instruction
194encoded. jt and jf are two 8 bit wide jump targets, one for condition
195"jump if true", the other one "jump if false". Eventually, element k
196contains a miscellaneous argument that can be interpreted in different
197ways depending on the given instruction in op.
198
199The instruction set consists of load, store, branch, alu, miscellaneous
200and return instructions that are also represented in bpf_asm syntax. This
201table lists all bpf_asm instructions available resp. what their underlying
202opcodes as defined in linux/filter.h stand for:
203
204 Instruction Addressing mode Description
205
206 ld 1, 2, 3, 4, 10 Load word into A
207 ldi 4 Load word into A
208 ldh 1, 2 Load half-word into A
209 ldb 1, 2 Load byte into A
210 ldx 3, 4, 5, 10 Load word into X
211 ldxi 4 Load word into X
212 ldxb 5 Load byte into X
213
214 st 3 Store A into M[]
215 stx 3 Store X into M[]
216
217 jmp 6 Jump to label
218 ja 6 Jump to label
219 jeq 7, 8 Jump on k == A
220 jneq 8 Jump on k != A
221 jne 8 Jump on k != A
222 jlt 8 Jump on k < A
223 jle 8 Jump on k <= A
224 jgt 7, 8 Jump on k > A
225 jge 7, 8 Jump on k >= A
226 jset 7, 8 Jump on k & A
227
228 add 0, 4 A + <x>
229 sub 0, 4 A - <x>
230 mul 0, 4 A * <x>
231 div 0, 4 A / <x>
232 mod 0, 4 A % <x>
233 neg 0, 4 !A
234 and 0, 4 A & <x>
235 or 0, 4 A | <x>
236 xor 0, 4 A ^ <x>
237 lsh 0, 4 A << <x>
238 rsh 0, 4 A >> <x>
239
240 tax Copy A into X
241 txa Copy X into A
242
243 ret 4, 9 Return
244
245The next table shows addressing formats from the 2nd column:
246
247 Addressing mode Syntax Description
248
249 0 x/%x Register X
250 1 [k] BHW at byte offset k in the packet
251 2 [x + k] BHW at the offset X + k in the packet
252 3 M[k] Word at offset k in M[]
253 4 #k Literal value stored in k
254 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet
255 6 L Jump label L
256 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf
257 8 #k,Lt Jump to Lt if predicate is true
258 9 a/%a Accumulator A
259 10 extension BPF extension
260
261The Linux kernel also has a couple of BPF extensions that are used along
262with the class of load instructions by "overloading" the k argument with
263a negative offset + a particular extension offset. The result of such BPF
264extensions are loaded into A.
265
266Possible BPF extensions are shown in the following table:
267
268 Extension Description
269
270 len skb->len
271 proto skb->protocol
272 type skb->pkt_type
273 poff Payload start offset
274 ifidx skb->dev->ifindex
275 nla Netlink attribute of type X with offset A
276 nlan Nested Netlink attribute of type X with offset A
277 mark skb->mark
278 queue skb->queue_mapping
279 hatype skb->dev->type
b0db5cdf 280 rxhash skb->hash
7924cd5e
DB
281 cpu raw_smp_processor_id()
282 vlan_tci vlan_tx_tag_get(skb)
283 vlan_pr vlan_tx_tag_present(skb)
4cd3675e 284 rand prandom_u32()
7924cd5e
DB
285
286These extensions can also be prefixed with '#'.
287Examples for low-level BPF:
288
289** ARP packets:
290
291 ldh [12]
292 jne #0x806, drop
293 ret #-1
294 drop: ret #0
295
296** IPv4 TCP packets:
297
298 ldh [12]
299 jne #0x800, drop
300 ldb [23]
301 jneq #6, drop
302 ret #-1
303 drop: ret #0
304
305** (Accelerated) VLAN w/ id 10:
306
307 ld vlan_tci
308 jneq #10, drop
309 ret #-1
310 drop: ret #0
311
4cd3675e
CG
312** icmp random packet sampling, 1 in 4
313 ldh [12]
314 jne #0x800, drop
315 ldb [23]
316 jneq #1, drop
317 # get a random uint32 number
318 ld rand
319 mod #4
320 jneq #1, drop
321 ret #-1
322 drop: ret #0
323
7924cd5e
DB
324** SECCOMP filter example:
325
326 ld [4] /* offsetof(struct seccomp_data, arch) */
327 jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */
328 ld [0] /* offsetof(struct seccomp_data, nr) */
329 jeq #15, good /* __NR_rt_sigreturn */
330 jeq #231, good /* __NR_exit_group */
331 jeq #60, good /* __NR_exit */
332 jeq #0, good /* __NR_read */
333 jeq #1, good /* __NR_write */
334 jeq #5, good /* __NR_fstat */
335 jeq #9, good /* __NR_mmap */
336 jeq #14, good /* __NR_rt_sigprocmask */
337 jeq #13, good /* __NR_rt_sigaction */
338 jeq #35, good /* __NR_nanosleep */
339 bad: ret #0 /* SECCOMP_RET_KILL */
340 good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
341
342The above example code can be placed into a file (here called "foo"), and
343then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
344and cls_bpf understands and can directly be loaded with. Example with above
345ARP code:
346
347$ ./bpf_asm foo
3484,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
349
350In copy and paste C-like output:
351
352$ ./bpf_asm -c foo
353{ 0x28, 0, 0, 0x0000000c },
354{ 0x15, 0, 1, 0x00000806 },
355{ 0x06, 0, 0, 0xffffffff },
356{ 0x06, 0, 0, 0000000000 },
357
358In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
359filters that might not be obvious at first, it's good to test filters before
360attaching to a live system. For that purpose, there's a small tool called
361bpf_dbg under tools/net/ in the kernel source directory. This debugger allows
362for testing BPF filters against given pcap files, single stepping through the
363BPF code on the pcap's packets and to do BPF machine register dumps.
364
365Starting bpf_dbg is trivial and just requires issuing:
366
367# ./bpf_dbg
368
369In case input and output do not equal stdin/stdout, bpf_dbg takes an
370alternative stdin source as a first argument, and an alternative stdout
371sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
372
373Other than that, a particular libreadline configuration can be set via
374file "~/.bpf_dbg_init" and the command history is stored in the file
375"~/.bpf_dbg_history".
376
377Interaction in bpf_dbg happens through a shell that also has auto-completion
378support (follow-up example commands starting with '>' denote bpf_dbg shell).
379The usual workflow would be to ...
380
381> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
382 Loads a BPF filter from standard output of bpf_asm, or transformed via
383 e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT
384 debugging (next section), this command creates a temporary socket and
385 loads the BPF code into the kernel. Thus, this will also be useful for
386 JIT developers.
387
388> load pcap foo.pcap
389 Loads standard tcpdump pcap file.
390
391> run [<n>]
392bpf passes:1 fails:9
393 Runs through all packets from a pcap to account how many passes and fails
394 the filter will generate. A limit of packets to traverse can be given.
395
396> disassemble
397l0: ldh [12]
398l1: jeq #0x800, l2, l5
399l2: ldb [23]
400l3: jeq #0x1, l4, l5
401l4: ret #0xffff
402l5: ret #0
403 Prints out BPF code disassembly.
404
405> dump
406/* { op, jt, jf, k }, */
407{ 0x28, 0, 0, 0x0000000c },
408{ 0x15, 0, 3, 0x00000800 },
409{ 0x30, 0, 0, 0x00000017 },
410{ 0x15, 0, 1, 0x00000001 },
411{ 0x06, 0, 0, 0x0000ffff },
412{ 0x06, 0, 0, 0000000000 },
413 Prints out C-style BPF code dump.
414
415> breakpoint 0
416breakpoint at: l0: ldh [12]
417> breakpoint 1
418breakpoint at: l1: jeq #0x800, l2, l5
419 ...
420 Sets breakpoints at particular BPF instructions. Issuing a `run` command
421 will walk through the pcap file continuing from the current packet and
422 break when a breakpoint is being hit (another `run` will continue from
423 the currently active breakpoint executing next instructions):
424
425 > run
426 -- register dump --
427 pc: [0] <-- program counter
428 code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction
429 curr: l0: ldh [12] <-- disassembly of current instruction
430 A: [00000000][0] <-- content of A (hex, decimal)
431 X: [00000000][0] <-- content of X (hex, decimal)
432 M[0,15]: [00000000][0] <-- folded content of M (hex, decimal)
433 -- packet dump -- <-- Current packet from pcap (hex)
434 len: 42
435 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
436 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
437 32: 00 00 00 00 00 00 0a 3b 01 01
438 (breakpoint)
439 >
440
441> breakpoint
442breakpoints: 0 1
443 Prints currently set breakpoints.
444
445> step [-<n>, +<n>]
446 Performs single stepping through the BPF program from the current pc
447 offset. Thus, on each step invocation, above register dump is issued.
448 This can go forwards and backwards in time, a plain `step` will break
449 on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
450
451> select <n>
452 Selects a given packet from the pcap file to continue from. Thus, on
453 the next `run` or `step`, the BPF program is being evaluated against
454 the user pre-selected packet. Numbering starts just as in Wireshark
455 with index 1.
456
457> quit
458#
459 Exits bpf_dbg.
460
461JIT compiler
462------------
463
464The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC,
465ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is
466transparently invoked for each attached filter from user space or for internal
467kernel users if it has been previously enabled by root:
468
469 echo 1 > /proc/sys/net/core/bpf_jit_enable
470
471For JIT developers, doing audits etc, each compile run can output the generated
472opcode image into the kernel log via:
473
474 echo 2 > /proc/sys/net/core/bpf_jit_enable
475
476Example output from dmesg:
477
478[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
479[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
480[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
481[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
482[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
483[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
484
485In the kernel source tree under tools/net/, there's bpf_jit_disasm for
486generating disassembly out of the kernel log's hexdump:
487
488# ./bpf_jit_disasm
48970 bytes emitted from JIT compiler (pass:3, flen:6)
490ffffffffa0069c8f + <x>:
491 0: push %rbp
492 1: mov %rsp,%rbp
493 4: sub $0x60,%rsp
494 8: mov %rbx,-0x8(%rbp)
495 c: mov 0x68(%rdi),%r9d
496 10: sub 0x6c(%rdi),%r9d
497 14: mov 0xd8(%rdi),%r8
498 1b: mov $0xc,%esi
499 20: callq 0xffffffffe0ff9442
500 25: cmp $0x800,%eax
501 2a: jne 0x0000000000000042
502 2c: mov $0x17,%esi
503 31: callq 0xffffffffe0ff945e
504 36: cmp $0x1,%eax
505 39: jne 0x0000000000000042
506 3b: mov $0xffff,%eax
507 40: jmp 0x0000000000000044
508 42: xor %eax,%eax
509 44: leaveq
510 45: retq
511
512Issuing option `-o` will "annotate" opcodes to resulting assembler
513instructions, which can be very useful for JIT developers:
514
515# ./bpf_jit_disasm -o
51670 bytes emitted from JIT compiler (pass:3, flen:6)
517ffffffffa0069c8f + <x>:
518 0: push %rbp
519 55
520 1: mov %rsp,%rbp
521 48 89 e5
522 4: sub $0x60,%rsp
523 48 83 ec 60
524 8: mov %rbx,-0x8(%rbp)
525 48 89 5d f8
526 c: mov 0x68(%rdi),%r9d
527 44 8b 4f 68
528 10: sub 0x6c(%rdi),%r9d
529 44 2b 4f 6c
530 14: mov 0xd8(%rdi),%r8
531 4c 8b 87 d8 00 00 00
532 1b: mov $0xc,%esi
533 be 0c 00 00 00
534 20: callq 0xffffffffe0ff9442
535 e8 1d 94 ff e0
536 25: cmp $0x800,%eax
537 3d 00 08 00 00
538 2a: jne 0x0000000000000042
539 75 16
540 2c: mov $0x17,%esi
541 be 17 00 00 00
542 31: callq 0xffffffffe0ff945e
543 e8 28 94 ff e0
544 36: cmp $0x1,%eax
545 83 f8 01
546 39: jne 0x0000000000000042
547 75 07
548 3b: mov $0xffff,%eax
549 b8 ff ff 00 00
550 40: jmp 0x0000000000000044
551 eb 02
552 42: xor %eax,%eax
553 31 c0
554 44: leaveq
555 c9
556 45: retq
557 c3
558
559For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
560toolchain for developing and testing the kernel's JIT compiler.
561
9a985cdc
AS
562BPF kernel internals
563--------------------
564Internally, for the kernel interpreter, a different BPF instruction set
565format with similar underlying principles from BPF described in previous
566paragraphs is being used. However, the instruction set format is modelled
567closer to the underlying architecture to mimic native instruction sets, so
568that a better performance can be achieved (more details later).
569
570It is designed to be JITed with one to one mapping, which can also open up
571the possibility for GCC/LLVM compilers to generate optimized BPF code through
572a BPF backend that performs almost as fast as natively compiled code.
573
574The new instruction set was originally designed with the possible goal in
575mind to write programs in "restricted C" and compile into BPF with a optional
576GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
577minimal performance overhead over two steps, that is, C -> BPF -> native code.
578
579Currently, the new format is being used for running user BPF programs, which
580includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
581team driver's classifier for its load-balancing mode, netfilter's xt_bpf
582extension, PTP dissector/classifier, and much more. They are all internally
583converted by the kernel into the new instruction set representation and run
584in the extended interpreter. For in-kernel handlers, this all works
585transparently by using sk_unattached_filter_create() for setting up the
586filter, resp. sk_unattached_filter_destroy() for destroying it. The macro
587SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to
588run the filter. 'filter' is a pointer to struct sk_filter that we got from
589sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer).
590All constraints and restrictions from sk_chk_filter() apply before a
591conversion to the new layout is being done behind the scenes!
592
593Currently, for JITing, the user BPF format is being used and current BPF JIT
594compilers reused whenever possible. In other words, we do not (yet!) perform
595a JIT compilation in the new layout, however, future work will successively
596migrate traditional JIT compilers into the new instruction format as well, so
597that they will profit from the very same benefits. Thus, when speaking about
598JIT in the following, a JIT compiler (TBD) for the new instruction format is
599meant in this context.
600
601Some core changes of the new internal format:
602
603- Number of registers increase from 2 to 10:
604
605 The old format had two registers A and X, and a hidden frame pointer. The
606 new layout extends this to be 10 internal registers and a read-only frame
607 pointer. Since 64-bit CPUs are passing arguments to functions via registers
608 the number of args from BPF program to in-kernel function is restricted
609 to 5 and one register is used to accept return value from an in-kernel
610 function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
611 sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
612 registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
613
614 Therefore, BPF calling convention is defined as:
615
dfee07cc 616 * R0 - return value from in-kernel function, and exit value for BPF program
9a985cdc
AS
617 * R1 - R5 - arguments from BPF program to in-kernel function
618 * R6 - R9 - callee saved registers that in-kernel function will preserve
619 * R10 - read-only frame pointer to access stack
620
621 Thus, all BPF registers map one to one to HW registers on x86_64, aarch64,
622 etc, and BPF calling convention maps directly to ABIs used by the kernel on
623 64-bit architectures.
624
625 On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
626 and may let more complex programs to be interpreted.
627
628 R0 - R5 are scratch registers and BPF program needs spill/fill them if
629 necessary across calls. Note that there is only one BPF program (== one BPF
630 main routine) and it cannot call other BPF functions, it can only call
631 predefined in-kernel functions, though.
632
633- Register width increases from 32-bit to 64-bit:
634
635 Still, the semantics of the original 32-bit ALU operations are preserved
636 via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower
637 subregisters that zero-extend into 64-bit if they are being written to.
638 That behavior maps directly to x86_64 and arm64 subregister definition, but
639 makes other JITs more difficult.
640
641 32-bit architectures run 64-bit internal BPF programs via interpreter.
642 Their JITs may convert BPF programs that only use 32-bit subregisters into
643 native instruction set and let the rest being interpreted.
644
645 Operation is 64-bit, because on 64-bit architectures, pointers are also
646 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
647 so 32-bit BPF registers would otherwise require to define register-pair
648 ABI, thus, there won't be able to use a direct BPF register to HW register
649 mapping and JIT would need to do combine/split/move operations for every
650 register in and out of the function, which is complex, bug prone and slow.
651 Another reason is the use of atomic 64-bit counters.
652
653- Conditional jt/jf targets replaced with jt/fall-through:
654
655 While the original design has constructs such as "if (cond) jump_true;
656 else jump_false;", they are being replaced into alternative constructs like
657 "if (cond) jump_true; /* else fall-through */".
658
659- Introduces bpf_call insn and register passing convention for zero overhead
660 calls from/to other kernel functions:
661
dfee07cc
AS
662 Before an in-kernel function call, the internal BPF program needs to
663 place function arguments into R1 to R5 registers to satisfy calling
664 convention, then the interpreter will take them from registers and pass
665 to in-kernel function. If R1 - R5 registers are mapped to CPU registers
666 that are used for argument passing on given architecture, the JIT compiler
667 doesn't need to emit extra moves. Function arguments will be in the correct
668 registers and BPF_CALL instruction will be JITed as single 'call' HW
669 instruction. This calling convention was picked to cover common call
670 situations without performance penalty.
671
672 After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
673 a return value of the function. Since R6 - R9 are callee saved, their state
674 is preserved across the call.
675
676 For example, consider three C functions:
677
678 u64 f1() { return (*_f2)(1); }
679 u64 f2(u64 a) { return f3(a + 1, a); }
680 u64 f3(u64 a, u64 b) { return a - b; }
681
682 GCC can compile f1, f3 into x86_64:
683
684 f1:
685 movl $1, %edi
686 movq _f2(%rip), %rax
687 jmp *%rax
688 f3:
689 movq %rdi, %rax
690 subq %rsi, %rax
691 ret
692
693 Function f2 in BPF may look like:
694
695 f2:
696 bpf_mov R2, R1
697 bpf_add R1, 1
698 bpf_call f3
699 bpf_exit
700
701 If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and
702 returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to
703 be used to call into f2.
704
705 For practical reasons all BPF programs have only one argument 'ctx' which is
706 already placed into R1 (e.g. on __sk_run_filter() startup) and the programs
707 can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
708 are currently not supported, but these restrictions can be lifted if necessary
709 in the future.
710
711 On 64-bit architectures all register map to HW registers one to one. For
712 example, x86_64 JIT compiler can map them as ...
713
714 R0 - rax
715 R1 - rdi
716 R2 - rsi
717 R3 - rdx
718 R4 - rcx
719 R5 - r8
720 R6 - rbx
721 R7 - r13
722 R8 - r14
723 R9 - r15
724 R10 - rbp
725
726 ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
727 and rbx, r12 - r15 are callee saved.
728
729 Then the following internal BPF pseudo-program:
730
731 bpf_mov R6, R1 /* save ctx */
732 bpf_mov R2, 2
733 bpf_mov R3, 3
734 bpf_mov R4, 4
735 bpf_mov R5, 5
736 bpf_call foo
737 bpf_mov R7, R0 /* save foo() return value */
738 bpf_mov R1, R6 /* restore ctx for next call */
739 bpf_mov R2, 6
740 bpf_mov R3, 7
741 bpf_mov R4, 8
742 bpf_mov R5, 9
743 bpf_call bar
744 bpf_add R0, R7
745 bpf_exit
746
747 After JIT to x86_64 may look like:
748
749 push %rbp
750 mov %rsp,%rbp
751 sub $0x228,%rsp
752 mov %rbx,-0x228(%rbp)
753 mov %r13,-0x220(%rbp)
754 mov %rdi,%rbx
755 mov $0x2,%esi
756 mov $0x3,%edx
757 mov $0x4,%ecx
758 mov $0x5,%r8d
759 callq foo
760 mov %rax,%r13
761 mov %rbx,%rdi
762 mov $0x2,%esi
763 mov $0x3,%edx
764 mov $0x4,%ecx
765 mov $0x5,%r8d
766 callq bar
767 add %r13,%rax
768 mov -0x228(%rbp),%rbx
769 mov -0x220(%rbp),%r13
770 leaveq
771 retq
772
773 Which is in this example equivalent in C to:
774
775 u64 bpf_filter(u64 ctx)
776 {
777 return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
778 }
779
780 In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
781 arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
782 registers and place their return value into '%rax' which is R0 in BPF.
783 Prologue and epilogue are emitted by JIT and are implicit in the
784 interpreter. R0-R5 are scratch registers, so BPF program needs to preserve
785 them across the calls as defined by calling convention.
786
787 For example the following program is invalid:
788
789 bpf_mov R1, 1
790 bpf_call foo
791 bpf_mov R0, R1
792 bpf_exit
793
794 After the call the registers R1-R5 contain junk values and cannot be read.
795 In the future a BPF verifier can be used to validate internal BPF programs.
9a985cdc
AS
796
797Also in the new design, BPF is limited to 4096 insns, which means that any
798program will terminate quickly and will only call a fixed number of kernel
799functions. Original BPF and the new format are two operand instructions,
800which helps to do one-to-one mapping between BPF insn and x86 insn during JIT.
801
802The input context pointer for invoking the interpreter function is generic,
803its content is defined by a specific use case. For seccomp register R1 points
804to seccomp_data, for converted BPF filters R1 points to a skb.
805
806A program, that is translated internally consists of the following elements:
807
e430f34e 808 op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
9a985cdc 809
dfee07cc
AS
810So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field
811has room for new instructions. Some of them may use 16/24/32 byte encoding. New
812instructions must be multiple of 8 bytes to preserve backward compatibility.
813
814Internal BPF is a general purpose RISC instruction set. Not every register and
815every instruction are used during translation from original BPF to new format.
816For example, socket filters are not using 'exclusive add' instruction, but
817tracing filters may do to maintain counters of events, for example. Register R9
818is not used by socket filters either, but more complex filters may be running
819out of registers and would have to resort to spill/fill to stack.
820
821Internal BPF can used as generic assembler for last step performance
822optimizations, socket filters and seccomp are using it as assembler. Tracing
823filters may use it as assembler to generate code from kernel. In kernel usage
824may not be bounded by security considerations, since generated internal BPF code
825may be optimizing internal code path and not being exposed to the user space.
826Safety of internal BPF can come from a verifier (TBD). In such use cases as
827described, it may be used as safe instruction set.
828
9a985cdc
AS
829Just like the original BPF, the new format runs within a controlled environment,
830is deterministic and the kernel can easily prove that. The safety of the program
831can be determined in two steps: first step does depth-first-search to disallow
832loops and other CFG validation; second step starts from the first insn and
833descends all possible paths. It simulates execution of every insn and observes
834the state change of registers and stack.
835
04caa489
DB
836Testing
837-------
838
839Next to the BPF toolchain, the kernel also ships a test module that contains
840various test cases for classic and internal BPF that can be executed against
841the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
842enabled via Kconfig:
843
844 CONFIG_TEST_BPF=m
845
846After the module has been built and installed, the test suite can be executed
847via insmod or modprobe against 'test_bpf' module. Results of the test cases
848including timings in nsec can be found in the kernel log (dmesg).
849
7924cd5e
DB
850Misc
851----
852
853Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
854SECCOMP-BPF kernel fuzzing.
855
856Written by
857----------
858
859The document was written in the hope that it is found useful and in order
860to give potential BPF hackers or security auditors a better overview of
861the underlying architecture.
862
863Jay Schulist <jschlst@samba.org>
864Daniel Borkmann <dborkman@redhat.com>
9a985cdc 865Alexei Starovoitov <ast@plumgrid.com>