Jakub Kicinski [Fri, 24 Nov 2017 02:12:01 +0000 (18:12 -0800)]
bpf: split parse from program loading
Parsing command line is currently done together with potentially
loading a new eBPF program. This makes it more difficult to
provide additional parameters for loading (which may come after
the eBPF program info on the command line).
Split the two (only internally for now). Verbose parameter
has to be saved in struct bpf_cfg_in to be carried between
the stages.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 24 Nov 2017 02:12:00 +0000 (18:12 -0800)]
bpf: allocate opcode table in struct bpf_cfg_in
struct bpf_cfg_in already carries a pointer to sock_filter ops.
It's currently set to a local variable in bpf_parse_opt_tbl(),
shared between parsing and loading stages. Move the array
entirely to struct bpf_cfg_in, this will allow us to split
parsing and loading.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 24 Nov 2017 02:11:59 +0000 (18:11 -0800)]
bpf: keep parsed program mode in struct bpf_cfg_in
bpf_parse() will parse command line arguments to find out the
program mode. This mode will later be needed at loading time.
Instead of keeping it locally add it to struct bpf_cfg_in,
this will allow splitting parsing and loading stages.
enum bpf_mode has to be moved to the header file, because C
doesn't allow forward declaration of enums.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 24 Nov 2017 02:11:58 +0000 (18:11 -0800)]
bpf: pass program type in struct bpf_cfg_in
Program type is needed both for parsing and loading of
the program. Parsing may also induce the type based on
signatures from __bpf_prog_meta. Instead of passing
the type around keep it in struct bpf_cfg_in.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adapts the tc command line interface to allow bandwidth limits
to be specified as a percentage of the interface's capacity.
Adding this functionality requires passing the specified device string to
each class/qdisc which changes the prototype for a couple of functions: the
.parse_qopt and .parse_copt interfaces. The device string is a required
parameter for tc-qdisc and tc-class, and when not specified, the kernel
returns ENODEV. In this patch, if the user tries to specify a bandwidth
percentage without naming the device, we return an error from userspace.
Tom Herbert [Wed, 22 Nov 2017 20:05:35 +0000 (12:05 -0800)]
ila: support to configure checksum neutral-map-auto
Configuration support in both ip ila and ip LWT for checksum
neutral-map-auto. This is a mode of ILA where checksum
neutral mapping is assumed for packets (there is no C-bit
in the identifier to indicate checksum neutral).
Jakub Kicinski [Thu, 23 Nov 2017 01:00:53 +0000 (17:00 -0800)]
bpf: initialize the verifier log
If program loading fails before verifier prints its first
message, the verifier log will not be initialized. Always
set the first character of the log buffer to zero to make
sure we don't dump non-printable characters to the terminal.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Lorenzo Colitti [Mon, 20 Nov 2017 03:57:07 +0000 (12:57 +0900)]
iproute2: fixes to compile on some systems.
1. Put the declarations of strlcpy and strlcat inside
an #ifdef NEED_STRLCPY. Their declarations were already in a
similar #ifdef.
2. In bpf_scm.h, include sys/un.h for struct sockaddr_un.
3. In utils.h, include time.h for struct timeval.
Tested: builds on ubuntu 14.04 with "make clean distclean; ./configure && make -j64"
Tested: 4.14.1 builds on Android with Android-specific #ifndefs for missing library code Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Phil Sutter [Wed, 15 Nov 2017 14:01:31 +0000 (15:01 +0100)]
tc_util: Silence spurious compiler warning
GCC version 7.2.1 complains that 'result1' may be used uninitialized in
parse_action_control_slash_spaces(). This should not be possible in
practice, so the actual value 'result1' is initialized with does not
matter.
Phil Sutter [Wed, 15 Nov 2017 14:01:30 +0000 (15:01 +0100)]
tc_util: Drop needless pointer check
The function parse_action_control_slash() returns early if 'p' is NULL,
so after the first call to action_a2n(), 'p' is guaranteed not to be
NULL. Otherwise, the assignment '*p = 0' above would dereference the
NULL pointer already anyway, so just drop this check here.
Jon Maloy [Wed, 15 Nov 2017 16:25:44 +0000 (17:25 +0100)]
tipc: change family attribute from u32 to u16
commit 28033ae4e0f ("net: netlink: Update attr validation to require
exact length for some types") introduces a stricter control on attributes
of type NLA_U* and NLA_S*.
Since the tipc tool is sending a family attribute of u32 instead of as
expected u16 the tool is now effectively broken.
We fix this by changing the type of the said attribute.
Leon Romanovsky [Mon, 13 Nov 2017 10:21:19 +0000 (12:21 +0200)]
ip: Fix compilation break on old systems
As was reported [1], the iproute2 fails to compile on old systems,
in Cong's case, it was Fedora 19, in our case it was RedHat 7.2, which
failed with the following errors during compilation:
ipxfrm.c: In function ‘xfrm_selector_print’:
ipxfrm.c:479:7: error: ‘IPPROTO_MH’ undeclared (first use in this
function)
case IPPROTO_MH:
^
ipxfrm.c:479:7: note: each undeclared identifier is reported only once
for each function it appears in
ipxfrm.c: In function ‘xfrm_selector_upspec_parse’:
ipxfrm.c:1345:8: error: ‘IPPROTO_MH’ undeclared (first use in this
function)
case IPPROTO_MH:
^ make[1]: *** [ipxfrm.o] Error 1
The reason to it is the order of headers files. The IPPROTO_MH field is
set in kernel's UAPI header file (in6.h), but only in case
__UAPI_DEF_IPPROTO_V6 is set before. That define comes from other kernel's
header file (libc-compat.h) and is set in case there are no previous
libc relevant declarations.
In ip code, the include of <netdb.h> causes to indirect inclusion of
<netinet/in.h> and it sets __UAPI_DEF_IPPROTO_V6 to be zero and prevents from
IPPROTO_MH declaration.
This patch takes the simplest possible approach to fix the compilation
error by checking if IPPROTO_MH was defined before and in case it
wasn't, it defines it to be the same as in the kernel.
Ivan Vecera [Fri, 10 Nov 2017 06:20:13 +0000 (07:20 +0100)]
lib: make resolve_hosts variable common
Any iproute utility that uses any function from lib/utils.c needs
to declare its own resolve_hosts variable instance although it does
not need/use hostname resolving functionality (currently only 'ip'
and 'ss' commands uses this).
The patch declares single common instance of resolve_hosts directly
in utils.c so the existing ones can be removed (the same approach
that is used for timestamp_short).
In order to calculate the idleSlope parameter of CBS correctly, users
must take into account the entire packet size, including the overhead
from all layers.
Add some more details to the man page to clarify that, giving one
simple example and pointing users to the correct 802.1Q section for
further clarifications if needed.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
David Ahern [Thu, 9 Nov 2017 00:46:50 +0000 (09:46 +0900)]
libnetlink: Handle extack messages for non-error case
Kernel can now return non-fatal error messages in extack facility.
Update iproute2 to dump to use if present.
- rename nl_dump_ext_err to nl_dump_ext_ack
- rename errmsg to msg
- add call to nl_dump_ext_ack in rtnl_dump_done and __rtnl_talk for
non-error path
Signed-off-by: David Ahern <dsahern@gmail.com> Tested-by: Ido Schimmel <idosch@mellanox.com>
Thomas Egerer [Mon, 30 Oct 2017 18:11:46 +0000 (19:11 +0100)]
xfrm_{state, policy}: Allow to deleteall polices/states with marks
Using 'ip deleteall' with policies that have marks, fails unless you
eplicitely specify the mark values. This is very uncomfortable when
bulk-deleting policies and states. With this patch all relevant states
and policies are wiped by 'ip deleteall' regardless of their mark
values.
Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Thomas Egerer [Mon, 30 Oct 2017 18:11:45 +0000 (19:11 +0100)]
xfrm_policy: Do not attempt to deleteall a socket policy
Socket polices are added to a socket using setsockopt(2). They cannot be
deleted by iproute2. The attempt to delete them causes an error
(EINVAL).
To avoid this unnecessary error message all socket policies are skipped
in xfrm_policy_keep.
Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Thomas Egerer [Mon, 30 Oct 2017 18:11:44 +0000 (19:11 +0100)]
xfrm_policy: Add filter option for socket policies
Listing policies on systems with a lot of socket policies can be
confusing due to the number of returned polices. Even if socket polices
are not of interest, they cannot be filtered. This patch adds an option
to filter all socket policies from the output.
Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Amritha Nambiar [Fri, 3 Nov 2017 08:54:01 +0000 (01:54 -0700)]
flower: Represent HW traffic classes as classid values
This patch was previously submitted as RFC. Submitting this as
non-RFC now that the classid reservation scheme for hardware
traffic classes and offloads to route packets to a hardware
traffic class are accepted in net-next.
HW traffic classes 0 through 15 are represented using the
reserved classid values :ffe0 - :ffef.
Example:
Match Dst IPv4,Dst Port and route to TC1:
# tc filter add dev eth0 protocol ip parent ffff:\
prio 1 flower dst_ip 192.168.1.1/32\
ip_proto udp dst_port 12000 skip_sw\
hw_tc 1
The Credit Based Shaper (CBS) queueing discipline allows bandwidth
reservation with sub-milisecond precision. It is defined by the
802.1Q-2014 specification (section 8.6.8.2 and Annex L).
The syntax is:
tc qdisc add dev DEV parent NODE cbs locredit <LOCREDIT>
hicredit <HICREDIT> sendslope <SENDSLOPE>
idleslope <IDLESLOPE>
(The order is not important)
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Amritha Nambiar [Wed, 1 Nov 2017 07:45:42 +0000 (00:45 -0700)]
tc/mqprio: Offload mode and shaper options in mqprio
This patch was previously submitted as RFC. Submitting this as
non-RFC now that the tc/mqprio changes are accepted in net-next.
Adds new mqprio options for 'mode' and 'shaper'. The mode
option can take values for offload modes such as 'dcb' (default),
'channel' with the 'hw' option set to 1. The new 'channel' mode
supports offloading TCs and other queue configurations. The
'shaper' option is to support HW shapers ('dcb' default) and
takes the value 'bw_rlimit' for bandwidth rate limiting. The
parameters to the bw_rlimit shaper are minimum and maximum
bandwidth rates. New HW shapers in future can be supported
through the shaper attribute.
Mahesh Bandewar [Mon, 30 Oct 2017 20:57:51 +0000 (13:57 -0700)]
ip/ipvlan: enhance ability to add mode flags to existing modes
IPvlan supported bridge-only functionality prior to commits a190d04db937 ('ipvlan: introduce 'private' attribute for all
existing modes.') and fe89aa6b250c ('ipvlan: implement VEPA mode').
These two commits allow to configure the VEPA and private modes now.
This patch adds those options in ip command.
e.g.
bash:~# ip link add link eth0 name ipvl0 type ipvlan mode l2 private
-or-
bash:~# ip link add link eth0 type ipvl0 type ipvlan mode l2 vepa
Also the output will reflect the mode and the mode-flag accordingly.
e.g.
bash:~# ip -details link show ipvl0
4: ipvl0@eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc ...
link/ether 00:1a:11:44:a5:3e brd ff:ff:ff:ff:ff:ff promiscuity 0
ipvlan mode l2 private addrgenmode eui64 numtxqueues 1 ...
Stefano Brivio [Tue, 31 Oct 2017 17:47:56 +0000 (18:47 +0100)]
ss: Fix width calculations when Netid or State columns are missing
If Netid or State columns are missing, we must not subtract one
for each of these two columns from the remaining screen width,
while distributing available space to columns. This one
character corresponding to one delimiting space has to be
subtracted only if the columns are actually printed.
Further, in the existing implementation, if the screen width is
an odd number, one additional character is added to the width of
one of the two columns.
But if both are not printed, this filling character needs to be
added somewhere else, in order to have the right spacing
allowing us to fill lines completely.
Address and port fields are printed in pairs (local and remote),
so we can't distribute the space to any of them, because it
would be doubled. Instead, print this additional space to the
right of the Send-Q column, to keep code changes to a minimum.
This is particularly visible with 'ss -f netlink -Z'. Before
this patch, with an 80 column terminal, we have:
$ ss -f netlink -Z|head -n3
Recv-Q Send-Q Local Address:Port Peer Address:Port
0 0 rtnl:evolution-calen/2049 * pro
c_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
0 0 rtnl:clock-applet/1944 * pro
c_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
After this patch, in both cases, the output is:
$ ss -f netlink -Z|head -n3
Recv-Q Send-Q Local Address:Port Peer Address:Port
0 0 rtnl:evolution-calen/2049 *
proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
0 0 rtnl:clock-applet/1944 *
proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
Stefano Brivio [Tue, 31 Oct 2017 17:47:54 +0000 (18:47 +0100)]
ss: Remove useless width specifier in process context print
Both local address and service, and remote address and service
fields are already printed out in netlink_show_one() before we
start printing process context, by calling sock_addr_print()
twice.
At this point, sock_addr_print() has already forced the remote
service field to be 'serv_width' wide -- that is, 'serv_width'
width has already been consumed, before we print process
context.
Hence, it makes no sense to force the display width of process
context to be 'serv_width' wide again: previous prints have
filled up the line already. Remove the width specifier and
prefix with a space instead, to keep this consistent with fields
which are displayed after the first output line.
Roman Mashak [Tue, 31 Oct 2017 18:24:19 +0000 (14:24 -0400)]
ip netns: use strtol() instead of atoi()
Use strtol-based API to parse and validate integer input; atoi() does
not detect errors and may yield undefined behaviour if result can't be
represented.
v2: use get_unsigned() since network namespace is really an unsigned value.
Roopa Prabhu [Sat, 28 Oct 2017 05:13:49 +0000 (22:13 -0700)]
iplink: bridge: support bridge port vlan_tunnel attribute
This config maps to IFLA_BRPORT_VLAN_TUNNEL bridge port netlink
flag attribute. This flag enables vlan to tunnel mapping on a bridge
port. It is off by default.
set vlan_tunnel attribute on bridge port vxlan0:
$ip link set dev vxlan0 type bridge_slave vlan_tunnel on
$ip link set dev vxlan0 type bridge_slave vlan_tunnel off
or via bridge command
$bridge link set dev vxlan0 vlan_tunnel on
$bridge link set dev vxlan0 vlan_tunnel off
Hangbin Liu [Thu, 26 Oct 2017 01:41:47 +0000 (09:41 +0800)]
lib/libnetlink: update rtnl_talk to support malloc buff at run time
This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
iplink_get()"). After update, we will not need to double the buffer size
every time when VFs number increased.
With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
length parameter.
With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
answer to avoid overwrite data in nlh, because it may has more info after
nlh. also this will avoid nlh buffer not enough issue.
We need to free answer after using.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
Hangbin Liu [Thu, 26 Oct 2017 01:41:46 +0000 (09:41 +0800)]
lib/libnetlink: re malloc buff if size is not enough
With commit 72b365e8e0fd ("libnetlink: Double the dump buffer size")
we doubled the buffer size to support more VFs. But the VFs number is
increasing all the time. Some customers even use more than 200 VFs now.
We could not double it everytime when the buffer is not enough. Let's just
not hard code the buffer size and malloc the correct number when running.
Introduce function rtnl_recvmsg() to always return a newly allocated buffer.
The caller need to free it after using.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
... if we exceed rate of 1kbps (burst of 90K), do an absolute jump of 2 actions
sudo $TC actions add action police rate 1kbit burst 90k conform-exceed jump 2 / pipe
... lets add a couple of marks so we can use them to mark exceed/not exceed
sudo $TC actions add action skbedit mark 11 ok index 11
sudo $TC actions add action skbedit mark 12 ok index 12
... if we dont exceed our rate we get a mark of 11, else mark of 12
sudo $TC filter add dev $ETH parent ffff: protocol ip prio 8 u32 \
match ip dst 127.0.0.8/32 flowid 1:10 \
action police index 4 \
action skbedit index 11 \
action skbedit index 12
Ok, lets keep this thing a little busy..
sudo ping -f -c 10000 127.0.0.8
... now lets see the filters..
sudo $TC -s filter ls dev $ETH parent ffff: protocol ip
filter pref 8 u32 chain 0
filter pref 8 u32 chain 0 fh 800: ht divisor 1
filter pref 8 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10 not_in_hw (rule hit 20000 success 10000)
match 7f000008/ffffffff at 16 (success 10000 )
action order 1: police 0x4 rate 1Kbit burst 23440b mtu 2Kb action jump 2/pipe overhead 0b
ref 2 bind 1 installed 198 sec used 2 sec
Action statistics:
Sent 840000 bytes 10000 pkt (dropped 0, overlimits 9721 requeues 0)
backlog 0b 0p requeues 0
action order 2: skbedit mark 11 pass
index 11 ref 2 bind 1 installed 127 sec used 2 sec
Action statistics:
Sent 23436 bytes 279 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
action order 3: skbedit mark 12 pass
index 12 ref 2 bind 1 installed 127 sec used 2 sec
Action statistics:
Sent 816564 bytes 9721 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
As can be seen 97.21% of the packets were marked as exceeding the allocated
rate; you could do something clever with the skb mark after this.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Michal Kubecek [Thu, 19 Oct 2017 08:21:08 +0000 (10:21 +0200)]
ip maddr: fix filtering by device
Commit 530903dd9003 ("ip: fix igmp parsing when iface is long") uses
variable len to keep trailing colon from interface name comparison. This
variable is local to loop body but we set it in one pass and use it in
following one(s) so that we are actually using (pseudo)random length for
comparison. This became apparent since commit b48a1161f5f9 ("ipmaddr: Avoid
accessing uninitialized data") always initializes len to zero so that the
name comparison is always true. As a result, "ip maddr show dev eth0" shows
IPv4 multicast addresses for all interfaces.
Instead of keeping the length, let's simply replace the trailing colon with
a null byte. The bonus is that we get correct interface name in ma.name.
Fixes: 530903dd9003 ("ip: fix igmp parsing when iface is long") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Phil Sutter <phil@nwl.cc> Acked-by: Petr Vorel <pvorel@suse.cz>
Phil Sutter [Wed, 18 Oct 2017 17:58:13 +0000 (19:58 +0200)]
ss: Distinguish between IPv4 and IPv6 wildcard sockets
Commit aba9c23a6e1cb ("ss: enclose IPv6 address in brackets") unified
display of wildcard sockets in IPv4 and IPv6 to print the unspecified
address as '*'. Users then complained that they can't distinguish
between address families anymore, so change this again to what Stephen
Hemminger suggested:
| *:80 << both IPV6 and IPV4
| [::]:80 << IPV6_ONLY
| 0.0.0.0:80 << IPV4_ONLY
Note that on older kernels which don't support INET_DIAG_SKV6ONLY
attribute, pure IPv6 sockets will still show as '*'.
Cc: Humberto Alves <hjalves@live.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
ip: bridge_slave: add support for per-port group_fwd_mask
This patch adds the iproute2 support for getting and setting the
per-port group_fwd_mask. It also tries to resolve the value into a more
human friendly format by printing the known protocols instead of only
the raw value.
The man page is also updated with the new option.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Petr Vorel [Fri, 13 Oct 2017 13:57:17 +0000 (15:57 +0200)]
color: Fix another ip segfault when using --color switch
Commit 959f1428 ("color: add new COLOR_NONE and disable_color function")
introducing color enum COLOR_NONE, which is not only duplicite of
COLOR_CLEAR, but also caused segfault, when running ip with --color
switch, as 'attr + 8' in color_fprintf() access array item out of
bounds. Thus removing it and restoring "magic" offset + 7.
Petr Vorel [Fri, 13 Oct 2017 13:57:16 +0000 (15:57 +0200)]
color: Fix ip segfault when using --color switch
Commit d0e72011 ("ip: ipaddress.c: add support for json output")
introduced passing -1 as enum color_attr. This is not only wrong as no
color_attr has value -1, but also causes another segfault in color_fprintf()
on this setup as there is no item with index -1 in array of enum attr_colors[].
Using COLOR_CLEAR is valid option.
Reproduce with:
$ COLORFGBG='0;15' ip -c a
NOTE: COLORFGBG is environmental variable used for defining whether user
has light or dark background.
COLORFGBG="0;15" is used to ask for color set suitable for light background,
COLORFGBG="15;0" is used to ask for color set suitable for dark background.