David Ahern [Fri, 15 Jul 2016 22:41:35 +0000 (15:41 -0700)]
ss: Fix support for device filter by index
Support was recently added for device filters. The intent was to allow
the device to be specified by name or index, and using the if%u format
(dev == if5) or the simpler and more intuitive index alone (dev == 5).
The latter case is broken since the index is not saved to the filter
after the strtoul conversion. Further, the tmp variable used for the
conversion shadows another variable used in the function. Fix both.
With this change all 3 variants work as expected:
$ ss -t 'dev == 62'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 224 10.0.1.3%mgmt:ssh 192.168.0.50:58442
$ ss -t 'dev == mgmt'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 224 10.0.1.3%mgmt:ssh 192.168.0.50:58442
$ ss -t 'dev == if62'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 36 10.0.1.3%mgmt:ssh 192.168.0.50:58442
Fixes: 2d2932125616 ("ss: Add support to filter on device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Daniel Borkmann [Mon, 18 Jul 2016 23:09:52 +0000 (01:09 +0200)]
bpf: also check elf for official e_machine value
Use the official BPF ELF e_machine value that was assigned recently [1]
and will be propagated to glibc, libelf et al. LLVM will switch to it
in 3.9 release, therefore we need to prepare tc to check for EM_ELF as
well, older version still have the EM_NONE.
Xin Long [Tue, 12 Jul 2016 13:37:58 +0000 (21:37 +0800)]
ip route: restore route entries in correct order
Sometimes we cannot restore route entries, because in kernel
[1] fib_check_nh()
[2] fib_valid_prefsrc()
cause some routes to depend on existence of others while adding.
For example, we saved all the routes, and flushed all tables
[a] default via 192.168.122.1 dev eth0
[b] 192.168.122.0/24 dev eth0 src 192.168.122.21
[c] broadcast 127.0.0.0 dev lo table local src 127.0.0.1
[d] local 127.0.0.0/8 dev lo table local src 127.0.0.1
[e] local 127.0.0.1 dev lo table local src 127.0.0.1
[f] broadcast 127.255.255.255 dev lo table local src 127.0.0.1
[g] broadcast 192.168.122.0 dev eth0 table local src 192.168.122.21
[h] local 192.168.122.21 dev eth0 table local src 192.168.122.21
[i] broadcast 192.168.122.255 dev eth0 table local src 192.168.122.21
Now start to restore them:
If we want to add [a], we have to add [b] first, as [1] and
'via 192.168.122.1' in [a].
If we want to add [b], we have to add [h] first, as [2] and
'src 192.168.122.21' in [b].
So the correct order to restore should be like:
[e][h] -> [b][c][d][f][g][i] -> [a]
This patch fixes it by traversing the file 3 times, it only restores
part of them in each run according to the following conditions, to
make sure every entry can be restored successfully.
1. !gw && (!fib_prefsrc || fib_prefsrc == cfg->fc_dst)
2. !gw && (fib_prefsrc != cfg->fc_dst)
3. gw
Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Phil Sutter <phil@nwl.cc>
Eli Cohen [Thu, 7 Jul 2016 21:09:03 +0000 (16:09 -0500)]
Add support for configuring Infiniband GUIDs
Add two NLA's that allow configuration of Infiniband node or port GUIDs
by referencing the IPoIB net device set over the physical function. The
format to be used is as follows:
ip link set dev ib0 vf 0 node_guid 00:02:c9:03:00:21:6e:70
ip link set dev ib0 vf 0 port_guid 00:02:c9:03:00:21:6e:78
David Ahern [Wed, 29 Jun 2016 18:27:02 +0000 (11:27 -0700)]
ip route: Add support for vrf keyword
Add vrf keyword to 'ip route' commands. Allows:
1. Users can list routes by VRF name:
$ ip route show vrf NAME
VRF tables have all routes including local and broadcast routes.
The VRF keyword filters LOCAL and BROADCAST routes; to see all
routes the table option can be used. Or to see local routes only
for a VRF:
$ ip route show vrf NAME type local
2. Add or delete a route for a VRF:
$ ip route {add|delete} vrf NAME <route spec>
3. Do a route lookup for a VRF:
$ ip route get vrf NAME ADDRESS
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Amir Vadai [Mon, 4 Jul 2016 07:34:11 +0000 (10:34 +0300)]
tc: flower: Add skip_{hw|sw} support
On devices that support TC flower offloads, these flags enable a filter to be
added only to HW or only to SW. skip_sw and skip_hw are mutually exclusive
flags. By default without any flags, the filter is added to both HW and SW,
but no error checks are done in case of failure to add to HW.
With skip-sw, failure to add to HW is treated as an error.
Here is a sample script that adds 2 filters, one with skip_sw and the other
with skip_hw flag.
# add ingress qdisc
tc qdisc add dev enp0s9 ingress
# enable hw tc offload.
ethtool -K enp0s9 hw-tc-offload on
# add a flower filter with skip-sw flag.
tc filter add dev enp0s9 protocol ip parent ffff: flower \
ip_proto 1 indev enp0s9 skip_sw \
action drop
# add a flower filter with skip-hw flag.
tc filter add dev enp0s9 protocol ip parent ffff: flower \
ip_proto 3 indev enp0s9 skip_hw \
action drop
Signed-off-by: Amir Vadai <amirva@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com>
Phil Sutter [Thu, 30 Jun 2016 14:47:02 +0000 (16:47 +0200)]
ip-address: constify match_link_kind arg
Since the function won't ever change the data 'kind' is pointing at, it
can sanely be made const.
Fixes: e0513807f6dbb ("ip-address: Support filtering by slave type, too") Suggested-by: Stephen Hemminger <shemming@brocade.com> Signed-off-by: Phil Sutter <phil@nwl.cc>
Andrew Vagin [Tue, 28 Jun 2016 23:27:14 +0000 (02:27 +0300)]
ip route: timeout for routes has to be set in seconds
Currently a timeout is multiplied by HZ in user-space and
then it multiplied by HZ in kernel-space.
$ ./ip/ip r add 2002::0/64 dev veth1 expires 10
$ ./ip/ip -6 r
2002::/64 dev veth1 metric 1024 linkdown expires 996sec pref medium
Cc: Xin Long <lucien.xin@gmail.com> Cc: Hangbin Liu <liuhangbin@gmail.com> Cc: Stephen Hemminger <shemming@brocade.com> Fixes: 68eede250500 ("route: allow routes to be configured with expire values") Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Phil Sutter [Tue, 28 Jun 2016 13:07:16 +0000 (15:07 +0200)]
ip-address: Support filtering by slave type, too
This patch allows to query all interfaces enslaved to a bridge or bond
using the following syntax:
| ip addr show type bridge_slave
Filtering has to be done in userspace since the kernel does not support
filtering on IFLA_INFO_SLAVE_KIND.
Functionality introduced in this patch is not fully complete since it
does not allow to match on type and slave type at the same time, but it
doesn't prevent implementing a dedicated slave_type match, either.
David Ahern [Mon, 27 Jun 2016 18:34:24 +0000 (11:34 -0700)]
ss: Allow ssfilter_bytecompile to return 0
Allow ssfilter_bytecompile to return 0 for filter ops the kernel
does not support. If such an op is in the filter string then all
filtering is done in userspace.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
David Ahern [Mon, 27 Jun 2016 18:34:23 +0000 (11:34 -0700)]
ss: Refactor inet_show_sock
Extract parsing of sockstat and filter from inet_show_sock.
While moving run_ssfilter into callers of inet_show_sock enable
userspace filtering before the kill.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Jakub Sitnicki [Wed, 22 Jun 2016 11:34:13 +0000 (13:34 +0200)]
ip/tcp_metrics: Simplify process_msg a bit
On Tue, Jun 21, 2016 at 06:18 PM CEST, Phil Sutter <phil@nwl.cc> wrote:
> By combining the attribute extraction and check for existence, the
> additional indentation level in the 'else' clause can be avoided.
>
> In addition to that, common actions for 'daddr' are combined since the
> function returns if neither of the branches are taken.
>
> Signed-off-by: Phil Sutter <phil@nwl.cc>
> ---
> ip/tcp_metrics.c | 45 ++++++++++++++++++---------------------------
> 1 file changed, 18 insertions(+), 27 deletions(-)
>
> diff --git a/ip/tcp_metrics.c b/ip/tcp_metrics.c
> index f82604f458ada..899830c127bcb 100644
> --- a/ip/tcp_metrics.c
> +++ b/ip/tcp_metrics.c
> @@ -112,47 +112,38 @@ static int process_msg(const struct sockaddr_nl *who, struct nlmsghdr *n,
> parse_rtattr(attrs, TCP_METRICS_ATTR_MAX, (void *) ghdr + GENL_HDRLEN,
> len);
>
> - a = attrs[TCP_METRICS_ATTR_ADDR_IPV4];
> - if (a) {
> + if ((a = attrs[TCP_METRICS_ATTR_ADDR_IPV4])) {
Copy the pointer inside the branch?
Same gain on indentation while keeping checkpatch happy.
Phil Sutter [Thu, 16 Jun 2016 14:19:40 +0000 (16:19 +0200)]
iplink: Check address length via netlink
This is a feature which was lost during the conversion to netlink
interface: If the device exists and a user tries to change the link
layer address, query the kernel for the old address first and reject the
new one if sizes differ.
This patch adds the same check when setting VF address by assuming same
length as PF device.
Note that at least for VFs the check can't be done in kernel space since
struct ifla_vf_mac lacks a length field and due to netlink padding the
exact size can't be communicated to the kernel.
Martin KaFai Lau [Sat, 18 Jun 2016 00:38:53 +0000 (17:38 -0700)]
ss: Add tcp_info fields data_segs_in/out
tcp_info fields, data_segs_in and data_segs_out, have been added to the
kernel in commit a44d6eacdaf5 ("tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In")
since kernel 4.6.
This patch renames "iptables_target" to "xtables_target" and some other
things which gets renamed and I noticed while reading iptables git log.
Functions which are not used in m_ipt.c and not exported by the header
are removed, if they still used in m_ipt.c I added a static to the function.
Reported-by: Clemens Gruber <clemens.gruber@pqgruber.com> Signed-off-by: Alexander Aring <aar@pengutronix.de>
Phil Sutter [Fri, 10 Jun 2016 11:42:02 +0000 (13:42 +0200)]
tc: m_xt: Fix indenting
By exiting early if xtables_find_target() fails, one indenting level can
be dropped. Some of the wrongly indented code then happens to sit at the
right spot by accident which is why this patch is smaller than expected.
Phil Sutter [Fri, 10 Jun 2016 11:42:01 +0000 (13:42 +0200)]
tc: m_xt: Fix segfault when adding multiple actions at once
Without this, the following call to tc would segfault:
| tc filter add dev d0 parent ffff: u32 match u32 0 0 \
| action xt -j MARK --set-mark 0x1 \
| action xt -j MARK --set-mark 0x1
The reason is basically the same as for 6e2e5ec28bad4 ("fix print_ipt:
segfault if more then one filter with action -j MARK.") but in
parse_ipt() instead of print_ipt().
strtoul() only modifies errno on overflow, so if errno is not zero
before calling the function its value is preserved and makes the
function fail for valid inputs; initialize it.
tc: f_u32: Add support for skip_hw and skip_sw flags
On devices that support TC U32 offloads, these flags enable a filter to be
added only to HW or only to SW. skip_sw and skip_hw are mutually exclusive
flags. By default without any flags, the filter is added to both HW and SW,
but no error checks are done in case of failure to add to HW.
With skip-sw, failure to add to HW is treated as an error.
Here is a sample script that adds 2 filters, one with skip_sw and the other
with skip_hw flag.
# add ingress qdisc
tc qdisc add dev p4p1 ingress
# enable hw tc offload.
ethtool -K p4p1 hw-tc-offload on
# add u32 filter with skip-sw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:1 u32 ht 800: flowid 800:1 \
skip-sw \
match ip src 192.168.1.0/24 \
action drop
# add u32 filter with skip-hw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:2 u32 ht 800: flowid 800:2 \
skip-hw \
match ip src 192.168.2.0/24 \
action drop
Phil Sutter [Mon, 30 May 2016 18:46:27 +0000 (20:46 +0200)]
man: ip, ip-link: Fix ip option location
This patch drops the redundant description of some of ip's options in
ip-link.8's description of the 'show' subcommand, preserving the
description of -iec (but appending it to the list in ip.8 with minor
fixes).
David Ahern [Tue, 24 May 2016 22:04:49 +0000 (15:04 -0700)]
Make builds default to quiet mode
Similar to the Linux kernel and perf add infrastructure to reduce the
amount of output tossed to a user during a build. Full build output
can be obtained with 'make V=1'
Jamal Hadi Salim [Tue, 24 May 2016 11:52:48 +0000 (07:52 -0400)]
tc simple action: bug fix
Failed compile
m_simple.c: In function ‘parse_simple’:
m_simple.c:154:6: warning: too many arguments for format [-Wformat-extra-args]
*argv);
^
m_simple.c:103:14: warning: unused variable ‘maybe_bind’ [-Wunused-variable]
Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Peter Heise [Mon, 30 May 2016 13:32:07 +0000 (15:32 +0200)]
Added support for selection of new HSR version
A new HSR version was added in 4.7 that can be enabled
via iproute2. Per default the old version is selected,
however, with "ip link add [..] type hsr [..] version 1"
the newer version can be enabled.
Signed-off-by: Peter Heise <peter.heise@airbus.com>
Daniel Borkmann [Wed, 18 May 2016 09:58:41 +0000 (11:58 +0200)]
f_bpf: fix filling of handle when no further arg is provided
We need to fill handle when provided by the user, even if no further
argument is provided. Thus, move the test for arg to the correct location,
so that it works correctly:
# tc filter show dev foo egress
filter protocol all pref 1 bpf
filter protocol all pref 1 bpf handle 0x1 bpf.o:[classifier] direct-action
filter protocol all pref 1 bpf handle 0x2 bpf.o:[classifier] direct-action
# tc filter del dev foo egress prio 1 handle 2 bpf
# tc filter show dev foo egress
filter protocol all pref 1 bpf
filter protocol all pref 1 bpf handle 0x1 bpf.o:[classifier] direct-action
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David Ahern [Wed, 11 May 2016 13:51:58 +0000 (06:51 -0700)]
ip link: Add support for kernel side filtering
Kernel gained support for filtering link dumps with commit dc599f76c22b
("net: Add support for filtering link dump by master device and kind").
Add support to ip link command. If a user passes master device or
kind to ip link command they are added to the link dump request message.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Jiri Pirko [Sat, 14 May 2016 13:21:02 +0000 (15:21 +0200)]
devlink: implement shared buffer occupancy control
Use kernel shared buffer occupancy control commands to make snapshot and
clear occupancy watermarks. Also, allow to show occupancy values in a
nice way.
Jiri Pirko [Sat, 14 May 2016 13:21:01 +0000 (15:21 +0200)]
devlink: implement shared buffer support
Implement kernel devlink shared buffer interface. Introduce new object
"sb" and allow to browse the shared buffer parameters and also change
configuration.
Daniel Borkmann [Sun, 15 May 2016 16:36:03 +0000 (18:36 +0200)]
ingress, clsact: don't add TCA_OPTIONS to nl msg
In ingress and clsact qdisc TCA_OPTIONS are ignored, since it's
parameterless. In tc, we add an empty addattr_l(... TCA_OPTIONS,
NULL, 0) to the netlink message nevertheless. This has the
side effect that when someone tries a 'tc qdisc replace' and
already an existing such qdisc is present, tc fails with
EINVAL here.
Reason is that in the kernel, this invokes qdisc_change() when
such requested qdisc is already present. When TCA_OPTIONS are
passed to modify parameters, it looks whether qdisc implements
.change() callback, and if not present (like in both cases here)
it returns with error. Rather than adding an empty stub to the
kernel that ignores TCA_OPTIONS again, just don't add TCA_OPTIONS
to the netlink message in the first place.
Before:
# tc qdisc replace dev foo clsact # first try
# tc qdisc replace dev foo clsact # second one
RTNETLINK answers: Invalid argument
After:
# tc qdisc replace dev foo clsact
# tc qdisc replace dev foo clsact
# tc qdisc replace dev foo clsact
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Brings it closer to more serious actions (adding branching
and allowing for late binding)
Unfortunately this breaks old syntax of the simple action.
But because simple is a pedagogical example unlikely to be used
in production environments (i.e its role is to serve as an example
on how to write actions), then this is ok.
New syntax for simple has new keyword "sdata". Example usage is:
sudo tc actions add action simple sdata "foobar" index 1
or
tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action simple sdata "foobar"
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
This action allows for a sending side to encapsulate arbitrary metadata
which is decapsulated by the receiving end.
The sender runs in encoding mode and the receiver in decode mode.
Both sender and receiver must specify the same ethertype.
At some point we hope to have a registered ethertype and we'll
then provide a default so the user doesnt have to specify it.
For now we enforce the user specify it.
Described in netdev01 paper:
"Distributing Linux Traffic Control Classifier-Action Subsystem"
Authors: Jamal Hadi Salim and Damascene M. Joachimpillai
Also refer to IETF draft-ietf-forces-interfelfb-04.txt
Lets show example usage where we encode icmp from a sender towards
a receiver with an skbmark of 17; both sender and receiver use
ethertype of 0xdead to interop.
YYYY: Lets start with Receiver-side policy config:
xxx: add an ingress qdisc
sudo tc qdisc add dev $ETH ingress
xxx: any packets with ethertype 0xdead will be subjected to ife decoding
xxx: we then restart the classification so we can match on icmp at prio 3
sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify
xxx: on restarting the classification from above if it was an icmp
xxx: packet, then match it here and continue to the next rule at prio 4
xxx: which will match based on skb mark of 17
sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue
xxx: match on skbmark of 0x11 (decimal 17) and accept
sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \
handle 0x11 fw flowid 1:1 \
action ok
xxx: Lets show the decoding policy
sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead
xxx:
filter pref 2 u32
filter pref 2 u32 fh 800: ht divisor 1
filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 0 success 0)
match 00000000/00000000 at 0 (success 0 )
action order 1: ife decode action reclassify type 0x0
allow mark allow prio
index 11 ref 1 bind 1 installed 45 sec used 45 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
Observe that above lists all metadatum it can decode. Typically these
submodules will already be compiled into a monolithic kernel or
loaded as modules
YYYY: Lets show the sender side now ..
xxx: Add an egress qdisc on the sender netdev
sudo tc qdisc add dev $ETH root handle 1: prio
xxx:
xxx: Match all icmp packets to 192.168.122.237/24, then
xxx: tag the packet with skb mark of decimal 17, then
xxx: Encode it with:
xxx: ethertype 0xdead
xxx: add skb->mark to whitelist of metadatum to send
xxx: rewrite target dst MAC address to 02:15:15:15:15:15
xxx:
sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 u32 \
match ip dst 192.168.122.237/24 \
match ip protocol 1 0xff \
flowid 1:2 \
action skbedit mark 17 \
action ife encode \
type 0xDEAD \
allow mark \
dst 02:15:15:15:15:15
xxx: Lets show the encoding policy
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 (rule hit 118 success 0)
match c0a87a00/ffffff00 at 16 (success 0 )
match 00010000/00ff0000 at 8 (success 0 )
action order 1: skbedit mark 17
index 11 ref 1 bind 1 installed 3 sec used 3 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
action order 2: ife encode action pipe type 0xDEAD
allow mark dst 02:15:15:15:15:15
index 12 ref 1 bind 1 installed 3 sec used 3 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
Now test by sending ping from sender to destination
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>