William Tu [Wed, 13 Dec 2017 02:22:52 +0000 (18:22 -0800)]
gre6: add collect metadata support
The patch adds 'external' option to support collect metadata
gre6 tunnel. The 'external' keyword is already used to set the
device into collect metadata mode such as vxlan, geneve, ipip,
etc. This patch extends support for ipv6 gre and gretap.
Example of L3 and L2 gre device:
bash:~# ip link add dev ip6gre123 type ip6gre external
bash:~# ip link add dev ip6gretap123 type ip6gretap external
Signed-off-by: William Tu <u9012063@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net>
Chris Mi [Thu, 14 Dec 2017 09:09:00 +0000 (18:09 +0900)]
tc: fix command "tc actions del" hang issue
If command is RTM_DELACTION, a non-NULL pointer is passed to rtnl_talk().
Then flag NLM_F_ACK is not set on n->nlmsg_flags and netlink_ack() will
not be called. Command tc will wait for the reply for ever.
Fixes: 86bf43c7c2fd ("lib/libnetlink: update rtnl_talk to support malloc buff at run time") Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Chris Mi <chrism@mellanox.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Stefano Brivio [Tue, 12 Dec 2017 00:46:33 +0000 (01:46 +0100)]
ss: Implement automatic column width calculation
Group fitting fields into lines and space them equally using the
remaining screen width for each line. If columns don't fit on
one line, break them into the least possible amount of lines and
keep them aligned across lines.
This is done by:
- recording the length of the longest item in each column during
formatting and buffering (which was added in the previous patch)
- fitting as many fields as possible on each line of output
- distributing the remaining padding space equally between the
columns
Stefano Brivio [Tue, 12 Dec 2017 00:46:32 +0000 (01:46 +0100)]
ss: Buffer raw fields first, then render them as a table
This allows us to measure the maximum field length for each
column before printing fields and will permit us to apply
optimal field spacing and distribution. Structure of the output
buffer with chunked allocation is described in comments.
Output is still unchanged, original spacing is used.
Running over one million sockets with -tul options by simply
modifying main() to loop 50,000 times over the *_show()
functions, buffering the whole output and rendering it at the
end, with 10 UDP sockets, 10 TCP sockets, while throwing
output away, doesn't show significant changes in execution time
on my laptop with an Intel i7-6600U CPU:
- before this patch:
$ time ./ss -tul > /dev/null
real 0m29.899s
user 0m2.017s
sys 0m27.801s
- after this patch:
$ time ./ss -tul > /dev/null
real 0m29.827s
user 0m1.942s
sys 0m27.812s
Stefano Brivio [Tue, 12 Dec 2017 00:46:31 +0000 (01:46 +0100)]
ss: Introduce columns lightweight abstraction
Instead of embedding spacing directly while printing contents,
logically declare columns and functions to buffer their content,
to print left and right spacing around fields, to flush them to
screen, and to print headers.
This makes it a bit easier to handle layout changes and prepares
for full output buffering, needed for optimal spacing in field
output layout.
Columns are currently set up to retain exactly the same output
as before. This needs some slight adjustments of the values
previously calculated in main(), as the width value introduced
here already includes the width of left delimiters and spacing
is not explicitly printed anymore whenever a field is printed.
These calculations will go away altogether once automatic width
calculation is implemented.
We can also remove explicit printing of newlines after the final
content for a given line is printed, flushing the last field on
a line will cause field_flush() to print newlines where
appropriate.
Stefano Brivio [Tue, 12 Dec 2017 00:46:30 +0000 (01:46 +0100)]
ss: Replace printf() calls for "main" output by calls to helper
This is preparation work for output buffering, which will allow
us to use optimal spacing and alignment of logical "columns".
The new out() function is just a re-implementation of a typical
libc's printf(), except that the return value of vfprintf() is
ignored as no callers use it. This implementation will be
replaced in the next patches to provide column width adjustment
and adequate spacing.
All printf() calls that output parts of the socket list are now
replaced by calls to out(). Output of summary and version is
excluded from this.
No functional differences here, output not affected.
This allows sending GSO maximum values when configuring a device.
The values are advisory. Most devices will ignore them but for some
pseudo devices such as veth pairs they can be set.
Example:
# ip link add dev vm1 type veth peer name vm2 gso_max_size 32768
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Phil Sutter [Wed, 29 Nov 2017 17:34:09 +0000 (18:34 +0100)]
man: tc-csum.8: Fix inconsistency in example description
Commit 6bbe5e6290db5 ("man: tc-csum.8: Fix example") changed both source
and destination IP addresses in example code but missed to update the
example's description accordingly.
Fixes: 6bbe5e6290db5 ("man: tc-csum.8: Fix example") Signed-off-by: Phil Sutter <phil@nwl.cc>
Robert Shearman [Tue, 28 Nov 2017 11:16:50 +0000 (11:16 +0000)]
vxlan: Make id optional when modifying a link
Specifying the IFLA_VXLAN_LINK attribute on a vxlan link modify is
optional in the kernel, so make the id argument optional for "ip link
set ..." to avoid a user needing to specify it when changing another
attribute.
Robert Shearman [Tue, 28 Nov 2017 11:16:21 +0000 (11:16 +0000)]
gre: Fix ttl inherit option
Specifying "... ttl inherit" currently does nothing on a GRE link
modify since the previous ttl value is retrieved up front. Fix this by
explicitly setting ttl to 0 when "inherit" is specified for the
option, since 0 represents the semantics of inherit.
Jiri Pirko [Sat, 25 Nov 2017 10:07:57 +0000 (11:07 +0100)]
tc: remove action cookie len from printout
Make the output same as input and avoid printout of unnecessary len.
Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Fixes: fd8b3d2c1b9b ("actions: Add support for user cookies") Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Jakub Kicinski [Fri, 24 Nov 2017 02:12:07 +0000 (18:12 -0800)]
f_bpf: communicate ifindex for eBPF offload
Split parsing and loading of the eBPF program and if skip_sw is set
load the program for ifindex, to which the qdisc is attached.
Note that the ifindex will be ignored for programs which are already
loaded (e.g. when using pinned programs), but in that case we just
trust the user knows what he's doing. Hopefully we will get extack
soon in the driver to help debugging this case.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 24 Nov 2017 02:12:06 +0000 (18:12 -0800)]
tc_filter: resolve device name before parsing filter
Move resolving device name into an ifindex before calling filter
specific callbacks. This way if filters need the ifindex, they
can read it from the request.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Both BPF filter and action will allow users to specify run
multiple times, and only the last one will be considered by
the kernel. Explicitly refuse such command lines.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 24 Nov 2017 02:12:01 +0000 (18:12 -0800)]
bpf: split parse from program loading
Parsing command line is currently done together with potentially
loading a new eBPF program. This makes it more difficult to
provide additional parameters for loading (which may come after
the eBPF program info on the command line).
Split the two (only internally for now). Verbose parameter
has to be saved in struct bpf_cfg_in to be carried between
the stages.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 24 Nov 2017 02:12:00 +0000 (18:12 -0800)]
bpf: allocate opcode table in struct bpf_cfg_in
struct bpf_cfg_in already carries a pointer to sock_filter ops.
It's currently set to a local variable in bpf_parse_opt_tbl(),
shared between parsing and loading stages. Move the array
entirely to struct bpf_cfg_in, this will allow us to split
parsing and loading.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 24 Nov 2017 02:11:59 +0000 (18:11 -0800)]
bpf: keep parsed program mode in struct bpf_cfg_in
bpf_parse() will parse command line arguments to find out the
program mode. This mode will later be needed at loading time.
Instead of keeping it locally add it to struct bpf_cfg_in,
this will allow splitting parsing and loading stages.
enum bpf_mode has to be moved to the header file, because C
doesn't allow forward declaration of enums.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Fri, 24 Nov 2017 02:11:58 +0000 (18:11 -0800)]
bpf: pass program type in struct bpf_cfg_in
Program type is needed both for parsing and loading of
the program. Parsing may also induce the type based on
signatures from __bpf_prog_meta. Instead of passing
the type around keep it in struct bpf_cfg_in.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adapts the tc command line interface to allow bandwidth limits
to be specified as a percentage of the interface's capacity.
Adding this functionality requires passing the specified device string to
each class/qdisc which changes the prototype for a couple of functions: the
.parse_qopt and .parse_copt interfaces. The device string is a required
parameter for tc-qdisc and tc-class, and when not specified, the kernel
returns ENODEV. In this patch, if the user tries to specify a bandwidth
percentage without naming the device, we return an error from userspace.
Tom Herbert [Wed, 22 Nov 2017 20:05:35 +0000 (12:05 -0800)]
ila: support to configure checksum neutral-map-auto
Configuration support in both ip ila and ip LWT for checksum
neutral-map-auto. This is a mode of ILA where checksum
neutral mapping is assumed for packets (there is no C-bit
in the identifier to indicate checksum neutral).
Jakub Kicinski [Thu, 23 Nov 2017 01:00:53 +0000 (17:00 -0800)]
bpf: initialize the verifier log
If program loading fails before verifier prints its first
message, the verifier log will not be initialized. Always
set the first character of the log buffer to zero to make
sure we don't dump non-printable characters to the terminal.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Lorenzo Colitti [Mon, 20 Nov 2017 03:57:07 +0000 (12:57 +0900)]
iproute2: fixes to compile on some systems.
1. Put the declarations of strlcpy and strlcat inside
an #ifdef NEED_STRLCPY. Their declarations were already in a
similar #ifdef.
2. In bpf_scm.h, include sys/un.h for struct sockaddr_un.
3. In utils.h, include time.h for struct timeval.
Tested: builds on ubuntu 14.04 with "make clean distclean; ./configure && make -j64"
Tested: 4.14.1 builds on Android with Android-specific #ifndefs for missing library code Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Phil Sutter [Wed, 15 Nov 2017 14:01:31 +0000 (15:01 +0100)]
tc_util: Silence spurious compiler warning
GCC version 7.2.1 complains that 'result1' may be used uninitialized in
parse_action_control_slash_spaces(). This should not be possible in
practice, so the actual value 'result1' is initialized with does not
matter.
Phil Sutter [Wed, 15 Nov 2017 14:01:30 +0000 (15:01 +0100)]
tc_util: Drop needless pointer check
The function parse_action_control_slash() returns early if 'p' is NULL,
so after the first call to action_a2n(), 'p' is guaranteed not to be
NULL. Otherwise, the assignment '*p = 0' above would dereference the
NULL pointer already anyway, so just drop this check here.
Jon Maloy [Wed, 15 Nov 2017 16:25:44 +0000 (17:25 +0100)]
tipc: change family attribute from u32 to u16
commit 28033ae4e0f ("net: netlink: Update attr validation to require
exact length for some types") introduces a stricter control on attributes
of type NLA_U* and NLA_S*.
Since the tipc tool is sending a family attribute of u32 instead of as
expected u16 the tool is now effectively broken.
We fix this by changing the type of the said attribute.
Leon Romanovsky [Mon, 13 Nov 2017 10:21:19 +0000 (12:21 +0200)]
ip: Fix compilation break on old systems
As was reported [1], the iproute2 fails to compile on old systems,
in Cong's case, it was Fedora 19, in our case it was RedHat 7.2, which
failed with the following errors during compilation:
ipxfrm.c: In function ‘xfrm_selector_print’:
ipxfrm.c:479:7: error: ‘IPPROTO_MH’ undeclared (first use in this
function)
case IPPROTO_MH:
^
ipxfrm.c:479:7: note: each undeclared identifier is reported only once
for each function it appears in
ipxfrm.c: In function ‘xfrm_selector_upspec_parse’:
ipxfrm.c:1345:8: error: ‘IPPROTO_MH’ undeclared (first use in this
function)
case IPPROTO_MH:
^ make[1]: *** [ipxfrm.o] Error 1
The reason to it is the order of headers files. The IPPROTO_MH field is
set in kernel's UAPI header file (in6.h), but only in case
__UAPI_DEF_IPPROTO_V6 is set before. That define comes from other kernel's
header file (libc-compat.h) and is set in case there are no previous
libc relevant declarations.
In ip code, the include of <netdb.h> causes to indirect inclusion of
<netinet/in.h> and it sets __UAPI_DEF_IPPROTO_V6 to be zero and prevents from
IPPROTO_MH declaration.
This patch takes the simplest possible approach to fix the compilation
error by checking if IPPROTO_MH was defined before and in case it
wasn't, it defines it to be the same as in the kernel.
Ivan Vecera [Fri, 10 Nov 2017 06:20:13 +0000 (07:20 +0100)]
lib: make resolve_hosts variable common
Any iproute utility that uses any function from lib/utils.c needs
to declare its own resolve_hosts variable instance although it does
not need/use hostname resolving functionality (currently only 'ip'
and 'ss' commands uses this).
The patch declares single common instance of resolve_hosts directly
in utils.c so the existing ones can be removed (the same approach
that is used for timestamp_short).
In order to calculate the idleSlope parameter of CBS correctly, users
must take into account the entire packet size, including the overhead
from all layers.
Add some more details to the man page to clarify that, giving one
simple example and pointing users to the correct 802.1Q section for
further clarifications if needed.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
David Ahern [Thu, 9 Nov 2017 00:46:50 +0000 (09:46 +0900)]
libnetlink: Handle extack messages for non-error case
Kernel can now return non-fatal error messages in extack facility.
Update iproute2 to dump to use if present.
- rename nl_dump_ext_err to nl_dump_ext_ack
- rename errmsg to msg
- add call to nl_dump_ext_ack in rtnl_dump_done and __rtnl_talk for
non-error path
Signed-off-by: David Ahern <dsahern@gmail.com> Tested-by: Ido Schimmel <idosch@mellanox.com>