David Ahern [Mon, 15 Jan 2018 16:25:30 +0000 (08:25 -0800)]
Merge branch 'tc-batch' into net-next
Chris Mi says:
====================
Currently in tc batch mode, only one command is read from the batch
file and sent to kernel to process. With this patchset, at most 128
commands can be accumulated before sending to kernel.
We introduced a new function in patch 1 to support for sending
multiple messages. In patch 2, we add this support for filter
add/delete/change/replace and actions add/change/replace commands.
But please note that kernel still processes the requests one by one.
To process the requests in parallel in kernel is another effort.
The time we're saving in this patchset is the user mode and kernel mode
context switch. So this patchset works on top of the current kernel.
Using the following script in kernel, we can generate 1,000,000 rules.
tools/testing/selftests/tc-testing/tdc_batch.py
Without this patchset, 'tc -b $file' exection time is:
real 0m15.555s
user 0m7.211s
sys 0m8.284s
With this patchset, 'tc -b $file' exection time is:
real 0m12.360s
user 0m6.082s
sys 0m6.213s
The insertion rate is improved more than 10%.
====================
Chris Mi [Fri, 12 Jan 2018 05:13:16 +0000 (14:13 +0900)]
tc: Add batchsize feature for filter and actions
Currently in tc batch mode, only one command is read from the batch
file and sent to kernel to process. With this support, at most 128
commands can be accumulated before sending to kernel.
Now it only works for the following successive commands:
1. filter add/delete/change/replace
2. actions add/change/replace
Signed-off-by: Chris Mi <chrism@mellanox.com> Signed-off-by: David Ahern <dsahern@gmail.com>
Chris Mi [Fri, 12 Jan 2018 05:13:15 +0000 (14:13 +0900)]
lib/libnetlink: Add a new function rtnl_talk_iov
rtnl_talk can only send a single message to kernel. Add a new function
rtnl_talk_iov that can send multiple messages to kernel.
rtnl_talk_iov takes struct iovec * and iovlen as arguments.
Signed-off-by: Chris Mi <chrism@mellanox.com> Signed-off-by: David Ahern <dsahern@gmail.com>
Serhey Popovych [Tue, 2 Jan 2018 21:27:58 +0000 (23:27 +0200)]
link_iptnl: Print tunnel mode
Tunnel mode does not appear in parameters print for iptnl
supported tunnels like ipip and sit, while printed for
ip6tnl.
Print tunnel mode as "proto" field name for JSON and
without any name when printing to cli to follow ip6tnl
behaviour.
For non JSON output we have:
$ ip -d link show dev sit1
Before:
-------
17: sit1@NONE: <NOARP> mtu 1480 qdisc noop state DOWN ...
link/sit X.X.X.X brd 0.0.0.0 promiscuity 0
sit remote any local X.X.X.X ...
~~~
After:
------
17: sit1@NONE: <NOARP> mtu 1480 qdisc noop state DOWN ...
link/sit X.X.X.X brd 0.0.0.0 promiscuity 0
sit any remote any local X.X.X.X ...
^^^
devlink, rdma, tipc: properly define TARGETS without HAVE_MNL
Leaving a variable with a generic name such as TARGETS undefined would lead
to Make picking up its value from the environment. Avoid this by always
defining TARGETS in the Makefiles.
Luca Boccassi [Tue, 2 Jan 2018 17:42:16 +0000 (18:42 +0100)]
man: fix small formatting errors
Lintian detected the following formatting errors:
man/man8/devlink-sb.8.gz 230: warning: macro `b' not defined
man/man8/ip-link.8.gz 1243: warning: macro `in-8' not defined
(possibly missing space after `in')
man/man8/tc-u32.8.gz `R' is a string (producing the registered sign),
not a macro.
The filesytem paths to these scripts might be different on various
distros, so don't mention it in the manpages. It is not really useful
information anyway.
Luca Boccassi [Sat, 30 Dec 2017 10:31:15 +0000 (11:31 +0100)]
man: add more keywords to ip.8 short description
A Debian user suggested adding more network-related keywords to the
ip manpage, so that manpage-scraping and indexing software like
apropos can do a better job of categorizing the programs.
Luca Boccassi [Sat, 30 Dec 2017 10:31:14 +0000 (11:31 +0100)]
man: drop references to Debian-specific paths
Documentation should be distribution-agnostic - any specific quirks
should be handled by downstream maintainers, if necessary.
Remove mentions of Debian paths and package names.
Signed-off-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Serhey Popovych [Wed, 27 Dec 2017 11:28:15 +0000 (13:28 +0200)]
gre6/tunnel: Do not submit garbage in flowinfo
We always send flowinfo to the kernel. If flowlabel/tclass
was set first to non-inherit value and then reset to
inherit we do not clear flowlabel/tclass part in flowinfo,
send it to kernel and can get from the kernel back.
Even if we check for IP6_TNL_F_USE_ORIG_TCLASS and
IP6_TNL_F_USE_ORIG_FLOWLABEL when printing options
sending invalid flowlabel/tclass to the kernel seems
bad idea.
Note that ip6tnl always clean corresponding flowinfo
parts on inherit.
Serhey Popovych [Wed, 27 Dec 2017 11:28:14 +0000 (13:28 +0200)]
gre,ip6tnl/tunnel: Fix noencap- support
We must clear bit, not set all but given bit.
Fixes: 858dbb208e39 ("ip link: Add support for remote checksum offload to IP tunnels") Fixes: 73516e128a5a ("ip6tnl: Support for fou encapsulation" Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
Leon Romanovsky [Wed, 27 Dec 2017 07:57:52 +0000 (09:57 +0200)]
rdma: Move per-device handler function to generic code
Most of the proposed objects are working in the scope "dev"
and will implement the same logic. Move the code to utils.c,
so other objects will be able to reuse the code.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David Ahern <dsahern@gmail.com>
Leon Romanovsky [Wed, 27 Dec 2017 07:57:51 +0000 (09:57 +0200)]
rdma: Protect dev_map_lookup from wrong input
Despite the fact that all callers to dev_map_lookup are ensuring that
there is always device name prior to call to that function, it is better
and safer to check that in the dev_map_lookup itself.
Fixes: 40df8263a0f0 ("rdma: Add dev object") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David Ahern <dsahern@gmail.com>
Serhey Popovych [Wed, 20 Dec 2017 07:57:10 +0000 (09:57 +0200)]
ip/tunnel: No need to free answer after rtnl_talk() on error
Since rtnl_talk() never returns with answer buffer allocated
on error we do not need to release it manually. After this
initializing answer with NULL before rtnl_talk() is useless.
Serhey Popovych [Wed, 20 Dec 2017 07:57:09 +0000 (09:57 +0200)]
utils: ll_addr: Handle ARPHRD_IP6GRE in ll_addr_n2a()
ll_addr_n2a() correctly prints tunnel endpoints for gre, ipip, sit
and ip6tnl, but not for ip6gre. Fix this by adding ARPHRD_IP6GRE to
IPv6 tunnel endpoing address conversion.
Before:
-------
$ ip link show
...
18: ip6tnl0: <NOARP> mtu 1452 qdisc noop state DOWN mode DEFAULT group default
link/tunnel6 :: brd ::
19: ip6gre0: <NOARP> mtu 1456 qdisc noop state DOWN mode DEFAULT group default
link/gre6 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 brd \
00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
After:
------
$ ip link show
...
18: ip6tnl0: <NOARP> mtu 1452 qdisc noop state DOWN mode DEFAULT group default
link/tunnel6 :: brd ::
19: ip6gre0: <NOARP> mtu 1456 qdisc noop state DOWN mode DEFAULT group default
link/gre6 :: brd ::
William Tu [Wed, 20 Dec 2017 02:01:06 +0000 (18:01 -0800)]
erspan: add erspan version II support
The patch adds support for configuring the erspan v2, for both
ipv4 and ipv6 erspan implementation. Three additional fields
are added: 'erspan_ver' for distinguishing v1 or v2, 'erspan_dir'
for specifying direction of the mirrored traffic, and 'erspan_hwid'
for users to set ERSPAN engine ID within a system.
As for manpage, the ERSPAN descriptions used to be under GRE, IPIP,
SIT Type paragraph. Since IP6GRE/IP6GRETAP also supports ERSPAN,
the patch removes the old one, creates a separate ERSPAN paragrah,
and adds an example.
Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com>
Alexander Zubkov [Sun, 17 Dec 2017 11:09:00 +0000 (12:09 +0100)]
iproute: "list/flush/save default" selected all of the routes
When running "ip route list default" and not specifying address family,
one will get all of the routes instead of just default only. The same
is for "exact default" and "match default".
It behaves in such a way because default route with unspecified family
has the same all-zeroes value like no prefix specified at all. Thus
following code blindly ignores the fact, that prefix was actually
specified.
This patch adds the flag PREFIXLEN_SPECIFIED to the default route too.
And then checks its value when filtering routes.
Signed-off-by: Alexander Zubkov <green@msu.ru> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Alexander Zubkov [Sun, 17 Dec 2017 12:02:11 +0000 (13:02 +0100)]
iproute: list/flush/save filter also by metric
Metric is one of the "unique key" fields of the route in Linux. But
still one can not use its value in filter while running ip list.
Because of this writing checks in scripts for example is incovenient.
Signed-off-by: Alexander Zubkov <green@msu.ru> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
All tunnels already support for parsing/adding zero
endpoints and vti6 isn't an exception.
This check was added as part of commit 2a80154fde40
(vti6: fix local/remote any addr handling) and looks
too restrictive as purpose of change is to avoid
endpoint configuration from uninitialized data.
Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Serhey Popovych [Mon, 18 Dec 2017 17:48:04 +0000 (19:48 +0200)]
link_ip6tnl: Use IN6ADDR_ANY_INIT to initialize local/remote endpoints
Use specialized helper to initialize endpoint addresses with
zeros instead of open coding this. This unifies initialization
style with other ipv6 tunnel variants (i.e. gre6 and vti6).
Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Serhey Popovych [Mon, 18 Dec 2017 17:48:03 +0000 (19:48 +0200)]
ip/tunnel: Use tnl_parse_key() to parse tunnel key
It is added with
commit a7ed1520ee96 ("ip/tunnel: introduce tnl_parse_key()")
to avoid code duplication in ip6?tunnel.c.
Reuse it for gre/gre6 and vti/vti6 tunnel rtnl
configuration interface with the same purpose
it is used in tunnel ioctl interface in ip6?tunnel.c.
While there change type of key variables from
unsigned integer to __be32 to reflect nature of the
value they store and place error message in
tnl_parse_key() on a single line to make single
call to fprintf().
Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Serhey Popovych [Wed, 13 Dec 2017 19:36:01 +0000 (21:36 +0200)]
ip/tunnel: Use get_addr() instead of get_prefix() for local/remote endpoints
Manual page ip-link(8) states that both local and remote accept
IPADDR not PREFIX. Use get_addr() instead of get_prefix() to
parse local/remote endpoint address correctly.
Force corresponding address family instead of using preferred_family
to catch weired cases as shown below.
Before this patch it is possible to create tunnel with commands:
ip li add dev ip6gre2 type ip6gre local fe80::1/64 remote fe80::2/64
ip -4 li add dev ip6gre2 type ip6gre local 10.0.0.1/24 remote 10.0.0.2/24
Serhey Popovych [Wed, 13 Dec 2017 19:36:00 +0000 (21:36 +0200)]
ip/tunnel: Unify setup and accept zero address for local/remote endpoints
It is fully legal to submit zero (INADDR_ANY/IN6ADDR_ANY_INIT)
value for local and/or remote endpoints for all tunnel drivers:
no need additionally check this in userspace.
Note that all tunnel specific code already can pass zero address
to the kernel.
Oliver Hartkopp [Sat, 16 Dec 2017 11:38:57 +0000 (12:38 +0100)]
ip: add vxcan/veth to ip-link man page
veth and vxcan both create a vitual tunnel between a pair of virtual network
devices. This patch adds the content for the now supported vxcan netdevices
and the documentation to create peer devices for vxcan and veth.
Additional remove 'can' that accidently was on the list of link types which
can be created by 'ip link add' as 'can' devices are real network devices.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Roman Mashak [Fri, 15 Dec 2017 14:27:42 +0000 (09:27 -0500)]
ss: add missing path MTU parameter
v3:
Rebase and use out() instead of printf().
v2:
Print the path MTU immediately after the MSS, as it is easier to parse
for humans (suggested by Neal Cardwell).
Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
William Tu [Wed, 13 Dec 2017 02:22:52 +0000 (18:22 -0800)]
gre6: add collect metadata support
The patch adds 'external' option to support collect metadata
gre6 tunnel. The 'external' keyword is already used to set the
device into collect metadata mode such as vxlan, geneve, ipip,
etc. This patch extends support for ipv6 gre and gretap.
Example of L3 and L2 gre device:
bash:~# ip link add dev ip6gre123 type ip6gre external
bash:~# ip link add dev ip6gretap123 type ip6gretap external
Signed-off-by: William Tu <u9012063@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net>
Chris Mi [Thu, 14 Dec 2017 09:09:00 +0000 (18:09 +0900)]
tc: fix command "tc actions del" hang issue
If command is RTM_DELACTION, a non-NULL pointer is passed to rtnl_talk().
Then flag NLM_F_ACK is not set on n->nlmsg_flags and netlink_ack() will
not be called. Command tc will wait for the reply for ever.
Fixes: 86bf43c7c2fd ("lib/libnetlink: update rtnl_talk to support malloc buff at run time") Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Chris Mi <chrism@mellanox.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Stefano Brivio [Tue, 12 Dec 2017 00:46:33 +0000 (01:46 +0100)]
ss: Implement automatic column width calculation
Group fitting fields into lines and space them equally using the
remaining screen width for each line. If columns don't fit on
one line, break them into the least possible amount of lines and
keep them aligned across lines.
This is done by:
- recording the length of the longest item in each column during
formatting and buffering (which was added in the previous patch)
- fitting as many fields as possible on each line of output
- distributing the remaining padding space equally between the
columns
Stefano Brivio [Tue, 12 Dec 2017 00:46:32 +0000 (01:46 +0100)]
ss: Buffer raw fields first, then render them as a table
This allows us to measure the maximum field length for each
column before printing fields and will permit us to apply
optimal field spacing and distribution. Structure of the output
buffer with chunked allocation is described in comments.
Output is still unchanged, original spacing is used.
Running over one million sockets with -tul options by simply
modifying main() to loop 50,000 times over the *_show()
functions, buffering the whole output and rendering it at the
end, with 10 UDP sockets, 10 TCP sockets, while throwing
output away, doesn't show significant changes in execution time
on my laptop with an Intel i7-6600U CPU:
- before this patch:
$ time ./ss -tul > /dev/null
real 0m29.899s
user 0m2.017s
sys 0m27.801s
- after this patch:
$ time ./ss -tul > /dev/null
real 0m29.827s
user 0m1.942s
sys 0m27.812s
Stefano Brivio [Tue, 12 Dec 2017 00:46:31 +0000 (01:46 +0100)]
ss: Introduce columns lightweight abstraction
Instead of embedding spacing directly while printing contents,
logically declare columns and functions to buffer their content,
to print left and right spacing around fields, to flush them to
screen, and to print headers.
This makes it a bit easier to handle layout changes and prepares
for full output buffering, needed for optimal spacing in field
output layout.
Columns are currently set up to retain exactly the same output
as before. This needs some slight adjustments of the values
previously calculated in main(), as the width value introduced
here already includes the width of left delimiters and spacing
is not explicitly printed anymore whenever a field is printed.
These calculations will go away altogether once automatic width
calculation is implemented.
We can also remove explicit printing of newlines after the final
content for a given line is printed, flushing the last field on
a line will cause field_flush() to print newlines where
appropriate.
Stefano Brivio [Tue, 12 Dec 2017 00:46:30 +0000 (01:46 +0100)]
ss: Replace printf() calls for "main" output by calls to helper
This is preparation work for output buffering, which will allow
us to use optimal spacing and alignment of logical "columns".
The new out() function is just a re-implementation of a typical
libc's printf(), except that the return value of vfprintf() is
ignored as no callers use it. This implementation will be
replaced in the next patches to provide column width adjustment
and adequate spacing.
All printf() calls that output parts of the socket list are now
replaced by calls to out(). Output of summary and version is
excluded from this.
No functional differences here, output not affected.
This allows sending GSO maximum values when configuring a device.
The values are advisory. Most devices will ignore them but for some
pseudo devices such as veth pairs they can be set.
Example:
# ip link add dev vm1 type veth peer name vm2 gso_max_size 32768
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Phil Sutter [Wed, 29 Nov 2017 17:34:09 +0000 (18:34 +0100)]
man: tc-csum.8: Fix inconsistency in example description
Commit 6bbe5e6290db5 ("man: tc-csum.8: Fix example") changed both source
and destination IP addresses in example code but missed to update the
example's description accordingly.
Fixes: 6bbe5e6290db5 ("man: tc-csum.8: Fix example") Signed-off-by: Phil Sutter <phil@nwl.cc>
Robert Shearman [Tue, 28 Nov 2017 11:16:50 +0000 (11:16 +0000)]
vxlan: Make id optional when modifying a link
Specifying the IFLA_VXLAN_LINK attribute on a vxlan link modify is
optional in the kernel, so make the id argument optional for "ip link
set ..." to avoid a user needing to specify it when changing another
attribute.
Robert Shearman [Tue, 28 Nov 2017 11:16:21 +0000 (11:16 +0000)]
gre: Fix ttl inherit option
Specifying "... ttl inherit" currently does nothing on a GRE link
modify since the previous ttl value is retrieved up front. Fix this by
explicitly setting ttl to 0 when "inherit" is specified for the
option, since 0 represents the semantics of inherit.
Jiri Pirko [Sat, 25 Nov 2017 10:07:57 +0000 (11:07 +0100)]
tc: remove action cookie len from printout
Make the output same as input and avoid printout of unnecessary len.
Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Fixes: fd8b3d2c1b9b ("actions: Add support for user cookies") Signed-off-by: Jiri Pirko <jiri@mellanox.com>