Mark Zhang [Tue, 4 Aug 2020 08:49:09 +0000 (11:49 +0300)]
rdma: Document the new "pid" criteria for auto mode
Document the new supported criteria of auto mode. Examples:
$ rdma statistic qp set link mlx5_2/1 auto pid on
$ rdma statistic qp set link mlx5_2/1 auto pid,type on
Signed-off-by: Mark Zhang <markz@mellanox.com> Reviewed-by: Ido Kalir <idok@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Mark Zhang [Tue, 4 Aug 2020 08:49:08 +0000 (11:49 +0300)]
rdma: Add "PID" criteria support for statistic counter auto mode
With this new criteria, QPs have different PIDs will be bound to
different counters in auto mode. This can be used in combination with
other criteria like "type". Examples:
$ rdma statistic qp set link mlx5_2/1 auto pid on
$ rdma statistic qp set link mlx5_2/1 auto type,pid on
$ rdma statistic qp set link mlx5_2/1 auto off
$ rdma statistic qp show link mlx5_0 qp-type UD
Signed-off-by: Mark Zhang <markz@mellanox.com> Reviewed-by: Ido Kalir <idok@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David Ahern <dsahern@kernel.org>
bridge: fdb show: fix fdb entry state output for json context
bridge json fdb show is printing an incorrect / non-machine readable
value, when using -j (json output) we are expecting machine readable
data that shouldn't require special handling/parsing.
Add space after format specifier in print_string call. Fixes broken
qdisc tests within tdc testing suite. Per suggestion from Petr Machata,
remove a space and change spacing in tc/q_event.c to complete the fix.
Tested fix in tdc using:
./tdc.py -c qdisc
All qdisc RED tests return ok.
Fixes: d0e450438571("tc: q_red: Add support for qevents "mark" and "early_drop") Signed-off-by: Briana Oursler <briana.oursler@gmail.com> Tested-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Anton Danilov [Mon, 27 Jul 2020 13:26:07 +0000 (16:26 +0300)]
bridge: fdb: the 'dynamic' option in the show/get commands
In most of cases a user wants to see only the dynamic mac addresses
in the fdb output. But currently the 'fdb show' displays tons of
various self entries, those only waste the output without any useful
goal.
New option 'dynamic' for 'show' and 'get' commands forces display
only relevant records.
Signed-off-by: Anton Danilov <littlesmilingcloud@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
David Ahern [Thu, 23 Jul 2020 00:34:07 +0000 (00:34 +0000)]
Merge branch 'devlink-port-health' into next
Moshe Shemesh says:
====================
Implement commands for interaction with per-port devlink health
reporters. To do this, adapt devlink-health for usage of port handles
with any existing devlink-health subcommands. Add devlink-port health
subcommand as an alias for devlink-health.
Add devlink port health show subcommand which displays information about
specified port reporter or all present port reporters as in the example.
Device and port reporters can be distinguished by a handle being used.
Make other devlink-health subcommands be aliased by devlink port health.
Refactor devlink-health commands for usage of port handles in order to
interact with port reporters.
Change devlink health show output to dump information about both device
and port reporters with correct handles.
Example:
$ devlink health show
pci/0000:00:0b.0:
reporter fw
state healthy error 0 recover 0 auto_dump true
reporter fw_fatal
state healthy error 0 recover 0 grace_period 1200000 auto_recover true auto_dump true
pci/0000:00:0b.0/1:
reporter tx
state healthy error 0 recover 0 grace_period 10000 auto_recover true auto_dump true
reporter rx
state healthy error 0 recover 0 grace_period 10000 auto_recover true auto_dump true
$ devlink health show pci/0000:00:0b.0/1 reporter rx
Which is equivalent to:
$ devlink port health show pci/0000:00:0b.0/1 reporter rx
pci/0000:00:0b.0/1:
reporter rx
state healthy error 0 recover 0 grace_period 10000 auto_recover true auto_dump true
$ devlink health set pci/0000:00:0b.0/1 reporter rx grace_period 5000
Which is equivalent to:
$ devlink port health set pci/0000:00:0b.0/1 reporter rx grace_period 5000
$ devlink port health show pci/0000:00:0b.0/1 reporter rx
pci/0000:00:0b.0/1:
reporter rx
state healthy error 0 recover 0 grace_period 5000 auto_recover true auto_dump true
David Ahern [Mon, 20 Jul 2020 16:36:41 +0000 (16:36 +0000)]
Merge branch 'tc-qevent-block' into next
Petr Machata says:
====================
When a list of filters at a given block is requested, tc first validates
that the block exists before doing the filter query. Currently the
validation routine checks ingress and egress blocks. But now that blocks
can be bound to qevents as well, qevent blocks should be looked for as
well:
# ip link add up type dummy
# tc qdisc add dev dummy1 root handle 1: \
red min 30000 max 60000 avpkt 1000 qevent early_drop block 100
# tc filter add block 100 pref 1234 handle 102 matchall action drop
# tc filter show block 100
Cannot find block "100"
This patchset fixes this issue:
# tc filter show block 100
filter protocol all pref 1234 matchall chain 0
filter protocol all pref 1234 matchall chain 0 handle 0x66
not_in_hw
action order 1: gact action drop
random type none pass val 0
index 2 ref 1 bind 1
In patch #1, the helpers and necessary infrastructure is introduced,
including a new qdisc_util callback that implements sniffing out bound
blocks in a given qdisc.
In patch #2, RED implements the new callback.
v3:
- Patch #1:
- Do not pass &ctx->found directly to has_block. Do it through a
helper variable, so that the callee does not overwrite the result
already stored in ctx->found.
v2:
- Patch #1:
- In tc_qdisc_block_exists_cb(), do not initialize 'q'.
- Propagate upwards errors from q->has_block.
Petr Machata [Thu, 16 Jul 2020 16:47:07 +0000 (19:47 +0300)]
tc: Look for blocks in qevents
When a list of filters at a given block is requested, tc first validates
that the block exists before doing the filter query. Currently the
validation routine checks ingress and egress blocks. But now that blocks
can be bound to qevents as well, qevent blocks should be looked for as
well.
In order to support that, extend struct qdisc_util with a new callback,
has_block. That should report whether, give the attributes in TCA_OPTIONS,
a blocks with a given number is bound to a qevent. In
tc_qdisc_block_exists_cb(), invoke that callback when set.
Add a helper to the tc_qevent module that walks the list of qevents and
looks for a given block. This is meant to be used by the individual qdiscs.
Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Paolo Abeni [Fri, 10 Jul 2020 13:52:35 +0000 (15:52 +0200)]
ss: mptcp: add msk diag interface support
This implement support for MPTCP sockets type, comprising
extended socket info. Note that we need to add an extended
attribute carrying the actual protocol number to the diag
request.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
ip xfrm: update man page on setting/printing XFRMA_IF_ID in states/policies
In commit aed63ae1acb9 ("ip xfrm: support setting/printing XFRMA_IF_ID attribute in states/policies")
I added the ability to set/print the xfrm interface ID without updating
the man page.
Fixes: aed63ae1acb9 ("ip xfrm: support setting/printing XFRMA_IF_ID attribute in states/policies") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Hoang Huu Le [Thu, 9 Jul 2020 04:25:55 +0000 (11:25 +0700)]
tipc: fixed a compile warning in tipc/link.c
Fixes: 5027f233e35b ("tipc: add link broadcast get") Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Tony Ambardar [Tue, 7 Jul 2020 07:58:33 +0000 (00:58 -0700)]
configure: support ipset version 7 with kernel version 5
The configure script checks for ipset v6 availability but doesn't test
for v7, which is backward compatible and used on kernel v5.x systems.
Update the script to test for both ipset versions. Without this change,
the tc ematch function em_ipset will be disabled.
Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Louis Peens [Fri, 19 Jun 2020 11:50:07 +0000 (13:50 +0200)]
devlink: add 'disk' to 'fw_load_policy' string validation
The 'fw_load_policy' devlink parameter supports the 'disk' value
since kernel v5.4, seems like there was some oversight in adding
this to iproute, fixed by this patch.
Signed-off-by: Louis Peens <louis.peens@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Add the new "mpls" keyword that can be used to match MPLS fields in
arbitrary Label Stack Entries.
LSEs are introduced by the "lse" keyword and followed by LSE options:
"depth", "label", "tc", "bos" and "ttl". The depth is manadtory, the
other options are optionals.
For example, the following filter drops MPLS packets having two labels,
where the first label is 21 and has TTL 64 and the second label is 22:
$ tc filter add dev ethX ingress proto mpls_uc flower mpls \
lse depth 1 label 21 ttl 64 \
lse depth 2 label 22 bos 1 \
action drop
Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Bareudp devices provide a generic L3 encapsulation for tunnelling
different protocols like MPLS, IP, NSH, etc. inside a UDP tunnel.
This patch is based on original work from Martin Varghese:
https://lore.kernel.org/netdev/1570532361-15163-1-git-send-email-martinvarghesenokia@gmail.com/
Examples:
- ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls_uc
This creates a bareudp tunnel device which tunnels L3 traffic with
ethertype 0x8847 (unicast MPLS traffic). The destination port of the
UDP header will be set to 6635. The device will listen on UDP port 6635
to receive traffic.
- ip link add dev bareudp0 type bareudp dstport 6635 ethertype ipv4 multiproto
Same as the MPLS example, but for IPv4. The "multiproto" keyword allows
the device to also tunnel IPv6 traffic.
Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Anton Danilov [Fri, 3 Jul 2020 15:39:22 +0000 (18:39 +0300)]
tc: improve the qdisc show command
Before can be possible show only all qeueue disciplines on an interface.
There wasn't a way to get the qdisc info by handle or parent, only full
dump of the disciplines with a following grep/sed usage.
Now new and old options work as expected to filter a qdisc by handle or
parent.
Full syntax of the qdisc show command:
tc qdisc { show | list } [ dev STRING ] [ QDISC_ID ] [ invisible ]
QDISC_ID := { root | ingress | handle QHANDLE | parent CLASSID }
This change doesn't require any changes in the kernel.
Signed-off-by: Anton Danilov <littlesmilingcloud@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
libnetlink.3: display section numbers in roman font, not boldface
Typeset section numbers in roman font, see man-pages(7).
###
Details:
Output is from: test-groff -b -mandoc -T utf8 -rF0 -t -w w -z
[ "test-groff" is a developmental version of "groff" ]
<./man/man3/libnetlink.3>:53 (macro BR): only 1 argument, but more are expected
<./man/man3/libnetlink.3>:132 (macro BR): only 1 argument, but more are expected
<./man/man3/libnetlink.3>:134 (macro BR): only 1 argument, but more are expected
<./man/man3/libnetlink.3>:197 (macro BR): only 1 argument, but more are expected
<./man/man3/libnetlink.3>:198 (macro BR): only 1 argument, but more are expected
David Ahern [Sun, 5 Jul 2020 18:11:49 +0000 (18:11 +0000)]
Merge branch 'rdma-raw-format-dumps' into next
Leon Romanovsky says:
====================
The following series adds support to get the RDMA resource data in RAW
format. The main motivation for doing this is to enable vendors to
return the entire QP/CQ/MR data without a need from the vendor to set
each field separately.
Maor Gottlieb [Wed, 24 Jun 2020 10:40:10 +0000 (13:40 +0300)]
rdma: Add support to get QP in raw format
Add 'raw' argument to get the resource in raw format.
When RDMA_NLDEV_ATTR_RES_RAW is set in the netlink message,
then the resource fields are in raw format, print it as byte array.
Example:
$rdma res show qp link rocep0s12f0/1 lqpn 1137 -j -r
[{"ifindex":7,"ifname":"mlx5_1","port":1,
"data":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...]}]
Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David Ahern <dsahern@kernel.org>
David Ahern [Sun, 5 Jul 2020 15:45:48 +0000 (15:45 +0000)]
Merge branch 'tc-qevents' into next
Petr Machata says:
====================
To allow configuring user-defined actions as a result of inner workings of
a qdisc, a concept of qevents was recently introduced to the kernel.
Qevents are attach points for TC blocks, where filters can be put that are
executed as the packet hits well-defined points in the qdisc algorithms.
The attached blocks can be shared, in a manner similar to clsact ingress
and egress blocks, arbitrary classifiers with arbitrary actions can be put
on them, etc.
This patch set introduces the corresponding iproute2 support. Patch #1 adds
the new netlink attribute enumerators. Patch #2 adds a set of helpers to
implement qevents, and #3 adds a generic documentation to tc.8. Patch #4
then adds two new qevents to the RED qdisc: mark and early_drop.
Petr Machata [Tue, 30 Jun 2020 10:14:50 +0000 (13:14 +0300)]
tc: Add helpers to support qevent handling
Introduce a set of helpers to make it easy to add support for qevents into
qdisc.
The idea behind this is that qevent types will be generally reused between
qdiscs, rather than each having a completely idiosyncratic set of qevents.
The qevent module holds functions for parsing, dumping and formatting of
these common qevent types, and for dispatch to the appropriate set of
handlers based on the qevent name.
Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Po Liu [Mon, 29 Jun 2020 02:04:20 +0000 (10:04 +0800)]
action police: make 'mtu' could be set independently in police action
Current police action must set 'rate' and 'burst'. 'mtu' parameter
set the max frame size and could be set alone without 'rate' and 'burst'
in some situation. Offloading to hardware for example, 'mtu' could limit
the flow max frame size.
Signed-off-by: Po Liu <po.liu@nxp.com> Signed-off-by: David Ahern <dsahern@kernel.org>
David Ahern [Sun, 5 Jul 2020 14:49:53 +0000 (14:49 +0000)]
Merge branch 'devlink-port-mac-addr' into next
Parav Pandit says:
====================
Currently ip link set dev <pfndev> vf <vf_num> <param> <value> has
few below limitations.
1. Command is limited to set VF parameters only.
It cannot set the default MAC address for the PCI PF.
2. It can be set only on system where PCI SR-IOV is supported.
In smartnic based system, eswitch of a NIC resides on a different
embedded cpu which has the VF and PF representors for the SR-IOV
support on a host system in which this smartnic is plugged-in.
3. It cannot setup the function attributes of sub-function described
in detail in comprehensive RFC [1] and [2].
This series covers the first small part to let user query and set MAC
address (hardware address) of a PCI PF/VF which is represented by
devlink port.
Parav Pandit [Tue, 23 Jun 2020 10:44:25 +0000 (10:44 +0000)]
devlink: Support setting port function hardware address
Support setting devlink port function hardware address.
Example of a PCI VF port which supports a port function:
Set hardware address of the VF's port function.
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00
$ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:11:22:33:44:55
Parav Pandit [Tue, 23 Jun 2020 10:44:24 +0000 (10:44 +0000)]
devlink: Support querying hardware address of port function
Add support to query the hardware address of function represented
by devlink port function.
Example of a PCI VF port which supports a port function:
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:11:22:33:44:66
Roi Dayan [Thu, 11 Jun 2020 17:35:43 +0000 (20:35 +0300)]
ip address: Fix loop initial declarations are only allowed in C99
On some distros, i.e. rhel 7.6, compilation fails with the following:
ipaddress.c: In function ‘lookup_flag_data_by_name’:
ipaddress.c:1260:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (int i = 0; i < ARRAY_SIZE(ifa_flag_data); ++i) {
^
ipaddress.c:1260:2: note: use option -std=c99 or -std=gnu99 to compile your code
This commit fixes the single place needed for compilation to pass.
Fixes: 9d59c86e575b ("iproute2: ip addr: Organize flag properties structurally") Signed-off-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Ian K. Coolidge [Wed, 27 May 2020 18:03:45 +0000 (11:03 -0700)]
iproute2: ip addr: Organize flag properties structurally
This creates a nice systematic way to check that the various flags are
mutable from userspace and that the address family is valid.
Mutability properties are preserved to avoid introducing any behavioral
change in this CL. However, previously, immutable flags were ignored and
fell through to this confusing error:
Error: either "local" is duplicate, or "dadfailed" is a garbage.
But now, they just warn more explicitly:
Warning: dadfailed option is not mutable from userspace Signed-off-by: David Ahern <dsahern@gmail.com>
Andrea Claudi [Tue, 26 May 2020 16:04:11 +0000 (18:04 +0200)]
bpf: Fixes a snprintf truncation warning
gcc v9.3.1 reports:
bpf.c: In function ‘bpf_get_work_dir’:
bpf.c:784:49: warning: ‘snprintf’ output may be truncated before the last format character [-Wformat-truncation=]
784 | snprintf(bpf_wrk_dir, sizeof(bpf_wrk_dir), "%s/", mnt);
| ^
bpf.c:784:2: note: ‘snprintf’ output between 2 and 4097 bytes into a destination of size 4096
784 | snprintf(bpf_wrk_dir, sizeof(bpf_wrk_dir), "%s/", mnt);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix this simply checking snprintf return code and properly handling the error.
Fixes: e42256699cac ("bpf: make tc's bpf loader generic and move into lib") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
This happens because asprintf allocates exactly the space needed to hold a
string in the buffer passed as its first argument, but if this buffer is later
used in strcat() or similar we have a buffer overrun.
As the aim of commit c0325b06382c is simply to fix a compiler warning, it
seems safe and reasonable to revert it.
Fixes: c0325b06382c ("bpf: replace snprintf with asprintf when dealing with long buffers") Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Tuong Lien [Tue, 26 May 2020 09:40:55 +0000 (16:40 +0700)]
tipc: enable printing of broadcast rcv link stats
This commit allows printing the statistics of a broadcast-receiver link
using the same tipc command, but with additional 'link' options:
$ tipc link stat show --help
Usage: tipc link stat show [ link { LINK | SUBSTRING | all } ]
With:
+ 'LINK' : print the stats of the specific link 'LINK';
+ 'SUBSTRING' : print the stats of all the links having the 'SUBSTRING'
in name;
+ 'all' : print all the links' stats incl. the broadcast-receiver
ones;
Also, a link stats can be reset in the usual way by specifying the link
name in command.
For example:
$ tipc l st sh l br
Link <broadcast-link>
Window:50 packets
RX packets:0 fragments:0/0 bundles:0/0
TX packets:5011125 fragments:4968774/149643 bundles:38402/307061
RX naks:781484 defs:0 dups:0
TX naks:0 acks:0 retrans:330259
Congestion link:50657 Send queue max:0 avg:0
Store the parsed count/offset pair count onto a dedicated variable that
will be compared against opt.num_tc after all of the command line
arguments were parsed. Bail out if this count is higher than opt.num_tc
and let user know about it.
Drivers were swallowing such commands as they were iterating over
count/offset pairs where num_tc was used as a delimiter, so this is not
a big deal, but better catch such misconfiguration at the command line
argument parsing level.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: David Ahern <dsahern@gmail.com>
Dmitry Yakunin [Sat, 9 May 2020 16:52:02 +0000 (19:52 +0300)]
ss: add checks for bc filter support
As noted by David Ahern, now if some bytecode filter is not supported
by running kernel printed error message is not clear. This patch is attempt to
detect such case and print correct message. This is done by providing checking
function for new filter types. As example check function for cgroup filter
is implemented. It sends correct lightweight request (idiag_states = 0)
with zero cgroup condition to the kernel and checks returned errno. If filter
is not supported EINVAL is returned. Result of checking is cached to
avoid extra checks if several same filters are specified.
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Signed-off-by: David Ahern <dsahern@gmail.com>
Dmitry Yakunin [Sat, 9 May 2020 16:52:01 +0000 (19:52 +0300)]
ss: add support for cgroup v2 information and filtering
This patch introduces two new features: obtaining cgroup information and
filtering sockets by cgroups. These features work based on cgroup v2 ID
field in the socket (kernel should be compiled with CONFIG_SOCK_CGROUP_DATA).
Cgroup information can be obtained by specifying --cgroup flag and now contains
only pathname. For faster pathname lookups cgroup cache is implemented. This
cache is filled on ss startup and missed entries are resolved and saved
on the fly.
Cgroup filter extends EXPRESSION and allows to specify cgroup pathname
(relative or absolute) to obtain sockets attached only to this cgroup.
Filter syntax: ss [ cgroup PATHNAME ]
Examples:
ss -a cgroup /sys/fs/cgroup/unified (or ss -a cgroup .)
ss -a cgroup /sys/fs/cgroup/unified/cgroup1 (or ss -a cgroup cgroup1)
v2:
- style fixes (David Ahern)
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Signed-off-by: David Ahern <dsahern@gmail.com>
Dmitry Yakunin [Sat, 9 May 2020 16:52:00 +0000 (19:52 +0300)]
ss: introduce cgroup2 cache and helper functions
This patch prepares infrastructure for matching sockets by cgroups.
Two helper functions are added for transformation between cgroup v2 ID
and pathname. Cgroup v2 cache is implemented as hash table indexed by ID.
This cache is needed for faster lookups of socket cgroup.
v2:
- style fixes (David Ahern)
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Signed-off-by: David Ahern <dsahern@gmail.com>