Delete the vdpa device after its use:
$ vdpa dev del foo2
An example of PCI PF, VF and SF management device:
pci/0000:03.00:0
supported_classes
net
pci/0000:03.00:4
supported_classes
net
auxiliary/mlx5_core.sf.8
supported_classes
net
Parav Pandit [Wed, 10 Feb 2021 18:34:42 +0000 (20:34 +0200)]
utils: Add helper routines for indent handling
Subsequent patch needs to use 2 char indentation for nested objects.
Hence introduce a generic helpers to allocate, deallocate, increment,
decrement and to print indent block.
Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
This commit adds support for configuring HTB in offload mode. HTB
offload eliminates the single qdisc lock in the datapath and offloads
the algorithm to the NIC. The new 'offload' parameter is added to
enable this mode:
Classes are created as usual, but filters should be moved to clsact for
lock-free classification (filters attached to HTB itself are not
supported in the offload mode):
# tc filter add dev eth0 egress protocol ip flower dst_port 80
action skbedit priority 1:10
tc qdisc show and tc class show will indicate whether the offload is
enabled. Example output:
Thayne McCombs [Tue, 2 Feb 2021 03:32:10 +0000 (20:32 -0700)]
ss: always prefer family as part of host condition to default family
ss accepts an address family both with the -f option and as part of a
host condition. However, if the family in the host condition is
different than the the last -f option, then which family is actually
used depends on the order that different families are checked.
This changes parse_hostcond to check all family prefixes before parsing
the rest of the address, so that the host condition's family always has
a higher priority than the "preferred" family.
Signed-off-by: Thayne McCombs <astrothayne@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org>
David Ahern [Tue, 2 Feb 2021 02:45:49 +0000 (02:45 +0000)]
Merge branch 'devlink-port-mgmt' into next
Parav Pandit says:
====================
This patchset implements devlink port add, delete and function state
management commands.
An example sequence for a PCI SF:
Set the device in switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
View ports in switchdev mode:
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 s=
plittable false
Add a subfunction port for PCI PF 0 with sfnumber 88:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:08:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfn=
um 0 sfnum 88 splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
Show a newly added port:
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf contro=
ller 0 pfnum 0 sfnum 88 splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
Set the function state to active:
$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:8=
8 state active
Show the port in JSON format:
$ devlink port show pci/0000:06:00.0/32768 -jp
{
"port": {
"pci/0000:06:00.0/32768": {
"type": "eth",
"netdev": "ens2f0npf0sf88",
"flavour": "pcisf",
"controller": 0,
"pfnum": 0,
"sfnum": 88,
"splittable": false,
"function": {
"hw_addr": "00:00:00:00:88:88",
"state": "active",
"opstate": "attached"
}
}
}
}
Set the function state to active:
$ devlink port function set pci/0000:06:00.0/32768 state inactive
Delete the port after use:
$ devlink port del pci/0000:06:00.0/32768
Oliver Hartkopp [Mon, 25 Jan 2021 10:40:55 +0000 (11:40 +0100)]
iplink_can: add Classical CAN frame LEN8_DLC support
The len8_dlc element is filled by the CAN interface driver and used for CAN
frame creation by the CAN driver when the CAN_CTRLMODE_CC_LEN8_DLC flag is
supported by the driver and enabled via netlink configuration interface.
Add the command line support for cc-len8-dlc for Linux 5.11+
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David Ahern <dsahern@kernel.org>
Jarod Wilson [Fri, 15 Jan 2021 19:21:37 +0000 (14:21 -0500)]
bond: support xmit_hash_policy=vlan+srcmac
There's a new transmit hash policy being added to the bonding driver that
is a simple XOR of vlan ID and source MAC, xmit_hash_policy vlan+srcmac.
This trivial patch makes it configurable and queryable via iproute2.
Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David Ahern <dsahern@kernel.org>
David Ahern [Mon, 18 Jan 2021 04:10:27 +0000 (04:10 +0000)]
Merge branch 'dcb-app-dcbx' into next
Petr Machata says:
====================
Add support to the dcb tool for the following two DCB objects:
- APP, which allows configuration of traffic prioritization rules based on
several possible packet headers.
- DCBX, which is a 1-byte bitfield of flags that configure whether the DCBX
protocol is implemented in the device or in the host, and which version
of the protocol should be used.
Patch #1 adds a new helper for finding a name of a given dsfield value.
This is useful for APP DSCP-to-priority rules, which can use human-readable
DSCP names.
Patches #2, #3 and #4 extend existing interfaces for, respectively, parsing
of the X:Y mappings, for setting a DCB object, and for getting a DCB
object.
In patch #5, support for the command line argument -N / --Numeric is
added. The APP tool later uses it to decide whether to format DSCP values
as human-readable strings or as plain numbers.
Patches #6 and #7 add the subtools themselves and their man pages.
v2:
- Two patches dropped and sent to iproute2 branch as "dcb: Fixes".
This patch set now depends on that one.
- Patch #5:
- Make it -N / --Numeric instead of -n / --no-nice-names
- Rename the flag from no_nice_names to numeric as well
- Patch #6:
- Adjust to s/no_nice_names/numeric/ from another patch.
Petr Machata [Sat, 2 Jan 2021 00:03:41 +0000 (01:03 +0100)]
dcb: Add a subtool for the DCBX object
The Linux DCBX object is a 1-byte bitfield of flags that configure whether
the DCBX protocol is implemented in the device or in the host, and which
version of the protocol should be used. Add a tool to access the per-port
Linux DCBX object.
For example:
# dcb dcbx set dev eni1np1 host ieee
# dcb dcbx show dev eni1np1
host ieee
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Sat, 2 Jan 2021 00:03:39 +0000 (01:03 +0100)]
dcb: Support -N to suppress translation to human-readable names
Some DSCP values can be translated to symbolic names. That may not be
always desirable. Introduce a command-line option similar to other tools,
-N or --Numeric, to suppress this translation.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Sat, 2 Jan 2021 00:03:38 +0000 (01:03 +0100)]
dcb: Generalize dcb_get_attribute()
The function dcb_get_attribute() assumes that the caller knows the exact
size of the looked-for payload. It also assumes that the response comes
wrapped in an DCB_ATTR_IEEE nest. The former assumption does not hold for
the IEEE APP table, which has variable size. The latter one does not hold
for DCBX, which is not IEEE-nested, and also for any CEE attributes, which
would come CEE-nested.
Factor out the payload extractor from the current dcb_get_attribute() code,
and put into a helper. Then rewrite dcb_get_attribute() compatibly in terms
of the new function. Introduce dcb_get_attribute_va() as a thin wrapper for
IEEE-nested access, and dcb_get_attribute_bare() for access to attributes
that are not nested.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Sat, 2 Jan 2021 00:03:37 +0000 (01:03 +0100)]
dcb: Generalize dcb_set_attribute()
The function dcb_set_attribute() takes a fully-formed payload as an
argument. For callers that need to build a nested attribute, such as is the
case for DCB APP table, this is not great, because with libmnl, they would
need to construct a separate netlink message just to pluck out the payload
and hand it over to this function.
Currently, dcb_set_attribute() also always wraps the payload in an
DCB_ATTR_IEEE container, because that is what all the dcb subtools so far
needed. But that is not appropriate for DCBX in particular, and in fact a
handful other attributes, as well as any CEE payloads.
Instead, generalize this code by adding parameters for constructing a
custom payload and for fetching the response from a custom response
attribute. Then add dcb_set_attribute_va(), which takes a callback to
invoke in the right place for the nest to be built, and
dcb_set_attribute_bare(), which is similar to dcb_set_attribute(), but does
not encapsulate the payload in an IEEE container. Rewrite
dcb_set_attribute() compatibly in terms of the new functions.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Sat, 2 Jan 2021 00:03:36 +0000 (01:03 +0100)]
lib: Generalize parse_mapping()
The function parse_mapping() assumes the key is a number, with a single
configurable exception, which is using "all" to mean "all possible keys".
If a caller wishes to use symbolic names instead of numbers, they cannot
reuse this function.
To facilitate reuse in these situations, convert parse_mapping() into a
helper, parse_mapping_gen(), which instead of an allow-all boolean takes a
generic key-parsing callback. Rewrite parse_mapping() in terms of this
newly-added helper and add a pair of key parsers, one for just numbers,
another for numbers and the keyword "all". Publish the latter as well.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
Petr Machata [Sat, 2 Jan 2021 00:03:35 +0000 (01:03 +0100)]
lib: rt_names: Add rtnl_dsfield_get_name()
For formatting DSCP (not full dsfield), it would be handy to be able to
just get the name from the name table, and not get any of the remaining
cruft related to formatting. Add a new entry point to just fetch the
name table string uninterpreted. Use it from rtnl_dsfield_n2a().
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@kernel.org>
$ tc -json filter show dev eth0 ingress
...{"eth_type":"8847",
" mpls":[" lse":["depth":1,"label":100],
" lse":["depth":2,"label":200]]}...
This is invalid as the arrays, introduced by "[", can't contain raw
string:value pairs. Those must be enclosed into "{}" to form valid json
ojects. Also, there are spurious whitespaces before the mpls and lse
strings because of the indentation used for normal output.
Fix this by putting all LSE parameters (depth, label, tc, bos and ttl)
into the same json object. The "mpls" key now directly contains a list
of such objects.
Also, handle strings differently for normal and json output, so that
json strings don't get spurious indentation whitespaces.
Normal output isn't modified.
The json output now looks like:
$ tc -json filter show dev eth0 ingress
...{"eth_type":"8847",
"mpls":[{"depth":1,"label":100},
{"depth":2,"label":200}]}...
Fixes: eb09a15c12fb ("tc: flower: support multiple MPLS LSE match") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Petr Machata [Sun, 3 Jan 2021 10:57:24 +0000 (11:57 +0100)]
dcb: Change --Netns/-N to --netns/-n
This to keep compatible with the major tools, ip and tc. Also
document the option in the man page, which was neglected.
Fixes: 67033d1c1c8a ("Add skeleton of a new tool, dcb") Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Petr Machata [Sun, 3 Jan 2021 10:57:23 +0000 (11:57 +0100)]
dcb: Plug a leaking DCB socket buffer
DCB socket buffer is allocated in dcb_init(), but never freed(). Free it
in dcb_fini().
Fixes: 67033d1c1c8a ("Add skeleton of a new tool, dcb") Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Petr Machata [Sun, 3 Jan 2021 10:57:22 +0000 (11:57 +0100)]
dcb: Set values with RTM_SETDCB type
dcb currently sends all netlink messages with a type RTM_GETDCB, even the
set ones. Change to the appropriate type.
Fixes: 67033d1c1c8a ("Add skeleton of a new tool, dcb") Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Add support in rdma for extack errors to be received
in userspace when sent from kernel, so now netlink extack
error messages sent from kernel would be printed for the
user.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Ido Schimmel [Thu, 7 Jan 2021 15:23:26 +0000 (17:23 +0200)]
nexthop: Fix usage output
Before:
# ip nexthop help
Usage: ip nexthop { list | flush } [ protocol ID ] SELECTOR
ip nexthop { add | replace } id ID NH [ protocol ID ]
ip nexthop { get| del } id ID
SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV ]
[ groups ] [ fdb ]
NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]
[ encap ENCAPTYPE ENCAPHDR ] | group GROUP ] }
GROUP := [ id[,weight]>/<id[,weight]>/... ]
ENCAPTYPE := [ mpls ]
ENCAPHDR := [ MPLSLABEL ]
After:
# ip nexthop help
Usage: ip nexthop { list | flush } [ protocol ID ] SELECTOR
ip nexthop { add | replace } id ID NH [ protocol ID ]
ip nexthop { get | del } id ID
SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV ]
[ groups ] [ fdb ]
NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]
[ encap ENCAPTYPE ENCAPHDR ] | group GROUP [ fdb ] }
GROUP := [ <id[,weight]>/<id[,weight]>/... ]
ENCAPTYPE := [ mpls ]
ENCAPHDR := [ MPLSLABEL ]
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David Ahern <dsahern@kernel.org>
Guillaume Nault [Mon, 14 Dec 2020 17:14:22 +0000 (18:14 +0100)]
testsuite: Add mpls packet matching tests for tc flower
Match all MPLS fields using smallest and highest possible values.
Test the two ways of specifying MPLS header matching:
* with the basic mpls_{label,tc,bos,ttl} keywords (match only on the
first LSE),
* with the more generic "lse" keyword (allows matching at different
depth of the MPLS label stack).
This test file allows to find problems like the one fixed by
Linux commit 7fdd375e3830 ("net: sched: Fix dump of MPLS_OPT_LSE_LABEL
attribute in cls_flower").
Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David Ahern <dsahern@gmail.com>
Thomas Karlsson [Mon, 14 Dec 2020 10:42:18 +0000 (11:42 +0100)]
iplink:macvlan: Added bcqueuelen parameter
This patch allows the user to set and retrieve the
IFLA_MACVLAN_BC_QUEUE_LEN parameter via the bcqueuelen
command line argument
This parameter controls the requested size of the queue for
broadcast and multicast packages in the macvlan driver.
If not specified, the driver default (1000) will be used.
Note: The request is per macvlan but the actually used queue
length per port is the maximum of any request to any macvlan
connected to the same port.
For this reason, the used queue length IFLA_MACVLAN_BC_QUEUE_LEN_USED
is also retrieved and displayed in order to aid in the understanding
of the setting. However, it can of course not be directly set.
Signed-off-by: Thomas Karlsson <thomas.karlsson@paneda.se> Signed-off-by: David Ahern <dsahern@gmail.com>
Andrea Claudi [Fri, 11 Dec 2020 18:53:03 +0000 (19:53 +0100)]
tc: pedit: fix memory leak in print_pedit
keys_ex is dinamically allocated with calloc on line 770, but
is not freed in case of error at line 823.
Fixes: 081d6c310d3a ("tc: pedit: Support JSON dumping") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Andrea Claudi [Fri, 11 Dec 2020 18:53:02 +0000 (19:53 +0100)]
devlink: fix memory leak in cmd_dev_flash()
nlg_ntf is dinamically allocated in mnlg_socket_open(), and is freed on
the out: return path. However, some error paths do not free it,
resulting in memory leak.
This commit fix this using mnlg_socket_close(), and reporting the
correct error number when required.
Fixes: 9b13cddfe268 ("devlink: implement flash status monitoring") Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Andrea Claudi [Fri, 11 Dec 2020 18:26:40 +0000 (19:26 +0100)]
man: tc-flower: fix manpage
Commit 924c43778a84 ("man: tc-ct.8: Add manual page for ct tc action")
add man page for tc-ct, but it brings with it a bogus block of text
in the benning of tc-flower man page.
This commit simply removes it.
Fixes: 924c43778a84 ("man: tc-ct.8: Add manual page for ct tc action") Reported-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
David Ahern [Mon, 14 Dec 2020 16:48:38 +0000 (16:48 +0000)]
Merge branch 'dcb-pfc-buffer-maxrate' into next
Petr Machata says:
====================
Add support to the dcb tool for the following three DCB objects:
- PFC, for "Priority-based Flow Control", allows configuration of priority
lossiness, and related toggles.
- DCBNL buffer interfaces are an extension to the 802.1q DCB interfaces and
allow configuration of port headroom buffers.
- DCBNL maxrate interfaces are an extension to the 802.1q DCB interfaces
and allow configuration of rate with which traffic in a given traffic
class is sent.
Patches #1-#4 fix small issues in the current DCB code and man pages.
Patch #5 adds new helpers to the DCB dispatcher.
Patches #6 and #7 add support for command line arguments -s and -i. These
enable, respectively, display of statistical counters, and ISO/IEC mode of
rate units.
Patches #8-#10 add the subtools themselves and their man pages.
Petr Machata [Thu, 10 Dec 2020 23:02:24 +0000 (00:02 +0100)]
dcb: Add a subtool for the DCB maxrate object
DCBNL maxrate interfaces are an extension to the 802.1q DCB interfaces and
allow configuration of rate with which traffic in a given traffic class is
sent.
Add a dcb subtool to allow showing and tweaking of this per-TC maximum
rate. For example:
# dcb maxrate show dev eni1np1
tc-maxrate 0:25Gbit 1:25Gbit 2:25Gbit 3:25Gbit 4:25Gbit 5:25Gbit 6:100Gbit 7:25Gbit
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@gmail.com>
Petr Machata [Thu, 10 Dec 2020 23:02:19 +0000 (00:02 +0100)]
dcb: Add dcb_set_u32(), dcb_set_u64()
The DCB buffer object has a settable array of 32-bit quantities, and the
maxrate object of 64-bit ones. Adjust dcb_parse_mapping() and related
helpers to support 64-bit values in mappings, and add appropriate helpers.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@gmail.com>
Petr Machata [Thu, 10 Dec 2020 23:02:15 +0000 (00:02 +0100)]
dcb: Remove unsupported command line arguments from getopt_long()
getopt_long() currently includes "c" and "n" in the short option string.
These probably slipped in as a cut'n'paste, and are not actually accepted.
Remove them.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@gmail.com>
Moshe Shemesh [Mon, 7 Dec 2020 05:35:22 +0000 (07:35 +0200)]
devlink: Add reload stats to dev show
Show reload statistics through devlink dev show using devlink stats
flag. The reload statistics show the history per reload action type and
limit. Add remote reload statistics to show the history of actions
performed due devlink reload commands initiated by remote host.
Moshe Shemesh [Mon, 7 Dec 2020 05:35:20 +0000 (07:35 +0200)]
devlink: Add devlink reload action and limit options
Add reload action and reload limit to devlink reload command to enable
the user to select the reload action required and constrains limits on
these actions that he may want to ensure.
The following reload actions are supported:
driver_reinit: driver entities re-initialization, applying
devlink-param and devlink-resource values.
fw_activate: firmware activate.
The uAPI is backward compatible, if the reload action option is omitted
from the reload command, the driver reinit action will be used.
Note that when required to do firmware activation some drivers may need
to reload the driver. On the other hand some drivers may need to reset
the firmware to reinitialize the driver entities. Therefore, the devlink
reload command returns the actions which were actually performed.
By default reload actions are not limited and driver implementation may
include reset or downtime as needed to perform the actions. However, if
reload limit is selected, the driver should perform only if it can do it
while keeping the limit constraints.
Reload limit added:
no_reset: No reset allowed, no down time allowed, no link flap and no
configuration is lost.
Command examples:
$devlink dev reload pci/0000:82:00.0 action driver_reinit
reload_actions_performed:
driver_reinit
$devlink dev reload pci/0000:82:00.0 action fw_activate
reload_actions_performed:
driver_reinit fw_activate
David Ahern [Wed, 9 Dec 2020 02:32:17 +0000 (02:32 +0000)]
Merge branch 'rate-size-parsing-output' into next
Petr Machata says:
==================
The DCB tool will have commands that deal with buffer sizes and traffic
rates. TC is another tool that has a number of such commands, and functions
to support them: get_size(), get_rate/64(), s/print_size() and
s/print_rate(). In this patchset, these functions are moved from TC to lib/
for possible reuse and modernized.
s/print_rate() has a hidden parameter of a global variable use_iec, which
made the conversion non-trivial. The parameter was made explicit,
print_rate() converted to a mostly json_print-like function, and
sprint_rate() retired in favor of the new print_rate. Patches #1 and #2
deal with this.
The intention was to treat s/print_size() similarly, but unfortunately two
use cases of sprint_size() cannot be converted to a json_print-like
print_size(), and the function sprint_size() had to remain as a discouraged
backdoor to print_size(). This is done in patch #3.
Patch #4 then improves the code of sprint_size() a little bit.
Patch #5 fixes a buglet in formatting small rates in IEC mode.
Patches #6 and #7 handle a routine movement of, respectively,
get_rate/64() and get_size() from tc to lib.
This patchset does not actually add any new uses of these functions. A
follow-up patchset will add subtools for management of DCB buffer and DCB
maxrate objects that will make use of them.
Petr Machata [Sat, 5 Dec 2020 21:13:35 +0000 (22:13 +0100)]
lib: Move get_size() from tc here
The function get_size() serves for parsing of sizes using a handly notation
that supports units and their prefixes, such as 10Kbit. This will be useful
for the DCB buffer size parsing. Move the function from TC to the general
library, so that it can be reused.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@gmail.com>
Petr Machata [Sat, 5 Dec 2020 21:13:34 +0000 (22:13 +0100)]
lib: Move get_rate(), get_rate64() from tc here
The functions get_rate() and get_rate64() are useful for parsing rate-like
values. The DCB tool will find these useful in the maxrate subtool.
Move them over to lib so that they can be easily reused.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@gmail.com>
Petr Machata [Sat, 5 Dec 2020 21:13:33 +0000 (22:13 +0100)]
lib: print_color_rate(): Fix formatting small rates in IEC mode
ISO/IEC units are distinguished from the decadic ones by using a prefixes
like "Ki", "Mi" instead of "K" and "M". The current code inserts the letter
"i" after the decadic unit when in IEC mode. However it does so even when
the prefix is an empty string, formatting 1Kbit in IEC mode as "1000ibit".
Fix by omitting the letter if there is no prefix.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@gmail.com>
Petr Machata [Sat, 5 Dec 2020 21:13:32 +0000 (22:13 +0100)]
lib: sprint_size(): Uncrustify the code a bit
Ideally this and the rate printing would both be converted to a common
helper, but unfortunately the two format differently and this would break
tests and scripts out there. So just make the code look less like a wad of
hay.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@gmail.com>
Petr Machata [Sat, 5 Dec 2020 21:13:31 +0000 (22:13 +0100)]
lib: Move sprint_size() from tc here, add print_size()
When displaying sizes of various sorts, tc commonly uses the function
sprint_size() to format the size into a buffer as a human-readable string.
This string is then displayed either using print_string(), or in some code
even fprintf(). As a result, a typical sequence of code when formatting a
size is something like the following:
For a concept as broadly useful as size, it would be better to have a
dedicated function in json_print.
To that end, move sprint_size() from tc_util to json_print. Add helpers
print_size() and print_color_size() that wrap arount sprint_size() and
provide the JSON dispatch as appropriate.
Since print_size() should be the preferred interface, convert vast majority
of uses of sprint_size() to print_size(). Two notable exceptions are:
- q_tbf, which does not show the size as such, but uses the string
"$human_readable_size/$cell_size" even in JSON. There is simply no way to
have print_size() emit the same text, because print_size() in JSON mode
should of course just use the raw number, without human-readable frills.
- q_cake, which relies on the existence of sprint_size() in its macro-based
formatting helpers. There might be ways to convert this particular case,
but given q_tbf simply cannot be converted, leave it as is.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@gmail.com>
Petr Machata [Sat, 5 Dec 2020 21:13:30 +0000 (22:13 +0100)]
lib: Move print_rate() from tc here; modernize
The functions print_rate() and sprint_rate() are useful for formatting
rate-like values. The DCB tool would find these useful in the maxrate
subtool. However, the current interface to these functions uses a global
variable use_iec as a flag indicating whether 1024- or 1000-based powers
should be used when formatting the rate value. For general use, a global
variable is not a great way of passing arguments to a function. Besides, it
is unlike most other printing functions in that it deals in buffers and
ignores JSON.
Therefore make the interface to print_rate() explicit by converting use_iec
to an ordinary parameter. Since the interface changes anyway, convert it to
follow the pattern of other json_print functions (except for the
now-explicit use_iec parameter). Move to json_print.c.
Add a wrapper to tc, so that all the call sites do not need to repeat the
use_iec global variable argument, and convert all call sites.
In q_cake.c, the conversion is not straightforward due to usage of a macro
that is shared across numerous data types. Simply hand-roll the
corresponding code, which seems better than making an extra helper for one
call site.
Drop sprint_rate() now that everybody just uses print_rate().
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@gmail.com>
Petr Machata [Sat, 5 Dec 2020 21:13:29 +0000 (22:13 +0100)]
Move the use_iec declaration to the tools
The tools "ip" and "tc" use a flag "use_iec", which indicates whether, when
formatting rate values, the prefixes "K", "M", etc. should refer to powers
of 1024, or powers of 1000. The flag is currently kept as a global variable
in "ip" and "tc", but is nonetheless declared in util.h.
Instead, move the declaration to tool-specific headers ip/ip_common.h and
tc/tc_common.h.
Signed-off-by: Petr Machata <me@pmachata.org> Signed-off-by: David Ahern <dsahern@gmail.com>
Paolo Lungaroni [Wed, 2 Dec 2020 13:15:51 +0000 (14:15 +0100)]
seg6: add support for vrftable attribute in SRv6 End.DT4/DT6 behaviors
We introduce the "vrftable" attribute for supporting the SRv6 End.DT4 and
End.DT6 behaviors in iproute2.
The "vrftable" attribute indicates the routing table associated with
the VRF device used by SRv6 End.DT4/DT6 for routing IPv4/IPv6 packets.
The SRv6 End.DT4/DT6 is used to implement IPv4/IPv6 L3 VPNs based on Segment
Routing over IPv6 networks in multi-tenants environments.
It decapsulates the received packets and it performs the IPv4/IPv6 routing
lookup in the routing table of the tenant.
The SRv6 End.DT4/DT6 leverages a VRF device in order to force the routing
lookup into the associated routing table using the "vrftable" attribute.
Some examples:
$ ip -6 route add 2001:db8::1 encap seg6local action End.DT4 vrftable 100 dev eth0
$ ip -6 route add 2001:db8::2 encap seg6local action End.DT6 vrftable 200 dev eth0
Standard Output:
$ ip -6 route show 2001:db8::1
2001:db8::1 encap seg6local action End.DT4 vrftable 100 dev eth0 metric 1024 pref medium
v2:
- no changes made: resubmit after pulling out this patch from the kernel
patchset.
v1:
- mixing this patch with the kernel patchset confused patckwork.
Signed-off-by: Paolo Lungaroni <paolo.lungaroni@cnit.it> Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: David Ahern <dsahern@gmail.com>
Luca Boccassi [Fri, 27 Nov 2020 18:06:51 +0000 (18:06 +0000)]
ip/netns: use flock when setting up /run/netns
If multiple ip processes are ran at the same time to set up
separate network namespaces, and it is the first time so /run/netns
has to be set up first, and they end up doing it at the same time,
the processes might enter a recursive loop creating thousands of
mount points, which might crash the system depending on resources
available.
Try to take a flock on /run/netns before doing the mount() dance, to
ensure this cannot happen. But do not try too hard, and if it fails
continue after printing a warning, to avoid introducing regressions.
First reported on Debian: https://bugs.debian.org/949235
To reproduce (WARNING: run in a VM to avoid system lockups):
for i in {0..9}
do
strace -e trace=mount -e inject=mount:delay_exit=1000000 ip \
netns add "testnetns$i" 2>&1 | tee "$i.log" &
done
wait
The strace is to ensure the problem always reproduces, to add an
artificial synchronization point after the first mount().
Vlad Buslov [Mon, 30 Nov 2020 19:32:50 +0000 (21:32 +0200)]
tc: implement support for action terse dump
Implement support for action terse dump using new TCA_ACT_FLAG_TERSE_DUMP
value of TCA_ROOT_FLAGS tlv. Set the flag when user requested it with
following example CLI (-br for 'brief'):
$ tc -s -br actions ls action tunnel_key
total acts 2
action order 0: tunnel_key index 1
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
action order 1: tunnel_key index 2
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
In terse mode dump only outputs essential data needed to identify the
action (kind, index) and stats, if requested by the user.
Signed-off-by: Vlad Buslov <vlad@buslov.dev> Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David Ahern <dsahern@gmail.com>
With gcc-10 it complains about array subscript error.
f_u32.c: In function ‘u32_parse_opt’:
f_u32.c:1113:24: warning: array subscript 0 is outside the bounds of an interior zero-length array ‘struct tc_u32_key[0]’ [-Wzero-length-bounds]
1113 | hash = sel2.sel.keys[0].val & sel2.sel.keys[0].mask;
| ~~~~~~~~~~~~~^~~
In file included from tc_util.h:11,
from f_u32.c:26:
../include/uapi/linux/pkt_cls.h:253:20: note: while referencing ‘keys’
253 | struct tc_u32_key keys[0];
|
This is because the keys are actually allocated in the second element
of the parent structure.
Simplest way to address the warning is to assign directly to the keys
in the containing structure.
This has always been in iproute2 (pre-git) so no Fixes.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
The code here was doing strncpy() in a way that causes gcc 10
warning about possible string overflow. Just use strlcpy() which
will null terminate and bound the string as expected.
This has existed since start of git era so no Fixes tag.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Gcc-10 complains about referencing a zero size array.
This occurs because the array of keys is actually in the following
structure which is part of the overall selector.
The original code was safe, but better to just use the key
array directly.
Fixes: 2d9a8dc439ee ("tc: p_ip6: Support pedit of IPv6 dsfield") Cc: petrm@mellanox.com Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Gcc-10 complains about possible string length overflow.
This can't happen Ethernet address format is always limited to
18 characters or less. Just resize the temp buffer.
Fixes: 70dfb0b8836d ("iplink: bridge: export bridge_id and designated_root") Cc: nikolay@cumulusnetworks.com Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
devlink.c: In function ‘cmd_dev’:
devlink.c:2803:12: warning: ‘val_u32’ may be used uninitialized in this function [-Wmaybe-uninitialized]
2803 | val_u16 = val_u32;
| ~~~~~~~~^~~~~~~~~
devlink.c:2747:11: note: ‘val_u32’ was declared here
2747 | uint32_t val_u32;
| ^~~~~~~
This is a false positive because it can't figure out the control flow
when the parse returns error.
Fixes: 2557dca2b028 ("devlink: Add string to uint{8,16,32} conversion for generic parameters") Cc: shalomt@mellanox.com Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
David Ahern [Wed, 25 Nov 2020 05:24:15 +0000 (22:24 -0700)]
Merge branch 'libbpf' into next
Hangbin Liu says:
====================
This series converts iproute2 to use libbpf for loading and attaching
BPF programs when it is available. This means that iproute2 will
correctly process BTF information and support the new-style BTF-defined
maps, while keeping compatibility with the old internal map definition
syntax.
This is achieved by checking for libbpf at './configure' time, and using
it if available. By default the system libbpf will be used, but static
linking against a custom libbpf version can be achieved by passing
LIBBPF_DIR to configure. LIBBPF_FORCE can be set to on to force configure
abort if no suitable libbpf is found (useful for automatic packaging
that wants to enforce the dependency), or set off to disable libbpf check
and build iproute2 with legacy bpf.
The old iproute2 bpf code is kept and will be used if no suitable libbpf
is available. When using libbpf, wrapper code ensures that iproute2 will
still understand the old map definition format, including populating
map-in-map and tail call maps before load.
The examples in bpf/examples are kept, and a separate set of examples
are added with BTF-based map definitions for those examples where this
is possible (libbpf doesn't currently support declaratively populating
tail call maps).
At last, Thanks a lot for Toke's help on this patch set.
v6:
a) print runtime libbpf version in ip -V and tc -V
v5:
a) Fix LIBBPF_DIR typo and description, use libbpf DESTDIR as LIBBPF_DIR
dest.
b) Fix bpf_prog_load_dev typo.
c) rebase to latest iproute2-next.
v4:
a) Make variable LIBBPF_FORCE able to control whether build iproute2
with libbpf or not.
b) Add new file bpf_glue.c to for libbpf/legacy mixed bpf calls.
c) Fix some build issues and shell compatibility error.
v3:
a) Update configure to Check function bpf_program__section_name() separately
b) Add a new function get_bpf_program__section_name() to choose whether to
use bpf_program__title() or not.
c) Test build the patch on Fedora 33 with libbpf-0.1.0-1.fc33 and
libbpf-devel-0.1.0-1.fc33
v2:
a) Remove self defined IS_ERR_OR_NULL and use libbpf_get_error() instead.
b) Add ipvrf with libbpf support.
Here are the test results with patched iproute2:
== Show libbpf version
$ ip -V
ip utility, iproute2-5.9.0, libbpf 0.1.0
$ tc -V
tc utility, iproute2-5.9.0, libbpf 0.1.0
== Load objs
$ /root/iproute2/ip/ip link set veth0 xdp obj bpf_graft.o sec aaa
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 4 tag 3056d2382e53f27c jited
$ ls /sys/fs/bpf/xdp/globals
jmp_tc
$ bpftool map show
1: prog_array name jmp_tc flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
$ bpftool prog show
4: xdp name cls_aaa tag 3056d2382e53f27c gpl
loaded_at 2020-10-22T08:04:21-0400 uid 0
xlated 80B jited 71B memlock 4096B
btf_id 5
$ /root/iproute2/ip/ip link set veth0 xdp off
$ /root/iproute2/ip/ip link set veth0 xdp obj bpf_map_in_map.o sec ingress
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 8 tag 4420e72b2a601ed7 jited
$ ls /sys/fs/bpf/xdp/globals
jmp_tc map_inner map_outer
$ bpftool map show
1: prog_array name jmp_tc flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
2: array name map_inner flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
3: array_of_maps name map_outer flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
$ bpftool prog show
8: xdp name imain tag 4420e72b2a601ed7 gpl
loaded_at 2020-10-22T08:04:23-0400 uid 0
xlated 336B jited 193B memlock 4096B map_ids 3
btf_id 10
$ /root/iproute2/ip/ip link set veth0 xdp off
$ /root/iproute2/ip/ip link set veth0 xdp obj bpf_shared.o sec ingress
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 12 tag 9cbab549c3af3eab jited
$ ls /sys/fs/bpf/xdp/7a1422e90cd81478f97bc33fbd7782bcb3b868ef /sys/fs/bpf/xdp/globals
/sys/fs/bpf/xdp/7a1422e90cd81478f97bc33fbd7782bcb3b868ef:
map_sh
/sys/fs/bpf/xdp/globals:
jmp_tc map_inner map_outer
$ bpftool map show
1: prog_array name jmp_tc flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
2: array name map_inner flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
3: array_of_maps name map_outer flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
4: array name map_sh flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
$ bpftool prog show
12: xdp name imain tag 9cbab549c3af3eab gpl
loaded_at 2020-10-22T08:04:25-0400 uid 0
xlated 224B jited 139B memlock 4096B map_ids 4
btf_id 15
$ /root/iproute2/ip/ip link set veth0 xdp off
== Load objs again to make sure maps could be reused
$ /root/iproute2/ip/ip link set veth0 xdp obj bpf_graft.o sec aaa
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 16 tag 3056d2382e53f27c jited
$ ls /sys/fs/bpf/xdp/7a1422e90cd81478f97bc33fbd7782bcb3b868ef /sys/fs/bpf/xdp/globals
/sys/fs/bpf/xdp/7a1422e90cd81478f97bc33fbd7782bcb3b868ef:
map_sh
/sys/fs/bpf/xdp/globals:
jmp_tc map_inner map_outer
$ bpftool map show
1: prog_array name jmp_tc flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
2: array name map_inner flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
3: array_of_maps name map_outer flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
4: array name map_sh flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
$ bpftool prog show
16: xdp name cls_aaa tag 3056d2382e53f27c gpl
loaded_at 2020-10-22T08:04:27-0400 uid 0
xlated 80B jited 71B memlock 4096B
btf_id 20
$ /root/iproute2/ip/ip link set veth0 xdp off
$ /root/iproute2/ip/ip link set veth0 xdp obj bpf_map_in_map.o sec ingress
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 20 tag 4420e72b2a601ed7 jited
$ ls /sys/fs/bpf/xdp/7a1422e90cd81478f97bc33fbd7782bcb3b868ef /sys/fs/bpf/xdp/globals
/sys/fs/bpf/xdp/7a1422e90cd81478f97bc33fbd7782bcb3b868ef:
map_sh
/sys/fs/bpf/xdp/globals:
jmp_tc map_inner map_outer
$ bpftool map show [236/4518]
1: prog_array name jmp_tc flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
2: array name map_inner flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
3: array_of_maps name map_outer flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
4: array name map_sh flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
$ bpftool prog show
20: xdp name imain tag 4420e72b2a601ed7 gpl
loaded_at 2020-10-22T08:04:29-0400 uid 0
xlated 336B jited 193B memlock 4096B map_ids 3
btf_id 25
$ /root/iproute2/ip/ip link set veth0 xdp off
$ /root/iproute2/ip/ip link set veth0 xdp obj bpf_shared.o sec ingress
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 24 tag 9cbab549c3af3eab jited
$ ls /sys/fs/bpf/xdp/7a1422e90cd81478f97bc33fbd7782bcb3b868ef /sys/fs/bpf/xdp/globals
/sys/fs/bpf/xdp/7a1422e90cd81478f97bc33fbd7782bcb3b868ef:
map_sh
/sys/fs/bpf/xdp/globals:
jmp_tc map_inner map_outer
$ bpftool map show
1: prog_array name jmp_tc flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
2: array name map_inner flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
3: array_of_maps name map_outer flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
4: array name map_sh flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
$ bpftool prog show
24: xdp name imain tag 9cbab549c3af3eab gpl
loaded_at 2020-10-22T08:04:31-0400 uid 0
xlated 224B jited 139B memlock 4096B map_ids 4
btf_id 30
$ /root/iproute2/ip/ip link set veth0 xdp off
$ rm -rf /sys/fs/bpf/xdp/7a1422e90cd81478f97bc33fbd7782bcb3b868ef /sys/fs/bpf/xdp/globals
== Testing if we can load new-style objects (using xdp-filter as an example)
$ /root/iproute2/ip/ip link set veth0 xdp obj /usr/lib64/bpf/xdpfilt_alw_all.o sec xdp_filter
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 28 tag e29eeda1489a6520 jited
$ ls /sys/fs/bpf/xdp/globals
filter_ethernet filter_ipv4 filter_ipv6 filter_ports xdp_stats_map
$ bpftool map show
5: percpu_array name xdp_stats_map flags 0x0
key 4B value 16B max_entries 5 memlock 4096B
btf_id 35
6: percpu_array name filter_ports flags 0x0
key 4B value 8B max_entries 65536 memlock 1576960B
btf_id 35
7: percpu_hash name filter_ipv4 flags 0x0
key 4B value 8B max_entries 10000 memlock 1064960B
btf_id 35
8: percpu_hash name filter_ipv6 flags 0x0
key 16B value 8B max_entries 10000 memlock 1142784B
btf_id 35
9: percpu_hash name filter_ethernet flags 0x0
key 6B value 8B max_entries 10000 memlock 1064960B
btf_id 35
$ bpftool prog show
28: xdp name xdpfilt_alw_all tag e29eeda1489a6520 gpl
loaded_at 2020-10-22T08:04:33-0400 uid 0
xlated 2408B jited 1405B memlock 4096B map_ids 9,5,7,8,6
btf_id 35
$ /root/iproute2/ip/ip link set veth0 xdp off
$ /root/iproute2/ip/ip link set veth0 xdp obj /usr/lib64/bpf/xdpfilt_alw_ip.o sec xdp_filter
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 32 tag 2f2b9dbfb786a5a2 jited
$ ls /sys/fs/bpf/xdp/globals
filter_ethernet filter_ipv4 filter_ipv6 filter_ports xdp_stats_map
$ bpftool map show
5: percpu_array name xdp_stats_map flags 0x0
key 4B value 16B max_entries 5 memlock 4096B
btf_id 35
6: percpu_array name filter_ports flags 0x0
key 4B value 8B max_entries 65536 memlock 1576960B
btf_id 35
7: percpu_hash name filter_ipv4 flags 0x0
key 4B value 8B max_entries 10000 memlock 1064960B
btf_id 35
8: percpu_hash name filter_ipv6 flags 0x0
key 16B value 8B max_entries 10000 memlock 1142784B
btf_id 35
9: percpu_hash name filter_ethernet flags 0x0
key 6B value 8B max_entries 10000 memlock 1064960B
btf_id 35
$ bpftool prog show
32: xdp name xdpfilt_alw_ip tag 2f2b9dbfb786a5a2 gpl
loaded_at 2020-10-22T08:04:35-0400 uid 0
xlated 1336B jited 778B memlock 4096B map_ids 7,8,5
btf_id 40
$ /root/iproute2/ip/ip link set veth0 xdp off
$ /root/iproute2/ip/ip link set veth0 xdp obj /usr/lib64/bpf/xdpfilt_alw_tcp.o sec xdp_filter
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 36 tag 18c1bb25084030bc jited
$ ls /sys/fs/bpf/xdp/globals
filter_ethernet filter_ipv4 filter_ipv6 filter_ports xdp_stats_map
$ bpftool map show
5: percpu_array name xdp_stats_map flags 0x0
key 4B value 16B max_entries 5 memlock 4096B
btf_id 35
6: percpu_array name filter_ports flags 0x0
key 4B value 8B max_entries 65536 memlock 1576960B
btf_id 35
7: percpu_hash name filter_ipv4 flags 0x0
key 4B value 8B max_entries 10000 memlock 1064960B
btf_id 35
8: percpu_hash name filter_ipv6 flags 0x0
key 16B value 8B max_entries 10000 memlock 1142784B
btf_id 35
9: percpu_hash name filter_ethernet flags 0x0
key 6B value 8B max_entries 10000 memlock 1064960B
btf_id 35
$ bpftool prog show
36: xdp name xdpfilt_alw_tcp tag 18c1bb25084030bc gpl
loaded_at 2020-10-22T08:04:37-0400 uid 0
xlated 1128B jited 690B memlock 4096B map_ids 6,5
btf_id 45
$ /root/iproute2/ip/ip link set veth0 xdp off
$ rm -rf /sys/fs/bpf/xdp/globals
== Load new btf defined maps
$ /root/iproute2/ip/ip link set veth0 xdp obj btf_graft.o sec aaa
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 40 tag 3056d2382e53f27c jited
$ ls /sys/fs/bpf/xdp/globals
jmp_tc
$ bpftool map show
10: prog_array name jmp_tc flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
$ bpftool prog show
40: xdp name cls_aaa tag 3056d2382e53f27c gpl
loaded_at 2020-10-22T08:04:39-0400 uid 0
xlated 80B jited 71B memlock 4096B
btf_id 50
$ /root/iproute2/ip/ip link set veth0 xdp off
$ /root/iproute2/ip/ip link set veth0 xdp obj btf_map_in_map.o sec ingress
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 44 tag 4420e72b2a601ed7 jited
$ ls /sys/fs/bpf/xdp/globals
jmp_tc map_outer
$ bpftool map show
10: prog_array name jmp_tc flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
11: array name map_inner flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
13: array_of_maps name map_outer flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
$ bpftool prog show
44: xdp name imain tag 4420e72b2a601ed7 gpl
loaded_at 2020-10-22T08:04:41-0400 uid 0
xlated 336B jited 193B memlock 4096B map_ids 13
btf_id 55
$ /root/iproute2/ip/ip link set veth0 xdp off
$ /root/iproute2/ip/ip link set veth0 xdp obj btf_shared.o sec ingress
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
prog/xdp id 48 tag 9cbab549c3af3eab jited
$ ls /sys/fs/bpf/xdp/globals
jmp_tc map_outer map_sh
$ bpftool map show
10: prog_array name jmp_tc flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
11: array name map_inner flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
13: array_of_maps name map_outer flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
14: array name map_sh flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
$ bpftool prog show
48: xdp name imain tag 9cbab549c3af3eab gpl
loaded_at 2020-10-22T08:04:43-0400 uid 0
xlated 224B jited 139B memlock 4096B map_ids 14
btf_id 60
$ /root/iproute2/ip/ip link set veth0 xdp off
$ rm -rf /sys/fs/bpf/xdp/globals
== Test load objs by tc
$ /root/iproute2/tc/tc qdisc add dev veth0 ingress
$ /root/iproute2/tc/tc filter add dev veth0 ingress bpf da obj bpf_cyclic.o sec 0xabccba/0
$ /root/iproute2/tc/tc filter add dev veth0 parent ffff: bpf obj bpf_graft.o
$ /root/iproute2/tc/tc filter add dev veth0 ingress bpf da obj bpf_tailcall.o sec 42/0
$ /root/iproute2/tc/tc filter add dev veth0 ingress bpf da obj bpf_tailcall.o sec 42/1
$ /root/iproute2/tc/tc filter add dev veth0 ingress bpf da obj bpf_tailcall.o sec 43/0
$ /root/iproute2/tc/tc filter add dev veth0 ingress bpf da obj bpf_tailcall.o sec classifier
$ /root/iproute2/ip/ip link show veth0
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 6a:e6:fa:2b:4e:1f brd ff:ff:ff:ff:ff:ff
$ ls /sys/fs/bpf/xdp/37e88cb3b9646b2ea5f99ab31069ad88db06e73d /sys/fs/bpf/xdp/fc68fe3e96378a0cba284ea6acbe17e898d8b11f /sys/fs/bpf/xdp/globals
/sys/fs/bpf/xdp/37e88cb3b9646b2ea5f99ab31069ad88db06e73d:
jmp_tc
Hangbin Liu [Mon, 23 Nov 2020 13:12:01 +0000 (21:12 +0800)]
examples/bpf: add bpf examples with BTF defined maps
Users should try use the new BTF defined maps instead of struct
bpf_elf_map defined maps. The tail call examples are not added yet
as libbpf doesn't currently support declaratively populating tail call
maps.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hangbin Liu <haliu@redhat.com> Signed-off-by: David Ahern <dsahern@gmail.com>
Hangbin Liu [Mon, 23 Nov 2020 13:11:59 +0000 (21:11 +0800)]
lib: add libbpf support
This patch converts iproute2 to use libbpf for loading and attaching
BPF programs when it is available, which is started by Toke's
implementation[1]. With libbpf iproute2 could correctly process BTF
information and support the new-style BTF-defined maps, while keeping
compatibility with the old internal map definition syntax.
The old iproute2 bpf code is kept and will be used if no suitable libbpf
is available. When using libbpf, wrapper code in bpf_legacy.c ensures that
iproute2 will still understand the old map definition format, including
populating map-in-map and tail call maps before load.
In bpf_libbpf.c, we init iproute2 ctx and elf info first to check the
legacy bytes. When handling the legacy maps, for map-in-maps, we create
them manually and re-use the fd as they are associated with id/inner_id.
For pin maps, we only set the pin path and let libbp load to handle it.
For tail calls, we find it first and update the element after prog load.
Other maps/progs will be loaded by libbpf directly.
Hangbin Liu [Mon, 23 Nov 2020 13:11:58 +0000 (21:11 +0800)]
lib: make ipvrf able to use libbpf and fix function name conflicts
There are directly calls in libbpf for bpf program load/attach.
So we could just use two wrapper functions for ipvrf and convert
them with libbpf support.
Function bpf_prog_load() is removed as it's conflict with libbpf
function name.
bpf.c is moved to bpf_legacy.c for later main libbpf support in
iproute2.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hangbin Liu <haliu@redhat.com> Signed-off-by: David Ahern <dsahern@gmail.com>
Hangbin Liu [Mon, 23 Nov 2020 13:11:57 +0000 (21:11 +0800)]
iproute2: add check_libbpf() and get_libbpf_version()
This patch aim to add basic checking functions for later iproute2
libbpf support.
First we add check_libbpf() in configure to see if we have bpf library
support. By default the system libbpf will be used, but static linking
against a custom libbpf version can be achieved by passing libbpf DESTDIR
to variable LIBBPF_DIR for configure.
Another variable LIBBPF_FORCE is used to control whether to build iproute2
with libbpf. If set to on, then force to build with libbpf and exit if
not available. If set to off, then force to not build with libbpf.
When dynamically linking against libbpf, we can't be sure that the
version we discovered at compile time is actually the one we are
using at runtime. This can lead to hard-to-debug errors. So we add
a new file lib/bpf_glue.c and a helper function get_libbpf_version()
to get correct libbpf version at runtime.
Signed-off-by: Hangbin Liu <haliu@redhat.com> Signed-off-by: David Ahern <dsahern@gmail.com>